Re: Process Million Binary Files

2018-10-11 Thread Nicolas PARIS
Hi Joel I built such pipeline to transform pdf-> text https://github.com/EDS-APHP/SparkPdfExtractor You can take a look It transforms 20M pdfs in 2 hours on a 5 node spark cluster Le 2018-10-10 23:56, Joel D a écrit : > Hi, > > I need to process millions of PDFs in hdfs using spark. First I’m

Re: Process Million Binary Files

2018-10-11 Thread Jörn Franke
I believe your use case can be better covered with an own data source reading PDF files. On Big Data platforms in general you have the issue that individual PDF files are very small and are a lot of them - this is not very efficient for those platforms. That could be also one source of your

Process Million Binary Files

2018-10-10 Thread Joel D
Hi, I need to process millions of PDFs in hdfs using spark. First I’m trying with some 40k files. I’m using binaryFiles api with which I’m facing couple of issues: 1. It creates only 4 tasks and I can’t seem to increase the parallelism there. 2. It took 2276 seconds and that means for millions