Re: Process Million Binary Files

2018-10-11 Thread Nicolas PARIS
Hi Joel I built such pipeline to transform pdf-> text https://github.com/EDS-APHP/SparkPdfExtractor You can take a look It transforms 20M pdfs in 2 hours on a 5 node spark cluster Le 2018-10-10 23:56, Joel D a écrit : > Hi, > > I need to process millions of PDFs in hdfs using spark. First I’m

Re: Process Million Binary Files

2018-10-10 Thread Jörn Franke
I believe your use case can be better covered with an own data source reading PDF files. On Big Data platforms in general you have the issue that individual PDF files are very small and are a lot of them - this is not very efficient for those platforms. That could be also one source of your pe