Hi Joel
I built such pipeline to transform pdf-> text
https://github.com/EDS-APHP/SparkPdfExtractor
You can take a look
It transforms 20M pdfs in 2 hours on a 5 node spark cluster
Le 2018-10-10 23:56, Joel D a écrit :
> Hi,
>
> I need to process millions of PDFs in hdfs using spark. First I’m
I believe your use case can be better covered with an own data source reading
PDF files.
On Big Data platforms in general you have the issue that individual PDF files
are very small and are a lot of them - this is not very efficient for those
platforms. That could be also one source of your pe