Hi Joel

I built such pipeline to transform pdf-> text
https://github.com/EDS-APHP/SparkPdfExtractor
You can take a look

It transforms 20M pdfs in 2 hours on a 5 node spark cluster 

Le 2018-10-10 23:56, Joel D a écrit :
> Hi,
> 
> I need to process millions of PDFs in hdfs using spark. First I’m
> trying with some 40k files. I’m using binaryFiles api with which
> I’m facing couple of issues:
> 
> 1. It creates only 4 tasks and I can’t seem to increase the
> parallelism there. 
> 2. It took 2276 seconds and that means for millions of files it will
> take ages to complete. I’m also expecting it to fail for million
> records with some timeout or gc overhead exception.
> 
> Val files = sparkSession.sparkContext.binaryFiles(filePath, 200).cache
> 
> Val fileContentRdd = files.map(file => myFunc(file)
> 
> Do you have any guidance on how I can process millions of files using
> binaryFiles api?
> 
> How can I increase the number of tasks/parallelism during the creation
> of files rdd?
> 
> Thanks

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to