Hi Joel I built such pipeline to transform pdf-> text https://github.com/EDS-APHP/SparkPdfExtractor You can take a look
It transforms 20M pdfs in 2 hours on a 5 node spark cluster Le 2018-10-10 23:56, Joel D a écrit : > Hi, > > I need to process millions of PDFs in hdfs using spark. First I’m > trying with some 40k files. I’m using binaryFiles api with which > I’m facing couple of issues: > > 1. It creates only 4 tasks and I can’t seem to increase the > parallelism there. > 2. It took 2276 seconds and that means for millions of files it will > take ages to complete. I’m also expecting it to fail for million > records with some timeout or gc overhead exception. > > Val files = sparkSession.sparkContext.binaryFiles(filePath, 200).cache > > Val fileContentRdd = files.map(file => myFunc(file) > > Do you have any guidance on how I can process millions of files using > binaryFiles api? > > How can I increase the number of tasks/parallelism during the creation > of files rdd? > > Thanks --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org