I believe your use case can be better covered with an own data source reading 
PDF files.

 On Big Data platforms in general you have the issue that individual PDF files 
are very small and are a lot of them - this is not very efficient for those 
platforms. That could be also one source of your performance problems (not 
necessarily the parallelism). You would need to make 1 mio requests to the 
namenode (this could be also interpreted as a Denial-of-Service attack). 
Historically, Hadoop Archives were introduced to address this problem: 

You can try also to store them first in Hbase or in the future on Hadoop Ozone. 
That could make a higher parallelism possible „out of the box“. 

> Am 10.10.2018 um 23:56 schrieb Joel D <games2013....@gmail.com>:
> Hi,
> I need to process millions of PDFs in hdfs using spark. First I’m trying with 
> some 40k files. I’m using binaryFiles api with which I’m facing couple of 
> issues:
> 1. It creates only 4 tasks and I can’t seem to increase the parallelism 
> there. 
> 2. It took 2276 seconds and that means for millions of files it will take 
> ages to complete. I’m also expecting it to fail for million records with some 
> timeout or gc overhead exception.
> Val files = sparkSession.sparkContext.binaryFiles(filePath, 200).cache
> Val fileContentRdd = files.map(file => myFunc(file)
> Do you have any guidance on how I can process millions of files using 
> binaryFiles api?
> How can I increase the number of tasks/parallelism during the creation of 
> files rdd?
> Thanks

Reply via email to