Hi,

I need to process millions of PDFs in hdfs using spark. First I’m trying
with some 40k files. I’m using binaryFiles api with which I’m facing couple
of issues:

1. It creates only 4 tasks and I can’t seem to increase the parallelism
there.
2. It took 2276 seconds and that means for millions of files it will take
ages to complete. I’m also expecting it to fail for million records with
some timeout or gc overhead exception.

Val files = sparkSession.sparkContext.binaryFiles(filePath, 200).cache

Val fileContentRdd = files.map(file => myFunc(file)



Do you have any guidance on how I can process millions of files using
binaryFiles api?

How can I increase the number of tasks/parallelism during the creation of
files rdd?

Thanks

Reply via email to