Hi, I need to process millions of PDFs in hdfs using spark. First I’m trying with some 40k files. I’m using binaryFiles api with which I’m facing couple of issues:
1. It creates only 4 tasks and I can’t seem to increase the parallelism there. 2. It took 2276 seconds and that means for millions of files it will take ages to complete. I’m also expecting it to fail for million records with some timeout or gc overhead exception. Val files = sparkSession.sparkContext.binaryFiles(filePath, 200).cache Val fileContentRdd = files.map(file => myFunc(file) Do you have any guidance on how I can process millions of files using binaryFiles api? How can I increase the number of tasks/parallelism during the creation of files rdd? Thanks