Hi, I have millions of html files in a directory, using "wholeTextFiles" api to load them and process further. Right now, testing it with 40k records and at the time of loading files(wholeTextFiles), it waits for minimum of 8-9 minutes. What are some recommended optimizations? Should consider any file stream apis of spark instead of "wholeTextFiles"?
Other info Running 1 master, 4 worker nodes 4 allocated. Added repartition jsc.wholeTextFiles(filesDirPath).repartition(4);