Hi,

I have millions of html files in a directory, using "wholeTextFiles" api to
load them and process further. Right now, testing it with 40k records and
at the time of loading files(wholeTextFiles), it waits for minimum of 8-9
minutes. What are some recommended optimizations? Should consider any file
stream apis of spark instead of "wholeTextFiles"?


Other info
Running 1 master, 4 worker nodes 4 allocated.
Added repartition jsc.wholeTextFiles(filesDirPath).repartition(4);

Reply via email to