Optimization for Processing a million of HTML files

Reth RM Mon, 12 Dec 2016 15:07:58 -0800

Hi,

I have millions of html files in a directory, using "wholeTextFiles" api to
load them and process further. Right now, testing it with 40k records and
at the time of loading files(wholeTextFiles), it waits for minimum of 8-9
minutes. What are some recommended optimizations? Should consider any file
stream apis of spark instead of "wholeTextFiles"?



Other info
Running 1 master, 4 worker nodes 4 allocated.
Added repartition jsc.wholeTextFiles(filesDirPath).repartition(4);

Optimization for Processing a million of HTML files

Reply via email to