Re: Optimization for Processing a million of HTML files

Jakob Odersky Mon, 12 Dec 2016 20:42:16 -0800

Assuming the bottleneck is IO, you could try saving your files to
HDFS. This will distribute your data and allow for better concurrent
reads.


On Mon, Dec 12, 2016 at 3:06 PM, Reth RM <reth.ik...@gmail.com> wrote:
> Hi,
>
> I have millions of html files in a directory, using "wholeTextFiles" api to
> load them and process further. Right now, testing it with 40k records and at
> the time of loading files(wholeTextFiles), it waits for minimum of 8-9
> minutes. What are some recommended optimizations? Should consider any file
> stream apis of spark instead of "wholeTextFiles"?
>
>
> Other info
> Running 1 master, 4 worker nodes 4 allocated.
> Added repartition jsc.wholeTextFiles(filesDirPath).repartition(4);
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Optimization for Processing a million of HTML files

Reply via email to