Re: Optimization for Processing a million of HTML files

Jörn Franke Mon, 12 Dec 2016 22:56:13 -0800

In Hadoop you should not have many small files. Put them into a HAR.

> On 13 Dec 2016, at 05:42, Jakob Odersky <ja...@odersky.com> wrote:
> 
> Assuming the bottleneck is IO, you could try saving your files to
> HDFS. This will distribute your data and allow for better concurrent
> reads.
> 
>> On Mon, Dec 12, 2016 at 3:06 PM, Reth RM <reth.ik...@gmail.com> wrote:
>> Hi,
>> 
>> I have millions of html files in a directory, using "wholeTextFiles" api to
>> load them and process further. Right now, testing it with 40k records and at
>> the time of loading files(wholeTextFiles), it waits for minimum of 8-9
>> minutes. What are some recommended optimizations? Should consider any file
>> stream apis of spark instead of "wholeTextFiles"?
>> 
>> 
>> Other info
>> Running 1 master, 4 worker nodes 4 allocated.
>> Added repartition jsc.wholeTextFiles(filesDirPath).repartition(4);
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Optimization for Processing a million of HTML files

Reply via email to