subject:"Re\: Optimization for Processing a million of HTML files"

Re: Optimization for Processing a million of HTML files

2016-12-12 Thread Jörn Franke

In Hadoop you should not have many small files. Put them into a HAR. > On 13 Dec 2016, at 05:42, Jakob Odersky wrote: > > Assuming the bottleneck is IO, you could try saving your files to > HDFS. This will distribute your data and allow for better concurrent > reads. > >> On

Re: Optimization for Processing a million of HTML files

2016-12-12 Thread Jakob Odersky

Assuming the bottleneck is IO, you could try saving your files to HDFS. This will distribute your data and allow for better concurrent reads. On Mon, Dec 12, 2016 at 3:06 PM, Reth RM wrote: > Hi, > > I have millions of html files in a directory, using "wholeTextFiles" api