In Hadoop you should not have many small files. Put them into a HAR.
> On 13 Dec 2016, at 05:42, Jakob Odersky wrote:
>
> Assuming the bottleneck is IO, you could try saving your files to
> HDFS. This will distribute your data and allow for better concurrent
> reads.
>
>> On
Assuming the bottleneck is IO, you could try saving your files to
HDFS. This will distribute your data and allow for better concurrent
reads.
On Mon, Dec 12, 2016 at 3:06 PM, Reth RM wrote:
> Hi,
>
> I have millions of html files in a directory, using "wholeTextFiles" api