Hi,
I have millions of html files in a directory, using "wholeTextFiles" api to
load them and process further. Right now, testing it with 40k records and
at the time of loading files(wholeTextFiles), it waits for minimum of 8-9
minutes. What are some recommended optimizations? Should consider any
Assuming the bottleneck is IO, you could try saving your files to
HDFS. This will distribute your data and allow for better concurrent
reads.
On Mon, Dec 12, 2016 at 3:06 PM, Reth RM wrote:
> Hi,
>
> I have millions of html files in a directory, using "wholeTextFiles" api to
> load them and proce
In Hadoop you should not have many small files. Put them into a HAR.
> On 13 Dec 2016, at 05:42, Jakob Odersky wrote:
>
> Assuming the bottleneck is IO, you could try saving your files to
> HDFS. This will distribute your data and allow for better concurrent
> reads.
>
>> On Mon, Dec 12, 2016 a