In Hadoop you should not have many small files. Put them into a HAR.
> On 13 Dec 2016, at 05:42, Jakob Odersky wrote:
>
> Assuming the bottleneck is IO, you could try saving your files to
> HDFS. This will distribute your data and allow for better concurrent
> reads.
>
>> On Mon, Dec 12, 2016 at 3:06 PM, Reth RM wrote:
>> Hi,
>>
>> I have millions of html files in a directory, using "wholeTextFiles" api to
>> load them and process further. Right now, testing it with 40k records and at
>> the time of loading files(wholeTextFiles), it waits for minimum of 8-9
>> minutes. What are some recommended optimizations? Should consider any file
>> stream apis of spark instead of "wholeTextFiles"?
>>
>>
>> Other info
>> Running 1 master, 4 worker nodes 4 allocated.
>> Added repartition jsc.wholeTextFiles(filesDirPath).repartition(4);
>>
>>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org