Assuming the bottleneck is IO, you could try saving your files to
HDFS. This will distribute your data and allow for better concurrent
reads.

On Mon, Dec 12, 2016 at 3:06 PM, Reth RM <reth.ik...@gmail.com> wrote:
> Hi,
>
> I have millions of html files in a directory, using "wholeTextFiles" api to
> load them and process further. Right now, testing it with 40k records and at
> the time of loading files(wholeTextFiles), it waits for minimum of 8-9
> minutes. What are some recommended optimizations? Should consider any file
> stream apis of spark instead of "wholeTextFiles"?
>
>
> Other info
> Running 1 master, 4 worker nodes 4 allocated.
> Added repartition jsc.wholeTextFiles(filesDirPath).repartition(4);
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to