In Hadoop you should not have many small files. Put them into a HAR. > On 13 Dec 2016, at 05:42, Jakob Odersky <ja...@odersky.com> wrote: > > Assuming the bottleneck is IO, you could try saving your files to > HDFS. This will distribute your data and allow for better concurrent > reads. > >> On Mon, Dec 12, 2016 at 3:06 PM, Reth RM <reth.ik...@gmail.com> wrote: >> Hi, >> >> I have millions of html files in a directory, using "wholeTextFiles" api to >> load them and process further. Right now, testing it with 40k records and at >> the time of loading files(wholeTextFiles), it waits for minimum of 8-9 >> minutes. What are some recommended optimizations? Should consider any file >> stream apis of spark instead of "wholeTextFiles"? >> >> >> Other info >> Running 1 master, 4 worker nodes 4 allocated. >> Added repartition jsc.wholeTextFiles(filesDirPath).repartition(4); >> >> > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >
--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org