Assuming the bottleneck is IO, you could try saving your files to HDFS. This will distribute your data and allow for better concurrent reads.
On Mon, Dec 12, 2016 at 3:06 PM, Reth RM <reth.ik...@gmail.com> wrote: > Hi, > > I have millions of html files in a directory, using "wholeTextFiles" api to > load them and process further. Right now, testing it with 40k records and at > the time of loading files(wholeTextFiles), it waits for minimum of 8-9 > minutes. What are some recommended optimizations? Should consider any file > stream apis of spark instead of "wholeTextFiles"? > > > Other info > Running 1 master, 4 worker nodes 4 allocated. > Added repartition jsc.wholeTextFiles(filesDirPath).repartition(4); > > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org