Hi Junjie, >From my experience HDFS is slow reading large amount of small files as every file come with lot of information from namenode and data nodes. When file size is bellow HDFS default block (usually 64MB or 128MB) size you can not use fully optimizations of Hadoop to read in streamed way lot of data.
Also when using DataFrames there is huge overhead by caching files information as described in https://issues.apache.org/jira/browse/SPARK-11441 BR, Arkadiusz Bicz https://www.linkedin.com/in/arkadiuszbicz On Thu, Feb 11, 2016 at 7:24 PM, Jakob Odersky <ja...@odersky.com> wrote: > Hi Junjie, > > How do you access the files currently? Have you considered using hdfs? It's > designed to be distributed across a cluster and Spark has built-in support. > > Best, > --Jakob > > On Feb 11, 2016 9:33 AM, "Junjie Qian" <qian.jun...@outlook.com> wrote: >> >> Hi all, >> >> I am working with Spark 1.6, scala and have a big dataset divided into >> several small files. >> >> My question is: right now the read operation takes really long time and >> often has RDD warnings. Is there a way I can read the files in parallel, >> that all nodes or workers read the file at the same time? >> >> Many thanks >> Junjie --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org