Re: How to parallel read files in a directory

2016-02-12 Thread Arkadiusz Bicz
Hi Junjie, >From my experience HDFS is slow reading large amount of small files as every file come with lot of information from namenode and data nodes. When file size is bellow HDFS default block (usually 64MB or 128MB) size you can not use fully optimizations of Hadoop to read in streamed way

Re: How to parallel read files in a directory

2016-02-12 Thread Jörn Franke
Put many small files in Hadoop Archives (HAR) to improve performance of reading small files. Alternatively have a batch job concatenating them. > On 11 Feb 2016, at 18:33, Junjie Qian wrote: > > Hi all, > > I am working with Spark 1.6, scala and have a big dataset

Re: How to parallel read files in a directory

2016-02-11 Thread Jakob Odersky
Hi Junjie, How do you access the files currently? Have you considered using hdfs? It's designed to be distributed across a cluster and Spark has built-in support. Best, --Jakob On Feb 11, 2016 9:33 AM, "Junjie Qian" wrote: > Hi all, > > I am working with Spark 1.6,

How to parallel read files in a directory

2016-02-11 Thread Junjie Qian
Hi all, I am working with Spark 1.6, scala and have a big dataset divided into several small files. My question is: right now the read operation takes really long time and often has RDD warnings. Is there a way I can read the files in parallel, that all nodes or workers read the file at the