Hi Junjie,
>From my experience HDFS is slow reading large amount of small files as
every file come with lot of information from namenode and data nodes.
When file size is bellow HDFS default block (usually 64MB or 128MB)
size you can not use fully optimizations of Hadoop to read in
streamed way
Put many small files in Hadoop Archives (HAR) to improve performance of reading
small files. Alternatively have a batch job concatenating them.
> On 11 Feb 2016, at 18:33, Junjie Qian wrote:
>
> Hi all,
>
> I am working with Spark 1.6, scala and have a big dataset
Hi Junjie,
How do you access the files currently? Have you considered using hdfs? It's
designed to be distributed across a cluster and Spark has built-in support.
Best,
--Jakob
On Feb 11, 2016 9:33 AM, "Junjie Qian" wrote:
> Hi all,
>
> I am working with Spark 1.6,
Hi all,
I am working with Spark 1.6, scala and have a big dataset divided into several
small files.
My question is: right now the read operation takes really long time and often
has RDD warnings. Is there a way I can read the files in parallel, that all
nodes or workers read the file at the