Re: How to parallel read files in a directory

Arkadiusz Bicz Fri, 12 Feb 2016 02:12:07 -0800

Hi Junjie,

>From my experience HDFS is slow reading large amount of small files as
every file come with lot of information from namenode and data nodes.
When file size is bellow HDFS default block (usually 64MB or 128MB)
size you can not use fully optimizations of Hadoop to read  in
streamed way lot of data.


Also when using DataFrames there is huge overhead by caching files
information as described in
https://issues.apache.org/jira/browse/SPARK-11441

BR,
Arkadiusz Bicz
https://www.linkedin.com/in/arkadiuszbicz

On Thu, Feb 11, 2016 at 7:24 PM, Jakob Odersky <ja...@odersky.com> wrote:
> Hi Junjie,
>
> How do you access the files currently? Have you considered using hdfs? It's
> designed to be distributed across a cluster and Spark has built-in support.
>
> Best,
> --Jakob
>
> On Feb 11, 2016 9:33 AM, "Junjie Qian" <qian.jun...@outlook.com> wrote:
>>
>> Hi all,
>>
>> I am working with Spark 1.6, scala and have a big dataset divided into
>> several small files.
>>
>> My question is: right now the read operation takes really long time and
>> often has RDD warnings. Is there a way I can read the files in parallel,
>> that all nodes or workers read the file at the same time?
>>
>> Many thanks
>> Junjie

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to parallel read files in a directory

Reply via email to