boxing/unboxing issue in
Java?
Andy
From: Don Drake
Date: Monday, November 23, 2015 at 7:10 PM
To: Andrew Davidson
Cc: Xiao Li , Sabarish Sasidharan
, "user @spark"
Subject: Re: newbie : why are thousands of empty files being created on
HDFS?
> I'm seeing s
: Sabarish Sasidharan
Date: Monday, November 23, 2015 at 7:57 PM
To: Andrew Davidson
Cc: Xiao Li , "user @spark"
Subject: Re: newbie : why are thousands of empty files being created on
HDFS?
>
> Hi Andy
>
> You can try sc.wholeTextFiles() instead of sc.textFile()
>
esce() and that coalesce() is taking a lot of
> time. Any suggestions how to choose the file number of files.
>
> Kind regards
>
> Andy
>
>
> From: Xiao Li
> Date: Monday, November 23, 2015 at 12:21 PM
> To: Andrew Davidson
> Cc: Sabarish Sasidharan , "use
ed to my
> >Job. I would expect spark would be smart enough not create more
> >partitions than I have worker machines?
> >
> >Also given I am not using any key/value operations like Join() or doing
> >multiple scans I would assume my app would not benefit from partitio
gt;multiple scans I would assume my app would not benefit from partitioning.
>
>
>Kind regards
>
>Andy
>
>
>From: Sabarish Sasidharan
>Date: Saturday, November 21, 2015 at 7:20 PM
>To: Andrew Davidson
>Cc: "user @spark"
>Subject: Re: newbie : why are t
t;
> From: Sabarish Sasidharan
> Date: Saturday, November 21, 2015 at 7:20 PM
> To: Andrew Davidson
> Cc: "user @spark"
> Subject: Re: newbie : why are thousands of empty files being created on
> HDFS?
>
> Those are empty partitions. I don't see the numbe
not benefit from partitioning.
Kind regards
Andy
From: Sabarish Sasidharan
Date: Saturday, November 21, 2015 at 7:20 PM
To: Andrew Davidson
Cc: "user @spark"
Subject: Re: newbie : why are thousands of empty files being created on
HDFS?
>
> Those are empty partitions.
Those are empty partitions. I don't see the number of partitions specified
in code. That then implies the default parallelism config is being used and
is set to a very high number, the sum of empty + non empty files.
Regards
Sab
On 21-Nov-2015 11:59 pm, "Andy Davidson"
wrote:
> I start working o
I start working on a very simple ETL pipeline for a POC. It reads a in a
data set of tweets stored as JSON strings on in HDFS and randomly selects 1%
of the observations and writes them to HDFS. It seems to run very slowly.
E.G. To write 4720 observations takes 1:06:46.577795. I
Also noticed that R