Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-24 Thread Andy Davidson
boxing/unboxing issue in Java? Andy From: Don Drake Date: Monday, November 23, 2015 at 7:10 PM To: Andrew Davidson Cc: Xiao Li , Sabarish Sasidharan , "user @spark" Subject: Re: newbie : why are thousands of empty files being created on HDFS? > I'm seeing s

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-24 Thread Andy Davidson
: Sabarish Sasidharan Date: Monday, November 23, 2015 at 7:57 PM To: Andrew Davidson Cc: Xiao Li , "user @spark" Subject: Re: newbie : why are thousands of empty files being created on HDFS? > > Hi Andy > > You can try sc.wholeTextFiles() instead of sc.textFile() >

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-23 Thread Sabarish Sasidharan
esce() and that coalesce() is taking a lot of > time. Any suggestions how to choose the file number of files. > > Kind regards > > Andy > > > From: Xiao Li > Date: Monday, November 23, 2015 at 12:21 PM > To: Andrew Davidson > Cc: Sabarish Sasidharan , "use

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-23 Thread Don Drake
ed to my > >Job. I would expect spark would be smart enough not create more > >partitions than I have worker machines? > > > >Also given I am not using any key/value operations like Join() or doing > >multiple scans I would assume my app would not benefit from partitio

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-23 Thread Andy Davidson
gt;multiple scans I would assume my app would not benefit from partitioning. > > >Kind regards > >Andy > > >From: Sabarish Sasidharan >Date: Saturday, November 21, 2015 at 7:20 PM >To: Andrew Davidson >Cc: "user @spark" >Subject: Re: newbie : why are t

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-23 Thread Xiao Li
t; > From: Sabarish Sasidharan > Date: Saturday, November 21, 2015 at 7:20 PM > To: Andrew Davidson > Cc: "user @spark" > Subject: Re: newbie : why are thousands of empty files being created on > HDFS? > > Those are empty partitions. I don't see the numbe

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-23 Thread Andy Davidson
not benefit from partitioning. Kind regards Andy From: Sabarish Sasidharan Date: Saturday, November 21, 2015 at 7:20 PM To: Andrew Davidson Cc: "user @spark" Subject: Re: newbie : why are thousands of empty files being created on HDFS? > > Those are empty partitions.

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-21 Thread Sabarish Sasidharan
Those are empty partitions. I don't see the number of partitions specified in code. That then implies the default parallelism config is being used and is set to a very high number, the sum of empty + non empty files. Regards Sab On 21-Nov-2015 11:59 pm, "Andy Davidson" wrote: > I start working o

newbie : why are thousands of empty files being created on HDFS?

2015-11-21 Thread Andy Davidson
I start working on a very simple ETL pipeline for a POC. It reads a in a data set of tweets stored as JSON strings on in HDFS and randomly selects 1% of the observations and writes them to HDFS. It seems to run very slowly. E.G. To write 4720 observations takes 1:06:46.577795. I Also noticed that R