Re: reducing number of output files

2015-01-23 Thread Sean Owen
It does not necessarily shuffle, yes. I believe it will not if you are
strictly reducing the number of partitions, and do not force a
shuffle. So I think the answer is 'yes'.

If you have a huge number of small files, you can also consider
wholeTextFiles, which gives you entire files of content in each
element of the RDD. It is not necessarily helpful, but thought I'd
mention it, as it could be of interest depending on what you do.

On Fri, Jan 23, 2015 at 2:14 AM, Kane Kim kane.ist...@gmail.com wrote:
 Does it avoid reshuffling? I have 300 thousands output files. If I
 coalesce to the number of cores in the cluster would it keep data
 local? (I have 100 nodes, 4 cores each, does it mean if I
 coalesce(400) it will use all cores and data will stay local)?

 On Thu, Jan 22, 2015 at 3:26 PM, Sean Owen so...@cloudera.com wrote:
 One output file is produced per partition. If you want fewer, use
 coalesce() before saving the RDD.

 On Thu, Jan 22, 2015 at 10:46 PM, Kane Kim kane.ist...@gmail.com wrote:
 How I can reduce number of output files? Is there a parameter to 
 saveAsTextFile?

 Thanks.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: reducing number of output files

2015-01-22 Thread Sean Owen
One output file is produced per partition. If you want fewer, use
coalesce() before saving the RDD.

On Thu, Jan 22, 2015 at 10:46 PM, Kane Kim kane.ist...@gmail.com wrote:
 How I can reduce number of output files? Is there a parameter to 
 saveAsTextFile?

 Thanks.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: reducing number of output files

2015-01-22 Thread DEVAN M.S.
Rdd.coalesce(1) will coalesce RDD and give only one output file.
coalesce(2) will give 2 wise versa.
On Jan 23, 2015 4:58 AM, Sean Owen so...@cloudera.com wrote:

 One output file is produced per partition. If you want fewer, use
 coalesce() before saving the RDD.

 On Thu, Jan 22, 2015 at 10:46 PM, Kane Kim kane.ist...@gmail.com wrote:
  How I can reduce number of output files? Is there a parameter to
 saveAsTextFile?
 
  Thanks.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org