Re: RDD coalesce or repartition by #records or #bytes?

2015-05-05 Thread Du Li
Hi, Spark experts: I did rdd.coalesce(numPartitions).saveAsSequenceFile(dir) in my code, which generates the rdd's in streamed batches. It generates numPartitions of files as expected with names dir/part-x. However, the first couple of files (e.g., part-0, part-1) have many times of

Re: RDD coalesce or repartition by #records or #bytes?

2015-03-04 Thread Zhan Zhang
It use HashPartitioner to distribute the record to different partitions, but the key is just integer evenly across output partitions. From the code, each resulting partition will get very similar number of records. Thanks. Zhan Zhang On Mar 4, 2015, at 3:47 PM, Du Li

RDD coalesce or repartition by #records or #bytes?

2015-03-04 Thread Du Li
Hi, My RDD's are created from kafka stream. After receiving a RDD, I want to do coalesce/repartition it so that the data will be processed in a set of machines in parallel as even as possible. The number of processing nodes is larger than the receiving nodes. My question is how the