Hi, Spark experts:
I did rdd.coalesce(numPartitions).saveAsSequenceFile(dir) in my code, which
generates the rdd's in streamed batches. It generates numPartitions of files as
expected with names dir/part-x. However, the first couple of files (e.g.,
part-0, part-1) have many times of
It use HashPartitioner to distribute the record to different partitions, but
the key is just integer evenly across output partitions.
From the code, each resulting partition will get very similar number of
records.
Thanks.
Zhan Zhang
On Mar 4, 2015, at 3:47 PM, Du Li
Hi,
My RDD's are created from kafka stream. After receiving a RDD, I want to do
coalesce/repartition it so that the data will be processed in a set of machines
in parallel as even as possible. The number of processing nodes is larger than
the receiving nodes.
My question is how the