Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Fei Hu
Hi Pradeep, That is a good idea. My customized RDDs are similar to the NewHadoopRDD. If we have billions of InputSplit, will it be bottlenecked for the performance? That is, will too many data need to be transferred from master node to computing nodes by networking? Thanks, Fei On Mon, Jan 16,

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Jasbir, Yes, you are right. Do you have any idea about my question? Thanks, Fei On Mon, Jan 16, 2017 at 12:37 AM, wrote: > Hi, > > > > Coalesce is used to decrease the number of partitions. If you give the > value of numPartitions greater than the current

RE: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread jasbir.sing
Hi, Coalesce is used to decrease the number of partitions. If you give the value of numPartitions greater than the current partition, I don’t think RDD number of partitions will be increased. Thanks, Jasbir From: Fei Hu [mailto:hufe...@gmail.com] Sent: Sunday, January 15, 2017 10:10 PM To:

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Anastasios, Thanks for your information. I will look into the CoalescedRDD code. Thanks, Fei On Sun, Jan 15, 2017 at 12:21 PM, Anastasios Zouzias wrote: > Hi Fei, > > I looked at the code of CoalescedRDD and probably what I suggested will > not work. > > Speaking of

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Anastasios Zouzias
Hi Fei, I looked at the code of CoalescedRDD and probably what I suggested will not work. Speaking of which, CoalescedRDD is private[spark]. If this was not the case, you could set balanceSlack to 1, and get what you requested, see

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Anastasios, Thanks for your reply. If I just increase the numPartitions to be twice larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps the data locality? Do I need to define my own Partitioner? Thanks, Fei On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Rishi, Thanks for your reply! The RDD has 24 partitions, and the cluster has a master node + 24 computing nodes (12 cores per node). Each node will have a partition, and I want to split each partition to two sub-partitions on the same node to improve the parallelism and achieve high data

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Anastasios Zouzias
Hi Fei, How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ? https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L395 coalesce is mostly used for reducing the number of partitions before writing to HDFS, but it might still be a

Re: Equally split a RDD partition into two partition at the same node

2017-01-14 Thread Rishi Yadav
Can you provide some more details: 1. How many partitions does RDD have 2. How big is the cluster On Sat, Jan 14, 2017 at 3:59 PM Fei Hu wrote: > Dear all, > > I want to equally divide a RDD partition into two partitions. That means, > the first half of elements in the

Equally split a RDD partition into two partition at the same node

2017-01-14 Thread Fei Hu
Dear all, I want to equally divide a RDD partition into two partitions. That means, the first half of elements in the partition will create a new partition, and the second half of elements in the partition will generate another new partition. But the two new partitions are required to be at the