Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Pradeep Gollakota
Usually this kind of thing can be done at a lower level in the InputFormat usually by specifying the max split size. Have you looked into that possibility with your InputFormat? On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu wrote: > Hi Jasbir, > > Yes, you are right. Do you have

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Fei Hu
Hi Pradeep, That is a good idea. My customized RDDs are similar to the NewHadoopRDD. If we have billions of InputSplit, will it be bottlenecked for the performance? That is, will too many data need to be transferred from master node to computing nodes by networking? Thanks, Fei On Mon, Jan 16,

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Fei Hu
Hi Liang-Chi, Yes, the logic split is needed in compute(). The preferred locations can be derived from the customized Partition class. Thanks for your help! Cheers, Fei On Mon, Jan 16, 2017 at 3:00 AM, Liang-Chi Hsieh wrote: > > Hi Fei, > > I think it should work. But you

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Liang-Chi Hsieh
Hi Fei, I think it should work. But you may need to add few logic in compute() to decide which half of the parent partition is needed to output. And you need to get the correct preferred locations for the partitions sharing the same parent partition. Fei Hu wrote > Hi Liang-Chi, > > Yes, you

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Jasbir, Yes, you are right. Do you have any idea about my question? Thanks, Fei On Mon, Jan 16, 2017 at 12:37 AM, wrote: > Hi, > > > > Coalesce is used to decrease the number of partitions. If you give the > value of numPartitions greater than the current

RE: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread jasbir.sing
Hi, Coalesce is used to decrease the number of partitions. If you give the value of numPartitions greater than the current partition, I don’t think RDD number of partitions will be increased. Thanks, Jasbir From: Fei Hu [mailto:hufe...@gmail.com] Sent: Sunday, January 15, 2017 10:10 PM To:

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Liang-Chi, Yes, you are right. I implement the following solution for this problem, and it works. But I am not sure if it is efficient: I double the partitions of the parent RDD, and then use the new partitions and parent RDD to construct the target RDD. In the compute() function of the

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Liang-Chi Hsieh
Hi, When calling `coalesce` with `shuffle = false`, it is going to produce at most min(numPartitions, previous RDD's number of partitions). So I think it can't be used to double the number of partitions. Anastasios Zouzias wrote > Hi Fei, > > How you tried coalesce(numPartitions: Int,

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Anastasios, Thanks for your information. I will look into the CoalescedRDD code. Thanks, Fei On Sun, Jan 15, 2017 at 12:21 PM, Anastasios Zouzias wrote: > Hi Fei, > > I looked at the code of CoalescedRDD and probably what I suggested will > not work. > > Speaking of

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Anastasios Zouzias
Hi Fei, I looked at the code of CoalescedRDD and probably what I suggested will not work. Speaking of which, CoalescedRDD is private[spark]. If this was not the case, you could set balanceSlack to 1, and get what you requested, see

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Anastasios, Thanks for your reply. If I just increase the numPartitions to be twice larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps the data locality? Do I need to define my own Partitioner? Thanks, Fei On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Rishi, Thanks for your reply! The RDD has 24 partitions, and the cluster has a master node + 24 computing nodes (12 cores per node). Each node will have a partition, and I want to split each partition to two sub-partitions on the same node to improve the parallelism and achieve high data

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Anastasios Zouzias
Hi Fei, How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ? https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L395 coalesce is mostly used for reducing the number of partitions before writing to HDFS, but it might still be a

Equally split a RDD partition into two partition at the same node

2017-01-14 Thread Fei Hu
Dear all, I want to equally divide a RDD partition into two partitions. That means, the first half of elements in the partition will create a new partition, and the second half of elements in the partition will generate another new partition. But the two new partitions are required to be at the