Usually this kind of thing can be done at a lower level in the InputFormat
usually by specifying the max split size. Have you looked into that
possibility with your InputFormat?
On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu wrote:
> Hi Jasbir,
>
> Yes, you are right. Do you have
Hi Pradeep,
That is a good idea. My customized RDDs are similar to the NewHadoopRDD. If
we have billions of InputSplit, will it be bottlenecked for the
performance? That is, will too many data need to be transferred from master
node to computing nodes by networking?
Thanks,
Fei
On Mon, Jan 16,
Hi Liang-Chi,
Yes, the logic split is needed in compute(). The preferred locations can be
derived from the customized Partition class.
Thanks for your help!
Cheers,
Fei
On Mon, Jan 16, 2017 at 3:00 AM, Liang-Chi Hsieh wrote:
>
> Hi Fei,
>
> I think it should work. But you
Hi Fei,
I think it should work. But you may need to add few logic in compute() to
decide which half of the parent partition is needed to output. And you need
to get the correct preferred locations for the partitions sharing the same
parent partition.
Fei Hu wrote
> Hi Liang-Chi,
>
> Yes, you
Hi Jasbir,
Yes, you are right. Do you have any idea about my question?
Thanks,
Fei
On Mon, Jan 16, 2017 at 12:37 AM, wrote:
> Hi,
>
>
>
> Coalesce is used to decrease the number of partitions. If you give the
> value of numPartitions greater than the current
Hi,
Coalesce is used to decrease the number of partitions. If you give the value of
numPartitions greater than the current partition, I don’t think RDD number of
partitions will be increased.
Thanks,
Jasbir
From: Fei Hu [mailto:hufe...@gmail.com]
Sent: Sunday, January 15, 2017 10:10 PM
To:
Hi Liang-Chi,
Yes, you are right. I implement the following solution for this problem,
and it works. But I am not sure if it is efficient:
I double the partitions of the parent RDD, and then use the new partitions
and parent RDD to construct the target RDD. In the compute() function of
the
Hi,
When calling `coalesce` with `shuffle = false`, it is going to produce at
most min(numPartitions, previous RDD's number of partitions). So I think it
can't be used to double the number of partitions.
Anastasios Zouzias wrote
> Hi Fei,
>
> How you tried coalesce(numPartitions: Int,
Hi Anastasios,
Thanks for your information. I will look into the CoalescedRDD code.
Thanks,
Fei
On Sun, Jan 15, 2017 at 12:21 PM, Anastasios Zouzias
wrote:
> Hi Fei,
>
> I looked at the code of CoalescedRDD and probably what I suggested will
> not work.
>
> Speaking of
Hi Fei,
I looked at the code of CoalescedRDD and probably what I suggested will not
work.
Speaking of which, CoalescedRDD is private[spark]. If this was not the
case, you could set balanceSlack to 1, and get what you requested, see
Hi Anastasios,
Thanks for your reply. If I just increase the numPartitions to be twice
larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
the data locality? Do I need to define my own Partitioner?
Thanks,
Fei
On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias
Hi Rishi,
Thanks for your reply! The RDD has 24 partitions, and the cluster has a
master node + 24 computing nodes (12 cores per node). Each node will have a
partition, and I want to split each partition to two sub-partitions on the
same node to improve the parallelism and achieve high data
Hi Fei,
How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L395
coalesce is mostly used for reducing the number of partitions before
writing to HDFS, but it might still be a
Dear all,
I want to equally divide a RDD partition into two partitions. That means,
the first half of elements in the partition will create a new partition,
and the second half of elements in the partition will generate another new
partition. But the two new partitions are required to be at the
14 matches
Mail list logo