Hi Pradeep,
That is a good idea. My customized RDDs are similar to the NewHadoopRDD. If
we have billions of InputSplit, will it be bottlenecked for the
performance? That is, will too many data need to be transferred from master
node to computing nodes by networking?
Thanks,
Fei
On Mon, Jan 16,
Hi Jasbir,
Yes, you are right. Do you have any idea about my question?
Thanks,
Fei
On Mon, Jan 16, 2017 at 12:37 AM, wrote:
> Hi,
>
>
>
> Coalesce is used to decrease the number of partitions. If you give the
> value of numPartitions greater than the current
Hi,
Coalesce is used to decrease the number of partitions. If you give the value of
numPartitions greater than the current partition, I don’t think RDD number of
partitions will be increased.
Thanks,
Jasbir
From: Fei Hu [mailto:hufe...@gmail.com]
Sent: Sunday, January 15, 2017 10:10 PM
To:
Hi Anastasios,
Thanks for your information. I will look into the CoalescedRDD code.
Thanks,
Fei
On Sun, Jan 15, 2017 at 12:21 PM, Anastasios Zouzias
wrote:
> Hi Fei,
>
> I looked at the code of CoalescedRDD and probably what I suggested will
> not work.
>
> Speaking of
Hi Fei,
I looked at the code of CoalescedRDD and probably what I suggested will not
work.
Speaking of which, CoalescedRDD is private[spark]. If this was not the
case, you could set balanceSlack to 1, and get what you requested, see
Hi Anastasios,
Thanks for your reply. If I just increase the numPartitions to be twice
larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
the data locality? Do I need to define my own Partitioner?
Thanks,
Fei
On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias
Hi Rishi,
Thanks for your reply! The RDD has 24 partitions, and the cluster has a
master node + 24 computing nodes (12 cores per node). Each node will have a
partition, and I want to split each partition to two sub-partitions on the
same node to improve the parallelism and achieve high data
Hi Fei,
How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L395
coalesce is mostly used for reducing the number of partitions before
writing to HDFS, but it might still be a
Can you provide some more details:
1. How many partitions does RDD have
2. How big is the cluster
On Sat, Jan 14, 2017 at 3:59 PM Fei Hu wrote:
> Dear all,
>
> I want to equally divide a RDD partition into two partitions. That means,
> the first half of elements in the
Dear all,
I want to equally divide a RDD partition into two partitions. That means,
the first half of elements in the partition will create a new partition,
and the second half of elements in the partition will generate another new
partition. But the two new partitions are required to be at the
10 matches
Mail list logo