Usually this kind of thing can be done at a lower level in the InputFormat usually by specifying the max split size. Have you looked into that possibility with your InputFormat?
On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu <hufe...@gmail.com> wrote: > Hi Jasbir, > > Yes, you are right. Do you have any idea about my question? > > Thanks, > Fei > > On Mon, Jan 16, 2017 at 12:37 AM, <jasbir.s...@accenture.com> wrote: > >> Hi, >> >> >> >> Coalesce is used to decrease the number of partitions. If you give the >> value of numPartitions greater than the current partition, I don’t think >> RDD number of partitions will be increased. >> >> >> >> Thanks, >> >> Jasbir >> >> >> >> *From:* Fei Hu [mailto:hufe...@gmail.com] >> *Sent:* Sunday, January 15, 2017 10:10 PM >> *To:* zouz...@cs.toronto.edu >> *Cc:* user @spark <user@spark.apache.org>; d...@spark.apache.org >> *Subject:* Re: Equally split a RDD partition into two partition at the >> same node >> >> >> >> Hi Anastasios, >> >> >> >> Thanks for your reply. If I just increase the numPartitions to be twice >> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps >> the data locality? Do I need to define my own Partitioner? >> >> >> >> Thanks, >> >> Fei >> >> >> >> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zouz...@gmail.com> >> wrote: >> >> Hi Fei, >> >> >> >> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ? >> >> >> >> https://github.com/apache/spark/blob/branch-1.6/core/src/ >> main/scala/org/apache/spark/rdd/RDD.scala#L395 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.6_core_src_main_scala_org_apache_spark_rdd_RDD.scala-23L395&d=DgMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=7scIIjM0jY9x3fjvY6a_yERLxMA2NwA8l0DnuyrL6yA&m=bFMBTBwSwMOFRd7Or6fF0sQOH87UIhmuUqEO9UkxPIY&s=qNa3MyvKhIDlXHtxm3s0DZJRZaSWIHpaNhcS86GEQow&e=> >> >> >> >> coalesce is mostly used for reducing the number of partitions before >> writing to HDFS, but it might still be a narrow dependency (satisfying your >> requirements) if you increase the # of partitions. >> >> >> >> Best, >> >> Anastasios >> >> >> >> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hufe...@gmail.com> wrote: >> >> Dear all, >> >> >> >> I want to equally divide a RDD partition into two partitions. That means, >> the first half of elements in the partition will create a new partition, >> and the second half of elements in the partition will generate another new >> partition. But the two new partitions are required to be at the same node >> with their parent partition, which can help get high data locality. >> >> >> >> Is there anyone who knows how to implement it or any hints for it? >> >> >> >> Thanks in advance, >> >> Fei >> >> >> >> >> >> >> >> -- >> >> -- Anastasios Zouzias >> >> >> >> ------------------------------ >> >> This message is for the designated recipient only and may contain >> privileged, proprietary, or otherwise confidential information. If you have >> received it in error, please notify the sender immediately and delete the >> original. Any other use of the e-mail by you is prohibited. Where allowed >> by local law, electronic communications with Accenture and its affiliates, >> including e-mail and instant messaging (including content), may be scanned >> by our systems for the purposes of information security and assessment of >> internal compliance with Accenture policy. >> ____________________________________________________________ >> __________________________ >> >> www.accenture.com >> > >