Thanks very much for Yong's help. Sorry that for one more issue, is it that different partitions must be in different nodes? that is, each node would only have one partition, in cluster mode ...
On Wednesday, December 9, 2015 6:41 AM, "Young, Matthew T" <matthew.t.yo...@intel.com> wrote: #yiv1938266569 #yiv1938266569 -- _filtered #yiv1938266569 {font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;} _filtered #yiv1938266569 {panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv1938266569 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv1938266569 {font-family:Cambria;panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv1938266569 #yiv1938266569 p.yiv1938266569MsoNormal, #yiv1938266569 li.yiv1938266569MsoNormal, #yiv1938266569 div.yiv1938266569MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv1938266569 a:link, #yiv1938266569 span.yiv1938266569MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv1938266569 a:visited, #yiv1938266569 span.yiv1938266569MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv1938266569 p.yiv1938266569msonormal0, #yiv1938266569 li.yiv1938266569msonormal0, #yiv1938266569 div.yiv1938266569msonormal0 {margin-right:0in;margin-left:0in;font-size:12.0pt;}#yiv1938266569 span.yiv1938266569EmailStyle18 {color:windowtext;font-weight:normal;font-style:normal;text-decoration:none none;}#yiv1938266569 .yiv1938266569MsoChpDefault {font-size:10.0pt;} _filtered #yiv1938266569 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv1938266569 div.yiv1938266569WordSection1 {}#yiv1938266569 Shuffling large amounts of data over the network is expensive, yes. The cost is lower if you are just using a single node where no networking needs to be involved to do the repartition (using Spark as a multithreading engine). In general you need to do performance testing to see if a repartition is worth the shuffle time. A common model is to repartition the data once after ingest to achieve parallelism and avoid shuffles whenever possible later. From: Zhiliang Zhu [mailto:zchl.j...@yahoo.com.INVALID] Sent: Tuesday, December 08, 2015 5:05 AM To: User <user@spark.apache.org> Subject: is repartition very cost Hi All, I need to do optimize objective function with some linear constraints by genetic algorithm. I would like to make as much parallelism for it by spark. repartition / shuffle may be used sometimes in it, however, is repartition API very cost ? Thanks in advance! Zhiliang