Thanks very much for Yong's help.
Sorry that for one more issue, is it that different partitions must be in 
different nodes? that is, each node would only have one partition, in cluster 
mode ...  


    On Wednesday, December 9, 2015 6:41 AM, "Young, Matthew T" 
<matthew.t.yo...@intel.com> wrote:
 

 #yiv1938266569 #yiv1938266569 -- _filtered #yiv1938266569 
{font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;} _filtered #yiv1938266569 
{panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv1938266569 
{font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv1938266569 
{font-family:Cambria;panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv1938266569 
#yiv1938266569 p.yiv1938266569MsoNormal, #yiv1938266569 
li.yiv1938266569MsoNormal, #yiv1938266569 div.yiv1938266569MsoNormal 
{margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv1938266569 a:link, 
#yiv1938266569 span.yiv1938266569MsoHyperlink 
{color:#0563C1;text-decoration:underline;}#yiv1938266569 a:visited, 
#yiv1938266569 span.yiv1938266569MsoHyperlinkFollowed 
{color:#954F72;text-decoration:underline;}#yiv1938266569 
p.yiv1938266569msonormal0, #yiv1938266569 li.yiv1938266569msonormal0, 
#yiv1938266569 div.yiv1938266569msonormal0 
{margin-right:0in;margin-left:0in;font-size:12.0pt;}#yiv1938266569 
span.yiv1938266569EmailStyle18 
{color:windowtext;font-weight:normal;font-style:normal;text-decoration:none 
none;}#yiv1938266569 .yiv1938266569MsoChpDefault {font-size:10.0pt;} _filtered 
#yiv1938266569 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv1938266569 
div.yiv1938266569WordSection1 {}#yiv1938266569 Shuffling large amounts of data 
over the network is expensive, yes. The cost is lower if you are just using a 
single node where no networking needs to be involved to do the repartition 
(using Spark as a multithreading engine).    In general you need to do 
performance testing to see if a repartition is worth the shuffle time.    A 
common model is to repartition the data once after ingest to achieve 
parallelism and avoid shuffles whenever possible later.    From: Zhiliang Zhu 
[mailto:zchl.j...@yahoo.com.INVALID]
Sent: Tuesday, December 08, 2015 5:05 AM
To: User <user@spark.apache.org>
Subject: is repartition very cost       Hi All,    I need to do optimize 
objective function with some linear constraints by  genetic algorithm.  I would 
like to make as much parallelism for it by spark.    repartition / shuffle may 
be used sometimes in it, however, is repartition API very cost ?    Thanks in 
advance! Zhiliang       

  

Reply via email to