RE: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-24 Thread Yong Zhang
in JIRA SPARK-13383 Thanks Yong From: java8...@hotmail.com To: dav...@databricks.com CC: user@spark.apache.org Subject: RE: Spark 1.5.2, why the broadcast join shuffle so much data in the last step Date: Wed, 23 Mar 2016 20:30:42 -0400 Sounds good. I will manual merge this patch on 1.6.1

RE: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Yong Zhang
Sounds good. I will manual merge this patch on 1.6.1, and test again for my case tomorrow on my environment and will update later. Thanks Yong > Date: Wed, 23 Mar 2016 16:20:23 -0700 > Subject: Re: Spark 1.5.2, why the broadcast join shuffle so much data in the > last step &g

Re: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Ted Yu
gt; [date_time#25L,visid_low#461L,visid_high#460L,account_id#976] > > +- Project [soid_e1#30 AS > > account_id#976,visid_high#460L,visid_low#461L,date_time#25L,ip#127] > > +- Filter (instr(event_list#105,202) > 0) > > +-

Re: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Davies Liu
;broadcast" join supposed to do, >> > correct? >> > In the last stage, it will be very slow, until it reach and process all >> > the history data, shown below as "shuffle read" reaching 720G, to finish. >> > >> > >> > >>

RE: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Yong Zhang
do anything with "broadcast" problem. Thanks Yong > Date: Wed, 23 Mar 2016 10:14:19 -0700 > Subject: Re: Spark 1.5.2, why the broadcast join shuffle so much data in the > last step > From: dav...@databricks.com > To: java8...@hotmail.com > CC: user@spark.apache.org >

Re: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Davies Liu
tly same input. It is an interesting point, does > anyone have some idea about this? > > > Overall, for my test case, "broadcast" join is the exactly most optimized way > I should use; but somehow, I cannot make it do the same way as "mapjoin" of > Hive