I created a draft pull request for explaining the cases: https://github.com/apache/spark/pull/18652
Chang Chen wrote > Hi All > > I don't understand the difference between the semantics, I found Spark > does > the same thing for GroupBy non-deterministic. From Map-Reduce point of > view, Join is also GroupBy in essence . > > @Liang Chi Hsieh > <https://plus.google.com/u/0/103179362592085650735?prsrc=4> > > in which situation, semantics will be changed? > > Thanks > Chang > > On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh < > viirya@ > > wrote: > >> >> Thinking about it more, I think it changes the semantics only under >> certain >> scenarios. >> >> For the example SQL query shown in previous discussion, it looks the same >> semantics. >> >> >> Xiao Li wrote >> > If the join condition is non-deterministic, pushing it down to the >> > underlying project will change the semantics. Thus, we are unable to do >> it >> > in PullOutNondeterministic. Users can do it manually if they do not >> care >> > the semantics difference. >> > >> > Thanks, >> > >> > Xiao >> > >> > >> > >> > 2017-07-16 20:07 GMT-07:00 Chang Chen < >> >> > baibaichen@ >> >> > >: >> > >> >> It is tedious since we have lots of Hive SQL being migrated to Spark. >> >> And >> >> this workaround is equivalent to insert a Project between Join >> operator >> >> and its child. >> >> >> >> Why not do it in PullOutNondeterministic? >> >> >> >> Thanks >> >> Chang >> >> >> >> >> >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh < >> >> > viirya@ >> >> > > wrote: >> >> >> >>> >> >>> A possible workaround is to add the rand column into tbl1 with a >> >>> projection >> >>> before the join. >> >>> >> >>> SELECT a.col1 >> >>> FROM ( >> >>> SELECT col1, >> >>> CASE >> >>> WHEN col2 IS NULL >> >>> THEN cast(rand(9)*1000 - 9999999999 as string) >> >>> ELSE >> >>> col2 >> >>> END AS col2 >> >>> FROM tbl1) a >> >>> LEFT OUTER JOIN tbl2 b >> >>> ON a.col2 = b.col3; >> >>> >> >>> >> >>> >> >>> Chang Chen wrote >> >>> > Hi Wenchen >> >>> > >> >>> > Yes. We also find this error is caused by Rand. However, this is >> >>> classic >> >>> > way to solve data skew in Hive. Is there any equivalent way in >> Spark? >> >>> > >> >>> > Thanks >> >>> > Chang >> >>> > >> >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan < >> >>> >> >>> > cloud0fan@ >> >>> >> >>> > > wrote: >> >>> > >> >>> >> Itβs not about case when, but about rand(). Non-deterministic >> >>> expressions >> >>> >> are not allowed in join condition. >> >>> >> >> >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang < >> >>> >> >>> > cn_wss@ >> >>> >> >>> > > wrote: >> >>> >> > >> >>> >> > I'm trying to execute hive sql on spark sql (Also on spark >> >>> >> thriftserver), For >> >>> >> > optimizing data skew, we use "case when" to handle null. >> >>> >> > Simple sql as following: >> >>> >> > >> >>> >> > >> >>> >> > SELECT a.col1 >> >>> >> > FROM tbl1 a >> >>> >> > LEFT OUTER JOIN tbl2 b >> >>> >> > ON >> >>> >> > * CASE >> >>> >> > WHEN a.col2 IS NULL >> >>> >> > TNEN cast(rand(9)*1000 - 9999999999 as >> >>> string) >> >>> >> > ELSE >> >>> >> > a.col2 END * >> >>> >> > = b.col3; >> >>> >> > >> >>> >> > >> >>> >> > But I get the error: >> >>> >> > >> >>> >> > == Physical Plan == >> >>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic >> >>> expressions >> >>> >> are >> >>> >> > only allowed in >> >>> >> > Project, Filter, Aggregate or Window, found:* >> >>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * >> CAST(1000 >> >>> AS >> >>> >> > DOUBLE)) - CAST(9999999999L AS DOUBLE)) AS STRING) ELSE >> >>> a.`nav_tcdt` >> >>> >> END >> >>> >> = >> >>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND >> >>> >> (c.`cur_flag` >> >>> >> = >> >>> >> > 1)) >> >>> >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) >> THEN >> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as >> >>> double)) >> >>> as >> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && >> >>> (cast(nav_tcd#26 >> >>> as >> >>> >> int) >> >>> >> > = 9)) && (cur_flag#77 = 1)) >> >>> >> > ;; >> >>> >> > GlobalLimit 10 >> >>> >> > +- LocalLimit 10 >> >>> >> > +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string) >> IN >> >>> >> > (cast(19596 as string),cast(20134 as string),cast(10997 as >> string)) >> >>> && >> >>> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21 >> >>> END], >> >>> >> > [date_id#7] >> >>> >> > +- Filter (date_id#7 = 2017-07-12) >> >>> >> > +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN >> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as >> >>> double)) >> >>> as >> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && >> >>> (cast(nav_tcd#26 >> >>> as >> >>> >> int) >> >>> >> > = 9)) && (cur_flag#77 = 1)) >> >>> >> > :- SubqueryAlias a >> >>> >> > : +- SubqueryAlias tmp_lifan_trfc_tpa_hive >> >>> >> > : +- CatalogRelation >> >>> `tmp`.`tmp_lifan_trfc_tpa_hive`, >> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7, >> >>> >> chanl_id#8L, >> >>> >> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12, >> >>> >> nav_refer_page_type_id#13, >> >>> >> > nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16, >> >>> >> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19, >> >>> >> nav_page_value#20, >> >>> >> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25, >> >>> >> nav_tcd#26, >> >>> >> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29, >> >>> >> > detl_refer_page_value#30, ... 33 more fields] >> >>> >> > +- SubqueryAlias c >> >>> >> > +- SubqueryAlias dim_site_categ_ext >> >>> >> > +- CatalogRelation `dw`.`dim_site_categ_ext`, >> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, >> >>> >> [site_categ_skid#64L, >> >>> >> > site_categ_type#65, site_categ_code#66, site_categ_name#67, >> >>> >> > site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L, >> >>> >> sort_seq#71L, >> >>> >> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74, >> >>> etl_batch_id#75L, >> >>> >> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, >> >>> bkgrnd_categ_id#79L, >> >>> >> > site_categ_id#80, site_categ_parnt_id#81] >> >>> >> > >> >>> >> > Does spark sql not support syntax "case when" in JOIN? >> Additional, >> >>> my >> >>> >> spark >> >>> >> > version is 2.2.0. >> >>> >> > Any help would be greatly appreciated. >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > -- >> >>> >> > View this message in context: http://apache-spark-developers >> >>> >> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t- >> >>> >> be-supported-in-JOIN-tp21953.html >> >>> >> > Sent from the Apache Spark Developers List mailing list archive >> at >> >>> >> Nabble.com. >> >>> >> > >> >>> >> > ------------------------------------------------------------ >> >>> --------- >> >>> >> > To unsubscribe e-mail: >> >>> >> >>> > dev-unsubscribe@.apache >> >>> >> >>> >> > >> >>> >> >> >>> >> >> >>> >> ------------------------------------------------------------ >> --------- >> >>> >> To unsubscribe e-mail: >> >>> >> >>> > dev-unsubscribe@.apache >> >>> >> >>> >> >> >>> >> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> ----- >> >>> Liang-Chi Hsieh | @viirya >> >>> Spark Technology Center >> >>> http://www.spark.tc/ >> >>> -- >> >>> View this message in context: http://apache-spark-developers >> >>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be- >> >>> supported-in-JOIN-tp21953p21961.html >> >>> Sent from the Apache Spark Developers List mailing list archive at >> >>> Nabble.com. >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe e-mail: >> >> > dev-unsubscribe@.apache >> >> >>> >> >>> >> >> >> >> >> >> >> >> ----- >> Liang-Chi Hsieh | @viirya >> Spark Technology Center >> http://www.spark.tc/ >> -- >> View this message in context: http://apache-spark- >> developers-list.1001551.n3.nabble.com/SQL-Syntax-case- >> when-doesn-t-be-supported-in-JOIN-tp21953p21973.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: > dev-unsubscribe@.apache >> >> ----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-supported-in-JOIN-tp21953p21976.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org