IIUC, the evaluation order of rows in Join can be different in different physical operators, e.g., Sort-based and Hash-based.
But for non-deterministic expressions, different evaluation orders change results. Chang Chen wrote > I see the issue. I will try https://github.com/apache/spark/pull/18652, I > think > > 1 For Join Operator, the left and right plan can't be non-deterministic. > 2 If Filter can support non-deterministic, why not join condition? > 3 We can't push down or project non-deterministic expression, since it may > change semantics. > > Actually, the real problem is #2. If the join condition could be > non-deterministic, then we needn't insert project. > > Thanks > Chang > > > > > On Mon, Jul 17, 2017 at 3:59 PM, 蒋星博 < > jiangxb1987@ > > wrote: > >> FYI there have been a related discussion here: https://github.com/apache/ >> spark/pull/15417#discussion_r85295977 >> >> 2017-07-17 15:44 GMT+08:00 Chang Chen < > baibaichen@ > >: >> >>> Hi All >>> >>> I don't understand the difference between the semantics, I found Spark >>> does the same thing for GroupBy non-deterministic. From Map-Reduce point >>> of >>> view, Join is also GroupBy in essence . >>> >>> @Liang Chi Hsieh >>> <https://plus.google.com/u/0/103179362592085650735?prsrc=4> >>> >>> in which situation, semantics will be changed? >>> >>> Thanks >>> Chang >>> >>> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh < > viirya@ > > >>> wrote: >>> >>>> >>>> Thinking about it more, I think it changes the semantics only under >>>> certain >>>> scenarios. >>>> >>>> For the example SQL query shown in previous discussion, it looks the >>>> same >>>> semantics. >>>> >>>> >>>> Xiao Li wrote >>>> > If the join condition is non-deterministic, pushing it down to the >>>> > underlying project will change the semantics. Thus, we are unable to >>>> do it >>>> > in PullOutNondeterministic. Users can do it manually if they do not >>>> care >>>> > the semantics difference. >>>> > >>>> > Thanks, >>>> > >>>> > Xiao >>>> > >>>> > >>>> > >>>> > 2017-07-16 20:07 GMT-07:00 Chang Chen < >>>> >>>> > baibaichen@ >>>> >>>> > >: >>>> > >>>> >> It is tedious since we have lots of Hive SQL being migrated to >>>> Spark. >>>> >> And >>>> >> this workaround is equivalent to insert a Project between Join >>>> operator >>>> >> and its child. >>>> >> >>>> >> Why not do it in PullOutNondeterministic? >>>> >> >>>> >> Thanks >>>> >> Chang >>>> >> >>>> >> >>>> >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh < >>>> >>>> > viirya@ >>>> >>>> > > wrote: >>>> >> >>>> >>> >>>> >>> A possible workaround is to add the rand column into tbl1 with a >>>> >>> projection >>>> >>> before the join. >>>> >>> >>>> >>> SELECT a.col1 >>>> >>> FROM ( >>>> >>> SELECT col1, >>>> >>> CASE >>>> >>> WHEN col2 IS NULL >>>> >>> THEN cast(rand(9)*1000 - 9999999999 as string) >>>> >>> ELSE >>>> >>> col2 >>>> >>> END AS col2 >>>> >>> FROM tbl1) a >>>> >>> LEFT OUTER JOIN tbl2 b >>>> >>> ON a.col2 = b.col3; >>>> >>> >>>> >>> >>>> >>> >>>> >>> Chang Chen wrote >>>> >>> > Hi Wenchen >>>> >>> > >>>> >>> > Yes. We also find this error is caused by Rand. However, this is >>>> >>> classic >>>> >>> > way to solve data skew in Hive. Is there any equivalent way in >>>> Spark? >>>> >>> > >>>> >>> > Thanks >>>> >>> > Chang >>>> >>> > >>>> >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan < >>>> >>> >>>> >>> > cloud0fan@ >>>> >>> >>>> >>> > > wrote: >>>> >>> > >>>> >>> >> It’s not about case when, but about rand(). Non-deterministic >>>> >>> expressions >>>> >>> >> are not allowed in join condition. >>>> >>> >> >>>> >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang < >>>> >>> >>>> >>> > cn_wss@ >>>> >>> >>>> >>> > > wrote: >>>> >>> >> > >>>> >>> >> > I'm trying to execute hive sql on spark sql (Also on spark >>>> >>> >> thriftserver), For >>>> >>> >> > optimizing data skew, we use "case when" to handle null. >>>> >>> >> > Simple sql as following: >>>> >>> >> > >>>> >>> >> > >>>> >>> >> > SELECT a.col1 >>>> >>> >> > FROM tbl1 a >>>> >>> >> > LEFT OUTER JOIN tbl2 b >>>> >>> >> > ON >>>> >>> >> > * CASE >>>> >>> >> > WHEN a.col2 IS NULL >>>> >>> >> > TNEN cast(rand(9)*1000 - 9999999999 as >>>> >>> string) >>>> >>> >> > ELSE >>>> >>> >> > a.col2 END * >>>> >>> >> > = b.col3; >>>> >>> >> > >>>> >>> >> > >>>> >>> >> > But I get the error: >>>> >>> >> > >>>> >>> >> > == Physical Plan == >>>> >>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic >>>> >>> expressions >>>> >>> >> are >>>> >>> >> > only allowed in >>>> >>> >> > Project, Filter, Aggregate or Window, found:* >>>> >>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * >>>> CAST(1000 >>>> >>> AS >>>> >>> >> > DOUBLE)) - CAST(9999999999L AS DOUBLE)) AS STRING) ELSE >>>> >>> a.`nav_tcdt` >>>> >>> >> END >>>> >>> >> = >>>> >>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND >>>> >>> >> (c.`cur_flag` >>>> >>> >> = >>>> >>> >> > 1)) >>>> >>> >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) >>>> THEN >>>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as >>>> >>> double)) >>>> >>> as >>>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && >>>> >>> (cast(nav_tcd#26 >>>> >>> as >>>> >>> >> int) >>>> >>> >> > = 9)) && (cur_flag#77 = 1)) >>>> >>> >> > ;; >>>> >>> >> > GlobalLimit 10 >>>> >>> >> > +- LocalLimit 10 >>>> >>> >> > +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as >>>> string) IN >>>> >>> >> > (cast(19596 as string),cast(20134 as string),cast(10997 as >>>> string)) >>>> >>> && >>>> >>> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21 >>>> >>> END], >>>> >>> >> > [date_id#7] >>>> >>> >> > +- Filter (date_id#7 = 2017-07-12) >>>> >>> >> > +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) >>>> THEN >>>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as >>>> >>> double)) >>>> >>> as >>>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && >>>> >>> (cast(nav_tcd#26 >>>> >>> as >>>> >>> >> int) >>>> >>> >> > = 9)) && (cur_flag#77 = 1)) >>>> >>> >> > :- SubqueryAlias a >>>> >>> >> > : +- SubqueryAlias tmp_lifan_trfc_tpa_hive >>>> >>> >> > : +- CatalogRelation >>>> >>> `tmp`.`tmp_lifan_trfc_tpa_hive`, >>>> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, >>>> [date_id#7, >>>> >>> >> chanl_id#8L, >>>> >>> >> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12, >>>> >>> >> nav_refer_page_type_id#13, >>>> >>> >> > nav_refer_page_value#14, nav_refer_tpa#15, >>>> nav_refer_tpa_id#16, >>>> >>> >> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19, >>>> >>> >> nav_page_value#20, >>>> >>> >> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, >>>> nav_tcdt#25, >>>> >>> >> nav_tcd#26, >>>> >>> >> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29, >>>> >>> >> > detl_refer_page_value#30, ... 33 more fields] >>>> >>> >> > +- SubqueryAlias c >>>> >>> >> > +- SubqueryAlias dim_site_categ_ext >>>> >>> >> > +- CatalogRelation `dw`.`dim_site_categ_ext`, >>>> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, >>>> >>> >> [site_categ_skid#64L, >>>> >>> >> > site_categ_type#65, site_categ_code#66, site_categ_name#67, >>>> >>> >> > site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L, >>>> >>> >> sort_seq#71L, >>>> >>> >> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74, >>>> >>> etl_batch_id#75L, >>>> >>> >> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, >>>> >>> bkgrnd_categ_id#79L, >>>> >>> >> > site_categ_id#80, site_categ_parnt_id#81] >>>> >>> >> > >>>> >>> >> > Does spark sql not support syntax "case when" in JOIN? >>>> Additional, >>>> >>> my >>>> >>> >> spark >>>> >>> >> > version is 2.2.0. >>>> >>> >> > Any help would be greatly appreciated. >>>> >>> >> > >>>> >>> >> > >>>> >>> >> > >>>> >>> >> > >>>> >>> >> > -- >>>> >>> >> > View this message in context: http://apache-spark-developers >>>> >>> >> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t- >>>> >>> >> be-supported-in-JOIN-tp21953.html >>>> >>> >> > Sent from the Apache Spark Developers List mailing list >>>> archive >>>> at >>>> >>> >> Nabble.com. >>>> >>> >> > >>>> >>> >> > ------------------------------------------------------------ >>>> >>> --------- >>>> >>> >> > To unsubscribe e-mail: >>>> >>> >>>> >>> > dev-unsubscribe@.apache >>>> >>> >>>> >>> >> > >>>> >>> >> >>>> >>> >> >>>> >>> >> ------------------------------------------------------------ >>>> --------- >>>> >>> >> To unsubscribe e-mail: >>>> >>> >>>> >>> > dev-unsubscribe@.apache >>>> >>> >>>> >>> >> >>>> >>> >> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> ----- >>>> >>> Liang-Chi Hsieh | @viirya >>>> >>> Spark Technology Center >>>> >>> http://www.spark.tc/ >>>> >>> -- >>>> >>> View this message in context: http://apache-spark-developers >>>> >>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be- >>>> >>> supported-in-JOIN-tp21953p21961.html >>>> >>> Sent from the Apache Spark Developers List mailing list archive at >>>> >>> Nabble.com. >>>> >>> >>>> >>> ------------------------------------------------------------ >>>> --------- >>>> >>> To unsubscribe e-mail: >>>> >>>> > dev-unsubscribe@.apache >>>> >>>> >>> >>>> >>> >>>> >> >>>> >>>> >>>> >>>> >>>> >>>> ----- >>>> Liang-Chi Hsieh | @viirya >>>> Spark Technology Center >>>> http://www.spark.tc/ >>>> -- >>>> View this message in context: http://apache-spark-developers >>>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be- >>>> supported-in-JOIN-tp21953p21973.html >>>> Sent from the Apache Spark Developers List mailing list archive at >>>> Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: > dev-unsubscribe@.apache >>>> >>>> >>> >> ----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-supported-in-JOIN-tp21953p21982.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org