I see. Actually, it isn't about evaluation order which user can't specify. It's about how many times we evaluate the non-deterministic expression for the same row.
For example, given the SQL: SELECT a.col1 FROM tbl1 a LEFT OUTER JOIN tbl2 b ON CASE WHEN a.col2 IS NULL TNEN cast(rand(9)*1000 - 9999999999 as string) ELSE a.col2 END = CASE WHEN b.col3 IS NULL TNEN cast(rand(9)*1000 - 9999999999 as string) ELSE b.col3 END; I think if we exactly evaluate join key one time for each row of a and b in the whole pipeline, even if the result isn't deterministic, but the computation is correct. Thanks Chang On Mon, Jul 17, 2017 at 10:49 PM, Liang-Chi Hsieh <vii...@gmail.com> wrote: > > IIUC, the evaluation order of rows in Join can be different in different > physical operators, e.g., Sort-based and Hash-based. > > But for non-deterministic expressions, different evaluation orders change > results. > > > > Chang Chen wrote > > I see the issue. I will try https://github.com/apache/spark/pull/18652, > I > > think > > > > 1 For Join Operator, the left and right plan can't be non-deterministic. > > 2 If Filter can support non-deterministic, why not join condition? > > 3 We can't push down or project non-deterministic expression, since it > may > > change semantics. > > > > Actually, the real problem is #2. If the join condition could be > > non-deterministic, then we needn't insert project. > > > > Thanks > > Chang > > > > > > > > > > On Mon, Jul 17, 2017 at 3:59 PM, 蒋星博 < > > > jiangxb1987@ > > > > wrote: > > > >> FYI there have been a related discussion here: > https://github.com/apache/ > >> spark/pull/15417#discussion_r85295977 > >> > >> 2017-07-17 15:44 GMT+08:00 Chang Chen < > > > baibaichen@ > > > >: > >> > >>> Hi All > >>> > >>> I don't understand the difference between the semantics, I found Spark > >>> does the same thing for GroupBy non-deterministic. From Map-Reduce > point > >>> of > >>> view, Join is also GroupBy in essence . > >>> > >>> @Liang Chi Hsieh > >>> <https://plus.google.com/u/0/103179362592085650735?prsrc=4> > >>> > >>> in which situation, semantics will be changed? > >>> > >>> Thanks > >>> Chang > >>> > >>> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh < > > > viirya@ > > > > > >>> wrote: > >>> > >>>> > >>>> Thinking about it more, I think it changes the semantics only under > >>>> certain > >>>> scenarios. > >>>> > >>>> For the example SQL query shown in previous discussion, it looks the > >>>> same > >>>> semantics. > >>>> > >>>> > >>>> Xiao Li wrote > >>>> > If the join condition is non-deterministic, pushing it down to the > >>>> > underlying project will change the semantics. Thus, we are unable to > >>>> do it > >>>> > in PullOutNondeterministic. Users can do it manually if they do not > >>>> care > >>>> > the semantics difference. > >>>> > > >>>> > Thanks, > >>>> > > >>>> > Xiao > >>>> > > >>>> > > >>>> > > >>>> > 2017-07-16 20:07 GMT-07:00 Chang Chen < > >>>> > >>>> > baibaichen@ > >>>> > >>>> > >: > >>>> > > >>>> >> It is tedious since we have lots of Hive SQL being migrated to > >>>> Spark. > >>>> >> And > >>>> >> this workaround is equivalent to insert a Project between Join > >>>> operator > >>>> >> and its child. > >>>> >> > >>>> >> Why not do it in PullOutNondeterministic? > >>>> >> > >>>> >> Thanks > >>>> >> Chang > >>>> >> > >>>> >> > >>>> >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh < > >>>> > >>>> > viirya@ > >>>> > >>>> > > wrote: > >>>> >> > >>>> >>> > >>>> >>> A possible workaround is to add the rand column into tbl1 with a > >>>> >>> projection > >>>> >>> before the join. > >>>> >>> > >>>> >>> SELECT a.col1 > >>>> >>> FROM ( > >>>> >>> SELECT col1, > >>>> >>> CASE > >>>> >>> WHEN col2 IS NULL > >>>> >>> THEN cast(rand(9)*1000 - 9999999999 as string) > >>>> >>> ELSE > >>>> >>> col2 > >>>> >>> END AS col2 > >>>> >>> FROM tbl1) a > >>>> >>> LEFT OUTER JOIN tbl2 b > >>>> >>> ON a.col2 = b.col3; > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> Chang Chen wrote > >>>> >>> > Hi Wenchen > >>>> >>> > > >>>> >>> > Yes. We also find this error is caused by Rand. However, this is > >>>> >>> classic > >>>> >>> > way to solve data skew in Hive. Is there any equivalent way in > >>>> Spark? > >>>> >>> > > >>>> >>> > Thanks > >>>> >>> > Chang > >>>> >>> > > >>>> >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan < > >>>> >>> > >>>> >>> > cloud0fan@ > >>>> >>> > >>>> >>> > > wrote: > >>>> >>> > > >>>> >>> >> It’s not about case when, but about rand(). Non-deterministic > >>>> >>> expressions > >>>> >>> >> are not allowed in join condition. > >>>> >>> >> > >>>> >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang < > >>>> >>> > >>>> >>> > cn_wss@ > >>>> >>> > >>>> >>> > > wrote: > >>>> >>> >> > > >>>> >>> >> > I'm trying to execute hive sql on spark sql (Also on spark > >>>> >>> >> thriftserver), For > >>>> >>> >> > optimizing data skew, we use "case when" to handle null. > >>>> >>> >> > Simple sql as following: > >>>> >>> >> > > >>>> >>> >> > > >>>> >>> >> > SELECT a.col1 > >>>> >>> >> > FROM tbl1 a > >>>> >>> >> > LEFT OUTER JOIN tbl2 b > >>>> >>> >> > ON > >>>> >>> >> > * CASE > >>>> >>> >> > WHEN a.col2 IS NULL > >>>> >>> >> > TNEN cast(rand(9)*1000 - 9999999999 as > >>>> >>> string) > >>>> >>> >> > ELSE > >>>> >>> >> > a.col2 END * > >>>> >>> >> > = b.col3; > >>>> >>> >> > > >>>> >>> >> > > >>>> >>> >> > But I get the error: > >>>> >>> >> > > >>>> >>> >> > == Physical Plan == > >>>> >>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic > >>>> >>> expressions > >>>> >>> >> are > >>>> >>> >> > only allowed in > >>>> >>> >> > Project, Filter, Aggregate or Window, found:* > >>>> >>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * > >>>> CAST(1000 > >>>> >>> AS > >>>> >>> >> > DOUBLE)) - CAST(9999999999L AS DOUBLE)) AS STRING) ELSE > >>>> >>> a.`nav_tcdt` > >>>> >>> >> END > >>>> >>> >> = > >>>> >>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND > >>>> >>> >> (c.`cur_flag` > >>>> >>> >> = > >>>> >>> >> > 1)) > >>>> >>> >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) > >>>> THEN > >>>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as > >>>> >>> double)) > >>>> >>> as > >>>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && > >>>> >>> (cast(nav_tcd#26 > >>>> >>> as > >>>> >>> >> int) > >>>> >>> >> > = 9)) && (cur_flag#77 = 1)) > >>>> >>> >> > ;; > >>>> >>> >> > GlobalLimit 10 > >>>> >>> >> > +- LocalLimit 10 > >>>> >>> >> > +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as > >>>> string) IN > >>>> >>> >> > (cast(19596 as string),cast(20134 as string),cast(10997 as > >>>> string)) > >>>> >>> && > >>>> >>> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE > nav_tpa_id#21 > >>>> >>> END], > >>>> >>> >> > [date_id#7] > >>>> >>> >> > +- Filter (date_id#7 = 2017-07-12) > >>>> >>> >> > +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) > >>>> THEN > >>>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as > >>>> >>> double)) > >>>> >>> as > >>>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && > >>>> >>> (cast(nav_tcd#26 > >>>> >>> as > >>>> >>> >> int) > >>>> >>> >> > = 9)) && (cur_flag#77 = 1)) > >>>> >>> >> > :- SubqueryAlias a > >>>> >>> >> > : +- SubqueryAlias tmp_lifan_trfc_tpa_hive > >>>> >>> >> > : +- CatalogRelation > >>>> >>> `tmp`.`tmp_lifan_trfc_tpa_hive`, > >>>> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > >>>> [date_id#7, > >>>> >>> >> chanl_id#8L, > >>>> >>> >> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12, > >>>> >>> >> nav_refer_page_type_id#13, > >>>> >>> >> > nav_refer_page_value#14, nav_refer_tpa#15, > >>>> nav_refer_tpa_id#16, > >>>> >>> >> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19, > >>>> >>> >> nav_page_value#20, > >>>> >>> >> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, > >>>> nav_tcdt#25, > >>>> >>> >> nav_tcd#26, > >>>> >>> >> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29, > >>>> >>> >> > detl_refer_page_value#30, ... 33 more fields] > >>>> >>> >> > +- SubqueryAlias c > >>>> >>> >> > +- SubqueryAlias dim_site_categ_ext > >>>> >>> >> > +- CatalogRelation > `dw`.`dim_site_categ_ext`, > >>>> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > >>>> >>> >> [site_categ_skid#64L, > >>>> >>> >> > site_categ_type#65, site_categ_code#66, site_categ_name#67, > >>>> >>> >> > site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L, > >>>> >>> >> sort_seq#71L, > >>>> >>> >> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74, > >>>> >>> etl_batch_id#75L, > >>>> >>> >> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, > >>>> >>> bkgrnd_categ_id#79L, > >>>> >>> >> > site_categ_id#80, site_categ_parnt_id#81] > >>>> >>> >> > > >>>> >>> >> > Does spark sql not support syntax "case when" in JOIN? > >>>> Additional, > >>>> >>> my > >>>> >>> >> spark > >>>> >>> >> > version is 2.2.0. > >>>> >>> >> > Any help would be greatly appreciated. > >>>> >>> >> > > >>>> >>> >> > > >>>> >>> >> > > >>>> >>> >> > > >>>> >>> >> > -- > >>>> >>> >> > View this message in context: http://apache-spark-developers > >>>> >>> >> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t- > >>>> >>> >> be-supported-in-JOIN-tp21953.html > >>>> >>> >> > Sent from the Apache Spark Developers List mailing list > >>>> archive > >>>> at > >>>> >>> >> Nabble.com. > >>>> >>> >> > > >>>> >>> >> > ------------------------------------------------------------ > >>>> >>> --------- > >>>> >>> >> > To unsubscribe e-mail: > >>>> >>> > >>>> >>> > dev-unsubscribe@.apache > >>>> >>> > >>>> >>> >> > > >>>> >>> >> > >>>> >>> >> > >>>> >>> >> ------------------------------------------------------------ > >>>> --------- > >>>> >>> >> To unsubscribe e-mail: > >>>> >>> > >>>> >>> > dev-unsubscribe@.apache > >>>> >>> > >>>> >>> >> > >>>> >>> >> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> ----- > >>>> >>> Liang-Chi Hsieh | @viirya > >>>> >>> Spark Technology Center > >>>> >>> http://www.spark.tc/ > >>>> >>> -- > >>>> >>> View this message in context: http://apache-spark-developers > >>>> >>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be- > >>>> >>> supported-in-JOIN-tp21953p21961.html > >>>> >>> Sent from the Apache Spark Developers List mailing list archive at > >>>> >>> Nabble.com. > >>>> >>> > >>>> >>> ------------------------------------------------------------ > >>>> --------- > >>>> >>> To unsubscribe e-mail: > >>>> > >>>> > dev-unsubscribe@.apache > >>>> > >>>> >>> > >>>> >>> > >>>> >> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> ----- > >>>> Liang-Chi Hsieh | @viirya > >>>> Spark Technology Center > >>>> http://www.spark.tc/ > >>>> -- > >>>> View this message in context: http://apache-spark-developers > >>>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be- > >>>> supported-in-JOIN-tp21953p21973.html > >>>> Sent from the Apache Spark Developers List mailing list archive at > >>>> Nabble.com. > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe e-mail: > > > dev-unsubscribe@.apache > > >>>> > >>>> > >>> > >> > > > > > > ----- > Liang-Chi Hsieh | @viirya > Spark Technology Center > http://www.spark.tc/ > -- > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/SQL-Syntax-case- > when-doesn-t-be-supported-in-JOIN-tp21953p21982.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >