I see the issue. I will try https://github.com/apache/spark/pull/18652, I think
1 For Join Operator, the left and right plan can't be non-deterministic. 2 If Filter can support non-deterministic, why not join condition? 3 We can't push down or project non-deterministic expression, since it may change semantics. Actually, the real problem is #2. If the join condition could be non-deterministic, then we needn't insert project. Thanks Chang On Mon, Jul 17, 2017 at 3:59 PM, 蒋星博 <jiangxb1...@gmail.com> wrote: > FYI there have been a related discussion here: https://github.com/apache/ > spark/pull/15417#discussion_r85295977 > > 2017-07-17 15:44 GMT+08:00 Chang Chen <baibaic...@gmail.com>: > >> Hi All >> >> I don't understand the difference between the semantics, I found Spark >> does the same thing for GroupBy non-deterministic. From Map-Reduce point of >> view, Join is also GroupBy in essence . >> >> @Liang Chi Hsieh >> <https://plus.google.com/u/0/103179362592085650735?prsrc=4> >> >> in which situation, semantics will be changed? >> >> Thanks >> Chang >> >> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh <vii...@gmail.com> >> wrote: >> >>> >>> Thinking about it more, I think it changes the semantics only under >>> certain >>> scenarios. >>> >>> For the example SQL query shown in previous discussion, it looks the same >>> semantics. >>> >>> >>> Xiao Li wrote >>> > If the join condition is non-deterministic, pushing it down to the >>> > underlying project will change the semantics. Thus, we are unable to >>> do it >>> > in PullOutNondeterministic. Users can do it manually if they do not >>> care >>> > the semantics difference. >>> > >>> > Thanks, >>> > >>> > Xiao >>> > >>> > >>> > >>> > 2017-07-16 20:07 GMT-07:00 Chang Chen < >>> >>> > baibaichen@ >>> >>> > >: >>> > >>> >> It is tedious since we have lots of Hive SQL being migrated to Spark. >>> >> And >>> >> this workaround is equivalent to insert a Project between Join >>> operator >>> >> and its child. >>> >> >>> >> Why not do it in PullOutNondeterministic? >>> >> >>> >> Thanks >>> >> Chang >>> >> >>> >> >>> >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh < >>> >>> > viirya@ >>> >>> > > wrote: >>> >> >>> >>> >>> >>> A possible workaround is to add the rand column into tbl1 with a >>> >>> projection >>> >>> before the join. >>> >>> >>> >>> SELECT a.col1 >>> >>> FROM ( >>> >>> SELECT col1, >>> >>> CASE >>> >>> WHEN col2 IS NULL >>> >>> THEN cast(rand(9)*1000 - 9999999999 as string) >>> >>> ELSE >>> >>> col2 >>> >>> END AS col2 >>> >>> FROM tbl1) a >>> >>> LEFT OUTER JOIN tbl2 b >>> >>> ON a.col2 = b.col3; >>> >>> >>> >>> >>> >>> >>> >>> Chang Chen wrote >>> >>> > Hi Wenchen >>> >>> > >>> >>> > Yes. We also find this error is caused by Rand. However, this is >>> >>> classic >>> >>> > way to solve data skew in Hive. Is there any equivalent way in >>> Spark? >>> >>> > >>> >>> > Thanks >>> >>> > Chang >>> >>> > >>> >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan < >>> >>> >>> >>> > cloud0fan@ >>> >>> >>> >>> > > wrote: >>> >>> > >>> >>> >> It’s not about case when, but about rand(). Non-deterministic >>> >>> expressions >>> >>> >> are not allowed in join condition. >>> >>> >> >>> >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang < >>> >>> >>> >>> > cn_wss@ >>> >>> >>> >>> > > wrote: >>> >>> >> > >>> >>> >> > I'm trying to execute hive sql on spark sql (Also on spark >>> >>> >> thriftserver), For >>> >>> >> > optimizing data skew, we use "case when" to handle null. >>> >>> >> > Simple sql as following: >>> >>> >> > >>> >>> >> > >>> >>> >> > SELECT a.col1 >>> >>> >> > FROM tbl1 a >>> >>> >> > LEFT OUTER JOIN tbl2 b >>> >>> >> > ON >>> >>> >> > * CASE >>> >>> >> > WHEN a.col2 IS NULL >>> >>> >> > TNEN cast(rand(9)*1000 - 9999999999 as >>> >>> string) >>> >>> >> > ELSE >>> >>> >> > a.col2 END * >>> >>> >> > = b.col3; >>> >>> >> > >>> >>> >> > >>> >>> >> > But I get the error: >>> >>> >> > >>> >>> >> > == Physical Plan == >>> >>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic >>> >>> expressions >>> >>> >> are >>> >>> >> > only allowed in >>> >>> >> > Project, Filter, Aggregate or Window, found:* >>> >>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * >>> CAST(1000 >>> >>> AS >>> >>> >> > DOUBLE)) - CAST(9999999999L AS DOUBLE)) AS STRING) ELSE >>> >>> a.`nav_tcdt` >>> >>> >> END >>> >>> >> = >>> >>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND >>> >>> >> (c.`cur_flag` >>> >>> >> = >>> >>> >> > 1)) >>> >>> >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) >>> THEN >>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as >>> >>> double)) >>> >>> as >>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && >>> >>> (cast(nav_tcd#26 >>> >>> as >>> >>> >> int) >>> >>> >> > = 9)) && (cur_flag#77 = 1)) >>> >>> >> > ;; >>> >>> >> > GlobalLimit 10 >>> >>> >> > +- LocalLimit 10 >>> >>> >> > +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as >>> string) IN >>> >>> >> > (cast(19596 as string),cast(20134 as string),cast(10997 as >>> string)) >>> >>> && >>> >>> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21 >>> >>> END], >>> >>> >> > [date_id#7] >>> >>> >> > +- Filter (date_id#7 = 2017-07-12) >>> >>> >> > +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN >>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as >>> >>> double)) >>> >>> as >>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && >>> >>> (cast(nav_tcd#26 >>> >>> as >>> >>> >> int) >>> >>> >> > = 9)) && (cur_flag#77 = 1)) >>> >>> >> > :- SubqueryAlias a >>> >>> >> > : +- SubqueryAlias tmp_lifan_trfc_tpa_hive >>> >>> >> > : +- CatalogRelation >>> >>> `tmp`.`tmp_lifan_trfc_tpa_hive`, >>> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7, >>> >>> >> chanl_id#8L, >>> >>> >> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12, >>> >>> >> nav_refer_page_type_id#13, >>> >>> >> > nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16, >>> >>> >> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19, >>> >>> >> nav_page_value#20, >>> >>> >> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25, >>> >>> >> nav_tcd#26, >>> >>> >> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29, >>> >>> >> > detl_refer_page_value#30, ... 33 more fields] >>> >>> >> > +- SubqueryAlias c >>> >>> >> > +- SubqueryAlias dim_site_categ_ext >>> >>> >> > +- CatalogRelation `dw`.`dim_site_categ_ext`, >>> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, >>> >>> >> [site_categ_skid#64L, >>> >>> >> > site_categ_type#65, site_categ_code#66, site_categ_name#67, >>> >>> >> > site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L, >>> >>> >> sort_seq#71L, >>> >>> >> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74, >>> >>> etl_batch_id#75L, >>> >>> >> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, >>> >>> bkgrnd_categ_id#79L, >>> >>> >> > site_categ_id#80, site_categ_parnt_id#81] >>> >>> >> > >>> >>> >> > Does spark sql not support syntax "case when" in JOIN? >>> Additional, >>> >>> my >>> >>> >> spark >>> >>> >> > version is 2.2.0. >>> >>> >> > Any help would be greatly appreciated. >>> >>> >> > >>> >>> >> > >>> >>> >> > >>> >>> >> > >>> >>> >> > -- >>> >>> >> > View this message in context: http://apache-spark-developers >>> >>> >> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t- >>> >>> >> be-supported-in-JOIN-tp21953.html >>> >>> >> > Sent from the Apache Spark Developers List mailing list archive >>> at >>> >>> >> Nabble.com. >>> >>> >> > >>> >>> >> > ------------------------------------------------------------ >>> >>> --------- >>> >>> >> > To unsubscribe e-mail: >>> >>> >>> >>> > dev-unsubscribe@.apache >>> >>> >>> >>> >> > >>> >>> >> >>> >>> >> >>> >>> >> ------------------------------------------------------------ >>> --------- >>> >>> >> To unsubscribe e-mail: >>> >>> >>> >>> > dev-unsubscribe@.apache >>> >>> >>> >>> >> >>> >>> >> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> ----- >>> >>> Liang-Chi Hsieh | @viirya >>> >>> Spark Technology Center >>> >>> http://www.spark.tc/ >>> >>> -- >>> >>> View this message in context: http://apache-spark-developers >>> >>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be- >>> >>> supported-in-JOIN-tp21953p21961.html >>> >>> Sent from the Apache Spark Developers List mailing list archive at >>> >>> Nabble.com. >>> >>> >>> >>> ------------------------------------------------------------ >>> --------- >>> >>> To unsubscribe e-mail: >>> >>> > dev-unsubscribe@.apache >>> >>> >>> >>> >>> >>> >> >>> >>> >>> >>> >>> >>> ----- >>> Liang-Chi Hsieh | @viirya >>> Spark Technology Center >>> http://www.spark.tc/ >>> -- >>> View this message in context: http://apache-spark-developers >>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be- >>> supported-in-JOIN-tp21953p21973.html >>> Sent from the Apache Spark Developers List mailing list archive at >>> Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> >> >