It is tedious since we have lots of Hive SQL being migrated to Spark. And this workaround is equivalent to insert a Project between Join operator and its child.
Why not do it in PullOutNondeterministic? Thanks Chang On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh <vii...@gmail.com> wrote: > > A possible workaround is to add the rand column into tbl1 with a projection > before the join. > > SELECT a.col1 > FROM ( > SELECT col1, > CASE > WHEN col2 IS NULL > THEN cast(rand(9)*1000 - 9999999999 as string) > ELSE > col2 > END AS col2 > FROM tbl1) a > LEFT OUTER JOIN tbl2 b > ON a.col2 = b.col3; > > > > Chang Chen wrote > > Hi Wenchen > > > > Yes. We also find this error is caused by Rand. However, this is classic > > way to solve data skew in Hive. Is there any equivalent way in Spark? > > > > Thanks > > Chang > > > > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan < > > > cloud0fan@ > > > > wrote: > > > >> It’s not about case when, but about rand(). Non-deterministic > expressions > >> are not allowed in join condition. > >> > >> > On 13 Jul 2017, at 6:43 PM, wangshuang < > > > cn_wss@ > > > > wrote: > >> > > >> > I'm trying to execute hive sql on spark sql (Also on spark > >> thriftserver), For > >> > optimizing data skew, we use "case when" to handle null. > >> > Simple sql as following: > >> > > >> > > >> > SELECT a.col1 > >> > FROM tbl1 a > >> > LEFT OUTER JOIN tbl2 b > >> > ON > >> > * CASE > >> > WHEN a.col2 IS NULL > >> > TNEN cast(rand(9)*1000 - 9999999999 as string) > >> > ELSE > >> > a.col2 END * > >> > = b.col3; > >> > > >> > > >> > But I get the error: > >> > > >> > == Physical Plan == > >> > *org.apache.spark.sql.AnalysisException: nondeterministic expressions > >> are > >> > only allowed in > >> > Project, Filter, Aggregate or Window, found:* > >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * CAST(1000 AS > >> > DOUBLE)) - CAST(9999999999L AS DOUBLE)) AS STRING) ELSE a.`nav_tcdt` > >> END > >> = > >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND > >> (c.`cur_flag` > >> = > >> > 1)) > >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN > >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as double)) > as > >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 > as > >> int) > >> > = 9)) && (cur_flag#77 = 1)) > >> > ;; > >> > GlobalLimit 10 > >> > +- LocalLimit 10 > >> > +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string) IN > >> > (cast(19596 as string),cast(20134 as string),cast(10997 as string)) && > >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21 END], > >> > [date_id#7] > >> > +- Filter (date_id#7 = 2017-07-12) > >> > +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN > >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as double)) > as > >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 > as > >> int) > >> > = 9)) && (cur_flag#77 = 1)) > >> > :- SubqueryAlias a > >> > : +- SubqueryAlias tmp_lifan_trfc_tpa_hive > >> > : +- CatalogRelation `tmp`.`tmp_lifan_trfc_tpa_hive`, > >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7, > >> chanl_id#8L, > >> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12, > >> nav_refer_page_type_id#13, > >> > nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16, > >> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19, > >> nav_page_value#20, > >> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25, > >> nav_tcd#26, > >> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29, > >> > detl_refer_page_value#30, ... 33 more fields] > >> > +- SubqueryAlias c > >> > +- SubqueryAlias dim_site_categ_ext > >> > +- CatalogRelation `dw`.`dim_site_categ_ext`, > >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > >> [site_categ_skid#64L, > >> > site_categ_type#65, site_categ_code#66, site_categ_name#67, > >> > site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L, > >> sort_seq#71L, > >> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74, etl_batch_id#75L, > >> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, bkgrnd_categ_id#79L, > >> > site_categ_id#80, site_categ_parnt_id#81] > >> > > >> > Does spark sql not support syntax "case when" in JOIN? Additional, my > >> spark > >> > version is 2.2.0. > >> > Any help would be greatly appreciated. > >> > > >> > > >> > > >> > > >> > -- > >> > View this message in context: http://apache-spark-developers > >> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t- > >> be-supported-in-JOIN-tp21953.html > >> > Sent from the Apache Spark Developers List mailing list archive at > >> Nabble.com. > >> > > >> > --------------------------------------------------------------------- > >> > To unsubscribe e-mail: > > > dev-unsubscribe@.apache > > >> > > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe e-mail: > > > dev-unsubscribe@.apache > > >> > >> > > > > > > ----- > Liang-Chi Hsieh | @viirya > Spark Technology Center > http://www.spark.tc/ > -- > View this message in context: http://apache-spark-developers > -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t- > be-supported-in-JOIN-tp21953p21961.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >