Hi All I don't understand the difference between the semantics, I found Spark does the same thing for GroupBy non-deterministic. From Map-Reduce point of view, Join is also GroupBy in essence .
@Liang Chi Hsieh <https://plus.google.com/u/0/103179362592085650735?prsrc=4> in which situation, semantics will be changed? Thanks Chang On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh <[email protected]> wrote: > > Thinking about it more, I think it changes the semantics only under certain > scenarios. > > For the example SQL query shown in previous discussion, it looks the same > semantics. > > > Xiao Li wrote > > If the join condition is non-deterministic, pushing it down to the > > underlying project will change the semantics. Thus, we are unable to do > it > > in PullOutNondeterministic. Users can do it manually if they do not care > > the semantics difference. > > > > Thanks, > > > > Xiao > > > > > > > > 2017-07-16 20:07 GMT-07:00 Chang Chen < > > > baibaichen@ > > > >: > > > >> It is tedious since we have lots of Hive SQL being migrated to Spark. > >> And > >> this workaround is equivalent to insert a Project between Join operator > >> and its child. > >> > >> Why not do it in PullOutNondeterministic? > >> > >> Thanks > >> Chang > >> > >> > >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh < > > > viirya@ > > > > wrote: > >> > >>> > >>> A possible workaround is to add the rand column into tbl1 with a > >>> projection > >>> before the join. > >>> > >>> SELECT a.col1 > >>> FROM ( > >>> SELECT col1, > >>> CASE > >>> WHEN col2 IS NULL > >>> THEN cast(rand(9)*1000 - 9999999999 as string) > >>> ELSE > >>> col2 > >>> END AS col2 > >>> FROM tbl1) a > >>> LEFT OUTER JOIN tbl2 b > >>> ON a.col2 = b.col3; > >>> > >>> > >>> > >>> Chang Chen wrote > >>> > Hi Wenchen > >>> > > >>> > Yes. We also find this error is caused by Rand. However, this is > >>> classic > >>> > way to solve data skew in Hive. Is there any equivalent way in > Spark? > >>> > > >>> > Thanks > >>> > Chang > >>> > > >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan < > >>> > >>> > cloud0fan@ > >>> > >>> > > wrote: > >>> > > >>> >> It’s not about case when, but about rand(). Non-deterministic > >>> expressions > >>> >> are not allowed in join condition. > >>> >> > >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang < > >>> > >>> > cn_wss@ > >>> > >>> > > wrote: > >>> >> > > >>> >> > I'm trying to execute hive sql on spark sql (Also on spark > >>> >> thriftserver), For > >>> >> > optimizing data skew, we use "case when" to handle null. > >>> >> > Simple sql as following: > >>> >> > > >>> >> > > >>> >> > SELECT a.col1 > >>> >> > FROM tbl1 a > >>> >> > LEFT OUTER JOIN tbl2 b > >>> >> > ON > >>> >> > * CASE > >>> >> > WHEN a.col2 IS NULL > >>> >> > TNEN cast(rand(9)*1000 - 9999999999 as > >>> string) > >>> >> > ELSE > >>> >> > a.col2 END * > >>> >> > = b.col3; > >>> >> > > >>> >> > > >>> >> > But I get the error: > >>> >> > > >>> >> > == Physical Plan == > >>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic > >>> expressions > >>> >> are > >>> >> > only allowed in > >>> >> > Project, Filter, Aggregate or Window, found:* > >>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * > CAST(1000 > >>> AS > >>> >> > DOUBLE)) - CAST(9999999999L AS DOUBLE)) AS STRING) ELSE > >>> a.`nav_tcdt` > >>> >> END > >>> >> = > >>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND > >>> >> (c.`cur_flag` > >>> >> = > >>> >> > 1)) > >>> >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN > >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as > >>> double)) > >>> as > >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && > >>> (cast(nav_tcd#26 > >>> as > >>> >> int) > >>> >> > = 9)) && (cur_flag#77 = 1)) > >>> >> > ;; > >>> >> > GlobalLimit 10 > >>> >> > +- LocalLimit 10 > >>> >> > +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string) > IN > >>> >> > (cast(19596 as string),cast(20134 as string),cast(10997 as > string)) > >>> && > >>> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21 > >>> END], > >>> >> > [date_id#7] > >>> >> > +- Filter (date_id#7 = 2017-07-12) > >>> >> > +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN > >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as > >>> double)) > >>> as > >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && > >>> (cast(nav_tcd#26 > >>> as > >>> >> int) > >>> >> > = 9)) && (cur_flag#77 = 1)) > >>> >> > :- SubqueryAlias a > >>> >> > : +- SubqueryAlias tmp_lifan_trfc_tpa_hive > >>> >> > : +- CatalogRelation > >>> `tmp`.`tmp_lifan_trfc_tpa_hive`, > >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7, > >>> >> chanl_id#8L, > >>> >> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12, > >>> >> nav_refer_page_type_id#13, > >>> >> > nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16, > >>> >> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19, > >>> >> nav_page_value#20, > >>> >> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25, > >>> >> nav_tcd#26, > >>> >> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29, > >>> >> > detl_refer_page_value#30, ... 33 more fields] > >>> >> > +- SubqueryAlias c > >>> >> > +- SubqueryAlias dim_site_categ_ext > >>> >> > +- CatalogRelation `dw`.`dim_site_categ_ext`, > >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > >>> >> [site_categ_skid#64L, > >>> >> > site_categ_type#65, site_categ_code#66, site_categ_name#67, > >>> >> > site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L, > >>> >> sort_seq#71L, > >>> >> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74, > >>> etl_batch_id#75L, > >>> >> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, > >>> bkgrnd_categ_id#79L, > >>> >> > site_categ_id#80, site_categ_parnt_id#81] > >>> >> > > >>> >> > Does spark sql not support syntax "case when" in JOIN? > Additional, > >>> my > >>> >> spark > >>> >> > version is 2.2.0. > >>> >> > Any help would be greatly appreciated. > >>> >> > > >>> >> > > >>> >> > > >>> >> > > >>> >> > -- > >>> >> > View this message in context: http://apache-spark-developers > >>> >> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t- > >>> >> be-supported-in-JOIN-tp21953.html > >>> >> > Sent from the Apache Spark Developers List mailing list archive at > >>> >> Nabble.com. > >>> >> > > >>> >> > ------------------------------------------------------------ > >>> --------- > >>> >> > To unsubscribe e-mail: > >>> > >>> > [email protected] > >>> > >>> >> > > >>> >> > >>> >> > >>> >> ------------------------------------------------------------ > --------- > >>> >> To unsubscribe e-mail: > >>> > >>> > [email protected] > >>> > >>> >> > >>> >> > >>> > >>> > >>> > >>> > >>> > >>> ----- > >>> Liang-Chi Hsieh | @viirya > >>> Spark Technology Center > >>> http://www.spark.tc/ > >>> -- > >>> View this message in context: http://apache-spark-developers > >>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be- > >>> supported-in-JOIN-tp21953p21961.html > >>> Sent from the Apache Spark Developers List mailing list archive at > >>> Nabble.com. > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe e-mail: > > > [email protected] > > >>> > >>> > >> > > > > > > ----- > Liang-Chi Hsieh | @viirya > Spark Technology Center > http://www.spark.tc/ > -- > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/SQL-Syntax-case- > when-doesn-t-be-supported-in-JOIN-tp21953p21973.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: [email protected] > >
