Hi All

I don't understand the difference between the semantics, I found Spark does
the same thing for GroupBy non-deterministic. From Map-Reduce point of
view, Join is also GroupBy in essence .

@Liang Chi Hsieh <https://plus.google.com/u/0/103179362592085650735?prsrc=4>

in which situation,  semantics  will be changed?

Thanks
Chang

On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh <vii...@gmail.com> wrote:

>
> Thinking about it more, I think it changes the semantics only under certain
> scenarios.
>
> For the example SQL query shown in previous discussion, it looks the same
> semantics.
>
>
> Xiao Li wrote
> > If the join condition is non-deterministic, pushing it down to the
> > underlying project will change the semantics. Thus, we are unable to do
> it
> > in PullOutNondeterministic. Users can do it manually if they do not care
> > the semantics difference.
> >
> > Thanks,
> >
> > Xiao
> >
> >
> >
> > 2017-07-16 20:07 GMT-07:00 Chang Chen &lt;
>
> > baibaichen@
>
> > &gt;:
> >
> >> It is tedious since we have lots of Hive SQL being migrated to Spark.
> >> And
> >> this workaround is equivalent  to insert a Project between Join operator
> >> and its child.
> >>
> >> Why not do it in PullOutNondeterministic?
> >>
> >> Thanks
> >> Chang
> >>
> >>
> >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh &lt;
>
> > viirya@
>
> > &gt; wrote:
> >>
> >>>
> >>> A possible workaround is to add the rand column into tbl1 with a
> >>> projection
> >>> before the join.
> >>>
> >>> SELECT a.col1
> >>> FROM (
> >>>   SELECT col1,
> >>>     CASE
> >>>          WHEN col2 IS NULL
> >>>            THEN cast(rand(9)*1000 - 9999999999 as string)
> >>>          ELSE
> >>>            col2
> >>>     END AS col2
> >>>     FROM tbl1) a
> >>> LEFT OUTER JOIN tbl2 b
> >>> ON a.col2 = b.col3;
> >>>
> >>>
> >>>
> >>> Chang Chen wrote
> >>> > Hi Wenchen
> >>> >
> >>> > Yes. We also find this error is caused by Rand. However, this is
> >>> classic
> >>> > way to solve data skew in Hive.  Is there any equivalent way in
> Spark?
> >>> >
> >>> > Thanks
> >>> > Chang
> >>> >
> >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan &lt;
> >>>
> >>> > cloud0fan@
> >>>
> >>> > &gt; wrote:
> >>> >
> >>> >> It’s not about case when, but about rand(). Non-deterministic
> >>> expressions
> >>> >> are not allowed in join condition.
> >>> >>
> >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang &lt;
> >>>
> >>> > cn_wss@
> >>>
> >>> > &gt; wrote:
> >>> >> >
> >>> >> > I'm trying to execute hive sql on spark sql (Also on spark
> >>> >> thriftserver), For
> >>> >> > optimizing data skew, we use "case when" to handle null.
> >>> >> > Simple sql as following:
> >>> >> >
> >>> >> >
> >>> >> > SELECT a.col1
> >>> >> > FROM tbl1 a
> >>> >> > LEFT OUTER JOIN tbl2 b
> >>> >> > ON
> >>> >> > *     CASE
> >>> >> >               WHEN a.col2 IS NULL
> >>> >> >                       TNEN cast(rand(9)*1000 - 9999999999 as
> >>> string)
> >>> >> >               ELSE
> >>> >> >                       a.col2 END *
> >>> >> >       = b.col3;
> >>> >> >
> >>> >> >
> >>> >> > But I get the error:
> >>> >> >
> >>> >> > == Physical Plan ==
> >>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic
> >>> expressions
> >>> >> are
> >>> >> > only allowed in
> >>> >> > Project, Filter, Aggregate or Window, found:*
> >>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) *
> CAST(1000
> >>> AS
> >>> >> > DOUBLE)) - CAST(9999999999L AS DOUBLE)) AS STRING) ELSE
> >>> a.`nav_tcdt`
> >>> >> END
> >>> >> =
> >>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
> >>> >> (c.`cur_flag`
> >>> >> =
> >>> >> > 1))
> >>> >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as
> >>> double))
> >>> as
> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
> >>> (cast(nav_tcd#26
> >>> as
> >>> >> int)
> >>> >> > = 9)) && (cur_flag#77 = 1))
> >>> >> >               ;;
> >>> >> > GlobalLimit 10
> >>> >> > +- LocalLimit 10
> >>> >> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string)
> IN
> >>> >> > (cast(19596 as string),cast(20134 as string),cast(10997 as
> string))
> >>> &&
> >>> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21
> >>> END],
> >>> >> > [date_id#7]
> >>> >> >      +- Filter (date_id#7 = 2017-07-12)
> >>> >> >         +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as
> >>> double))
> >>> as
> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
> >>> (cast(nav_tcd#26
> >>> as
> >>> >> int)
> >>> >> > = 9)) && (cur_flag#77 = 1))
> >>> >> >            :- SubqueryAlias a
> >>> >> >            :  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
> >>> >> >            :     +- CatalogRelation
> >>> `tmp`.`tmp_lifan_trfc_tpa_hive`,
> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7,
> >>> >> chanl_id#8L,
> >>> >> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12,
> >>> >> nav_refer_page_type_id#13,
> >>> >> > nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16,
> >>> >> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19,
> >>> >> nav_page_value#20,
> >>> >> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25,
> >>> >> nav_tcd#26,
> >>> >> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29,
> >>> >> > detl_refer_page_value#30, ... 33 more fields]
> >>> >> >            +- SubqueryAlias c
> >>> >> >               +- SubqueryAlias dim_site_categ_ext
> >>> >> >                  +- CatalogRelation `dw`.`dim_site_categ_ext`,
> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
> >>> >> [site_categ_skid#64L,
> >>> >> > site_categ_type#65, site_categ_code#66, site_categ_name#67,
> >>> >> > site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L,
> >>> >> sort_seq#71L,
> >>> >> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74,
> >>> etl_batch_id#75L,
> >>> >> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L,
> >>> bkgrnd_categ_id#79L,
> >>> >> > site_categ_id#80, site_categ_parnt_id#81]
> >>> >> >
> >>> >> > Does spark sql not support syntax "case when" in JOIN?
> Additional,
> >>> my
> >>> >> spark
> >>> >> > version is 2.2.0.
> >>> >> > Any help would be greatly appreciated.
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > --
> >>> >> > View this message in context: http://apache-spark-developers
> >>> >> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-
> >>> >> be-supported-in-JOIN-tp21953.html
> >>> >> > Sent from the Apache Spark Developers List mailing list archive at
> >>> >> Nabble.com.
> >>> >> >
> >>> >> > ------------------------------------------------------------
> >>> ---------
> >>> >> > To unsubscribe e-mail:
> >>>
> >>> > dev-unsubscribe@.apache
> >>>
> >>> >> >
> >>> >>
> >>> >>
> >>> >> ------------------------------------------------------------
> ---------
> >>> >> To unsubscribe e-mail:
> >>>
> >>> > dev-unsubscribe@.apache
> >>>
> >>> >>
> >>> >>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> -----
> >>> Liang-Chi Hsieh | @viirya
> >>> Spark Technology Center
> >>> http://www.spark.tc/
> >>> --
> >>> View this message in context: http://apache-spark-developers
> >>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-
> >>> supported-in-JOIN-tp21953p21961.html
> >>> Sent from the Apache Spark Developers List mailing list archive at
> >>> Nabble.com.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
> >>>
> >>>
> >>
>
>
>
>
>
> -----
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/SQL-Syntax-case-
> when-doesn-t-be-supported-in-JOIN-tp21953p21973.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Reply via email to