Re: [SQL] Syntax "case when" doesn't be supported in JOIN

Chang Chen Mon, 17 Jul 2017 07:05:57 -0700

I see the issue. I will try https://github.com/apache/spark/pull/18652, I
think


1 For Join Operator, the left and right plan can't be non-deterministic.
2 If  Filter can support non-deterministic, why not join condition?
3 We can't push down or project non-deterministic expression, since it may
change semantics.

Actually, the real problem is #2. If the join condition could be
non-deterministic, then we needn't insert project.

Thanks
Chang




On Mon, Jul 17, 2017 at 3:59 PM, 蒋星博 <[email protected]> wrote:

> FYI there have been a related discussion here: https://github.com/apache/
> spark/pull/15417#discussion_r85295977
>
> 2017-07-17 15:44 GMT+08:00 Chang Chen <[email protected]>:
>
>> Hi All
>>
>> I don't understand the difference between the semantics, I found Spark
>> does the same thing for GroupBy non-deterministic. From Map-Reduce point of
>> view, Join is also GroupBy in essence .
>>
>> @Liang Chi Hsieh
>> <https://plus.google.com/u/0/103179362592085650735?prsrc=4>
>>
>> in which situation,  semantics  will be changed?
>>
>> Thanks
>> Chang
>>
>> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh <[email protected]>
>> wrote:
>>
>>>
>>> Thinking about it more, I think it changes the semantics only under
>>> certain
>>> scenarios.
>>>
>>> For the example SQL query shown in previous discussion, it looks the same
>>> semantics.
>>>
>>>
>>> Xiao Li wrote
>>> > If the join condition is non-deterministic, pushing it down to the
>>> > underlying project will change the semantics. Thus, we are unable to
>>> do it
>>> > in PullOutNondeterministic. Users can do it manually if they do not
>>> care
>>> > the semantics difference.
>>> >
>>> > Thanks,
>>> >
>>> > Xiao
>>> >
>>> >
>>> >
>>> > 2017-07-16 20:07 GMT-07:00 Chang Chen &lt;
>>>
>>> > baibaichen@
>>>
>>> > &gt;:
>>> >
>>> >> It is tedious since we have lots of Hive SQL being migrated to Spark.
>>> >> And
>>> >> this workaround is equivalent  to insert a Project between Join
>>> operator
>>> >> and its child.
>>> >>
>>> >> Why not do it in PullOutNondeterministic?
>>> >>
>>> >> Thanks
>>> >> Chang
>>> >>
>>> >>
>>> >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh &lt;
>>>
>>> > viirya@
>>>
>>> > &gt; wrote:
>>> >>
>>> >>>
>>> >>> A possible workaround is to add the rand column into tbl1 with a
>>> >>> projection
>>> >>> before the join.
>>> >>>
>>> >>> SELECT a.col1
>>> >>> FROM (
>>> >>>   SELECT col1,
>>> >>>     CASE
>>> >>>          WHEN col2 IS NULL
>>> >>>            THEN cast(rand(9)*1000 - 9999999999 as string)
>>> >>>          ELSE
>>> >>>            col2
>>> >>>     END AS col2
>>> >>>     FROM tbl1) a
>>> >>> LEFT OUTER JOIN tbl2 b
>>> >>> ON a.col2 = b.col3;
>>> >>>
>>> >>>
>>> >>>
>>> >>> Chang Chen wrote
>>> >>> > Hi Wenchen
>>> >>> >
>>> >>> > Yes. We also find this error is caused by Rand. However, this is
>>> >>> classic
>>> >>> > way to solve data skew in Hive.  Is there any equivalent way in
>>> Spark?
>>> >>> >
>>> >>> > Thanks
>>> >>> > Chang
>>> >>> >
>>> >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan &lt;
>>> >>>
>>> >>> > cloud0fan@
>>> >>>
>>> >>> > &gt; wrote:
>>> >>> >
>>> >>> >> It’s not about case when, but about rand(). Non-deterministic
>>> >>> expressions
>>> >>> >> are not allowed in join condition.
>>> >>> >>
>>> >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang &lt;
>>> >>>
>>> >>> > cn_wss@
>>> >>>
>>> >>> > &gt; wrote:
>>> >>> >> >
>>> >>> >> > I'm trying to execute hive sql on spark sql (Also on spark
>>> >>> >> thriftserver), For
>>> >>> >> > optimizing data skew, we use "case when" to handle null.
>>> >>> >> > Simple sql as following:
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > SELECT a.col1
>>> >>> >> > FROM tbl1 a
>>> >>> >> > LEFT OUTER JOIN tbl2 b
>>> >>> >> > ON
>>> >>> >> > *     CASE
>>> >>> >> >               WHEN a.col2 IS NULL
>>> >>> >> >                       TNEN cast(rand(9)*1000 - 9999999999 as
>>> >>> string)
>>> >>> >> >               ELSE
>>> >>> >> >                       a.col2 END *
>>> >>> >> >       = b.col3;
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > But I get the error:
>>> >>> >> >
>>> >>> >> > == Physical Plan ==
>>> >>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic
>>> >>> expressions
>>> >>> >> are
>>> >>> >> > only allowed in
>>> >>> >> > Project, Filter, Aggregate or Window, found:*
>>> >>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) *
>>> CAST(1000
>>> >>> AS
>>> >>> >> > DOUBLE)) - CAST(9999999999L AS DOUBLE)) AS STRING) ELSE
>>> >>> a.`nav_tcdt`
>>> >>> >> END
>>> >>> >> =
>>> >>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
>>> >>> >> (c.`cur_flag`
>>> >>> >> =
>>> >>> >> > 1))
>>> >>> >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25)
>>> THEN
>>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as
>>> >>> double))
>>> >>> as
>>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
>>> >>> (cast(nav_tcd#26
>>> >>> as
>>> >>> >> int)
>>> >>> >> > = 9)) && (cur_flag#77 = 1))
>>> >>> >> >               ;;
>>> >>> >> > GlobalLimit 10
>>> >>> >> > +- LocalLimit 10
>>> >>> >> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as
>>> string) IN
>>> >>> >> > (cast(19596 as string),cast(20134 as string),cast(10997 as
>>> string))
>>> >>> &&
>>> >>> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21
>>> >>> END],
>>> >>> >> > [date_id#7]
>>> >>> >> >      +- Filter (date_id#7 = 2017-07-12)
>>> >>> >> >         +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(9999999999 as
>>> >>> double))
>>> >>> as
>>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
>>> >>> (cast(nav_tcd#26
>>> >>> as
>>> >>> >> int)
>>> >>> >> > = 9)) && (cur_flag#77 = 1))
>>> >>> >> >            :- SubqueryAlias a
>>> >>> >> >            :  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
>>> >>> >> >            :     +- CatalogRelation
>>> >>> `tmp`.`tmp_lifan_trfc_tpa_hive`,
>>> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7,
>>> >>> >> chanl_id#8L,
>>> >>> >> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12,
>>> >>> >> nav_refer_page_type_id#13,
>>> >>> >> > nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16,
>>> >>> >> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19,
>>> >>> >> nav_page_value#20,
>>> >>> >> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25,
>>> >>> >> nav_tcd#26,
>>> >>> >> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29,
>>> >>> >> > detl_refer_page_value#30, ... 33 more fields]
>>> >>> >> >            +- SubqueryAlias c
>>> >>> >> >               +- SubqueryAlias dim_site_categ_ext
>>> >>> >> >                  +- CatalogRelation `dw`.`dim_site_categ_ext`,
>>> >>> >> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
>>> >>> >> [site_categ_skid#64L,
>>> >>> >> > site_categ_type#65, site_categ_code#66, site_categ_name#67,
>>> >>> >> > site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L,
>>> >>> >> sort_seq#71L,
>>> >>> >> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74,
>>> >>> etl_batch_id#75L,
>>> >>> >> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L,
>>> >>> bkgrnd_categ_id#79L,
>>> >>> >> > site_categ_id#80, site_categ_parnt_id#81]
>>> >>> >> >
>>> >>> >> > Does spark sql not support syntax "case when" in JOIN?
>>> Additional,
>>> >>> my
>>> >>> >> spark
>>> >>> >> > version is 2.2.0.
>>> >>> >> > Any help would be greatly appreciated.
>>> >>> >> >
>>> >>> >> >
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > --
>>> >>> >> > View this message in context: http://apache-spark-developers
>>> >>> >> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-
>>> >>> >> be-supported-in-JOIN-tp21953.html
>>> >>> >> > Sent from the Apache Spark Developers List mailing list archive
>>> at
>>> >>> >> Nabble.com.
>>> >>> >> >
>>> >>> >> > ------------------------------------------------------------
>>> >>> ---------
>>> >>> >> > To unsubscribe e-mail:
>>> >>>
>>> >>> > [email protected]
>>> >>>
>>> >>> >> >
>>> >>> >>
>>> >>> >>
>>> >>> >> ------------------------------------------------------------
>>> ---------
>>> >>> >> To unsubscribe e-mail:
>>> >>>
>>> >>> > [email protected]
>>> >>>
>>> >>> >>
>>> >>> >>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> -----
>>> >>> Liang-Chi Hsieh | @viirya
>>> >>> Spark Technology Center
>>> >>> http://www.spark.tc/
>>> >>> --
>>> >>> View this message in context: http://apache-spark-developers
>>> >>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-
>>> >>> supported-in-JOIN-tp21953p21961.html
>>> >>> Sent from the Apache Spark Developers List mailing list archive at
>>> >>> Nabble.com.
>>> >>>
>>> >>> ------------------------------------------------------------
>>> ---------
>>> >>> To unsubscribe e-mail:
>>>
>>> > [email protected]
>>>
>>> >>>
>>> >>>
>>> >>
>>>
>>>
>>>
>>>
>>>
>>> -----
>>> Liang-Chi Hsieh | @viirya
>>> Spark Technology Center
>>> http://www.spark.tc/
>>> --
>>> View this message in context: http://apache-spark-developers
>>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-
>>> supported-in-JOIN-tp21953p21973.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [email protected]
>>>
>>>
>>
>

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

Reply via email to