[jira] [Commented] (SPARK-15372) TPC-DS Qury 84 returns wrong results against TPC official

Frederick Reiss (JIRA) Tue, 17 May 2016 18:10:50 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15288004#comment-15288004
 ]


Frederick Reiss commented on SPARK-15372:
-----------------------------------------

There is no CONCAT function in the SQL standard. The concatenation operator in 
the standard is ||. DB2 treats "CONCAT" as a synonym for "||"; other vendors do 
different things. The standard states that if either input to the concatenation 
operator is NULL, then the output is NULL (Section 6.28 <string value 
expression> in the SQL/Foundations document).

The TPC-DS benchmark spec doesn't say anything about the allowable semantics of 
CONCAT, only that "vendor specific syntax can be used".

> TPC-DS Qury 84 returns wrong results against TPC official
> ---------------------------------------------------------
>
>                 Key: SPARK-15372
>                 URL: https://issues.apache.org/jira/browse/SPARK-15372
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: JESSE CHEN
>            Assignee: Herman van Hovell
>            Priority: Critical
>              Labels: SPARK-15071
>             Fix For: 2.0.0
>
>
> The official TPC-DS query 84 returns wrong results when compared to its 
> official answer set.
> The query itself is:
> {noformat}
>   select  c_customer_id as customer_id
>        ,concat(c_last_name , ', ' , c_first_name) as customername
>  from customer
>      ,customer_address
>      ,customer_demographics
>      ,household_demographics
>      ,income_band
>      ,store_returns
>  where ca_city          =  'Edgewood'
>    and c_current_addr_sk = ca_address_sk
>    and ib_lower_bound   >=  38128
>    and ib_upper_bound   <=  38128 + 50000
>    and ib_income_band_sk = hd_income_band_sk
>    and cd_demo_sk = c_current_cdemo_sk
>    and hd_demo_sk = c_current_hdemo_sk
>    and sr_cdemo_sk = cd_demo_sk
>  order by c_customer_id
>   limit 100;
> {noformat}
> Spark 2.0 build 0517 returned the following result:
> {noformat}
> AAAAAAAAAIPGAAAA      Carter, Rodney
> AAAAAAAAAKMBBAAA      Mcarthur, Emma
> AAAAAAAACBNHBAAA      Wells, Ron
> AAAAAAAADBMEAAAA      Vera, Tina
> AAAAAAAADBMEAAAA      Vera, Tina
> AAAAAAAADHKGBAAA      Scott, Pamela
> AAAAAAAAEIIBBAAA      Atkins, Susan
> AAAAAAAAFKAHAAAA      Batiste, Ernest
> AAAAAAAAGHMAAAAA      Mitchell, Gregory
> AAAAAAAAIAODBAAA      Murray, Karen
> AAAAAAAAIEOKAAAA      Solomon, Clyde
> AAAAAAAAIIBOAAAA      Owens, David
> AAAAAAAAIPDCAAAA      Wallace, Eric
> AAAAAAAAIPIMAAAA      Hayward, Benjamin
> AAAAAAAAJCIKAAAA      Ramos, Donald
> AAAAAAAAKFJEAAAA      Roberts, Yvonne
> AAAAAAAAKPGBBAAA      NULL <------------------------ ??? questionable row
> AAAAAAAALCLABAAA      Whitaker, Lettie
> AAAAAAAAMGMEAAAA      Sharp, Michael
> AAAAAAAAMIGBBAAA      Montgomery, Jesenia
> AAAAAAAAMPDKAAAA      Lopez, Isabel
> AAAAAAAANEOMAAAA      Powell, Linda
> AAAAAAAANKPCAAAA      Shaffer, Sergio
> AAAAAAAANOCKAAAA      Vargas, James
> AAAAAAAAOGJEBAAA      Owens, Denice
> {noformat}
> Official answer set (which is correct!)
> {noformat}
> AAAAAAAAAIPGAAAA Carter                      , Rodney
> AAAAAAAAAKMBBAAA Mcarthur                    , Emma
> AAAAAAAACBNHBAAA Wells                               , Ron
> AAAAAAAADBMEAAAA Vera                        , Tina
> AAAAAAAADBMEAAAA Vera                        , Tina
> AAAAAAAADHKGBAAA Scott                               , Pamela
> AAAAAAAAEIIBBAAA Atkins                      , Susan
> AAAAAAAAFKAHAAAA Batiste                     , Ernest
> AAAAAAAAGHMAAAAA Mitchell                    , Gregory
> AAAAAAAAIAODBAAA Murray                      , Karen
> AAAAAAAAIEOKAAAA Solomon                     , Clyde
> AAAAAAAAIIBOAAAA Owens                               , David
> AAAAAAAAIPDCAAAA Wallace                     , Eric
> AAAAAAAAIPIMAAAA Hayward                     , Benjamin
> AAAAAAAAJCIKAAAA Ramos                               , Donald
> AAAAAAAAKFJEAAAA Roberts                     , Yvonne
> AAAAAAAAKPGBBAAA Moore                               ,
> AAAAAAAALCLABAAA Whitaker                    , Lettie
> AAAAAAAAMGMEAAAA Sharp                               , Michael
> AAAAAAAAMIGBBAAA Montgomery                  , Jesenia
> AAAAAAAAMPDKAAAA Lopez                               , Isabel
> AAAAAAAANEOMAAAA Powell                      , Linda
> AAAAAAAANKPCAAAA Shaffer                     , Sergio
> AAAAAAAANOCKAAAA Vargas                      , James
> AAAAAAAAOGJEBAAA Owens                               , Denice
> {noformat}
> The issue is with the "concat" function in Spark SQL (also behaves the same 
> in Hive). When 'concat' meets any NULL string, it returns NULL as the answer. 
> But is this right? When I concatenate a person's last name and first name, if 
> the first name is missing (empty string or NULL), I should see the last name 
> still, not NULL, i.e., "Smith" + "" = "Smith", not NULL. 
> Simplest repeatable test:
> {noformat}
> hive> select c_first_name, c_last_name from customer where c_customer_id = 
> 'AAAAAAAAKPGBBAAA';
> OK
> NULL Moore
> Time taken: 0.07 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, ', ', c_first_name) from customer where 
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> NULL
> Time taken: 0.1 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, c_first_name) from customer where 
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> NULL
> Time taken: 0.055 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, ', ', c_first_name) from customer where 
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> NULL
> Time taken: 0.061 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, ', ', c_customer_id) from customer where 
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> Moore, AAAAAAAAKPGBBAAA
> Same in 'spark-sql' shell:
> ...
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 45
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 46
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 47
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 48
> select concat(c_last_name, c_first_name) from customer where c_customer_id = 
> 'AAAAAAAAKPGBBAAA';
> NULL
> Time taken: 0.296 seconds, Fetched 1 row(s)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15372) TPC-DS Qury 84 returns wrong results against TPC official

Reply via email to