[ 
https://issues.apache.org/jira/browse/SPARK-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-15372:
-------------------------------
    Description: 
The official TPC-DS query 84 returns wrong results when compared to its 
official answer set.

The query itself is:
{noformat}
  select  c_customer_id as customer_id
       ,concat(c_last_name , ', ' , c_first_name) as customername
 from customer
     ,customer_address
     ,customer_demographics
     ,household_demographics
     ,income_band
     ,store_returns
 where ca_city          =  'Edgewood'
   and c_current_addr_sk = ca_address_sk
   and ib_lower_bound   >=  38128
   and ib_upper_bound   <=  38128 + 50000
   and ib_income_band_sk = hd_income_band_sk
   and cd_demo_sk = c_current_cdemo_sk
   and hd_demo_sk = c_current_hdemo_sk
   and sr_cdemo_sk = cd_demo_sk
 order by c_customer_id
  limit 100;

{noformat}

Spark 2.0 build 0517 returned the following result:
{noformat}
AAAAAAAAAIPGAAAA        Carter, Rodney
AAAAAAAAAKMBBAAA        Mcarthur, Emma
AAAAAAAACBNHBAAA        Wells, Ron
AAAAAAAADBMEAAAA        Vera, Tina
AAAAAAAADBMEAAAA        Vera, Tina
AAAAAAAADHKGBAAA        Scott, Pamela
AAAAAAAAEIIBBAAA        Atkins, Susan
AAAAAAAAFKAHAAAA        Batiste, Ernest
AAAAAAAAGHMAAAAA        Mitchell, Gregory
AAAAAAAAIAODBAAA        Murray, Karen
AAAAAAAAIEOKAAAA        Solomon, Clyde
AAAAAAAAIIBOAAAA        Owens, David
AAAAAAAAIPDCAAAA        Wallace, Eric
AAAAAAAAIPIMAAAA        Hayward, Benjamin
AAAAAAAAJCIKAAAA        Ramos, Donald
AAAAAAAAKFJEAAAA        Roberts, Yvonne
AAAAAAAAKPGBBAAA        NULL <------------------------ ??? questionable row
AAAAAAAALCLABAAA        Whitaker, Lettie
AAAAAAAAMGMEAAAA        Sharp, Michael
AAAAAAAAMIGBBAAA        Montgomery, Jesenia
AAAAAAAAMPDKAAAA        Lopez, Isabel
AAAAAAAANEOMAAAA        Powell, Linda
AAAAAAAANKPCAAAA        Shaffer, Sergio
AAAAAAAANOCKAAAA        Vargas, James
AAAAAAAAOGJEBAAA        Owens, Denice

{noformat}

Official answer set (which is correct!)
{noformat}
AAAAAAAAAIPGAAAA Carter                        , Rodney
AAAAAAAAAKMBBAAA Mcarthur                      , Emma
AAAAAAAACBNHBAAA Wells                         , Ron
AAAAAAAADBMEAAAA Vera                          , Tina
AAAAAAAADBMEAAAA Vera                          , Tina
AAAAAAAADHKGBAAA Scott                         , Pamela
AAAAAAAAEIIBBAAA Atkins                        , Susan
AAAAAAAAFKAHAAAA Batiste                       , Ernest
AAAAAAAAGHMAAAAA Mitchell                      , Gregory
AAAAAAAAIAODBAAA Murray                        , Karen
AAAAAAAAIEOKAAAA Solomon                       , Clyde
AAAAAAAAIIBOAAAA Owens                         , David
AAAAAAAAIPDCAAAA Wallace                       , Eric
AAAAAAAAIPIMAAAA Hayward                       , Benjamin
AAAAAAAAJCIKAAAA Ramos                         , Donald
AAAAAAAAKFJEAAAA Roberts                       , Yvonne
AAAAAAAAKPGBBAAA Moore                         ,
AAAAAAAALCLABAAA Whitaker                      , Lettie
AAAAAAAAMGMEAAAA Sharp                         , Michael
AAAAAAAAMIGBBAAA Montgomery                    , Jesenia
AAAAAAAAMPDKAAAA Lopez                         , Isabel
AAAAAAAANEOMAAAA Powell                        , Linda
AAAAAAAANKPCAAAA Shaffer                       , Sergio
AAAAAAAANOCKAAAA Vargas                        , James
AAAAAAAAOGJEBAAA Owens                         , Denice

{noformat}

The issue is with the "concat" function in Spark SQL (also behaves the same in 
Hive). When 'concat' meets any NULL string, it returns NULL as the answer. But 
is this right? When I concatenate a person's last name and first name, if the 
first name is missing (empty string or NULL), I should see the last name still, 
not NULL, i.e., "Smith" + "" = "Smith", not NULL. 

Simplest repeatable test:
{noformat}
hive> select c_first_name, c_last_name from customer where c_customer_id = 
'AAAAAAAAKPGBBAAA';
OK
NULL Moore
Time taken: 0.07 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, ', ', c_first_name) from customer where 
c_customer_id = 'AAAAAAAAKPGBBAAA';
OK
NULL
Time taken: 0.1 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, c_first_name) from customer where 
c_customer_id = 'AAAAAAAAKPGBBAAA';
OK
NULL
Time taken: 0.055 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, ', ', c_first_name) from customer where 
c_customer_id = 'AAAAAAAAKPGBBAAA';
OK
NULL
Time taken: 0.061 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, ', ', c_customer_id) from customer where 
c_customer_id = 'AAAAAAAAKPGBBAAA';
OK
Moore, AAAAAAAAKPGBBAAA

Same in 'spark-sql' shell:

...
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 45
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 46
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 47
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 48
select concat(c_last_name, c_first_name) from customer where c_customer_id = 
'AAAAAAAAKPGBBAAA';
NULL
Time taken: 0.296 seconds, Fetched 1 row(s)
{noformat}



  was:
The official TPC-DS query 84 returns wrong results when compared to its 
official answer set.

The query itself is:
{noformat}
  select  c_customer_id as customer_id
       ,concat(c_last_name , ', ' , c_first_name) as customername
 from customer
     ,customer_address
     ,customer_demographics
     ,household_demographics
     ,income_band
     ,store_returns
 where ca_city          =  'Edgewood'
   and c_current_addr_sk = ca_address_sk
   and ib_lower_bound   >=  38128
   and ib_upper_bound   <=  38128 + 50000
   and ib_income_band_sk = hd_income_band_sk
   and cd_demo_sk = c_current_cdemo_sk
   and hd_demo_sk = c_current_hdemo_sk
   and sr_cdemo_sk = cd_demo_sk
 order by c_customer_id
  limit 100;

{noformat}

Spark 2.0 build 0517 returned the following result:
{noformat}
AAAAAAAAAIPGAAAA        Carter, Rodney
AAAAAAAAAKMBBAAA        Mcarthur, Emma
AAAAAAAACBNHBAAA        Wells, Ron
AAAAAAAADBMEAAAA        Vera, Tina
AAAAAAAADBMEAAAA        Vera, Tina
AAAAAAAADHKGBAAA        Scott, Pamela
AAAAAAAAEIIBBAAA        Atkins, Susan
AAAAAAAAFKAHAAAA        Batiste, Ernest
AAAAAAAAGHMAAAAA        Mitchell, Gregory
AAAAAAAAIAODBAAA        Murray, Karen
AAAAAAAAIEOKAAAA        Solomon, Clyde
AAAAAAAAIIBOAAAA        Owens, David
AAAAAAAAIPDCAAAA        Wallace, Eric
AAAAAAAAIPIMAAAA        Hayward, Benjamin
AAAAAAAAJCIKAAAA        Ramos, Donald
AAAAAAAAKFJEAAAA        Roberts, Yvonne
AAAAAAAAKPGBBAAA        NULL <------------------------ ??? questionable row
AAAAAAAALCLABAAA        Whitaker, Lettie
AAAAAAAAMGMEAAAA        Sharp, Michael
AAAAAAAAMIGBBAAA        Montgomery, Jesenia
AAAAAAAAMPDKAAAA        Lopez, Isabel
AAAAAAAANEOMAAAA        Powell, Linda
AAAAAAAANKPCAAAA        Shaffer, Sergio
AAAAAAAANOCKAAAA        Vargas, James
AAAAAAAAOGJEBAAA        Owens, Denice

{noformat}

Official answer set (which is correct!)
{noformat}
AAAAAAAAAIPGAAAA Carter                        , Rodney
AAAAAAAAAKMBBAAA Mcarthur                      , Emma
AAAAAAAACBNHBAAA Wells                         , Ron
AAAAAAAADBMEAAAA Vera                          , Tina
AAAAAAAADBMEAAAA Vera                          , Tina
AAAAAAAADHKGBAAA Scott                         , Pamela
AAAAAAAAEIIBBAAA Atkins                        , Susan
AAAAAAAAFKAHAAAA Batiste                       , Ernest
AAAAAAAAGHMAAAAA Mitchell                      , Gregory
AAAAAAAAIAODBAAA Murray                        , Karen
AAAAAAAAIEOKAAAA Solomon                       , Clyde
AAAAAAAAIIBOAAAA Owens                         , David
AAAAAAAAIPDCAAAA Wallace                       , Eric
AAAAAAAAIPIMAAAA Hayward                       , Benjamin
AAAAAAAAJCIKAAAA Ramos                         , Donald
AAAAAAAAKFJEAAAA Roberts                       , Yvonne
AAAAAAAAKPGBBAAA Moore                         ,
AAAAAAAALCLABAAA Whitaker                      , Lettie
AAAAAAAAMGMEAAAA Sharp                         , Michael
AAAAAAAAMIGBBAAA Montgomery                    , Jesenia
AAAAAAAAMPDKAAAA Lopez                         , Isabel
AAAAAAAANEOMAAAA Powell                        , Linda
AAAAAAAANKPCAAAA Shaffer                       , Sergio
AAAAAAAANOCKAAAA Vargas                        , James
AAAAAAAAOGJEBAAA Owens                         , Denice

{noformat}

The issue is with the "concat" function in Spark SQL (also behaves the same in 
Hive). When 'concat' meets any NULL string, it returns NULL as the answer. But 
is this right? When I concatenate a person's last name and first name, if the 
first name is missing (empty string or NULL), I should see the last name still, 
not NULL, i.e., "Smith" + "" = "Smith", not NULL. 

Simplest repeatable test:
hive> select c_first_name, c_last_name from customer where c_customer_id = 
'AAAAAAAAKPGBBAAA';
OK
NULL Moore
Time taken: 0.07 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, ', ', c_first_name) from customer where 
c_customer_id = 'AAAAAAAAKPGBBAAA';
OK
NULL
Time taken: 0.1 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, c_first_name) from customer where 
c_customer_id = 'AAAAAAAAKPGBBAAA';
OK
NULL
Time taken: 0.055 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, ', ', c_first_name) from customer where 
c_customer_id = 'AAAAAAAAKPGBBAAA';
OK
NULL
Time taken: 0.061 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, ', ', c_customer_id) from customer where 
c_customer_id = 'AAAAAAAAKPGBBAAA';
OK
Moore, AAAAAAAAKPGBBAAA

Same in 'spark-sql' shell:

...
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 45
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 46
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 47
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 48
select concat(c_last_name, c_first_name) from customer where c_customer_id = 
'AAAAAAAAKPGBBAAA';
NULL
Time taken: 0.296 seconds, Fetched 1 row(s)




> TPC-DS Qury 84 returns wrong results against TPC official
> ---------------------------------------------------------
>
>                 Key: SPARK-15372
>                 URL: https://issues.apache.org/jira/browse/SPARK-15372
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: JESSE CHEN
>            Assignee: Herman van Hovell
>            Priority: Critical
>              Labels: SPARK-15071
>             Fix For: 2.0.0
>
>
> The official TPC-DS query 84 returns wrong results when compared to its 
> official answer set.
> The query itself is:
> {noformat}
>   select  c_customer_id as customer_id
>        ,concat(c_last_name , ', ' , c_first_name) as customername
>  from customer
>      ,customer_address
>      ,customer_demographics
>      ,household_demographics
>      ,income_band
>      ,store_returns
>  where ca_city          =  'Edgewood'
>    and c_current_addr_sk = ca_address_sk
>    and ib_lower_bound   >=  38128
>    and ib_upper_bound   <=  38128 + 50000
>    and ib_income_band_sk = hd_income_band_sk
>    and cd_demo_sk = c_current_cdemo_sk
>    and hd_demo_sk = c_current_hdemo_sk
>    and sr_cdemo_sk = cd_demo_sk
>  order by c_customer_id
>   limit 100;
> {noformat}
> Spark 2.0 build 0517 returned the following result:
> {noformat}
> AAAAAAAAAIPGAAAA      Carter, Rodney
> AAAAAAAAAKMBBAAA      Mcarthur, Emma
> AAAAAAAACBNHBAAA      Wells, Ron
> AAAAAAAADBMEAAAA      Vera, Tina
> AAAAAAAADBMEAAAA      Vera, Tina
> AAAAAAAADHKGBAAA      Scott, Pamela
> AAAAAAAAEIIBBAAA      Atkins, Susan
> AAAAAAAAFKAHAAAA      Batiste, Ernest
> AAAAAAAAGHMAAAAA      Mitchell, Gregory
> AAAAAAAAIAODBAAA      Murray, Karen
> AAAAAAAAIEOKAAAA      Solomon, Clyde
> AAAAAAAAIIBOAAAA      Owens, David
> AAAAAAAAIPDCAAAA      Wallace, Eric
> AAAAAAAAIPIMAAAA      Hayward, Benjamin
> AAAAAAAAJCIKAAAA      Ramos, Donald
> AAAAAAAAKFJEAAAA      Roberts, Yvonne
> AAAAAAAAKPGBBAAA      NULL <------------------------ ??? questionable row
> AAAAAAAALCLABAAA      Whitaker, Lettie
> AAAAAAAAMGMEAAAA      Sharp, Michael
> AAAAAAAAMIGBBAAA      Montgomery, Jesenia
> AAAAAAAAMPDKAAAA      Lopez, Isabel
> AAAAAAAANEOMAAAA      Powell, Linda
> AAAAAAAANKPCAAAA      Shaffer, Sergio
> AAAAAAAANOCKAAAA      Vargas, James
> AAAAAAAAOGJEBAAA      Owens, Denice
> {noformat}
> Official answer set (which is correct!)
> {noformat}
> AAAAAAAAAIPGAAAA Carter                      , Rodney
> AAAAAAAAAKMBBAAA Mcarthur                    , Emma
> AAAAAAAACBNHBAAA Wells                               , Ron
> AAAAAAAADBMEAAAA Vera                        , Tina
> AAAAAAAADBMEAAAA Vera                        , Tina
> AAAAAAAADHKGBAAA Scott                               , Pamela
> AAAAAAAAEIIBBAAA Atkins                      , Susan
> AAAAAAAAFKAHAAAA Batiste                     , Ernest
> AAAAAAAAGHMAAAAA Mitchell                    , Gregory
> AAAAAAAAIAODBAAA Murray                      , Karen
> AAAAAAAAIEOKAAAA Solomon                     , Clyde
> AAAAAAAAIIBOAAAA Owens                               , David
> AAAAAAAAIPDCAAAA Wallace                     , Eric
> AAAAAAAAIPIMAAAA Hayward                     , Benjamin
> AAAAAAAAJCIKAAAA Ramos                               , Donald
> AAAAAAAAKFJEAAAA Roberts                     , Yvonne
> AAAAAAAAKPGBBAAA Moore                               ,
> AAAAAAAALCLABAAA Whitaker                    , Lettie
> AAAAAAAAMGMEAAAA Sharp                               , Michael
> AAAAAAAAMIGBBAAA Montgomery                  , Jesenia
> AAAAAAAAMPDKAAAA Lopez                               , Isabel
> AAAAAAAANEOMAAAA Powell                      , Linda
> AAAAAAAANKPCAAAA Shaffer                     , Sergio
> AAAAAAAANOCKAAAA Vargas                      , James
> AAAAAAAAOGJEBAAA Owens                               , Denice
> {noformat}
> The issue is with the "concat" function in Spark SQL (also behaves the same 
> in Hive). When 'concat' meets any NULL string, it returns NULL as the answer. 
> But is this right? When I concatenate a person's last name and first name, if 
> the first name is missing (empty string or NULL), I should see the last name 
> still, not NULL, i.e., "Smith" + "" = "Smith", not NULL. 
> Simplest repeatable test:
> {noformat}
> hive> select c_first_name, c_last_name from customer where c_customer_id = 
> 'AAAAAAAAKPGBBAAA';
> OK
> NULL Moore
> Time taken: 0.07 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, ', ', c_first_name) from customer where 
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> NULL
> Time taken: 0.1 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, c_first_name) from customer where 
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> NULL
> Time taken: 0.055 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, ', ', c_first_name) from customer where 
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> NULL
> Time taken: 0.061 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, ', ', c_customer_id) from customer where 
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> Moore, AAAAAAAAKPGBBAAA
> Same in 'spark-sql' shell:
> ...
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 45
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 46
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 47
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 48
> select concat(c_last_name, c_first_name) from customer where c_customer_id = 
> 'AAAAAAAAKPGBBAAA';
> NULL
> Time taken: 0.296 seconds, Fetched 1 row(s)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to