[
https://issues.apache.org/jira/browse/SPARK-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
JESSE CHEN updated SPARK-15372:
-------------------------------
Description:
The official TPC-DS query 84 returns wrong results when compared to its
official answer set.
The query itself is:
{noformat}
select c_customer_id as customer_id
,concat(c_last_name , ', ' , c_first_name) as customername
from customer
,customer_address
,customer_demographics
,household_demographics
,income_band
,store_returns
where ca_city = 'Edgewood'
and c_current_addr_sk = ca_address_sk
and ib_lower_bound >= 38128
and ib_upper_bound <= 38128 + 50000
and ib_income_band_sk = hd_income_band_sk
and cd_demo_sk = c_current_cdemo_sk
and hd_demo_sk = c_current_hdemo_sk
and sr_cdemo_sk = cd_demo_sk
order by c_customer_id
limit 100;
{noformat}
Spark 2.0 build 0517 returned the following result:
{noformat}
AAAAAAAAAIPGAAAA Carter, Rodney
AAAAAAAAAKMBBAAA Mcarthur, Emma
AAAAAAAACBNHBAAA Wells, Ron
AAAAAAAADBMEAAAA Vera, Tina
AAAAAAAADBMEAAAA Vera, Tina
AAAAAAAADHKGBAAA Scott, Pamela
AAAAAAAAEIIBBAAA Atkins, Susan
AAAAAAAAFKAHAAAA Batiste, Ernest
AAAAAAAAGHMAAAAA Mitchell, Gregory
AAAAAAAAIAODBAAA Murray, Karen
AAAAAAAAIEOKAAAA Solomon, Clyde
AAAAAAAAIIBOAAAA Owens, David
AAAAAAAAIPDCAAAA Wallace, Eric
AAAAAAAAIPIMAAAA Hayward, Benjamin
AAAAAAAAJCIKAAAA Ramos, Donald
AAAAAAAAKFJEAAAA Roberts, Yvonne
AAAAAAAAKPGBBAAA NULL <------------------------ ??? questionable row
AAAAAAAALCLABAAA Whitaker, Lettie
AAAAAAAAMGMEAAAA Sharp, Michael
AAAAAAAAMIGBBAAA Montgomery, Jesenia
AAAAAAAAMPDKAAAA Lopez, Isabel
AAAAAAAANEOMAAAA Powell, Linda
AAAAAAAANKPCAAAA Shaffer, Sergio
AAAAAAAANOCKAAAA Vargas, James
AAAAAAAAOGJEBAAA Owens, Denice
{noformat}
Official answer set (which is correct!)
{noformat}
AAAAAAAAAIPGAAAA Carter , Rodney
AAAAAAAAAKMBBAAA Mcarthur , Emma
AAAAAAAACBNHBAAA Wells , Ron
AAAAAAAADBMEAAAA Vera , Tina
AAAAAAAADBMEAAAA Vera , Tina
AAAAAAAADHKGBAAA Scott , Pamela
AAAAAAAAEIIBBAAA Atkins , Susan
AAAAAAAAFKAHAAAA Batiste , Ernest
AAAAAAAAGHMAAAAA Mitchell , Gregory
AAAAAAAAIAODBAAA Murray , Karen
AAAAAAAAIEOKAAAA Solomon , Clyde
AAAAAAAAIIBOAAAA Owens , David
AAAAAAAAIPDCAAAA Wallace , Eric
AAAAAAAAIPIMAAAA Hayward , Benjamin
AAAAAAAAJCIKAAAA Ramos , Donald
AAAAAAAAKFJEAAAA Roberts , Yvonne
AAAAAAAAKPGBBAAA Moore ,
AAAAAAAALCLABAAA Whitaker , Lettie
AAAAAAAAMGMEAAAA Sharp , Michael
AAAAAAAAMIGBBAAA Montgomery , Jesenia
AAAAAAAAMPDKAAAA Lopez , Isabel
AAAAAAAANEOMAAAA Powell , Linda
AAAAAAAANKPCAAAA Shaffer , Sergio
AAAAAAAANOCKAAAA Vargas , James
AAAAAAAAOGJEBAAA Owens , Denice
{noformat}
The issue is with the "concat" function in Spark SQL (also behaves the same in
Hive). When 'concat' meets any NULL string, it returns NULL as the answer. But
is this right? When I concatenate a person's last name and first name, if the
first name is missing (empty string or NULL), I should see the last name still,
not NULL, i.e., "Smith" + "" = "Smith", not NULL.
Simplest repeatable test:
{noformat}
hive> select c_first_name, c_last_name from customer where c_customer_id =
'AAAAAAAAKPGBBAAA';
OK
NULL Moore
Time taken: 0.07 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, ', ', c_first_name) from customer where
c_customer_id = 'AAAAAAAAKPGBBAAA';
OK
NULL
Time taken: 0.1 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, c_first_name) from customer where
c_customer_id = 'AAAAAAAAKPGBBAAA';
OK
NULL
Time taken: 0.055 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, ', ', c_first_name) from customer where
c_customer_id = 'AAAAAAAAKPGBBAAA';
OK
NULL
Time taken: 0.061 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, ', ', c_customer_id) from customer where
c_customer_id = 'AAAAAAAAKPGBBAAA';
OK
Moore, AAAAAAAAKPGBBAAA
Same in 'spark-sql' shell:
...
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 45
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 46
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 47
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 48
select concat(c_last_name, c_first_name) from customer where c_customer_id =
'AAAAAAAAKPGBBAAA';
NULL
Time taken: 0.296 seconds, Fetched 1 row(s)
{noformat}
was:
The official TPC-DS query 84 returns wrong results when compared to its
official answer set.
The query itself is:
{noformat}
select c_customer_id as customer_id
,concat(c_last_name , ', ' , c_first_name) as customername
from customer
,customer_address
,customer_demographics
,household_demographics
,income_band
,store_returns
where ca_city = 'Edgewood'
and c_current_addr_sk = ca_address_sk
and ib_lower_bound >= 38128
and ib_upper_bound <= 38128 + 50000
and ib_income_band_sk = hd_income_band_sk
and cd_demo_sk = c_current_cdemo_sk
and hd_demo_sk = c_current_hdemo_sk
and sr_cdemo_sk = cd_demo_sk
order by c_customer_id
limit 100;
{noformat}
Spark 2.0 build 0517 returned the following result:
{noformat}
AAAAAAAAAIPGAAAA Carter, Rodney
AAAAAAAAAKMBBAAA Mcarthur, Emma
AAAAAAAACBNHBAAA Wells, Ron
AAAAAAAADBMEAAAA Vera, Tina
AAAAAAAADBMEAAAA Vera, Tina
AAAAAAAADHKGBAAA Scott, Pamela
AAAAAAAAEIIBBAAA Atkins, Susan
AAAAAAAAFKAHAAAA Batiste, Ernest
AAAAAAAAGHMAAAAA Mitchell, Gregory
AAAAAAAAIAODBAAA Murray, Karen
AAAAAAAAIEOKAAAA Solomon, Clyde
AAAAAAAAIIBOAAAA Owens, David
AAAAAAAAIPDCAAAA Wallace, Eric
AAAAAAAAIPIMAAAA Hayward, Benjamin
AAAAAAAAJCIKAAAA Ramos, Donald
AAAAAAAAKFJEAAAA Roberts, Yvonne
AAAAAAAAKPGBBAAA NULL <------------------------ ??? questionable row
AAAAAAAALCLABAAA Whitaker, Lettie
AAAAAAAAMGMEAAAA Sharp, Michael
AAAAAAAAMIGBBAAA Montgomery, Jesenia
AAAAAAAAMPDKAAAA Lopez, Isabel
AAAAAAAANEOMAAAA Powell, Linda
AAAAAAAANKPCAAAA Shaffer, Sergio
AAAAAAAANOCKAAAA Vargas, James
AAAAAAAAOGJEBAAA Owens, Denice
{noformat}
Official answer set (which is correct!)
{noformat}
AAAAAAAAAIPGAAAA Carter , Rodney
AAAAAAAAAKMBBAAA Mcarthur , Emma
AAAAAAAACBNHBAAA Wells , Ron
AAAAAAAADBMEAAAA Vera , Tina
AAAAAAAADBMEAAAA Vera , Tina
AAAAAAAADHKGBAAA Scott , Pamela
AAAAAAAAEIIBBAAA Atkins , Susan
AAAAAAAAFKAHAAAA Batiste , Ernest
AAAAAAAAGHMAAAAA Mitchell , Gregory
AAAAAAAAIAODBAAA Murray , Karen
AAAAAAAAIEOKAAAA Solomon , Clyde
AAAAAAAAIIBOAAAA Owens , David
AAAAAAAAIPDCAAAA Wallace , Eric
AAAAAAAAIPIMAAAA Hayward , Benjamin
AAAAAAAAJCIKAAAA Ramos , Donald
AAAAAAAAKFJEAAAA Roberts , Yvonne
AAAAAAAAKPGBBAAA Moore ,
AAAAAAAALCLABAAA Whitaker , Lettie
AAAAAAAAMGMEAAAA Sharp , Michael
AAAAAAAAMIGBBAAA Montgomery , Jesenia
AAAAAAAAMPDKAAAA Lopez , Isabel
AAAAAAAANEOMAAAA Powell , Linda
AAAAAAAANKPCAAAA Shaffer , Sergio
AAAAAAAANOCKAAAA Vargas , James
AAAAAAAAOGJEBAAA Owens , Denice
{noformat}
The issue is with the "concat" function in Spark SQL (also behaves the same in
Hive). When 'concat' meets any NULL string, it returns NULL as the answer. But
is this right? When I concatenate a person's last name and first name, if the
first name is missing (empty string or NULL), I should see the last name still,
not NULL, i.e., "Smith" + "" = "Smith", not NULL.
Simplest repeatable test:
hive> select c_first_name, c_last_name from customer where c_customer_id =
'AAAAAAAAKPGBBAAA';
OK
NULL Moore
Time taken: 0.07 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, ', ', c_first_name) from customer where
c_customer_id = 'AAAAAAAAKPGBBAAA';
OK
NULL
Time taken: 0.1 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, c_first_name) from customer where
c_customer_id = 'AAAAAAAAKPGBBAAA';
OK
NULL
Time taken: 0.055 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, ', ', c_first_name) from customer where
c_customer_id = 'AAAAAAAAKPGBBAAA';
OK
NULL
Time taken: 0.061 seconds, Fetched: 1 row(s)
hive> select concat(c_last_name, ', ', c_customer_id) from customer where
c_customer_id = 'AAAAAAAAKPGBBAAA';
OK
Moore, AAAAAAAAKPGBBAAA
Same in 'spark-sql' shell:
...
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 45
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 46
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 47
16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 48
select concat(c_last_name, c_first_name) from customer where c_customer_id =
'AAAAAAAAKPGBBAAA';
NULL
Time taken: 0.296 seconds, Fetched 1 row(s)
> TPC-DS Qury 84 returns wrong results against TPC official
> ---------------------------------------------------------
>
> Key: SPARK-15372
> URL: https://issues.apache.org/jira/browse/SPARK-15372
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: JESSE CHEN
> Assignee: Herman van Hovell
> Priority: Critical
> Labels: SPARK-15071
> Fix For: 2.0.0
>
>
> The official TPC-DS query 84 returns wrong results when compared to its
> official answer set.
> The query itself is:
> {noformat}
> select c_customer_id as customer_id
> ,concat(c_last_name , ', ' , c_first_name) as customername
> from customer
> ,customer_address
> ,customer_demographics
> ,household_demographics
> ,income_band
> ,store_returns
> where ca_city = 'Edgewood'
> and c_current_addr_sk = ca_address_sk
> and ib_lower_bound >= 38128
> and ib_upper_bound <= 38128 + 50000
> and ib_income_band_sk = hd_income_band_sk
> and cd_demo_sk = c_current_cdemo_sk
> and hd_demo_sk = c_current_hdemo_sk
> and sr_cdemo_sk = cd_demo_sk
> order by c_customer_id
> limit 100;
> {noformat}
> Spark 2.0 build 0517 returned the following result:
> {noformat}
> AAAAAAAAAIPGAAAA Carter, Rodney
> AAAAAAAAAKMBBAAA Mcarthur, Emma
> AAAAAAAACBNHBAAA Wells, Ron
> AAAAAAAADBMEAAAA Vera, Tina
> AAAAAAAADBMEAAAA Vera, Tina
> AAAAAAAADHKGBAAA Scott, Pamela
> AAAAAAAAEIIBBAAA Atkins, Susan
> AAAAAAAAFKAHAAAA Batiste, Ernest
> AAAAAAAAGHMAAAAA Mitchell, Gregory
> AAAAAAAAIAODBAAA Murray, Karen
> AAAAAAAAIEOKAAAA Solomon, Clyde
> AAAAAAAAIIBOAAAA Owens, David
> AAAAAAAAIPDCAAAA Wallace, Eric
> AAAAAAAAIPIMAAAA Hayward, Benjamin
> AAAAAAAAJCIKAAAA Ramos, Donald
> AAAAAAAAKFJEAAAA Roberts, Yvonne
> AAAAAAAAKPGBBAAA NULL <------------------------ ??? questionable row
> AAAAAAAALCLABAAA Whitaker, Lettie
> AAAAAAAAMGMEAAAA Sharp, Michael
> AAAAAAAAMIGBBAAA Montgomery, Jesenia
> AAAAAAAAMPDKAAAA Lopez, Isabel
> AAAAAAAANEOMAAAA Powell, Linda
> AAAAAAAANKPCAAAA Shaffer, Sergio
> AAAAAAAANOCKAAAA Vargas, James
> AAAAAAAAOGJEBAAA Owens, Denice
> {noformat}
> Official answer set (which is correct!)
> {noformat}
> AAAAAAAAAIPGAAAA Carter , Rodney
> AAAAAAAAAKMBBAAA Mcarthur , Emma
> AAAAAAAACBNHBAAA Wells , Ron
> AAAAAAAADBMEAAAA Vera , Tina
> AAAAAAAADBMEAAAA Vera , Tina
> AAAAAAAADHKGBAAA Scott , Pamela
> AAAAAAAAEIIBBAAA Atkins , Susan
> AAAAAAAAFKAHAAAA Batiste , Ernest
> AAAAAAAAGHMAAAAA Mitchell , Gregory
> AAAAAAAAIAODBAAA Murray , Karen
> AAAAAAAAIEOKAAAA Solomon , Clyde
> AAAAAAAAIIBOAAAA Owens , David
> AAAAAAAAIPDCAAAA Wallace , Eric
> AAAAAAAAIPIMAAAA Hayward , Benjamin
> AAAAAAAAJCIKAAAA Ramos , Donald
> AAAAAAAAKFJEAAAA Roberts , Yvonne
> AAAAAAAAKPGBBAAA Moore ,
> AAAAAAAALCLABAAA Whitaker , Lettie
> AAAAAAAAMGMEAAAA Sharp , Michael
> AAAAAAAAMIGBBAAA Montgomery , Jesenia
> AAAAAAAAMPDKAAAA Lopez , Isabel
> AAAAAAAANEOMAAAA Powell , Linda
> AAAAAAAANKPCAAAA Shaffer , Sergio
> AAAAAAAANOCKAAAA Vargas , James
> AAAAAAAAOGJEBAAA Owens , Denice
> {noformat}
> The issue is with the "concat" function in Spark SQL (also behaves the same
> in Hive). When 'concat' meets any NULL string, it returns NULL as the answer.
> But is this right? When I concatenate a person's last name and first name, if
> the first name is missing (empty string or NULL), I should see the last name
> still, not NULL, i.e., "Smith" + "" = "Smith", not NULL.
> Simplest repeatable test:
> {noformat}
> hive> select c_first_name, c_last_name from customer where c_customer_id =
> 'AAAAAAAAKPGBBAAA';
> OK
> NULL Moore
> Time taken: 0.07 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, ', ', c_first_name) from customer where
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> NULL
> Time taken: 0.1 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, c_first_name) from customer where
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> NULL
> Time taken: 0.055 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, ', ', c_first_name) from customer where
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> NULL
> Time taken: 0.061 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, ', ', c_customer_id) from customer where
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> Moore, AAAAAAAAKPGBBAAA
> Same in 'spark-sql' shell:
> ...
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 45
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 46
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 47
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 48
> select concat(c_last_name, c_first_name) from customer where c_customer_id =
> 'AAAAAAAAKPGBBAAA';
> NULL
> Time taken: 0.296 seconds, Fetched 1 row(s)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]