[ https://issues.apache.org/jira/browse/SPARK-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
JESSE CHEN updated SPARK-15372: ------------------------------- Description: The official TPC-DS query 84 returns wrong results when compared to its official answer set. The query itself is: {noformat} select c_customer_id as customer_id ,concat(c_last_name , ', ' , c_first_name) as customername from customer ,customer_address ,customer_demographics ,household_demographics ,income_band ,store_returns where ca_city = 'Edgewood' and c_current_addr_sk = ca_address_sk and ib_lower_bound >= 38128 and ib_upper_bound <= 38128 + 50000 and ib_income_band_sk = hd_income_band_sk and cd_demo_sk = c_current_cdemo_sk and hd_demo_sk = c_current_hdemo_sk and sr_cdemo_sk = cd_demo_sk order by c_customer_id limit 100; {noformat} Spark 2.0 build 0517 returned the following result: {noformat} AAAAAAAAAIPGAAAA Carter, Rodney AAAAAAAAAKMBBAAA Mcarthur, Emma AAAAAAAACBNHBAAA Wells, Ron AAAAAAAADBMEAAAA Vera, Tina AAAAAAAADBMEAAAA Vera, Tina AAAAAAAADHKGBAAA Scott, Pamela AAAAAAAAEIIBBAAA Atkins, Susan AAAAAAAAFKAHAAAA Batiste, Ernest AAAAAAAAGHMAAAAA Mitchell, Gregory AAAAAAAAIAODBAAA Murray, Karen AAAAAAAAIEOKAAAA Solomon, Clyde AAAAAAAAIIBOAAAA Owens, David AAAAAAAAIPDCAAAA Wallace, Eric AAAAAAAAIPIMAAAA Hayward, Benjamin AAAAAAAAJCIKAAAA Ramos, Donald AAAAAAAAKFJEAAAA Roberts, Yvonne AAAAAAAAKPGBBAAA NULL <------------------------ ??? questionable row AAAAAAAALCLABAAA Whitaker, Lettie AAAAAAAAMGMEAAAA Sharp, Michael AAAAAAAAMIGBBAAA Montgomery, Jesenia AAAAAAAAMPDKAAAA Lopez, Isabel AAAAAAAANEOMAAAA Powell, Linda AAAAAAAANKPCAAAA Shaffer, Sergio AAAAAAAANOCKAAAA Vargas, James AAAAAAAAOGJEBAAA Owens, Denice {noformat} Official answer set (which is correct!) {noformat} AAAAAAAAAIPGAAAA Carter , Rodney AAAAAAAAAKMBBAAA Mcarthur , Emma AAAAAAAACBNHBAAA Wells , Ron AAAAAAAADBMEAAAA Vera , Tina AAAAAAAADBMEAAAA Vera , Tina AAAAAAAADHKGBAAA Scott , Pamela AAAAAAAAEIIBBAAA Atkins , Susan AAAAAAAAFKAHAAAA Batiste , Ernest AAAAAAAAGHMAAAAA Mitchell , Gregory AAAAAAAAIAODBAAA Murray , Karen AAAAAAAAIEOKAAAA Solomon , Clyde AAAAAAAAIIBOAAAA Owens , David AAAAAAAAIPDCAAAA Wallace , Eric AAAAAAAAIPIMAAAA Hayward , Benjamin AAAAAAAAJCIKAAAA Ramos , Donald AAAAAAAAKFJEAAAA Roberts , Yvonne AAAAAAAAKPGBBAAA Moore , AAAAAAAALCLABAAA Whitaker , Lettie AAAAAAAAMGMEAAAA Sharp , Michael AAAAAAAAMIGBBAAA Montgomery , Jesenia AAAAAAAAMPDKAAAA Lopez , Isabel AAAAAAAANEOMAAAA Powell , Linda AAAAAAAANKPCAAAA Shaffer , Sergio AAAAAAAANOCKAAAA Vargas , James AAAAAAAAOGJEBAAA Owens , Denice {noformat} The issue is with the "concat" function in Spark SQL (also behaves the same in Hive). When 'concat' meets any NULL string, it returns NULL as the answer. But is this right? When I concatenate a person's last name and first name, if the first name is missing (empty string or NULL), I should see the last name still, not NULL, i.e., "Smith" + "" = "Smith", not NULL. Simplest repeatable test: {noformat} hive> select c_first_name, c_last_name from customer where c_customer_id = 'AAAAAAAAKPGBBAAA'; OK NULL Moore Time taken: 0.07 seconds, Fetched: 1 row(s) hive> select concat(c_last_name, ', ', c_first_name) from customer where c_customer_id = 'AAAAAAAAKPGBBAAA'; OK NULL Time taken: 0.1 seconds, Fetched: 1 row(s) hive> select concat(c_last_name, c_first_name) from customer where c_customer_id = 'AAAAAAAAKPGBBAAA'; OK NULL Time taken: 0.055 seconds, Fetched: 1 row(s) hive> select concat(c_last_name, ', ', c_first_name) from customer where c_customer_id = 'AAAAAAAAKPGBBAAA'; OK NULL Time taken: 0.061 seconds, Fetched: 1 row(s) hive> select concat(c_last_name, ', ', c_customer_id) from customer where c_customer_id = 'AAAAAAAAKPGBBAAA'; OK Moore, AAAAAAAAKPGBBAAA Same in 'spark-sql' shell: ... 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 45 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 46 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 47 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 48 select concat(c_last_name, c_first_name) from customer where c_customer_id = 'AAAAAAAAKPGBBAAA'; NULL Time taken: 0.296 seconds, Fetched 1 row(s) {noformat} was: The official TPC-DS query 84 returns wrong results when compared to its official answer set. The query itself is: {noformat} select c_customer_id as customer_id ,concat(c_last_name , ', ' , c_first_name) as customername from customer ,customer_address ,customer_demographics ,household_demographics ,income_band ,store_returns where ca_city = 'Edgewood' and c_current_addr_sk = ca_address_sk and ib_lower_bound >= 38128 and ib_upper_bound <= 38128 + 50000 and ib_income_band_sk = hd_income_band_sk and cd_demo_sk = c_current_cdemo_sk and hd_demo_sk = c_current_hdemo_sk and sr_cdemo_sk = cd_demo_sk order by c_customer_id limit 100; {noformat} Spark 2.0 build 0517 returned the following result: {noformat} AAAAAAAAAIPGAAAA Carter, Rodney AAAAAAAAAKMBBAAA Mcarthur, Emma AAAAAAAACBNHBAAA Wells, Ron AAAAAAAADBMEAAAA Vera, Tina AAAAAAAADBMEAAAA Vera, Tina AAAAAAAADHKGBAAA Scott, Pamela AAAAAAAAEIIBBAAA Atkins, Susan AAAAAAAAFKAHAAAA Batiste, Ernest AAAAAAAAGHMAAAAA Mitchell, Gregory AAAAAAAAIAODBAAA Murray, Karen AAAAAAAAIEOKAAAA Solomon, Clyde AAAAAAAAIIBOAAAA Owens, David AAAAAAAAIPDCAAAA Wallace, Eric AAAAAAAAIPIMAAAA Hayward, Benjamin AAAAAAAAJCIKAAAA Ramos, Donald AAAAAAAAKFJEAAAA Roberts, Yvonne AAAAAAAAKPGBBAAA NULL <------------------------ ??? questionable row AAAAAAAALCLABAAA Whitaker, Lettie AAAAAAAAMGMEAAAA Sharp, Michael AAAAAAAAMIGBBAAA Montgomery, Jesenia AAAAAAAAMPDKAAAA Lopez, Isabel AAAAAAAANEOMAAAA Powell, Linda AAAAAAAANKPCAAAA Shaffer, Sergio AAAAAAAANOCKAAAA Vargas, James AAAAAAAAOGJEBAAA Owens, Denice {noformat} Official answer set (which is correct!) {noformat} AAAAAAAAAIPGAAAA Carter , Rodney AAAAAAAAAKMBBAAA Mcarthur , Emma AAAAAAAACBNHBAAA Wells , Ron AAAAAAAADBMEAAAA Vera , Tina AAAAAAAADBMEAAAA Vera , Tina AAAAAAAADHKGBAAA Scott , Pamela AAAAAAAAEIIBBAAA Atkins , Susan AAAAAAAAFKAHAAAA Batiste , Ernest AAAAAAAAGHMAAAAA Mitchell , Gregory AAAAAAAAIAODBAAA Murray , Karen AAAAAAAAIEOKAAAA Solomon , Clyde AAAAAAAAIIBOAAAA Owens , David AAAAAAAAIPDCAAAA Wallace , Eric AAAAAAAAIPIMAAAA Hayward , Benjamin AAAAAAAAJCIKAAAA Ramos , Donald AAAAAAAAKFJEAAAA Roberts , Yvonne AAAAAAAAKPGBBAAA Moore , AAAAAAAALCLABAAA Whitaker , Lettie AAAAAAAAMGMEAAAA Sharp , Michael AAAAAAAAMIGBBAAA Montgomery , Jesenia AAAAAAAAMPDKAAAA Lopez , Isabel AAAAAAAANEOMAAAA Powell , Linda AAAAAAAANKPCAAAA Shaffer , Sergio AAAAAAAANOCKAAAA Vargas , James AAAAAAAAOGJEBAAA Owens , Denice {noformat} The issue is with the "concat" function in Spark SQL (also behaves the same in Hive). When 'concat' meets any NULL string, it returns NULL as the answer. But is this right? When I concatenate a person's last name and first name, if the first name is missing (empty string or NULL), I should see the last name still, not NULL, i.e., "Smith" + "" = "Smith", not NULL. Simplest repeatable test: hive> select c_first_name, c_last_name from customer where c_customer_id = 'AAAAAAAAKPGBBAAA'; OK NULL Moore Time taken: 0.07 seconds, Fetched: 1 row(s) hive> select concat(c_last_name, ', ', c_first_name) from customer where c_customer_id = 'AAAAAAAAKPGBBAAA'; OK NULL Time taken: 0.1 seconds, Fetched: 1 row(s) hive> select concat(c_last_name, c_first_name) from customer where c_customer_id = 'AAAAAAAAKPGBBAAA'; OK NULL Time taken: 0.055 seconds, Fetched: 1 row(s) hive> select concat(c_last_name, ', ', c_first_name) from customer where c_customer_id = 'AAAAAAAAKPGBBAAA'; OK NULL Time taken: 0.061 seconds, Fetched: 1 row(s) hive> select concat(c_last_name, ', ', c_customer_id) from customer where c_customer_id = 'AAAAAAAAKPGBBAAA'; OK Moore, AAAAAAAAKPGBBAAA Same in 'spark-sql' shell: ... 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 45 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 46 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 47 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 48 select concat(c_last_name, c_first_name) from customer where c_customer_id = 'AAAAAAAAKPGBBAAA'; NULL Time taken: 0.296 seconds, Fetched 1 row(s) > TPC-DS Qury 84 returns wrong results against TPC official > --------------------------------------------------------- > > Key: SPARK-15372 > URL: https://issues.apache.org/jira/browse/SPARK-15372 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: JESSE CHEN > Assignee: Herman van Hovell > Priority: Critical > Labels: SPARK-15071 > Fix For: 2.0.0 > > > The official TPC-DS query 84 returns wrong results when compared to its > official answer set. > The query itself is: > {noformat} > select c_customer_id as customer_id > ,concat(c_last_name , ', ' , c_first_name) as customername > from customer > ,customer_address > ,customer_demographics > ,household_demographics > ,income_band > ,store_returns > where ca_city = 'Edgewood' > and c_current_addr_sk = ca_address_sk > and ib_lower_bound >= 38128 > and ib_upper_bound <= 38128 + 50000 > and ib_income_band_sk = hd_income_band_sk > and cd_demo_sk = c_current_cdemo_sk > and hd_demo_sk = c_current_hdemo_sk > and sr_cdemo_sk = cd_demo_sk > order by c_customer_id > limit 100; > {noformat} > Spark 2.0 build 0517 returned the following result: > {noformat} > AAAAAAAAAIPGAAAA Carter, Rodney > AAAAAAAAAKMBBAAA Mcarthur, Emma > AAAAAAAACBNHBAAA Wells, Ron > AAAAAAAADBMEAAAA Vera, Tina > AAAAAAAADBMEAAAA Vera, Tina > AAAAAAAADHKGBAAA Scott, Pamela > AAAAAAAAEIIBBAAA Atkins, Susan > AAAAAAAAFKAHAAAA Batiste, Ernest > AAAAAAAAGHMAAAAA Mitchell, Gregory > AAAAAAAAIAODBAAA Murray, Karen > AAAAAAAAIEOKAAAA Solomon, Clyde > AAAAAAAAIIBOAAAA Owens, David > AAAAAAAAIPDCAAAA Wallace, Eric > AAAAAAAAIPIMAAAA Hayward, Benjamin > AAAAAAAAJCIKAAAA Ramos, Donald > AAAAAAAAKFJEAAAA Roberts, Yvonne > AAAAAAAAKPGBBAAA NULL <------------------------ ??? questionable row > AAAAAAAALCLABAAA Whitaker, Lettie > AAAAAAAAMGMEAAAA Sharp, Michael > AAAAAAAAMIGBBAAA Montgomery, Jesenia > AAAAAAAAMPDKAAAA Lopez, Isabel > AAAAAAAANEOMAAAA Powell, Linda > AAAAAAAANKPCAAAA Shaffer, Sergio > AAAAAAAANOCKAAAA Vargas, James > AAAAAAAAOGJEBAAA Owens, Denice > {noformat} > Official answer set (which is correct!) > {noformat} > AAAAAAAAAIPGAAAA Carter , Rodney > AAAAAAAAAKMBBAAA Mcarthur , Emma > AAAAAAAACBNHBAAA Wells , Ron > AAAAAAAADBMEAAAA Vera , Tina > AAAAAAAADBMEAAAA Vera , Tina > AAAAAAAADHKGBAAA Scott , Pamela > AAAAAAAAEIIBBAAA Atkins , Susan > AAAAAAAAFKAHAAAA Batiste , Ernest > AAAAAAAAGHMAAAAA Mitchell , Gregory > AAAAAAAAIAODBAAA Murray , Karen > AAAAAAAAIEOKAAAA Solomon , Clyde > AAAAAAAAIIBOAAAA Owens , David > AAAAAAAAIPDCAAAA Wallace , Eric > AAAAAAAAIPIMAAAA Hayward , Benjamin > AAAAAAAAJCIKAAAA Ramos , Donald > AAAAAAAAKFJEAAAA Roberts , Yvonne > AAAAAAAAKPGBBAAA Moore , > AAAAAAAALCLABAAA Whitaker , Lettie > AAAAAAAAMGMEAAAA Sharp , Michael > AAAAAAAAMIGBBAAA Montgomery , Jesenia > AAAAAAAAMPDKAAAA Lopez , Isabel > AAAAAAAANEOMAAAA Powell , Linda > AAAAAAAANKPCAAAA Shaffer , Sergio > AAAAAAAANOCKAAAA Vargas , James > AAAAAAAAOGJEBAAA Owens , Denice > {noformat} > The issue is with the "concat" function in Spark SQL (also behaves the same > in Hive). When 'concat' meets any NULL string, it returns NULL as the answer. > But is this right? When I concatenate a person's last name and first name, if > the first name is missing (empty string or NULL), I should see the last name > still, not NULL, i.e., "Smith" + "" = "Smith", not NULL. > Simplest repeatable test: > {noformat} > hive> select c_first_name, c_last_name from customer where c_customer_id = > 'AAAAAAAAKPGBBAAA'; > OK > NULL Moore > Time taken: 0.07 seconds, Fetched: 1 row(s) > hive> select concat(c_last_name, ', ', c_first_name) from customer where > c_customer_id = 'AAAAAAAAKPGBBAAA'; > OK > NULL > Time taken: 0.1 seconds, Fetched: 1 row(s) > hive> select concat(c_last_name, c_first_name) from customer where > c_customer_id = 'AAAAAAAAKPGBBAAA'; > OK > NULL > Time taken: 0.055 seconds, Fetched: 1 row(s) > hive> select concat(c_last_name, ', ', c_first_name) from customer where > c_customer_id = 'AAAAAAAAKPGBBAAA'; > OK > NULL > Time taken: 0.061 seconds, Fetched: 1 row(s) > hive> select concat(c_last_name, ', ', c_customer_id) from customer where > c_customer_id = 'AAAAAAAAKPGBBAAA'; > OK > Moore, AAAAAAAAKPGBBAAA > Same in 'spark-sql' shell: > ... > 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 45 > 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 46 > 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 47 > 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 48 > select concat(c_last_name, c_first_name) from customer where c_customer_id = > 'AAAAAAAAKPGBBAAA'; > NULL > Time taken: 0.296 seconds, Fetched 1 row(s) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org