[ https://issues.apache.org/jira/browse/SPARK-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15288004#comment-15288004 ]
Frederick Reiss commented on SPARK-15372: ----------------------------------------- There is no CONCAT function in the SQL standard. The concatenation operator in the standard is ||. DB2 treats "CONCAT" as a synonym for "||"; other vendors do different things. The standard states that if either input to the concatenation operator is NULL, then the output is NULL (Section 6.28 <string value expression> in the SQL/Foundations document). The TPC-DS benchmark spec doesn't say anything about the allowable semantics of CONCAT, only that "vendor specific syntax can be used". > TPC-DS Qury 84 returns wrong results against TPC official > --------------------------------------------------------- > > Key: SPARK-15372 > URL: https://issues.apache.org/jira/browse/SPARK-15372 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: JESSE CHEN > Assignee: Herman van Hovell > Priority: Critical > Labels: SPARK-15071 > Fix For: 2.0.0 > > > The official TPC-DS query 84 returns wrong results when compared to its > official answer set. > The query itself is: > {noformat} > select c_customer_id as customer_id > ,concat(c_last_name , ', ' , c_first_name) as customername > from customer > ,customer_address > ,customer_demographics > ,household_demographics > ,income_band > ,store_returns > where ca_city = 'Edgewood' > and c_current_addr_sk = ca_address_sk > and ib_lower_bound >= 38128 > and ib_upper_bound <= 38128 + 50000 > and ib_income_band_sk = hd_income_band_sk > and cd_demo_sk = c_current_cdemo_sk > and hd_demo_sk = c_current_hdemo_sk > and sr_cdemo_sk = cd_demo_sk > order by c_customer_id > limit 100; > {noformat} > Spark 2.0 build 0517 returned the following result: > {noformat} > AAAAAAAAAIPGAAAA Carter, Rodney > AAAAAAAAAKMBBAAA Mcarthur, Emma > AAAAAAAACBNHBAAA Wells, Ron > AAAAAAAADBMEAAAA Vera, Tina > AAAAAAAADBMEAAAA Vera, Tina > AAAAAAAADHKGBAAA Scott, Pamela > AAAAAAAAEIIBBAAA Atkins, Susan > AAAAAAAAFKAHAAAA Batiste, Ernest > AAAAAAAAGHMAAAAA Mitchell, Gregory > AAAAAAAAIAODBAAA Murray, Karen > AAAAAAAAIEOKAAAA Solomon, Clyde > AAAAAAAAIIBOAAAA Owens, David > AAAAAAAAIPDCAAAA Wallace, Eric > AAAAAAAAIPIMAAAA Hayward, Benjamin > AAAAAAAAJCIKAAAA Ramos, Donald > AAAAAAAAKFJEAAAA Roberts, Yvonne > AAAAAAAAKPGBBAAA NULL <------------------------ ??? questionable row > AAAAAAAALCLABAAA Whitaker, Lettie > AAAAAAAAMGMEAAAA Sharp, Michael > AAAAAAAAMIGBBAAA Montgomery, Jesenia > AAAAAAAAMPDKAAAA Lopez, Isabel > AAAAAAAANEOMAAAA Powell, Linda > AAAAAAAANKPCAAAA Shaffer, Sergio > AAAAAAAANOCKAAAA Vargas, James > AAAAAAAAOGJEBAAA Owens, Denice > {noformat} > Official answer set (which is correct!) > {noformat} > AAAAAAAAAIPGAAAA Carter , Rodney > AAAAAAAAAKMBBAAA Mcarthur , Emma > AAAAAAAACBNHBAAA Wells , Ron > AAAAAAAADBMEAAAA Vera , Tina > AAAAAAAADBMEAAAA Vera , Tina > AAAAAAAADHKGBAAA Scott , Pamela > AAAAAAAAEIIBBAAA Atkins , Susan > AAAAAAAAFKAHAAAA Batiste , Ernest > AAAAAAAAGHMAAAAA Mitchell , Gregory > AAAAAAAAIAODBAAA Murray , Karen > AAAAAAAAIEOKAAAA Solomon , Clyde > AAAAAAAAIIBOAAAA Owens , David > AAAAAAAAIPDCAAAA Wallace , Eric > AAAAAAAAIPIMAAAA Hayward , Benjamin > AAAAAAAAJCIKAAAA Ramos , Donald > AAAAAAAAKFJEAAAA Roberts , Yvonne > AAAAAAAAKPGBBAAA Moore , > AAAAAAAALCLABAAA Whitaker , Lettie > AAAAAAAAMGMEAAAA Sharp , Michael > AAAAAAAAMIGBBAAA Montgomery , Jesenia > AAAAAAAAMPDKAAAA Lopez , Isabel > AAAAAAAANEOMAAAA Powell , Linda > AAAAAAAANKPCAAAA Shaffer , Sergio > AAAAAAAANOCKAAAA Vargas , James > AAAAAAAAOGJEBAAA Owens , Denice > {noformat} > The issue is with the "concat" function in Spark SQL (also behaves the same > in Hive). When 'concat' meets any NULL string, it returns NULL as the answer. > But is this right? When I concatenate a person's last name and first name, if > the first name is missing (empty string or NULL), I should see the last name > still, not NULL, i.e., "Smith" + "" = "Smith", not NULL. > Simplest repeatable test: > {noformat} > hive> select c_first_name, c_last_name from customer where c_customer_id = > 'AAAAAAAAKPGBBAAA'; > OK > NULL Moore > Time taken: 0.07 seconds, Fetched: 1 row(s) > hive> select concat(c_last_name, ', ', c_first_name) from customer where > c_customer_id = 'AAAAAAAAKPGBBAAA'; > OK > NULL > Time taken: 0.1 seconds, Fetched: 1 row(s) > hive> select concat(c_last_name, c_first_name) from customer where > c_customer_id = 'AAAAAAAAKPGBBAAA'; > OK > NULL > Time taken: 0.055 seconds, Fetched: 1 row(s) > hive> select concat(c_last_name, ', ', c_first_name) from customer where > c_customer_id = 'AAAAAAAAKPGBBAAA'; > OK > NULL > Time taken: 0.061 seconds, Fetched: 1 row(s) > hive> select concat(c_last_name, ', ', c_customer_id) from customer where > c_customer_id = 'AAAAAAAAKPGBBAAA'; > OK > Moore, AAAAAAAAKPGBBAAA > Same in 'spark-sql' shell: > ... > 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 45 > 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 46 > 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 47 > 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 48 > select concat(c_last_name, c_first_name) from customer where c_customer_id = > 'AAAAAAAAKPGBBAAA'; > NULL > Time taken: 0.296 seconds, Fetched 1 row(s) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org