[ https://issues.apache.org/jira/browse/SPARK-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200465#comment-15200465 ]
Dilip Biswal commented on SPARK-13859: -------------------------------------- Hello, Just checked the original spec for this query from tpcds website. Here is the template for Q38. {code} [_LIMITA] select [_LIMITB] count(*) from ( select distinct c_last_name, c_first_name, d_date from store_sales, date_dim, customer where store_sales.ss_sold_date_sk = date_dim.d_date_sk and store_sales.ss_customer_sk = customer.c_customer_sk and d_month_seq between [DMS] and [DMS] + 11 intersect select distinct c_last_name, c_first_name, d_date from catalog_sales, date_dim, customer where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk and d_month_seq between [DMS] and [DMS] + 11 intersect select distinct c_last_name, c_first_name, d_date from web_sales, date_dim, customer where web_sales.ws_sold_date_sk = date_dim.d_date_sk and web_sales.ws_bill_customer_sk = customer.c_customer_sk and d_month_seq between [DMS] and [DMS] + 11 ) hot_cust [_LIMITC]; {code} In this case the query in spec uses intersect operator where the implicitly generated join conditions use null safe comparison. In other-words, if we ran the query as is from spec then it would have worked. However the query in this JIRA has user supplied join conditions and uses "=". In my knowledge in SQL, the semantics of equal operator is well defined. So i don't think its a spark SQL issue. [~rxin] [~marmbrus] Please let us know your thoughts.. > TPCDS query 38 returns wrong results compared to TPC official result set > ------------------------------------------------------------------------- > > Key: SPARK-13859 > URL: https://issues.apache.org/jira/browse/SPARK-13859 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.0 > Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 38 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 0, answer set reports 107. > Actual results: > {noformat} > [0] > {noformat} > Expected: > {noformat} > +-----+ > | 1 | > +-----+ > | 107 | > +-----+ > {noformat} > query used: > {noformat} > -- start query 38 in stream 0 using template query38.tpl and seed > QUALIFICATION > select count(*) from ( > select distinct c_last_name, c_first_name, d_date > from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp1 > JOIN > (select distinct c_last_name, c_first_name, d_date > from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name = > tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and > (tmp1.d_date = tmp2.d_date) > JOIN > ( > select distinct c_last_name, c_first_name, d_date > from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name = > tmp3.c_last_name) and (tmp1.c_first_name = tmp3.c_first_name) and > (tmp1.d_date = tmp3.d_date) > limit 100 > ; > -- end query 38 in stream 0 using template query38.tpl > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org