[ 
https://issues.apache.org/jira/browse/SPARK-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200177#comment-15200177
 ] 

JESSE CHEN commented on SPARK-13859:
------------------------------------

Tested both q87 and q38 on the lab's cluster. 

With this modification (i.e., null-safe equals), both q87 and q38 returned 
correct results (per TPC) on both text and parquet.
Without this modification, both queries returned the wrong results.

Per TPC rules on vendor-specific syntax:

4.2.3.4 The following query modifications are minor: 
c) Operators
2. Relational operators - Relational operators used in queries such as "<", 
">", "<>", "<=", and "=", may be replaced by equivalent vendor-specific 
operators, for example ".LT.", ".GT.", "!=" or "^=", ".LE.", and "==", 
respectively. 

This proposed modification however seems outside of allowed modifcation because 
it is a workaround to an issue where 
"Spark does not deal with nulls correctly under certain conditions."  If you 
look at other queries in TPC (which 72 of them 
returned correct results), there are this type of equals used all over. 

SO there is a inherent unsafe null operation in Spark that is **not related** 
to a) wrong table definition, or b) wrong query
syntax, or c) file format. Spark should do this "=" correctly and automatically.

These two queries provide excellent testcases for finding that bug and fixing 
it.

Jesse 









> TPCDS query 38 returns wrong results compared to TPC official result set 
> -------------------------------------------------------------------------
>
>                 Key: SPARK-13859
>                 URL: https://issues.apache.org/jira/browse/SPARK-13859
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>            Reporter: JESSE CHEN
>              Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 38 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 0, answer set reports 107.
> Actual results:
> {noformat}
> [0]
> {noformat}
> Expected:
> {noformat}
> +-----+
> |   1 |
> +-----+
> | 107 |
> +-----+
> {noformat}
> query used:
> {noformat}
> -- start query 38 in stream 0 using template query38.tpl and seed 
> QUALIFICATION
>  select  count(*) from (
>     select distinct c_last_name, c_first_name, d_date
>     from store_sales
>          JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
>          JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>     where d_month_seq between 1200 and 1200 + 11) tmp1
>   JOIN
>     (select distinct c_last_name, c_first_name, d_date
>     from catalog_sales
>          JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
>          JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
>     where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name = 
> tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and 
> (tmp1.d_date = tmp2.d_date) 
>   JOIN
>     (
>     select distinct c_last_name, c_first_name, d_date
>     from web_sales
>          JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
>          JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
>     where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name = 
> tmp3.c_last_name) and (tmp1.c_first_name = tmp3.c_first_name) and 
> (tmp1.d_date = tmp3.d_date) 
>   limit 100
>  ;
> -- end query 38 in stream 0 using template query38.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to