[ 
https://issues.apache.org/jira/browse/SPARK-36926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36926:
---------------------------------
    Priority: Critical  (was: Major)

> Discrepancy in Q22 of TPCH for Spark 3.2
> ----------------------------------------
>
>                 Key: SPARK-36926
>                 URL: https://issues.apache.org/jira/browse/SPARK-36926
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Aravind Patnam
>            Priority: Critical
>
> When running TPCH scale 100 against 3.2, Query 22 has a discrepancy in the 
> number of rows returned by the query. This was tested with both AQE on and 
> off. All the other queries were matching in results. Below is the results 
> that we got when testing Q22 on 3.2: 
>  
> {code:java}
>   "results": [
>     {
>       "name": "Q22",
>       "mode": "collect",
>       "parameters": {},
>       "joinTypes": [
>         "SortMergeJoin"
>       ],
>       "tables": [
>         "customer"
>       ],
>       "parsingTime": 0.016522,
>       "analysisTime": 0.004132,
>       "optimizationTime": 39.173868,
>       "planningTime": 23.10939,
>       "executionTime": 13762.183844,
>       "result": 0,
>       "breakDown": [],
>       "queryExecution": "== Parsed Logical Plan ==\n'Sort ['cntrycode ASC 
> NULLS FIRST], true\n+- 'Aggregate ['cntrycode], ['cntrycode, 'count(1) AS 
> numcust#150, 'sum('c_acctbal) AS totacctbal#151]\n   +- 'SubqueryAlias 
> custsale\n      +- 'Project ['substring('c_phone, 1, 2) AS cntrycode#147, 
> 'c_acctbal]\n         +- 'Filter (('substring('c_phone, 1, 2) IN 
> (13,31,23,29,30,18,17) AND ('c_acctbal > scalar-subquery#148 [])) AND NOT 
> exists#149 [])\n            :  :- 'Project [unresolvedalias('avg('c_acctbal), 
> None)]\n            :  :  +- 'Filter (('c_acctbal > 0.00) AND 
> 'substring('c_phone, 1, 2) IN (13,31,23,29,30,18,17))\n            :  :     
> +- 'UnresolvedRelation [customer], [], false\n            :  +- 'Project 
> [*]\n            :     +- 'Filter ('o_custkey = 'c_custkey)\n            :    
>     +- 'UnresolvedRelation [orders], [], false\n            +- 
> 'UnresolvedRelation [customer], [], false\n\n== Analyzed Logical Plan 
> ==\ncntrycode: string, numcust: bigint, totacctbal: decimal(22,2)\nSort 
> [cntrycode#147 ASC NULLS FIRST], true\n+- Aggregate [cntrycode#147], 
> [cntrycode#147, count(1) AS numcust#150L, sum(c_acctbal#11) AS 
> totacctbal#151]\n   +- SubqueryAlias custsale\n      +- Project 
> [substring(c_phone#10, 1, 2) AS cntrycode#147, c_acctbal#11]\n         +- 
> Filter ((substring(c_phone#10, 1, 2) IN (13,31,23,29,30,18,17) AND 
> (cast(c_acctbal#11 as decimal(16,6)) > cast(scalar-subquery#148 [] as 
> decimal(16,6)))) AND NOT exists#149 [c_custkey#6L])\n            :  :- 
> Aggregate [avg(c_acctbal#160) AS avg(c_acctbal)#154]\n            :  :  +- 
> Filter ((cast(c_acctbal#160 as decimal(12,2)) > cast(0.00 as decimal(12,2))) 
> AND substring(c_phone#159, 1, 2) IN (13,31,23,29,30,18,17))\n            :  : 
>     +- SubqueryAlias spark_catalog.tpch_data_orc_100.customer\n            :  
> :        +- Relation 
> tpch_data_orc_100.customer[c_custkey#155L,c_name#156,c_address#157,c_nationkey#158L,c_phone#159,c_acctbal#160,c_comment#161,c_mktsegment#162]
>  orc\n            :  +- Project [o_orderkey#16L, o_custkey#17L, 
> o_orderstatus#18, o_totalprice#19, o_orderpriority#20, o_clerk#21, 
> o_shippriority#22, o_comment#23, o_orderdate#24]\n            :     +- Filter 
> (o_custkey#17L = outer(c_custkey#6L))\n            :        +- SubqueryAlias 
> spark_catalog.tpch_data_orc_100.orders\n            :           +- Relation 
> tpch_data_orc_100.orders[o_orderkey#16L,o_custkey#17L,o_orderstatus#18,o_totalprice#19,o_orderpriority#20,o_clerk#21,o_shippriority#22,o_comment#23,o_orderdate#24]
>  orc\n            +- SubqueryAlias spark_catalog.tpch_data_orc_100.customer\n 
>               +- Relation 
> tpch_data_orc_100.customer[c_custkey#6L,c_name#7,c_address#8,c_nationkey#9L,c_phone#10,c_acctbal#11,c_comment#12,c_mktsegment#13]
>  orc\n\n== Optimized Logical Plan ==\nSort [cntrycode#147 ASC NULLS FIRST], 
> true\n+- Aggregate [cntrycode#147], [cntrycode#147, count(1) AS numcust#150L, 
> sum(c_acctbal#11) AS totacctbal#151]\n   +- Project [substring(c_phone#10, 1, 
> 2) AS cntrycode#147, c_acctbal#11]\n      +- Join LeftAnti, (o_custkey#17L = 
> c_custkey#6L)\n         :- Project [c_custkey#6L, c_phone#10, c_acctbal#11]\n 
>         :  +- Filter ((isnotnull(c_acctbal#11) AND substring(c_phone#10, 1, 
> 2) IN (13,31,23,29,30,18,17)) AND (cast(c_acctbal#11 as decimal(16,6)) > 
> scalar-subquery#148 []))\n         :     :  +- Aggregate [avg(c_acctbal#160) 
> AS avg(c_acctbal)#154]\n         :     :     +- Project [c_acctbal#160]\n     
>     :     :        +- Filter (isnotnull(c_acctbal#160) AND ((c_acctbal#160 > 
> 0.00) AND substring(c_phone#159, 1, 2) IN (13,31,23,29,30,18,17)))\n         
> :     :           +- Relation 
> tpch_data_orc_100.customer[c_custkey#155L,c_name#156,c_address#157,c_nationkey#158L,c_phone#159,c_acctbal#160,c_comment#161,c_mktsegment#162]
>  orc\n         :     +- Relation 
> tpch_data_orc_100.customer[c_custkey#6L,c_name#7,c_address#8,c_nationkey#9L,c_phone#10,c_acctbal#11,c_comment#12,c_mktsegment#13]
>  orc\n         +- Project [o_custkey#17L]\n            +- Relation 
> tpch_data_orc_100.orders[o_orderkey#16L,o_custkey#17L,o_orderstatus#18,o_totalprice#19,o_orderpriority#20,o_clerk#21,o_shippriority#22,o_comment#23,o_orderdate#24]
>  orc\n\n== Physical Plan ==\n*(7) Sort [cntrycode#147 ASC NULLS FIRST], true, 
> 0\n+- Exchange rangepartitioning(cntrycode#147 ASC NULLS FIRST, 200), 
> ENSURE_REQUIREMENTS, [id=#314]\n   +- *(6) 
> HashAggregate(keys=[cntrycode#147], functions=[count(1), sum(c_acctbal#11)], 
> output=[cntrycode#147, numcust#150L, totacctbal#151])\n      +- Exchange 
> hashpartitioning(cntrycode#147, 200), ENSURE_REQUIREMENTS, [id=#310]\n        
>  +- *(5) HashAggregate(keys=[cntrycode#147], functions=[partial_count(1), 
> partial_sum(c_acctbal#11)], output=[cntrycode#147, count#170L, sum#171, 
> isEmpty#172])\n            +- *(5) Project [substring(c_phone#10, 1, 2) AS 
> cntrycode#147, c_acctbal#11]\n               +- *(5) SortMergeJoin 
> [c_custkey#6L], [o_custkey#17L], LeftAnti\n                  :- *(2) Sort 
> [c_custkey#6L ASC NULLS FIRST], false, 0\n                  :  +- Exchange 
> hashpartitioning(c_custkey#6L, 200), ENSURE_REQUIREMENTS, [id=#292]\n         
>          :     +- *(1) Project [c_custkey#6L, c_phone#10, c_acctbal#11]\n     
>              :        +- *(1) Filter ((isnotnull(c_acctbal#11) AND 
> substring(c_phone#10, 1, 2) IN (13,31,23,29,30,18,17)) AND (cast(c_acctbal#11 
> as decimal(16,6)) > Subquery scalar-subquery#148, [id=#249]))\n               
>    :           :  +- Subquery scalar-subquery#148, [id=#249]\n                
>   :           :     +- *(2) HashAggregate(keys=[], 
> functions=[avg(c_acctbal#160)], output=[avg(c_acctbal)#154])\n                
>   :           :        +- Exchange SinglePartition, ENSURE_REQUIREMENTS, 
> [id=#245]\n                  :           :           +- *(1) 
> HashAggregate(keys=[], functions=[partial_avg(c_acctbal#160)], 
> output=[sum#175, count#176L])\n                  :           :              
> +- *(1) Project [c_acctbal#160]\n                  :           :              
>    +- *(1) Filter ((isnotnull(c_acctbal#160) AND (c_acctbal#160 > 0.00)) AND 
> substring(c_phone#159, 1, 2) IN (13,31,23,29,30,18,17))\n                  :  
>          :                    +- *(1) ColumnarToRow\n                  :      
>      :                       +- FileScan orc 
> tpch_data_orc_100.customer[c_phone#159,c_acctbal#160,c_mktsegment#162] 
> Batched: true, DataFilters: [isnotnull(c_acctbal#160), (c_acctbal#160 > 
> 0.00), substring(c_phone#159, 1, 2) IN (13,31,23,29,3..., Format: ORC, 
> Location: CatalogFileIndex(0 paths)[], PartitionFilters: [], PushedFilters: 
> [IsNotNull(c_acctbal), GreaterThan(c_acctbal,0.00)], ReadSchema: 
> struct<c_phone:string,c_acctbal:decimal(12,2)>\n                  :           
> +- *(1) ColumnarToRow\n                  :              +- FileScan orc 
> tpch_data_orc_100.customer[c_custkey#6L,c_phone#10,c_acctbal#11,c_mktsegment#13]
>  Batched: true, DataFilters: [isnotnull(c_acctbal#11), substring(c_phone#10, 
> 1, 2) IN (13,31,23,29,30,18,17)], Format: ORC, Location: CatalogFileIndex(0 
> paths)[], PartitionFilters: [], PushedFilters: [IsNotNull(c_acctbal)], 
> ReadSchema: struct<c_custkey:bigint,c_phone:string,c_acctbal:decimal(12,2)>\n 
>                  +- *(4) Sort [o_custkey#17L ASC NULLS FIRST], false, 0\n     
>                 +- Exchange hashpartitioning(o_custkey#17L, 200), 
> ENSURE_REQUIREMENTS, [id=#301]\n                        +- *(3) Project 
> [o_custkey#17L]\n                           +- *(3) ColumnarToRow\n           
>                    +- FileScan orc 
> tpch_data_orc_100.orders[o_custkey#17L,o_orderdate#24] Batched: true, 
> DataFilters: [], Format: ORC, Location: CatalogFileIndex(0 paths)[], 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<o_custkey:bigint>\n"
>     }
>   ]
> }
> {code}
> The correct result should be 7 rows returned, not 0 as shown above. This can 
> be confirmed by testing the same exact query against the hive table using 
> Presto/Hive, which both return 7. Additionally, this link also shows that it 
> should be 7 
> [https://github.com/apache/impala/blob/master/testdata/workloads/tpch/queries/tpch-q22.test.]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to