Aravind Patnam created SPARK-36926:
--------------------------------------

             Summary: Discrepancy in Q22 of TPCH for Spark 3.2
                 Key: SPARK-36926
                 URL: https://issues.apache.org/jira/browse/SPARK-36926
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.2.0
            Reporter: Aravind Patnam


When running TPCH scale 100 against 3.2, Query 22 has a discrepancy in the 
number of rows returned by the query. This was tested with both AQE on and off. 
All the other queries were matching in results. Below is the results that we 
got when testing Q22 on 3.2: 

 
{code:java}
  "results": [
    {
      "name": "Q22",
      "mode": "collect",
      "parameters": {},
      "joinTypes": [
        "SortMergeJoin"
      ],
      "tables": [
        "customer"
      ],
      "parsingTime": 0.016522,
      "analysisTime": 0.004132,
      "optimizationTime": 39.173868,
      "planningTime": 23.10939,
      "executionTime": 13762.183844,
      "result": 0,
      "breakDown": [],
      "queryExecution": "== Parsed Logical Plan ==\n'Sort ['cntrycode ASC NULLS 
FIRST], true\n+- 'Aggregate ['cntrycode], ['cntrycode, 'count(1) AS 
numcust#150, 'sum('c_acctbal) AS totacctbal#151]\n   +- 'SubqueryAlias 
custsale\n      +- 'Project ['substring('c_phone, 1, 2) AS cntrycode#147, 
'c_acctbal]\n         +- 'Filter (('substring('c_phone, 1, 2) IN 
(13,31,23,29,30,18,17) AND ('c_acctbal > scalar-subquery#148 [])) AND NOT 
exists#149 [])\n            :  :- 'Project [unresolvedalias('avg('c_acctbal), 
None)]\n            :  :  +- 'Filter (('c_acctbal > 0.00) AND 
'substring('c_phone, 1, 2) IN (13,31,23,29,30,18,17))\n            :  :     +- 
'UnresolvedRelation [customer], [], false\n            :  +- 'Project [*]\n     
       :     +- 'Filter ('o_custkey = 'c_custkey)\n            :        +- 
'UnresolvedRelation [orders], [], false\n            +- 'UnresolvedRelation 
[customer], [], false\n\n== Analyzed Logical Plan ==\ncntrycode: string, 
numcust: bigint, totacctbal: decimal(22,2)\nSort [cntrycode#147 ASC NULLS 
FIRST], true\n+- Aggregate [cntrycode#147], [cntrycode#147, count(1) AS 
numcust#150L, sum(c_acctbal#11) AS totacctbal#151]\n   +- SubqueryAlias 
custsale\n      +- Project [substring(c_phone#10, 1, 2) AS cntrycode#147, 
c_acctbal#11]\n         +- Filter ((substring(c_phone#10, 1, 2) IN 
(13,31,23,29,30,18,17) AND (cast(c_acctbal#11 as decimal(16,6)) > 
cast(scalar-subquery#148 [] as decimal(16,6)))) AND NOT exists#149 
[c_custkey#6L])\n            :  :- Aggregate [avg(c_acctbal#160) AS 
avg(c_acctbal)#154]\n            :  :  +- Filter ((cast(c_acctbal#160 as 
decimal(12,2)) > cast(0.00 as decimal(12,2))) AND substring(c_phone#159, 1, 2) 
IN (13,31,23,29,30,18,17))\n            :  :     +- SubqueryAlias 
spark_catalog.tpch_data_orc_100.customer\n            :  :        +- Relation 
tpch_data_orc_100.customer[c_custkey#155L,c_name#156,c_address#157,c_nationkey#158L,c_phone#159,c_acctbal#160,c_comment#161,c_mktsegment#162]
 orc\n            :  +- Project [o_orderkey#16L, o_custkey#17L, 
o_orderstatus#18, o_totalprice#19, o_orderpriority#20, o_clerk#21, 
o_shippriority#22, o_comment#23, o_orderdate#24]\n            :     +- Filter 
(o_custkey#17L = outer(c_custkey#6L))\n            :        +- SubqueryAlias 
spark_catalog.tpch_data_orc_100.orders\n            :           +- Relation 
tpch_data_orc_100.orders[o_orderkey#16L,o_custkey#17L,o_orderstatus#18,o_totalprice#19,o_orderpriority#20,o_clerk#21,o_shippriority#22,o_comment#23,o_orderdate#24]
 orc\n            +- SubqueryAlias spark_catalog.tpch_data_orc_100.customer\n   
            +- Relation 
tpch_data_orc_100.customer[c_custkey#6L,c_name#7,c_address#8,c_nationkey#9L,c_phone#10,c_acctbal#11,c_comment#12,c_mktsegment#13]
 orc\n\n== Optimized Logical Plan ==\nSort [cntrycode#147 ASC NULLS FIRST], 
true\n+- Aggregate [cntrycode#147], [cntrycode#147, count(1) AS numcust#150L, 
sum(c_acctbal#11) AS totacctbal#151]\n   +- Project [substring(c_phone#10, 1, 
2) AS cntrycode#147, c_acctbal#11]\n      +- Join LeftAnti, (o_custkey#17L = 
c_custkey#6L)\n         :- Project [c_custkey#6L, c_phone#10, c_acctbal#11]\n   
      :  +- Filter ((isnotnull(c_acctbal#11) AND substring(c_phone#10, 1, 2) IN 
(13,31,23,29,30,18,17)) AND (cast(c_acctbal#11 as decimal(16,6)) > 
scalar-subquery#148 []))\n         :     :  +- Aggregate [avg(c_acctbal#160) AS 
avg(c_acctbal)#154]\n         :     :     +- Project [c_acctbal#160]\n         
:     :        +- Filter (isnotnull(c_acctbal#160) AND ((c_acctbal#160 > 0.00) 
AND substring(c_phone#159, 1, 2) IN (13,31,23,29,30,18,17)))\n         :     :  
         +- Relation 
tpch_data_orc_100.customer[c_custkey#155L,c_name#156,c_address#157,c_nationkey#158L,c_phone#159,c_acctbal#160,c_comment#161,c_mktsegment#162]
 orc\n         :     +- Relation 
tpch_data_orc_100.customer[c_custkey#6L,c_name#7,c_address#8,c_nationkey#9L,c_phone#10,c_acctbal#11,c_comment#12,c_mktsegment#13]
 orc\n         +- Project [o_custkey#17L]\n            +- Relation 
tpch_data_orc_100.orders[o_orderkey#16L,o_custkey#17L,o_orderstatus#18,o_totalprice#19,o_orderpriority#20,o_clerk#21,o_shippriority#22,o_comment#23,o_orderdate#24]
 orc\n\n== Physical Plan ==\n*(7) Sort [cntrycode#147 ASC NULLS FIRST], true, 
0\n+- Exchange rangepartitioning(cntrycode#147 ASC NULLS FIRST, 200), 
ENSURE_REQUIREMENTS, [id=#314]\n   +- *(6) HashAggregate(keys=[cntrycode#147], 
functions=[count(1), sum(c_acctbal#11)], output=[cntrycode#147, numcust#150L, 
totacctbal#151])\n      +- Exchange hashpartitioning(cntrycode#147, 200), 
ENSURE_REQUIREMENTS, [id=#310]\n         +- *(5) 
HashAggregate(keys=[cntrycode#147], functions=[partial_count(1), 
partial_sum(c_acctbal#11)], output=[cntrycode#147, count#170L, sum#171, 
isEmpty#172])\n            +- *(5) Project [substring(c_phone#10, 1, 2) AS 
cntrycode#147, c_acctbal#11]\n               +- *(5) SortMergeJoin 
[c_custkey#6L], [o_custkey#17L], LeftAnti\n                  :- *(2) Sort 
[c_custkey#6L ASC NULLS FIRST], false, 0\n                  :  +- Exchange 
hashpartitioning(c_custkey#6L, 200), ENSURE_REQUIREMENTS, [id=#292]\n           
       :     +- *(1) Project [c_custkey#6L, c_phone#10, c_acctbal#11]\n         
         :        +- *(1) Filter ((isnotnull(c_acctbal#11) AND 
substring(c_phone#10, 1, 2) IN (13,31,23,29,30,18,17)) AND (cast(c_acctbal#11 
as decimal(16,6)) > Subquery scalar-subquery#148, [id=#249]))\n                 
 :           :  +- Subquery scalar-subquery#148, [id=#249]\n                  : 
          :     +- *(2) HashAggregate(keys=[], functions=[avg(c_acctbal#160)], 
output=[avg(c_acctbal)#154])\n                  :           :        +- 
Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#245]\n                  :   
        :           +- *(1) HashAggregate(keys=[], 
functions=[partial_avg(c_acctbal#160)], output=[sum#175, count#176L])\n         
         :           :              +- *(1) Project [c_acctbal#160]\n           
       :           :                 +- *(1) Filter ((isnotnull(c_acctbal#160) 
AND (c_acctbal#160 > 0.00)) AND substring(c_phone#159, 1, 2) IN 
(13,31,23,29,30,18,17))\n                  :           :                    +- 
*(1) ColumnarToRow\n                  :           :                       +- 
FileScan orc 
tpch_data_orc_100.customer[c_phone#159,c_acctbal#160,c_mktsegment#162] Batched: 
true, DataFilters: [isnotnull(c_acctbal#160), (c_acctbal#160 > 0.00), 
substring(c_phone#159, 1, 2) IN (13,31,23,29,3..., Format: ORC, Location: 
CatalogFileIndex(0 paths)[], PartitionFilters: [], PushedFilters: 
[IsNotNull(c_acctbal), GreaterThan(c_acctbal,0.00)], ReadSchema: 
struct<c_phone:string,c_acctbal:decimal(12,2)>\n                  :           
+- *(1) ColumnarToRow\n                  :              +- FileScan orc 
tpch_data_orc_100.customer[c_custkey#6L,c_phone#10,c_acctbal#11,c_mktsegment#13]
 Batched: true, DataFilters: [isnotnull(c_acctbal#11), substring(c_phone#10, 1, 
2) IN (13,31,23,29,30,18,17)], Format: ORC, Location: CatalogFileIndex(0 
paths)[], PartitionFilters: [], PushedFilters: [IsNotNull(c_acctbal)], 
ReadSchema: struct<c_custkey:bigint,c_phone:string,c_acctbal:decimal(12,2)>\n   
               +- *(4) Sort [o_custkey#17L ASC NULLS FIRST], false, 0\n         
            +- Exchange hashpartitioning(o_custkey#17L, 200), 
ENSURE_REQUIREMENTS, [id=#301]\n                        +- *(3) Project 
[o_custkey#17L]\n                           +- *(3) ColumnarToRow\n             
                 +- FileScan orc 
tpch_data_orc_100.orders[o_custkey#17L,o_orderdate#24] Batched: true, 
DataFilters: [], Format: ORC, Location: CatalogFileIndex(0 paths)[], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<o_custkey:bigint>\n"
    }
  ]
}
{code}
The correct result should be 7 rows returned, not 0 as shown above. This can be 
confirmed by testing the same exact query against the hive table using 
Presto/Hive, which both return 7. Additionally, this this link also shows that 
it should be 7 
[https://github.com/apache/impala/blob/master/testdata/workloads/tpch/queries/tpch-q22.test.]
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to