[ 
https://issues.apache.org/jira/browse/SPARK-36926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venkata krishnan Sowrirajan updated SPARK-36926:
------------------------------------------------
    Description: 
When running TPCH scale 100 against 3.2, Query 22 has a discrepancy in the 
number of rows returned by the query. This was tested with both AQE on and off. 
All the other queries were matching in results. Below is the results that we 
got when testing Q22 on 3.2: 

 
{code:java}
  "results": [
    {
      "name": "Q22",
      "mode": "collect",
      "parameters": {},
      "joinTypes": [
        "SortMergeJoin"
      ],
      "tables": [
        "customer"
      ],
      "parsingTime": 0.016522,
      "analysisTime": 0.004132,
      "optimizationTime": 39.173868,
      "planningTime": 23.10939,
      "executionTime": 13762.183844,
      "result": 0,
      "breakDown": [],
      "queryExecution": "== Parsed Logical Plan ==\n'Sort ['cntrycode ASC NULLS 
FIRST], true\n+- 'Aggregate ['cntrycode], ['cntrycode, 'count(1) AS 
numcust#150, 'sum('c_acctbal) AS totacctbal#151]\n   +- 'SubqueryAlias 
custsale\n      +- 'Project ['substring('c_phone, 1, 2) AS cntrycode#147, 
'c_acctbal]\n         +- 'Filter (('substring('c_phone, 1, 2) IN 
(13,31,23,29,30,18,17) AND ('c_acctbal > scalar-subquery#148 [])) AND NOT 
exists#149 [])\n            :  :- 'Project [unresolvedalias('avg('c_acctbal), 
None)]\n            :  :  +- 'Filter (('c_acctbal > 0.00) AND 
'substring('c_phone, 1, 2) IN (13,31,23,29,30,18,17))\n            :  :     +- 
'UnresolvedRelation [customer], [], false\n            :  +- 'Project [*]\n     
       :     +- 'Filter ('o_custkey = 'c_custkey)\n            :        +- 
'UnresolvedRelation [orders], [], false\n            +- 'UnresolvedRelation 
[customer], [], false\n\n== Analyzed Logical Plan ==\ncntrycode: string, 
numcust: bigint, totacctbal: decimal(22,2)\nSort [cntrycode#147 ASC NULLS 
FIRST], true\n+- Aggregate [cntrycode#147], [cntrycode#147, count(1) AS 
numcust#150L, sum(c_acctbal#11) AS totacctbal#151]\n   +- SubqueryAlias 
custsale\n      +- Project [substring(c_phone#10, 1, 2) AS cntrycode#147, 
c_acctbal#11]\n         +- Filter ((substring(c_phone#10, 1, 2) IN 
(13,31,23,29,30,18,17) AND (cast(c_acctbal#11 as decimal(16,6)) > 
cast(scalar-subquery#148 [] as decimal(16,6)))) AND NOT exists#149 
[c_custkey#6L])\n            :  :- Aggregate [avg(c_acctbal#160) AS 
avg(c_acctbal)#154]\n            :  :  +- Filter ((cast(c_acctbal#160 as 
decimal(12,2)) > cast(0.00 as decimal(12,2))) AND substring(c_phone#159, 1, 2) 
IN (13,31,23,29,30,18,17))\n            :  :     +- SubqueryAlias 
spark_catalog.tpch_data_orc_100.customer\n            :  :        +- Relation 
tpch_data_orc_100.customer[c_custkey#155L,c_name#156,c_address#157,c_nationkey#158L,c_phone#159,c_acctbal#160,c_comment#161,c_mktsegment#162]
 orc\n            :  +- Project [o_orderkey#16L, o_custkey#17L, 
o_orderstatus#18, o_totalprice#19, o_orderpriority#20, o_clerk#21, 
o_shippriority#22, o_comment#23, o_orderdate#24]\n            :     +- Filter 
(o_custkey#17L = outer(c_custkey#6L))\n            :        +- SubqueryAlias 
spark_catalog.tpch_data_orc_100.orders\n            :           +- Relation 
tpch_data_orc_100.orders[o_orderkey#16L,o_custkey#17L,o_orderstatus#18,o_totalprice#19,o_orderpriority#20,o_clerk#21,o_shippriority#22,o_comment#23,o_orderdate#24]
 orc\n            +- SubqueryAlias spark_catalog.tpch_data_orc_100.customer\n   
            +- Relation 
tpch_data_orc_100.customer[c_custkey#6L,c_name#7,c_address#8,c_nationkey#9L,c_phone#10,c_acctbal#11,c_comment#12,c_mktsegment#13]
 orc\n\n== Optimized Logical Plan ==\nSort [cntrycode#147 ASC NULLS FIRST], 
true\n+- Aggregate [cntrycode#147], [cntrycode#147, count(1) AS numcust#150L, 
sum(c_acctbal#11) AS totacctbal#151]\n   +- Project [substring(c_phone#10, 1, 
2) AS cntrycode#147, c_acctbal#11]\n      +- Join LeftAnti, (o_custkey#17L = 
c_custkey#6L)\n         :- Project [c_custkey#6L, c_phone#10, c_acctbal#11]\n   
      :  +- Filter ((isnotnull(c_acctbal#11) AND substring(c_phone#10, 1, 2) IN 
(13,31,23,29,30,18,17)) AND (cast(c_acctbal#11 as decimal(16,6)) > 
scalar-subquery#148 []))\n         :     :  +- Aggregate [avg(c_acctbal#160) AS 
avg(c_acctbal)#154]\n         :     :     +- Project [c_acctbal#160]\n         
:     :        +- Filter (isnotnull(c_acctbal#160) AND ((c_acctbal#160 > 0.00) 
AND substring(c_phone#159, 1, 2) IN (13,31,23,29,30,18,17)))\n         :     :  
         +- Relation 
tpch_data_orc_100.customer[c_custkey#155L,c_name#156,c_address#157,c_nationkey#158L,c_phone#159,c_acctbal#160,c_comment#161,c_mktsegment#162]
 orc\n         :     +- Relation 
tpch_data_orc_100.customer[c_custkey#6L,c_name#7,c_address#8,c_nationkey#9L,c_phone#10,c_acctbal#11,c_comment#12,c_mktsegment#13]
 orc\n         +- Project [o_custkey#17L]\n            +- Relation 
tpch_data_orc_100.orders[o_orderkey#16L,o_custkey#17L,o_orderstatus#18,o_totalprice#19,o_orderpriority#20,o_clerk#21,o_shippriority#22,o_comment#23,o_orderdate#24]
 orc\n\n== Physical Plan ==\n*(7) Sort [cntrycode#147 ASC NULLS FIRST], true, 
0\n+- Exchange rangepartitioning(cntrycode#147 ASC NULLS FIRST, 200), 
ENSURE_REQUIREMENTS, [id=#314]\n   +- *(6) HashAggregate(keys=[cntrycode#147], 
functions=[count(1), sum(c_acctbal#11)], output=[cntrycode#147, numcust#150L, 
totacctbal#151])\n      +- Exchange hashpartitioning(cntrycode#147, 200), 
ENSURE_REQUIREMENTS, [id=#310]\n         +- *(5) 
HashAggregate(keys=[cntrycode#147], functions=[partial_count(1), 
partial_sum(c_acctbal#11)], output=[cntrycode#147, count#170L, sum#171, 
isEmpty#172])\n            +- *(5) Project [substring(c_phone#10, 1, 2) AS 
cntrycode#147, c_acctbal#11]\n               +- *(5) SortMergeJoin 
[c_custkey#6L], [o_custkey#17L], LeftAnti\n                  :- *(2) Sort 
[c_custkey#6L ASC NULLS FIRST], false, 0\n                  :  +- Exchange 
hashpartitioning(c_custkey#6L, 200), ENSURE_REQUIREMENTS, [id=#292]\n           
       :     +- *(1) Project [c_custkey#6L, c_phone#10, c_acctbal#11]\n         
         :        +- *(1) Filter ((isnotnull(c_acctbal#11) AND 
substring(c_phone#10, 1, 2) IN (13,31,23,29,30,18,17)) AND (cast(c_acctbal#11 
as decimal(16,6)) > Subquery scalar-subquery#148, [id=#249]))\n                 
 :           :  +- Subquery scalar-subquery#148, [id=#249]\n                  : 
          :     +- *(2) HashAggregate(keys=[], functions=[avg(c_acctbal#160)], 
output=[avg(c_acctbal)#154])\n                  :           :        +- 
Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#245]\n                  :   
        :           +- *(1) HashAggregate(keys=[], 
functions=[partial_avg(c_acctbal#160)], output=[sum#175, count#176L])\n         
         :           :              +- *(1) Project [c_acctbal#160]\n           
       :           :                 +- *(1) Filter ((isnotnull(c_acctbal#160) 
AND (c_acctbal#160 > 0.00)) AND substring(c_phone#159, 1, 2) IN 
(13,31,23,29,30,18,17))\n                  :           :                    +- 
*(1) ColumnarToRow\n                  :           :                       +- 
FileScan orc 
tpch_data_orc_100.customer[c_phone#159,c_acctbal#160,c_mktsegment#162] Batched: 
true, DataFilters: [isnotnull(c_acctbal#160), (c_acctbal#160 > 0.00), 
substring(c_phone#159, 1, 2) IN (13,31,23,29,3..., Format: ORC, Location: 
CatalogFileIndex(0 paths)[], PartitionFilters: [], PushedFilters: 
[IsNotNull(c_acctbal), GreaterThan(c_acctbal,0.00)], ReadSchema: 
struct<c_phone:string,c_acctbal:decimal(12,2)>\n                  :           
+- *(1) ColumnarToRow\n                  :              +- FileScan orc 
tpch_data_orc_100.customer[c_custkey#6L,c_phone#10,c_acctbal#11,c_mktsegment#13]
 Batched: true, DataFilters: [isnotnull(c_acctbal#11), substring(c_phone#10, 1, 
2) IN (13,31,23,29,30,18,17)], Format: ORC, Location: CatalogFileIndex(0 
paths)[], PartitionFilters: [], PushedFilters: [IsNotNull(c_acctbal)], 
ReadSchema: struct<c_custkey:bigint,c_phone:string,c_acctbal:decimal(12,2)>\n   
               +- *(4) Sort [o_custkey#17L ASC NULLS FIRST], false, 0\n         
            +- Exchange hashpartitioning(o_custkey#17L, 200), 
ENSURE_REQUIREMENTS, [id=#301]\n                        +- *(3) Project 
[o_custkey#17L]\n                           +- *(3) ColumnarToRow\n             
                 +- FileScan orc 
tpch_data_orc_100.orders[o_custkey#17L,o_orderdate#24] Batched: true, 
DataFilters: [], Format: ORC, Location: CatalogFileIndex(0 paths)[], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<o_custkey:bigint>\n"
    }
  ]
}
{code}
The correct result should be 7 rows returned, not 0 as shown above. This can be 
confirmed by testing the same exact query against the hive table using 
Presto/Hive, which both return 7. Additionally, this link also shows that it 
should be 7 
[https://github.com/apache/impala/blob/master/testdata/workloads/tpch/queries/tpch-q22.test.]
 

  was:
When running TPCH scale 100 against 3.2, Query 22 has a discrepancy in the 
number of rows returned by the query. This was tested with both AQE on and off. 
All the other queries were matching in results. Below is the results that we 
got when testing Q22 on 3.2: 

 
{code:java}
  "results": [
    {
      "name": "Q22",
      "mode": "collect",
      "parameters": {},
      "joinTypes": [
        "SortMergeJoin"
      ],
      "tables": [
        "customer"
      ],
      "parsingTime": 0.016522,
      "analysisTime": 0.004132,
      "optimizationTime": 39.173868,
      "planningTime": 23.10939,
      "executionTime": 13762.183844,
      "result": 0,
      "breakDown": [],
      "queryExecution": "== Parsed Logical Plan ==\n'Sort ['cntrycode ASC NULLS 
FIRST], true\n+- 'Aggregate ['cntrycode], ['cntrycode, 'count(1) AS 
numcust#150, 'sum('c_acctbal) AS totacctbal#151]\n   +- 'SubqueryAlias 
custsale\n      +- 'Project ['substring('c_phone, 1, 2) AS cntrycode#147, 
'c_acctbal]\n         +- 'Filter (('substring('c_phone, 1, 2) IN 
(13,31,23,29,30,18,17) AND ('c_acctbal > scalar-subquery#148 [])) AND NOT 
exists#149 [])\n            :  :- 'Project [unresolvedalias('avg('c_acctbal), 
None)]\n            :  :  +- 'Filter (('c_acctbal > 0.00) AND 
'substring('c_phone, 1, 2) IN (13,31,23,29,30,18,17))\n            :  :     +- 
'UnresolvedRelation [customer], [], false\n            :  +- 'Project [*]\n     
       :     +- 'Filter ('o_custkey = 'c_custkey)\n            :        +- 
'UnresolvedRelation [orders], [], false\n            +- 'UnresolvedRelation 
[customer], [], false\n\n== Analyzed Logical Plan ==\ncntrycode: string, 
numcust: bigint, totacctbal: decimal(22,2)\nSort [cntrycode#147 ASC NULLS 
FIRST], true\n+- Aggregate [cntrycode#147], [cntrycode#147, count(1) AS 
numcust#150L, sum(c_acctbal#11) AS totacctbal#151]\n   +- SubqueryAlias 
custsale\n      +- Project [substring(c_phone#10, 1, 2) AS cntrycode#147, 
c_acctbal#11]\n         +- Filter ((substring(c_phone#10, 1, 2) IN 
(13,31,23,29,30,18,17) AND (cast(c_acctbal#11 as decimal(16,6)) > 
cast(scalar-subquery#148 [] as decimal(16,6)))) AND NOT exists#149 
[c_custkey#6L])\n            :  :- Aggregate [avg(c_acctbal#160) AS 
avg(c_acctbal)#154]\n            :  :  +- Filter ((cast(c_acctbal#160 as 
decimal(12,2)) > cast(0.00 as decimal(12,2))) AND substring(c_phone#159, 1, 2) 
IN (13,31,23,29,30,18,17))\n            :  :     +- SubqueryAlias 
spark_catalog.tpch_data_orc_100.customer\n            :  :        +- Relation 
tpch_data_orc_100.customer[c_custkey#155L,c_name#156,c_address#157,c_nationkey#158L,c_phone#159,c_acctbal#160,c_comment#161,c_mktsegment#162]
 orc\n            :  +- Project [o_orderkey#16L, o_custkey#17L, 
o_orderstatus#18, o_totalprice#19, o_orderpriority#20, o_clerk#21, 
o_shippriority#22, o_comment#23, o_orderdate#24]\n            :     +- Filter 
(o_custkey#17L = outer(c_custkey#6L))\n            :        +- SubqueryAlias 
spark_catalog.tpch_data_orc_100.orders\n            :           +- Relation 
tpch_data_orc_100.orders[o_orderkey#16L,o_custkey#17L,o_orderstatus#18,o_totalprice#19,o_orderpriority#20,o_clerk#21,o_shippriority#22,o_comment#23,o_orderdate#24]
 orc\n            +- SubqueryAlias spark_catalog.tpch_data_orc_100.customer\n   
            +- Relation 
tpch_data_orc_100.customer[c_custkey#6L,c_name#7,c_address#8,c_nationkey#9L,c_phone#10,c_acctbal#11,c_comment#12,c_mktsegment#13]
 orc\n\n== Optimized Logical Plan ==\nSort [cntrycode#147 ASC NULLS FIRST], 
true\n+- Aggregate [cntrycode#147], [cntrycode#147, count(1) AS numcust#150L, 
sum(c_acctbal#11) AS totacctbal#151]\n   +- Project [substring(c_phone#10, 1, 
2) AS cntrycode#147, c_acctbal#11]\n      +- Join LeftAnti, (o_custkey#17L = 
c_custkey#6L)\n         :- Project [c_custkey#6L, c_phone#10, c_acctbal#11]\n   
      :  +- Filter ((isnotnull(c_acctbal#11) AND substring(c_phone#10, 1, 2) IN 
(13,31,23,29,30,18,17)) AND (cast(c_acctbal#11 as decimal(16,6)) > 
scalar-subquery#148 []))\n         :     :  +- Aggregate [avg(c_acctbal#160) AS 
avg(c_acctbal)#154]\n         :     :     +- Project [c_acctbal#160]\n         
:     :        +- Filter (isnotnull(c_acctbal#160) AND ((c_acctbal#160 > 0.00) 
AND substring(c_phone#159, 1, 2) IN (13,31,23,29,30,18,17)))\n         :     :  
         +- Relation 
tpch_data_orc_100.customer[c_custkey#155L,c_name#156,c_address#157,c_nationkey#158L,c_phone#159,c_acctbal#160,c_comment#161,c_mktsegment#162]
 orc\n         :     +- Relation 
tpch_data_orc_100.customer[c_custkey#6L,c_name#7,c_address#8,c_nationkey#9L,c_phone#10,c_acctbal#11,c_comment#12,c_mktsegment#13]
 orc\n         +- Project [o_custkey#17L]\n            +- Relation 
tpch_data_orc_100.orders[o_orderkey#16L,o_custkey#17L,o_orderstatus#18,o_totalprice#19,o_orderpriority#20,o_clerk#21,o_shippriority#22,o_comment#23,o_orderdate#24]
 orc\n\n== Physical Plan ==\n*(7) Sort [cntrycode#147 ASC NULLS FIRST], true, 
0\n+- Exchange rangepartitioning(cntrycode#147 ASC NULLS FIRST, 200), 
ENSURE_REQUIREMENTS, [id=#314]\n   +- *(6) HashAggregate(keys=[cntrycode#147], 
functions=[count(1), sum(c_acctbal#11)], output=[cntrycode#147, numcust#150L, 
totacctbal#151])\n      +- Exchange hashpartitioning(cntrycode#147, 200), 
ENSURE_REQUIREMENTS, [id=#310]\n         +- *(5) 
HashAggregate(keys=[cntrycode#147], functions=[partial_count(1), 
partial_sum(c_acctbal#11)], output=[cntrycode#147, count#170L, sum#171, 
isEmpty#172])\n            +- *(5) Project [substring(c_phone#10, 1, 2) AS 
cntrycode#147, c_acctbal#11]\n               +- *(5) SortMergeJoin 
[c_custkey#6L], [o_custkey#17L], LeftAnti\n                  :- *(2) Sort 
[c_custkey#6L ASC NULLS FIRST], false, 0\n                  :  +- Exchange 
hashpartitioning(c_custkey#6L, 200), ENSURE_REQUIREMENTS, [id=#292]\n           
       :     +- *(1) Project [c_custkey#6L, c_phone#10, c_acctbal#11]\n         
         :        +- *(1) Filter ((isnotnull(c_acctbal#11) AND 
substring(c_phone#10, 1, 2) IN (13,31,23,29,30,18,17)) AND (cast(c_acctbal#11 
as decimal(16,6)) > Subquery scalar-subquery#148, [id=#249]))\n                 
 :           :  +- Subquery scalar-subquery#148, [id=#249]\n                  : 
          :     +- *(2) HashAggregate(keys=[], functions=[avg(c_acctbal#160)], 
output=[avg(c_acctbal)#154])\n                  :           :        +- 
Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#245]\n                  :   
        :           +- *(1) HashAggregate(keys=[], 
functions=[partial_avg(c_acctbal#160)], output=[sum#175, count#176L])\n         
         :           :              +- *(1) Project [c_acctbal#160]\n           
       :           :                 +- *(1) Filter ((isnotnull(c_acctbal#160) 
AND (c_acctbal#160 > 0.00)) AND substring(c_phone#159, 1, 2) IN 
(13,31,23,29,30,18,17))\n                  :           :                    +- 
*(1) ColumnarToRow\n                  :           :                       +- 
FileScan orc 
tpch_data_orc_100.customer[c_phone#159,c_acctbal#160,c_mktsegment#162] Batched: 
true, DataFilters: [isnotnull(c_acctbal#160), (c_acctbal#160 > 0.00), 
substring(c_phone#159, 1, 2) IN (13,31,23,29,3..., Format: ORC, Location: 
CatalogFileIndex(0 paths)[], PartitionFilters: [], PushedFilters: 
[IsNotNull(c_acctbal), GreaterThan(c_acctbal,0.00)], ReadSchema: 
struct<c_phone:string,c_acctbal:decimal(12,2)>\n                  :           
+- *(1) ColumnarToRow\n                  :              +- FileScan orc 
tpch_data_orc_100.customer[c_custkey#6L,c_phone#10,c_acctbal#11,c_mktsegment#13]
 Batched: true, DataFilters: [isnotnull(c_acctbal#11), substring(c_phone#10, 1, 
2) IN (13,31,23,29,30,18,17)], Format: ORC, Location: CatalogFileIndex(0 
paths)[], PartitionFilters: [], PushedFilters: [IsNotNull(c_acctbal)], 
ReadSchema: struct<c_custkey:bigint,c_phone:string,c_acctbal:decimal(12,2)>\n   
               +- *(4) Sort [o_custkey#17L ASC NULLS FIRST], false, 0\n         
            +- Exchange hashpartitioning(o_custkey#17L, 200), 
ENSURE_REQUIREMENTS, [id=#301]\n                        +- *(3) Project 
[o_custkey#17L]\n                           +- *(3) ColumnarToRow\n             
                 +- FileScan orc 
tpch_data_orc_100.orders[o_custkey#17L,o_orderdate#24] Batched: true, 
DataFilters: [], Format: ORC, Location: CatalogFileIndex(0 paths)[], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<o_custkey:bigint>\n"
    }
  ]
}
{code}
The correct result should be 7 rows returned, not 0 as shown above. This can be 
confirmed by testing the same exact query against the hive table using 
Presto/Hive, which both return 7. Additionally, this this link also shows that 
it should be 7 
[https://github.com/apache/impala/blob/master/testdata/workloads/tpch/queries/tpch-q22.test.]
 


> Discrepancy in Q22 of TPCH for Spark 3.2
> ----------------------------------------
>
>                 Key: SPARK-36926
>                 URL: https://issues.apache.org/jira/browse/SPARK-36926
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Aravind Patnam
>            Priority: Major
>
> When running TPCH scale 100 against 3.2, Query 22 has a discrepancy in the 
> number of rows returned by the query. This was tested with both AQE on and 
> off. All the other queries were matching in results. Below is the results 
> that we got when testing Q22 on 3.2: 
>  
> {code:java}
>   "results": [
>     {
>       "name": "Q22",
>       "mode": "collect",
>       "parameters": {},
>       "joinTypes": [
>         "SortMergeJoin"
>       ],
>       "tables": [
>         "customer"
>       ],
>       "parsingTime": 0.016522,
>       "analysisTime": 0.004132,
>       "optimizationTime": 39.173868,
>       "planningTime": 23.10939,
>       "executionTime": 13762.183844,
>       "result": 0,
>       "breakDown": [],
>       "queryExecution": "== Parsed Logical Plan ==\n'Sort ['cntrycode ASC 
> NULLS FIRST], true\n+- 'Aggregate ['cntrycode], ['cntrycode, 'count(1) AS 
> numcust#150, 'sum('c_acctbal) AS totacctbal#151]\n   +- 'SubqueryAlias 
> custsale\n      +- 'Project ['substring('c_phone, 1, 2) AS cntrycode#147, 
> 'c_acctbal]\n         +- 'Filter (('substring('c_phone, 1, 2) IN 
> (13,31,23,29,30,18,17) AND ('c_acctbal > scalar-subquery#148 [])) AND NOT 
> exists#149 [])\n            :  :- 'Project [unresolvedalias('avg('c_acctbal), 
> None)]\n            :  :  +- 'Filter (('c_acctbal > 0.00) AND 
> 'substring('c_phone, 1, 2) IN (13,31,23,29,30,18,17))\n            :  :     
> +- 'UnresolvedRelation [customer], [], false\n            :  +- 'Project 
> [*]\n            :     +- 'Filter ('o_custkey = 'c_custkey)\n            :    
>     +- 'UnresolvedRelation [orders], [], false\n            +- 
> 'UnresolvedRelation [customer], [], false\n\n== Analyzed Logical Plan 
> ==\ncntrycode: string, numcust: bigint, totacctbal: decimal(22,2)\nSort 
> [cntrycode#147 ASC NULLS FIRST], true\n+- Aggregate [cntrycode#147], 
> [cntrycode#147, count(1) AS numcust#150L, sum(c_acctbal#11) AS 
> totacctbal#151]\n   +- SubqueryAlias custsale\n      +- Project 
> [substring(c_phone#10, 1, 2) AS cntrycode#147, c_acctbal#11]\n         +- 
> Filter ((substring(c_phone#10, 1, 2) IN (13,31,23,29,30,18,17) AND 
> (cast(c_acctbal#11 as decimal(16,6)) > cast(scalar-subquery#148 [] as 
> decimal(16,6)))) AND NOT exists#149 [c_custkey#6L])\n            :  :- 
> Aggregate [avg(c_acctbal#160) AS avg(c_acctbal)#154]\n            :  :  +- 
> Filter ((cast(c_acctbal#160 as decimal(12,2)) > cast(0.00 as decimal(12,2))) 
> AND substring(c_phone#159, 1, 2) IN (13,31,23,29,30,18,17))\n            :  : 
>     +- SubqueryAlias spark_catalog.tpch_data_orc_100.customer\n            :  
> :        +- Relation 
> tpch_data_orc_100.customer[c_custkey#155L,c_name#156,c_address#157,c_nationkey#158L,c_phone#159,c_acctbal#160,c_comment#161,c_mktsegment#162]
>  orc\n            :  +- Project [o_orderkey#16L, o_custkey#17L, 
> o_orderstatus#18, o_totalprice#19, o_orderpriority#20, o_clerk#21, 
> o_shippriority#22, o_comment#23, o_orderdate#24]\n            :     +- Filter 
> (o_custkey#17L = outer(c_custkey#6L))\n            :        +- SubqueryAlias 
> spark_catalog.tpch_data_orc_100.orders\n            :           +- Relation 
> tpch_data_orc_100.orders[o_orderkey#16L,o_custkey#17L,o_orderstatus#18,o_totalprice#19,o_orderpriority#20,o_clerk#21,o_shippriority#22,o_comment#23,o_orderdate#24]
>  orc\n            +- SubqueryAlias spark_catalog.tpch_data_orc_100.customer\n 
>               +- Relation 
> tpch_data_orc_100.customer[c_custkey#6L,c_name#7,c_address#8,c_nationkey#9L,c_phone#10,c_acctbal#11,c_comment#12,c_mktsegment#13]
>  orc\n\n== Optimized Logical Plan ==\nSort [cntrycode#147 ASC NULLS FIRST], 
> true\n+- Aggregate [cntrycode#147], [cntrycode#147, count(1) AS numcust#150L, 
> sum(c_acctbal#11) AS totacctbal#151]\n   +- Project [substring(c_phone#10, 1, 
> 2) AS cntrycode#147, c_acctbal#11]\n      +- Join LeftAnti, (o_custkey#17L = 
> c_custkey#6L)\n         :- Project [c_custkey#6L, c_phone#10, c_acctbal#11]\n 
>         :  +- Filter ((isnotnull(c_acctbal#11) AND substring(c_phone#10, 1, 
> 2) IN (13,31,23,29,30,18,17)) AND (cast(c_acctbal#11 as decimal(16,6)) > 
> scalar-subquery#148 []))\n         :     :  +- Aggregate [avg(c_acctbal#160) 
> AS avg(c_acctbal)#154]\n         :     :     +- Project [c_acctbal#160]\n     
>     :     :        +- Filter (isnotnull(c_acctbal#160) AND ((c_acctbal#160 > 
> 0.00) AND substring(c_phone#159, 1, 2) IN (13,31,23,29,30,18,17)))\n         
> :     :           +- Relation 
> tpch_data_orc_100.customer[c_custkey#155L,c_name#156,c_address#157,c_nationkey#158L,c_phone#159,c_acctbal#160,c_comment#161,c_mktsegment#162]
>  orc\n         :     +- Relation 
> tpch_data_orc_100.customer[c_custkey#6L,c_name#7,c_address#8,c_nationkey#9L,c_phone#10,c_acctbal#11,c_comment#12,c_mktsegment#13]
>  orc\n         +- Project [o_custkey#17L]\n            +- Relation 
> tpch_data_orc_100.orders[o_orderkey#16L,o_custkey#17L,o_orderstatus#18,o_totalprice#19,o_orderpriority#20,o_clerk#21,o_shippriority#22,o_comment#23,o_orderdate#24]
>  orc\n\n== Physical Plan ==\n*(7) Sort [cntrycode#147 ASC NULLS FIRST], true, 
> 0\n+- Exchange rangepartitioning(cntrycode#147 ASC NULLS FIRST, 200), 
> ENSURE_REQUIREMENTS, [id=#314]\n   +- *(6) 
> HashAggregate(keys=[cntrycode#147], functions=[count(1), sum(c_acctbal#11)], 
> output=[cntrycode#147, numcust#150L, totacctbal#151])\n      +- Exchange 
> hashpartitioning(cntrycode#147, 200), ENSURE_REQUIREMENTS, [id=#310]\n        
>  +- *(5) HashAggregate(keys=[cntrycode#147], functions=[partial_count(1), 
> partial_sum(c_acctbal#11)], output=[cntrycode#147, count#170L, sum#171, 
> isEmpty#172])\n            +- *(5) Project [substring(c_phone#10, 1, 2) AS 
> cntrycode#147, c_acctbal#11]\n               +- *(5) SortMergeJoin 
> [c_custkey#6L], [o_custkey#17L], LeftAnti\n                  :- *(2) Sort 
> [c_custkey#6L ASC NULLS FIRST], false, 0\n                  :  +- Exchange 
> hashpartitioning(c_custkey#6L, 200), ENSURE_REQUIREMENTS, [id=#292]\n         
>          :     +- *(1) Project [c_custkey#6L, c_phone#10, c_acctbal#11]\n     
>              :        +- *(1) Filter ((isnotnull(c_acctbal#11) AND 
> substring(c_phone#10, 1, 2) IN (13,31,23,29,30,18,17)) AND (cast(c_acctbal#11 
> as decimal(16,6)) > Subquery scalar-subquery#148, [id=#249]))\n               
>    :           :  +- Subquery scalar-subquery#148, [id=#249]\n                
>   :           :     +- *(2) HashAggregate(keys=[], 
> functions=[avg(c_acctbal#160)], output=[avg(c_acctbal)#154])\n                
>   :           :        +- Exchange SinglePartition, ENSURE_REQUIREMENTS, 
> [id=#245]\n                  :           :           +- *(1) 
> HashAggregate(keys=[], functions=[partial_avg(c_acctbal#160)], 
> output=[sum#175, count#176L])\n                  :           :              
> +- *(1) Project [c_acctbal#160]\n                  :           :              
>    +- *(1) Filter ((isnotnull(c_acctbal#160) AND (c_acctbal#160 > 0.00)) AND 
> substring(c_phone#159, 1, 2) IN (13,31,23,29,30,18,17))\n                  :  
>          :                    +- *(1) ColumnarToRow\n                  :      
>      :                       +- FileScan orc 
> tpch_data_orc_100.customer[c_phone#159,c_acctbal#160,c_mktsegment#162] 
> Batched: true, DataFilters: [isnotnull(c_acctbal#160), (c_acctbal#160 > 
> 0.00), substring(c_phone#159, 1, 2) IN (13,31,23,29,3..., Format: ORC, 
> Location: CatalogFileIndex(0 paths)[], PartitionFilters: [], PushedFilters: 
> [IsNotNull(c_acctbal), GreaterThan(c_acctbal,0.00)], ReadSchema: 
> struct<c_phone:string,c_acctbal:decimal(12,2)>\n                  :           
> +- *(1) ColumnarToRow\n                  :              +- FileScan orc 
> tpch_data_orc_100.customer[c_custkey#6L,c_phone#10,c_acctbal#11,c_mktsegment#13]
>  Batched: true, DataFilters: [isnotnull(c_acctbal#11), substring(c_phone#10, 
> 1, 2) IN (13,31,23,29,30,18,17)], Format: ORC, Location: CatalogFileIndex(0 
> paths)[], PartitionFilters: [], PushedFilters: [IsNotNull(c_acctbal)], 
> ReadSchema: struct<c_custkey:bigint,c_phone:string,c_acctbal:decimal(12,2)>\n 
>                  +- *(4) Sort [o_custkey#17L ASC NULLS FIRST], false, 0\n     
>                 +- Exchange hashpartitioning(o_custkey#17L, 200), 
> ENSURE_REQUIREMENTS, [id=#301]\n                        +- *(3) Project 
> [o_custkey#17L]\n                           +- *(3) ColumnarToRow\n           
>                    +- FileScan orc 
> tpch_data_orc_100.orders[o_custkey#17L,o_orderdate#24] Batched: true, 
> DataFilters: [], Format: ORC, Location: CatalogFileIndex(0 paths)[], 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<o_custkey:bigint>\n"
>     }
>   ]
> }
> {code}
> The correct result should be 7 rows returned, not 0 as shown above. This can 
> be confirmed by testing the same exact query against the hive table using 
> Presto/Hive, which both return 7. Additionally, this link also shows that it 
> should be 7 
> [https://github.com/apache/impala/blob/master/testdata/workloads/tpch/queries/tpch-q22.test.]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to