[ https://issues.apache.org/jira/browse/SPARK-36926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424293#comment-17424293 ]
Apache Spark commented on SPARK-36926: -------------------------------------- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/34180 > Discrepancy in Q22 of TPCH for Spark 3.2 > ---------------------------------------- > > Key: SPARK-36926 > URL: https://issues.apache.org/jira/browse/SPARK-36926 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.2.0 > Reporter: Aravind Patnam > Priority: Critical > > When running TPCH scale 100 against 3.2, Query 22 has a discrepancy in the > number of rows returned by the query. This was tested with both AQE on and > off. All the other queries were matching in results. Below is the results > that we got when testing Q22 on 3.2: > > {code:java} > "results": [ > { > "name": "Q22", > "mode": "collect", > "parameters": {}, > "joinTypes": [ > "SortMergeJoin" > ], > "tables": [ > "customer" > ], > "parsingTime": 0.016522, > "analysisTime": 0.004132, > "optimizationTime": 39.173868, > "planningTime": 23.10939, > "executionTime": 13762.183844, > "result": 0, > "breakDown": [], > "queryExecution": "== Parsed Logical Plan ==\n'Sort ['cntrycode ASC > NULLS FIRST], true\n+- 'Aggregate ['cntrycode], ['cntrycode, 'count(1) AS > numcust#150, 'sum('c_acctbal) AS totacctbal#151]\n +- 'SubqueryAlias > custsale\n +- 'Project ['substring('c_phone, 1, 2) AS cntrycode#147, > 'c_acctbal]\n +- 'Filter (('substring('c_phone, 1, 2) IN > (13,31,23,29,30,18,17) AND ('c_acctbal > scalar-subquery#148 [])) AND NOT > exists#149 [])\n : :- 'Project [unresolvedalias('avg('c_acctbal), > None)]\n : : +- 'Filter (('c_acctbal > 0.00) AND > 'substring('c_phone, 1, 2) IN (13,31,23,29,30,18,17))\n : : > +- 'UnresolvedRelation [customer], [], false\n : +- 'Project > [*]\n : +- 'Filter ('o_custkey = 'c_custkey)\n : > +- 'UnresolvedRelation [orders], [], false\n +- > 'UnresolvedRelation [customer], [], false\n\n== Analyzed Logical Plan > ==\ncntrycode: string, numcust: bigint, totacctbal: decimal(22,2)\nSort > [cntrycode#147 ASC NULLS FIRST], true\n+- Aggregate [cntrycode#147], > [cntrycode#147, count(1) AS numcust#150L, sum(c_acctbal#11) AS > totacctbal#151]\n +- SubqueryAlias custsale\n +- Project > [substring(c_phone#10, 1, 2) AS cntrycode#147, c_acctbal#11]\n +- > Filter ((substring(c_phone#10, 1, 2) IN (13,31,23,29,30,18,17) AND > (cast(c_acctbal#11 as decimal(16,6)) > cast(scalar-subquery#148 [] as > decimal(16,6)))) AND NOT exists#149 [c_custkey#6L])\n : :- > Aggregate [avg(c_acctbal#160) AS avg(c_acctbal)#154]\n : : +- > Filter ((cast(c_acctbal#160 as decimal(12,2)) > cast(0.00 as decimal(12,2))) > AND substring(c_phone#159, 1, 2) IN (13,31,23,29,30,18,17))\n : : > +- SubqueryAlias spark_catalog.tpch_data_orc_100.customer\n : > : +- Relation > tpch_data_orc_100.customer[c_custkey#155L,c_name#156,c_address#157,c_nationkey#158L,c_phone#159,c_acctbal#160,c_comment#161,c_mktsegment#162] > orc\n : +- Project [o_orderkey#16L, o_custkey#17L, > o_orderstatus#18, o_totalprice#19, o_orderpriority#20, o_clerk#21, > o_shippriority#22, o_comment#23, o_orderdate#24]\n : +- Filter > (o_custkey#17L = outer(c_custkey#6L))\n : +- SubqueryAlias > spark_catalog.tpch_data_orc_100.orders\n : +- Relation > tpch_data_orc_100.orders[o_orderkey#16L,o_custkey#17L,o_orderstatus#18,o_totalprice#19,o_orderpriority#20,o_clerk#21,o_shippriority#22,o_comment#23,o_orderdate#24] > orc\n +- SubqueryAlias spark_catalog.tpch_data_orc_100.customer\n > +- Relation > tpch_data_orc_100.customer[c_custkey#6L,c_name#7,c_address#8,c_nationkey#9L,c_phone#10,c_acctbal#11,c_comment#12,c_mktsegment#13] > orc\n\n== Optimized Logical Plan ==\nSort [cntrycode#147 ASC NULLS FIRST], > true\n+- Aggregate [cntrycode#147], [cntrycode#147, count(1) AS numcust#150L, > sum(c_acctbal#11) AS totacctbal#151]\n +- Project [substring(c_phone#10, 1, > 2) AS cntrycode#147, c_acctbal#11]\n +- Join LeftAnti, (o_custkey#17L = > c_custkey#6L)\n :- Project [c_custkey#6L, c_phone#10, c_acctbal#11]\n > : +- Filter ((isnotnull(c_acctbal#11) AND substring(c_phone#10, 1, > 2) IN (13,31,23,29,30,18,17)) AND (cast(c_acctbal#11 as decimal(16,6)) > > scalar-subquery#148 []))\n : : +- Aggregate [avg(c_acctbal#160) > AS avg(c_acctbal)#154]\n : : +- Project [c_acctbal#160]\n > : : +- Filter (isnotnull(c_acctbal#160) AND ((c_acctbal#160 > > 0.00) AND substring(c_phone#159, 1, 2) IN (13,31,23,29,30,18,17)))\n > : : +- Relation > tpch_data_orc_100.customer[c_custkey#155L,c_name#156,c_address#157,c_nationkey#158L,c_phone#159,c_acctbal#160,c_comment#161,c_mktsegment#162] > orc\n : +- Relation > tpch_data_orc_100.customer[c_custkey#6L,c_name#7,c_address#8,c_nationkey#9L,c_phone#10,c_acctbal#11,c_comment#12,c_mktsegment#13] > orc\n +- Project [o_custkey#17L]\n +- Relation > tpch_data_orc_100.orders[o_orderkey#16L,o_custkey#17L,o_orderstatus#18,o_totalprice#19,o_orderpriority#20,o_clerk#21,o_shippriority#22,o_comment#23,o_orderdate#24] > orc\n\n== Physical Plan ==\n*(7) Sort [cntrycode#147 ASC NULLS FIRST], true, > 0\n+- Exchange rangepartitioning(cntrycode#147 ASC NULLS FIRST, 200), > ENSURE_REQUIREMENTS, [id=#314]\n +- *(6) > HashAggregate(keys=[cntrycode#147], functions=[count(1), sum(c_acctbal#11)], > output=[cntrycode#147, numcust#150L, totacctbal#151])\n +- Exchange > hashpartitioning(cntrycode#147, 200), ENSURE_REQUIREMENTS, [id=#310]\n > +- *(5) HashAggregate(keys=[cntrycode#147], functions=[partial_count(1), > partial_sum(c_acctbal#11)], output=[cntrycode#147, count#170L, sum#171, > isEmpty#172])\n +- *(5) Project [substring(c_phone#10, 1, 2) AS > cntrycode#147, c_acctbal#11]\n +- *(5) SortMergeJoin > [c_custkey#6L], [o_custkey#17L], LeftAnti\n :- *(2) Sort > [c_custkey#6L ASC NULLS FIRST], false, 0\n : +- Exchange > hashpartitioning(c_custkey#6L, 200), ENSURE_REQUIREMENTS, [id=#292]\n > : +- *(1) Project [c_custkey#6L, c_phone#10, c_acctbal#11]\n > : +- *(1) Filter ((isnotnull(c_acctbal#11) AND > substring(c_phone#10, 1, 2) IN (13,31,23,29,30,18,17)) AND (cast(c_acctbal#11 > as decimal(16,6)) > Subquery scalar-subquery#148, [id=#249]))\n > : : +- Subquery scalar-subquery#148, [id=#249]\n > : : +- *(2) HashAggregate(keys=[], > functions=[avg(c_acctbal#160)], output=[avg(c_acctbal)#154])\n > : : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, > [id=#245]\n : : +- *(1) > HashAggregate(keys=[], functions=[partial_avg(c_acctbal#160)], > output=[sum#175, count#176L])\n : : > +- *(1) Project [c_acctbal#160]\n : : > +- *(1) Filter ((isnotnull(c_acctbal#160) AND (c_acctbal#160 > 0.00)) AND > substring(c_phone#159, 1, 2) IN (13,31,23,29,30,18,17))\n : > : +- *(1) ColumnarToRow\n : > : +- FileScan orc > tpch_data_orc_100.customer[c_phone#159,c_acctbal#160,c_mktsegment#162] > Batched: true, DataFilters: [isnotnull(c_acctbal#160), (c_acctbal#160 > > 0.00), substring(c_phone#159, 1, 2) IN (13,31,23,29,3..., Format: ORC, > Location: CatalogFileIndex(0 paths)[], PartitionFilters: [], PushedFilters: > [IsNotNull(c_acctbal), GreaterThan(c_acctbal,0.00)], ReadSchema: > struct<c_phone:string,c_acctbal:decimal(12,2)>\n : > +- *(1) ColumnarToRow\n : +- FileScan orc > tpch_data_orc_100.customer[c_custkey#6L,c_phone#10,c_acctbal#11,c_mktsegment#13] > Batched: true, DataFilters: [isnotnull(c_acctbal#11), substring(c_phone#10, > 1, 2) IN (13,31,23,29,30,18,17)], Format: ORC, Location: CatalogFileIndex(0 > paths)[], PartitionFilters: [], PushedFilters: [IsNotNull(c_acctbal)], > ReadSchema: struct<c_custkey:bigint,c_phone:string,c_acctbal:decimal(12,2)>\n > +- *(4) Sort [o_custkey#17L ASC NULLS FIRST], false, 0\n > +- Exchange hashpartitioning(o_custkey#17L, 200), > ENSURE_REQUIREMENTS, [id=#301]\n +- *(3) Project > [o_custkey#17L]\n +- *(3) ColumnarToRow\n > +- FileScan orc > tpch_data_orc_100.orders[o_custkey#17L,o_orderdate#24] Batched: true, > DataFilters: [], Format: ORC, Location: CatalogFileIndex(0 paths)[], > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<o_custkey:bigint>\n" > } > ] > } > {code} > The correct result should be 7 rows returned, not 0 as shown above. This can > be confirmed by testing the same exact query against the hive table using > Presto/Hive, which both return 7. Additionally, this link also shows that it > should be 7 > [https://github.com/apache/impala/blob/master/testdata/workloads/tpch/queries/tpch-q22.test.] > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org