[ https://issues.apache.org/jira/browse/SPARK-39584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan resolved SPARK-39584. --------------------------------- Fix Version/s: 3.4.0 Assignee: Kazuyuki Tanimura Resolution: Fixed > Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results > -------------------------------------------------------------------- > > Key: SPARK-39584 > URL: https://issues.apache.org/jira/browse/SPARK-39584 > Project: Spark > Issue Type: Test > Components: Tests > Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0 > Reporter: Kazuyuki Tanimura > Assignee: Kazuyuki Tanimura > Priority: Minor > Fix For: 3.4.0 > > > GenTPCDSData uses the schema defined in `TPCDSSchema` that contains char(N). > When GenTPCDSData generates parquet, that pads spaces for strings whose > lengths are < N. > When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, > it uses schema from the parquet file and keeps the paddings. Due to the extra > spaces, string filter queries of TPC-DS fail to match. For example, q13 query > results are all nulls and returns too fast because string filter does not > meet any rows. > Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and > that is inflating some performance results. > I am exploring two possible solutions now > 1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before > reading. This is what Spark TPC-DS unit tests are doing > 2. Change char to string in the schema. This is what [databricks data > generator|https://github.com/databricks/spark-sql-perf] is doing > TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in > https://issues.apache.org/jira/browse/SPARK-35192 > History related char issue > [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org