[ 
https://issues.apache.org/jira/browse/SPARK-39584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-39584:
--------------------------------------
    Description: 
GenTPCDSData uses the schema defined in `TPCDSSchema` that contains 
varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for 
strings whose lengths are < N.

When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it 
uses schema from the parquet file and keeps the paddings. Due to the extra 
spaces, string filter queries of TPC-DS fail to match. For example, q13 query 
results are all nulls and returns too fast because string filter does not meet 
any rows.

Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and 
that is inflating some performance results.

I am exploring two possible solutions now
1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before 
reading. This is what Spark TPC-DS unit tests are doing
2. Change varchar to string in the schema. This is what [databricks data 
generator|https://github.com/databricks/spark-sql-perf] is doing

TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in 
https://issues.apache.org/jira/browse/SPARK-35192

History related varchar issue 
[https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn]

  was:
GenTPCDSData uses the schema defined in `TPCDSSchema` that contains 
varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for 
strings whose lengths are < N.

When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it 
uses schema from the parquet file and keeps the paddings. Due to the extra 
spaces, string filter queries of TPC-DS fail to match. For example, q13 query 
results are all nulls and returns too fast because string filter does not meet 
any rows.

Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and 
that is inflating some performance results.

I am exploring two possible solutions now
1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before 
reading. This is what Spark unit tests are doing
2. Change varchar to string in the schema. This is what [databricks data 
generator|https://github.com/databricks/spark-sql-perf] is doing

TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in 
https://issues.apache.org/jira/browse/SPARK-35192

History related varchar issue 
[https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn]


> Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
> --------------------------------------------------------------------
>
>                 Key: SPARK-39584
>                 URL: https://issues.apache.org/jira/browse/SPARK-39584
>             Project: Spark
>          Issue Type: Test
>          Components: Tests
>    Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0
>            Reporter: Kazuyuki Tanimura
>            Priority: Minor
>
> GenTPCDSData uses the schema defined in `TPCDSSchema` that contains 
> varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for 
> strings whose lengths are < N.
> When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, 
> it uses schema from the parquet file and keeps the paddings. Due to the extra 
> spaces, string filter queries of TPC-DS fail to match. For example, q13 query 
> results are all nulls and returns too fast because string filter does not 
> meet any rows.
> Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and 
> that is inflating some performance results.
> I am exploring two possible solutions now
> 1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before 
> reading. This is what Spark TPC-DS unit tests are doing
> 2. Change varchar to string in the schema. This is what [databricks data 
> generator|https://github.com/databricks/spark-sql-perf] is doing
> TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in 
> https://issues.apache.org/jira/browse/SPARK-35192
> History related varchar issue 
> [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to