[GitHub] [arrow-datafusion] tustvold opened a new pull request #1738: Add parquet SQL benchmarks

GitBox Thu, 03 Feb 2022 07:42:33 -0800


tustvold opened a new pull request #1738:
URL: https://github.com/apache/arrow-datafusion/pull/1738



   # Which issue does this PR close?
   
   Closes #TBD.
   
    # Rationale for this change
   
   Benchmarks good, more benchmarks more good :smile: 
   
   # What changes are included in this PR?
   
   This adds a benchmark that optionally generates a large-ish parquet file, or 
uses a file specified by an environment variable, and then runs through a list 
of queries against this file.
   
   My hope is that this will supplement the TPCH benchmark, with one that is 
perhaps easier for people to setup and run, and that can be more easily adapted 
to test different data shapes and queries.
   
   In particular as currently configured this will test:
   
   * Dictionary arrays
   * Nullable arrays
   * Large-ish parquet files (~200Mb)
   * Basic table scans with filters and aggregates
   * ...Suggestions welcome :smile: 
   
   It could theoretically be extended to incorporate joins, however, as I don't 
currently have a real-world use-case that produces these, I'd rather leave this 
to someone with such a workload to model a representative benchmark for.
   
   _Unfortunately the generation portion needs 
https://github.com/apache/arrow-rs/pull/1214 but arrow 9 should be out soon 
which will contain this. Will keep this as a draft until then._
   
   # Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] tustvold opened a new pull request #1738: Add parquet SQL benchmarks

Reply via email to