tustvold opened a new pull request #1738:
URL: https://github.com/apache/arrow-datafusion/pull/1738
# Which issue does this PR close?
Closes #TBD.
# Rationale for this change
Benchmarks good, more benchmarks more good :smile:
# What changes are included in this PR?
This adds a benchmark that optionally generates a large-ish parquet file, or
uses a file specified by an environment variable, and then runs through a list
of queries against this file.
My hope is that this will supplement the TPCH benchmark, with one that is
perhaps easier for people to setup and run, and that can be more easily adapted
to test different data shapes and queries.
In particular as currently configured this will test:
* Dictionary arrays
* Nullable arrays
* Large-ish parquet files (~200Mb)
* Basic table scans with filters and aggregates
* ...Suggestions welcome :smile:
It could theoretically be extended to incorporate joins, however, as I don't
currently have a real-world use-case that produces these, I'd rather leave this
to someone with such a workload to model a representative benchmark for.
_Unfortunately the generation portion needs
https://github.com/apache/arrow-rs/pull/1214 but arrow 9 should be out soon
which will contain this. Will keep this as a draft until then._
# Are there any user-facing changes?
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]