Piyush-08-bot commented on issue #1845:
URL: 
https://github.com/apache/datafusion-ballista/issues/1845#issuecomment-4694753552

   Thanks for confirming @milenkovicm! I checked apache/datafusion-benchmarks 
and found that TPC-DS is actually mostly ready there:
   
   - All 99 query files (q1.sql-q99.sql) already exist and can be copied over 
directly
   - There's a generic Rust runner pattern (datafusion-rust/main.rs) that 
registers tables and runs queries by number, similar in spirit to tpch.rs
   
   The main gap is data generation - TPC-H uses tpchgen-cli which is a simple 
cargo-installable binary, but TPC-DS data generation in datafusion-benchmarks 
requires downloading tpc-ds-tool.zip from TPC.org manually + building dsdgen 
via Docker. That's heavier and harder to fully automate in CI the same way as 
the TPC-H workflow.
   
   Given that, would it be reasonable to scope this PR as:
   
   - Add the TPC-DS query files + a benchmark binary (adapted from tpch.rs 
pattern) to run against a Ballista cluster
   - Start with a smaller subset of queries to keep things manageable
   - For the CI workflow, either use a pre-generated small dataset checked into 
CI cache, or document a manual data-gen step for now, and we can fully automate 
the data-gen part in a follow-up once we figure out a lighter-weight TPC-DS 
generator (similar to tpchgen-cli)?
   
   Let me know if this scope works or if you'd prefer a different approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to