[jira] [Created] (ARROW-18418) [WEBSITE] do not delete /datafusion-python
Andy Grove created ARROW-18418: -- Summary: [WEBSITE] do not delete /datafusion-python Key: ARROW-18418 URL: https://issues.apache.org/jira/browse/ARROW-18418 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Andy Grove Assignee: Andy Grove do not delete /datafusion-python when publishing -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17878) [Website] Exclude Ballista docs from being deleted
Andy Grove created ARROW-17878: -- Summary: [Website] Exclude Ballista docs from being deleted Key: ARROW-17878 URL: https://issues.apache.org/jira/browse/ARROW-17878 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Andy Grove Exclude Ballista docs from being deleted -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17325) AQE should use available column statistics from completed query stages
Andy Grove created ARROW-17325: -- Summary: AQE should use available column statistics from completed query stages Key: ARROW-17325 URL: https://issues.apache.org/jira/browse/ARROW-17325 Project: Apache Arrow Issue Type: Improvement Components: SQL Reporter: Andy Grove In QueryStageExec.computeStats we copy partial statistics from materlized query stages by calling QueryStageExec#getRuntimeStatistics, which in turn calls ShuffleExchangeLike#runtimeStatistics or BroadcastExchangeLike#runtimeStatistics. Only dataSize and numOutputRows are copied into the new Statistics object: {code:scala} def computeStats(): Option[Statistics] = if (isMaterialized) { val runtimeStats = getRuntimeStatistics val dataSize = runtimeStats.sizeInBytes.max(0) val numOutputRows = runtimeStats.rowCount.map(_.max(0)) Some(Statistics(dataSize, numOutputRows, isRuntime = true)) } else { None } {code} I would like to also copy over the column statistics stored in Statistics.attributeMap so that they can be fed back into the logical plan optimization phase. The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do not currently provide such column statistics but other custom implementations can. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-16595) [WEBSITE] DataFusion 8.0.0 Release Blog Post
Andy Grove created ARROW-16595: -- Summary: [WEBSITE] DataFusion 8.0.0 Release Blog Post Key: ARROW-16595 URL: https://issues.apache.org/jira/browse/ARROW-16595 Project: Apache Arrow Issue Type: Task Components: Website Reporter: Andy Grove DataFusion 8.0.0 Release Blog Post -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-13656) [Website] [Rust] DataFusion 5.0.0 and Ballista 0.5.0 Blog Posts
Andy Grove created ARROW-13656: -- Summary: [Website] [Rust] DataFusion 5.0.0 and Ballista 0.5.0 Blog Posts Key: ARROW-13656 URL: https://issues.apache.org/jira/browse/ARROW-13656 Project: Apache Arrow Issue Type: Task Components: Website Reporter: Andy Grove Assignee: Andy Grove https://github.com/apache/arrow-datafusion/issues/881 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12437) [Rust] [Ballista] Ballista plans must not include RepartitionExec
Andy Grove created ARROW-12437: -- Summary: [Rust] [Ballista] Ballista plans must not include RepartitionExec Key: ARROW-12437 URL: https://issues.apache.org/jira/browse/ARROW-12437 Project: Apache Arrow Issue Type: Bug Components: Rust - Ballista Reporter: Andy Grove Ballista plans must not include RepartitionExec because it results in incorrect results. Ballista needs to manage its own repartitioning in a distributed-aware way later on. For now we just need to configure the DataFusion context to disable repartition. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12434) [Rust] [Ballista] Show executed plans with metrics
Andy Grove created ARROW-12434: -- Summary: [Rust] [Ballista] Show executed plans with metrics Key: ARROW-12434 URL: https://issues.apache.org/jira/browse/ARROW-12434 Project: Apache Arrow Issue Type: New Feature Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove Fix For: 5.0.0 Show executed plans with metrics to help with debugging and performance tuning -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12433) [Rust] Builds failing due to new flatbuffer release introducing const generics
Andy Grove created ARROW-12433: -- Summary: [Rust] Builds failing due to new flatbuffer release introducing const generics Key: ARROW-12433 URL: https://issues.apache.org/jira/browse/ARROW-12433 Project: Apache Arrow Issue Type: Bug Affects Versions: 4.0.0 Reporter: Andy Grove I filed [https://github.com/google/flatbuffers/issues/6572] but for now we should pin the dependency to 0.8.3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12432) [Rust] [DataFusion] Add metrics for SortExec
Andy Grove created ARROW-12432: -- Summary: [Rust] [DataFusion] Add metrics for SortExec Key: ARROW-12432 URL: https://issues.apache.org/jira/browse/ARROW-12432 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Fix For: 5.0.0 Add metrics for SortExec -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master
Andy Grove created ARROW-12421: -- Summary: [Rust] [DataFusion] topk_query test fails in master Key: ARROW-12421 URL: https://issues.apache.org/jira/browse/ARROW-12421 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Andy Grove {code:java} Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 tests test topk_plan ... ok test topk_query ... FAILED test normal_query ... okfailures: topk_query stdout thread 'topk_query' panicked at 'assertion failed: `(left == right)` left: `["+-+-+", "| customer_id | revenue |", "+-+-+", "| paul| 300 |", "| jorge | 200 |", "| andy| 150 |", "+-+-+"]`, right: `["++", "||", "++", "++"]`: output mismatch for Topk context. Expectedn +-+-+ | customer_id | revenue | +-+-+ | paul| 300 | | jorge | 200 | | andy| 150 | +-+-+Actual: ++ || ++ ++ ', datafusion/tests/user_defined_plan.rs:133:5 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12403) [Rust] [Ballista] Integration tests should check that query results are correct
Andy Grove created ARROW-12403: -- Summary: [Rust] [Ballista] Integration tests should check that query results are correct Key: ARROW-12403 URL: https://issues.apache.org/jira/browse/ARROW-12403 Project: Apache Arrow Issue Type: Improvement Reporter: Andy Grove Fix For: 5.0.0 The integration checks only test that the benchmark queries run without error. They do not check that the results are correct. I think some work already happened in DataFusion to check the TPC-H results so hopefully we can re-use that work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12402) [Rust] [DataFusion] Implement SQL metrics framework
Andy Grove created ARROW-12402: -- Summary: [Rust] [DataFusion] Implement SQL metrics framework Key: ARROW-12402 URL: https://issues.apache.org/jira/browse/ARROW-12402 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 As a user, I would like the ability to inspect metrics for an executed plan to help with debugging and performance tuning. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12362) [Rust] [DataFusion] topk_query test failure
Andy Grove created ARROW-12362: -- Summary: [Rust] [DataFusion] topk_query test failure Key: ARROW-12362 URL: https://issues.apache.org/jira/browse/ARROW-12362 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Fix For: 4.0.0 I'm seeing this locally with latest from master. {code:java} topk_query stdout thread 'topk_query' panicked at 'assertion failed: `(left == right)` left: `["+-+-+", "| customer_id | revenue |", "+-+-+", "| paul| 300 |", "| jorge | 200 |", "| andy| 150 |", "+-+-+"]`, right: `["++", "||", "++", "++"]`: output mismatch for Topk context. Expectedn +-+-+ | customer_id | revenue | +-+-+ | paul| 300 | | jorge | 200 | | andy| 150 | +-+-+Actual: ++ || ++ ++ ', datafusion/tests/user_defined_plan.rs:133:5 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12361) [Rust] [DataFusion] Allow users to override physical optimization rules
Andy Grove created ARROW-12361: -- Summary: [Rust] [DataFusion] Allow users to override physical optimization rules Key: ARROW-12361 URL: https://issues.apache.org/jira/browse/ARROW-12361 Project: Apache Arrow Issue Type: Improvement Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 As a user of DataFusion (in Ballista) I would override the list of physical optimization rules. It is currently possible to add new rules but not to remove existing rules. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12335) [Rust] [Ballista] Bump DataFusion version
Andy Grove created ARROW-12335: -- Summary: [Rust] [Ballista] Bump DataFusion version Key: ARROW-12335 URL: https://issues.apache.org/jira/browse/ARROW-12335 Project: Apache Arrow Issue Type: Task Components: Rust - Ballista Reporter: Andy Grove Fix For: 4.0.0 Update Ballista to use latest DataFusion version -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12334) [Rust] [Ballista] Aggregate queries producing incorrect results
Andy Grove created ARROW-12334: -- Summary: [Rust] [Ballista] Aggregate queries producing incorrect results Key: ARROW-12334 URL: https://issues.apache.org/jira/browse/ARROW-12334 Project: Apache Arrow Issue Type: Bug Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 I just ran benchmarks for the first time in a while and I see duplicate entries for group by keys. For example, query 1 has "group by l_returnflag, l_linestatus" and I see multiple results with l_returnflag = 'A' and l_linestatus = 'F'. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12331) [Rust] [Ballista] Make CI build work with snmalloc
Andy Grove created ARROW-12331: -- Summary: [Rust] [Ballista] Make CI build work with snmalloc Key: ARROW-12331 URL: https://issues.apache.org/jira/browse/ARROW-12331 Project: Apache Arrow Issue Type: Improvement Components: Rust - Ballista Reporter: Andy Grove Fix For: 4.0.0 Ballista was added to CI in [https://github.com/apache/arrow/pull/9979] but is building without default features due to snmalloc requiring cmake. An alternative approach would be to build with cc instead of cmake. See the above PR for conversation about this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12329) [Rust] [Ballista] Add README
Andy Grove created ARROW-12329: -- Summary: [Rust] [Ballista] Add README Key: ARROW-12329 URL: https://issues.apache.org/jira/browse/ARROW-12329 Project: Apache Arrow Issue Type: Task Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 We did not bring a README over in the donation and need to write a new one anyway now this is part of Arrow -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12328) [Rust] [Ballista] Fix code formatting
Andy Grove created ARROW-12328: -- Summary: [Rust] [Ballista] Fix code formatting Key: ARROW-12328 URL: https://issues.apache.org/jira/browse/ARROW-12328 Project: Apache Arrow Issue Type: Improvement Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12313) [Rust] [Ballista] Benchmark docuementation out of date
Andy Grove created ARROW-12313: -- Summary: [Rust] [Ballista] Benchmark docuementation out of date Key: ARROW-12313 URL: https://issues.apache.org/jira/browse/ARROW-12313 Project: Apache Arrow Issue Type: Bug Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 The scheduler/executor were refactored and the documentation for the benchmarks now needs updating. I plan on fixing this over the weekend. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12284) [Rust] [DataFusion] Review the contract between DataFusion and Arrow
Andy Grove created ARROW-12284: -- Summary: [Rust] [DataFusion] Review the contract between DataFusion and Arrow Key: ARROW-12284 URL: https://issues.apache.org/jira/browse/ARROW-12284 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove I am creating this issue based on the discussion at the sync call earlier today. Apparently DataFusion is not only using the high-level Arrow API but is also accessing Arrow internals directly and this would be one challenge in moving to a majorly refactored Arrow implementation. Perhaps we need to review what the public Arrow API should be and which APIs DataFusion should or should not be using. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12261) [Rust] [Ballista] Ballista should not have its own DataFrame API
Andy Grove created ARROW-12261: -- Summary: [Rust] [Ballista] Ballista should not have its own DataFrame API Key: ARROW-12261 URL: https://issues.apache.org/jira/browse/ARROW-12261 Project: Apache Arrow Issue Type: Task Components: Rust - Ballista Reporter: Andy Grove Fix For: 5.0.0 When building the Ballista POC it was necessary to implement a new DataFrame API that wrapped the DataFusion API. One issue is that it wasn't possible to override the behavior of the collect method to make it use the Ballista context rather than the DataFusion context. Now that the projects are in the same repo it should be easier to fix this and have users always use the DataFusion DataFrame API. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12260) [Website] [Rust] Announce Ballista donation
Andy Grove created ARROW-12260: -- Summary: [Website] [Rust] Announce Ballista donation Key: ARROW-12260 URL: https://issues.apache.org/jira/browse/ARROW-12260 Project: Apache Arrow Issue Type: Task Components: Website Reporter: Andy Grove Assignee: Andy Grove Once the IP clearance vote passes and the PR has been merged, we should announce the donation on the Arrow blog. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12257) [Rust] [Ballista] Publish user guide to Arrow site
Andy Grove created ARROW-12257: -- Summary: [Rust] [Ballista] Publish user guide to Arrow site Key: ARROW-12257 URL: https://issues.apache.org/jira/browse/ARROW-12257 Project: Apache Arrow Issue Type: New Feature Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove Fix For: 5.0.0 Ballista has a user guide in mdbook format and we need to figure out how to get this published to the arrow site (it was previously hosted at https://ballistacompute.org/docs/) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12256) [Rust] [Ballista] Add DataFrame support
Andy Grove created ARROW-12256: -- Summary: [Rust] [Ballista] Add DataFrame support Key: ARROW-12256 URL: https://issues.apache.org/jira/browse/ARROW-12256 Project: Apache Arrow Issue Type: New Feature Components: Rust - Ballista Reporter: Andy Grove Fix For: 5.0.0 Ballista has so far been focused on SQL support rather than DataFrame support. DataFrame support is partially implemented but needs more work to complete. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12255) [Rust] [Ballista] Integrate scheduler with DataFusion
Andy Grove created ARROW-12255: -- Summary: [Rust] [Ballista] Integrate scheduler with DataFusion Key: ARROW-12255 URL: https://issues.apache.org/jira/browse/ARROW-12255 Project: Apache Arrow Issue Type: New Feature Components: Rust - Ballista, Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 5.0.0 The Ballista scheduler breaks a query down into stages based on changes in partitioning int he plan, where each stage is broken down into tasks that can be executed concurrently. Rather than trying to run all the partitions at once, Ballista executors process n concurrent tasks at a time and then request new tasks from the scheduler. This approach would help DataFusion scale better and it would be ideal to use the same scheduler to scale across cores in DataFusion and across nodes in Ballista. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12253) [Rust] [Ballista] Implement scalable joins
Andy Grove created ARROW-12253: -- Summary: [Rust] [Ballista] Implement scalable joins Key: ARROW-12253 URL: https://issues.apache.org/jira/browse/ARROW-12253 Project: Apache Arrow Issue Type: New Feature Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove Fix For: 5.0.0 The main issue limiting scalability in Ballista today is that joins are implemented as hash joins where each partition of the probe side causes the entire left side to be loaded into memory. To make this scalable we need to hash partition left and right inputs so that we can join the left and right partitions in parallel. There is already work underway in DataFusion to implement this that we can leverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12252) [Rust] [Ballista] How to continue "This week in Ballista"?
Andy Grove created ARROW-12252: -- Summary: [Rust] [Ballista] How to continue "This week in Ballista"? Key: ARROW-12252 URL: https://issues.apache.org/jira/browse/ARROW-12252 Project: Apache Arrow Issue Type: Task Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove The Ballista project published a weekly newsletter and this has been very effective at building a community around the project. We need to determine how we can continue with something like this, while following the Apache way. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12251) [Rust] [Ballista] Add Ballista tests to CI
Andy Grove created ARROW-12251: -- Summary: [Rust] [Ballista] Add Ballista tests to CI Key: ARROW-12251 URL: https://issues.apache.org/jira/browse/ARROW-12251 Project: Apache Arrow Issue Type: Bug Components: Rust - Ballista Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 Ballista is a standalone project (not part of the Arrow Rust workspace) and therefore the tests will not run in CI without additional work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12250) [Rust] Failing test arrow::arrow_writer::tests::fixed_size_binary_single_column
Andy Grove created ARROW-12250: -- Summary: [Rust] Failing test arrow::arrow_writer::tests::fixed_size_binary_single_column Key: ARROW-12250 URL: https://issues.apache.org/jira/browse/ARROW-12250 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andy Grove Fix For: 4.0.0 I just pulled latest from master (commit d95c72f7f8e61b90c935ecb4e64d3e77648ef6d5) and updated submodules, then ran `cargo clean` followed by `cargo test`. One test fails: {code:java} arrow::arrow_writer::tests::fixed_size_binary_single_column stdout thread 'arrow::arrow_writer::tests::fixed_size_binary_single_column' panicked at 'called `Result::unwrap()` on an `Err` value: General("Could not parse metadata: protocol error")', parquet/src/arrow/arrow_writer.rs:920:54 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12064) [Rust] [DataFusion] Make DataFrame extensible
Andy Grove created ARROW-12064: -- Summary: [Rust] [DataFusion] Make DataFrame extensible Key: ARROW-12064 URL: https://issues.apache.org/jira/browse/ARROW-12064 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove The DataFrame implementation currently has two types of logic: # Logic for building a logical query plan # Logic for executing a query using the DataFusion context We can make DataFrame more extensible by having it always delegate to the context for execution, allowing the same DataFrame logic to be used for local and distributed execution. We will likely need to introduce a new ExecutionContext trait with different implementations for DataFusion and Ballista. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11982) [Rust] Donate Ballista Distributed Compute Platform
Andy Grove created ARROW-11982: -- Summary: [Rust] Donate Ballista Distributed Compute Platform Key: ARROW-11982 URL: https://issues.apache.org/jira/browse/ARROW-11982 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 See [PR|[https://github.com/apache/arrow/pull/9723]] for details. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11948) [Rust] 3.0.1 patch release
Andy Grove created ARROW-11948: -- Summary: [Rust] 3.0.1 patch release Key: ARROW-11948 URL: https://issues.apache.org/jira/browse/ARROW-11948 Project: Apache Arrow Issue Type: Task Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.1 Spreadsheet where I am tracking the fixes that get merged to maint-3.0.x https://docs.google.com/spreadsheets/d/111k0PGEVzxg1k7Q_d_1kV7E24VRB3DVJP1MnQImVrCc/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11934) [Rust] Document patch release process
Andy Grove created ARROW-11934: -- Summary: [Rust] Document patch release process Key: ARROW-11934 URL: https://issues.apache.org/jira/browse/ARROW-11934 Project: Apache Arrow Issue Type: Task Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.1 Now that we moved to voting on source releases for patch releases, we need to document the process for doing so in the Rust implementation. Google doc for discussion / collaboration: https://docs.google.com/document/d/1i2Elk6J0H4nhPeQZdLDyqvHoRbsabx2iOTXLHxxNqRE/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11852) [Documentation] Update CONTRIBUTING to explain how to self-assign tickets
Andy Grove created ARROW-11852: -- Summary: [Documentation] Update CONTRIBUTING to explain how to self-assign tickets Key: ARROW-11852 URL: https://issues.apache.org/jira/browse/ARROW-11852 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 In CONTRIBUTING.md [1] we should explain that new users need to ask someone to assign them the "Contributing" role so that they can self-assign issues in JIRA. It would also be useful to add notes for commuters on how to assign this role to people. [1]https://github.com/apache/arrow/blob/master/CONTRIBUTING.md -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11669) [Rust] [DataFusion] Remove concurrency field from GlobalLimitExec
Andy Grove created ARROW-11669: -- Summary: [Rust] [DataFusion] Remove concurrency field from GlobalLimitExec Key: ARROW-11669 URL: https://issues.apache.org/jira/browse/ARROW-11669 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Fix For: 4.0.0 GlobalLimitExec has an unused concurrency field that can now be removed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11625) [Rust] [DataFusion] Move SortExec partition check to constructor
Andy Grove created ARROW-11625: -- Summary: [Rust] [DataFusion] Move SortExec partition check to constructor Key: ARROW-11625 URL: https://issues.apache.org/jira/browse/ARROW-11625 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Fix For: 4.0.0 SortExec has the following error check at execution time and this could be moved into the try_new constructor so the error check happens at planning time instead. {code:java} if 1 != self.input.output_partitioning().partition_count() { return Err(DataFusionError::Internal( "SortExec requires a single input partition".to_owned(), )); } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11622) [Rust] [DataFusion] AggregateExpression name inconsistency
Andy Grove created ARROW-11622: -- Summary: [Rust] [DataFusion] AggregateExpression name inconsistency Key: ARROW-11622 URL: https://issues.apache.org/jira/browse/ARROW-11622 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Andy Grove I have an aggregate query and the AggregateExpr has this name {code:java} SUM(l_extendedprice Multiply Int64(1)){code} This is hiding the fact that the expression has a CAST operation: {code:java} expr: BinaryExpr { left: Column { name: "l_extendedprice" }, op: Multiply, right: CastExpr { expr: Literal { value: Int64(1) }, cast_type: Float64 } }, nullable: true }{code} In Ballista, this causes a problem with serde because after a rountrip, the expression has a name that includes the CAST and this causes a schema mismatch. {code:java} SUM(l_extendedprice Multiply CAST(Int64(1) AS Float64)) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11606) [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction
Andy Grove created ARROW-11606: -- Summary: [Rust] [DataFusion] Need guidance on HashAggregateExec reconstruction Key: ARROW-11606 URL: https://issues.apache.org/jira/browse/ARROW-11606 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove We have run into an issue in the Ballista project where we are reconstructing the Final and Partial HashAggregateExec operators [1] for distributed execution and we need some guidance. The Partial HashAggregateExec gets created OK and executes correctly. However, when we create the Final HashAggregateExec, it is not finding the expected schema in the input operator. The partial exec outputs field names ending with "[sum]" and "[count]" and so on but the final aggregate doesn't seem to be looking for those names. It is also worth noting that the Final and Partial executors are not connected directly in this usage. The Partial exec is executed and output streamed to disk. The Final exec then runs against the output from the Partial exec. We may need to make changes in DataFusion to allow other crates to support this kind of use case? [1] https://github.com/ballista-compute/ballista/pull/491 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11545) [Rust] [DataFusion] SendableRecordBatchStream should implement Sync
Andy Grove created ARROW-11545: -- Summary: [Rust] [DataFusion] SendableRecordBatchStream should implement Sync Key: ARROW-11545 URL: https://issues.apache.org/jira/browse/ARROW-11545 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove SendableRecordBatchStream should implement Sync. I ran into issues in Ballista when trying to use this type in some circumstances. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11544) [Rust] [DataFusion] Implement as_any for AggregateExpr
Andy Grove created ARROW-11544: -- Summary: [Rust] [DataFusion] Implement as_any for AggregateExpr Key: ARROW-11544 URL: https://issues.apache.org/jira/browse/ARROW-11544 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 Implement as_any for AggregateExpr so it can be downcast to a known implementation. Ballista needs this for serde. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11454) [Website] [Rust] 3.0.0 Blog Post
Andy Grove created ARROW-11454: -- Summary: [Website] [Rust] 3.0.0 Blog Post Key: ARROW-11454 URL: https://issues.apache.org/jira/browse/ARROW-11454 Project: Apache Arrow Issue Type: Improvement Components: Rust, Website Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11440) [Rust] [DataFusion] Add method to CsvExec to get CSV schema
Andy Grove created ARROW-11440: -- Summary: [Rust] [DataFusion] Add method to CsvExec to get CSV schema Key: ARROW-11440 URL: https://issues.apache.org/jira/browse/ARROW-11440 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 4.0.0 It is not possible to inspect an already created CsvExec and determine what the underlying schema is -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11360) [Rust] [DataFusion] Improve CSV "No files found" error message
Andy Grove created ARROW-11360: -- Summary: [Rust] [DataFusion] Improve CSV "No files found" error message Key: ARROW-11360 URL: https://issues.apache.org/jira/browse/ARROW-11360 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Fix For: 4.0.0 There are two places in DataFusion where the error message "No files found" is returned if no CSV files can be found in the specified directory with the specified file extension. It would be much easier to debug issues if this error message stated the directory and file extension. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11276) [Rust] [DataFusion] Make MemoryStream public
Andy Grove created ARROW-11276: -- Summary: [Rust] [DataFusion] Make MemoryStream public Key: ARROW-11276 URL: https://issues.apache.org/jira/browse/ARROW-11276 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove I found the need to take a copy of MemoryStream for use in another project. It would be nice if we could expose this as a supported public API so that other projects building physical operators can re-use it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11252) [Rust] [DataFusion[ Add DataFusion to h2oai/db-benchmark
Andy Grove created ARROW-11252: -- Summary: [Rust] [DataFusion[ Add DataFusion to h2oai/db-benchmark Key: ARROW-11252 URL: https://issues.apache.org/jira/browse/ARROW-11252 Project: Apache Arrow Issue Type: Task Components: Rust - DataFusion Reporter: Andy Grove Fix For: 4.0.0 I would like to see DataFusion added to h2oai/db-benchmark so that we can see how we compare to other solutions (including Pandas, Spark, cuDF, and Polars). Since Polars (another Rust DataFrame library that uses Arrow) has already been added, I am hoping that we can learn from their scripts. There is an issue filed against db-benchmark for adding DataFusion: https://github.com/h2oai/db-benchmark/issues/107 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11200) [Rust] [DateFusion] Physical operators and expressions should have public accessor methods
Andy Grove created ARROW-11200: -- Summary: [Rust] [DateFusion] Physical operators and expressions should have public accessor methods Key: ARROW-11200 URL: https://issues.apache.org/jira/browse/ARROW-11200 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 Physical operators and expressions should have public accessor methods that expose the values used to construct them. This allows projects that use DataFusion to be able to serialize a physical plan. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11195) [Rust] [DataFusiion] Built-in table providers should expose relevant fields
Andy Grove created ARROW-11195: -- Summary: [Rust] [DataFusiion] Built-in table providers should expose relevant fields Key: ARROW-11195 URL: https://issues.apache.org/jira/browse/ARROW-11195 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 For projects that depend on the DataFusion logical plan, there is a breaking change currently compared to 2.0.0 where it is not possible to introspect the TableScan node for CSV and Parquet files to get the file path and other meta-data (such as delimiter for CSV). I propose adding public methods to the specific providers so that other crates can downcast to specific implementations in order to access this data. My specific use case is the ability to serialize a DataFusion logical plan. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11150) [Rust] Set up bi-weekly Rust sync call and update website
Andy Grove created ARROW-11150: -- Summary: [Rust] Set up bi-weekly Rust sync call and update website Key: ARROW-11150 URL: https://issues.apache.org/jira/browse/ARROW-11150 Project: Apache Arrow Issue Type: Task Components: Rust Reporter: Andy Grove Assignee: Andy Grove Given the momentum on the Rust implementation, I am going to set up a bi-weekly sync call on Google Meet most likely. The call will be at the same time as the current sync call but on alternate weeks. I will update the web site to list both calls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11094) [Rust] [DataFusion] Implement Sort-Merge Join
Andy Grove created ARROW-11094: -- Summary: [Rust] [DataFusion] Implement Sort-Merge Join Key: ARROW-11094 URL: https://issues.apache.org/jira/browse/ARROW-11094 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Fix For: 4.0.0 The current hash join works well when one side of the join can be loaded into memory but cannot scale beyond the available RAM. The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the left and right partitions in parallel and then stream both sides of the join by merging these sorted partitions and we do not need to load one side into memory. At most, we need to load all batches from both sides that contain the current join key values. https://en.wikipedia.org/wiki/Sort-merge_join -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11093) [Rust] [DataFusion] RFC Roadmap for 2021
Andy Grove created ARROW-11093: -- Summary: [Rust] [DataFusion] RFC Roadmap for 2021 Key: ARROW-11093 URL: https://issues.apache.org/jira/browse/ARROW-11093 Project: Apache Arrow Issue Type: Task Components: Rust Reporter: Andy Grove Assignee: Andy Grove Given the momentum and number of contributors involved in the Rust implementation, I think it would be useful to crowdsource a roadmap for the next few releases that we expect to release in 2021. We have a small number of active committers on the project currently and it is hard for us to keep up with all the PRs sometimes, especially when so many different areas are being contributed to. It would be helpful if we can co-ordinate to prioritize work for the release. Of course, this is open source, and anyone can contribute anything at any time, but it would be nice to have some areas that we all agree are the main priorities. I will create a PR to kick start this discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11068) [Rust] [DataFusion] Wrap HashJoinExec in CoalesceBatchExec
Andy Grove created ARROW-11068: -- Summary: [Rust] [DataFusion] Wrap HashJoinExec in CoalesceBatchExec Key: ARROW-11068 URL: https://issues.apache.org/jira/browse/ARROW-11068 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 Once [https://github.com/apache/arrow/pull/9043] is merged, we should extend this to wrap join output as well. Rather than hard-code a list of operators that need to be wrapped, we should find a more generic mechanism so that plans can declare if their input and/or output batches should be coalesced (similar to how we handle partitioning) and this would allow custom operators outside of DataFusion to benefit from this optimization. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11059) [Rust] [DataFusion] Implement extensible configuration mechanism
Andy Grove created ARROW-11059: -- Summary: [Rust] [DataFusion] Implement extensible configuration mechanism Key: ARROW-11059 URL: https://issues.apache.org/jira/browse/ARROW-11059 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 We are getting to the point where there are multiple settings we could add to operators to fine-tune performance. Custom operators provided by crates that extend DataFusion may also need this capability. I propose that we add support for key-value configuration options so that we don't need to plumb through each new configuration setting that we add. For example. I am about to start on a "coalesce batches" operator and I would like a setting such as "coalesce.batch.size". For built-in settings like this we can provide information such as documentation and default values and generate documentation from this. For example, here is how Spark defines configs: {code:java} val PARQUET_VECTORIZED_READER_ENABLED = buildConf("spark.sql.parquet.enableVectorizedReader") .doc("Enables vectorized parquet decoding.") .version("2.0.0") .booleanConf .createWithDefault(true) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11058) [Rust] [DataFusion] Implement "coalesce batches" operator
Andy Grove created ARROW-11058: -- Summary: [Rust] [DataFusion] Implement "coalesce batches" operator Key: ARROW-11058 URL: https://issues.apache.org/jira/browse/ARROW-11058 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 When we have a FilterExec in the plan, it can produce lots of small batches and we therefore lose efficiency of vectorized operations. We should implement a new CoalesceBatchExec and wrap every FilterExec with one of these so that small batches can be recombined into larger batches to improve the efficiency of upstream operators. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11056) [Rust] [DataFusion] Allow ParquetExec to parallelize work based on row groups
Andy Grove created ARROW-11056: -- Summary: [Rust] [DataFusion] Allow ParquetExec to parallelize work based on row groups Key: ARROW-11056 URL: https://issues.apache.org/jira/browse/ARROW-11056 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove ParquetExec currently parallelizes work by passinging individual files to threads. It would be nice to be able to do this in a finer-grained way by assigning row groups and/or column chunks instead. This will be especially important in distributed systems built on DataFusion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11053) [Rust] [DataFusion] Optimize joins with dynamic capacity for output batches
Andy Grove created ARROW-11053: -- Summary: [Rust] [DataFusion] Optimize joins with dynamic capacity for output batches Key: ARROW-11053 URL: https://issues.apache.org/jira/browse/ARROW-11053 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 Rather than using the size of the left or right batches to determine the capacity of the output batches we can use the average size of previous output batches. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11052) [Rust] [DataFusion] Implement metrics in join operator
Andy Grove created ARROW-11052: -- Summary: [Rust] [DataFusion] Implement metrics in join operator Key: ARROW-11052 URL: https://issues.apache.org/jira/browse/ARROW-11052 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 Implement metrics in join operator to make it easier to debug performance issues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11047) [Rust] [DataFusion] ParquetTable should avoid scanning all files twice
Andy Grove created ARROW-11047: -- Summary: [Rust] [DataFusion] ParquetTable should avoid scanning all files twice Key: ARROW-11047 URL: https://issues.apache.org/jira/browse/ARROW-11047 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove ParquetTable currently reads the metadata for all files once in the constructor in order to get the schema, and does it again each time scan() is called. We could read the metadata once and cache it instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11031) [Rust] Update assignees in JIRA where missing
Andy Grove created ARROW-11031: -- Summary: [Rust] Update assignees in JIRA where missing Key: ARROW-11031 URL: https://issues.apache.org/jira/browse/ARROW-11031 Project: Apache Arrow Issue Type: Task Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 I have started updating tasks in Jira that were missing assignees. I see that [~nevi_me] [~alamb] [~jorgecarleitao] have asked about how to do this in the past so I will briefly explain here. First you need admin permissions to Jira (I will grant these to you now). Next you need to add new contributors to the "contributors" role in JIRA. To do this, go to the project setting for Apache Arrow. Select "users and roles" on the left. Click "add user to role" in top right. Then enter the user name and choose the "contributor" role. After that you can assign issues to that user (and they can self-assign too) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11030) [Rust] [DataFusion] TPC-H q12 does not finish with SF=100 Parquet input
Andy Grove created ARROW-11030: -- Summary: [Rust] [DataFusion] TPC-H q12 does not finish with SF=100 Parquet input Key: ARROW-11030 URL: https://issues.apache.org/jira/browse/ARROW-11030 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 TPC-H q12 does not finish with SF=100 Parquet input. Either I have something wrong with my local setup or there is a major performance regression. I will investigate more after the holidays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11029) [Rust] [DataFusion] Join order optimization does not work with filter pushdown
Andy Grove created ARROW-11029: -- Summary: [Rust] [DataFusion] Join order optimization does not work with filter pushdown Key: ARROW-11029 URL: https://issues.apache.org/jira/browse/ARROW-11029 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Andy Grove Fix For: 3.0.0 When I run TPC-H query 12, I see that the join order is not optimized to put the smaller table on the left. I added some debug logging and see that the optimization sees row count of None on the left side because there is a filter wrapping the table scan. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11022) [Rust] [DataFusion] Upgrade to tokio 1.0
Andy Grove created ARROW-11022: -- Summary: [Rust] [DataFusion] Upgrade to tokio 1.0 Key: ARROW-11022 URL: https://issues.apache.org/jira/browse/ARROW-11022 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Fix For: 3.0.0 https://tokio.rs/blog/2020-12-tokio-1-0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11020) [Rust] [DataFusion] Implement better tests for ParquetExec
Andy Grove created ARROW-11020: -- Summary: [Rust] [DataFusion] Implement better tests for ParquetExec Key: ARROW-11020 URL: https://issues.apache.org/jira/browse/ARROW-11020 Project: Apache Arrow Issue Type: Test Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 Implement better tests for ParquetExec -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11019) [Rust] [DataFusion] Add support for reading partitioned Parquet files
Andy Grove created ARROW-11019: -- Summary: [Rust] [DataFusion] Add support for reading partitioned Parquet files Key: ARROW-11019 URL: https://issues.apache.org/jira/browse/ARROW-11019 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Add support for reading Parquet files that are partitioned by key where the files are under a directory structure based on partition keys and values. /path/to/files/KEY1=value/KEY2=value/files -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11017) [Rust] [DataFusion] Add support for Parquet schema merging
Andy Grove created ARROW-11017: -- Summary: [Rust] [DataFusion] Add support for Parquet schema merging Key: ARROW-11017 URL: https://issues.apache.org/jira/browse/ARROW-11017 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Add support for Parquet schema merging so that we can read data sets where some files have additional columns. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11016) [Rust] Parquet ArrayReader should allow reading a subset of row groups
Andy Grove created ARROW-11016: -- Summary: [Rust] Parquet ArrayReader should allow reading a subset of row groups Key: ARROW-11016 URL: https://issues.apache.org/jira/browse/ARROW-11016 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: Andy Grove Parquet ArrayReader currently only supports reading an entire file from start to finish and does not allow selectively reading a subset of row groups. This prevents us from parallelizing work across threads when processing a single parquet file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11014) [Rust] [DataFusion] ParquetExec reports incorrect statistics
Andy Grove created ARROW-11014: -- Summary: [Rust] [DataFusion] ParquetExec reports incorrect statistics Key: ARROW-11014 URL: https://issues.apache.org/jira/browse/ARROW-11014 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove ParquetExec represents one or more Parquet files and currently only returns statistics based on the first file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11012) [Rust] [DataFusion] Make write_csv and write_parquet concurrent
Andy Grove created ARROW-11012: -- Summary: [Rust] [DataFusion] Make write_csv and write_parquet concurrent Key: ARROW-11012 URL: https://issues.apache.org/jira/browse/ARROW-11012 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove ExecutionContext.write_csv and write_parquet currently iterate over the output partitions and execute one at a time and write the results out. We should run these as tokio tasks so they can run concurrently. This should, in theory, help with memory usage when the plan contains repartition operators. We may want to add a configuration option so we can choose between serial and parallel writes? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11011) [Rust] [DataFusion] Implement hash partitioning
Andy Grove created ARROW-11011: -- Summary: [Rust] [DataFusion] Implement hash partitioning Key: ARROW-11011 URL: https://issues.apache.org/jira/browse/ARROW-11011 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Once https://issues.apache.org/jira/browse/ARROW-10582 is implemented, we should add support for hash partitioning. The logical physical plans already support create a plan with hash partitioning but the execution needs implementing in repartition.rs -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10999) [Rust] TPC-H parquet files cannot be read by Apache Spark
Andy Grove created ARROW-10999: -- Summary: [Rust] TPC-H parquet files cannot be read by Apache Spark Key: ARROW-10999 URL: https://issues.apache.org/jira/browse/ARROW-10999 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 The TPC-H parquet files generated by the benchmark crate cannot be read by Apache Spark because they use unsigned ints, which cannot be read in Spark (I am guessing because Java only has signed ints). I would like to use the same data sets for benchmarking DataFusion, Apache Spark, and other tools. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10995) [Rust] [DataFusion] Improve parallelism when reading Parquet files
Andy Grove created ARROW-10995: -- Summary: [Rust] [DataFusion] Improve parallelism when reading Parquet files Key: ARROW-10995 URL: https://issues.apache.org/jira/browse/ARROW-10995 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Currently the unit of parallelism is the number of parquet files being read. For example, if we run a query against a Parquet table that consists of 8 partitions then we will attempt to run 8 async tasks in parallel and if there is a single Parquet file then we will only try and run 1 async task so this does not scale well. A better approach would be to have one parallel task per "chunk" in each parquet file. This would involve an upfront step in the planner to scan the parquet meta-data to get a list of chunks and then split these up between the configured number of parallel tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10994) [Rust] Fix bugs in TPC-H file conversion
Andy Grove created ARROW-10994: -- Summary: [Rust] Fix bugs in TPC-H file conversion Key: ARROW-10994 URL: https://issues.apache.org/jira/browse/ARROW-10994 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 The utility in tpch for converting tbl files to CSV/Parquet no longer works now that we have multiple tables. The UX is also terrible and the tool only supports generating uncompressed parquet files. I'm going to create one PR to fix these things to make this more usable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10985) [Rust] Update unsafe guidelines for adding JIRA references
Andy Grove created ARROW-10985: -- Summary: [Rust] Update unsafe guidelines for adding JIRA references Key: ARROW-10985 URL: https://issues.apache.org/jira/browse/ARROW-10985 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 We should document known soundness issues in Jira issues and reference them from the source code. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10984) [Rust] Document use of unsafe in parquet crate
Andy Grove created ARROW-10984: -- Summary: [Rust] Document use of unsafe in parquet crate Key: ARROW-10984 URL: https://issues.apache.org/jira/browse/ARROW-10984 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Andy Grove Fix For: 3.0.0 There are ~64 uses of unsafe in the parquet crate {code:java} ./parquet/src/util/hash_util.rs:6 ./parquet/src/util/bit_packing.rs:34 ./parquet/src/util/bit_util.rs:1 ./parquet/src/data_type.rs:12 ./parquet/src/arrow/record_reader.rs:5 ./parquet/src/arrow/array_reader.rs:8 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10983) [Rust] Document use of unsafe in datatypes.rs
Andy Grove created ARROW-10983: -- Summary: [Rust] Document use of unsafe in datatypes.rs Key: ARROW-10983 URL: https://issues.apache.org/jira/browse/ARROW-10983 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Andy Grove Fix For: 3.0.0 There are ~12 uses of unsafe in datatypes and we should document them according to the guidelines in the Arrow crate README {code:java} // JUSTIFICATION // Benefit // Describe the benefit of using unsafe. E.g. // "30% performance degradation if the safe counterpart is used, see bench X." // Soundness // Describe why the code remains sound (according to the definition of rust's unsafe code guidelines). E.g. // "We bounded check these values at initialization and the array is immutable." let ... = unsafe { ... }; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10981) [Rust] Document use of unsafe in remaining files
Andy Grove created ARROW-10981: -- Summary: [Rust] Document use of unsafe in remaining files Key: ARROW-10981 URL: https://issues.apache.org/jira/browse/ARROW-10981 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Andy Grove Fix For: 3.0.0 ./arrow/src/bytes.rs:5 ./arrow/src/arch/avx512.rs:4 ./arrow/src/bitmap.rs:1 ./arrow/src/zz_memory_check.rs:1 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10982) [Rust] Document use of unsafe in ffi.rs
Andy Grove created ARROW-10982: -- Summary: [Rust] Document use of unsafe in ffi.rs Key: ARROW-10982 URL: https://issues.apache.org/jira/browse/ARROW-10982 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Andy Grove Fix For: 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10980) [Rust] Document use of unsafe in src/arrays
Andy Grove created ARROW-10980: -- Summary: [Rust] Document use of unsafe in src/arrays Key: ARROW-10980 URL: https://issues.apache.org/jira/browse/ARROW-10980 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Andy Grove Fix For: 3.0.0 This needs to be broken down into smaller tasks -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10978) [Rust] Document use of unsafe in utils
Andy Grove created ARROW-10978: -- Summary: [Rust] Document use of unsafe in utils Key: ARROW-10978 URL: https://issues.apache.org/jira/browse/ARROW-10978 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Andy Grove Fix For: 3.0.0 There are ~18 uses of unsafe in src/utils and we should document them according to the guidelines in the Arrow crate README {code:java} // JUSTIFICATION // Benefit // Describe the benefit of using unsafe. E.g. // "30% performance degradation if the safe counterpart is used, see bench X." // Soundness // Describe why the code remains sound (according to the definition of rust's unsafe code guidelines). E.g. // "We bounded check these values at initialization and the array is immutable." let ... = unsafe { ... }; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10977) [Rust] Document use of unsafe in builder.rs
Andy Grove created ARROW-10977: -- Summary: [Rust] Document use of unsafe in builder.rs Key: ARROW-10977 URL: https://issues.apache.org/jira/browse/ARROW-10977 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Andy Grove Fix For: 3.0.0 There are ~6 uses of unsafe in builder.rs and we should document them according to the guidelines in the Arrow crate README {code:java} // JUSTIFICATION // Benefit // Describe the benefit of using unsafe. E.g. // "30% performance degradation if the safe counterpart is used, see bench X." // Soundness // Describe why the code remains sound (according to the definition of rust's unsafe code guidelines). E.g. // "We bounded check these values at initialization and the array is immutable." let ... = unsafe { ... }; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10976) [Rust] Document use of unsafe in buffer.rs
Andy Grove created ARROW-10976: -- Summary: [Rust] Document use of unsafe in buffer.rs Key: ARROW-10976 URL: https://issues.apache.org/jira/browse/ARROW-10976 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Andy Grove Fix For: 3.0.0 There are ~32 uses of unsafe in buffer.rs and we should document them according to the guidelines in the Arrow crate README {code:java} // JUSTIFICATION // Benefit // Describe the benefit of using unsafe. E.g. // "30% performance degradation if the safe counterpart is used, see bench X." // Soundness // Describe why the code remains sound (according to the definition of rust's unsafe code guidelines). E.g. // "We bounded check these values at initialization and the array is immutable." let ... = unsafe { ... }; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10975) [Rust] Document use of unsafe in IPC
Andy Grove created ARROW-10975: -- Summary: [Rust] Document use of unsafe in IPC Key: ARROW-10975 URL: https://issues.apache.org/jira/browse/ARROW-10975 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Andy Grove Fix For: 3.0.0 There are ~6 uses of unsafe in IPC and we should document them according to the guidelines in the Arrow crate README {code:java} // JUSTIFICATION // Benefit // Describe the benefit of using unsafe. E.g. // "30% performance degradation if the safe counterpart is used, see bench X." // Soundness // Describe why the code remains sound (according to the definition of rust's unsafe code guidelines). E.g. // "We bounded check these values at initialization and the array is immutable." let ... = unsafe { ... }; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10974) [Rust] Document use of unsafe in memory.rs
Andy Grove created ARROW-10974: -- Summary: [Rust] Document use of unsafe in memory.rs Key: ARROW-10974 URL: https://issues.apache.org/jira/browse/ARROW-10974 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Andy Grove Fix For: 3.0.0 There are ~10 uses of unsafe in memory.rs and we should document them according to the guidelines in the Arrow crate README {code:java} // JUSTIFICATION // Benefit // Describe the benefit of using unsafe. E.g. // "30% performance degradation if the safe counterpart is used, see bench X." // Soundness // Describe why the code remains sound (according to the definition of rust's unsafe code guidelines). E.g. // "We bounded check these values at initialization and the array is immutable." let ... = unsafe { ... }; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10973) [Rust] Document use of unsafe in compute kernels
Andy Grove created ARROW-10973: -- Summary: [Rust] Document use of unsafe in compute kernels Key: ARROW-10973 URL: https://issues.apache.org/jira/browse/ARROW-10973 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Andy Grove Fix For: 3.0.0 There are ~7 uses of unsafe in the compute kernels and we should document them according to the guidelines in the Arrow crate README {code:java} // JUSTIFICATION // Benefit // Describe the benefit of using unsafe. E.g. // "30% performance degradation if the safe counterpart is used, see bench X." // Soundness // Describe why the code remains sound (according to the definition of rust's unsafe code guidelines). E.g. // "We bounded check these values at initialization and the array is immutable." let ... = unsafe { ... }; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10972) [Rust] RFC Roadmap for 4.0.0
Andy Grove created ARROW-10972: -- Summary: [Rust] RFC Roadmap for 4.0.0 Key: ARROW-10972 URL: https://issues.apache.org/jira/browse/ARROW-10972 Project: Apache Arrow Issue Type: Task Components: Rust Reporter: Andy Grove Assignee: Andy Grove Given the momentum and number of contributors involved in the Rust implementation, I think it would be useful to crowdsource a roadmap for the next release. We have a small number of active committers on the project currently and it is hard for us to keep up with all the PRs sometimes, especially when so many different areas are being contributed to. It would be helpful if we can co-ordinate to prioritize work for the release. Of course, this is open source, and anyone can contribute anything at any time, but it would be nice to have some areas that we all agree are the main priorities. I will create a PR to kick start this discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10964) [Rust] [DataFusion] Optimize nested joins
Andy Grove created ARROW-10964: -- Summary: [Rust] [DataFusion] Optimize nested joins Key: ARROW-10964 URL: https://issues.apache.org/jira/browse/ARROW-10964 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Once [https://github.com/apache/arrow/pull/8961] is merged, we have an optimization for a JOIN that operates on two tables. The next step is to extend this optimization to work with nested joins, and this is not trivial. See discussion in [https://github.com/apache/arrow/pull/8961] for context. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10943) [Rust] Intermittent build failure in parquet encoding
Andy Grove created ARROW-10943: -- Summary: [Rust] Intermittent build failure in parquet encoding Key: ARROW-10943 URL: https://issues.apache.org/jira/browse/ARROW-10943 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andy Grove I saw this test failure locally {code:java} encodings::encoding::tests::test_bool stdout thread 'encodings::encoding::tests::test_bool' panicked at 'Invalid byte when reading bool', parquet/src/util/bit_util.rs:73:18 {code} I ran "cargo test" again and it passed -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10920) [Rust] Segmentation fault in Arrow Parquet writer with huge arrays
Andy Grove created ARROW-10920: -- Summary: [Rust] Segmentation fault in Arrow Parquet writer with huge arrays Key: ARROW-10920 URL: https://issues.apache.org/jira/browse/ARROW-10920 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andy Grove I stumbled across this by chance. I am not too surprised that this fails but I would expect it to fail gracefully and not with a segmentation fault. {code:java} use std::fs::File; use std::sync::Arc; use arrow::array::StringBuilder; use arrow::datatypes::{DataType, Field, Schema}; use arrow::error::Result; use arrow::record_batch::RecordBatch; use parquet::arrow::ArrowWriter; fn main() -> Result<()> { let schema = Schema::new(vec![ Field::new("c0", DataType::Utf8, false), Field::new("c1", DataType::Utf8, true), ]); let batch_size = 250; let repeat_count = 140; let file = File::create("/tmp/test.parquet")?; let mut writer = ArrowWriter::try_new(file, Arc::new(schema.clone()), None).unwrap(); let mut c0_builder = StringBuilder::new(batch_size); let mut c1_builder = StringBuilder::new(batch_size); println!("Start of loop"); for i in 0..batch_size { let c0_value = format!("{:032}", i); let c1_value = c0_value.repeat(repeat_count); c0_builder.append_value(&c0_value)?; c1_builder.append_value(&c1_value)?; } println!("Finish building c0"); let c0 = Arc::new(c0_builder.finish()); println!("Finish building c1"); let c1 = Arc::new(c1_builder.finish()); println!("Creating RecordBatch"); let batch = RecordBatch::try_new(Arc::new(schema.clone()), vec![c0, c1])?; // write the batch to parquet println!("Writing RecordBatch"); writer.write(&batch).unwrap(); println!("Closing writer"); writer.close().unwrap(); Ok(()) } {code} output: {code:java} Start of loop Finish building c0 Finish building c1 Creating RecordBatch Writing RecordBatch Segmentation fault (core dumped) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10894) [Rust] [DataFusion] Optimizer rules should work with qualified column names
Andy Grove created ARROW-10894: -- Summary: [Rust] [DataFusion] Optimizer rules should work with qualified column names Key: ARROW-10894 URL: https://issues.apache.org/jira/browse/ARROW-10894 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove We have a number of optimization rules that deal with column names represented by strings. In order to support qualified field names in queries we need to update these rules to work with a data structure representing optionally qualified column names instead. Suggested data structure: {code:java} struct ColumnName { qualifier: Option, name: String } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10890) [Rust] [DataFusion] JOIN support
Andy Grove created ARROW-10890: -- Summary: [Rust] [DataFusion] JOIN support Key: ARROW-10890 URL: https://issues.apache.org/jira/browse/ARROW-10890 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Fix For: 3.0.0 This is the tracking issue for JOIN support. See sub-tasks for more details. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10889) [Rust] Document our approach to unsafe code in README
Andy Grove created ARROW-10889: -- Summary: [Rust] Document our approach to unsafe code in README Key: ARROW-10889 URL: https://issues.apache.org/jira/browse/ARROW-10889 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Andy Grove Fix For: 3.0.0 It would be helpful to document the project's stance on the use of unsafe code in a prominent location such as in the top-level README so that we can refer people to this when questions are asked about this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10888) [Rust] Tracking issue for safety issues
Andy Grove created ARROW-10888: -- Summary: [Rust] Tracking issue for safety issues Key: ARROW-10888 URL: https://issues.apache.org/jira/browse/ARROW-10888 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andy Grove We have a number of functions in the Rust code base that use unsafe code but they are not declared as unsafe or documented as to why we are guaranteeing that they are actually safe. I have seen recent criticism of the project on social media because of this and it is something that we should address. If anyone is interested in working on this, please create sub-tasks under this JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10884) [Rust] [DataFusion] Benchmark crate does not have a SIMD feature
Andy Grove created ARROW-10884: -- Summary: [Rust] [DataFusion] Benchmark crate does not have a SIMD feature Key: ARROW-10884 URL: https://issues.apache.org/jira/browse/ARROW-10884 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Andy Grove Fix For: 3.0.0 The benchmarks run without SIMD by default. We need to add a feature to the Cargo.toml to enable SIMD. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10877) [Rust] [DataFusion] Add benchmark based on kaggle movies
Andy Grove created ARROW-10877: -- Summary: [Rust] [DataFusion] Add benchmark based on kaggle movies Key: ARROW-10877 URL: https://issues.apache.org/jira/browse/ARROW-10877 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10829) [Rust] [DataFusion] Implement Into for DFSchema
Andy Grove created ARROW-10829: -- Summary: [Rust] [DataFusion] Implement Into for DFSchema Key: ARROW-10829 URL: https://issues.apache.org/jira/browse/ARROW-10829 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Implement Into for DFSchema -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10815) [Rust] [DataFusion] Finish integrating SQL relation names
Andy Grove created ARROW-10815: -- Summary: [Rust] [DataFusion] Finish integrating SQL relation names Key: ARROW-10815 URL: https://issues.apache.org/jira/browse/ARROW-10815 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10813) [Rust] [DataFusion] Implement DFSchema
Andy Grove created ARROW-10813: -- Summary: [Rust] [DataFusion] Implement DFSchema Key: ARROW-10813 URL: https://issues.apache.org/jira/browse/ARROW-10813 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Implement DFSchema -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10793) [Rust] [DataFusion] Decide on CAST behaviour for invalid inputs
Andy Grove created ARROW-10793: -- Summary: [Rust] [DataFusion] Decide on CAST behaviour for invalid inputs Key: ARROW-10793 URL: https://issues.apache.org/jira/browse/ARROW-10793 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove This is a placeholder for now. See discussion on [https://github.com/apache/arrow/pull/8794] Briefly, the issue is do we want CAST to return null for invalid inputs or throw an error. Spark has different behavior depending on whether ANSI mode is enabled or not. I'm not sure if this is a DataFusion specific or a more general Arrow issue yet. It needs a discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10783) [Rust] [DataFusion] Implement row count statistics for Parquet TableProvider
Andy Grove created ARROW-10783: -- Summary: [Rust] [DataFusion] Implement row count statistics for Parquet TableProvider Key: ARROW-10783 URL: https://issues.apache.org/jira/browse/ARROW-10783 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andy Grove Following on from https://issues.apache.org/jira/browse/ARROW-10781 we should implement statistics for Parquet data sources. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10782) [Rust] [DataFusion] Optimize hash join to use smaller relation as build side
Andy Grove created ARROW-10782: -- Summary: [Rust] [DataFusion] Optimize hash join to use smaller relation as build side Key: ARROW-10782 URL: https://issues.apache.org/jira/browse/ARROW-10782 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove When performing an inner join using the hash join algorithm, it is more efficient to load the smaller table into memory and then stream the larger table. We should the statistics made available in https://issues.apache.org/jira/browse/ARROW-10781 to build an optimizer rule to determine the smaller side of a join and use that as the build/hash side. -- This message was sent by Atlassian Jira (v8.3.4#803005)