[jira] [Created] (ARROW-17909) [Website] Arbitrarily Nested Data in Parquet and Arrow: Part 2: Encoding Structs and Lists
Andrew Lamb created ARROW-17909: --- Summary: [Website] Arbitrarily Nested Data in Parquet and Arrow: Part 2: Encoding Structs and Lists Key: ARROW-17909 URL: https://issues.apache.org/jira/browse/ARROW-17909 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17910) [Website] Arbitrarily Nested Data in Parquet and Arrow: Part
Andrew Lamb created ARROW-17910: --- Summary: [Website] Arbitrarily Nested Data in Parquet and Arrow: Part Key: ARROW-17910 URL: https://issues.apache.org/jira/browse/ARROW-17910 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17908) [Website] Arbitrarily Nested Data in Parqet and Arrow: Part 1: Introduction
Andrew Lamb created ARROW-17908: --- Summary: [Website] Arbitrarily Nested Data in Parqet and Arrow: Part 1: Introduction Key: ARROW-17908 URL: https://issues.apache.org/jira/browse/ARROW-17908 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17907) [Website] Blog about Arrow <--> Parquet translation and structured representation
Andrew Lamb created ARROW-17907: --- Summary: [Website] Blog about Arrow <--> Parquet translation and structured representation Key: ARROW-17907 URL: https://issues.apache.org/jira/browse/ARROW-17907 Project: Apache Arrow Issue Type: Task Reporter: Andrew Lamb Assignee: Andrew Lamb @tustvold has spent a significant amount of time fixing the Rust implementation of the parquet <–> arrow conversion logic for all the corner cases of nulls, etc. During that process, he observed there was a relative lack of information on the topic to be found, so we would like to write some blog posts to remedy that and explain the format and parquet -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-16846) [Rust] Write blog post with Rust release highlights
Andrew Lamb created ARROW-16846: --- Summary: [Rust] Write blog post with Rust release highlights Key: ARROW-16846 URL: https://issues.apache.org/jira/browse/ARROW-16846 Project: Apache Arrow Issue Type: Task Components: Website Reporter: Andrew Lamb Assignee: Andrew Lamb See details here https://github.com/apache/arrow-rs/issues/1808 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-15902) [Website] Add Add new committers: Raphael Taylor-Davies, Wang Xudong, Yijie Shen, Kun Liu
Andrew Lamb created ARROW-15902: --- Summary: [Website] Add Add new committers: Raphael Taylor-Davies, Wang Xudong, Yijie Shen, Kun Liu Key: ARROW-15902 URL: https://issues.apache.org/jira/browse/ARROW-15902 Project: Apache Arrow Issue Type: Task Components: Website Reporter: Andrew Lamb Reference: [https://lists.apache.org/thread/n26odmwlv7vgxvp9xboql0txk00nyypx] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15683) [Rust] [DataFusion] Make a 7.0.0 release announcement blog
Andrew Lamb created ARROW-15683: --- Summary: [Rust] [DataFusion] Make a 7.0.0 release announcement blog Key: ARROW-15683 URL: https://issues.apache.org/jira/browse/ARROW-15683 Project: Apache Arrow Issue Type: Task Components: Rust, Rust - DataFusion, Website Reporter: Andrew Lamb Assignee: Andrew Lamb -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15675) [Rust] Blog post for versions 7-9
Andrew Lamb created ARROW-15675: --- Summary: [Rust] Blog post for versions 7-9 Key: ARROW-15675 URL: https://issues.apache.org/jira/browse/ARROW-15675 Project: Apache Arrow Issue Type: Task Components: Website Reporter: Andrew Lamb It would be good to tell the world about the progress we have made -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-12427) [Rust][DataFusion] Reenable physical_optimizer::repartition::Repartition;
Andrew Lamb created ARROW-12427: --- Summary: [Rust][DataFusion] Reenable physical_optimizer::repartition::Repartition; Key: ARROW-12427 URL: https://issues.apache.org/jira/browse/ARROW-12427 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb To fix https://issues.apache.org/jira/browse/ARROW-12421 We disabled the physical_optimizer::repartition::Repartition rule in https://github.com/apache/arrow/pull/10069 this ticket tracks finding the root cause of the CI test failure and reenabing physical_optimizer::repartition::Repartition; -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12411) [Rust] Add Builder interface for adding Arrays to record batches
Andrew Lamb created ARROW-12411: --- Summary: [Rust] Add Builder interface for adding Arrays to record batches Key: ARROW-12411 URL: https://issues.apache.org/jira/browse/ARROW-12411 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andrew Lamb Assignee: Andrew Lamb Use case: While writing tests (both in IOx and in DataFusion) where I need a single `RecordBatch`, I often find myself doing something like this: ``` let schema = Arc::new(Schema::new(vec![ ArrowField::new("float_field", ArrowDataType::Float64, true), ArrowField::new("time", ArrowDataType::Int64, true), ])); let float_array: ArrayRef = Arc::new(Float64Array::from(vec![10.1, 20.1, 30.1, 40.1])); let timestamp_array: ArrayRef = Arc::new(Int64Array::from(vec![1000, 2000, 3000, 4000])); let batch = RecordBatch::try_new(schema, vec![float_array, timestamp_array]) .expect("created new record batch"); ``` This is annoying because the information that `float_field` is a float is encoded both in the Schema and the `Float64Array` I would much rather rather be able to construct RecordBatches a a builder style to avoid the the redundancy and reduce the amount of typing / redundancy: ``` let float_array: ArrayRef = Arc::new(Float64Array::from(vec![10.1, 20.1, 30.1, 40.1])); let timestamp_array: ArrayRef = Arc::new(Int64Array::from(vec![1000, 2000, 3000, 4000])); let batch = RecordBatch::empty() .append("float_field", timestamp_array).unwrap() .append("time", float_array).unwrap; ``` The proposal is to add a method to `RecordBatch` like ``` impl RecordBatch { ... fn append(self, field_name: , field_values: ArrayRef) -> Result } ``` That would append the a field name to the current schema, returning an error if field_name was already present. The nullability of the field would be set based on the actual null count of the field_values -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12397) [Rust] [DataFusion] Simplify readme example #10038
Andrew Lamb created ARROW-12397: --- Summary: [Rust] [DataFusion] Simplify readme example #10038 Key: ARROW-12397 URL: https://issues.apache.org/jira/browse/ARROW-12397 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andrew Lamb MINOR: [Rust] [DataFusion] Simplify readme example #10038 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12339) [Rust][DataFusion] COUNT DISTINCT does not support for `Boolean`
Andrew Lamb created ARROW-12339: --- Summary: [Rust][DataFusion] COUNT DISTINCT does not support for `Boolean` Key: ARROW-12339 URL: https://issues.apache.org/jira/browse/ARROW-12339 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andrew Lamb If you try to run a `COUNT (DISTINCT ..)` query on a float column you get the following error: thread 'tokio-runtime-worker' panicked at 'Unexpected DataType for list', datafusion/src/scalar.rs:342:22 Reproducer: {code} echo "foo,1.23" > /tmp/foo.csv ./target/debug/datafusion-cli > CREATE EXTERNAL TABLE t (a varchar, b float) STORED AS CSV LOCATION > '/tmp/foo.csv'; 0 rows in set. Query took 0 seconds. > select count(distinct a) from t; +---+ | COUNT(DISTINCT a) | +---+ | 1 | +---+ 1 rows in set. Query took 0 seconds. > select count(distinct b) from t; thread 'tokio-runtime-worker' panicked at 'Unexpected DataType for list', datafusion/src/scalar.rs:342:22 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ArrowError(ExternalError(Canceled)) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12319) [Rust][DataFusion] Improve the errors that result when a aggregate type is not supported
Andrew Lamb created ARROW-12319: --- Summary: [Rust][DataFusion] Improve the errors that result when a aggregate type is not supported Key: ARROW-12319 URL: https://issues.apache.org/jira/browse/ARROW-12319 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andrew Lamb When you try and run a query such as {code} select AVG(ts_colum) from t; {code} where ts_column has `DataType::Timestamp` type, you get a pretty unintelligible error message "Coercion from [Timestamp(Nanosecond, None)] to the signature Uniform(1, [Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float32, Float64]) failed." This error should be improved to say something more like AVG is not supported for {datatype} try an explicit cast. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12318) [Rust][DataFusion] Add support for AVG(Timestamp) types
Andrew Lamb created ARROW-12318: --- Summary: [Rust][DataFusion] Add support for AVG(Timestamp) types Key: ARROW-12318 URL: https://issues.apache.org/jira/browse/ARROW-12318 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andrew Lamb This is a follow on to ARROW-12277 Background: Support for Min/Max/Sum/Count were added for DataType::Timestamp(*) types in https://github.com/apache/arrow/pull/9970. This ticket tracks adding support for Avg, which is slightly more involved as currently Avg assumes the output type is always F64, and in this case I think Avg(timestamp) should also be (timestamp). We should double check what postgres does in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12317) [Rust] JSON writer does not support time, date or interval types
Andrew Lamb created ARROW-12317: --- Summary: [Rust] JSON writer does not support time, date or interval types Key: ARROW-12317 URL: https://issues.apache.org/jira/browse/ARROW-12317 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andrew Lamb While working on https://issues.apache.org/jira/browse/ARROW-12267 , adding support for writing Timestamp types, I noticed we were also lacking support for other time types. Specifically, if you try to write an array with any of the following types as JSON it will panic: An example of adding support for timestamps is on https://github.com/apache/arrow/pull/9968 ``` pub type Date32Array = PrimitiveArray; pub type Date64Array = PrimitiveArray; pub type Time32SecondArray = PrimitiveArray; pub type Time32MillisecondArray = PrimitiveArray; pub type Time64MicrosecondArray = PrimitiveArray; pub type Time64NanosecondArray = PrimitiveArray; pub type IntervalYearMonthArray = PrimitiveArray; pub type IntervalDayTimeArray = PrimitiveArray; pub type DurationSecondArray = PrimitiveArray; pub type DurationMillisecondArray = PrimitiveArray; pub type DurationMicrosecondArray = PrimitiveArray; pub type DurationNanosecondArray = PrimitiveArray; ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12312) [Rust][DataFusion] COUNT DISTINCT not support for `Float64`
Andrew Lamb created ARROW-12312: --- Summary: [Rust][DataFusion] COUNT DISTINCT not support for `Float64` Key: ARROW-12312 URL: https://issues.apache.org/jira/browse/ARROW-12312 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Andrew Lamb If you try to run a `COUNT (DISTINCT ..)` query on a float column you get the following error: thread 'tokio-runtime-worker' panicked at 'Unexpected DataType for list', datafusion/src/scalar.rs:342:22 Reproducer: {code} echo "foo,1.23" > /tmp/foo.csv ./target/debug/datafusion-cli > CREATE EXTERNAL TABLE t (a varchar, b float) STORED AS CSV LOCATION > '/tmp/foo.csv'; 0 rows in set. Query took 0 seconds. > select count(distinct a) from t; +---+ | COUNT(DISTINCT a) | +---+ | 1 | +---+ 1 rows in set. Query took 0 seconds. > select count(distinct b) from t; thread 'tokio-runtime-worker' panicked at 'Unexpected DataType for list', datafusion/src/scalar.rs:342:22 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ArrowError(ExternalError(Canceled)) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12278) [Rust][DataFusion]Use Timestamp(Nanosecond, None) for SQL TIMESTAMP Type
Andrew Lamb created ARROW-12278: --- Summary: [Rust][DataFusion]Use Timestamp(Nanosecond, None) for SQL TIMESTAMP Type Key: ARROW-12278 URL: https://issues.apache.org/jira/browse/ARROW-12278 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andrew Lamb Assignee: Andrew Lamb # Rationale Running the query `CREATE EXTERNAL TABLE .. (c TIMESTAMP)` today in DataFusion will result in a data type pf "Date64" which means that anything more specific than the date will be ignored. This leads to strange behavior such as {code} echo "Jorge,2018-12-13T12:12:10.011" >> /tmp/foo.csv echo "Andrew,2018-11-13T17:11:10.011" > /tmp/foo.csv cargo run -p datafusion --bin datafusion-cli Finished dev [unoptimized + debuginfo] target(s) in 0.23s Running `target/debug/datafusion-cli` > CREATE EXTERNAL TABLE t(a varchar, b TIMESTAMP) STORED AS CSV LOCATION '/tmp/foo.csv'; 0 rows in set. Query took 0 seconds. > select * from t; +++ | a | b | +++ | Andrew | 2018-11-13 | | Jorge | 2018-12-13 | +++ {code} (note how it is only a date, not a timestamp) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12277) [Rust][DataFusion] Aggregates are not supported for timestamp types
Andrew Lamb created ARROW-12277: --- Summary: [Rust][DataFusion] Aggregates are not supported for timestamp types Key: ARROW-12277 URL: https://issues.apache.org/jira/browse/ARROW-12277 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Assignee: Andrew Lamb If you try and aggregate (via SUM, for example) a column of a timestamp type, it generates an error: ``` Coercion from [Timestamp(Nanosecond, None)] to the signature Uniform(1, [Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float32, Float64]) failed. ``` For example: {code} > show columns from t; +---+--++-+-+-+ | table_catalog | table_schema | table_name | column_name | data_type | is_nullable | +---+--++-+-+-+ | datafusion| public | t | a | Utf8 | NO | | datafusion| public | t | b | Timestamp(Nanosecond, None) | NO | +---+--++-+-+-+ 2 row in set. Query took 0 seconds. > select sum(b) from t; Plan("Coercion from [Timestamp(Nanosecond, None)] to the signature Uniform(1, [Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float32, Float64]) failed.") {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12267) [Rust] JSON writer does not support timestamp types
Andrew Lamb created ARROW-12267: --- Summary: [Rust] JSON writer does not support timestamp types Key: ARROW-12267 URL: https://issues.apache.org/jira/browse/ARROW-12267 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andrew Lamb Assignee: Andrew Lamb Looks like the json writer.rs code in arrow doesn't support writing out timestamps. When I tried to write out a `TimestampNanosecondArray` I got the following error: ``` thread 'influxdb_ioxd::http::tests::test_query_json' panicked at 'Unsupported datatype: Timestamp( Nanosecond, None, )', /Users/alamb/.cargo/git/checkouts/arrow-3a9cfebb6b7b2bdc/3e825a7/rust/arrow/src/json/writer.rs:326:13 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12254) [Rust][DataFusion] Limit keeps polling input after limit is reached
Andrew Lamb created ARROW-12254: --- Summary: [Rust][DataFusion] Limit keeps polling input after limit is reached Key: ARROW-12254 URL: https://issues.apache.org/jira/browse/ARROW-12254 Project: Apache Arrow Issue Type: Bug Reporter: Andrew Lamb Assignee: Andrew Lamb Once the number of rows needed for a limit query has been produced, any further work done to read values from its input is wasted. The current implementation of LimitStream will keep polling its input for the next value, and returning Poll::Ready(None) , even once the limit has been reached This is wasteful -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12235) [Rust][DataFusion] LIMIT returns incorrect results when used with several small partitions
Andrew Lamb created ARROW-12235: --- Summary: [Rust][DataFusion] LIMIT returns incorrect results when used with several small partitions Key: ARROW-12235 URL: https://issues.apache.org/jira/browse/ARROW-12235 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Andrew Lamb Assignee: Andrew Lamb I noticed when I was running some queries locally that `LIMIT` was not behaving correctly. For my case, a query with `LIMIT 10` was always returning zero rows. I spent some time and I have found a self contained reproducer. If you put the following test in `rust/src/datafusion/execution/context.rs` it will fail. {code} /// Return a RecordBatch with a single Int32 array with values (0..sz) fn make_partition(sz: i32) -> RecordBatch { let seq_start = 0; let seq_end = sz; let values = (seq_start..seq_end).collect::>(); let schema = Arc::new(Schema::new(vec![Field::new("i", DataType::Int32, true)])); let arr = Arc::new(Int32Array::from(values)); let arr = arr as ArrayRef; RecordBatch::try_new(schema.clone(),vec![arr]).unwrap() } #[tokio::test] async fn limit_multi_partitions() -> Result<()> { let tmp_dir = TempDir::new()?; let mut ctx = create_ctx(_dir, 1)?; let partitions = vec![ vec![make_partition(0)], vec![make_partition(1)], vec![make_partition(2)], vec![make_partition(3)], vec![make_partition(4)], vec![make_partition(5)], ]; let schema = partitions[0][0].schema(); let provider = Arc::new(MemTable::try_new(schema, partitions).unwrap()); ctx.register_table("t", provider) .unwrap(); // select all rows let results = plan_and_collect( ctx, "SELECT i FROM t") .await .unwrap(); let num_rows: usize = results.into_iter().map(|b| b.num_rows()).sum(); assert_eq!(num_rows, 15); for limit in 1..10 { let query = format!("SELECT i FROM t limit {}", limit); let results = plan_and_collect( ctx, ) .await .unwrap(); let num_rows: usize = results.into_iter().map(|b| b.num_rows()).sum(); assert_eq!(num_rows, limit, "mismatch with query {}", query); } Ok(()) } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12234) [Rust][DataFusion] Can't subtract timestamps
Andrew Lamb created ARROW-12234: --- Summary: [Rust][DataFusion] Can't subtract timestamps Key: ARROW-12234 URL: https://issues.apache.org/jira/browse/ARROW-12234 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb I have two columns, time_of_last_write, and time_of_first_write, and that have type `Timestamp(Nanosecond, None)` When I try to subtract them I get an error that there isn't a common type to coerce the types to: {code} > select id, partition_key, storage, estimated_bytes, time_of_last_write - > time_of_first_write as time_open from chunks where database_name = > '844910ece80be8bc_7be09b71c487d5d3' order by id; Plan("\'Timestamp(Nanosecond, None) - Timestamp(Nanosecond, None)\' can\'t be evaluated because there isn\'t a common type to coerce the types to") > {code} Expected behavior: The query works (the resulting column should be a duration) The data looks like this: {code} > select * from chunks where database_name = > '844910ece80be8bc_7be09b71c487d5d3' order by id; +---+-+-+-+-+---+---+---+ | database_name | id | partition_key | storage | estimated_bytes | time_of_first_write | time_of_last_write | time_closing | +---+-+-+-+-+---+---+---+ | 844910ece80be8bc_7be09b71c487d5d3 | 452 | 2021-04-06 18:00:00 | ClosedMutableBuffer | 10746690| 2021-04-06 18:46:52.356380931 | 2021-04-06 18:47:09.065541747 | 2021-04-06 18:47:09.098939917 | | 844910ece80be8bc_7be09b71c487d5d3 | 453 | 2021-04-06 18:00:00 | ClosedMutableBuffer | 11248853| 2021-04-06 18:47:09.495662420 | 2021-04-06 18:47:13.032639050 | 2021-04-06 18:47:13.058829814 | | 844910ece80be8bc_7be09b71c487d5d3 | 454 | 2021-04-06 18:00:00 | ClosedMutableBuffer | 11249404| 2021-04-06 18:47:13.594526676 | 2021-04-06 18:47:16.697048218 | 2021-04-06 18:47:16.723124402 | | 844910ece80be8bc_7be09b71c487d5d3 | 455 | 2021-04-06 18:00:00 | ClosedMutableBuffer | 11248972| 2021-04-06 18:47:17.128724226 | 2021-04-06 18:47:20.055123319 | 2021-04-06 18:47:20.081196973 | | 844910ece80be8bc_7be09b71c487d5d3 | 456 | 2021-04-06 18:00:00 | ClosedMutableBuffer | 11248778| 2021-04-06 18:47:20.609498175 | 2021-04-06 18:47:24.196610989 | 2021-04-06 18:47:24.233891509 | | 844910ece80be8bc_7be09b71c487d5d3 | 457 | 2021-04-06 18:00:00 | ClosedMutableBuffer | 11249297| 2021-04-06 18:47:24.660687691 | 2021-04-06 18:47:27.734848138 | 2021-04-06 18:47:27.762860931 | | 844910ece80be8bc_7be09b71c487d5d3 | 458 | 2021-04-06 18:00:00 | ClosedMutableBuffer | 11249046| 2021-04-06 18:47:28.128078919 | 2021-04-06 18:47:31.652250155 | 2021-04-06 18:47:31.690460702 | | 844910ece80be8bc_7be09b71c487d5d3 | 459 | 2021-04-06 18:00:00 | ClosedMutableBuffer | 11249824| 2021-04-06 18:47:32.286068833 | 2021-04-06 18:47:36.461676369 | 2021-04-06 18:47:36.486294829 | | 844910ece80be8bc_7be09b71c487d5d3 | 460 | 2021-04-06 18:00:00 | ClosedMutableBuffer | 11249913| 2021-04-06 18:47:36.944984769 | 2021-04-06 18:47:40.162251810 | 2021-04-06 18:47:40.188262747 | | 844910ece80be8bc_7be09b71c487d5d3 | 461 | 2021-04-06 18:00:00 | ClosedMutableBuffer | 11248237| 2021-04-06 18:47:40.719734516 | 2021-04-06 18:47:44.370867837 | 2021-04-06 18:47:44.397872698 | | 844910ece80be8bc_7be09b71c487d5d3 | 462 | 2021-04-06 18:00:00 | ClosedMutableBuffer | 11602754| 2021-04-06 18:47:44.844728218 | 2021-04-06 18:48:24.309093588 | 2021-04-06 18:48:24.339811197 | | 844910ece80be8bc_7be09b71c487d5d3 | 463 | 2021-04-06 18:00:00 | ClosedMutableBuffer | 11249162| 2021-04-06 18:48:24.847852183 | 2021-04-06 18:48:30.529014754 | 2021-04-06 18:48:30.556962859 | | 844910ece80be8bc_7be09b71c487d5d3 | 464 | 2021-04-06 18:00:00 | ClosedMutableBuffer | 11248908| 2021-04-06 18:48:31.148468537 | 2021-04-06 18:48:36.805296070 | 2021-04-06 18:48:36.830190418 | | 844910ece80be8bc_7be09b71c487d5d3 | 465 | 2021-04-06 18:00:00 | ClosedMutableBuffer | 11250833| 2021-04-06 18:48:37.258673133 | 2021-04-06 18:48:39.849493178 | 2021-04-06 18:48:39.875272790 | | 844910ece80be8bc_7be09b71c487d5d3 | 466 | 2021-04-06 18:00:00 | ClosedMutableBuffer | 11248570| 2021-04-06 18:48:40.304598973 | 2021-04-06 18:48:43.572838266 | 2021-04-06 18:48:43.597973739 | | 844910ece80be8bc_7be09b71c487d5d3 | 467 | 2021-04-06 18:00:00 | ClosedMutableBuffer | 11248882| 2021-04-06 18:48:44.086791040 | 2021-04-06 18:48:46.746045462 | 2021-04-06
[jira] [Created] (ARROW-12224) [Rust] Use stable rust for no default test, clean up CI tests
Andrew Lamb created ARROW-12224: --- Summary: [Rust] Use stable rust for no default test, clean up CI tests Key: ARROW-12224 URL: https://issues.apache.org/jira/browse/ARROW-12224 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andrew Lamb Assignee: Andrew Lamb # Rationale 1. As @jorgecarleitao noted on https://github.com/apache/arrow/pull/9889#discussion_r607720790, we should be running the check if arrow compiles with stable rust as that is what we target for the arrow crate 2. I noticed that there were several redundant settings of `RUSTFLAGS` 3. The titles of many of the tests are confusing (to me) as they have a lot of detailed architecture / rust version information rather than the test title -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12214) [Rust][DataFusion] Add some tests for limit
Andrew Lamb created ARROW-12214: --- Summary: [Rust][DataFusion] Add some tests for limit Key: ARROW-12214 URL: https://issues.apache.org/jira/browse/ARROW-12214 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Assignee: Andrew Lamb -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12210) [Rust][DataFusion] Document SHOW TABLES / SHOW COLUMNS / InformationSchema
Andrew Lamb created ARROW-12210: --- Summary: [Rust][DataFusion] Document SHOW TABLES / SHOW COLUMNS / InformationSchema Key: ARROW-12210 URL: https://issues.apache.org/jira/browse/ARROW-12210 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12204) [Rust][CI]
Andrew Lamb created ARROW-12204: --- Summary: [Rust][CI] Key: ARROW-12204 URL: https://issues.apache.org/jira/browse/ARROW-12204 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb # Rationale The [integration test](https://github.com/apache/arrow/pull/9884/checks?check_run_id=2263730460) has a fixed size builder docker image and has builds from several Arrow implementations. The Rust build artifacts (compiled binaries) in the integration tests still consume ~ 1GB of space even after https://github.com/apache/arrow/pull/9879 (see @pitrou 's comment on https://github.com/apache/arrow/pull/9884#issuecomment-813037756). It would be nice to reduce this even more (and speed up integration test while we are at it) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12194) [Rust] [Parquet] Update zstd version
Andrew Lamb created ARROW-12194: --- Summary: [Rust] [Parquet] Update zstd version Key: ARROW-12194 URL: https://issues.apache.org/jira/browse/ARROW-12194 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andrew Lamb Assignee: Andrew Lamb updates zstd version used by parquet crate to zstd = "0.7.0+zstd.1.4.9". -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12171) [Rust] Clippy error
Andrew Lamb created ARROW-12171: --- Summary: [Rust] Clippy error Key: ARROW-12171 URL: https://issues.apache.org/jira/browse/ARROW-12171 Project: Apache Arrow Issue Type: New Feature Reporter: Andrew Lamb Assignee: Andrew Lamb -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12159) [Rust][DataFusion] Support grouping on expressions
Andrew Lamb created ARROW-12159: --- Summary: [Rust][DataFusion] Support grouping on expressions Key: ARROW-12159 URL: https://issues.apache.org/jira/browse/ARROW-12159 Project: Apache Arrow Issue Type: New Feature Reporter: Andrew Lamb Usecase: I want to group based on time windows (as defined by the `date_trunc` function). For example, given the table: {code} +--+---+-+-+--+---+--+---++---+-+++ | cpu | host | time| usage_guest | usage_guest_nice | usage_idle| usage_iowait | usage_irq | usage_nice | usage_softirq | usage_steal | usage_system | usage_user | +--+---+-+-+--+---+--+---++---+-+++ | cpu0 | MacBook-Pro.local | 16171301300 | 0 | 0 | 65.30408773649165 | 0| 0 | 0 | 0 | 0 | 18.444666002000673 | 16.251246261217506 | | cpu1 | MacBook-Pro.local | 16171301300 | 0 | 0 | 84.43113772402216 | 0| 0 | 0 | 0 | 0 | 3.193612774446795 | 12.37524950097282 | | cpu2 | MacBook-Pro.local | 16171301300 | 0 | 0 | 65.96806387199344 | 0| 0 | 0 | 0 | 0 | 15.469061876247794 | 18.56287425146831 | | cpu3 | MacBook-Pro.local | 16171301300 | 0 | 0 | 84.0478564307993 | 0| 0 | 0 | 0 | 0 | 3.0907278165770684 | 12.861415752863932 | | cpu4 | MacBook-Pro.local | 16171301300 | 0 | 0 | 63.21036889281897 | 0| 0 | 0 | 0 | 0 | 13.758723828377473 | 23.030907278223218 | | cpu5 | MacBook-Pro.local | 16171301300 | 0 | 0 | 83.94815553242313 | 0| 0 | 0 | 0 | 0 | 2.991026919231221 | 13.0608175473346 | | cpu6 | MacBook-Pro.local | 16171301300 | 0 | 0 | 70.85828343276965 | 0| 0 | 0 | 0 | 0 | 12.87425149699077 | 16.26746506987651 | | cpu7 | MacBook-Pro.local | 16171301300 | 0 | 0 | 83.9321357287122 | 0| 0 | 0 | 0 | 0 | 3.093812375243205 | 12.974051896176206 | | cpu8 | MacBook-Pro.local | 16171301300 | 0 | 0 | 74.80079681313936 | 0| 0 | 0 | 0 | 0 | 10.756972111708253 | 14.442231075949556 | | cpu9 | MacBook-Pro.local | 16171301300 | 0 | 0 | 83.84845463618315 | 0| 0 | 0 | 0 | 0 | 3.0907278165434624 | 13.060817547316466 | +--+---+-+-+--+---+--+---++---+-+++ {code} I want to be able to find the min and max usage time grouped by minute {code} select date_trunc('minute', cast (time as timestamp)), min(usage_user), max(usage_user) from cpu group by date_trunc('minute', cast (time as timestamp)), min(usage_user)" {code} Or alternately {code} select date_trunc('minute', cast (time as timestamp)), min(usage_user), max(usage_user) from cpu group by 1 {code} {code}Instead as of now I get a planning error: Error preparing query Error during planning: Projection references non-aggregate values {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12158) [Rust][DataFusion]: Implement support for the `now()` sql function
Andrew Lamb created ARROW-12158: --- Summary: [Rust][DataFusion]: Implement support for the `now()` sql function Key: ARROW-12158 URL: https://issues.apache.org/jira/browse/ARROW-12158 Project: Apache Arrow Issue Type: New Feature Reporter: Andrew Lamb Assignee: Andrew Lamb Usecase: selecting the last 5 minutes of data I would like to be able to run queries like this: ``` select * from cpu where time > now() - interval '3' minute; ``` Proposed implementation: follow postgres functions: https://www.postgresql.org/docs/current/functions-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12108) [Rust][DataFusion] Support `SHOW TABLES`
Andrew Lamb created ARROW-12108: --- Summary: [Rust][DataFusion] Support `SHOW TABLES` Key: ARROW-12108 URL: https://issues.apache.org/jira/browse/ARROW-12108 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12109) [Rust][DataFusion] Support `SHOW COLUMNS`
Andrew Lamb created ARROW-12109: --- Summary: [Rust][DataFusion] Support `SHOW COLUMNS` Key: ARROW-12109 URL: https://issues.apache.org/jira/browse/ARROW-12109 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12107) [Rust][DataFusion] Support `SELECT * from information_schema.columns`
Andrew Lamb created ARROW-12107: --- Summary: [Rust][DataFusion] Support `SELECT * from information_schema.columns` Key: ARROW-12107 URL: https://issues.apache.org/jira/browse/ARROW-12107 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb Assignee: Andrew Lamb -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12106) [Rust][DataFusion] Support `SELECT * from information_schema.tables`
Andrew Lamb created ARROW-12106: --- Summary: [Rust][DataFusion] Support `SELECT * from information_schema.tables` Key: ARROW-12106 URL: https://issues.apache.org/jira/browse/ARROW-12106 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12076) Fix build
Andrew Lamb created ARROW-12076: --- Summary: Fix build Key: ARROW-12076 URL: https://issues.apache.org/jira/browse/ARROW-12076 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andrew Lamb Assignee: Andrew Lamb There was a logical conflict between https://github.com/apache/arrow/commit/eebf64b00e3a26f61c4bebec7241a0b24d27ec67 which removed the Arc in `ArrayData` and https://github.com/apache/arrow/commit/8dd6abbb72b6b8958f3b2f35512bdadcaf43066f which optimized the compute kernels. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12075) [Rust][DataFusion] Add CTE to list of supported features
Andrew Lamb created ARROW-12075: --- Summary: [Rust][DataFusion] Add CTE to list of supported features Key: ARROW-12075 URL: https://issues.apache.org/jira/browse/ARROW-12075 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andrew Lamb Assignee: Andrew Lamb -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12024) [Rust] Rust 1.52 has additional clippy lint failure
Andrew Lamb created ARROW-12024: --- Summary: [Rust] Rust 1.52 has additional clippy lint failure Key: ARROW-12024 URL: https://issues.apache.org/jira/browse/ARROW-12024 Project: Apache Arrow Issue Type: Bug Reporter: Andrew Lamb Assignee: Andrew Lamb Rust 1.52 was released yesterday: info: latest update on 2021-03-19, rust version 1.52.0-nightly (1705a7d64 2021-03-18) Resulting in lint failures such as https://github.com/apache/arrow/pull/9749/checks?check_run_id=2144048180 {code} error: this `else { if .. }` block can be collapsed --> arrow/src/array/array_binary.rs:427:20 | 427 | } else { | ^ 428 | | if let Some(size) = size { 429 | | buffer.extend_zeros(size); 430 | | } else { 431 | | prepend += 1; 432 | | } 433 | | } | |_^ | = note: `-D clippy::collapsible-if` implied by `-D warnings` = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#collapsible_if help: collapse nested if block | 427 | } else if let Some(size) = size { 428 | buffer.extend_zeros(size); 429 | } else { 430 | prepend += 1; 431 | } | {code} Reproduce via running `rustup` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12020) [Rust][DataFusion] Adding SHOW TABLES and SHOW COLUMNS + partial information_schema support to DataFusion
Andrew Lamb created ARROW-12020: --- Summary: [Rust][DataFusion] Adding SHOW TABLES and SHOW COLUMNS + partial information_schema support to DataFusion Key: ARROW-12020 URL: https://issues.apache.org/jira/browse/ARROW-12020 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Andrew Lamb Assignee: Andrew Lamb See proposal here: https://docs.google.com/document/d/12cpZUSNPqVH9Z0BBx6O8REu7TFqL-NPPAYCUPpDls1k/edit# -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11992) [Rust][Parquet] Add upgrade notes on 4.0 rename of LogicalType #9731
Andrew Lamb created ARROW-11992: --- Summary: [Rust][Parquet] Add upgrade notes on 4.0 rename of LogicalType #9731 Key: ARROW-11992 URL: https://issues.apache.org/jira/browse/ARROW-11992 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andrew Lamb Assignee: Andrew Lamb -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11979) [Rust] Combine limit into SortOptions
Andrew Lamb created ARROW-11979: --- Summary: [Rust] Combine limit into SortOptions Key: ARROW-11979 URL: https://issues.apache.org/jira/browse/ARROW-11979 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Assignee: Andrew Lamb The `sort_limit` kernel was added by @sundy-li in https://github.com/apache/arrow/pull/9602 While writing some doc examples in https://github.com/apache/arrow/pull/9721, it occured to me we could potentially simplify the API so I figured I would offer a proposed PR for comment # Rationale Since we already have a `SortOptions` structure that controls sorting options, we could also add the `limit` to that structure rather than adding a new `sort_limit` function and still avoid changing the API # Changes Move the `limit` option to `SortOptions` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11977) [Rust] Add documentation examples for sort kernel
Andrew Lamb created ARROW-11977: --- Summary: [Rust] Add documentation examples for sort kernel Key: ARROW-11977 URL: https://issues.apache.org/jira/browse/ARROW-11977 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Assignee: Andrew Lamb -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11969) [Rust][DataFusion] Improve Examples in documentation
Andrew Lamb created ARROW-11969: --- Summary: [Rust][DataFusion] Improve Examples in documentation Key: ARROW-11969 URL: https://issues.apache.org/jira/browse/ARROW-11969 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Assignee: Andrew Lamb It would be cool to have an example on the main README.md of datafusion (that appears on the crates.io homepage) that shows a prospective user what DataFusion offers. e.g look at how tokio does it https://crates.io/crates/tokio) I plan to lift the nice example from https://docs.rs/datafusion/3.0.0/datafusion/ ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11962) [Rust][DataFusion] Update Datafusion Docs / readme
Andrew Lamb created ARROW-11962: --- Summary: [Rust][DataFusion] Update Datafusion Docs / readme Key: ARROW-11962 URL: https://issues.apache.org/jira/browse/ARROW-11962 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Assignee: Andrew Lamb -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11952) [Rust] Make ArrayData --> GenericListArray fallable instead of `panic!`
Andrew Lamb created ARROW-11952: --- Summary: [Rust] Make ArrayData --> GenericListArray fallable instead of `panic!` Key: ARROW-11952 URL: https://issues.apache.org/jira/browse/ARROW-11952 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andrew Lamb -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11951) [Rust] Remove OffsetSize::prefix
Andrew Lamb created ARROW-11951: --- Summary: [Rust] Remove OffsetSize::prefix Key: ARROW-11951 URL: https://issues.apache.org/jira/browse/ARROW-11951 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Assignee: Andrew Lamb Background: Left over cleanups suggested by from @sunchao on https://github.com/apache/arrow/pull/9425 Broken out from https://github.com/apache/arrow/pull/9508 Rationale: This function is redundant with `OffsetSize::is_large` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11908) [Rust] Intermittent Flight integration test failures
Andrew Lamb created ARROW-11908: --- Summary: [Rust] Intermittent Flight integration test failures Key: ARROW-11908 URL: https://issues.apache.org/jira/browse/ARROW-11908 Project: Apache Arrow Issue Type: Bug Reporter: Andrew Lamb Assignee: Andrew Lamb This is similar to the symptoms seen in ARROW-11717 but it is still happening intermittently On two separate PR I see similar failures: https://github.com/apache/arrow/pull/9645/checks?check_run_id=2052183132 https://github.com/apache/arrow/pull/9647/checks?check_run_id=2051946608 Example failure: {code} subprocess.CalledProcessError: Command '['/build/cpp/debug/flight-test-integration-client', '-host', 'localhost', '-port=41743', '-scenario', 'auth:basic_proto']' died with . During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/arrow/dev/archery/archery/integration/runner.py", line 308, in _run_flight_test_case consumer.flight_request(port, **client_args) File "/arrow/dev/archery/archery/integration/tester_cpp.py", line 116, in flight_request run_cmd(cmd) File "/arrow/dev/archery/archery/integration/util.py", line 148, in run_cmd raise RuntimeError(sio.getvalue()) RuntimeError: Command failed: /build/cpp/debug/flight-test-integration-client -host localhost -port=41743 -scenario auth:basic_proto With output: -- -- Arrow Fatal Error -- Invalid: Expected UNAUTHENTICATED but got Unavailable -- # FAILURES # FAILED TEST: auth:basic_proto Rust producing, C++ consuming {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11896) [Rust] Hang in CI
Andrew Lamb created ARROW-11896: --- Summary: [Rust] Hang in CI Key: ARROW-11896 URL: https://issues.apache.org/jira/browse/ARROW-11896 Project: Apache Arrow Issue Type: Bug Reporter: Andrew Lamb As observed first by [~nevi_me] on https://github.com/apache/arrow/pull/9592#issuecomment-791901636 The Rust CI tests seem to be failing due to a timeout, due to a timeout . For example: https://github.com/apache/arrow/runs/2045186826 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11882) [Rust] Implement Debug printing "kernel"
Andrew Lamb created ARROW-11882: --- Summary: [Rust] Implement Debug printing "kernel" Key: ARROW-11882 URL: https://issues.apache.org/jira/browse/ARROW-11882 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andrew Lamb [~jorgecarleitao] offered a great way to improve the Debug/Display implementations for various Array implementations on https://github.com/apache/arrow/pull/9624#issuecomment-790976766 The only reason we are implementing to_isize/to_usize on NativeType is because we have a function to represent an array (for Display) that accepts a generic physical type T, and then tries to convert it to a isize depending on a logical type (DataType::Date). However, there is already a Many to one relationship between logical and physical types. Thus, a solution for this is to have the `Debug` function branch off depending on the (logical) datatype, implementing the custom string representation depending on it, instead of having a loop of native type T and then branching off according to the DataType inside the loop. I.e. instead of {code} for i in ... { match data_type { DataType::Date32 => represent array[i] as date DataType::Int32 => represent array[i] as int } } {code} imo we should have {code} match data_type { DataType::Date32 => for i in ... {represent array[i] as date} DataType::Int32 => for i in ... {represent array[i] as int} } {code} i.e. treat the Display as any other "kernel", where behavior is logical, not physical, type-dependent. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11881) [Rust][DataFusion] Fix Clippy Lint
Andrew Lamb created ARROW-11881: --- Summary: [Rust][DataFusion] Fix Clippy Lint Key: ARROW-11881 URL: https://issues.apache.org/jira/browse/ARROW-11881 Project: Apache Arrow Issue Type: Bug Reporter: Andrew Lamb A linter error has appeared on master somehow: ``` error: unnecessary parentheses around `for` iterator expression --> datafusion/src/physical_plan/merge.rs:124:31 | 124 | for part_i in (0..input_partitions) { | ^ help: remove these parentheses | = note: `-D unused-parens` implied by `-D warnings` error: aborting due to previous error Seen on these PRs: https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472 https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11863) [Rust][DataFusion] No way to get to the examples from docs.rs
Andrew Lamb created ARROW-11863: --- Summary: [Rust][DataFusion] No way to get to the examples from docs.rs Key: ARROW-11863 URL: https://issues.apache.org/jira/browse/ARROW-11863 Project: Apache Arrow Issue Type: Bug Reporter: Andrew Lamb Attachments: Screen Shot 2021-03-04 at 2.51.54 PM.png https://docs.rs/datafusion/3.0.0/datafusion/ has a tantalizing piece of text about the examples, but no link or explanation of how to find them !Screen Shot 2021-03-04 at 2.51.54 PM.png! The examples are at https://github.com/apache/arrow/tree/master/rust/datafusion/examples The ideal outcome would be to point people somehow at the examples directory for the version of the docs they are looking at in docs.rs. The ok, outcome would be to point the docs from docs.rs always at master. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11862) [Rust] String and BinaryArray created from iterators that don't accurately report size can lead to undefined behavior
Andrew Lamb created ARROW-11862: --- Summary: [Rust] String and BinaryArray created from iterators that don't accurately report size can lead to undefined behavior Key: ARROW-11862 URL: https://issues.apache.org/jira/browse/ARROW-11862 Project: Apache Arrow Issue Type: Bug Reporter: Andrew Lamb As [~jorgecarleitao] says on https://github.com/apache/arrow/pull/9588#discussion_r584290701 The (Rust) Iterator spec recommends, but does not require, that the iterator reports a correct length. Consumer that lead to undefined behavior from an incorrect size_hint are the causers of said undefined behavior. The only case where consumers can trust the iterators' length is when the interator implement unsafe trait TrustedLen. Unfortunately, TrustedLen is still in unstable. For that reason, we have been exposing unsafe Buffer::from_trusted_len_iter and the like for those cases. So the code should be updated to handle the case where the reported `size_hint` turns out to be incorrect -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11860) [Rust] [DataFusion] Add DataFusion logos
Andrew Lamb created ARROW-11860: --- Summary: [Rust] [DataFusion] Add DataFusion logos Key: ARROW-11860 URL: https://issues.apache.org/jira/browse/ARROW-11860 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andrew Lamb Issue automatically created from Pull Request [9630|https://github.com/apache/arrow/pull/9630] I don't think this needs a JIRA? These are the DataFusion logos that I had created before the project was donated to Apache Arrow. They weren't part of the source code repo so didn't get donated at the time. https://user-images.githubusercontent.com/934084/109990656-d55ddf80-7cc6-11eb-8bbc-f21946fd1dfc.png;> https://user-images.githubusercontent.com/934084/109990665-d68f0c80-7cc6-11eb-891c-bf367cb5f447.png;> -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11851) [Rust][DataFusion] Add coercion support for `NULL` literals
Andrew Lamb created ARROW-11851: --- Summary: [Rust][DataFusion] Add coercion support for `NULL` literals Key: ARROW-11851 URL: https://issues.apache.org/jira/browse/ARROW-11851 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andrew Lamb As we observed in https://github.com/apache/arrow/pull/9565#discussion_r586347165 datafusion won't coerce null literals, forcing strange syntax such as: ``` rpad('hi', CAST(NULL AS INT), 'xy') We should add automatic coercion logic from the null literal to any type and this expression should work just fine (produce a NULL output) ``` rpad('hi', NULL, 'xy') ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11821) [Rust] Edit Rust README
Andrew Lamb created ARROW-11821: --- Summary: [Rust] Edit Rust README Key: ARROW-11821 URL: https://issues.apache.org/jira/browse/ARROW-11821 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andrew Lamb Issue automatically created from Pull Request [9576|https://github.com/apache/arrow/pull/9576] Edits and fixes for some missing words, punctuation, and wording. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11819) [Rust] Add link to the doc
Andrew Lamb created ARROW-11819: --- Summary: [Rust] Add link to the doc Key: ARROW-11819 URL: https://issues.apache.org/jira/browse/ARROW-11819 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andrew Lamb Issue automatically created from Pull Request [9594|https://github.com/apache/arrow/pull/9594] This is a test PR with a minor fix, that has no JIRA issue, to automatically create the issue -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11818) [Rust] Add link to the doc
Andrew Lamb created ARROW-11818: --- Summary: [Rust] Add link to the doc Key: ARROW-11818 URL: https://issues.apache.org/jira/browse/ARROW-11818 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andrew Lamb Issue automatically created from Pull Request [9594|https://github.com/apache/arrow/pull/9594] This is a test PR with a minor fix, that has no JIRA issue, to automatically create the issue -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11817) TESTING PLEASE IGNORE[Rust] Add link to the doc
Andrew Lamb created ARROW-11817: --- Summary: TESTING PLEASE IGNORE[Rust] Add link to the doc Key: ARROW-11817 URL: https://issues.apache.org/jira/browse/ARROW-11817 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andrew Lamb Issue automatically created from Pull Request [9594|https://github.com/apache/arrow/pull/9594] This is a test PR with a minor fix, that has no JIRA issue, to automatically create the issue -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11816) TESTING PLEASE IGNORE[Rust] Add link to the doc
Andrew Lamb created ARROW-11816: --- Summary: TESTING PLEASE IGNORE[Rust] Add link to the doc Key: ARROW-11816 URL: https://issues.apache.org/jira/browse/ARROW-11816 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andrew Lamb Issue automatically created from Pull Request [9594|https://github.com/apache/arrow/pull/9594] This is a test PR with a minor fix, that has no JIRA issue, to automatically create the issue -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11815) TESTING PLEASE IGNORE[Rust] Add link to the doc
Andrew Lamb created ARROW-11815: --- Summary: TESTING PLEASE IGNORE[Rust] Add link to the doc Key: ARROW-11815 URL: https://issues.apache.org/jira/browse/ARROW-11815 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andrew Lamb Issue automatically created from Pull Request [9594|https://github.com/apache/arrow/pull/9594] This is a test PR with a minor fix, that has no JIRA issue, to automatically create the issue -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11814) TESTING PLEASE IGNORE[Rust] Add link to the doc
Andrew Lamb created ARROW-11814: --- Summary: TESTING PLEASE IGNORE[Rust] Add link to the doc Key: ARROW-11814 URL: https://issues.apache.org/jira/browse/ARROW-11814 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andrew Lamb Issue automatically created from Pull Request [9594|https://github.com/apache/arrow/pull/9594] This is a test PR with a minor fix, that has no JIRA issue, to automatically create the issue -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11813) TESTING PLEASE IGNORE[Rust] Add link to the doc
Andrew Lamb created ARROW-11813: --- Summary: TESTING PLEASE IGNORE[Rust] Add link to the doc Key: ARROW-11813 URL: https://issues.apache.org/jira/browse/ARROW-11813 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andrew Lamb Issue automatically created from Pull Request [9594|https://github.com/apache/arrow/pull/9594] This is a test PR with a minor fix, that has no JIRA issue, to automatically create the issue -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11812) TESTING PLEASE IGNORE[Rust] Add link to the doc
Andrew Lamb created ARROW-11812: --- Summary: TESTING PLEASE IGNORE[Rust] Add link to the doc Key: ARROW-11812 URL: https://issues.apache.org/jira/browse/ARROW-11812 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andrew Lamb Issue automatically created from Pull Request [9594|https://github.com/apache/arrow/pull/9594] This is a test PR with a minor fix, that has no JIRA issue, to automatically create the issue -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11811) TESTING PLEASE IGNORE[Rust] Add link to the doc
Andrew Lamb created ARROW-11811: --- Summary: TESTING PLEASE IGNORE[Rust] Add link to the doc Key: ARROW-11811 URL: https://issues.apache.org/jira/browse/ARROW-11811 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andrew Lamb Issue automatically created from Pull Request [9594|https://github.com/apache/arrow/pull/9594] This is a test PR with a minor fix, that has no JIRA issue, to automatically create the issue -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11810) TESTING PLEASE IGNORE[Rust] Add link to the doc
Andrew Lamb created ARROW-11810: --- Summary: TESTING PLEASE IGNORE[Rust] Add link to the doc Key: ARROW-11810 URL: https://issues.apache.org/jira/browse/ARROW-11810 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andrew Lamb Issue automatically created from Pull Request [9594|https://github.com/apache/arrow/pull/9594] This is a test PR with a minor fix, that has no JIRA issue, to automatically create the issue -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11809) TESTING PLEASE IGNORE[Rust] Add link to the doc
Andrew Lamb created ARROW-11809: --- Summary: TESTING PLEASE IGNORE[Rust] Add link to the doc Key: ARROW-11809 URL: https://issues.apache.org/jira/browse/ARROW-11809 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andrew Lamb Issue automatically created from Pull Request [9594|https://github.com/apache/arrow/pull/9594] This is a test PR with a minor fix, that has no JIRA issue, to automatically create the issue -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11808) TESTING PLEASE IGNORE[Rust] Add link to the doc
Andrew Lamb created ARROW-11808: --- Summary: TESTING PLEASE IGNORE[Rust] Add link to the doc Key: ARROW-11808 URL: https://issues.apache.org/jira/browse/ARROW-11808 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andrew Lamb Issue automatically created from Pull Request [PRNUM|https://github.com/apache/arrow/pull/9594] This is a test PR with a minor fix, that has no JIRA issue, to automatically create the issue -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11807) TESTING PLEASE IGNORE[Rust] Add link to the doc
Andrew Lamb created ARROW-11807: --- Summary: TESTING PLEASE IGNORE[Rust] Add link to the doc Key: ARROW-11807 URL: https://issues.apache.org/jira/browse/ARROW-11807 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andrew Lamb This is a test PR with a minor fix, that has no JIRA issue, to automatically create the issue -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11805) [Rust] Add link to the doc
Andrew Lamb created ARROW-11805: --- Summary: [Rust] Add link to the doc Key: ARROW-11805 URL: https://issues.apache.org/jira/browse/ARROW-11805 Project: Apache Arrow Issue Type: Bug Reporter: Andrew Lamb This is a test PR with a minor fix, that has no JIRA issue, to automatically create the issue -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11804) [Developer] Add option to auto-create JIRA issue for PRs which don't have it
Andrew Lamb created ARROW-11804: --- Summary: [Developer] Add option to auto-create JIRA issue for PRs which don't have it Key: ARROW-11804 URL: https://issues.apache.org/jira/browse/ARROW-11804 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Assignee: Andrew Lamb Improve dev workflow by automatically creating JIRA tickets if requested, as discussed here; https://lists.apache.org/thread.html/rd4533c7f882adbfc51061aceafebe8d84ea194fa5108d6cebc3621e1%40%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11802) [Rust][DataFusion] Mixing of crossbeam channel and async tasks can lead to deadlock
Andrew Lamb created ARROW-11802: --- Summary: [Rust][DataFusion] Mixing of crossbeam channel and async tasks can lead to deadlock Key: ARROW-11802 URL: https://issues.apache.org/jira/browse/ARROW-11802 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Andrew Lamb [~edrevo] noticed, on https://github.com/apache/arrow/pull/9523#issuecomment-786237494, that the use of crossbeam channels can potentially deadlock datafusion The use of crossbeam channel is left over from earlier, non `async` implementations and get been fingered in some hangs that [~MikeSeddonAU] has observed in DataFusion ). Specifically the crossbeam channel can block a thread when the channel is full or empty, which can result in blocking all the tokio executor threads and deadlocking the system The proposal is is to use tokio's mpsc channels instead of crossbeam which can properly yield back to tokio to run another task when the channel is either full or empty. . -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11790) [Rust][DataFusion] Change plan builder signature to take Vec rather than &[Expr]
Andrew Lamb created ARROW-11790: --- Summary: [Rust][DataFusion] Change plan builder signature to take Vec rather than &[Expr] Key: ARROW-11790 URL: https://issues.apache.org/jira/browse/ARROW-11790 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb Another thing to do is to change the signagure of LogicalPlanBuilder from taking slices of owned things &[Expr] to just taking Vec entirely The rationale is that at all callsites you need to have an owned vec and Datafusion is going to copy anyways, so it would better to allow the caller to give up ownership -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11773) [Rust] Allow json writer to write out JSON arrays as well as newline formatted objects
Andrew Lamb created ARROW-11773: --- Summary: [Rust] Allow json writer to write out JSON arrays as well as newline formatted objects Key: ARROW-11773 URL: https://issues.apache.org/jira/browse/ARROW-11773 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andrew Lamb Currently the arrow json writer makes JSON that looks like this (one record per line): ``` {"foo":1} {"bar":1} ``` Whereas a JSON array looks like this ``` [ {"foo":1}, {"bar":1} ] ``` It would be nice to write out json in a streaming fashion (we added such a feature in IOx via https://github.com/influxdata/influxdb_iox/pull/870/files) /// Writes out well formed JSON arays in a streaming fashion /// /// [{"foo": "bar"}, {"foo": "baz"}] /// /// This is based on the arrow JSON writer (json::writer::Writer) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11753) [Rust][DataFusion] Add test for Join Statement: Schema contains duplicate unqualified field name
Andrew Lamb created ARROW-11753: --- Summary: [Rust][DataFusion] Add test for Join Statement: Schema contains duplicate unqualified field name Key: ARROW-11753 URL: https://issues.apache.org/jira/browse/ARROW-11753 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb PR to add a test for this ticket -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11742) [Rust] [DataFusion] Add Expr::is_null and Expr::is_not_null functions
Andrew Lamb created ARROW-11742: --- Summary: [Rust] [DataFusion] Add Expr::is_null and Expr::is_not_null functions Key: ARROW-11742 URL: https://issues.apache.org/jira/browse/ARROW-11742 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb There are functions such as `Expr::lt` for building up expression trees more simply I recently noticed that there is no `Expr::is_null()` or `Expr::is_not_null` for easily creating `Expr::IsNull(..)` and `Expr::IsNotNull(..)`, respectively. Instead users must currently do something like; ``` let tag_name_is_not_null = Expr::IsNotNull(Box::new(col(tag_name))); ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11717) [Integration] Intermittent (but frequent) flight integration failures with auth:basic_proto
Andrew Lamb created ARROW-11717: --- Summary: [Integration] Intermittent (but frequent) flight integration failures with auth:basic_proto Key: ARROW-11717 URL: https://issues.apache.org/jira/browse/ARROW-11717 Project: Apache Arrow Issue Type: Bug Components: Integration Reporter: Andrew Lamb Link to discussion on list: https://lists.apache.org/thread.html/r0dcdc2b6334e7f067a828634cf7584406ed859ff4d3fb622fef1bdd7%40%3Cdev.arrow.apache.org%3E I noticed that the Rust/CPP integration tests are failing seemingly intermittently on master (and on Rust PRs). The tests pass if they are re-run (enough) There are several commits that the little red `X` meaning that CI didn't pass on master https://github.com/apache/arrow/commits/master Here are some Some example CI runs that are failing https://github.com/apache/arrow/runs/1935673508 https://github.com/apache/arrow/runs/1926705212 Here is another example: https://github.com/apache/arrow/pull/9359/checks?check_run_id=1941967422 Example failure: {code} == Testing file auth:basic_proto == Traceback (most recent call last): File "/arrow/dev/archery/archery/integration/util.py", line 139, in run_cmd output = subprocess.check_output(cmd, stderr=subprocess.STDOUT) File "/opt/conda/envs/arrow/lib/python3.8/subprocess.py", line 411, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "/opt/conda/envs/arrow/lib/python3.8/subprocess.py", line 512, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['/build/cpp/debug/flight-test-integration-client', '-host', 'localhost', '-port=33569', '-scenario', 'auth:basic_proto']' died with . During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/arrow/dev/archery/archery/integration/runner.py", line 308, in _run_flight_test_case consumer.flight_request(port, **client_args) File "/arrow/dev/archery/archery/integration/tester_cpp.py", line 116, in flight_request run_cmd(cmd) File "/arrow/dev/archery/archery/integration/util.py", line 148, in run_cmd raise RuntimeError(sio.getvalue()) RuntimeError: Command failed: /build/cpp/debug/flight-test-integration-client -host localhost -port=33569 -scenario auth:basic_proto With output: -- -- Arrow Fatal Error -- Invalid: Expected UNAUTHENTICATED but got Unavailable {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11716) [Rust][DataFusion] Change tests in sql.rs to use `assert_batch`
Andrew Lamb created ARROW-11716: --- Summary: [Rust][DataFusion] Change tests in sql.rs to use `assert_batch` Key: ARROW-11716 URL: https://issues.apache.org/jira/browse/ARROW-11716 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb The idea is to make the tests in [sql.rs|https://github.com/apache/arrow/blob/master/rust/datafusion/tests/sql.rs#L103] more maintainable by using the `assert_batches_eq` macro that was introduced here: https://github.com/apache/arrow/pull/9264 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11715) [Rust] Ensure a successful MIRI Run on CI
Andrew Lamb created ARROW-11715: --- Summary: [Rust] Ensure a successful MIRI Run on CI Key: ARROW-11715 URL: https://issues.apache.org/jira/browse/ARROW-11715 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb Now we have the MIRI check setup to pass even of `cargo miri` returns an error. https://github.com/apache/arrow/blob/master/.github/workflows/rust.yml#L263-L264 {code} # Ignore MIRI errors until we can get a clean run cargo miri test || true {code} Goal is to make MIRI pass and then remove this check in CI -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11714) [Rust] Fix MIRI build on CI
Andrew Lamb created ARROW-11714: --- Summary: [Rust] Fix MIRI build on CI Key: ARROW-11714 URL: https://issues.apache.org/jira/browse/ARROW-11714 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb the MIRI check doesn't even compile anymore: {code} Compiling criterion v0.3.4 Compiling h2 v0.3.0 Compiling tower v0.4.5 Compiling hyper v0.14.4 error[E0463]: can't find crate for `tracing` --> /home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.14.4/src/lib.rs:68:1 | 68 | extern crate tracing; | ^ can't find crate error: aborting due to previous error {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11713) [Rust] Get MIRI running again
Andrew Lamb created ARROW-11713: --- Summary: [Rust] Get MIRI running again Key: ARROW-11713 URL: https://issues.apache.org/jira/browse/ARROW-11713 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Rust's MIRI https://github.com/rust-lang/miri can help detect logical errors in programs The Rust arrow implementation now runs the MIRI checks as part of CI, but it does not pass cleanly For example: https://github.com/apache/arrow/pull/9535/checks?check_run_id=1941313240 {code} Compiling criterion v0.3.4 Compiling h2 v0.3.0 Compiling tower v0.4.5 Compiling hyper v0.14.4 error[E0463]: can't find crate for `tracing` --> /home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.14.4/src/lib.rs:68:1 | 68 | extern crate tracing; | ^ can't find crate error: aborting due to previous error {code} Previously MIRI ran but the check failed in FFI somewhere -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11712) [Rust][DataFusion] Introduce PlanRewriter for rewriting plans
Andrew Lamb created ARROW-11712: --- Summary: [Rust][DataFusion] Introduce PlanRewriter for rewriting plans Key: ARROW-11712 URL: https://issues.apache.org/jira/browse/ARROW-11712 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb Introduce a PlanRewriter to encapsulate visiting all logical plan nodes and rewriting them bottom up (and get rid of utils::inputs, utils::exprs, etc) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11711) [Rust][DataFusion] Rename ExpressionVisitor --> ExprVisitor and standardize input
Andrew Lamb created ARROW-11711: --- Summary: [Rust][DataFusion] Rename ExpressionVisitor --> ExprVisitor and standardize input Key: ARROW-11711 URL: https://issues.apache.org/jira/browse/ARROW-11711 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb Rename ExpressionVisitor ExprVisitor for consistency and change it to use ` self` rather than consuming the visitor for consistency with `PlanVisitor` (as well as the soon to be created `ExprVisitor` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11710) [Rust][DataFusion] Implement ExprRewriter to avoid tree traversal redundancy
Andrew Lamb created ARROW-11710: --- Summary: [Rust][DataFusion] Implement ExprRewriter to avoid tree traversal redundancy Key: ARROW-11710 URL: https://issues.apache.org/jira/browse/ARROW-11710 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb The idea is to 1. Reduce the amount repetitions in optimizer rules to make them easier to implement 2. Reduce the amount of repetition to make it easier to see the actual logic (rather than having it intertwined with the code needed to do recursion) 2. Set the stage for a more general `PlanRewriter` that doesn't have to clone its input, and can modify take their input by value and consume them Plan is to make an ExprRewriter, the mutable counterpart to `ExpressionVisitor` and demonstrates its usefulness by rewriting several expression transformation rewrite passes using it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11709) [Rust][DataFusion] Move `expressions` and `inputs` into LogicalPlan rather than helpers in util
Andrew Lamb created ARROW-11709: --- Summary: [Rust][DataFusion] Move `expressions` and `inputs` into LogicalPlan rather than helpers in util Key: ARROW-11709 URL: https://issues.apache.org/jira/browse/ARROW-11709 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb move `expressions` and `inputs` into LogicalPlan rather than helpers in util, and use Visitor rather than hard coded list Goal is to consolidate the expression walking in one place -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11692) [Rust][DataFusion] Improve documentation on Optimizer
Andrew Lamb created ARROW-11692: --- Summary: [Rust][DataFusion] Improve documentation on Optimizer Key: ARROW-11692 URL: https://issues.apache.org/jira/browse/ARROW-11692 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11690) [Rust][DataFusion] Avoid Expr::clone in Expr builder methods
Andrew Lamb created ARROW-11690: --- Summary: [Rust][DataFusion] Avoid Expr::clone in Expr builder methods Key: ARROW-11690 URL: https://issues.apache.org/jira/browse/ARROW-11690 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11689) [Rust][DataFusion] Reduce copies in DataFusion LogicalPlan and Expr creation
Andrew Lamb created ARROW-11689: --- Summary: [Rust][DataFusion] Reduce copies in DataFusion LogicalPlan and Expr creation Key: ARROW-11689 URL: https://issues.apache.org/jira/browse/ARROW-11689 Project: Apache Arrow Issue Type: New Feature Reporter: Andrew Lamb The theme of this overall epic to make the plan and expression rewriting phases of DataFusion more efficient by avoiding copies by leveraging the Rust type system Benefits: * More standard / idomatic Rust usage * faster / more efficient (I don't have numbers to back this up) Downsides: * These will be backwards incompatible changes h1. Background Many things in DataFusion look like Input --tranformation-->output And the input is not used again. In rust, you can model this by giving ownership to the transformation At a high level the idea is to avoid so much cloning in DataFustion The basic principle is if the function needs to `clone` one of its arguments, the caller should be given the choice of when to do that. Often, the caller can give up ownership without issue I envision at least the following the following items: 1. Optimizer passes that take `` and produce a new `LogicalPlan` even though most callsites do not need the original 2. Expr builder calls that take `` and return a new `Expr` 3. An expression rewriter (TODO) while running down optimizer passes I think this style takes advantage of Rust's ownership model and will let us avoid a lot o copying and allocations and avoid the need for something like slab allocators -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11671) [Arrow][DataFusion
Andrew Lamb created ARROW-11671: --- Summary: [Arrow][DataFusion Key: ARROW-11671 URL: https://issues.apache.org/jira/browse/ARROW-11671 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11667) [Rust] Add docs for utf8 comparison functions
Andrew Lamb created ARROW-11667: --- Summary: [Rust] Add docs for utf8 comparison functions Key: ARROW-11667 URL: https://issues.apache.org/jira/browse/ARROW-11667 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andrew Lamb -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11635) [Rust] [DataFusion] Improve performance for grouping/hashing on dictionary encoded data
Andrew Lamb created ARROW-11635: --- Summary: [Rust] [DataFusion] Improve performance for grouping/hashing on dictionary encoded data Key: ARROW-11635 URL: https://issues.apache.org/jira/browse/ARROW-11635 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb I am recording this for posterity / potential for someone else to help if they want: While adding support for GROUP BY hash, [~jorgecarleitao] had some great suggestions https://github.com/apache/arrow/pull/9233#issuecomment-762174671 The initial GROUP BY implementation hashes the actual value of the dictionary (aka looks up the underlying value). For the common case such as when the dictionary contains strings, this will likely do much more work than is necessary. In the common case we should be able to hash the dictionary indexes directly, or possibly skip hashing entirely and build an aggregate table directly from the indexes -- this would work incredibly well for low cardinality string columns What makes it tricky is that we would have to handle the case where the dictionary itself is not the same across all record batches (and thus indexes in one record batch may not correspond to the same value in another) Some possibly implementation ideas are: Implement a special case for a shared dictionary across all input record batches, and have code to switch back to the more general case (hash table) if the dictionary ever changes. Alternately, we could hold a hash table (or equivalent) for each distinct dictionary we saw and merge them all at the end. The second approach likely would likely be the fastest, but also would potentially consume the most resources -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11620) [Rust] [DataFusion] Inconsistent use of Box and Arc for TableProvider
Andrew Lamb created ARROW-11620: --- Summary: [Rust] [DataFusion] Inconsistent use of Box and Arc for TableProvider Key: ARROW-11620 URL: https://issues.apache.org/jira/browse/ARROW-11620 Project: Apache Arrow Issue Type: Bug Reporter: Andrew Lamb Assignee: Andrew Lamb The API inconsistently uses Box and Arc -- we should standardize on Arc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11602) [Rust] Clippy CI is failing
Andrew Lamb created ARROW-11602: --- Summary: [Rust] Clippy CI is failing Key: ARROW-11602 URL: https://issues.apache.org/jira/browse/ARROW-11602 Project: Apache Arrow Issue Type: Bug Reporter: Andrew Lamb Assignee: Andrew Lamb CI uses "stable" rust 1.50 stable was updated today: https://blog.rust-lang.org/2021/02/11/Rust-1.50.0.html The new clippy is pickier resulting in many clippy warnings such as https://github.com/apache/arrow/pull/9469/checks?check_run_id=1881854256 We need to get CI back green -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11594) [Rust] Support pretty printing with NullArrays
Andrew Lamb created ARROW-11594: --- Summary: [Rust] Support pretty printing with NullArrays Key: ARROW-11594 URL: https://issues.apache.org/jira/browse/ARROW-11594 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andrew Lamb The whole point of `NullArray::new_with_type` is to to be able to cheaply construct entirely null columns, with a smaller memory footprint. Currently trying to print them out causes a painic: {code} #[test] fn test_pretty_format_null() -> Result<()> { // define a schema. let schema = Arc::new(Schema::new(vec![ Field::new("a", DataType::Utf8, true), Field::new("b", DataType::Int32, true), ])); let num_rows = 4; // define data (null) let batch = RecordBatch::try_new( schema, vec![ Arc::new(NullArray::new_with_type(num_rows, DataType::Utf8)), Arc::new(NullArray::new_with_type(num_rows, DataType::Int32)), ], )?; let table = pretty_format_batches(&[batch])?; } {code} Panics: {code} failures: util::pretty::tests::test_pretty_format_null stdout thread 'util::pretty::tests::test_pretty_format_null' panicked at 'called `Option::unwrap()` on a `None` value', arrow/src/util/display.rs:201:27 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11576) [Rust] Remove unused variable in example
Andrew Lamb created ARROW-11576: --- Summary: [Rust] Remove unused variable in example Key: ARROW-11576 URL: https://issues.apache.org/jira/browse/ARROW-11576 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Assignee: Andrew Lamb As shown in https://github.com/apache/arrow/commit/3a380a4c4193c6683a71ba72dc31f8456bc661d5 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11489) [Rust][DataFusion] Make DataFrame should be Send+Sync
Andrew Lamb created ARROW-11489: --- Summary: [Rust][DataFusion] Make DataFrame should be Send+Sync Key: ARROW-11489 URL: https://issues.apache.org/jira/browse/ARROW-11489 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andrew Lamb Assignee: Andrew Lamb Inspired by a question on the mailing list https://lists.apache.org/thread.html/r8f81fae08346817fa283804037ed79a4309bb54aa8ed77c354d7baf0%40%3Cuser.arrow.apache.org%3E Things need to be `Send + Sync` on order to be sent between threads (or async tasks). Thus we should make DataFrame require Send + Sync as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11457) [Rust] Make string comparisson kernels generic over Utf8 and LargeUtf8
Andrew Lamb created ARROW-11457: --- Summary: [Rust] Make string comparisson kernels generic over Utf8 and LargeUtf8 Key: ARROW-11457 URL: https://issues.apache.org/jira/browse/ARROW-11457 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Assignee: Ritchie -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11414) [Rust] Reduce copies in Schema::try_merge
Andrew Lamb created ARROW-11414: --- Summary: [Rust] Reduce copies in Schema::try_merge Key: ARROW-11414 URL: https://issues.apache.org/jira/browse/ARROW-11414 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb https://github.com/apache/arrow/blob/ab5fc979c69ccc5dde07e1bc1467b02951b4b7e9/rust/arrow/src/datatypes.rs#L1832-L1860 I was looking at this code yesterday while using it in IOx -- https://github.com/influxdata/influxdb_iox/pull/703 Even though Schema::try_merge requires a slice of Schemas (not schema refs), it copies all of its fields. This is not ideal in the common case where most of the fields in the Schema will be the same -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11375) [Rust] CI fails due to deprecation warning in clippy
Andrew Lamb created ARROW-11375: --- Summary: [Rust] CI fails due to deprecation warning in clippy Key: ARROW-11375 URL: https://issues.apache.org/jira/browse/ARROW-11375 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andrew Lamb Assignee: Jorge Leitão Rust clippy lint test on CI started failing with this error: {code} Compiling arrow-flight v3.0.0-SNAPSHOT (/__w/arrow/arrow/rust/arrow-flight) error: use of deprecated struct `criterion::Benchmark`: Please use BenchmarkGroups instead. --> arrow/benches/builder.rs:39:9 | 39 | Benchmark::new("bench_primitive", move |b| { | ^^ | = note: `-D deprecated` implied by `-D warnings` error: use of deprecated struct `criterion::Benchmark`: Please use BenchmarkGroups instead. --> arrow/benches/builder.rs:62:9 | 62 | Benchmark::new("bench_bool", move |b| { | ^^ error: use of deprecated associated function `criterion::Criterionbench`: Please use BenchmarkGroups instead. --> arrow/benches/builder.rs:37:7 | 37 | c.bench( | ^ error: use of deprecated associated function `criterion::Criterionbench`: Please use BenchmarkGroups instead. --> arrow/benches/builder.rs:60:7 | 60 | c.bench( | ^ {code} It appears related to the latest release of criterion: https://crates.io/crates/criterion/0.3.4 (On Jan 24 2021) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11330) [Rust][DataFusion] Add ExpressionVisitor pattern
Andrew Lamb created ARROW-11330: --- Summary: [Rust][DataFusion] Add ExpressionVisitor pattern Key: ARROW-11330 URL: https://issues.apache.org/jira/browse/ARROW-11330 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Assignee: Andrew Lamb -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11327) [Rust] [DataFusion] Add DictionaryArray support for create_batch_empty
Andrew Lamb created ARROW-11327: --- Summary: [Rust] [DataFusion] Add DictionaryArray support for create_batch_empty Key: ARROW-11327 URL: https://issues.apache.org/jira/browse/ARROW-11327 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Assignee: Andrew Lamb the create_batch_empty function is used for creating output during aggregation. As part of my plan for better dictionary support it also needs to support DictionaryArray as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11323) [Rust][DataFusion] with queries with ORDER BY or GROUP BY that return no
Andrew Lamb created ARROW-11323: --- Summary: [Rust][DataFusion] with queries with ORDER BY or GROUP BY that return no Key: ARROW-11323 URL: https://issues.apache.org/jira/browse/ARROW-11323 Project: Apache Arrow Issue Type: Bug Reporter: Andrew Lamb If you run a SQL query in datafusion which has predicates that produces no rows that also includes a GROUP BY or ORDER BY clause, you get the following error: Error of "ArrowError(ComputeError("concat requires input of at least one array"))" Here are two test cases that show the problem: https://github.com/apache/arrow/blob/master/rust/datafusion/src/execution/context.rs#L889 {code} #[tokio::test] async fn sort_empty() -> Result<()> { // The predicate on this query purposely generates no results let results = execute("SELECT c1, c2 FROM test WHERE c1 > 10 ORDER BY c1 DESC, c2 ASC", 4).await?; assert_eq!(results.len(), 0); Ok(()) } #[tokio::test] async fn aggregate_empty() -> Result<()> { // The predicate on this query purposely generates no results let results = execute("SELECT SUM(c1), SUM(c2) FROM test where c1 > 10", 4).await?; assert_eq!(results.len(), 0); Ok(()) } {code{ -- This message was sent by Atlassian Jira (v8.3.4#803005)