[jira] [Created] (ARROW-8914) [C++][Gandiva] Decimal128 related test failed on big-endian platforms
Kazuaki Ishizaki created ARROW-8914: --- Summary: [C++][Gandiva] Decimal128 related test failed on big-endian platforms Key: ARROW-8914 URL: https://issues.apache.org/jira/browse/ARROW-8914 Project: Apache Arrow Issue Type: Improvement Components: C++ - Gandiva Reporter: Kazuaki Ishizaki These test failures in gandiva tests occur on big-endian platforms. An example from https://travis-ci.org/github/apache/arrow/jobs/690006107#L2306 {code} ... [==] 17 tests from 1 test case ran. (2334 ms total) [ PASSED ] 7 tests. [ FAILED ] 10 tests, listed below: [ FAILED ] TestDecimal.TestSimple [ FAILED ] TestDecimal.TestLiteral [ FAILED ] TestDecimal.TestCompare [ FAILED ] TestDecimal.TestRoundFunctions [ FAILED ] TestDecimal.TestCastFunctions [ FAILED ] TestDecimal.TestIsDistinct [ FAILED ] TestDecimal.TestCastVarCharDecimal [ FAILED ] TestDecimal.TestCastDecimalVarChar [ FAILED ] TestDecimal.TestVarCharDecimalNestedCast [ FAILED ] TestDecimal.TestCastDecimalOverflow ... {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8913) [Ruby] Use "field" instead of "child"
Kouhei Sutou created ARROW-8913: --- Summary: [Ruby] Use "field" instead of "child" Key: ARROW-8913 URL: https://issues.apache.org/jira/browse/ARROW-8913 Project: Apache Arrow Issue Type: Improvement Components: Ruby Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8913) [Ruby] Use "field" instead of "child"
[ https://issues.apache.org/jira/browse/ARROW-8913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8913: -- Labels: pull-request-available (was: ) > [Ruby] Use "field" instead of "child" > - > > Key: ARROW-8913 > URL: https://issues.apache.org/jira/browse/ARROW-8913 > Project: Apache Arrow > Issue Type: Improvement > Components: Ruby >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8912) [Ruby] Keep reference of Arrow::Buffer's data for GC
[ https://issues.apache.org/jira/browse/ARROW-8912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8912: -- Labels: pull-request-available (was: ) > [Ruby] Keep reference of Arrow::Buffer's data for GC > > > Key: ARROW-8912 > URL: https://issues.apache.org/jira/browse/ARROW-8912 > Project: Apache Arrow > Issue Type: Improvement > Components: Ruby >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8912) [Ruby] Keep reference of Arrow::Buffer's data for GC
Kouhei Sutou created ARROW-8912: --- Summary: [Ruby] Keep reference of Arrow::Buffer's data for GC Key: ARROW-8912 URL: https://issues.apache.org/jira/browse/ARROW-8912 Project: Apache Arrow Issue Type: Improvement Components: Ruby Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8911) An empty ChunkedArray created by `filter` can crash.
A. Coady created ARROW-8911: --- Summary: An empty ChunkedArray created by `filter` can crash. Key: ARROW-8911 URL: https://issues.apache.org/jira/browse/ARROW-8911 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.17.1 Environment: macOS, ubuntu Reporter: A. Coady {code:python} import pyarrow as pa arr = pa.chunked_array([[1]]) empty = arr.filter(pa.array([False])) print(empty) print(empty[:]) # <- crash {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8910) [Rust] [DataFusion] Add support for explicit casts between signed and unsigned ints
[ https://issues.apache.org/jira/browse/ARROW-8910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-8910: Summary: [Rust] [DataFusion] Add support for explicit casts between signed and unsigned ints (was: [Rus5t] [DataFusion] Add support for explicit casts between signed and unsigned ints) > [Rust] [DataFusion] Add support for explicit casts between signed and > unsigned ints > --- > > Key: ARROW-8910 > URL: https://issues.apache.org/jira/browse/ARROW-8910 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Priority: Major > Fix For: 1.0.0 > > > Add support for explicit casts between signed and unsigned ints. > Note that the type coercion optimizer rule shoud never implicity perform > casts between types when data would be lost e.g. from negative value to > unsigned type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8910) [Rus5t] [DataFusion] Add support for explicit casts between signed and unsigned ints
Andy Grove created ARROW-8910: - Summary: [Rus5t] [DataFusion] Add support for explicit casts between signed and unsigned ints Key: ARROW-8910 URL: https://issues.apache.org/jira/browse/ARROW-8910 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Andy Grove Fix For: 1.0.0 Add support for explicit casts between signed and unsigned ints. Note that the type coercion optimizer rule shoud never implicity perform casts between types when data would be lost e.g. from negative value to unsigned type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8869) [Rust] [DataFusion] Type Coercion optimizer rule does not support new scan nodes
[ https://issues.apache.org/jira/browse/ARROW-8869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-8869. --- Resolution: Fixed Issue resolved by pull request 7230 [https://github.com/apache/arrow/pull/7230] > [Rust] [DataFusion] Type Coercion optimizer rule does not support new scan > nodes > > > Key: ARROW-8869 > URL: https://issues.apache.org/jira/browse/ARROW-8869 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Affects Versions: 1.0.0 >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Type Coercion optimizer rule does not support new scan nodes -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8907) [Rust] implement scalar comparison operations
[ https://issues.apache.org/jira/browse/ARROW-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114893#comment-17114893 ] Yordan Pavlov commented on ARROW-8907: -- Sounds good [~andygrove], I think it makes sense to have efficient comparison to scalar values as they are often used in real world queries; I already have some work in progress for adding scalar comparison functions to the comparison kernel of arrow and hope to submit a pull request within the next few days. Hopefully this can later be used to increase Data Fusion performance with scalar values. > [Rust] implement scalar comparison operations > - > > Key: ARROW-8907 > URL: https://issues.apache.org/jira/browse/ARROW-8907 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Yordan Pavlov >Priority: Major > > Currently comparing an array to a scalar / literal value using the comparison > operations defined in the comparison kernel here: > https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/comparison.rs > is very inefficient because: > (1) an array with the scalar value repeated has to be created, taking time > and wasting memory > (2) time is spent during comparison to load the same literal values over and > over > Initial benchmarking of a specialized scalar comparison function indicates > good performance gains: > eq Float32 time: [938.54 us 950.28 us 962.65 us] > eq scalar Float32 time: [836.47 us 838.47 us 840.78 us] > eq Float32 simd time: [75.836 us 76.389 us 77.185 us] > eq scalar Float32 simd time: [61.551 us 61.605 us 61.671 us] > The benchmark results above show that the scalar comparison function is about > 12% faster for non-SIMD and about 20% faster for SIMD comparison operations. > And this is before accounting for creating the literal array. > In a more complex benchmark, the scalar comparison version is about 40% > faster overall when we account for not having to create arrays of scalar / > literal values. > Here are the benchmark results: > filter/filter with arrow SIMD (array) time: [647.77 us 675.12 us 706.69 us] > filter/filter with arrow SIMD (scalar) time: [402.19 us 404.23 us 407.22 us] > And here is the code for the benchmark: > https://github.com/yordan-pavlov/arrow-benchmark/blob/master/rust/arrow_benchmark/src/main.rs#L230 > My only concern is that I can't see an easy way to use scalar comparison > operations in Data Fusion as it is currently designed to only work on arrays. > [~paddyhoran] [~andygrove] let me know what you think, would there be value > in implementing scalar comparison operations? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8909) Out of order writes using setSafe
[ https://issues.apache.org/jira/browse/ARROW-8909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saurabh updated ARROW-8909: --- Priority: Minor (was: Major) > Out of order writes using setSafe > - > > Key: ARROW-8909 > URL: https://issues.apache.org/jira/browse/ARROW-8909 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Saurabh >Priority: Minor > > I noticed that calling setSafe on a VarCharVector with indices not in > increasing order causes the lastIndex to be set to the index in the last call > to setSafe. > Is this a documented and expected behavior ? > Sample code: > {code:java} > import java.util.Collections; > import lombok.extern.slf4j.Slf4j; > import org.apache.arrow.memory.RootAllocator; > import org.apache.arrow.vector.VarCharVector; > import org.apache.arrow.vector.VectorSchemaRoot; > import org.apache.arrow.vector.types.pojo.ArrowType; > import org.apache.arrow.vector.types.pojo.Field; > import org.apache.arrow.vector.types.pojo.Schema; > import org.apache.arrow.vector.util.Text; > @Slf4j > public class ATest { > public static void main() { > Schema schema = new > Schema(Collections.singletonList(Field.nullable("Data", new > ArrowType.Utf8(; > try (VectorSchemaRoot vroot = VectorSchemaRoot.create(schema, new > RootAllocator())) { > VarCharVector vec = (VarCharVector) vroot.getVector("Data"); > for (int i = 0; i < 10; i++) { > vec.setSafe(i, new Text(Integer.toString(i) + "_mtest")); > } > // vec.setSafe(0, new Text(Integer.toString(0) + "_new")); > vec.setSafe(7, new Text(Integer.toString(7) + "_new")); > vroot.setRowCount(10); > log.info(vroot.contentToTSVString()); > } > } > } > {code} > > If I don't set the 0 or 7 after the loop, I get all the 0_mtest, 1_mtest, > ..., 9_mtest entries. > If I set index 0 after the loop, I only see 0_new entry; other entries are "" > If I set index 7 after the loop, I see 0_mtest, ..., 5_mtest, 7_new; other > entries are "" > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8909) Out of order writes using setSafe
Saurabh created ARROW-8909: -- Summary: Out of order writes using setSafe Key: ARROW-8909 URL: https://issues.apache.org/jira/browse/ARROW-8909 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Saurabh I noticed that calling setSafe on a VarCharVector with indices not in increasing order causes the lastIndex to be set to the index in the last call to setSafe. Is this a documented and expected behavior ? Sample code: {code:java} import java.util.Collections; import lombok.extern.slf4j.Slf4j; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.VarCharVector; import org.apache.arrow.vector.VectorSchemaRoot; import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.Schema; import org.apache.arrow.vector.util.Text; @Slf4j public class ATest { public static void main() { Schema schema = new Schema(Collections.singletonList(Field.nullable("Data", new ArrowType.Utf8(; try (VectorSchemaRoot vroot = VectorSchemaRoot.create(schema, new RootAllocator())) { VarCharVector vec = (VarCharVector) vroot.getVector("Data"); for (int i = 0; i < 10; i++) { vec.setSafe(i, new Text(Integer.toString(i) + "_mtest")); } // vec.setSafe(0, new Text(Integer.toString(0) + "_new")); vec.setSafe(7, new Text(Integer.toString(7) + "_new")); vroot.setRowCount(10); log.info(vroot.contentToTSVString()); } } } {code} If I don't set the 0 or 7 after the loop, I get all the 0_mtest, 1_mtest, ..., 9_mtest entries. If I set index 0 after the loop, I only see 0_new entry; other entries are "" If I set index 7 after the loop, I see 0_mtest, ..., 5_mtest, 7_new; other entries are "" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8869) [Rust] [DataFusion] Type Coercion optimizer rule does not support new scan nodes
[ https://issues.apache.org/jira/browse/ARROW-8869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove reassigned ARROW-8869: - Assignee: Andy Grove > [Rust] [DataFusion] Type Coercion optimizer rule does not support new scan > nodes > > > Key: ARROW-8869 > URL: https://issues.apache.org/jira/browse/ARROW-8869 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Affects Versions: 1.0.0 >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Type Coercion optimizer rule does not support new scan nodes -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8907) [Rust] implement scalar comparison operations
[ https://issues.apache.org/jira/browse/ARROW-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114741#comment-17114741 ] Andy Grove commented on ARROW-8907: --- Thanks [~yordan-pavlov] . What I would really like is for DataFusion to use a specialized version of RecordBatch that can contain both Arrays and Scalar values, something like this: {code:java} enum ColumnarValue { Array(ArrayRef), Scalar(ScalarValue) } struct ColumnarBatch { columns: Vec } {code} > [Rust] implement scalar comparison operations > - > > Key: ARROW-8907 > URL: https://issues.apache.org/jira/browse/ARROW-8907 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Yordan Pavlov >Priority: Major > > Currently comparing an array to a scalar / literal value using the comparison > operations defined in the comparison kernel here: > https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/comparison.rs > is very inefficient because: > (1) an array with the scalar value repeated has to be created, taking time > and wasting memory > (2) time is spent during comparison to load the same literal values over and > over > Initial benchmarking of a specialized scalar comparison function indicates > good performance gains: > eq Float32 time: [938.54 us 950.28 us 962.65 us] > eq scalar Float32 time: [836.47 us 838.47 us 840.78 us] > eq Float32 simd time: [75.836 us 76.389 us 77.185 us] > eq scalar Float32 simd time: [61.551 us 61.605 us 61.671 us] > The benchmark results above show that the scalar comparison function is about > 12% faster for non-SIMD and about 20% faster for SIMD comparison operations. > And this is before accounting for creating the literal array. > In a more complex benchmark, the scalar comparison version is about 40% > faster overall when we account for not having to create arrays of scalar / > literal values. > Here are the benchmark results: > filter/filter with arrow SIMD (array) time: [647.77 us 675.12 us 706.69 us] > filter/filter with arrow SIMD (scalar) time: [402.19 us 404.23 us 407.22 us] > And here is the code for the benchmark: > https://github.com/yordan-pavlov/arrow-benchmark/blob/master/rust/arrow_benchmark/src/main.rs#L230 > My only concern is that I can't see an easy way to use scalar comparison > operations in Data Fusion as it is currently designed to only work on arrays. > [~paddyhoran] [~andygrove] let me know what you think, would there be value > in implementing scalar comparison operations? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8902) [rust][datafusion] optimize count(*) queries on parquet sources
[ https://issues.apache.org/jira/browse/ARROW-8902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paddy Horan updated ARROW-8902: --- Component/s: Rust > [rust][datafusion] optimize count(*) queries on parquet sources > --- > > Key: ARROW-8902 > URL: https://issues.apache.org/jira/browse/ARROW-8902 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Alex Gaynor >Priority: Minor > > Currently, as far as I can tell, when you perform a `select count(*) from > dataset` in datafusion against a parquet dataset, the way this is implemented > is by doing a scan on column 0, and counting up all of the rows (specifically > I think it counts the # of rows in each batch). > > However, for the specific case of just counting _everythign_ in a parquet > file, you can just read the rowcount from the footer metadata, so it's O(1) > instead of O(n) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8902) [rust][datafusion] optimize count(*) queries on parquet sources
[ https://issues.apache.org/jira/browse/ARROW-8902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paddy Horan updated ARROW-8902: --- Issue Type: Improvement (was: Bug) > [rust][datafusion] optimize count(*) queries on parquet sources > --- > > Key: ARROW-8902 > URL: https://issues.apache.org/jira/browse/ARROW-8902 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Alex Gaynor >Priority: Minor > > Currently, as far as I can tell, when you perform a `select count(*) from > dataset` in datafusion against a parquet dataset, the way this is implemented > is by doing a scan on column 0, and counting up all of the rows (specifically > I think it counts the # of rows in each batch). > > However, for the specific case of just counting _everythign_ in a parquet > file, you can just read the rowcount from the footer metadata, so it's O(1) > instead of O(n) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8908) [Rust][DataFusion] improve performance of building literal arrays
Yordan Pavlov created ARROW-8908: Summary: [Rust][DataFusion] improve performance of building literal arrays Key: ARROW-8908 URL: https://issues.apache.org/jira/browse/ARROW-8908 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Yordan Pavlov [~andygrove] I was doing some profiling and noticed a potential performance improvement described below NOTE: The issue described below would be irrelevant if it was possible to use scalar comparison operations in DataFusion as described here: https://issues.apache.org/jira/browse/ARROW-8907 the `build_literal_array` function defined here https://github.com/apache/arrow/blob/master/rust/datafusion/src/execution/physical_plan/expressions.rs#L1204 creates an array of literal values using a loop, but from benchmarks it appears creating an array from vec is much faster (about 58 times faster when building an array with 10 values). Here are the benchmark results: array builder/array from vec: time: [25.644 us 25.883 us 26.214 us] array builder/array from values: time: [1.4985 ms 1.5090 ms 1.5213 ms] here is the benchmark code: ``` fn bench_array_builder(c: Criterion) { let array_len = 10; let mut count = 0; let mut group = c.benchmark_group("array builder"); group.bench_function("array from vec", |b| b.iter(|| { let float_array: PrimitiveArray = vec![1.0; array_len].into(); count = float_array.len(); })); println!("built array with {} values", count); group.bench_function("array from values", |b| b.iter(|| { // let float_array: PrimitiveArray = build_literal_array(1.0, array_len); let mut builder = PrimitiveBuildernew(array_len); for _ in 0..count { _value(1.0); } let float_array = builder.finish(); count = float_array.len(); })); println!("built array with {} values", count); } ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8907) [Rust] implement scalar comparison operations
Yordan Pavlov created ARROW-8907: Summary: [Rust] implement scalar comparison operations Key: ARROW-8907 URL: https://issues.apache.org/jira/browse/ARROW-8907 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Yordan Pavlov Currently comparing an array to a scalar / literal value using the comparison operations defined in the comparison kernel here: https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/comparison.rs is very inefficient because: (1) an array with the scalar value repeated has to be created, taking time and wasting memory (2) time is spent during comparison to load the same literal values over and over Initial benchmarking of a specialized scalar comparison function indicates good performance gains: eq Float32 time: [938.54 us 950.28 us 962.65 us] eq scalar Float32 time: [836.47 us 838.47 us 840.78 us] eq Float32 simd time: [75.836 us 76.389 us 77.185 us] eq scalar Float32 simd time: [61.551 us 61.605 us 61.671 us] The benchmark results above show that the scalar comparison function is about 12% faster for non-SIMD and about 20% faster for SIMD comparison operations. And this is before accounting for creating the literal array. In a more complex benchmark, the scalar comparison version is about 40% faster overall when we account for not having to create arrays of scalar / literal values. Here are the benchmark results: filter/filter with arrow SIMD (array) time: [647.77 us 675.12 us 706.69 us] filter/filter with arrow SIMD (scalar) time: [402.19 us 404.23 us 407.22 us] And here is the code for the benchmark: https://github.com/yordan-pavlov/arrow-benchmark/blob/master/rust/arrow_benchmark/src/main.rs#L230 My only concern is that I can't see an easy way to use scalar comparison operations in Data Fusion as it is currently designed to only work on arrays. [~paddyhoran] [~andygrove] let me know what you think, would there be value in implementing scalar comparison operations? -- This message was sent by Atlassian Jira (v8.3.4#803005)