date:20200523

[jira] [Created] (ARROW-8914) [C++][Gandiva] Decimal128 related test failed on big-endian platforms

2020-05-23 Thread Kazuaki Ishizaki (Jira)

Kazuaki Ishizaki created ARROW-8914:
---

 Summary: [C++][Gandiva] Decimal128 related test failed on 
big-endian platforms
 Key: ARROW-8914
 URL: https://issues.apache.org/jira/browse/ARROW-8914
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva
Reporter: Kazuaki Ishizaki


These test failures in gandiva tests occur on big-endian platforms. An example 
from https://travis-ci.org/github/apache/arrow/jobs/690006107#L2306

{code}
...
[==] 17 tests from 1 test case ran. (2334 ms total)
[  PASSED  ] 7 tests.
[  FAILED  ] 10 tests, listed below:
[  FAILED  ] TestDecimal.TestSimple
[  FAILED  ] TestDecimal.TestLiteral
[  FAILED  ] TestDecimal.TestCompare
[  FAILED  ] TestDecimal.TestRoundFunctions
[  FAILED  ] TestDecimal.TestCastFunctions
[  FAILED  ] TestDecimal.TestIsDistinct
[  FAILED  ] TestDecimal.TestCastVarCharDecimal
[  FAILED  ] TestDecimal.TestCastDecimalVarChar
[  FAILED  ] TestDecimal.TestVarCharDecimalNestedCast
[  FAILED  ] TestDecimal.TestCastDecimalOverflow
...
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8913) [Ruby] Use "field" instead of "child"

2020-05-23 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-8913:
---

 Summary: [Ruby] Use "field" instead of "child"
 Key: ARROW-8913
 URL: https://issues.apache.org/jira/browse/ARROW-8913
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Ruby
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8913) [Ruby] Use "field" instead of "child"

2020-05-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8913:
--
Labels: pull-request-available  (was: )

> [Ruby] Use "field" instead of "child"
> -
>
> Key: ARROW-8913
> URL: https://issues.apache.org/jira/browse/ARROW-8913
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8912) [Ruby] Keep reference of Arrow::Buffer's data for GC

2020-05-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8912:
--
Labels: pull-request-available  (was: )

> [Ruby] Keep reference of Arrow::Buffer's data for GC
> 
>
> Key: ARROW-8912
> URL: https://issues.apache.org/jira/browse/ARROW-8912
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8912) [Ruby] Keep reference of Arrow::Buffer's data for GC

2020-05-23 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-8912:
---

 Summary: [Ruby] Keep reference of Arrow::Buffer's data for GC
 Key: ARROW-8912
 URL: https://issues.apache.org/jira/browse/ARROW-8912
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Ruby
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8911) An empty ChunkedArray created by `filter` can crash.

2020-05-23 Thread A. Coady (Jira)

A. Coady created ARROW-8911:
---

 Summary: An empty ChunkedArray created by `filter` can crash.
 Key: ARROW-8911
 URL: https://issues.apache.org/jira/browse/ARROW-8911
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.17.1
 Environment: macOS, ubuntu
Reporter: A. Coady


{code:python}
import pyarrow as pa
arr = pa.chunked_array([[1]])
empty = arr.filter(pa.array([False]))
print(empty)
print(empty[:]) # <- crash
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8910) [Rust] [DataFusion] Add support for explicit casts between signed and unsigned ints

2020-05-23 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-8910:

Summary: [Rust] [DataFusion] Add support for explicit casts between signed 
and unsigned ints  (was: [Rus5t] [DataFusion] Add support for explicit casts 
between signed and unsigned ints)

> [Rust] [DataFusion] Add support for explicit casts between signed and 
> unsigned ints
> ---
>
> Key: ARROW-8910
> URL: https://issues.apache.org/jira/browse/ARROW-8910
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> Add support for explicit casts between signed and unsigned ints.
> Note that the type coercion optimizer rule shoud never implicity perform 
> casts between types when data would be lost e.g. from negative value to 
> unsigned type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8910) [Rus5t] [DataFusion] Add support for explicit casts between signed and unsigned ints

2020-05-23 Thread Andy Grove (Jira)

Andy Grove created ARROW-8910:
-

 Summary: [Rus5t] [DataFusion] Add support for explicit casts 
between signed and unsigned ints
 Key: ARROW-8910
 URL: https://issues.apache.org/jira/browse/ARROW-8910
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
 Fix For: 1.0.0


Add support for explicit casts between signed and unsigned ints.

Note that the type coercion optimizer rule shoud never implicity perform casts 
between types when data would be lost e.g. from negative value to unsigned type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8869) [Rust] [DataFusion] Type Coercion optimizer rule does not support new scan nodes

2020-05-23 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-8869.
---
Resolution: Fixed

Issue resolved by pull request 7230
[https://github.com/apache/arrow/pull/7230]

> [Rust] [DataFusion] Type Coercion optimizer rule does not support new scan 
> nodes
> 
>
> Key: ARROW-8869
> URL: https://issues.apache.org/jira/browse/ARROW-8869
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Affects Versions: 1.0.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Type Coercion optimizer rule does not support new scan nodes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8907) [Rust] implement scalar comparison operations

2020-05-23 Thread Yordan Pavlov (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114893#comment-17114893
 ] 

Yordan Pavlov commented on ARROW-8907:
--

Sounds good [~andygrove], I think it makes sense to have efficient comparison 
to scalar values as they are often used in real world queries; I already have 
some work in progress for adding scalar comparison functions to the comparison 
kernel of arrow and hope to submit a pull request within the next few days. 
Hopefully this can later be used to increase Data Fusion performance with 
scalar values.

> [Rust] implement scalar comparison operations
> -
>
> Key: ARROW-8907
> URL: https://issues.apache.org/jira/browse/ARROW-8907
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Yordan Pavlov
>Priority: Major
>
> Currently comparing an array to a scalar / literal value using the comparison 
> operations defined in the comparison kernel here:
> https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/comparison.rs
> is very inefficient because:
> (1) an array with the scalar value repeated has to be created, taking time 
> and wasting memory
> (2) time is spent during comparison to load the same literal values over and 
> over
> Initial benchmarking of a specialized scalar comparison function indicates 
> good performance gains:
> eq Float32 time: [938.54 us 950.28 us 962.65 us]
> eq scalar Float32 time: [836.47 us 838.47 us 840.78 us]
> eq Float32 simd time: [75.836 us 76.389 us 77.185 us]
> eq scalar Float32 simd time: [61.551 us 61.605 us 61.671 us]
> The benchmark results above show that the scalar comparison function is about 
> 12% faster for non-SIMD and about 20% faster for SIMD comparison operations.
> And this is before accounting for creating the literal array. 
> In a more complex benchmark, the scalar comparison version is about 40% 
> faster overall when we account for not having to create arrays of scalar / 
> literal values.
> Here are the benchmark results:
> filter/filter with arrow SIMD (array) time: [647.77 us 675.12 us 706.69 us]
> filter/filter with arrow SIMD (scalar) time: [402.19 us 404.23 us 407.22 us]
> And here is the code for the benchmark:
> https://github.com/yordan-pavlov/arrow-benchmark/blob/master/rust/arrow_benchmark/src/main.rs#L230
> My only concern is that I can't see an easy way to use scalar comparison 
> operations in Data Fusion as it is currently designed to only work on arrays.
> [~paddyhoran] [~andygrove]  let me know what you think, would there be value 
> in implementing scalar comparison operations?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8909) Out of order writes using setSafe

2020-05-23 Thread Saurabh (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh updated ARROW-8909:
---
Priority: Minor  (was: Major)

> Out of order writes using setSafe
> -
>
> Key: ARROW-8909
> URL: https://issues.apache.org/jira/browse/ARROW-8909
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Saurabh
>Priority: Minor
>
> I noticed that calling setSafe on a VarCharVector with indices not in 
> increasing order causes the lastIndex to be set to the index in the last call 
> to setSafe.
> Is this a documented and expected behavior ?
> Sample code:
> {code:java}
> import java.util.Collections;
> import lombok.extern.slf4j.Slf4j;
> import org.apache.arrow.memory.RootAllocator;
> import org.apache.arrow.vector.VarCharVector;
> import org.apache.arrow.vector.VectorSchemaRoot;
> import org.apache.arrow.vector.types.pojo.ArrowType;
> import org.apache.arrow.vector.types.pojo.Field;
> import org.apache.arrow.vector.types.pojo.Schema;
> import org.apache.arrow.vector.util.Text;
> @Slf4j
> public class ATest {
>   public static void main() {
> Schema schema = new 
> Schema(Collections.singletonList(Field.nullable("Data", new 
> ArrowType.Utf8(;
> try (VectorSchemaRoot vroot = VectorSchemaRoot.create(schema, new 
> RootAllocator())) {
>   VarCharVector vec = (VarCharVector) vroot.getVector("Data");
>   for (int i = 0; i < 10; i++) {
> vec.setSafe(i, new Text(Integer.toString(i) + "_mtest"));
>   }
>   // vec.setSafe(0, new Text(Integer.toString(0) + "_new"));
>   vec.setSafe(7, new Text(Integer.toString(7) + "_new"));
>   vroot.setRowCount(10);
>   log.info(vroot.contentToTSVString());
> }
>   }
> }
> {code}
>  
> If I don't set the 0 or 7 after the loop, I get all the 0_mtest, 1_mtest, 
> ..., 9_mtest entries.
> If I set index 0 after the loop, I only see 0_new entry; other entries are ""
> If I set index 7 after the loop, I see 0_mtest, ..., 5_mtest, 7_new; other 
> entries are ""
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8909) Out of order writes using setSafe

2020-05-23 Thread Saurabh (Jira)

Saurabh created ARROW-8909:
--

 Summary: Out of order writes using setSafe
 Key: ARROW-8909
 URL: https://issues.apache.org/jira/browse/ARROW-8909
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Saurabh


I noticed that calling setSafe on a VarCharVector with indices not in 
increasing order causes the lastIndex to be set to the index in the last call 
to setSafe.

Is this a documented and expected behavior ?

Sample code:
{code:java}
import java.util.Collections;
import lombok.extern.slf4j.Slf4j;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.VarCharVector;
import org.apache.arrow.vector.VectorSchemaRoot;
import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.Schema;
import org.apache.arrow.vector.util.Text;

@Slf4j
public class ATest {

  public static void main() {
Schema schema = new Schema(Collections.singletonList(Field.nullable("Data", 
new ArrowType.Utf8(;
try (VectorSchemaRoot vroot = VectorSchemaRoot.create(schema, new 
RootAllocator())) {
  VarCharVector vec = (VarCharVector) vroot.getVector("Data");

  for (int i = 0; i < 10; i++) {
vec.setSafe(i, new Text(Integer.toString(i) + "_mtest"));
  }
  // vec.setSafe(0, new Text(Integer.toString(0) + "_new"));
  vec.setSafe(7, new Text(Integer.toString(7) + "_new"));

  vroot.setRowCount(10);
  log.info(vroot.contentToTSVString());
}
  }
}
{code}
 

If I don't set the 0 or 7 after the loop, I get all the 0_mtest, 1_mtest, ..., 
9_mtest entries.

If I set index 0 after the loop, I only see 0_new entry; other entries are ""

If I set index 7 after the loop, I see 0_mtest, ..., 5_mtest, 7_new; other 
entries are ""

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8869) [Rust] [DataFusion] Type Coercion optimizer rule does not support new scan nodes

2020-05-23 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-8869:
-

Assignee: Andy Grove

> [Rust] [DataFusion] Type Coercion optimizer rule does not support new scan 
> nodes
> 
>
> Key: ARROW-8869
> URL: https://issues.apache.org/jira/browse/ARROW-8869
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Affects Versions: 1.0.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Type Coercion optimizer rule does not support new scan nodes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8907) [Rust] implement scalar comparison operations

2020-05-23 Thread Andy Grove (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114741#comment-17114741
 ] 

Andy Grove commented on ARROW-8907:
---

Thanks [~yordan-pavlov] . What I would really like is for DataFusion to use a 
specialized version of RecordBatch that can contain both Arrays and Scalar 
values, something like this:

 
{code:java}
enum ColumnarValue {
  Array(ArrayRef),
  Scalar(ScalarValue)
}
 
 
struct ColumnarBatch {
  columns: Vec
}
  {code}
 

 

> [Rust] implement scalar comparison operations
> -
>
> Key: ARROW-8907
> URL: https://issues.apache.org/jira/browse/ARROW-8907
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Yordan Pavlov
>Priority: Major
>
> Currently comparing an array to a scalar / literal value using the comparison 
> operations defined in the comparison kernel here:
> https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/comparison.rs
> is very inefficient because:
> (1) an array with the scalar value repeated has to be created, taking time 
> and wasting memory
> (2) time is spent during comparison to load the same literal values over and 
> over
> Initial benchmarking of a specialized scalar comparison function indicates 
> good performance gains:
> eq Float32 time: [938.54 us 950.28 us 962.65 us]
> eq scalar Float32 time: [836.47 us 838.47 us 840.78 us]
> eq Float32 simd time: [75.836 us 76.389 us 77.185 us]
> eq scalar Float32 simd time: [61.551 us 61.605 us 61.671 us]
> The benchmark results above show that the scalar comparison function is about 
> 12% faster for non-SIMD and about 20% faster for SIMD comparison operations.
> And this is before accounting for creating the literal array. 
> In a more complex benchmark, the scalar comparison version is about 40% 
> faster overall when we account for not having to create arrays of scalar / 
> literal values.
> Here are the benchmark results:
> filter/filter with arrow SIMD (array) time: [647.77 us 675.12 us 706.69 us]
> filter/filter with arrow SIMD (scalar) time: [402.19 us 404.23 us 407.22 us]
> And here is the code for the benchmark:
> https://github.com/yordan-pavlov/arrow-benchmark/blob/master/rust/arrow_benchmark/src/main.rs#L230
> My only concern is that I can't see an easy way to use scalar comparison 
> operations in Data Fusion as it is currently designed to only work on arrays.
> [~paddyhoran] [~andygrove]  let me know what you think, would there be value 
> in implementing scalar comparison operations?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8902) [rust][datafusion] optimize count(*) queries on parquet sources

2020-05-23 Thread Paddy Horan (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paddy Horan updated ARROW-8902:
---
Component/s: Rust

> [rust][datafusion] optimize count(*) queries on parquet sources
> ---
>
> Key: ARROW-8902
> URL: https://issues.apache.org/jira/browse/ARROW-8902
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Alex Gaynor
>Priority: Minor
>
> Currently, as far as I can tell, when you perform a `select count(*) from 
> dataset` in datafusion against a parquet dataset, the way this is implemented 
> is by doing a scan on column 0, and counting up all of the rows (specifically 
> I think it counts the # of rows in each batch).
>  
> However, for the specific case of just counting _everythign_ in a parquet 
> file, you can just read the rowcount from the footer metadata, so it's O(1) 
> instead of O(n)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8902) [rust][datafusion] optimize count(*) queries on parquet sources

2020-05-23 Thread Paddy Horan (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paddy Horan updated ARROW-8902:
---
Issue Type: Improvement  (was: Bug)

> [rust][datafusion] optimize count(*) queries on parquet sources
> ---
>
> Key: ARROW-8902
> URL: https://issues.apache.org/jira/browse/ARROW-8902
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Alex Gaynor
>Priority: Minor
>
> Currently, as far as I can tell, when you perform a `select count(*) from 
> dataset` in datafusion against a parquet dataset, the way this is implemented 
> is by doing a scan on column 0, and counting up all of the rows (specifically 
> I think it counts the # of rows in each batch).
>  
> However, for the specific case of just counting _everythign_ in a parquet 
> file, you can just read the rowcount from the footer metadata, so it's O(1) 
> instead of O(n)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8908) [Rust][DataFusion] improve performance of building literal arrays

2020-05-23 Thread Yordan Pavlov (Jira)

Yordan Pavlov created ARROW-8908:


 Summary: [Rust][DataFusion] improve performance of building 
literal arrays
 Key: ARROW-8908
 URL: https://issues.apache.org/jira/browse/ARROW-8908
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Yordan Pavlov


[~andygrove] I was doing some profiling and noticed a potential performance 
improvement described below


NOTE: The issue described below would be irrelevant if it was possible to use 
scalar comparison operations in DataFusion as described here:
https://issues.apache.org/jira/browse/ARROW-8907


the `build_literal_array` function defined here 
https://github.com/apache/arrow/blob/master/rust/datafusion/src/execution/physical_plan/expressions.rs#L1204
creates an array of literal values using a loop, but from benchmarks it appears 
creating an array from vec is much faster 
(about 58 times faster when building an array with 10 values).
Here are the benchmark results:

array builder/array from vec: time: [25.644 us 25.883 us 26.214 us]
array builder/array from values: time: [1.4985 ms 1.5090 ms 1.5213 ms]

here is the benchmark code:
```
fn bench_array_builder(c:  Criterion) {
 let array_len = 10;
 let mut count = 0;
 let mut group = c.benchmark_group("array builder");

group.bench_function("array from vec", |b| b.iter(|| {
 let float_array: PrimitiveArray = vec![1.0; array_len].into();
 count = float_array.len();
 }));
 println!("built array with {} values", count);

group.bench_function("array from values", |b| b.iter(|| {
 // let float_array: PrimitiveArray = build_literal_array(1.0, 
array_len);
 let mut builder = PrimitiveBuildernew(array_len);
 for _ in 0..count {
 _value(1.0);
 }
 let float_array = builder.finish();
 count = float_array.len();
 }));
 println!("built array with {} values", count);
}
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8907) [Rust] implement scalar comparison operations

2020-05-23 Thread Yordan Pavlov (Jira)

Yordan Pavlov created ARROW-8907:


 Summary: [Rust] implement scalar comparison operations
 Key: ARROW-8907
 URL: https://issues.apache.org/jira/browse/ARROW-8907
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Yordan Pavlov


Currently comparing an array to a scalar / literal value using the comparison 
operations defined in the comparison kernel here:
https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/comparison.rs
is very inefficient because:
(1) an array with the scalar value repeated has to be created, taking time and 
wasting memory
(2) time is spent during comparison to load the same literal values over and 
over

Initial benchmarking of a specialized scalar comparison function indicates good 
performance gains:

eq Float32 time: [938.54 us 950.28 us 962.65 us]
eq scalar Float32 time: [836.47 us 838.47 us 840.78 us]
eq Float32 simd time: [75.836 us 76.389 us 77.185 us]
eq scalar Float32 simd time: [61.551 us 61.605 us 61.671 us]

The benchmark results above show that the scalar comparison function is about 
12% faster for non-SIMD and about 20% faster for SIMD comparison operations.
And this is before accounting for creating the literal array. 
In a more complex benchmark, the scalar comparison version is about 40% faster 
overall when we account for not having to create arrays of scalar / literal 
values.
Here are the benchmark results:

filter/filter with arrow SIMD (array) time: [647.77 us 675.12 us 706.69 us]
filter/filter with arrow SIMD (scalar) time: [402.19 us 404.23 us 407.22 us]

And here is the code for the benchmark:
https://github.com/yordan-pavlov/arrow-benchmark/blob/master/rust/arrow_benchmark/src/main.rs#L230

My only concern is that I can't see an easy way to use scalar comparison 
operations in Data Fusion as it is currently designed to only work on arrays.

[~paddyhoran] [~andygrove]  let me know what you think, would there be value in 
implementing scalar comparison operations?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8914) [C++][Gandiva] Decimal128 related test failed on big-endian platforms

[jira] [Created] (ARROW-8913) [Ruby] Use "field" instead of "child"

[jira] [Updated] (ARROW-8913) [Ruby] Use "field" instead of "child"

[jira] [Updated] (ARROW-8912) [Ruby] Keep reference of Arrow::Buffer's data for GC

[jira] [Created] (ARROW-8912) [Ruby] Keep reference of Arrow::Buffer's data for GC

[jira] [Created] (ARROW-8911) An empty ChunkedArray created by `filter` can crash.

[jira] [Updated] (ARROW-8910) [Rust] [DataFusion] Add support for explicit casts between signed and unsigned ints

[jira] [Created] (ARROW-8910) [Rus5t] [DataFusion] Add support for explicit casts between signed and unsigned ints

[jira] [Resolved] (ARROW-8869) [Rust] [DataFusion] Type Coercion optimizer rule does not support new scan nodes

[jira] [Commented] (ARROW-8907) [Rust] implement scalar comparison operations

[jira] [Updated] (ARROW-8909) Out of order writes using setSafe

[jira] [Created] (ARROW-8909) Out of order writes using setSafe

[jira] [Assigned] (ARROW-8869) [Rust] [DataFusion] Type Coercion optimizer rule does not support new scan nodes

[jira] [Commented] (ARROW-8907) [Rust] implement scalar comparison operations

[jira] [Updated] (ARROW-8902) [rust][datafusion] optimize count(*) queries on parquet sources

[jira] [Updated] (ARROW-8902) [rust][datafusion] optimize count(*) queries on parquet sources

[jira] [Created] (ARROW-8908) [Rust][DataFusion] improve performance of building literal arrays

[jira] [Created] (ARROW-8907) [Rust] implement scalar comparison operations

18 matches

Site Navigation

Mail list logo

Footer information