[jira] [Created] (ARROW-11897) [Rust][Parquet] Use iterators to increase performance of creating Arrow arrays

2021-03-06 Thread Yordan Pavlov (Jira)
Yordan Pavlov created ARROW-11897:
-

 Summary: [Rust][Parquet] Use iterators to increase performance of 
creating Arrow arrays
 Key: ARROW-11897
 URL: https://issues.apache.org/jira/browse/ARROW-11897
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Yordan Pavlov


The overall goal is to create an efficient pipeline from Parquet page data into 
Arrow arrays, with as little intermediate conversion and memory allocation as 
possible. It is assumed, that for best performance, we favor doing fewer but 
larger copy operations (rather than more but smaller). 

Such a pipeline would need to be flexible in order to enable high performance 
implementations in several different cases:
(1) In some cases, such as plain-encoded number array, it might even be 
possible to copy / create the array from a single contiguous section from a 
page buffer. 
(2) In other cases, such as plain-encoded string array, since values are 
encoded in non-contiguous slices (where value bytes are separated by length 
bytes) in a page buffer contains multiple values, individual values will have 
to be copied separately and it's not obvious how this can be avoided.
(3) Finally, in the case of bit-packing encoding and smaller numeric values, 
page buffer data has to be decoded / expanded before it is ready to copy into 
an arrow arrow, so a `Vec` will have to be returned instead of a slice 
pointing to a page buffer.

A decoder output abstraction that enables all of the above cases and minimizes 
intermediate memory allocation is `Iterator)>`.
Then in case (1) above, where a numeric array could be created from a single 
contiguous byte slice, such an iterator could return a single item such as 
`(1024, &[u8])`. 
In case (2) above, where each string value is encoded as an individual byte 
slice, but it is still possible to copy directly from a page buffer, a decoder 
iterator could return a sequence of items such as `(1, &[u8])`. 
And finally in case (3) above, where bit-packed values have to be 
unpacked/expanded, and it's NOT possible to copy value bytes directly from a 
page buffer, a decoder iterator could return items representing chunks of 
values such as `(32, Vec)` where bit-packed values have been unpacked and  
the chunk size is configured for best performance.

Another benefit of an `Iterator`-based abstraction is that it would prepare the 
parquet crate for  migration to `async` `Stream`s (my understanding is that a 
`Stream` is effectively an async `Iterator`).

Then a higher level iterator could combine a value iterator and a (def) level 
iterator to produce a sequence of `ValueSequence(count, AsRef<[u8]>)` and 
`NullSequence(count)` items from which an arrow array can be created 
efficiently.

In future, a higher level iterator (for the keys) could be combined with a 
dictionary value iterator to create a dictionary array.

Finally, Arrow arrays would be created from a (generic) higher-level iterator, 
using a layer of array converters that know what the value bytes and nulls mean 
for each type of array.

[~nevime] , [~Dandandan] , [~jorgecarleitao] let me know what you think

Next steps:
* split work into smaller tasks that could be done over time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11896) [Rust] Hang in CI

2021-03-06 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-11896:
---

 Summary: [Rust] Hang in CI
 Key: ARROW-11896
 URL: https://issues.apache.org/jira/browse/ARROW-11896
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Andrew Lamb


As observed first by [~nevi_me] on 
https://github.com/apache/arrow/pull/9592#issuecomment-791901636

The Rust CI tests seem to be failing due to a timeout, due to a timeout . For 
example: https://github.com/apache/arrow/runs/2045186826 





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11895) [Rust]DataFusion] Add support for extra column statistics

2021-03-06 Thread Jira
Daniël Heres created ARROW-11895:


 Summary: [Rust]DataFusion] Add support for extra column statistics
 Key: ARROW-11895
 URL: https://issues.apache.org/jira/browse/ARROW-11895
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Daniël Heres
Assignee: Daniël Heres






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11894) [Rust][DataFusion] Change flight server example to use DataFrame API

2021-03-06 Thread Jira
Daniël Heres created ARROW-11894:


 Summary: [Rust][DataFusion] Change flight server example to use 
DataFrame API
 Key: ARROW-11894
 URL: https://issues.apache.org/jira/browse/ARROW-11894
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Daniël Heres
Assignee: Daniël Heres






--
This message was sent by Atlassian Jira
(v8.3.4#803005)