[jira] [Resolved] (ARROW-7276) [Ruby] Add support for building Arrow::ListArray from [[...]]
[ https://issues.apache.org/jira/browse/ARROW-7276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yosuke Shiro resolved ARROW-7276. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5925 [https://github.com/apache/arrow/pull/5925] > [Ruby] Add support for building Arrow::ListArray from [[...]] > - > > Key: ARROW-7276 > URL: https://issues.apache.org/jira/browse/ARROW-7276 > Project: Apache Arrow > Issue Type: Improvement > Components: Ruby >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7275) [Ruby] Add support for Arrow::ListDataType.new(data_type)
[ https://issues.apache.org/jira/browse/ARROW-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yosuke Shiro resolved ARROW-7275. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5924 [https://github.com/apache/arrow/pull/5924] > [Ruby] Add support for Arrow::ListDataType.new(data_type) > - > > Key: ARROW-7275 > URL: https://issues.apache.org/jira/browse/ARROW-7275 > Project: Apache Arrow > Issue Type: Improvement > Components: Ruby >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7281) AdaptiveIntBuilder::length() does not consider pending_pos_.
[ https://issues.apache.org/jira/browse/ARROW-7281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Hooper updated ARROW-7281: --- Description: {code:c++} arrow::AdaptiveIntBuilder builder(arrow::default_memory_pool()); builder.Append(1); std::cout << builder.length() << std::endl; {code} Expected output: {{1}} Actual output: {{0}} I imagine this regression came with https://github.com/apache/arrow/pull/3040 My use case: I'm building a JSON parser that appends "records" (JSON Objects mapping key=>value) to Arrow columns (each key gets an ArrayBuilder). Not all JSON Objects contain all keys; so {{builder.Append()}} isn't always called. So on a subsequent row, I want to add nulls for every append that was skipped: {{builder.AppendNulls(row - builder.length()); builder.Append(value)}}. This fails because {{builder.length()}} is wrong. Annoying but simple workaround: I maintain a separate {{length}} value alongside {{builder}}. was: {code:c++} arrow::AdaptiveIntBuilder builder(arrow::default_memory_pool()); builder.Append(1); std::cout << builder.length() << std::endl; {code} Expected output: {{1}} Actual output: {{0}} I imagine this regression came with https://github.com/apache/arrow/pull/3040 My use case: I'm building a JSON parser that appends "records" (JSON Objects mapping key=>value) to Arrow columns (each key gets an ArrayBuilder). Not all JSON Objects contain all keys; so {{builder.Append()}} isn't always called. So on a subsequent row, I want to add nulls for every append that was skipped: {{builder.AppendNulls(builder.length() - row); builder.Append(value)}}. This fails because {{builder.length()}} is wrong. Annoying but simple workaround: I maintain a separate {{length}} value alongside {{builder}}. > AdaptiveIntBuilder::length() does not consider pending_pos_. > > > Key: ARROW-7281 > URL: https://issues.apache.org/jira/browse/ARROW-7281 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.15.1 >Reporter: Adam Hooper >Priority: Major > > {code:c++} > arrow::AdaptiveIntBuilder builder(arrow::default_memory_pool()); > builder.Append(1); > std::cout << builder.length() << std::endl; > {code} > Expected output: {{1}} > Actual output: {{0}} > I imagine this regression came with https://github.com/apache/arrow/pull/3040 > My use case: I'm building a JSON parser that appends "records" (JSON Objects > mapping key=>value) to Arrow columns (each key gets an ArrayBuilder). Not all > JSON Objects contain all keys; so {{builder.Append()}} isn't always called. > So on a subsequent row, I want to add nulls for every append that was > skipped: {{builder.AppendNulls(row - builder.length()); > builder.Append(value)}}. This fails because {{builder.length()}} is wrong. > Annoying but simple workaround: I maintain a separate {{length}} value > alongside {{builder}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7281) AdaptiveIntBuilder::length() does not consider pending_pos_.
Adam Hooper created ARROW-7281: -- Summary: AdaptiveIntBuilder::length() does not consider pending_pos_. Key: ARROW-7281 URL: https://issues.apache.org/jira/browse/ARROW-7281 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.15.1 Reporter: Adam Hooper {code:c++} arrow::AdaptiveIntBuilder builder(arrow::default_memory_pool()); builder.Append(1); std::cout << builder.length() << std::endl; {code} Expected output: {{1}} Actual output: {{0}} I imagine this regression came with https://github.com/apache/arrow/pull/3040 My use case: I'm building a JSON parser that appends "records" (JSON Objects mapping key=>value) to Arrow columns (each key gets an ArrayBuilder). Not all JSON Objects contain all keys; so {{builder.Append()}} isn't always called. So on a subsequent row, I want to add nulls for every append that was skipped: {{builder.AppendNulls(builder.length() - row); builder.Append(value)}}. This fails because {{builder.length()}} is wrong. Annoying but simple workaround: I maintain a separate {{length}} value alongside {{builder}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7280) Flight Data Store : memory mapped file [JAVA and Python]
Vinay created ARROW-7280: Summary: Flight Data Store : memory mapped file [JAVA and Python] Key: ARROW-7280 URL: https://issues.apache.org/jira/browse/ARROW-7280 Project: Apache Arrow Issue Type: Test Reporter: Vinay There are limited references for Arrow Flight implementation/examples for DataStores. For holding huge data it may require to choose memory mapped file/file system instead of unsafe memory buffer. It will be great if any there is any reference in this direction for JAVA and Python API. And also the possibility if this is feasible with performance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6515) [C++] Clean type_traits.h definitions
[ https://issues.apache.org/jira/browse/ARROW-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques resolved ARROW-6515. --- Resolution: Fixed Issue resolved by pull request 5885 [https://github.com/apache/arrow/pull/5885] > [C++] Clean type_traits.h definitions > - > > Key: ARROW-6515 > URL: https://issues.apache.org/jira/browse/ARROW-6515 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ben Kietzman >Assignee: Francois Saint-Jacques >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > {{IsSignedInt}} takes either an array or a type as a type argument, which is > surprisingly atypical for traits. Furthermore whereas {{is_signed_integer}} > returns false for date and other types which are represented by but not > identical to integers {{IsSignedInt}} returns true by checking only the > {{c_type}}, which leads to {{static_assert(IsSignedInt::value, > "")}}. Finally the declaration of {{IsSignedInt}} is far from readable due to > nested macro usage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7279) [C++] Rename UnionArray::type_ids to UnionArray::type_codes
[ https://issues.apache.org/jira/browse/ARROW-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984974#comment-16984974 ] Wes McKinney commented on ARROW-7279: - It's fine with me > [C++] Rename UnionArray::type_ids to UnionArray::type_codes > --- > > Key: ARROW-7279 > URL: https://issues.apache.org/jira/browse/ARROW-7279 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Affects Versions: 1.0.0 >Reporter: Antoine Pitrou >Priority: Minor > > This would be consistent with {{UnionType::type_codes}}. Furthermore, > "type_id" already means something else in the C++ API, so it would be less > confusing as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-5949) [Rust] Implement DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5949: -- Labels: pull-request-available (was: ) > [Rust] Implement DictionaryArray > > > Key: ARROW-5949 > URL: https://issues.apache.org/jira/browse/ARROW-5949 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: David Atienza >Priority: Major > Labels: pull-request-available > > I am pretty new to the codebase, but I have seen that DictionaryArray is not > implemented in the Rust implementation. > I went through the list of issues and I could not see any work on this. Is > there any blocker? > > The specification is a bit > [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] > or even > [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding], > so I am not sure how to implement it myself. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7279) [C++] Rename UnionArray::type_ids to UnionArray::type_codes
[ https://issues.apache.org/jira/browse/ARROW-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984883#comment-16984883 ] Antoine Pitrou commented on ARROW-7279: --- [~wesm] [~emkornfield] Thoughts? > [C++] Rename UnionArray::type_ids to UnionArray::type_codes > --- > > Key: ARROW-7279 > URL: https://issues.apache.org/jira/browse/ARROW-7279 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Affects Versions: 1.0.0 >Reporter: Antoine Pitrou >Priority: Minor > > This would be consistent with {{UnionType::type_codes}}. Furthermore, > "type_id" already means something else in the C++ API, so it would be less > confusing as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7279) [C++] Rename UnionArray::type_ids to UnionArray::type_codes
Antoine Pitrou created ARROW-7279: - Summary: [C++] Rename UnionArray::type_ids to UnionArray::type_codes Key: ARROW-7279 URL: https://issues.apache.org/jira/browse/ARROW-7279 Project: Apache Arrow Issue Type: Wish Components: C++ Affects Versions: 1.0.0 Reporter: Antoine Pitrou This would be consistent with {{UnionType::type_codes}}. Furthermore, "type_id" already means something else in the C++ API, so it would be less confusing as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults
[ https://issues.apache.org/jira/browse/ARROW-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-6157. --- Resolution: Fixed Issue resolved by pull request 5892 [https://github.com/apache/arrow/pull/5892] > [Python][C++] UnionArray with invalid data passes validation / leads to > segfaults > - > > Key: ARROW-6157 > URL: https://issues.apache.org/jira/browse/ARROW-6157 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Joris Van den Bossche >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > From the Python side, you can create an "invalid" UnionArray: > {code} > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8') # <- value of 2 is out > of bound for number of childs > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64]) > {code} > Eg on conversion to python this leads to a segfault: > {code} > In [7]: a.to_pylist() > Segmentation fault (core dumped) > {code} > On the other hand, doing an explicit validation does not give an error: > {code} > In [8]: a.validate() > {code} > Should the validation raise errors for this case? (the C++ > {{ValidateVisitor}} for UnionArray does nothing) > (so that this can be called from the Python API to avoid creating invalid > arrays / segfaults there) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-5949) [Rust] Implement DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984836#comment-16984836 ] Andy Thomason edited comment on ARROW-5949 at 11/29/19 9:27 AM: We should discuss the design for a dictionary type and the necessary serialisation. For example, start by adding {code:java} Dictionary((Box, Box)),{code} To DataType (key and value types) We may not need the extra Schema dictionary field as this is integral in the DataType. {code:java} pub struct DictionaryArray { keys: ArrayRef, values: Vec, } {code} Note that to support multiple dictionary batches, we need a vector of values, although in the majority of our use cases, we have only used a single dictionary. An option to concatenate dictionaries might be useful. Access is similar to ListArray except that the index is a variable type. For example, we often have a "chromosome" column which is "1", .. "X" and reduces to a byte. Fast access to dictionary components is essential - returning slices for key and value per recordbatch. It would be very useful for all types to have a rb.get_slice("name") function to get a named, typed slice for an array. Andy. was (Author: andy-thomason): We should discuss the design for a dictionary type and the necessary serialisation. For example, start by adding Dictionary((Box, Box)), To DataType (key and value types) We may not need the extra Schema dictionary field as this is integral in the DataType. {code:java} pub struct DictionaryArray { keys: ArrayRef, values: Vec, } {code} Note that to support multiple dictionary batches, we need a vector of values, although in the majority of our use cases, we have only used a single dictionary. An option to concatenate dictionaries might be useful. Access is similar to ListArray except that the index is a variable type. For example, we often have a "chromosome" column which is "1", .. "X" and reduces to a byte. Fast access to dictionary components is essential - returning slices for key and value per recordbatch. It would be very useful for all types to have a rb.get_slice("name") function to get a named, typed slice for an array. Andy. > [Rust] Implement DictionaryArray > > > Key: ARROW-5949 > URL: https://issues.apache.org/jira/browse/ARROW-5949 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: David Atienza >Priority: Major > > I am pretty new to the codebase, but I have seen that DictionaryArray is not > implemented in the Rust implementation. > I went through the list of issues and I could not see any work on this. Is > there any blocker? > > The specification is a bit > [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] > or even > [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding], > so I am not sure how to implement it myself. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-5949) [Rust] Implement DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984836#comment-16984836 ] Andy Thomason edited comment on ARROW-5949 at 11/29/19 9:26 AM: We should discuss the design for a dictionary type and the necessary serialisation. For example, start by adding Dictionary((Box, Box)), To DataType (key and value types) We may not need the extra Schema dictionary field as this is integral in the DataType. {code:java} pub struct DictionaryArray { keys: ArrayRef, values: Vec, } {code} Note that to support multiple dictionary batches, we need a vector of values, although in the majority of our use cases, we have only used a single dictionary. An option to concatenate dictionaries might be useful. Access is similar to ListArray except that the index is a variable type. For example, we often have a "chromosome" column which is "1", .. "X" and reduces to a byte. Fast access to dictionary components is essential - returning slices for key and value per recordbatch. It would be very useful for all types to have a rb.get_slice("name") function to get a named, typed slice for an array. Andy. was (Author: andy-thomason): We should discuss the design for a dictionary type and the necessary serialisation. For example, start by adding Dictionary((Box, Box)), To DataType (key and value types) We may not need the extra Schema dictionary field as this is integral in the DataType. pub struct DictionaryArray { keys: ArrayRef, values: Vec, } Note that to support multiple dictionary batches, we need a vector of values, although in the majority of our use cases, we have only used a single dictionary. An option to concatenate dictionaries might be useful. Access is similar to ListArray except that the index is a variable type. For example, we often have a "chromosome" column which is "1", .. "X" and reduces to a byte. Fast access to dictionary components is essential - returning slices for key and value per recordbatch. It would be very useful for all types to have a rb.get_slice("name") function to get a named, typed slice for an array. Andy. > [Rust] Implement DictionaryArray > > > Key: ARROW-5949 > URL: https://issues.apache.org/jira/browse/ARROW-5949 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: David Atienza >Priority: Major > > I am pretty new to the codebase, but I have seen that DictionaryArray is not > implemented in the Rust implementation. > I went through the list of issues and I could not see any work on this. Is > there any blocker? > > The specification is a bit > [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] > or even > [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding], > so I am not sure how to implement it myself. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984836#comment-16984836 ] Andy Thomason commented on ARROW-5949: -- We should discuss the design for a dictionary type and the necessary serialisation. For example, start by adding Dictionary((Box, Box)), To DataType (key and value types) We may not need the extra Schema dictionary field as this is integral in the DataType. pub struct DictionaryArray { keys: ArrayRef, values: Vec, } Note that to support multiple dictionary batches, we need a vector of values, although in the majority of our use cases, we have only used a single dictionary. An option to concatenate dictionaries might be useful. Access is similar to ListArray except that the index is a variable type. For example, we often have a "chromosome" column which is "1", .. "X" and reduces to a byte. Fast access to dictionary components is essential - returning slices for key and value per recordbatch. It would be very useful for all types to have a rb.get_slice("name") function to get a named, typed slice for an array. Andy. > [Rust] Implement DictionaryArray > > > Key: ARROW-5949 > URL: https://issues.apache.org/jira/browse/ARROW-5949 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: David Atienza >Priority: Major > > I am pretty new to the codebase, but I have seen that DictionaryArray is not > implemented in the Rust implementation. > I went through the list of issues and I could not see any work on this. Is > there any blocker? > > The specification is a bit > [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] > or even > [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding], > so I am not sure how to implement it myself. -- This message was sent by Atlassian Jira (v8.3.4#803005)