[jira] [Resolved] (ARROW-7276) [Ruby] Add support for building Arrow::ListArray from [[...]]

2019-11-29 Thread Yosuke Shiro (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yosuke Shiro resolved ARROW-7276.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5925
[https://github.com/apache/arrow/pull/5925]

> [Ruby] Add support for building Arrow::ListArray from [[...]]
> -
>
> Key: ARROW-7276
> URL: https://issues.apache.org/jira/browse/ARROW-7276
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7275) [Ruby] Add support for Arrow::ListDataType.new(data_type)

2019-11-29 Thread Yosuke Shiro (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yosuke Shiro resolved ARROW-7275.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5924
[https://github.com/apache/arrow/pull/5924]

> [Ruby] Add support for Arrow::ListDataType.new(data_type)
> -
>
> Key: ARROW-7275
> URL: https://issues.apache.org/jira/browse/ARROW-7275
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7281) AdaptiveIntBuilder::length() does not consider pending_pos_.

2019-11-29 Thread Adam Hooper (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Hooper updated ARROW-7281:
---
Description: 
{code:c++}
arrow::AdaptiveIntBuilder builder(arrow::default_memory_pool());
builder.Append(1);
std::cout << builder.length() << std::endl;
{code}

Expected output: {{1}}
Actual output: {{0}}

I imagine this regression came with https://github.com/apache/arrow/pull/3040

My use case: I'm building a JSON parser that appends "records" (JSON Objects 
mapping key=>value) to Arrow columns (each key gets an ArrayBuilder). Not all 
JSON Objects contain all keys; so {{builder.Append()}} isn't always called. So 
on a subsequent row, I want to add nulls for every append that was skipped: 
{{builder.AppendNulls(row - builder.length()); builder.Append(value)}}. This 
fails because {{builder.length()}} is wrong.

Annoying but simple workaround: I maintain a separate {{length}} value 
alongside {{builder}}.

  was:
{code:c++}
arrow::AdaptiveIntBuilder builder(arrow::default_memory_pool());
builder.Append(1);
std::cout << builder.length() << std::endl;
{code}

Expected output: {{1}}
Actual output: {{0}}

I imagine this regression came with https://github.com/apache/arrow/pull/3040

My use case: I'm building a JSON parser that appends "records" (JSON Objects 
mapping key=>value) to Arrow columns (each key gets an ArrayBuilder). Not all 
JSON Objects contain all keys; so {{builder.Append()}} isn't always called. So 
on a subsequent row, I want to add nulls for every append that was skipped: 
{{builder.AppendNulls(builder.length() - row); builder.Append(value)}}. This 
fails because {{builder.length()}} is wrong.

Annoying but simple workaround: I maintain a separate {{length}} value 
alongside {{builder}}.


> AdaptiveIntBuilder::length() does not consider pending_pos_.
> 
>
> Key: ARROW-7281
> URL: https://issues.apache.org/jira/browse/ARROW-7281
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Adam Hooper
>Priority: Major
>
> {code:c++}
> arrow::AdaptiveIntBuilder builder(arrow::default_memory_pool());
> builder.Append(1);
> std::cout << builder.length() << std::endl;
> {code}
> Expected output: {{1}}
> Actual output: {{0}}
> I imagine this regression came with https://github.com/apache/arrow/pull/3040
> My use case: I'm building a JSON parser that appends "records" (JSON Objects 
> mapping key=>value) to Arrow columns (each key gets an ArrayBuilder). Not all 
> JSON Objects contain all keys; so {{builder.Append()}} isn't always called. 
> So on a subsequent row, I want to add nulls for every append that was 
> skipped: {{builder.AppendNulls(row - builder.length()); 
> builder.Append(value)}}. This fails because {{builder.length()}} is wrong.
> Annoying but simple workaround: I maintain a separate {{length}} value 
> alongside {{builder}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7281) AdaptiveIntBuilder::length() does not consider pending_pos_.

2019-11-29 Thread Adam Hooper (Jira)
Adam Hooper created ARROW-7281:
--

 Summary: AdaptiveIntBuilder::length() does not consider 
pending_pos_.
 Key: ARROW-7281
 URL: https://issues.apache.org/jira/browse/ARROW-7281
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.15.1
Reporter: Adam Hooper


{code:c++}
arrow::AdaptiveIntBuilder builder(arrow::default_memory_pool());
builder.Append(1);
std::cout << builder.length() << std::endl;
{code}

Expected output: {{1}}
Actual output: {{0}}

I imagine this regression came with https://github.com/apache/arrow/pull/3040

My use case: I'm building a JSON parser that appends "records" (JSON Objects 
mapping key=>value) to Arrow columns (each key gets an ArrayBuilder). Not all 
JSON Objects contain all keys; so {{builder.Append()}} isn't always called. So 
on a subsequent row, I want to add nulls for every append that was skipped: 
{{builder.AppendNulls(builder.length() - row); builder.Append(value)}}. This 
fails because {{builder.length()}} is wrong.

Annoying but simple workaround: I maintain a separate {{length}} value 
alongside {{builder}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7280) Flight Data Store : memory mapped file [JAVA and Python]

2019-11-29 Thread Vinay (Jira)
Vinay created ARROW-7280:


 Summary: Flight Data Store : memory mapped file [JAVA and Python]
 Key: ARROW-7280
 URL: https://issues.apache.org/jira/browse/ARROW-7280
 Project: Apache Arrow
  Issue Type: Test
Reporter: Vinay


There are limited references for Arrow Flight implementation/examples for 
DataStores.

For holding huge data it may require to choose memory mapped file/file system 
instead of unsafe memory buffer.

It will be great if any there is any reference in this direction for JAVA and 
Python API.

 

And also the possibility if this is feasible with performance.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6515) [C++] Clean type_traits.h definitions

2019-11-29 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-6515.
---
Resolution: Fixed

Issue resolved by pull request 5885
[https://github.com/apache/arrow/pull/5885]

> [C++] Clean type_traits.h definitions
> -
>
> Key: ARROW-6515
> URL: https://issues.apache.org/jira/browse/ARROW-6515
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> {{IsSignedInt}} takes either an array or a type as a type argument, which is 
> surprisingly atypical for traits. Furthermore whereas {{is_signed_integer}} 
> returns false for date and other types which are represented by but not 
> identical to integers {{IsSignedInt}} returns true by checking only the 
> {{c_type}}, which leads to {{static_assert(IsSignedInt::value, 
> "")}}. Finally the declaration of {{IsSignedInt}} is far from readable due to 
> nested macro usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7279) [C++] Rename UnionArray::type_ids to UnionArray::type_codes

2019-11-29 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984974#comment-16984974
 ] 

Wes McKinney commented on ARROW-7279:
-

It's fine with me

> [C++] Rename UnionArray::type_ids to UnionArray::type_codes
> ---
>
> Key: ARROW-7279
> URL: https://issues.apache.org/jira/browse/ARROW-7279
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: Antoine Pitrou
>Priority: Minor
>
> This would be consistent with {{UnionType::type_codes}}. Furthermore, 
> "type_id" already means something else in the C++ API, so it would be less 
> confusing as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5949) [Rust] Implement DictionaryArray

2019-11-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5949:
--
Labels: pull-request-available  (was: )

> [Rust] Implement DictionaryArray
> 
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: David Atienza
>Priority: Major
>  Labels: pull-request-available
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not 
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is 
> there any blocker?
>  
> The specification is a bit 
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] 
> or even 
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
>  so I am not sure how to implement it myself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7279) [C++] Rename UnionArray::type_ids to UnionArray::type_codes

2019-11-29 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984883#comment-16984883
 ] 

Antoine Pitrou commented on ARROW-7279:
---

[~wesm] [~emkornfield] Thoughts?

> [C++] Rename UnionArray::type_ids to UnionArray::type_codes
> ---
>
> Key: ARROW-7279
> URL: https://issues.apache.org/jira/browse/ARROW-7279
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: Antoine Pitrou
>Priority: Minor
>
> This would be consistent with {{UnionType::type_codes}}. Furthermore, 
> "type_id" already means something else in the C++ API, so it would be less 
> confusing as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7279) [C++] Rename UnionArray::type_ids to UnionArray::type_codes

2019-11-29 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7279:
-

 Summary: [C++] Rename UnionArray::type_ids to 
UnionArray::type_codes
 Key: ARROW-7279
 URL: https://issues.apache.org/jira/browse/ARROW-7279
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Affects Versions: 1.0.0
Reporter: Antoine Pitrou


This would be consistent with {{UnionType::type_codes}}. Furthermore, "type_id" 
already means something else in the C++ API, so it would be less confusing as 
well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults

2019-11-29 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-6157.
---
Resolution: Fixed

Issue resolved by pull request 5892
[https://github.com/apache/arrow/pull/5892]

> [Python][C++] UnionArray with invalid data passes validation / leads to 
> segfaults
> -
>
> Key: ARROW-6157
> URL: https://issues.apache.org/jira/browse/ARROW-6157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> From the Python side, you can create an "invalid" UnionArray:
> {code}
> binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') 
> int64 = pa.array([1, 2, 3], type='int64') 
> types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8')   # <- value of 2 is out 
> of bound for number of childs
> value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')
> a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64])
> {code}
> Eg on conversion to python this leads to a segfault:
> {code}
> In [7]: a.to_pylist()
> Segmentation fault (core dumped)
> {code}
> On the other hand, doing an explicit validation does not give an error:
> {code}
> In [8]: a.validate()
> {code}
> Should the validation raise errors for this case? (the C++ 
> {{ValidateVisitor}} for UnionArray does nothing) 
> (so that this can be called from the Python API to avoid creating invalid 
> arrays / segfaults there)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-5949) [Rust] Implement DictionaryArray

2019-11-29 Thread Andy Thomason (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984836#comment-16984836
 ] 

Andy Thomason edited comment on ARROW-5949 at 11/29/19 9:27 AM:


We should discuss the design for a dictionary type and the necessary 
serialisation.

For example, start by adding
  
{code:java}
 Dictionary((Box, Box)),{code}

To DataType (key and value types)
  
 We may not need the extra Schema dictionary field as this is integral in the 
DataType.
  
{code:java}
pub struct DictionaryArray
{
     keys: ArrayRef,
 values: Vec,
} {code}
 
 Note that to support multiple dictionary batches, we need a vector of values, 
although
 in the majority of our use cases, we have only used a single dictionary. An 
option
 to concatenate dictionaries might be useful.
  
 Access is similar to ListArray except that the index is a variable type. For 
example,
 we often have a "chromosome" column which is "1", .. "X" and reduces to a byte.
  
 Fast access to dictionary components is essential - returning slices for key 
and
 value per recordbatch. It would be very useful for all types to have a 
rb.get_slice("name") function
 to get a named, typed slice for an array.
  
 Andy.
  
  

 


was (Author: andy-thomason):
We should discuss the design for a dictionary type and the necessary 
serialisation.

For example, start by adding
  
 Dictionary((Box, Box)),
 To DataType (key and value types)
  
 We may not need the extra Schema dictionary field as this is integral in the 
DataType.
  
{code:java}
pub struct DictionaryArray
{
     keys: ArrayRef,
 values: Vec,
} {code}
 
 Note that to support multiple dictionary batches, we need a vector of values, 
although
 in the majority of our use cases, we have only used a single dictionary. An 
option
 to concatenate dictionaries might be useful.
  
 Access is similar to ListArray except that the index is a variable type. For 
example,
 we often have a "chromosome" column which is "1", .. "X" and reduces to a byte.
  
 Fast access to dictionary components is essential - returning slices for key 
and
 value per recordbatch. It would be very useful for all types to have a 
rb.get_slice("name") function
 to get a named, typed slice for an array.
  
 Andy.
  
  

 

> [Rust] Implement DictionaryArray
> 
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: David Atienza
>Priority: Major
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not 
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is 
> there any blocker?
>  
> The specification is a bit 
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] 
> or even 
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
>  so I am not sure how to implement it myself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-5949) [Rust] Implement DictionaryArray

2019-11-29 Thread Andy Thomason (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984836#comment-16984836
 ] 

Andy Thomason edited comment on ARROW-5949 at 11/29/19 9:26 AM:


We should discuss the design for a dictionary type and the necessary 
serialisation.

For example, start by adding
  
 Dictionary((Box, Box)),
 To DataType (key and value types)
  
 We may not need the extra Schema dictionary field as this is integral in the 
DataType.
  
{code:java}
pub struct DictionaryArray
{
     keys: ArrayRef,
 values: Vec,
} {code}
 
 Note that to support multiple dictionary batches, we need a vector of values, 
although
 in the majority of our use cases, we have only used a single dictionary. An 
option
 to concatenate dictionaries might be useful.
  
 Access is similar to ListArray except that the index is a variable type. For 
example,
 we often have a "chromosome" column which is "1", .. "X" and reduces to a byte.
  
 Fast access to dictionary components is essential - returning slices for key 
and
 value per recordbatch. It would be very useful for all types to have a 
rb.get_slice("name") function
 to get a named, typed slice for an array.
  
 Andy.
  
  

 


was (Author: andy-thomason):
We should discuss the design for a dictionary type and the necessary 
serialisation.

For example, start by adding
 
Dictionary((Box, Box)),
To DataType (key and value types)
 
We may not need the extra Schema dictionary field as this is integral in the 
DataType.
 
pub struct DictionaryArray {
    keys: ArrayRef,
    values: Vec,
}
 
Note that to support multiple dictionary batches, we need a vector of values, 
although
in the majority of our use cases, we have only used a single dictionary. An 
option
to concatenate dictionaries might be useful.
 
Access is similar to ListArray except that the index is a variable type. For 
example,
we often have a "chromosome" column which is "1", .. "X" and reduces to a byte.
 
Fast access to dictionary components is essential - returning slices for key and
value per recordbatch. It would be very useful for all types to have a 
rb.get_slice("name") function
to get a named, typed slice for an array.
 
Andy.
 
 

 

> [Rust] Implement DictionaryArray
> 
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: David Atienza
>Priority: Major
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not 
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is 
> there any blocker?
>  
> The specification is a bit 
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] 
> or even 
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
>  so I am not sure how to implement it myself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray

2019-11-29 Thread Andy Thomason (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984836#comment-16984836
 ] 

Andy Thomason commented on ARROW-5949:
--

We should discuss the design for a dictionary type and the necessary 
serialisation.

For example, start by adding
 
Dictionary((Box, Box)),
To DataType (key and value types)
 
We may not need the extra Schema dictionary field as this is integral in the 
DataType.
 
pub struct DictionaryArray {
    keys: ArrayRef,
    values: Vec,
}
 
Note that to support multiple dictionary batches, we need a vector of values, 
although
in the majority of our use cases, we have only used a single dictionary. An 
option
to concatenate dictionaries might be useful.
 
Access is similar to ListArray except that the index is a variable type. For 
example,
we often have a "chromosome" column which is "1", .. "X" and reduces to a byte.
 
Fast access to dictionary components is essential - returning slices for key and
value per recordbatch. It would be very useful for all types to have a 
rb.get_slice("name") function
to get a named, typed slice for an array.
 
Andy.
 
 

 

> [Rust] Implement DictionaryArray
> 
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: David Atienza
>Priority: Major
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not 
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is 
> there any blocker?
>  
> The specification is a bit 
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] 
> or even 
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
>  so I am not sure how to implement it myself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)