[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray

2020-04-24 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091640#comment-17091640
 ] 

Neville Dipale commented on ARROW-5949:
---

I think not providing more convenient ways of using DictionaryArray potentially 
defeats the purpose of having it. I've already mentioned the need for compute 
kernel support on dictionaries, some of which would require access to the 
array's keys as a primitive array (e.g. sort, take), and others which would 
need both keys and values (filter).

I would rather have the DictionaryArray::keys() return 
ArrayRef instead of NullableIter, then support iterating on arrays in general.

Yes, building the primitive array is a bit expensive, and more importantly, 
it's opaque to a casual Arrow user; so I'd support providing that option.

Look at the below, for example:
{code:java}
impl<'a, K: ArrowPrimitiveType> DictionaryArray {
     pub fn decode_dictionary() -> Result {
 // convert the keys into an array
 let keys = Arc::new(PrimitiveArrayfrom(self.data.clone())) as 
ArrayRef;
 // cast keys to an uint32 array
 let keys = crate::compute::cast(, ::UInt32)?;
 let keys = UInt32Array::from(keys.data());
 // index into the values of the dictionary, with keys
 crate::compute::take(, , None)
     }
 }{code}
This is how I'd convert a dictionary to a 'normal' array of an unknown type.

Perhaps this could be a discussion for the mailing list? I'm interested in 
simplifying the dictionary API, and widening dictionary support; this could be 
a good starting point to do this. CC [~paddyhoran] [~andygrove]

> [Rust] Implement DictionaryArray
> 
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: David Atienza
>Assignee: David Atienza
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 18h
>  Remaining Estimate: 0h
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not 
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is 
> there any blocker?
>  
> The specification is a bit 
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] 
> or even 
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
>  so I am not sure how to implement it myself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray

2020-04-24 Thread Mahmut Bulut (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091600#comment-17091600
 ] 

Mahmut Bulut commented on ARROW-5949:
-

Sorry, yes, that's exactly like that, it is ok and valid. Gave that example to 
show that we can leave the indices as how -1 is masked on (unfortunately it 
won't work with unsigned values, I think that's why the bit masking approach is 
better). Thanks for the links they were fruitful.

 

I think I am more inclined to not build the primitive array, neither user 
should collect the result from the iterator nor one by one look for the 
Some(_), that said I tend to have slice given back from the array, which is 
most probably enable users who are using SIMD later. Thou, it is also nice to 
have a PrimitiveArray API given to users. Current stable SIMD instructions also 
packed_simd are fill free so I need to use continuous scalars for dict encoded 
operations, which are crucial for my use case (repacking the arrow array is an 
overhead for me). So I have started to make a vectorized slice implementation 
over current dictionary array, is it ok to include slice kind of approach to 
Arrow? with chunked offsets, we can even use Rust arrays too. Wdyt?

> [Rust] Implement DictionaryArray
> 
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: David Atienza
>Assignee: David Atienza
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 18h
>  Remaining Estimate: 0h
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not 
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is 
> there any blocker?
>  
> The specification is a bit 
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] 
> or even 
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
>  so I am not sure how to implement it myself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray

2020-04-24 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091569#comment-17091569
 ] 

Neville Dipale commented on ARROW-5949:
---

Thanks, having looked at the implementation; I think they're handled the same 
way in Rust (if we exclude the iterator interface).
 
{code:java}
  std::vector raw_indices = {0, 1, 2, -1, 3};
  std::vector is_valid = {1, 1, 1, 0, 1};{code}
 

> [Rust] Implement DictionaryArray
> 
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: David Atienza
>Assignee: David Atienza
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 18h
>  Remaining Estimate: 0h
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not 
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is 
> there any blocker?
>  
> The specification is a bit 
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] 
> or even 
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
>  so I am not sure how to implement it myself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray

2020-04-24 Thread Mahmut Bulut (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091555#comment-17091555
 ] 

Mahmut Bulut commented on ARROW-5949:
-

For the reference implementation that I am talking about, please take a look at 
the `TestStringDictionaryAppendIndices` in cxx implementation for how nulls are 
handled in arrow cxx implementation.

> [Rust] Implement DictionaryArray
> 
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: David Atienza
>Assignee: David Atienza
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 18h
>  Remaining Estimate: 0h
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not 
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is 
> there any blocker?
>  
> The specification is a bit 
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] 
> or even 
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
>  so I am not sure how to implement it myself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray

2020-04-24 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091547#comment-17091547
 ] 

Neville Dipale commented on ARROW-5949:
---

Hi [~vertexclique], there was some discussion around using sentinel values over 
bitmask ([https://github.com/apache/arrow/pull/6095#discussion_r367760573),] 
and I believe it was a matter of sentinel values not being spec-compliant.

We never resolved the following point, but I was of the opinion that it'd be 
better to provide methods/functions that allow converting a dictionary array 
into a primitive array. 
My opinion was mainly informed by my concern that we don't have a way of using 
dictionary arrays in compute kernels, so at the time I preferred something to 
convert `dict(i32)[` to `i32<1, 1, null, 
2, null>`.

The contributor of the PR provided a valid use-case, which led them in the 
route of providing iterator access, so we eventually merged the PR under the 
premise that more work could be done in future to provide other access methods.

Regarding the 2 reasons:

R1: what do you mean by "rebuilding from that lookup"? Do you mean rebuilding a 
primitive array from the dictionary's iterator? If so, would a method that 
converts a dict(i32) into a primitive(i32) suffice for your needs?

R2: may you please provide an example of what you mean by parallel comparison? 
My knowledge of SIMD and auto-vec is a bit limited, but what we noticed in the 
Rust implementation is that we can often forgo explicit SIMD on some 
computation kernels if we relegate null handling to bitmask manipulation, and 
operate on arrays without branching to check nulls 
([https://github.com/apache/arrow/pull/6086]).

> [Rust] Implement DictionaryArray
> 
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: David Atienza
>Assignee: David Atienza
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 18h
>  Remaining Estimate: 0h
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not 
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is 
> there any blocker?
>  
> The specification is a bit 
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] 
> or even 
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
>  so I am not sure how to implement it myself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray

2020-04-24 Thread Mahmut Bulut (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091512#comment-17091512
 ] 

Mahmut Bulut commented on ARROW-5949:
-

Hi, I've just seen this. Is there any reason why we provide custom iterator 
over keys? Which is basically resolving into Option or None? Can we use 0 as a 
null identifier?

> [Rust] Implement DictionaryArray
> 
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: David Atienza
>Assignee: David Atienza
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 18h
>  Remaining Estimate: 0h
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not 
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is 
> there any blocker?
>  
> The specification is a bit 
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] 
> or even 
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
>  so I am not sure how to implement it myself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray

2020-02-27 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17047269#comment-17047269
 ] 

Neville Dipale commented on ARROW-5949:
---

I'm unable to assign this to andy-thomason, I don't have permission.

> [Rust] Implement DictionaryArray
> 
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: David Atienza
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 18h
>  Remaining Estimate: 0h
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not 
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is 
> there any blocker?
>  
> The specification is a bit 
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] 
> or even 
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
>  so I am not sure how to implement it myself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray

2019-11-29 Thread Andy Thomason (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984836#comment-16984836
 ] 

Andy Thomason commented on ARROW-5949:
--

We should discuss the design for a dictionary type and the necessary 
serialisation.

For example, start by adding
 
Dictionary((Box, Box)),
To DataType (key and value types)
 
We may not need the extra Schema dictionary field as this is integral in the 
DataType.
 
pub struct DictionaryArray {
    keys: ArrayRef,
    values: Vec,
}
 
Note that to support multiple dictionary batches, we need a vector of values, 
although
in the majority of our use cases, we have only used a single dictionary. An 
option
to concatenate dictionaries might be useful.
 
Access is similar to ListArray except that the index is a variable type. For 
example,
we often have a "chromosome" column which is "1", .. "X" and reduces to a byte.
 
Fast access to dictionary components is essential - returning slices for key and
value per recordbatch. It would be very useful for all types to have a 
rb.get_slice("name") function
to get a named, typed slice for an array.
 
Andy.
 
 

 

> [Rust] Implement DictionaryArray
> 
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: David Atienza
>Priority: Major
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not 
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is 
> there any blocker?
>  
> The specification is a bit 
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] 
> or even 
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
>  so I am not sure how to implement it myself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray

2019-11-28 Thread Andy Thomason (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984568#comment-16984568
 ] 

Andy Thomason commented on ARROW-5949:
--

I've implemented this in two of our internal I/O libraries at work and should 
be able to help out if I get

the time. I've sent a test generator to Andy which should help. We have a huge 
repository of Arrow files to test it on.

 

> [Rust] Implement DictionaryArray
> 
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: David Atienza
>Priority: Major
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not 
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is 
> there any blocker?
>  
> The specification is a bit 
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] 
> or even 
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
>  so I am not sure how to implement it myself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray

2019-07-15 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885228#comment-16885228
 ] 

Wes McKinney commented on ARROW-5949:
-

I'd recommend looking at what we've done in C++, the implementation and usage 
is fairly mature there

> [Rust] Implement DictionaryArray
> 
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: David Atienza
>Assignee: Andy Grove
>Priority: Major
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not 
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is 
> there any blocker?
>  
> The specification is a bit 
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] 
> or even 
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
>  so I am not sure how to implement it myself.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray

2019-07-15 Thread Andy Grove (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885214#comment-16885214
 ] 

Andy Grove commented on ARROW-5949:
---

I'm not aware of any blockers. I expect this is just a case of nobody needing 
the feature yet.

> [Rust] Implement DictionaryArray
> 
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: David Atienza
>Assignee: Andy Grove
>Priority: Major
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not 
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is 
> there any blocker?
>  
> The specification is a bit 
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] 
> or even 
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
>  so I am not sure how to implement it myself.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)