[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091640#comment-17091640 ] Neville Dipale commented on ARROW-5949: --- I think not providing more convenient ways of using DictionaryArray potentially defeats the purpose of having it. I've already mentioned the need for compute kernel support on dictionaries, some of which would require access to the array's keys as a primitive array (e.g. sort, take), and others which would need both keys and values (filter). I would rather have the DictionaryArray::keys() return ArrayRef instead of NullableIter, then support iterating on arrays in general. Yes, building the primitive array is a bit expensive, and more importantly, it's opaque to a casual Arrow user; so I'd support providing that option. Look at the below, for example: {code:java} impl<'a, K: ArrowPrimitiveType> DictionaryArray { pub fn decode_dictionary() -> Result { // convert the keys into an array let keys = Arc::new(PrimitiveArrayfrom(self.data.clone())) as ArrayRef; // cast keys to an uint32 array let keys = crate::compute::cast(, ::UInt32)?; let keys = UInt32Array::from(keys.data()); // index into the values of the dictionary, with keys crate::compute::take(, , None) } }{code} This is how I'd convert a dictionary to a 'normal' array of an unknown type. Perhaps this could be a discussion for the mailing list? I'm interested in simplifying the dictionary API, and widening dictionary support; this could be a good starting point to do this. CC [~paddyhoran] [~andygrove] > [Rust] Implement DictionaryArray > > > Key: ARROW-5949 > URL: https://issues.apache.org/jira/browse/ARROW-5949 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: David Atienza >Assignee: David Atienza >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 18h > Remaining Estimate: 0h > > I am pretty new to the codebase, but I have seen that DictionaryArray is not > implemented in the Rust implementation. > I went through the list of issues and I could not see any work on this. Is > there any blocker? > > The specification is a bit > [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] > or even > [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding], > so I am not sure how to implement it myself. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091600#comment-17091600 ] Mahmut Bulut commented on ARROW-5949: - Sorry, yes, that's exactly like that, it is ok and valid. Gave that example to show that we can leave the indices as how -1 is masked on (unfortunately it won't work with unsigned values, I think that's why the bit masking approach is better). Thanks for the links they were fruitful. I think I am more inclined to not build the primitive array, neither user should collect the result from the iterator nor one by one look for the Some(_), that said I tend to have slice given back from the array, which is most probably enable users who are using SIMD later. Thou, it is also nice to have a PrimitiveArray API given to users. Current stable SIMD instructions also packed_simd are fill free so I need to use continuous scalars for dict encoded operations, which are crucial for my use case (repacking the arrow array is an overhead for me). So I have started to make a vectorized slice implementation over current dictionary array, is it ok to include slice kind of approach to Arrow? with chunked offsets, we can even use Rust arrays too. Wdyt? > [Rust] Implement DictionaryArray > > > Key: ARROW-5949 > URL: https://issues.apache.org/jira/browse/ARROW-5949 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: David Atienza >Assignee: David Atienza >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 18h > Remaining Estimate: 0h > > I am pretty new to the codebase, but I have seen that DictionaryArray is not > implemented in the Rust implementation. > I went through the list of issues and I could not see any work on this. Is > there any blocker? > > The specification is a bit > [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] > or even > [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding], > so I am not sure how to implement it myself. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091569#comment-17091569 ] Neville Dipale commented on ARROW-5949: --- Thanks, having looked at the implementation; I think they're handled the same way in Rust (if we exclude the iterator interface). {code:java} std::vector raw_indices = {0, 1, 2, -1, 3}; std::vector is_valid = {1, 1, 1, 0, 1};{code} > [Rust] Implement DictionaryArray > > > Key: ARROW-5949 > URL: https://issues.apache.org/jira/browse/ARROW-5949 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: David Atienza >Assignee: David Atienza >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 18h > Remaining Estimate: 0h > > I am pretty new to the codebase, but I have seen that DictionaryArray is not > implemented in the Rust implementation. > I went through the list of issues and I could not see any work on this. Is > there any blocker? > > The specification is a bit > [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] > or even > [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding], > so I am not sure how to implement it myself. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091555#comment-17091555 ] Mahmut Bulut commented on ARROW-5949: - For the reference implementation that I am talking about, please take a look at the `TestStringDictionaryAppendIndices` in cxx implementation for how nulls are handled in arrow cxx implementation. > [Rust] Implement DictionaryArray > > > Key: ARROW-5949 > URL: https://issues.apache.org/jira/browse/ARROW-5949 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: David Atienza >Assignee: David Atienza >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 18h > Remaining Estimate: 0h > > I am pretty new to the codebase, but I have seen that DictionaryArray is not > implemented in the Rust implementation. > I went through the list of issues and I could not see any work on this. Is > there any blocker? > > The specification is a bit > [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] > or even > [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding], > so I am not sure how to implement it myself. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091547#comment-17091547 ] Neville Dipale commented on ARROW-5949: --- Hi [~vertexclique], there was some discussion around using sentinel values over bitmask ([https://github.com/apache/arrow/pull/6095#discussion_r367760573),] and I believe it was a matter of sentinel values not being spec-compliant. We never resolved the following point, but I was of the opinion that it'd be better to provide methods/functions that allow converting a dictionary array into a primitive array. My opinion was mainly informed by my concern that we don't have a way of using dictionary arrays in compute kernels, so at the time I preferred something to convert `dict(i32)[` to `i32<1, 1, null, 2, null>`. The contributor of the PR provided a valid use-case, which led them in the route of providing iterator access, so we eventually merged the PR under the premise that more work could be done in future to provide other access methods. Regarding the 2 reasons: R1: what do you mean by "rebuilding from that lookup"? Do you mean rebuilding a primitive array from the dictionary's iterator? If so, would a method that converts a dict(i32) into a primitive(i32) suffice for your needs? R2: may you please provide an example of what you mean by parallel comparison? My knowledge of SIMD and auto-vec is a bit limited, but what we noticed in the Rust implementation is that we can often forgo explicit SIMD on some computation kernels if we relegate null handling to bitmask manipulation, and operate on arrays without branching to check nulls ([https://github.com/apache/arrow/pull/6086]). > [Rust] Implement DictionaryArray > > > Key: ARROW-5949 > URL: https://issues.apache.org/jira/browse/ARROW-5949 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: David Atienza >Assignee: David Atienza >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 18h > Remaining Estimate: 0h > > I am pretty new to the codebase, but I have seen that DictionaryArray is not > implemented in the Rust implementation. > I went through the list of issues and I could not see any work on this. Is > there any blocker? > > The specification is a bit > [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] > or even > [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding], > so I am not sure how to implement it myself. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091512#comment-17091512 ] Mahmut Bulut commented on ARROW-5949: - Hi, I've just seen this. Is there any reason why we provide custom iterator over keys? Which is basically resolving into Option or None? Can we use 0 as a null identifier? > [Rust] Implement DictionaryArray > > > Key: ARROW-5949 > URL: https://issues.apache.org/jira/browse/ARROW-5949 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: David Atienza >Assignee: David Atienza >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 18h > Remaining Estimate: 0h > > I am pretty new to the codebase, but I have seen that DictionaryArray is not > implemented in the Rust implementation. > I went through the list of issues and I could not see any work on this. Is > there any blocker? > > The specification is a bit > [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] > or even > [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding], > so I am not sure how to implement it myself. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17047269#comment-17047269 ] Neville Dipale commented on ARROW-5949: --- I'm unable to assign this to andy-thomason, I don't have permission. > [Rust] Implement DictionaryArray > > > Key: ARROW-5949 > URL: https://issues.apache.org/jira/browse/ARROW-5949 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: David Atienza >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 18h > Remaining Estimate: 0h > > I am pretty new to the codebase, but I have seen that DictionaryArray is not > implemented in the Rust implementation. > I went through the list of issues and I could not see any work on this. Is > there any blocker? > > The specification is a bit > [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] > or even > [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding], > so I am not sure how to implement it myself. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984836#comment-16984836 ] Andy Thomason commented on ARROW-5949: -- We should discuss the design for a dictionary type and the necessary serialisation. For example, start by adding Dictionary((Box, Box)), To DataType (key and value types) We may not need the extra Schema dictionary field as this is integral in the DataType. pub struct DictionaryArray { keys: ArrayRef, values: Vec, } Note that to support multiple dictionary batches, we need a vector of values, although in the majority of our use cases, we have only used a single dictionary. An option to concatenate dictionaries might be useful. Access is similar to ListArray except that the index is a variable type. For example, we often have a "chromosome" column which is "1", .. "X" and reduces to a byte. Fast access to dictionary components is essential - returning slices for key and value per recordbatch. It would be very useful for all types to have a rb.get_slice("name") function to get a named, typed slice for an array. Andy. > [Rust] Implement DictionaryArray > > > Key: ARROW-5949 > URL: https://issues.apache.org/jira/browse/ARROW-5949 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: David Atienza >Priority: Major > > I am pretty new to the codebase, but I have seen that DictionaryArray is not > implemented in the Rust implementation. > I went through the list of issues and I could not see any work on this. Is > there any blocker? > > The specification is a bit > [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] > or even > [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding], > so I am not sure how to implement it myself. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984568#comment-16984568 ] Andy Thomason commented on ARROW-5949: -- I've implemented this in two of our internal I/O libraries at work and should be able to help out if I get the time. I've sent a test generator to Andy which should help. We have a huge repository of Arrow files to test it on. > [Rust] Implement DictionaryArray > > > Key: ARROW-5949 > URL: https://issues.apache.org/jira/browse/ARROW-5949 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: David Atienza >Priority: Major > > I am pretty new to the codebase, but I have seen that DictionaryArray is not > implemented in the Rust implementation. > I went through the list of issues and I could not see any work on this. Is > there any blocker? > > The specification is a bit > [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] > or even > [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding], > so I am not sure how to implement it myself. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885228#comment-16885228 ] Wes McKinney commented on ARROW-5949: - I'd recommend looking at what we've done in C++, the implementation and usage is fairly mature there > [Rust] Implement DictionaryArray > > > Key: ARROW-5949 > URL: https://issues.apache.org/jira/browse/ARROW-5949 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: David Atienza >Assignee: Andy Grove >Priority: Major > > I am pretty new to the codebase, but I have seen that DictionaryArray is not > implemented in the Rust implementation. > I went through the list of issues and I could not see any work on this. Is > there any blocker? > > The specification is a bit > [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] > or even > [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding], > so I am not sure how to implement it myself. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885214#comment-16885214 ] Andy Grove commented on ARROW-5949: --- I'm not aware of any blockers. I expect this is just a case of nobody needing the feature yet. > [Rust] Implement DictionaryArray > > > Key: ARROW-5949 > URL: https://issues.apache.org/jira/browse/ARROW-5949 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: David Atienza >Assignee: Andy Grove >Priority: Major > > I am pretty new to the codebase, but I have seen that DictionaryArray is not > implemented in the Rust implementation. > I went through the list of issues and I could not see any work on this. Is > there any blocker? > > The specification is a bit > [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] > or even > [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding], > so I am not sure how to implement it myself. -- This message was sent by Atlassian JIRA (v7.6.14#76016)