[ https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984836#comment-16984836 ]
Andy Thomason edited comment on ARROW-5949 at 11/29/19 9:27 AM: ---------------------------------------------------------------- We should discuss the design for a dictionary type and the necessary serialisation. For example, start by adding {code:java} Dictionary((Box<DataType>, Box<DataType>)),{code} To DataType (key and value types) We may not need the extra Schema dictionary field as this is integral in the DataType. {code:java} pub struct DictionaryArray { keys: ArrayRef, values: Vec<ArrayDataRef>, } {code} Note that to support multiple dictionary batches, we need a vector of values, although in the majority of our use cases, we have only used a single dictionary. An option to concatenate dictionaries might be useful. Access is similar to ListArray except that the index is a variable type. For example, we often have a "chromosome" column which is "1", .. "X" and reduces to a byte. Fast access to dictionary components is essential - returning slices for key and value per recordbatch. It would be very useful for all types to have a rb.get_slice<T>("name") function to get a named, typed slice for an array. Andy. was (Author: andy-thomason): We should discuss the design for a dictionary type and the necessary serialisation. For example, start by adding Dictionary((Box<DataType>, Box<DataType>)), To DataType (key and value types) We may not need the extra Schema dictionary field as this is integral in the DataType. {code:java} pub struct DictionaryArray { keys: ArrayRef, values: Vec<ArrayDataRef>, } {code} Note that to support multiple dictionary batches, we need a vector of values, although in the majority of our use cases, we have only used a single dictionary. An option to concatenate dictionaries might be useful. Access is similar to ListArray except that the index is a variable type. For example, we often have a "chromosome" column which is "1", .. "X" and reduces to a byte. Fast access to dictionary components is essential - returning slices for key and value per recordbatch. It would be very useful for all types to have a rb.get_slice<T>("name") function to get a named, typed slice for an array. Andy. > [Rust] Implement DictionaryArray > -------------------------------- > > Key: ARROW-5949 > URL: https://issues.apache.org/jira/browse/ARROW-5949 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust > Reporter: David Atienza > Priority: Major > > I am pretty new to the codebase, but I have seen that DictionaryArray is not > implemented in the Rust implementation. > I went through the list of issues and I could not see any work on this. Is > there any blocker? > > The specification is a bit > [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] > or even > [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding], > so I am not sure how to implement it myself. -- This message was sent by Atlassian Jira (v8.3.4#803005)