[ 
https://issues.apache.org/jira/browse/ARROW-4083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734990#comment-16734990
 ] 

Wes McKinney commented on ARROW-4083:
-------------------------------------

{{ChunkedArray}} is probably the wrong abstraction for this. Eventually we are 
going to be forced to address this issue in cases where dictionary encoding is 
used for data compression. But I think it can be handled on a case by case 
basis. For example, when performing a hash aggregation, it would not make sense 
to materialize all dictionary encoded data to dense and then hash it again 
during aggregation. So it would be up to the hash aggregation implementation to 
treat both {{string}} and {{dictionary<string>}} as being "the same" from an 
analysis point of view. 

FWIW, my presumption of Arrow "users" is that they are system developers, not 
end users, so we can expect a certain level of sophistication that would not be 
expected from, say, a pandas user

> [C++] Allowing ChunkedArrays to contain a mix of DictionaryArray and dense 
> Array (of the dictionary type)
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-4083
>                 URL: https://issues.apache.org/jira/browse/ARROW-4083
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: 0.13.0
>
>
> In some applications we may receive a stream of some dictionary encoded data 
> followed by some non-dictionary encoded data. For example this happens in 
> Parquet files when the dictionary reaches a certain configurable size 
> threshold.
> We should think about how we can model this in our in-memory data structures, 
> and how it can flow through to relevant computational components (i.e. 
> certain data flow observers -- like an Aggregation -- might need to be able 
> to process either a dense or dictionary encoded version of a particular array 
> in the same stream)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to