[ 
https://issues.apache.org/jira/browse/DRILL-7096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16791928#comment-16791928
 ] 

Igor Guzenko commented on DRILL-7096:
-------------------------------------

Hello [~Paul.Rogers],

I have few thoughts about related concerns:

1) About problem described in the Jira (effective get by key) I think most 
probably we will go with sorting of keys before writing 
maps into new vector. As well as the Map vector will be used for Hive Map and 
MapObjectInspector will give us Map<?,?> for 
each row in column, it won't be an issue to sort keys before writing.


2) Unnest functionality that you mentioned may be implemented as conversion 
from new MapVector to current Map(*Struct*)Vector. 
All keys across all rows will be converted to strings and on meeting new key, 
new vector will be created for holding value assigned to the key. 
Of course users should be aware that their rows has limited sets of shared 
keys, otherwise when all keys in all rows are unique we 
will get OOM error very quickly. I guess we can calculate rate of new unique 
key additions while converting each row and detect the 
key uniqueness problem very quickly.


3) What relates to use cases, first place where the new vector will be used is 
reading map columns from Hive. And it looks reasonable to follow 
their restriction on keys (use only primitives). Also at least we need to 
support all existing functionality related to Map datatype. I started 
listing of use cases in [Hive Complex Types design 
document|https://docs.google.com/document/d/1yEcaJi9dyksfMs4w5_GsZCQH_Pffe-HLeLVNNKsV7CA/edit?usp=sharing],
 which is in progress now and later will be attached to DRILL-3290. Please feel 
free to add comments in design doc, everything will be useful for me because 
I'm writing such document for the first time. 

4) About using unions for values I guess you're thinking in therms of support 
JSON maps flexibility. In such case I'd rather go with all text mode 
for map values, than pollute memory and code with unions. For case when type of 
map values is clearly determined (like in Hive) we have rich set of 
datatype specific vectors, though Hive unions also may be used as map values, 
at least we will know clearly amount of necessary types for them.

5) Now [~KazydubB] is working on the new vector design and he'll contribute his 
results to design document mentioned previously.

Thanks, Igor Guzenko

 

> Develop vector for canonical Map<K,V>
> -------------------------------------
>
>                 Key: DRILL-7096
>                 URL: https://issues.apache.org/jira/browse/DRILL-7096
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Igor Guzenko
>            Assignee: Bohdan Kazydub
>            Priority: Major
>
> Canonical Map<K,V> datatype can be represented using combination of three 
> value vectors:
> keysVector - vector for storing keys of each map
> valuesVector - vector for storing values of each map
> offsetsVector - vector for storing of start indexes of next each map
> So it's not very hard to create such Map vector, but there is a major issue 
> with such map representation. It's hard to search maps values by key in such 
> vector, need to investigate some advanced techniques to make such search 
> efficient. Or find other more suitable options to represent map datatype in 
> world of vectors.
> After question about maps, Apache Arrow developers responded that for Java 
> they don't have real Map vector, for now they just have logical Map type 
> definition where they define Map like: List< Struct<key:key_type, 
> value:value_type> >. So implementation of value vector would be useful for 
> Arrow too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to