[jira] [Commented] (DRILL-7096) Develop vector for canonical Map

ASF GitHub Bot (JIRA) Mon, 29 Jul 2019 08:02:47 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-7096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16895319#comment-16895319
 ]


ASF GitHub Bot commented on DRILL-7096:
---------------------------------------

KazydubB commented on issue #1829: DRILL-7096: Develop vector for canonical 
Map<K,V>
URL: https://github.com/apache/drill/pull/1829#issuecomment-516003913
 
 
   Hi @paul-rogers, thank you for your elaborate comment.
   This map is intended to be a map in Java's sense, with strict key and map 
types. The key type was indeed planned to be primitive only while the value 
could have been of any type either primitive or complex, but I accidentally 
removed the check when finalizing the changes :(. Just to be clear, key type is 
not limited to `VARCHAR`, but can be of any other, like `INT`, `BIGINT`, 
`FLOAT4` etc.
   
   This new type is intended to be used for sources which support this type of 
maps, i.e. Parquet files, Hive tables etc.
   Support for EVF was not considered because I thought it was not supporting 
readers which support such map type, but thank you for guidance, I'll look into 
it. Of course, it should be added.
   
   The vector `extends` `RepeatedMapVector` and is essentially a repeated map 
vector (as you've noted) with constraints regarding its children and type of 
one of the child, 'key' (it `@Override`s 
`RepeatedMapVector.Accessor#getObject(int)` as well). This constraint is 
implemented in `TrueMapVector#putChild(String, ValueVector)` which dissalows 
the vector to have other children fields. Additionally, new reader and writer 
was introduced which also `extend` (inherit from) repeated map's reader and 
writer respectively. The new writer differs from its parent by how it handles 
children's offsets and introduces two essential methods, `startKeyValuePair()` 
and `endKeyValuePair()`, to separate entries. And new type's reader adds 
methods to find index for a given key and read a value for a given key into 
passed `ValueHolder`.
   With this approach for writer, writing data into new map vector is not any 
different than writing data into repeated map (of course, if user (developer) 
is not trying to access "unknown" field, other than defined `"key"` and 
`"value"` on the writer for the new type). Flatten, for example, does behave 
the same for the two vectors. In places these two vectors (`TrueMapVector` and 
`RepeatedMapVector`) have different behavior, this is done in code; in other 
case the common behavior (logic) is used.
   That being said, I do not see a reason to create another 
`AbstractRepeatedMap` as I think extending `RepeatedMapVector` suffices (unless 
we really need to separate these two vectors).
   
   I am going to add documentation to code to avoid reverse engineering for 
developers (sorry for making you go through this!) and add unit tests for the 
new `ValueVector` and its writer and slightly refactor the code to implement 
aforementioned constraints. Also, I'll rename the type to `DICT`.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Develop vector for canonical Map<K,V>
> -------------------------------------
>
>                 Key: DRILL-7096
>                 URL: https://issues.apache.org/jira/browse/DRILL-7096
>             Project: Apache Drill
>          Issue Type: Sub-task
>            Reporter: Igor Guzenko
>            Assignee: Bohdan Kazydub
>            Priority: Major
>             Fix For: 1.17.0
>
>
> Canonical Map<K,V> datatype can be represented using combination of three 
> value vectors:
> keysVector - vector for storing keys of each map
> valuesVector - vector for storing values of each map
> offsetsVector - vector for storing of start indexes of next each map
> So it's not very hard to create such Map vector, but there is a major issue 
> with such map representation. It's hard to search maps values by key in such 
> vector, need to investigate some advanced techniques to make such search 
> efficient. Or find other more suitable options to represent map datatype in 
> world of vectors.
> After question about maps, Apache Arrow developers responded that for Java 
> they don't have real Map vector, for now they just have logical Map type 
> definition where they define Map like: List< Struct<key:key_type, 
> value:value_type> >. So implementation of value vector would be useful for 
> Arrow too.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (DRILL-7096) Develop vector for canonical Map

Reply via email to