Re: [DISCUSSION] DRILL-7097 Rename MapVector to StructVector

Bohdan Kazydub Tue, 04 Jun 2019 01:58:28 -0700

Hi Paul,

if I understood you correctly, you are talking about implementation of
"true map" as list of STRUCT (which is currently named MAP in Drill). While
this implementation is viable we still do need to introduce a new type for
such "true map" as REPEATED MAP is still a different data type. That is
while a "true map" can be implemented using REPEATED MAP under the hood
these are not the same types (e.g., what if user wants to use REPEATED MAP
(AKA repeated struct) and not "true map").
Is my understanding correct?


The approach found in [1] was taken similarly to that done in Hive[2] as I
find it clearer and not to meddle with MAP's innards.

Also worth mentioning that there is this[3] open [WIP] PR into Apache Arrow
which introduces MapVector (opened a few days ago) which uses the approach
you suggested.

[1]
https://docs.google.com/presentation/d/1FG4swOrkFIRL7qjiP7PSOPy8a1vnxs5Z9PM3ZfRPRYo/edit#slide=id.p
[2]
https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/MapColumnVector.java#L30
[3]https://github.com/apache/arrow/pull/4444

On Tue, Jun 4, 2019 at 2:59 AM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi Igor,
>
> Glad the community was able to provide a bit of help.
>
> Let's talk about about another topic. You said: "And main purpose will be
> hiding of repeated map meta keys
> ("key","value") and simulation of real map functionality."
>
> On the one hand, we are all accustomed to thinking of a Java (or Python)
> map as a black box: store (key, value) pairs, retrieve values by key. This
> is the programming view. I wonder, however, if it is the best SQL view.
>
> Drill is, of course, SQL-based. It may be easier to bring the data to SQL
> than to bring SQL to the data. SQL works on tables (relations) and is very
> powerful when doing so. Standard SQL does not, however, provide tools to
> work with dictionaries. (There is an extension, SQL++, that might provide
> such extensions. But, even if Drill supported SQL++, no front-end tools
> provides such support AFAIK.)
>
> So, how do we bring the DICT type to SQL? We do so by noting that a DICT
> is really a little table of (key, value) pairs (with a uniqueness
> constraint on the key.) Once we adopt this view, we can apply (I hope!) the
> nested table mechanism recently added to Drill.
>
> This means that the user DOES want to know the name of the key and value
> columns: they are columns in a tuple (relation) that can be joined and
> filtered. Suppose each customer has a DICT of contact information with keys
> as "office", "home", "cell",... and values as the phone number. You can use
> SQL to find the office numbers:
>
>
> SELECT custName, contactInfo.value as phone WHERE contactInfo.key =
> "office"...
>
>
> So, rather than wanting to hide the (key, value) structure of a DICT, we
> could argue that exposing that structure allows the DICT to look like a
> relation, and thus exploit existing Drill features. In fact, this may make
> Drill more powerful when working Hive maps than is Hive itself (If Hive
> treats maps as opaque objects.)
>
>
> You also showed the SQLLine output you would like for a DICT column. This
> example exposes a "lie" (a short-cut) that Sqlline exploits. SqlLine asks
> Drill to convert a column to a Java Object of some sort, then SqlLine calls
> toString() on that object to produce the value you see in SqlLine output.
>
> Some examples. An array (repeated) column is a set of values. Drill
> converts the repeated value to a Java array, which toString() converts to
> something like "[1, 2, 3]". The same is true of MAP: Drill converts it to a
> Java Map, toString converts it to a JSON-like presentation.
>
> So, your DICT (or repeated map) type should provide a getObject() method
> that converts the repeated map to a Java Map. SqlLine will convert the map
> object to the display format you showed in your example. (My guess is that
> a repeated map today produces an array of Java Map objects: you want a
> single Java Map built from the key/value pairs.)
>
>
> A JDBC user can use the getObject() method to retrieve a Java Map
> representation of a Drill DICT. (This functionality is not available in
> ODBC AFAIK.) The same is true for anyone brave enough to use the native
> Drill client API.
>
>
> Thanks,
> - Paul
>
>
>
>     On Monday, June 3, 2019, 7:08:42 AM PDT, Igor Guzenko <
> ihor.huzenko....@gmail.com> wrote:
>
>  Hi all,
>
> So finally, I'm going to abandon the renaming ticket DRILL-7097 and
> related PR (1803).
>
> Next, the DRILL-7096 should be rewritten to cover addition of new DICT
> type. But, if I understand correctly,
> based on repeated vector, now result for new type will be returned like:
>
> row |  dict_column MAP<INT, STRING>
>
> ------------------------------------------------------------------------------------------------------
>   1  | [{"key":1, "value":"v1"}, {"key":2, "value":"v2"} ]
>   2  | [{"key":0, "value":"v7"}, {"key":2, "value":"v2"}, {"key":4,
> "value":"v4"} ]
>   3  | [{"key":-1, "value":"o"}]
>
> And main purpose will be hiding of repeated map meta keys
> ("key","value") and simulation of real map functionality.
>
> I believe that actually it won't be so easy to reuse all existing
> functionality for repeated maps to return logically correct
> results for DICT, because it's usage of repeated map in unexpected
> way. Also I'd like to hear thoughts from Bohdan about
> such application of repeated maps instead of new vector.
>
> Thanks, Igor
>
>

Re: [DISCUSSION] DRILL-7097 Rename MapVector to StructVector

Reply via email to