Hi Bohdan,

As you note, the two constraints we nave are 1) avoiding breaking 
compatibility, and 2) providing the true map (DICT) type.

As we noted, the DICT type will allow SqlLine to present the type as a map.

There seem to be advantages to reusing existing work where possible.

You have looked at the issue longer than the rest of us - can you see a path to 
building the feature this way? What issues would need resolution?

Does it make sense to try working out the proposed approach by revising your 
spec or JIra ticket description so we can see if we are on the right track?

Thanks,

- Paul

Sent from my iPhone

> On Jun 4, 2019, at 1:57 AM, Bohdan Kazydub <bohdan.kazy...@gmail.com> wrote:
> 
> Hi Paul,
> 
> if I understood you correctly, you are talking about implementation of
> "true map" as list of STRUCT (which is currently named MAP in Drill). While
> this implementation is viable we still do need to introduce a new type for
> such "true map" as REPEATED MAP is still a different data type. That is
> while a "true map" can be implemented using REPEATED MAP under the hood
> these are not the same types (e.g., what if user wants to use REPEATED MAP
> (AKA repeated struct) and not "true map").
> Is my understanding correct?
> 
> The approach found in [1] was taken similarly to that done in Hive[2] as I
> find it clearer and not to meddle with MAP's innards.
> 
> Also worth mentioning that there is this[3] open [WIP] PR into Apache Arrow
> which introduces MapVector (opened a few days ago) which uses the approach
> you suggested.
> 
> [1]
> https://docs.google.com/presentation/d/1FG4swOrkFIRL7qjiP7PSOPy8a1vnxs5Z9PM3ZfRPRYo/edit#slide=id.p
> [2]
> https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/MapColumnVector.java#L30
> [3]https://github.com/apache/arrow/pull/4444
> 
> On Tue, Jun 4, 2019 at 2:59 AM Paul Rogers <par0...@yahoo.com.invalid>
> wrote:
> 
>> Hi Igor,
>> 
>> Glad the community was able to provide a bit of help.
>> 
>> Let's talk about about another topic. You said: "And main purpose will be
>> hiding of repeated map meta keys
>> ("key","value") and simulation of real map functionality."
>> 
>> On the one hand, we are all accustomed to thinking of a Java (or Python)
>> map as a black box: store (key, value) pairs, retrieve values by key. This
>> is the programming view. I wonder, however, if it is the best SQL view.
>> 
>> Drill is, of course, SQL-based. It may be easier to bring the data to SQL
>> than to bring SQL to the data. SQL works on tables (relations) and is very
>> powerful when doing so. Standard SQL does not, however, provide tools to
>> work with dictionaries. (There is an extension, SQL++, that might provide
>> such extensions. But, even if Drill supported SQL++, no front-end tools
>> provides such support AFAIK.)
>> 
>> So, how do we bring the DICT type to SQL? We do so by noting that a DICT
>> is really a little table of (key, value) pairs (with a uniqueness
>> constraint on the key.) Once we adopt this view, we can apply (I hope!) the
>> nested table mechanism recently added to Drill.
>> 
>> This means that the user DOES want to know the name of the key and value
>> columns: they are columns in a tuple (relation) that can be joined and
>> filtered. Suppose each customer has a DICT of contact information with keys
>> as "office", "home", "cell",... and values as the phone number. You can use
>> SQL to find the office numbers:
>> 
>> 
>> SELECT custName, contactInfo.value as phone WHERE contactInfo.key =
>> "office"...
>> 
>> 
>> So, rather than wanting to hide the (key, value) structure of a DICT, we
>> could argue that exposing that structure allows the DICT to look like a
>> relation, and thus exploit existing Drill features. In fact, this may make
>> Drill more powerful when working Hive maps than is Hive itself (If Hive
>> treats maps as opaque objects.)
>> 
>> 
>> You also showed the SQLLine output you would like for a DICT column. This
>> example exposes a "lie" (a short-cut) that Sqlline exploits. SqlLine asks
>> Drill to convert a column to a Java Object of some sort, then SqlLine calls
>> toString() on that object to produce the value you see in SqlLine output.
>> 
>> Some examples. An array (repeated) column is a set of values. Drill
>> converts the repeated value to a Java array, which toString() converts to
>> something like "[1, 2, 3]". The same is true of MAP: Drill converts it to a
>> Java Map, toString converts it to a JSON-like presentation.
>> 
>> So, your DICT (or repeated map) type should provide a getObject() method
>> that converts the repeated map to a Java Map. SqlLine will convert the map
>> object to the display format you showed in your example. (My guess is that
>> a repeated map today produces an array of Java Map objects: you want a
>> single Java Map built from the key/value pairs.)
>> 
>> 
>> A JDBC user can use the getObject() method to retrieve a Java Map
>> representation of a Drill DICT. (This functionality is not available in
>> ODBC AFAIK.) The same is true for anyone brave enough to use the native
>> Drill client API.
>> 
>> 
>> Thanks,
>> - Paul
>> 
>> 
>> 
>>    On Monday, June 3, 2019, 7:08:42 AM PDT, Igor Guzenko <
>> ihor.huzenko....@gmail.com> wrote:
>> 
>> Hi all,
>> 
>> So finally, I'm going to abandon the renaming ticket DRILL-7097 and
>> related PR (1803).
>> 
>> Next, the DRILL-7096 should be rewritten to cover addition of new DICT
>> type. But, if I understand correctly,
>> based on repeated vector, now result for new type will be returned like:
>> 
>> row |  dict_column MAP<INT, STRING>
>> 
>> ------------------------------------------------------------------------------------------------------
>>  1  | [{"key":1, "value":"v1"}, {"key":2, "value":"v2"} ]
>>  2  | [{"key":0, "value":"v7"}, {"key":2, "value":"v2"}, {"key":4,
>> "value":"v4"} ]
>>  3  | [{"key":-1, "value":"o"}]
>> 
>> And main purpose will be hiding of repeated map meta keys
>> ("key","value") and simulation of real map functionality.
>> 
>> I believe that actually it won't be so easy to reuse all existing
>> functionality for repeated maps to return logically correct
>> results for DICT, because it's usage of repeated map in unexpected
>> way. Also I'd like to hear thoughts from Bohdan about
>> such application of repeated maps instead of new vector.
>> 
>> Thanks, Igor
>> 
>> 

Reply via email to