Re: [DISCUSSION] DRILL-7097 Rename MapVector to StructVector

Paul Rogers Fri, 31 May 2019 12:12:05 -0700

Ted, you found the simple, elegant solution, as usual!

As it turns out, this project could be made far simpler if we just look at it 
from the right angle. Bear with me.

Ted and Charles each have, on occasion, suggested the value of a correlated 
list, along with functions to zip/unzip values in various ways. Drill has kv 
functions. Charles (I believe) suggested other, more general cases such as that 
zip function.

The new, proposed "DICT" type is, in one sense a hash map. But, the unique-name 
aspect does not fit well with write-once vectors. So, the DICT type is really a 
list of pairs (k, v). We provide functions to add a (k, v) pair, to find a 
value given a key, etc.

Here's the thing, to quote Monty Python, "we already got one": it is called a 
Map Array (AKA repeated map). Consider a Map of two fields (k VARCHAR, v 
VARCHAR). This is exactly the data structure proposed for the new DICT type. 
All that is missing are the functions to give it DICT-like behavior.

Why limit a DICT to VARCHAR keys or values? With a Map array, we can have (k 
INT, v DATE[]) (making up an array notation.) Want a "variant" key? Then it is 
(k VARCHAR, v UNION) (assuming we get UNION to work everywhere.)

We don't need to limit ourselves to a single value, either, since a MAP allows 
any number of fields. Maybe (id INT, name VARCHAR, age INT) or even (k INT, 
MAP(name VARCHAR, age INT)).

A quick read of the Hive MAP presentation (Igor's [3]) shows this is what the 
new Map is supposed to do.

You get the idea. In this world a DICT is just an alias for a map array, along 
with new functions.

This gives four big benefits:

1. No renaming needed, so no compatibility issues.
2. We can use the existing "complex" and "row set" readers and writers, etc.
3. No new low-level vectors, mutators, accessors etc. Just use the existing 
(fully tested) map array.
4. Done right, with a DICT as a correlated list (a map array), we can offer the 
additional functions that Ted and Charles suggested, killing two birds with one 
stone (to be a bit politically incorrect.)

To read/write a DICT, just use the existing Map Array features of the row set 
mechanism. Or, if you are old-school, the complex readers/writers for map 
arrays. (Or, if you are even older school, and have a strong stomach, you can 
diddle with the vectors directly.)

Just to be very clear here. If we introduce DICT as an alias for repeated MAP, 
then you IMMEDIATELY can use all our existing tools to create a Hive Map, and 
clients can already read the results. You must make it fancier and easier to 
use by adding DICT-specific features.

Thanks,
- Paul

    On Friday, May 31, 2019, 10:54:29 AM PDT, Ted Dunning 
<ted.dunn...@gmail.com> wrote:  

 Would it be possible to call the new structure a Dict (following Python's
inspiration)?

That would avoid the large disruption of renaming Map*.

On Fri, May 31, 2019 at 10:10 AM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi Igor,
>
> Thank you for finally addressing a long-running irritation: that the Drill
> Map type is not a map, it is a tuple.
>
> Perhaps you can divide the discussion into three parts.
>
> 1. Renaming classes, enums and other items internal to the Drill source
> code.
>
> 2. Renaming classes that are part of a public or ad-hoc API.
>
> 3. Renaming items visible to users.
>
> Changing items in part 1 causes a one-time disruption to anyone who has a
> working branch. However, a rebase onto master would easily resolve any
> issues. So, changes in this group are pretty safe.
>
>
> The PR also seems to change symbols visible to the anyone who has code in
> a repo separate from Drill, but that builds against Drill. All UDFs and
> plugins that use the former map classes must change. This means that those
> contributions can support only Drill before your PR or after; the
> maintainer would need two separate branches to support both versions of
> Drill.
>
> Such breaking of (implied) API compatibility is often considered a "bad
> thing." We may not want to complicate the lives of those who have
> graciously created Drill extensions and integrations.
>
> Finally, if we change anything visible from SqlLine, we break running
> applications, which we almost certainly do not want to do. See the changes
> to Types.java as an example.
>
> Can you make the change in a way that all your changes fall only into
> group 1, provide a gradual migration for group 2, and do not change
> anything in group 3?
>
> For example; the MinorType enum is a de-facto public API and must retain
> MAP with its current meaning, at least for some number of releases. You
> could add a STRUCT enum and mark MAP deprecated, and encourage third-party
> code to migrate. But we must still support MAP for some period of time to
> provide time for the migration. Then, add the new "map" as, say KVMAP,
> TRUEMAP, KVPAIRS, HIVEMAP, MAP2 or whatever. (Awkward, yes, but necessary.)
> In the future, when the old MAP enum value is retired, it can be repurposed
> as an alias for KVMAP (or whatever), and the KVMAP enum marked as
> deprecated, to be removed after several more releases.
>
>
> Similarly the SQL "MAP" type keyword cannot change, nor can the name of
> any SQL function (UDF) that use the "map" term. These changes will break
> SQL created by users which generally does not end well. Again, you can add
> a new alias, and encourage use of that alias.
>
> One could certainly argue that making a breaking change will impact a
> limited number of people, and that the benefit justifies the cost. I'll
> leave that debate to others, focusing here on the mechanics.
>
>
> Thanks,
> - Paul
>
>
>
>    On Friday, May 31, 2019, 12:06:35 AM PDT, Igor Guzenko <
> ihor.huzenko....@gmail.com> wrote:
>
>  Hello Drillers,
>
> I'm working on the renaming of Map vector[1] and related stuff to make
> space for new canonical Map<K,V> vector [2] [3]. I believe this
> renaming causes big impact on Drill and related client's code
> (ODBC/JDBC).
>
> So I'd like to be sure that this renaming is really necessary and
> everybody agrees with the changes. Please check the draft PR [4] and
> reply on the email.
>
> Alternative solution is simply leave current map vector as is and name
> newly created Map<K,V> vector (+readers, writers etc.)  differently.
>
> [1] https://issues.apache.org/jira/browse/DRILL-7097
> [2] https://issues.apache.org/jira/browse/DRILL-7096
> [3]
> https://docs.google.com/presentation/d/1FG4swOrkFIRL7qjiP7PSOPy8a1vnxs5Z9PM3ZfRPRYo/edit#slide=id.p
> [4] https://github.com/apache/drill/pull/1803
>
> Thanks, Igor Guzenko
>

Re: [DISCUSSION] DRILL-7097 Rename MapVector to StructVector

Reply via email to