Re: [DISCUSSION] DRILL-7097 Rename MapVector to StructVector

Paul Rogers Sat, 01 Jun 2019 17:25:45 -0700

Hi All,

TLDR; Drill already provides a number of powerful features that give us 80-90% 
of what we need for DICT type. Much time could be saved by using them, focusing 
efforts on adding the remaining bits specific to DICT.


We divide the DICT problem down into two categories:

1. Internal representation, the topic of the previous note which suggested that 
a DICT is really just a repeated MAP.

2. DICT semantics, which is the topic here.

Item 2, semantics, can itself be further divided into two groups:

3. Functionality already in Drill that can be extended/repurposed for the DICT 
type, if DICT is implemented as a repeated MAP.

4. New functionality which must be added.

Existing functionality includes things like:

* The flatten() function which, essentially, joins a DICT with its containing 
row.
* The powerful nested table functionality (added by Parth, Aman and others over 
the last year) that lets users treat a map array (hence a DICT) as a nested 
table and allows sorting, filtering, aggregation and many other SQL operations.

For item 4, Igor probably has a list of new functionality. Some might include:

* A DICT data type which is a repeated map with the addition of identifying the 
key column. (Add a column property in ColumnMetadata, a field in 
MaterializedField.)

* Using the implied uniqueness constraint on the key column to plan nested 
table operations (some operations might be simpler if we know the key is unique 
within each map array.)

* Providing DICT functions such as extracting a value by key (noting that this 
can be done via a SELECT on the nested table.)

* And so on.


Leveraging functionality Drill already has should reduce the cost of 
implementation, and should avoid the compatibility issues that started this 
discussion.

Thanks,
- Paul

Re: [DISCUSSION] DRILL-7097 Rename MapVector to StructVector

Reply via email to