Hi All,

We've been having a lively discussion about the pros and cons of Arrow. What 
are people's thoughts about another difficult topic: how we support complex 
types?

Drill is a columnar engine. Columnar works best when data can be split into 
simple columns. The whole purpose of columnar is to group like values: all my 
INT "prodQuantity" values in this vector, all my VARCHAR "prodCode" values in 
that vector, and so on.


Complex types, on the other hand, want to be wild & crazy like JSON. An 
arbitrary (or evolving) map here, an array (perhaps of heterogeneous types) 
there. Maybe a deeply nested structure, as in Suricata logs [1] or some REST 
responses. This is exactly what columnar does NOT support.


Data frames, in R and Pandas for example [2], are a hybrid: columnar, but each 
value can be any type. The whole purpose of a data frame is for a human to muck 
about with the values, performing data cleansing, etc. But, Drill is not R; 
there is no human in a SQL engine, so it is not clear the data frame (anything 
goes model) works when it is software, not a human, messing with the values.


We've talked many times, and at length about the challenges of mapping 
"anything goes" data structures into the confines of a fixed-type columnar 
structure. We've got code that goes to heroic lengths to make this work: MAP, 
DICT, UNION (which has never fully worked), LIST (which can be a Repeated 
UNION, among other tasks, which also has never fully worked) and so on.

Does anyone else get the sense that we're doing something wrong here?

We've struggled for years to map these structures to/from SQL relational column 
vectors. At the same time, Java has gotten pretty good at representing 
arbitrary JSON (or XML) structures. Maybe we should just use those Java 
structures.

We still want columnar for basic queries, such as the traditional Parquet data 
lake use cases. So maybe we need a hybrid scheme. If we know the type, and have 
simple values, we store them in single-type vectors and get the goodness that 
results. But, if the user has non-relational data (complex structures, varying 
types, varying structures), represent those as Java objects. The user trades 
speed (simple columnar types) for flexibility (any data structure you want.)


As noted in an earlier post, Drill does, in fact, support Java object vectors. 
At present, they are only used for System tables, and only for queries that run 
in the root fragment (no network exchanges, so no serialization.)  Maybe we can 
offer the user: use any Java structure you want, just make sure it is 
serializable (for exchanges). You're responsible for providing the functions to 
understand the structure. Drill can even provide common functions typical tasks 
(get some value out of a complex structure, convert it to some given type an 
store it in a vector, say.)

Furthermore, UDFs for complex types would be vastly simpler: no Holders, no 
Readers no obscure tags: just pass your Java object to a normal Java function. 
Heck, you could even debug your functions!

The user would have to ensure that the query output works for the target use 
case. If the consumer is ODBC, then complex types must be reduced to simple 
relational structures (a restriction we already have today.) If the consumer is 
JDBC, then we can ship Java objects (as long as the classes are known to the 
client), and the client can figure out what to do with those objects.

The key benefits would be:

* Vector code would become much simpler (which would make conversion to Arrow 
easier, should we go that route).

* Classic Parquet queries would go fast (maybe faster, because we strip out 
complexity.)

* Drill can support arbitrary JSON structures in a simple, clean, standard way 
-- without the current complexity.

 
Thoughts?

Thanks,
- Paul

[1] 
https://suricata.readthedocs.io/en/suricata-5.0.1/output/eve/eve-json-format.html

[2] 
https://towardsdatascience.com/preventing-the-death-of-the-dataframe-8bca1c0f83c8




Reply via email to