Hi Igor,

Thanks for the note. My post was in response to the work I'm doing to convert 
JSON to EVF. The work seems fractal: I need x hours to support the scalar 
vectors. Then, I need x hours for each of the complex data types. As I dive 
into the details of UNION and LIST, I need x hours for each of those cases. 
Then, when I think of a UNION that includes a LIST of DICT types which include 
UNIONs... It seems the amount of work becomes exponential. All for something 
that may never actually be used.

On the other hand, we know that folks want to use arbitrary types for JSON, 
REST, etc. So, the idea is to cap the work at the first x hours: the main 
vector types which everyone uses all the time and which are a core part of SQL.

This is not a well-developed idea, but we can brainstorm as we go along. We 
might divide types into two broad groups: SQL and "JSON". The SQL group 
represents anything that fits into the (extended) Relational model: tables of 
tuples (possibly nested.) For example:

* Scalars (nullable, non-nullable)

* Nested tables: arrays, including arrays of maps. (A repeated map is now just 
a subtable, supported via the lateral join feature.)

* Nested tuples: well-defined nested structures, typically called a STRUCT in 
other tools.

The semantics and mechanism of these types are well defined; they fit well into 
the vector format.

On the other hand are the "JSON" types: anything that can be described as JSON. 
For such types, rather than an endlessly expanding set of complex cases, define 
a single simple mechanism that  uses Java objects. The simplest case is just to 
use objects that mimic JSON: List, Map, Long, Double, String. Multiple JSON 
serializers exist for text (JSON) and binary (BSON).

Again, making things up as we go along, the JSON vector would have two forms: 
serialized (as a VARBINARY vector) and deserialized, as Java objects. The 
serialized form would be that sent over the wire on exchanges and to the 
client. The deserialized form would be produced by readers, consumed and 
produced by functions, etc.

I suspect we'd need some schema hint to say, "these five fields are SQL types, 
but this other field is a JSON type." 

Here is simple example. Suppose we have this JSON:

{ name: "Fred", balance: 123.45, city: "Bedrock",
  shoppingCard: [ {prodId: 1234, quantity: 1}, {prodId: 2345, quantity: 10} ],
  contactInfo: { phones: [ "123-4567", "345-5678 x123" ], notes: "Call in early 
AM" }}

The first three fields are simple SQL types that appear in all records. The 
next field is a nested table of two columns. The last bit is an arbitrary JSON 
structure that can vary for each record.

If the user can tell us that, we can store the "contactInfo" as a JSON 
structure: a Map that contains a List of String and a String. (The trick is 
that there is no way, without a hint, to know which bits of the above JSON are 
SQL types and which are JSON types.)

A query can then use something like JSON Path [1] to work with this structure:

SELECT name, jsonPath(contactInfo, "$.phones[0]") AS phone FROM ...

Technically, we'd probably need some rules for serialization/deserialization. 
For example, a JSON vector might have both a VARBINARY and an Object component, 
if a query asks for an object, it is deserialized on the fly if not yet 
available. Vectors are write-once. A new vector will have only objects, these 
are serialized to VARBINARY before a network exchange.

Seems we can come up with a design. How does this fit with the stuff you've 
been looking at?

Thanks,
- Paul

[1] https://github.com/json-path/JsonPath




 

    On Monday, January 20, 2020, 02:22:09 AM PST, Igor Guzenko 
<ihor.huzenko....@gmail.com> wrote:  
 
 Hello Paul,

It's an exciting idea to get rid of all the complexity added in an attempt
to store complex types in a columnar way. I would like to understand more
about the alternative. Could you please show, using examples, how new
representation will work for complex columns? I'm interested in how to
represent each row value for structs preserving low memory footprint and
avoid an increase in networking workload.

Thanks,
Igor

On Mon, Jan 20, 2020 at 9:04 AM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi All,
>
> We've been having a lively discussion about the pros and cons of Arrow.
> What are people's thoughts about another difficult topic: how we support
> complex types?
>
> Drill is a columnar engine. Columnar works best when data can be split
> into simple columns. The whole purpose of columnar is to group like values:
> all my INT "prodQuantity" values in this vector, all my VARCHAR "prodCode"
> values in that vector, and so on.
>
>
> Complex types, on the other hand, want to be wild & crazy like JSON. An
> arbitrary (or evolving) map here, an array (perhaps of heterogeneous types)
> there. Maybe a deeply nested structure, as in Suricata logs [1] or some
> REST responses. This is exactly what columnar does NOT support.
>
>
> Data frames, in R and Pandas for example [2], are a hybrid: columnar, but
> each value can be any type. The whole purpose of a data frame is for a
> human to muck about with the values, performing data cleansing, etc. But,
> Drill is not R; there is no human in a SQL engine, so it is not clear the
> data frame (anything goes model) works when it is software, not a human,
> messing with the values.
>
>
> We've talked many times, and at length about the challenges of mapping
> "anything goes" data structures into the confines of a fixed-type columnar
> structure. We've got code that goes to heroic lengths to make this work:
> MAP, DICT, UNION (which has never fully worked), LIST (which can be a
> Repeated UNION, among other tasks, which also has never fully worked) and
> so on.
>
> Does anyone else get the sense that we're doing something wrong here?
>
> We've struggled for years to map these structures to/from SQL relational
> column vectors. At the same time, Java has gotten pretty good at
> representing arbitrary JSON (or XML) structures. Maybe we should just use
> those Java structures.
>
> We still want columnar for basic queries, such as the traditional Parquet
> data lake use cases. So maybe we need a hybrid scheme. If we know the type,
> and have simple values, we store them in single-type vectors and get the
> goodness that results. But, if the user has non-relational data (complex
> structures, varying types, varying structures), represent those as Java
> objects. The user trades speed (simple columnar types) for flexibility (any
> data structure you want.)
>
>
> As noted in an earlier post, Drill does, in fact, support Java object
> vectors. At present, they are only used for System tables, and only for
> queries that run in the root fragment (no network exchanges, so no
> serialization.)  Maybe we can offer the user: use any Java structure you
> want, just make sure it is serializable (for exchanges). You're responsible
> for providing the functions to understand the structure. Drill can even
> provide common functions typical tasks (get some value out of a complex
> structure, convert it to some given type an store it in a vector, say.)
>
> Furthermore, UDFs for complex types would be vastly simpler: no Holders,
> no Readers no obscure tags: just pass your Java object to a normal Java
> function. Heck, you could even debug your functions!
>
> The user would have to ensure that the query output works for the target
> use case. If the consumer is ODBC, then complex types must be reduced to
> simple relational structures (a restriction we already have today.) If the
> consumer is JDBC, then we can ship Java objects (as long as the classes are
> known to the client), and the client can figure out what to do with those
> objects.
>
> The key benefits would be:
>
> * Vector code would become much simpler (which would make conversion to
> Arrow easier, should we go that route).
>
> * Classic Parquet queries would go fast (maybe faster, because we strip
> out complexity.)
>
> * Drill can support arbitrary JSON structures in a simple, clean, standard
> way -- without the current complexity.
>
>
> Thoughts?
>
> Thanks,
> - Paul
>
> [1]
> https://suricata.readthedocs.io/en/suricata-5.0.1/output/eve/eve-json-format.html
>
> [2]
> https://towardsdatascience.com/preventing-the-death-of-the-dataframe-8bca1c0f83c8
>
>
>
>
>
  

Reply via email to