Hi Igor, Thanks for the note. My post was in response to the work I'm doing to convert JSON to EVF. The work seems fractal: I need x hours to support the scalar vectors. Then, I need x hours for each of the complex data types. As I dive into the details of UNION and LIST, I need x hours for each of those cases. Then, when I think of a UNION that includes a LIST of DICT types which include UNIONs... It seems the amount of work becomes exponential. All for something that may never actually be used.
On the other hand, we know that folks want to use arbitrary types for JSON, REST, etc. So, the idea is to cap the work at the first x hours: the main vector types which everyone uses all the time and which are a core part of SQL. This is not a well-developed idea, but we can brainstorm as we go along. We might divide types into two broad groups: SQL and "JSON". The SQL group represents anything that fits into the (extended) Relational model: tables of tuples (possibly nested.) For example: * Scalars (nullable, non-nullable) * Nested tables: arrays, including arrays of maps. (A repeated map is now just a subtable, supported via the lateral join feature.) * Nested tuples: well-defined nested structures, typically called a STRUCT in other tools. The semantics and mechanism of these types are well defined; they fit well into the vector format. On the other hand are the "JSON" types: anything that can be described as JSON. For such types, rather than an endlessly expanding set of complex cases, define a single simple mechanism that uses Java objects. The simplest case is just to use objects that mimic JSON: List, Map, Long, Double, String. Multiple JSON serializers exist for text (JSON) and binary (BSON). Again, making things up as we go along, the JSON vector would have two forms: serialized (as a VARBINARY vector) and deserialized, as Java objects. The serialized form would be that sent over the wire on exchanges and to the client. The deserialized form would be produced by readers, consumed and produced by functions, etc. I suspect we'd need some schema hint to say, "these five fields are SQL types, but this other field is a JSON type." Here is simple example. Suppose we have this JSON: { name: "Fred", balance: 123.45, city: "Bedrock", shoppingCard: [ {prodId: 1234, quantity: 1}, {prodId: 2345, quantity: 10} ], contactInfo: { phones: [ "123-4567", "345-5678 x123" ], notes: "Call in early AM" }} The first three fields are simple SQL types that appear in all records. The next field is a nested table of two columns. The last bit is an arbitrary JSON structure that can vary for each record. If the user can tell us that, we can store the "contactInfo" as a JSON structure: a Map that contains a List of String and a String. (The trick is that there is no way, without a hint, to know which bits of the above JSON are SQL types and which are JSON types.) A query can then use something like JSON Path [1] to work with this structure: SELECT name, jsonPath(contactInfo, "$.phones[0]") AS phone FROM ... Technically, we'd probably need some rules for serialization/deserialization. For example, a JSON vector might have both a VARBINARY and an Object component, if a query asks for an object, it is deserialized on the fly if not yet available. Vectors are write-once. A new vector will have only objects, these are serialized to VARBINARY before a network exchange. Seems we can come up with a design. How does this fit with the stuff you've been looking at? Thanks, - Paul [1] https://github.com/json-path/JsonPath On Monday, January 20, 2020, 02:22:09 AM PST, Igor Guzenko <ihor.huzenko....@gmail.com> wrote: Hello Paul, It's an exciting idea to get rid of all the complexity added in an attempt to store complex types in a columnar way. I would like to understand more about the alternative. Could you please show, using examples, how new representation will work for complex columns? I'm interested in how to represent each row value for structs preserving low memory footprint and avoid an increase in networking workload. Thanks, Igor On Mon, Jan 20, 2020 at 9:04 AM Paul Rogers <par0...@yahoo.com.invalid> wrote: > Hi All, > > We've been having a lively discussion about the pros and cons of Arrow. > What are people's thoughts about another difficult topic: how we support > complex types? > > Drill is a columnar engine. Columnar works best when data can be split > into simple columns. The whole purpose of columnar is to group like values: > all my INT "prodQuantity" values in this vector, all my VARCHAR "prodCode" > values in that vector, and so on. > > > Complex types, on the other hand, want to be wild & crazy like JSON. An > arbitrary (or evolving) map here, an array (perhaps of heterogeneous types) > there. Maybe a deeply nested structure, as in Suricata logs [1] or some > REST responses. This is exactly what columnar does NOT support. > > > Data frames, in R and Pandas for example [2], are a hybrid: columnar, but > each value can be any type. The whole purpose of a data frame is for a > human to muck about with the values, performing data cleansing, etc. But, > Drill is not R; there is no human in a SQL engine, so it is not clear the > data frame (anything goes model) works when it is software, not a human, > messing with the values. > > > We've talked many times, and at length about the challenges of mapping > "anything goes" data structures into the confines of a fixed-type columnar > structure. We've got code that goes to heroic lengths to make this work: > MAP, DICT, UNION (which has never fully worked), LIST (which can be a > Repeated UNION, among other tasks, which also has never fully worked) and > so on. > > Does anyone else get the sense that we're doing something wrong here? > > We've struggled for years to map these structures to/from SQL relational > column vectors. At the same time, Java has gotten pretty good at > representing arbitrary JSON (or XML) structures. Maybe we should just use > those Java structures. > > We still want columnar for basic queries, such as the traditional Parquet > data lake use cases. So maybe we need a hybrid scheme. If we know the type, > and have simple values, we store them in single-type vectors and get the > goodness that results. But, if the user has non-relational data (complex > structures, varying types, varying structures), represent those as Java > objects. The user trades speed (simple columnar types) for flexibility (any > data structure you want.) > > > As noted in an earlier post, Drill does, in fact, support Java object > vectors. At present, they are only used for System tables, and only for > queries that run in the root fragment (no network exchanges, so no > serialization.) Maybe we can offer the user: use any Java structure you > want, just make sure it is serializable (for exchanges). You're responsible > for providing the functions to understand the structure. Drill can even > provide common functions typical tasks (get some value out of a complex > structure, convert it to some given type an store it in a vector, say.) > > Furthermore, UDFs for complex types would be vastly simpler: no Holders, > no Readers no obscure tags: just pass your Java object to a normal Java > function. Heck, you could even debug your functions! > > The user would have to ensure that the query output works for the target > use case. If the consumer is ODBC, then complex types must be reduced to > simple relational structures (a restriction we already have today.) If the > consumer is JDBC, then we can ship Java objects (as long as the classes are > known to the client), and the client can figure out what to do with those > objects. > > The key benefits would be: > > * Vector code would become much simpler (which would make conversion to > Arrow easier, should we go that route). > > * Classic Parquet queries would go fast (maybe faster, because we strip > out complexity.) > > * Drill can support arbitrary JSON structures in a simple, clean, standard > way -- without the current complexity. > > > Thoughts? > > Thanks, > - Paul > > [1] > https://suricata.readthedocs.io/en/suricata-5.0.1/output/eve/eve-json-format.html > > [2] > https://towardsdatascience.com/preventing-the-death-of-the-dataframe-8bca1c0f83c8 > > > > >