Hi All, We've been having a lively discussion about the pros and cons of Arrow. What are people's thoughts about another difficult topic: how we support complex types?
Drill is a columnar engine. Columnar works best when data can be split into simple columns. The whole purpose of columnar is to group like values: all my INT "prodQuantity" values in this vector, all my VARCHAR "prodCode" values in that vector, and so on. Complex types, on the other hand, want to be wild & crazy like JSON. An arbitrary (or evolving) map here, an array (perhaps of heterogeneous types) there. Maybe a deeply nested structure, as in Suricata logs [1] or some REST responses. This is exactly what columnar does NOT support. Data frames, in R and Pandas for example [2], are a hybrid: columnar, but each value can be any type. The whole purpose of a data frame is for a human to muck about with the values, performing data cleansing, etc. But, Drill is not R; there is no human in a SQL engine, so it is not clear the data frame (anything goes model) works when it is software, not a human, messing with the values. We've talked many times, and at length about the challenges of mapping "anything goes" data structures into the confines of a fixed-type columnar structure. We've got code that goes to heroic lengths to make this work: MAP, DICT, UNION (which has never fully worked), LIST (which can be a Repeated UNION, among other tasks, which also has never fully worked) and so on. Does anyone else get the sense that we're doing something wrong here? We've struggled for years to map these structures to/from SQL relational column vectors. At the same time, Java has gotten pretty good at representing arbitrary JSON (or XML) structures. Maybe we should just use those Java structures. We still want columnar for basic queries, such as the traditional Parquet data lake use cases. So maybe we need a hybrid scheme. If we know the type, and have simple values, we store them in single-type vectors and get the goodness that results. But, if the user has non-relational data (complex structures, varying types, varying structures), represent those as Java objects. The user trades speed (simple columnar types) for flexibility (any data structure you want.) As noted in an earlier post, Drill does, in fact, support Java object vectors. At present, they are only used for System tables, and only for queries that run in the root fragment (no network exchanges, so no serialization.) Maybe we can offer the user: use any Java structure you want, just make sure it is serializable (for exchanges). You're responsible for providing the functions to understand the structure. Drill can even provide common functions typical tasks (get some value out of a complex structure, convert it to some given type an store it in a vector, say.) Furthermore, UDFs for complex types would be vastly simpler: no Holders, no Readers no obscure tags: just pass your Java object to a normal Java function. Heck, you could even debug your functions! The user would have to ensure that the query output works for the target use case. If the consumer is ODBC, then complex types must be reduced to simple relational structures (a restriction we already have today.) If the consumer is JDBC, then we can ship Java objects (as long as the classes are known to the client), and the client can figure out what to do with those objects. The key benefits would be: * Vector code would become much simpler (which would make conversion to Arrow easier, should we go that route). * Classic Parquet queries would go fast (maybe faster, because we strip out complexity.) * Drill can support arbitrary JSON structures in a simple, clean, standard way -- without the current complexity. Thoughts? Thanks, - Paul [1] https://suricata.readthedocs.io/en/suricata-5.0.1/output/eve/eve-json-format.html [2] https://towardsdatascience.com/preventing-the-death-of-the-dataframe-8bca1c0f83c8