Hello Paul, It's an exciting idea to get rid of all the complexity added in an attempt to store complex types in a columnar way. I would like to understand more about the alternative. Could you please show, using examples, how new representation will work for complex columns? I'm interested in how to represent each row value for structs preserving low memory footprint and avoid an increase in networking workload.
Thanks, Igor On Mon, Jan 20, 2020 at 9:04 AM Paul Rogers <par0...@yahoo.com.invalid> wrote: > Hi All, > > We've been having a lively discussion about the pros and cons of Arrow. > What are people's thoughts about another difficult topic: how we support > complex types? > > Drill is a columnar engine. Columnar works best when data can be split > into simple columns. The whole purpose of columnar is to group like values: > all my INT "prodQuantity" values in this vector, all my VARCHAR "prodCode" > values in that vector, and so on. > > > Complex types, on the other hand, want to be wild & crazy like JSON. An > arbitrary (or evolving) map here, an array (perhaps of heterogeneous types) > there. Maybe a deeply nested structure, as in Suricata logs [1] or some > REST responses. This is exactly what columnar does NOT support. > > > Data frames, in R and Pandas for example [2], are a hybrid: columnar, but > each value can be any type. The whole purpose of a data frame is for a > human to muck about with the values, performing data cleansing, etc. But, > Drill is not R; there is no human in a SQL engine, so it is not clear the > data frame (anything goes model) works when it is software, not a human, > messing with the values. > > > We've talked many times, and at length about the challenges of mapping > "anything goes" data structures into the confines of a fixed-type columnar > structure. We've got code that goes to heroic lengths to make this work: > MAP, DICT, UNION (which has never fully worked), LIST (which can be a > Repeated UNION, among other tasks, which also has never fully worked) and > so on. > > Does anyone else get the sense that we're doing something wrong here? > > We've struggled for years to map these structures to/from SQL relational > column vectors. At the same time, Java has gotten pretty good at > representing arbitrary JSON (or XML) structures. Maybe we should just use > those Java structures. > > We still want columnar for basic queries, such as the traditional Parquet > data lake use cases. So maybe we need a hybrid scheme. If we know the type, > and have simple values, we store them in single-type vectors and get the > goodness that results. But, if the user has non-relational data (complex > structures, varying types, varying structures), represent those as Java > objects. The user trades speed (simple columnar types) for flexibility (any > data structure you want.) > > > As noted in an earlier post, Drill does, in fact, support Java object > vectors. At present, they are only used for System tables, and only for > queries that run in the root fragment (no network exchanges, so no > serialization.) Maybe we can offer the user: use any Java structure you > want, just make sure it is serializable (for exchanges). You're responsible > for providing the functions to understand the structure. Drill can even > provide common functions typical tasks (get some value out of a complex > structure, convert it to some given type an store it in a vector, say.) > > Furthermore, UDFs for complex types would be vastly simpler: no Holders, > no Readers no obscure tags: just pass your Java object to a normal Java > function. Heck, you could even debug your functions! > > The user would have to ensure that the query output works for the target > use case. If the consumer is ODBC, then complex types must be reduced to > simple relational structures (a restriction we already have today.) If the > consumer is JDBC, then we can ship Java objects (as long as the classes are > known to the client), and the client can figure out what to do with those > objects. > > The key benefits would be: > > * Vector code would become much simpler (which would make conversion to > Arrow easier, should we go that route). > > * Classic Parquet queries would go fast (maybe faster, because we strip > out complexity.) > > * Drill can support arbitrary JSON structures in a simple, clean, standard > way -- without the current complexity. > > > Thoughts? > > Thanks, > - Paul > > [1] > https://suricata.readthedocs.io/en/suricata-5.0.1/output/eve/eve-json-format.html > > [2] > https://towardsdatascience.com/preventing-the-death-of-the-dataframe-8bca1c0f83c8 > > > > >