Great stuff, Julian. I have some answers.
On Wed, Sep 19, 2012 at 1:33 PM, Julian Hyde <julianh...@gmail.com> wrote: > ... > It would be really useful to establish a list of queries. Minimal, but > demonstrate the key features (e.g. "COUNT(a.b) WITHIN RECORD", the ability > to aggregate over nested collections. > Very good idea. > > - is it possible to change Optiq operators and optimization rules > without a > > recompile? > > Yes. (From Optiq's perspective, anyway, it is very straightforward. Optiq > starts from scratch each time a query is prepared, and takes whatever > metadata sources, operators and rules it is given.) > Cool. > > - what has to happen to make Optiq handle nested data? > > I am still mulling the answer to that question. The list of queries I > mentioned above will help me understand what exactly is "nested data" and > the operations on it. > The Drill/Dremel idea of nested data is essentially different from SQL's. The basic idea in Dremel is that the data being scanned consists of trees. The . operator always qualifies a collection of trees. In a where clause, the subset of trees is returned that meet the where clause has any true evaluation. Aggregates over sub-trees qualified as above does the natural tree kind of thing. This is very different from the SQL intuition that nested collections are really a shorthand for a normalized relational set of records. The difference is important to the users of Dremel since it matches their problem statement very well. > I have been using > http://developers.google.com/bigquery/docs/query-reference as my main > resource for BigQuery/DrQL. Let me know if there is a more authoritative > source. > This is as good as it gets. > For the record (I'll only make this whine once, I promise) the BigQuery > language has some usability flaws, particularly inconsistencies with SQL. Noted. I would like it if you could repeat this whine at regular intervals since we need to record and document these differences. > * The paper has an expression like "COUNT(x > 5)". SQL's COUNT operator > counts not-NULL values, not boolean TRUE values, so in SQL this expression > would not have the desired effect > Yowza. I knew this. Kind of. But the Dremel interpretation seemed so much more natural to me that I didn't catch the diff. * My jaw dropped when I read "Unlike many other SQL-based systems, BigQuery > uses the comma syntax to indicate table unions, not joins". I see how > commas are a convenient abbreviation for UNION, but that's like redefining > "if" in a programming language. > Hmm... don't understand why they did this. Of course, since Dremel explicitly isn't doing real joins, it kind of makes the join syntax available. But this is a bit like redefining "if" as you say. > * "Note that the HAVING clause can only refer to fields defined in your > SELECT clause (if the field has an alias, you must use it; if it doesn't, > use the aggregate field name instead)." This kind of design decision makes > the language difficult to generate for. > Hmm.... sounds like the post-parse injection point just got more important. So did the true SQL subset. > * Apparently string literals can be enclosed in either single or double > quotes. This may be for the convenience of the user, but is at odds with > SQL, which allows only single quotes. (Many SQL dialects use double quotes > to quote identifiers that contain mixed case or spaces.) > My sympathies are with Dremel on this. Many customers will be coming from Python or Perl backgrounds where this alternative quoting syntax is ubiquitous. > * BigQuery allows table names to be prefixed with an optional > "projectname:". > Is this bad? Or just lexically not SQL? > * The syntax for accessing nested collections at depth 2 or more. If "a" > is a table alias, and b is a collection-valued field, then "a.b" is the > collection of sub-records, and "a.b.c" is the union of the collections of > the "c" field of all records in the "a.b" collection. In 'a.b.c', the first > '.' is the conventional operator that accesses a field of a record. But the > second '.' is a strange beast that operates on a set of records. > This is only inconsistent if you view the input as a sequence of records. If it is a forest, then it makes sense and the "." operator has a single meaning. > Most of these I offer as evidence that DrQL is not a superset of SQL. I > suppose we can vive la difference, implement separate parsers/validators > for the two languages, including implementing DrQL "features" we think are > ill-advised. > Well, from my point of view matching Dremel is key (since the mission statement is to replicate Dremel). Your arguments also make matching SQL an interesting additional goal. But I was looking to DrQL to find out how to query nested collections, and > I didn't find much depth. How, for instance, to convert a "dept" relation > with nested "emp" records into a "flat" relation? Or given an > "shipment.order.lineitem" nested relation, sum up lineitem.discount only in > orders that have "order.prepaid=true"? Are nested collections ordered? > These aren't nested collections, they are a forest of trees. And the children will be ordered according to the original scan ordering. SQL's support for nested collections is powerful & consistent. (But not > very concise, or intuitive for the non-expert.) It has operators like > UNNEST that convert a nested collection into a relation (note that > collections and relations are different beasts), and operators such as > CARDINALITY to aggregate them. (My friend John Sichi wrote a great overview > here: http://farrago.sourceforge.net/design/CollectionTypes.html.) > To be clear, I am not advocating adding SQL's collection constructs to > Drill's query language. But I am drawing inspiration from them when > designing how Optiq would represent collections, because a query planner > needs semantics to be very, very precise. I can see how most of DrQL's > constructs would map onto SQL's constructs. > I am very behind on SQL semantics and have never used nested collections. We do have to be precise here. > I'll have a better answer to the "what has to happen to make Optiq handle > nested data" question in a day or two. > My worry here is that we need to let go of the "record" concept a bit and replace it with the "tree" concept.