Great stuff, Julian.

I have some answers.

On Wed, Sep 19, 2012 at 1:33 PM, Julian Hyde <julianh...@gmail.com> wrote:

> ...
> It would be really useful to establish a list of queries. Minimal, but
> demonstrate the key features (e.g. "COUNT(a.b) WITHIN RECORD", the ability
> to aggregate over nested collections.
>

Very good idea.


> > - is it possible to change Optiq operators and optimization rules
> without a
> > recompile?
>
> Yes. (From Optiq's perspective, anyway, it is very straightforward. Optiq
> starts from scratch each time a query is prepared, and takes whatever
> metadata sources, operators and rules it is given.)
>

Cool.


> > - what has to happen to make Optiq handle nested data?
>
> I am still mulling the answer to that question. The list of queries I
> mentioned above will help me understand what exactly is "nested data" and
> the operations on it.
>

The Drill/Dremel idea of nested data is essentially different from SQL's.
 The basic idea in Dremel is that the data being scanned consists of trees.
 The . operator always qualifies a collection of trees.  In a where clause,
the subset of trees is returned that meet the where clause has any true
evaluation.  Aggregates over sub-trees qualified as above does the natural
tree kind of thing.

This is very different from the SQL intuition that nested collections are
really a shorthand for a normalized relational set of records.  The
difference is important to the users of Dremel since it matches their
problem statement very well.


> I have been using
> http://developers.google.com/bigquery/docs/query-reference as my main
> resource for BigQuery/DrQL. Let me know if there is a more authoritative
> source.
>

This is as good as it gets.


> For the record (I'll only make this whine once, I promise) the BigQuery
> language has some usability flaws, particularly inconsistencies with SQL.


Noted.  I would like it if you could repeat this whine at regular intervals
since we need to record and document these differences.


> * The paper has an expression like "COUNT(x > 5)". SQL's COUNT operator
> counts not-NULL values, not boolean TRUE values, so in SQL this expression
> would not have the desired effect
>

Yowza.  I knew this.  Kind of.  But the Dremel interpretation seemed so
much more natural to me that I didn't catch the diff.

* My jaw dropped when I read "Unlike many other SQL-based systems, BigQuery
> uses the comma syntax to indicate table unions, not joins". I see how
> commas are a convenient abbreviation for UNION, but that's like redefining
> "if" in a programming language.
>

Hmm... don't understand why they did this.

Of course, since Dremel explicitly isn't doing real joins, it kind of makes
the join syntax available.  But this is a bit like redefining "if" as you
say.


> * "Note that the HAVING clause can only refer to fields defined in your
> SELECT clause (if the field has an alias, you must use it; if it doesn't,
> use the aggregate field name instead)." This kind of design decision makes
> the language difficult to generate for.
>

Hmm....  sounds like the post-parse injection point just got more
important.  So did the true SQL subset.


> * Apparently string literals can be enclosed in either single or double
> quotes. This may be for the convenience of the user, but is at odds with
> SQL, which allows only single quotes. (Many SQL dialects use double quotes
> to quote identifiers that contain mixed case or spaces.)
>

My sympathies are with Dremel on this.  Many customers will be coming from
Python or Perl backgrounds where this alternative quoting syntax is
ubiquitous.


> * BigQuery allows table names to be prefixed with an optional
> "projectname:".
>

Is this bad?  Or just lexically not SQL?


> * The syntax for accessing nested collections at depth 2 or more. If "a"
> is a table alias, and b is a collection-valued field, then "a.b" is the
> collection of sub-records, and "a.b.c" is the union of the collections of
> the "c" field of all records in the "a.b" collection. In 'a.b.c', the first
> '.' is the conventional operator that accesses a field of a record. But the
> second '.' is a strange beast that operates on a set of records.
>

This is only inconsistent if you view the input as a sequence of records.
 If it is a forest, then it makes sense and the "." operator has a single
meaning.


> Most of these I offer as evidence that DrQL is not a superset of SQL. I
> suppose we can vive la difference, implement separate parsers/validators
> for the two languages, including implementing DrQL "features" we think are
> ill-advised.
>

Well, from my point of view matching Dremel is key (since the mission
statement is to replicate Dremel).  Your arguments also make matching SQL
an interesting additional goal.

But I was looking to DrQL to find out how to query nested collections, and
> I didn't find much depth. How, for instance, to convert a "dept" relation
> with nested "emp" records into a "flat" relation? Or given an
> "shipment.order.lineitem" nested relation, sum up lineitem.discount only in
> orders that have "order.prepaid=true"? Are nested collections ordered?
>

These aren't nested collections, they are a forest of trees.  And the
children will be ordered according to the original scan ordering.

SQL's support for nested collections is powerful & consistent. (But not
> very concise, or intuitive for the non-expert.) It has operators like
> UNNEST that convert a nested collection into a relation (note that
> collections and relations are different beasts), and operators such as
> CARDINALITY to aggregate them. (My friend John Sichi wrote a great overview
> here: http://farrago.sourceforge.net/design/CollectionTypes.html.)


> To be clear, I am not advocating adding SQL's collection constructs to
> Drill's query language. But I am drawing inspiration from them when
> designing how Optiq would represent collections, because a query planner
> needs semantics to be very, very precise. I can see how most of DrQL's
> constructs would map onto SQL's constructs.
>

I am very behind on SQL semantics and have never used nested collections.

We do have to be precise here.


> I'll have a better answer to the "what has to happen to make Optiq handle
> nested data" question in a day or two.
>

My worry here is that we need to let go of the "record" concept a bit and
replace it with the "tree" concept.

Reply via email to