On Thu, 19 Feb 2015 07:21:17 -0800
Roger Binns <rogerb at rogerbinns.com> wrote:

> There is a lot that would have to be done with it:
> 
> - - make the IR stable across releases
> - - add more general instructions beyond only what is needed for SQL
> - - expose an API that takes the IR
> - - possibly redesign it to make static verification possible, like
> Java bytecode did.  

As it happens, I'm working on a project exactly like that, and can
attest that it's plenty of work!  

You might be tempted to think -- I was -- that it would be nice to have
a C library of relational algebra operators on a transactional store.
No SQL, just C functions like ra_project, ra_select, and ra_join.  What
could be nicer than, 

        t3 = ra_join(db, t1, t2)

?   Well, there's the little matter of "join on what?", so you get 

        t3 = ra_join(db, t1, t2, how)

and then there's the question of "what goes in 'how'?".  IOW, how to
specify the join criteria without syntax, using only data?  In fact,
consider a simpler problem, 

        t2 = ra_select(db, t1, "where id > 2")

and suddenly you're back in Parserville whether you wanted to be or
not.  You're inventing a language, a very limited, incomplete version
of SQL.  Limited, that is, until you want to say something like "where
exists (select 1 from t5 where t1.id < t5.id)".  Before you know it,
you've reinvented more than half of SQL, except that the ratio of SQL
speakers to your-language speakers approaches infinity.  

Now, you could avoid adding complexity to your language by moving more
of the relational algebra stuff out of it, 

                t3 = ra_semijoin( db, t1, 
                                ra_join(db, t1, t2, ra_natural) )

or something like that.  To Richard's point, the more you do that, the
*less* there is for the optimizer to do, and the more the application
comes to control (for good and ill) the order of operation.  It's very
difficult to maintain the relational abstraction *and* be efficient,
because the C language defines an order of operation, specifically
"greedy evaluation".  That means inner functions are evaluated before
outer ones, inviting production of temporaries that a query optimizer
would just laugh at.  

None of this is news.  Query optimizers have (for decades) been
"lowering the filter" to minimize the processed data.  DBMSs and vector
languages (R, but also APL, NumPy, Matlab, inter alia) are *languages*
precisely so that they can *effect* high-level instructions without
interpreting them literally.  That's not an option in C.  (It's also
what makes things like BLAS such a bear to write *and* use.)  

I'm happy to discuss this off-list in more detail with anyone
interested.  Bringing it back to the topic, I would only say, "be
careful what you wish for".  You want SQL, or something very like it,
to provide abtraction and execution efficiency, because C can't
do that kind of interpretation.  

--jkl

Reply via email to