On Thu, 19 Feb 2015 07:21:17 -0800 Roger Binns <rogerb at rogerbinns.com> wrote:
> There is a lot that would have to be done with it: > > - - make the IR stable across releases > - - add more general instructions beyond only what is needed for SQL > - - expose an API that takes the IR > - - possibly redesign it to make static verification possible, like > Java bytecode did. As it happens, I'm working on a project exactly like that, and can attest that it's plenty of work! You might be tempted to think -- I was -- that it would be nice to have a C library of relational algebra operators on a transactional store. No SQL, just C functions like ra_project, ra_select, and ra_join. What could be nicer than, t3 = ra_join(db, t1, t2) ? Well, there's the little matter of "join on what?", so you get t3 = ra_join(db, t1, t2, how) and then there's the question of "what goes in 'how'?". IOW, how to specify the join criteria without syntax, using only data? In fact, consider a simpler problem, t2 = ra_select(db, t1, "where id > 2") and suddenly you're back in Parserville whether you wanted to be or not. You're inventing a language, a very limited, incomplete version of SQL. Limited, that is, until you want to say something like "where exists (select 1 from t5 where t1.id < t5.id)". Before you know it, you've reinvented more than half of SQL, except that the ratio of SQL speakers to your-language speakers approaches infinity. Now, you could avoid adding complexity to your language by moving more of the relational algebra stuff out of it, t3 = ra_semijoin( db, t1, ra_join(db, t1, t2, ra_natural) ) or something like that. To Richard's point, the more you do that, the *less* there is for the optimizer to do, and the more the application comes to control (for good and ill) the order of operation. It's very difficult to maintain the relational abstraction *and* be efficient, because the C language defines an order of operation, specifically "greedy evaluation". That means inner functions are evaluated before outer ones, inviting production of temporaries that a query optimizer would just laugh at. None of this is news. Query optimizers have (for decades) been "lowering the filter" to minimize the processed data. DBMSs and vector languages (R, but also APL, NumPy, Matlab, inter alia) are *languages* precisely so that they can *effect* high-level instructions without interpreting them literally. That's not an option in C. (It's also what makes things like BLAS such a bear to write *and* use.) I'm happy to discuss this off-list in more detail with anyone interested. Bringing it back to the topic, I would only say, "be careful what you wish for". You want SQL, or something very like it, to provide abtraction and execution efficiency, because C can't do that kind of interpretation. --jkl