On Tue, Jul 02, 2013 at 11:40:34AM +0100, Alex Bowden scratched on the wall:
>
> The SQL standard has always been a moving feast, chasing the field
> implementations, perfectly capable of going back on it's earlier
> mistakes, the main purpose of which, on a good day, is to promote
> standardisation of SQL implementations and try and keep to the
> Relational Theory model where practical considerations allow.
>
> So, if the SQL standard has drifted toward requiring "in the order
> in which they are defined in the table definition" to be meaningful,
> then this is an oversight that would likely be corrected when somebody
> has an in the field SQL database which, correctly, enforces no such
> concept.
I realize that historically the SQL standards have been somewhat liquid,
and that the standards often follow the implementations (although not
so much in the last 20 years). This is not some small misstep, however.
The concept that columns have order, and are defined and referenced by
that order, is baked deep, deep into the SQL language, environment,
and every SQL operation. It isn't just some after-thought in a newer
version of the standard.
As I said before, SQL has very poor and highly ambiguous naming conventions
(nevermind standards) for result set columns. Trying to define
columns by their names in an SQL environment would be extremely
difficult and require a massive overhaul of SQL and every database
out there. While it may be a bit ugly and non-Relational to have a
defined column order, it rarely matters. Unlike row order, there
aren't performance concerns in a row-based database-- the query
optimizer has little to gain by allowing arbitrary column ordering.
This is a reason nearly every database API provides access to
results by column index. In the SQLite API, column index is the
*ONLY* way to retrieve a value (or, for that matter, a column name).
Even if it wasn't part of the standard, reordering the column results
of a "*" would break almost every application written in the last 35
years that uses that type of query.
> People should not be encouraged to become more dependent on the use of
> such temporary misfeatures.
I agree that good database engineers should have a strong
understanding of Relational Model and-- given the choice-- should
tend to default to doing things "the Relational way", but back in
reality, that's not the world we live in. SQL is not the Relational
Model, and the Relational Model is not SQL. Every practical RDBMS out
there uses SQL, so every good database engineer should be equally
versed in SQL, and the differences between SQL and the Relational Model.
Sometimes you can approach a problem in a very Relational way, but still
express it in SQL. Many times you cannot. Fighting SQL and going
against its design concepts just to satisfy some desire to do things
the theoretically pretty way is likely to result in making thing
overly complex and poor performance. If you're working in SQL, doing
it the "SQL way" isn't inherently evil, especially if you know enough
to understand why you're diverging from the Model, and why it makes
sense in SQL.
If you try to make your C++ look and act exactly like Smalltalk, you'll
end up writing something that is even uglier than standard C++. Using
ideas and concepts from Smalltalk to better your C++ programs and OOP
designs is good engineering. Forcing something that isn't meant to
be is not.
The same applies to SQL and the Relational Model.
> In context, the particular focus of your objection to the
> relational approach, seems irrelevant.
Yes and no. I have great respect for the Relational Model, and think
every database programmer worth anything should have a deep
understanding of it... Not the Relational Model through SQL, but the
Model itself, unencumbered by everything SQL brings to the table.
They should also know SQL, and where the environments differ. If anyone
cares, I can provide a list of good books (most by C.J. Date).
But in the bigger context of this discussion, I think SQLite might
consider a row_number() function, or (my personal preference) some
type of virtual column, such as "_row_number_". I suggest that term,
since that's what Oracle, MS SQL Server, and PostgreSQL use. I
prefer the idea of a virtual column because, unlike a function, it is
difficult to misuse a virtual column in some other context. If it
can only be expressed in a result set selector (the SELECT clause),
it will only work there.
It is easy to say such a feature it isn't Relational. But guess what...
result sets aren't Relational, even in a pure Relational Model environment.
In fact, that's why they're called "result sets" rather than
"result relations" or (in SQL) "result tables." Yes, the name "set"
is not the best choice, but the difference in terms is no casual accident.
Results specifically have a different name, because the object that is
returned by a query result is not a relation or table... it has
different properties, including a defined column order and
possibly defined row order.
If you want to claim a result set should be strongly Relational,
that's fine, but then you need to get rid of both ORDER BY and LIMIT.
While getting rid of ORDER BY and LIMIT is all nice and pure and full
of theory, I think the people that actually use SQL to build real
applications and get stuff done might have a slightly different set
of thoughts on why those features should stay.
Sorry, but "it's not Relational" doesn't cut it for me, even as
someone that really likes the Model and finds SQL pretty darn ugly.
Results aren't Relational in many, many ways; this is not the first.
Other points:
-- There are other ways of doing this in SQL.
Umm, exactly? This isn't adding anything new. There are ways you
can bend over backwards to do this in SQL already, and people will
use them. Might as well provide users with a very straight-forward
way of dealing with things, so that the query optimizer can understand
what the heck you're *really* trying to do and deal with it appropriately
(which is to say, mostly ignore it).
If you want to take the high road on pure theory over performance,
that's fine, have fun with your pure theory and math. Meanwhile,
back in the Real World...
-- You can just do it in code.
Well sure. I can "just do" ORDER BY in code as well, but that's not
really the point, is it? SQL is an API-independent query expression
language for a reason. Most of the database logic and information
management is supposed to happen in the SQL, which is a pretty rich
language for such things, allowing complex sub-queries, views, and
all kinds of crazy things. If an application needs to manipulate
database data as part of the data management pipeline, it should
likely be able to do it in SQL-- not use some quirk of the API to
glue the desired information into the result set afterwords.
Otherwise ORDER BY and LIMIT are out again, as is selecting specific
columns, or even WHERE limits... we can go on and on, saying "just do
it in code" until you've devolved into a very simple and basic
key-value store. Then you have what's known as a "NoSQL" database.
Finally, I'm going to make the argument that having a ranking
function is actually more Relational, not less. Here's why:
Here is a query. I would say it is a very reasonable query for a
database to perform-- and I specifically mean "database", as in SQL,
not a full application pipeline. Interactive, at the prompt, type
stuff.
"Give me a list of the top 10 sales regions for
last quarter, ranked by total number of orders."
Seems pretty reasonable. JOIN together all the data we need, filter
by dates, group by sales regions, total them all up, and sort by the
number of orders. Very traditional type of query, very easy to
express in SQL.
Of course the result is very non-Relational, since the inherent
information I asked for is dependent on row order, but I'll get back
to that.
So anyways, I don't actually care about the actual number of orders,
which is mostly likely what my SQL query returns, I just want the
ranking-- who is first, second, and third. I can get that from an
ORDER BY query, but the data that makes up the result set doesn't
actually contain the information I want. The information I want is
encoded into the row order of the result set, not the data values of
the result. In other words, the row order is extremely relevant, and
part of the desired result itself. That's about as non-Relational
as you can get.
Now consider a ranking function that lets me label the result of the
ORDER BY explicitly.
Now we have an actual column value that represents the data I want.
This is important, because once that column exists, I can once again
treat the rows as a proper set. That is, once the ranking is
explicitly encoded as a column-- a data value within that unique
row, rather than inferred from the structure of the rows-- then the row
order of the result is not important. The data set represents the
same answer to the same query, even if the row order is scrambled, or
undefined. That is a very, very Relational like thing.
"But, but, but..." you scream. Yes, I know. I had to sort the rows
into an explicit order before I could extract the ranking data, so
after the first pass, the ranking data will always be in order and
that doesn't really add much. Yes and no.
There are plenty of cases when an SQL result set gets turned back into
a relation, dropping the defined row order. Views and sub-queries are
the most obvious case.
Views are a touchy subject, since some people will say that views can
be ordered. Personally, I don't buy that. If a view is supposed to
look like a table, smell like a table, and act like a table, then the
view cannot have a pre-defined row order. It breaks way too many
Relational ideas. Ordered views are ugly from a Relational
standpoint, not just an SQL one. So if you want a view that provides
any type of ordering, you need to be able to provide a rank column as
part of the view result.
Subqueries are usually a bit easier to understand, since it is
obvious to see how a very complex query with many subqueries is not
going to preserve ORDER BY output through a tree of subqueries. And,
in the case of a ranking, it makes sense to order the result of a
subquery, attach the query information, and then devolve the result
set back into an arbitrary relation/table and let the rows scramble
themselves for the next stage of the root query. More to the point,
if I have a ranking functionality available, I'm happy to let the
query engine do this, since I've explicitly defined the information I
want to carry through the larger query.
Additionally, it should be pointed out that most other databases
provide a ranking function as part of their windowing function set,
meaning a ranking could be applied explicitly over some data set;
i.e. "RANK( total( orders ) )", rather than having ORDER BY applied
over some data set and then pulling the rank information back out,
i.e. "SELECT total( orders ), _row_number_ [...] ORDER BY 1". That's
clearly perfectly Relational, since row order never comes into it.
It is also likely to be higher performance, since the query engine
understands what you want and may not have to do a full sort in order
to provide it. In such a case the rows can always be treated as a
proper set (i.e. Relationally). Of course, such a thing would
require a large and complex update to the SQL syntax, so having a
simple rank function that is dependent on ORDER BY to generate
meaningful results makes some sense in the world of theory vs
practicality.
So, yes... overall I think some type of ranking functionality-- be it
an explicit function or a virtual column or something else-- is a
good idea. I think it actually improves the Relational aspects of
these queries, since it allows the SQL programmer to explicitly
define orders and ranking as data values, rather than having them
implied in row order. This makes the result set much more
Relational, since the defined row order can be dropped without losing
query information.
And, yes, it will be abused by people that don't really understand
how to use it, or how it fits into the overall SQL language. You
know, like every other aspect of the SQL language-- or any
programming language-- is misused by clueless people. I don't care.
It lets me sell more books.
> Sort order isn't necessarily deterministic even if we know the column
> order. So the possibility that we may not know it, makes life no worse.
True, you can sort by random(). The sort *process* must be
deterministic, however, or there isn't much point in having a sort.
-j
--
Jay A. Kreibich < J A Y @ K R E I B I.C H >
"Intelligence is like underwear: it is important that you have it,
but showing it to the wrong people has the tendency to make them
feel uncomfortable." -- Angela Johnson
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users