On 2018/06/30 3:12 PM, Luuk wrote:
Ok ,my highway to hell start here (regargind the use of SQL)
In SQLite3 you are allowed to do this:
SELECT a,b,c
FROM t1
GROUP BY a
The values of 'b' and 'c' will be taken from a 'random' row...
But if we rewrite this in SQL, i am getting something like this:
SELECT
a,
(SELECT MIN(b) FROM T1 WHERE a=t.a) AS B,
(SELECT MIN(c) FROM T1 WHERE a=t.a) AS C
FROM t1 t
GROUP BY a
QUESTION: How does one get the proper relationship between 'B' and 'C'?,
This question comes up from time to time, and is sometimes misunderstood
by SQL users everywhere (or from every Engine I should say).
There is no easy solution and it isn't really an aggregate Set-Algebra
problem, so mostly not handled specifically by Engines for aggregate
queries.
To just state the problem more fully:
Consider this Table "Fruit" in a Grocer's shop:
SELECT * FROM Fruit;
-- | | | DaysOn-
-- Name | BasketNo | Qty | Shelf
-- ------- | ------------ | --- | ---------
-- Apples | 1 | 40 | 12
-- Apples | 2 | 40 | 14
-- Apples | 3 | 23 | 16
-- Oranges | 4 | 40 | 12
-- Oranges | 5 | 40 | 14
-- Oranges | 6 | 11 | 16
-- Apples | 7 | 7 | 22
An unopened basket contains 40 fruit. We try to use the oldest first, etc.
Now suppose we want to know what is the oldest unopened basket for every
kind of fruit. The simplest query in *SQLite* would be:
SELECT Name, MAX(DaysOnShelf) AS MaxAge, BasketNo
FROM Fruit
WHERE Qty = 40
GROUP BY Name
;
-- Name | MaxAge | BasketNo
-- ------- | ------ | ------------
-- Apples | 14 | 2
-- Oranges | 14 | 5
Note that this ONLY works due to a peculiarity in SQLite, and it is
forewarned in the SQLite documents that the non-aggregate field could
return ANY random value that qualifies to be in a selected row (i.e. any
row that is chosen for aggregation by the WHERE clause filter).
To say exactly what that means: In the query, for the Apples aggregate,
two rows "pass" the filter where Qty = 40, namely those with baskets 1
and 2. MAX(DaysOnShelf) for those two rows correctly returns 14, but
there is no guarantee that the returned BasketNo will be from THAT
specific row, it could have returned (still mathematically correct)
Basket 1 in stead of Basket 2. And, in a next release, it might even do
so. You cannot assume what it would be.
In fact, if we change the query to enclose BasketNo too in an aggregate
function (MIN() in this case):
SELECT Name, MAX(DaysOnShelf) AS MaxAge, MIN(BasketNo) AS BasketNo
FROM Fruit
WHERE Qty = 40
GROUP BY Name
;
-- Name | MaxAge | BasketNo
-- ------- | ------ | ------------
-- Apples | 14 | 1
-- Oranges | 14 | 4
Now the 14 and 1 next to Apples are definitely NOT from the same row -
which is the entire point of aggregate functions, we want to know the
truth over the whole set for each function, it is not intended to pick
out values.
Thing is, most developers are precisely interested in the specific row
containing the qualifying aggregate.
There are two ways to solve this (using standard SQL in any SQL engine -
there might be more ways in specific Engines, such as SQLite).
1 - First find the target aggregate, then look that up against the
original (non aggregate) set of records.
2 - Construct a custom taxonomy and deconstruct after.
1 is by far the most used and the easiest to write. It's downside is
that it requires a double look-up loop, which in most Engines is fairly
efficient anyway.
An example of this:
SELECT X.Name, X.MaxAge, F.BasketNo
FROM (
SELECT Name, MAX(DaysOnShelf) AS MaxAge, MAX(Qty) AS Qty
FROM Fruit
WHERE Qty = 40
GROUP BY Name
) AS X
JOIN Fruit AS F ON F.Name = X.Name AND F.DaysOnShelf = X.MaxAge AND
F.Qty = X.Qty
;
-- Name | MaxAge | BasketNo
-- ------- | ------ | ------------
-- Apples | 14 | 2
-- Oranges | 14 | 5
-- The above example shows ALL rows that match (as is the nature of a
JOIN). To only show a single row, the following can work:
SELECT X.Name, X.MaxAge, (
SELECT F.BasketNo FROM Fruit AS F WHERE F.Name = X.Name AND
F.DaysOnShelf = X.MaxAge AND F.Qty = X.Qty LIMIT 1
) AS BasketNo
FROM (
SELECT Name, MAX(DaysOnShelf) AS MaxAge, MAX(Qty) AS Qty
FROM Fruit
WHERE Qty = 40
GROUP BY Name
) AS X
;
-- Name | MaxAge | BasketNo
-- ------- | ------ | ------------
-- Apples | 14 | 2
-- Oranges | 14 | 5
-- And lastly, an example of achieving the first query with a CTE:
WITH X(Name, MaxAge, Qty) AS (
SELECT Name, MAX(DaysOnShelf), MAX(Qty)
FROM Fruit
WHERE Qty = 40
GROUP BY Name
)
SELECT X.Name, X.MaxAge, F.BasketNo
FROM X
JOIN Fruit AS F ON F.Name = X.Name AND F.DaysOnShelf = X.MaxAge AND
F.Qty = X.Qty
;
-- Name | MaxAge | BasketNo
-- ------- | ------ | ------------
-- Apples | 14 | 2
-- Oranges | 14 | 5
The second way to do it is the custom taxonomy. This simply means we
fuse together some fields so they are treated as if they are 1 unit,
with the most significant field first. If the Engine sees them as 1
value, they must always be regarded together, and the aggregate function
typically acts upon the most significant part of the value[1]. After we
get the aggregate returned, we can take them apart again and have the
details we want.
This is rather easy with TEXT values, so I'm going to use the same
example as above to show how it can also be done with Numeric values.
The only real advantage of this method is that it requires only a single
lookup loop, though it does come with some string-handling overhead - as
always, it's best to test the speed with the target DB.
This first example shows only half the solution for clarity (I am not
taking it apart yet) so that you can see what the fused-together values
look like[2]. We simply make the Age value into a known-length number by
adding it (I chose 1000 since fruit older than 1000 days is, well,
unlikely, but you can use much larger values) and then find the MAX()
for it:
SELECT X.Name, X.MaxAge_Basket
FROM (
SELECT Name, MAX((1000+DaysOnShelf)||':'||BasketNo) AS MaxAge_Basket
FROM Fruit
WHERE Qty = 40
GROUP BY Name
) AS X
;
-- | MaxAge_B-
-- Name | asket
-- ------- | ---------
-- Apples | 1014:2
-- Oranges | 1014:5
[1] "Fusing-together"typically results in a string, though you can do a
value-based taxonomy. An example of it is referring to Months by their
(Year*100+Month) value, so that September 2017 is 201709 and 202012 is
December 2020. Aggregate functions on strings consider the most
significant part always to the left. i.e. "012" > "0099999999999" and if
you want to compare numbers when they are in strings, you have to make
sure the left leading length is similar: "00012" < "00014", etc.
[2] The colon is not technically needed, it's just added here for clarity.
And finally here taking that fused-together bit apart (and subtracting
the 1000 we added) to show the two actual values. This guarantees that
the shown Basket belongs to that specific MaxAge:
SELECT X.Name, CAST(substr(X.MaxAge_Basket,1,4) AS INT)-1000 AS MaxAge,
substr(X.MaxAge_Basket,6) AS Basket
FROM (
SELECT Name, MAX((1000+DaysOnShelf)||':'||BasketNo) AS MaxAge_Basket
FROM Fruit
WHERE Qty = 40
GROUP BY Name
) AS X
;
-- Name | MaxAge | Basket
-- ------- | ------ | ------
-- Apples | 14 | 2
-- Oranges | 14 | 5
This last example achieves the same using an Integer taxonomy:
SELECT X.Name, X.MaxAge_Basket / 100 AS MaxAge, X.MaxAge_Basket % 100 AS
Basket
FROM (
SELECT Name, 100 * DaysOnShelf + BasketNo AS MaxAge_Basket
FROM Fruit
WHERE Qty = 40
GROUP BY Name
) AS X
;
-- Name | MaxAge | Basket
-- ------- | ------ | ------------
-- Apples | 14 | 2
-- Oranges | 14 | 5
These work in any Engine.
Hope that all makes sense!
Ryan
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users