Re: [sqlite] GROUP BY [Re: unique values from a subset of data based on two fields]

R Smith Sat, 30 Jun 2018 09:48:40 -0700

On 2018/06/30 3:12 PM, Luuk wrote:

Ok ,my highway to hell start here (regargind the use of SQL)


In SQLite3 you are allowed to do this:
SELECT a,b,c
FROM t1
GROUP BY a

The values of 'b' and 'c' will be taken from a 'random' row...

But if we rewrite this in SQL, i am getting something like this:
SELECT
    a,
    (SELECT MIN(b) FROM T1 WHERE a=t.a) AS B,
    (SELECT MIN(c) FROM T1 WHERE a=t.a) AS C
FROM t1 t
GROUP BY a

QUESTION: How does one get the proper relationship between 'B' and 'C'?,

This question comes up from time to time, and is sometimes misunderstoodby SQL users everywhere (or from every Engine I should say).

There is no easy solution and it isn't really an aggregate Set-Algebraproblem, so mostly not handled specifically by Engines for aggregatequeries.


To just state the problem more fully:

Consider this Table "Fruit" in a Grocer's shop:
SELECT * FROM Fruit;
  --         |              |     |  DaysOn-
  -- Name    |   BasketNo   | Qty |   Shelf
  -- ------- | ------------ | --- | ---------
  -- Apples  |       1      |  40 |     12
  -- Apples  |       2      |  40 |     14
  -- Apples  |       3      |  23 |     16
  -- Oranges |       4      |  40 |     12
  -- Oranges |       5      |  40 |     14
  -- Oranges |       6      |  11 |     16
  -- Apples  |       7      |  7  |     22

An unopened basket contains 40 fruit. We try to use the oldest first, etc.

Now suppose we want to know what is the oldest unopened basket for everykind of fruit. The simplest query in *SQLite* would be:


SELECT Name, MAX(DaysOnShelf) AS MaxAge, BasketNo
  FROM Fruit
 WHERE Qty = 40
 GROUP BY Name
;

  -- Name    | MaxAge |   BasketNo
  -- ------- | ------ | ------------
  -- Apples  |   14   |       2
  -- Oranges |   14   |       5

Note that this ONLY works due to a peculiarity in SQLite, and it isforewarned in the SQLite documents that the non-aggregate field couldreturn ANY random value that qualifies to be in a selected row (i.e. anyrow that is chosen for aggregation by the WHERE clause filter).

To say exactly what that means: In the query, for the Apples aggregate,two rows "pass" the filter where Qty = 40, namely those with baskets 1and 2. MAX(DaysOnShelf) for those two rows correctly returns 14, butthere is no guarantee that the returned BasketNo will be from THATspecific row, it could have returned (still mathematically correct)Basket 1 in stead of Basket 2. And, in a next release, it might even doso. You cannot assume what it would be.

In fact, if we change the query to enclose BasketNo too in an aggregatefunction (MIN() in this case):

SELECT Name, MAX(DaysOnShelf) AS MaxAge, MIN(BasketNo) AS BasketNo
  FROM Fruit
 WHERE Qty = 40
 GROUP BY Name
;

  -- Name    | MaxAge |   BasketNo
  -- ------- | ------ | ------------
  -- Apples  |   14   |       1
  -- Oranges |   14   |       4

Now the 14 and 1 next to Apples are definitely NOT from the same row -which is the entire point of aggregate functions, we want to know thetruth over the whole set for each function, it is not intended to pickout values.

Thing is, most developers are precisely interested in the specific rowcontaining the qualifying aggregate.

There are two ways to solve this (using standard SQL in any SQL engine -there might be more ways in specific Engines, such as SQLite).

1 - First find the target aggregate, then look that up against theoriginal (non aggregate) set of records.

2 - Construct a custom taxonomy and deconstruct after.

1 is by far the most used and the easiest to write. It's downside isthat it requires a double look-up loop, which in most Engines is fairlyefficient anyway.

An example of this:

SELECT X.Name, X.MaxAge, F.BasketNo
  FROM (
    SELECT Name, MAX(DaysOnShelf) AS MaxAge, MAX(Qty) AS Qty
      FROM Fruit
     WHERE Qty = 40
     GROUP BY Name
  ) AS X

JOIN Fruit AS F ON F.Name = X.Name AND F.DaysOnShelf = X.MaxAge ANDF.Qty = X.Qty

;
  -- Name    | MaxAge |   BasketNo
  -- ------- | ------ | ------------
  -- Apples  |   14   |       2
  -- Oranges |   14   |       5

-- The above example shows ALL rows that match (as is the nature of aJOIN). To only show a single row, the following can work:


SELECT X.Name, X.MaxAge, (

SELECT F.BasketNo FROM Fruit AS F WHERE F.Name = X.Name ANDF.DaysOnShelf = X.MaxAge AND F.Qty = X.Qty LIMIT 1

       ) AS BasketNo
  FROM (
    SELECT Name, MAX(DaysOnShelf) AS MaxAge, MAX(Qty) AS Qty
      FROM Fruit
     WHERE Qty = 40
     GROUP BY Name
  ) AS X
;
  -- Name    | MaxAge |   BasketNo
  -- ------- | ------ | ------------
  -- Apples  |   14   |       2
  -- Oranges |   14   |       5


-- And lastly, an example of achieving the first query with a CTE:

WITH X(Name, MaxAge, Qty) AS (
    SELECT Name, MAX(DaysOnShelf), MAX(Qty)
      FROM Fruit
     WHERE Qty = 40
     GROUP BY Name
)
SELECT X.Name, X.MaxAge, F.BasketNo
  FROM X

JOIN Fruit AS F ON F.Name = X.Name AND F.DaysOnShelf = X.MaxAge ANDF.Qty = X.Qty

;
  -- Name    | MaxAge |   BasketNo
  -- ------- | ------ | ------------
  -- Apples  |   14   |       2
  -- Oranges |   14   |       5

The second way to do it is the custom taxonomy. This simply means wefuse together some fields so they are treated as if they are 1 unit,with the most significant field first. If the Engine sees them as 1value, they must always be regarded together, and the aggregate functiontypically acts upon the most significant part of the value[1]. After weget the aggregate returned, we can take them apart again and have thedetails we want.This is rather easy with TEXT values, so I'm going to use the sameexample as above to show how it can also be done with Numeric values.

The only real advantage of this method is that it requires only a singlelookup loop, though it does come with some string-handling overhead - asalways, it's best to test the speed with the target DB.

This first example shows only half the solution for clarity (I am nottaking it apart yet) so that you can see what the fused-together valueslook like[2]. We simply make the Age value into a known-length number byadding it (I chose 1000 since fruit older than 1000 days is, well,unlikely, but you can use much larger values) and then find the MAX()for it:


SELECT X.Name, X.MaxAge_Basket
  FROM (
    SELECT Name, MAX((1000+DaysOnShelf)||':'||BasketNo) AS MaxAge_Basket
      FROM Fruit
     WHERE Qty = 40
     GROUP BY Name
  ) AS X
;
  --         | MaxAge_B-
  -- Name    | asket
  -- ------- | ---------
  -- Apples  | 1014:2
  -- Oranges | 1014:5

[1] "Fusing-together"typically results in a string, though you can do avalue-based taxonomy. An example of it is referring to Months by their(Year*100+Month) value, so that September 2017 is 201709 and 202012 isDecember 2020. Aggregate functions on strings consider the mostsignificant part always to the left. i.e. "012" > "0099999999999" and ifyou want to compare numbers when they are in strings, you have to makesure the left leading length is similar: "00012" < "00014", etc.


[2] The colon is not technically needed, it's just added here for clarity.

And finally here taking that fused-together bit apart (and subtractingthe 1000 we added) to show the two actual values. This guarantees thatthe shown Basket belongs to that specific MaxAge:

SELECT X.Name, CAST(substr(X.MaxAge_Basket,1,4) AS INT)-1000 AS MaxAge,substr(X.MaxAge_Basket,6) AS Basket

  FROM (
    SELECT Name, MAX((1000+DaysOnShelf)||':'||BasketNo) AS MaxAge_Basket
      FROM Fruit
     WHERE Qty = 40
     GROUP BY Name
  ) AS X
;
  -- Name    | MaxAge | Basket
  -- ------- | ------ | ------
  -- Apples  |   14   |   2
  -- Oranges |   14   |   5


This last example achieves the same using an Integer taxonomy:

SELECT X.Name, X.MaxAge_Basket / 100 AS MaxAge, X.MaxAge_Basket % 100 ASBasket

  FROM (
    SELECT Name, 100 * DaysOnShelf + BasketNo AS MaxAge_Basket
      FROM Fruit
     WHERE Qty = 40
     GROUP BY Name
  ) AS X
;
  -- Name    | MaxAge |    Basket
  -- ------- | ------ | ------------
  -- Apples  |   14   |       2
  -- Oranges |   14   |       5


These work in any Engine.

Hope that all makes sense!
Ryan







_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] GROUP BY [Re: unique values from a subset of data based on two fields]

Reply via email to