Re: [sqlite] GROUP BY and ICU collation

David Raymond Fri, 08 Feb 2019 06:12:27 -0800

Remember that fancy collations don't just look at 1 character at a time, they 
look at the whole thing, and can do surprising stuff based on that. In this 
case the order of preference for the collation looks like "when it's part of a 
larger word, then treating 'S' and 'Š' the same is more important than 
separating them. But when it's just 1 character then they're different.


So 'S' might be before 'Š', but in words it might go

'Sam'
'Šam'
'Skunk'
'Škunk'
'Sudden'
'Šudden'

rather than a simple character-at-a-time order of

'Sam'
'Skunk'
'Sudden'
'Šam'
'Škunk'
'Šudden'


-----Original Message-----
From: sqlite-users [mailto:[email protected]] On 
Behalf Of Michael Herger
Sent: Friday, February 08, 2019 4:13 AM
To: [email protected]
Subject: Re: [sqlite] GROUP BY and ICU collation

Hi Keith,

thanks for your response (which partly goes beyond my understanding, but 
I'm learning :-)).

> But you are not using the same "expression" for selecting, sorting, and 
> grouping.  That is, you need to specify:
> 
>    SELECT expression, count(distinct id)
>      FROM artists
> GROUP BY expression
> ORDER BY expression;
> 
> where expression is the expression that you want to use
> 
>    SELECT substr(name collate de_DE, 1, 1), count(distinct id)
>      FROM artists
> GROUP BY substr(name collate de_DE, 1, 1)
> ORDER BY substr(name collate de_DE, 1, 1);
> 
> If you do not do so then you cannot expect the "GROUP BY" to be using the 
> same ordering as the "ORDER BY" nor the select to be returning the same value 
> that was used to do the grouping and sorting.

Ok, tried that:

SELECT SUBSTR(contributors.namesort COLLATE de_DE , 1, 1)
FROM contributors
GROUP BY SUBSTR(contributors.namesort COLLATE de_DE , 1, 1)
ORDER BY SUBSTR(contributors.namesort COLLATE de_DE , 1, 1)

Resulted in:

"A"
"C"
"D"
"P"
"R"
"S"
"Š"
"T"
"W"

Whereas this:

SELECT contributors.name, contributors.namesort COLLATE de_DE
FROM contributors
ORDER BY contributors.namesort COLLATE de_DE

...resulted in this:

...
   ["Roddy Woomble & Band", "RODDY WOOMBLE BAND"],
   ["Samuel Yirga", "SAMUEL YIRGA"],
   ["Stephin Merritt", "STEPHIN MERRITT"],
   ["Šuma Čovjek", "ŠUMA ČOVJEK"],
   ["Syriana", "SYRIANA"],
   ["Tom Griesgraber", "TOM GRIESGRABER"],
...

So despite my using the same expression for the GROUP as for the ORDER, 
the grouping seems to ignore the collation, whereas the sorting alone 
would not.

I'm using the ICU extension (otherwise the sorting wouldn't work either, 
right?). Could that extension be responsible for this odd behaviour?

I've put a small sample .db file in my dropbox:
https://www.dropbox.com/s/w4h6n3cyamt6yc0/library.db?dl=0


> 
> (This substitutes the collation NOCASE for the de_DE since I no have a de_DE 
> collation:
> 
> sqlite> select substr(name collate nocase, 1, 1), count(distinct id)
>     ...>   FROM artists
>     ...> group by substr(name collate nocase, 1, 1)
>     ...> order by substr(name collate nocase, 1, 1);
> QUERY PLAN
> |--SCAN TABLE artists (~1048576 rows)
> `--USE TEMP B-TREE FOR GROUP BY
> addr  opcode         p1    p2    p3    p4             p5  comment
> ----  -------------  ----  ----  ----  -------------  --  -------------
> 0     Init           0     57    0                    00  Start at 57
> 1     Noop           1     4     0                    00
> 2     SorterOpen     3     3     0     k(1,NOCASE)    00
> 3     Integer        0     5     0                    00  r[5]=0; clear abort 
> flag
> 4     Null           0     8     8                    00  r[8..8]=NULL
> 5     Gosub          7     52    0                    00
> 6     OpenRead       0     3     0     2              00  root=3 iDb=0; 
> artists
> 7     ColumnsUsed    0     0     0     3              00
> 8     Explain        8     0     0     SCAN TABLE artists (~1048576 rows)  00
> 9     Noop           0     0     0                    00  Begin WHERE-loop0: 
> artists
> 10    Rewind         0     20    0                    00
> 11      Noop           0     0     0                    00  Begin WHERE-core
> 12      Column         0     1     13                   00  r[13]=artists.name
> 13      Function0      6     13    10    substr(3)      03  
> r[10]=func(r[13..15])
> 14      Column         0     1     11                   00  r[11]=artists.name
> 15      Column         0     0     12                   00  r[12]=artists.id
> 16      MakeRecord     10    3     16                   00  
> r[16]=mkrec(r[10..12])
> 17      SorterInsert   3     16    0                    00  key=r[16]
> 18      Noop           0     0     0                    00  End WHERE-core
> 19    Next           0     11    0                    01
> 20    Noop           0     0     0                    00  End WHERE-loop0: 
> artists
> 21    OpenPseudo     4     16    3                    00  3 columns in r[16]
> 22    SorterSort     3     56    0                    00  GROUP BY sort
> 23      SorterData     3     16    4                    00  r[16]=data
> 24      Column         4     0     9                    00  r[9]=
> 25      Compare        8     9     1     k(1,NOCASE)    00  r[8] <-> r[9]
> 26      Jump           27    31    27                   00
> 27      Move           9     8     1                    00  r[8]=r[9]
> 28      Gosub          6     44    0                    00  output one row
> 29      IfPos          5     56    0                    00  if r[5]>0 then 
> r[5]-=0, goto 56; check abort flag
> 30      Gosub          7     52    0                    00  reset accumulator
> 31      Column         4     2     17                   00  r[17]=
> 32      Found          2     36    17    1              00  key=r[17]
> 33      MakeRecord     17    1     18                   00  r[18]=mkrec(r[17])
> 34      IdxInsert      2     18    17    1              10  key=r[18]
> 35      AggStep        0     17    2     count(1)       01  accum=r[2] 
> step(r[17])
> 36      If             4     38    0                    00
> 37      Column         4     1     1                    00  r[1]=
> 38      Integer        1     4     0                    00  r[4]=1; indicate 
> data in accumulator
> 39    SorterNext     3     23    0                    00
> 40    Gosub          6     44    0                    00  output final row
> 41    Goto           0     56    0                    00
> 42    Integer        1     5     0                    00  r[5]=1; set abort 
> flag
> 43    Return         6     0     0                    00
> 44    IfPos          4     46    0                    00  if r[4]>0 then 
> r[4]-=0, goto 46; Groupby result generator entry point
> 45    Return         6     0     0                    00
> 46    AggFinal       2     1     0     count(1)       00  accum=r[2] N=1
> 47    Copy           1     21    0                    00  r[21]=r[1]
> 48    Function0      6     21    19    substr(3)      03  
> r[19]=func(r[21..23])
> 49    Copy           2     20    0                    00  r[20]=r[2]
> 50    ResultRow      19    2     0                    00  output=r[19..20]
> 51    Return         6     0     0                    00  end groupby result 
> generator
> 52    Null           0     1     3                    00  r[1..3]=NULL
> 53    OpenEphemeral  2     0     0     k(1,B)         00  nColumn=0
> 54    Integer        0     4     0                    00  r[4]=0; indicate 
> accumulator empty
> 55    Return         7     0     0                    00
> 56    Halt           0     0     0                    00
> 57    Transaction    0     0     2     0              01  usesStmtJournal=0
> 58    Integer        1     14    0                    00  r[14]=1
> 59    Integer        1     15    0                    00  r[15]=1
> 60    Integer        1     22    0                    00  r[22]=1
> 61    Integer        1     23    0                    00  r[23]=1
> 62    Goto           0     1     0                    00
> sqlite>
> 
>>
>> Michael
>>
>>
>>>
>>> That is
>>>
>>> select name collate nocase, count(distinct id) from x group by name
>> collate nocase order by name collate nocase
>>>
>>> whill produce cased output not the value that was used for the
>> sorting.
>>>
>>>
>>> select lower(name collate nocase), count(distinct id) from x group
>> by name collate nocase order by name collate nocase;
>>>
>>> to transmorgificate name into a "caseless" representation.  So you
>> would need to do something like this:
>>>
>>> select de_DE(substr(name collate de_DE,1,1)), count(distinct id)
>>from artists
>>> group by substr(name collate de_DE,1,1)
>>> order by by substr(name collate de_DE,1,1)
>>>
>>> and the function de_DE would have to transmorgificate its value to
>> the result you want to see.
>>>
>>> ---
>>> The fact that there's a Highway to Hell but only a Stairway to
>> Heaven says a lot about anticipated traffic volume.
>>>
>>>> -----Original Message-----
>>>> From: sqlite-users [mailto:sqlite-users-
>>>> [email protected]] On Behalf Of [email protected]
>>>> Sent: Thursday, 7 February, 2019 05:12
>>>> To: [email protected]
>>>> Subject: [sqlite] GROUP BY and ICU collation
>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> I'm trying to create a list with an index list. Eg. I have
>>>> artists:
>>>>>>
>>>>>> Sting
>>>>>> Šuma Čovjek
>>>>>> Suzanne Vega
>>>>>>
>>>>>> That's the sort order I'd get using an ICU collation. "Šuma
>>>> Čovjek"
>>>>>> would be sorted as "Suma..." as expected.
>>>>>>
>>>>>> Now I'd like to create an index bar by providing groups of the
>>>> first
>>>>>> character:
>>>>>>
>>>>>> SELECT SUBSTR(name,1,1), count(distinct id) FROM artists GROUP
>> BY
>>>>>> SUBSTR(name,1,1) ORDER BY name COLLATE de_DE>>
>>>>> Aren't you missing a COLLATE clause after the GROUP BY term?
>>>>>
>>>>>        ... GROUP BY SUBSTR(name,1,1) COLLATE de_DE ...
>>>>
>>>> TBH: I didn't even know about this. I thought the COLLATE at the
>> end
>>>> of
>>>> the statement would do it for all.
>>>>
>>>> Alas, tried again to no avail. No matter whether I add it after
>> the
>>>> GROUP BY or not, the result is the same.
>>>>
>>>> I should probably have added some version information: I'm using
>> the
>>>> Perl DBD::SQLite 1.58 (sqlite 3.22.0). I didn't see any mention of
>>>> related changes in the changelog for SQLite. What would be the
>>>> easiest
>>>> (and most reliable) way to try to reproduce this without Perl? Is
>>>> there
>>>> a HowTo use collations with the CLI sqlite?
>>>>
>>>> --
>>>>
>>>> Michael
> 
> ---
> The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
> lot about anticipated traffic volume.
> 
> 
> 
> 
> _______________________________________________
> sqlite-users mailing list
> [email protected]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> 
_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] GROUP BY and ICU collation

Reply via email to