Re: optimisation? collation "C" sorting for GroupAggregate for all deterministic collations

2020-03-23 Thread Maxim Ivanov


> Perhaps this is what you mean by "deterministic", but isn't it
> possible for some collations to treat multiple byte sequences as equal
> values? And those multiple byte sequences wouldn't necessarily occur
> sequentially in C collation, so it wouldn't be possible to work around
> that by having the grouping node use one collation but the sorting
> node use the C one.
>
> If my memory is incorrect, then this sounds like an intriguing idea.


Yes, as per doc 
(https://www.postgresql.org/docs/12/collation.html#COLLATION-NONDETERMINISTIC) 
some collations can result in symbols(chars? codes? runes?) to be equal, while 
their byte representations is not. This optimisation should check for source 
table collation and do not change sorting collation if columns being sorted use 
non deterministic collation.

Luckily in practice it is probably to be very rare, all builtin collations are 
deterministic.






Re: optimisation? collation "C" sorting for GroupAggregate for all deterministic collations

2020-03-22 Thread Corey Huinker
>
> Perhaps this is what you mean by "deterministic", but isn't it
> possible for some collations to treat multiple byte sequences as equal
> values? And those multiple byte sequences wouldn't necessarily occur
> sequentially in C collation, so it wouldn't be possible to work around
> that by having the grouping node use one collation but the sorting
> node use the C one.
>
> If my memory is incorrect, then this sounds like an intriguing idea.
>
>
I could see the value in a hash aggregate on C-collation that then passes
itself as a partial aggregate up to another step which applies the
collation and then finalizes the aggregation before sorting


Re: optimisation? collation "C" sorting for GroupAggregate for all deterministic collations

2020-03-22 Thread James Coleman
On Sun, Mar 22, 2020 at 5:33 AM Pavel Stehule  wrote:
>
> Hi
>
> ne 22. 3. 2020 v 10:12 odesílatel Maxim Ivanov  napsal:
>>
>> Hi All,
>>
>> It is known, that  collation "C" significantly speeds up string comparisons 
>> and as a result sorting. I was wondering, whether it is possible to use it 
>> regardless of collation set on a column in sorts not visible to users?
>>
>> Example I have in  mind is sorting performed for GroupAggregate. Purpose of 
>> that sort is to bring equal values next to each other, so as long as:
>>1) user didn't request ORDER BY in addition to GROUP BY
>>2) source column has any deterministic collation (as per docs all builtin 
>> collations are deterministic)
>>
>> it seems to be possible to do sorting with any deterministic collation, 
>> regardless of what user specifid for the column being sorted. "C" collation 
>> is deterministic and fastest.
>>
>> In other words, couldn't PostgreSQL convert this:
>>
>> ->  GroupAggregate  (cost=15726557.87..22944558.69 rows=721 width=176) 
>> (actual time=490103.209..771536.389 rows=3600 loops=1)
>>   Group Key: ec_180days.msn, ec_180days.to_date_time
>>   ->  Sort  (cost=15726557.87..15906557.89 rows=7208 width=113) 
>> (actual time=490094.849..524854.662 rows=7200 loops=1)
>> Sort Key: ec_180days.msn, ec_180days.to_date_time
>> Sort Method: external merge  Disk: 7679136kB
>>
>> To this:
>>
>> ->  GroupAggregate  (cost=14988274.87..22206275.69 rows=721 width=155) 
>> (actual time=140497.729..421510.001 rows=3600 loops=1)
>>   Group Key: ec_180days.msn, ec_180days.to_date_time
>>   ->  Sort  (cost=14988274.87..15168274.89 rows=7208 width=92) 
>> (actual time=140489.807..174228.722 rows=7200 loops=1)
>> Sort Key: ec_180days.msn COLLATE "C", ec_180days.to_date_time
>> Sort Method: external merge  Disk: 7679136kB
>>
>>
>> which is 3 times faster in my tests.
>
>
> I had a same idea. It is possible only if default collation is deterministic. 
> Probably it will be less important if abbreviate sort will be enabled, but it 
> is disabled now.
>
> p.s. can be interesting repeat your tests with ICU locale where abbreviate 
> sort is enabled.

Perhaps this is what you mean by "deterministic", but isn't it
possible for some collations to treat multiple byte sequences as equal
values? And those multiple byte sequences wouldn't necessarily occur
sequentially in C collation, so it wouldn't be possible to work around
that by having the grouping node use one collation but the sorting
node use the C one.

If my memory is incorrect, then this sounds like an intriguing idea.

James




Re: optimisation? collation "C" sorting for GroupAggregate for all deterministic collations

2020-03-22 Thread Pavel Stehule
Hi

ne 22. 3. 2020 v 10:12 odesílatel Maxim Ivanov  napsal:

> Hi All,
>
> It is known, that  collation "C" significantly speeds up string
> comparisons and as a result sorting. I was wondering, whether it is
> possible to use it regardless of collation set on a column in sorts not
> visible to users?
>
> Example I have in  mind is sorting performed for GroupAggregate. Purpose
> of that sort is to bring equal values next to each other, so as long as:
>1) user didn't request ORDER BY in addition to GROUP BY
>2) source column has any deterministic collation (as per docs all
> builtin collations are deterministic)
>
> it seems to be possible to do sorting with any deterministic collation,
> regardless of what user specifid for the column being sorted. "C" collation
> is deterministic and fastest.
>
> In other words, couldn't PostgreSQL convert this:
>
> ->  GroupAggregate  (cost=15726557.87..22944558.69 rows=721 width=176)
> (actual time=490103.209..771536.389 rows=3600 loops=1)
>   Group Key: ec_180days.msn, ec_180days.to_date_time
>   ->  Sort  (cost=15726557.87..15906557.89 rows=7208 width=113)
> (actual time=490094.849..524854.662 rows=7200 loops=1)
> Sort Key: ec_180days.msn, ec_180days.to_date_time
> Sort Method: external merge  Disk: 7679136kB
>
> To this:
>
> ->  GroupAggregate  (cost=14988274.87..22206275.69 rows=721 width=155)
> (actual time=140497.729..421510.001 rows=3600 loops=1)
>   Group Key: ec_180days.msn, ec_180days.to_date_time
>   ->  Sort  (cost=14988274.87..15168274.89 rows=7208 width=92)
> (actual time=140489.807..174228.722 rows=7200 loops=1)
> Sort Key: ec_180days.msn COLLATE "C", ec_180days.to_date_time
> Sort Method: external merge  Disk: 7679136kB
>
>
> which is 3 times faster in my tests.
>

I had a same idea. It is possible only if default collation is
deterministic. Probably it will be less important if abbreviate sort will
be enabled, but it is disabled now.

p.s. can be interesting repeat your tests with ICU locale where abbreviate
sort is enabled.

Regards

Pavel


optimisation? collation "C" sorting for GroupAggregate for all deterministic collations

2020-03-22 Thread Maxim Ivanov
Hi All,

It is known, that  collation "C" significantly speeds up string comparisons and 
as a result sorting. I was wondering, whether it is possible to use it 
regardless of collation set on a column in sorts not visible to users?

Example I have in  mind is sorting performed for GroupAggregate. Purpose of 
that sort is to bring equal values next to each other, so as long as:
   1) user didn't request ORDER BY in addition to GROUP BY
   2) source column has any deterministic collation (as per docs all builtin 
collations are deterministic)

it seems to be possible to do sorting with any deterministic collation, 
regardless of what user specifid for the column being sorted. "C" collation is 
deterministic and fastest.

In other words, couldn't PostgreSQL convert this:
->  GroupAggregate  (cost=15726557.87..22944558.69 rows=721 width=176) 
(actual time=490103.209..771536.389 rows=3600 loops=1)
  Group Key: ec_180days.msn, ec_180days.to_date_time
  ->  Sort  (cost=15726557.87..15906557.89 rows=7208 width=113) (actual 
time=490094.849..524854.662 rows=7200 loops=1)
Sort Key: ec_180days.msn, ec_180days.to_date_time
Sort Method: external merge  Disk: 7679136kB

To this:
->  GroupAggregate  (cost=14988274.87..22206275.69 rows=721 width=155) 
(actual time=140497.729..421510.001 rows=3600 loops=1)

  Group Key: ec_180days.msn, ec_180days.to_date_time

  ->  Sort  (cost=14988274.87..15168274.89 rows=7208 width=92) (actual 
time=140489.807..174228.722 rows=7200 loops=1)

Sort Key: ec_180days.msn COLLATE "C", ec_180days.to_date_time

Sort Method: external merge  Disk: 7679136kB

which is 3 times faster in my tests.