Re: [Maria-developers] [Maria-discuss] Collations, trailing spaces and unique indexes

2016-03-19 Thread Binarus
Alexander,

I'd like to continue our conversation from Maria-discuss.

>> - I have found my_strnxfrm_unicode in the same file and will need more time 
>> to make my opinion of how difficult it will be
>> (I don't know what a weight is, so I currently try to understand what the 
>> function does at all).
> 
> This function is used to create sort keys for non-indexed ORDER BY,
> for these cases:
> 
> - ORDER BY on an expression
> - ORDER BY on a column that does not have an index
> 
> 
> The idea is exactly the same with the C function strxfrm.
> See "man strxfrm".
> 
> The code implements non-indexed ORDER BY in filesort.cc
> as follows:
> 
> 1. It calls *_strnxfrm_* functions for all records and converts
> CHAR/VARCHAR/TEXT values into their fixed length binary sortable keys.
> 
> 2 . Then executes binary sorting on these keys.
> 
> By the way, fixing this function might be tricky.
> 
> 
> Currently my_strnxfrm_unicode() pads the tail using weights of the SPACE
> character.
> The NO PAD version will need to pad the tail using a weight which is
> less than the weight of the smallest possible character.
> 
> This should be easy for UCA bases collations (e.g.
> utf8_unicode_nopad_ci), because the smallest possible
> character in UCA collations is "U+0009 HORIZONTAL TABULATION",
> and its weight is 0x0201. So we can just pad the sort key
> using a smaller value 0x0200.
> 
> 
> But I'm not sure yet what to do with 8-bit collations,
> which usually use 0x00 as weight for the smallest character.
> So we don't have a smaller value.
> There are two options here:
> 
> 1. Pad with 0x00. But this will mean that 'aaa' and just 'aaa'
> will  have unpredictable order when doing ORDER BY without an index
> (where  is the smallest possible character in the collation).
> 
> As the smallest character in non-UCA collations is usually
> "U+ NULL", this will mean that 'aaa\0' and just 'aaa'
> will have unpredictable order.
> 
> 2. Reserve extra bytes at the end of the key, to store the true length, so
> - 'aaa\0' will have the key '4141410004'
> - 'aaa'   will have the key '4141410003', and therefore will always
>be sorted before 'aaa\0'.
> 
> I'm inclined towards #2, to have consistent ORDER BY behavior
> with and without indexes.

Wow, thanks for explaining. I now have learned that I probably can't assist you 
in solving that problem (I haven't been in unicode, weights, sorting and 
character sets very deep until now).

But if it helps, you eventually could break the problem into two steps:

1) Fix the problem with unique indexes

2) Fix ordering for any possible situation

The problems we have encountered in real life so far are all related to 1), 
i.e. we would be happy if unique indexes would consider 'a' and 'a ' being 
different values. In contrast, ordering is not that important for us. I don't 
know what others are thinking about that, though.

>>
>> - My main problem: I did not find my_strnncollsp_utf8_general_ci anywhere 
>> (nor in the same neither in any other file). Where is it?
>
> The function name is just "my_strnncollsp_utf8".

OK, thank you very much. I must have misunderstood something.

>> Furthermore, studying the code has led to some questions; for example, there 
>> already seems to be a #define which controls the padding-when-comparing 
>> mode, but only for the _cs collations?
> 
> Can you please clarify which lines do you mean?

I am referring to the file ctype-utf8.c, line 5698. There is:

#ifndef VARCHAR_WITH_DIFF_ENDSPACE_ARE_DIFFERENT_FOR_UNIQUE
  diff_if_only_endspace_difference= 0;
#endif

I haven't been able yet to thoroughly analyze where and how this is used, but 
at least somebody obviously already had some thoughts whether or not trailing 
spaces should make a difference in UNIQUEs :-)

Regards,

Binarus

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] [Maria-discuss] Collations, trailing spaces and unique indexes

2016-03-19 Thread Alexander Barkov
Hi Binarus,


On 03/16/2016 07:02 PM, Binarus wrote:
> Alexander,
> 
> I'd like to continue our conversation from Maria-discuss.

Sure.

> 
>>> - I have found my_strnxfrm_unicode in the same file and will need more time 
>>> to make my opinion of how difficult it will be
>>> (I don't know what a weight is, so I currently try to understand what the 
>>> function does at all).
>>
>> This function is used to create sort keys for non-indexed ORDER BY,
>> for these cases:
>>
>> - ORDER BY on an expression
>> - ORDER BY on a column that does not have an index
>>
>>
>> The idea is exactly the same with the C function strxfrm.
>> See "man strxfrm".
>>
>> The code implements non-indexed ORDER BY in filesort.cc
>> as follows:
>>
>> 1. It calls *_strnxfrm_* functions for all records and converts
>> CHAR/VARCHAR/TEXT values into their fixed length binary sortable keys.
>>
>> 2 . Then executes binary sorting on these keys.
>>
>> By the way, fixing this function might be tricky.
>>
>>
>> Currently my_strnxfrm_unicode() pads the tail using weights of the SPACE
>> character.
>> The NO PAD version will need to pad the tail using a weight which is
>> less than the weight of the smallest possible character.
>>
>> This should be easy for UCA bases collations (e.g.
>> utf8_unicode_nopad_ci), because the smallest possible
>> character in UCA collations is "U+0009 HORIZONTAL TABULATION",
>> and its weight is 0x0201. So we can just pad the sort key
>> using a smaller value 0x0200.
>>
>>
>> But I'm not sure yet what to do with 8-bit collations,
>> which usually use 0x00 as weight for the smallest character.
>> So we don't have a smaller value.
>> There are two options here:
>>
>> 1. Pad with 0x00. But this will mean that 'aaa' and just 'aaa'
>> will  have unpredictable order when doing ORDER BY without an index
>> (where  is the smallest possible character in the collation).
>>
>> As the smallest character in non-UCA collations is usually
>> "U+ NULL", this will mean that 'aaa\0' and just 'aaa'
>> will have unpredictable order.
>>
>> 2. Reserve extra bytes at the end of the key, to store the true length, so
>> - 'aaa\0' will have the key '4141410004'
>> - 'aaa'   will have the key '4141410003', and therefore will always
>>be sorted before 'aaa\0'.
>>
>> I'm inclined towards #2, to have consistent ORDER BY behavior
>> with and without indexes.
> 
> Wow, thanks for explaining. I now have learned that I probably can't assist 
> you in solving that problem (I haven't been in unicode, weights, sorting and 
> character sets very deep until now).
> 
> But if it helps, you eventually could break the problem into two steps:
> 
> 1) Fix the problem with unique indexes
> 
> 2) Fix ordering for any possible situation
> 
> The problems we have encountered in real life so far are all related to 1), 
> i.e. we would be happy if unique indexes would consider 'a' and 'a ' being 
> different values. In contrast, ordering is not that important for us. I don't 
> know what others are thinking about that, though.
> 

Thanks for the explanation. That makes understand your needs better.

When we add new collations into the main tree, I prefer to have
both steps already addressed, to have consistent collation behavior.
But you're right that in a development tree it can be done in two
separate steps, for simplicity.


>>>
>>> - My main problem: I did not find my_strnncollsp_utf8_general_ci anywhere 
>>> (nor in the same neither in any other file). Where is it?
>>
>> The function name is just "my_strnncollsp_utf8".
> 
> OK, thank you very much. I must have misunderstood something.
> 
>>> Furthermore, studying the code has led to some questions; for example, 
>>> there already seems to be a #define which controls the 
>>> padding-when-comparing mode, but only for the _cs collations?
>>
>> Can you please clarify which lines do you mean?
> 
> I am referring to the file ctype-utf8.c, line 5698. There is:
> 
> #ifndef VARCHAR_WITH_DIFF_ENDSPACE_ARE_DIFFERENT_FOR_UNIQUE
>   diff_if_only_endspace_difference= 0;
> #endif
> 
> I haven't been able yet to thoroughly analyze where and how this is used, but 
> at least somebody obviously already had some thoughts whether or not trailing 
> spaces should make a difference in UNIQUEs :-)


Thanks. This is a kind of dead code. You correctly guessed that it was
supposed to switch between NO PAD and PAD SPACE, but the exact behavior
was supposed to be chosen and compile time. Then we understood that
we want NO PAD and PAD SPACE collations to co-exist, so this approach
with conditional compilation appeared not to be a way to go.

So this code has never been used in the NO PAD mode actually.
We should refactor it when working on NO PAD collations,
by removing #ifndef and #else and moving the relevant pieces into
different functions, for PAD SPACE and NO PAD handlers.


> 
> Regards,
> 
> Binarus
> 
> ___
> Mailing list: https://launchpad.net/~maria-developers
> Post to 

Re: [Maria-developers] [Maria-discuss] Collations, trailing spaces and unique indexes

2016-03-19 Thread Alexander Barkov
Hi Binarus,

On 03/16/2016 06:13 PM, Binarus wrote:
>> I just wanted to split the task into these steps:
>>
>> 1. Default and _bin collations.
>>This is what MDEV-9711 is about.
>>
>> 2. Unicode Collation Algorithm (UCA) based collations
>>(e.g. utf8_unicode_nopad_ci)
>>
>>
>> 3. The rest (not covered by the previous steps)
>>This will include latin1_general_nopad_cs, latin2_czech_nopad_cs
>>
>>
>> This is why asked in the previous letter if utf8_unicode_nopad_ci would
>> work for you.
>> Also, which MariaDB version do you need this for? Will 10.1 work?
>>
> 
> Thank you very much for this explanation.
> 
> Regarding the version: Our O/S usually is Debian which generally uses 
> software which is quite old. This means that I will have to compile MariaDB 
> myself in every case.
> The current project where this new collations are needed will probably take 
> about three months before being finished. It would be very nice if we could 
> have the new collations until then.
> In the meantime, we'll continue development with MySQL and switch to MariaDB 
> as soon as the new collations are available.

Excellent. The fact that you compile from sources makes the things
easier. I guess we can add new collations officially into 10.2.

Then after 3 month if you'll feel that 10.2 is not mature enough
for you, we can make a custom non-official 10.1 tree with the new
collations back-ported (it should be trivial to do).

> 
> ci collations would be sufficient as a starting point. Nevertheless, I think 
> there must be a nopad collation for every collation which already exists.
> Personally, since we are using utf8 (and no other encoding) in every part of 
> every project, I would prefer to have *all utf8* nopad collations before
>having any other nopad collation. But that's just my personal opinion.
>We will be grateful for any progress regardless of the order of things
being implemented.

Thanks. Then we're flexible to implement #2 first, and wait for #1
to be done under terms of Google Summer Of Code 2016.


> 
> Binarus
> 
> ___
> Mailing list: https://launchpad.net/~maria-discuss
> Post to : maria-disc...@lists.launchpad.net
> Unsubscribe : https://launchpad.net/~maria-discuss
> More help   : https://help.launchpad.net/ListHelp
> 

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp