Jim Starkey wrote:
> Jay Pipes wrote:
>> Jim Starkey wrote:
>>  
>>> Brian Aker wrote:
>>>    
>>>> Hi!
>>>>
>>>> On Sep 3, 2008, at 7:27 AM, Jim Starkey wrote:
>>>>
>>>>      
>>>>> For timestamps, standard syntax.  For numbers, "<name> number [ scale
>>>>> <decimalScaleFactor> ]".  For strings, "<name> text".  All text is
>>>>> internally UTF-8, though that isn't externally visible.  Blobs /clobs
>>>>> are a threshold for separate storage rather than distinct types.  The
>>>>> legacy SQL types will be recognized as quaint, but acceptable.
>>>>>         
>>>> We are on the same page then. I haven't pulled the TINY/SMALL code
>>>> yet, but it will be gone soon. All of the int(12) is gone. Medium has
>>>> been gone for a while (I mean... 24bit int?).
>>>>
>>>> By external on UTF8... what do you mean? How are you handling your
>>>> collations on UTF8? At this point I have yet to find a good free
>>>> library to do it. Very soon Drizzle will be end to end UTF8. Clients
>>>> will only accept it, and I am just requiring clients to handle the
>>>> conversion themselves. This means native conversion should work for
>>>> almost everything.
>>>>       
>>> I'm planning to use ICU (IBM's International Components for Unicode) for
>>> the actual collations.  It's licensed under MIT's X11 license, and is
>>> GPL compatible.
>>>     
>>
>> We investigate ICU a month or so ago.  Looks good, but the number one
>> reason it was decided not to go forward was because it is natively
>> UTF16, and we weren't willing to go down that route at the moment.
>>
>> The current plan is to remove charset support entirely within drizzle
>> (krow is already working on that) and use the current collation system
>> as-is, while fixing any bugs we find.
>>
>>   
> Thanks for the heads up on ICU -- I hadn't picked that up.  Lacking a
> good alternative, I'll probably stick with it.

Understood, and it's still a choice we may end up with. :)  I noted
today that Google's Chromium web browser includes ICU as the
characterset/collation facility...who knows it may turn out to be the
right road.  I think our current roadmap, of scrapping charsets in favor
of only 4-byte utf8 and using the existing collations is a good
in-between step.

> I neglected to mention that I'm dithering on whether to translate index
> keys to byte strings that can be compared naturally or to invoke
> collation specific comparisons during index traversal.  

Please expand on this for me.  The way I see it, a byte string cannot be
compared correctly without a collation (unless of course the collation
is binary...).  Otherwise, two sets of string characters stored in a
binary format could be sorted differently depending on whether a
specific locale determines char 0xXXXX to be before OxXXXX where XXXX is
some arbitrary utf8 character code.

I see a couple options:

1) Store all index keys as binary strings and do all lookups and
comparisons at runtime
2) Have collation set at create/alter time and store keys in collation
order, with ability to pull into a filesort if needed collation is
different from the stored index collation.

Agree?  Any other ways to do it?

Cheers,

Jay

> For Falcon, we
> use expanded keys for our btrees on the assumption that collation
> specific compares would be too expensive in performance.  Nimbus uses
> AVL (balanced) trees, so the trade-off is quite different.  I may punt
> and let the user make the CPU/memory trade-off at index creation time
> (for which I will expect -- and deserve -- a great deal of heckling).
> 
> Nothing is better for an engineer's soul than whacking out obsoleted
> code.  Go, Brian, go!
> 

_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help   : https://help.launchpad.net/ListHelp

Reply via email to