Jim Starkey wrote: > Jay Pipes wrote: >> Jim Starkey wrote: >> >>> Brian Aker wrote: >>> >>>> Hi! >>>> >>>> On Sep 3, 2008, at 7:27 AM, Jim Starkey wrote: >>>> >>>> >>>>> For timestamps, standard syntax. For numbers, "<name> number [ scale >>>>> <decimalScaleFactor> ]". For strings, "<name> text". All text is >>>>> internally UTF-8, though that isn't externally visible. Blobs /clobs >>>>> are a threshold for separate storage rather than distinct types. The >>>>> legacy SQL types will be recognized as quaint, but acceptable. >>>>> >>>> We are on the same page then. I haven't pulled the TINY/SMALL code >>>> yet, but it will be gone soon. All of the int(12) is gone. Medium has >>>> been gone for a while (I mean... 24bit int?). >>>> >>>> By external on UTF8... what do you mean? How are you handling your >>>> collations on UTF8? At this point I have yet to find a good free >>>> library to do it. Very soon Drizzle will be end to end UTF8. Clients >>>> will only accept it, and I am just requiring clients to handle the >>>> conversion themselves. This means native conversion should work for >>>> almost everything. >>>> >>> I'm planning to use ICU (IBM's International Components for Unicode) for >>> the actual collations. It's licensed under MIT's X11 license, and is >>> GPL compatible. >>> >> >> We investigate ICU a month or so ago. Looks good, but the number one >> reason it was decided not to go forward was because it is natively >> UTF16, and we weren't willing to go down that route at the moment. >> >> The current plan is to remove charset support entirely within drizzle >> (krow is already working on that) and use the current collation system >> as-is, while fixing any bugs we find. >> >> > Thanks for the heads up on ICU -- I hadn't picked that up. Lacking a > good alternative, I'll probably stick with it.
Understood, and it's still a choice we may end up with. :) I noted today that Google's Chromium web browser includes ICU as the characterset/collation facility...who knows it may turn out to be the right road. I think our current roadmap, of scrapping charsets in favor of only 4-byte utf8 and using the existing collations is a good in-between step. > I neglected to mention that I'm dithering on whether to translate index > keys to byte strings that can be compared naturally or to invoke > collation specific comparisons during index traversal. Please expand on this for me. The way I see it, a byte string cannot be compared correctly without a collation (unless of course the collation is binary...). Otherwise, two sets of string characters stored in a binary format could be sorted differently depending on whether a specific locale determines char 0xXXXX to be before OxXXXX where XXXX is some arbitrary utf8 character code. I see a couple options: 1) Store all index keys as binary strings and do all lookups and comparisons at runtime 2) Have collation set at create/alter time and store keys in collation order, with ability to pull into a filesort if needed collation is different from the stored index collation. Agree? Any other ways to do it? Cheers, Jay > For Falcon, we > use expanded keys for our btrees on the assumption that collation > specific compares would be too expensive in performance. Nimbus uses > AVL (balanced) trees, so the trade-off is quite different. I may punt > and let the user make the CPU/memory trade-off at index creation time > (for which I will expect -- and deserve -- a great deal of heckling). > > Nothing is better for an engineer's soul than whacking out obsoleted > code. Go, Brian, go! > _______________________________________________ Mailing list: https://launchpad.net/~drizzle-discuss Post to : [email protected] Unsubscribe : https://launchpad.net/~drizzle-discuss More help : https://help.launchpad.net/ListHelp

