Re: [GENERAL] Collation rules and multi-lingual databases
NO No no, do not mess with set locale, You will need to hack your own version of strxfrm which takes an aragument for locale. If you mess with locale,you will corrupt the indexes I've been told. Greg Stark wrote: Dennis Gearon <[EMAIL PROTECTED]> writes: I was thinking of INGNORING locale, since it is basically fixed for a DB for long periods of time. If a table/column HAD it's own locale, that could be used, but I was more interested in a function taht would allow the explicit declaration of the encoding(s) to look for. Indeed for my purposes that's what I'll have to do. but the strxfrm function uses the current application locale, so I'll have to call setlocale to set it, call strxfrm, then call setlocale to set it back. I fear that some implementations might do a lot of work when setlocale is called loading large data files and might leak memory expecting it to only be called once at program initialization. That would suck BTW, what is l10n l10n = localization i18n = internationalization arguably i should have said i18n actually. ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] Collation rules and multi-lingual databases
Dennis Gearon <[EMAIL PROTECTED]> writes: > I was thinking of INGNORING locale, since it is basically fixed for a DB for > long periods of time. > > If a table/column HAD it's own locale, that could be used, > but I was more interested in a function taht would allow the explicit > declaration of the encoding(s) to look for. Indeed for my purposes that's what I'll have to do. but the strxfrm function uses the current application locale, so I'll have to call setlocale to set it, call strxfrm, then call setlocale to set it back. I fear that some implementations might do a lot of work when setlocale is called loading large data files and might leak memory expecting it to only be called once at program initialization. That would suck > BTW, what is l10n l10n = localization i18n = internationalization arguably i should have said i18n actually. -- greg ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] Collation rules and multi-lingual databases
I was thinking of INGNORING locale, since it is basically fixed for a DB for long periods of time. If a table/column HAD it's own locale, that could be used, but I was more interested in a function taht would allow the explicit declaration of the encoding(s) to look for. BTW, what is l10n Greg Stark wrote: Greg Stark <[EMAIL PROTECTED]> writes: Dennis Gearon <[EMAIL PROTECTED]> writes: I think it would be nice, and I may write it eventually, to have a function called: COLLATION_VALUE( 'string', 'encoding' ) Indeed that would be really nice. I wish I had that and a pony. Unfortunately my understanding is that the collation rules are simply too complex to allow such a function in general. It's too bad because it would indeed eliminate a lot of the problems in a single swoop. Uh, so apparently I'm on crack and this is *precisely* how the l10n collation rules work. Sorry for jumping in with an uninformed opinion. Effectively, the way these functions work is by applying a mapping to transform the characters in a string to a byte sequence that represents the string's position in the collating sequence of the current locale. Comparing two such byte sequences in a simple fashion is equivalent to comparing the strings with the locale's collating sequence. The functions `strcoll' and `wcscoll' perform this translation implicitly, in order to do one comparison. By contrast, `strxfrm' and `wcsxfrm' perform the mapping explicitly. If you are making multiple comparisons using the same string or set of strings, it is likely to be more efficient to use `strxfrm' or `wcsxfrm' to transform all the strings just once, and subsequently compare the transformed strings with `strcmp' or `wcscmp'. Given this it should be easy to write a collation_value(string,locale) C function that switches the collation order, calls strxfrm and then restores the collation order. I fear memory leaks or performance losses on frequent locale switches like this but it should be easy enough to try out. I don't see any problems with postgres as long as it's possible to ensure the locale is always switched back properly. It might not be thread-safe though. At worst I could always call strxfrm in the application for each locale I care about when inserting the data. That would bloat my tables for nothing though. So it's looking like I might get my pony after all. ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [GENERAL] Collation rules and multi-lingual databases
Greg Stark <[EMAIL PROTECTED]> writes: > Dennis Gearon <[EMAIL PROTECTED]> writes: > > > I think it would be nice, and I may write it eventually, to have a function > > called: > > > > COLLATION_VALUE( 'string', 'encoding' ) > > Indeed that would be really nice. I wish I had that and a pony. > > Unfortunately my understanding is that the collation rules are simply too > complex to allow such a function in general. It's too bad because it would > indeed eliminate a lot of the problems in a single swoop. Uh, so apparently I'm on crack and this is *precisely* how the l10n collation rules work. Sorry for jumping in with an uninformed opinion. >Effectively, the way these functions work is by applying a mapping to > transform the characters in a string to a byte sequence that represents > the string's position in the collating sequence of the current locale. > Comparing two such byte sequences in a simple fashion is equivalent to > comparing the strings with the locale's collating sequence. > >The functions `strcoll' and `wcscoll' perform this translation > implicitly, in order to do one comparison. By contrast, `strxfrm' and > `wcsxfrm' perform the mapping explicitly. If you are making multiple > comparisons using the same string or set of strings, it is likely to be > more efficient to use `strxfrm' or `wcsxfrm' to transform all the > strings just once, and subsequently compare the transformed strings > with `strcmp' or `wcscmp'. Given this it should be easy to write a collation_value(string,locale) C function that switches the collation order, calls strxfrm and then restores the collation order. I fear memory leaks or performance losses on frequent locale switches like this but it should be easy enough to try out. I don't see any problems with postgres as long as it's possible to ensure the locale is always switched back properly. It might not be thread-safe though. At worst I could always call strxfrm in the application for each locale I care about when inserting the data. That would bloat my tables for nothing though. So it's looking like I might get my pony after all. -- greg ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] Collation rules and multi-lingual databases
I think it would be nice, and I may write it eventually, to have a function called: COLLATION_VALUE( 'string', 'encoding' ) Which could be used like: SELECT field_a, field_b FROM table_a GROUP BY COLLATION_VALUE( field_a ) ORDER BY COLLATION_VALUE( field_b ); or in other creative ways. Greg Stark wrote: Tom Lane <[EMAIL PROTECTED]> writes: Greg Stark <[EMAIL PROTECTED]> writes: My understanding is that the entire set of localization parameters needs to be decided upon when the initdb is done and can never be changed later. Is that right? No, not all of them are frozen. Unfortunately, the one you care about (LC_COLLATE) is. The reason for this is that it determines index ordering for textual columns, and so changing LC_COLLATE on the fly produces instant corrupt indexes :-( Yeah, I really would be perfectly happy to have indexes be in "C" order and have my queries have to specifically specify the sort order in the ORDER BY clause, knowing they won't use the index. Well, ok, not quite perfectly happy. But in this case there are no indexes on the columns anyways so... A solution for this is on the TODO list, but don't hold your breath ... ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [GENERAL] Collation rules and multi-lingual databases
Tom Lane <[EMAIL PROTECTED]> writes: > Greg Stark <[EMAIL PROTECTED]> writes: > > My understanding is that the entire set of localization parameters needs to be > > decided upon when the initdb is done and can never be changed later. Is that > > right? > > No, not all of them are frozen. Unfortunately, the one you care about > (LC_COLLATE) is. The reason for this is that it determines index > ordering for textual columns, and so changing LC_COLLATE on the fly > produces instant corrupt indexes :-( Yeah, I really would be perfectly happy to have indexes be in "C" order and have my queries have to specifically specify the sort order in the ORDER BY clause, knowing they won't use the index. Well, ok, not quite perfectly happy. But in this case there are no indexes on the columns anyways so... > A solution for this is on the TODO list, but don't hold your breath ... -- greg ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] Collation rules and multi-lingual databases
Greg Stark <[EMAIL PROTECTED]> writes: > My understanding is that the entire set of localization parameters needs to be > decided upon when the initdb is done and can never be changed later. Is that > right? No, not all of them are frozen. Unfortunately, the one you care about (LC_COLLATE) is. The reason for this is that it determines index ordering for textual columns, and so changing LC_COLLATE on the fly produces instant corrupt indexes :-( A solution for this is on the TODO list, but don't hold your breath ... regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend