Re: [GENERAL] Collation rules and multi-lingual databases

2003-08-22 Thread Dennis Gearon
NO No no, do not mess with set locale,
   You will need to hack your own version of strxfrm which takes an 
aragument for locale. If you mess with locale,you will corrupt the 
indexes I've been told.

Greg Stark wrote:

Dennis Gearon <[EMAIL PROTECTED]> writes:

 

I was thinking of INGNORING locale, since it is basically fixed for a DB for
long periods of time.
If a table/column HAD it's own locale, that could be used,
but I was more interested in a function taht would allow the explicit
declaration of the encoding(s) to look for.
   

Indeed for my purposes that's what I'll have to do.

but the strxfrm function uses the current application locale, so I'll have to
call setlocale to set it, call strxfrm, then call setlocale to set it back.
I fear that some implementations might do a lot of work when setlocale is
called loading large data files and might leak memory expecting it to only be
called once at program initialization. That would suck
 

BTW, what is l10n
   

l10n = localization
i18n = internationalization
arguably i should have said i18n actually.

 



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [GENERAL] Collation rules and multi-lingual databases

2003-08-22 Thread Greg Stark

Dennis Gearon <[EMAIL PROTECTED]> writes:

> I was thinking of INGNORING locale, since it is basically fixed for a DB for
> long periods of time.
> 
> If a table/column HAD it's own locale, that could be used,
> but I was more interested in a function taht would allow the explicit
> declaration of the encoding(s) to look for.

Indeed for my purposes that's what I'll have to do.

but the strxfrm function uses the current application locale, so I'll have to
call setlocale to set it, call strxfrm, then call setlocale to set it back.

I fear that some implementations might do a lot of work when setlocale is
called loading large data files and might leak memory expecting it to only be
called once at program initialization. That would suck

> BTW, what is l10n

l10n = localization
i18n = internationalization

arguably i should have said i18n actually.

-- 
greg


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] Collation rules and multi-lingual databases

2003-08-22 Thread Dennis Gearon
I was thinking of INGNORING locale, since it is basically fixed for a DB 
for long periods of time.

If a table/column HAD it's own locale, that could be used,
but I was more interested in a function taht would allow the explicit 
declaration of the encoding(s) to look for.

BTW, what is l10n

Greg Stark wrote:

Greg Stark <[EMAIL PROTECTED]> writes:

 

Dennis Gearon <[EMAIL PROTECTED]> writes:

   

I think it would be nice, and I may write it eventually, to have a function
called:
COLLATION_VALUE( 'string', 'encoding' )
 

Indeed that would be really nice. I wish I had that and a pony.

Unfortunately my understanding is that the collation rules are simply too
complex to allow such a function in general. It's too bad because it would
indeed eliminate a lot of the problems in a single swoop.
   

Uh, so apparently I'm on crack and this is *precisely* how the l10n collation
rules work. Sorry for jumping in with an uninformed opinion.
 

  Effectively, the way these functions work is by applying a mapping to
transform the characters in a string to a byte sequence that represents
the string's position in the collating sequence of the current locale.
Comparing two such byte sequences in a simple fashion is equivalent to
comparing the strings with the locale's collating sequence.
  The functions `strcoll' and `wcscoll' perform this translation
implicitly, in order to do one comparison.  By contrast, `strxfrm' and
`wcsxfrm' perform the mapping explicitly.  If you are making multiple
comparisons using the same string or set of strings, it is likely to be
more efficient to use `strxfrm' or `wcsxfrm' to transform all the
strings just once, and subsequently compare the transformed strings
with `strcmp' or `wcscmp'.
   

Given this it should be easy to write a collation_value(string,locale) C
function that switches the collation order, calls strxfrm and then restores
the collation order. 

I fear memory leaks or performance losses on frequent locale switches like
this but it should be easy enough to try out. I don't see any problems with
postgres as long as it's possible to ensure the locale is always switched back
properly. It might not be thread-safe though.
At worst I could always call strxfrm in the application for each locale I care
about when inserting the data. That would bloat my tables for nothing though.
So it's looking like I might get my pony after all.

 



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [GENERAL] Collation rules and multi-lingual databases

2003-08-22 Thread Greg Stark
Greg Stark <[EMAIL PROTECTED]> writes:

> Dennis Gearon <[EMAIL PROTECTED]> writes:
> 
> > I think it would be nice, and I may write it eventually, to have a function
> > called:
> > 
> > COLLATION_VALUE( 'string', 'encoding' )
> 
> Indeed that would be really nice. I wish I had that and a pony.
> 
> Unfortunately my understanding is that the collation rules are simply too
> complex to allow such a function in general. It's too bad because it would
> indeed eliminate a lot of the problems in a single swoop.

Uh, so apparently I'm on crack and this is *precisely* how the l10n collation
rules work. Sorry for jumping in with an uninformed opinion.

>Effectively, the way these functions work is by applying a mapping to
> transform the characters in a string to a byte sequence that represents
> the string's position in the collating sequence of the current locale.
> Comparing two such byte sequences in a simple fashion is equivalent to
> comparing the strings with the locale's collating sequence.
> 
>The functions `strcoll' and `wcscoll' perform this translation
> implicitly, in order to do one comparison.  By contrast, `strxfrm' and
> `wcsxfrm' perform the mapping explicitly.  If you are making multiple
> comparisons using the same string or set of strings, it is likely to be
> more efficient to use `strxfrm' or `wcsxfrm' to transform all the
> strings just once, and subsequently compare the transformed strings
> with `strcmp' or `wcscmp'.

Given this it should be easy to write a collation_value(string,locale) C
function that switches the collation order, calls strxfrm and then restores
the collation order. 

I fear memory leaks or performance losses on frequent locale switches like
this but it should be easy enough to try out. I don't see any problems with
postgres as long as it's possible to ensure the locale is always switched back
properly. It might not be thread-safe though.

At worst I could always call strxfrm in the application for each locale I care
about when inserting the data. That would bloat my tables for nothing though.

So it's looking like I might get my pony after all.

-- 
greg


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] Collation rules and multi-lingual databases

2003-08-20 Thread Dennis Gearon
I think it would be nice, and I may write it eventually, to have a 
function called:

COLLATION_VALUE( 'string', 'encoding' )

Which could be used like:

SELECT field_a, field_b
FROM table_a
GROUP BY COLLATION_VALUE( field_a )
ORDER BY COLLATION_VALUE( field_b );
or in other creative ways.

Greg Stark wrote:

Tom Lane <[EMAIL PROTECTED]> writes:

 

Greg Stark <[EMAIL PROTECTED]> writes:
   

My understanding is that the entire set of localization parameters needs to be
decided upon when the initdb is done and can never be changed later. Is that
right?
 

No, not all of them are frozen.  Unfortunately, the one you care about
(LC_COLLATE) is.  The reason for this is that it determines index
ordering for textual columns, and so changing LC_COLLATE on the fly
produces instant corrupt indexes :-(
   

Yeah, I really would be perfectly happy to have indexes be in "C" order and
have my queries have to specifically specify the sort order in the ORDER BY
clause, knowing they won't use the index.
Well, ok, not quite perfectly happy. But in this case there are no indexes on
the columns anyways so...
 

A solution for this is on the TODO list, but don't hold your breath ...
   

 



---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [GENERAL] Collation rules and multi-lingual databases

2003-08-20 Thread Greg Stark
Tom Lane <[EMAIL PROTECTED]> writes:

> Greg Stark <[EMAIL PROTECTED]> writes:
> > My understanding is that the entire set of localization parameters needs to be
> > decided upon when the initdb is done and can never be changed later. Is that
> > right?
> 
> No, not all of them are frozen.  Unfortunately, the one you care about
> (LC_COLLATE) is.  The reason for this is that it determines index
> ordering for textual columns, and so changing LC_COLLATE on the fly
> produces instant corrupt indexes :-(

Yeah, I really would be perfectly happy to have indexes be in "C" order and
have my queries have to specifically specify the sort order in the ORDER BY
clause, knowing they won't use the index.

Well, ok, not quite perfectly happy. But in this case there are no indexes on
the columns anyways so...

> A solution for this is on the TODO list, but don't hold your breath ...

-- 
greg


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [GENERAL] Collation rules and multi-lingual databases

2003-08-20 Thread Tom Lane
Greg Stark <[EMAIL PROTECTED]> writes:
> My understanding is that the entire set of localization parameters needs to be
> decided upon when the initdb is done and can never be changed later. Is that
> right?

No, not all of them are frozen.  Unfortunately, the one you care about
(LC_COLLATE) is.  The reason for this is that it determines index
ordering for textual columns, and so changing LC_COLLATE on the fly
produces instant corrupt indexes :-(

A solution for this is on the TODO list, but don't hold your breath ...

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend