subject:"\[HACKERS\] multibyte\-character aware support for function downcase_truncate

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-11-23 Thread Greg Stark

On Mon, Nov 22, 2010 at 12:38 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 No, especially if it results in queries that used to work breaking,
 which it well could.  But I'm not sure where to go with it from there,
 beyond throwing up my hands.

 Well, that's why there's been no movement on this since 2004 :-(.  The
 amount of work needed for a better solution seems far out of proportion
 to the benefits.

We could extend the existing logic to handle multi-bytes characters
though, couldn't we? It's not going to fix all the problems but at
least it'll do something sane.




-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-11-23 Thread Tom Lane

Greg Stark gsst...@mit.edu writes:
 On Mon, Nov 22, 2010 at 12:38 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Well, that's why there's been no movement on this since 2004 :-(.  The
 amount of work needed for a better solution seems far out of proportion
 to the benefits.

 We could extend the existing logic to handle multi-bytes characters
 though, couldn't we? It's not going to fix all the problems but at
 least it'll do something sane.

Not easily, cheaply, or portably.  The closest you could get in that
line would be to use towlower(), which doesn't exist everywhere
(though I grant probably most platforms have it by now).  The much much
bigger problem though is that we don't know what character representation
towlower() deals in.  We recently kluged the regex code to assume that
the wchar_t representation for UTF8 locales is the standardized Unicode
code point.  I haven't heard of that breaking, but 9.0 hasn't been out
that long.  In other multibyte encodings we have no idea how to use that
function, short of invoking mbstowcs/wcstombs or local equivalent, which
is expensive and doesn't readily allow a short-circuit for ASCII.

And, after you've hacked your way through all that, you still end up
with case-folding behavior that depends on the prevailing locale.
Which is dangerous for the previously cited reasons, and arguably not
spec-compliant.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-11-23 Thread Andrew Dunstan




On 11/23/2010 11:14 AM, Greg Stark wrote:

On Mon, Nov 22, 2010 at 12:38 AM, Tom Lanet...@sss.pgh.pa.us  wrote:

No, especially if it results in queries that used to work breaking,
which it well could.  But I'm not sure where to go with it from there,
beyond throwing up my hands.

Well, that's why there's been no movement on this since 2004 :-(.  The
amount of work needed for a better solution seems far out of proportion
to the benefits.

We could extend the existing logic to handle multi-bytes characters
though, couldn't we? It's not going to fix all the problems but at
least it'll do something sane.


What casing rules will you apply? How will you know what is an upper 
case character and what its lower case character is? The sad, short 
answer is that there are no simple rules beyond ASCII. See the URL I 
posted upthread.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-11-23 Thread Greg Stark

On Tue, Nov 23, 2010 at 5:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 And, after you've hacked your way through all that, you still end up
 with case-folding behavior that depends on the prevailing locale.
 Which is dangerous for the previously cited reasons, and arguably not
 spec-compliant.


So I thought the problem with the Turkish locale definition was that
it redefined how a capital ascii character which was present in
standard SQL identifiers was lowercased. Resulting in standard SQL
syntax not working.

I'm not sure I understand the danger if a user creates an object in a
database with a particular encoding and locale using that locale for
downcasing in the future. We don't currently support changing the
locale of a database or using different locales in the future. Even
with Peter's patch I think we can reasonably require the user to
specify a single locale which controls how the SQL identifiers are
interpreted regardless of the collations used in the operations.

The points about the C API being limited and nonportable are a
different issue.I guess I would need to do research to see if we're
missing something which would help here. Otherwise I might be
beginning to see the value in that /other/ library which I've argued
against in the past.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-11-23 Thread Tom Lane

Greg Stark gsst...@mit.edu writes:
 I'm not sure I understand the danger if a user creates an object in a
 database with a particular encoding and locale using that locale for
 downcasing in the future.

The case I was worried about is dumping from one database and reloading
into another one with a different locale.  Although I suppose there are
enough *other* reasons why that might fail that adding changes of
downcasing behavior might not be a big deal.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-11-23 Thread Greg Stark

On Tue, Nov 23, 2010 at 5:51 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 The case I was worried about is dumping from one database and reloading
 into another one with a different locale.  Although I suppose there are
 enough *other* reasons why that might fail that adding changes of
 downcasing behavior might not be a big deal.

If you dump the whole database then pg_dump would create the new
database with the correct encoding and locale. If you change it then
that can already cause it to fail if the data can't be converted to
the new encoding.  And as you point out there are all kinds of ways
you can cause that to fail by making the context incompatible with the
definitions you're loading.

The lesson we learned in the past is that we have to ignore the locale
for all the characters present in the standard identifiers. Beyond
that I think this is just an implementation problem which may be a
show stopper in itself but if we can do anything with mulitbyte
characters it's probably an improvement over what we do now.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-11-23 Thread Peter Eisentraut

On sön, 2010-11-21 at 18:48 -0500, Tom Lane wrote:
 Yeah.  I'm actually not sure that the SQL committee has thought very
 hard about this, because the spec is worded as though they think that
 Unicode case normalization is all they have to say to uniquely
 define what to do.  The Unicode guys recognize that case mapping is
 locale-specific, which puts us right back at square one.  But leaving
 spec compliance aside, we know from bitter experience that we cannot
 use a definition that lets the Turkish locale fool with the mapping of
 i/I. I suspect that locale-dependent mappings of any other characters
 are just as bad, we simply haven't had enough users burnt by such
 cases to have an institutional memory of it.

The number of locale-dependent case mappings in the entire universe of
Unicode is actually limited  to 7 cases for Lithuanian and 8 cases for
Turkish. (ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt)  So it
would be fair to say that there is a default case mapping, and that is
what the SQL standard presumably refers to.

One thing that we could do is let the user declare that he thinks his
current locale is consistent with the Unicode case normalization, and
apply the full Unicode conversion if so.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-11-21 Thread Robert Haas

On Wed, Jul 7, 2010 at 10:07 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Rajanikant Chirmade rajanikant.chirm...@enterprisedb.com writes:
 Every identifier is downcase  truncated by function
 downcase_truncate_identifier()
 before using it.

 But since the function downcase_truncate_identifier() is not
 multibyte-charecter aware,
 it is not able to downcase some of special charecters in identifier like
 my_SchemÄ.

 IIRC this is intentional.  Please consult the archives for previous
 discussions.

Why would this be intentional?

One concern I have about this approach is that I am guessing that the
current implementation of str_tolower() is a lot slower than the
current implementation of downcase_truncate_identifier().  It would be
nice to have an implementation that is capable of handling wide
characters but doesn't actually incur the speed penalty unless a wide
character is actually present.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-11-21 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 On Wed, Jul 7, 2010 at 10:07 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 IIRC this is intentional.  Please consult the archives for previous
 discussions.

 Why would this be intentional?

Well, it's intentional for lack of any infrastructure that would allow
a more spec-compliant approach.  As you say, calling str_tolower here
is probably a non-starter for performance reasons.  Another big problem
is that str_tolower produces a locale-specific downcasing conversion.
This (a) is going to create portability headaches of the first magnitude,
and (b) is not really an advance in terms of spec compliance.  The SQL
spec says that identifier case folding should be done according to the
Unicode standard, but it's not safe to assume that any random
platform-specific locale is going to act that way.  A specific example
of a locale that is known to NOT behave acceptably is Turkish: they have
weird ideas about i versus I, which in fact broke things back when we
used to use tolower for this purpose.  See the archives from early 2004,
and in particular commit 59f9a0b9df0d224bb62ff8ec5b65e0b187655742, which
removed the exact same logic (though not wide-character-aware) that this
patch proposes to put back.

I think the given patch can be rejected out of hand.  If the OP has any
ideas about doing non-locale-dependent case folding at an acceptable
speed, I'm happy to listen.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-11-21 Thread Robert Haas

On Sun, Nov 21, 2010 at 4:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Wed, Jul 7, 2010 at 10:07 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 IIRC this is intentional.  Please consult the archives for previous
 discussions.

 Why would this be intentional?

 Well, it's intentional for lack of any infrastructure that would allow
 a more spec-compliant approach.  As you say, calling str_tolower here
 is probably a non-starter for performance reasons.  Another big problem
 is that str_tolower produces a locale-specific downcasing conversion.
 This (a) is going to create portability headaches of the first magnitude,
 and (b) is not really an advance in terms of spec compliance.  The SQL
 spec says that identifier case folding should be done according to the
 Unicode standard, but it's not safe to assume that any random
 platform-specific locale is going to act that way.  A specific example
 of a locale that is known to NOT behave acceptably is Turkish: they have
 weird ideas about i versus I, which in fact broke things back when we
 used to use tolower for this purpose.  See the archives from early 2004,
 and in particular commit 59f9a0b9df0d224bb62ff8ec5b65e0b187655742, which
 removed the exact same logic (though not wide-character-aware) that this
 patch proposes to put back.

 I think the given patch can be rejected out of hand.  If the OP has any
 ideas about doing non-locale-dependent case folding at an acceptable
 speed, I'm happy to listen.

I think that's fair.  It actually doesn't seem like it should be that
hard if we knew that the server encoding were UTF8 - it's just a big
translation table somewhere, no?  We use heuristics to copy as many
characters as possible without detailed examination and consult the
lookup table for the rest.  However, that's not very practical in the
face of more than one encoding that must be handled.  What sort of
infrastructure would actually be useful for dealing with this problem?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-11-21 Thread Andrew Dunstan




On 11/21/2010 06:09 PM, Robert Haas wrote:

I think that's fair.  It actually doesn't seem like it should be that
hard if we knew that the server encoding were UTF8 - it's just a big
translation table somewhere, no?


No, it's far more complex. See for example 
http://unicode.org/reports/tr21/tr21-3.html, which says:


   There are a number of complications to case mappings that occur once
   the repertoire of characters is expanded beyond ASCII.

   * Because of the inclusion of certain composite characters for
 compatibility, such as 01F1 DZ /capital dz/, there is a
 third case, called /titlecase/, which is used where the first
 letter of a word is to be capitalized (e.g. Titlecase, vs.
 UPPERCASE, or lowercase).
 o For example, the title case of the example character is
   01F2 Dz /capital d with small z/.
   * Case mappings may produce strings of different length than the
 original.
 o For example, the German character 00DF ß /small letter
   sharp s/ expands when uppercased to the sequence of two
   characters SS. This also occurs where there is no
   precomposed character corresponding to a case mapping,
   such as with 0149 'n /latin small letter n preceded by
   apostrophe./
   * Characters may also have different case mappings, depending on
 the context.
 o For example, 03A3 ? /capital sigma/ lowercases to 03C3
   ? /small sigma/ if it is followed by another letter,
   but lowercases to 03C2 ? /small final sigma/ if it is not.
   * Characters may have case mappings that depend on the locale.
 o For example, in Turkish the letter 0049 I /capital
   letter i/ lowercases to 0131 ? /small dotless i/.
   * Case mappings are not, in general, reversible.
 o For example, once the string McGowan has been
   uppercased, lowercased or titlecased, the original
   cannot be recovered by applying another uppercase,
   lowercase, or titlecase operation.


cheers

andrew

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-11-21 Thread Robert Haas

On Sun, Nov 21, 2010 at 6:22 PM, Andrew Dunstan and...@dunslane.net wrote:


 On 11/21/2010 06:09 PM, Robert Haas wrote:

 I think that's fair.  It actually doesn't seem like it should be that
 hard if we knew that the server encoding were UTF8 - it's just a big
 translation table somewhere, no?

 No, it's far more complex. See for example
 http://unicode.org/reports/tr21/tr21-3.html, which says:

 There are a number of complications to case mappings that occur once the
 repertoire of characters is expanded beyond ASCII.

 Because of the inclusion of certain composite characters for compatibility,
 such as 01F1 DZ capital dz, there is a third case, called titlecase, which
 is used where the first letter of a word is to be capitalized (e.g.
 Titlecase, vs. UPPERCASE, or lowercase).

 For example, the title case of the example character is 01F2 Dz capital d
 with small z.

 Case mappings may produce strings of different length than the original.

 For example, the German character 00DF ß small letter sharp s expands when
 uppercased to the sequence of two characters SS. This also occurs where
 there is no precomposed character corresponding to a case mapping, such as
 with 0149 ŉ latin small letter n preceded by apostrophe.

 Characters may also have different case mappings, depending on the context.

 For example, 03A3 Σ capital sigma lowercases to 03C3 σ small sigma if it
 is followed by another letter, but lowercases to 03C2 ς small final sigma
 if it is not.

 Characters may have case mappings that depend on the locale.

 For example, in Turkish the letter 0049 I capital letter i lowercases to
 0131 ı small dotless i.

 Case mappings are not, in general, reversible.

 For example, once the string McGowan has been uppercased, lowercased or
 titlecased, the original cannot be recovered by applying another uppercase,
 lowercase, or titlecase operation.

Yikes.  So what do people do about this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-11-21 Thread Tom Lane

Andrew Dunstan and...@dunslane.net writes:
 On 11/21/2010 06:09 PM, Robert Haas wrote:
 I think that's fair.  It actually doesn't seem like it should be that
 hard if we knew that the server encoding were UTF8 - it's just a big
 translation table somewhere, no?

 No, it's far more complex. See for example 
 http://unicode.org/reports/tr21/tr21-3.html, which says:

Yeah.  I'm actually not sure that the SQL committee has thought very
hard about this, because the spec is worded as though they think that
Unicode case normalization is all they have to say to uniquely define
what to do.  The Unicode guys recognize that case mapping is
locale-specific, which puts us right back at square one.  But leaving
spec compliance aside, we know from bitter experience that we cannot use
a definition that lets the Turkish locale fool with the mapping of i/I.
I suspect that locale-dependent mappings of any other characters are
just as bad, we simply haven't had enough users burnt by such cases to
have an institutional memory of it.  But for example do you really think
it's a good idea if pg_dump and reload into a DB with a different locale
results in changing the normalized form of SQL identifiers?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-11-21 Thread Robert Haas

On Sun, Nov 21, 2010 at 6:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Andrew Dunstan and...@dunslane.net writes:
 On 11/21/2010 06:09 PM, Robert Haas wrote:
 I think that's fair.  It actually doesn't seem like it should be that
 hard if we knew that the server encoding were UTF8 - it's just a big
 translation table somewhere, no?

 No, it's far more complex. See for example
 http://unicode.org/reports/tr21/tr21-3.html, which says:

 Yeah.  I'm actually not sure that the SQL committee has thought very
 hard about this, because the spec is worded as though they think that
 Unicode case normalization is all they have to say to uniquely define
 what to do.  The Unicode guys recognize that case mapping is
 locale-specific, which puts us right back at square one.  But leaving
 spec compliance aside, we know from bitter experience that we cannot use
 a definition that lets the Turkish locale fool with the mapping of i/I.
 I suspect that locale-dependent mappings of any other characters are
 just as bad, we simply haven't had enough users burnt by such cases to
 have an institutional memory of it.  But for example do you really think
 it's a good idea if pg_dump and reload into a DB with a different locale
 results in changing the normalized form of SQL identifiers?

No, especially if it results in queries that used to work breaking,
which it well could.  But I'm not sure where to go with it from there,
beyond throwing up my hands.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-11-21 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 On Sun, Nov 21, 2010 at 6:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 spec compliance aside, we know from bitter experience that we cannot use
 a definition that lets the Turkish locale fool with the mapping of i/I.
 I suspect that locale-dependent mappings of any other characters are
 just as bad, we simply haven't had enough users burnt by such cases to
 have an institutional memory of it.  But for example do you really think
 it's a good idea if pg_dump and reload into a DB with a different locale
 results in changing the normalized form of SQL identifiers?

 No, especially if it results in queries that used to work breaking,
 which it well could.  But I'm not sure where to go with it from there,
 beyond throwing up my hands.

Well, that's why there's been no movement on this since 2004 :-(.  The
amount of work needed for a better solution seems far out of proportion
to the benefits.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-07-26 Thread Robert Haas

On Mon, Jul 26, 2010 at 12:40 AM, Rajanikant Chirmade
rajanikant.chirm...@enterprisedb.com wrote:
 Since discussion stopped in discussion thread

 http://archives.postgresql.org/pgsql-bugs/2006-09/msg00128.php

 Are there any implications of this change in handling identifiers ?

 Thanks  Regards,
 Rajanikant Chirmade

An even more relevant message appears to be this one:

http://archives.postgresql.org/pgsql-bugs/2006-09/msg00133.php

Both this and the comment in downcase_truncate_identifier() suggests
that the current method is attributable to lack of support for
Unicode-aware case normalization and is known not to work correctly in
all locales.  Locale and encoding stuff isn't really my area of
expertise, but if now have support for Unicode-aware case
normalization, shouldn't we be using it here, too?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-07-25 Thread Rajanikant Chirmade

Since discussion stopped in discussion thread

http://archives.postgresql.org/pgsql-bugs/2006-09/msg00128.php

Are there any implications of this change in handling identifiers ?

Thanks  Regards,
Rajanikant Chirmade

On Tue, Jul 13, 2010 at 12:10 AM, Rajanikant Chirmade 
rajanikant.chirm...@enterprisedb.com wrote:



 On Wed, Jul 7, 2010 at 7:37 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Rajanikant Chirmade rajanikant.chirm...@enterprisedb.com writes:
  Every identifier is downcase  truncated by function
  downcase_truncate_identifier()
  before using it.

  But since the function downcase_truncate_identifier() is not
  multibyte-charecter aware,
  it is not able to downcase some of special charecters in identifier like
  my_SchemÄ.






 IIRC this is intentional.  Please consult the archives for previous
 discussions.

regards, tom lane




 I got one discussion thread on same issue. But it stopped without any
 conclusion.

 http://archives.postgresql.org/pgsql-bugs/2006-09/msg00128.php

 Thanks  Regards,
 Rajanikant Chirmade.

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-07-13 Thread Rajanikant Chirmade

On Wed, Jul 7, 2010 at 7:37 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Rajanikant Chirmade rajanikant.chirm...@enterprisedb.com writes:
  Every identifier is downcase  truncated by function
  downcase_truncate_identifier()
  before using it.

  But since the function downcase_truncate_identifier() is not
  multibyte-charecter aware,
  it is not able to downcase some of special charecters in identifier like
  my_SchemÄ.






 IIRC this is intentional.  Please consult the archives for previous
 discussions.

regards, tom lane




I got one discussion thread on same issue. But it stopped without any
conclusion.

http://archives.postgresql.org/pgsql-bugs/2006-09/msg00128.php

Thanks  Regards,
Rajanikant Chirmade.

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-07-07 Thread Tom Lane

Rajanikant Chirmade rajanikant.chirm...@enterprisedb.com writes:
 Every identifier is downcase  truncated by function
 downcase_truncate_identifier()
 before using it.

 But since the function downcase_truncate_identifier() is not
 multibyte-charecter aware,
 it is not able to downcase some of special charecters in identifier like
 my_SchemÄ.

IIRC this is intentional.  Please consult the archives for previous
discussions.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

2010-07-06 Thread Rajanikant Chirmade

Hi All,

Every identifier is downcase  truncated by function
downcase_truncate_identifier()
before using it.

But since the function downcase_truncate_identifier() is not
multibyte-charecter aware,
it is not able to downcase some of special charecters in identifier like
my_SchemÄ.

If schema is created of name my_SchemÄ, pg_namespace shows entries as
my_schemÄ .

Example is as below :

postgres=# create schema my_SchemÄ;
CREATE SCHEMA
postgres=# select nspname from pg_namespace;
  nspname

 pg_toast
 pg_temp_1
 pg_toast_temp_1
 pg_catalog
 public
 information_schema
 my_schemÄ
(7 rows)

postgres=#

Achually it should downcase as my_schemä as per multibyte-character aware
as lower()
works :

postgres=# select lower('my_SchemÄ');
   lower
---
 my_schemä
(1 row)

There is function str_tolower()  which work as multibyte-character aware.
Need to use same function where ever downcase required. So, it will create
uniform down-casing at all places.

two places identified where need to add wide-character aware downcase :

1. downcase_truncate_identifier();
   -  Attaching patch for changes and small test case.

Following functions should also synchronise with
downcase_truncate_identifier() :

2. pg_strcasecmp();
3. pg_strncasecmp();

- to add fix at these functions (2,3) need to move str_tolower() from
formatting.c  from backend to some common location (may be in src/port) from
where these can be used with client as well as server.

Thanks  Regards,
Rajanikant Chirmade.
diff --git a/orig/postgresql-9.0beta2/src/backend/parser/scansup.c b/postgresql-9.0beta2/src/backend/parser/scansup.c
index 94082f7..179b37e 100644
--- a/orig/postgresql-9.0beta2/src/backend/parser/scansup.c
+++ b/postgresql-9.0beta2/src/backend/parser/scansup.c
@@ -129,33 +129,11 @@ char *
 downcase_truncate_identifier(const char *ident, int len, bool warn)
 {
 	char	   *result;
-	int			i;
-
-	result = palloc(len + 1);
-
-	/*
-	 * SQL99 specifies Unicode-aware case normalization, which we don't yet
-	 * have the infrastructure for.  Instead we use tolower() to provide a
-	 * locale-aware translation.  However, there are some locales where this
-	 * is not right either (eg, Turkish may do strange things with 'i' and
-	 * 'I').  Our current compromise is to use tolower() for characters with
-	 * the high bit set, and use an ASCII-only downcasing for 7-bit
-	 * characters.
-	 */
-	for (i = 0; i  len; i++)
-	{
-		unsigned char ch = (unsigned char) ident[i];
 
-		if (ch = 'A'  ch = 'Z')
-			ch += 'a' - 'A';
-		else if (IS_HIGHBIT_SET(ch)  isupper(ch))
-			ch = tolower(ch);
-		result[i] = (char) ch;
-	}
-	result[i] = '\0';
+	result = str_tolower(ident, len);
 
-	if (i = NAMEDATALEN)
-		truncate_identifier(result, i, warn);
+	if (len = NAMEDATALEN)
+		truncate_identifier(result, len, warn);
 
 	return result;
 }
--- This tests if identifier with special charecters using wide-charecter aware downcase.

create schema my_SchemÄ;

--- Since we smash identifiers to lower we try to find schema name
--- by downcasing nspname.
select count(nspname) from pg_namespace where nspname=LOWER('my_SchemÄ');

drop schema my_SchemÄ;



wide-charecter_aware_downcase.out
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

Re: [HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

[HACKERS] multibyte-character aware support for function downcase_truncate_identifier()

20 matches

Site Navigation

Mail list logo

Footer information