subject:"Re\: \[HACKERS\] PATCH\: CITEXT 2.0"

Re: [HACKERS] PATCH: CITEXT 2.0 v4

2008-07-22 Thread David E. Wheeler


On Jul 18, 2008, at 01:39, Michael Paesold wrote:

Calling regex functions with the case-insensitivity option would be  
great. It should also be possible to rewrite replace() into  
regexp_replace() by first escaping the regex meta characters.


Actually re-implementing those functions in a case insensitive way  
would still be an option, but of course some amount of work. The  
question is, how much use case there is.


I've figured out how to make all the functions work using SQL function  
workarounds, converting things and re-dispatching to the text versions  
as appropriate. They work quite well, and can be converted to C later  
if that becomes a requirement.


Meanwhile, on the question of whether or not regular expression and  
LIKE comparisons *should* match case-insensitively, I have a couple  
more observations:


* Thinking about how a true case-insensitive collation would work, I'm  
quite certain that it would match case-insensitively. Anything else  
would just be unexpected, because in a case-insensitive collation,  
lowercase characters are, in practice, identical to uppercase  
characters. As far as matching is concerned, there is no difference  
between them. So the matching operators and functions against CITEXT  
should follow that assumption.


* I tried a few matches on MySQL, where the collation is case- 
insensitive by default, and it confirms my impression:


mysql> select 'Foo' regexp 'o$';
+---+
| 'Foo' regexp 'o$' |
+---+
| 1 |
+---+
1 row in set (0.00 sec)

mysql> select 'Foo' regexp 'O$';
+---+
| 'Foo' regexp 'O$' |
+---+
| 1 |
+---+
1 row in set (0.00 sec)

mysql> select 'Foo' like '%o';
+-+
| 'Foo' like '%o' |
+-+
|   1 |
+-+
1 row in set (0.00 sec)

mysql> select 'Foo' like '%O';
+-+
| 'Foo' like '%O' |
+-+
|   1 |
+-+
1 row in set (0.00 sec)

I'll grant that MySQL may not be the best model for how things should  
work, but it's something, at least. Anyone else got access to another  
database with case-insensitive collations to see how LIKE and regular  
expressions work?


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v4

2008-07-21 Thread David E. Wheeler


On Jul 18, 2008, at 09:53, David E. Wheeler wrote:

However, if someone with a lot more C and Pg core knowledge wanted  
to sit down with me for a couple hours next week and help me bang  
out these functions, that would be great. I'd love to have the  
implementation be that much more complete.


I've implemented fixes for the regexp_* functions and strpos() in pure  
SQL, like so:


CREATE OR REPLACE FUNCTION regexp_matches( citext, citext ) RETURNS  
TEXT[] AS '

SELECT regexp_matches( $1::text, $2::text, ''i'' );
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_matches( citext, citext, text )  
RETURNS TEXT[] AS '
SELECT regexp_matches( $1::text, $2::text, CASE WHEN strpos($3,  
''c'') = 0 THEN  $3 || ''i'' ELSE $3 END );

' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_replace( citext, citext, text )  
returns TEXT AS '

SELECT regexp_replace( $1::text, $2::text, $3, ''i'');
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_replace( citext, citext, text,  
text ) returns TEXT AS '
SELECT regexp_replace( $1::text, $2::text, $3, CASE WHEN  
strpos($4, ''c'') = 0 THEN  $4 || ''i'' ELSE $4 END);

' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_split_to_array( citext, citext )  
RETURNS TEXT[] AS '

SELECT regexp_split_to_array( $1::text, $2::text, ''i'' );
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_split_to_array( citext, citext,  
text ) RETURNS TEXT[] AS '
SELECT regexp_split_to_array( $1::text, $2::text, CASE WHEN  
strpos($3, ''c'') = 0 THEN  $3 || ''i'' ELSE $3 END );

' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_split_to_table( citext, citext )  
RETURNS SETOF TEXT AS '

SELECT regexp_split_to_table( $1::text, $2::text, ''i'' );
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_split_to_table( citext, citext,  
text ) RETURNS SETOF TEXT AS '
SELECT regexp_split_to_table( $1::text, $2::text, CASE WHEN  
strpos($3, ''c'') = 0 THEN  $3 || ''i'' ELSE $3 END );

' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION strpos( citext, citext ) RETURNS INT AS '
SELECT strpos( LOWER( $1::text ), LOWER( $2::text ) );
' LANGUAGE SQL IMMUTABLE STRICT;

Not so bad, though it'd be nice to have C functions that just did  
these things. Still not case-insensitive are:


-- replace()
-- split_part()
-- translate()

So, anyone at OSCON this week want to help me with these? Or to  
convert the above functions to C? Greg? Bruce?


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v4

2008-07-18 Thread David E. Wheeler


On Jul 18, 2008, at 01:39, Michael Paesold wrote:

Calling regex functions with the case-insensitivity option would be  
great. It should also be possible to rewrite replace() into  
regexp_replace() by first escaping the regex meta characters.


Actually re-implementing those functions in a case insensitive way  
would still be an option, but of course some amount of work. The  
question is, how much use case there is.


Not much for me. I might use the regex functions, but would be happy  
to manually pass the "i" flag.


However, if someone with a lot more C and Pg core knowledge wanted to  
sit down with me for a couple hours next week and help me bang out  
these functions, that would be great. I'd love to have the  
implementation be that much more complete.


I do believe that, as it stands now in the v4 patch, citext is pretty  
close to ready, and certainly commit-able.


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v4

2008-07-18 Thread Michael Paesold


David E. Wheeler writes:


On Jul 17, 2008, at 03:45, Michael Paesold wrote:

Wouldn't it be possible to create a variant of regexp_replace, i.e.  
regexp_replace(citext,citext,text), which would again lower-case  
the first two arguments before passing the input to  
regexp_replace(text,text,text)?


Sure, but then you end up with this:

template1=# select regexp_replace( 'Fxx'::citext, 'X'::citext, 'o');
regexp_replace

foo
(1 row)


Yeah, you are right, I see. :-)


Which is just wrong. I'm going to look at the regex C functions  
today and see if there's an easy way to just always pass them the  
'i' flag, which would do the trick. That still won't help replace(),  
split_part(), or translate(), however.


Calling regex functions with the case-insensitivity option would be  
great. It should also be possible to rewrite replace() into  
regexp_replace() by first escaping the regex meta characters.


Actually re-implementing those functions in a case insensitive way  
would still be an option, but of course some amount of work. The  
question is, how much use case there is.


Best Regards
Michael Paesold

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v4

2008-07-17 Thread David E. Wheeler


On Jul 17, 2008, at 03:45, Michael Paesold wrote:

Wouldn't it be possible to create a variant of regexp_replace, i.e.  
regexp_replace(citext,citext,text), which would again lower-case the  
first two arguments before passing the input to  
regexp_replace(text,text,text)?


Sure, but then you end up with this:

template1=# select regexp_replace( 'Fxx'::citext, 'X'::citext, 'o');
regexp_replace

foo
(1 row)

Which is just wrong. I'm going to look at the regex C functions today  
and see if there's an easy way to just always pass them the 'i' flag,  
which would do the trick. That still won't help replace(),  
split_part(), or translate(), however.


Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v4

2008-07-17 Thread Michael Paesold



Am 16.07.2008 um 20:38 schrieb David E. Wheeler:


The trouble is that, right now:

template1=# select regexp_replace( 'fxx'::citext, 'X'::citext, 'o');
regexp_replace

fxx
(1 row)

So there's an inconsistency there. I don't know how to make that  
work case-insensitively.


Wouldn't it be possible to create a variant of regexp_replace, i.e.  
regexp_replace(citext,citext,text), which would again lower-case the  
first two arguments before passing the input to  
regexp_replace(text,text,text)?


Best Regards
Michael Paesold

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v4

2008-07-16 Thread David E. Wheeler


On Jul 16, 2008, at 11:20, Robert Treat wrote:


I was thinking about this a bit last night and wanted to fill things
out a bit.

As a programmer, I find Donald Fraser's hindsight to be more
appealing, because at least that way I have the option to do matching
against CITEXT strings case-sensitively when I want to.

OTOH, if what we want is to have CITEXT work more like a case-
insensitive collation, then the expectation from the matching
operators and functions might be different. Does anyone have any idea
whether regex and LIKE matching against a string in a case- 
insensitive
collation would match case-insensitively or not? If so, then maybe  
the

regex and LIKE ops and funcs *should* match case-insensitively? If
not, or if only for some collations, then I would think not.

Either way, I know of no way, currently, to allow functions like
replace(), split_part(), strpos(), and translate() to match case-
insensitiely, even if we wanted to. Anyone have any ideas?


* If the answer is "no", how can I make those functions behave case-
insensitively? (See the "TODO" tests.)

* Should there be any other casts? To and from name, perhaps?




AIUI, your propsing the following:

select 'x'::citext = 'X'::citext;
?column?
--
t
(1 row)

select 'x'::citext ~ 'X'::citext;
?column?
--
f
(1 row)

I understand the desire for flexibility, but the above seems wierd  
to me.


That's what Donald Fraser suggested, and I see some value in that, but  
wanted to get some other opinions. And you're right, that does seem a  
bit weird.


The trouble is that, right now:

template1=# select regexp_replace( 'fxx'::citext, 'X'::citext, 'o');
 regexp_replace

 fxx
(1 row)

So there's an inconsistency there. I don't know how to make that work  
case-insensitively.


Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v4

2008-07-16 Thread Robert Treat

On Wednesday 16 July 2008 13:54:25 David E. Wheeler wrote:
> On Jul 15, 2008, at 22:23, David E. Wheeler wrote:
> > * The README for citext 1.0 on pgFoundry says:
> >> I had to make a decision on casting between types for regular
> >> expressions and
> >> decided that if any parameter is of citext type then case
> >> insensitive applies.
> >> For example applying regular expressions with a varchar and a
> >> citext will
> >> produce a case-insensitive result.
> >>
> >> Having thought about this afterwards I realised that since we have
> >> the option
> >> to use case-insensitive results with regular expressions I should
> >> have left the
> >> behaviour exactly as text and then you have the best of both
> >> worlds... oh well
> >> not hard to change for any of you perfectionists!
> >
> > I followed the original and made all the regex and LIKE comparisons
> > case-insensitive. But maybe I should not have? Especially since the
> > regular expression functions (e.g., regexp_replace()) and a few non-
> > regex functions (e.g., replace()) still don't behave case-
> > insensitively?
>
> I was thinking about this a bit last night and wanted to fill things
> out a bit.
>
> As a programmer, I find Donald Fraser's hindsight to be more
> appealing, because at least that way I have the option to do matching
> against CITEXT strings case-sensitively when I want to.
>
> OTOH, if what we want is to have CITEXT work more like a case-
> insensitive collation, then the expectation from the matching
> operators and functions might be different. Does anyone have any idea
> whether regex and LIKE matching against a string in a case-insensitive
> collation would match case-insensitively or not? If so, then maybe the
> regex and LIKE ops and funcs *should* match case-insensitively? If
> not, or if only for some collations, then I would think not.
>
> Either way, I know of no way, currently, to allow functions like
> replace(), split_part(), strpos(), and translate() to match case-
> insensitiely, even if we wanted to. Anyone have any ideas?
>
> > * If the answer is "no", how can I make those functions behave case-
> > insensitively? (See the "TODO" tests.)
> >
> > * Should there be any other casts? To and from name, perhaps?
>

AIUI, your propsing the following:

select 'x'::citext = 'X'::citext;
 ?column?
--
 t
(1 row)

select 'x'::citext ~ 'X'::citext;
 ?column?
--
 f
(1 row)

I understand the desire for flexibility, but the above seems wierd to me. 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v4

2008-07-16 Thread David E. Wheeler


On Jul 15, 2008, at 22:23, David E. Wheeler wrote:


* The README for citext 1.0 on pgFoundry says:

I had to make a decision on casting between types for regular  
expressions and
decided that if any parameter is of citext type then case  
insensitive applies.
For example applying regular expressions with a varchar and a  
citext will

produce a case-insensitive result.

Having thought about this afterwards I realised that since we have  
the option
to use case-insensitive results with regular expressions I should  
have left the
behaviour exactly as text and then you have the best of both  
worlds... oh well

not hard to change for any of you perfectionists!


I followed the original and made all the regex and LIKE comparisons  
case-insensitive. But maybe I should not have? Especially since the  
regular expression functions (e.g., regexp_replace()) and a few non- 
regex functions (e.g., replace()) still don't behave case- 
insensitively?


I was thinking about this a bit last night and wanted to fill things  
out a bit.


As a programmer, I find Donald Fraser's hindsight to be more  
appealing, because at least that way I have the option to do matching  
against CITEXT strings case-sensitively when I want to.


OTOH, if what we want is to have CITEXT work more like a case- 
insensitive collation, then the expectation from the matching  
operators and functions might be different. Does anyone have any idea  
whether regex and LIKE matching against a string in a case-insensitive  
collation would match case-insensitively or not? If so, then maybe the  
regex and LIKE ops and funcs *should* match case-insensitively? If  
not, or if only for some collations, then I would think not.


Either way, I know of no way, currently, to allow functions like  
replace(), split_part(), strpos(), and translate() to match case- 
insensitiely, even if we wanted to. Anyone have any ideas?


* If the answer is "no", how can I make those functions behave case- 
insensitively? (See the "TODO" tests.)


* Should there be any other casts? To and from name, perhaps?


Thanks,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-15 Thread David E. Wheeler


On Jul 15, 2008, at 20:26, Tom Lane wrote:


"David E. Wheeler" <[EMAIL PROTECTED]> writes:

So I guess my question is: what is wrong with the  properties for
citextsend/citextrecv


[ checks catalogs... ] textsend and textrecv are marked STABLE not
IMMUTABLE.  I am not totally sure about the reasoning offhand --- it
might be because their behavior depends on client_encoding.


Thanks. Looks like maybe the xtypes docs need to be updated?

  http://www.postgresql.org/docs/8.3/static/xtypes.html

Anyway, changing them to "STABLE STRICT" appears to have done the  
trick (diff attached).



and what else might these failures be indicating
is wrong?


I think the other diffs are okay, they just reflect the fact that  
you're

depending on binary equivalence of text and citext.


Great, thanks. And with that, I think I'm just about ready to submit a  
new version of the patch, coming up shortly.


Best,

David


regression.diffs
Description: Binary data



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-15 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
> So I guess my question is: what is wrong with the  properties for  
> citextsend/citextrecv

[ checks catalogs... ] textsend and textrecv are marked STABLE not
IMMUTABLE.  I am not totally sure about the reasoning offhand --- it
might be because their behavior depends on client_encoding.

> and what else might these failures be indicating  
> is wrong?

I think the other diffs are okay, they just reflect the fact that you're
depending on binary equivalence of text and citext.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-15 Thread David E. Wheeler


On Jul 15, 2008, at 12:56, Tom Lane wrote:


Don't run the tests in a read-only directory, perhaps.


Yes, I changed the owner to the postgres system user and that did the  
trick.





Or do they matter for sanity-checking citext?


Hard to tell --- I'd suggest trying to get a clean run.  As for what  
you

have, the first diff hunk suggests you've got the wrong function
properties for citextsend/citextrecv.


Here's the new diff:

*** ./expected/opr_sanity.out   Mon Jul 14 21:55:49 2008
--- ./results/opr_sanity.outTue Jul 15 17:41:03 2008
***
*** 87,94 
   p1.provolatile != p2.provolatile OR
   p1.pronargs != p2.pronargs);
   oid | proname | oid | proname
! -+-+-+-
! (0 rows)

  -- Look for uses of different type OIDs in the argument/result type  
fields

  -- for different aliases of the same built-in function.
--- 87,96 
   p1.provolatile != p2.provolatile OR
   p1.pronargs != p2.pronargs);
   oid  | proname  |  oid  |  proname
! --+--+---+
!  2414 | textrecv | 87258 | citextrecv
!  2415 | textsend | 87259 | citextsend
! (2 rows)

  -- Look for uses of different type OIDs in the argument/result type  
fields

  -- for different aliases of the same built-in function.
***
*** 110,117 
   prorettype | prorettype
  +
   25 |   1043
 1114 |   1184
! (2 rows)

  SELECT DISTINCT p1.proargtypes[0], p2.proargtypes[0]
  FROM pg_proc AS p1, pg_proc AS p2
--- 112,120 
   prorettype | prorettype
  +
   25 |   1043
+  25 |  87255
 1114 |   1184
! (3 rows)

  SELECT DISTINCT p1.proargtypes[0], p2.proargtypes[0]
  FROM pg_proc AS p1, pg_proc AS p2
***
*** 124,133 
  -+-
25 |1042
25 |1043
  1114 |1184
  1560 |1562
  2277 |2283
! (5 rows)

  SELECT DISTINCT p1.proargtypes[1], p2.proargtypes[1]
  FROM pg_proc AS p1, pg_proc AS p2
--- 127,138 
  -+-
25 |1042
25 |1043
+   25 |   87255
+ 1042 |   87255
  1114 |1184
  1560 |1562
  2277 |2283
! (7 rows)

  SELECT DISTINCT p1.proargtypes[1], p2.proargtypes[1]
  FROM pg_proc AS p1, pg_proc AS p2
***
*** 139,148 
   proargtypes | proargtypes
  -+-
23 |  28
  1114 |1184
  1560 |1562
  2277 |2283
! (4 rows)

  SELECT DISTINCT p1.proargtypes[2], p2.proargtypes[2]
  FROM pg_proc AS p1, pg_proc AS p2
--- 144,154 
   proargtypes | proargtypes
  -+-
23 |  28
+   25 |   87255
  1114 |1184
  1560 |1562
  2277 |2283
! (5 rows)

  SELECT DISTINCT p1.proargtypes[2], p2.proargtypes[2]
  FROM pg_proc AS p1, pg_proc AS p2
***
*** 305,311 
  142 | 25 |0 | a
  142 |   1043 |0 | a
  142 |   1042 |0 | a
! (6 rows)

  --  pg_operator 
  -- Look for illegal values in pg_operator fields.
--- 311,318 
  142 | 25 |0 | a
  142 |   1043 |0 | a
  142 |   1042 |0 | a
!   87255 |   1042 |0 | a
! (7 rows)

  --  pg_operator 
  -- Look for illegal values in pg_operator fields.

==

So I guess my question is: what is wrong with the  properties for  
citextsend/citextrecv and what else might these failures be indicating  
is wrong?


CREATE OR REPLACE FUNCTION citextrecv(internal)
RETURNS citext
AS 'textrecv'
LANGUAGE 'internal' IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION citextsend(citext)
RETURNS bytea
AS 'textsend'
LANGUAGE 'internal' IMMUTABLE STRICT;

CREATE TYPE citext (
INPUT  = citextin,
OUTPUT = citextout,
RECEIVE= citextrecv,
SEND   = citextsend,
INTERNALLENGTH = VARIABLE,
STORAGE= extended
);

Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-15 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
> Well now that was cool to see. I got some failures, of course, but  
> nothing stands out to me as an obvious bug. I attach the diffs file  
> (with the citext.sql failure removed) for your perusal. What would be  
> the best way for me to resolve those permission issues?

Don't run the tests in a read-only directory, perhaps.

> Or do they matter for sanity-checking citext?

Hard to tell --- I'd suggest trying to get a clean run.  As for what you
have, the first diff hunk suggests you've got the wrong function
properties for citextsend/citextrecv.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-15 Thread David E. Wheeler


On Jul 15, 2008, at 07:09, Tom Lane wrote:


Yeah, probably.  I don't think the "make check" path will support it
because it doesn't install contrib into the temp installation.
(You'd also need to have put the extra entry in parallel_schedule
not serial_schedule, but it's not gonna work anyway.)


Well now that was cool to see. I got some failures, of course, but  
nothing stands out to me as an obvious bug. I attach the diffs file  
(with the citext.sql failure removed) for your perusal. What would be  
the best way for me to resolve those permission issues? Or do they  
matter for sanity-checking citext?


Thanks,

David


regression.diffs
Description: Binary data




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-15 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
> Okay, I copied citext.sql into src/test/regress/sql and then added  
> "test: citext" to the top of src/test/regress/serial_schedule. Then I  
> ran `make check`. All tests passed, but I don't think that citext was  
> tested.
> Do I need to install the server, build a cluster, and run `make  
> installcheck`?

Yeah, probably.  I don't think the "make check" path will support it
because it doesn't install contrib into the temp installation.
(You'd also need to have put the extra entry in parallel_schedule
not serial_schedule, but it's not gonna work anyway.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-14 Thread David E. Wheeler


On Jul 14, 2008, at 07:08, Tom Lane wrote:


You're really making it into another test.  Just copy the citext.sql
file into the sql/ subdirectory and add a "citext" entry to the  
schedule

file.

The last time I did this, I had to at least "touch" a corresponding
expected file (expected/citext.out) or pg_regress would go belly-up
without running the rest of the tests.  There's no special need to
make the expected file match, of course, since you can just ignore
the "failure" report from that one test.


Okay, I copied citext.sql into src/test/regress/sql and then added  
"test: citext" to the top of src/test/regress/serial_schedule. Then I  
ran `make check`. All tests passed, but I don't think that citext was  
tested.


Do I need to install the server, build a cluster, and run `make  
installcheck`?


Thanks for the hand-holding.

Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-14 Thread David E. Wheeler


On Jul 12, 2008, at 15:13, David E. Wheeler wrote:


2. It's ridiculously slow; at least a factor of ten slower than doing
equivalent tests directly in SQL.  This is a very bad thing.  Speed  
of
regression tests matters a lot to those of us who run them a dozen  
times

per day --- and I do not wish to discourage any developers who don't
work that way from learning better habits ;-)


Hrm. I'm wonder why it's so slow? The test functions don't really do  
a lot. Anyway, I agree that they should perform well.


Just as an FYI, I've just moved all the tests to regular SQL instead  
of using pgTAP. The difference in runtime is:


psql -Xd try -f sql/citext.sql  0.03s user 0.02s system 19% cpu 0.253  
total
psql -Xd try -f sql/citext.sql  0.03s user 0.02s system  4% cpu 1.298  
total


So it's close to a factor of five, though subtract .125 for the time  
to load the pgTAP functions. The pgTAP tests *are* doing a lot more  
work, but I'm sure that they could be made a lot more efficient,  
though of course the TAP functions will always introduce some  
overhead. One just needs to decide whether the tradeoffs are worth it.


Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-14 Thread David E. Wheeler


On Jul 14, 2008, at 11:49, Tom Lane wrote:


You don't seem to understand what I'm suggesting: I'm not talking
about testing strcoll. I'm talking about making sure that citext
*uses* strcoll.


This seems pointless, as well as essentially impossible to enforce by
black-box testing.  Nobody is likely to consider a patch that removes
such obviously basic functionality of the module.


Never underestimate the human capacity for incompetence. One of my  
mottos.



I think you're
worrying about testing the wrong things --- and as evidence I'd note  
the
serious errors that slipped by your testing (including the fact that  
the

last submitted version would still claim that 'a' = 'ab').


Um, say what? I had that problem at one time, but it was when I was  
writing my own lower() function, which was shitty (I wasn't creating a  
nul-terminated string). I don't recall that being in any patch I sent  
to the list, and indeed you wrote that very test in the revised test  
script you sent in.


At any rate, I make no claims that my tests properly covered  
everything. There is a lot I didn't know to test, but I'm learning  
more all the time. That's why this code review has been so immensely  
valuable, both for me and for CITEXT.



What we
generally ask a regression test to do is catch portability problems
(which is certainly a real-enough issue here, but we don't know how
to address it well) or catch foreseeable breakage (such as someone
introducing a cast that breaks resolution of citext function calls).
The methodology can't catch everything, and trying to push it beyond
what it can do is just a time sink for effort that can more usefully
be spent other ways, such as code-reading.


I guess what I'm thinking of is not regression tests but unit tests.  
And with unit testing, one of the goals is coverage. Good coverage has  
saved my ass many times in the past, even catching changes that, in  
hindsight, should obviously never have been attempted. Perhaps it'd be  
worth thinking about creating a project for unit testing PostgreSQL in  
controlled environments? Maybe having tests that can contain proper  
conditionals?


Anyway, it seems that, given the limitations of the current testing  
system with regard to testing multibyte characters, the issue is moot.  
I'll throw in a few tests that test multibyte characters, but I'll  
comment them out and just uncomment them for individual test runs on  
my box for the purposes of my own sanity going forward.


Thanks,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-14 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
> On Jul 14, 2008, at 11:06, Tom Lane wrote:
>> Let me put it another way: if we test on another platform and find out
>> that strcoll's behavior is different there, are you going to fix that
>> version of strcoll?  No, you're not.  So you might as well just test  
>> the
>> behavior of the code that's actually under your control.

> You don't seem to understand what I'm suggesting: I'm not talking  
> about testing strcoll. I'm talking about making sure that citext  
> *uses* strcoll.

This seems pointless, as well as essentially impossible to enforce by
black-box testing.  Nobody is likely to consider a patch that removes
such obviously basic functionality of the module.  I think you're
worrying about testing the wrong things --- and as evidence I'd note the
serious errors that slipped by your testing (including the fact that the
last submitted version would still claim that 'a' = 'ab').  What we
generally ask a regression test to do is catch portability problems
(which is certainly a real-enough issue here, but we don't know how
to address it well) or catch foreseeable breakage (such as someone
introducing a cast that breaks resolution of citext function calls).
The methodology can't catch everything, and trying to push it beyond
what it can do is just a time sink for effort that can more usefully
be spent other ways, such as code-reading.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-14 Thread David E. Wheeler


On Jul 14, 2008, at 11:06, Tom Lane wrote:

[ shrug... ]  Seems pretty useless to me: we already know that it  
works
for you.  The point of a regression test in my mind is to make sure  
that

it works for everybody.  Given the platform variations involved in
strcoll's behavior, the definition of "works for everybody" is going  
to

be pretty darn narrowly circumscribed anyway, and thus I don't have a
big problem with restricting the tests to ASCII cases.


Neither do I, as long as there is *some* context to ensure that the  
type remains locale-aware. We only know that it works for me because  
I've written the tests.



Let me put it another way: if we test on another platform and find out
that strcoll's behavior is different there, are you going to fix that
version of strcoll?  No, you're not.  So you might as well just test  
the

behavior of the code that's actually under your control.


You don't seem to understand what I'm suggesting: I'm not talking  
about testing strcoll. I'm talking about making sure that citext  
*uses* strcoll. Whether or not strcoll actually works properly is not  
my concern.


Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-14 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
> On Jul 14, 2008, at 07:24, Tom Lane wrote:
>> The fallacy in that proposal is the assumption that there are only two
>> behaviors out there.

> Well, no, that's not the assumption at all. The assumption is that the  
> type works properly with multibyte characters under multibyte-aware  
> locales. So I want to have tests to ensure that such is true by having  
> multibyte characters run under a very specific locale and platform.

[ shrug... ]  Seems pretty useless to me: we already know that it works
for you.  The point of a regression test in my mind is to make sure that
it works for everybody.  Given the platform variations involved in
strcoll's behavior, the definition of "works for everybody" is going to
be pretty darn narrowly circumscribed anyway, and thus I don't have a
big problem with restricting the tests to ASCII cases.

Let me put it another way: if we test on another platform and find out
that strcoll's behavior is different there, are you going to fix that
version of strcoll?  No, you're not.  So you might as well just test the
behavior of the code that's actually under your control.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-14 Thread David E. Wheeler


On Jul 14, 2008, at 10:58, Tom Lane wrote:


Well, unless the display of the characters would be broken (the build
farm databases use UTF-8, no?),


No, they use C.


Um, you mean SQL_ASCII?

Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-14 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
> Well, unless the display of the characters would be broken (the build  
> farm databases use UTF-8, no?),

No, they use C.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-14 Thread David E. Wheeler


On Jul 14, 2008, at 10:55, Andrew Dunstan wrote:


No.

--no-locale is the same as --locale=C which means the encoding is  
SQL_ASCII.


I've used --no-locale --encoding UTF-8 many times. But if the  
regressions run with SQL_ASCII…well, I'm just amazed that there  
haven't been more unexpected side-effects to source code changes  
without controlling for such variables in tests. Seems like a lot to  
keep in one's head.


Anyway, are tests for contrib modules run on the build farms? If so, I  
could still potentially get around this issue by turning off ECHO.


Best,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-14 Thread Andrew Dunstan




David E. Wheeler wrote:

On Jul 14, 2008, at 06:05, Andrew Dunstan wrote:

You can certainly add any tests you like. But the buildfarm does all 
its regression tests using an install done with --no-locale. Any test 
that requires some locale to work, or make sense, should probably not 
be in your Makefile's REGRESS target.


Well, unless the display of the characters would be broken (the build 
farm databases use UTF-8, no?),


No.

--no-locale is the same as --locale=C which means the encoding is SQL_ASCII.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-14 Thread David E. Wheeler


On Jul 14, 2008, at 07:26, Tom Lane wrote:


I'd like to keep these tests, since they ensure not just that the
functions work but that they work with citext.


It might be reasonable to test a couple of them for that purpose.
If your agenda is "test every function in the system that comes or
might come in a bpchar variant", I think that's pointless.


Or a varchar variant, or where such a variant might be added in the  
future. To my mind, it's important to have good coverage in my unit  
tests to ensure that things continue to work exactly the same over time.


So, since the tests are already written, and are unlikely to add more  
than a few milliseconds to test runtime, can you at least agree that  
such tests are harmless?


Updated patch later today.

Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-14 Thread David E. Wheeler


On Jul 14, 2008, at 07:24, Tom Lane wrote:


"David E. Wheeler" <[EMAIL PROTECTED]> writes:
Could I supply two comparison files, one for Mac OS X with  
en_US.UTF-8
and one for everything else, as described in the last three  
paragraphs

here?


The fallacy in that proposal is the assumption that there are only two
behaviors out there.


Well, no, that's not the assumption at all. The assumption is that the  
type works properly with multibyte characters under multibyte-aware  
locales. So I want to have tests to ensure that such is true by having  
multibyte characters run under a very specific locale and platform. I  
don't really care what platform or locale; the point is to make sure  
that the type is actually multibyte-aware.



Let me recalibrate your thoughts a bit: so far
I have tried citext on three different machines (Mac, Fedora 8, HPUX),
and I got three different answers from those tests.  That's despite
endeavoring to make the database locales match ... which is less than
trivial in itself because they use three slightly different  
spellings of

"en_US.UTF8".



This is a truly pitiful state of affairs. Rhetorical question: Why is  
there no standardization of locales? I'm sure there are a lot of  
opinions out there (should all uppercase chars should precede all  
lowercase chars or be mixed in with lowercase chars), but I should  
think that, in this day and age, there would be some sort of standard  
defining locales and how they work -- and to allow such opinions to be  
expressed by different locales, not in the same locale names on  
different platforms.




Given that you were more or less deliberately testing corner cases,
I think it's quite likely that the number of observable reactions from
N platforms would be more nearly O(N) than O(1).


To me they're not corner cases. To me it is just, "given a specific  
platform/locale, does CITEXT respect the locale's rules?" I don't care  
to test all platforms and locales (I'm not *that* stupid :-)).


In the real world, to the extent that we are able to control the  
locale

of the regression tests, we make it "C" --- and to a large extent we
can't control it at all, which means you have another uncontrolled
variable besides platform.  So in the current universe there is
absolutely no value in submitting locale-specific tests for a contrib
module.


Then how do we know that it will continue to be locale-aware over  
time? Someone could replace the comparison function with one that just  
lowercases ASCII characters, like CITEXT 1 does, and no tests would  
fail. How do you prevent that from happening without being hyper- 
vigilant (and never leaving the project, I might add)?



I see some discussion in the thread about improving the situation, but
until we are able to decouple database locale from environment locale,
I doubt we'll be able to do a whole lot about automating this kind
of test.  There are too many variables at the moment.


Is the decoupling of database locale from environment locale likely to  
happen anytime soon? Now that I've written CITEXT, I dare say that  
such might become my top-desired feature (aside from replication).


Thanks for the discussion, much appreciated, and I'm learning a ton. I  
retain the right to be opinionated, however. ;-)


Best,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-14 Thread David E. Wheeler


On Jul 14, 2008, at 06:05, Andrew Dunstan wrote:

You can certainly add any tests you like. But the buildfarm does all  
its regression tests using an install done with --no-locale. Any  
test that requires some locale to work, or make sense, should  
probably not be in your Makefile's REGRESS target.


Well, unless the display of the characters would be broken (the build  
farm databases use UTF-8, no?), the output would look like this, which  
should more or less work (I'm hoping):


SELECT citext_larger( 'Â'::citext, 'ç'::citext ) = 'ç' AS t WHERE  
false and test_multibyte();

 t
---
(0 rows)


Some time ago I raised the question of doing locale- and encoding- 
aware regression tests, and IIRC the consensus seemed to be that it  
was not worth the effort, but maybe we need to revisit that. We  
certainly seem to be doing a lot more work now to get Postgres to  
where it needs to be w.r.t. locale support, and we've cleaned up a  
lot of the encoding issues we had, so maybe we need to reflect that  
in our testing regime more.


Makes sense to me.

Best,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-14 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
> On Jul 12, 2008, at 14:57, Tom Lane wrote:
>> 4. A lot of the later test cases are equally uselessly testing whether
>> piggybacking over text functions works.

> I'd like to keep these tests, since they ensure not just that the  
> functions work but that they work with citext.

It might be reasonable to test a couple of them for that purpose.
If your agenda is "test every function in the system that comes or
might come in a bpchar variant", I think that's pointless.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-14 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
> Could I supply two comparison files, one for Mac OS X with en_US.UTF-8  
> and one for everything else, as described in the last three paragraphs  
> here?

The fallacy in that proposal is the assumption that there are only two
behaviors out there.  Let me recalibrate your thoughts a bit: so far
I have tried citext on three different machines (Mac, Fedora 8, HPUX),
and I got three different answers from those tests.  That's despite
endeavoring to make the database locales match ... which is less than
trivial in itself because they use three slightly different spellings of
"en_US.UTF8".

Given that you were more or less deliberately testing corner cases,
I think it's quite likely that the number of observable reactions from
N platforms would be more nearly O(N) than O(1).

In the real world, to the extent that we are able to control the locale
of the regression tests, we make it "C" --- and to a large extent we
can't control it at all, which means you have another uncontrolled
variable besides platform.  So in the current universe there is
absolutely no value in submitting locale-specific tests for a contrib
module.

I see some discussion in the thread about improving the situation, but
until we are able to decouple database locale from environment locale,
I doubt we'll be able to do a whole lot about automating this kind
of test.  There are too many variables at the moment.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-14 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
> On Jul 13, 2008, at 10:19, Tom Lane wrote:
>> BTW, actually a better idea would be to put citext.sql at the front of
>> the list and just run the whole main regression series with it  
>> present.
>> typ_sanity and oidjoins might possibly find issues too.

> Um, stupid question (sorry, I'm not familiar with how the regression  
> tests are organized): serial_schedule doesn't look like a file into  
> which I can insert an SQL file. How would I do that?

You're really making it into another test.  Just copy the citext.sql
file into the sql/ subdirectory and add a "citext" entry to the schedule
file.

The last time I did this, I had to at least "touch" a corresponding
expected file (expected/citext.out) or pg_regress would go belly-up
without running the rest of the tests.  There's no special need to
make the expected file match, of course, since you can just ignore
the "failure" report from that one test.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-14 Thread Andrew Dunstan




David E. Wheeler wrote:


(If we get to having per-database collations in 8.4 then integrating a
test with a non-default collation would get a lot easier, of course;
but for the moment I'm afraid you've got to work with what's there.)


Could I supply two comparison files, one for Mac OS X with en_US.UTF-8 
and one for everything else, as described in the last three paragraphs 
here?


  http://www.postgresql.org/docs/current/static/regress-variant.html

That way, I can at least make sure that the multibyte stuff *does* 
work. Even if it's tested on only one platform, the purpose is not to 
test a particular collation, but to test that CITEXT is actually 
sensitive to locale.





You can certainly add any tests you like. But the buildfarm does all its 
regression tests using an install done with --no-locale. Any test that 
requires some locale to work, or make sense, should probably not be in 
your Makefile's REGRESS target.


Some time ago I raised the question of doing locale- and encoding-aware 
regression tests, and IIRC the consensus seemed to be that it was not 
worth the effort, but maybe we need to revisit that. We certainly seem 
to be doing a lot more work now to get Postgres to where it needs to be 
w.r.t. locale support, and we've cleaned up a lot of the encoding issues 
we had, so maybe we need to reflect that in our testing regime more.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-14 Thread David E. Wheeler


On Jul 12, 2008, at 14:57, Tom Lane wrote:


4. A lot of the later test cases are equally uselessly testing whether
piggybacking over text functions works.  The odds of ever finding
anything with those tests are not distinguishable from zero (unless  
the
underlying text function is busted, which is not your responsibility  
to

test).  So I don't see any point in putting them into the standard
regression package.  (What maybe *would* be useful to test, but you
didn't, is whether the result of a function is considered citext  
rather

than text.)


I'd like to keep these tests, since they ensure not just that the  
functions work but that they work with citext. Given what we found  
with length() and friends not working when there was an implicit cast  
to bpchar, I think it's valuable to continue to ensure that these  
functions work as expected with citext. Otherwise someone in the  
future might come along and make the cast to bpchar implicit again,  
and no tests would fail to tell him/her otherwise.


These tests make good regressions.

Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-13 Thread David E. Wheeler


On Jul 13, 2008, at 10:31, Tom Lane wrote:


Grr. Kind of defeats the purpose. Is there no infrastructure for
testing multibyte functionality?


There's some stuff under src/test/locale and src/test/mb, though it
doesn't get exercised regularly.  The contrib tests are a particularly
bad place for trying to do any locale-dependent testing, because we
only support "make installcheck" which means there is no way to set
the database locale --- you *have to* work with whatever locale is
predetermined by the postmaster you're pointed at.

I don't recall the reason for not supporting "make check" in the  
contrib

modules; maybe it was just that preparing a test installation for each
contrib module sounded too slow?  In any case, what you'd need to  
pursue

is having some additional tests available that are not executed by the
standard contrib regression test sequence.

(If we get to having per-database collations in 8.4 then integrating a
test with a non-default collation would get a lot easier, of course;
but for the moment I'm afraid you've got to work with what's there.)


Could I supply two comparison files, one for Mac OS X with en_US.UTF-8  
and one for everything else, as described in the last three paragraphs  
here?


  http://www.postgresql.org/docs/current/static/regress-variant.html

That way, I can at least make sure that the multibyte stuff *does*  
work. Even if it's tested on only one platform, the purpose is not to  
test a particular collation, but to test that CITEXT is actually  
sensitive to locale.


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-13 Thread David E. Wheeler


On Jul 13, 2008, at 10:19, Tom Lane wrote:


You might try running the
opr_sanity regression test on this module to see if it finds any
other silliness.  (Procedure: insert the citext definition script
into the serial_schedule list just ahead of opr_sanity, run tests,
make sure you understand the reason for any diffs in the opr_sanity
result.  There will be at least one from the uses of text-related
functions for citext.)



Thanks. Added to my list.


BTW, actually a better idea would be to put citext.sql at the front of
the list and just run the whole main regression series with it  
present.

typ_sanity and oidjoins might possibly find issues too.


Um, stupid question (sorry, I'm not familiar with how the regression  
tests are organized): serial_schedule doesn't look like a file into  
which I can insert an SQL file. How would I do that?


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-13 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
> To judge by the User-Defined Types docs, I was close. :-) I just  
> changed the argument to citextrecv to "internal".

Right.  The APIs for send and recv aren't inverses.  (Oh, the sins
we'll commit in the name of performance ...)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-13 Thread David E. Wheeler


On Jul 13, 2008, at 16:06, David E. Wheeler wrote:

Should those return bytea and citext, respectively? IOW, are these  
right?


CREATE OR REPLACE FUNCTION citextrecv(bytea)
RETURNS citext
AS 'textrecv'
LANGUAGE 'internal' IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION citextsend(citext)
RETURNS bytea
AS 'textsend'
LANGUAGE 'internal' IMMUTABLE STRICT;


To judge by the User-Defined Types docs, I was close. :-) I just  
changed the argument to citextrecv to "internal".


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-13 Thread David E. Wheeler


On Jul 13, 2008, at 10:19, Tom Lane wrote:

I'm confused. Is that not what the citextin and citextout functions  
are?


No, those are text I/O.  You need analogues of textsend and textrecv
too.


Should those return bytea and citext, respectively? IOW, are these  
right?


CREATE OR REPLACE FUNCTION citextrecv(bytea)
RETURNS citext
AS 'textrecv'
LANGUAGE 'internal' IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION citextsend(citext)
RETURNS bytea
AS 'textsend'
LANGUAGE 'internal' IMMUTABLE STRICT;

Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-13 Thread David E. Wheeler


On Jul 13, 2008, at 10:19, Tom Lane wrote:


"David E. Wheeler" <[EMAIL PROTECTED]> writes:

On Jul 12, 2008, at 12:17, Tom Lane wrote:
* You should provide binary I/O (send/receive) functions, if you  
want

this to be an industrial-strength module.  It's easy since you can
piggyback on text's.


I'm confused. Is that not what the citextin and citextout functions  
are?


No, those are text I/O.  You need analogues of textsend and textrecv
too.


Okay.


Thanks. Added to my list.


BTW, actually a better idea would be to put citext.sql at the front of
the list and just run the whole main regression series with it  
present.

typ_sanity and oidjoins might possibly find issues too.


Also added to my list. :-)


Some (not all) of your CREATE OPERATOR commands have things like

   NEGATOR   = OPERATOR(!~),

which seems unnecessary, and is certainly inconsistent.


Oh, I hadn't even noticed those; I'd just copied them from citext 1.  
Fixed!


Best,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-13 Thread David E. Wheeler


On Jul 13, 2008, at 10:16, Tom Lane wrote:


Hmm.  I think what that actually means is that the cast from citext to
bpchar should be AS ASSIGNMENT rather than IMPLICIT.  What is  
happening

is that the system can't figure out whether to use length(text) or
length(bpchar) when presented with a citext argument.  I had been
thinking yesterday that it would automatically prefer length(text)
because text is a "preferred type", but after tracing through it I see
that that doesn't happen because citext is not thought to be of the
string category.  (We really need a way to let user-defined types
specify their category...)


That'd be nice.


The fact that you need all these piggyback functions is a red flag
because what it implies is that citext will not work nicely for any
situation where both text and bpchar functions have been provided
--- and that includes user-added functions, so it's hopeless to think
that you can get to a solution this way.  Downgrading the cast seems
like the right thing to me.


Yes, that works for me. I've downgraded it and can now remove the size  
functions and all the tests still pass.



The implicit cast to varchar is a bit worrisome because it creates the
same issue if someone has provided both varchar and text versions of a
function.  However, that seems a bit pointless given the lack of
semantic difference, and I suspect that a lot of user-written  
functions
come only in varchar variants --- so on balance my recommendation is  
to

keep that one as implicit.


Yes, I agree. Thanks for tracing this out, Tom, it never would have  
ocurred to me -- and now I know more than I did before!


Best,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-13 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
> On Jul 12, 2008, at 14:57, Tom Lane wrote:
>> Sadly, I think you have to give up
>> attempts to check the interesting multibyte cases and confine yourself
>> to tests using ASCII strings.

> Grr. Kind of defeats the purpose. Is there no infrastructure for  
> testing multibyte functionality?

There's some stuff under src/test/locale and src/test/mb, though it
doesn't get exercised regularly.  The contrib tests are a particularly
bad place for trying to do any locale-dependent testing, because we
only support "make installcheck" which means there is no way to set
the database locale --- you *have to* work with whatever locale is
predetermined by the postmaster you're pointed at.

I don't recall the reason for not supporting "make check" in the contrib
modules; maybe it was just that preparing a test installation for each
contrib module sounded too slow?  In any case, what you'd need to pursue
is having some additional tests available that are not executed by the
standard contrib regression test sequence.

(If we get to having per-database collations in 8.4 then integrating a
test with a non-default collation would get a lot easier, of course;
but for the moment I'm afraid you've got to work with what's there.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-13 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
> On Jul 12, 2008, at 12:17, Tom Lane wrote:
>> * You should provide binary I/O (send/receive) functions, if you want
>> this to be an industrial-strength module.  It's easy since you can
>> piggyback on text's.

> I'm confused. Is that not what the citextin and citextout functions are?

No, those are text I/O.  You need analogues of textsend and textrecv
too.

>> You might try running the
>> opr_sanity regression test on this module to see if it finds any
>> other silliness.  (Procedure: insert the citext definition script
>> into the serial_schedule list just ahead of opr_sanity, run tests,
>> make sure you understand the reason for any diffs in the opr_sanity
>> result.  There will be at least one from the uses of text-related
>> functions for citext.)

> Thanks. Added to my list.

BTW, actually a better idea would be to put citext.sql at the front of
the list and just run the whole main regression series with it present.
typ_sanity and oidjoins might possibly find issues too.

>> * Don't use the OPERATOR() notation when you don't need to.
>> It's just clutter.

> Sorry, don't know what you're referring to here.

Some (not all) of your CREATE OPERATOR commands have things like

NEGATOR   = OPERATOR(!~),

which seems unnecessary, and is certainly inconsistent.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-13 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
> When I deleted any of the others, I got errors like this:

> psql:sql/citext.sql:865: ERROR:  function length(citext) is not unique
> LINE 1: SELECT is( length( name ), length(name::text), 'length("' ||...
> ^
> HINT:  Could not choose a best candidate function. You might need to  
> add explicit type casts.

Hmm.  I think what that actually means is that the cast from citext to
bpchar should be AS ASSIGNMENT rather than IMPLICIT.  What is happening
is that the system can't figure out whether to use length(text) or
length(bpchar) when presented with a citext argument.  I had been
thinking yesterday that it would automatically prefer length(text)
because text is a "preferred type", but after tracing through it I see
that that doesn't happen because citext is not thought to be of the
string category.  (We really need a way to let user-defined types
specify their category...)

The fact that you need all these piggyback functions is a red flag
because what it implies is that citext will not work nicely for any
situation where both text and bpchar functions have been provided
--- and that includes user-added functions, so it's hopeless to think
that you can get to a solution this way.  Downgrading the cast seems
like the right thing to me.

The implicit cast to varchar is a bit worrisome because it creates the
same issue if someone has provided both varchar and text versions of a
function.  However, that seems a bit pointless given the lack of
semantic difference, and I suspect that a lot of user-written functions
come only in varchar variants --- so on balance my recommendation is to
keep that one as implicit.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-12 Thread David E. Wheeler


On Jul 12, 2008, at 14:50, David E. Wheeler wrote:


* An explicit comment explaining that you're piggybacking on the I/O
functions (and some others) for "text" wouldn't be out of place.


I've added SQL comments. Were you talking about a COMMENT?


* Lose the GRANT EXECUTEs on the I/O functions; they're redundant.
(If you needed them, you'd need them on a lot more than these two.)
I'd be inclined to lose the COMMENTs on the functions too; again
these are about the least useful ones to comment on out of the
whole module.


I wondered about that; those were copied from CITEXT 1. I've removed  
all GRANTs and COMMENTs.



* You should provide binary I/O (send/receive) functions, if you want
this to be an industrial-strength module.  It's easy since you can
piggyback on text's.


I'm confused. Is that not what the citextin and citextout functions  
are?



* The type declaration needs to say storage = extended, else the type
won't be toastable.


Ah, good, thanks.


* The cast from bpchar to citext cannot be WITHOUT FUNCTION;
it needs to invoke rtrim1.  Compare the bpchar to text cast.


Where do I find that? I have trouble finding the SQL that creates  
the core types. :-(


Duh, you just told me. Added, thanks.


* <= is surely not its own commutator.


Changed to >=.


You might try running the
opr_sanity regression test on this module to see if it finds any
other silliness.  (Procedure: insert the citext definition script
into the serial_schedule list just ahead of opr_sanity, run tests,
make sure you understand the reason for any diffs in the opr_sanity
result.  There will be at least one from the uses of text-related
functions for citext.)


Thanks. Added to my list.


* I think you can and should drop all of the "size" functions and
a lot of the "miscellaneous" functions: anyplace where the implicit
coercion to text would serve, you don't need a piggyback function,
and introducing one just creates risks of
can't-resolve-ambiguous-function failures.  The overloaded  
miscellaneous

functions are only justifiable to the extent that it's important to
preserve "citext-ness" of the result of a function, which seems at
best dubious for many of these.  It is likewise pretty pointless  
AFAICS

to introduce regex functions taking citext pattern arguments.


I added most of those as I wrote tests and they failed to find the  
functions. Once I added the functions, they worked. But I'll do an  
audit to make sure that I didn't inadvertantly leave in any unneeded  
ones (I'm happy to have less code :-)).


Of the size functions, I was able to remove only this one and keep all  
of my pgTAP tests passing:


CREATE FUNCTION textlen(citext)
RETURNS int4 AS 'textlen'
LANGUAGE 'internal' IMMUTABLE STRICT;

When I deleted any of the others, I got errors like this:

psql:sql/citext.sql:865: ERROR:  function length(citext) is not unique
LINE 1: SELECT is( length( name ), length(name::text), 'length("' ||...
   ^
HINT:  Could not choose a best candidate function. You might need to  
add explicit type casts.


I think you can see now why I wrote the tests: I wanted to ensure that  
CITEXT really does work (almost) just like TEXT.


I was able to eliminate *all* of the miscellaneous functions, but the  
upper() and lower() functions now return TEXT instead of CITEXT, which  
I don't think is exactly what we want, is it? For now, I'e left  
upper() and lower() in. It just seems like more expected functionality.



* Don't use the OPERATOR() notation when you don't need to.
It's just clutter.


Sorry, don't know what you're referring to here. CREATE OPERATOR  
appears to require parens…


Best,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-12 Thread David E. Wheeler


On Jul 12, 2008, at 15:13, David E. Wheeler wrote:


Sadly, I think you have to give up
attempts to check the interesting multibyte cases and confine  
yourself

to tests using ASCII strings.


Grr. Kind of defeats the purpose. Is there no infrastructure for  
testing multibyte functionality? Are test database clusters all  
built using SQL_ASCII and the C locale?


I just tried to take your modified tests and add multibyte tests that  
run only on OS X with en_US.UTF-8. They worked like this:


CREATE OR REPLACE FUNCTION test_multibyte ()
RETURNS BOOLEAN AS $$
SELECT version() ~ 'apple-darwin'
   AND (select setting from pg_settings where name = 'lc_collate')
 = 'en_US.UTF-8';
$$ LANGUAGE SQL IMMUTABLE;

SELECT 'À'::citext = 'À'::citext WHERE test_multibyte() = true;
SELECT 'À'::citext = 'à'::citext WHERE test_multibyte() = true;

But then I realized that this would change the expected output  
depending on the platform, and thus make the tests fail. This is one  
reason why the inflexibility of the existing regression test model is  
a drag: it limits you to testing only what works everywhere!


Grrr.

I'll remove all the multibyte character tests, but I have to say that,  
as a result, the CITEXT 1 module would likely pass all such tests,   
but it still isn't locale-aware. How can one add regressions to ensure  
that something truly is locale-aware?


Thanks,

David
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-12 Thread David E. Wheeler


On Jul 12, 2008, at 14:57, Tom Lane wrote:

There was some discussion earlier about whether the proposed  
regression
tests for citext are suitable for use in contrib or not.  After  
playing

with them for awhile, I have to come down very firmly on the side of
"not".  I have these gripes:


You're nothing if not thorough, Tom.

1. The style is gratuitously different from every other regression  
test

in the system.  This is not a good thing.  If it were an amazingly
better way to do things, then maybe, but as far as I can tell the  
style

pgTAP forces on you is really pretty darn poorly suited for SQL tests.
You have to contort what could naturally be expressed in SQL as a  
table
result into a scalar. Plus it's redundant with the expected-output  
file.


Sure. I wasn't aware of the existing regression test methodology when  
I wrote pgTAP and those tests. I'm fine to change them and use pgTAP  
for testing my app code, rather than PostgreSQL itself.



2. It's ridiculously slow; at least a factor of ten slower than doing
equivalent tests directly in SQL.  This is a very bad thing.  Speed of
regression tests matters a lot to those of us who run them a dozen  
times

per day --- and I do not wish to discourage any developers who don't
work that way from learning better habits ;-)


Hrm. I'm wonder why it's so slow? The test functions don't really do a  
lot. Anyway, I agree that they should perform well.



Because of #1 and #2 I find the use of pgTAP to be a nonstarter.
I have a couple of gripes about the substance of the tests as well:

3. What you are mostly testing is not the behavior of citext itself,
but the behavior of the underlying strcoll function.  This is pretty
much doomed to failure, first because the regression tests are  
intended

to work in multiple locales (and we are *not* giving that up for
citext), and second because the behavior of strcoll isn't all that
portable across platforms even given allegedly similar locale settings
(as we already found out yesterday).


Yes, I *just* ran the tests on Ubuntu and opened my mail to ask about  
the bizarre differences when I saw this message. Thanks for answering  
my question before I asked it. :-)



Sadly, I think you have to give up
attempts to check the interesting multibyte cases and confine yourself
to tests using ASCII strings.


Grr. Kind of defeats the purpose. Is there no infrastructure for  
testing multibyte functionality? Are test database clusters all built  
using SQL_ASCII and the C locale?



4. A lot of the later test cases are equally uselessly testing whether
piggybacking over text functions works.  The odds of ever finding
anything with those tests are not distinguishable from zero (unless  
the
underlying text function is busted, which is not your responsibility  
to

test).  So I don't see any point in putting them into the standard
regression package.  (What maybe *would* be useful to test, but you
didn't, is whether the result of a function is considered citext  
rather

than text.)


Well, I was doing test-driven development: I wrote the tests to ensure  
that I had added the functions for CITEXT properly, and when they  
passed, I could move on. As a unit tester, it'd feel weird for me to  
drop these. I'm not testing the underlying functions; I'm making sure  
that they work properly with CITEXT.



I suggest something more like the attached as a suitable regression
test.  I got bored about halfway through and started to skim, so there
might be a few tests toward the end of the original set that would be
worth transposing into this one, but nothing jumped out at me.


Thanks! I'll work this in.

Best,

David

PS: I confirmed late yesterday that the memory leak I saw was with my  
version for 8.3, where I had my own str_tolower(). It clearly has a  
leak that I'll have to fix, but it has no bearing on the contribution  
to 8.4, which has no such leak. Thanks for running the test yourself  
to confirm.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-12 Thread Tom Lane

There was some discussion earlier about whether the proposed regression
tests for citext are suitable for use in contrib or not.  After playing
with them for awhile, I have to come down very firmly on the side of
"not".  I have these gripes:

1. The style is gratuitously different from every other regression test
in the system.  This is not a good thing.  If it were an amazingly
better way to do things, then maybe, but as far as I can tell the style
pgTAP forces on you is really pretty darn poorly suited for SQL tests.
You have to contort what could naturally be expressed in SQL as a table
result into a scalar. Plus it's redundant with the expected-output file.

2. It's ridiculously slow; at least a factor of ten slower than doing
equivalent tests directly in SQL.  This is a very bad thing.  Speed of
regression tests matters a lot to those of us who run them a dozen times
per day --- and I do not wish to discourage any developers who don't
work that way from learning better habits ;-)

Because of #1 and #2 I find the use of pgTAP to be a nonstarter.
I have a couple of gripes about the substance of the tests as well:

3. What you are mostly testing is not the behavior of citext itself,
but the behavior of the underlying strcoll function.  This is pretty
much doomed to failure, first because the regression tests are intended
to work in multiple locales (and we are *not* giving that up for
citext), and second because the behavior of strcoll isn't all that
portable across platforms even given allegedly similar locale settings
(as we already found out yesterday).  Sadly, I think you have to give up
attempts to check the interesting multibyte cases and confine yourself
to tests using ASCII strings.

4. A lot of the later test cases are equally uselessly testing whether
piggybacking over text functions works.  The odds of ever finding
anything with those tests are not distinguishable from zero (unless the
underlying text function is busted, which is not your responsibility to
test).  So I don't see any point in putting them into the standard
regression package.  (What maybe *would* be useful to test, but you
didn't, is whether the result of a function is considered citext rather
than text.)


I suggest something more like the attached as a suitable regression
test.  I got bored about halfway through and started to skim, so there
might be a few tests toward the end of the original set that would be
worth transposing into this one, but nothing jumped out at me.

regards, tom lane

--
--  Test citext datatype
--

--
-- first, define the datatype.  Turn off echoing so that expected file
-- does not depend on contents of citext.sql.
--
SET client_min_messages = warning;
\set ECHO none
\i citext.sql
\set ECHO all
RESET client_min_messages;

-- Test the operators and indexing functions

-- Test = and <>.
SELECT 'a'::citext = 'a'::citext as t;
SELECT 'a'::citext = 'A'::citext as t;
SELECT 'a'::citext = 'A'::text as f;-- text wins the discussion
SELECT 'a'::citext = 'b'::citext as f;
SELECT 'a'::citext = 'ab'::citext as f;
SELECT 'a'::citext <> 'ab'::citext as t;

-- Test > and >=
SELECT 'B'::citext > 'a'::citext as t;
SELECT 'b'::citext > 'A'::citext as t;
SELECT 'B'::citext > 'b'::citext as f;
SELECT 'B'::citext >= 'b'::citext as t;

-- Test < and <=
SELECT 'a'::citext < 'B'::citext as t;
SELECT 'a'::citext <= 'B'::citext as t;

-- Test implicit casting. citext casts to text, but not vice-versa.
SELECT 'a'::citext = 'a'::text as t;
SELECT 'A'::text <> 'a'::citext as t;

SELECT 'aardvark'::citext = 'aardvark'::citext as t;
SELECT 'aardvark'::citext = 'aardVark'::citext as t;

-- Check the citext_cmp() function explicitly.
SELECT citext_cmp('aardvark'::citext, 'aardvark'::citext) as zero;
SELECT citext_cmp('aardvark'::citext, 'aardVark'::citext) as zero;
SELECT citext_cmp('B'::citext, 'a'::citext) as one;

-- Do some tests using a table and index.

CREATE TEMP TABLE try (
name citext PRIMARY KEY
);

INSERT INTO try (name)
VALUES ('a'), ('ab'), ('aba'), ('b'), ('ba'), ('bab'), ('AZ');

SELECT name, 'a' = name from try;
SELECT name, 'a' = name from try where name = 'a';
SELECT name, 'A' = name from try;
SELECT name, 'A' = name from try where name = 'A';
SELECT name, 'A' = name from try where name = 'A';

-- expected failures on duplicate key
INSERT INTO try (name) VALUES ('a');
INSERT INTO try (name) VALUES ('A');
INSERT INTO try (name) VALUES ('aB');

-- Test aggregate functions and sort ordering

CREATE TEMP TABLE srt (
name CITEXT
);

INSERT INTO srt (name)
VALUES ('aardvark'),
   ('AAA'),
   ('aba'),
   ('ABC'),
   ('abd');

-- Check the min() and max() aggregates, with and without index.
set enable_seqscan = off;
SELECT MIN(name) from srt;
SELECT MAX(name) from srt;
reset enable_seqscan;
set enable_indexscan = off;
SELECT MIN(name) from srt;
SELECT MAX(name) from srt;
reset enable_indexscan;

-- Check sorting likewise
set enable_seqscan = off;
SELECT name FROM

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-12 Thread David E. Wheeler


On Jul 12, 2008, at 12:17, Tom Lane wrote:


BTW, I looked at the SQL file (citext.sql.in) a bit.  Some comments:


Thanks a million for these, Tom. I greatly appreciate it.


* An explicit comment explaining that you're piggybacking on the I/O
functions (and some others) for "text" wouldn't be out of place.


I've added SQL comments. Were you talking about a COMMENT?


* Lose the GRANT EXECUTEs on the I/O functions; they're redundant.
(If you needed them, you'd need them on a lot more than these two.)
I'd be inclined to lose the COMMENTs on the functions too; again
these are about the least useful ones to comment on out of the
whole module.


I wondered about that; those were copied from CITEXT 1. I've removed  
all GRANTs and COMMENTs.



* You should provide binary I/O (send/receive) functions, if you want
this to be an industrial-strength module.  It's easy since you can
piggyback on text's.


I'm confused. Is that not what the citextin and citextout functions are?


* The type declaration needs to say storage = extended, else the type
won't be toastable.


Ah, good, thanks.


* The cast from bpchar to citext cannot be WITHOUT FUNCTION;
it needs to invoke rtrim1.  Compare the bpchar to text cast.


Where do I find that? I have trouble finding the SQL that creates the  
core types. :-(



* <= is surely not its own commutator.


Changed to >=.


You might try running the
opr_sanity regression test on this module to see if it finds any
other silliness.  (Procedure: insert the citext definition script
into the serial_schedule list just ahead of opr_sanity, run tests,
make sure you understand the reason for any diffs in the opr_sanity
result.  There will be at least one from the uses of text-related
functions for citext.)


Thanks. Added to my list.


* I think you can and should drop all of the "size" functions and
a lot of the "miscellaneous" functions: anyplace where the implicit
coercion to text would serve, you don't need a piggyback function,
and introducing one just creates risks of
can't-resolve-ambiguous-function failures.  The overloaded  
miscellaneous

functions are only justifiable to the extent that it's important to
preserve "citext-ness" of the result of a function, which seems at
best dubious for many of these.  It is likewise pretty pointless  
AFAICS

to introduce regex functions taking citext pattern arguments.


I added most of those as I wrote tests and they failed to find the  
functions. Once I added the functions, they worked. But I'll do an  
audit to make sure that I didn't inadvertantly leave in any unneeded  
ones (I'm happy to have less code :-)).



* Don't use the OPERATOR() notation when you don't need to.
It's just clutter.


Sorry, don't know what you're referring to here. CREATE OPERATOR  
appears to require parens…


Thanks for the great feedback! I'll work on getting things all  
straightened out and a new patch in tonight.


Best,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-12 Thread Tom Lane

BTW, I looked at the SQL file (citext.sql.in) a bit.  Some comments:

* An explicit comment explaining that you're piggybacking on the I/O
functions (and some others) for "text" wouldn't be out of place.

* Lose the GRANT EXECUTEs on the I/O functions; they're redundant.
(If you needed them, you'd need them on a lot more than these two.)
I'd be inclined to lose the COMMENTs on the functions too; again
these are about the least useful ones to comment on out of the 
whole module.

* You should provide binary I/O (send/receive) functions, if you want
this to be an industrial-strength module.  It's easy since you can
piggyback on text's.

* The type declaration needs to say storage = extended, else the type
won't be toastable.

* The cast from bpchar to citext cannot be WITHOUT FUNCTION;
it needs to invoke rtrim1.  Compare the bpchar to text cast.

* <= is surely not its own commutator.  You might try running the
opr_sanity regression test on this module to see if it finds any
other silliness.  (Procedure: insert the citext definition script
into the serial_schedule list just ahead of opr_sanity, run tests,
make sure you understand the reason for any diffs in the opr_sanity
result.  There will be at least one from the uses of text-related
functions for citext.)

* I think you can and should drop all of the "size" functions and
a lot of the "miscellaneous" functions: anyplace where the implicit
coercion to text would serve, you don't need a piggyback function,
and introducing one just creates risks of
can't-resolve-ambiguous-function failures.  The overloaded miscellaneous
functions are only justifiable to the extent that it's important to
preserve "citext-ness" of the result of a function, which seems at
best dubious for many of these.  It is likewise pretty pointless AFAICS
to introduce regex functions taking citext pattern arguments.

* Don't use the OPERATOR() notation when you don't need to.
It's just clutter.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-11 Thread David E. Wheeler


On Jul 11, 2008, at 20:22, Tom Lane wrote:


Here's an updated version of citext.c.  Some changes cosmetic, some
not so much ...


Thanks! Good catch on my missing all of the s/int/int32/ stuff related  
to citextcmp(). Stupid oversites on my part. I'll submit a new patch  
with these changes shortly.


Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-11 Thread David E. Wheeler


On Jul 11, 2008, at 20:10, Tom Lane wrote:


Oh, got one of those too ... [ time passes ... ]  Nope, no leak
seen here either, though you could possibly fry an egg on my laptop
right now ...


Yeah, gotta love that, right?

So the issue must be with my version for 8.3.x, meaning, likely, my  
custom cilower() function. I'll have another look at it. Thanks for  
giving it a whirl on your end.



Also, I notice that the Mac is the *only* one of the three systems
on which your regression tests pass.  You're depending way too much
on platform-specific behavior there, I fear.


Hrm. I wonder what that could be. I'll have to give it a shot on my  
Ubuntu box and find out. Thanks.


David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-11 Thread Tom Lane

Here's an updated version of citext.c.  Some changes cosmetic, some
not so much ...

regards, tom lane

/*
 * PostgreSQL type definitions for CITEXT 2.0.
 */

#include "postgres.h"

#include "access/hash.h"
#include "fmgr.h"
#include "utils/builtins.h"
#include "utils/formatting.h"

#ifdef PG_MODULE_MAGIC
PG_MODULE_MAGIC;
#endif

/*
 *  
 *  FORWARD DECLARATIONS
 *  
 */

static int32  citextcmp  (text *left, text *right);
extern Datum  citext_cmp (PG_FUNCTION_ARGS);
extern Datum  citext_hash(PG_FUNCTION_ARGS);
extern Datum  citext_eq  (PG_FUNCTION_ARGS);
extern Datum  citext_ne  (PG_FUNCTION_ARGS);
extern Datum  citext_gt  (PG_FUNCTION_ARGS);
extern Datum  citext_ge  (PG_FUNCTION_ARGS);
extern Datum  citext_lt  (PG_FUNCTION_ARGS);
extern Datum  citext_le  (PG_FUNCTION_ARGS);
extern Datum  citext_smaller (PG_FUNCTION_ARGS);
extern Datum  citext_larger  (PG_FUNCTION_ARGS);

/*
 *  =
 *  UTILITY FUNCTIONS
 *  =
 */


/* citextcmp()
 * Internal comparison function for citext strings.
 * Returns int32 negative, zero, or positive.
 */

static int32
citextcmp (text *left, text *right)
{
char   *lcstr, *rcstr;
int32   result;

lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left));
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right));

result = varstr_cmp(lcstr, strlen(lcstr),
rcstr, strlen(rcstr));

pfree(lcstr);
pfree(rcstr);

return result;
}

/*
 *  ==
 *  INDEXING FUNCTIONS
 *  ==
 */

PG_FUNCTION_INFO_V1(citext_cmp);

Datum
citext_cmp(PG_FUNCTION_ARGS)
{
text *left  = PG_GETARG_TEXT_PP(0);
text *right = PG_GETARG_TEXT_PP(1);
int32 result;

result = citextcmp(left, right);

PG_FREE_IF_COPY(left, 0);
PG_FREE_IF_COPY(right, 1);

PG_RETURN_INT32(result);
}

PG_FUNCTION_INFO_V1(citext_hash);

Datum
citext_hash(PG_FUNCTION_ARGS)
{
text   *txt = PG_GETARG_TEXT_PP(0);
char   *str;
Datum   result;

str= str_tolower(VARDATA_ANY(txt), VARSIZE_ANY_EXHDR(txt));
result = hash_any((unsigned char *) str, strlen(str));
pfree(str);

/* Avoid leaking memory for toasted inputs */
PG_FREE_IF_COPY(txt, 0);

PG_RETURN_DATUM(result);
}

/*
 *  ==
 *  OPERATOR FUNCTIONS
 *  ==
 */

PG_FUNCTION_INFO_V1(citext_eq);

Datum
citext_eq(PG_FUNCTION_ARGS)
{
text *left  = PG_GETARG_TEXT_PP(0);
text *right = PG_GETARG_TEXT_PP(1);
char *lcstr, *rcstr;
bool  result;

lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left));
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right));

/*
 * We can't do the length-comparison optimization here, as is done for the
 * text type in varlena.c, because sometimes the lengths can be different.
 * The canonical example is the turkish dotted i: the lowercase version is
 * the standard ASCII i, but the uppercase version is multibyte.
 * Otherwise, since we only care about equality or not-equality, we can
 * avoid all the expense of strcoll() here, and just do bitwise
 * comparison.
 */
result = (strcmp(lcstr, rcstr) == 0);

pfree(lcstr);
pfree(rcstr);
PG_FREE_IF_COPY(left, 0);
PG_FREE_IF_COPY(right, 1);

PG_RETURN_BOOL(result);
}

PG_FUNCTION_INFO_V1(citext_ne);

Datum
citext_ne(PG_FUNCTION_ARGS)
{
text *left  = PG_GETARG_TEXT_PP(0);
text *right = PG_GETARG_TEXT_PP(1);
char *lcstr, *rcstr;
bool  result;

lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left));
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right));

/*
 * We can't do the length-comparison optimization here, as is done for the
 * text type in varlena.c, because sometimes the lengths can be different.
 * The canonical example is the turkish dotted i: the lowercase version is
 * the standard ASCII i, but the uppercase version is multibyte.
 * Otherwise, since we only care about equality or not-equality, we can
 * avoid all the expense of strcoll() here, and just do bitwise
 * comparison.
 */
result = (strcmp(lcstr, rcstr) != 0);

pfree(lcstr);
pfree(rcstr);
PG_FREE_IF_COPY(left, 0);
PG_FREE_IF_COPY(right, 1);

PG_RETURN_BOOL(result);
}

PG_FUNCTION_INFO_V1(citext_lt);

Datum
citext_lt(PG_FUNCTION_ARGS)
{
text *left  = PG_GETARG_TEXT_PP(0);
text *right = PG_GETARG_TEXT_PP(1);
bool  result;

result = citextcmp(left, right) < 0;

PG_FREE_IF_COPY(left, 0);
PG_FREE_IF_COPY(right, 1);

PG_RETURN_BOOL(result);
}

PG_FUNCTION_INFO_V1(citext_le);

Datum
citext_le(PG_FUNCTION_ARGS)
{
text *left  = PG_GETARG_TEXT_PP(0);
text *right = PG_GETARG_TEXT_PP(1);
bool  result;

resul

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-11 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
> On Jul 11, 2008, at 19:18, Tom Lane wrote:
>> I tried it here and didn't see any apparent memory leak, on two
>> different systems.  What platform are you testing on, and with
>> what encoding/locale settings?

> That's Mac OS X 10.5.4 "Leopard." Using en_US.UTF-8.

Oh, got one of those too ... [ time passes ... ]  Nope, no leak
seen here either, though you could possibly fry an egg on my laptop
right now ...

Also, I notice that the Mac is the *only* one of the three systems
on which your regression tests pass.  You're depending way too much
on platform-specific behavior there, I fear.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-11 Thread David E. Wheeler


On Jul 11, 2008, at 19:18, Tom Lane wrote:


I tried it here and didn't see any apparent memory leak, on two
different systems.  What platform are you testing on, and with
what encoding/locale settings?


 PostgreSQL 8.3.3 on i386-apple-darwin9.4.0, compiled by GCC i686- 
apple-darwin9-gcc-4.0.1 (G


That's Mac OS X 10.5.4 "Leopard." Using en_US.UTF-8. This is using my  
version for 8.3.3 from svn.


  https://svn.kineticode.com/citext/trunk/

I'll give it a try myself with the patch against CVS in a bit.

Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-11 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
> Unfortunately, CITEXT seems to have a memory leak somewhere,  

I tried it here and didn't see any apparent memory leak, on two
different systems.  What platform are you testing on, and with
what encoding/locale settings?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-11 Thread David E. Wheeler


On Jul 11, 2008, at 16:45, Tom Lane wrote:


Not sure about a memory leak, but this code is clearly wrong if
str_tolower results in a change of physical string length (clearly
possible in Turkish, for instance, and probably in some other
locales too).  You need to do strlen's on the lowercased strings.


Ah, great point. Thanks.

Anyone else got an idea on the memory leak?

Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-11 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
>  lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left));
>  rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right));

>  result = varstr_cmp(
>  lcstr,
>  VARSIZE_ANY_EXHDR(left),
>  rcstr,
>  VARSIZE_ANY_EXHDR(right)
>  );

Not sure about a memory leak, but this code is clearly wrong if
str_tolower results in a change of physical string length (clearly
possible in Turkish, for instance, and probably in some other
locales too).  You need to do strlen's on the lowercased strings.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-11 Thread David E. Wheeler


On Jul 11, 2008, at 13:02, Zdenek Kotala wrote:

Thank you, Zdenek. Have you had a chance to try citext yet? Or did  
you just read the source?


I tested version two on Solaris/SPARC and Sun studio compiler. I  
checked last version  only quickly (comparing your changes).


Thanks.

I just updated my performance test script (attached) by increasing the  
number of rows tested by an order of magnitude. So now it creates  
1,000,000 rows, queries them, adds indexes, and then queries them  
again. Unfortunately, CITEXT seems to have a memory leak somewhere,  
because when I index the CITEXT column, it fails with "ERROR:  out of  
memory". So obviously something's not getting cleaned up. Here's the  
btree indexing function:


Datum
citext_cmp(PG_FUNCTION_ARGS)
{
text *left  = PG_GETARG_TEXT_PP(0);
text *right = PG_GETARG_TEXT_PP(1);
int32 result;

result = citextcmp(left, right);

PG_FREE_IF_COPY(left, 0);
PG_FREE_IF_COPY(right, 1);

PG_RETURN_INT32(result);
}

And here's citextcmp():

citextcmp (text *left, text *right)
{
char *lcstr, *rcstr;
int   result;

lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left));
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right));

result = varstr_cmp(
lcstr,
VARSIZE_ANY_EXHDR(left),
rcstr,
VARSIZE_ANY_EXHDR(right)
);

pfree(lcstr);
pfree(rcstr);
return result;
}

Can anyone see where I'm failing to free up memory? Might it be in  
some other function?


Thanks!

David



try.sql
Description: Binary data





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-11 Thread Zdenek Kotala


David E. Wheeler napsal(a):

On Jul 11, 2008, at 12:37, Zdenek Kotala wrote:

I'm sorry for late response. I'm little bit busy this week. I quickly 
look on your latest changes and it seems to me that everything is OK. 
I'm going to change status to ready for commit. After discussion with 
Pavel Stehule I changed meaning about pgFoundry vs. contrib. Contrib 
is OK for me.


Thank you, Zdenek. Have you had a chance to try citext yet? Or did you 
just read the source?


I tested version two on Solaris/SPARC and Sun studio compiler. I checked last 
version  only quickly (comparing your changes).


With regards Zdenek


--
Zdenek Kotala  Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-11 Thread David E. Wheeler


On Jul 11, 2008, at 12:37, Zdenek Kotala wrote:

I'm sorry for late response. I'm little bit busy this week. I  
quickly look on your latest changes and it seems to me that  
everything is OK. I'm going to change status to ready for commit.  
After discussion with Pavel Stehule I changed meaning about  
pgFoundry vs. contrib. Contrib is OK for me.


Thank you, Zdenek. Have you had a chance to try citext yet? Or did you  
just read the source?


Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-11 Thread Zdenek Kotala


David E. Wheeler napsal(a):
Attached is a new version of a patch to add a CITEXT contrib module. 
Changes since v2:


* Optimized citext_eq() and citext_ne() (the = and <> operators, 
respectively) by having them use strncmp() instead of varstr_cmp(). Per 
discussion.


* Added `RESTRICT` and `JOIN` clauses to the comparison operators (=, 
<>, <, >, <=, >=). These improve statistics estimations, increasing the 
liklihood that indices will be used.


I'm sorry for late response. I'm little bit busy this week. I quickly look on 
your latest changes and it seems to me that everything is OK. I'm going to 
change status to ready for commit. After discussion with Pavel Stehule I changed 
meaning about pgFoundry vs. contrib. Contrib is OK for me.


Maybe some native speaker should review documentation.

Zdenek


--
Zdenek Kotala  Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-09 Thread Alvaro Herrera

David E. Wheeler wrote:
> On Jul 9, 2008, at 13:40, Alvaro Herrera wrote:

>> One thing that jumps at me is pgTAP usage, as Zdenek said.  I
>> understand that it's neat and all that, but we can't include the
>> tests because  they won't run unless one installs pgTAP which seems a
>> nonstarter.  So if  you want the tests in the repository along the
>> rest of the stuff, they really should use pg_regress.
>
> It does use pg_regress. The test just load the included pgtap.sql file
> to get the tap functions, and then away they go. If you run `make  
> installcheck` it works.

Uh-huh, OK, that's fine then I guess.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-09 Thread David E. Wheeler


On Jul 9, 2008, at 13:40, Alvaro Herrera wrote:

The problem is that we're in the middle of a commitfest, so  
everybody is

busy reviewing other patches (in theory at least).


Of course.

One thing that jumps at me is pgTAP usage, as Zdenek said.  I  
understand
that it's neat and all that, but we can't include the tests because  
they
won't run unless one installs pgTAP which seems a nonstarter.  So if  
you

want the tests in the repository along the rest of the stuff, they
really should use pg_regress.


It does use pg_regress. The test just load the included pgtap.sql file  
to get the tap functions, and then away they go. If you run `make  
installcheck` it works.



It's not even difficult to use.  Have a look at contrib/ltree/sql and
contrib/ltree/expected for examples.

If you want to push for pgTAP in core, that's fine, but it's a  
separate

discussion.


Agreed. I've sent a couple of messages in a thread about that, the  
latest this morning.


The other possibility being, of course, that you are proposing  
citext to

live on pgFoundry.


I'm not, actually. I mean, I have an updated version for 8.3, but it'd  
be quite a pita to maintain them both, since the api for lowercasing  
text is so much simpler in 8.4.


Best,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-09 Thread Alvaro Herrera

David E. Wheeler wrote:
> I guess you're all just blown away by the perfection of this patch? ;-)

The problem is that we're in the middle of a commitfest, so everybody is
busy reviewing other patches (in theory at least).

One thing that jumps at me is pgTAP usage, as Zdenek said.  I understand
that it's neat and all that, but we can't include the tests because they
won't run unless one installs pgTAP which seems a nonstarter.  So if you
want the tests in the repository along the rest of the stuff, they
really should use pg_regress.

It's not even difficult to use.  Have a look at contrib/ltree/sql and
contrib/ltree/expected for examples.

If you want to push for pgTAP in core, that's fine, but it's a separate
discussion.

The other possibility being, of course, that you are proposing citext to
live on pgFoundry.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v3

2008-07-09 Thread David E. Wheeler


I guess you're all just blown away by the perfection of this patch? ;-)

Best,

David

On Jul 7, 2008, at 18:03, David E. Wheeler wrote:

Attached is a new version of a patch to add a CITEXT contrib module.  
Changes since v2:


* Optimized citext_eq() and citext_ne() (the = and <> operators,  
respectively) by having them use strncmp() instead of varstr_cmp().  
Per discussion.


* Added `RESTRICT` and `JOIN` clauses to the comparison operators  
(=, <>, <, >, <=, >=). These improve statistics estimations,  
increasing the liklihood that indices will be used.


Thanks!

David




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-09 Thread David E. Wheeler


On Jul 7, 2008, at 12:06, David E. Wheeler wrote:

I understand it but there is parallel project which should solve  
this problem completely I guess in "close" future (2-3years).  
Afterward this module will be obsolete and it will takes times to  
remove it from contrib. It seems to me that have citext in contrib  
only for two releases is not optimal solution.


I guess that'd be the reason to keep it on pgFoundry, but I have two  
comments:


* 2-3 years is a *long* time in Internet time.

* There is on guarantee that it will be finished in that time or,  
indeed ever (no disrespect to Radek Strnad, it's just there are  
always unforeseen issues).


One other thing that occurred to me yesterday: Given that the feature  
will ultimately support column-level collations, I suspect that it  
will be much easier to migrate CITEXT to a case-insensitive collation  
(perhaps using an updated CITEXT contrib module that just does so  
transparently) than to migrate application code from using LOWER() all  
over the place to not using. One transition requires a change to the  
schema, the other requires a change to application code, of which  
there is generally a lot more.


Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-08 Thread Zdenek Kotala


Martijn van Oosterhout napsal(a):

On Mon, Jul 07, 2008 at 12:06:08PM -0700, David E. Wheeler wrote:
I guess that'd be the reason to keep it on pgFoundry, but I have two  
comments:


* 2-3 years is a *long* time in Internet time.


There have been patches over the years, but they tend not to get looked
at. Recently someone pulled up the COLLATE patch from a couple of years
ago but it didn't get much feedback either. (I can't find the link
right now).


I know about it. I have printed your proposal on my desk. I think It is linked 
from TODO list.



It's disappointing that the discussions get hung up on the ICU library
when it's not required or even needed for COLLATE support. My original
patch never even mentioned it.

I note that Firebird added COLLATE using ICU a few years back now. I
think PostgreSQL is the only large DBMS to not support it.


Complete agree. Collation missing support is big problem for many users.

* There is on guarantee that it will be finished in that time or,  
indeed ever (no disrespect to Radek Strnad, it's just there are always  
unforeseen issues).


I think that with concerted coding effort it could be done in 2-3
months (judging by how long it took to write the first version). The
problem is it needs some planner kung-fu which not many people have.


I agree that 2-3 months on fulltime is good estimation, problem is that you need 
kung-fu master which has time to do it :(. What we currently have is student 
which works on it in free time.


Zdenek

--
Zdenek Kotala  Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread Martijn van Oosterhout

On Mon, Jul 07, 2008 at 12:06:08PM -0700, David E. Wheeler wrote:
> I guess that'd be the reason to keep it on pgFoundry, but I have two  
> comments:
> 
> * 2-3 years is a *long* time in Internet time.

There have been patches over the years, but they tend not to get looked
at. Recently someone pulled up the COLLATE patch from a couple of years
ago but it didn't get much feedback either. (I can't find the link
right now).

It's disappointing that the discussions get hung up on the ICU library
when it's not required or even needed for COLLATE support. My original
patch never even mentioned it.

I note that Firebird added COLLATE using ICU a few years back now. I
think PostgreSQL is the only large DBMS to not support it.

> * There is on guarantee that it will be finished in that time or,  
> indeed ever (no disrespect to Radek Strnad, it's just there are always  
> unforeseen issues).

I think that with concerted coding effort it could be done in 2-3
months (judging by how long it took to write the first version). The
problem is it needs some planner kung-fu which not many people have.

Have a nice day,
-- 
Martijn van Oosterhout   <[EMAIL PROTECTED]>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while 
> boarding. Thank you for flying nlogn airlines.

signature.asc
Description: Digital signature

Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread David E. Wheeler


Thanks to help from RhodiumToad on IRC, I got some things improved here:

On Jul 7, 2008, at 16:24, David E. Wheeler wrote:

So for some reason, after adding the indexes, the queries against  
the CITEXT column aren't using them. Furthermore, the `lower(text)  
LIKE lower(?)` query isn't using *its* index. Huh?


I never knew what one needed to use the text_pattern_ops operator  
class to index a column for use with LIKE! I had no clue. Would that  
work for a citext column, too, since it's essentially the same as TEXT?



So this leaves me with two questions:

1. For what reason would the query against the citext column *not*  
use the index?


It turns out that it did use the index if I put `SET enable_seqscan =  
false;` into my script. So with RhodiumToad's direction, I added some  
`RESTRICT` and `JOIN` clauses to my comparison operators (copying them  
from ip4r). So now I have:


CREATE OPERATOR = (
LEFTARG= CITEXT,
RIGHTARG   = CITEXT,
COMMUTATOR = =,
NEGATOR= <>,
PROCEDURE  = citext_eq,
RESTRICT   = eqsel,
JOIN   = eqjoinsel,
HASHES,
MERGES
);

CREATE OPERATOR <> (
LEFTARG= CITEXT,
RIGHTARG   = CITEXT,
NEGATOR= =,
COMMUTATOR = <>,
PROCEDURE  = citext_ne,
RESTRICT   = neqsel,
JOIN   = neqjoinsel
);

CREATE OPERATOR < (
LEFTARG= CITEXT,
RIGHTARG   = CITEXT,
NEGATOR= >=,
COMMUTATOR = >,
PROCEDURE  = citext_lt,
RESTRICT   = scalarltsel,
JOIN   = scalarltjoinsel
);

CREATE OPERATOR <= (
LEFTARG= CITEXT,
RIGHTARG   = CITEXT,
NEGATOR= >,
COMMUTATOR = <=,
PROCEDURE  = citext_le,
RESTRICT   = scalarltsel,
JOIN   = scalarltjoinsel
);

CREATE OPERATOR >= (
LEFTARG= CITEXT,
RIGHTARG   = CITEXT,
NEGATOR= <,
COMMUTATOR = <=,
PROCEDURE  = citext_ge,
RESTRICT   = scalargtsel,
JOIN   = scalargtjoinsel
);

CREATE OPERATOR > (
LEFTARG= CITEXT,
RIGHTARG   = CITEXT,
NEGATOR= <=,
COMMUTATOR = <,
PROCEDURE  = citext_gt,
RESTRICT   = scalargtsel,
JOIN   = scalargtjoinsel
);

With this change, the index was used:

Loading words from dictionary.
Inserting into the table.

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 261.295 ms
SELECT * FROM try WHERE citext = 'food';
Time: 289.304 ms
Time: 1228.961 ms

Adding indexes...

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 2.018 ms
SELECT * FROM try WHERE citext = 'food';
Time: 0.788 ms

Seems to be faster than the LOWER() version, too, which makes me  
happy. The output from EXPLAIN ANALYZE:


try=# EXPLAIN ANALYZE SELECT * FROM try WHERE citext = 'food';

  QUERY PLAN
--
 Index Scan using idx_try_citext on try  (cost=0.00..8.31 rows=1  
width=119) (actual time=0.324..0.324 rows=0 loops=1)

   Index Cond: (citext = 'food'::citext)
 Total runtime: 0.377 ms
(3 rows)

try=# EXPLAIN ANALYZE SELECT * FROM try WHERE LOWER(text) =  
LOWER('food');


   QUERY PLAN

 Bitmap Heap Scan on try  (cost=28.17..1336.10 rows=500 width=119)  
(actual time=0.170..0.170 rows=0 loops=1)

   Recheck Cond: (lower(text) = 'food'::text)
   ->  Bitmap Index Scan on idx_try_text  (cost=0.00..28.04 rows=500  
width=0) (actual time=0.168..0.168 rows=0 loops=1)

 Index Cond: (lower(text) = 'food'::text)
 Total runtime: 0.211 ms
(5 rows)

So my only other question related to this is:

* Are the above RESTRICT and JOIN functions the ones to use, or is  
there some way to make use of those used by the TEXT type that would  
be more appropriate?


2. Is there some way to get the CITEXT index to behave like a  
LOWER() index, that is, so that its value is stored using the result  
of the str_tolower() function, thus removing some of the overhead of  
converting the values for each row fetched from the index? (Does  
this question make any sense?)


Given the performance with an index, I think that this is moot, yes?  
There is, of course, much more overhead for a table scan.


Best,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread David E. Wheeler


On Jul 7, 2008, at 17:18, Tom Lane wrote:


No, but you were: you proposed using strncmp for everything.


Yes, that's right. I was trying to understand why I wouldn't use just  
one or the other. And the answer turned out to be, you wouldn't,  
except that strncmp() is an desirable optimization for = and <>. So  
I've changed only those.


Phew, I think I'm clear now. Thanks!

DAvid

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
> On Jul 7, 2008, at 16:58, Tom Lane wrote:
>> If that's so, you certainly can't use strncmp, because that would  
>> result
>> in sort orderings totally different from lower()'s result.  Even  
>> without
>> that argument, for most multibyte cases you'd get a pretty arbitrary,
>> user-unfriendly sort ordering.

> Now I'm confused again. :-( Whether or not I use strncmp() or  
> varstr_cmp(), I first lowercase the value to be compared using  
> str_tolower(). What Zdenek has said is, that aside, just as for the  
> TEXT type, I should use strncmp() for = and <>, and varstr_cmp() for  
> everything else. Are you saying something different?

No, but you were: you proposed using strncmp for everything.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread David E. Wheeler


On Jul 7, 2008, at 16:58, Tom Lane wrote:


"David E. Wheeler" <[EMAIL PROTECTED]> writes:

Hrm. So in your opinion, strncmp() could be used for all comparisons
by citext, rather than varstr_cmp()?


I thought the charter of this type was to work like lower(textcol).


Correct.

If that's so, you certainly can't use strncmp, because that would  
result
in sort orderings totally different from lower()'s result.  Even  
without

that argument, for most multibyte cases you'd get a pretty arbitrary,
user-unfriendly sort ordering.


Now I'm confused again. :-( Whether or not I use strncmp() or  
varstr_cmp(), I first lowercase the value to be compared using  
str_tolower(). What Zdenek has said is, that aside, just as for the  
TEXT type, I should use strncmp() for = and <>, and varstr_cmp() for  
everything else. Are you saying something different?


I could use some examples to play with in order to ensure that things  
are behaving as they should. I'll add regression tests for them.


Thanks,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
> Hrm. So in your opinion, strncmp() could be used for all comparisons  
> by citext, rather than varstr_cmp()?

I thought the charter of this type was to work like lower(textcol).
If that's so, you certainly can't use strncmp, because that would result
in sort orderings totally different from lower()'s result.  Even without
that argument, for most multibyte cases you'd get a pretty arbitrary,
user-unfriendly sort ordering.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread David E. Wheeler


No, *really*

Sheesh, sorry.

David



try.sql
Description: Binary data




On Jul 7, 2008, at 16:26, David E. Wheeler wrote:


And here is the script. D'oh!

Thanks,

David




On Jul 7, 2008, at 16:24, David E. Wheeler wrote:


On Jul 7, 2008, at 08:01, Andrew Dunstan wrote:

What does still bother me is its performance. I'd like to know if  
any measurement has been done of using citext vs. a functional  
index on lower(foo).


Okay, here's a start. The attached script inserts random strings of  
1-10 space-delimited words into text and citext columns, and then  
compares the performance of queries with and without indexes. The  
output for me is as follows:


Loading words from dictionary.
Inserting into the table.

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 254.254 ms
SELECT * FROM try WHERE citext = 'food';
Time: 288.535 ms

Test LIKE and ILIKE
SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
Time: 209.385 ms
SELECT * FROM try WHERE citext ILIKE 'C%';
Time: 236.186 ms
SELECT * FROM try WHERE citext LIKE 'C%';
Time: 235.818 ms

Adding indexes...

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 1.260 ms
SELECT * FROM try WHERE citext = 'food';
Time: 277.755 ms

Test LIKE and ILIKE
SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
Time: 209.073 ms
SELECT * FROM try WHERE citext ILIKE 'C%';
Time: 238.430 ms
SELECT * FROM try WHERE citext LIKE 'C%';
Time: 238.685 ms
benedict%

So for some reason, after adding the indexes, the queries against  
the CITEXT column aren't using them. Furthermore, the `lower(text)  
LIKE lower(?)` query isn't using *its* index. Huh?


So this leaves me with two questions:

1. For what reason would the query against the citext column *not*  
use the index?


2. Is there some way to get the CITEXT index to behave like a  
LOWER() index, that is, so that its value is stored using the  
result of the str_tolower() function, thus removing some of the  
overhead of converting the values for each row fetched from the  
index? (Does this question make any sense?)


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread David E. Wheeler


And here is the script. D'oh!

Thanks,

David



try.sql
Description: Binary data




On Jul 7, 2008, at 16:24, David E. Wheeler wrote:


On Jul 7, 2008, at 08:01, Andrew Dunstan wrote:

What does still bother me is its performance. I'd like to know if  
any measurement has been done of using citext vs. a functional  
index on lower(foo).


Okay, here's a start. The attached script inserts random strings of  
1-10 space-delimited words into text and citext columns, and then  
compares the performance of queries with and without indexes. The  
output for me is as follows:


Loading words from dictionary.
Inserting into the table.

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 254.254 ms
SELECT * FROM try WHERE citext = 'food';
Time: 288.535 ms

Test LIKE and ILIKE
SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
Time: 209.385 ms
SELECT * FROM try WHERE citext ILIKE 'C%';
Time: 236.186 ms
SELECT * FROM try WHERE citext LIKE 'C%';
Time: 235.818 ms

Adding indexes...

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 1.260 ms
SELECT * FROM try WHERE citext = 'food';
Time: 277.755 ms

Test LIKE and ILIKE
SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
Time: 209.073 ms
SELECT * FROM try WHERE citext ILIKE 'C%';
Time: 238.430 ms
SELECT * FROM try WHERE citext LIKE 'C%';
Time: 238.685 ms
benedict%

So for some reason, after adding the indexes, the queries against  
the CITEXT column aren't using them. Furthermore, the `lower(text)  
LIKE lower(?)` query isn't using *its* index. Huh?


So this leaves me with two questions:

1. For what reason would the query against the citext column *not*  
use the index?


2. Is there some way to get the CITEXT index to behave like a  
LOWER() index, that is, so that its value is stored using the result  
of the str_tolower() function, thus removing some of the overhead of  
converting the values for each row fetched from the index? (Does  
this question make any sense?)


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread David E. Wheeler


On Jul 7, 2008, at 08:01, Andrew Dunstan wrote:

What does still bother me is its performance. I'd like to know if  
any measurement has been done of using citext vs. a functional index  
on lower(foo).


Okay, here's a start. The attached script inserts random strings of  
1-10 space-delimited words into text and citext columns, and then  
compares the performance of queries with and without indexes. The  
output for me is as follows:


Loading words from dictionary.
Inserting into the table.

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 254.254 ms
SELECT * FROM try WHERE citext = 'food';
Time: 288.535 ms

Test LIKE and ILIKE
SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
Time: 209.385 ms
SELECT * FROM try WHERE citext ILIKE 'C%';
Time: 236.186 ms
SELECT * FROM try WHERE citext LIKE 'C%';
Time: 235.818 ms

Adding indexes...

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 1.260 ms
SELECT * FROM try WHERE citext = 'food';
Time: 277.755 ms

Test LIKE and ILIKE
SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
Time: 209.073 ms
SELECT * FROM try WHERE citext ILIKE 'C%';
Time: 238.430 ms
SELECT * FROM try WHERE citext LIKE 'C%';
Time: 238.685 ms
benedict%

So for some reason, after adding the indexes, the queries against the  
CITEXT column aren't using them. Furthermore, the `lower(text) LIKE  
lower(?)` query isn't using *its* index. Huh?


So this leaves me with two questions:

1. For what reason would the query against the citext column *not* use  
the index?


2. Is there some way to get the CITEXT index to behave like a LOWER()  
index, that is, so that its value is stored using the result of the  
str_tolower() function, thus removing some of the overhead of  
converting the values for each row fetched from the index? (Does this  
question make any sense?)


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread David E. Wheeler


On Jul 7, 2008, at 13:59, Gregory Stark wrote:

Of course the obvious case of two equivalent strings with different  
bytes
would be two strings which differ only in case in a collation which  
doesn't
distinguish based on case. So you obviously can't take this route  
for citext.


Well, to be fair, citext isn't imposing a collation. It's just calling  
str_tolower() on strings before passing them on to varstr_cmp() or  
strncmp() to compare.


I don't think you have to worry about the problem that cause  
Postgres to make
this change. IIRC it was someone comparing strings like paths and  
usernames
and getting false positives because they were in a Turkish locale  
which found
certain sequences of characters to be insignificant for ordering.  
Someone
who's using a citext data type has obviously decided that's  
precisely the kind

of behaviour they want.


Hrm. So in your opinion, strncmp() could be used for all comparisons  
by citext, rather than varstr_cmp()?


Thanks,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread Gregory Stark

"David E. Wheeler" <[EMAIL PROTECTED]> writes:

> On Jul 7, 2008, at 12:21, David E. Wheeler wrote:
>
>> My question is: why? Shouldn't they all use the same function for
>> comparison? I'm happy to dupe this implementation for citext, but I  don't
>> understand it. Should not all comparisons be executed  consistently?
>
> Let me try to answer my own question by citing this comment:
>
>   /*
>* Since we only care about equality or not-equality, we can avoid  all
> the
>* expense of strcoll() here, and just do bitwise comparison.
>*/
>
> So, the upshot is that the = and <> operators are not locale-aware,  yes? They
> just do byte comparisons. Is that really the way it should  be? I mean, could
> there not be strings that are equivalent but have  different bytes?

There could be strings that strcoll returns 0 for even though they're not
identical. However that caused problems in Postgres so we decided that only
equal strings should actually compare equal. So if strcoll returns 0 then we
do a bytewise comparison to impose an arbitrary ordering.

Of course the obvious case of two equivalent strings with different bytes
would be two strings which differ only in case in a collation which doesn't
distinguish based on case. So you obviously can't take this route for citext.

I don't think you have to worry about the problem that cause Postgres to make
this change. IIRC it was someone comparing strings like paths and usernames
and getting false positives because they were in a Turkish locale which found
certain sequences of characters to be insignificant for ordering. Someone
who's using a citext data type has obviously decided that's precisely the kind
of behaviour they want.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread David E. Wheeler


On Jul 7, 2008, at 12:46, Zdenek Kotala wrote:

So, the upshot is that the = and <> operators are not locale-aware,  
yes? They just do byte comparisons. Is that really the way it  
should be? I mean, could there not be strings that are equivalent  
but have different bytes?


Correct. The problem is complex. It works fine only for normalized  
string. But postgres now assume that all utf8 strings are normalized.


I see. So binary equivalence is okay, in that case.

If you need to implement < <= >= > operators you need to use strcol  
which take care of locale collation.


Which varstr_cmp() does, I guess. It's what textlt uses, for example.


See unicode collation algorithm http://www.unicode.org/reports/tr10/


Wow, that looks like a fun read.

Best,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread David E. Wheeler


On Jul 7, 2008, at 13:10, Tom Lane wrote:


We intentionally force such strings to be considered non-equal.
See varstr_cmp, and if you like see the archives from back when
that was put in.

The = and <> operators are in fact consistent with the behavior of
varstr_cmp (and had better be!); they're just optimized a bit.


Thank you, Tom. I'll post my updated patch shortly.

In the meantime, can anyone suggest an easy way to populate a table  
full of random strings so that I can do a bit of benchmarking?


Thanks,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread Tom Lane

"David E. Wheeler" <[EMAIL PROTECTED]> writes:
> So, the upshot is that the = and <> operators are not locale-aware,  
> yes? They just do byte comparisons. Is that really the way it should  
> be? I mean, could there not be strings that are equivalent but have  
> different bytes?

We intentionally force such strings to be considered non-equal.
See varstr_cmp, and if you like see the archives from back when
that was put in.

The = and <> operators are in fact consistent with the behavior of
varstr_cmp (and had better be!); they're just optimized a bit.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread Zdenek Kotala


David E. Wheeler napsal(a):

On Jul 7, 2008, at 12:21, David E. Wheeler wrote:

My question is: why? Shouldn't they all use the same function for 
comparison? I'm happy to dupe this implementation for citext, but I 
don't understand it. Should not all comparisons be executed consistently?


Let me try to answer my own question by citing this comment:

/*
 * Since we only care about equality or not-equality, we can avoid 
all the

 * expense of strcoll() here, and just do bitwise comparison.
 */

So, the upshot is that the = and <> operators are not locale-aware, yes? 
They just do byte comparisons. Is that really the way it should be? I 
mean, could there not be strings that are equivalent but have different 
bytes?


Correct. The problem is complex. It works fine only for normalized string. But 
postgres now assume that all utf8 strings are normalized.


If you need to implement < <= >= > operators you need to use strcol which take 
care of locale collation.


See unicode collation algorithm http://www.unicode.org/reports/tr10/

Zdenek




--
Zdenek Kotala  Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread Pavel Stehule

2008/7/7 David E. Wheeler <[EMAIL PROTECTED]>:
> On Jul 7, 2008, at 12:36, Pavel Stehule wrote:
>
>>> * Does it need to be locale-aware or not?
>>
>> yes!
>
> texteq() in varlena.c does not seem to be.
>

it's case sensitive and it's +/- binary compare

Regards
Pavel Stehule

> Best,
>
> David
>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread David E. Wheeler


On Jul 7, 2008, at 12:36, Pavel Stehule wrote:


* Does it need to be locale-aware or not?


yes!


texteq() in varlena.c does not seem to be.

Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread Pavel Stehule

2008/7/7 David E. Wheeler <[EMAIL PROTECTED]>:
> On Jul 7, 2008, at 11:54, Alvaro Herrera wrote:
>
>>> Then why shouldn't I use strncmp() for all comparisons?
>>
>> I have no idea :-) -- because it's not locale-aware perhaps.
>
> Could someone who does have an idea answer these questions:
>
> * Does it need to be locale-aware or not?

yes!

> * Should I use strncmp() or varstr_cmp() to compare strings?
> * Shouldn't it use one or the other, but not both?
>
> Sorry, I'm just confused about the "correct" thing to do here. If someone
> who knows the definitive answers could weigh in, I'd be happy to make the
> adjustment.
>
> Thanks,
>
> David
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread David E. Wheeler


On Jul 7, 2008, at 12:21, David E. Wheeler wrote:

My question is: why? Shouldn't they all use the same function for  
comparison? I'm happy to dupe this implementation for citext, but I  
don't understand it. Should not all comparisons be executed  
consistently?


Let me try to answer my own question by citing this comment:

/*
	 * Since we only care about equality or not-equality, we can avoid  
all the

 * expense of strcoll() here, and just do bitwise comparison.
 */

So, the upshot is that the = and <> operators are not locale-aware,  
yes? They just do byte comparisons. Is that really the way it should  
be? I mean, could there not be strings that are equivalent but have  
different bytes?


Thanks,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread David E. Wheeler


On Jul 7, 2008, at 12:10, Pavel Stehule wrote:


Maybe we can have some "locale" test outside our regress tests -


I think that would be useful.

Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread David E. Wheeler


On Jul 7, 2008, at 12:13, Zdenek Kotala wrote:


I'm sorry. I meant bttext() 
http://doxygen.postgresql.org/varlena_8c-source.html#l01270

bttext should use in citextcmp function. It uses strcol function.


I've no idea what bttext has to do with anything. Sorry if I'm being  
slow here.



And citext_eq should be implemented as texteq:

http://doxygen.postgresql.org/varlena_8c-source.html#l01164

I'm sorry for confusion I'm exchange bttext and varstr_cmp. :(


Okay, I see that text_cmp() uses varstr_cmp():

  http://doxygen.postgresql.org/varlena_8c-source.html#l01139

While texteq() and textne() use strncmp():

  http://doxygen.postgresql.org/varlena_8c-source.html#l01164
  http://doxygen.postgresql.org/varlena_8c-source.html#l01187

The other operator functions, like text_lt(), use text_cmp() and  
therefore varstr_cmp():


  http://doxygen.postgresql.org/varlena_8c-source.html#01210

My question is: why? Shouldn't they all use the same function for  
comparison? I'm happy to dupe this implementation for citext, but I  
don't understand it. Should not all comparisons be executed  
consistently?


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread Zdenek Kotala


David E. Wheeler napsal(a):

On Jul 7, 2008, at 11:54, Alvaro Herrera wrote:


Then why shouldn't I use strncmp() for all comparisons?


I have no idea :-) -- because it's not locale-aware perhaps.


Could someone who does have an idea answer these questions:

* Does it need to be locale-aware or not?
* Should I use strncmp() or varstr_cmp() to compare strings?
* Shouldn't it use one or the other, but not both?

Sorry, I'm just confused about the "correct" thing to do here. If 
someone who knows the definitive answers could weigh in, I'd be happy to 
make the adjustment.



I'm sorry. I meant bttext() 
http://doxygen.postgresql.org/varlena_8c-source.html#l01270


bttext should use in citextcmp function. It uses strcol function.

And citext_eq should be implemented as texteq:

http://doxygen.postgresql.org/varlena_8c-source.html#l01164

I'm sorry for confusion I'm exchange bttext and varstr_cmp. :(

Zdenek




--
Zdenek Kotala  Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread Pavel Stehule

2008/7/7 Zdenek Kotala <[EMAIL PROTECTED]>:
> David E. Wheeler napsal(a):
>>
>> On Jul 7, 2008, at 07:41, Zdenek Kotala wrote:
>>
>>> However, It seems to me that code is ok now (exclude citex_eq). I think
>>> there two open problems/questions:
>>>
>>> 1) regression test -
>>>
>>>  a) I think that regresion is not correct. It depends on en_US locale,
>>> but it uses characters which is not in related character repertoire. It
>>> means comparing is not defined and I guess it could generate different
>>> result on different OS - depends on locale implementation.
>>
>> That I don't know about. The test requires en_US.UTF-8, at least at this
>> point. How are tests run on the build farm? And how else could I ensure that
>> comparisons are case-insensitive for non-ASCII characters other than
>> requiring a Unicode locale? Or is it just an issue for the sort order tests?
>> For those, I could potentially remove accented characters, just as long as
>> I'm verifying in other tests that comparisons are case-insensitive (without
>> worrying about collation).
>
> Hmm, it is complex area and I'm not sure if anybody on planet know correct
> answer :-). I think that best approach is to keep it as is and when a
> problem occur then it will be fixed.
>

Maybe we can have some "locale" test outside our regress tests -

>>>  b) pgTap is something new. Need make a decision if this framework is
>>> acceptable or not.
>>
>> Well, from the point of view of `make installcheck`, it's invisible. I've
>> submitted a talk proposal for pgDay.US on ptTAP. I'm happy to discuss it
>> further here though, if folks are interested.
>
> Yeah, it is invisible, but question is why don't put it as a framework to
> common place. But it is for another discussion.
>
>>> 2) contrib vs. pgFoundry
>>>
>>> There is unresolved answer if we want to have this in contrib or not.
>>> Good to mention that citext type will be obsoleted with full collation
>>> implementation in a future. I personally prefer to keep it on pgFoundry
>>> because it is temporally workaround (by my opinion), but I can live with
>>> contrib module as well.
>>
>> I second what Andrew has said in reply to this issue. I'll also say that,
>> since people *so* often end up using `WHERE LOWER(col) = LOWER(?)`, that
>> it'd be very valuable to have citext in contrib, especially since so few
>> folks seem to even know about pgFoundry, let alone be able to find it. I
>> mean, look at these search results:
>
> I understand it but there is parallel project which should solve this
> problem completely I guess in "close" future (2-3years). Afterward this
> module will be obsolete and it will takes times to remove it from contrib.
> It seems to me that have citext in contrib only for two releases is not
> optimal solution.
>
>Zdenek
>
>
Regards
Pavel

>
> --
> Zdenek Kotala  Sun Microsystems
> Prague, Czech Republic http://sun.com/postgresql
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread David E. Wheeler


On Jul 7, 2008, at 11:54, Zdenek Kotala wrote:

Hmm, it is complex area and I'm not sure if anybody on planet know  
correct answer :-). I think that best approach is to keep it as is  
and when a problem occur then it will be fixed.


Regression tests are really important, though.

b) pgTap is something new. Need make a decision if this framework  
is acceptable or not.
Well, from the point of view of `make installcheck`, it's  
invisible. I've submitted a talk proposal for pgDay.US on ptTAP.  
I'm happy to discuss it further here though, if folks are interested.


Yeah, it is invisible, but question is why don't put it as a  
framework to common place. But it is for another discussion.


It is in a common place as a project, on pgFoundry. Whether the core  
team wants to use it for testing other parts of PostgreSQL (core or  
contrib) and therefore put it in a central location is, as you say, a  
separate conversation. It'd be easy to move it in such a case, of  
course.


I understand it but there is parallel project which should solve  
this problem completely I guess in "close" future (2-3years).  
Afterward this module will be obsolete and it will takes times to  
remove it from contrib. It seems to me that have citext in contrib  
only for two releases is not optimal solution.


I guess that'd be the reason to keep it on pgFoundry, but I have two  
comments:


* 2-3 years is a *long* time in Internet time.

* There is on guarantee that it will be finished in that time or,  
indeed ever (no disrespect to Radek Strnad, it's just there are always  
unforeseen issues).


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread David E. Wheeler


On Jul 7, 2008, at 11:54, Alvaro Herrera wrote:


Then why shouldn't I use strncmp() for all comparisons?


I have no idea :-) -- because it's not locale-aware perhaps.


Could someone who does have an idea answer these questions:

* Does it need to be locale-aware or not?
* Should I use strncmp() or varstr_cmp() to compare strings?
* Shouldn't it use one or the other, but not both?

Sorry, I'm just confused about the "correct" thing to do here. If  
someone who knows the definitive answers could weigh in, I'd be happy  
to make the adjustment.


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread Zdenek Kotala


Andrew Dunstan napsal(a):



Zdenek Kotala wrote:



2) contrib vs. pgFoundry

There is unresolved answer if we want to have this in contrib or not. 
Good to mention that citext type will be obsoleted with full collation 
implementation in a future. I personally prefer to keep it on 
pgFoundry because it is temporally workaround (by my opinion), but I 
can live with contrib module as well.



  


Is there a concrete plan to get to full collation support (i.e. per 
column) in any known time frame? If not, then I think a citext module 
would be acceptable.


Collation per database should be part of 8.4. Radek Strnad works on it as a SoC 
project. He plans to continue on this work and implement full support as his 
Master of Thesis in 2-3 years time frame.


Zdenek


--
Zdenek Kotala  Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread Zdenek Kotala


David E. Wheeler napsal(a):

On Jul 7, 2008, at 07:41, Zdenek Kotala wrote:

However, It seems to me that code is ok now (exclude citex_eq). I 
think there two open problems/questions:


1) regression test -

 a) I think that regresion is not correct. It depends on en_US locale, 
but it uses characters which is not in related character repertoire. 
It means comparing is not defined and I guess it could generate 
different result on different OS - depends on locale implementation.


That I don't know about. The test requires en_US.UTF-8, at least at this 
point. How are tests run on the build farm? And how else could I ensure 
that comparisons are case-insensitive for non-ASCII characters other 
than requiring a Unicode locale? Or is it just an issue for the sort 
order tests? For those, I could potentially remove accented characters, 
just as long as I'm verifying in other tests that comparisons are 
case-insensitive (without worrying about collation).


Hmm, it is complex area and I'm not sure if anybody on planet know correct 
answer :-). I think that best approach is to keep it as is and when a problem 
occur then it will be fixed.


 b) pgTap is something new. Need make a decision if this framework is 
acceptable or not.


Well, from the point of view of `make installcheck`, it's invisible. 
I've submitted a talk proposal for pgDay.US on ptTAP. I'm happy to 
discuss it further here though, if folks are interested.


Yeah, it is invisible, but question is why don't put it as a framework to common 
place. But it is for another discussion.



2) contrib vs. pgFoundry

There is unresolved answer if we want to have this in contrib or not. 
Good to mention that citext type will be obsoleted with full collation 
implementation in a future. I personally prefer to keep it on 
pgFoundry because it is temporally workaround (by my opinion), but I 
can live with contrib module as well.


I second what Andrew has said in reply to this issue. I'll also say 
that, since people *so* often end up using `WHERE LOWER(col) = 
LOWER(?)`, that it'd be very valuable to have citext in contrib, 
especially since so few folks seem to even know about pgFoundry, let 
alone be able to find it. I mean, look at these search results:


I understand it but there is parallel project which should solve this problem 
completely I guess in "close" future (2-3years). Afterward this module will be 
obsolete and it will takes times to remove it from contrib. It seems to me that 
have citext in contrib only for two releases is not optimal solution.


Zdenek



--
Zdenek Kotala  Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread Alvaro Herrera

David E. Wheeler wrote:
> On Jul 7, 2008, at 11:44, Alvaro Herrera wrote:
>
>>> I'm confused. What's the difference between strncmp() and  
>>> varstr_cmp()?
>>> And why would I use one in one place and the other elsewhere?  
>>> Shouldn't
>>> they be used consistently?
>>
>> Not necessarily -- see texteq.  I think the point is that varstr_cmp()
>> is a lot slower.
>
> Then why shouldn't I use strncmp() for all comparisons?

I have no idea :-) -- because it's not locale-aware perhaps.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread David E. Wheeler


On Jul 7, 2008, at 11:44, Alvaro Herrera wrote:

I'm confused. What's the difference between strncmp() and  
varstr_cmp()?
And why would I use one in one place and the other elsewhere?  
Shouldn't

they be used consistently?


Not necessarily -- see texteq.  I think the point is that varstr_cmp()
is a lot slower.


Then why shouldn't I use strncmp() for all comparisons?

No, I was wondering before what locale was used for initdb on the  
build

farm. I mean, how are locale-aware things really tested?


They aren't AFAIK.


Ow. Pity.

Best,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread Alvaro Herrera

David E. Wheeler wrote:
> On Jul 7, 2008, at 00:46, Zdenek Kotala wrote:
>
>> You have to use varstr_cmp  in citextcmp. Your code is correct,  
>> because for
>> < <= >= > operators you need collation sensible function.
>>
>> You need to change only citext_cmp function to use strncmp() or call  
>> texteq function.
>
> I'm confused. What's the difference between strncmp() and varstr_cmp()? 
> And why would I use one in one place and the other elsewhere? Shouldn't 
> they be used consistently?

Not necessarily -- see texteq.  I think the point is that varstr_cmp()
is a lot slower.

>> I'm think that this test will work correctly for en_US.UTF-8 at any  
>> time. I guess the test make sense only when Czech collation  
>> (cs_CZ.UTF-8) is selected, but unfortunately, you cannot change  
>> collation during your test :(.
>
> No, I was wondering before what locale was used for initdb on the build 
> farm. I mean, how are locale-aware things really tested?

They aren't AFAIK.


-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread David E. Wheeler


On Jul 7, 2008, at 08:01, Andrew Dunstan wrote:

What does still bother me is its performance. I'd like to know if  
any measurement has been done of using citext vs. a functional index  
on lower(foo).


That's a good question. I can whip up a quick test by populating a  
column full of data and trying both, but I'm no performance testing  
expert. Anyone else know more about such performance testing who can  
help out?


Many Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0 v2

2008-07-07 Thread David E. Wheeler


On Jul 7, 2008, at 07:41, Zdenek Kotala wrote:

However, It seems to me that code is ok now (exclude citex_eq). I  
think there two open problems/questions:


1) regression test -

 a) I think that regresion is not correct. It depends on en_US  
locale, but it uses characters which is not in related character  
repertoire. It means comparing is not defined and I guess it could  
generate different result on different OS - depends on locale  
implementation.


That I don't know about. The test requires en_US.UTF-8, at least at  
this point. How are tests run on the build farm? And how else could I  
ensure that comparisons are case-insensitive for non-ASCII characters  
other than requiring a Unicode locale? Or is it just an issue for the  
sort order tests? For those, I could potentially remove accented  
characters, just as long as I'm verifying in other tests that  
comparisons are case-insensitive (without worrying about collation).


 b) pgTap is something new. Need make a decision if this framework  
is acceptable or not.


Well, from the point of view of `make installcheck`, it's invisible.  
I've submitted a talk proposal for pgDay.US on ptTAP. I'm happy to  
discuss it further here though, if folks are interested.



2) contrib vs. pgFoundry

There is unresolved answer if we want to have this in contrib or  
not. Good to mention that citext type will be obsoleted with full  
collation implementation in a future. I personally prefer to keep it  
on pgFoundry because it is temporally workaround (by my opinion),  
but I can live with contrib module as well.


I second what Andrew has said in reply to this issue. I'll also say  
that, since people *so* often end up using `WHERE LOWER(col) =  
LOWER(?)`, that it'd be very valuable to have citext in contrib,  
especially since so few folks seem to even know about pgFoundry, let  
alone be able to find it. I mean, look at these search results:


  http://www.google.com/search?q=PostgreSQL%20case-insensitive%20text

My blog entry about this patch is hit #3. pgFoundry (and CITEXT 1) is  
#7. Last time I did a query like this, it didn't turn up at all.


Belive me, I'd love for pgFoundry (or something like it) to become the  
CPAN for PostgreSQL. But without some love and SEO, I don't think  
that's gonna happen.


Besides, CITEXT 2 would be a PITA to maintain for both 8.3 and 8.4,  
given the changes in the string comparison API in CVS.


Thanks,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: CITEXT 2.0

2008-07-07 Thread David E. Wheeler


On Jul 7, 2008, at 00:46, Zdenek Kotala wrote:

You have to use varstr_cmp  in citextcmp. Your code is correct,  
because for

< <= >= > operators you need collation sensible function.

You need to change only citext_cmp function to use strncmp() or call  
texteq function.


I'm confused. What's the difference between strncmp() and  
varstr_cmp()? And why would I use one in one place and the other  
elsewhere? Shouldn't they be used consistently?


I'm think that this test will work correctly for en_US.UTF-8 at any  
time. I guess the test make sense only when Czech collation  
(cs_CZ.UTF-8) is selected, but unfortunately, you cannot change  
collation during your test :(.


No, I was wondering before what locale was used for initdb on the  
build farm. I mean, how are locale-aware things really tested?


I think, Best solution for now is to keep the test and add comment  
about recommended collation for this test.


Yep, that's what it does.

Thanks,

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 131 matches

Mail list logo