Re: [HACKERS] PATCH: CITEXT 2.0 v4
On Jul 18, 2008, at 01:39, Michael Paesold wrote: Calling regex functions with the case-insensitivity option would be great. It should also be possible to rewrite replace() into regexp_replace() by first escaping the regex meta characters. Actually re-implementing those functions in a case insensitive way would still be an option, but of course some amount of work. The question is, how much use case there is. I've figured out how to make all the functions work using SQL function workarounds, converting things and re-dispatching to the text versions as appropriate. They work quite well, and can be converted to C later if that becomes a requirement. Meanwhile, on the question of whether or not regular expression and LIKE comparisons *should* match case-insensitively, I have a couple more observations: * Thinking about how a true case-insensitive collation would work, I'm quite certain that it would match case-insensitively. Anything else would just be unexpected, because in a case-insensitive collation, lowercase characters are, in practice, identical to uppercase characters. As far as matching is concerned, there is no difference between them. So the matching operators and functions against CITEXT should follow that assumption. * I tried a few matches on MySQL, where the collation is case- insensitive by default, and it confirms my impression: mysql> select 'Foo' regexp 'o$'; +---+ | 'Foo' regexp 'o$' | +---+ | 1 | +---+ 1 row in set (0.00 sec) mysql> select 'Foo' regexp 'O$'; +---+ | 'Foo' regexp 'O$' | +---+ | 1 | +---+ 1 row in set (0.00 sec) mysql> select 'Foo' like '%o'; +-+ | 'Foo' like '%o' | +-+ | 1 | +-+ 1 row in set (0.00 sec) mysql> select 'Foo' like '%O'; +-+ | 'Foo' like '%O' | +-+ | 1 | +-+ 1 row in set (0.00 sec) I'll grant that MySQL may not be the best model for how things should work, but it's something, at least. Anyone else got access to another database with case-insensitive collations to see how LIKE and regular expressions work? Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v4
On Jul 18, 2008, at 09:53, David E. Wheeler wrote: However, if someone with a lot more C and Pg core knowledge wanted to sit down with me for a couple hours next week and help me bang out these functions, that would be great. I'd love to have the implementation be that much more complete. I've implemented fixes for the regexp_* functions and strpos() in pure SQL, like so: CREATE OR REPLACE FUNCTION regexp_matches( citext, citext ) RETURNS TEXT[] AS ' SELECT regexp_matches( $1::text, $2::text, ''i'' ); ' LANGUAGE SQL IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION regexp_matches( citext, citext, text ) RETURNS TEXT[] AS ' SELECT regexp_matches( $1::text, $2::text, CASE WHEN strpos($3, ''c'') = 0 THEN $3 || ''i'' ELSE $3 END ); ' LANGUAGE SQL IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION regexp_replace( citext, citext, text ) returns TEXT AS ' SELECT regexp_replace( $1::text, $2::text, $3, ''i''); ' LANGUAGE SQL IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION regexp_replace( citext, citext, text, text ) returns TEXT AS ' SELECT regexp_replace( $1::text, $2::text, $3, CASE WHEN strpos($4, ''c'') = 0 THEN $4 || ''i'' ELSE $4 END); ' LANGUAGE SQL IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION regexp_split_to_array( citext, citext ) RETURNS TEXT[] AS ' SELECT regexp_split_to_array( $1::text, $2::text, ''i'' ); ' LANGUAGE SQL IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION regexp_split_to_array( citext, citext, text ) RETURNS TEXT[] AS ' SELECT regexp_split_to_array( $1::text, $2::text, CASE WHEN strpos($3, ''c'') = 0 THEN $3 || ''i'' ELSE $3 END ); ' LANGUAGE SQL IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION regexp_split_to_table( citext, citext ) RETURNS SETOF TEXT AS ' SELECT regexp_split_to_table( $1::text, $2::text, ''i'' ); ' LANGUAGE SQL IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION regexp_split_to_table( citext, citext, text ) RETURNS SETOF TEXT AS ' SELECT regexp_split_to_table( $1::text, $2::text, CASE WHEN strpos($3, ''c'') = 0 THEN $3 || ''i'' ELSE $3 END ); ' LANGUAGE SQL IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION strpos( citext, citext ) RETURNS INT AS ' SELECT strpos( LOWER( $1::text ), LOWER( $2::text ) ); ' LANGUAGE SQL IMMUTABLE STRICT; Not so bad, though it'd be nice to have C functions that just did these things. Still not case-insensitive are: -- replace() -- split_part() -- translate() So, anyone at OSCON this week want to help me with these? Or to convert the above functions to C? Greg? Bruce? Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v4
On Jul 18, 2008, at 01:39, Michael Paesold wrote: Calling regex functions with the case-insensitivity option would be great. It should also be possible to rewrite replace() into regexp_replace() by first escaping the regex meta characters. Actually re-implementing those functions in a case insensitive way would still be an option, but of course some amount of work. The question is, how much use case there is. Not much for me. I might use the regex functions, but would be happy to manually pass the "i" flag. However, if someone with a lot more C and Pg core knowledge wanted to sit down with me for a couple hours next week and help me bang out these functions, that would be great. I'd love to have the implementation be that much more complete. I do believe that, as it stands now in the v4 patch, citext is pretty close to ready, and certainly commit-able. Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v4
David E. Wheeler writes: On Jul 17, 2008, at 03:45, Michael Paesold wrote: Wouldn't it be possible to create a variant of regexp_replace, i.e. regexp_replace(citext,citext,text), which would again lower-case the first two arguments before passing the input to regexp_replace(text,text,text)? Sure, but then you end up with this: template1=# select regexp_replace( 'Fxx'::citext, 'X'::citext, 'o'); regexp_replace foo (1 row) Yeah, you are right, I see. :-) Which is just wrong. I'm going to look at the regex C functions today and see if there's an easy way to just always pass them the 'i' flag, which would do the trick. That still won't help replace(), split_part(), or translate(), however. Calling regex functions with the case-insensitivity option would be great. It should also be possible to rewrite replace() into regexp_replace() by first escaping the regex meta characters. Actually re-implementing those functions in a case insensitive way would still be an option, but of course some amount of work. The question is, how much use case there is. Best Regards Michael Paesold -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v4
On Jul 17, 2008, at 03:45, Michael Paesold wrote: Wouldn't it be possible to create a variant of regexp_replace, i.e. regexp_replace(citext,citext,text), which would again lower-case the first two arguments before passing the input to regexp_replace(text,text,text)? Sure, but then you end up with this: template1=# select regexp_replace( 'Fxx'::citext, 'X'::citext, 'o'); regexp_replace foo (1 row) Which is just wrong. I'm going to look at the regex C functions today and see if there's an easy way to just always pass them the 'i' flag, which would do the trick. That still won't help replace(), split_part(), or translate(), however. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v4
Am 16.07.2008 um 20:38 schrieb David E. Wheeler: The trouble is that, right now: template1=# select regexp_replace( 'fxx'::citext, 'X'::citext, 'o'); regexp_replace fxx (1 row) So there's an inconsistency there. I don't know how to make that work case-insensitively. Wouldn't it be possible to create a variant of regexp_replace, i.e. regexp_replace(citext,citext,text), which would again lower-case the first two arguments before passing the input to regexp_replace(text,text,text)? Best Regards Michael Paesold -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v4
On Jul 16, 2008, at 11:20, Robert Treat wrote: I was thinking about this a bit last night and wanted to fill things out a bit. As a programmer, I find Donald Fraser's hindsight to be more appealing, because at least that way I have the option to do matching against CITEXT strings case-sensitively when I want to. OTOH, if what we want is to have CITEXT work more like a case- insensitive collation, then the expectation from the matching operators and functions might be different. Does anyone have any idea whether regex and LIKE matching against a string in a case- insensitive collation would match case-insensitively or not? If so, then maybe the regex and LIKE ops and funcs *should* match case-insensitively? If not, or if only for some collations, then I would think not. Either way, I know of no way, currently, to allow functions like replace(), split_part(), strpos(), and translate() to match case- insensitiely, even if we wanted to. Anyone have any ideas? * If the answer is "no", how can I make those functions behave case- insensitively? (See the "TODO" tests.) * Should there be any other casts? To and from name, perhaps? AIUI, your propsing the following: select 'x'::citext = 'X'::citext; ?column? -- t (1 row) select 'x'::citext ~ 'X'::citext; ?column? -- f (1 row) I understand the desire for flexibility, but the above seems wierd to me. That's what Donald Fraser suggested, and I see some value in that, but wanted to get some other opinions. And you're right, that does seem a bit weird. The trouble is that, right now: template1=# select regexp_replace( 'fxx'::citext, 'X'::citext, 'o'); regexp_replace fxx (1 row) So there's an inconsistency there. I don't know how to make that work case-insensitively. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v4
On Wednesday 16 July 2008 13:54:25 David E. Wheeler wrote: > On Jul 15, 2008, at 22:23, David E. Wheeler wrote: > > * The README for citext 1.0 on pgFoundry says: > >> I had to make a decision on casting between types for regular > >> expressions and > >> decided that if any parameter is of citext type then case > >> insensitive applies. > >> For example applying regular expressions with a varchar and a > >> citext will > >> produce a case-insensitive result. > >> > >> Having thought about this afterwards I realised that since we have > >> the option > >> to use case-insensitive results with regular expressions I should > >> have left the > >> behaviour exactly as text and then you have the best of both > >> worlds... oh well > >> not hard to change for any of you perfectionists! > > > > I followed the original and made all the regex and LIKE comparisons > > case-insensitive. But maybe I should not have? Especially since the > > regular expression functions (e.g., regexp_replace()) and a few non- > > regex functions (e.g., replace()) still don't behave case- > > insensitively? > > I was thinking about this a bit last night and wanted to fill things > out a bit. > > As a programmer, I find Donald Fraser's hindsight to be more > appealing, because at least that way I have the option to do matching > against CITEXT strings case-sensitively when I want to. > > OTOH, if what we want is to have CITEXT work more like a case- > insensitive collation, then the expectation from the matching > operators and functions might be different. Does anyone have any idea > whether regex and LIKE matching against a string in a case-insensitive > collation would match case-insensitively or not? If so, then maybe the > regex and LIKE ops and funcs *should* match case-insensitively? If > not, or if only for some collations, then I would think not. > > Either way, I know of no way, currently, to allow functions like > replace(), split_part(), strpos(), and translate() to match case- > insensitiely, even if we wanted to. Anyone have any ideas? > > > * If the answer is "no", how can I make those functions behave case- > > insensitively? (See the "TODO" tests.) > > > > * Should there be any other casts? To and from name, perhaps? > AIUI, your propsing the following: select 'x'::citext = 'X'::citext; ?column? -- t (1 row) select 'x'::citext ~ 'X'::citext; ?column? -- f (1 row) I understand the desire for flexibility, but the above seems wierd to me. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v4
On Jul 15, 2008, at 22:23, David E. Wheeler wrote: * The README for citext 1.0 on pgFoundry says: I had to make a decision on casting between types for regular expressions and decided that if any parameter is of citext type then case insensitive applies. For example applying regular expressions with a varchar and a citext will produce a case-insensitive result. Having thought about this afterwards I realised that since we have the option to use case-insensitive results with regular expressions I should have left the behaviour exactly as text and then you have the best of both worlds... oh well not hard to change for any of you perfectionists! I followed the original and made all the regex and LIKE comparisons case-insensitive. But maybe I should not have? Especially since the regular expression functions (e.g., regexp_replace()) and a few non- regex functions (e.g., replace()) still don't behave case- insensitively? I was thinking about this a bit last night and wanted to fill things out a bit. As a programmer, I find Donald Fraser's hindsight to be more appealing, because at least that way I have the option to do matching against CITEXT strings case-sensitively when I want to. OTOH, if what we want is to have CITEXT work more like a case- insensitive collation, then the expectation from the matching operators and functions might be different. Does anyone have any idea whether regex and LIKE matching against a string in a case-insensitive collation would match case-insensitively or not? If so, then maybe the regex and LIKE ops and funcs *should* match case-insensitively? If not, or if only for some collations, then I would think not. Either way, I know of no way, currently, to allow functions like replace(), split_part(), strpos(), and translate() to match case- insensitiely, even if we wanted to. Anyone have any ideas? * If the answer is "no", how can I make those functions behave case- insensitively? (See the "TODO" tests.) * Should there be any other casts? To and from name, perhaps? Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 15, 2008, at 20:26, Tom Lane wrote: "David E. Wheeler" <[EMAIL PROTECTED]> writes: So I guess my question is: what is wrong with the properties for citextsend/citextrecv [ checks catalogs... ] textsend and textrecv are marked STABLE not IMMUTABLE. I am not totally sure about the reasoning offhand --- it might be because their behavior depends on client_encoding. Thanks. Looks like maybe the xtypes docs need to be updated? http://www.postgresql.org/docs/8.3/static/xtypes.html Anyway, changing them to "STABLE STRICT" appears to have done the trick (diff attached). and what else might these failures be indicating is wrong? I think the other diffs are okay, they just reflect the fact that you're depending on binary equivalence of text and citext. Great, thanks. And with that, I think I'm just about ready to submit a new version of the patch, coming up shortly. Best, David regression.diffs Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > So I guess my question is: what is wrong with the properties for > citextsend/citextrecv [ checks catalogs... ] textsend and textrecv are marked STABLE not IMMUTABLE. I am not totally sure about the reasoning offhand --- it might be because their behavior depends on client_encoding. > and what else might these failures be indicating > is wrong? I think the other diffs are okay, they just reflect the fact that you're depending on binary equivalence of text and citext. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 15, 2008, at 12:56, Tom Lane wrote: Don't run the tests in a read-only directory, perhaps. Yes, I changed the owner to the postgres system user and that did the trick. Or do they matter for sanity-checking citext? Hard to tell --- I'd suggest trying to get a clean run. As for what you have, the first diff hunk suggests you've got the wrong function properties for citextsend/citextrecv. Here's the new diff: *** ./expected/opr_sanity.out Mon Jul 14 21:55:49 2008 --- ./results/opr_sanity.outTue Jul 15 17:41:03 2008 *** *** 87,94 p1.provolatile != p2.provolatile OR p1.pronargs != p2.pronargs); oid | proname | oid | proname ! -+-+-+- ! (0 rows) -- Look for uses of different type OIDs in the argument/result type fields -- for different aliases of the same built-in function. --- 87,96 p1.provolatile != p2.provolatile OR p1.pronargs != p2.pronargs); oid | proname | oid | proname ! --+--+---+ ! 2414 | textrecv | 87258 | citextrecv ! 2415 | textsend | 87259 | citextsend ! (2 rows) -- Look for uses of different type OIDs in the argument/result type fields -- for different aliases of the same built-in function. *** *** 110,117 prorettype | prorettype + 25 | 1043 1114 | 1184 ! (2 rows) SELECT DISTINCT p1.proargtypes[0], p2.proargtypes[0] FROM pg_proc AS p1, pg_proc AS p2 --- 112,120 prorettype | prorettype + 25 | 1043 + 25 | 87255 1114 | 1184 ! (3 rows) SELECT DISTINCT p1.proargtypes[0], p2.proargtypes[0] FROM pg_proc AS p1, pg_proc AS p2 *** *** 124,133 -+- 25 |1042 25 |1043 1114 |1184 1560 |1562 2277 |2283 ! (5 rows) SELECT DISTINCT p1.proargtypes[1], p2.proargtypes[1] FROM pg_proc AS p1, pg_proc AS p2 --- 127,138 -+- 25 |1042 25 |1043 + 25 | 87255 + 1042 | 87255 1114 |1184 1560 |1562 2277 |2283 ! (7 rows) SELECT DISTINCT p1.proargtypes[1], p2.proargtypes[1] FROM pg_proc AS p1, pg_proc AS p2 *** *** 139,148 proargtypes | proargtypes -+- 23 | 28 1114 |1184 1560 |1562 2277 |2283 ! (4 rows) SELECT DISTINCT p1.proargtypes[2], p2.proargtypes[2] FROM pg_proc AS p1, pg_proc AS p2 --- 144,154 proargtypes | proargtypes -+- 23 | 28 + 25 | 87255 1114 |1184 1560 |1562 2277 |2283 ! (5 rows) SELECT DISTINCT p1.proargtypes[2], p2.proargtypes[2] FROM pg_proc AS p1, pg_proc AS p2 *** *** 305,311 142 | 25 |0 | a 142 | 1043 |0 | a 142 | 1042 |0 | a ! (6 rows) -- pg_operator -- Look for illegal values in pg_operator fields. --- 311,318 142 | 25 |0 | a 142 | 1043 |0 | a 142 | 1042 |0 | a ! 87255 | 1042 |0 | a ! (7 rows) -- pg_operator -- Look for illegal values in pg_operator fields. == So I guess my question is: what is wrong with the properties for citextsend/citextrecv and what else might these failures be indicating is wrong? CREATE OR REPLACE FUNCTION citextrecv(internal) RETURNS citext AS 'textrecv' LANGUAGE 'internal' IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION citextsend(citext) RETURNS bytea AS 'textsend' LANGUAGE 'internal' IMMUTABLE STRICT; CREATE TYPE citext ( INPUT = citextin, OUTPUT = citextout, RECEIVE= citextrecv, SEND = citextsend, INTERNALLENGTH = VARIABLE, STORAGE= extended ); Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > Well now that was cool to see. I got some failures, of course, but > nothing stands out to me as an obvious bug. I attach the diffs file > (with the citext.sql failure removed) for your perusal. What would be > the best way for me to resolve those permission issues? Don't run the tests in a read-only directory, perhaps. > Or do they matter for sanity-checking citext? Hard to tell --- I'd suggest trying to get a clean run. As for what you have, the first diff hunk suggests you've got the wrong function properties for citextsend/citextrecv. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 15, 2008, at 07:09, Tom Lane wrote: Yeah, probably. I don't think the "make check" path will support it because it doesn't install contrib into the temp installation. (You'd also need to have put the extra entry in parallel_schedule not serial_schedule, but it's not gonna work anyway.) Well now that was cool to see. I got some failures, of course, but nothing stands out to me as an obvious bug. I attach the diffs file (with the citext.sql failure removed) for your perusal. What would be the best way for me to resolve those permission issues? Or do they matter for sanity-checking citext? Thanks, David regression.diffs Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > Okay, I copied citext.sql into src/test/regress/sql and then added > "test: citext" to the top of src/test/regress/serial_schedule. Then I > ran `make check`. All tests passed, but I don't think that citext was > tested. > Do I need to install the server, build a cluster, and run `make > installcheck`? Yeah, probably. I don't think the "make check" path will support it because it doesn't install contrib into the temp installation. (You'd also need to have put the extra entry in parallel_schedule not serial_schedule, but it's not gonna work anyway.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 14, 2008, at 07:08, Tom Lane wrote: You're really making it into another test. Just copy the citext.sql file into the sql/ subdirectory and add a "citext" entry to the schedule file. The last time I did this, I had to at least "touch" a corresponding expected file (expected/citext.out) or pg_regress would go belly-up without running the rest of the tests. There's no special need to make the expected file match, of course, since you can just ignore the "failure" report from that one test. Okay, I copied citext.sql into src/test/regress/sql and then added "test: citext" to the top of src/test/regress/serial_schedule. Then I ran `make check`. All tests passed, but I don't think that citext was tested. Do I need to install the server, build a cluster, and run `make installcheck`? Thanks for the hand-holding. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 12, 2008, at 15:13, David E. Wheeler wrote: 2. It's ridiculously slow; at least a factor of ten slower than doing equivalent tests directly in SQL. This is a very bad thing. Speed of regression tests matters a lot to those of us who run them a dozen times per day --- and I do not wish to discourage any developers who don't work that way from learning better habits ;-) Hrm. I'm wonder why it's so slow? The test functions don't really do a lot. Anyway, I agree that they should perform well. Just as an FYI, I've just moved all the tests to regular SQL instead of using pgTAP. The difference in runtime is: psql -Xd try -f sql/citext.sql 0.03s user 0.02s system 19% cpu 0.253 total psql -Xd try -f sql/citext.sql 0.03s user 0.02s system 4% cpu 1.298 total So it's close to a factor of five, though subtract .125 for the time to load the pgTAP functions. The pgTAP tests *are* doing a lot more work, but I'm sure that they could be made a lot more efficient, though of course the TAP functions will always introduce some overhead. One just needs to decide whether the tradeoffs are worth it. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 14, 2008, at 11:49, Tom Lane wrote: You don't seem to understand what I'm suggesting: I'm not talking about testing strcoll. I'm talking about making sure that citext *uses* strcoll. This seems pointless, as well as essentially impossible to enforce by black-box testing. Nobody is likely to consider a patch that removes such obviously basic functionality of the module. Never underestimate the human capacity for incompetence. One of my mottos. I think you're worrying about testing the wrong things --- and as evidence I'd note the serious errors that slipped by your testing (including the fact that the last submitted version would still claim that 'a' = 'ab'). Um, say what? I had that problem at one time, but it was when I was writing my own lower() function, which was shitty (I wasn't creating a nul-terminated string). I don't recall that being in any patch I sent to the list, and indeed you wrote that very test in the revised test script you sent in. At any rate, I make no claims that my tests properly covered everything. There is a lot I didn't know to test, but I'm learning more all the time. That's why this code review has been so immensely valuable, both for me and for CITEXT. What we generally ask a regression test to do is catch portability problems (which is certainly a real-enough issue here, but we don't know how to address it well) or catch foreseeable breakage (such as someone introducing a cast that breaks resolution of citext function calls). The methodology can't catch everything, and trying to push it beyond what it can do is just a time sink for effort that can more usefully be spent other ways, such as code-reading. I guess what I'm thinking of is not regression tests but unit tests. And with unit testing, one of the goals is coverage. Good coverage has saved my ass many times in the past, even catching changes that, in hindsight, should obviously never have been attempted. Perhaps it'd be worth thinking about creating a project for unit testing PostgreSQL in controlled environments? Maybe having tests that can contain proper conditionals? Anyway, it seems that, given the limitations of the current testing system with regard to testing multibyte characters, the issue is moot. I'll throw in a few tests that test multibyte characters, but I'll comment them out and just uncomment them for individual test runs on my box for the purposes of my own sanity going forward. Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > On Jul 14, 2008, at 11:06, Tom Lane wrote: >> Let me put it another way: if we test on another platform and find out >> that strcoll's behavior is different there, are you going to fix that >> version of strcoll? No, you're not. So you might as well just test >> the >> behavior of the code that's actually under your control. > You don't seem to understand what I'm suggesting: I'm not talking > about testing strcoll. I'm talking about making sure that citext > *uses* strcoll. This seems pointless, as well as essentially impossible to enforce by black-box testing. Nobody is likely to consider a patch that removes such obviously basic functionality of the module. I think you're worrying about testing the wrong things --- and as evidence I'd note the serious errors that slipped by your testing (including the fact that the last submitted version would still claim that 'a' = 'ab'). What we generally ask a regression test to do is catch portability problems (which is certainly a real-enough issue here, but we don't know how to address it well) or catch foreseeable breakage (such as someone introducing a cast that breaks resolution of citext function calls). The methodology can't catch everything, and trying to push it beyond what it can do is just a time sink for effort that can more usefully be spent other ways, such as code-reading. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 14, 2008, at 11:06, Tom Lane wrote: [ shrug... ] Seems pretty useless to me: we already know that it works for you. The point of a regression test in my mind is to make sure that it works for everybody. Given the platform variations involved in strcoll's behavior, the definition of "works for everybody" is going to be pretty darn narrowly circumscribed anyway, and thus I don't have a big problem with restricting the tests to ASCII cases. Neither do I, as long as there is *some* context to ensure that the type remains locale-aware. We only know that it works for me because I've written the tests. Let me put it another way: if we test on another platform and find out that strcoll's behavior is different there, are you going to fix that version of strcoll? No, you're not. So you might as well just test the behavior of the code that's actually under your control. You don't seem to understand what I'm suggesting: I'm not talking about testing strcoll. I'm talking about making sure that citext *uses* strcoll. Whether or not strcoll actually works properly is not my concern. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > On Jul 14, 2008, at 07:24, Tom Lane wrote: >> The fallacy in that proposal is the assumption that there are only two >> behaviors out there. > Well, no, that's not the assumption at all. The assumption is that the > type works properly with multibyte characters under multibyte-aware > locales. So I want to have tests to ensure that such is true by having > multibyte characters run under a very specific locale and platform. [ shrug... ] Seems pretty useless to me: we already know that it works for you. The point of a regression test in my mind is to make sure that it works for everybody. Given the platform variations involved in strcoll's behavior, the definition of "works for everybody" is going to be pretty darn narrowly circumscribed anyway, and thus I don't have a big problem with restricting the tests to ASCII cases. Let me put it another way: if we test on another platform and find out that strcoll's behavior is different there, are you going to fix that version of strcoll? No, you're not. So you might as well just test the behavior of the code that's actually under your control. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 14, 2008, at 10:58, Tom Lane wrote: Well, unless the display of the characters would be broken (the build farm databases use UTF-8, no?), No, they use C. Um, you mean SQL_ASCII? Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > Well, unless the display of the characters would be broken (the build > farm databases use UTF-8, no?), No, they use C. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 14, 2008, at 10:55, Andrew Dunstan wrote: No. --no-locale is the same as --locale=C which means the encoding is SQL_ASCII. I've used --no-locale --encoding UTF-8 many times. But if the regressions run with SQL_ASCII…well, I'm just amazed that there haven't been more unexpected side-effects to source code changes without controlling for such variables in tests. Seems like a lot to keep in one's head. Anyway, are tests for contrib modules run on the build farms? If so, I could still potentially get around this issue by turning off ECHO. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
David E. Wheeler wrote: On Jul 14, 2008, at 06:05, Andrew Dunstan wrote: You can certainly add any tests you like. But the buildfarm does all its regression tests using an install done with --no-locale. Any test that requires some locale to work, or make sense, should probably not be in your Makefile's REGRESS target. Well, unless the display of the characters would be broken (the build farm databases use UTF-8, no?), No. --no-locale is the same as --locale=C which means the encoding is SQL_ASCII. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 14, 2008, at 07:26, Tom Lane wrote: I'd like to keep these tests, since they ensure not just that the functions work but that they work with citext. It might be reasonable to test a couple of them for that purpose. If your agenda is "test every function in the system that comes or might come in a bpchar variant", I think that's pointless. Or a varchar variant, or where such a variant might be added in the future. To my mind, it's important to have good coverage in my unit tests to ensure that things continue to work exactly the same over time. So, since the tests are already written, and are unlikely to add more than a few milliseconds to test runtime, can you at least agree that such tests are harmless? Updated patch later today. Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 14, 2008, at 07:24, Tom Lane wrote: "David E. Wheeler" <[EMAIL PROTECTED]> writes: Could I supply two comparison files, one for Mac OS X with en_US.UTF-8 and one for everything else, as described in the last three paragraphs here? The fallacy in that proposal is the assumption that there are only two behaviors out there. Well, no, that's not the assumption at all. The assumption is that the type works properly with multibyte characters under multibyte-aware locales. So I want to have tests to ensure that such is true by having multibyte characters run under a very specific locale and platform. I don't really care what platform or locale; the point is to make sure that the type is actually multibyte-aware. Let me recalibrate your thoughts a bit: so far I have tried citext on three different machines (Mac, Fedora 8, HPUX), and I got three different answers from those tests. That's despite endeavoring to make the database locales match ... which is less than trivial in itself because they use three slightly different spellings of "en_US.UTF8". This is a truly pitiful state of affairs. Rhetorical question: Why is there no standardization of locales? I'm sure there are a lot of opinions out there (should all uppercase chars should precede all lowercase chars or be mixed in with lowercase chars), but I should think that, in this day and age, there would be some sort of standard defining locales and how they work -- and to allow such opinions to be expressed by different locales, not in the same locale names on different platforms. Given that you were more or less deliberately testing corner cases, I think it's quite likely that the number of observable reactions from N platforms would be more nearly O(N) than O(1). To me they're not corner cases. To me it is just, "given a specific platform/locale, does CITEXT respect the locale's rules?" I don't care to test all platforms and locales (I'm not *that* stupid :-)). In the real world, to the extent that we are able to control the locale of the regression tests, we make it "C" --- and to a large extent we can't control it at all, which means you have another uncontrolled variable besides platform. So in the current universe there is absolutely no value in submitting locale-specific tests for a contrib module. Then how do we know that it will continue to be locale-aware over time? Someone could replace the comparison function with one that just lowercases ASCII characters, like CITEXT 1 does, and no tests would fail. How do you prevent that from happening without being hyper- vigilant (and never leaving the project, I might add)? I see some discussion in the thread about improving the situation, but until we are able to decouple database locale from environment locale, I doubt we'll be able to do a whole lot about automating this kind of test. There are too many variables at the moment. Is the decoupling of database locale from environment locale likely to happen anytime soon? Now that I've written CITEXT, I dare say that such might become my top-desired feature (aside from replication). Thanks for the discussion, much appreciated, and I'm learning a ton. I retain the right to be opinionated, however. ;-) Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 14, 2008, at 06:05, Andrew Dunstan wrote: You can certainly add any tests you like. But the buildfarm does all its regression tests using an install done with --no-locale. Any test that requires some locale to work, or make sense, should probably not be in your Makefile's REGRESS target. Well, unless the display of the characters would be broken (the build farm databases use UTF-8, no?), the output would look like this, which should more or less work (I'm hoping): SELECT citext_larger( 'Â'::citext, 'ç'::citext ) = 'ç' AS t WHERE false and test_multibyte(); t --- (0 rows) Some time ago I raised the question of doing locale- and encoding- aware regression tests, and IIRC the consensus seemed to be that it was not worth the effort, but maybe we need to revisit that. We certainly seem to be doing a lot more work now to get Postgres to where it needs to be w.r.t. locale support, and we've cleaned up a lot of the encoding issues we had, so maybe we need to reflect that in our testing regime more. Makes sense to me. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > On Jul 12, 2008, at 14:57, Tom Lane wrote: >> 4. A lot of the later test cases are equally uselessly testing whether >> piggybacking over text functions works. > I'd like to keep these tests, since they ensure not just that the > functions work but that they work with citext. It might be reasonable to test a couple of them for that purpose. If your agenda is "test every function in the system that comes or might come in a bpchar variant", I think that's pointless. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > Could I supply two comparison files, one for Mac OS X with en_US.UTF-8 > and one for everything else, as described in the last three paragraphs > here? The fallacy in that proposal is the assumption that there are only two behaviors out there. Let me recalibrate your thoughts a bit: so far I have tried citext on three different machines (Mac, Fedora 8, HPUX), and I got three different answers from those tests. That's despite endeavoring to make the database locales match ... which is less than trivial in itself because they use three slightly different spellings of "en_US.UTF8". Given that you were more or less deliberately testing corner cases, I think it's quite likely that the number of observable reactions from N platforms would be more nearly O(N) than O(1). In the real world, to the extent that we are able to control the locale of the regression tests, we make it "C" --- and to a large extent we can't control it at all, which means you have another uncontrolled variable besides platform. So in the current universe there is absolutely no value in submitting locale-specific tests for a contrib module. I see some discussion in the thread about improving the situation, but until we are able to decouple database locale from environment locale, I doubt we'll be able to do a whole lot about automating this kind of test. There are too many variables at the moment. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > On Jul 13, 2008, at 10:19, Tom Lane wrote: >> BTW, actually a better idea would be to put citext.sql at the front of >> the list and just run the whole main regression series with it >> present. >> typ_sanity and oidjoins might possibly find issues too. > Um, stupid question (sorry, I'm not familiar with how the regression > tests are organized): serial_schedule doesn't look like a file into > which I can insert an SQL file. How would I do that? You're really making it into another test. Just copy the citext.sql file into the sql/ subdirectory and add a "citext" entry to the schedule file. The last time I did this, I had to at least "touch" a corresponding expected file (expected/citext.out) or pg_regress would go belly-up without running the rest of the tests. There's no special need to make the expected file match, of course, since you can just ignore the "failure" report from that one test. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
David E. Wheeler wrote: (If we get to having per-database collations in 8.4 then integrating a test with a non-default collation would get a lot easier, of course; but for the moment I'm afraid you've got to work with what's there.) Could I supply two comparison files, one for Mac OS X with en_US.UTF-8 and one for everything else, as described in the last three paragraphs here? http://www.postgresql.org/docs/current/static/regress-variant.html That way, I can at least make sure that the multibyte stuff *does* work. Even if it's tested on only one platform, the purpose is not to test a particular collation, but to test that CITEXT is actually sensitive to locale. You can certainly add any tests you like. But the buildfarm does all its regression tests using an install done with --no-locale. Any test that requires some locale to work, or make sense, should probably not be in your Makefile's REGRESS target. Some time ago I raised the question of doing locale- and encoding-aware regression tests, and IIRC the consensus seemed to be that it was not worth the effort, but maybe we need to revisit that. We certainly seem to be doing a lot more work now to get Postgres to where it needs to be w.r.t. locale support, and we've cleaned up a lot of the encoding issues we had, so maybe we need to reflect that in our testing regime more. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 12, 2008, at 14:57, Tom Lane wrote: 4. A lot of the later test cases are equally uselessly testing whether piggybacking over text functions works. The odds of ever finding anything with those tests are not distinguishable from zero (unless the underlying text function is busted, which is not your responsibility to test). So I don't see any point in putting them into the standard regression package. (What maybe *would* be useful to test, but you didn't, is whether the result of a function is considered citext rather than text.) I'd like to keep these tests, since they ensure not just that the functions work but that they work with citext. Given what we found with length() and friends not working when there was an implicit cast to bpchar, I think it's valuable to continue to ensure that these functions work as expected with citext. Otherwise someone in the future might come along and make the cast to bpchar implicit again, and no tests would fail to tell him/her otherwise. These tests make good regressions. Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 13, 2008, at 10:31, Tom Lane wrote: Grr. Kind of defeats the purpose. Is there no infrastructure for testing multibyte functionality? There's some stuff under src/test/locale and src/test/mb, though it doesn't get exercised regularly. The contrib tests are a particularly bad place for trying to do any locale-dependent testing, because we only support "make installcheck" which means there is no way to set the database locale --- you *have to* work with whatever locale is predetermined by the postmaster you're pointed at. I don't recall the reason for not supporting "make check" in the contrib modules; maybe it was just that preparing a test installation for each contrib module sounded too slow? In any case, what you'd need to pursue is having some additional tests available that are not executed by the standard contrib regression test sequence. (If we get to having per-database collations in 8.4 then integrating a test with a non-default collation would get a lot easier, of course; but for the moment I'm afraid you've got to work with what's there.) Could I supply two comparison files, one for Mac OS X with en_US.UTF-8 and one for everything else, as described in the last three paragraphs here? http://www.postgresql.org/docs/current/static/regress-variant.html That way, I can at least make sure that the multibyte stuff *does* work. Even if it's tested on only one platform, the purpose is not to test a particular collation, but to test that CITEXT is actually sensitive to locale. Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 13, 2008, at 10:19, Tom Lane wrote: You might try running the opr_sanity regression test on this module to see if it finds any other silliness. (Procedure: insert the citext definition script into the serial_schedule list just ahead of opr_sanity, run tests, make sure you understand the reason for any diffs in the opr_sanity result. There will be at least one from the uses of text-related functions for citext.) Thanks. Added to my list. BTW, actually a better idea would be to put citext.sql at the front of the list and just run the whole main regression series with it present. typ_sanity and oidjoins might possibly find issues too. Um, stupid question (sorry, I'm not familiar with how the regression tests are organized): serial_schedule doesn't look like a file into which I can insert an SQL file. How would I do that? Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > To judge by the User-Defined Types docs, I was close. :-) I just > changed the argument to citextrecv to "internal". Right. The APIs for send and recv aren't inverses. (Oh, the sins we'll commit in the name of performance ...) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 13, 2008, at 16:06, David E. Wheeler wrote: Should those return bytea and citext, respectively? IOW, are these right? CREATE OR REPLACE FUNCTION citextrecv(bytea) RETURNS citext AS 'textrecv' LANGUAGE 'internal' IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION citextsend(citext) RETURNS bytea AS 'textsend' LANGUAGE 'internal' IMMUTABLE STRICT; To judge by the User-Defined Types docs, I was close. :-) I just changed the argument to citextrecv to "internal". Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 13, 2008, at 10:19, Tom Lane wrote: I'm confused. Is that not what the citextin and citextout functions are? No, those are text I/O. You need analogues of textsend and textrecv too. Should those return bytea and citext, respectively? IOW, are these right? CREATE OR REPLACE FUNCTION citextrecv(bytea) RETURNS citext AS 'textrecv' LANGUAGE 'internal' IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION citextsend(citext) RETURNS bytea AS 'textsend' LANGUAGE 'internal' IMMUTABLE STRICT; Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 13, 2008, at 10:19, Tom Lane wrote: "David E. Wheeler" <[EMAIL PROTECTED]> writes: On Jul 12, 2008, at 12:17, Tom Lane wrote: * You should provide binary I/O (send/receive) functions, if you want this to be an industrial-strength module. It's easy since you can piggyback on text's. I'm confused. Is that not what the citextin and citextout functions are? No, those are text I/O. You need analogues of textsend and textrecv too. Okay. Thanks. Added to my list. BTW, actually a better idea would be to put citext.sql at the front of the list and just run the whole main regression series with it present. typ_sanity and oidjoins might possibly find issues too. Also added to my list. :-) Some (not all) of your CREATE OPERATOR commands have things like NEGATOR = OPERATOR(!~), which seems unnecessary, and is certainly inconsistent. Oh, I hadn't even noticed those; I'd just copied them from citext 1. Fixed! Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 13, 2008, at 10:16, Tom Lane wrote: Hmm. I think what that actually means is that the cast from citext to bpchar should be AS ASSIGNMENT rather than IMPLICIT. What is happening is that the system can't figure out whether to use length(text) or length(bpchar) when presented with a citext argument. I had been thinking yesterday that it would automatically prefer length(text) because text is a "preferred type", but after tracing through it I see that that doesn't happen because citext is not thought to be of the string category. (We really need a way to let user-defined types specify their category...) That'd be nice. The fact that you need all these piggyback functions is a red flag because what it implies is that citext will not work nicely for any situation where both text and bpchar functions have been provided --- and that includes user-added functions, so it's hopeless to think that you can get to a solution this way. Downgrading the cast seems like the right thing to me. Yes, that works for me. I've downgraded it and can now remove the size functions and all the tests still pass. The implicit cast to varchar is a bit worrisome because it creates the same issue if someone has provided both varchar and text versions of a function. However, that seems a bit pointless given the lack of semantic difference, and I suspect that a lot of user-written functions come only in varchar variants --- so on balance my recommendation is to keep that one as implicit. Yes, I agree. Thanks for tracing this out, Tom, it never would have ocurred to me -- and now I know more than I did before! Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > On Jul 12, 2008, at 14:57, Tom Lane wrote: >> Sadly, I think you have to give up >> attempts to check the interesting multibyte cases and confine yourself >> to tests using ASCII strings. > Grr. Kind of defeats the purpose. Is there no infrastructure for > testing multibyte functionality? There's some stuff under src/test/locale and src/test/mb, though it doesn't get exercised regularly. The contrib tests are a particularly bad place for trying to do any locale-dependent testing, because we only support "make installcheck" which means there is no way to set the database locale --- you *have to* work with whatever locale is predetermined by the postmaster you're pointed at. I don't recall the reason for not supporting "make check" in the contrib modules; maybe it was just that preparing a test installation for each contrib module sounded too slow? In any case, what you'd need to pursue is having some additional tests available that are not executed by the standard contrib regression test sequence. (If we get to having per-database collations in 8.4 then integrating a test with a non-default collation would get a lot easier, of course; but for the moment I'm afraid you've got to work with what's there.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > On Jul 12, 2008, at 12:17, Tom Lane wrote: >> * You should provide binary I/O (send/receive) functions, if you want >> this to be an industrial-strength module. It's easy since you can >> piggyback on text's. > I'm confused. Is that not what the citextin and citextout functions are? No, those are text I/O. You need analogues of textsend and textrecv too. >> You might try running the >> opr_sanity regression test on this module to see if it finds any >> other silliness. (Procedure: insert the citext definition script >> into the serial_schedule list just ahead of opr_sanity, run tests, >> make sure you understand the reason for any diffs in the opr_sanity >> result. There will be at least one from the uses of text-related >> functions for citext.) > Thanks. Added to my list. BTW, actually a better idea would be to put citext.sql at the front of the list and just run the whole main regression series with it present. typ_sanity and oidjoins might possibly find issues too. >> * Don't use the OPERATOR() notation when you don't need to. >> It's just clutter. > Sorry, don't know what you're referring to here. Some (not all) of your CREATE OPERATOR commands have things like NEGATOR = OPERATOR(!~), which seems unnecessary, and is certainly inconsistent. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > When I deleted any of the others, I got errors like this: > psql:sql/citext.sql:865: ERROR: function length(citext) is not unique > LINE 1: SELECT is( length( name ), length(name::text), 'length("' ||... > ^ > HINT: Could not choose a best candidate function. You might need to > add explicit type casts. Hmm. I think what that actually means is that the cast from citext to bpchar should be AS ASSIGNMENT rather than IMPLICIT. What is happening is that the system can't figure out whether to use length(text) or length(bpchar) when presented with a citext argument. I had been thinking yesterday that it would automatically prefer length(text) because text is a "preferred type", but after tracing through it I see that that doesn't happen because citext is not thought to be of the string category. (We really need a way to let user-defined types specify their category...) The fact that you need all these piggyback functions is a red flag because what it implies is that citext will not work nicely for any situation where both text and bpchar functions have been provided --- and that includes user-added functions, so it's hopeless to think that you can get to a solution this way. Downgrading the cast seems like the right thing to me. The implicit cast to varchar is a bit worrisome because it creates the same issue if someone has provided both varchar and text versions of a function. However, that seems a bit pointless given the lack of semantic difference, and I suspect that a lot of user-written functions come only in varchar variants --- so on balance my recommendation is to keep that one as implicit. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 12, 2008, at 14:50, David E. Wheeler wrote: * An explicit comment explaining that you're piggybacking on the I/O functions (and some others) for "text" wouldn't be out of place. I've added SQL comments. Were you talking about a COMMENT? * Lose the GRANT EXECUTEs on the I/O functions; they're redundant. (If you needed them, you'd need them on a lot more than these two.) I'd be inclined to lose the COMMENTs on the functions too; again these are about the least useful ones to comment on out of the whole module. I wondered about that; those were copied from CITEXT 1. I've removed all GRANTs and COMMENTs. * You should provide binary I/O (send/receive) functions, if you want this to be an industrial-strength module. It's easy since you can piggyback on text's. I'm confused. Is that not what the citextin and citextout functions are? * The type declaration needs to say storage = extended, else the type won't be toastable. Ah, good, thanks. * The cast from bpchar to citext cannot be WITHOUT FUNCTION; it needs to invoke rtrim1. Compare the bpchar to text cast. Where do I find that? I have trouble finding the SQL that creates the core types. :-( Duh, you just told me. Added, thanks. * <= is surely not its own commutator. Changed to >=. You might try running the opr_sanity regression test on this module to see if it finds any other silliness. (Procedure: insert the citext definition script into the serial_schedule list just ahead of opr_sanity, run tests, make sure you understand the reason for any diffs in the opr_sanity result. There will be at least one from the uses of text-related functions for citext.) Thanks. Added to my list. * I think you can and should drop all of the "size" functions and a lot of the "miscellaneous" functions: anyplace where the implicit coercion to text would serve, you don't need a piggyback function, and introducing one just creates risks of can't-resolve-ambiguous-function failures. The overloaded miscellaneous functions are only justifiable to the extent that it's important to preserve "citext-ness" of the result of a function, which seems at best dubious for many of these. It is likewise pretty pointless AFAICS to introduce regex functions taking citext pattern arguments. I added most of those as I wrote tests and they failed to find the functions. Once I added the functions, they worked. But I'll do an audit to make sure that I didn't inadvertantly leave in any unneeded ones (I'm happy to have less code :-)). Of the size functions, I was able to remove only this one and keep all of my pgTAP tests passing: CREATE FUNCTION textlen(citext) RETURNS int4 AS 'textlen' LANGUAGE 'internal' IMMUTABLE STRICT; When I deleted any of the others, I got errors like this: psql:sql/citext.sql:865: ERROR: function length(citext) is not unique LINE 1: SELECT is( length( name ), length(name::text), 'length("' ||... ^ HINT: Could not choose a best candidate function. You might need to add explicit type casts. I think you can see now why I wrote the tests: I wanted to ensure that CITEXT really does work (almost) just like TEXT. I was able to eliminate *all* of the miscellaneous functions, but the upper() and lower() functions now return TEXT instead of CITEXT, which I don't think is exactly what we want, is it? For now, I'e left upper() and lower() in. It just seems like more expected functionality. * Don't use the OPERATOR() notation when you don't need to. It's just clutter. Sorry, don't know what you're referring to here. CREATE OPERATOR appears to require parens… Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 12, 2008, at 15:13, David E. Wheeler wrote: Sadly, I think you have to give up attempts to check the interesting multibyte cases and confine yourself to tests using ASCII strings. Grr. Kind of defeats the purpose. Is there no infrastructure for testing multibyte functionality? Are test database clusters all built using SQL_ASCII and the C locale? I just tried to take your modified tests and add multibyte tests that run only on OS X with en_US.UTF-8. They worked like this: CREATE OR REPLACE FUNCTION test_multibyte () RETURNS BOOLEAN AS $$ SELECT version() ~ 'apple-darwin' AND (select setting from pg_settings where name = 'lc_collate') = 'en_US.UTF-8'; $$ LANGUAGE SQL IMMUTABLE; SELECT 'À'::citext = 'À'::citext WHERE test_multibyte() = true; SELECT 'À'::citext = 'à'::citext WHERE test_multibyte() = true; But then I realized that this would change the expected output depending on the platform, and thus make the tests fail. This is one reason why the inflexibility of the existing regression test model is a drag: it limits you to testing only what works everywhere! Grrr. I'll remove all the multibyte character tests, but I have to say that, as a result, the CITEXT 1 module would likely pass all such tests, but it still isn't locale-aware. How can one add regressions to ensure that something truly is locale-aware? Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 12, 2008, at 14:57, Tom Lane wrote: There was some discussion earlier about whether the proposed regression tests for citext are suitable for use in contrib or not. After playing with them for awhile, I have to come down very firmly on the side of "not". I have these gripes: You're nothing if not thorough, Tom. 1. The style is gratuitously different from every other regression test in the system. This is not a good thing. If it were an amazingly better way to do things, then maybe, but as far as I can tell the style pgTAP forces on you is really pretty darn poorly suited for SQL tests. You have to contort what could naturally be expressed in SQL as a table result into a scalar. Plus it's redundant with the expected-output file. Sure. I wasn't aware of the existing regression test methodology when I wrote pgTAP and those tests. I'm fine to change them and use pgTAP for testing my app code, rather than PostgreSQL itself. 2. It's ridiculously slow; at least a factor of ten slower than doing equivalent tests directly in SQL. This is a very bad thing. Speed of regression tests matters a lot to those of us who run them a dozen times per day --- and I do not wish to discourage any developers who don't work that way from learning better habits ;-) Hrm. I'm wonder why it's so slow? The test functions don't really do a lot. Anyway, I agree that they should perform well. Because of #1 and #2 I find the use of pgTAP to be a nonstarter. I have a couple of gripes about the substance of the tests as well: 3. What you are mostly testing is not the behavior of citext itself, but the behavior of the underlying strcoll function. This is pretty much doomed to failure, first because the regression tests are intended to work in multiple locales (and we are *not* giving that up for citext), and second because the behavior of strcoll isn't all that portable across platforms even given allegedly similar locale settings (as we already found out yesterday). Yes, I *just* ran the tests on Ubuntu and opened my mail to ask about the bizarre differences when I saw this message. Thanks for answering my question before I asked it. :-) Sadly, I think you have to give up attempts to check the interesting multibyte cases and confine yourself to tests using ASCII strings. Grr. Kind of defeats the purpose. Is there no infrastructure for testing multibyte functionality? Are test database clusters all built using SQL_ASCII and the C locale? 4. A lot of the later test cases are equally uselessly testing whether piggybacking over text functions works. The odds of ever finding anything with those tests are not distinguishable from zero (unless the underlying text function is busted, which is not your responsibility to test). So I don't see any point in putting them into the standard regression package. (What maybe *would* be useful to test, but you didn't, is whether the result of a function is considered citext rather than text.) Well, I was doing test-driven development: I wrote the tests to ensure that I had added the functions for CITEXT properly, and when they passed, I could move on. As a unit tester, it'd feel weird for me to drop these. I'm not testing the underlying functions; I'm making sure that they work properly with CITEXT. I suggest something more like the attached as a suitable regression test. I got bored about halfway through and started to skim, so there might be a few tests toward the end of the original set that would be worth transposing into this one, but nothing jumped out at me. Thanks! I'll work this in. Best, David PS: I confirmed late yesterday that the memory leak I saw was with my version for 8.3, where I had my own str_tolower(). It clearly has a leak that I'll have to fix, but it has no bearing on the contribution to 8.4, which has no such leak. Thanks for running the test yourself to confirm. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
There was some discussion earlier about whether the proposed regression tests for citext are suitable for use in contrib or not. After playing with them for awhile, I have to come down very firmly on the side of "not". I have these gripes: 1. The style is gratuitously different from every other regression test in the system. This is not a good thing. If it were an amazingly better way to do things, then maybe, but as far as I can tell the style pgTAP forces on you is really pretty darn poorly suited for SQL tests. You have to contort what could naturally be expressed in SQL as a table result into a scalar. Plus it's redundant with the expected-output file. 2. It's ridiculously slow; at least a factor of ten slower than doing equivalent tests directly in SQL. This is a very bad thing. Speed of regression tests matters a lot to those of us who run them a dozen times per day --- and I do not wish to discourage any developers who don't work that way from learning better habits ;-) Because of #1 and #2 I find the use of pgTAP to be a nonstarter. I have a couple of gripes about the substance of the tests as well: 3. What you are mostly testing is not the behavior of citext itself, but the behavior of the underlying strcoll function. This is pretty much doomed to failure, first because the regression tests are intended to work in multiple locales (and we are *not* giving that up for citext), and second because the behavior of strcoll isn't all that portable across platforms even given allegedly similar locale settings (as we already found out yesterday). Sadly, I think you have to give up attempts to check the interesting multibyte cases and confine yourself to tests using ASCII strings. 4. A lot of the later test cases are equally uselessly testing whether piggybacking over text functions works. The odds of ever finding anything with those tests are not distinguishable from zero (unless the underlying text function is busted, which is not your responsibility to test). So I don't see any point in putting them into the standard regression package. (What maybe *would* be useful to test, but you didn't, is whether the result of a function is considered citext rather than text.) I suggest something more like the attached as a suitable regression test. I got bored about halfway through and started to skim, so there might be a few tests toward the end of the original set that would be worth transposing into this one, but nothing jumped out at me. regards, tom lane -- -- Test citext datatype -- -- -- first, define the datatype. Turn off echoing so that expected file -- does not depend on contents of citext.sql. -- SET client_min_messages = warning; \set ECHO none \i citext.sql \set ECHO all RESET client_min_messages; -- Test the operators and indexing functions -- Test = and <>. SELECT 'a'::citext = 'a'::citext as t; SELECT 'a'::citext = 'A'::citext as t; SELECT 'a'::citext = 'A'::text as f;-- text wins the discussion SELECT 'a'::citext = 'b'::citext as f; SELECT 'a'::citext = 'ab'::citext as f; SELECT 'a'::citext <> 'ab'::citext as t; -- Test > and >= SELECT 'B'::citext > 'a'::citext as t; SELECT 'b'::citext > 'A'::citext as t; SELECT 'B'::citext > 'b'::citext as f; SELECT 'B'::citext >= 'b'::citext as t; -- Test < and <= SELECT 'a'::citext < 'B'::citext as t; SELECT 'a'::citext <= 'B'::citext as t; -- Test implicit casting. citext casts to text, but not vice-versa. SELECT 'a'::citext = 'a'::text as t; SELECT 'A'::text <> 'a'::citext as t; SELECT 'aardvark'::citext = 'aardvark'::citext as t; SELECT 'aardvark'::citext = 'aardVark'::citext as t; -- Check the citext_cmp() function explicitly. SELECT citext_cmp('aardvark'::citext, 'aardvark'::citext) as zero; SELECT citext_cmp('aardvark'::citext, 'aardVark'::citext) as zero; SELECT citext_cmp('B'::citext, 'a'::citext) as one; -- Do some tests using a table and index. CREATE TEMP TABLE try ( name citext PRIMARY KEY ); INSERT INTO try (name) VALUES ('a'), ('ab'), ('aba'), ('b'), ('ba'), ('bab'), ('AZ'); SELECT name, 'a' = name from try; SELECT name, 'a' = name from try where name = 'a'; SELECT name, 'A' = name from try; SELECT name, 'A' = name from try where name = 'A'; SELECT name, 'A' = name from try where name = 'A'; -- expected failures on duplicate key INSERT INTO try (name) VALUES ('a'); INSERT INTO try (name) VALUES ('A'); INSERT INTO try (name) VALUES ('aB'); -- Test aggregate functions and sort ordering CREATE TEMP TABLE srt ( name CITEXT ); INSERT INTO srt (name) VALUES ('aardvark'), ('AAA'), ('aba'), ('ABC'), ('abd'); -- Check the min() and max() aggregates, with and without index. set enable_seqscan = off; SELECT MIN(name) from srt; SELECT MAX(name) from srt; reset enable_seqscan; set enable_indexscan = off; SELECT MIN(name) from srt; SELECT MAX(name) from srt; reset enable_indexscan; -- Check sorting likewise set enable_seqscan = off; SELECT name FROM
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 12, 2008, at 12:17, Tom Lane wrote: BTW, I looked at the SQL file (citext.sql.in) a bit. Some comments: Thanks a million for these, Tom. I greatly appreciate it. * An explicit comment explaining that you're piggybacking on the I/O functions (and some others) for "text" wouldn't be out of place. I've added SQL comments. Were you talking about a COMMENT? * Lose the GRANT EXECUTEs on the I/O functions; they're redundant. (If you needed them, you'd need them on a lot more than these two.) I'd be inclined to lose the COMMENTs on the functions too; again these are about the least useful ones to comment on out of the whole module. I wondered about that; those were copied from CITEXT 1. I've removed all GRANTs and COMMENTs. * You should provide binary I/O (send/receive) functions, if you want this to be an industrial-strength module. It's easy since you can piggyback on text's. I'm confused. Is that not what the citextin and citextout functions are? * The type declaration needs to say storage = extended, else the type won't be toastable. Ah, good, thanks. * The cast from bpchar to citext cannot be WITHOUT FUNCTION; it needs to invoke rtrim1. Compare the bpchar to text cast. Where do I find that? I have trouble finding the SQL that creates the core types. :-( * <= is surely not its own commutator. Changed to >=. You might try running the opr_sanity regression test on this module to see if it finds any other silliness. (Procedure: insert the citext definition script into the serial_schedule list just ahead of opr_sanity, run tests, make sure you understand the reason for any diffs in the opr_sanity result. There will be at least one from the uses of text-related functions for citext.) Thanks. Added to my list. * I think you can and should drop all of the "size" functions and a lot of the "miscellaneous" functions: anyplace where the implicit coercion to text would serve, you don't need a piggyback function, and introducing one just creates risks of can't-resolve-ambiguous-function failures. The overloaded miscellaneous functions are only justifiable to the extent that it's important to preserve "citext-ness" of the result of a function, which seems at best dubious for many of these. It is likewise pretty pointless AFAICS to introduce regex functions taking citext pattern arguments. I added most of those as I wrote tests and they failed to find the functions. Once I added the functions, they worked. But I'll do an audit to make sure that I didn't inadvertantly leave in any unneeded ones (I'm happy to have less code :-)). * Don't use the OPERATOR() notation when you don't need to. It's just clutter. Sorry, don't know what you're referring to here. CREATE OPERATOR appears to require parens… Thanks for the great feedback! I'll work on getting things all straightened out and a new patch in tonight. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
BTW, I looked at the SQL file (citext.sql.in) a bit. Some comments: * An explicit comment explaining that you're piggybacking on the I/O functions (and some others) for "text" wouldn't be out of place. * Lose the GRANT EXECUTEs on the I/O functions; they're redundant. (If you needed them, you'd need them on a lot more than these two.) I'd be inclined to lose the COMMENTs on the functions too; again these are about the least useful ones to comment on out of the whole module. * You should provide binary I/O (send/receive) functions, if you want this to be an industrial-strength module. It's easy since you can piggyback on text's. * The type declaration needs to say storage = extended, else the type won't be toastable. * The cast from bpchar to citext cannot be WITHOUT FUNCTION; it needs to invoke rtrim1. Compare the bpchar to text cast. * <= is surely not its own commutator. You might try running the opr_sanity regression test on this module to see if it finds any other silliness. (Procedure: insert the citext definition script into the serial_schedule list just ahead of opr_sanity, run tests, make sure you understand the reason for any diffs in the opr_sanity result. There will be at least one from the uses of text-related functions for citext.) * I think you can and should drop all of the "size" functions and a lot of the "miscellaneous" functions: anyplace where the implicit coercion to text would serve, you don't need a piggyback function, and introducing one just creates risks of can't-resolve-ambiguous-function failures. The overloaded miscellaneous functions are only justifiable to the extent that it's important to preserve "citext-ness" of the result of a function, which seems at best dubious for many of these. It is likewise pretty pointless AFAICS to introduce regex functions taking citext pattern arguments. * Don't use the OPERATOR() notation when you don't need to. It's just clutter. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 11, 2008, at 20:22, Tom Lane wrote: Here's an updated version of citext.c. Some changes cosmetic, some not so much ... Thanks! Good catch on my missing all of the s/int/int32/ stuff related to citextcmp(). Stupid oversites on my part. I'll submit a new patch with these changes shortly. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 11, 2008, at 20:10, Tom Lane wrote: Oh, got one of those too ... [ time passes ... ] Nope, no leak seen here either, though you could possibly fry an egg on my laptop right now ... Yeah, gotta love that, right? So the issue must be with my version for 8.3.x, meaning, likely, my custom cilower() function. I'll have another look at it. Thanks for giving it a whirl on your end. Also, I notice that the Mac is the *only* one of the three systems on which your regression tests pass. You're depending way too much on platform-specific behavior there, I fear. Hrm. I wonder what that could be. I'll have to give it a shot on my Ubuntu box and find out. Thanks. David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
Here's an updated version of citext.c. Some changes cosmetic, some not so much ... regards, tom lane /* * PostgreSQL type definitions for CITEXT 2.0. */ #include "postgres.h" #include "access/hash.h" #include "fmgr.h" #include "utils/builtins.h" #include "utils/formatting.h" #ifdef PG_MODULE_MAGIC PG_MODULE_MAGIC; #endif /* * * FORWARD DECLARATIONS * */ static int32 citextcmp (text *left, text *right); extern Datum citext_cmp (PG_FUNCTION_ARGS); extern Datum citext_hash(PG_FUNCTION_ARGS); extern Datum citext_eq (PG_FUNCTION_ARGS); extern Datum citext_ne (PG_FUNCTION_ARGS); extern Datum citext_gt (PG_FUNCTION_ARGS); extern Datum citext_ge (PG_FUNCTION_ARGS); extern Datum citext_lt (PG_FUNCTION_ARGS); extern Datum citext_le (PG_FUNCTION_ARGS); extern Datum citext_smaller (PG_FUNCTION_ARGS); extern Datum citext_larger (PG_FUNCTION_ARGS); /* * = * UTILITY FUNCTIONS * = */ /* citextcmp() * Internal comparison function for citext strings. * Returns int32 negative, zero, or positive. */ static int32 citextcmp (text *left, text *right) { char *lcstr, *rcstr; int32 result; lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left)); rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right)); result = varstr_cmp(lcstr, strlen(lcstr), rcstr, strlen(rcstr)); pfree(lcstr); pfree(rcstr); return result; } /* * == * INDEXING FUNCTIONS * == */ PG_FUNCTION_INFO_V1(citext_cmp); Datum citext_cmp(PG_FUNCTION_ARGS) { text *left = PG_GETARG_TEXT_PP(0); text *right = PG_GETARG_TEXT_PP(1); int32 result; result = citextcmp(left, right); PG_FREE_IF_COPY(left, 0); PG_FREE_IF_COPY(right, 1); PG_RETURN_INT32(result); } PG_FUNCTION_INFO_V1(citext_hash); Datum citext_hash(PG_FUNCTION_ARGS) { text *txt = PG_GETARG_TEXT_PP(0); char *str; Datum result; str= str_tolower(VARDATA_ANY(txt), VARSIZE_ANY_EXHDR(txt)); result = hash_any((unsigned char *) str, strlen(str)); pfree(str); /* Avoid leaking memory for toasted inputs */ PG_FREE_IF_COPY(txt, 0); PG_RETURN_DATUM(result); } /* * == * OPERATOR FUNCTIONS * == */ PG_FUNCTION_INFO_V1(citext_eq); Datum citext_eq(PG_FUNCTION_ARGS) { text *left = PG_GETARG_TEXT_PP(0); text *right = PG_GETARG_TEXT_PP(1); char *lcstr, *rcstr; bool result; lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left)); rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right)); /* * We can't do the length-comparison optimization here, as is done for the * text type in varlena.c, because sometimes the lengths can be different. * The canonical example is the turkish dotted i: the lowercase version is * the standard ASCII i, but the uppercase version is multibyte. * Otherwise, since we only care about equality or not-equality, we can * avoid all the expense of strcoll() here, and just do bitwise * comparison. */ result = (strcmp(lcstr, rcstr) == 0); pfree(lcstr); pfree(rcstr); PG_FREE_IF_COPY(left, 0); PG_FREE_IF_COPY(right, 1); PG_RETURN_BOOL(result); } PG_FUNCTION_INFO_V1(citext_ne); Datum citext_ne(PG_FUNCTION_ARGS) { text *left = PG_GETARG_TEXT_PP(0); text *right = PG_GETARG_TEXT_PP(1); char *lcstr, *rcstr; bool result; lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left)); rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right)); /* * We can't do the length-comparison optimization here, as is done for the * text type in varlena.c, because sometimes the lengths can be different. * The canonical example is the turkish dotted i: the lowercase version is * the standard ASCII i, but the uppercase version is multibyte. * Otherwise, since we only care about equality or not-equality, we can * avoid all the expense of strcoll() here, and just do bitwise * comparison. */ result = (strcmp(lcstr, rcstr) != 0); pfree(lcstr); pfree(rcstr); PG_FREE_IF_COPY(left, 0); PG_FREE_IF_COPY(right, 1); PG_RETURN_BOOL(result); } PG_FUNCTION_INFO_V1(citext_lt); Datum citext_lt(PG_FUNCTION_ARGS) { text *left = PG_GETARG_TEXT_PP(0); text *right = PG_GETARG_TEXT_PP(1); bool result; result = citextcmp(left, right) < 0; PG_FREE_IF_COPY(left, 0); PG_FREE_IF_COPY(right, 1); PG_RETURN_BOOL(result); } PG_FUNCTION_INFO_V1(citext_le); Datum citext_le(PG_FUNCTION_ARGS) { text *left = PG_GETARG_TEXT_PP(0); text *right = PG_GETARG_TEXT_PP(1); bool result; resul
Re: [HACKERS] PATCH: CITEXT 2.0 v3
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > On Jul 11, 2008, at 19:18, Tom Lane wrote: >> I tried it here and didn't see any apparent memory leak, on two >> different systems. What platform are you testing on, and with >> what encoding/locale settings? > That's Mac OS X 10.5.4 "Leopard." Using en_US.UTF-8. Oh, got one of those too ... [ time passes ... ] Nope, no leak seen here either, though you could possibly fry an egg on my laptop right now ... Also, I notice that the Mac is the *only* one of the three systems on which your regression tests pass. You're depending way too much on platform-specific behavior there, I fear. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 11, 2008, at 19:18, Tom Lane wrote: I tried it here and didn't see any apparent memory leak, on two different systems. What platform are you testing on, and with what encoding/locale settings? PostgreSQL 8.3.3 on i386-apple-darwin9.4.0, compiled by GCC i686- apple-darwin9-gcc-4.0.1 (G That's Mac OS X 10.5.4 "Leopard." Using en_US.UTF-8. This is using my version for 8.3.3 from svn. https://svn.kineticode.com/citext/trunk/ I'll give it a try myself with the patch against CVS in a bit. Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > Unfortunately, CITEXT seems to have a memory leak somewhere, I tried it here and didn't see any apparent memory leak, on two different systems. What platform are you testing on, and with what encoding/locale settings? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 11, 2008, at 16:45, Tom Lane wrote: Not sure about a memory leak, but this code is clearly wrong if str_tolower results in a change of physical string length (clearly possible in Turkish, for instance, and probably in some other locales too). You need to do strlen's on the lowercased strings. Ah, great point. Thanks. Anyone else got an idea on the memory leak? Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left)); > rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right)); > result = varstr_cmp( > lcstr, > VARSIZE_ANY_EXHDR(left), > rcstr, > VARSIZE_ANY_EXHDR(right) > ); Not sure about a memory leak, but this code is clearly wrong if str_tolower results in a change of physical string length (clearly possible in Turkish, for instance, and probably in some other locales too). You need to do strlen's on the lowercased strings. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 11, 2008, at 13:02, Zdenek Kotala wrote: Thank you, Zdenek. Have you had a chance to try citext yet? Or did you just read the source? I tested version two on Solaris/SPARC and Sun studio compiler. I checked last version only quickly (comparing your changes). Thanks. I just updated my performance test script (attached) by increasing the number of rows tested by an order of magnitude. So now it creates 1,000,000 rows, queries them, adds indexes, and then queries them again. Unfortunately, CITEXT seems to have a memory leak somewhere, because when I index the CITEXT column, it fails with "ERROR: out of memory". So obviously something's not getting cleaned up. Here's the btree indexing function: Datum citext_cmp(PG_FUNCTION_ARGS) { text *left = PG_GETARG_TEXT_PP(0); text *right = PG_GETARG_TEXT_PP(1); int32 result; result = citextcmp(left, right); PG_FREE_IF_COPY(left, 0); PG_FREE_IF_COPY(right, 1); PG_RETURN_INT32(result); } And here's citextcmp(): citextcmp (text *left, text *right) { char *lcstr, *rcstr; int result; lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left)); rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right)); result = varstr_cmp( lcstr, VARSIZE_ANY_EXHDR(left), rcstr, VARSIZE_ANY_EXHDR(right) ); pfree(lcstr); pfree(rcstr); return result; } Can anyone see where I'm failing to free up memory? Might it be in some other function? Thanks! David try.sql Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
David E. Wheeler napsal(a): On Jul 11, 2008, at 12:37, Zdenek Kotala wrote: I'm sorry for late response. I'm little bit busy this week. I quickly look on your latest changes and it seems to me that everything is OK. I'm going to change status to ready for commit. After discussion with Pavel Stehule I changed meaning about pgFoundry vs. contrib. Contrib is OK for me. Thank you, Zdenek. Have you had a chance to try citext yet? Or did you just read the source? I tested version two on Solaris/SPARC and Sun studio compiler. I checked last version only quickly (comparing your changes). With regards Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 11, 2008, at 12:37, Zdenek Kotala wrote: I'm sorry for late response. I'm little bit busy this week. I quickly look on your latest changes and it seems to me that everything is OK. I'm going to change status to ready for commit. After discussion with Pavel Stehule I changed meaning about pgFoundry vs. contrib. Contrib is OK for me. Thank you, Zdenek. Have you had a chance to try citext yet? Or did you just read the source? Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
David E. Wheeler napsal(a): Attached is a new version of a patch to add a CITEXT contrib module. Changes since v2: * Optimized citext_eq() and citext_ne() (the = and <> operators, respectively) by having them use strncmp() instead of varstr_cmp(). Per discussion. * Added `RESTRICT` and `JOIN` clauses to the comparison operators (=, <>, <, >, <=, >=). These improve statistics estimations, increasing the liklihood that indices will be used. I'm sorry for late response. I'm little bit busy this week. I quickly look on your latest changes and it seems to me that everything is OK. I'm going to change status to ready for commit. After discussion with Pavel Stehule I changed meaning about pgFoundry vs. contrib. Contrib is OK for me. Maybe some native speaker should review documentation. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
David E. Wheeler wrote: > On Jul 9, 2008, at 13:40, Alvaro Herrera wrote: >> One thing that jumps at me is pgTAP usage, as Zdenek said. I >> understand that it's neat and all that, but we can't include the >> tests because they won't run unless one installs pgTAP which seems a >> nonstarter. So if you want the tests in the repository along the >> rest of the stuff, they really should use pg_regress. > > It does use pg_regress. The test just load the included pgtap.sql file > to get the tap functions, and then away they go. If you run `make > installcheck` it works. Uh-huh, OK, that's fine then I guess. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
On Jul 9, 2008, at 13:40, Alvaro Herrera wrote: The problem is that we're in the middle of a commitfest, so everybody is busy reviewing other patches (in theory at least). Of course. One thing that jumps at me is pgTAP usage, as Zdenek said. I understand that it's neat and all that, but we can't include the tests because they won't run unless one installs pgTAP which seems a nonstarter. So if you want the tests in the repository along the rest of the stuff, they really should use pg_regress. It does use pg_regress. The test just load the included pgtap.sql file to get the tap functions, and then away they go. If you run `make installcheck` it works. It's not even difficult to use. Have a look at contrib/ltree/sql and contrib/ltree/expected for examples. If you want to push for pgTAP in core, that's fine, but it's a separate discussion. Agreed. I've sent a couple of messages in a thread about that, the latest this morning. The other possibility being, of course, that you are proposing citext to live on pgFoundry. I'm not, actually. I mean, I have an updated version for 8.3, but it'd be quite a pita to maintain them both, since the api for lowercasing text is so much simpler in 8.4. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
David E. Wheeler wrote: > I guess you're all just blown away by the perfection of this patch? ;-) The problem is that we're in the middle of a commitfest, so everybody is busy reviewing other patches (in theory at least). One thing that jumps at me is pgTAP usage, as Zdenek said. I understand that it's neat and all that, but we can't include the tests because they won't run unless one installs pgTAP which seems a nonstarter. So if you want the tests in the repository along the rest of the stuff, they really should use pg_regress. It's not even difficult to use. Have a look at contrib/ltree/sql and contrib/ltree/expected for examples. If you want to push for pgTAP in core, that's fine, but it's a separate discussion. The other possibility being, of course, that you are proposing citext to live on pgFoundry. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v3
I guess you're all just blown away by the perfection of this patch? ;-) Best, David On Jul 7, 2008, at 18:03, David E. Wheeler wrote: Attached is a new version of a patch to add a CITEXT contrib module. Changes since v2: * Optimized citext_eq() and citext_ne() (the = and <> operators, respectively) by having them use strncmp() instead of varstr_cmp(). Per discussion. * Added `RESTRICT` and `JOIN` clauses to the comparison operators (=, <>, <, >, <=, >=). These improve statistics estimations, increasing the liklihood that indices will be used. Thanks! David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v2
On Jul 7, 2008, at 12:06, David E. Wheeler wrote: I understand it but there is parallel project which should solve this problem completely I guess in "close" future (2-3years). Afterward this module will be obsolete and it will takes times to remove it from contrib. It seems to me that have citext in contrib only for two releases is not optimal solution. I guess that'd be the reason to keep it on pgFoundry, but I have two comments: * 2-3 years is a *long* time in Internet time. * There is on guarantee that it will be finished in that time or, indeed ever (no disrespect to Radek Strnad, it's just there are always unforeseen issues). One other thing that occurred to me yesterday: Given that the feature will ultimately support column-level collations, I suspect that it will be much easier to migrate CITEXT to a case-insensitive collation (perhaps using an updated CITEXT contrib module that just does so transparently) than to migrate application code from using LOWER() all over the place to not using. One transition requires a change to the schema, the other requires a change to application code, of which there is generally a lot more. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v2
Martijn van Oosterhout napsal(a): On Mon, Jul 07, 2008 at 12:06:08PM -0700, David E. Wheeler wrote: I guess that'd be the reason to keep it on pgFoundry, but I have two comments: * 2-3 years is a *long* time in Internet time. There have been patches over the years, but they tend not to get looked at. Recently someone pulled up the COLLATE patch from a couple of years ago but it didn't get much feedback either. (I can't find the link right now). I know about it. I have printed your proposal on my desk. I think It is linked from TODO list. It's disappointing that the discussions get hung up on the ICU library when it's not required or even needed for COLLATE support. My original patch never even mentioned it. I note that Firebird added COLLATE using ICU a few years back now. I think PostgreSQL is the only large DBMS to not support it. Complete agree. Collation missing support is big problem for many users. * There is on guarantee that it will be finished in that time or, indeed ever (no disrespect to Radek Strnad, it's just there are always unforeseen issues). I think that with concerted coding effort it could be done in 2-3 months (judging by how long it took to write the first version). The problem is it needs some planner kung-fu which not many people have. I agree that 2-3 months on fulltime is good estimation, problem is that you need kung-fu master which has time to do it :(. What we currently have is student which works on it in free time. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v2
On Mon, Jul 07, 2008 at 12:06:08PM -0700, David E. Wheeler wrote: > I guess that'd be the reason to keep it on pgFoundry, but I have two > comments: > > * 2-3 years is a *long* time in Internet time. There have been patches over the years, but they tend not to get looked at. Recently someone pulled up the COLLATE patch from a couple of years ago but it didn't get much feedback either. (I can't find the link right now). It's disappointing that the discussions get hung up on the ICU library when it's not required or even needed for COLLATE support. My original patch never even mentioned it. I note that Firebird added COLLATE using ICU a few years back now. I think PostgreSQL is the only large DBMS to not support it. > * There is on guarantee that it will be finished in that time or, > indeed ever (no disrespect to Radek Strnad, it's just there are always > unforeseen issues). I think that with concerted coding effort it could be done in 2-3 months (judging by how long it took to write the first version). The problem is it needs some planner kung-fu which not many people have. Have a nice day, -- Martijn van Oosterhout <[EMAIL PROTECTED]> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] PATCH: CITEXT 2.0 v2
Thanks to help from RhodiumToad on IRC, I got some things improved here: On Jul 7, 2008, at 16:24, David E. Wheeler wrote: So for some reason, after adding the indexes, the queries against the CITEXT column aren't using them. Furthermore, the `lower(text) LIKE lower(?)` query isn't using *its* index. Huh? I never knew what one needed to use the text_pattern_ops operator class to index a column for use with LIKE! I had no clue. Would that work for a citext column, too, since it's essentially the same as TEXT? So this leaves me with two questions: 1. For what reason would the query against the citext column *not* use the index? It turns out that it did use the index if I put `SET enable_seqscan = false;` into my script. So with RhodiumToad's direction, I added some `RESTRICT` and `JOIN` clauses to my comparison operators (copying them from ip4r). So now I have: CREATE OPERATOR = ( LEFTARG= CITEXT, RIGHTARG = CITEXT, COMMUTATOR = =, NEGATOR= <>, PROCEDURE = citext_eq, RESTRICT = eqsel, JOIN = eqjoinsel, HASHES, MERGES ); CREATE OPERATOR <> ( LEFTARG= CITEXT, RIGHTARG = CITEXT, NEGATOR= =, COMMUTATOR = <>, PROCEDURE = citext_ne, RESTRICT = neqsel, JOIN = neqjoinsel ); CREATE OPERATOR < ( LEFTARG= CITEXT, RIGHTARG = CITEXT, NEGATOR= >=, COMMUTATOR = >, PROCEDURE = citext_lt, RESTRICT = scalarltsel, JOIN = scalarltjoinsel ); CREATE OPERATOR <= ( LEFTARG= CITEXT, RIGHTARG = CITEXT, NEGATOR= >, COMMUTATOR = <=, PROCEDURE = citext_le, RESTRICT = scalarltsel, JOIN = scalarltjoinsel ); CREATE OPERATOR >= ( LEFTARG= CITEXT, RIGHTARG = CITEXT, NEGATOR= <, COMMUTATOR = <=, PROCEDURE = citext_ge, RESTRICT = scalargtsel, JOIN = scalargtjoinsel ); CREATE OPERATOR > ( LEFTARG= CITEXT, RIGHTARG = CITEXT, NEGATOR= <=, COMMUTATOR = <, PROCEDURE = citext_gt, RESTRICT = scalargtsel, JOIN = scalargtjoinsel ); With this change, the index was used: Loading words from dictionary. Inserting into the table. Test =. SELECT * FROM try WHERE LOWER(text) = LOWER('food'); Time: 261.295 ms SELECT * FROM try WHERE citext = 'food'; Time: 289.304 ms Time: 1228.961 ms Adding indexes... Test =. SELECT * FROM try WHERE LOWER(text) = LOWER('food'); Time: 2.018 ms SELECT * FROM try WHERE citext = 'food'; Time: 0.788 ms Seems to be faster than the LOWER() version, too, which makes me happy. The output from EXPLAIN ANALYZE: try=# EXPLAIN ANALYZE SELECT * FROM try WHERE citext = 'food'; QUERY PLAN -- Index Scan using idx_try_citext on try (cost=0.00..8.31 rows=1 width=119) (actual time=0.324..0.324 rows=0 loops=1) Index Cond: (citext = 'food'::citext) Total runtime: 0.377 ms (3 rows) try=# EXPLAIN ANALYZE SELECT * FROM try WHERE LOWER(text) = LOWER('food'); QUERY PLAN Bitmap Heap Scan on try (cost=28.17..1336.10 rows=500 width=119) (actual time=0.170..0.170 rows=0 loops=1) Recheck Cond: (lower(text) = 'food'::text) -> Bitmap Index Scan on idx_try_text (cost=0.00..28.04 rows=500 width=0) (actual time=0.168..0.168 rows=0 loops=1) Index Cond: (lower(text) = 'food'::text) Total runtime: 0.211 ms (5 rows) So my only other question related to this is: * Are the above RESTRICT and JOIN functions the ones to use, or is there some way to make use of those used by the TEXT type that would be more appropriate? 2. Is there some way to get the CITEXT index to behave like a LOWER() index, that is, so that its value is stored using the result of the str_tolower() function, thus removing some of the overhead of converting the values for each row fetched from the index? (Does this question make any sense?) Given the performance with an index, I think that this is moot, yes? There is, of course, much more overhead for a table scan. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
On Jul 7, 2008, at 17:18, Tom Lane wrote: No, but you were: you proposed using strncmp for everything. Yes, that's right. I was trying to understand why I wouldn't use just one or the other. And the answer turned out to be, you wouldn't, except that strncmp() is an desirable optimization for = and <>. So I've changed only those. Phew, I think I'm clear now. Thanks! DAvid -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > On Jul 7, 2008, at 16:58, Tom Lane wrote: >> If that's so, you certainly can't use strncmp, because that would >> result >> in sort orderings totally different from lower()'s result. Even >> without >> that argument, for most multibyte cases you'd get a pretty arbitrary, >> user-unfriendly sort ordering. > Now I'm confused again. :-( Whether or not I use strncmp() or > varstr_cmp(), I first lowercase the value to be compared using > str_tolower(). What Zdenek has said is, that aside, just as for the > TEXT type, I should use strncmp() for = and <>, and varstr_cmp() for > everything else. Are you saying something different? No, but you were: you proposed using strncmp for everything. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
On Jul 7, 2008, at 16:58, Tom Lane wrote: "David E. Wheeler" <[EMAIL PROTECTED]> writes: Hrm. So in your opinion, strncmp() could be used for all comparisons by citext, rather than varstr_cmp()? I thought the charter of this type was to work like lower(textcol). Correct. If that's so, you certainly can't use strncmp, because that would result in sort orderings totally different from lower()'s result. Even without that argument, for most multibyte cases you'd get a pretty arbitrary, user-unfriendly sort ordering. Now I'm confused again. :-( Whether or not I use strncmp() or varstr_cmp(), I first lowercase the value to be compared using str_tolower(). What Zdenek has said is, that aside, just as for the TEXT type, I should use strncmp() for = and <>, and varstr_cmp() for everything else. Are you saying something different? I could use some examples to play with in order to ensure that things are behaving as they should. I'll add regression tests for them. Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > Hrm. So in your opinion, strncmp() could be used for all comparisons > by citext, rather than varstr_cmp()? I thought the charter of this type was to work like lower(textcol). If that's so, you certainly can't use strncmp, because that would result in sort orderings totally different from lower()'s result. Even without that argument, for most multibyte cases you'd get a pretty arbitrary, user-unfriendly sort ordering. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v2
No, *really* Sheesh, sorry. David try.sql Description: Binary data On Jul 7, 2008, at 16:26, David E. Wheeler wrote: And here is the script. D'oh! Thanks, David On Jul 7, 2008, at 16:24, David E. Wheeler wrote: On Jul 7, 2008, at 08:01, Andrew Dunstan wrote: What does still bother me is its performance. I'd like to know if any measurement has been done of using citext vs. a functional index on lower(foo). Okay, here's a start. The attached script inserts random strings of 1-10 space-delimited words into text and citext columns, and then compares the performance of queries with and without indexes. The output for me is as follows: Loading words from dictionary. Inserting into the table. Test =. SELECT * FROM try WHERE LOWER(text) = LOWER('food'); Time: 254.254 ms SELECT * FROM try WHERE citext = 'food'; Time: 288.535 ms Test LIKE and ILIKE SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%'); Time: 209.385 ms SELECT * FROM try WHERE citext ILIKE 'C%'; Time: 236.186 ms SELECT * FROM try WHERE citext LIKE 'C%'; Time: 235.818 ms Adding indexes... Test =. SELECT * FROM try WHERE LOWER(text) = LOWER('food'); Time: 1.260 ms SELECT * FROM try WHERE citext = 'food'; Time: 277.755 ms Test LIKE and ILIKE SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%'); Time: 209.073 ms SELECT * FROM try WHERE citext ILIKE 'C%'; Time: 238.430 ms SELECT * FROM try WHERE citext LIKE 'C%'; Time: 238.685 ms benedict% So for some reason, after adding the indexes, the queries against the CITEXT column aren't using them. Furthermore, the `lower(text) LIKE lower(?)` query isn't using *its* index. Huh? So this leaves me with two questions: 1. For what reason would the query against the citext column *not* use the index? 2. Is there some way to get the CITEXT index to behave like a LOWER() index, that is, so that its value is stored using the result of the str_tolower() function, thus removing some of the overhead of converting the values for each row fetched from the index? (Does this question make any sense?) Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v2
And here is the script. D'oh! Thanks, David try.sql Description: Binary data On Jul 7, 2008, at 16:24, David E. Wheeler wrote: On Jul 7, 2008, at 08:01, Andrew Dunstan wrote: What does still bother me is its performance. I'd like to know if any measurement has been done of using citext vs. a functional index on lower(foo). Okay, here's a start. The attached script inserts random strings of 1-10 space-delimited words into text and citext columns, and then compares the performance of queries with and without indexes. The output for me is as follows: Loading words from dictionary. Inserting into the table. Test =. SELECT * FROM try WHERE LOWER(text) = LOWER('food'); Time: 254.254 ms SELECT * FROM try WHERE citext = 'food'; Time: 288.535 ms Test LIKE and ILIKE SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%'); Time: 209.385 ms SELECT * FROM try WHERE citext ILIKE 'C%'; Time: 236.186 ms SELECT * FROM try WHERE citext LIKE 'C%'; Time: 235.818 ms Adding indexes... Test =. SELECT * FROM try WHERE LOWER(text) = LOWER('food'); Time: 1.260 ms SELECT * FROM try WHERE citext = 'food'; Time: 277.755 ms Test LIKE and ILIKE SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%'); Time: 209.073 ms SELECT * FROM try WHERE citext ILIKE 'C%'; Time: 238.430 ms SELECT * FROM try WHERE citext LIKE 'C%'; Time: 238.685 ms benedict% So for some reason, after adding the indexes, the queries against the CITEXT column aren't using them. Furthermore, the `lower(text) LIKE lower(?)` query isn't using *its* index. Huh? So this leaves me with two questions: 1. For what reason would the query against the citext column *not* use the index? 2. Is there some way to get the CITEXT index to behave like a LOWER() index, that is, so that its value is stored using the result of the str_tolower() function, thus removing some of the overhead of converting the values for each row fetched from the index? (Does this question make any sense?) Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v2
On Jul 7, 2008, at 08:01, Andrew Dunstan wrote: What does still bother me is its performance. I'd like to know if any measurement has been done of using citext vs. a functional index on lower(foo). Okay, here's a start. The attached script inserts random strings of 1-10 space-delimited words into text and citext columns, and then compares the performance of queries with and without indexes. The output for me is as follows: Loading words from dictionary. Inserting into the table. Test =. SELECT * FROM try WHERE LOWER(text) = LOWER('food'); Time: 254.254 ms SELECT * FROM try WHERE citext = 'food'; Time: 288.535 ms Test LIKE and ILIKE SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%'); Time: 209.385 ms SELECT * FROM try WHERE citext ILIKE 'C%'; Time: 236.186 ms SELECT * FROM try WHERE citext LIKE 'C%'; Time: 235.818 ms Adding indexes... Test =. SELECT * FROM try WHERE LOWER(text) = LOWER('food'); Time: 1.260 ms SELECT * FROM try WHERE citext = 'food'; Time: 277.755 ms Test LIKE and ILIKE SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%'); Time: 209.073 ms SELECT * FROM try WHERE citext ILIKE 'C%'; Time: 238.430 ms SELECT * FROM try WHERE citext LIKE 'C%'; Time: 238.685 ms benedict% So for some reason, after adding the indexes, the queries against the CITEXT column aren't using them. Furthermore, the `lower(text) LIKE lower(?)` query isn't using *its* index. Huh? So this leaves me with two questions: 1. For what reason would the query against the citext column *not* use the index? 2. Is there some way to get the CITEXT index to behave like a LOWER() index, that is, so that its value is stored using the result of the str_tolower() function, thus removing some of the overhead of converting the values for each row fetched from the index? (Does this question make any sense?) Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
On Jul 7, 2008, at 13:59, Gregory Stark wrote: Of course the obvious case of two equivalent strings with different bytes would be two strings which differ only in case in a collation which doesn't distinguish based on case. So you obviously can't take this route for citext. Well, to be fair, citext isn't imposing a collation. It's just calling str_tolower() on strings before passing them on to varstr_cmp() or strncmp() to compare. I don't think you have to worry about the problem that cause Postgres to make this change. IIRC it was someone comparing strings like paths and usernames and getting false positives because they were in a Turkish locale which found certain sequences of characters to be insignificant for ordering. Someone who's using a citext data type has obviously decided that's precisely the kind of behaviour they want. Hrm. So in your opinion, strncmp() could be used for all comparisons by citext, rather than varstr_cmp()? Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > On Jul 7, 2008, at 12:21, David E. Wheeler wrote: > >> My question is: why? Shouldn't they all use the same function for >> comparison? I'm happy to dupe this implementation for citext, but I don't >> understand it. Should not all comparisons be executed consistently? > > Let me try to answer my own question by citing this comment: > > /* >* Since we only care about equality or not-equality, we can avoid all > the >* expense of strcoll() here, and just do bitwise comparison. >*/ > > So, the upshot is that the = and <> operators are not locale-aware, yes? They > just do byte comparisons. Is that really the way it should be? I mean, could > there not be strings that are equivalent but have different bytes? There could be strings that strcoll returns 0 for even though they're not identical. However that caused problems in Postgres so we decided that only equal strings should actually compare equal. So if strcoll returns 0 then we do a bytewise comparison to impose an arbitrary ordering. Of course the obvious case of two equivalent strings with different bytes would be two strings which differ only in case in a collation which doesn't distinguish based on case. So you obviously can't take this route for citext. I don't think you have to worry about the problem that cause Postgres to make this change. IIRC it was someone comparing strings like paths and usernames and getting false positives because they were in a Turkish locale which found certain sequences of characters to be insignificant for ordering. Someone who's using a citext data type has obviously decided that's precisely the kind of behaviour they want. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
On Jul 7, 2008, at 12:46, Zdenek Kotala wrote: So, the upshot is that the = and <> operators are not locale-aware, yes? They just do byte comparisons. Is that really the way it should be? I mean, could there not be strings that are equivalent but have different bytes? Correct. The problem is complex. It works fine only for normalized string. But postgres now assume that all utf8 strings are normalized. I see. So binary equivalence is okay, in that case. If you need to implement < <= >= > operators you need to use strcol which take care of locale collation. Which varstr_cmp() does, I guess. It's what textlt uses, for example. See unicode collation algorithm http://www.unicode.org/reports/tr10/ Wow, that looks like a fun read. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
On Jul 7, 2008, at 13:10, Tom Lane wrote: We intentionally force such strings to be considered non-equal. See varstr_cmp, and if you like see the archives from back when that was put in. The = and <> operators are in fact consistent with the behavior of varstr_cmp (and had better be!); they're just optimized a bit. Thank you, Tom. I'll post my updated patch shortly. In the meantime, can anyone suggest an easy way to populate a table full of random strings so that I can do a bit of benchmarking? Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
"David E. Wheeler" <[EMAIL PROTECTED]> writes: > So, the upshot is that the = and <> operators are not locale-aware, > yes? They just do byte comparisons. Is that really the way it should > be? I mean, could there not be strings that are equivalent but have > different bytes? We intentionally force such strings to be considered non-equal. See varstr_cmp, and if you like see the archives from back when that was put in. The = and <> operators are in fact consistent with the behavior of varstr_cmp (and had better be!); they're just optimized a bit. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
David E. Wheeler napsal(a): On Jul 7, 2008, at 12:21, David E. Wheeler wrote: My question is: why? Shouldn't they all use the same function for comparison? I'm happy to dupe this implementation for citext, but I don't understand it. Should not all comparisons be executed consistently? Let me try to answer my own question by citing this comment: /* * Since we only care about equality or not-equality, we can avoid all the * expense of strcoll() here, and just do bitwise comparison. */ So, the upshot is that the = and <> operators are not locale-aware, yes? They just do byte comparisons. Is that really the way it should be? I mean, could there not be strings that are equivalent but have different bytes? Correct. The problem is complex. It works fine only for normalized string. But postgres now assume that all utf8 strings are normalized. If you need to implement < <= >= > operators you need to use strcol which take care of locale collation. See unicode collation algorithm http://www.unicode.org/reports/tr10/ Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
2008/7/7 David E. Wheeler <[EMAIL PROTECTED]>: > On Jul 7, 2008, at 12:36, Pavel Stehule wrote: > >>> * Does it need to be locale-aware or not? >> >> yes! > > texteq() in varlena.c does not seem to be. > it's case sensitive and it's +/- binary compare Regards Pavel Stehule > Best, > > David > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
On Jul 7, 2008, at 12:36, Pavel Stehule wrote: * Does it need to be locale-aware or not? yes! texteq() in varlena.c does not seem to be. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
2008/7/7 David E. Wheeler <[EMAIL PROTECTED]>: > On Jul 7, 2008, at 11:54, Alvaro Herrera wrote: > >>> Then why shouldn't I use strncmp() for all comparisons? >> >> I have no idea :-) -- because it's not locale-aware perhaps. > > Could someone who does have an idea answer these questions: > > * Does it need to be locale-aware or not? yes! > * Should I use strncmp() or varstr_cmp() to compare strings? > * Shouldn't it use one or the other, but not both? > > Sorry, I'm just confused about the "correct" thing to do here. If someone > who knows the definitive answers could weigh in, I'd be happy to make the > adjustment. > > Thanks, > > David > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
On Jul 7, 2008, at 12:21, David E. Wheeler wrote: My question is: why? Shouldn't they all use the same function for comparison? I'm happy to dupe this implementation for citext, but I don't understand it. Should not all comparisons be executed consistently? Let me try to answer my own question by citing this comment: /* * Since we only care about equality or not-equality, we can avoid all the * expense of strcoll() here, and just do bitwise comparison. */ So, the upshot is that the = and <> operators are not locale-aware, yes? They just do byte comparisons. Is that really the way it should be? I mean, could there not be strings that are equivalent but have different bytes? Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v2
On Jul 7, 2008, at 12:10, Pavel Stehule wrote: Maybe we can have some "locale" test outside our regress tests - I think that would be useful. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
On Jul 7, 2008, at 12:13, Zdenek Kotala wrote: I'm sorry. I meant bttext() http://doxygen.postgresql.org/varlena_8c-source.html#l01270 bttext should use in citextcmp function. It uses strcol function. I've no idea what bttext has to do with anything. Sorry if I'm being slow here. And citext_eq should be implemented as texteq: http://doxygen.postgresql.org/varlena_8c-source.html#l01164 I'm sorry for confusion I'm exchange bttext and varstr_cmp. :( Okay, I see that text_cmp() uses varstr_cmp(): http://doxygen.postgresql.org/varlena_8c-source.html#l01139 While texteq() and textne() use strncmp(): http://doxygen.postgresql.org/varlena_8c-source.html#l01164 http://doxygen.postgresql.org/varlena_8c-source.html#l01187 The other operator functions, like text_lt(), use text_cmp() and therefore varstr_cmp(): http://doxygen.postgresql.org/varlena_8c-source.html#01210 My question is: why? Shouldn't they all use the same function for comparison? I'm happy to dupe this implementation for citext, but I don't understand it. Should not all comparisons be executed consistently? Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
David E. Wheeler napsal(a): On Jul 7, 2008, at 11:54, Alvaro Herrera wrote: Then why shouldn't I use strncmp() for all comparisons? I have no idea :-) -- because it's not locale-aware perhaps. Could someone who does have an idea answer these questions: * Does it need to be locale-aware or not? * Should I use strncmp() or varstr_cmp() to compare strings? * Shouldn't it use one or the other, but not both? Sorry, I'm just confused about the "correct" thing to do here. If someone who knows the definitive answers could weigh in, I'd be happy to make the adjustment. I'm sorry. I meant bttext() http://doxygen.postgresql.org/varlena_8c-source.html#l01270 bttext should use in citextcmp function. It uses strcol function. And citext_eq should be implemented as texteq: http://doxygen.postgresql.org/varlena_8c-source.html#l01164 I'm sorry for confusion I'm exchange bttext and varstr_cmp. :( Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v2
2008/7/7 Zdenek Kotala <[EMAIL PROTECTED]>: > David E. Wheeler napsal(a): >> >> On Jul 7, 2008, at 07:41, Zdenek Kotala wrote: >> >>> However, It seems to me that code is ok now (exclude citex_eq). I think >>> there two open problems/questions: >>> >>> 1) regression test - >>> >>> a) I think that regresion is not correct. It depends on en_US locale, >>> but it uses characters which is not in related character repertoire. It >>> means comparing is not defined and I guess it could generate different >>> result on different OS - depends on locale implementation. >> >> That I don't know about. The test requires en_US.UTF-8, at least at this >> point. How are tests run on the build farm? And how else could I ensure that >> comparisons are case-insensitive for non-ASCII characters other than >> requiring a Unicode locale? Or is it just an issue for the sort order tests? >> For those, I could potentially remove accented characters, just as long as >> I'm verifying in other tests that comparisons are case-insensitive (without >> worrying about collation). > > Hmm, it is complex area and I'm not sure if anybody on planet know correct > answer :-). I think that best approach is to keep it as is and when a > problem occur then it will be fixed. > Maybe we can have some "locale" test outside our regress tests - >>> b) pgTap is something new. Need make a decision if this framework is >>> acceptable or not. >> >> Well, from the point of view of `make installcheck`, it's invisible. I've >> submitted a talk proposal for pgDay.US on ptTAP. I'm happy to discuss it >> further here though, if folks are interested. > > Yeah, it is invisible, but question is why don't put it as a framework to > common place. But it is for another discussion. > >>> 2) contrib vs. pgFoundry >>> >>> There is unresolved answer if we want to have this in contrib or not. >>> Good to mention that citext type will be obsoleted with full collation >>> implementation in a future. I personally prefer to keep it on pgFoundry >>> because it is temporally workaround (by my opinion), but I can live with >>> contrib module as well. >> >> I second what Andrew has said in reply to this issue. I'll also say that, >> since people *so* often end up using `WHERE LOWER(col) = LOWER(?)`, that >> it'd be very valuable to have citext in contrib, especially since so few >> folks seem to even know about pgFoundry, let alone be able to find it. I >> mean, look at these search results: > > I understand it but there is parallel project which should solve this > problem completely I guess in "close" future (2-3years). Afterward this > module will be obsolete and it will takes times to remove it from contrib. > It seems to me that have citext in contrib only for two releases is not > optimal solution. > >Zdenek > > Regards Pavel > > -- > Zdenek Kotala Sun Microsystems > Prague, Czech Republic http://sun.com/postgresql > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v2
On Jul 7, 2008, at 11:54, Zdenek Kotala wrote: Hmm, it is complex area and I'm not sure if anybody on planet know correct answer :-). I think that best approach is to keep it as is and when a problem occur then it will be fixed. Regression tests are really important, though. b) pgTap is something new. Need make a decision if this framework is acceptable or not. Well, from the point of view of `make installcheck`, it's invisible. I've submitted a talk proposal for pgDay.US on ptTAP. I'm happy to discuss it further here though, if folks are interested. Yeah, it is invisible, but question is why don't put it as a framework to common place. But it is for another discussion. It is in a common place as a project, on pgFoundry. Whether the core team wants to use it for testing other parts of PostgreSQL (core or contrib) and therefore put it in a central location is, as you say, a separate conversation. It'd be easy to move it in such a case, of course. I understand it but there is parallel project which should solve this problem completely I guess in "close" future (2-3years). Afterward this module will be obsolete and it will takes times to remove it from contrib. It seems to me that have citext in contrib only for two releases is not optimal solution. I guess that'd be the reason to keep it on pgFoundry, but I have two comments: * 2-3 years is a *long* time in Internet time. * There is on guarantee that it will be finished in that time or, indeed ever (no disrespect to Radek Strnad, it's just there are always unforeseen issues). Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
On Jul 7, 2008, at 11:54, Alvaro Herrera wrote: Then why shouldn't I use strncmp() for all comparisons? I have no idea :-) -- because it's not locale-aware perhaps. Could someone who does have an idea answer these questions: * Does it need to be locale-aware or not? * Should I use strncmp() or varstr_cmp() to compare strings? * Shouldn't it use one or the other, but not both? Sorry, I'm just confused about the "correct" thing to do here. If someone who knows the definitive answers could weigh in, I'd be happy to make the adjustment. Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v2
Andrew Dunstan napsal(a): Zdenek Kotala wrote: 2) contrib vs. pgFoundry There is unresolved answer if we want to have this in contrib or not. Good to mention that citext type will be obsoleted with full collation implementation in a future. I personally prefer to keep it on pgFoundry because it is temporally workaround (by my opinion), but I can live with contrib module as well. Is there a concrete plan to get to full collation support (i.e. per column) in any known time frame? If not, then I think a citext module would be acceptable. Collation per database should be part of 8.4. Radek Strnad works on it as a SoC project. He plans to continue on this work and implement full support as his Master of Thesis in 2-3 years time frame. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v2
David E. Wheeler napsal(a): On Jul 7, 2008, at 07:41, Zdenek Kotala wrote: However, It seems to me that code is ok now (exclude citex_eq). I think there two open problems/questions: 1) regression test - a) I think that regresion is not correct. It depends on en_US locale, but it uses characters which is not in related character repertoire. It means comparing is not defined and I guess it could generate different result on different OS - depends on locale implementation. That I don't know about. The test requires en_US.UTF-8, at least at this point. How are tests run on the build farm? And how else could I ensure that comparisons are case-insensitive for non-ASCII characters other than requiring a Unicode locale? Or is it just an issue for the sort order tests? For those, I could potentially remove accented characters, just as long as I'm verifying in other tests that comparisons are case-insensitive (without worrying about collation). Hmm, it is complex area and I'm not sure if anybody on planet know correct answer :-). I think that best approach is to keep it as is and when a problem occur then it will be fixed. b) pgTap is something new. Need make a decision if this framework is acceptable or not. Well, from the point of view of `make installcheck`, it's invisible. I've submitted a talk proposal for pgDay.US on ptTAP. I'm happy to discuss it further here though, if folks are interested. Yeah, it is invisible, but question is why don't put it as a framework to common place. But it is for another discussion. 2) contrib vs. pgFoundry There is unresolved answer if we want to have this in contrib or not. Good to mention that citext type will be obsoleted with full collation implementation in a future. I personally prefer to keep it on pgFoundry because it is temporally workaround (by my opinion), but I can live with contrib module as well. I second what Andrew has said in reply to this issue. I'll also say that, since people *so* often end up using `WHERE LOWER(col) = LOWER(?)`, that it'd be very valuable to have citext in contrib, especially since so few folks seem to even know about pgFoundry, let alone be able to find it. I mean, look at these search results: I understand it but there is parallel project which should solve this problem completely I guess in "close" future (2-3years). Afterward this module will be obsolete and it will takes times to remove it from contrib. It seems to me that have citext in contrib only for two releases is not optimal solution. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
David E. Wheeler wrote: > On Jul 7, 2008, at 11:44, Alvaro Herrera wrote: > >>> I'm confused. What's the difference between strncmp() and >>> varstr_cmp()? >>> And why would I use one in one place and the other elsewhere? >>> Shouldn't >>> they be used consistently? >> >> Not necessarily -- see texteq. I think the point is that varstr_cmp() >> is a lot slower. > > Then why shouldn't I use strncmp() for all comparisons? I have no idea :-) -- because it's not locale-aware perhaps. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
On Jul 7, 2008, at 11:44, Alvaro Herrera wrote: I'm confused. What's the difference between strncmp() and varstr_cmp()? And why would I use one in one place and the other elsewhere? Shouldn't they be used consistently? Not necessarily -- see texteq. I think the point is that varstr_cmp() is a lot slower. Then why shouldn't I use strncmp() for all comparisons? No, I was wondering before what locale was used for initdb on the build farm. I mean, how are locale-aware things really tested? They aren't AFAIK. Ow. Pity. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
David E. Wheeler wrote: > On Jul 7, 2008, at 00:46, Zdenek Kotala wrote: > >> You have to use varstr_cmp in citextcmp. Your code is correct, >> because for >> < <= >= > operators you need collation sensible function. >> >> You need to change only citext_cmp function to use strncmp() or call >> texteq function. > > I'm confused. What's the difference between strncmp() and varstr_cmp()? > And why would I use one in one place and the other elsewhere? Shouldn't > they be used consistently? Not necessarily -- see texteq. I think the point is that varstr_cmp() is a lot slower. >> I'm think that this test will work correctly for en_US.UTF-8 at any >> time. I guess the test make sense only when Czech collation >> (cs_CZ.UTF-8) is selected, but unfortunately, you cannot change >> collation during your test :(. > > No, I was wondering before what locale was used for initdb on the build > farm. I mean, how are locale-aware things really tested? They aren't AFAIK. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v2
On Jul 7, 2008, at 08:01, Andrew Dunstan wrote: What does still bother me is its performance. I'd like to know if any measurement has been done of using citext vs. a functional index on lower(foo). That's a good question. I can whip up a quick test by populating a column full of data and trying both, but I'm no performance testing expert. Anyone else know more about such performance testing who can help out? Many Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0 v2
On Jul 7, 2008, at 07:41, Zdenek Kotala wrote: However, It seems to me that code is ok now (exclude citex_eq). I think there two open problems/questions: 1) regression test - a) I think that regresion is not correct. It depends on en_US locale, but it uses characters which is not in related character repertoire. It means comparing is not defined and I guess it could generate different result on different OS - depends on locale implementation. That I don't know about. The test requires en_US.UTF-8, at least at this point. How are tests run on the build farm? And how else could I ensure that comparisons are case-insensitive for non-ASCII characters other than requiring a Unicode locale? Or is it just an issue for the sort order tests? For those, I could potentially remove accented characters, just as long as I'm verifying in other tests that comparisons are case-insensitive (without worrying about collation). b) pgTap is something new. Need make a decision if this framework is acceptable or not. Well, from the point of view of `make installcheck`, it's invisible. I've submitted a talk proposal for pgDay.US on ptTAP. I'm happy to discuss it further here though, if folks are interested. 2) contrib vs. pgFoundry There is unresolved answer if we want to have this in contrib or not. Good to mention that citext type will be obsoleted with full collation implementation in a future. I personally prefer to keep it on pgFoundry because it is temporally workaround (by my opinion), but I can live with contrib module as well. I second what Andrew has said in reply to this issue. I'll also say that, since people *so* often end up using `WHERE LOWER(col) = LOWER(?)`, that it'd be very valuable to have citext in contrib, especially since so few folks seem to even know about pgFoundry, let alone be able to find it. I mean, look at these search results: http://www.google.com/search?q=PostgreSQL%20case-insensitive%20text My blog entry about this patch is hit #3. pgFoundry (and CITEXT 1) is #7. Last time I did a query like this, it didn't turn up at all. Belive me, I'd love for pgFoundry (or something like it) to become the CPAN for PostgreSQL. But without some love and SEO, I don't think that's gonna happen. Besides, CITEXT 2 would be a PITA to maintain for both 8.3 and 8.4, given the changes in the string comparison API in CVS. Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: CITEXT 2.0
On Jul 7, 2008, at 00:46, Zdenek Kotala wrote: You have to use varstr_cmp in citextcmp. Your code is correct, because for < <= >= > operators you need collation sensible function. You need to change only citext_cmp function to use strncmp() or call texteq function. I'm confused. What's the difference between strncmp() and varstr_cmp()? And why would I use one in one place and the other elsewhere? Shouldn't they be used consistently? I'm think that this test will work correctly for en_US.UTF-8 at any time. I guess the test make sense only when Czech collation (cs_CZ.UTF-8) is selected, but unfortunately, you cannot change collation during your test :(. No, I was wondering before what locale was used for initdb on the build farm. I mean, how are locale-aware things really tested? I think, Best solution for now is to keep the test and add comment about recommended collation for this test. Yep, that's what it does. Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers