[HACKERS] proposal: UTF8 to_ascii function
Hello, combination functions to_ascii and convert_to is broken now. Problem is in convert_to function. It doesn't support 8bit output encoding. Current workaround: CREATE FUNCTION to_ascii(bytea, name) RETURNS text AS 'to_ascii_encname' LANGUAGE internal; SELECT to_ascii(convert_to('Příliš žlutý kůň', 'latin2'),'latin2'); I don't expect column collate for 8.4, so we need to have workable to_ascii function. I propose function to_ascii(text, name) that internally convert text from utf8 encoding when it's necessary. Regards Pavel Stehule -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: PageLayout footprint
Zdenek Kotala wrote: Current content of control file is insufficient to check if database is compatible with postgres server. It is? Do you have an example of where it's insufficient? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Question regarding the database page layout.
Hello all, I have been digging into the database page layout (specifically the tuples) to ensure the unsigned integer types were consuming the proper storage. While digging around, I found one thing surprising: It appears the heap tuples are padded at the end to the MAXALIGN distance. Below is my data that I used to come to this conclusion. (This test was performed on a 64-bit system with --with-blocksize=32). The goal was to compare data from comparable type sizes. The first column indicates the type (char, uint1, int2, uint2, int4, and uint4), the number in () indicates the number of columns in the table. The Length is from the .lp_off field in the ItemId structure. The Offset is from the .lp_len field in the ItemId structure. The Size is the offset difference. char (1)Length Offset Sizechar (9) Length Offset Size 25 32736 32 33 32728 40 25 32704 32 33 32688 40 25 32672 32 33 32648 40 25 32640 33 32608 uint1 (1) Length Offset Sizeuint1 (9) Length Offset Size 25 32736 3233 32728 40 25 32704 3233 32688 40 25 32672 3233 32648 40 25 32640 33 32608 int2 (1) Length Offset Sizeint2 (5) Length Offset Size 26 32736 3234 32728 40 26 32704 3234 32688 40 26 32672 3234 32648 40 26 32640 34 32608 uint2 (1)Length Offset Sizeunt2 (5) Length Offset Size 26 32736 3234 32728 40 26 32704 3234 32688 40 26 32672 3234 32648 40 26 32640 34 32608 int4 (1) Length Offset Sizeint4 (3) Length Offset Size 28 32736 32 36 32728 40 28 32704 32 36 32688 40 28 32672 32 36 32648 40 28 32640 36 32608 uint4 (1) Length Offset Sizeuint4 (3) Length Offset Size 28 32736 32 36 32728 40 28 32704 32 36 32688 40 28 32672 32 36 32648 40 28 32640 36 32608 From the documentation at: http://www.postgresql.org/docs/8.3/static/storage-page-layout.html and from the comments in src/include/access/htup.h I understand the user data (indicated by t_hoff) must by a multiple of MAXALIGN distance, but I did not find anything suggesting the heap tuple itself had this requirement. After a cursory glance at the HeapTupleHeaderData structure, it appears it could be aligned with INTALIGN instead of MAXALIGN. The one structure I was worried about was the 6 byte t_ctid structure. The comments in src/include/storage/itemptr.h file indicate the ItemPointerData structure is composed of 3 int16 fields. So everthing in the HeapTupleHeaderData structure is 32-bits or less. I am interested in attempting to generate a patch if this idea appears feasible. The current data set I am playing with it would save over 3GB of disk space. (Back of the envelope calculations indicate that 5% of my current storage is consumed by this padding. My tuple length is 44 bytes.) Thanks, - Ryan
Re: [HACKERS] gsoc, oprrest function for text search take 2
Jan Urbański wrote: Heikki Linnakangas wrote: Jan Urbański wrote: Another thing are cstring_to_text_with_len calls. I'm doing them so I can use bttextcmp in bsearch(). I think I could come up with a dedicated function to return text Datums and WordEntries (read: non-NULL terminated strings with a given length). Just keep them as cstrings and use strcmp. We're only keeping the array sorted so that we can binary search it, so we don't need proper locale-dependent collation. Note that we already assume that two strings ('text's) are equal if and only if their binary representations are equal (texteq() uses strcmp). OK, I got rid of cstring-text calls and memory contexts as I went through it. The only tiny ugliness is that there's one function used for qsort() and another for bsearch(), because I'm sorting an array of texts (from pg_statistic) and I'm binary searching for a lexeme (non-NULL terminated string with length). It would be nice to clean that up a bit. I think you could convert the lexeme to a TextFreq, or make the TextFreq.element a text * instead of Datum (ie., detoast it with PG_DETOAST_DATUM while you build the array for qsort). My medicore gprof skills got me: 0.000.22 5/5 OidFunctionCall4 [37] [38]28.40.000.22 5 tssel [38] 0.000.17 5/5 get_restriction_variable [40] 0.030.01 5/10 pg_qsort [60] 0.000.00 5/5 get_attstatsslot [139] Hopefully that says that the qsort() overhead is small compared to munging through the planner Node. I'd like to see a little bit more testing of that. I can't read gprof myself, so the above doesn't give me much confidence. I use oprofile, which I find is much simpler to use. I think the worst case scenario is with statistics_target set to maximum, with a simplest possible query and simplest possible tsquery. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parsing of pg_hba.conf and authentication inconsistencies
Stephen Frost wrote: Magnus, * Magnus Hagander ([EMAIL PROTECTED]) wrote: Yeah. I think the question there is just - how likely is it that the same installation actually uses 1 authentication method. Personally, I think it's not very uncommon at all, but fact remains that as long as you only use one of them at a time, using a shared file doesn't matter. We use multiple authentication types *alot*.. ident, md5, kerberos, and gssapi are all currently in use on our systems. ident for local unix logins, md5 for 'role' accounts and software the doesn't understand kerberos, kerberos/gssapi depending on the age of the client library connecting. Oh, and we use pam too.. We use some mappings now with ident, which I'd expect to continue to do, and I've got definite plans for mappings under Kerberos/GSSAPI once it's supported.. Ok. Good to know - if you want to use it, there are bound to be a number of others who would like it as well :) It wouldn't be very easy/clean to do that w/o breaking the existing structure of pg_ident though, which makes me feel like using seperate files is probably the way to go. Yeah, thats my feeling as well. Now, can someone figure out a way to do that without parsing the file in the postmaster? (And if we do parse it, there's no point in not storing the parsed version, IMHO). And if not, the question it comes down to is which is most important - keeping the parsing away, or being able to do this ;-) I don't have an answer wrt the parsing issue, but I definitely want to be able to do this. :) Right. I guess one option would be to load the map file at runtime in the backend, and not pre-load/cache it from the postmaster. But that seems rahter sub-optimal to me. Other thoughts? //Magnus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Question regarding the database page layout.
Ryan Bradetich [EMAIL PROTECTED] writes: After a cursory glance at the HeapTupleHeaderData structure, it appears it could be aligned with INTALIGN instead of MAXALIGN. The one structure I was worried about was the 6 byte t_ctid structure. The comments in src/include/storage/itemptr.h file indicate the ItemPointerData structure is composed of 3 int16 fields. So everthing in the HeapTupleHeaderData structure is 32-bits or less. Sure, but the tuple itself could contain something with double alignment. If you have a bigint or double in the tuple then heap_form_tuple needs to know where to put it so it ends up at right alignment. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Multiple anyelement types
I'm working on improvements of orafce. http://pgfoundry.org/projects/orafce I found postgres supports only one type of anyelement at one time when I added nvl2() and decode(). I'd like to use multiple types of anyelement something like: template typename Expr, typename Ret CREATE FUNCTION nvl(Expr, Ret, Ret) RETURNS Ret template typename Expr, typename Ret CREATE FUNCTION decode(Expr, Expr, Ret, ..., Ret) RETURNS Ret I don't mean to propose the above C++-like syntax, but such feature is important to develop a generic porting tool. Two independent anyelements are enough for me, but three or more might be better for general use. What syntax is suitable for postgres? Comments welcome. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: PageLayout footprint
Heikki Linnakangas napsal(a): Zdenek Kotala wrote: Current content of control file is insufficient to check if database is compatible with postgres server. It is? Do you have an example of where it's insufficient? Current control file contain following information (related to page layout): maxAlign blcksz toast_max_chunk_size But you don't have control how aligned is each member of data structure. By my opinion -fipa-struct-reorg GCC option could break structure. And maybe there are more compiler magic switches and optimization on different platforms which can modify structure alignment or member order. It probably does not happen often but footprint should protect people to shot himself. Zdenek PS: And of course toast_max_chunk_size is not insufficient as well. There are more constants like MaxHeapTupleSize and so on, but it is different story. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multiple anyelement types
He 2008/8/11 ITAGAKI Takahiro [EMAIL PROTECTED]: I'm working on improvements of orafce. http://pgfoundry.org/projects/orafce I found postgres supports only one type of anyelement at one time when I added nvl2() and decode(). I'd like to use multiple types of anyelement something like: template typename Expr, typename Ret CREATE FUNCTION nvl(Expr, Ret, Ret) RETURNS Ret template typename Expr, typename Ret CREATE FUNCTION decode(Expr, Expr, Ret, ..., Ret) RETURNS Ret I don't mean to propose the above C++-like syntax, but such feature is important to develop a generic porting tool. Two independent anyelements are enough for me, but three or more might be better for general use. it's good idea - I though so 2 independent types are enough: anyelement2, enyarray2. If you are C coder, you should use any type. regards Pavel Stehule What syntax is suitable for postgres? Comments welcome. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: PageLayout footprint
Zdenek Kotala wrote: By my opinion -fipa-struct-reorg GCC option could break structure. That option would probably break a lot of things. Like our variable-sized array as last field of a struct hacks. And maybe there are more compiler magic switches and optimization on different platforms which can modify structure alignment or member order. It probably does not happen often but footprint should protect people to shot himself. We depend on a certain member order and alignment rules. If we're worried about that, we should add checks in configure instead, to barf if you try to use such options. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: PageLayout footprint
Zdenek Kotala [EMAIL PROTECTED] writes: By my opinion -fipa-struct-reorg GCC option could break structure. And maybe there are more compiler magic switches and optimization on different platforms which can modify structure alignment or member order. It probably does not happen often but footprint should protect people to shot himself. My version of GCC doesn't have that option, what does it do? If structure members aren't in the order they're defined and padded to the alignment they're declared to have in pg_type then Postgres catalogs won't work anyways. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: PageLayout footprint
Gregory Stark wrote: Zdenek Kotala [EMAIL PROTECTED] writes: By my opinion -fipa-struct-reorg GCC option could break structure. And maybe there are more compiler magic switches and optimization on different platforms which can modify structure alignment or member order. It probably does not happen often but footprint should protect people to shot himself. My version of GCC doesn't have that option, what does it do? From gcc man page: -fipa-struct-reorg Perform structure reorganization optimization, that change C-like structures layout in order to better utilize spatial locality. This transformation is affective for programs containing arrays of structures. Available in two compilation modes: profile-based (enabled with -fprofile-generate) or static (which uses built-in heuristics). Require -fipa-type-escape to provide the safety of this transformation. It works only in whole program mode, so it requires -fwhole-program and -combine to be enabled. Structures considered cold by this transformation are not affected (see --param struct-reorg-cold-struct-ratio=value). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: UTF8 to_ascii function
Pavel Stehule wrote: Hello, combination functions to_ascii and convert_to is broken now. Problem is in convert_to function. It doesn't support 8bit output encoding. Current workaround: CREATE FUNCTION to_ascii(bytea, name) RETURNS text AS 'to_ascii_encname' LANGUAGE internal; SELECT to_ascii(convert_to('Příliš žlutý kůň', 'latin2'),'latin2'); I don't expect column collate for 8.4, so we need to have workable to_ascii function. I propose function to_ascii(text, name) that internally convert text from utf8 encoding when it's necessary.cheers convert_to is not broken. It returns a bytea, and it is up to you to de-escape it if you get the text representation. We are surely not going to go back to a situation where we have functions returning text in any encoding other than the database encoding. That becomes a vehicle for storing wrongly encoded data in the database, and we have just gone through the exercise of plugging those holes. I privately predicted when we did this work that it might motivate people who had been abusing convert_to to get proper support for multiple encodings done. That is the right way to go, not re-opening holes we have just very deliberately plugged. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: PageLayout footprint
Heikki Linnakangas napsal(a): Zdenek Kotala wrote: By my opinion -fipa-struct-reorg GCC option could break structure. That option would probably break a lot of things. Like our variable-sized array as last field of a struct hacks. Yes, it is extreme case. And maybe there are more compiler magic switches and optimization on different platforms which can modify structure alignment or member order. It probably does not happen often but footprint should protect people to shot himself. We depend on a certain member order and alignment rules. If we're worried about that, we should add checks in configure instead, to barf if you try to use such options. You are able to check order, but you cannot complain about structure member alignment during configure time. But if you have two binaries which you get from two sources then you need to verify that both binaries has same structure footprint. Similar is 32/64 bit compilation. It is handled on x86 by MAXALIGN but MAXALIGN is same on SPARC for both binaries, but I'm not sure if it works correctly. Any other usage is to protect developers to make a mistake and break silently compatibility, but it should be caught by --footprint switch. Zdenek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Skimming icc warnings on mongoose
I happened to be skimming the compile warnings on mongoose, an icc buildfarm member and noticed two interesting warnings. icc -O3 -xN -parallel -ip -mp1 -fno-strict-aliasing -g -fpic -I../../../src/include/snowball -I../../../src/include/snowball/libstemmer -I../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/et -c -o stem_UTF_8_swedish.o ./libstemmer/stem_UTF_8_swedish.c icc -O3 -xN -parallel -ip -mp1 -fno-strict-aliasing -g -fpic -I../../../src/include/snowball -I../../../src/include/snowball/libstemmer -I../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/et -c -o stem_UTF_8_turkish.o ./libstemmer/stem_UTF_8_turkish.c icc -O3 -xN -parallel -ip -mp1 -fno-strict-aliasing -g -fpic -shared dict_snowball.o api.o utilities.o stem_ISO_8859_1_danish.o stem_ISO_8859_1_dutch.o stem_ISO_8859_1_english.o stem_ISO_8859_1_finnish.o stem_ISO_8859_1_french.o stem_ISO_8859_1_german.o stem_ISO_8859_1_hungarian.o stem_ISO_8859_1_italian.o stem_ISO_8859_1_norwegian.o stem_ISO_8859_1_porter.o stem_ISO_8859_1_portuguese.o stem_ISO_8859_1_spanish.o stem_ISO_8859_1_swedish.o stem_ISO_8859_2_romanian.o stem_KOI8_R_russian.o stem_UTF_8_danish.o stem_UTF_8_dutch.o stem_UTF_8_english.o stem_UTF_8_finnish.o stem_UTF_8_french.o stem_UTF_8_german.o stem_UTF_8_hungarian.o stem_UTF_8_italian.o stem_UTF_8_norwegian.o stem_UTF_8_porter.o stem_UTF_8_portuguese.o stem_UTF_8_romanian.o stem_UTF_8_russian.o stem_UTF_8_spanish.o stem_UTF_8_swedish.o stem_UTF_8_turkish.o -L../../../src/port -o dict_snowball.so ld: warning: creating a DT_TEXTREL in object. There are a few instances of this. I think it indicates we are either spelling -fpic wrong or including some .o object that was not compiled with -fpic (perhaps something from the ports directory?). This can result in a performance drain when the .so is linked in since the text segment can't be mapped directly from the file and instead needs to be read in and adjusted to the base address to which it was loaded. It isn't unique to tsearch stuff either, plpgsql has the same warning: make[4]: Entering directory `/home/data/local/jeremyd/postgres/buildfarm/root/HEAD/pgsql.4376/src/pl/plpgsql/src' bison -y -d gram.y mv -f y.tab.c ./pl_gram.c mv -f y.tab.h ./pl.tab.h LC_CTYPE=C /usr/bin/flex -o'pl_scan.c' scan.l icc -O3 -xN -parallel -ip -mp1 -fno-strict-aliasing -g -fpic -I. -I../../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/et -c -o pl_gram.o pl_gram.c pl_scan.c(1944) : (col. 2) remark: LOOP WAS VECTORIZED. pl_scan.c(2495) : (col. 2) remark: LOOP WAS VECTORIZED. icc -O3 -xN -parallel -ip -mp1 -fno-strict-aliasing -g -fpic -I. -I../../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/et -c -o pl_handler.o pl_handler.c pl_handler.c(199) : (col. 3) remark: LOOP WAS VECTORIZED. pl_handler.c(200) : (col. 3) remark: LOOP WAS VECTORIZED. pl_handler.c(206) : (col. 4) remark: LOOP WAS VECTORIZED. icc -O3 -xN -parallel -ip -mp1 -fno-strict-aliasing -g -fpic -I. -I../../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/et -c -o pl_comp.o pl_comp.c pl_comp.c(157) : (col. 3) remark: LOOP WAS VECTORIZED. pl_comp.c(211) : (col. 4) remark: LOOP WAS VECTORIZED. icc -O3 -xN -parallel -ip -mp1 -fno-strict-aliasing -g -fpic -I. -I../../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/et -c -o pl_exec.o pl_exec.c icc -O3 -xN -parallel -ip -mp1 -fno-strict-aliasing -g -fpic -I. -I../../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/et -c -o pl_funcs.o pl_funcs.c icc -O3 -xN -parallel -ip -mp1 -fno-strict-aliasing -g -fpic -shared pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o -L../../../../src/port -o plpgsql.so ld: warning: creating a DT_TEXTREL in object. And then there's this -- plen is a size_t... plpython.c(1420): warning #186: pointless comparison of unsigned integer with zero Assert(plen = 0 plen mlen); -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: UTF8 to_ascii function
Hello 2008/8/11 Andrew Dunstan [EMAIL PROTECTED]: Pavel Stehule wrote: Hello, combination functions to_ascii and convert_to is broken now. Problem is in convert_to function. It doesn't support 8bit output encoding. Current workaround: CREATE FUNCTION to_ascii(bytea, name) RETURNS text AS 'to_ascii_encname' LANGUAGE internal; SELECT to_ascii(convert_to('Příliš žlutý kůň', 'latin2'),'latin2'); I don't expect column collate for 8.4, so we need to have workable to_ascii function. I propose function to_ascii(text, name) that internally convert text from utf8 encoding when it's necessary.cheers convert_to is not broken. It returns a bytea, and it is up to you to de-escape it if you get the text representation. ok, I talked about combination convert_to and to_ascii. to_ascii doesn't support bytea, what is probably correct. We cannot use descape, because it remove 8bit. This issue was noticed more times - http://archives.postgresql.org/pgsql-general/2008-06/msg00495.php We are surely not going to go back to a situation where we have functions returning text in any encoding other than the database encoding. That becomes a vehicle for storing wrongly encoded data in the database, and we have just gone through the exercise of plugging those holes. I privately predicted when we did this work that it might motivate people who had been abusing convert_to to get proper support for multiple encodings done. That is the right way to go, not re-opening holes we have just very deliberately plugged. to_ascii isn't related to multiple encodings. And actually there is only one man who works on it. We will be happy for database collation in 8.4. So without any change this feature will be broken more than two years. Regards Pavel cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Skimming icc warnings on mongoose
Gregory Stark [EMAIL PROTECTED] writes: I happened to be skimming the compile warnings on mongoose, an icc buildfarm member and noticed two interesting warnings. ld: warning: creating a DT_TEXTREL in object. There are a few instances of this. I think it indicates we are either spelling -fpic wrong or including some .o object that was not compiled with -fpic (perhaps something from the ports directory?). If it were the latter, quite a number of platforms would just fail outright, I believe. As for spelling -fpic wrong, that seems unlikely too considering that icc generally tries to pretend it's gcc. Any other theories? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: PageLayout footprint
Zdenek Kotala wrote: Similar is 32/64 bit compilation. It is handled on x86 by MAXALIGN but MAXALIGN is same on SPARC for both binaries, but I'm not sure if it works correctly. Are 32/64-bit binaries on Sparc incompatible, then? Does our control file check catch it? If not, what's causing it, and would the --footprint switch catch it? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: UTF8 to_ascii function
2008/8/11 Andrew Dunstan [EMAIL PROTECTED]: Pavel Stehule wrote: Hello, combination functions to_ascii and convert_to is broken now. Problem is in convert_to function. It doesn't support 8bit output encoding. Current workaround: CREATE FUNCTION to_ascii(bytea, name) RETURNS text AS 'to_ascii_encname' LANGUAGE internal; SELECT to_ascii(convert_to('Příliš žlutý kůň', 'latin2'),'latin2'); I don't expect column collate for 8.4, so we need to have workable to_ascii function. I propose function to_ascii(text, name) that internally convert text from utf8 encoding when it's necessary.cheers convert_to is not broken. It returns a bytea, and it is up to you to de-escape it if you get the text representation. One note - convert_to is correct. But we have to use to_ascii without decode functions. It has same behave - convert from bytea to text. Text in incorrect encoding is dafacto bytea. So correct to_ascii function prototypes are: to_ascii(text) to_ascii(bytea, integer); to_ascii(bytea, name); Regards Pavel Stehule We are surely not going to go back to a situation where we have functions returning text in any encoding other than the database encoding. That becomes a vehicle for storing wrongly encoded data in the database, and we have just gone through the exercise of plugging those holes. I privately predicted when we did this work that it might motivate people who had been abusing convert_to to get proper support for multiple encodings done. That is the right way to go, not re-opening holes we have just very deliberately plugged. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: UTF8 to_ascii function
Pavel Stehule wrote: One note - convert_to is correct. But we have to use to_ascii without decode functions. It has same behave - convert from bytea to text. Text in incorrect encoding is dafacto bytea. So correct to_ascii function prototypes are: to_ascii(text) to_ascii(bytea, integer); to_ascii(bytea, name); What you have not said is how you propose to convert UTF8 to ASCII. Currently to_ascii() converts a small number of single byte charsets to ASCII by folding the chars with high bits set, so what we get is a pure ASCII result which is safe in any server encoding, as they are all ASCII supersets. But what conversion rule will you use for the gazillions of Unicode characters? I honestly do not understand the use case for this at all. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: UTF8 to_ascii function
Andrew Dunstan wrote: Pavel Stehule wrote: One note - convert_to is correct. But we have to use to_ascii without decode functions. It has same behave - convert from bytea to text. Text in incorrect encoding is dafacto bytea. So correct to_ascii function prototypes are: to_ascii(text) to_ascii(bytea, integer); to_ascii(bytea, name); What you have not said is how you propose to convert UTF8 to ASCII. Currently to_ascii() converts a small number of single byte charsets to ASCII by folding the chars with high bits set, so what we get is a pure ASCII result which is safe in any server encoding, as they are all ASCII supersets. But what conversion rule will you use for the gazillions of Unicode characters? I honestly do not understand the use case for this at all. I do. Often clients want their searches to be accented-or-language-specific letters insensitive. So searching for 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of such facility made me consider not upgrading particular client to 8.3...). Or maybe there's a better way to do it? Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: UTF8 to_ascii function
Jan Urbański wrote: Andrew Dunstan wrote: Pavel Stehule wrote: One note - convert_to is correct. But we have to use to_ascii without decode functions. It has same behave - convert from bytea to text. Text in incorrect encoding is dafacto bytea. So correct to_ascii function prototypes are: to_ascii(text) to_ascii(bytea, integer); to_ascii(bytea, name); What you have not said is how you propose to convert UTF8 to ASCII. Currently to_ascii() converts a small number of single byte charsets to ASCII by folding the chars with high bits set, so what we get is a pure ASCII result which is safe in any server encoding, as they are all ASCII supersets. But what conversion rule will you use for the gazillions of Unicode characters? I honestly do not understand the use case for this at all. I do. Often clients want their searches to be accented-or-language-specific letters insensitive. So searching for 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of such facility made me consider not upgrading particular client to 8.3...). Or maybe there's a better way to do it? Well, my first question would be Why aren't you using a database encoding that supports to_ascii()? However, I suppose that your use case would support this signature: to_ascii(bytea, name) where it would just error out if the encoding name were something other than LATIN1, LATIN2, LATIN9, or WIN1250. But what would be the meaning of this?: to_ascii(bytea, integer) cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: PageLayout footprint
Heikki Linnakangas napsal(a): Zdenek Kotala wrote: Similar is 32/64 bit compilation. It is handled on x86 by MAXALIGN but MAXALIGN is same on SPARC for both binaries, but I'm not sure if it works correctly. Are 32/64-bit binaries on Sparc incompatible, then? Does our control file check catch it? If not, what's causing it, and would the --footprint switch catch it? It is what I don't know. controlfile only says incorrect checksum in control file. It seems to me that CRC computing does not work correctly. I check pg_control footprint and it is same for 32/64. It seems to me a false positive complain, but ... Footprint switch should help show you why it is incompatible and also protect you during development to break some structure. Zdenek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: PageLayout footprint
Zdenek Kotala wrote: It is what I don't know. controlfile only says incorrect checksum in control file. It seems to me that CRC computing does not work correctly. I check pg_control footprint and it is same for 32/64. It seems to me a false positive complain, but ... Endianness perhaps? Per comment in ControlFileData: * This data is used to check for hardware-architecture compatibility of * the database and the backend executable. We need not check endianness * explicitly, since the pg_control version will surely look wrong to a * machine of different endianness, but we do need to worry about MAXALIGN * and floating-point format. (Note: storage layout nominally also * depends on SHORTALIGN and INTALIGN, but in practice these are the same * on all architectures of interest.) Footprint switch should help show you why it is incompatible and also protect you during development to break some structure. Apparently the footprint switch didn't provide any insight into the Sparc 32/64-bit issue, so I'm not too impressed.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: UTF8 to_ascii function
Andrew Dunstan wrote: Jan Urbański wrote: Andrew Dunstan wrote: Pavel Stehule wrote: What you have not said is how you propose to convert UTF8 to ASCII. Currently to_ascii() converts a small number of single byte charsets to ASCII by folding the chars with high bits set, so what we get is a pure ASCII result which is safe in any server encoding, as they are all ASCII supersets. But what conversion rule will you use for the gazillions of Unicode characters? I honestly do not understand the use case for this at all. I do. Often clients want their searches to be accented-or-language-specific letters insensitive. So searching for 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of such facility made me consider not upgrading particular client to 8.3...). Or maybe there's a better way to do it? Well, my first question would be Why aren't you using a database encoding that supports to_ascii()? Because I want UTF-8 in it ;) It's mostly LATIN2, but clients sometimes input Cyrillic, Greek or Hebrew letters, and sometimes use Unicode characters like (U+2026) HORIZONTAL ELLIPSIS. I'd like to have to_ascii(text, [error_handling]) returns text So no bytea, to_ascii would accept text that's legal in my current database encoding and return text in that encoding. And error_handling would be something like: - 'error' (the default, throw an error if a character is untranslable to ASCII) - 'ignore' (omit untranslable characters) - 'transliterate' (do your best to transliterate the character, or leave it as it is if impossible). Examples would include (assuming UTF-8 database) to_ascii('łódź') - 'lodz' to_ascii('china is written 中國') - ERROR to_ascii('china is written 中國', 'ignore') - 'china is written ' to_ascii('china is written 中國', 'transliterate') - 'china is written zhong guo' (in an ideal world) to_ascii('china is written 中國', 'transliterate') - 'china is written 中國' (in reality)\ These would have the property, that: to_ascii(X, 'ignore') is always pure ASCII data and never throws an error to_ascii(X, 'transliterate') is sometimes non-ASCII data and never throws an error to_ascii(X) is sometimes non-ASCII data and sometimes throws an error It's something like PHP's iconv that can have //TRANSLIT or somesuch (forgive me for giving PHP as an example...). Now I'd love to hear people punch holes in my daydreaming design ;) Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: PageLayout footprint
Heikki Linnakangas napsal(a): Zdenek Kotala wrote: It is what I don't know. controlfile only says incorrect checksum in control file. It seems to me that CRC computing does not work correctly. I check pg_control footprint and it is same for 32/64. It seems to me a false positive complain, but ... Endianness perhaps? Per comment in ControlFileData: No. It is same Endianness. SPARC allows to map memory with different endian, but it needs extra magic. Zdenek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: UTF8 to_ascii function
2008/8/11 Andrew Dunstan [EMAIL PROTECTED]: Pavel Stehule wrote: One note - convert_to is correct. But we have to use to_ascii without decode functions. It has same behave - convert from bytea to text. Text in incorrect encoding is dafacto bytea. So correct to_ascii function prototypes are: to_ascii(text) to_ascii(bytea, integer); to_ascii(bytea, name); What you have not said is how you propose to convert UTF8 to ASCII. Currently to_ascii() converts a small number of single byte charsets to ASCII by folding the chars with high bits set, so what we get is a pure ASCII result which is safe in any server encoding, as they are all ASCII supersets. But what conversion rule will you use for the gazillions of Unicode characters? I honestly do not understand the use case for this at all. It's typical case in czech language, where some searchings are accents insensitive - Stěhule, Stehule, Novotný, Novotny. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: UTF8 to_ascii function
2008/8/11 Andrew Dunstan [EMAIL PROTECTED]: Jan Urbański wrote: Andrew Dunstan wrote: Pavel Stehule wrote: One note - convert_to is correct. But we have to use to_ascii without decode functions. It has same behave - convert from bytea to text. Text in incorrect encoding is dafacto bytea. So correct to_ascii function prototypes are: to_ascii(text) to_ascii(bytea, integer); to_ascii(bytea, name); What you have not said is how you propose to convert UTF8 to ASCII. Currently to_ascii() converts a small number of single byte charsets to ASCII by folding the chars with high bits set, so what we get is a pure ASCII result which is safe in any server encoding, as they are all ASCII supersets. But what conversion rule will you use for the gazillions of Unicode characters? I honestly do not understand the use case for this at all. I do. Often clients want their searches to be accented-or-language-specific letters insensitive. So searching for 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of such facility made me consider not upgrading particular client to 8.3...). Or maybe there's a better way to do it? Well, my first question would be Why aren't you using a database encoding that supports to_ascii()? However, I suppose that your use case would support this signature: to_ascii(bytea, name) where it would just error out if the encoding name were something other than LATIN1, LATIN2, LATIN9, or WIN1250. But what would be the meaning of this?: to_ascii(bytea, integer) it's symmetric. Nothing more. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: UTF8 to_ascii function
Pavel Stehule wrote: But what would be the meaning of this?: to_ascii(bytea, integer) it's symmetric. Nothing more. Symmetric to what? What is the second argument supposed to be? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: UTF8 to_ascii function
2008/8/11 Andrew Dunstan [EMAIL PROTECTED]: Pavel Stehule wrote: But what would be the meaning of this?: to_ascii(bytea, integer) it's symmetric. Nothing more. Symmetric to what? What is the second argument supposed to be? postgres=# \df to_ascii List of functions Schema | Name | Result data type | Argument data types +--+--+- pg_catalog | to_ascii | text | text pg_catalog | to_ascii | text | text, integer pg_catalog | to_ascii | text | text, name postgres=# select to_ascii('pavel',8); to_ascii -- pavel (1 row) cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: UTF8 to_ascii function
Pavel Stehule wrote: 2008/8/11 Andrew Dunstan [EMAIL PROTECTED]: Pavel Stehule wrote: But what would be the meaning of this?: to_ascii(bytea, integer) it's symmetric. Nothing more. Symmetric to what? What is the second argument supposed to be? postgres=# \df to_ascii List of functions Schema | Name | Result data type | Argument data types +--+--+- pg_catalog | to_ascii | text | text pg_catalog | to_ascii | text | text, integer pg_catalog | to_ascii | text | text, name postgres=# select to_ascii('pavel',8); to_ascii -- pavel (1 row) Hmm. That's not documented, and I suspect shouldn't be there. Everywhere else pretty much that I can think of we pass the encoding as a name, and I think we should be consistent about it. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Question regarding the database page layout.
Hello Greg, On Mon, Aug 11, 2008 at 2:24 AM, Gregory Stark [EMAIL PROTECTED]wrote: Ryan Bradetich [EMAIL PROTECTED] writes: After a cursory glance at the HeapTupleHeaderData structure, it appears it could be aligned with INTALIGN instead of MAXALIGN. The one structure I was worried about was the 6 byte t_ctid structure. The comments in src/include/storage/itemptr.h file indicate the ItemPointerData structure is composed of 3 int16 fields. So everthing in the HeapTupleHeaderData structure is 32-bits or less. Sure, but the tuple itself could contain something with double alignment. If you have a bigint or double in the tuple then heap_form_tuple needs to know where to put it so it ends up at right alignment. My first thought was we can still figure this out because the user data is already forced to be MAXALIGN. Then I realized oh, since that is true then I am still going to eat the padding anyhow. The padding would just move to one of two places: 1. To MAXALIGN the user data or 2. To MAXALIGN the heap tuple header. Thanks for the sanity check. I was wrapped up in looking for alignment requirements in the HeapTupleHeaderData structure, I overlooked the significance of the MAXALIGN on the user data. Thanks, - Ryan P.S. Just for the archive, it seems this idea still may be workable (need to look at the heap_form_tuple in significant more detail) if/when someone implements the proposal to separate the physical storage from the column order. That solution is a bit more than I am ready to tackle at the moment :) Maybe in the future.
Re: [HACKERS] proposal: UTF8 to_ascii function
2008/8/11 Andrew Dunstan [EMAIL PROTECTED]: Pavel Stehule wrote: 2008/8/11 Andrew Dunstan [EMAIL PROTECTED]: Pavel Stehule wrote: But what would be the meaning of this?: to_ascii(bytea, integer) it's symmetric. Nothing more. Symmetric to what? What is the second argument supposed to be? postgres=# \df to_ascii List of functions Schema | Name | Result data type | Argument data types +--+--+- pg_catalog | to_ascii | text | text pg_catalog | to_ascii | text | text, integer pg_catalog | to_ascii | text | text, name postgres=# select to_ascii('pavel',8); to_ascii -- pavel (1 row) Hmm. That's not documented, and I suspect shouldn't be there. Everywhere else pretty much that I can think of we pass the encoding as a name, and I think we should be consistent about it. I don't need it regards Pavel cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] WIP: New Page API
I finished first prototype of new page API. It contains several new functions like: Pointer PageGetUpperPointer(Page page); void PageClearPrunable(Page page); bool PageIsComprimable(Page page); void PageReserveLinp(Page page); void PageReleaseLinp(Page page); LocationIndex PageGetLower(Page page); LocationIndex PageGetUpper(Page page); LocationIndex PageGetSpecial(Page page); void PageSetLower(Page page, LocationIndex lower); void PageSetUpper(Page page, LocationIndex lower); Page PageGetTempPage(Page page, bool copy); Page PageGetTempPageCopySpecial(Page page); void PageRestoreTempPage(Page tempPage, Page oldPage); Size PageGetSpecialSize(Page page); Size PageGetDataSize(Page page); bool PageLayoutIsValid(Page page); The main point of the new API is to handle multi page layout versions. The current implementation don't uses any speed optimization and performance gap is about 5% (big thanks to Paul van den Bogaard for benchmarking). Finally I plan to implement hottest function like macro (or inline fn ;-) ) and because most structure members are located on same place I will remove extra switch in the code. I also grab number of calls by DTrace and I got following result: snip PageGetHeapFreeSpace984 PageGetSpecialSize 1170 PageGetFreeSpace 1200 PageGetExactFreeSpace 1399 PageGetUpper 1419 PageGetLower 1618 PageGetLSN 2936 PageGetMaxOffsetNumber 5504 PageGetSpecialPointer 13534 PageGetItemId 71074 PageGetItem 76629 I plan to remove PageGetItemId and replace it with any other function like PageGetHeapTuple and so on. The reason is that ItemId flags has been changed between version 3 and 4. And on many times it is called like itemId = PageGetItemId(); PageGetItem(itemId); I'm also thinking about add following function: PageSetXLOG(page,TLI,LSN) - it should replace PageSetLSN();PageSetTLI(); sequence in code I'm not happy with PageSetLower() function which is used in nbtree, but no idea yet how to improve it. Please, let me know your comments. I attached prototype patch. Thanks Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql diff -r 879ef18ad9cb -r 3e71cf34dced src/backend/access/gin/gindatapage.c --- a/src/backend/access/gin/gindatapage.c Tue Aug 05 21:28:29 2008 + +++ b/src/backend/access/gin/gindatapage.c Mon Aug 11 15:58:27 2008 +0200 @@ -445,7 +445,7 @@ char *ptr; OffsetNumber separator; ItemPointer bound; - Page lpage = GinPageGetCopyPage(BufferGetPage(lbuf)); + Page lpage = PageGetTempPage(BufferGetPage(lbuf), true); ItemPointerData oldbound = *GinDataPageGetRightBound(lpage); int sizeofitem = GinSizeOfItem(lpage); OffsetNumber maxoff = GinPageGetOpaque(lpage)-maxoff; diff -r 879ef18ad9cb -r 3e71cf34dced src/backend/access/gin/ginentrypage.c --- a/src/backend/access/gin/ginentrypage.c Tue Aug 05 21:28:29 2008 + +++ b/src/backend/access/gin/ginentrypage.c Mon Aug 11 15:58:27 2008 +0200 @@ -458,7 +458,7 @@ leftrightmost = NULL; static ginxlogSplit data; Page page; - Page lpage = GinPageGetCopyPage(BufferGetPage(lbuf)); + Page lpage = PageGetTempPage(BufferGetPage(lbuf), true); Page rpage = BufferGetPage(rbuf); Size pageSize = PageGetPageSize(lpage); diff -r 879ef18ad9cb -r 3e71cf34dced src/backend/access/gin/ginutil.c --- a/src/backend/access/gin/ginutil.c Tue Aug 05 21:28:29 2008 + +++ b/src/backend/access/gin/ginutil.c Mon Aug 11 15:58:27 2008 +0200 @@ -309,21 +309,6 @@ return entries; } -/* - * It's analog of PageGetTempPage(), but copies whole page - */ -Page -GinPageGetCopyPage(Page page) -{ - Size pageSize = PageGetPageSize(page); - Page tmppage; - - tmppage = (Page) palloc(pageSize); - memcpy(tmppage, page, pageSize); - - return tmppage; -} - Datum ginoptions(PG_FUNCTION_ARGS) { diff -r 879ef18ad9cb -r 3e71cf34dced src/backend/access/gin/ginvacuum.c --- a/src/backend/access/gin/ginvacuum.c Tue Aug 05 21:28:29 2008 + +++ b/src/backend/access/gin/ginvacuum.c Mon Aug 11 15:58:27 2008 +0200 @@ -529,7 +529,7 @@ * On first difference we create temporary page in memory * and copies content in to it. */ - tmppage = GinPageGetCopyPage(origpage); + tmppage = PageGetTempPage(origpage, true); if (newN 0) { diff -r 879ef18ad9cb -r 3e71cf34dced src/backend/access/gist/gist.c --- a/src/backend/access/gist/gist.c Tue Aug 05 21:28:29 2008 + +++ b/src/backend/access/gist/gist.c Mon Aug
Re: [HACKERS] IN vs EXISTS equivalence
Kevin Grittner [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] wrote: I believe that the optimizable cases for EXISTS are those where the EXISTS() is either at the top level of WHERE, or just underneath a NOT, The rest of the plan makes sense to me, but this part seems narrow. There's probably a good reason for that which is beyond my depth, but attached is a view that is used for calculating statistics for a database which is primarily used for case management purposes. If EXISTS could also be optimized in the contexts used there, it would be great. I chewed on that for awhile. We can certainly optimize EXISTS that's appearing in the ON-condition of a regular JOIN, because that's not really semantically different from a WHERE-condition. But I don't think it's going to be reasonable to improve EXISTS in outer-JOIN ON conditions. There are a couple of problems. Consider t1 LEFT JOIN t2 ON (t1.f1 = t2.f2 AND EXISTS(SELECT 1 FROM t3 WHERE t3.f3 = t1.fx AND t3.f4 = t2.fy)) To implement this with the correct semantics, we'd have to have the t1/t2 join and the t1/t2/t3 join going on in the same execution node, with two different join behaviors (LEFT and SEMI). There isn't any way AFAICS to factor it into two separate steps. That's unreasonably complicated, and it's not clear that you'd get any better performance anyway than the current implementation (which'd treat the EXISTS as a subplan). Even worse, let the EXISTS be a degenerate case: t1 LEFT JOIN t2 ON (t1.f1 = t2.f2 AND EXISTS(SELECT 1 FROM t3 WHERE t3.f3 = t1.fx)); We can't actually treat this EXISTS as a semijoin between t1 and t3, either before or after the LEFT JOIN; because then the behavior would be to drop t1 rows that have no t3 match, which is not what this query specifies. (Note: the other degenerate case, where the EXISTS depends only on t2, *could* be optimized since we could just let the semijoin be performed within the RHS of the LEFT JOIN.) So this is not something I'm going to tackle; at least not this devel cycle. One small step we can take in this direction, though, is to improve the planner's internal handling of the qual conditions for IN and EXISTS. Right now the process is just to throw the sub-select into the main range table and put the IN join conditions into the same place in WHERE that the IN-clause was to start with. The trouble with this is that the distribute_quals_to_rels processing has no idea that there's anything special about the IN join conditions. We got away with that for the limited case of IN clauses at the top level of WHERE, but it's become clear to me over the weekend that this has no hope of working for NOT EXISTS --- since that's effectively an outer join, it absolutely has to have the same kinds of qual-scheduling constraints as ordinary outer joins do. So we need a data structure that distribute_quals_to_rels can work with. What I think needs to happen is that the initial pass that pulls up optimizable IN/EXISTS sub-selects should not merge the SubLink's replacement qual clauses seamlessly, but put them underneath a new node type, say FlattenedSubLink, that retains knowledge of the join it's representing. The FlattenedSubLink would survive only as far as distribute_quals_to_rels, which would distribute out the contained qual conditions instead of the FlattenedSubLink itself --- but only after marking them properly for the outer-join restrictions. This representation would make it feasible to work with IN/EXISTS that are inside JOIN ON conditions, which the present representation using a single in_info_list really can't do. The semantic issues are still there but at least the representation isn't getting in the way ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Type Categories for User-Defined Types
On Aug 10, 2008, at 20:58, David E. Wheeler wrote: Just realized that I forgot to add the DROP FUNCTION statements to the uninstall script. New patch attached. And this one also includes the casts I added. :-) Best, David citext_casting3.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: UTF8 to_ascii function
On Monday 11 August 2008 16:23:29 Jan Urbański wrote: Often clients want their searches to be accented-or-language-specific letters insensitive. So searching for 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of such facility made me consider not upgrading particular client to 8.3...). These are valid ideas, but then please design a new function that addresses your use case in a well-defined way, and don't overload questionable old interfaces for new purposes. In the Unicode standard you can find well-defined methods to decompose characters into diacritic marks, and then you could strip them off. But this has nothing to do with ASCII or UTF8 or encodings. Cyrillic characters can have diacritic marks as well, for example. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IN vs EXISTS equivalence
On Aug 8, 2008, at 3:23 PM, Tom Lane wrote: * has no set operations (UNION etc), grouping, set-returning functions in the SELECT list, LIMIT, or a few other funny cases Couldn't union/union all be treated as EXISTS(a) OR EXISTS(b) ... Or am I missing some detail with NULLS? Personally, I'd rather write it as separate EXISTS clauses rather than using UNION, but perhaps others have a different preference... -- Decibel!, aka Jim C. Nasby, Database Architect [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 smime.p7s Description: S/MIME cryptographic signature
Re: [HACKERS] Visibility Groups
On Aug 7, 2008, at 9:49 AM, Robert Haas wrote: This proposal sounds like it would target batch jobs, because those are the kinds of jobs that where you can predict in advance what tables will be needed. I don't know whether my personal set of problems with MVCC syncs up with anyone else's, but this is rarely how I get bitten. Usually, what happens is that a user session (psql or web server connection) gets left in a transaction for days or weeks. Now the batch jobs (which are doing lots of updates) start creating tons of bloat, but it's not their snapshot that is causing the problem. We have some cron'd scripts that check for long-running queries, idle in transaction and just plain idle. The scripts will kill sessions when the sit in different states for too long. It would be nice if the database could handle this (no, statement_timeout won't work, because the user can just change it to whatever they want), but I don't know how many other people have this need. I guess I could at least put the scripts up on pgFoundry... -- Decibel!, aka Jim C. Nasby, Database Architect [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 smime.p7s Description: S/MIME cryptographic signature
Re: [HACKERS] Mini improvement: statement_cost_limit
On Aug 3, 2008, at 9:57 PM, Robert Treat wrote: I think a variation on this could be very useful in development and test environments. Suppose it raised a warning or notice if the cost was over the limit. Then one could set a limit of a few million on the development and test servers and developers would at least have a clue that they needed to look at explain for that query. As it is now, one can exhort them to run explain, but it has no effect. Instead we later see queries killed by a 24 hour timeout with estimated costs ranging from until they unplug the machine and dump it to until the sun turns into a red giant. Great argument. So that's 4 in favour at least. Not such a great argument. Cost models on development servers can and often are quite different from those on production, so you might be putting an artifical limit on top of your developers. We should have an approved API for dumping stats from one database and loading them into another. pg_dump needs this as well, IMO. -- Decibel!, aka Jim C. Nasby, Database Architect [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 smime.p7s Description: S/MIME cryptographic signature
Re: [HACKERS] Mini improvement: statement_cost_limit
On Aug 4, 2008, at 3:49 PM, Simon Riggs wrote: On Mon, 2008-08-04 at 14:35 -0400, Robert Treat wrote: On Monday 04 August 2008 03:50:40 daveg wrote: And you'll note, I specifically said that a crude tool is better than nothing. But your completely ignoring that a crude tool can often end-up as a foot-gun once relased into the wild. The proposal is for an option with no consequences when turned off. We respect your right not to use it. What is the danger exactly? If we cancel stupid queries before people run them, everybody is a winner. Even the person who submitted the stupid query, since they find out faster. I could *really* use this. Unfortunately, we have a lot of folks writing some horrible queries and killing our slave databases. I'd *love* to be able to throw out any queries that had insane limits... We'll have to do something with enable_seqscan, BTW, chaps. My thought would be to back the cost penalty out if we end up with a seqscan anyway. Speaking of which, there is a semi-related issue... if you have a large enough table the fixed-size cost we add to a seqscan won't be enough to make an alternative plan come out cheaper. Instead of adding a fixed cost, I think we should multiply by the estimated number of rows. -- Decibel!, aka Jim C. Nasby, Database Architect [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 smime.p7s Description: S/MIME cryptographic signature
Re: [HACKERS] IN vs EXISTS equivalence
Our Internet connectivity failed as this was being sent. It looks like at least the list didn't get it, so here goes another try. Apologies for any duplication. -Kevin Tom Lane [EMAIL PROTECTED] wrote: I chewed on that for awhile. We can certainly optimize EXISTS that's appearing in the ON-condition of a regular JOIN, because that's not really semantically different from a WHERE-condition. Good to hear. I thought that might be doable. :-) But I don't think it's going to be reasonable to improve EXISTS in outer-JOIN ON conditions. There are a couple of problems. Consider The discussion did make the difficulties clear. So this is not something I'm going to tackle; at least not this devel cycle. Fair enough. One small step we can take in this direction, though, is to improve the planner's internal handling of the qual conditions for IN and EXISTS. Right now the process is just to throw the sub-select into the main range table and put the IN join conditions into the same place in WHERE that the IN-clause was to start with. The trouble with this is that the distribute_quals_to_rels processing has no idea that there's anything special about the IN join conditions. We got away with that for the limited case of IN clauses at the top level of WHERE, but it's become clear to me over the weekend that this has no hope of working for NOT EXISTS --- since that's effectively an outer join, it absolutely has to have the same kinds of qual-scheduling constraints as ordinary outer joins do. So we need a data structure that distribute_quals_to_rels can work with. What I think needs to happen is that the initial pass that pulls up optimizable IN/EXISTS sub-selects should not merge the SubLink's replacement qual clauses seamlessly, but put them underneath a new node type, say FlattenedSubLink, that retains knowledge of the join it's representing. The FlattenedSubLink would survive only as far as distribute_quals_to_rels, which would distribute out the contained qual conditions instead of the FlattenedSubLink itself --- but only after marking them properly for the outer-join restrictions. This representation would make it feasible to work with IN/EXISTS that are inside JOIN ON conditions, which the present representation using a single in_info_list really can't do. The semantic issues are still there but at least the representation isn't getting in the way ... Just curious, is that something for this cycle, or a TODO item? Thanks for looking at this. The one part I'm not sure about is where the CASE/EXISTS in the SELECT value list fits into this discussion. It seems conceptually similar to the OUTER JOIN, but sort of a special case, so I'm not sure what you had in mind there. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Column level privileges was:(Re: [HACKERS] Extending grant insert on tables to sequences)
On 7/25/08, Stephen Frost [EMAIL PROTECTED] wrote: * Jaime Casanova ([EMAIL PROTECTED]) wrote: ok, seems this is the last one for column level patch http://archives.postgresql.org/pgsql-patches/2008-04/msg00417.php any one working it... Yes, I'm working on it hi, any work on it? may i help? -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Guayaquil - Ecuador Cel. (593) 87171157 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IN vs EXISTS equivalence
Decibel! [EMAIL PROTECTED] writes: On Aug 8, 2008, at 3:23 PM, Tom Lane wrote: * has no set operations (UNION etc), grouping, set-returning functions in the SELECT list, LIMIT, or a few other funny cases Couldn't union/union all be treated as EXISTS(a) OR EXISTS(b) Kind of confused by what you mean here. Can you give an example? The usual transformation to consider with UNION is to transform SELECT ... WHERE x OR y into SELECT ... WHERE x UNION ALL SELECT ... WHERE y AND NOT x (modulo handling NULLs properly) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum and TOAST tables
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Note that this patch allows a toast table to be vacuumed by the user: I don't have a problem with that, but if anyone thinks this is not a good idea, please speak up. The permissions on pg_toast will prevent anyone but a superuser from doing that anyway, so it's no big deal. Possibly more interesting is what happens if someone drops the parent table while VACUUM is working independently on the toast table. Does DROP take exclusive lock on a toast table? Probably, but it needs to be checked. Yes, it does. So the autovacuum process working on the TOAST table would get cancelled by the DROP TABLE, TRUNCATE, CLUSTER. The one ALTER TABLE variant that I think needs to handle the TOAST table is ALTER TYPE, but I think it should work that it is being vacuumed concurrently. REINDEX TABLE should perhaps also be concerned because it does reindex the toast table, but it grabs the lock before actually doing the reindexing so I don't think there's a problem here. BTW only now I notice that CLUSTER leaves the toast table name in bad shape: if you create a table with OID X its TOAST table is named pg_toast_X. If you then cluster this table, a new transient table gets created with OID Y; the TOAST table for Y is named pg_toast_Y, and then this new TOAST table is used as the new TOAST table for the original table X. So you end up with table OID X having TOAST table pg_toast_Y. This is not a concern from the system standpoint because it doesn't use this name for anything, but people looking at the catalogs manually may be taken by surprise. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum and TOAST tables
Alvaro Herrera [EMAIL PROTECTED] writes: BTW only now I notice that CLUSTER leaves the toast table name in bad shape: if you create a table with OID X its TOAST table is named pg_toast_X. If you then cluster this table, a new transient table gets created with OID Y; the TOAST table for Y is named pg_toast_Y, and then this new TOAST table is used as the new TOAST table for the original table X. So you end up with table OID X having TOAST table pg_toast_Y. Hmm, we could probably fix that if we made the cluster operation swap the physical storage of the two toast tables, rather than swapping the tables altogether. I agree it's not critical but it could be confusing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum and TOAST tables
I wrote: Hmm, we could probably fix that if we made the cluster operation swap the physical storage of the two toast tables, rather than swapping the tables altogether. I agree it's not critical but it could be confusing. On second thought, I think it *could* lead to a visible failure. Suppose the OID counter wraps around and the OID that had been used for the temporary CLUSTER table gets assigned to a new table. If that table needs a toast table, it'll try to create one using the name that is already in use. We have defenses against picking an OID that's in use, but none for toast table names. So I think it's indeed worth fixing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum and TOAST tables
Tom Lane wrote: On second thought, I think it *could* lead to a visible failure. Suppose the OID counter wraps around and the OID that had been used for the temporary CLUSTER table gets assigned to a new table. If that table needs a toast table, it'll try to create one using the name that is already in use. We have defenses against picking an OID that's in use, but none for toast table names. So I think it's indeed worth fixing. Okay, I'll see to it after committing this autovacuum stuff. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Mini improvement: statement_cost_limit
On Aug 4, 2008, at 1:04 PM, daveg wrote: Ok, that is a different use case where an error seems very useful. What about slightly extending the proposal to have the severity of exceeding the limit configurable too. Something like: costestimate_limit = 10 # default 0 to ignore limit costestimate_limit_severity = error # debug, notice, warning, error I very much like this idea, and I would definitely use something like this on our production oltp app. We had a case recently where a query joining two large tables was very fast 99.9% of the time (i.e., a few ms), but for particular, rare key combinations the planner would make a poor choice turning into a multi-minute monster. It ran longer than the web server timeout, and the client was programmed to retry on error, essentially causing a database DoS. The monster version of the plan had an outrageous cost estimate, many orders of magnitude higher than any regular app query, and would be easy to peg using even a crudely chosen limit value. The problem was first mitigated by setting a query timeout a little longer than the web server timeout (since the query results are discarded for anything running longer), but even this was not a solution, since the client would retry on timeout, still keeping the db too busy. The real solution was to not do the query, but it would have been better to identify this via ERRORs in the logs than by the database becoming saturated in the middle of the day. For our application it is far better for an expensive query to be rejected outright than to attempt to run it in vain. Just thought I'd throw that out as anecdotal support. -Casey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IN vs EXISTS equivalence
Decibel! [EMAIL PROTECTED] writes: On Aug 8, 2008, at 3:23 PM, Tom Lane wrote: * has no set operations (UNION etc), grouping, set-returning functions in the SELECT list, LIMIT, or a few other funny cases Couldn't union/union all be treated as EXISTS(a) OR EXISTS(b) Perhaps, but that would end up defeating the optimization anyway, because as soon as the EXISTS is underneath an OR, it's no longer representing a potential join clause. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Plans for 8.4
Added to TODO: * Add 'hostgss' pg_hba.conf option to allow GSS link-level encryption http://archives.postgresql.org/pgsql-hackers/2008-07/msg01454.php --- Henry B. Hotz wrote: What's the time frame for 8.4? I'm making no promises, but what would people think of a hostgss hba option? Using it would imply the gssapi/sspi authentication option. It would be mutually exclusive of the ssl link-encryption option. It would support strong encryption of the whole connection without the need to get X509 certs deployed (which would be a big win if you're using gssapi/sspi authentication anyway). The thing that prevented me from including it in the gssapi patches I did for 8.3 was that I couldn't disentangle the program logic to the point of inserting the gssapi security layer code above the SSL code and below everything else. I'm thinking that doing both is pretty much an edge case, so I propose to do gssapi security layers instead of SSL. The mods are a lot more obvious. I'm *NOT* proposing to make build support of gssapi security layers exclusive of SSL. You might, for example, configure a server to support username/password over SSL for intra-net addresses, but support gssapi for Internet addresses. -- The opinions expressed in this message are mine, not those of Caltech, JPL, NASA, or the US Government. [EMAIL PROTECTED], or [EMAIL PROTECTED] -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian [EMAIL PROTECTED]http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Plugin system like Firefox
Matthew T. O'Connor wrote: A few random thoughts... The application that comes to mind first for me when you talk plugins is Firefox. They make it very easy to browse for plugins and to install, update, remove them. Their plug-in system also tries to account for Firefox version and OS platform which we would need to do also. Perhaps one thing that would help PostgreSQL plug-ins is a nice GUI plug-in browser and management application. The logical place to add this IMHO is PGAdmin since it is GUI, already talks to the DB and is cross platform. I'm not saying a GUI should be required to manage plug-ins, a fully CLI option should be made available too. I feel the above comment about Firefox is the crux of the plugin issue. Having per-operating system options for each plugin is never going to scale. Having users compile/install these plugins works, but the effort required discourages their use, and the fewer people that try them the fewer people use them and contribute back to the Postgres plugin ecosystem. This is similar to the problem of installing Postgres before we had per-OS installs and Postgres software bundles like the Win32 installer and Postgres Plus. The Firefox plugins are successful because it is easy to install stuff. I run Ubuntu on my laptop so I can easily install software --- I can compile/install software from source, but I prefer to use the Ubuntu software install system so I can concentrate on my major task. So, ideally, if we do a plug-in system, I think we need some way to have these plugins be very easily installed, perhaps by choosing object files pre-compile by the build farm for each operating system. -- Bruce Momjian [EMAIL PROTECTED]http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pltcl_*mod commands are broken on Solaris 10
Zdenek Kotala wrote: Tom Lane napsal(a): Zdenek Kotala [EMAIL PROTECTED] writes: I understand. However I have another dumb idea/question - It seems to me that it is client code. I think that it should be integrated into psql command. That doesn't seem like a particularly appropriate thing to do ... nor do I see the argument for calling it client-side code. I think that best thing at this moment is to add item to the TODO list about cleanup. I can add a TODO item but I am unsure anyone really cares --- seeing how long it took us to realize the poor state of the code. -- Bruce Momjian [EMAIL PROTECTED]http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: UTF8 to_ascii function
2008/8/11 Peter Eisentraut [EMAIL PROTECTED]: On Monday 11 August 2008 16:23:29 Jan Urbański wrote: Often clients want their searches to be accented-or-language-specific letters insensitive. So searching for 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of such facility made me consider not upgrading particular client to 8.3...). These are valid ideas, but then please design a new function that addresses your use case in a well-defined way, and don't overload questionable old interfaces for new purposes. In the Unicode standard you can find well-defined methods to decompose characters into diacritic marks, and then you could strip them off. But this has nothing to do with ASCII or UTF8 or encodings. Cyrillic characters can have diacritic marks as well, for example. Hi Peter, changes to_ascii from text to bytea is more bugfix than new feature and should be done immediately. Correct conversions are related to colum collation - is not necessary repeat same work and same code from some unicode libs. Regards Pavel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers