Re: [GENERAL] text and bytea
Tom Lane wrote: hernan gonzalez [EMAIL PROTECTED] writes: test=# create view vchartest as select encode(convert_to(c,'LATIN9'),'escape') as c1 from chartest; Hmm. This isn't a very sensible combination that you've written here, but I see the point: encode(..., 'escape') is broken in that it fails to convert high-bit-set bytes into \nnn sequences. At least in multibyte backend encodings, we *must* do that to produce valid textual output. I suspect that for consistency we should do it regardless of backend encoding. With Tom's encoding() patch applied I assume there is no TODO item here. -- Bruce Momjian [EMAIL PROTECTED]http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] text and bytea
Umm, I think all you showed was that the to_ascii() function was broken. Postgres knows exactly what encoding the string is in, the backend encoding: in your case UTF-8. That would be fine, if it were true; then, one could assume that every postgresql function that returns a text gets ALWAYS the standard backend encoding (again: as in Java). But consider the result postgresql gets from this (from my example): encode(convert_to(c,'LATIN9'),'escape') That's something of type text (a strign), postgresql believes it's UTF8, but it's not (it probably woud not even validate as a valid utf8 sequence). IMHO, the semantics of encode() and decode() are correct (the bridge between bytea and text ... in the backend encoding; they should be the only bridge), convert() is also ok (deals with bytes), but convert_to() and convert_from() are dubious if not broken: they imply texts in arbitrary encodings (for output or input) , lead to anomalities and shouldnt be necessary at all. Cheers Hernán J. González http://hjg.com.ar/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] text and bytea
hernan gonzalez [EMAIL PROTECTED] writes: IMHO, the semantics of encode() and decode() are correct (the bridge between bytea and text ... in the backend encoding; they should be the only bridge), convert() is also ok (deals with bytes), but convert_to() and convert_from() are dubious if not broken: they imply texts in arbitrary encodings (for output or input) , lead to anomalities and shouldnt be necessary at all. postgres=# \df convert_from List of functions Schema | Name | Result data type | Argument data types +--+--+- pg_catalog | convert_from | text | bytea, name (1 row) postgres=# \df convert_to List of functions Schema |Name| Result data type | Argument data types ++--+- pg_catalog | convert_to | bytea| text, name (1 row) Looks like they produce and consume byteas to me. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services! ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [GENERAL] text and bytea
IMHO, the semantics of encode() and decode() are correct (the bridge between bytea and text ... in the backend encoding; they should be the only bridge), convert() is also ok (deals with bytes), but convert_to() and convert_from() are dubious if not broken: they imply texts in arbitrary encodings (for output or input) , lead to anomalities and shouldnt be necessary at all. Sorry, my mistake. I meant the opposite: convert_to() and convert_from() are the correct bridge (text = bytea) functions. Also convert() is ok. The objetionable ones IMHO are decode()/encode(), which can consume/produce a non-utf8 string (I mean, not the backend encoding) Going back to the line: encode(convert_to(c,'LATIN9'),'escape') Here we have: c = text (ut8) convert_to(..). = bytea (represents a char sequence in latin9 encoding) encode(...) = text (in latin9 encoding?) Cheers Hernán J. González http://hjg.com.ar/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/
Re: [GENERAL] text and bytea
Another example (Psotgresql 8.3.0, UTF-8 server/client encoding) test=# create table chartest ( c text); test=# insert into chartest (c) values ('¡Hasta mañana!'); test=# create view vchartest as select encode(convert_to(c,'LATIN9'),'escape') as c1 from chartest; test=# select c,octet_length(c) from chartest ; c| octet_length +-- ¡Hasta mañana! | 16 test=# select c1,octet_length(c1) from vchartest ; c1 | octet_length --+-- Hasta maana! | 14 (the field is seen as as text by postgresql, with the default encoding.. UTF8; it is actually not) test=# select * from vchartest where c1 like '%a%'; c1 -- Hasta maana! (1 row) test=# select * from vchartest where c1 ilike '%a%'; ERROR: invalid byte sequence for encoding UTF8: 0xa1 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by client_encoding. That ilike breaks and like is rather random, it seems that the later has some optimization does not check the validty of the utf8 stream. But thats not the point. The point is that IMO postgresql should always handle text in the backend encoding, there should no exists funcions that are designed to produce/consume texts in other encodings. Perhaps the encode function is ill defined, and should be rethinked. Two alternatives: 1. For special binary-to-ascii encodings (base64,hex). Keep its present signature but remove/deprecate the escape type. It returns a text in the backend encoding. 2 For arbitrary binary encodings. Change its signature so that it returns bytea. Of course, all this applies symmetrically to decode(). Appart, and in accordance with this, I think to_ascii() should accept only one text argument. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] text and bytea
hernan gonzalez [EMAIL PROTECTED] writes: The objetionable ones IMHO are decode()/encode(), which can consume/produce a non-utf8 string (I mean, not the backend encoding) Huh? Those deal with bytea too --- in fact, they've got nothing at all to do with multibyte character representations. They're for handling hex and base64 and suchlike representations of binary data. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] text and bytea
hernan gonzalez [EMAIL PROTECTED] writes: test=# create view vchartest as select encode(convert_to(c,'LATIN9'),'escape') as c1 from chartest; Hmm. This isn't a very sensible combination that you've written here, but I see the point: encode(..., 'escape') is broken in that it fails to convert high-bit-set bytes into \nnn sequences. At least in multibyte backend encodings, we *must* do that to produce valid textual output. I suspect that for consistency we should do it regardless of backend encoding. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/
Re: [GENERAL] text and bytea
It seems to me that postgres is trying to do as you suggest: text is characters and bytea is bytes, like in Java. But the big difference is that, for text type, postgresql knows this is a text but doesnt know the encoding, as my example showed. This goes against the concept of text vs bytes distintion, which per se is very useful and powerful (specially in this Unicode world) and leads to a dubious/clumsy string api (IMHO, as always). You don't indicate what version you are using, this area was rejigged recently. Sorry, I forget to say that my examples are for last version (8.3) Cheers -- Hernán J. González ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [GENERAL] text and bytea
On Fri, Feb 22, 2008 at 01:54:46PM -0200, hernan gonzalez wrote: It seems to me that postgres is trying to do as you suggest: text is characters and bytea is bytes, like in Java. But the big difference is that, for text type, postgresql knows this is a text but doesnt know the encoding, as my example showed. This goes against the concept of text vs bytes distintion, which per se is very useful and powerful (specially in this Unicode world) and leads to a dubious/clumsy string api (IMHO, as always). Umm, I think all you showed was that the to_ascii() function was broken. Postgres knows exactly what encoding the string is in, the backend encoding: in your case UTF-8. Now, it would be nice if postgres could handle other encodings in the backend, but there's no agreement on how to implement that feature so it isn't implemented. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Those who make peaceful revolution impossible will make violent revolution inevitable. -- John F Kennedy signature.asc Description: Digital signature
Re: [GENERAL] text and bytea
On Thu, Feb 21, 2008 at 02:34:15PM -0200, hernan gonzalez wrote: (After dealing a while with this, and learning a little, I though of post this as comment in the docs, but perhaps someone who knows better can correct or clarify) It seems to me that postgres is trying to do as you suggest: text is characters and bytea is bytes, like in Java. You don't indicate what version you are using, this area was rejigged recently. The most surprising this is that to_ascii won't accept a bytea. TBH the whole to_ascii function seems somewhat half-baked. If what you're trying to do is remove accents, there are perl functions around that do that. Basically, the switch to a different normal form then drop all the accent characters. Essentially, Postgres supports UTF-8, but doesn't understand Unicode characters much at all. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Those who make peaceful revolution impossible will make violent revolution inevitable. -- John F Kennedy signature.asc Description: Digital signature
Re: [GENERAL] text and bytea
Martijn van Oosterhout escribió: The most surprising this is that to_ascii won't accept a bytea. TBH the whole to_ascii function seems somewhat half-baked. If what you're trying to do is remove accents, there are perl functions around that do that. Basically, the switch to a different normal form then drop all the accent characters. Yeah, it's been a common suggestion to use convert() in combination with to_ascii on UTF-8 databases, and I didn't notice that the convert() shuffling would take that ability away :-( I don't think requiring plperl is nice however. Perhaps we could get around the problem by using byteaout/textin. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster