Re: [GENERAL] How to convert HEX to ASCII?
Hello all just note 9.1 will have a bytea_agg aggregate regards Pavel Stehule 2011/12/2 Marti Raudsepp ma...@juffo.org: On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff f...@meisterderspiele.de wrote: But i clearly have a missunderstanding of other chars, like umlauts or utf-8 chars. This, for example, should return a 'ö': # SELECT chr(x'C3B6'::int); chr - 쎶 (1 row) That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded and actually decodes to the codepoint 00F6. There is a fundamental problem that a decoded URL may actually be a binary string -- it might not have a textual representation at all. But if text is what you want, RFC3986 strongly suggests using UTF-8 for encoding text strings in URLs, and that works almost always in the real world. So the *right* way is to first convert the URL to a binary bytea type by fixing all the % escapes, then convert that to UTF-8 encoding to handle multibyte characters. What I came up with is far from elegant because PostgreSQL lacks convenient functions for bytea manipulation (no bytea_agg, etc). Stealing a little from Merlin, this is what it looks like: CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text LANGUAGE plpgsql IMMUTABLE STRICT AS $$ DECLARE bin bytea = ''; byte text; BEGIN FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP IF length(byte) = 3 THEN bin = bin || decode(substring(byte, 2, 2), 'hex'); ELSE bin = bin || byte::bytea; END IF; END LOOP; RETURN convert_from(bin, 'utf8'); END $$; db=# select url_decode('Hell%C3%B6%20World%21'); url_decode -- Hellö World! db=# select url_decode('%EC%8E%B6'); url_decode 쎶 This will break for binary-encoded data in URLs, though. db=# select url_decode('%fa%fa%fa'); ERROR: invalid byte sequence for encoding UTF8: 0xfa CONTEXT: PL/pgSQL function url_decode line 13 at RETURN On Fri, Dec 2, 2011 at 17:46, Merlin Moncure mmonc...@gmail.com wrote: set client_encoding to latin1; postgres=# select unencode('Hell%C3%B6%20World%21'); unencode --- Hellö World! (1 row) Sorry, but AFAICT this makes a mess of encodings and only works by pure luck. The server thinks it's sending the client LATIN1 text, but it's actually UTF8-encoded and the last decoding step is done by your terminal. Regards, Marti -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] How to convert HEX to ASCII?
Marti Raudsepp schrieb: On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff f...@meisterderspiele.de wrote: But i clearly have a missunderstanding of other chars, like umlauts or utf-8 chars. This, for example, should return a 'ö': # SELECT chr(x'C3B6'::int); chr - 쎶 (1 row) That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded and actually decodes to the codepoint 00F6. There is a fundamental problem that a decoded URL may actually be a binary string -- it might not have a textual representation at all. But if text is what you want, RFC3986 strongly suggests using UTF-8 for encoding text strings in URLs, and that works almost always in the real world. Text is what i want. :) I've created a highly specialiced CMS, which handle a bunch of big sites (in meaning of a great numbers of users and content). It has a build-in traffic-analyze and with this function it creates a real time analyze of the keywords, a user used to find the sites in search engines. This is very needful if you try to do SEO for websites with more than 20.000 unique content-pages. :) CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text LANGUAGE plpgsql IMMUTABLE STRICT AS $$ DECLARE bin bytea = ''; byte text; BEGIN FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP IF length(byte) = 3 THEN bin = bin || decode(substring(byte, 2, 2), 'hex'); ELSE bin = bin || byte::bytea; END IF; END LOOP; RETURN convert_from(bin, 'utf8'); END $$; Hey, this function looks similar to my encoding function :) Thank you very munch! This will break for binary-encoded data in URLs, though. Thats no problem, i just have text. Big thanks to all of you, Torsten -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] How to convert HEX to ASCII?
Hello, after several attempts I have finally succeeded in developing a urlencode() function to encode text correctly like defined in RFC 1738. Now i have a big problem: how to decode the text? Example: # SELECT urlencode('Hellö World!'); urlencode --- Hell%C3%B6%20World%21 Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'? Thanks for your help and greetings from Germany, Torsten -- http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8 verschiedenen Datenbanksystemen abstrahiert, Queries von Applikationen trennt und automatisch die Query-Ergebnisse auswerten kann. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] How to convert HEX to ASCII?
On 2 December 2011 13:18, Torsten Zuehlsdorff f...@meisterderspiele.de wrote: Hello, after several attempts I have finally succeeded in developing a urlencode() function to encode text correctly like defined in RFC 1738. Now i have a big problem: how to decode the text? Example: # SELECT urlencode('Hellö World!'); urlencode --- Hell%C3%B6%20World%21 Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'? I've extracted the unquote method [0] from urllib in the python stdlib that decodes urlencoded strings. Hopefully be some use! [0] http://pastie.org/2954968 -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] How to convert HEX to ASCII?
Damien Churchill schrieb: after several attempts I have finally succeeded in developing a urlencode() function to encode text correctly like defined in RFC 1738. Now i have a big problem: how to decode the text? Example: # SELECT urlencode('Hellö World!'); urlencode --- Hell%C3%B6%20World%21 Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'? I've extracted the unquote method [0] from urllib in the python stdlib that decodes urlencoded strings. Hopefully be some use! Not directly, but it gives me some helpful hints. For example i'm now able to decode some basic characters, for example: # SELECT chr(x'21'::int); chr - ! (1 row) But i clearly have a missunderstanding of other chars, like umlauts or utf-8 chars. This, for example, should return a 'ö': # SELECT chr(x'C3B6'::int); chr - 쎶 (1 row) Also i'm not sure how to figure out, when to decode '%C3' and when to decode '%C3%B6'. Thanks for your help, Torsten -- http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8 verschiedenen Datenbanksystemen abstrahiert, Queries von Applikationen trennt und automatisch die Query-Ergebnisse auswerten kann. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] How to convert HEX to ASCII?
On Fri, Dec 2, 2011 at 8:16 AM, Torsten Zuehlsdorff f...@meisterderspiele.de wrote: Damien Churchill schrieb: after several attempts I have finally succeeded in developing a urlencode() function to encode text correctly like defined in RFC 1738. Now i have a big problem: how to decode the text? Example: # SELECT urlencode('Hellö World!'); urlencode --- Hell%C3%B6%20World%21 Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'? I've extracted the unquote method [0] from urllib in the python stdlib that decodes urlencoded strings. Hopefully be some use! Not directly, but it gives me some helpful hints. For example i'm now able to decode some basic characters, for example: # SELECT chr(x'21'::int); chr - ! (1 row) But i clearly have a missunderstanding of other chars, like umlauts or utf-8 chars. This, for example, should return a 'ö': # SELECT chr(x'C3B6'::int); chr - 쎶 (1 row) Also i'm not sure how to figure out, when to decode '%C3' and when to decode '%C3%B6'. Thanks for your help, You're welcome. get ready for some seriously abusive sql: create or replace function unencode(text) returns text as $$ with q as ( select (regexp_matches($1, '(%..|.)', 'g'))[1] as v ) select string_agg(case when length(v) = 3 then chr(replace(v, '%', 'x')::bit(8)::int) else v end, '') from q; $$ language sql immutable; set client_encoding to latin1; SET postgres=# select unencode('Hell%C3%B6%20World%21'); unencode --- Hellö World! (1 row) Time: 1.908 ms (maybe this isn't really an immutable function, but oh well). merlin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] How to convert HEX to ASCII?
On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff f...@meisterderspiele.de wrote: But i clearly have a missunderstanding of other chars, like umlauts or utf-8 chars. This, for example, should return a 'ö': # SELECT chr(x'C3B6'::int); chr - 쎶 (1 row) That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded and actually decodes to the codepoint 00F6. There is a fundamental problem that a decoded URL may actually be a binary string -- it might not have a textual representation at all. But if text is what you want, RFC3986 strongly suggests using UTF-8 for encoding text strings in URLs, and that works almost always in the real world. So the *right* way is to first convert the URL to a binary bytea type by fixing all the % escapes, then convert that to UTF-8 encoding to handle multibyte characters. What I came up with is far from elegant because PostgreSQL lacks convenient functions for bytea manipulation (no bytea_agg, etc). Stealing a little from Merlin, this is what it looks like: CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text LANGUAGE plpgsql IMMUTABLE STRICT AS $$ DECLARE bin bytea = ''; byte text; BEGIN FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP IF length(byte) = 3 THEN bin = bin || decode(substring(byte, 2, 2), 'hex'); ELSE bin = bin || byte::bytea; END IF; END LOOP; RETURN convert_from(bin, 'utf8'); END $$; db=# select url_decode('Hell%C3%B6%20World%21'); url_decode -- Hellö World! db=# select url_decode('%EC%8E%B6'); url_decode 쎶 This will break for binary-encoded data in URLs, though. db=# select url_decode('%fa%fa%fa'); ERROR: invalid byte sequence for encoding UTF8: 0xfa CONTEXT: PL/pgSQL function url_decode line 13 at RETURN On Fri, Dec 2, 2011 at 17:46, Merlin Moncure mmonc...@gmail.com wrote: set client_encoding to latin1; postgres=# select unencode('Hell%C3%B6%20World%21'); unencode --- Hellö World! (1 row) Sorry, but AFAICT this makes a mess of encodings and only works by pure luck. The server thinks it's sending the client LATIN1 text, but it's actually UTF8-encoded and the last decoding step is done by your terminal. Regards, Marti -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] How to convert HEX to ASCII?
On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp ma...@juffo.org wrote: Sorry, but AFAICT this makes a mess of encodings and only works by pure luck. The server thinks it's sending the client LATIN1 text, but it's actually UTF8-encoded and the last decoding step is done by your terminal. yup -- your're right -- what a coincidence! I still prefer the 1 liner sql variant vs plpgsql loop though. nicely done. merlin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] How to convert HEX to ASCII?
2011/12/2 Merlin Moncure mmonc...@gmail.com: On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp ma...@juffo.org wrote: Sorry, but AFAICT this makes a mess of encodings and only works by pure luck. The server thinks it's sending the client LATIN1 text, but it's actually UTF8-encoded and the last decoding step is done by your terminal. yup -- your're right -- what a coincidence! I still prefer the 1 liner sql variant vs plpgsql loop though. nicely done. so bytea_agg - one param aggregate has sense it's very easy to implement it Pavel merlin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] How to convert HEX to ASCII?
On Fri, Dec 2, 2011 at 11:15 AM, Pavel Stehule pavel.steh...@gmail.com wrote: 2011/12/2 Merlin Moncure mmonc...@gmail.com: On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp ma...@juffo.org wrote: Sorry, but AFAICT this makes a mess of encodings and only works by pure luck. The server thinks it's sending the client LATIN1 text, but it's actually UTF8-encoded and the last decoding step is done by your terminal. yup -- your're right -- what a coincidence! I still prefer the 1 liner sql variant vs plpgsql loop though. nicely done. so bytea_agg - one param aggregate has sense it's very easy to implement it yup: create aggregate bytea_agg (bytea) ( sfunc=byteacat, stype=bytea ); merlin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] How to convert HEX to ASCII?
2011/12/2 Merlin Moncure mmonc...@gmail.com: On Fri, Dec 2, 2011 at 11:15 AM, Pavel Stehule pavel.steh...@gmail.com wrote: 2011/12/2 Merlin Moncure mmonc...@gmail.com: On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp ma...@juffo.org wrote: Sorry, but AFAICT this makes a mess of encodings and only works by pure luck. The server thinks it's sending the client LATIN1 text, but it's actually UTF8-encoded and the last decoding step is done by your terminal. yup -- your're right -- what a coincidence! I still prefer the 1 liner sql variant vs plpgsql loop though. nicely done. so bytea_agg - one param aggregate has sense it's very easy to implement it yup: create aggregate bytea_agg (bytea) ( sfunc=byteacat, stype=bytea ); this is workaround :) without a memory preallocating it has same speed like cycle in plpgsql. Regards Pavel merlin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] How to convert HEX to ASCII?
On Fri, Dec 2, 2011 at 2:01 PM, Pavel Stehule pavel.steh...@gmail.com wrote: so bytea_agg - one param aggregate has sense it's very easy to implement it yup: create aggregate bytea_agg (bytea) ( sfunc=byteacat, stype=bytea ); this is workaround :) without a memory preallocating it has same speed like cycle in plpgsql. sure, but I prefer to code against the workaround because it's cleaner and it makes things easier to port over when such a feature makes it into core. also, one liner sql has better chance of inlining as a general rule. merlin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] How to convert HEX to ASCII?
2011/12/2 Merlin Moncure mmonc...@gmail.com: On Fri, Dec 2, 2011 at 2:01 PM, Pavel Stehule pavel.steh...@gmail.com wrote: so bytea_agg - one param aggregate has sense it's very easy to implement it yup: create aggregate bytea_agg (bytea) ( sfunc=byteacat, stype=bytea ); this is workaround :) without a memory preallocating it has same speed like cycle in plpgsql. sure, but I prefer to code against the workaround because it's cleaner and it makes things easier to port over when such a feature makes it into core. also, one liner sql has better chance of inlining as a general rule. ook Pavel merlin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general