Re: [GENERAL] How to convert HEX to ASCII?

2011-12-23 Thread Pavel Stehule
Hello all

just note

9.1 will have a bytea_agg aggregate

regards

Pavel Stehule

2011/12/2 Marti Raudsepp ma...@juffo.org:
 On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff
 f...@meisterderspiele.de wrote:
 But i clearly have a missunderstanding of other chars, like umlauts or utf-8
 chars. This, for example, should return a 'ö':

 # SELECT chr(x'C3B6'::int);
  chr
 -
  쎶
 (1 row)

 That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded
 and actually decodes to the codepoint 00F6.

 There is a fundamental problem that a decoded URL may actually be a
 binary string -- it might not have a textual representation at all.
 But if text is what you want, RFC3986 strongly suggests using UTF-8
 for encoding text strings in URLs, and that works almost always in the
 real world.

 So the *right* way is to first convert the URL to a binary bytea
 type by fixing all the % escapes, then convert that to UTF-8 encoding
 to handle multibyte characters.

 What I came up with is far from elegant because PostgreSQL lacks
 convenient functions for bytea manipulation (no bytea_agg, etc).
 Stealing a little from Merlin, this is what it looks like:

 CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text
 LANGUAGE plpgsql IMMUTABLE STRICT AS $$
 DECLARE
  bin bytea = '';
  byte text;
 BEGIN
  FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP
    IF length(byte) = 3 THEN
      bin = bin || decode(substring(byte, 2, 2), 'hex');
    ELSE
      bin = bin || byte::bytea;
    END IF;
  END LOOP;
  RETURN convert_from(bin, 'utf8');
 END
 $$;

 db=# select url_decode('Hell%C3%B6%20World%21');
  url_decode
 --
  Hellö World!

 db=# select url_decode('%EC%8E%B6');
  url_decode
 
  쎶

 This will break for binary-encoded data in URLs, though.
 db=# select url_decode('%fa%fa%fa');
 ERROR:  invalid byte sequence for encoding UTF8: 0xfa
 CONTEXT:  PL/pgSQL function url_decode line 13 at RETURN

 

 On Fri, Dec 2, 2011 at 17:46, Merlin Moncure mmonc...@gmail.com wrote:
 set client_encoding to latin1;

 postgres=# select unencode('Hell%C3%B6%20World%21');
   unencode
 ---
  Hellö World!
 (1 row)

 Sorry, but AFAICT this makes a mess of encodings and only works by
 pure luck. The server thinks it's sending the client LATIN1 text, but
 it's actually UTF8-encoded and the last decoding step is done by your
 terminal.

 Regards,
 Marti

 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to convert HEX to ASCII?

2011-12-03 Thread Torsten Zühlsdorff

Marti Raudsepp schrieb:

On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff
f...@meisterderspiele.de wrote:

But i clearly have a missunderstanding of other chars, like umlauts or utf-8
chars. This, for example, should return a 'ö':

# SELECT chr(x'C3B6'::int);
 chr
-
 쎶
(1 row)


That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded
and actually decodes to the codepoint 00F6.

There is a fundamental problem that a decoded URL may actually be a
binary string -- it might not have a textual representation at all.
But if text is what you want, RFC3986 strongly suggests using UTF-8
for encoding text strings in URLs, and that works almost always in the
real world.


Text is what i want. :) I've created a highly specialiced CMS, which 
handle a bunch of big sites (in meaning of a great numbers of users and 
content). It has a build-in traffic-analyze and with this function it 
creates a real time analyze of the keywords, a user used to find the 
sites in search engines. This is very needful if you try to do SEO for 
websites with more than 20.000 unique content-pages. :)



CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text
LANGUAGE plpgsql IMMUTABLE STRICT AS $$
DECLARE
  bin bytea = '';
  byte text;
BEGIN
  FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP
IF length(byte) = 3 THEN
  bin = bin || decode(substring(byte, 2, 2), 'hex');
ELSE
  bin = bin || byte::bytea;
END IF;
  END LOOP;
  RETURN convert_from(bin, 'utf8');
END
$$;


Hey, this function looks similar to my encoding function :) Thank you 
very munch!



This will break for binary-encoded data in URLs, though.


Thats no problem, i just have text.

Big thanks to all of you,
Torsten

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] How to convert HEX to ASCII?

2011-12-02 Thread Torsten Zuehlsdorff

Hello,

after several attempts I have finally succeeded in developing a 
urlencode() function to encode text correctly like defined in RFC 1738.


Now i have a big problem: how to decode the text?

Example:
# SELECT urlencode('Hellö World!');
   urlencode
---
 Hell%C3%B6%20World%21

Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'?

Thanks for your help and greetings from Germany,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8 
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse 
auswerten kann.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to convert HEX to ASCII?

2011-12-02 Thread Damien Churchill
On 2 December 2011 13:18, Torsten Zuehlsdorff f...@meisterderspiele.de wrote:
 Hello,

 after several attempts I have finally succeeded in developing a urlencode()
 function to encode text correctly like defined in RFC 1738.

 Now i have a big problem: how to decode the text?

 Example:
 # SELECT urlencode('Hellö World!');
       urlencode
 ---
  Hell%C3%B6%20World%21

 Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'?


I've extracted the unquote method [0] from urllib in the python stdlib
that decodes urlencoded strings. Hopefully be some use!

[0] http://pastie.org/2954968

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to convert HEX to ASCII?

2011-12-02 Thread Torsten Zuehlsdorff

Damien Churchill schrieb:


after several attempts I have finally succeeded in developing a urlencode()
function to encode text correctly like defined in RFC 1738.

Now i have a big problem: how to decode the text?

Example:
# SELECT urlencode('Hellö World!');
  urlencode
---
 Hell%C3%B6%20World%21

Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'?


I've extracted the unquote method [0] from urllib in the python stdlib
that decodes urlencoded strings. Hopefully be some use!


Not directly, but it gives me some helpful hints. For example i'm now 
able to decode some basic characters, for example:


# SELECT chr(x'21'::int);
 chr
-
 !
(1 row)

But i clearly have a missunderstanding of other chars, like umlauts or 
utf-8 chars. This, for example, should return a 'ö':


# SELECT chr(x'C3B6'::int);
 chr
-
 쎶
(1 row)

Also i'm not sure how to figure out, when to decode '%C3' and when to 
decode '%C3%B6'.


Thanks for your help,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8 
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse 
auswerten kann.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to convert HEX to ASCII?

2011-12-02 Thread Merlin Moncure
On Fri, Dec 2, 2011 at 8:16 AM, Torsten Zuehlsdorff
f...@meisterderspiele.de wrote:
 Damien Churchill schrieb:


 after several attempts I have finally succeeded in developing a
 urlencode()
 function to encode text correctly like defined in RFC 1738.

 Now i have a big problem: how to decode the text?

 Example:
 # SELECT urlencode('Hellö World!');
      urlencode
 ---
  Hell%C3%B6%20World%21

 Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'?


 I've extracted the unquote method [0] from urllib in the python stdlib
 that decodes urlencoded strings. Hopefully be some use!


 Not directly, but it gives me some helpful hints. For example i'm now able
 to decode some basic characters, for example:

 # SELECT chr(x'21'::int);
  chr
 -
  !
 (1 row)

 But i clearly have a missunderstanding of other chars, like umlauts or utf-8
 chars. This, for example, should return a 'ö':

 # SELECT chr(x'C3B6'::int);
  chr
 -
  쎶
 (1 row)

 Also i'm not sure how to figure out, when to decode '%C3' and when to decode
 '%C3%B6'.

 Thanks for your help,

You're welcome.  get ready for some seriously abusive sql:

create or replace function unencode(text) returns text as
$$
with q as
(
  select (regexp_matches($1, '(%..|.)', 'g'))[1] as v
)
select string_agg(case when length(v) = 3 then chr(replace(v, '%',
'x')::bit(8)::int) else v end, '') from q;
$$ language sql immutable;

set client_encoding to latin1;
SET

postgres=# select unencode('Hell%C3%B6%20World%21');
   unencode
---
 Hellö World!
(1 row)

Time: 1.908 ms

(maybe this isn't really an immutable function, but oh well).

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to convert HEX to ASCII?

2011-12-02 Thread Marti Raudsepp
On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff
f...@meisterderspiele.de wrote:
 But i clearly have a missunderstanding of other chars, like umlauts or utf-8
 chars. This, for example, should return a 'ö':

 # SELECT chr(x'C3B6'::int);
  chr
 -
  쎶
 (1 row)

That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded
and actually decodes to the codepoint 00F6.

There is a fundamental problem that a decoded URL may actually be a
binary string -- it might not have a textual representation at all.
But if text is what you want, RFC3986 strongly suggests using UTF-8
for encoding text strings in URLs, and that works almost always in the
real world.

So the *right* way is to first convert the URL to a binary bytea
type by fixing all the % escapes, then convert that to UTF-8 encoding
to handle multibyte characters.

What I came up with is far from elegant because PostgreSQL lacks
convenient functions for bytea manipulation (no bytea_agg, etc).
Stealing a little from Merlin, this is what it looks like:

CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text
LANGUAGE plpgsql IMMUTABLE STRICT AS $$
DECLARE
  bin bytea = '';
  byte text;
BEGIN
  FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP
IF length(byte) = 3 THEN
  bin = bin || decode(substring(byte, 2, 2), 'hex');
ELSE
  bin = bin || byte::bytea;
END IF;
  END LOOP;
  RETURN convert_from(bin, 'utf8');
END
$$;

db=# select url_decode('Hell%C3%B6%20World%21');
  url_decode
--
 Hellö World!

db=# select url_decode('%EC%8E%B6');
 url_decode

 쎶

This will break for binary-encoded data in URLs, though.
db=# select url_decode('%fa%fa%fa');
ERROR:  invalid byte sequence for encoding UTF8: 0xfa
CONTEXT:  PL/pgSQL function url_decode line 13 at RETURN



On Fri, Dec 2, 2011 at 17:46, Merlin Moncure mmonc...@gmail.com wrote:
 set client_encoding to latin1;

 postgres=# select unencode('Hell%C3%B6%20World%21');
   unencode
 ---
  Hellö World!
 (1 row)

Sorry, but AFAICT this makes a mess of encodings and only works by
pure luck. The server thinks it's sending the client LATIN1 text, but
it's actually UTF8-encoded and the last decoding step is done by your
terminal.

Regards,
Marti

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to convert HEX to ASCII?

2011-12-02 Thread Merlin Moncure
On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp ma...@juffo.org wrote:
 Sorry, but AFAICT this makes a mess of encodings and only works by
 pure luck. The server thinks it's sending the client LATIN1 text, but
 it's actually UTF8-encoded and the last decoding step is done by your
 terminal.

yup -- your're right -- what a coincidence!  I still prefer the 1
liner sql variant vs plpgsql loop though.  nicely done.

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to convert HEX to ASCII?

2011-12-02 Thread Pavel Stehule
2011/12/2 Merlin Moncure mmonc...@gmail.com:
 On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp ma...@juffo.org wrote:
 Sorry, but AFAICT this makes a mess of encodings and only works by
 pure luck. The server thinks it's sending the client LATIN1 text, but
 it's actually UTF8-encoded and the last decoding step is done by your
 terminal.

 yup -- your're right -- what a coincidence!  I still prefer the 1
 liner sql variant vs plpgsql loop though.  nicely done.

so bytea_agg - one param aggregate has sense

it's very easy to implement it

Pavel


 merlin

 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to convert HEX to ASCII?

2011-12-02 Thread Merlin Moncure
On Fri, Dec 2, 2011 at 11:15 AM, Pavel Stehule pavel.steh...@gmail.com wrote:
 2011/12/2 Merlin Moncure mmonc...@gmail.com:
 On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp ma...@juffo.org wrote:
 Sorry, but AFAICT this makes a mess of encodings and only works by
 pure luck. The server thinks it's sending the client LATIN1 text, but
 it's actually UTF8-encoded and the last decoding step is done by your
 terminal.

 yup -- your're right -- what a coincidence!  I still prefer the 1
 liner sql variant vs plpgsql loop though.  nicely done.

 so bytea_agg - one param aggregate has sense

 it's very easy to implement it

yup:

create aggregate bytea_agg (bytea)
(
  sfunc=byteacat,
  stype=bytea
);

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to convert HEX to ASCII?

2011-12-02 Thread Pavel Stehule
2011/12/2 Merlin Moncure mmonc...@gmail.com:
 On Fri, Dec 2, 2011 at 11:15 AM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 2011/12/2 Merlin Moncure mmonc...@gmail.com:
 On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp ma...@juffo.org wrote:
 Sorry, but AFAICT this makes a mess of encodings and only works by
 pure luck. The server thinks it's sending the client LATIN1 text, but
 it's actually UTF8-encoded and the last decoding step is done by your
 terminal.

 yup -- your're right -- what a coincidence!  I still prefer the 1
 liner sql variant vs plpgsql loop though.  nicely done.

 so bytea_agg - one param aggregate has sense

 it's very easy to implement it

 yup:

 create aggregate bytea_agg (bytea)
 (
  sfunc=byteacat,
  stype=bytea
 );

this is workaround :)

without a memory preallocating it has same speed like cycle in plpgsql.

Regards

Pavel


 merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to convert HEX to ASCII?

2011-12-02 Thread Merlin Moncure
On Fri, Dec 2, 2011 at 2:01 PM, Pavel Stehule pavel.steh...@gmail.com wrote:
 so bytea_agg - one param aggregate has sense

 it's very easy to implement it

 yup:

 create aggregate bytea_agg (bytea)
 (
  sfunc=byteacat,
  stype=bytea
 );

 this is workaround :)

 without a memory preallocating it has same speed like cycle in plpgsql.

sure, but I prefer to code against the workaround because it's cleaner
and it makes things easier to port over when such a feature makes it
into core.  also, one liner sql has better chance of inlining as a
general rule.

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to convert HEX to ASCII?

2011-12-02 Thread Pavel Stehule
2011/12/2 Merlin Moncure mmonc...@gmail.com:
 On Fri, Dec 2, 2011 at 2:01 PM, Pavel Stehule pavel.steh...@gmail.com wrote:
 so bytea_agg - one param aggregate has sense

 it's very easy to implement it

 yup:

 create aggregate bytea_agg (bytea)
 (
  sfunc=byteacat,
  stype=bytea
 );

 this is workaround :)

 without a memory preallocating it has same speed like cycle in plpgsql.

 sure, but I prefer to code against the workaround because it's cleaner
 and it makes things easier to port over when such a feature makes it
 into core.  also, one liner sql has better chance of inlining as a
 general rule.

ook

Pavel


 merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general