Re: [GENERAL] How to find freak UTF-8 character?

2011-10-04 Thread Daniele Varrazzo
On Sat, Oct 1, 2011 at 10:16 PM, Leif Biberg Kristensen
l...@solumslekt.org wrote:
 Yes I know that this is a perfectly legal UTF-8
 character. It crept into my database as a result of a copy-and-paste job from
 a web site. The point is that it doesn't have a counterpart in ISO-8859-1 to
 which I regularly have to export the data.

If your database is utf8 and the output latin1, it sounds more robust
to connect to the database with utf8 client encoding, pull data from
the database as unicode and at output time use data.encode('latin1',
'replace') in python to convert data without bombing on non-latin1
chars.

-- Daniele

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to find freak UTF-8 character?

2011-10-03 Thread Andrew Sullivan
On Sat, Oct 01, 2011 at 11:16:06PM +0200, Leif Biberg Kristensen wrote:
 But thank you for the idea, I think that I will strip out at least any lrm; 
 entities from text entered into the database.

If you're getting lrm, you might want to check for ZWJ and ZWNJ code
points too.  They're nasty because by definition they don't display
(ZW stands for Zero Width.  If you ever actually have to spend
time understanding how things got this way in Unicode, you will have
my sympathy).

 By the way, is there a setting in psql that will output unprintable 
 characters 
 as question marks or something?

Not to my knowledge, although I'd expect the terminal driver to have
control over this, no?

A

-- 
Andrew Sullivan
a...@crankycanuck.ca

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to find freak UTF-8 character?

2011-10-02 Thread pasman pasmański
Its simple to remove strange chars  with regex_replace.

2011/10/1, Leif Biberg Kristensen l...@solumslekt.org:
 On Saturday 1. October 2011 21.29.45 Andrew Sullivan wrote:
 I see you found it, but note that it's _not_ a spurious UTF-8
 character: it's a right-to-left mark, ans is a perfectly ok UTF-8 code
 point.

 Andrew,
 thank you for your reply. Yes I know that this is a perfectly legal UTF-8
 character. It crept into my database as a result of a copy-and-paste job
 from
 a web site. The point is that it doesn't have a counterpart in ISO-8859-1 to
 which I regularly have to export the data.

 The offending character came from this URL:
 http://www.soge.kviteseid.no/individual.php?pid=I2914ged=Kviteseid.GEDtab=0

 and the text that I copied and pasted from the page looks like this in the
 source code:

 Aslaug Steinarsdotter Fjågesundnbsp;nbsp;lrm;(I2914)lrm;

 I'm going to write to the webmaster of the site and ask why that character,
 represented in the HTML as the lrm; entity, has to appear in a Norwegian
 web
 site which never should have to display text in anything but left-to-right
 order.

 If you need a subset of the UTF-8 character set, you want to make sure
 you have some sort of constraint in your application or your database
 that prevents insertion of anything at all in UTF-8.  This is a need
 people often forget when working in an internationalized setting,
 because there's a lot of crap that comes from the client side in a
 UTF-8 setting that might not come in other settings (like LATIN1).

 I don't want any constraint of that sort. I'm perfectly happy with UTF-8.
 And
 now that I've found out how to spot problematic characters that will crash
 my
 export script, it's really not an issue anymore. The character didn't print
 neither in psql nor in my PHP frontend, so I just removed the problematic
 text
 and re-entered it by hand. Problem solved.

 But thank you for the idea, I think that I will strip out at least any lrm;
 entities from text entered into the database.

 By the way, is there a setting in psql that will output unprintable
 characters
 as question marks or something?

 regards, Leif.

 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



-- 

pasman

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to find freak UTF-8 character?

2011-10-02 Thread Leif Biberg Kristensen
On Sunday 2. October 2011 15.53.50 pasman pasmański wrote:
 Its simple to remove strange chars  with regex_replace.

True, but first you have to know how to represent a «strange char» in 
Postgresql :P

It isn't all that obvious, and it's difficult to search for the solution. I 
tried a lot of different search terms in Google, and none of them turned up 
anything near what I needed.

regards, Leif

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to find freak UTF-8 character?

2011-10-02 Thread Cédric Villemain
2011/10/2 Leif Biberg Kristensen l...@solumslekt.org:
 On Sunday 2. October 2011 15.53.50 pasman pasmański wrote:
 Its simple to remove strange chars  with regex_replace.

 True, but first you have to know how to represent a «strange char» in
 Postgresql :P

 It isn't all that obvious, and it's difficult to search for the solution. I
 tried a lot of different search terms in Google, and none of them turned up
 anything near what I needed.

you may have miss this one :
http://tapoueh.org/blog/2010/02/23-getting-out-of-sql_ascii-part-2.html



 regards, Leif

 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general




-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to find freak UTF-8 character?

2011-10-02 Thread Leif Biberg Kristensen
On Sunday 2. October 2011 16.34.27 Cédric Villemain wrote:
 you may have miss this one :
 http://tapoueh.org/blog/2010/02/23-getting-out-of-sql_ascii-part-2.html

That's an, uh, interesting article, but as far as I can see, it doesn't tell 
anything about how to find a perfectly legal three-byte UTF-8 character that 
doesn't have a counterpart in LATIN1, given that all I know about it is its 
hexadecimal value.

I know how to do it now, and I consider the problem solved. Hopefully, this 
thread may help others who stumbles upon the same issue.

regards, Leif

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to find freak UTF-8 character?

2011-10-02 Thread Raymond O'Donnell
On 02/10/2011 15:55, Leif Biberg Kristensen wrote:
 On Sunday 2. October 2011 16.34.27 Cédric Villemain wrote:
 you may have miss this one :
 http://tapoueh.org/blog/2010/02/23-getting-out-of-sql_ascii-part-2.html
 
 That's an, uh, interesting article, but as far as I can see, it doesn't tell 
 anything about how to find a perfectly legal three-byte UTF-8 character that 
 doesn't have a counterpart in LATIN1, given that all I know about it is its 
 hexadecimal value.
 
 I know how to do it now, and I consider the problem solved. Hopefully, this 
 thread may help others who stumbles upon the same issue.

I may have missed it upthread, but if you haven't already would you
consider writing up your solution for the benefit of the archives?

Ray.


-- 
Raymond O'Donnell :: Galway :: Ireland
r...@iol.ie

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to find freak UTF-8 character?

2011-10-02 Thread Leif Biberg Kristensen
On Sunday 2. October 2011 17.54.52 Raymond O'Donnell wrote:
 I may have missed it upthread, but if you haven't already would you
 consider writing up your solution for the benefit of the archives?

I did, in my own first reply to the original message:

SELECT * FROM foo WHERE bar LIKE E'%\xe2\x80\x8e%';

The trick is obviously to escape each byte in the sequence.

Maybe I'll write a blog post about it. It appears to be weakly documented, or 
at least very hard to find. Or maybe it's just me being dense.

regards, Leif.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] How to find freak UTF-8 character?

2011-10-01 Thread Leif Biberg Kristensen
I've somehow introduced a spurious UTF-8 character in my database. When I try 
to export to an application that requires LATIN1 encoding, my export script 
bombs out with this message:

psycopg2.DataError: character 0xe2808e of encoding UTF8 has no equivalent in 
LATIN1

I figure that it should be easy to find the offending character in the database 
by doing a SELECT * FROM foo WHERE bar like '%\0xe2808e%' or something like 
that, but I can't find the correct syntax, I can't find a relevant section in 
the manual, and I can't figure out how to google this problem.

regards, Leif



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to find freak UTF-8 character?

2011-10-01 Thread Leif Biberg Kristensen
On Saturday 1. October 2011 07.55.01 Leif Biberg Kristensen wrote:
 I've somehow introduced a spurious UTF-8 character in my database. When I
 try to export to an application that requires LATIN1 encoding, my export
 script bombs out with this message:
 
 psycopg2.DataError: character 0xe2808e of encoding UTF8 has no equivalent
 in LATIN1

I finally figured it out, with a little help from maatb's unicode database 
(http://vazor.com/unicode/c200E.html):

SELECT * FROM foo WHERE bar LIKE E'%\xe2\x80\x8e%';

regards, Leif


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to find freak UTF-8 character?

2011-10-01 Thread Andrew Sullivan
On Sat, Oct 01, 2011 at 07:55:01AM +0200, Leif Biberg Kristensen wrote:
 I've somehow introduced a spurious UTF-8 character in my database. When I try 
 to export to an application that requires LATIN1 encoding, my export script 
 bombs out with this message:
 
 psycopg2.DataError: character 0xe2808e of encoding UTF8 has no equivalent 
 in 
 LATIN1

I see you found it, but note that it's _not_ a spurious UTF-8
character: it's a right-to-left mark, ans is a perfectly ok UTF-8 code
point.

If you need a subset of the UTF-8 character set, you want to make sure
you have some sort of constraint in your application or your database
that prevents insertion of anything at all in UTF-8.  This is a need
people often forget when working in an internationalized setting,
because there's a lot of crap that comes from the client side in a
UTF-8 setting that might not come in other settings (like LATIN1).

Best,

A

-- 
Andrew Sullivan
a...@crankycanuck.ca

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to find freak UTF-8 character?

2011-10-01 Thread Leif Biberg Kristensen
On Saturday 1. October 2011 21.29.45 Andrew Sullivan wrote:
 I see you found it, but note that it's _not_ a spurious UTF-8
 character: it's a right-to-left mark, ans is a perfectly ok UTF-8 code
 point.

Andrew,
thank you for your reply. Yes I know that this is a perfectly legal UTF-8 
character. It crept into my database as a result of a copy-and-paste job from 
a web site. The point is that it doesn't have a counterpart in ISO-8859-1 to 
which I regularly have to export the data.

The offending character came from this URL: 
http://www.soge.kviteseid.no/individual.php?pid=I2914ged=Kviteseid.GEDtab=0

and the text that I copied and pasted from the page looks like this in the 
source code:

Aslaug Steinarsdotter Fjågesundnbsp;nbsp;lrm;(I2914)lrm;

I'm going to write to the webmaster of the site and ask why that character, 
represented in the HTML as the lrm; entity, has to appear in a Norwegian web 
site which never should have to display text in anything but left-to-right 
order.
 
 If you need a subset of the UTF-8 character set, you want to make sure
 you have some sort of constraint in your application or your database
 that prevents insertion of anything at all in UTF-8.  This is a need
 people often forget when working in an internationalized setting,
 because there's a lot of crap that comes from the client side in a
 UTF-8 setting that might not come in other settings (like LATIN1).

I don't want any constraint of that sort. I'm perfectly happy with UTF-8. And 
now that I've found out how to spot problematic characters that will crash my 
export script, it's really not an issue anymore. The character didn't print 
neither in psql nor in my PHP frontend, so I just removed the problematic text 
and re-entered it by hand. Problem solved.

But thank you for the idea, I think that I will strip out at least any lrm; 
entities from text entered into the database.

By the way, is there a setting in psql that will output unprintable characters 
as question marks or something?

regards, Leif.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general