Re: [GENERAL] Need magic to clean strings from unconvertible UTF8

2010-11-08 Thread Dimitri Fontaine
Andreas  writes:
> I can find the problematic rows.
> How could I delete every char in a string that can't be converted to
> WIN1252?

  http://tapoueh.org/articles/blog/_Getting_out_of_SQL_ASCII,_part_1.html
  http://tapoueh.org/articles/blog/_Getting_out_of_SQL_ASCII,_part_2.html

That's using an hand-crafted translate expression, you could also use
the recode library that does a pretty good job. Maybe the easiest way
here would be using some plpythonu procedure using librecode?

  http://packages.debian.org/sid/python-bibtex

Well or the same in plperl… or even easier, process the source files
before importing them?

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Need magic to clean strings from unconvertible UTF8

2010-11-07 Thread Andreas

Am 07.11.2010 06:54, schrieb John R Pierce:

On 11/06/10 9:35 PM, Andreas wrote:

somehow there have unconvertible characters sneaked into my DB.
Very probaply they came in via Imports from MS-Access.

Access doesn't complain but when I try to export stuff with pgAdmin 
to csv I get an error that some char is not representable in the 
local charset.


I can find the problematic rows.
How could I delete every char in a string that can't be converted to 
WIN1252?



One idea that comes to my mind  issue a

SET CLIENT_ENCODING 'C';

then find and fix any problems with SQL. The C aka Posix encoding 
lets you directly manipulate the characters as binary.


or set the client_encoding to whatever the database encoding is, and 
find the characters that you know aren't compatible with WIN1252 and 
change them


Actually that's the problem.
How would I delete everything, that is not win1252 compatible.
I'm certain those are junk in my case so I'd rather get rid of them and 
not convert them to anything.
In some cases they even are illegal UTF8 codes that must have been 
created on the way when the data was transferred between a couple of 
file formats among which was excel and who knows what.


As I said, I can find the rows with such junk but I don't know how to 
clean up the text-fields without doing it manually.



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Need magic to clean strings from unconvertible UTF8

2010-11-06 Thread John R Pierce

On 11/06/10 9:35 PM, Andreas wrote:

Hi,

somehow there have unconvertible characters sneaked into my DB.
Very probaply they came in via Imports from MS-Access.

Access doesn't complain but when I try to export stuff with pgAdmin to 
csv I get an error that some char is not representable in the local 
charset.


I can find the problematic rows.
How could I delete every char in a string that can't be converted to 
WIN1252?



One idea that comes to my mind  issue a

SET CLIENT_ENCODING 'C';

then find and fix any problems with SQL. The C aka Posix encoding 
lets you directly manipulate the characters as binary.


or set the client_encoding to whatever the database encoding is, and 
find the characters that you know aren't compatible with WIN1252 and 
change them





--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general