Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-08 Thread Oliver Jowett
Madison Kelly wrote:
  Is there a way to store the name in raw binary? 
Yes: bytea.
-O
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faq


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-08 Thread Karsten Hilbert
> There is a table t1(member_id integer primary key, member_name text,
> address text, phone text, email text). I have to reach each member by
> either adress, phone or email. Unfortunately some of address field
> have wrong encoded data. In this case I will use phone or email to
> reach them. 
> 
> Now I need to upgrade to newer PostgreSQL within 1 day. I know I have
> to fix wrong encoded field but it will take more than 1 day. So I
> would like to import the data first then fix wrong encoded field on
> running database since I can reach members by phone or email even with
> wrong encoded address field...
You could add a bytea field in the migration process which
holds the original data and put some "sloppily corrected" data
inside the real field for the time being. Then, once corrected
set the additional bytea field to NULL. Pretty much the
concept of staging tables for data import.

Karsten
-- 
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-08 Thread Karsten Hilbert
>   Is there a way to store the name in raw binary? If so, would this not 
> be safe because to postgresql it should no longer matter what data is or 
> represents, right? Maybe there is a third option I am not yet concidering?
In the backup rename the file and add another file
 .README
which explains the issue, details the steps taken (eg
renaming) and offers some sort of raw binary value of the
original file name.

Karsten
-- 
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-08 Thread John Hansen
Tatsuo Ishii wrote:
> Sent: Sunday, May 08, 2005 3:31 PM
> To: John Hansen
> Cc: [EMAIL PROTECTED]; pgsql-general@postgresql.org; 
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
> 
> > Tatsuo Ishii wrote:
> > > Sent: Sunday, May 08, 2005 12:01 PM
> > > To: [EMAIL PROTECTED]
> > > Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org
> > > Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
> > > 
> > > We have developed patches which relaxes the character 
> validation so 
> > > that PostgreSQL accepts invalid characters. It works like this:
> > 
> > That is just plain 100% wrong!!
> >
> > Under no circumstances should there be invalid data in a database.
> > And if you're trying to make a database of invalid data, 
> then at least 
> > encode it using a valid encoding.
> > 
> > In fact, I've proposed strengthening the validation 
> routines for UTF-8.
> 
> Actually I myself thought as you are before. Later I found 
> that it was not so good idea. People already have invalid 
> encoded data in their precious database and have very hard 
> time to migrate to newer version of PostgreSQL because of 
> encoding validation.
> 
> Think about this kind of situation:
> 
> There is a table t1(member_id integer primary key, 
> member_name text, address text, phone text, email text). I 
> have to reach each member by either adress, phone or email. 
> Unfortunately some of address field have wrong encoded data. 
> In this case I will use phone or email to reach them. 
> 
> Now I need to upgrade to newer PostgreSQL within 1 day. I 
> know I have to fix wrong encoded field but it will take more 
> than 1 day. So I would like to import the data first then fix 
> wrong encoded field on running database since I can reach 
> members by phone or email even with wrong encoded address field...

Actually would be very simple, create function isvalidutf8(text) in your
preferred language.
C source is available from unicode.org.
Create function converttoutf8(text) using whatever code is required to
transform the _wrong_ encoding (SQL_ASCII -> UTF8 for instance) to
utf-8.
Update table set field=converttoutf8(field) where !isvalidutf8(field);

Now sit back and relax while your invalid data is converted to utf-8.

When done, pg_dump the database, upgrade, and reload.

This should take less than a day.

> I saw this kind of situation in the real world and that's why 
> we developed the patches.
> --
> Tatsuo Ishii
> 
> 

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-08 Thread Tino Wildenhain
Am Sonntag, den 08.05.2005, 14:30 +0900 schrieb Tatsuo Ishii:
...
> Actually I myself thought as you are before. Later I found that it was
> not so good idea. People already have invalid encoded data in their
> precious database and have very hard time to migrate to newer version
> of PostgreSQL because of encoding validation.
...
> Now I need to upgrade to newer PostgreSQL within 1 day. I know I have
> to fix wrong encoded field but it will take more than 1 day. So I
> would like to import the data first then fix wrong encoded field on
> running database since I can reach members by phone or email even with
> wrong encoded address field...

Well, if you are so in a hurry you better not migrate. Postgres is
proud of validating the input and to have no invalid data. So if
you have invalid data, better fix it.

> I saw this kind of situation in the real world and that's why we
> developed the patches.

Why not developing a helper for contrib to help reencoding the
database instead?

Regards
Tino


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread Tatsuo Ishii
> Tatsuo Ishii wrote:
> > Sent: Sunday, May 08, 2005 12:01 PM
> > To: [EMAIL PROTECTED]
> > Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org
> > Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
> > 
> > We have developed patches which relaxes the character 
> > validation so that PostgreSQL accepts invalid characters. It 
> > works like this:
> 
> That is just plain 100% wrong!!
>
> Under no circumstances should there be invalid data in a database.
> And if you're trying to make a database of invalid data, then at 
> least encode it using a valid encoding.
> 
> In fact, I've proposed strengthening the validation routines for UTF-8.

Actually I myself thought as you are before. Later I found that it was
not so good idea. People already have invalid encoded data in their
precious database and have very hard time to migrate to newer version
of PostgreSQL because of encoding validation.

Think about this kind of situation:

There is a table t1(member_id integer primary key, member_name text,
address text, phone text, email text). I have to reach each member by
either adress, phone or email. Unfortunately some of address field
have wrong encoded data. In this case I will use phone or email to
reach them. 

Now I need to upgrade to newer PostgreSQL within 1 day. I know I have
to fix wrong encoded field but it will take more than 1 day. So I
would like to import the data first then fix wrong encoded field on
running database since I can reach members by phone or email even with
wrong encoded address field...

I saw this kind of situation in the real world and that's why we
developed the patches.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread Tom Lane
"John Hansen" <[EMAIL PROTECTED]> writes:
> Tatsuo Ishii wrote:
>> We have developed patches which relaxes the character 
>> validation so that PostgreSQL accepts invalid characters.

> That is just plain 100% wrong!!

That was my first reaction too.  Why would this be a good idea?
If someone does want an encoding-agnostic database, they can
set it as SQL_ASCII.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread John Hansen
Madison Kelly wrote:
>Under most circumstances I would agree with you completely. In my 
> case though I have to decide between risking a loss of a 
> user's data or 
> attempt to store the file name in some manner that would 
> return the same 
> name used by the file system.
> 
>The user (or one of his/her users in the case of an admin) may be 
> completely unaware of the file name being an invalid unicode 
> name. The 
> file itself though may still be quite valid and contain information 
> worthy of backing up. I could notify the user/admin that the 
> name is not 
> valid but there is no way I could rely on the name being 
> changed. Given 
> the choices, I would prefer to attempt to store/use the file 
> name with 
> the invalid unicode character than simply ignore the file.
> 
>Is there a way to store the name in raw binary? If so, 
> would this not 
> be safe because to postgresql it should no longer matter what 
> data is or 
> represents, right? Maybe there is a third option I am not yet 
> concidering?

Set the client_encoding to ascii when storing that name, and again when
retrieving it.
Or, use a bytea column.

> 
> Madison

... John

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread Madison Kelly
John Hansen wrote:
Tatsuo Ishii wrote:
We have developed patches which relaxes the character 
validation so that PostgreSQL accepts invalid characters. It 
works like this:

That is just plain 100% wrong!!
Under no circumstances should there be invalid data in a database.
And if you're trying to make a database of invalid data, then at 
least encode it using a valid encoding.

In fact, I've proposed strengthening the validation routines for UTF-8.
... John
  Under most circumstances I would agree with you completely. In my 
case though I have to decide between risking a loss of a user's data or 
attempt to store the file name in some manner that would return the same 
name used by the file system.

  The user (or one of his/her users in the case of an admin) may be 
completely unaware of the file name being an invalid unicode name. The 
file itself though may still be quite valid and contain information 
worthy of backing up. I could notify the user/admin that the name is not 
valid but there is no way I could rely on the name being changed. Given 
the choices, I would prefer to attempt to store/use the file name with 
the invalid unicode character than simply ignore the file.

  Is there a way to store the name in raw binary? If so, would this not 
be safe because to postgresql it should no longer matter what data is or 
represents, right? Maybe there is a third option I am not yet concidering?

Madison
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread John Hansen
Tatsuo Ishii wrote:
> Sent: Sunday, May 08, 2005 12:01 PM
> To: [EMAIL PROTECTED]
> Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
> 
> We have developed patches which relaxes the character 
> validation so that PostgreSQL accepts invalid characters. It 
> works like this:

That is just plain 100% wrong!!
Under no circumstances should there be invalid data in a database.
And if you're trying to make a database of invalid data, then at 
least encode it using a valid encoding.

In fact, I've proposed strengthening the validation routines for UTF-8.

... John

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match