Re: [GENERAL] converting databases form SQL_ASCII to UTF8

2011-05-03 Thread Jasen Betts
On 2011-04-22, Geoffrey Myers g...@serioustechnology.com wrote:
 Vick Khera wrote:
 On Fri, Apr 22, 2011 at 11:00 AM, Geoffrey Myers 
 li...@serioustechnology.com mailto:li...@serioustechnology.com wrote:
 
 Here's our problem.  We planned on moving databases a few at a time.
 Problem is, there is a process that pushes data from one database to
 another.  If this process attempts to push data from a SQL_ASCII
 database to a new UTF8 database and it has one of these characters
 mentioned above, the process fails.
 
 
 The database's enforcement of the encoding should be the last layer that 
 does so.  Your applications should be enforcing strict utf-8 encoding 
 from start to finish.  Once this is done, and the old data already in 
 the DB is properly encoded as utf-8, then there should be no problems 
 switching on the utf-8 encoding in postgres to get that final layer of 
 verification.

 Totally agree.  Still, the question remains, why not leave it as SQL_ASCII?

perhaps you want sorted output in some locale other than 'C'?
or maybe want to take a substring in the database...

utf8 in SQL-ASCII is just a string of octets

utf8 in a utf8 database is a string of unicode characters.

-- 
⚂⚃ 100% natural


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] converting databases form SQL_ASCII to UTF8

2011-05-03 Thread Geoffrey Myers

Jasen Betts wrote:

On 2011-04-22, Geoffrey Myers g...@serioustechnology.com wrote:

Vick Khera wrote:
On Fri, Apr 22, 2011 at 11:00 AM, Geoffrey Myers 
li...@serioustechnology.com mailto:li...@serioustechnology.com wrote:


Here's our problem.  We planned on moving databases a few at a time.
Problem is, there is a process that pushes data from one database to
another.  If this process attempts to push data from a SQL_ASCII
database to a new UTF8 database and it has one of these characters
mentioned above, the process fails.


The database's enforcement of the encoding should be the last layer that 
does so.  Your applications should be enforcing strict utf-8 encoding 
from start to finish.  Once this is done, and the old data already in 
the DB is properly encoded as utf-8, then there should be no problems 
switching on the utf-8 encoding in postgres to get that final layer of 
verification.

Totally agree.  Still, the question remains, why not leave it as SQL_ASCII?


perhaps you want sorted output in some locale other than 'C'?
or maybe want to take a substring in the database...

utf8 in SQL-ASCII is just a string of octets

utf8 in a utf8 database is a string of unicode characters.



We finally have a solution in place. A bug in my code was making the 
problem bigger then it really is.  Gotta love those bugs.



--
Until later, Geoffrey

I predict future happiness for America if they can prevent
the government from wasting the labors of the people under
the pretense of taking care of them.
- Thomas Jefferson

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] converting databases form SQL_ASCII to UTF8

2011-04-30 Thread Dimitri Fontaine
Geoffrey Myers li...@serioustechnology.com writes:
 So, now the question is, is this effort even worth our effort?
 What is the harm in leaving our databases SQL_ASCII encoded?

You're declaring bankruptcy on being able to make any sense of the data
you stored.  Is that really what you think you need?

For converting, you might be interested into those two blog entries:

  http://tapoueh.org/articles/blog/_Getting_out_of_SQL_ASCII,_part_1.html
  http://tapoueh.org/articles/blog/_Getting_out_of_SQL_ASCII,_part_2.html

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] converting databases form SQL_ASCII to UTF8

2011-04-22 Thread Geoffrey Myers
We are moving our databases to new hardware soon, so we felt it would be 
a good time to get the encoding correct.  Our databases are currently 
SQL_ASCII and we plan to move them to UTF8.


So, as previously noted, there are certain characters that won't load 
into a UTF8 database from a dump of the SQL_ASCII database.


Here's our problem.  We planned on moving databases a few at a time. 
Problem is, there is a process that pushes data from one database to 
another.  If this process attempts to push data from a SQL_ASCII 
database to a new UTF8 database and it has one of these characters 
mentioned above, the process fails.


So, now the question is, is this effort even worth our effort?

What is the harm in leaving our databases SQL_ASCII encoded?

Thanks for any insights.

--
Until later, Geoffrey

I predict future happiness for America if they can prevent
the government from wasting the labors of the people under
the pretense of taking care of them.
- Thomas Jefferson

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] converting databases form SQL_ASCII to UTF8

2011-04-22 Thread Vick Khera
On Fri, Apr 22, 2011 at 11:00 AM, Geoffrey Myers 
li...@serioustechnology.com wrote:

 Here's our problem.  We planned on moving databases a few at a time.
 Problem is, there is a process that pushes data from one database to
 another.  If this process attempts to push data from a SQL_ASCII database to
 a new UTF8 database and it has one of these characters mentioned above, the
 process fails.


The database's enforcement of the encoding should be the last layer that
does so.  Your applications should be enforcing strict utf-8 encoding from
start to finish.  Once this is done, and the old data already in the DB is
properly encoded as utf-8, then there should be no problems switching on the
utf-8 encoding in postgres to get that final layer of verification.


Re: [GENERAL] converting databases form SQL_ASCII to UTF8

2011-04-22 Thread Adrian Klaver
On Friday, April 22, 2011 8:00:08 am Geoffrey Myers wrote:

 
 What is the harm in leaving our databases SQL_ASCII encoded?

SQL_ASCII is basically no encoding. The world is slowly but surely moving to 
Unicode, sooner or later you are going to hit the unknown encoding/Unicode 
wall. 
Probably better to get it over with now.

 
 Thanks for any insights.

-- 
Adrian Klaver
adrian.kla...@gmail.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] converting databases form SQL_ASCII to UTF8

2011-04-22 Thread Vick Khera
On Fri, Apr 22, 2011 at 11:16 AM, Geoffrey Myers g...@serioustechnology.com
 wrote:

 Totally agree.  Still, the question remains, why not leave it as SQL_ASCII?


you have no guarantees that the data stored within is utf-8. that is all.
 if you can make such guarantees from within your application, then you have
some confidence, but never 100%.  but it depends on what access you permit
to the DB. it is kind of like removing FKs and enforcing them from the
application layer.  you can do it, but you just don't have a 100% guarantee
that the DB is consistent.


Re: [GENERAL] converting databases form SQL_ASCII to UTF8

2011-04-22 Thread Geoffrey Myers

Vick Khera wrote:
On Fri, Apr 22, 2011 at 11:00 AM, Geoffrey Myers 
li...@serioustechnology.com mailto:li...@serioustechnology.com wrote:


Here's our problem.  We planned on moving databases a few at a time.
Problem is, there is a process that pushes data from one database to
another.  If this process attempts to push data from a SQL_ASCII
database to a new UTF8 database and it has one of these characters
mentioned above, the process fails.


The database's enforcement of the encoding should be the last layer that 
does so.  Your applications should be enforcing strict utf-8 encoding 
from start to finish.  Once this is done, and the old data already in 
the DB is properly encoded as utf-8, then there should be no problems 
switching on the utf-8 encoding in postgres to get that final layer of 
verification.


Totally agree.  Still, the question remains, why not leave it as SQL_ASCII?

--
Geoffrey Myers
Myers Consulting Inc.
770.592.1651

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] converting databases form SQL_ASCII to UTF8

2011-04-22 Thread Rob Sargent



On 04/22/2011 09:16 AM, Geoffrey Myers wrote:

Vick Khera wrote:

On Fri, Apr 22, 2011 at 11:00 AM, Geoffrey Myers
li...@serioustechnology.com mailto:li...@serioustechnology.com wrote:

Here's our problem. We planned on moving databases a few at a time.
Problem is, there is a process that pushes data from one database to
another. If this process attempts to push data from a SQL_ASCII
database to a new UTF8 database and it has one of these characters
mentioned above, the process fails.


The database's enforcement of the encoding should be the last layer
that does so. Your applications should be enforcing strict utf-8
encoding from start to finish. Once this is done, and the old data
already in the DB is properly encoded as utf-8, then there should be
no problems switching on the utf-8 encoding in postgres to get that
final layer of verification.


Totally agree. Still, the question remains, why not leave it as SQL_ASCII?

Maybe because you'll have to consistently remember that you're doing 
non-standard stuff?


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] converting databases form SQL_ASCII to UTF8

2011-04-22 Thread Eric McKeeth
On Fri, Apr 22, 2011 at 9:16 AM, Geoffrey Myers
g...@serioustechnology.comwrote:

 Vick Khera wrote:

 The database's enforcement of the encoding should be the last layer that
 does so.  Your applications should be enforcing strict utf-8 encoding from
 start to finish.  Once this is done, and the old data already in the DB is
 properly encoded as utf-8, then there should be no problems switching on the
 utf-8 encoding in postgres to get that final layer of verification.


 Totally agree.  Still, the question remains, why not leave it as SQL_ASCII?


Well, if your data is supposed to be UTF-8 encoded, then any of those
characters with invalid encoding in UTF-8 could reasonably be viewed as data
errors. Leaving the database in SQL-ASCII allows those errors to continue
accumulating, which will make a switch in the future even harder. If the
lack of being able to check encoding errors at the database level doesn't
bother you, and you're fine with risking bigger pain later in order to avoid
pain now, then I see no compelling reason to move away from SQL_ASCII.

-Eric


Re: [GENERAL] converting databases form SQL_ASCII to UTF8

2011-04-22 Thread Steve Crawford

On 04/22/2011 08:00 AM, Geoffrey Myers wrote:
We are moving our databases to new hardware soon, so we felt it would 
be a good time to get the encoding correct.  Our databases are 
currently SQL_ASCII and we plan to move them to UTF8.
We are in the same boat, fortunately only on one older server we are 
upgrading and fortunately for internal apps



So, as previously noted, there are certain characters that won't load 
into a UTF8 database from a dump of the SQL_ASCII database.


Here's our problem.  We planned on moving databases a few at a time. 
Problem is, there is a process that pushes data from one database to 
another.  If this process attempts to push data from a SQL_ASCII 
database to a new UTF8 database and it has one of these characters 
mentioned above, the process fails.


So, now the question is, is this effort even worth our effort?

What is the harm in leaving our databases SQL_ASCII encoded?


SQL_ASCII is a synonym for no encoding. You put in a stream of bytes 
and that's what you get out. That's OK if the byte-stream has exactly 
the same meaning to every application and user. If that's not the case 
then you have bytes in your database but you don't know what those bytes 
are supposed to represent.


In a way, it's like having a generic integer column but depending on the 
user or the application, that column might represent a unix epoch 
timestamp, an age in years, a salary, a weight in grams, furlongs per 
fortnight, etc. And there is no indicator to say which it is. Not good.


We are in the final stages of cleaning up our last bit of non-utf8 data 
and the above some what silly example is actually not far from the 
truth. Due to data that arrived from web-inputs, spreadsheet imports, 
command-line, internal-apps, etc. we have been faced with cleaning 
tables where one row has only basic ASCII data, another has UTF8, while 
others have various different Microsoft encodings. With data like that 
it's pretty much impossible to guarantee that even a simple web-report 
will display fully correctly.


The longer you wait, the worse it gets. Even though it is only a tiny 
percentage of our data, cleaning it is still a pain.


Is it worth it? Dunno - you have to calculate the cost/benefit. For us 
it was a no-brainer to bite-the-bullet and do it.


Cheers,
Steve


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general