Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-24 Thread Arnaud Lesauvage

Tomi NA a écrit :

2006/11/23, Arnaud Lesauvage [EMAIL PROTECTED]:

Arnaud Lesauvage a écrit :
 Brandon Aiken a écrit :
 It also might be a big/little endian problem, although I always thought that 
was platform specific, not locale specific.

 Try the UCS-2-INTERNAL and UCS-4-INTERNAL codepages in iconv, which should 
use the two-byte or four-byte versions of UCS encoding using the system's default 
endian setting.


 Guys, it worked 
 UCS-4-INTERNAL was the right choice !!!

 I love you all !

 (now I just have an out of memory problem, but that's going
 to be a new thread)

Guys, it did not work !!! :(
I thought it worked because postgres seemed to be loading
the file and failing at the end with an out of memory
error, but in fact I think the conversion remove all
end-of-line characters (one line of 1.5GB was too much for
COPY...).

Still searching !


It will take you a day or two to get started, and then a day or two to
get the job done, but you really might want to look into kettle or
some other ETL tool to do the job.
It looks to me like you're trying to screw in a screw using a hammer.


Yes, I might try something else.
I was thinking that others would probably run into this 
problem sometime, and that our investigations might help them.
I think I'll forget about this COPY stuff and just export 
with DTS through PostgreSQL ODBC Unicode.


--
Arnaud

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-23 Thread Arnaud Lesauvage

Brandon Aiken a écrit :

It also might be a big/little endian problem, although I always thought that 
was platform specific, not locale specific.

Try the UCS-2-INTERNAL and UCS-4-INTERNAL codepages in iconv, which should use 
the two-byte or four-byte versions of UCS encoding using the system's default 
endian setting.



Guys, it worked 
UCS-4-INTERNAL was the right choice !!!

I love you all !

(now I just have an out of memory problem, but that's going
to be a new thread)

--
Arnaud


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-23 Thread Arnaud Lesauvage

Arnaud Lesauvage a écrit :

Brandon Aiken a écrit :

It also might be a big/little endian problem, although I always thought that 
was platform specific, not locale specific.

Try the UCS-2-INTERNAL and UCS-4-INTERNAL codepages in iconv, which should use 
the two-byte or four-byte versions of UCS encoding using the system's default 
endian setting.



Guys, it worked 
UCS-4-INTERNAL was the right choice !!!

I love you all !

(now I just have an out of memory problem, but that's going
to be a new thread)


Guys, it did not work !!! :(
I thought it worked because postgres seemed to be loading 
the file and failing at the end with an out of memory 
error, but in fact I think the conversion remove all 
end-of-line characters (one line of 1.5GB was too much for 
COPY...).


Still searching !

--
Arnaud

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-23 Thread Tomi NA

2006/11/23, Arnaud Lesauvage [EMAIL PROTECTED]:

Arnaud Lesauvage a écrit :
 Brandon Aiken a écrit :
 It also might be a big/little endian problem, although I always thought that 
was platform specific, not locale specific.

 Try the UCS-2-INTERNAL and UCS-4-INTERNAL codepages in iconv, which should 
use the two-byte or four-byte versions of UCS encoding using the system's default 
endian setting.


 Guys, it worked 
 UCS-4-INTERNAL was the right choice !!!

 I love you all !

 (now I just have an out of memory problem, but that's going
 to be a new thread)

Guys, it did not work !!! :(
I thought it worked because postgres seemed to be loading
the file and failing at the end with an out of memory
error, but in fact I think the conversion remove all
end-of-line characters (one line of 1.5GB was too much for
COPY...).

Still searching !


It will take you a day or two to get started, and then a day or two to
get the job done, but you really might want to look into kettle or
some other ETL tool to do the job.
It looks to me like you're trying to screw in a screw using a hammer.

t.n.a.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Arnaud Lesauvage

Richard Huxton a écrit :

Arnaud Lesauvage wrote:

Hi list !

I already posted this as COPY FROM encoding error, but I have been 
doing some more tests since then.


I'm trying to export data from MS SQL Server to PostgreSQL.
The tables are quite big (20M rows), so a CSV export and a COPY FROM3 
import seems to be the only reasonable solution.


Or go via MS-Access/Perl and ODBC/DBI perhaps?


Yes, I think it would work. The problem is that the DB is 
too big for this king of export. Using DTS from MSSQL to 
export directly to PostgreSQL using psqlODBC Unicode Driver, 
I exported ~1000 rows per second in a 2-columns table with 
~20M rows. That means several days just for this table, and 
I have bigger ones !



In DTS, I have 3 options to export a table as a text file : ANSI, OEM 
and UNICODE.

I tried all these options (and I have three files, one for each).


Well, what character-set is your database in?



Collation in MSSQL is SQL_Latin1_General_CP1_Cl_AS.
DTS documentation tells me that exporting in ANSI should 
export using the current codepage.

According to my local setting, my codepage is Windows-1252.
This file is not correctly read by COPY when using 
client_encoding of WIN1252 though...



I then try to import into PostgreSQL. The farther I can get is when 
using the UNICODE export, and importing it using a client_encoding set 
to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...).

The copy then stops with an error :
ERROR: invalid byte sequence for encoding UTF8: 0xff
État SQL :22021

The problematic character is the euro currency symbol.


You'll want UTF-8 or LATIN9 for the euro symbol. LATIN1 supports that 
character-number but it is used for a different symbol.


Your first step needs to be to find out what character-set your data is in.
Your second is then to decide what char-set you want to use to store it 
in PG.

Then you can decide how to get there.


In PG, UTF8 was my choice (the DB already exists, I am just 
adding some tables that are still stored in MSSQL), and 
according to what you say this was the right choice.

The problem is really about reading this file I think.

I thought that given the character sets available in 
PostgreSQL, I would be able to COPY directly from my 
exported files.
If I have to convert them using some third party tool, I'll 
do that, but that's a bit more painful...


--
Arnaud

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Arnaud Lesauvage

Tomi NA a écrit :

I think I'll go this way... No other choice, actually !
The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
I don't really understand what this is. It supports the euro
symbol, so it is probably not pure LATIN1, right ?


I suppose you'd have to look at the latin1 codepage character table
somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the
question. :)


Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that 
LATIN1 is missing the euro sign...

Gr I hate this !!!

--
Arnaud

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Tomi NA

2006/11/21, Arnaud Lesauvage [EMAIL PROTECTED]:

Hi list !

I already posted this as COPY FROM encoding error, but I have
been doing some more tests since then.

I'm trying to export data from MS SQL Server to PostgreSQL.
The tables are quite big (20M rows), so a CSV export and a COPY
FROM3 import seems to be the only reasonable solution.


I believe you might have more luck working without files altogether.
Use an ETL tool like kettle or even DTS with the pgsql ODBC driver.
That's exactly what those tools are for.

You still have to get the encodings right, though.
I suggest unicode for pgsql, but only you know how the MSSQL database
is encoded.

t.n.a.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Arnaud Lesauvage

Richard Huxton a écrit :

Arnaud Lesauvage wrote:

Richard Huxton a écrit :


Or go via MS-Access/Perl and ODBC/DBI perhaps?


Yes, I think it would work. The problem is that the DB is too big for 
this king of export. Using DTS from MSSQL to export directly to 
PostgreSQL using psqlODBC Unicode Driver, I exported ~1000 rows per 
second in a 2-columns table with ~20M rows. That means several days just 
for this table, and I have bigger ones !


Well it's about 0.25 days, but if it's too long, it's too long.


Sure, sorry for the confusion, the problem is with the other 
tables (same number of rows but a lot of columns, some very 
large).



 Collation in MSSQL is SQL_Latin1_General_CP1_Cl_AS.
 DTS documentation tells me that exporting in ANSI should export using 
 the current codepage.

 According to my local setting, my codepage is Windows-1252.
 This file is not correctly read by COPY when using client_encoding of 
 WIN1252 though...


Hmm. Odd that they don't agree on what WIN1252 is. I'm not sure how to
check the file and confirm one way or the other. Anyone else on the list 
got an idea?


I just downloaded the GnuWin32 version of iconv.
I'm giving it a try and I'll tell you haw it went.


--
Arnaud

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Alvaro Herrera
Arnaud Lesauvage wrote:
 Tomi NA a écrit :
 I think I'll go this way... No other choice, actually !
 The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
 I don't really understand what this is. It supports the euro
 symbol, so it is probably not pure LATIN1, right ?
 
 I suppose you'd have to look at the latin1 codepage character table
 somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the
 question. :)
 
 Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that 
 LATIN1 is missing the euro sign...
 Gr I hate this !!!

So use Latin9 ...

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Tomi NA

2006/11/22, Arnaud Lesauvage [EMAIL PROTECTED]:

Tomi NA a écrit :
 2006/11/21, Arnaud Lesauvage [EMAIL PROTECTED]:
 Hi list !

 I already posted this as COPY FROM encoding error, but I have
 been doing some more tests since then.

 I'm trying to export data from MS SQL Server to PostgreSQL.
 The tables are quite big (20M rows), so a CSV export and a COPY
 FROM3 import seems to be the only reasonable solution.

 I believe you might have more luck working without files altogether.
 Use an ETL tool like kettle or even DTS with the pgsql ODBC driver.
 That's exactly what those tools are for.

 You still have to get the encodings right, though.
 I suggest unicode for pgsql, but only you know how the MSSQL database
 is encoded.

I think I'll go this way... No other choice, actually !
The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
I don't really understand what this is. It supports the euro
symbol, so it is probably not pure LATIN1, right ?


I suppose you'd have to look at the latin1 codepage character table
somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the
question. :)

t.n.a.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Magnus Hagander

  I have done this in Delphi using it's built in UTF8 encoding and 
  decoding routines.   You can get a free copy of Delphi 
 Turbo Explorer 
  which includes components for MS SQL server and ODBC, so it 
 would be 
  pretty straight forward to get this working.
  
  The actual method in Delphi is system.UTF8Encode(widestring).  This 
  will encode unicode to UTF8 which is compatible with a 
 Postgresql UTF8 database.
 
 Ah, that's useful to know. Windows just doesn't have the same 
 quantity of tools installed as a *nix platform.

If your file is small enough, you can just open it up in Notepad and
re-save it as UTF8. It might play funny with the BOMs though
(byte-order-marks).

There is also, IIRC, an iconv binary available for Windows that should
be able to do such a conversion. Can't rememebr where thuogh :-)

//Magnus

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Magnus Hagander
  I already posted this as COPY FROM encoding error, but I 
 have been 
  doing some more tests since then.
  
  I'm trying to export data from MS SQL Server to PostgreSQL.
  The tables are quite big (20M rows), so a CSV export and a COPY 
  FROM3 import seems to be the only reasonable solution.
  
  Or go via MS-Access/Perl and ODBC/DBI perhaps?
 
 Yes, I think it would work. The problem is that the DB is too 
 big for this king of export. Using DTS from MSSQL to export 
 directly to PostgreSQL using psqlODBC Unicode Driver, I 
 exported ~1000 rows per second in a 2-columns table with ~20M 
 rows. That means several days just for this table, and I have 
 bigger ones !
 

Interesting. What did you set the Inser batch size to? (I think that's
available for all transformatino tasks). And did you remember to check
the box for use transactions?

While it's never as fast as a COPY, it should be possible to make it
faster than that, Ithink.


Another option is to just BCP the file out, and then COPY it into
postgresql. No nice GUI, but you can tune almost everything with BCP.

//Magnus

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Arnaud Lesauvage

Alvaro Herrera a écrit :

Arnaud Lesauvage wrote:

Tomi NA a écrit :
I think I'll go this way... No other choice, actually !
The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
I don't really understand what this is. It supports the euro
symbol, so it is probably not pure LATIN1, right ?

I suppose you'd have to look at the latin1 codepage character table
somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the
question. :)

Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that 
LATIN1 is missing the euro sign...

Gr I hate this !!!


So use Latin9 ...


Of course, but it doesn't work !!!
Whatever client encoding I choose in postgresql before 
COPYing, I get the 'invalid byte sequence error'.


The farther I can get is exporting to UNICODE and importing 
as UTF8. Then COPY only breaks on the euro symbol (otherwise 
it breaks very early, I think on the first non-ascii 
character).


--
Arnaud

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Arnaud Lesauvage

Tomi NA a écrit :

2006/11/21, Arnaud Lesauvage [EMAIL PROTECTED]:

Hi list !

I already posted this as COPY FROM encoding error, but I have
been doing some more tests since then.

I'm trying to export data from MS SQL Server to PostgreSQL.
The tables are quite big (20M rows), so a CSV export and a COPY
FROM3 import seems to be the only reasonable solution.


I believe you might have more luck working without files altogether.
Use an ETL tool like kettle or even DTS with the pgsql ODBC driver.
That's exactly what those tools are for.

You still have to get the encodings right, though.
I suggest unicode for pgsql, but only you know how the MSSQL database
is encoded.


I think I'll go this way... No other choice, actually !
The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
I don't really understand what this is. It supports the euro 
symbol, so it is probably not pure LATIN1, right ?


--
Arnaud

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Alvaro Herrera
Arnaud Lesauvage wrote:
 Alvaro Herrera a écrit :
 Arnaud Lesauvage wrote:
 Tomi NA a écrit :
 I think I'll go this way... No other choice, actually !
 The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
 I don't really understand what this is. It supports the euro
 symbol, so it is probably not pure LATIN1, right ?
 
 I suppose you'd have to look at the latin1 codepage character table
 somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the
 question. :)
 
 Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that 
 LATIN1 is missing the euro sign...
 Gr I hate this !!!
 
 So use Latin9 ...
 
 Of course, but it doesn't work !!!
 Whatever client encoding I choose in postgresql before 
 COPYing, I get the 'invalid byte sequence error'.

Humm ... how are you choosing the client encoding?  Is it actually
working?  I don't see how choosing Latin1 or Latin9 and feeding whatever
byte sequence would give you an invalid byte sequence.  These charsets
don't have any way to validate the bytes, as opposed to what UTF-8 can
do.  So you could end up with invalid bytes if you choose the wrong
client encoding, but that's a different error.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Richard Huxton

Arnaud Lesauvage wrote:

Richard Huxton a écrit :


Or go via MS-Access/Perl and ODBC/DBI perhaps?


Yes, I think it would work. The problem is that the DB is too big for 
this king of export. Using DTS from MSSQL to export directly to 
PostgreSQL using psqlODBC Unicode Driver, I exported ~1000 rows per 
second in a 2-columns table with ~20M rows. That means several days just 
for this table, and I have bigger ones !


Well it's about 0.25 days, but if it's too long, it's too long.

In DTS, I have 3 options to export a table as a text file : ANSI, OEM 
and UNICODE.

I tried all these options (and I have three files, one for each).


Well, what character-set is your database in?



Collation in MSSQL is SQL_Latin1_General_CP1_Cl_AS.
DTS documentation tells me that exporting in ANSI should export using 
the current codepage.

According to my local setting, my codepage is Windows-1252.
This file is not correctly read by COPY when using client_encoding of 
WIN1252 though...


Hmm. Odd that they don't agree on what WIN1252 is. I'm not sure how to 
check the file and confirm one way or the other. Anyone else on the list 
got an idea?


--
  Richard Huxton
  Archonet Ltd


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Tony Caduto




Of course, but it doesn't work !!!
Whatever client encoding I choose in postgresql before COPYing, I get 
the 'invalid byte sequence error'.


The farther I can get is exporting to UNICODE and importing as UTF8. 
Then COPY only breaks on the euro symbol (otherwise it breaks very 
early, I think on the first non-ascii character).


--
Like I said before UNICODE  UTF8  That's why the COPY command breaks 
on the Euro symbol.
You will have to export as UNICODE, then encode it as UTF8, then you 
won't get the breakage.


UTF8 is simply a means to store UNICODE pretty much as ASCII text.

You could grab a copy of Delphi TurboExplorer and create a import 
routine using the dbGO ADO components and the PG ODBC driver.
Basicly you need to encode any UNICODE data going to the PG server with 
the system.utf8encode function:


[Delphi] function *UTF8Encode*(const WS: WideString): UTF8String;
Call Utf8Encode to convert a Unicode string to UTF-8. WS is the Unicode 
string to convert. Utf8Encode returns the corresponding UTF-8 string.


I would imagine that Perl also has such routines, but I don't know for 
sure. These routines might be in FreePascal as well.


--
Tony Caduto
AM Software Design
http://www.amsoftwaredesign.com
Home of PG Lightning Admin for Postgresql
Your best bet for Postgresql Administration 



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Arnaud Lesauvage

Magnus Hagander a écrit :
 I have done this in Delphi using it's built in UTF8 encoding and 
 decoding routines.   You can get a free copy of Delphi 
Turbo Explorer 
 which includes components for MS SQL server and ODBC, so it 
would be 
 pretty straight forward to get this working.
 
 The actual method in Delphi is system.UTF8Encode(widestring).  This 
 will encode unicode to UTF8 which is compatible with a 
Postgresql UTF8 database.


Ah, that's useful to know. Windows just doesn't have the same 
quantity of tools installed as a *nix platform.


If your file is small enough, you can just open it up in Notepad and
re-save it as UTF8. It might play funny with the BOMs though
(byte-order-marks).

There is also, IIRC, an iconv binary available for Windows that should
be able to do such a conversion. Can't rememebr where thuogh :-)


The file is way too big for notepad. It is even too big for 
notepad++.


I do have the GnuWin32 version of iconv (*great* software 
collection, BTW), but still no go...
I tried iconv -f CP1252 -t UTF-8 
detailrecherche_ansi.csv detailrecherche_cp1252utf8.csv
and iconv -f LATIN-9 -t UTF-8 detailrecherche_ansi.csv 
detailrecherche_latin9utf8.csv


Both don't want to load as UTF8 (invalid byte sequence x00).

I am desperate...
--
Arnaud

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Arnaud Lesauvage

Alvaro Herrera a écrit :

Arnaud Lesauvage wrote:

Alvaro Herrera a écrit :
Arnaud Lesauvage wrote:
Tomi NA a écrit :
I think I'll go this way... No other choice, actually !
The MSSQL database is in SQL_Latin1_General_CP1_Cl_AS.
I don't really understand what this is. It supports the euro
symbol, so it is probably not pure LATIN1, right ?

I suppose you'd have to look at the latin1 codepage character table
somewhere...I'm a UTF-8 guy so I'm not well suited to respond to the
question. :)

Yep, http://en.wikipedia.org/wiki/Latin-1 tells me that 
LATIN1 is missing the euro sign...

Gr I hate this !!!

So use Latin9 ...

Of course, but it doesn't work !!!
Whatever client encoding I choose in postgresql before 
COPYing, I get the 'invalid byte sequence error'.


Humm ... how are you choosing the client encoding?  Is it actually
working?  I don't see how choosing Latin1 or Latin9 and feeding whatever
byte sequence would give you an invalid byte sequence.  These charsets
don't have any way to validate the bytes, as opposed to what UTF-8 can
do.  So you could end up with invalid bytes if you choose the wrong
client encoding, but that's a different error.



mydb=# SET client_encoding TO LATIN9;
SET
mydb=# COPY statistiques.detailrecherche (log_gid, 
champrecherche, valeurrecherche) FROM 
'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV;

ERROR:  invalid byte sequence for encoding LATIN9: 0x00
HINT:  This error can also happen if the byte sequence does 
not match the encoding expected by the server, which is 
controlled by client_encoding.

CONTEXT:  COPY detailrecherche, line 9212
mydb=# SET client_encoding TO WIN1252;
SET
mydb=# COPY statistiques.detailrecherche (log_gid, 
champrecherche, valeurrecherche) FROM 
'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV;

ERROR:  invalid byte sequence for encoding WIN1252: 0x00
HINT:  This error can also happen if the byte sequence does 
not match the encoding expected by the server, which is 
controlled by client_encoding.

CONTEXT:  COPY detailrecherche, line 9212


Really, I'd rather have another error, but this is all I can 
get.

This is with the ANSI export.
With the UNICODE export :

mydb=# SET client_encoding TO UTF8;
SET
mydb=# COPY statistiques.detailrecherche (log_gid, 
champrecherche, valeurrecherche) FROM 
'E:\\Production\\Temp\\detailrecherche_unicode.csv' CSV;

ERROR:  invalid byte sequence for encoding UTF8: 0xff
HINT:  This error can also happen if the byte sequence does 
not match the encoding expected by the server, which is 
controlled by client_encoding.

CONTEXT:  COPY detailrecherche, line 592680

So, line 592680 is *a lot* better, but it is still not good!

--
Arnaud


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Thomas H.



Or go via MS-Access/Perl and ODBC/DBI perhaps?


Yes, I think it would work. The problem is that the DB is too big for 
this king of export. Using DTS from MSSQL to export directly to 
PostgreSQL using psqlODBC Unicode Driver, I exported ~1000 rows per 
second in a 2-columns table with ~20M rows. That means several days just 
for this table, and I have bigger ones !


Well it's about 0.25 days, but if it's too long, it's too long.


Sure, sorry for the confusion, the problem is with the other tables (same 
number of rows but a lot of columns, some very large).




well, if its too slow, then you will have to dump the db to a textfile (DTS 
does this for you) and then convert the textfile to utf8 manually before 
importing it to pgsql. iconv for win32 will help you there. i found tho it 
removes some wanted special characters, so watch out.
a less scientific approach would be using an unicode-aware texteditor to 
convert it (ultraedit does this pretty nicely, for example). have had good 
results with it.


loading several million rows will always take some time, tho.

- thomas 




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Alvaro Herrera
Arnaud Lesauvage wrote:

 mydb=# SET client_encoding TO LATIN9;
 SET
 mydb=# COPY statistiques.detailrecherche (log_gid, 
 champrecherche, valeurrecherche) FROM 
 'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV;
 ERROR:  invalid byte sequence for encoding LATIN9: 0x00
 HINT:  This error can also happen if the byte sequence does 
 not match the encoding expected by the server, which is 
 controlled by client_encoding.

Huh, why do you have a 0x00 byte in there?  That's certainly not
Latin9 (nor UTF8 as far as I know).

Is the file actually Latin-something or did you convert it to something
else at some point?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Alvaro Herrera
Arnaud Lesauvage wrote:
 Alvaro Herrera a écrit :
 Arnaud Lesauvage wrote:
 
 mydb=# SET client_encoding TO LATIN9;
 SET
 mydb=# COPY statistiques.detailrecherche (log_gid, 
 champrecherche, valeurrecherche) FROM 
 'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV;
 ERROR:  invalid byte sequence for encoding LATIN9: 0x00
 HINT:  This error can also happen if the byte sequence does 
 not match the encoding expected by the server, which is 
 controlled by client_encoding.
 
 Huh, why do you have a 0x00 byte in there?  That's certainly not
 Latin9 (nor UTF8 as far as I know).
 
 Is the file actually Latin-something or did you convert it to something
 else at some point?
 
 This is the file generated by DTS with ANSI encoding. It 
 was not altered in any way after that !
 The doc states that ANSI exports with the local codepage 
 (which is Win1252). That's all I know. :(

I thought Win1252 was supposed to be almost the same as Latin1.  While
I'd expect certain differences, I wouldn't expect it to use 0x00 as
data!

Maybe you could have DTS export Unicode, which would presumably be
UTF-16, then recode that to something else (possibly UTF-8) with GNU
iconv.

FWIW, I think the preferred way to set the client encoding on psql is
\encoding.  I'm not sure if it does anything different from the SET
command though.

-- 
Alvaro Herrera   Valdivia, Chile   ICBM: S 39º 49' 18.1, W 73º 13' 56.4
In Europe they call me Niklaus Wirth; in the US they call me Nickel's worth.
 That's because in Europe they call me by name, and in the US by value!

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Arnaud Lesauvage

Alvaro Herrera a écrit :

Arnaud Lesauvage wrote:

Alvaro Herrera a écrit :
Arnaud Lesauvage wrote:

mydb=# SET client_encoding TO LATIN9;
SET
mydb=# COPY statistiques.detailrecherche (log_gid, 
champrecherche, valeurrecherche) FROM 
'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV;

ERROR:  invalid byte sequence for encoding LATIN9: 0x00
HINT:  This error can also happen if the byte sequence does 
not match the encoding expected by the server, which is 
controlled by client_encoding.


Huh, why do you have a 0x00 byte in there?  That's certainly not
Latin9 (nor UTF8 as far as I know).

Is the file actually Latin-something or did you convert it to something
else at some point?

This is the file generated by DTS with ANSI encoding. It 
was not altered in any way after that !
The doc states that ANSI exports with the local codepage 
(which is Win1252). That's all I know. :(


I thought Win1252 was supposed to be almost the same as Latin1.  While
I'd expect certain differences, I wouldn't expect it to use 0x00 as
data!

Maybe you could have DTS export Unicode, which would presumably be
UTF-16, then recode that to something else (possibly UTF-8) with GNU
iconv.


UTF-16 ! That's something I haven't tried !
I'll try an iconv conversion tomorrow from UTF16 to UTF8 !

--
Arnaud

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Bruce Momjian
Arnaud Lesauvage wrote:
  I thought Win1252 was supposed to be almost the same as Latin1.  While
  I'd expect certain differences, I wouldn't expect it to use 0x00 as
  data!
  
  Maybe you could have DTS export Unicode, which would presumably be
  UTF-16, then recode that to something else (possibly UTF-8) with GNU
  iconv.
 
 UTF-16 ! That's something I haven't tried !
 I'll try an iconv conversion tomorrow from UTF16 to UTF8 !

Right!  To clarify, Unicode is the character set, and UTF8 and UTF16 are
ways of representing that characters set in 8-bit and 16-bit segments,
respectively.  PostgreSQL only suports UTF8, and Win32 only supports
UTF16 in the operating system.  And 0x00 is not a valid value in any of
those, that I know of, but perhaps it is in UTF16.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Magnus Hagander
   I thought Win1252 was supposed to be almost the same as Latin1.  
   While I'd expect certain differences, I wouldn't expect it to use 
   0x00 as data!
   
   Maybe you could have DTS export Unicode, which would 
 presumably be 
   UTF-16, then recode that to something else (possibly 
 UTF-8) with GNU 
   iconv.
  
  UTF-16 ! That's something I haven't tried !
  I'll try an iconv conversion tomorrow from UTF16 to UTF8 !
 
 Right!  To clarify, Unicode is the character set, and UTF8 
 and UTF16 are ways of representing that characters set in 
 8-bit and 16-bit segments, respectively.  PostgreSQL only 
 suports UTF8, and Win32 only supports
 UTF16 in the operating system.  And 0x00 is not a valid value 
 in any of those, that I know of, but perhaps it is in UTF16.

Actually, Win32 supports UTF8 as well. There are a few operations that
aren't supported on it, but you can certainly read and write files in it
from most builtin apps.

One other problem is that in most (all) win32 documentation talks about
UNICODE when they mean UTF16 (in = NT4, UCS-2). And PostgreSQL used to
say UNICODE when we meant UTF8. Adds to the confusion.

Finally, UTF-8 does not represent the characters in 8-bit segments - it
can use anything from 8 to 32 bits. UTF-16 always uses 16 bits. This
also means that you acn't talk about 0x00 being valid in UTF-16,
because all characters are 16-bit. It would be 0x or 0x00 0x00.
But that requires an application that knows UTF16, which postgresql
doesn't, so it reports on the first 0x00.

//Magnus

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Brandon Aiken
It also might be a big/little endian problem, although I always thought that 
was platform specific, not locale specific.

Try the UCS-2-INTERNAL and UCS-4-INTERNAL codepages in iconv, which should use 
the two-byte or four-byte versions of UCS encoding using the system's default 
endian setting.

There's many Unicode codepage formats that iconv supports:
UTF-8
ISO-10646-UCS-2 UCS-2 CSUNICODE
UCS-2BE UNICODE-1-1 UNICODEBIG CSUNICODE11
UCS-2LE UNICODELITTLE
ISO-10646-UCS-4 UCS-4 CSUCS4
UCS-4BE
UCS-4LE
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
UNICODE-1-1-UTF-7 UTF-7 CSUNICODE11UTF7
UCS-2-INTERNAL
UCS-2-SWAPPED
UCS-4-INTERNAL
UCS-4-SWAPPED

Gee, didn't Unicode just so simplify this codepage mess?  Remember when it was 
just ASCII, EBCDIC, ANSI, and localized codepages?

--
Brandon Aiken
CS/IT Systems Engineer
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Arnaud Lesauvage
Sent: Wednesday, November 22, 2006 12:38 PM
To: Arnaud Lesauvage; General
Subject: Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

Alvaro Herrera a écrit :
 Arnaud Lesauvage wrote:
 Alvaro Herrera a écrit :
 Arnaud Lesauvage wrote:
 
 mydb=# SET client_encoding TO LATIN9;
 SET
 mydb=# COPY statistiques.detailrecherche (log_gid, 
 champrecherche, valeurrecherche) FROM 
 'E:\\Production\\Temp\\detailrecherche_ansi.csv' CSV;
 ERROR:  invalid byte sequence for encoding LATIN9: 0x00
 HINT:  This error can also happen if the byte sequence does 
 not match the encoding expected by the server, which is 
 controlled by client_encoding.
 
 Huh, why do you have a 0x00 byte in there?  That's certainly not
 Latin9 (nor UTF8 as far as I know).
 
 Is the file actually Latin-something or did you convert it to something
 else at some point?
 
 This is the file generated by DTS with ANSI encoding. It 
 was not altered in any way after that !
 The doc states that ANSI exports with the local codepage 
 (which is Win1252). That's all I know. :(
 
 I thought Win1252 was supposed to be almost the same as Latin1.  While
 I'd expect certain differences, I wouldn't expect it to use 0x00 as
 data!
 
 Maybe you could have DTS export Unicode, which would presumably be
 UTF-16, then recode that to something else (possibly UTF-8) with GNU
 iconv.

UTF-16 ! That's something I haven't tried !
I'll try an iconv conversion tomorrow from UTF16 to UTF8 !

--
Arnaud

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Tomi NA

2006/11/22, Brandon Aiken [EMAIL PROTECTED]:


Gee, didn't Unicode just so simplify this codepage mess?  Remember when it was 
just ASCII, EBCDIC, ANSI, and localized codepages?


Unicode is a heaven sent, compared to 3 or 4 codepages representing
any given (obviously non-English) language, and 3 or 4 more for every
other language you have to deal with in your application. Perfect?
Hardly. But then again, much more so than natural languages.
I'd say we'd deliver products 10-20% faster (in the company I work in)
if people looked ahead a couple of decades ago and decided upon
something along the lines of unicode instead of ASCII.

Cheers,
t.n.a.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-22 Thread Martijn van Oosterhout
On Wed, Nov 22, 2006 at 01:55:55PM -0500, Brandon Aiken wrote:
 Gee, didn't Unicode just so simplify this codepage mess?  Remember
 when it was just ASCII, EBCDIC, ANSI, and localized codepages?

I think that's one reason why Unix has standardised on UTF-8 rather
than one of the other Unicode variants. For transmission between
systems it's the easiest to get right...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-21 Thread Tony Caduto

Arnaud Lesauvage wrote:



I then try to import into PostgreSQL. The farther I can get is when 
using the UNICODE export, and importing it using a client_encoding 
set to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...).

The copy then stops with an error :
ERROR: invalid byte sequence for encoding UTF8: 0xff
État SQL :22021

The problematic character is the euro currency symbol.



Exporting from MS SQL server as unicode is going to give you full 
Unicode, not UTF8.  Full unicde is 2 bytes per character and UTF8 is 1, 
same as ASCII.

You will have to encode the Unicode data to UTF8

I have done this in Delphi using it's built in UTF8 encoding and 
decoding routines.   You can get a free copy of Delphi Turbo Explorer 
which includes components for MS SQL server and ODBC, so it would be 
pretty straight forward to get this working.


The actual method in Delphi is system.UTF8Encode(widestring).  This will 
encode unicode to UTF8 which is compatible with a Postgresql UTF8 database.


I am sure Perl could do it also.

Hope this helps.

--
Tony Caduto
AM Software Design
http://www.amsoftwaredesign.com
Home of PG Lightning Admin for Postgresql
Your best bet for Postgresql Administration 



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] MSSQL to PostgreSQL : Encoding problem

2006-11-21 Thread Richard Huxton

Tony Caduto wrote:

Arnaud Lesauvage wrote:



I then try to import into PostgreSQL. The farther I can get is when 
using the UNICODE export, and importing it using a client_encoding 
set to UTF8 (I tried WIN1252, LATIN9, LATIN1, ...).

The copy then stops with an error :
ERROR: invalid byte sequence for encoding UTF8: 0xff
État SQL :22021

The problematic character is the euro currency symbol.



Exporting from MS SQL server as unicode is going to give you full 
Unicode, not UTF8.  Full unicde is 2 bytes per character and UTF8 is 1, 
same as ASCII.

You will have to encode the Unicode data to UTF8


Well, UTF8 is a minimum of one byte, but can be longer for non-ASCII 
characters. The idea being that chars below 128 map to ASCII. There's 
also UTF16 and I believe UTF32 with 2+ and four byte characters.


I have done this in Delphi using it's built in UTF8 encoding and 
decoding routines.   You can get a free copy of Delphi Turbo Explorer 
which includes components for MS SQL server and ODBC, so it would be 
pretty straight forward to get this working.


The actual method in Delphi is system.UTF8Encode(widestring).  This will 
encode unicode to UTF8 which is compatible with a Postgresql UTF8 database.


Ah, that's useful to know. Windows just doesn't have the same quantity 
of tools installed as a *nix platform.



I am sure Perl could do it also.


And in one line if you're clever enough no doubt ;-)

--
  Richard Huxton
  Archonet Ltd


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly