Reading Windows CSV file with LCID entries under Linux.

2008-09-22 Thread Thomas Troeger

Dear all,

I've stumbled over a problem with Windows Locale ID information and 
codepages. I'm writing a Python application that parses a CSV file,
the format of a line in this file is LCID;Text1;Text2. Each line can 
contain a different locale id (LCID) and the text fields contain data 
that is encoded in some codepage which is associated with this LCID. My 
current data file contains the codes 1033 for German and 1031 for 
English US (as listed in 
http://www.microsoft.com/globaldev/reference/lcid-all.mspx). 
Unfortunately, I cannot find out which Codepage (like cp-1252 or 
whatever) belongs to which LCID.


My question is: How can I convert this data into something more 
reasonable like unicode? Basically, what I want is something like 
Text1;Text2, both fields encoded as UTF-8. Can this be done with 
Python? How can I find out which codepage I have to use for 1033 and 1031?


Any help appreciated,
Thomas.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Reading Windows CSV file with LCID entries under Linux.

2008-09-22 Thread skip

Thomas My question is: How can I convert this data into something more
Thomas reasonable like unicode? Basically, what I want is something
Thomas like Text1;Text2, both fields encoded as UTF-8. Can this be
Thomas done with Python? How can I find out which codepage I have to
Thomas use for 1033 and 1031?

There are examples at end of the CSV module documentation which show how to
create Unicode readers and writers.  You can extend the UnicodeReader class
to peek at the LCID field and save the corresponding codepage for the
remainder of the line.  (This would assume you're not creating CSV files
which contain newlines.  Each line read would be assumed to be a new record
in the file.)

Skip
--
http://mail.python.org/mailman/listinfo/python-list


Re: Reading Windows CSV file with LCID entries under Linux.

2008-09-22 Thread Tim Golden

Thomas Troeger wrote:
I've stumbled over a problem with Windows Locale ID information and 
codepages. I'm writing a Python application that parses a CSV file,
the format of a line in this file is LCID;Text1;Text2. Each line can 
contain a different locale id (LCID) and the text fields contain data 
that is encoded in some codepage which is associated with this LCID. My 
current data file contains the codes 1033 for German and 1031 for 
English US (as listed in 
http://www.microsoft.com/globaldev/reference/lcid-all.mspx). 
Unfortunately, I cannot find out which Codepage (like cp-1252 or 
whatever) belongs to which LCID.


My question is: How can I convert this data into something more 
reasonable like unicode? Basically, what I want is something like 
Text1;Text2, both fields encoded as UTF-8. Can this be done with 
Python? How can I find out which codepage I have to use for 1033 and 1031?



The GetLocaleInfo API call can do that conversion:

http://msdn.microsoft.com/en-us/library/ms776270(VS.85).aspx

You'll need to use ctypes (or write a c extension) to
use it. Be aware that if it doesn't succeed you may need
to fall back on cp 65001 -- utf8.

TJG
--
http://mail.python.org/mailman/listinfo/python-list