character sets? unicode?

2005-02-03 Thread Michael
I'm trying to import text from email I've received, run some regular 
expressions on it, and save the text into a database. I'm trying to 
figure out how to handle the issue of character sets. I've had some 
problems with my regular expressions on email that has interesting 
character sets. Korean text seems to be filled with a lot of '=3D=21' 
type of stuff. This doesn't look like unicode (or am I wrong?) so does 
anyone know how I should handle it? Do I need to do anything special 
when passing text with non-ascii characters to re, MySQLdb, or any other 
libraries? Is it better to save the text as-is in my db and save the 
character set type too or should I try to convert all text to some 
default format like UTF-8? Any advice? Thanks.

--
Michael [EMAIL PROTECTED]
http://kavlon.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: character sets? unicode?

2005-02-03 Thread Fredrik Lundh
Michael wrote:

 I'm trying to import text from email I've received, run some regular 
 expressions on it, and save 
 the text into a database. I'm trying to figure out how to handle the issue of 
 character sets. I've 
 had some problems with my regular expressions on email that has interesting 
 character sets. Korean 
 text seems to be filled with a lot of '=3D=21' type of stuff.

looks like

http://python.org/doc/lib/module-quopri.html

plus perhaps some encoding.

instead of rolling your own message handling code, consider using this
package instead:

http://python.org/doc/lib/module-email.html

in either case, the MIME specification is required reading here (for a link,
see the quopri page above).

 Do I need to do anything special when passing text with non-ascii
 characters to re

depends on your patterns.  by default, RE operators like \w and \s assume
ASCII.  to use other encodings, use the (?u) flag and convert your text to
Unicode before passing it to the RE module.

 Is it better to save the text as-is in my db and save the  character set type
 too or should I try to convert all text to some  default format like UTF-8?

depends on your application; using a standard encoding has many advantages,
but storing the original text as is guarantees that no information is lost, 
even if
you have bugs in your conversion code.  when in doubt, save the original and
do the conversion on the way out.

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list