Am 08.08.2013 18:37, schrieb Chris Angelico: > On Thu, Aug 8, 2013 at 5:16 PM, Kurt Mueller > <kurt.alfred.muel...@gmail.com> wrote: >> Am 08.08.2013 17:44, schrieb Peter Otten: >>> Kurt Mueller wrote: >>>> What do I do, when input_strings/output_list has other codings like >>>> iso-8859-1? >>> You have to know the actual encoding. With that information it's easy: >>>>>> output_list >>> ['\xc3\xb6', '\xc3\xbc', 'i', 's', 'f'] >>>>>> encoding = "utf-8" >>>>>> output_list = [s.decode(encoding) for s in output_list] >>>>>> print output_list >>> [u'\xf6', u'\xfc', u'i', u's', u'f'] >> How do I get to know the actual encoding? >> I read from stdin. There can be different encondings. >> Usually utf8 but also iso-8859-1/latin9 are to be expected. >> But sys.stdin.encoding sais always 'None'. > > If you can switch to Python 3, life becomes a LOT easier. The Python 3 > input() function (which does the same job as raw_input() from Python > 2) returns a Unicode string, meaning that it takes care of encodings > for you.
Because I cannot switch to Python 3 for now my life is not so easy:-) For some text manipulation tasks I need a template to split lines from stdin into a list of strings the way shlex.split() does it. The encoding of the input can vary. For further processing in Python I need the list of strings to be in unicode. Here is template.py: ############################################################################################################## #!/usr/bin/env python # vim: set fileencoding=utf-8 : # split lines from stdin into a list of unicode strings # Muk 2013-08-23 # Python 2.7.3 from __future__ import print_function import sys import shlex import chardet bool_cmnt = True # shlex: skip comments bool_posx = True # shlex: posix mode (strings in quotes) for inpt_line in sys.stdin: print( 'inpt_line=' + repr( inpt_line ) ) enco_type = chardet.detect( inpt_line )[ 'encoding' ] # {'encoding': 'EUC-JP', 'confidence': 0.99} print( 'enco_type=' + repr( enco_type ) ) try: strg_inpt = shlex.split( inpt_line, bool_cmnt, bool_posx, ) # shlex does not work on unicode except Exception, errr: # usually 'No closing quotation' print( "error='%s' on inpt_line='%s'" % ( errr, inpt_line.rstrip(), ), file=sys.stderr, ) continue print( 'strg_inpt=' + repr( strg_inpt ) ) # list of strings strg_unic = [ strg.decode( enco_type ) for strg in strg_inpt ] # decode the strings into unicode print( 'strg_unic=' + repr( strg_unic ) ) # list of unicode strings ############################################################################################################## $ cat <some-file> | template.py Comments are welcome. TIA -- Kurt Mueller -- Kurt Mueller -- http://mail.python.org/mailman/listinfo/python-list