Re: How to turn a string into a list of integers?
Am 06.09.2014 um 20:19 schrieb Steven D'Aprano : > Kurt Mueller wrote: > [...] >> Now the part of the two Python builds is still somewhat unclear to me. > [...] >> In Python 2.7: >> As I learned from the ord() manual: >> If a unicode argument is given and Python was built with UCS2 Unicode, > Where does the manual mention UCS-2? As far as I know, no version of Python > uses that. https://docs.python.org/2/library/functions.html?highlight=ord#ord [snip] very detailed explanation of narrow/wide build, UCS-2/UCS-4, UTF-16/UTF-32 > Remember, though, these internal representations are (nearly) irrelevant to > Python code. In Python code, you just consider that a Unicode string is an > array of ordinal values from 0x0 to 0x10, each representing a single > code point U+ to U+10. The only reason I say "nearly" is that > narrow builds don't *quite* work right if the string contains surrogate > pairs. So I can interpret your last section: Processing any Unicode string will work with small and wide python 2.7 builds and also with python >3.3? ( parts of small build python will not work with values over 0x ) ( strings with surrogate pairs will not work correctly on small build python ) Many thanks for your detailed answer! -- Kurt Mueller, kurt.alfred.muel...@gmail.com -- https://mail.python.org/mailman/listinfo/python-list
Re: How to turn a string into a list of integers?
Am 06.09.2014 um 07:47 schrieb Steven D'Aprano : > Kurt Mueller wrote: >> Could someone please explain the following behavior to me: >> Python 2.7.7, MacOS 10.9 Mavericks [snip] Thanks for the detailed explanation. I think I understand a bit better now. Now the part of the two Python builds is still somewhat unclear to me. > If you could peer under the hood, and see what implementation Python uses to > store that string, you would see something version dependent. In Python > 2.7, you would see an object more or less something vaguely like this: > > [object header containing various fields] > [length = 2] > [array of bytes = 0x0041 0x00C4] > > > That's for a so-called "narrow build" of Python. If you have a "wide build", > it will something like this: > > [object header containing various fields] > [length = 2] > [array of bytes = 0x0041 0x00C4] > > In Python 3.3, "narrow builds" and "wide builds" are gone, and you'll have > something conceptually like this: > > [object header containing various fields] > [length = 2] > [tag = one byte per character] > [array of bytes = 0x41 0xC4] > > Some other implementations of Python could use UTF-8 internally: > > [object header containing various fields] > [length = 2] > [array of bytes = 0x41 0xC3 0x84] > > > or even something more complex. But the important thing is, regardless of > the internal implementation, Python guarantees that a Unicode string is > treated as a fixed array of code points. Each code point has a value > between 0 and, not 127, not 255, not 65535, but 1114111. In Python 2.7: As I learned from the ord() manual: If a unicode argument is given and Python was built with UCS2 Unicode, (I suppose this is the narrow build in your terms), then the character’s code point must be in the range [0..65535] inclusive; I understand: In a UCS2 build each character of a Unicode string uses 16 Bits and can represent code points from U-..U-. >From the unichr(i) manual I learn: The valid range for the argument depends how Python was configured – it may be either UCS2 [0..0x] or UCS4 [0..0x10]. I understand: narrow build is UCS2, wide build is UCS4 - In a UCS2 build each character of an Unicode string uses 16 Bits and has code points from U-..U- (0..65535) - In a UCS4 build each character of an Unicode string uses 32 Bits and has code points from U-..U-0010 (0..1114111) Am I right? -- Kurt Mueller, kurt.alfred.muel...@gmail.com -- https://mail.python.org/mailman/listinfo/python-list
Re: How to turn a string into a list of integers?
Am 05.09.2014 um 21:16 schrieb Kurt Mueller : > Am 05.09.2014 um 20:25 schrieb Chris “Kwpolska” Warrick : >> On Sep 5, 2014 7:57 PM, "Kurt Mueller" wrote: >>> Could someone please explain the following behavior to me: >>> Python 2.7.7, MacOS 10.9 Mavericks >>> >>>>>> import sys >>>>>> sys.getdefaultencoding() >>> 'ascii' >>>>>> [ord(c) for c in 'AÄ'] >>> [65, 195, 132] >>>>>> [ord(c) for c in u'AÄ'] >>> [65, 196] >>> >>> My obviously wrong understanding: >>> ‚AÄ‘ in ‚ascii‘ are two characters >>> one with ord A=65 and >>> one with ord Ä=196 ISO8859-1 >>> —-> why [65, 195, 132] >>> u’AÄ’ is an Unicode string >>> —-> why [65, 196] >>> >>> It is just the other way round as I would expect. >> >> Basically, the first string is just a bunch of bytes, as provided by your >> terminal — which sounds like UTF-8 (perfectly logical in 2014). The second >> one is converted into a real Unicode representation. The codepoint for Ä is >> U+00C4 (196 decimal). It's just a coincidence that it also matches latin1 >> aka ISO 8859-1 as Unicode starts with all 256 latin1 codepoints. Please >> kindly forget encodings other than UTF-8. > > So: > ‘AÄ’ is an UTF-8 string represented by 3 bytes: > A -> 41 -> 65 first byte decimal > Ä -> c384 -> 195 and 132 second and third byte decimal > > u’AÄ’ is an Unicode string represented by 2 bytes?: > A -> U+0041 -> 65 first byte decimal, 00 is omitted or not yielded by ord()? > Ä -> U+00C4 -> 196 second byte decimal, 00 is ommited or not yielded by ord()? After reading the ord() manual: The second case should read: u’AÄ’ is an Unicode string represented by 2 unicode characters: If Python was built with UCS2 Unicode, then the character’s code point must be in the range [0..65535, 16 bits, U-..U-] A -> U+0041 -> 65 first character decimal (code point) Ä -> U+00C4 -> 196 second character decimal (code point) Am I right now? -- Kurt Mueller, kurt.alfred.muel...@gmail.com -- https://mail.python.org/mailman/listinfo/python-list
Re: How to turn a string into a list of integers?
Am 05.09.2014 um 20:25 schrieb Chris “Kwpolska” Warrick : > On Sep 5, 2014 7:57 PM, "Kurt Mueller" wrote: > > Could someone please explain the following behavior to me: > > Python 2.7.7, MacOS 10.9 Mavericks > > > > >>> import sys > > >>> sys.getdefaultencoding() > > 'ascii' > > >>> [ord(c) for c in 'AÄ'] > > [65, 195, 132] > > >>> [ord(c) for c in u'AÄ'] > > [65, 196] > > > > My obviously wrong understanding: > > ‚AÄ‘ in ‚ascii‘ are two characters > > one with ord A=65 and > > one with ord Ä=196 ISO8859-1 > > —-> why [65, 195, 132] > > u’AÄ’ is an Unicode string > > —-> why [65, 196] > > > > It is just the other way round as I would expect. > > Basically, the first string is just a bunch of bytes, as provided by your > terminal — which sounds like UTF-8 (perfectly logical in 2014). The second > one is converted into a real Unicode representation. The codepoint for Ä is > U+00C4 (196 decimal). It's just a coincidence that it also matches latin1 aka > ISO 8859-1 as Unicode starts with all 256 latin1 codepoints. Please kindly > forget encodings other than UTF-8. So: ‘AÄ’ is an UTF-8 string represented by 3 bytes: A -> 41 -> 65 first byte decimal Ä -> c384 -> 195 and 132 second and third byte decimal u’AÄ’ is an Unicode string represented by 2 bytes?: A -> U+0041 -> 65 first byte decimal, 00 is omitted or not yielded by ord()? Ä -> U+00C4 -> 196 second byte decimal, 00 is ommited or not yielded by ord()? > BTW: ASCII covers only the first 128 bytes. ACK -- Kurt Mueller, kurt.alfred.muel...@gmail.com -- https://mail.python.org/mailman/listinfo/python-list
Re: How to turn a string into a list of integers?
Am 05.09.2014 um 10:42 schrieb c...@isbd.net: > Joshua Landau wrote: >> On 3 September 2014 15:48, wrote: >>> Peter Otten <__pete...@web.de> wrote: >>>>>>> [ord(c) for c in "This is a string"] >>>> [84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103] >>>> >>>> There are other ways, but you have to describe the use case and your Python >>>> version for us to recommend the most appropriate. >>>> >>> That looks OK to me. It's just for outputting a string to the block >>> write command in python-smbus which expects an integer array. >> >> Just be careful about Unicode characters. > > I have to avoid them completely because I'm sending the string to a > character LCD with a limited 8-bit only character set. Could someone please explain the following behavior to me: Python 2.7.7, MacOS 10.9 Mavericks >>> import sys >>> sys.getdefaultencoding() 'ascii' >>> [ord(c) for c in 'AÄ'] [65, 195, 132] >>> [ord(c) for c in u'AÄ'] [65, 196] My obviously wrong understanding: ‚AÄ‘ in ‚ascii‘ are two characters one with ord A=65 and one with ord Ä=196 ISO8859-1 —-> why [65, 195, 132] u’AÄ’ is an Unicode string —-> why [65, 196] It is just the other way round as I would expect. Thank you -- Kurt Mueller, kurt.alfred.muel...@gmail.com -- https://mail.python.org/mailman/listinfo/python-list
Re: split lines from stdin into a list of unicode strings
Am 05.09.2013 10:33, schrieb Peter Otten: > Kurt Mueller wrote: >> Am 29.08.2013 11:12, schrieb Peter Otten: >>> kurt.alfred.muel...@gmail.com wrote: >>>> On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote: >>>>> On 28/8/2013 04:32, Kurt Mueller wrote: >>>>>> For some text manipulation tasks I need a template to split lines >>>>>> from stdin into a list of strings the way shlex.split() does it. >>>>>> The encoding of the input can vary. >> I took your script as a template. >> But I used the libmagic library (pyhton-magic) instead of chardet. >> See http://linux.die.net/man/3/libmagic >> and https://github.com/ahupp/python-magic >> ( I made tests with files of different size, up to 1.2 [GB] ) >> I had following issues: >> - I a real file, the encoding was detected as 'ascii' for >> detect_lines=1000. >> In line 1002 there was an umlaut character. So then the >> line.decode(encoding) failed. I think to add the errors parameter, >> line.decode(encoding, errors='replace') > > Tough luck ;) You could try and tackle the problem by skipping leading > ascii-only lines. Untested: > > def detect_encoding(instream, encoding, detect_lines, skip_ascii=True): > if encoding is None: > encoding = instream.encoding > if encoding is None: > if skip_ascii: > try: > for line in instream: > yield line.decode("ascii") > except UnicodeDecodeError: > pass > else: > return > head = [line] > head.extend(islice(instream, detect_lines-1)) > encoding = chardet.detect("".join(head))["encoding"] > instream = chain(head, instream) > for line in instream: > yield line.decode(encoding) I find this solution as a generator very nice. With just some small modifications it runs fine for now. ( line is undefined if skip_ascii is False. ) For ascii only files chardet or libmagic will not be bothered. And the detect_lines comes not in charge, until there are some non ascii characters. -- def decode_stream_lines( inpt_strm, enco_type, numb_inpt, skip_asci=True, ): if enco_type is None: enco_type = inpt_strm.encoding if enco_type is None: line_head = [] if skip_asci: try: for line in inpt_strm: yield line.decode( 'ascii' ) except UnicodeDecodeError: line_head = [ line ] # last line was not ascii else: return # all lines were ascii line_head.extend( islice( inpt_strm, numb_inpt - 1 ) ) magc_enco = magic.open( magic.MAGIC_MIME_ENCODING ) magc_enco.load() enco_type = magc_enco.buffer( "".join( line_head ) ) magc_enco.close() print( I_AM + '-ERROR: enco_type=' + repr( enco_type ), file=sys.stderr, ) if enco_type.rfind( 'binary' ) >= 0: # binary, application/mswordbinary, application/vnd.ms-excelbinary and the like return inpt_strm = chain( line_head, inpt_strm ) for line in inpt_strm: yield line.decode( enco_type, errors='replace' ) -- Thank you very much! -- Kurt Mueller -- https://mail.python.org/mailman/listinfo/python-list
Re: split lines from stdin into a list of unicode strings
Am 29.08.2013 11:12, schrieb Peter Otten: > kurt.alfred.muel...@gmail.com wrote: >> On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote: >>> On 28/8/2013 04:32, Kurt Mueller wrote: >>>> For some text manipulation tasks I need a template to split lines >>>> from stdin into a list of strings the way shlex.split() does it. >>>> The encoding of the input can vary. > You can compromise and read ahead a limited number of lines. Here's my demo > script (The interesting part is detect_encoding(), I got a bit distracted by > unrelated stuff...). The script does one extra decode/encode cycle -- it > should be easy to avoid that if you run into performance issues. I took your script as a template. But I used the libmagic library (pyhton-magic) instead of chardet. See http://linux.die.net/man/3/libmagic and https://github.com/ahupp/python-magic ( I made tests with files of different size, up to 1.2 [GB] ) I had following issues: - I a real file, the encoding was detected as 'ascii' for detect_lines=1000. In line 1002 there was an umlaut character. So then the line.decode(encoding) failed. I think to add the errors parameter, line.decode(encoding, errors='replace') - If the buffer was bigger than about some Megabytes, the returned encoding from libmagic was always None. The big files had very long lines ( more than 4k per line ). So with detect_lines=1000 this limit was exceeded. - The magic.buffer() ( the equivalent of chardet.detect() ) takes about 2 seconds per megabyte buffer. -- Kurt Mueller -- https://mail.python.org/mailman/listinfo/python-list
Re: split lines from stdin into a list of unicode strings
Am 29.08.2013 11:12, schrieb Peter Otten: > kurt.alfred.muel...@gmail.com wrote: >> On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote: >>> On 28/8/2013 04:32, Kurt Mueller wrote: >>>> For some text manipulation tasks I need a template to split lines >>>> from stdin into a list of strings the way shlex.split() does it. >>>> The encoding of the input can vary. > You can compromise and read ahead a limited number of lines. Here's my demo > script (The interesting part is detect_encoding(), I got a bit distracted by > unrelated stuff...). The script does one extra decode/encode cycle -- it > should be easy to avoid that if you run into performance issues. Thanks Peter! I see the idea. It limits the buffersize/memory usage for the detection. I have to say that I am a bit disapointed by the chardet library. The encoding for the single character 'ü' is detected as {'confidence': 0.99, 'encoding': 'EUC-JP'}, whereas "file" says: $ echo "ü" | file -i - /dev/stdin: text/plain; charset=utf-8 $ "ü" is a character I use very often, as it is in my name: "Müller":-) I try to use the "python-magic" library which has a similar functionality as chardet and is used by the "file" unix-command and it is expandable with a magicfile, see "man file". My magic_test script: --- #!/usr/bin/env python # vim: set fileencoding=utf-8 : from __future__ import print_function import magic strg_chck = 'ü' magc_enco = magic.open( magic.MAGIC_MIME_ENCODING ) magc_enco.load() print( strg_chck + ' encoding=' + magc_enco.buffer( strg_chck ) ) magc_enco.close() --- $ magic_test ü encoding=utf-8 python-magic seems to me a bit more reliable. Cheers -- Kurt Mueller -- http://mail.python.org/mailman/listinfo/python-list
split lines from stdin into a list of unicode strings
This is a follow up to the Subject "right adjusted strings containing umlauts" For some text manipulation tasks I need a template to split lines from stdin into a list of strings the way shlex.split() does it. The encoding of the input can vary. For further processing in Python I need the list of strings to be in unicode. Here is template.py: ## #!/usr/bin/env python # vim: set fileencoding=utf-8 : # split lines from stdin into a list of unicode strings # Muk 2013-08-23 # Python 2.7.3 from __future__ import print_function import sys import shlex import chardet bool_cmnt = True # shlex: skip comments bool_posx = True # shlex: posix mode (strings in quotes) for inpt_line in sys.stdin: print( 'inpt_line=' + repr( inpt_line ) ) enco_type = chardet.detect( inpt_line )[ 'encoding' ] # {'encoding': 'EUC-JP', 'confidence': 0.99} print( 'enco_type=' + repr( enco_type ) ) try: strg_inpt = shlex.split( inpt_line, bool_cmnt, bool_posx, ) # shlex does not work on unicode except Exception, errr: # usually 'No closing quotation' print( "error='%s' on inpt_line='%s'" % ( errr, inpt_line.rstrip(), ), file=sys.stderr, ) continue print( 'strg_inpt=' + repr( strg_inpt ) ) # list of strings strg_unic = [ strg.decode( enco_type ) for strg in strg_inpt ] # decode the strings into unicode print( 'strg_unic=' + repr( strg_unic ) ) # list of unicode strings ###### $ cat | template.py Comments are welcome. TIA -- Kurt Mueller -- http://mail.python.org/mailman/listinfo/python-list
Re: right adjusted strings containing umlauts
Am 08.08.2013 18:37, schrieb Chris Angelico: > On Thu, Aug 8, 2013 at 5:16 PM, Kurt Mueller > wrote: >> Am 08.08.2013 17:44, schrieb Peter Otten: >>> Kurt Mueller wrote: >>>> What do I do, when input_strings/output_list has other codings like >>>> iso-8859-1? >>> You have to know the actual encoding. With that information it's easy: >>>>>> output_list >>> ['\xc3\xb6', '\xc3\xbc', 'i', 's', 'f'] >>>>>> encoding = "utf-8" >>>>>> output_list = [s.decode(encoding) for s in output_list] >>>>>> print output_list >>> [u'\xf6', u'\xfc', u'i', u's', u'f'] >> How do I get to know the actual encoding? >> I read from stdin. There can be different encondings. >> Usually utf8 but also iso-8859-1/latin9 are to be expected. >> But sys.stdin.encoding sais always 'None'. > > If you can switch to Python 3, life becomes a LOT easier. The Python 3 > input() function (which does the same job as raw_input() from Python > 2) returns a Unicode string, meaning that it takes care of encodings > for you. Because I cannot switch to Python 3 for now my life is not so easy:-) For some text manipulation tasks I need a template to split lines from stdin into a list of strings the way shlex.split() does it. The encoding of the input can vary. For further processing in Python I need the list of strings to be in unicode. Here is template.py: ## #!/usr/bin/env python # vim: set fileencoding=utf-8 : # split lines from stdin into a list of unicode strings # Muk 2013-08-23 # Python 2.7.3 from __future__ import print_function import sys import shlex import chardet bool_cmnt = True # shlex: skip comments bool_posx = True # shlex: posix mode (strings in quotes) for inpt_line in sys.stdin: print( 'inpt_line=' + repr( inpt_line ) ) enco_type = chardet.detect( inpt_line )[ 'encoding' ] # {'encoding': 'EUC-JP', 'confidence': 0.99} print( 'enco_type=' + repr( enco_type ) ) try: strg_inpt = shlex.split( inpt_line, bool_cmnt, bool_posx, ) # shlex does not work on unicode except Exception, errr: # usually 'No closing quotation' print( "error='%s' on inpt_line='%s'" % ( errr, inpt_line.rstrip(), ), file=sys.stderr, ) continue print( 'strg_inpt=' + repr( strg_inpt ) ) # list of strings strg_unic = [ strg.decode( enco_type ) for strg in strg_inpt ] # decode the strings into unicode print( 'strg_unic=' + repr( strg_unic ) ) # list of unicode strings ## $ cat | template.py Comments are welcome. TIA -- Kurt Mueller -- Kurt Mueller -- http://mail.python.org/mailman/listinfo/python-list
Re: right adjusted strings containing umlauts
Am 08.08.2013 18:37, schrieb Chris Angelico: > On Thu, Aug 8, 2013 at 5:16 PM, Kurt Mueller > wrote: >> Am 08.08.2013 17:44, schrieb Peter Otten: >>> Kurt Mueller wrote: >>>> What do I do, when input_strings/output_list has other codings like >>>> iso-8859-1? >>> You have to know the actual encoding. With that information it's easy: >>>>>> output_list >>> ['\xc3\xb6', '\xc3\xbc', 'i', 's', 'f'] >>>>>> encoding = "utf-8" >>>>>> output_list = [s.decode(encoding) for s in output_list] >>>>>> print output_list >>> [u'\xf6', u'\xfc', u'i', u's', u'f'] >> How do I get to know the actual encoding? >> I read from stdin. There can be different encondings. >> Usually utf8 but also iso-8859-1/latin9 are to be expected. >> But sys.stdin.encoding sais always 'None'. > > If you can switch to Python 3, life becomes a LOT easier. The Python 3 > input() function (which does the same job as raw_input() from Python > 2) returns a Unicode string, meaning that it takes care of encodings > for you. Because I cannot switch to Python 3 for now my life is not so easy:-) For some text manipulation tasks I need a template to split lines from stdin into a list of strings the way shlex.split() does it. The encoding of the input can vary. For further processing in Python I need the list of strings to be in unicode. Here is template.py: ## #!/usr/bin/env python # vim: set fileencoding=utf-8 : # split lines from stdin into a list of unicode strings # Muk 2013-08-23 # Python 2.7.3 from __future__ import print_function import sys import shlex import chardet bool_cmnt = True # shlex: skip comments bool_posx = True # shlex: posix mode (strings in quotes) for inpt_line in sys.stdin: print( 'inpt_line=' + repr( inpt_line ) ) enco_type = chardet.detect( inpt_line )[ 'encoding' ] # {'encoding': 'EUC-JP', 'confidence': 0.99} print( 'enco_type=' + repr( enco_type ) ) try: strg_inpt = shlex.split( inpt_line, bool_cmnt, bool_posx, ) # shlex does not work on unicode except Exception, errr: # usually 'No closing quotation' print( "error='%s' on inpt_line='%s'" % ( errr, inpt_line.rstrip(), ), file=sys.stderr, ) continue print( 'strg_inpt=' + repr( strg_inpt ) ) # list of strings strg_unic = [ strg.decode( enco_type ) for strg in strg_inpt ] # decode the strings into unicode print( 'strg_unic=' + repr( strg_unic ) ) # list of unicode strings ## $ cat | template.py Comments are welcome. TIA -- Kurt Mueller -- http://mail.python.org/mailman/listinfo/python-list
Re: right adjusted strings containing umlauts
Now I have this small example: -- #!/usr/bin/env python # vim: set fileencoding=utf-8 : from __future__ import print_function import sys, shlex print( repr( sys.stdin.encoding ) ) strg_form = u'{0:>3} {1:>3} {2:>3} {3:>3} {4:>3}' for inpt_line in sys.stdin: proc_line = shlex.split( inpt_line, False, True, ) encoding = "utf-8" proc_line = [ strg.decode( encoding ) for strg in proc_line ] print( strg_form.format( *proc_line ) ) -- $ echo -e "a b c d e\na ö u 1 2" | file - /dev/stdin: UTF-8 Unicode text $ echo -e "a b c d e\na ö u 1 2" | ./align_compact.py None a b c d e a ö u 1 2 $ echo -e "a b c d e\na ö u 1 2" | recode utf8..latin9 | file - /dev/stdin: ISO-8859 text $ echo -e "a b c d e\na ö u 1 2" | recode utf8..latin9 | ./align_compact.py None a b c d e Traceback (most recent call last): File "./align_compact.py", line 13, in proc_line = [ strg.decode( encoding ) for strg in proc_line ] File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6 in position 0: invalid start byte muk@mcp20:/sw/prog/scripts/text_manip> How do I handle this two inputs? TIA -- Kurt Mueller -- http://mail.python.org/mailman/listinfo/python-list
Re: right adjusted strings containing umlauts
Am 08.08.2013 17:44, schrieb Peter Otten: > Kurt Mueller wrote: >> What do I do, when input_strings/output_list has other codings like >> iso-8859-1? > > You have to know the actual encoding. With that information it's easy: >>>> output_list > ['\xc3\xb6', '\xc3\xbc', 'i', 's', 'f'] >>>> encoding = "utf-8" >>>> output_list = [s.decode(encoding) for s in output_list] >>>> print output_list > [u'\xf6', u'\xfc', u'i', u's', u'f'] How do I get to know the actual encoding? I read from stdin. There can be different encondings. Usually utf8 but also iso-8859-1/latin9 are to be expected. But sys.stdin.encoding sais always 'None'. TIA -- Kurt Mueller -- http://mail.python.org/mailman/listinfo/python-list
Re: right adjusted strings containing umlauts
Am 08.08.2013 16:43, schrieb jfhar...@gmail.com: > On Thursday, 8 August 2013 15:23:46 UTC+1, Kurt Mueller wrote: >> I'd like to print strings right adjusted. >> print( '>{0:>3}<'.format( 'ä' ) ) > > Make both strings unicode > print( u'>{0:>3}<'.format( u'ä' ) ) > Why not use rjust for it though? > u'ä'.rjust(3) In real life there is a list of strings in output_list from a command like: output_list = shlex.split( input_string, bool_cmnt, bool_posi, ) input_string is from a file, bool_* are either True or False repr( output_list ) ['\xc3\xb6', '\xc3\xbc', 'i', 's', 'f'] which should be printed right aligned. using: print( u'{0:>3} {1:>3} {2:>3} {3:>3} {4:>3}'.format( *output_list ) ) ( In real life, the alignement and the width is variable ) How do I prepare output_list the pythonic way to be unicode strings? What do I do, when input_strings/output_list has other codings like iso-8859-1? TIA -- Kurt Mueller -- http://mail.python.org/mailman/listinfo/python-list
right adjusted strings containing umlauts
I'd like to print strings right adjusted. ( Python 2.7.3, Linux 3.4.47-2.38-desktop ) from __future__ import print_function print( '>{0:>3}<'.format( 'a' ) ) > a< But if the string contains an Umlaut: print( '>{0:>3}<'.format( 'ä' ) ) > ä< Same with % notation: print( '>%3s<' % ( 'a' ) ) > a< print( '>%3s<' % ( 'ä' ) ) > ä< For a string with no Umlaut it uses 3 characters, but for an Umlaut it uses only 2 characters. I guess it has to to with unicode. How do I get it right? TIA -- Kurt Mueller -- http://mail.python.org/mailman/listinfo/python-list
Re: how to detect the character encoding in a web page ?
Am 24.12.2012 um 04:03 schrieb iMath: > but how to let python do it for you ? > such as these 2 pages > http://python.org/ > http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx > how to detect the character encoding in these 2 pages by python ? If you have the html code, let chardetect.py do an educated guess for you. http://pypi.python.org/pypi/chardet Example: $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2 with confidence 0.803579722043 $ $ wget -q -O - 'http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx' | chardetect.py stdin: utf-8 with confidence 0.87625 $ Grüessli -- kurt.alfred.muel...@gmail.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Donald E. Knuth in Python, cont'd
Am 11.04.2012 16:38, schrieb mwil...@the-wire.com: Antti J Ylikoski wrote: I wrote about a straightforward way to program D. E. Knuth in Python, Maybe it's godly coded... $ python Easter.py The year: 2012 (20, 'th ', 'APRIL, ', 2012) $ AFAIK it was the 8th of April 2012? Grüessli -- Kurt Müller, m...@problemlos.ch -- http://mail.python.org/mailman/listinfo/python-list
Re: while True or while 1
Am 14.12.2010 11:33, schrieb Hans-Peter Jansen: > On Tuesday 14 December 2010, 10:19:04 Gregory Ewing wrote: >> Steven D'Aprano wrote: >>>>>> while True: >>> >>> ... print "Looping" >>> ... True = 0 >> >> Just remember that if you use that inside a function, you'll >> have to initialise True to True before... er, wait a moment, >> that won't work... ah, I know: >> >> def f(true = True): >> True = true >> while True: >> ... >> True = False > > Thankfully, with Python 3 this code falls flat on its face. > > If I would have to _consume_ code like that more often, > it would require me to also use a vomit resistant keyboard cover.. > > Pete True yesterday, today and in the future: Yesterday: "Pilate said to him, True? what is true? Having said this he went out again to the Jews and said to them, I see no wrong in him." Today: We are so thankful that today we are free to define "True" ourselves using Python 2.x. Future: Be warned, the future gets darker! ;-) Grüessli -- Kurt Mueller -- http://mail.python.org/mailman/listinfo/python-list
Re: Glob in python which supports the ** wildcard
Hi, Am 22.11.2010 um 23:05 schrieb Stefan Sonnenberg-Carstens: > Am 22.11.2010 22:43, schrieb Martin Lundberg: >> I want to be able to let the user enter paths like this: >> apps/name/**/*.js >> and then find all the matching files in apps/name and all its >> subdirectories. However I found out that Python's glob function >> doesn't support the recursive ** wildcard. Is there any 3rd party glob >> function which do support **? > os.walk() or os.path.walk() can be used. > You need to traverse the file system. > AFAIK there is no support for this. If you are a lucky Unix/Linux/MacOS user: --- #!/usr/bin/env python # find files import os cmd = 'find apps/name/ -type f -name "*.js" -print' # find is a standard Unix tool for filename in os.popen(cmd).readlines(): # run find command # do something with filename --- find is very powerful, really. Have anice day -- kurt.alfred.muel...@gmail.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Glob in python which supports the ** wildcard
HI, Am 22.11.2010 um 23:05 schrieb Stefan Sonnenberg-Carstens: > Am 22.11.2010 22:43, schrieb Martin Lundberg; >> >> I want to be able to let the user enter paths like this: >> apps/name/**/*.js >> and then find all the matching files in apps/name and all its >> subdirectories. However I found out that Python's glob function >> doesn't support the recursive ** wildcard. Is there any 3rd party glob >> function which do support **? >> > os.walk() or os.path.walk() can be used. > You need to traverse the file system. > AFAIK there is no support for this. Or python only: -- #!/usr/bin/env python import os, fnmatch # generator: def find_files(directory, pattern): for root, dirs, files in os.walk(directory): for basename in files: if fnmatch.fnmatch(basename, pattern): filename = os.path.join(root, basename) yield filename # process each file as it is found: for filename in find_files('apps/name', '*.js'): print 'found java source:', filename -- Found at http://stackoverflow.com/questions/2186525/use-a-glob-to-find-files-recursively-in-python Have a nice day -- kurt.alfred.muel...@gmail.com -- http://mail.python.org/mailman/listinfo/python-list
Re: An assessment of the Unicode standard
Am 01.09.2009 um 09:39 schrieb Terry Reedy: But this same problem also extends into monies, nation states, units of measure, etc. There is, of course, an international system of measure. The US is the only major holdout. (I recall Burma, or somesuch, is another.) An interesting proposition would be for the US to adopt the metric system in exchange for the rest of the world adopting simplified basic English as a common language. The SI-system is nearly universally employed. Three principal exceptions are Burma (Myanmar), Liberia, and the United States. The United Kingdom has officially adopted the International System of Units but not with the intention of replacing customary measures entirely. <> When I was a student, they told us, that in a couple of years there will be the SI-system only, because most countries accepted it in their laws. So we should adopt it. That was in the early 70ties. Only this year we have to deliver results of technical processes to british and US companies. They still want them in their "crazy outdated" units. The other thing would be the US to adopt a "simplified basic English". I would not be astonished, that british people would state, that they already do :-) Grüessli-- http://mail.python.org/mailman/listinfo/python-list
Re: string processing question
Sion Arrowsmith wrote: > Kurt Mueller wrote: >> :> python -c 'print unicode("ä", "utf8")' >> ä >> :> python -c 'print unicode("ä", "utf8")' | cat >> Traceback (most recent call last): >> File "", line 1, in >> UnicodeEncodeError: 'ascii' codec can't encode characters in position >> 0-1: ordinal not in range(128) > $ python -c 'import sys; print sys.stdout.encoding' > UTF-8 > $ python -c 'import sys; print sys.stdout.encoding' | cat > None > > If print gets a Unicode string, it does an implicit > .encode(sys.stdout.encoding or sys.getdefaultencoding()) on it. > If you want your output to be guaranteed UTF-8, you'll need to > explicitly .encode("utf8") it yourself. This works now correct with and without piping: python -c 'a=unicode("ä", "utf8") ; print (a.encode("utf8"))' In my python source code I have these two lines first: #!/usr/bin/env python # vim: set fileencoding=utf-8 : So the source code itself and the strings in the source code are interpreted as utf-8. But from the command line python interprets the code as 'latin_1' I presume. That is why I have to convert the "ä" with unicode(). Am I right? > (I dare say this is slightly different in 3.x .) I heard about it but I wait to go to 3.x until its time to... Thanks -- Kurt Mueller -- Kurt Müller, m...@problemlos.ch -- http://mail.python.org/mailman/listinfo/python-list
Re: string processing question
Scott David Daniels schrieb: > To discover what is happening, try something like: > python -c 'for a in "ä", unicode("ä"): print len(a), a' > > I suspect that in your encoding, "ä" is two bytes long, and in > unicode it is converted to to a single character. :> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a' 2 ä 1 ä :> Yes it is. That is one of the two problems I see. The solution for this is to unicode(, ) each string. I'd like to have my python programs unicode enabled. :> python -c 'for a in "ä", unicode("ä"): print len(a), a' Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) It seems that the default encoding is "ascii", so unicode() cannot cope with "ä". If I specify "utf8" for the encoding, unicode() works. :> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a' 2 ä 1 ä :> But the print statement yelds an UnicodeEncodeError if I pipe the output to a program or a file. :> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a' | cat Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128) 2 ä 1 :> So it seems to me, that piping the output changes the behavior of the print statement: :> python -c 'for a in "ä", unicode("ä", "utf8", "ignore"): print a, len(a), type(a)' ä 2 ä 1 :> python -c 'for a in "ä", unicode("ä", "utf8", "ignore"): print a, len(a), type(a)' | cat Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128) ä 2 :> How can I achieve that my python programs are unicode enabled: - Input strings can have different encodings (mostly ascii, latin_1 or utf8) - My python programs should always output "utf8". Is that a good idea?? TIA -- Kurt Müller, m...@problemlos.ch -- http://mail.python.org/mailman/listinfo/python-list
Re: string processing question
Paul McGuire schrieb: > -- > Weird. What happens if you change the second print statement to: > print b.center(6,u"-") Same behavior. I have an even more minimal example: :> python -c 'print unicode("ä", "utf8")' ä :> python -c 'print unicode("ä", "utf8")' | cat Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128) Just the difference of having piped the output to another program or to a file. Maybe we leave the other issue with the different centering for the moment. My goal is to have my python programs unicode enabled. TIA -- Kurt Mueller -- Kurt Müller, m...@problemlos.ch -- http://mail.python.org/mailman/listinfo/python-list
string processing question
Hi, on a Linux system and python 2.5.1 I have the following behaviour which I do not understand: case 1 > python -c 'a="ä"; print a ; print a.center(6,"-") ; b=unicode(a, "utf8"); > print b.center(6,"-")' ä --ä-- --ä--- > case 2 - an UnicodeEncodeError in this case: > python -c 'a="ä"; print a ; print a.center(20,"-") ; b=unicode(a, "utf8"); > print b.center(20,"-")' | cat Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 9: ordinal not in range(128) ä --ä-- > The behaviour changes if I pipe the output to another prog or to a file. and centering with the string a is not correct, but with string b. Could somebody please explain this to me? Thanks in advance -- Kurt Müller, m...@problemlos.ch -- http://mail.python.org/mailman/listinfo/python-list
Re: Minor Typo in doc
Steve Holden schrieb: > Kurt Mueller wrote: >> Hi >> There is a minor typo in the new doc in: >> http://www.python.org/doc/2.6/library/signal.html >> -- >> signal.SIG_DFL¶ >> This is one of two standard signal handling options; >> it will simply perform the default function for the signal. >> For example, on most systems the default action for SIGQUIT >> is to dump core and exit, while the default action for >> SIGCLD >> is to simply ignore it. >> -- >> SIGCLD >> should be >> SIGCHLD >> Should I make a bug report in http://bugs.python.org > Yes. The documentation give you a link to follow from the bottom of > each page. > regards > Steve Yesterday evening I made a bug report on bugs.python.org. (I opened a issue). This morning it has already been fixed. (It has been assigned to benjamin.peterson. He has fixed it. The issue has been closed.) > Benjamin Peterson added the comment: > Thanks for the report! Fixed in r67848. > -- > nosy: +benjamin.peterson > resolution: -> fixed > status: open -> closed So, that is a great community! isn't it? Thanks to all who make this possible. Grüessli -- Kurt Mueller -- Kurt Müller, m...@problemlos.ch -- http://mail.python.org/mailman/listinfo/python-list
Minor Typo in doc
Hi There is a minor typo in the new doc in: http://www.python.org/doc/2.6/library/signal.html -- signal.SIG_DFL¶ This is one of two standard signal handling options; it will simply perform the default function for the signal. For example, on most systems the default action for SIGQUIT is to dump core and exit, while the default action for SIGCLD is to simply ignore it. -- SIGCLD should be SIGCHLD Should I make a bug report in http://bugs.python.org? Grüessli -- Kurt Müller, m...@problemlos.ch -- http://mail.python.org/mailman/listinfo/python-list
Re: HARD REAL TIME PYTHON
Am 08.10.2008 um 06:59 schrieb Hendrik van Rooyen: "Blubaugh, David A." wrote: I have done some additional research into the possibility of utilizing Python for hard real time development. I have seen on various websites where this has been discussed before on the internet. However, I was wondering as to how successful anyone has truly been in developing a program project either in windows or in Linux that was or extremely close to real time constraints? For example is it possible to develop a python program that can address an interrupt or execute an operation within 70 Hz or less?? Are there any additional considerations that I should investigate first regarding this matter?? [...] If I run it between 2 PC's, faking the I/O by writing to a disk, I sometimes get up to 250 such "pings" per second. Going full duplex would pass a lot more info, but you lose the stimulus- response nature, which is kind of nice to have in a control environment. Its not real time, but its not exactly yesterday's stuff either. OK, this is gives an impression of SPEED. We have also used python to do the HMI for an Injection Moulding machine, talking to a custom controller with an 8031 and an ARM on it via an RS-422 link running at 115200 - and the python keeps the link fully occupied In your application the REAL-TIME requirements are isolated and implemented in the custom controller. And the HMI which has to be fast enough (SPEED, not HARD-REAL-TIME) you implemented in python probably on a standard hardware/OS. That is exactly what I suggested the OP to consider. So don't be afraid - go for it! Unless we do not know more details of the requirements for the OPs application I think it is a bit early to give this advice, althoug Python is a great programming language in many aspects. The only requirements we know of the OPs program project are: - able to address interrupts (python can) - execute an operation within 70[Hz] or less (python can) But we do not know what operation has to be done at this rate. My concern is to point out, that the terms SPEED and REAL-TIME and HARD-REAL-TIME should not be misused or misunderstood. Read: http://en.wikipedia.org/wiki/Real-time Grüessli -- Kurt Müller, [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: HARD REAL TIME PYTHON
Am 07.10.2008 um 11:44 schrieb Diez B. Roggisch: Kurt Mueller wrote: David, As others mentioned before, python is not the right tool for "HARD REAL TIME". But: Maybe you can isolate the part of your application that needs "HARD REAL TIME". Then implement this part in an approriate Environment (Language, OS, HW). Then implement the rest of your application which is not "HARD REAL TIME" in python. I've done this using RTAI + ctypes. Of course the hard realtime tasks are written in C - but only the absolutely minimal core. Works like a charm. (Btw, what is this application like) Yes. The key is to !*isolate*! the part of the application that needs "HARD REAL TIME". Thats what you have done with the "absolutely minimal core" in RTAI + ctypes. Sometimes it is even questionable if an application really needs "HARD REAL TIME". sometimes it is enough to have "soft real time" or even "fast enough" is enough. Good enough is good enough! "HARD REAL TIME" is mostly expensive. Grüessli -- Kurt Müller, [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: HARD REAL TIME PYTHON
David, Am 07.10.2008 um 01:25 schrieb Blubaugh, David A.: I have done some additional research into the possibility of utilizing Python for hard real time development. I have seen on various websites where this has been discussed before on the internet. However, I was wondering as to how successful anyone has truly been in developing a program project either in windows or in Linux that was or extremely close to real time constraints? For example is it possible to develop a python program that can address an interrupt or execute an operation within 70 Hz or less?? Are there any additional considerations that I should investigate first regarding this matter?? As others mentioned before, python is not the right tool for "HARD REAL TIME". But: Maybe you can isolate the part of your application that needs "HARD REAL TIME". Then implement this part in an approriate Environment (Language, OS, HW). Then implement the rest of your application which is not "HARD REAL TIME" in python. To be more helpful, we should know what you mean by "HARD REAL TIME". Do you mean: - Handle at least 70 interrupt per second("SPEED") - If one fails, this is catastrophic for the application ("HARD") - Deliver an response to an interrupt within 5-10[ms]("REAL TIME") see http://en.wikipedia.org/wiki/Real-time Grüessli -- Kurt Müller, [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: Getting pid of a remote process
srinivasan srinivas schrieb: Thanks a lot. But i am wondeing will it return correct pid if more than one instance of run on the remote machine?? Thanks, Srini On UNIX-like OS: If you start the process in the background, you can get the PID with: :~> ssh 'ls -l & echo PID=$!' | grep PID PID=30596 :~> see: man bash -> Special Parameters Grüessli -- Kurt Müller, [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: String split with " and/or ' and/or \
Peter Otten schrieb: Kurt Mueller wrote: How to (super)split a string (literal) containing " and/or ' and/or \. example: ' a " b b " c\ c '.supersplit(' ') -> ['a', ' b b ', 'c c'] import shlex shlex.split(' a " b b " c\ c ') ['a', ' b b ', 'c c'] Thanks Peter Thanks Paul shlex is what I was looking for. Grüessli -- Kurt Müller, [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
String split with " and/or ' and/or \
How to (super)split a string (literal) containing " and/or ' and/or \. example: ' a " b b " c\ c '.supersplit(' ') -> ['a', ' b b ', 'c c'] Thanks and Grüessli -- Kurt Müller: [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list