Re: Case-insensitive sorting of strings (Python newbie)
John Sampson wrote: I notice that the string method 'lower' seems to convert some strings (input from a text file) to Unicode but not others. This messes up sorting if it is used on arguments of 'sorted' since Unicode strings come before ordinary ones. Is there a better way of case-insensitive sorting of strings in a list? Is it necessary to convert strings read from a plaintext file to Unicode? If so, how? This is Python 2.7.8. The standard recommendation is to convert bytes to unicode as early as possible and only manipulate unicode. This is more likely to give correct results when slicing or converting a string. $ cat tmp.txt ähnlich üblich nötig möglich Maß Maße Masse ÄHNLICH $ python Python 2.7.6 (default, Mar 22 2014, 22:59:56) [GCC 4.8.2] on linux2 Type help, copyright, credits or license for more information. for line in open(tmp.txt): ... line = line.strip() ... print line, line.lower() ... ähnlich ähnlich üblich üblich nötig nötig möglich möglich Maß maß Maße maße Masse masse ÄHNLICH Ähnlich Now the same with unicode. To read text with a specific encoding use either codecs.open() or io.open() instead of the built-in (replace utf-8 with your actual encoding): import io for line in io.open(tmp.txt, encoding=utf-8): ... line = line.strip() ... print line, line.lower() ... ähnlich ähnlich üblich üblich nötig nötig möglich möglich Maß maß Maße maße Masse masse ÄHNLICH ähnlich Unfortunately this will not give the order that you (or a german speaker in the example below) will probably expect: print .join(sorted(io.open(tmp.txt), key=unicode.lower)) Masse Maß Maße möglich nötig ähnlich ÄHNLICH üblich For case-insensitive sorting you get better results with locale.strxfrm() -- but this doesn't accept unicode: import locale locale.setlocale(locale.LC_ALL, ) 'de_DE.UTF-8' print .join(sorted(io.open(tmp.txt), key=locale.strxfrm)) Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128) As a workaround you can sort first: print .join(sorted(open(tmp.txt), key=locale.strxfrm)) ähnlich ÄHNLICH Maß Masse Maße möglich nötig üblich You should still convert the result to unicode if you want to do further processing in Python. -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive sorting of strings (Python newbie)
John Sampson wrote: I notice that the string method 'lower' seems to convert some strings (input from a text file) to Unicode but not others. I don't think so. You're going to have to show an example. I *think* what you might be running into is an artifact of printing to a terminal, which may (or may not) interpret some byte sequences as UTF-8 characters, but I can't replicate it. So I'll have to see an example. Please state what OS you are running on, and what encoding your terminal is set to. Also, are you opening the file in text mode or binary mode? This messes up sorting if it is used on arguments of 'sorted' since Unicode strings come before ordinary ones. Is there a better way of case-insensitive sorting of strings in a list? Is it necessary to convert strings read from a plaintext file to Unicode? If so, how? This is Python 2.7.8. Best practice is to always convert to Unicode, even if you know your text is pure ASCII. You *may* be able to get away with not doing so if you know you have ASCII, but that's still the lazy way. And of course you need to know what encoding has been used. There is some overhead with decoding to Unicode, so if performance really is critical, *and* your needs are quite low, you may be able to get away with just treating the strings as ASCII byte strings: with open(my file.txt) as f: for line in f: print line.lower() will correctly lowercase ASCII strings. It won't lowercase non-ASCII letters, and there's a good chance that they may display as raw bytes in some encoding. Otherwise, I think the best way to approach this may be: import io with io.open(my file.txt, encoding='utf-8') as f: for line in f: print line.lower() Assuming the file actually is encoded with UTF-8, that ought to work perfectly. But to really know what is going on we will need more information. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Case-insensitive sorting of strings (Python newbie)
I notice that the string method 'lower' seems to convert some strings (input from a text file) to Unicode but not others. This messes up sorting if it is used on arguments of 'sorted' since Unicode strings come before ordinary ones. Is there a better way of case-insensitive sorting of strings in a list? Is it necessary to convert strings read from a plaintext file to Unicode? If so, how? This is Python 2.7.8. Regards John Sampson -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive sorting of strings (Python newbie)
John Sampson wrote: I notice that the string method 'lower' seems to convert some strings (input from a text file) to Unicode but not others. This messes up sorting if it is used on arguments of 'sorted' since Unicode strings come before ordinary ones. I doubt that. Can you provide a short example? Is there a better way of case-insensitive sorting of strings in a list? Is it necessary to convert strings read from a plaintext file to Unicode? If so, how? This is Python 2.7.8. Well, if you have non-ASCII chars for many Unicode characters str.lower() won't give reasonable results. So binary strings containing an encoding of Unicode character entities should be decoded to Unicode strings first. Ciao, Michael. -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive sorting of strings (Python newbie)
On Sat, Jan 24, 2015 at 4:53 AM, Peter Otten __pete...@web.de wrote: Now the same with unicode. To read text with a specific encoding use either codecs.open() or io.open() instead of the built-in (replace utf-8 with your actual encoding): import io for line in io.open(tmp.txt, encoding=utf-8): ... line = line.strip() ... print line, line.lower() In Python 3, the built-in open() function sports a fancy encoding= parameter like that. for line in open(tmp.txt, encoding=utf-8): If you can, I would recommend using Python 3 for all this kind of thing. The difference may not be huge, but there are all sorts of little differences here and there that mean that Unicode support is generally better; most of it stems from the fact that the default quoted string literal is a Unicode string rather than a byte string, which means that basically every function ever written for Py3 has been written to be Unicode-compatible. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive sorting of strings (Python newbie)
On Sat, Jan 24, 2015 at 6:14 AM, Marko Rauhamaa ma...@pacujo.net wrote: Well, if Python can't, then who can? Probably nobody in the world, not generically, anyway. Example: print(re\u0301sume\u0301) résumé print(r\u00e9sum\u00e9) résumé print(re\u0301sume\u0301 == r\u00e9sum\u00e9) False print(\ufb01nd) find print(find) find print(\ufb01nd == find) False If equality can't be determined, words really can't be sorted. Ah, that's a bit easier to deal with. Just use Unicode normalization. print(unicodedata.normalize(NFC,re\u0301sume\u0301) == unicodedata.normalize(NFC,r\u00e9sum\u00e9)) True It's a bit verbose, but if you're doing a lot of comparisons, you probably want to make a key-function that folds together everything that you want to be treated the same way, for instance: def key(s): Normalize a Unicode string for comparison purposes. Composes, case-folds, and trims excess spaces. return unicodedata.normalize(NFC,s).strip().casefold() Then it's much tidier: print(key(re\u0301sume\u0301) == key(r\u00e9sum\u00e9)) True print(key(\ufb01nd) == key(find)) True You may want to go further, too; for search comparisons, you'll want to use NFKC normalization, and probably translate all strings of Unicode whitespace into single U+0020s, or completely strip out zero-width non-breaking spaces (and maybe zero-width breaking spaces, too), etc, etc. It all depends on what you mean by equality. But certainly a basic NFC or NFD normalization is safe for general work. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive sorting of strings (Python newbie)
Peter Otten __pete...@web.de: The standard recommendation is to convert bytes to unicode as early as possible and only manipulate unicode. Unicode doesn't get you off the hook (as you explain later in your post). Upper/lowercase as well as collation order is ambiguous. Python even with decent locale support can't be expected to do it all for you. Well, if Python can't, then who can? Probably nobody in the world, not generically, anyway. Example: print(re\u0301sume\u0301) résumé print(r\u00e9sum\u00e9) résumé print(re\u0301sume\u0301 == r\u00e9sum\u00e9) False print(\ufb01nd) find print(find) find print(\ufb01nd == find) False If equality can't be determined, words really can't be sorted. Marko -- https://mail.python.org/mailman/listinfo/python-list