Re: Case-insensitive sorting of strings (Python newbie)

2015-01-23 Thread Peter Otten
John Sampson wrote:

 I notice that the string method 'lower' seems to convert some strings
 (input from a text file) to Unicode but not others.
 This messes up sorting if it is used on arguments of 'sorted' since
 Unicode strings come before ordinary ones.
 
 Is there a better way of case-insensitive sorting of strings in a list?
 Is it necessary to convert strings read from a plaintext file
 to Unicode? If so, how? This is Python 2.7.8.

The standard recommendation is to convert bytes to unicode as early as 
possible and only manipulate unicode. This is more likely to give correct 
results when slicing or converting a string.

$ cat tmp.txt
ähnlich
üblich
nötig
möglich
Maß
Maße
Masse
ÄHNLICH
$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type help, copyright, credits or license for more information.
 for line in open(tmp.txt):
... line = line.strip()
... print line, line.lower()
... 
ähnlich ähnlich
üblich üblich
nötig nötig
möglich möglich
Maß maß
Maße maße
Masse masse
ÄHNLICH Ähnlich

Now the same with unicode. To read text with a specific encoding use either 
codecs.open() or io.open() instead of the built-in (replace utf-8 with your 
actual encoding):

 import io
 for line in io.open(tmp.txt, encoding=utf-8): 
... line = line.strip()
... print line, line.lower()
... 
ähnlich ähnlich
üblich üblich
nötig nötig
möglich möglich
Maß maß
Maße maße
Masse masse
ÄHNLICH ähnlich

Unfortunately this will not give the order that you (or a german speaker in 
the example below) will probably expect:

 print .join(sorted(io.open(tmp.txt), key=unicode.lower))
Masse
Maß
Maße
möglich
nötig
ähnlich
ÄHNLICH
üblich

For case-insensitive sorting you get better results with locale.strxfrm() -- 
but this doesn't accept unicode:

 import locale
 locale.setlocale(locale.LC_ALL, )
'de_DE.UTF-8'
 print .join(sorted(io.open(tmp.txt), key=locale.strxfrm))
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 
0: ordinal not in range(128)

As a workaround you can sort first:

 print .join(sorted(open(tmp.txt), key=locale.strxfrm))
ähnlich
ÄHNLICH
Maß
Masse
Maße
möglich
nötig
üblich

You should still convert the result to unicode if you want to do further 
processing in Python.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Case-insensitive sorting of strings (Python newbie)

2015-01-23 Thread Steven D'Aprano
John Sampson wrote:

 I notice that the string method 'lower' seems to convert some strings
 (input from a text file) to Unicode but not others.

I don't think so. You're going to have to show an example.

I *think* what you might be running into is an artifact of printing to a
terminal, which may (or may not) interpret some byte sequences as UTF-8
characters, but I can't replicate it. So I'll have to see an example.
Please state what OS you are running on, and what encoding your terminal is
set to. Also, are you opening the file in text mode or binary mode?


 This messes up sorting if it is used on arguments of 'sorted' since
 Unicode strings come before ordinary ones.
 
 Is there a better way of case-insensitive sorting of strings in a list?
 Is it necessary to convert strings read from a plaintext file
 to Unicode? If so, how? This is Python 2.7.8.

Best practice is to always convert to Unicode, even if you know your text is
pure ASCII. You *may* be able to get away with not doing so if you know you
have ASCII, but that's still the lazy way. And of course you need to know
what encoding has been used.

There is some overhead with decoding to Unicode, so if performance really is
critical, *and* your needs are quite low, you may be able to get away with
just treating the strings as ASCII byte strings:

with open(my file.txt) as f:
for line in f:
print line.lower()


will correctly lowercase ASCII strings. It won't lowercase non-ASCII
letters, and there's a good chance that they may display as raw bytes in
some encoding. Otherwise, I think the best way to approach this may be:


import io
with io.open(my file.txt, encoding='utf-8') as f:
for line in f:
print line.lower()


Assuming the file actually is encoded with UTF-8, that ought to work
perfectly.

But to really know what is going on we will need more information.


-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Case-insensitive sorting of strings (Python newbie)

2015-01-23 Thread John Sampson
I notice that the string method 'lower' seems to convert some strings 
(input from a text file) to Unicode but not others.
This messes up sorting if it is used on arguments of 'sorted' since 
Unicode strings come before ordinary ones.


Is there a better way of case-insensitive sorting of strings in a list? 
Is it necessary to convert strings read from a plaintext file

to Unicode? If so, how? This is Python 2.7.8.

Regards

John Sampson
--
https://mail.python.org/mailman/listinfo/python-list


Re: Case-insensitive sorting of strings (Python newbie)

2015-01-23 Thread Michael Ströder
John Sampson wrote:
 I notice that the string method 'lower' seems to convert some strings (input
 from a text file) to Unicode but not others.
 This messes up sorting if it is used on arguments of 'sorted' since Unicode
 strings come before ordinary ones.

I doubt that. Can you provide a short example?

 Is there a better way of case-insensitive sorting of strings in a list? Is it
 necessary to convert strings read from a plaintext file
 to Unicode? If so, how? This is Python 2.7.8.

Well, if you have non-ASCII chars for many Unicode characters str.lower()
won't give reasonable results. So binary strings containing an encoding of
Unicode character entities should be decoded to Unicode strings first.

Ciao, Michael.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Case-insensitive sorting of strings (Python newbie)

2015-01-23 Thread Chris Angelico
On Sat, Jan 24, 2015 at 4:53 AM, Peter Otten __pete...@web.de wrote:
 Now the same with unicode. To read text with a specific encoding use either
 codecs.open() or io.open() instead of the built-in (replace utf-8 with your
 actual encoding):

 import io
 for line in io.open(tmp.txt, encoding=utf-8):
 ... line = line.strip()
 ... print line, line.lower()

In Python 3, the built-in open() function sports a fancy encoding=
parameter like that.

for line in open(tmp.txt, encoding=utf-8):

If you can, I would recommend using Python 3 for all this kind of
thing. The difference may not be huge, but there are all sorts of
little differences here and there that mean that Unicode support is
generally better; most of it stems from the fact that the default
quoted string literal is a Unicode string rather than a byte string,
which means that basically every function ever written for Py3 has
been written to be Unicode-compatible.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Case-insensitive sorting of strings (Python newbie)

2015-01-23 Thread Chris Angelico
On Sat, Jan 24, 2015 at 6:14 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 Well, if Python can't, then who can? Probably nobody in the world, not
 generically, anyway.

 Example:

  print(re\u0301sume\u0301)
 résumé
  print(r\u00e9sum\u00e9)
 résumé
  print(re\u0301sume\u0301 == r\u00e9sum\u00e9)
 False
  print(\ufb01nd)
 find
  print(find)
 find
  print(\ufb01nd == find)
 False

 If equality can't be determined, words really can't be sorted.

Ah, that's a bit easier to deal with. Just use Unicode normalization.

 print(unicodedata.normalize(NFC,re\u0301sume\u0301) == 
 unicodedata.normalize(NFC,r\u00e9sum\u00e9))
True

It's a bit verbose, but if you're doing a lot of comparisons, you
probably want to make a key-function that folds together everything
that you want to be treated the same way, for instance:

def key(s):
Normalize a Unicode string for comparison purposes.

Composes, case-folds, and trims excess spaces.

return unicodedata.normalize(NFC,s).strip().casefold()

Then it's much tidier:

 print(key(re\u0301sume\u0301) == key(r\u00e9sum\u00e9))
True
 print(key(\ufb01nd) == key(find))
True

You may want to go further, too; for search comparisons, you'll want
to use NFKC normalization, and probably translate all strings of
Unicode whitespace into single U+0020s, or completely strip out
zero-width non-breaking spaces (and maybe zero-width breaking spaces,
too), etc, etc. It all depends on what you mean by equality. But
certainly a basic NFC or NFD normalization is safe for general work.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Case-insensitive sorting of strings (Python newbie)

2015-01-23 Thread Marko Rauhamaa
Peter Otten __pete...@web.de:

 The standard recommendation is to convert bytes to unicode as early as
 possible and only manipulate unicode.

Unicode doesn't get you off the hook (as you explain later in your
post). Upper/lowercase as well as collation order is ambiguous. Python
even with decent locale support can't be expected to do it all for you.

Well, if Python can't, then who can? Probably nobody in the world, not
generically, anyway.

Example:

 print(re\u0301sume\u0301)
résumé
 print(r\u00e9sum\u00e9)
résumé
 print(re\u0301sume\u0301 == r\u00e9sum\u00e9)
False
 print(\ufb01nd)
find
 print(find)
find
 print(\ufb01nd == find)
False

If equality can't be determined, words really can't be sorted.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list