>> Unicode >> interoperability is a pain, though, and I find it depressing to work >> with in Python2.x, because it never seems to behave predictably. I >> still have no idea why tokenizing Hungarian text and tokenizing German >> text are not fundamentally the same operation > > I have no idea why they're not: > > <code - untested> > import codecs > > with codecs.open ("german.txt", "rb", encoding="utf8") as f: > german_text = f.read () > > with codecs.open ("hungarian.txt", "rb", encoding="utf8") as f: > hungarian_text = f.read () > > # do_stuff_with (german_text) > # do_stuff_with (hungarian_text) > > </code> > > Of course, I'm assuming that you know what encoding has been > used to serialise the text, but if you don't then it's not > Python's fault ;)
Thanks. I'll try that. I seem to remember that 'file' in Linux detects encodings, but it's also a matter of calling it by the exact same name... Alec _______________________________________________ python-uk mailing list python-uk@python.org http://mail.python.org/mailman/listinfo/python-uk