>> Unicode
>> interoperability is a pain, though, and I find it depressing to work
>> with in Python2.x, because it never seems to behave predictably. I
>> still have no idea why tokenizing Hungarian text and tokenizing German
>> text are not fundamentally the same operation
>
> I have no idea why they're not:
>
> <code - untested>
> import codecs
>
> with codecs.open ("german.txt", "rb", encoding="utf8") as f:
>  german_text = f.read ()
>
> with codecs.open ("hungarian.txt", "rb", encoding="utf8") as f:
>  hungarian_text = f.read ()
>
> # do_stuff_with (german_text)
> # do_stuff_with (hungarian_text)
>
> </code>
>
> Of course, I'm assuming that you know what encoding has been
> used to serialise the text, but if you don't then it's not
> Python's fault ;)

Thanks. I'll try that.

I seem to remember that 'file' in Linux detects encodings, but it's
also a matter of calling it by the exact same name...

Alec
_______________________________________________
python-uk mailing list
python-uk@python.org
http://mail.python.org/mailman/listinfo/python-uk

Reply via email to