I'd like to point out the historical reason: Python predates Unicode, so the byte string type has many convenience operations that you would only expect of a character string.
We have come up with a transition strategy, allowing existing libraries to widen their support from byte strings to character strings. This isn't a simple task, so many libraries still expect and return byte strings, when they should process character strings. Instead of breaking the libraries right away, we have defined a transitional mechanism, which allows to add Unicode support to libraries as the need arises. This transition is still in progress.
I understand. So I wasn't yelling "why can't Python be more like Java". On the other hand I also want to point out making individual decision for each string wasn't practical and is very error prone. The fact that unicode and 8 bit string look alike and work alike in common situation but only run into problem with non-ASCII is very confusing for most people.
Eventually, the primary string type should be the Unicode string. If you are curious how far we are still off that goal, just try running your program with the -U option.
Lots of errors. Amount them are gzip (binary?!) and strftime??
I actually quite appriciate Python's power in processing binary data as 8-bit strings. But perhaps we should transition to use unicode as text string as treat binary string as exception. Right now we have
'' - 8bit string; u'' unicode string
How about
b'' - 8bit string; '' unicode string
and no automatic conversion. Perhaps this can be activated by something like the encoding declarations, so that transition can happen module by module.
Regards, Martin
-- http://mail.python.org/mailman/listinfo/python-list