[Tutor] unicode: % & str & str()

spir Fri, 30 Oct 2009 08:51:40 -0700

[back to the list after a rather long break]

Hello,


I stepped on a unicode issue ;-) (one more)
Below an illustration:

===============================
class U(unicode):
        def __str__(self):
                return self

# if you can't properly see the string below,
# 128<ordinals<255
c0 = "¶ÿµ"
c1 = U("¶ÿµ","utf8")
c2 = unicode("¶ÿµ","utf8")

for c in (c0,c1,c2):
        try:
                print "%s" %c,
        except UnicodeEncodeError:
                print "***",
        try:
                print c.__str__(),
        except UnicodeEncodeError:
                print "***",
        try:
                print str(c)
        except UnicodeEncodeError:
                print "***"

==>

¶ÿµ ¶ÿµ ¶ÿµ
¶ÿµ ¶ÿµ ***
¶ÿµ *** ***
================================

The last line shows that a regular unicode cannot be passed to str() (more or 
less ok) nor __str__() (not ok at all).
Maybe I overlook some obvious point (again). If not, then this means 2 issues 
in fact:

-1- The old ambiguity of str() meaning both "create an instance of type str 
from the given data" and "build a textual representation of the given object, 
through __str__", which has always been a semantic flaw for me, becomes 
concretely problematic when we have text that is not str.
Well, i'm very surprised of this. Actually, how comes this point doesn't seem 
to be very well known; how is it simply possible to use unicode without 
stepping on this problem? I guess this breaks years or even decades of habits 
for coders used to write str() when they mean __str__().

-2- How is it possible that __str__ does not work on a unicode object? It seems 
that the method is simply not implemented on unicode, the type, and __repr__ 
neither. So that it falls back to str().
Strangely enough, % interpolation works, which means that for both types of 
text a short circuit is used, namely return the text itself as is. I would have 
bet my last cents that % would simply delegate to __str__, or maybe that they 
were the same func in fact, synonyms, but obviously I was wrong!

Looking for workarounds, I first tried to overload (or rather create) __str__ 
like in the U type above. But this solution is far to be ideal cause we still 
cannot use str() (I mean my digits can write it while my head is 
who-knows-where). Also, it is really unusable in fact for the following reason:
===================================
print c1.__class__
print c1[1].__class__
c3 = c1 ; print (c1+c3).__class__
==>
<class '__main__.U'>
<type 'unicode'>
<type 'unicode'>
====================================
Any operation will return back a unicode instead of the original type. So that 
the said type would have to overload all possible operations on text, which is 
much, indeed, to convert back the results. I don't even speak of performance 
issues.

So, the only solution seems to me to use % everywhere, hunt all str and __str__ 
and __repr__ and such in all code.

I hope I'm wrong on this. Please, give me a better solution ;-)



------
la vita e estrany


_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

[Tutor] unicode: % & __str__ & str()

Reply via email to

[Tutor] unicode: % & str & str()