On Mon, Nov 19, 2012 at 04:36:12AM -0500, Zdenek Pavlas wrote:
> > $ python -c "import yum.misc; print yum.misc.to_xml('Skytt\xe4')"
> > Skyttä
> >
> > After this patch:
> >
> > $ python -c "import yum.misc; print yum.misc.to_xml('Skytt\xe4')"
> > Skytt�
> >
> > That'd be a regression in my opinion.
>
> I see, and agree.
>
Note that I would encourage this to be behaviour that you mark as
deprecated and schedule to get rid of at some defined point in the future.
People using latin-1 can switch to utf-8 with extremely limited
repurcussions (compared to, say, people who use big5 or shift-jis who get
hit with 1) more of their characters being outside of the ascii subset, and
2) more extra bytes being needed to represent those characters in utf-8 than
to represent latin-1 characters in utf-8 )Doing this is more confusing to users of other encodings (for instance, the large number of people who use shift-jis and big-5). All of them will get gibberish instead of a replacement character. We can never be 100% correct with this. For instance: "Driver for SKYTT\xc4\xae Brand Video Cards" (SKYTTÄ®) would be interpreted as UTF-8 and thus rendered as gibberish. I do note that these cases are pretty rare. They would need to have a character whose byte is in the range 0xC0-0xDF followed by characters whose bytes are in the range 0xA0-0xBF. If you look at a latin-1 character chart, you can see that sequences of the latin-1 characters that map to these bytes aren't impossible but they are very rare: http://en.wikipedia.org/wiki/Latin1#Codepage_layout Also note that upstream python has been unsympathetic to arguments about latin-1 locales. (they're unsympathetic to non-utf-8 locales in general but the space savings and widespread use of shift-jis and big5 make them slightly more sympathetic to issues there than to latin-1). > # check if valid utf8 > try: unicode(item, 'utf-8') > except UnicodeDecodeError: > - # replace invalid bytes with \ufffd > - item = unicode(item, 'utf-8', 'replace').encode('utf-8') > + # assume iso-8859-1 > + item = unicode(item, 'iso-8859-1').encode('utf-8') > elif type(item) is unicode: > item = item.encode('utf-8') > elif item is None: ACK. Also a bit of bad news... I re-read http://en.wikipedia.org/wiki/XML#Valid_characters while researching this. It would appear that we should also be removing the C1 control codes from the output, not jsut the C0 control codes. Unfortunately, the C1 control codes fall outside of the ascii subset. And that means that we can't use str.translate to remove them. In kitchen, where I'm already taking the hit of transforming to unicode and using unicode.translate(), I can extend it to delete these bytes just by modifying the translation table but if we want to avoid that with the yum code.... Options: * Don't do anyhting to the C1 control codes: The current yum code does not handle C1 control codes and we haven't seen any problems yet. This probably means that the C1 codes are not being used in the wild. We might also want to try passing some C1 control codes into libxml2 and seeing what happens -- perhaps libxml2 only barfs on C0 codes. * Convert to unicode and use unicode.translate() -- more correct and we still get a significant speedup over the original code. * Subset of this is that it would be possible to translate the control codes to their escaped equivalent since XML-1.1 specifies this as valid. Probably not needed for yum, though. * Write our own loop that keeps track of multibyte sequences to decide if the sequence is a control code -- this is the kind of code we're trying to remove so I'd be highly against this. -Toshio
pgpJaGi2HHgDs.pgp
Description: PGP signature
_______________________________________________ Yum-devel mailing list [email protected] http://lists.baseurl.org/mailman/listinfo/yum-devel
