Logic looks good. James should test that this doesn't cause the performance regression that he mentioned yesterday. Since it doesn't use saxutils.escape anymore, it shouldn't but we never tracked down why it was doing that so it's better to check.
I have some style comments. I'll mention them inline
On Fri, Nov 16, 2012 at 02:47:12PM +0100, Zdeněk Pavlas wrote:
> diff --git a/yum/misc.py b/yum/misc.py
> index a0bac7b..9f403db 100644
> --- a/yum/misc.py
> +++ b/yum/misc.py
> +_deletechars = ''.join(chr(i) for i in range(32) if i not in (9, 10, 13))
>
> def to_xml(item, attrib=False):
> + if type(item) is str:
type checking is better done with isinstance
if isinstance(item, str):
> + # check if valid utf8
> + try: unicode(item, 'utf8')
I benchmarked 'utf8', 'UTF8', 'utf-8', 'UTF-8', and unicode() vs
str.decode() at one point (with both python2.6 and python-2.7)
at one point. unicode('item', 'utf-8') was a runaway winner. There was
a thread about it on upstream's python-dev list at one point. I believe the
reason was that 'utf-8' had been hardcoded in somewhere but the others had
to do a lookup in the codecs module when they were used.
> + except UnicodeDecodeError:
> + item = unicode(item, 'utf8', 'replace').encode('utf8')
> + elif type(item) is unicode:
> + item = item.encode('utf8')
> + elif item is None:
> + return ''
>
Should probably have a final else: clause here. It can be as simple as
throwing an TypeError exception or complex and retrieve the string
representation of item (if you do more complex: the output should then go
through the above steps if: else: steps to make sure it is valid utf-8).
> + # compat cruft...
> item = item.rstrip()
> +
> + # kill ivalid low bytes
> + item = item.translate(None, _deletechars)
> +
> + # quote reserved XML characters
> + item = item.replace("&", "&")
> + item = item.replace("<", "<")
> + item = item.replace(">", ">")
> if attrib:
> + item = item.replace('"', '"')
> +
> return item
In certain circumstances (if you don't know whether the string will be used
in an attribute that uses single or double quotes) you'll need to do this
replacement as well:
item = item.replace("'", ''')
If you had a unicode string here, you could do all of the replacing and
killing of bad bytes with a single call to item.translate() [because unicode
string's translate takes a dict]. You'd need to benchmark with some actual
data to see whether the function call overhead of multiple item.replace()
calls or the cost of encoding everything [note: the cost of decoding strings
is already paid for as that was part of the test at the top for being valid
utf-8] is more expensive.
(Function calls are surprisingly expensive in python so I wouldn't be
surprised either way).
-Toshio
pgpIELKqlUiUo.pgp
Description: PGP signature
_______________________________________________ Yum-devel mailing list [email protected] http://lists.baseurl.org/mailman/listinfo/yum-devel
