On Oct 26, 12:05 am, Steven D'Aprano <[EMAIL PROTECTED] cybersource.com.au> wrote: > On Thu, 25 Oct 2007 23:23:37 +0200, Michal Bozon wrote: > >> Repeatedly adding strings together in this way is about the most > >> inefficient, slow way of building up a long string. (Although I'm sure > >> somebody can come up with a worse way if they try hard enough.) > > >> Even though recent versions of CPython have a local optimization that > >> improves the performance hit of string concatenation somewhat, it is > >> better to use ''.join() rather than add many strings together: > > > String appending is not tragically slower, for strings long tens of MB, > > the speed makes me a difference in few tens of percents, so it is not > > several times slower, or so > > That is a half-truth. > > Because strings are immutable, when you concat two strings Python has to > duplicate both of them. This leads to quadratic-time behaviour, where the > time taken is proportional to the SQUARE of the number of characters. > This rapidly becomes very slow. > > *However*, as of Python 2.4, CPython has an optimization that can detect > some types of string concatenation and do them in-place, giving (almost) > linear-time performance. But that depends on: > > (1) the implementation: it only works for CPython, not Jython or > IronPython or other Python implementations; > > (2) the version: it is an implementation detail introduced in Python 2.4, > and is not guaranteed to remain in future versions; > > (3) the specific details of how you concat strings: s=s+t will get the > optimization, but s=t+s or s=s+t1+t2 will not. > > In other words: while having that optimization in place is a win, you > cannot rely on it. If you care about portable code, the advice to use > join() still stands. > > [snip] > > > Nice, I did not know that string translation exists, but Abandoned have > > defined allowed characters, so making a translation table for the > > unallowed characters, which would take nearly complete unicode character > > table would be inefficient. > > The cost of building the unicode translation table is minimal: about 1.5 > seconds ONCE, and it is only a couple of megabytes of data: > > >>> allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \ > > ... u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutwvyxz' > > >>> timer = timeit.Timer('not_allowed = [i for i in range(0x110000) if > > unichr(i) not in allowed]; TABLE = dict(zip(not_allowed, u" "*len > (not_allowed)))', 'from __main__ import allowed') > > >>> timer.repeat(3, 10) > > [18.267689228057861, 16.495684862136841, 16.785034894943237] > > The translate method runs about ten times faster than anything you can > write in pure Python. If Abandoned has got as much data as he keeps > saying he has, he will save a lot more than 1.5 seconds by using > translate compared to relatively slow Python code.
String translate runs 10 times faster than pure python: unicode translate isn't anywhere near as fast as it has to look up each character in the mapping dict. import timeit timer = timeit.Timer("a.translate(m)", setup = "a = u'abc' * 1000; m = dict((x, x) for x in range(256))") print timer.repeat(3, 10000) [2.4009871482849121, 2.4191598892211914, 2.3641388416290283] timer = timeit.Timer("a.translate(m)", setup = "a = 'abc' * 1000; m = ''.join(chr(x) for x in range(256))") print timer.repeat(3, 10000) [0.12261486053466797, 0.12225103378295898, 0.12217879295349121] Also, the unicode translation dict as given doesn't work on character's that aren't allowed: it should map ints to ints rather than ints to strings. Anyway, there's no need to pay the cost of building a full mapping dict when most of the entries are the same. Something like this can work: from collections import defaultdict def clear(message): allowed = u'abc...' clear_translate = defaultdict(lambda: ord(u' ')) clear_translate.update((c, c) for c in map(ord, allowed)) return message.translate(clear_translate) -- Paul Hankin -- http://mail.python.org/mailman/listinfo/python-list