Re: utils.text.wrap is O(n^2)

Michael Radziej Sun, 17 Dec 2006 13:53:50 -0800

Hi,

somehow I kept interest in this and found a better solution. Splitting
the whole text first into words is only the best way when the line width
is close to the average word length. In the more typical cases you have
multiple words per line, and then it's faster to process line by line by
looking for the last space from the rightmost possible end of a line,
resulting in this function:


def wrap(text, width):
    def _generate():
        # `start` is the start position of the next line
        # `nl_pos` the position of the next newline charachter
        # `space_pos` the position of the space or newline
        #   where a line break could occur.
        start = 0
        nl_pos = text.find('\n')
        if nl_pos == -1:
            nl_pos = len(text)
        while start < len(text):
            if nl_pos <= start + width:
                space_pos = nl_pos
            else:
                space_pos = text.rfind(' ', start, start + width + 1)
                if space_pos == -1:
                    space_pos = text.find(' ', start, nl_pos)
                    if space_pos == -1:
                        space_pos = len(text)
            if nl_pos <= space_pos:
                yield text[start:nl_pos]
                start = nl_pos + 1
                nl_pos = text.rfind('\n', start, start + width + 1)
                if nl_pos == -1:
                    nl_pos = text.rfind('\n', start + width + 1)
                    if nl_pos == -1:
                        nl_pos = len(text)
            else:
                yield text[start:space_pos]
                start = space_pos + 1
    return "\n".join(_generate())

For the test text in the activestate URL and a line width of 40, this is
5 times faster than my previous approach. But:  For width=7, it is a
little bit longer, and for width=0 it takes about 70 % longer. I still
think the above code is another improvement in speed. It should also use
less memory since it doesn't need to build a list of words (but I
haven't done any measurements).

This function and all the previous implementations of wrap() share one
buglet:  If the line break occurs on the first space of a sequence of
spaces, the next line will start with a space, even when there's no
indentation in the original text. Is this a bug to be fixed, or should
we better following bug compatibility here. This (handwritten diff)
fixes the buglet in the code above:


                 yield text[start:space_pos]
                 start = space_pos + 1
+                # remove spaces after "soft" line break:
+                while start < nl_pos and text[start]==" ":
+                    start += 1
     return "\n".join(_generate())

For width=40, the runtime increases by 17 %. The penalty is less for
width=7 or width=0.


Sure, I'm goint to contribute this back to activestate, but first I'd
like to see your comments and whether I missed something.

Michael


--~--~---------~--~----~------------~-------~--~----~
 You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: utils.text.wrap is O(n^2)

Reply via email to