[issue22687] horrible performance of textwrap.wrap() with a long word

2015-03-24 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 7bd87a219813 by Serhiy Storchaka in branch 'default':
Issue #22687: Fixed some corner cases in breaking words in tetxtwrap.
https://hg.python.org/cpython/rev/7bd87a219813

--
nosy: +python-dev

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2015-03-24 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
resolution:  - fixed
stage: patch review - resolved
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2015-02-15 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

wordsplit_3.patch is wordsplit_2.patch with few added comments in tests. Is it 
enough?

--
Added file: http://bugs.python.org/file38149/wordsplit_3.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2015-01-28 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Ping. What can I do to move this issue forward?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-21 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

So what the patch (with mitigated tests) is more preferable?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-13 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Here is a patch which is closer to current code but solves complexity issue and 
also fixes some bugs in current code.

$ ./python -c import textwrap; print(textwrap.wrap('this-is-a-useful-feature', 
width=1, break_long_words=False))
['this-', 'is-a', '-useful-', 'feature']
$ ./python -c import textwrap; print(textwrap.wrap('what-d\x27you-call-it.', 
width=1, break_long_words=False))
['what-d', 'you-, 'call-', 'it.']

--
versions: +Python 2.7, Python 3.4
Added file: http://bugs.python.org/file37188/wordsplit.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-13 Thread Georg Brandl

Georg Brandl added the comment:

LGTM.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-13 Thread Antoine Pitrou

Antoine Pitrou added the comment:

I don't understand:

+expect = (this-|is-a-useful-|feature-|for-|
+  reformatting-|posts-|from-|tim-|peters'ly).split('|')
+self.check_wrap(text, 1, expect, break_long_words=False)
+self.check_split(text, expect)

Why would is-a-useful remain unsplit? It looks like you're making up new 
rules.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-13 Thread Antoine Pitrou

Changes by Antoine Pitrou pit...@free.fr:


--
versions:  -Python 2.7, Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-13 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

This is old rule. \w{2,}-(?=\w{2,} -- single letter shouldn't be separated. But 
there was a bug in such simple regex, it splits a word after non-word character 
(in particular apostrophe or hyphen) if it followed by word characters and 
hyphen. There were attempts to fix this bug in issue596434 and issue965425 but 
they missed a cases when non-word character is occurred inside a word.

Originally I had assigned this issue only to 3.5 because I supposed that the 
solution needs either new features in re or backward-incompatible changes to 
word splitting algorithm. But found solution doesn't require 3.5-only features, 
doesn't change interface, and fixes performance and behavior bugs. So I think 
it should be applied to maintained releases too.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-13 Thread Antoine Pitrou

Antoine Pitrou added the comment:

 This is old rule. \w{2,}-(?=\w{2,} -- single letter shouldn't be separated.

I don't agree. This was an implementation detail. There was no test, and it 
wasn't specified anywhere.
If you think single letter shouldn't be separated, there should be some 
grammatical or typographical reference on the Internet to prove it.

 There were attempts to fix this bug in issue596434 and issue965425 

Those don't seem related to single letters between hyphens.

 But found solution doesn't require 3.5-only features, doesn't change 
 interface, and fixes performance and behavior bugs.

It does change behaviour in ways that could break existing code. The textwrap 
behaviour is underspecified so it's not ok to assume that previous behaviour 
was obviously buggy.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-13 Thread R. David Murray

R. David Murray added the comment:

https://owl.english.purdue.edu/owl/resource/576/01/

Rule 8.

So, no, in the middle of the word single letters aren't a problem, only at the 
beginning or the end of the word.

--
nosy: +r.david.murray

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-13 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Thank you David. If splitting single letter surrounded with hyphens is 
desirable, here is more complicated patch which does this. It deviates from 
original code more, but it doesn't look break any reasonable example.

 The textwrap behaviour is underspecified so it's not ok to assume that 
 previous behaviour was obviously buggy.

Aren't ['this-', 'is-a', '-useful-', 'feature'] and ['what-d', 'you-, 
'call-', 'it.'] obvious bugs?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-13 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


Added file: http://bugs.python.org/file37190/wordsplit_2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-13 Thread Antoine Pitrou

Antoine Pitrou added the comment:

To clarify, I would be fine with the previous patch if it didn't add the tests.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-13 Thread Antoine Pitrou

Antoine Pitrou added the comment:

 Aren't ['this-', 'is-a', '-useful-', 'feature'] and
 ['what-d', 'you-, 'call-', 'it.'] obvious bugs?

Obvious according to which rules?

If we want to improve the behaviour of textwrap, IMHO it should be in a 
separate issue. And someone would have to study the word-wrapping rules of the 
English language :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-13 Thread R. David Murray

R. David Murray added the comment:

What I usually do in cases like this is to add the tests but mark them with 
comments saying that the tests test current behavior but are not testing parts 
of the (currently defined) API.  That way you know if a change changes behavior 
and then can decide if that is a problem or not, as opposed to inadvertently 
changing behavior and only finding out when the bug reports roll in :)

But yeah, defining the rules textwrap should follow is a different issue than 
the performance issue.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-13 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

 To clarify, I would be fine with the previous patch if it didn't add the 
 tests.

The absent of tests could cause introducing new non-detected bugs and 
reappearing old bugs.

 Obvious according to which rules?

If you think a word should be splitted before hyphen or apostrophe, there 
should be some grammatical or typographical reference on the Internet to prove 
it.

I would be fine with moving the fix of textwrap behavior to a separate issue, 
but what to do with this issue then? We have not a patch which only fixes 
performance complexity and doesn't change the behavior.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-13 Thread Antoine Pitrou

Antoine Pitrou added the comment:

 What I usually do in cases like this is to add the tests but mark
 them with comments saying that the tests test current behavior but
 are not testing parts of the (currently defined) API.  That way
 you know if a change changes behavior and then can decide if that is
 a problem or not, as opposed to inadvertently changing behavior
 and only finding out when the bug reports roll in :)

That's a good idea!

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-12 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Current regex produces insane result.

$ ./python -c import textwrap; print(textwrap.wrap('this-is-a-useful-feature', 
width=1, break_long_words=False))
['this-', 'is-a', '-useful-', 'feature']

Antoine's regex produces more correct result for this case: ['this-', 'is-', 
'a-', 'useful-', 'feature']. But this is not totally correct, one-letter word 
should not be separated. This can be easy fixed.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-12 Thread Antoine Pitrou

Antoine Pitrou added the comment:

 But this is not totally correct, one-letter word should not be
 separated.

Why not? I guess it depends on English's rules for word splitting, which I 
don't know.
In any case, this issue is not about improving correctness, only performance.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-12 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

 Why not? I guess it depends on English's rules for word splitting, which I
 don't know.

I suppose this is common rule in many languages. And current code supports it 
(there is a special code in the regex to ensure this rule).

 In any case, this issue is not about improving correctness,
 only performance.

But the patch shouldn't add a regression.

$ ./python -c import textwrap; print(textwrap.wrap('this-is-a-useful', 
width=1, break_long_words=False))

Current code: ['this-', 'is-a-useful']
Patched: ['this-', 'is-', 'a-', 'useful']

Just use lookahead assertion to ensure that the hyphen is followed by at least 
two letters.

My previous message is about that current code is not always correct so it is 
acceptable to replace it with not absolutely equivalent code.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-12 Thread Antoine Pitrou

Antoine Pitrou added the comment:

 I suppose this is common rule in many languages.

I frankly don't know about this rule. And the tests don't check for it, so for 
me it's not broken.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-12 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Tests are not perfect. But this is intentional design. The part of initial 
regex:

r'\w{2,}-(?=\w{2,})|' # hyphenated words

Now it is more complicated. Note '(?=\w{2,})'.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-11 Thread Antoine Pitrou

Antoine Pitrou added the comment:

Here is a patch which solves the algorithmic complexity issue by using a 
different scheme: instead of splitting, match words incrementally.

--
keywords: +patch
nosy: +pitrou
stage: needs patch - patch review
versions:  -Python 2.7, Python 3.4
Added file: http://bugs.python.org/file37179/wordsplit_complexity.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-11 Thread Antoine Pitrou

Antoine Pitrou added the comment:

Actually, it is enough to change the regexp while still using re.split(). 
Updated patch attached.

--
Added file: http://bugs.python.org/file37180/wordsplit_complexity2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-11 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Unfortunately there are two disadvantages:

1. wordsep_re and wordsep_simple_re are public attributes and user code can 
depend on this. Changing their is a way to customize TextWrapper.

2. This is slowdown common case (no abnormally long words):

$ ./python -m timeit -s 'import textwrap; s = abcde  * 10**4' -- 
'textwrap.wrap(s)'

Unpatched: 178 msec per loop
Patched: 285 msec per loop

First reason stopped me from writing a patch.

When change the way how to split words, I suggest to use undocumented re 
scanner.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-11 Thread Antoine Pitrou

Antoine Pitrou added the comment:

Are you sure? I get the reverse results here (second patch):

Unpatched:
$ ./python -m timeit -s 'import textwrap; s = abcde  * 10**4' -- 
'textwrap.wrap(s)'
10 loops, best of 3: 27 msec per loop

Patched:
$ ./python -m timeit -s 'import textwrap; s = abcde  * 10**4' -- 
'textwrap.wrap(s)'
10 loops, best of 3: 19.2 msec per loop

 wordsep_re and wordsep_simple_re are public attributes and user code can 
 depend on this. Changing their is a way to customize TextWrapper.

With my second patch, that shouldn't be a problem.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-11 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Oh, sorry, I tested your first patch. Your second patch is faster than current 
code to me. But it changes behavior.

 textwrap.wrap('1a-2b', width=5)
['1a-', '2b']

With the patch the result is ['1a-2', 'b'].

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-11 Thread Antoine Pitrou

Antoine Pitrou added the comment:

Yes... but in both cases the result is nonsensical, and untested.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-11 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Possessive quantifiers (issue433030) is not a panacea. They allow to speed up 
regular expressions, but the complexity is still quadratic. Antoine's patch 
makes the complexity linear.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-11-08 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

May be atomic grouping or possessive quantifiers (issue433030) will help with 
this issue.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-10-21 Thread Piotr Engelking

New submission from Piotr Engelking:

Wrapping a paragraph containing a long word takes a lot of time:

$ time python3 -c 'import textwrap; textwrap.wrap(a * 2 ** 16)'

real3m14.923s
user3m14.792s
sys 0m0.016s
$

A straightforward replacement is 5000 times faster:

$ time python3 -c '(.join(x) for x in zip(*[iter(a * 2 ** 16)] * 70))'

real0m0.053s
user0m0.032s
sys 0m0.016s
$

Tested on Debian with python3.4 3.4.2-1 and python2.7 2.7.8-10.

--
messages: 229768
nosy: inkerman
priority: normal
severity: normal
status: open
title: horrible performance of textwrap.wrap() with a long word
type: performance
versions: Python 2.7, Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-10-21 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
nosy: +georg.brandl, serhiy.storchaka
versions: +Python 3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-10-21 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

This particular case is related to the behavior of the wordsep_re regular 
expression in worst case. When text contains long sequence of words characters 
which is not ended by a hypen, or long sequence of non-word and non-space 
characters (and in some other cases), computational complexity of this regular 
expression matching is quadratic. This is a peculiarity of current 
implementation of regular expression engine. May be it is possible to rewrite 
the regular expression so that quadratic complexity will gone, but this is not 
so easy.

The workaround -- use break_on_hyphens=False.

--
assignee:  - serhiy.storchaka
priority: normal - low
stage:  - needs patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22687] horrible performance of textwrap.wrap() with a long word

2014-10-21 Thread Ben Roberts

Changes by Ben Roberts bjr.robe...@gmail.com:


--
nosy: +roippi

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22687
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com