[issue15045] Make textwrap.dedent() and textwrap.indent() handle whitespace consistently

Nick Coghlan Tue, 22 Jan 2019 02:54:06 -0800


Nick Coghlan <[email protected]> added the comment:


Putting some of my comments here rather than on the PR, as they're design 
questions related to "Is the current behaviour actually wrong?" and "Even if 
the current behaviour is deemed technically incorrect, is it worth the risk of 
changing it after all these years?".

I've also retitled the issue to describe the desired outcome of increased 
consistency between textwrap.dedent() and textwrap.indent(), without expressing 
a preference one way or the other.

1. Correctness & consistency

Quoting the textwrap.py comment about ASCII whitespace that Serhiy mentioned:

```
# Hardcode the recognized whitespace characters to the US-ASCII
# whitespace characters.  The main reason for doing this is that
# some Unicode spaces (like \u00a0) are non-breaking whitespaces.
_whitespace = '\t\n\x0b\x0c\r '
```

It's not clear whether or not non-breaking whitespace should actually be 
counted as an empty line, as it wouldn't be a candidate break point for line 
wrapping if that space appeared between two words (e.g. "first\N{NO-BREAK 
SPACE}second"), even though str.split() would split on it, and str.strip() 
would remove it from the beginning or end of a string (and that discrepancy is 
by design, since it's what "non-breaking" *means*).

So the interesting test cases from that perspective would be strings like:

"""\
    4 space indent
\N{NO-BREAK SPACE}   4 spaces, but first is no-break
    4 space indent
"""
-> textwrap.dedent() should do nothing

"""\
    4 space indent
\N{NO-BREAK SPACE}
    Previous line is just a single no-break space
"""
-> textwrap.dedent() should do nothing

"""\
    4 space indent
    \N{NO-BREAK SPACE}5 spaces, but last is no-break
    4 space indent
"""
-> textwrap.dedent() should strip the common 4-space prefix

"""\
    4 space indent
    \N{NO-BREAK SPACE}
    Previous line is indented with 4 spaces
"""
-> textwrap.dedent() should strip the common 4-space prefix

The potential inconsistency I cite above is with the then-new textwrap.indent() 
which *does* consider all lines consisting solely of whitespace (whether 
non-breaking or not) to be blank lines, and hence doesn't indent them. This 
means that the following test string wouldn't round trip correctly through a 
textwrap.indent/textwrap.dedent pair:

"""\
4 space indent
\N{NO-BREAK SPACE}
Previous line is indented as usual
"""

indent() would skip adding leading whitespace to the second line, which means 
dedent() would subsequently fail to detect a common leading prefix to be 
stripped.

However, that can easily be considered a bug in textwrap.indent() - it's the 
newer function, so it was a design error to make it inconsistent with the 
textwrap.dedent() precedent.

2. Performance

Since this issue was opened purely on design consistency grounds, it needs to 
offer really compelling benefits if we're going to risk a change that might 
make textwrap.dedent() slower.

I don't think we've reached that bar, as with universal newlines as the default 
text reading behaviour, it's going to be fairly rare for `textwrap.dedent()` to 
be applied to strings containing `\r` or `\r\n`, and if it is, it's a pretty 
straightforward prior normalisation step to convert both of those to `\n` via 
`text.sub('\r\n', '\n').sub('\r', '\n')` (or a comparable regex based on 
https://stackoverflow.com/questions/1331815/regular-expression-to-match-cross-platform-newline-characters)

So I think the most that can be argued for when it comes to the newline 
handling in dedent() is a *documentation* change that notes that 
textwrap.dedent() splits lines strictly on `\n`, and other line endings need to 
be normalized to `\n` before it will work correctly.

That leaves indent(), and I think the case can plausible be made there that it 
should be using _whitespace_only_re for its empty line detection  (the same as 
dedent(), instead of using str.strip().

----------
title: Make textwrap.dedent() consistent with str.splitlines(True) and 
str.strip() -> Make textwrap.dedent() and textwrap.indent() handle whitespace 
consistently

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue15045>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue15045] Make textwrap.dedent() and textwrap.indent() handle whitespace consistently

Reply via email to