Textwrapping with paragraphs, was RE: Confusing textwrap parameters, and request for RE help

Peter Otten Mon, 30 Mar 2020 01:08:11 -0700

Steve Smith wrote:

> I am having the same issue. I can either get the text to wrap, which makes
> all the text wrap, or I can get the text to ignore independent '/n'
> characters, so that all the blank space is removed. I'd like to set up my
> code, so that only 1 blank space is remaining (I'll settle for none at
> this point), an the text wraps up to 100 chars or so out per line. Does
> anyone have any thoughts on the attached code? And what I'm not doing
> correctly?
> 
> 
> #import statements
> import textwrap
> import requests
> from bs4 import BeautifulSoup
> 
> #class extension of textwrapper
> class DocumentWrapper(textwrap.TextWrapper):
> 
>     def wrap(self, text):
>         split_text = text.split('\n')
>         lines = [line for para in split_text for line in
>         textwrap.TextWrapper.wrap(self, para)] return lines
> 
> #import statement of text.
> page = requests.get("http://classics.mit.edu/Aristotle/rhetoric.mb.txt";)
> soup = BeautifulSoup(page.text, "html.parser")
> 
> #instantiation of extension of textwrap.wrap.
> d = DocumentWrapper(width=110,initial_indent='',fix_sentence_endings=True
> ) new_string = d.fill(page.text)
> 
> #set up an optional variable, even attempted applying BOTH the extended
> #method and the original method to the issue... nothing has worked.
> #new_string_2 = textwrap.wrap(new_string,90)
> 
> #with loop with JUST the class extension of textwrapper.
> with open("Art_of_Rhetoric.txt", "w") as f:
>     f.writelines(new_string)
> 
> #with loop with JUST the standard textwrapper.text method applied to it.
> with open("Art_of_Rhetoric2.txt", "w") as f:
>     f.writelines(textwrap.wrap(page.text,90))


I think in your case the problem is that newlines in the source text do not
indicate paragraphs -- thus you should not keep them. Instead try 
interpreting empty lines as paragraph separators:

$ cat tmp.py                
import sys
import textwrap
import itertools

import requests
from bs4 import BeautifulSoup


class DocumentWrapper(textwrap.TextWrapper):
    def wrap(self, text):
        paras = (
            "".join(group) for non_empty, group in itertools.groupby(
                text.splitlines(True),
                key=lambda line: bool(line.strip())
            ) if non_empty
        )
        wrap = super().wrap
        lines = [line for para in paras for line in wrap(para)]
        return lines


page = requests.get("http://classics.mit.edu/Aristotle/rhetoric.mb.txt";).text

d = DocumentWrapper(width=110, initial_indent='', fix_sentence_endings=True)
new_string = d.fill(page)

sys.stdout.write(new_string)
$ python3 tmp.py | head -n10
Provided by The Internet Classics Archive.  See bottom for copyright.  
Available online at
http://classics.mit.edu//Aristotle/rhetoric.html
Rhetoric By Aristotle
Translated by W. Rhys Roberts
----------------------------------------------------------------------
BOOK I
Part 1
Rhetoric is the counterpart of Dialectic.  Both alike are concerned with such 
things as come, more or less,
within the general ken of all men and belong to no definite science.  
Accordingly all men make use, more or
less, of both; for to a certain extent all men attempt to discuss statements 
and to maintain them, to defend
Traceback (most recent call last):
  File "tmp.py", line 27, in <module>
    sys.stdout.write(new_string)
BrokenPipeError: [Errno 32] Broken pipe
$

-- 
https://mail.python.org/mailman/listinfo/python-list

Textwrapping with paragraphs, was RE: Confusing textwrap parameters, and request for RE help

Reply via email to