Hello community, here is the log from the commit of package python-html2text for openSUSE:Leap:15.2 checked in at 2020-04-15 20:07:22 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Leap:15.2/python-html2text (Old) and /work/SRC/openSUSE:Leap:15.2/.python-html2text.new.2738 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "python-html2text" Wed Apr 15 20:07:22 2020 rev:18 rq: version:2019.8.11 Changes: -------- --- /work/SRC/openSUSE:Leap:15.2/python-html2text/python-html2text.changes 2020-04-14 14:24:16.617422083 +0200 +++ /work/SRC/openSUSE:Leap:15.2/.python-html2text.new.2738/python-html2text.changes 2020-04-15 20:08:28.638186050 +0200 @@ -2,21 +1,0 @@ -Thu Apr 9 11:17:36 UTC 2020 - Marketa Calabkova <mcalabk...@suse.com> - -- Update to 2020.1.16 - * Add type annotations. - * Add support for Python 3.8. - * Performance improvements when ``wrap_links`` is ``False`` (the default). - * Configure setuptools using setup.cfg. - -------------------------------------------------------------------- -Fri Dec 13 13:43:47 UTC 2019 - Matthias Fehring <buschman...@opensuse.org> - -- Update to 2019.9.26: - * Fix long blockquotes wrapping. - * Remove the trailing whitespaces that were added after wrapping list items & blockquotes. - * Remove support for Python <= 3.4. Now requires Python 3.5+. - * Fix memory leak when processing a document containing a <abbr> tag. - * Fix AttributeError when reading text from stdin. - * Fix UnicodeEncodeError when writing output to stdout. -- Disable build for Python 2 - -------------------------------------------------------------------- Old: ---- html2text-2020.1.16.tar.gz New: ---- html2text-2019.8.11.tar.gz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ python-html2text.spec ++++++ --- /var/tmp/diff_new_pack.iuv4gt/_old 2020-04-15 20:08:29.010186303 +0200 +++ /var/tmp/diff_new_pack.iuv4gt/_new 2020-04-15 20:08:29.014186306 +0200 @@ -1,7 +1,7 @@ # # spec file for package python-html2text # -# Copyright (c) 2020 SUSE LLC +# Copyright (c) 2019 SUSE LINUX GmbH, Nuernberg, Germany. # # All modifications and additions to the file contributed by third parties # remain the property of their copyright owners, unless otherwise agreed @@ -17,10 +17,9 @@ %define upname html2text -%define skip_python2 1 %{?!python_module:%define python_module() python-%{**} python3-%{**}} Name: python-%{upname} -Version: 2020.1.16 +Version: 2019.8.11 Release: 0 Summary: Python script for turning HTML into Markdown text License: GPL-3.0-only @@ -64,8 +63,6 @@ %python_uninstall_alternative html2text %check -# otherwise python 3.6 does not automatically select UTF-8 for console output -export LANG=en_US.UTF-8 %pytest %files %{python_files} ++++++ html2text-2020.1.16.tar.gz -> html2text-2019.8.11.tar.gz ++++++ diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/ChangeLog.rst new/html2text-2019.8.11/ChangeLog.rst --- old/html2text-2020.1.16/ChangeLog.rst 2020-01-16 15:20:17.000000000 +0100 +++ new/html2text-2019.8.11/ChangeLog.rst 2019-08-11 21:33:38.000000000 +0200 @@ -1,25 +1,3 @@ -2020.1.16 -========= ----- - -* Add type annotations. -* Add support for Python 3.8. -* Performance improvements when ``wrap_links`` is ``False`` (the default). -* Configure setuptools using setup.cfg. - - -2019.9.26 -========= ----- - -* Fix long blockquotes wrapping. -* Remove the trailing whitespaces that were added after wrapping list items & blockquotes. -* Remove support for Python ≤ 3.4. Now requires Python 3.5+. -* Fix memory leak when processing a document containing a ``<abbr>`` tag. -* Fix ``AttributeError`` when reading text from stdin. -* Fix ``UnicodeEncodeError`` when writing output to stdout. - - 2019.8.11 ========= ---- @@ -32,16 +10,13 @@ * Add ``__main__.py`` module to allow running the CLI using ``python -m html2text ...``. * Fix #238: correct spacing when a HTML entity follows a non-stressed tags which follow a stressed tag. * Remove unused or deprecated: - * ``html2text.compat.escape()`` * ``html2text.config.RE_UNESCAPE`` * ``html2text.HTML2Text.replaceEntities()`` * ``html2text.HTML2Text.unescape()`` * ``html2text.unescape()`` - * Fix #208: handle LEFT-TO-RIGHT MARK after a stressed tag. - 2018.1.9 ======== ---- diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/PKG-INFO new/html2text-2019.8.11/PKG-INFO --- old/html2text-2020.1.16/PKG-INFO 2020-01-16 15:21:10.000000000 +0100 +++ new/html2text-2019.8.11/PKG-INFO 2019-08-11 21:36:00.000000000 +0200 @@ -1,6 +1,6 @@ Metadata-Version: 2.1 Name: html2text -Version: 2020.1.16 +Version: 2019.8.11 Summary: Turn HTML into equivalent Markdown-structured text. Home-page: https://github.com/Alir3z4/html2text/ Author: Aaron Swartz @@ -101,13 +101,14 @@ Classifier: License :: OSI Approved :: GNU General Public License (GPL) Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python +Classifier: Programming Language :: Python :: 2 +Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3 +Classifier: Programming Language :: Python :: 3.4 Classifier: Programming Language :: Python :: 3.5 Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 -Classifier: Programming Language :: Python :: 3.8 -Classifier: Programming Language :: Python :: 3 :: Only Classifier: Programming Language :: Python :: Implementation :: CPython Classifier: Programming Language :: Python :: Implementation :: PyPy -Requires-Python: >=3.5 +Requires-Python: >=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.* Description-Content-Type: text/markdown diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/html2text/__init__.py new/html2text-2019.8.11/html2text/__init__.py --- old/html2text-2020.1.16/html2text/__init__.py 2020-01-16 15:20:17.000000000 +0100 +++ new/html2text-2019.8.11/html2text/__init__.py 2019-08-11 21:35:55.000000000 +0200 @@ -1,16 +1,14 @@ +# coding: utf-8 """html2text: Turn HTML into equivalent Markdown-structured text.""" +from __future__ import division, unicode_literals -import html.entities -import html.parser import re -import urllib.parse as urlparse +import sys from textwrap import wrap -from typing import Dict, List, Optional, Tuple, Union -from . import config -from .elements import AnchorElement, ListElement -from .typing import OutCallback -from .utils import ( +from html2text import config +from html2text.compat import HTMLParser, urlparse +from html2text.utils import ( dumb_css_parser, element_style, escape_md, @@ -21,32 +19,38 @@ google_text_emphasis, hn, list_numbering_start, + name2cp, pad_tables_in_text, skipwrap, unifiable_n, ) -__version__ = (2020, 1, 16) +try: + chr = unichr + nochr = unicode("") +except NameError: + # python3 uses chr + nochr = str("") + +__version__ = (2019, 8, 11) # TODO: # Support decoded entities with UNIFIABLE. -class HTML2Text(html.parser.HTMLParser): - def __init__( - self, - out: Optional[OutCallback] = None, - baseurl: str = "", - bodywidth: int = config.BODY_WIDTH, - ) -> None: +class HTML2Text(HTMLParser.HTMLParser): + def __init__(self, out=None, baseurl="", bodywidth=config.BODY_WIDTH): """ Input parameters: out: possible custom replacement for self.outtextf (which appends lines of text). baseurl: base URL of the document we process """ - super().__init__(convert_charrefs=False) + kwargs = {} + if sys.version_info >= (3, 4): + kwargs["convert_charrefs"] = False + HTMLParser.HTMLParser.__init__(self, **kwargs) # Config options self.split_next_td = False @@ -90,20 +94,20 @@ self.out = out # empty list to store output characters before they are "joined" - self.outtextlist = [] # type: List[str] + self.outtextlist = [] self.quiet = 0 self.p_p = 0 # number of newline character to print before next output self.outcount = 0 self.start = True self.space = False - self.a = [] # type: List[AnchorElement] - self.astack = [] # type: List[Optional[Dict[str, Optional[str]]]] - self.maybe_automatic_link = None # type: Optional[str] + self.a = [] + self.astack = [] + self.maybe_automatic_link = None self.empty_link = False self.absolute_url_matcher = re.compile(r"^[a-zA-Z+]+://") self.acount = 0 - self.list = [] # type: List[ListElement] + self.list = [] self.blockquote = 0 self.pre = False self.startpre = False @@ -113,57 +117,52 @@ self.lastWasNL = False self.lastWasList = False self.style = 0 - self.style_def = {} # type: Dict[str, Dict[str, str]] - self.tag_stack = ( - [] - ) # type: List[Tuple[str, Dict[str, Optional[str]], Dict[str, str]]] + self.style_def = {} + self.tag_stack = [] self.emphasis = 0 self.drop_white_space = 0 self.inheader = False - # Current abbreviation definition - self.abbr_title = None # type: Optional[str] - # Last inner HTML (for abbr being defined) - self.abbr_data = None # type: Optional[str] - # Stack of abbreviations to write later - self.abbr_list = {} # type: Dict[str, str] + self.abbr_title = None # current abbreviation definition + self.abbr_data = None # last inner HTML (for abbr being defined) + self.abbr_list = {} # stack of abbreviations to write later self.baseurl = baseurl self.stressed = False self.preceding_stressed = False - self.preceding_data = "" - self.current_tag = "" + self.preceding_data = None + self.current_tag = None config.UNIFIABLE["nbsp"] = " _place_holder;" - def feed(self, data: str) -> None: + def feed(self, data): data = data.replace("</' + 'script>", "</ignore>") - super().feed(data) + HTMLParser.HTMLParser.feed(self, data) - def handle(self, data: str) -> str: + def handle(self, data): self.feed(data) self.feed("") - markdown = self.optwrap(self.finish()) + markdown = self.optwrap(self.close()) if self.pad_tables: return pad_tables_in_text(markdown) else: return markdown - def outtextf(self, s: str) -> None: + def outtextf(self, s): self.outtextlist.append(s) if s: self.lastWasNL = s[-1] == "\n" - def finish(self) -> str: - self.close() + def close(self): + HTMLParser.HTMLParser.close(self) self.pbr() self.o("", force="end") - outtext = "".join(self.outtextlist) + outtext = nochr.join(self.outtextlist) if self.unicode_snob: - nbsp = html.entities.html5["nbsp;"] + nbsp = chr(name2cp("nbsp")) else: - nbsp = " " + nbsp = chr(32) outtext = outtext.replace(" _place_holder;", nbsp) # Clear self.outtextlist to avoid memory leak of its content to @@ -172,10 +171,10 @@ return outtext - def handle_charref(self, c: str) -> None: + def handle_charref(self, c): self.handle_data(self.charref(c), True) - def handle_entityref(self, c: str) -> None: + def handle_entityref(self, c): ref = self.entityref(c) # ref may be an empty string (e.g. for ‎/‏ markers that should @@ -187,13 +186,13 @@ if ref: self.handle_data(ref, True) - def handle_starttag(self, tag: str, attrs: List[Tuple[str, Optional[str]]]) -> None: - self.handle_tag(tag, dict(attrs), start=True) + def handle_starttag(self, tag, attrs): + self.handle_tag(tag, attrs, 1) - def handle_endtag(self, tag: str) -> None: - self.handle_tag(tag, {}, start=False) + def handle_endtag(self, tag): + self.handle_tag(tag, None, 0) - def previousIndex(self, attrs: Dict[str, Optional[str]]) -> Optional[int]: + def previousIndex(self, attrs): """ :type attrs: dict @@ -203,15 +202,17 @@ """ if "href" not in attrs: return None + i = -1 + for a in self.a: + i += 1 + match = False - match = False - for i, a in enumerate(self.a): - if "href" in a.attrs and a.attrs["href"] == attrs["href"]: - if "title" in a.attrs or "title" in attrs: + if "href" in a and a["href"] == attrs["href"]: + if "title" in a or "title" in attrs: if ( - "title" in a.attrs + "title" in a and "title" in attrs - and a.attrs["title"] == attrs["title"] + and a["title"] == attrs["title"] ): match = True else: @@ -219,11 +220,8 @@ if match: return i - return None - def handle_emphasis( - self, start: bool, tag_style: Dict[str, str], parent_style: Dict[str, str] - ) -> None: + def handle_emphasis(self, start, tag_style, parent_style): """ Handles various text emphases """ @@ -294,10 +292,13 @@ if strikethrough: self.quiet -= 1 - def handle_tag( - self, tag: str, attrs: Dict[str, Optional[str]], start: bool - ) -> None: + def handle_tag(self, tag, attrs, start): self.current_tag = tag + # attrs is None for endtags + if attrs is None: + attrs = {} + else: + attrs = dict(attrs) if self.tag_callback is not None: if self.tag_callback(self, tag, attrs, start) is True: @@ -320,7 +321,7 @@ # need the attributes of the parent nodes in order to get a # complete style description for the current element. we assume # that google docs export well formed html. - parent_style = {} # type: Dict[str, str] + parent_style = {} if start: if self.tag_stack: parent_style = self.tag_stack[-1][2] @@ -389,10 +390,8 @@ self.blockquote -= 1 self.p() - def no_preceding_space(self: HTML2Text) -> bool: - return bool( - self.preceding_data and re.match(r"[^\s]", self.preceding_data[-1]) - ) + def no_preceding_space(self): + return self.preceding_data and re.match(r"[^\s]", self.preceding_data[-1]) if tag in ["em", "i", "u"] and not self.ignore_emphasis: if start and no_preceding_space(self): @@ -441,10 +440,9 @@ self.abbr_title = attrs["title"] else: if self.abbr_title is not None: - assert self.abbr_data is not None self.abbr_list[self.abbr_data] = self.abbr_title self.abbr_title = None - self.abbr_data = None + self.abbr_data = "" if tag == "q": if not self.quote: @@ -453,7 +451,7 @@ self.o(self.close_quote) self.quote = not self.quote - def link_url(self: HTML2Text, link: str, title: str = "") -> None: + def link_url(self, link, title=""): url = urlparse.urljoin(self.baseurl, link) title = ' "{}"'.format(title) if title.strip() else "" self.o("]({url}{title})".format(url=escape_md(url), title=title)) @@ -478,28 +476,31 @@ if self.maybe_automatic_link and not self.empty_link: self.maybe_automatic_link = None elif a: - assert a["href"] is not None if self.empty_link: self.o("[") self.empty_link = False self.maybe_automatic_link = None if self.inline_links: - title = a.get("title") or "" - title = escape_md(title) - link_url(self, a["href"], title) + try: + title = a["title"] if a["title"] else "" + title = escape_md(title) + except KeyError: + link_url(self, a["href"], "") + else: + link_url(self, a["href"], title) else: i = self.previousIndex(a) if i is not None: - a_props = self.a[i] + a = self.a[i] else: self.acount += 1 - a_props = AnchorElement(a, self.acount, self.outcount) - self.a.append(a_props) - self.o("][" + str(a_props.count) + "]") + a["count"] = self.acount + a["outcount"] = self.outcount + self.a.append(a) + self.o("][" + str(a["count"]) + "]") if tag == "img" and start and not self.ignore_images: if "src" in attrs: - assert attrs["src"] is not None if not self.images_to_alt: attrs["href"] = attrs["src"] alt = attrs.get("alt") or self.default_image_alt @@ -511,10 +512,8 @@ ): self.o("<img src='" + attrs["src"] + "' ") if "width" in attrs: - assert attrs["width"] is not None self.o("width='" + attrs["width"] + "' ") if "height" in attrs: - assert attrs["height"] is not None self.o("height='" + attrs["height"] + "' ") if alt: self.o("alt='" + alt + "' ") @@ -551,12 +550,13 @@ else: i = self.previousIndex(attrs) if i is not None: - a_props = self.a[i] + attrs = self.a[i] else: self.acount += 1 - a_props = AnchorElement(attrs, self.acount, self.outcount) - self.a.append(a_props) - self.o("[" + str(a_props.count) + "]") + attrs["count"] = self.acount + attrs["outcount"] = self.outcount + self.a.append(attrs) + self.o("[" + str(attrs["count"]) + "]") if tag == "dl" and start: self.p() @@ -569,7 +569,7 @@ if tag in ["ol", "ul"]: # Google Docs create sub lists as top level lists - if not self.list and not self.lastWasList: + if (not self.list) and (not self.lastWasList): self.p() if start: if self.google_doc: @@ -577,11 +577,11 @@ else: list_style = tag numbering_start = list_numbering_start(attrs) - self.list.append(ListElement(list_style, numbering_start)) + self.list.append({"name": list_style, "num": numbering_start}) else: if self.list: self.list.pop() - if not self.google_doc and not self.list: + if (not self.google_doc) and (not self.list): self.o("\n") self.lastWasList = True else: @@ -593,18 +593,18 @@ if self.list: li = self.list[-1] else: - li = ListElement("ul", 0) + li = {"name": "ul", "num": 0} if self.google_doc: nest_count = self.google_nest_count(tag_style) else: nest_count = len(self.list) # TODO: line up <ol><li>s > 9 correctly. self.o(" " * nest_count) - if li.name == "ul": + if li["name"] == "ul": self.o(self.ul_item_mark + " ") - elif li.name == "ol": - li.num += 1 - self.o(str(li.num) + ". ") + elif li["name"] == "ol": + li["num"] += 1 + self.o(str(li["num"]) + ". ") self.start = True if tag in ["table", "tr", "td", "th"]: @@ -671,23 +671,21 @@ self.p() # TODO: Add docstring for these one letter functions - def pbr(self) -> None: + def pbr(self): "Pretty print has a line break" if self.p_p == 0: self.p_p = 1 - def p(self) -> None: + def p(self): "Set pretty print to 1 or 2 lines" self.p_p = 1 if self.single_line_break else 2 - def soft_br(self) -> None: + def soft_br(self): "Soft breaks" self.pbr() self.br_toggle = " " - def o( - self, data: str, puredata: bool = False, force: Union[bool, str] = False - ) -> None: + def o(self, data, puredata=False, force=False): """ Deal with indentation and whitespace """ @@ -732,7 +730,8 @@ if not self.list: bq += " " # else: list content is already partially indented - bq += " " * len(self.list) + for i in range(len(self.list)): + bq += " " data = data.replace("\n", "\n" + bq) if self.startpre: @@ -770,16 +769,15 @@ newa = [] for link in self.a: - if self.outcount > link.outcount: + if self.outcount > link["outcount"]: self.out( " [" - + str(link.count) + + str(link["count"]) + "]: " - + urlparse.urljoin(self.baseurl, link.attrs["href"]) + + urlparse.urljoin(self.baseurl, link["href"]) ) - if "title" in link.attrs: - assert link.attrs["title"] is not None - self.out(" (" + link.attrs["title"] + ")") + if "title" in link: + self.out(" (" + link["title"] + ")") self.out("\n") else: newa.append(link) @@ -798,7 +796,7 @@ self.out(data) self.outcount += 1 - def handle_data(self, data: str, entity_char: bool = False) -> None: + def handle_data(self, data, entity_char=False): if not data: # Data may be empty for some HTML entities. For example, # LEFT-TO-RIGHT MARK. @@ -841,7 +839,7 @@ self.preceding_data = data self.o(data, puredata=True) - def charref(self, name: str) -> str: + def charref(self, name): if name[0] in ["x", "X"]: c = int(name[1:], 16) else: @@ -855,16 +853,21 @@ except ValueError: # invalid unicode return "" - def entityref(self, c: str) -> str: + def entityref(self, c): if not self.unicode_snob and c in config.UNIFIABLE: return config.UNIFIABLE[c] - try: - ch = html.entities.html5[c + ";"] - except KeyError: - return "&" + c + ";" - return config.UNIFIABLE[c] if c == "nbsp" else ch + else: + try: + cp = name2cp(c) + except KeyError: + return "&" + c + ";" + else: + if c == "nbsp": + return config.UNIFIABLE[c] + else: + return chr(cp) - def google_nest_count(self, style: Dict[str, str]) -> int: + def google_nest_count(self, style): """ Calculate the nesting count of google doc lists @@ -878,7 +881,7 @@ return nest_count - def optwrap(self, text: str) -> str: + def optwrap(self, text): """ Wrap all paragraphs in the provided text. @@ -901,13 +904,7 @@ if not skipwrap(para, self.wrap_links, self.wrap_list_items): indent = "" if para.startswith(" " + self.ul_item_mark): - # list item continuation: add a double indent to the - # new lines - indent = " " - elif para.startswith("> "): - # blockquote continuation: add the greater than symbol - # to the new lines - indent = "> " + indent = " " # For list items. wrapped = wrap( para, self.body_width, @@ -915,12 +912,9 @@ subsequent_indent=indent, ) result += "\n".join(wrapped) - if para.endswith(" "): + if indent or para.endswith(" "): result += " \n" newlines = 1 - elif indent: - result += "\n" - newlines = 1 else: result += "\n\n" newlines = 2 @@ -939,7 +933,7 @@ return result -def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) -> str: +def html2text(html, baseurl="", bodywidth=None): if bodywidth is None: bodywidth = config.BODY_WIDTH h = HTML2Text(baseurl=baseurl, bodywidth=bodywidth) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/html2text/__main__.py new/html2text-2019.8.11/html2text/__main__.py --- old/html2text-2020.1.16/html2text/__main__.py 2019-10-12 17:55:30.000000000 +0200 +++ new/html2text-2019.8.11/html2text/__main__.py 2019-08-11 21:27:39.000000000 +0200 @@ -1,3 +1,3 @@ -from .cli import main +from html2text.cli import main main() diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/html2text/cli.py new/html2text-2019.8.11/html2text/cli.py --- old/html2text-2020.1.16/html2text/cli.py 2019-10-12 18:20:41.000000000 +0200 +++ new/html2text-2019.8.11/html2text/cli.py 2019-08-11 21:27:39.000000000 +0200 @@ -1,10 +1,10 @@ import argparse -import sys -from . import HTML2Text, __version__, config +from html2text import HTML2Text, __version__, config +from html2text.utils import wrap_read, wrapwrite -def main() -> None: +def main(): baseurl = "" class bcolors: @@ -256,10 +256,10 @@ with open(args.filename, "rb") as fp: data = fp.read() else: - data = sys.stdin.buffer.read() + data = wrap_read() try: - html = data.decode(args.encoding, args.decode_errors) + data = data.decode(args.encoding, args.decode_errors) except UnicodeDecodeError as err: warning = bcolors.WARNING + "Warning:" + bcolors.ENDC warning += " Use the " + bcolors.OKGREEN @@ -303,4 +303,4 @@ h.open_quote = args.open_quote h.close_quote = args.close_quote - sys.stdout.write(h.handle(html)) + wrapwrite(h.handle(data)) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/html2text/compat.py new/html2text-2019.8.11/html2text/compat.py --- old/html2text-2020.1.16/html2text/compat.py 1970-01-01 01:00:00.000000000 +0100 +++ new/html2text-2019.8.11/html2text/compat.py 2019-08-11 21:27:39.000000000 +0200 @@ -0,0 +1,12 @@ +import sys + +if sys.version_info[0] == 2: + import htmlentitydefs + import urlparse + import HTMLParser +else: + import urllib.parse as urlparse + import html.entities as htmlentitydefs + import html.parser as HTMLParser + +__all__ = ["HTMLParser", "htmlentitydefs", "urlparse"] diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/html2text/config.py new/html2text-2019.8.11/html2text/config.py --- old/html2text-2020.1.16/html2text/config.py 2019-08-15 12:56:54.000000000 +0200 +++ new/html2text-2019.8.11/html2text/config.py 2019-08-11 21:27:39.000000000 +0200 @@ -1,3 +1,5 @@ +from __future__ import unicode_literals + import re # Use Unicode characters instead of their ascii pseudo-replacements diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/html2text/elements.py new/html2text-2019.8.11/html2text/elements.py --- old/html2text-2020.1.16/html2text/elements.py 2019-10-12 18:20:41.000000000 +0200 +++ new/html2text-2019.8.11/html2text/elements.py 1970-01-01 01:00:00.000000000 +0100 @@ -1,18 +0,0 @@ -from typing import Dict, Optional - - -class AnchorElement: - __slots__ = ["attrs", "count", "outcount"] - - def __init__(self, attrs: Dict[str, Optional[str]], count: int, outcount: int): - self.attrs = attrs - self.count = count - self.outcount = outcount - - -class ListElement: - __slots__ = ["name", "num"] - - def __init__(self, name: str, num: int): - self.name = name - self.num = num diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/html2text/typing.py new/html2text-2019.8.11/html2text/typing.py --- old/html2text-2020.1.16/html2text/typing.py 2019-10-12 18:20:41.000000000 +0200 +++ new/html2text-2019.8.11/html2text/typing.py 1970-01-01 01:00:00.000000000 +0100 @@ -1,3 +0,0 @@ -class OutCallback: - def __call__(self, s: str) -> None: - ... diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/html2text/utils.py new/html2text-2019.8.11/html2text/utils.py --- old/html2text-2020.1.16/html2text/utils.py 2020-01-16 15:08:28.000000000 +0100 +++ new/html2text-2019.8.11/html2text/utils.py 2019-08-11 21:27:39.000000000 +0200 @@ -1,16 +1,20 @@ -import html.entities -from typing import Dict, List, Optional +import sys -from . import config +from html2text import config +from html2text.compat import htmlentitydefs -unifiable_n = { - html.entities.name2codepoint[k]: v - for k, v in config.UNIFIABLE.items() - if k != "nbsp" -} +def name2cp(k): + """Return sname to codepoint""" + if k == "apos": + return ord("'") + return htmlentitydefs.name2codepoint[k] -def hn(tag: str) -> int: + +unifiable_n = {name2cp(k): v for k, v in config.UNIFIABLE.items() if k != "nbsp"} + + +def hn(tag): if tag[0] == "h" and len(tag) == 2: n = tag[1] if "0" < n <= "9": @@ -18,7 +22,7 @@ return 0 -def dumb_property_dict(style: str) -> Dict[str, str]: +def dumb_property_dict(style): """ :returns: A hash of css attributes """ @@ -28,7 +32,7 @@ } -def dumb_css_parser(data: str) -> Dict[str, Dict[str, str]]: +def dumb_css_parser(data): """ :type data: str @@ -45,20 +49,16 @@ # parse the css. reverted from dictionary comprehension in order to # support older pythons - pairs = [x.split("{") for x in data.split("}") if "{" in x.strip()] + elements = [x.split("{") for x in data.split("}") if "{" in x.strip()] try: - elements = {a.strip(): dumb_property_dict(b) for a, b in pairs} + elements = {a.strip(): dumb_property_dict(b) for a, b in elements} except ValueError: elements = {} # not that important return elements -def element_style( - attrs: Dict[str, Optional[str]], - style_def: Dict[str, Dict[str, str]], - parent_style: Dict[str, str], -) -> Dict[str, str]: +def element_style(attrs, style_def, parent_style): """ :type attrs: dict :type style_def: dict @@ -69,19 +69,17 @@ """ style = parent_style.copy() if "class" in attrs: - assert attrs["class"] is not None for css_class in attrs["class"].split(): css_style = style_def.get("." + css_class, {}) style.update(css_style) if "style" in attrs: - assert attrs["style"] is not None immediate_style = dumb_property_dict(attrs["style"]) style.update(immediate_style) return style -def google_list_style(style: Dict[str, str]) -> str: +def google_list_style(style): """ Finds out whether this is an ordered or unordered list @@ -97,7 +95,7 @@ return "ol" -def google_has_height(style: Dict[str, str]) -> bool: +def google_has_height(style): """ Check if the style of the element has the 'height' attribute explicitly defined @@ -109,7 +107,7 @@ return "height" in style -def google_text_emphasis(style: Dict[str, str]) -> List[str]: +def google_text_emphasis(style): """ :type style: dict @@ -127,7 +125,7 @@ return emphasis -def google_fixed_width_font(style: Dict[str, str]) -> bool: +def google_fixed_width_font(style): """ Check if the css of the current element defines a fixed width font @@ -141,7 +139,7 @@ return "courier new" == font_family or "consolas" == font_family -def list_numbering_start(attrs: Dict[str, Optional[str]]) -> int: +def list_numbering_start(attrs): """ Extract numbering from list element attributes @@ -150,7 +148,6 @@ :rtype: int or None """ if "start" in attrs: - assert attrs["start"] is not None try: return int(attrs["start"]) - 1 except ValueError: @@ -159,10 +156,10 @@ return 0 -def skipwrap(para: str, wrap_links: bool, wrap_list_items: bool) -> bool: +def skipwrap(para, wrap_links, wrap_list_items): # If it appears to contain a link # don't wrap - if not wrap_links and config.RE_LINK.search(para): + if (len(config.RE_LINK.findall(para)) > 0) and not wrap_links: return True # If the text begins with four spaces or one tab, it's a code block; # don't wrap @@ -190,7 +187,25 @@ ) -def escape_md(text: str) -> str: +def wrapwrite(text): + text = text.encode("utf-8") + try: # Python3 + sys.stdout.buffer.write(text) + except AttributeError: + sys.stdout.write(text) + + +def wrap_read(): + """ + :rtype: str + """ + try: + return sys.stdin.read() + except AttributeError: + return sys.stdin.buffer.read() + + +def escape_md(text): """ Escapes markdown-sensitive characters within other markdown constructs. @@ -198,7 +213,7 @@ return config.RE_MD_CHARS_MATCHER.sub(r"\\\1", text) -def escape_md_section(text: str, snob: bool = False) -> str: +def escape_md_section(text, snob=False): """ Escapes markdown-sensitive characters across whole document sections. """ @@ -214,7 +229,7 @@ return text -def reformat_table(lines: List[str], right_margin: int) -> List[str]: +def reformat_table(lines, right_margin): """ Given the lines of a table padds the cells and returns the new lines @@ -257,13 +272,12 @@ return new_lines -def pad_tables_in_text(text: str, right_margin: int = 1) -> str: +def pad_tables_in_text(text, right_margin=1): """ Provide padding for tables in the text """ lines = text.split("\n") - table_buffer = [] # type: List[str] - table_started = False + table_buffer, table_started = [], False new_lines = [] for line in lines: # Toggle table started diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/html2text.egg-info/PKG-INFO new/html2text-2019.8.11/html2text.egg-info/PKG-INFO --- old/html2text-2020.1.16/html2text.egg-info/PKG-INFO 2020-01-16 15:21:10.000000000 +0100 +++ new/html2text-2019.8.11/html2text.egg-info/PKG-INFO 2019-08-11 21:35:58.000000000 +0200 @@ -1,6 +1,6 @@ Metadata-Version: 2.1 Name: html2text -Version: 2020.1.16 +Version: 2019.8.11 Summary: Turn HTML into equivalent Markdown-structured text. Home-page: https://github.com/Alir3z4/html2text/ Author: Aaron Swartz @@ -101,13 +101,14 @@ Classifier: License :: OSI Approved :: GNU General Public License (GPL) Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python +Classifier: Programming Language :: Python :: 2 +Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3 +Classifier: Programming Language :: Python :: 3.4 Classifier: Programming Language :: Python :: 3.5 Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 -Classifier: Programming Language :: Python :: 3.8 -Classifier: Programming Language :: Python :: 3 :: Only Classifier: Programming Language :: Python :: Implementation :: CPython Classifier: Programming Language :: Python :: Implementation :: PyPy -Requires-Python: >=3.5 +Requires-Python: >=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.* Description-Content-Type: text/markdown diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/html2text.egg-info/SOURCES.txt new/html2text-2019.8.11/html2text.egg-info/SOURCES.txt --- old/html2text-2020.1.16/html2text.egg-info/SOURCES.txt 2020-01-16 15:21:10.000000000 +0100 +++ new/html2text-2019.8.11/html2text.egg-info/SOURCES.txt 2019-08-11 21:35:59.000000000 +0200 @@ -9,16 +9,13 @@ html2text/__init__.py html2text/__main__.py html2text/cli.py +html2text/compat.py html2text/config.py -html2text/elements.py -html2text/py.typed -html2text/typing.py html2text/utils.py html2text.egg-info/PKG-INFO html2text.egg-info/SOURCES.txt html2text.egg-info/dependency_links.txt html2text.egg-info/entry_points.txt -html2text.egg-info/not-zip-safe html2text.egg-info/top_level.txt test/GoogleDocMassDownload.html test/GoogleDocMassDownload.md diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/html2text.egg-info/not-zip-safe new/html2text-2019.8.11/html2text.egg-info/not-zip-safe --- old/html2text-2020.1.16/html2text.egg-info/not-zip-safe 2020-01-16 15:21:10.000000000 +0100 +++ new/html2text-2019.8.11/html2text.egg-info/not-zip-safe 1970-01-01 01:00:00.000000000 +0100 @@ -1 +0,0 @@ - diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/setup.cfg new/html2text-2019.8.11/setup.cfg --- old/html2text-2020.1.16/setup.cfg 2020-01-16 15:21:10.000000000 +0100 +++ new/html2text-2019.8.11/setup.cfg 2019-08-11 21:36:00.000000000 +0200 @@ -1,42 +1,5 @@ -[metadata] -name = html2text -version = attr: html2text.__version__ -description = Turn HTML into equivalent Markdown-structured text. -long_description = file: README.md -long_description_content_type = text/markdown -url = https://github.com/Alir3z4/html2text/ -author = Aaron Swartz -author_email = m...@aaronsw.com -maintainer = Alireza Savand -maintainer_email = alireza.sav...@gmail.com -license = GNU GPL 3 -classifiers = - Development Status :: 5 - Production/Stable - Intended Audience :: Developers - License :: OSI Approved :: GNU General Public License (GPL) - Operating System :: OS Independent - Programming Language :: Python - Programming Language :: Python :: 3 - Programming Language :: Python :: 3.5 - Programming Language :: Python :: 3.6 - Programming Language :: Python :: 3.7 - Programming Language :: Python :: 3.8 - Programming Language :: Python :: 3 :: Only - Programming Language :: Python :: Implementation :: CPython - Programming Language :: Python :: Implementation :: PyPy -platform = OS Independent - -[options] -zip_safe = False -packages = html2text -python_requires = >=3.5 - -[options.entry_points] -console_scripts = - html2text = html2text.cli:main - -[options.package_data] -html2text = py.typed +[bdist_wheel] +universal = 1 [flake8] max_line_length = 88 @@ -50,9 +13,6 @@ line_length = 88 multi_line_output = 3 -[mypy] -python_version = 3.5 - [egg_info] tag_build = tag_date = 0 diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/setup.py new/html2text-2019.8.11/setup.py --- old/html2text-2020.1.16/setup.py 2019-10-31 19:37:31.000000000 +0100 +++ new/html2text-2019.8.11/setup.py 2019-08-11 21:27:39.000000000 +0200 @@ -1,3 +1,42 @@ +# coding: utf-8 from setuptools import setup -setup() + +def readall(f): + with open(f) as fp: + return fp.read() + + +setup( + name="html2text", + version=".".join(map(str, __import__("html2text").__version__)), + description="Turn HTML into equivalent Markdown-structured text.", + long_description=readall("README.md"), + long_description_content_type="text/markdown", + author="Aaron Swartz", + author_email="m...@aaronsw.com", + maintainer="Alireza Savand", + maintainer_email="alireza.sav...@gmail.com", + url="https://github.com/Alir3z4/html2text/", + platforms="OS Independent", + classifiers=[ + "Development Status :: 5 - Production/Stable", + "Intended Audience :: Developers", + "License :: OSI Approved :: GNU General Public License (GPL)", + "Operating System :: OS Independent", + "Programming Language :: Python", + "Programming Language :: Python :: 2", + "Programming Language :: Python :: 2.7", + "Programming Language :: Python :: 3", + "Programming Language :: Python :: 3.4", + "Programming Language :: Python :: 3.5", + "Programming Language :: Python :: 3.6", + "Programming Language :: Python :: 3.7", + "Programming Language :: Python :: Implementation :: CPython", + "Programming Language :: Python :: Implementation :: PyPy", + ], + python_requires=">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*", + entry_points={"console_scripts": ["html2text = html2text.cli:main"]}, + license="GNU GPL 3", + packages=["html2text"], +) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/test/blockquote_example.html new/html2text-2019.8.11/test/blockquote_example.html --- old/html2text-2020.1.16/test/blockquote_example.html 2019-08-15 12:56:54.000000000 +0200 +++ new/html2text-2019.8.11/test/blockquote_example.html 2019-08-11 21:27:39.000000000 +0200 @@ -1,3 +1,3 @@ <blockquote> -"The time has come", the Walrus said, "To talk of many things: Of shoes - and ships - and sealing wax - Of cabbages - and kings- And why the sea is boiling hot - And whether pigs have wings." +The time has come, the Walrus said, to speak of many things. </blockquote> diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/test/blockquote_example.md new/html2text-2019.8.11/test/blockquote_example.md --- old/html2text-2020.1.16/test/blockquote_example.md 2019-09-25 10:07:55.000000000 +0200 +++ new/html2text-2019.8.11/test/blockquote_example.md 2019-08-11 21:27:39.000000000 +0200 @@ -1,4 +1,2 @@ -> "The time has come", the Walrus said, "To talk of many things: Of shoes - -> and ships - and sealing wax - Of cabbages - and kings- And why the sea is -> boiling hot - And whether pigs have wings." +> The time has come, the Walrus said, to speak of many things. diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/test/test_html2text.py new/html2text-2019.8.11/test/test_html2text.py --- old/html2text-2020.1.16/test/test_html2text.py 2020-01-16 15:08:28.000000000 +0100 +++ new/html2text-2019.8.11/test/test_html2text.py 2019-08-11 21:27:39.000000000 +0200 @@ -1,3 +1,4 @@ +import codecs import glob import os import re @@ -40,7 +41,8 @@ if base_fn.find("unicode") >= 0: module_args["unicode_snob"] = True - cmdline_args.append("--unicode-snob") + # There is no command-line option to control unicode_snob. + cmdline_args = skip func_args = skip if base_fn.find("flip_emphasis") >= 0: @@ -187,7 +189,7 @@ result = get_baseline(fn) out = subprocess.check_output(cmd) - actual = out.decode() + actual = out.decode("utf8") actual = cleanup_eol(actual) @@ -208,7 +210,7 @@ def get_baseline(fn): name = get_baseline_name(fn) - with open(name, encoding="utf-8") as f: + with codecs.open(name, mode="r", encoding="utf8") as f: out = f.read() return cleanup_eol(out) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/test/test_memleak.py new/html2text-2019.8.11/test/test_memleak.py --- old/html2text-2020.1.16/test/test_memleak.py 2019-09-25 09:41:57.000000000 +0200 +++ new/html2text-2019.8.11/test/test_memleak.py 2019-08-11 21:27:39.000000000 +0200 @@ -17,10 +17,3 @@ h2t.handle(INSTR) # And even less when the input is empty. assert h2t.handle("") == "\n\n" - - -def test_abbr_data(): - h2t = html2text.HTML2Text() - result = h2t.handle('<p>foo <abbr title="Three Letter Acronym">TLA</abbr> bar</p>') - assert result == "foo TLA bar\n\n *[TLA]: Three Letter Acronym\n\n" - assert h2t.abbr_data is None diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/test/wrap_list_items_example.md new/html2text-2019.8.11/test/wrap_list_items_example.md --- old/html2text-2020.1.16/test/wrap_list_items_example.md 2019-09-25 10:07:55.000000000 +0200 +++ new/html2text-2019.8.11/test/wrap_list_items_example.md 2019-08-11 21:27:39.000000000 +0200 @@ -1,14 +1,14 @@ * One two three four five six seven eight nine ten eleven twelve thirteen - fourteen fifteen sixteen seventeen eighteen nineteen twenty. + fourteen fifteen sixteen seventeen eighteen nineteen twenty. * One two three four five six seven eight nine ten eleven twelve thirteen - fourteen fifteen sixteen seventeen eighteen nineteen twenty. + fourteen fifteen sixteen seventeen eighteen nineteen twenty. Text between lists. * One two three four five six seven eight nine ten eleven twelve thirteen - fourteen fifteen sixteen seventeen eighteen nineteen twenty. + fourteen fifteen sixteen seventeen eighteen nineteen twenty. * One two three four five six seven eight nine ten eleven twelve thirteen - fourteen fifteen sixteen seventeen eighteen nineteen twenty. + fourteen fifteen sixteen seventeen eighteen nineteen twenty. Text after list. diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/html2text-2020.1.16/tox.ini new/html2text-2019.8.11/tox.ini --- old/html2text-2020.1.16/tox.ini 2019-10-31 19:37:31.000000000 +0100 +++ new/html2text-2019.8.11/tox.ini 2019-08-11 21:27:39.000000000 +0200 @@ -3,8 +3,7 @@ black flake8 isort - mypy - py{35,36,37,38,py3} + py{27,34,35,36,37,py,py3} minversion = 1.9 [testenv] @@ -17,7 +16,7 @@ [testenv:black] basepython = python3 commands = - black --target-version py35 --check --diff . + black --check --diff . deps = black skip_install = true @@ -37,8 +36,3 @@ deps = isort skip_install = true - -[testenv:mypy] -commands = mypy --strict html2text -deps = mypy -skip_install = true