Package: wnpp Severity: wishlist Owner: Christian Marillat <maril...@debian.org> X-Debbugs-Cc: debian-de...@lists.debian.org
* Package name : python-html-text Version : 0.5.2 Upstream Author : Scrapinghub Inc * URL : https://github.com/TeamHG-Memex/html-text * License : MIT Programming Lang: Python Description : extract text from HTML How is html_text different from .xpath('//text()') from LXML or .get_text() from Beautiful Soup? Text extracted with html_text does not contain inline styles, javascript, comments and other text that is not normally visible to users; html_text normalizes whitespace, but in a way smarter than .xpath('normalize-space()), adding spaces around inline elements (which are often used as block elements in html markup), and trying to avoid adding extra spaces for punctuation; html-text can add newlines (e.g. after headers or paragraphs), so that the output text looks more like how it is rendered in browsers. This package is a dependency for python-extruct