Bug#1022911: ITP: python-html-text -- extract text from HTML

Christian Marillat Thu, 27 Oct 2022 08:09:33 -0700

Package: wnpp
Severity: wishlist
Owner: Christian Marillat <maril...@debian.org>
X-Debbugs-Cc: debian-de...@lists.debian.org


* Package name    : python-html-text
  Version         : 0.5.2
  Upstream Author : Scrapinghub Inc
* URL             : https://github.com/TeamHG-Memex/html-text
* License         : MIT
  Programming Lang: Python
  Description     : extract text from HTML

  How is html_text different from .xpath('//text()') from LXML or
  .get_text() from Beautiful Soup? 

  Text extracted with html_text does not contain inline styles,
  javascript, comments  and other text that is not normally visible to users;

  html_text normalizes whitespace, but in a way smarter than
  .xpath('normalize-space()), adding spaces around inline elements
  (which are often used as block elements in html markup), and trying
  to avoid adding extra spaces for punctuation;

  html-text can add newlines (e.g. after headers or paragraphs), so
  that the output text looks more like how it is rendered in
  browsers. 


 This package is a dependency for python-extruct

Bug#1022911: ITP: python-html-text -- extract text from HTML

Reply via email to