Re: [Django] #30686: Improve utils.text.Truncator &co to use a full HTML parser. (was: Truncator.chars splits HTML entities)

Django Wed, 07 Aug 2019 02:18:16 -0700

#30686: Improve utils.text.Truncator &co to use a full HTML parser.
-------------------------------+------------------------------------
     Reporter:  Thomas Hooper  |                    Owner:  nobody
         Type:  Bug            |                   Status:  new
    Component:  Utilities      |                  Version:  master
     Severity:  Normal         |               Resolution:
     Keywords:                 |             Triage Stage:  Accepted
    Has patch:  0              |      Needs documentation:  0
  Needs tests:  0              |  Patch needs improvement:  0
Easy pickings:  0              |                    UI/UX:  0
-------------------------------+------------------------------------
Changes (by Carlton Gibson):


 * version:  2.2 => master
 * stage:  Unreviewed => Accepted


Old description:

> I'm using Truncator.chars to truncate wikis, and it sometimes truncates
> in the middle of &quot; entities, resulting in '<p>some text &qu</p>'

New description:

 Original description:

 > I'm using Truncator.chars to truncate wikis, and it sometimes truncates
 in the middle of &quot; entities, resulting in '<p>some text &qu</p>'

 This is a limitation of the regex based implementation (which has had
 security issues, and presents an intractable problem).

 Better to move to use a HTML parser, for Truncate, and strip_tags(), via
 html5lib and bleach.

--

Comment:

 Right, good news is this isn't a regression from
 7f65974f8219729c047fbbf8cd5cc9d80faefe77.

 * The new example case fails on v2.2.3 &co.
 * The suggestion for the regex change is in the part not changed as part
 of 7f65974f8219729c047fbbf8cd5cc9d80faefe77. (Which is why the new case
 fails, I suppose :)

 I don't want to accept a tweaking of the regex here. Rather, we should
 move to using `html5lib` as Florian suggests.
 Possibly this would entail small changes in behaviour around edge cases,
 to be called out in release notes, but
 would be a big win overall.

 This has previously been discussed by the Security Team as the required
 way forward.
 I've updated the title/description and will Accept accordingly.

 I've attached an initial WIP patch by Florian of an `html5lib`
 implementation of the core `_truncate_html()` method.

 An implementation of `strip_tags()` using `bleach` would go something
 like:

 {{{
 bleach.clean(text, tags=[], strip=True, strip_comments=True)
 }}}



 Thomas, would taking on making changes like these be something you'd be
 willing/keen to do? If so, I'm very happy to input to assist in any way.
 :)

-- 
Ticket URL: <https://code.djangoproject.com/ticket/30686#comment:7>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-updates+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-updates/066.365d1302d9426793bc68a55f94756a7f%40djangoproject.com.

Re: [Django] #30686: Improve utils.text.Truncator &co to use a full HTML parser. (was: Truncator.chars splits HTML entities)

Reply via email to