[freenet-dev] Attribute reordering in HTML filter

Matthew Toseland Tue, 11 May 2010 00:30:15 +0100

On Sunday 09 May 2010 14:17:38 Evan Daniel wrote:
> On Sun, May 9, 2010 at 7:36 AM, Florent Daigniere
> <nextgens at freenetproject.org> wrote:
> >> >> > Depending how much cleaning of the HTML filtering system you want to
> >> >> > do... ?Has using something like JTidy ( http://jtidy.sourceforge.net/
> >> >> > ) been discussed? ?That way you wouldn't have to worry about what's
> >> >> > valid or invalid HTML, merely the security aspects of valid HTML that
> >> >> > are unique to Freenet.
> >> >
> >> > That might be nice... but wouldn't we have the same problem in that it 
> >> > would
> >> > be hard to diff the output of the filter against the input for debugging
> >> > purposes? What do other people think about this? It would make life much
> >> > easier...
> >>
> >> I don't see why it would be a problem. ?I haven't used tidy much,
> >> honestly. ?I don't see how to make it stop changing line breaks and
> >> such in my page. ?However, I don't mind running it locally before
> >> inserting, so that nothing changes when the filter runs it. ?I don't
> >> need the filter to never change anything; I just need to know what to
> >> do so that I can get a diff that shows only the changes made by the
> >> filter. ?If I need to run tidy on the original, and then diff that vs
> >> the filtered output, that's fine by me.
> >>
> >> And anything that makes the filtering more robust and less work is a
> >> big win, imho.
> >>
> >> Evan Daniel
> >
> > No way. We have a filter which works (security-wise), why would we change?
> >
> > Auditing upstream changes is going to be more time-expensive than 
> > maintaining our own
> > ?because it implements only a subset of the features.
> 
> As I see it, there are three parts to the filter:
> 1) Parse the HTML / XHTML, and build a parse tree / DOM.  Handle
> invalid markup, invalid character escapes, etc.
> 2) Remove any elements that present a security risk.
> 3) Turn the resulting DOM back into HTML output
> 
> The goal of using something like JTidy would be to make part 1 more
> robust, and easier to maintain.  Part 2 would be the same filter we
> have now.
> 
> At present, we allow a large amount of invalid markup through the
> filter.  I don't like this for a variety of reasons, but the relevant
> one is that browser behavior when presented with invalid markup is not
> well defined, and therefore has a lot of potential for security risks.
>  OTOH, we can't just ban invalid markup, because so many freesites use
> it.  Using something like JTidy gets the best of both worlds: it
> cleans up invalid markup and produces something that is valid and
> likely to do what the freesite author wanted.  That means we can be
> certain that the browser will interpret the document in the same
> fashion our filter does, which is a win for security.
> 
> Reasons to change:
> - Our filter works security-wise, but is more restrictive than
> required on content.  Loosening those restrictions will be less work
> if we can assume that the filtering process starts with a completely
> valid DOM.
> - We don't have to maintain a parser, just a filter.
> - Our current filter breaks valid markup; fixing this is probably
> easier if we use something like JTidy to handle the DOM rather than
> rolling our own HTML / XHTML parser.
> 
> Reasons not to change:
> - Changing takes work, and has the potential to introduce new bugs
> - We have to worry about upstream changes
> 
> I'm not overly convinced about the upstream changes piece.  The
> upstream code in question is a port of the W3C reference parser.
> Since we'd be using a whitelist filter on the DOM it produces, we
> don't need to worry about new features and supported markup, only new
> bugs.  How much auditing do we currently do on upstream code?
> 
> I'm not trying to advocate spending lots of time changing everything
> over.  We have better things to work on.  I'm asking whether it's
> easier to fix the current code, or to refactor it to use a more robust
> parser.  (And which is easier to maintain long term -- though if
> switching is a lot of work, but easier long term, then imho we should
> keep the current code for now and revisit the question later, like
> after 0.8 is out.)


IMHO it would be a lot of work for little benefit, and involve pulling in lots 
of third party code. And we would still need a whitelist filter - IIRC the 
parser isn't the bulk of the filtering code?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20100511/cf0d2cb0/attachment.pgp>

[freenet-dev] Attribute reordering in HTML filter

Reply via email to