On Sun, May 9, 2010 at 7:36 AM, Florent Daigniere
<nextgens at freenetproject.org> wrote:
>> >> > Depending how much cleaning of the HTML filtering system you want to
>> >> > do... ?Has using something like JTidy ( http://jtidy.sourceforge.net/
>> >> > ) been discussed? ?That way you wouldn't have to worry about what's
>> >> > valid or invalid HTML, merely the security aspects of valid HTML that
>> >> > are unique to Freenet.
>> >
>> > That might be nice... but wouldn't we have the same problem in that it 
>> > would
>> > be hard to diff the output of the filter against the input for debugging
>> > purposes? What do other people think about this? It would make life much
>> > easier...
>>
>> I don't see why it would be a problem. ?I haven't used tidy much,
>> honestly. ?I don't see how to make it stop changing line breaks and
>> such in my page. ?However, I don't mind running it locally before
>> inserting, so that nothing changes when the filter runs it. ?I don't
>> need the filter to never change anything; I just need to know what to
>> do so that I can get a diff that shows only the changes made by the
>> filter. ?If I need to run tidy on the original, and then diff that vs
>> the filtered output, that's fine by me.
>>
>> And anything that makes the filtering more robust and less work is a
>> big win, imho.
>>
>> Evan Daniel
>
> No way. We have a filter which works (security-wise), why would we change?
>
> Auditing upstream changes is going to be more time-expensive than maintaining 
> our own
> ?because it implements only a subset of the features.

As I see it, there are three parts to the filter:
1) Parse the HTML / XHTML, and build a parse tree / DOM.  Handle
invalid markup, invalid character escapes, etc.
2) Remove any elements that present a security risk.
3) Turn the resulting DOM back into HTML output

The goal of using something like JTidy would be to make part 1 more
robust, and easier to maintain.  Part 2 would be the same filter we
have now.

At present, we allow a large amount of invalid markup through the
filter.  I don't like this for a variety of reasons, but the relevant
one is that browser behavior when presented with invalid markup is not
well defined, and therefore has a lot of potential for security risks.
 OTOH, we can't just ban invalid markup, because so many freesites use
it.  Using something like JTidy gets the best of both worlds: it
cleans up invalid markup and produces something that is valid and
likely to do what the freesite author wanted.  That means we can be
certain that the browser will interpret the document in the same
fashion our filter does, which is a win for security.

Reasons to change:
- Our filter works security-wise, but is more restrictive than
required on content.  Loosening those restrictions will be less work
if we can assume that the filtering process starts with a completely
valid DOM.
- We don't have to maintain a parser, just a filter.
- Our current filter breaks valid markup; fixing this is probably
easier if we use something like JTidy to handle the DOM rather than
rolling our own HTML / XHTML parser.

Reasons not to change:
- Changing takes work, and has the potential to introduce new bugs
- We have to worry about upstream changes

I'm not overly convinced about the upstream changes piece.  The
upstream code in question is a port of the W3C reference parser.
Since we'd be using a whitelist filter on the DOM it produces, we
don't need to worry about new features and supported markup, only new
bugs.  How much auditing do we currently do on upstream code?

I'm not trying to advocate spending lots of time changing everything
over.  We have better things to work on.  I'm asking whether it's
easier to fix the current code, or to refactor it to use a more robust
parser.  (And which is easier to maintain long term -- though if
switching is a lot of work, but easier long term, then imho we should
keep the current code for now and revisit the question later, like
after 0.8 is out.)

Evan Daniel

Reply via email to