On Sunday 09 May 2010 14:17:38 Evan Daniel wrote: > On Sun, May 9, 2010 at 7:36 AM, Florent Daigniere > <nextgens at freenetproject.org> wrote: > >> >> > Depending how much cleaning of the HTML filtering system you want to > >> >> > do... ?Has using something like JTidy ( http://jtidy.sourceforge.net/ > >> >> > ) been discussed? ?That way you wouldn't have to worry about what's > >> >> > valid or invalid HTML, merely the security aspects of valid HTML that > >> >> > are unique to Freenet. > >> > > >> > That might be nice... but wouldn't we have the same problem in that it > >> > would > >> > be hard to diff the output of the filter against the input for debugging > >> > purposes? What do other people think about this? It would make life much > >> > easier... > >> > >> I don't see why it would be a problem. ?I haven't used tidy much, > >> honestly. ?I don't see how to make it stop changing line breaks and > >> such in my page. ?However, I don't mind running it locally before > >> inserting, so that nothing changes when the filter runs it. ?I don't > >> need the filter to never change anything; I just need to know what to > >> do so that I can get a diff that shows only the changes made by the > >> filter. ?If I need to run tidy on the original, and then diff that vs > >> the filtered output, that's fine by me. > >> > >> And anything that makes the filtering more robust and less work is a > >> big win, imho. > >> > >> Evan Daniel > > > > No way. We have a filter which works (security-wise), why would we change? > > > > Auditing upstream changes is going to be more time-expensive than > > maintaining our own > > ?because it implements only a subset of the features. > > As I see it, there are three parts to the filter: > 1) Parse the HTML / XHTML, and build a parse tree / DOM. Handle > invalid markup, invalid character escapes, etc. > 2) Remove any elements that present a security risk. > 3) Turn the resulting DOM back into HTML output > > The goal of using something like JTidy would be to make part 1 more > robust, and easier to maintain. Part 2 would be the same filter we > have now. > > At present, we allow a large amount of invalid markup through the > filter. I don't like this for a variety of reasons, but the relevant > one is that browser behavior when presented with invalid markup is not > well defined, and therefore has a lot of potential for security risks. > OTOH, we can't just ban invalid markup, because so many freesites use > it. Using something like JTidy gets the best of both worlds: it > cleans up invalid markup and produces something that is valid and > likely to do what the freesite author wanted. That means we can be > certain that the browser will interpret the document in the same > fashion our filter does, which is a win for security. > > Reasons to change: > - Our filter works security-wise, but is more restrictive than > required on content. Loosening those restrictions will be less work > if we can assume that the filtering process starts with a completely > valid DOM. > - We don't have to maintain a parser, just a filter. > - Our current filter breaks valid markup; fixing this is probably > easier if we use something like JTidy to handle the DOM rather than > rolling our own HTML / XHTML parser. > > Reasons not to change: > - Changing takes work, and has the potential to introduce new bugs > - We have to worry about upstream changes > > I'm not overly convinced about the upstream changes piece. The > upstream code in question is a port of the W3C reference parser. > Since we'd be using a whitelist filter on the DOM it produces, we > don't need to worry about new features and supported markup, only new > bugs. How much auditing do we currently do on upstream code? > > I'm not trying to advocate spending lots of time changing everything > over. We have better things to work on. I'm asking whether it's > easier to fix the current code, or to refactor it to use a more robust > parser. (And which is easier to maintain long term -- though if > switching is a lot of work, but easier long term, then imho we should > keep the current code for now and revisit the question later, like > after 0.8 is out.)
IMHO it would be a lot of work for little benefit, and involve pulling in lots of third party code. And we would still need a whitelist filter - IIRC the parser isn't the bulk of the filtering code? -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20100511/cf0d2cb0/attachment.pgp>
