[freenet-dev] Attribute reordering in HTML filter

2010-05-08 Thread Spencer Jackson
he attribute be equal to
its name, if true. So yes, there is a difference.
Okay. How's this. Step one, for all attributes in the tag, create the same
attributes in the same order in the sanitized tag, all equal to null. Parse
the tag, replacing the null values, if new values exist. Now that we're
done, we iterate through all the attributes in the parsed map. If the
attribute is null, discard it. If the attribute is simply empty, check for
whether the HTML parse context says we're parsing XHTML. If no, pass through
the minimized attribute. If yes, discard it.
-- next part --
An HTML attachment was scrubbed...
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20100508/20887fea/attachment.html>


[freenet-dev] Attribute reordering in HTML filter

2010-05-08 Thread Matthew Toseland
On Saturday 08 May 2010 05:09:07 Evan Daniel wrote:
> On Fri, May 7, 2010 at 11:43 PM, Spencer Jackson
>  wrote:
> > On Fri, 2010-05-07 at 12:40 +0100, Matthew Toseland wrote:
> >> On Thursday 06 May 2010 20:40:03 Spencer Jackson wrote:
> >> > Hi guys, just wanted to touch base. Anyway, I'm working on resolving bug
> >> > number 3571( https://bugs.freenetproject.org/view.php?id=3571 ). To
> >> > summarize, the filter tends to reorder attributes at semirandom when
> >> > they get parsed. While the structure which holds the parsed attribute is
> >> > a LinkedHashMap, meaning we should be able to stuff in values and pull
> >> > them out in the same order, the put functions are called in the derived
> >> > verifier's overrided sanitizeHash methods. These methods extract an
> >> > attribute, sanitize it, then place it in the Map. The problem is, they
> >> > are extracted out of the original order, meaning they get pulled out of
> >> > the Map in the wrong order. To fix this, I created a callback object
> >> > which the derived classes pass to the baseclass. The baseclass may then
> >> > parse all of the attributes in order, invoking the callback to
> >> > sanitize.If an attribute's contents fails to be processed, an exception
> >> > may be thrown, so that the attribute will not be included in the final
> >> > tag.
> >>
> >> It is important that only attributes that are explicitly parsed and 
> >> understood are passed on, and that it doesn't take extra per-sanitiser 
> >> work to achieve this. Will this be the case?
> >>
> >
> > Yeah, this should be the case. ?Attributes which don't have a callback
> > stored simply aren't parsed. I am starting, however, to think this
> > approach might be overkill. ?Here I have a different take:
> > http://github.com/spencerjackson/fred-staging/tree/HTMLAttributeReorder
> > Instead of running a callback in the base class, I simply create the
> > attributes, in order, with null content. Then, in the overloaded methods
> > on the child classes I repopulate them with the correct data. This
> > preserves the original order of the attributes, while minimizing the
> > amount of new code that needs to be written. What do you think? Which
> > solution do you think is preferable?
> 
> Do attributes without content still get written?  Is that always
> valid?  Not writing them isn't always valid; see eg bug 4125: current
> code happily removes required attributes from  tags, thus
> breaking valid pages.
> 
> Depending how much cleaning of the HTML filtering system you want to
> do...  Has using something like JTidy ( http://jtidy.sourceforge.net/
> ) been discussed?  That way you wouldn't have to worry about what's
> valid or invalid HTML, merely the security aspects of valid HTML that
> are unique to Freenet.

IMHO sajack's solution is acceptable, you will have to just use null to 
indicate no attribute and "" to indicate an attribute with no value? Or is 
there a difference between attributes with an empty value and attributes with 
no value?
-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20100508/9fd3c561/attachment.pgp>


[freenet-dev] Resource usage up since last version

2010-05-08 Thread Matthew Toseland
On Wednesday 05 May 2010 20:04:23 Christian Funder Sommerlund wrote:
> Matthew Toseland skrev:
> > On Monday 03 May 2010 22:13:17 Christian Funder Sommerlund wrote:
> >> Random observation:
> >>
> >> Since the last version of Freenet my 24/7 node has seen an increase in 
> >> CPU usage, network traffic and Disk IO of about 50%. I'm not sure if 
> >> this is intentional or not, but graphs of all of the above mentioned 
> >> show a significant jump up around april 24th.
> > 
> > Is this 1245 or trunk changes since then (e.g. merging web-pushing was 
> > causing a lot of unnecessary work for a while, but this would not have 
> > increased network traffic)?
> 
> Node is running standard versions with autoupdate, so most likely this 
> was caused by an automatic update to 1245, which you released on april 
> 23th according to the mailing list :).

Hmmm, strange, nothing obvious - leak fixes, internal stuff...
-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20100508/63eefbd3/attachment.pgp>


[freenet-dev] Attribute reordering in HTML filter

2010-05-08 Thread Evan Daniel
On Fri, May 7, 2010 at 11:43 PM, Spencer Jackson
 wrote:
> On Fri, 2010-05-07 at 12:40 +0100, Matthew Toseland wrote:
>> On Thursday 06 May 2010 20:40:03 Spencer Jackson wrote:
>> > Hi guys, just wanted to touch base. Anyway, I'm working on resolving bug
>> > number 3571( https://bugs.freenetproject.org/view.php?id=3571 ). To
>> > summarize, the filter tends to reorder attributes at semirandom when
>> > they get parsed. While the structure which holds the parsed attribute is
>> > a LinkedHashMap, meaning we should be able to stuff in values and pull
>> > them out in the same order, the put functions are called in the derived
>> > verifier's overrided sanitizeHash methods. These methods extract an
>> > attribute, sanitize it, then place it in the Map. The problem is, they
>> > are extracted out of the original order, meaning they get pulled out of
>> > the Map in the wrong order. To fix this, I created a callback object
>> > which the derived classes pass to the baseclass. The baseclass may then
>> > parse all of the attributes in order, invoking the callback to
>> > sanitize.If an attribute's contents fails to be processed, an exception
>> > may be thrown, so that the attribute will not be included in the final
>> > tag.
>>
>> It is important that only attributes that are explicitly parsed and 
>> understood are passed on, and that it doesn't take extra per-sanitiser work 
>> to achieve this. Will this be the case?
>>
>
> Yeah, this should be the case. ?Attributes which don't have a callback
> stored simply aren't parsed. I am starting, however, to think this
> approach might be overkill. ?Here I have a different take:
> http://github.com/spencerjackson/fred-staging/tree/HTMLAttributeReorder
> Instead of running a callback in the base class, I simply create the
> attributes, in order, with null content. Then, in the overloaded methods
> on the child classes I repopulate them with the correct data. This
> preserves the original order of the attributes, while minimizing the
> amount of new code that needs to be written. What do you think? Which
> solution do you think is preferable?

Do attributes without content still get written?  Is that always
valid?  Not writing them isn't always valid; see eg bug 4125: current
code happily removes required attributes from  tags, thus
breaking valid pages.

Depending how much cleaning of the HTML filtering system you want to
do...  Has using something like JTidy ( http://jtidy.sourceforge.net/
) been discussed?  That way you wouldn't have to worry about what's
valid or invalid HTML, merely the security aspects of valid HTML that
are unique to Freenet.

Evan Daniel



Re: [freenet-dev] Resource usage up since last version

2010-05-08 Thread Matthew Toseland
On Wednesday 05 May 2010 20:04:23 Christian Funder Sommerlund wrote:
 Matthew Toseland skrev:
  On Monday 03 May 2010 22:13:17 Christian Funder Sommerlund wrote:
  Random observation:
 
  Since the last version of Freenet my 24/7 node has seen an increase in 
  CPU usage, network traffic and Disk IO of about 50%. I'm not sure if 
  this is intentional or not, but graphs of all of the above mentioned 
  show a significant jump up around april 24th.
  
  Is this 1245 or trunk changes since then (e.g. merging web-pushing was 
  causing a lot of unnecessary work for a while, but this would not have 
  increased network traffic)?
 
 Node is running standard versions with autoupdate, so most likely this 
 was caused by an automatic update to 1245, which you released on april 
 23th according to the mailing list :).

Hmmm, strange, nothing obvious - leak fixes, internal stuff...


signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://osprey.vm.bytemark.co.uk/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Attribute reordering in HTML filter

2010-05-08 Thread Matthew Toseland
On Saturday 08 May 2010 05:09:07 Evan Daniel wrote:
 On Fri, May 7, 2010 at 11:43 PM, Spencer Jackson
 spencerandrewjack...@gmail.com wrote:
  On Fri, 2010-05-07 at 12:40 +0100, Matthew Toseland wrote:
  On Thursday 06 May 2010 20:40:03 Spencer Jackson wrote:
   Hi guys, just wanted to touch base. Anyway, I'm working on resolving bug
   number 3571( https://bugs.freenetproject.org/view.php?id=3571 ). To
   summarize, the filter tends to reorder attributes at semirandom when
   they get parsed. While the structure which holds the parsed attribute is
   a LinkedHashMap, meaning we should be able to stuff in values and pull
   them out in the same order, the put functions are called in the derived
   verifier's overrided sanitizeHash methods. These methods extract an
   attribute, sanitize it, then place it in the Map. The problem is, they
   are extracted out of the original order, meaning they get pulled out of
   the Map in the wrong order. To fix this, I created a callback object
   which the derived classes pass to the baseclass. The baseclass may then
   parse all of the attributes in order, invoking the callback to
   sanitize.If an attribute's contents fails to be processed, an exception
   may be thrown, so that the attribute will not be included in the final
   tag.
 
  It is important that only attributes that are explicitly parsed and 
  understood are passed on, and that it doesn't take extra per-sanitiser 
  work to achieve this. Will this be the case?
 
 
  Yeah, this should be the case.  Attributes which don't have a callback
  stored simply aren't parsed. I am starting, however, to think this
  approach might be overkill.  Here I have a different take:
  http://github.com/spencerjackson/fred-staging/tree/HTMLAttributeReorder
  Instead of running a callback in the base class, I simply create the
  attributes, in order, with null content. Then, in the overloaded methods
  on the child classes I repopulate them with the correct data. This
  preserves the original order of the attributes, while minimizing the
  amount of new code that needs to be written. What do you think? Which
  solution do you think is preferable?
 
 Do attributes without content still get written?  Is that always
 valid?  Not writing them isn't always valid; see eg bug 4125: current
 code happily removes required attributes from meta tags, thus
 breaking valid pages.
 
 Depending how much cleaning of the HTML filtering system you want to
 do...  Has using something like JTidy ( http://jtidy.sourceforge.net/
 ) been discussed?  That way you wouldn't have to worry about what's
 valid or invalid HTML, merely the security aspects of valid HTML that
 are unique to Freenet.

IMHO sajack's solution is acceptable, you will have to just use null to 
indicate no attribute and  to indicate an attribute with no value? Or is 
there a difference between attributes with an empty value and attributes with 
no value?


signature.asc
Description: This is a digitally signed message part.
___
Devl mailing list
Devl@freenetproject.org
http://osprey.vm.bytemark.co.uk/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Attribute reordering in HTML filter

2010-05-08 Thread Spencer Jackson
tOn Sat, May 8, 2010 at 10:38 AM, Matthew Toseland 
t...@amphibian.dyndns.org wrote:

 On Saturday 08 May 2010 05:09:07 Evan Daniel wrote:
  On Fri, May 7, 2010 at 11:43 PM, Spencer Jackson
  spencerandrewjack...@gmail.com wrote:
   On Fri, 2010-05-07 at 12:40 +0100, Matthew Toseland wrote:
   On Thursday 06 May 2010 20:40:03 Spencer Jackson wrote:
Hi guys, just wanted to touch base. Anyway, I'm working on resolving
 bug
number 3571( https://bugs.freenetproject.org/view.php?id=3571 ). To
summarize, the filter tends to reorder attributes at semirandom when
they get parsed. While the structure which holds the parsed
 attribute is
a LinkedHashMap, meaning we should be able to stuff in values and
 pull
them out in the same order, the put functions are called in the
 derived
verifier's overrided sanitizeHash methods. These methods extract an
attribute, sanitize it, then place it in the Map. The problem is,
 they
are extracted out of the original order, meaning they get pulled out
 of
the Map in the wrong order. To fix this, I created a callback object
which the derived classes pass to the baseclass. The baseclass may
 then
parse all of the attributes in order, invoking the callback to
sanitize.If an attribute's contents fails to be processed, an
 exception
may be thrown, so that the attribute will not be included in the
 final
tag.
  
   It is important that only attributes that are explicitly parsed and
 understood are passed on, and that it doesn't take extra per-sanitiser work
 to achieve this. Will this be the case?
  
  
   Yeah, this should be the case.  Attributes which don't have a callback
   stored simply aren't parsed. I am starting, however, to think this
   approach might be overkill.  Here I have a different take:
  
 http://github.com/spencerjackson/fred-staging/tree/HTMLAttributeReorder
   Instead of running a callback in the base class, I simply create the
   attributes, in order, with null content. Then, in the overloaded
 methods
   on the child classes I repopulate them with the correct data. This
   preserves the original order of the attributes, while minimizing the
   amount of new code that needs to be written. What do you think? Which
   solution do you think is preferable?
 
  Do attributes without content still get written?  Is that always
  valid?  Not writing them isn't always valid; see eg bug 4125: current
  code happily removes required attributes from meta tags, thus
  breaking valid pages.


Odd. I'm looking at the code for MetaTagVerifier, and I can't see any code
branches in which, if the 'content' attribute is defined, it is failed to be
added to the LinkedHashMap unless nothing else is added either... I'm not on
my home computer, so I'll have to test this tomorrow. Does it happen to all
meta tags? Oh. Do you mean, if there are no attributes, the tag will still
exist, but be empty? I could alter MetaTagVerifier to return null if this is
the case, and remove the tag from the final output. Would that fix this?


 
  Depending how much cleaning of the HTML filtering system you want to
  do...  Has using something like JTidy ( http://jtidy.sourceforge.net/
  ) been discussed?  That way you wouldn't have to worry about what's
  valid or invalid HTML, merely the security aspects of valid HTML that
  are unique to Freenet.

That might be nice... but wouldn't we have the same problem in that it would
be hard to diff the output of the filter against the input for debugging
purposes? What do other people think about this? It would make life much
easier...


 IMHO sajack's solution is acceptable, you will have to just use null to
 indicate no attribute and  to indicate an attribute with no value? Or is
 there a difference between attributes with an empty value and attributes
 with no value?


 Apparently, HTML supports attribute minimization, but XHTML does not. In
other words, 'compact' is valid HTML, but not valid XHTML, which needs
'compact=compact'. ( http://www.w3.org/TR/xhtml1/#h-4.5 ) For boolean
values, according to (
http://www.w3.org/TR/html401/intro/sgmltut.html#h-3.3.4.2 ) attributes
should either exist, without an '=', or be equal to the attribute's name if
true, and nonexistent if false. XHTML will require the attribute be equal to
its name, if true. So yes, there is a difference.
Okay. How's this. Step one, for all attributes in the tag, create the same
attributes in the same order in the sanitized tag, all equal to null. Parse
the tag, replacing the null values, if new values exist. Now that we're
done, we iterate through all the attributes in the parsed map. If the
attribute is null, discard it. If the attribute is simply empty, check for
whether the HTML parse context says we're parsing XHTML. If no, pass through
the minimized attribute. If yes, discard it.
___
Devl mailing list
Devl@freenetproject.org
http://osprey.vm.bytemark.co.uk/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Attribute reordering in HTML filter

2010-05-08 Thread Evan Daniel
On Sat, May 8, 2010 at 11:38 AM, Matthew Toseland
t...@amphibian.dyndns.org wrote:
 On Saturday 08 May 2010 05:09:07 Evan Daniel wrote:
 On Fri, May 7, 2010 at 11:43 PM, Spencer Jackson
 spencerandrewjack...@gmail.com wrote:
  On Fri, 2010-05-07 at 12:40 +0100, Matthew Toseland wrote:
  On Thursday 06 May 2010 20:40:03 Spencer Jackson wrote:
   Hi guys, just wanted to touch base. Anyway, I'm working on resolving bug
   number 3571( https://bugs.freenetproject.org/view.php?id=3571 ). To
   summarize, the filter tends to reorder attributes at semirandom when
   they get parsed. While the structure which holds the parsed attribute is
   a LinkedHashMap, meaning we should be able to stuff in values and pull
   them out in the same order, the put functions are called in the derived
   verifier's overrided sanitizeHash methods. These methods extract an
   attribute, sanitize it, then place it in the Map. The problem is, they
   are extracted out of the original order, meaning they get pulled out of
   the Map in the wrong order. To fix this, I created a callback object
   which the derived classes pass to the baseclass. The baseclass may then
   parse all of the attributes in order, invoking the callback to
   sanitize.If an attribute's contents fails to be processed, an exception
   may be thrown, so that the attribute will not be included in the final
   tag.
 
  It is important that only attributes that are explicitly parsed and 
  understood are passed on, and that it doesn't take extra per-sanitiser 
  work to achieve this. Will this be the case?
 
 
  Yeah, this should be the case.  Attributes which don't have a callback
  stored simply aren't parsed. I am starting, however, to think this
  approach might be overkill.  Here I have a different take:
  http://github.com/spencerjackson/fred-staging/tree/HTMLAttributeReorder
  Instead of running a callback in the base class, I simply create the
  attributes, in order, with null content. Then, in the overloaded methods
  on the child classes I repopulate them with the correct data. This
  preserves the original order of the attributes, while minimizing the
  amount of new code that needs to be written. What do you think? Which
  solution do you think is preferable?

 Do attributes without content still get written?  Is that always
 valid?  Not writing them isn't always valid; see eg bug 4125: current
 code happily removes required attributes from meta tags, thus
 breaking valid pages.

 Depending how much cleaning of the HTML filtering system you want to
 do...  Has using something like JTidy ( http://jtidy.sourceforge.net/
 ) been discussed?  That way you wouldn't have to worry about what's
 valid or invalid HTML, merely the security aspects of valid HTML that
 are unique to Freenet.

 IMHO sajack's solution is acceptable, you will have to just use null to 
 indicate no attribute and  to indicate an attribute with no value? Or is 
 there a difference between attributes with an empty value and attributes with 
 no value?


It sounds fine to me, provided it doesn't take validating html and
make it stop validating.  Or at least does so no more than the current
code.

I'm asking what will happen when the attribute has null content
because the filter couldn't find anything to fill it with; does that
get written as tag attribute= or tag or something else?
Whichever it is, do we know that the result will be valid html?

The current filter turns eg
meta http-equiv=Content-type content=application/xhtml+xml;charset=UTF-8 /
into
meta /

The first is valid xhtml, the second is not.  Run the w3c validator
against my flog, both filtered an unfiltered, for details.  So, how
will the new filter handle cases like this, where filter code hasn't
been completely written for all relevant aspects?

Evan Daniel
___
Devl mailing list
Devl@freenetproject.org
http://osprey.vm.bytemark.co.uk/cgi-bin/mailman/listinfo/devl


Re: [freenet-dev] Attribute reordering in HTML filter

2010-05-08 Thread Evan Daniel
On Sat, May 8, 2010 at 9:35 PM, Spencer Jackson
spencerandrewjack...@gmail.com wrote:
 tOn Sat, May 8, 2010 at 10:38 AM, Matthew Toseland
 t...@amphibian.dyndns.org wrote:

 On Saturday 08 May 2010 05:09:07 Evan Daniel wrote:
  On Fri, May 7, 2010 at 11:43 PM, Spencer Jackson
  spencerandrewjack...@gmail.com wrote:
   On Fri, 2010-05-07 at 12:40 +0100, Matthew Toseland wrote:
   On Thursday 06 May 2010 20:40:03 Spencer Jackson wrote:
Hi guys, just wanted to touch base. Anyway, I'm working on
resolving bug
number 3571( https://bugs.freenetproject.org/view.php?id=3571 ). To
summarize, the filter tends to reorder attributes at semirandom
when
they get parsed. While the structure which holds the parsed
attribute is
a LinkedHashMap, meaning we should be able to stuff in values and
pull
them out in the same order, the put functions are called in the
derived
verifier's overrided sanitizeHash methods. These methods extract an
attribute, sanitize it, then place it in the Map. The problem is,
they
are extracted out of the original order, meaning they get pulled
out of
the Map in the wrong order. To fix this, I created a callback
object
which the derived classes pass to the baseclass. The baseclass may
then
parse all of the attributes in order, invoking the callback to
sanitize.If an attribute's contents fails to be processed, an
exception
may be thrown, so that the attribute will not be included in the
final
tag.
  
   It is important that only attributes that are explicitly parsed and
   understood are passed on, and that it doesn't take extra per-sanitiser 
   work
   to achieve this. Will this be the case?
  
  
   Yeah, this should be the case.  Attributes which don't have a callback
   stored simply aren't parsed. I am starting, however, to think this
   approach might be overkill.  Here I have a different take:
  
   http://github.com/spencerjackson/fred-staging/tree/HTMLAttributeReorder
   Instead of running a callback in the base class, I simply create the
   attributes, in order, with null content. Then, in the overloaded
   methods
   on the child classes I repopulate them with the correct data. This
   preserves the original order of the attributes, while minimizing the
   amount of new code that needs to be written. What do you think? Which
   solution do you think is preferable?
 
  Do attributes without content still get written?  Is that always
  valid?  Not writing them isn't always valid; see eg bug 4125: current
  code happily removes required attributes from meta tags, thus
  breaking valid pages.


 Odd. I'm looking at the code for MetaTagVerifier, and I can't see any code
 branches in which, if the 'content' attribute is defined, it is failed to be
 added to the LinkedHashMap unless nothing else is added either... I'm not on
 my home computer, so I'll have to test this tomorrow. Does it happen to all
 meta tags? Oh. Do you mean, if there are no attributes, the tag will still
 exist, but be empty? I could alter MetaTagVerifier to return null if this is
 the case, and remove the tag from the final output. Would that fix this?

As mentioned in the other reply, the content filter alters my flog from
meta http-equiv=Content-type content=application/xhtml+xml;charset=UTF-8 /
to
meta /

I haven't done a detailed analysis of why.



 
  Depending how much cleaning of the HTML filtering system you want to
  do...  Has using something like JTidy ( http://jtidy.sourceforge.net/
  ) been discussed?  That way you wouldn't have to worry about what's
  valid or invalid HTML, merely the security aspects of valid HTML that
  are unique to Freenet.

 That might be nice... but wouldn't we have the same problem in that it would
 be hard to diff the output of the filter against the input for debugging
 purposes? What do other people think about this? It would make life much
 easier...

I don't see why it would be a problem.  I haven't used tidy much,
honestly.  I don't see how to make it stop changing line breaks and
such in my page.  However, I don't mind running it locally before
inserting, so that nothing changes when the filter runs it.  I don't
need the filter to never change anything; I just need to know what to
do so that I can get a diff that shows only the changes made by the
filter.  If I need to run tidy on the original, and then diff that vs
the filtered output, that's fine by me.

And anything that makes the filtering more robust and less work is a
big win, imho.

Evan Daniel
___
Devl mailing list
Devl@freenetproject.org
http://osprey.vm.bytemark.co.uk/cgi-bin/mailman/listinfo/devl