[freenet-dev] [GSoC 2010] Improving Content Filters

Evan Daniel Thu, 8 Apr 2010 11:33:23 -0400

On Thu, Apr 8, 2010 at 12:29 AM, Spencer Jackson
<spencerandrewjackson at gmail.com> wrote:
> Hi
>
> I've been hanging out on IRC as of late under the nick 'sajack'. I'm going
> to be submitting an application to work with Freenet during Google Summer of
> Code. Anyway, here's the proposal I'm going to be uploading, if anyone has
> any thoughts. Thanks for looking at it.
>
>
>
> Proposal: Improve Implementation and Functionality of Content Filtration and
> Add Support for Additional Formats
> Proposer: Spencer Jackson (sajack)
>
> Introduction
> Content is an important part of the Freenet experience. Good, plentiful
> content attracts users, which attracts donations and creates more nodes,
> both of which, directly or indirectly, improve performance and security of
> the network. As such, to make Freenet better, we must make the process of
> getting information from the network to the user quick, easy, and safe. I am
> proposing a series of changes to the ContentFilter and adjacent systems so
> as to realize this. Below are the general steps I will take.
>
>
> Modify content filters to act as streams
>
> Presently, Freenet's data filters are passed Bucket objects containing all
> of the data they need to process. This is suboptimal. Ideally, the data
> filters should have a stream interface. This will reduce duplication of data
> and increase performance by removing the need for vast amounts of disk I/O,
> as less will be needed to be cached on the disk. This will be very easy to
> implement, as most of the filters deal with streams internally.
>
> Right now, filters are a part of FProxy, and are invoked by it when a file
> is downloaded. Really though, most clients probably desire filtered data, so
> filtration should be done earlier, with FProxy simply using general
> functionality. I will therefore move filters into the client layer, and
> invoke them there.
>
> Now while here, it would be useful to add some new functionality. First off,
> I'll add the ability to filter files being saved to the hard drive. Right
> now, this doesn't happen, and it's something of a weak spot in our armor.
> Later on especially, when there are Ogg filters, users may be downloading
> large files directly to their hard drive. We will want them to be filtered.
>
> Another thing while I'm working with filters in the client layer: I will
> implement filtration of inserts. This will help prevent metadata in a file
> uploaded by the user from breaking his or her anonymity. For example, EXIF
> data in jpegs may reveal the serial number of the camera which took the
> picture, or even the GPS coordinates from where the picture was taken.
>
> Of course, there are some usage cases, such as during debugging, where it
> may be undesirable for a request to be filtered. It must therefore be
> possible to disable filtering. To accomplish this, I will prevent the data
> from being filtered when a configuration setting in the request's context
> has been set. Support for disabling filters will need to be added to FCP.
> All of this will then need to be supported in the web interface. I will add
> support for complementary GET and POST variables in FProxy which would be
> used to trigger this setting. Next, I'll add UI elements to the download and
> insert queue pages and any other pertinent locations, such as the
> 'Downloading a page' page, which would enable these variables. These
> elements should only be visible when the user is in 'Advanced mode,' and,
> even then, should be tagged with a Big Fat Warning about the risks of
> turning off filtering.
>
> Another feature I will implement is the ability to run data through a filter
> without placing it on the network. This would be useful for debugging
> content filters, and for freesite writers, who want to see what their site
> will look like after its been parsed. This should be pretty easy to
> implement. I'll create an FCP message which will take data, filter it, and
> return it. I will also create a way to do this through FProxy, by uploading
> a file, and receiving the filtered version.
>
> The next thing I will implement are stream friendly Compressors.
> Essentially, we should be able to have a filter and a decompressor running
> on separate threads, and have data be passable between them transparently
> using piped streams.
>
>
> Implementation of Ogg container formatRight now, Freenet has filters for
> HTML and some forms of image files. More filters means more types of content
> which may be safely viewed by the user. This will allow the network to be
> used in ways which are currently not safe. After I have implemented the new
> stream based content filters, I shall implement more of them.
>
> The first type of filter which I will implement is for the Ogg container
> format. This is technically interesting, as it encapsulates other types of
> data. A generic Ogg parser will be written, which will need to validate the
> Ogg container, identify the bitstreams it contains, identify the codec used
> inside these bitstreams, and process the streams using a second(or nth,
> really, depending on how many bitstreams are in the container) codec
> specific filter. It should be possible to use this filter to either filter
> the just beginning of the file, or the whole thing. This will make it
> possible to preview a partially downloaded file, at some point in the
> future. Some things which will need to be taken into consideration are the
> possibility of Ogg pages being concealed inside of other Ogg pages. This
> will be checked for, and a fatal error will be raised if it occurs.
>
> The Ogg codecs which I will initially add support for are, in order, Vorbis,
> Theora, and FLAC.
>
>
> More content filters
>
> The more filters the better. In the time remaining, I will implement as many
> different possible content filters. While this step is very important, these
> codecs individually are of a lower priority than previous steps. I will
> implement ATOM/RSS, mp3, and the rudiments of pdf.
>
>
>
> Milestones
> Here are clear milestones which may be used to evaluate my performance. The
> following are a list of these goals which should be met to signify
> completion, along with very rough estimates as to how long each step should
> take:
>
> *Stream based filters (3 days)
> *Filters are moved to the client layer, with support for (disableable)
> support filtering files going to the hard drive, and inserts (9 days)
> *Filters can be tested on data, without inserting it into the network (3
> days)
> *Compressors can be interacted with through streams (4 days)
> *An Ogg content filter is implemented, supporting the following codecs: (3
> days)
> ?-The Vorbis codec (2 days)
> ?-The Theora codec (2 days)
> ?-The FLAC codec (2 days)
> *Content filters for ATOM/RSS are implemented (5 days)
> *A content filter for MP3 is implemented (6 days)
> *A basic content filter for pdf is implemented (Remaining time)
>
>
>
> Biography
> I initially became interested in Freenet because I am something of a
> cypherpunk, in that I believe the ability to hold pseudonymous discourse to
> be a major cornerstone of free speech and the free flow of information. I've
> skulked around Freenet occasionally, even helping pre-alpha test version
> 0.7. But I'd like to do more. I want to put my time and energy where my
> mouth is and spend my summer making the world, in some small way, safer for
> freedom.
> Starry-eyed idealism aside, I am an 18 year old American high school senior,
> who will be studying Computer Science after I graduate. While C/C++ is my
> 'first language', so to speak, I am also fluent in Java and Python. Last
> year, I personally rewrote my high school's web page in Python and Django.
> This year, I've been working on an editor for Model United Nations
> resolutions, as time permits. This project is licensed under the GPLv3, and
> is available on GitHub, at http://github.com/spencerjackson/resolute. It's
> written in C++, and uses GTKmm for the GUI.
>
>
> _______________________________________________
> Devl mailing list
> Devl at freenetproject.org
> http://osprey.vm.bytemark.co.uk/cgi-bin/mailman/listinfo/devl
>


IMHO this looks good.

My one concern is that your suggested timeline looks aggressive.  It
looks to me more like a timeline for writing the code, as opposed to a
timeline for writing the code, documenting it, writing unit tests, and
debugging it.  I know that writing copious documentation and unit
tests as we go isn't how Freenet normally does things, but it would be
nice to improve on that standard :)  I think adding 1 day worth of
documentation and unit tests after each of your listed steps would
make a meaningful improvement to the resultant body of work.  Of
course, others might disagree, and it's not a big concern.  Like I
said, this looks good.

Evan Daniel

[freenet-dev] [GSoC 2010] Improving Content Filters

Reply via email to