[freenet-dev] [GSoC 2010] Improving Content Filters

Spencer Jackson Thu, 8 Apr 2010 00:29:44 -0400

Hi

I've been hanging out on IRC as of late under the nick 'sajack'. I'm going
to be submitting an application to work with Freenet during Google Summer of
Code. Anyway, here's the proposal I'm going to be uploading, if anyone has
any thoughts. Thanks for looking at it.

Proposal: Improve Implementation and Functionality of Content Filtration and
Add Support for Additional Formats
Proposer: Spencer Jackson (sajack)

Introduction
Content is an important part of the Freenet experience. Good, plentiful
content attracts users, which attracts donations and creates more nodes,
both of which, directly or indirectly, improve performance and security of
the network. As such, to make Freenet better, we must make the process of
getting information from the network to the user quick, easy, and safe. I am
proposing a series of changes to the ContentFilter and adjacent systems so
as to realize this. Below are the general steps I will take.

Modify content filters to act as streams

Presently, Freenet's data filters are passed Bucket objects containing all
of the data they need to process. This is suboptimal. Ideally, the data
filters should have a stream interface. This will reduce duplication of data
and increase performance by removing the need for vast amounts of disk I/O,
as less will be needed to be cached on the disk. This will be very easy to
implement, as most of the filters deal with streams internally.

Right now, filters are a part of FProxy, and are invoked by it when a file
is downloaded. Really though, most clients probably desire filtered data, so
filtration should be done earlier, with FProxy simply using general
functionality. I will therefore move filters into the client layer, and
invoke them there.

Now while here, it would be useful to add some new functionality. First off,
I'll add the ability to filter files being saved to the hard drive. Right
now, this doesn't happen, and it's something of a weak spot in our armor.
Later on especially, when there are Ogg filters, users may be downloading
large files directly to their hard drive. We will want them to be filtered.

Another thing while I'm working with filters in the client layer: I will
implement filtration of inserts. This will help prevent metadata in a file
uploaded by the user from breaking his or her anonymity. For example, EXIF
data in jpegs may reveal the serial number of the camera which took the
picture, or even the GPS coordinates from where the picture was taken.

Of course, there are some usage cases, such as during debugging, where it
may be undesirable for a request to be filtered. It must therefore be
possible to disable filtering. To accomplish this, I will prevent the data
from being filtered when a configuration setting in the request's context
has been set. Support for disabling filters will need to be added to FCP.
All of this will then need to be supported in the web interface. I will add
support for complementary GET and POST variables in FProxy which would be
used to trigger this setting. Next, I'll add UI elements to the download and
insert queue pages and any other pertinent locations, such as the
'Downloading a page' page, which would enable these variables. These
elements should only be visible when the user is in 'Advanced mode,' and,
even then, should be tagged with a Big Fat Warning about the risks of
turning off filtering.

Another feature I will implement is the ability to run data through a filter
without placing it on the network. This would be useful for debugging
content filters, and for freesite writers, who want to see what their site
will look like after its been parsed. This should be pretty easy to
implement. I'll create an FCP message which will take data, filter it, and
return it. I will also create a way to do this through FProxy, by uploading
a file, and receiving the filtered version.

The next thing I will implement are stream friendly Compressors.
Essentially, we should be able to have a filter and a decompressor running
on separate threads, and have data be passable between them transparently
using piped streams.

Implementation of Ogg container formatRight now, Freenet has filters for
HTML and some forms of image files. More filters means more types of content
which may be safely viewed by the user. This will allow the network to be
used in ways which are currently not safe. After I have implemented the new
stream based content filters, I shall implement more of them.

The first type of filter which I will implement is for the Ogg container
format. This is technically interesting, as it encapsulates other types of
data. A generic Ogg parser will be written, which will need to validate the
Ogg container, identify the bitstreams it contains, identify the codec used
inside these bitstreams, and process the streams using a second(or nth,
really, depending on how many bitstreams are in the container) codec
specific filter. It should be possible to use this filter to either filter
the just beginning of the file, or the whole thing. This will make it
possible to preview a partially downloaded file, at some point in the
future. Some things which will need to be taken into consideration are the
possibility of Ogg pages being concealed inside of other Ogg pages. This
will be checked for, and a fatal error will be raised if it occurs.

The Ogg codecs which I will initially add support for are, in order, Vorbis,
Theora, and FLAC.

More content filters

The more filters the better. In the time remaining, I will implement as many
different possible content filters. While this step is very important, these
codecs individually are of a lower priority than previous steps. I will
implement ATOM/RSS, mp3, and the rudiments of pdf.

Milestones
Here are clear milestones which may be used to evaluate my performance. The
following are a list of these goals which should be met to signify
completion, along with very rough estimates as to how long each step should
take:

*Stream based filters (3 days)
*Filters are moved to the client layer, with support for (disableable)
support filtering files going to the hard drive, and inserts (9 days)
*Filters can be tested on data, without inserting it into the network (3
days)
*Compressors can be interacted with through streams (4 days)
*An Ogg content filter is implemented, supporting the following codecs: (3
days)
-The Vorbis codec (2 days)
-The Theora codec (2 days)
-The FLAC codec (2 days)
*Content filters for ATOM/RSS are implemented (5 days)
*A content filter for MP3 is implemented (6 days)
*A basic content filter for pdf is implemented (Remaining time)

Biography
I initially became interested in Freenet because I am something of a
cypherpunk, in that I believe the ability to hold pseudonymous discourse to
be a major cornerstone of free speech and the free flow of information. I've
skulked around Freenet occasionally, even helping pre-alpha test version
0.7. But I'd like to do more. I want to put my time and energy where my
mouth is and spend my summer making the world, in some small way, safer for
freedom.
Starry-eyed idealism aside, I am an 18 year old American high school senior,
who will be studying Computer Science after I graduate. While C/C++ is my
'first language', so to speak, I am also fluent in Java and Python. Last
year, I personally rewrote my high school's web page in Python and Django.
This year, I've been working on an editor for Model United Nations
resolutions, as time permits. This project is licensed under the GPLv3, and
is available on GitHub, at http://github.com/spencerjackson/resolute. It's
written in C++, and uses GTKmm for the GUI.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<https://emu.freenetproject.org/pipermail/devl/attachments/20100408/1ab86278/attachment.html>

[freenet-dev] [GSoC 2010] Improving Content Filters

Reply via email to