On Saturday 06 February 2010 16:39:53 Karol Pysniak wrote:
> Hi, 
> I would like to take part in the development of Free Net Project as my
> contribution to Google Summer of Code. I am especially interested in
> Data Bases, Algorithms and AI. I think that it would be great to try to
> create new, 'intelligent' search engine.
> 
> I can program in Java/C++/C/Haskell/Assembler(IA-32).

Great. You will need to apply via the Summer of Code web interface. If you want 
to send us your proposal so we can look at it and suggest improvements, please 
do so here.

Before we accept you we will need you to demonstrate some basic coding ability 
by making some small change (bugfix or feature) to Freenet. See the bug tracker 
at https://bugs.freenetproject.org/ for ideas.

You should apply for at least two tasks within Freenet, so that we are able to 
choose both students and projects, and don't have to drop a good student 
because we already have somebody else for that project.

Please read up on Freenet first or this may not make much sense: The most 
fundamental thing is that Freenet only provides "insert" (publish data to a 
key) and "request" (fetch a key) operations and everything else is built on top 
of it, including searching. There have been proposals for distributed searching 
but doing so in a secure and spam-proof way is remarkably difficult so for now 
we will probably continue to build search on top of inserts and requests...

As regards searching, here is the current situation. 
- A spider (XMLSpider plugin) spiders Freenet freesites and generates an index, 
which is inserted into Freenet in the same way as a freesite would be.
- Another plugin downloads the relevant parts of the index when you do a 
search, combines them and displays the matching URIs.
- The old XMLLibrarian plugin implemented a simple XML-based search index 
format. One file contains the list of sub-indexes (split by md5 of the word 
being looked up), and then there is one file for each sub-index. Within each 
sub-index, we have a list of URIs, words, and which URIs each word is contained 
in. For an example of this format, please have a look here (install Freenet 
first):
USK at 
5hH~39FtjA7A9~VXWtBKI~prUDTuJZURudDG0xFn3KA,GDgRGt5f6xqbmo-WraQtU54x4H~871Sho9Hz6hC-0RA,AQACAAE/Search/24/index.xml
USK at 
5hH~39FtjA7A9~VXWtBKI~prUDTuJZURudDG0xFn3KA,GDgRGt5f6xqbmo-WraQtU54x4H~871Sho9Hz6hC-0RA,AQACAAE/Search/24/index_00.xml
- This format is still supported by the new Library plugin, which replaced 
XMLLibrarian for the frontend some time ago.
- Unfortunately there are severe scaling problems with this format: The 
sub-indexes can get huge, and inserting them all as a single freesite also 
causes problems.
- Plus, the spider's architecture, involving a database of terms and URIs, also 
doesn't scale. It takes a week or more to write the index from the database. It 
would be better to maintain the database on the fly, rewriting affected parts 
every few hours as new keys are spidered.
- We have a new format, created by infinity0, a Summer of Code student last 
year, which should scale much better. This is based on b-trees, and therefore 
data can be loaded into it progressively and just the changed nodes are 
re-uploaded. However, currently the spider uses the old format. So the first 
task would be to make the spider use the new format and load data into it 
progressively (on the fly). Also, when the new format is on Freenet, it is 
forkable - meaning that not only can the original author of the index add data 
and only insert those nodes affected (including their parents), but anyone else 
can also add their own changes - which do not affect the original btree - and 
reuse the existing ones. In other words, it is a "copy on write btree", 
although I don't think it has all the tweaks that the COW btrees paper talks 
about. This may have many wider applications, e.g. merging others' indexes, and 
maybe eventually distributing the spidering process.
- The other  half of infinity0's Summer of Code project was to be distributed 
search. This would allow each user to publish indexes and to link to other 
users' indexes; it is described on the wiki.
- Most likely infinity0 would be your mentor.
- There is limited support for basic page ranking based on word frequencies. I 
don't think the current search indexes support the metadata needed, but the 
spider should if I remember correctly.
You should read:
http://new-wiki.freenetproject.org/Library
http://new-wiki.freenetproject.org/B-tree_indexes
http://new-wiki.freenetproject.org/Web_of_Trust
And of course some of the papers/videos/introductory stuff on our web page.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20100402/49eb283c/attachment.pgp>

Reply via email to