On Saturday 06 February 2010 16:39:53 Karol Pysniak wrote: > Hi, > I would like to take part in the development of Free Net Project as my > contribution to Google Summer of Code. I am especially interested in > Data Bases, Algorithms and AI. I think that it would be great to try to > create new, 'intelligent' search engine. > > I can program in Java/C++/C/Haskell/Assembler(IA-32).
Great. You will need to apply via the Summer of Code web interface. If you want to send us your proposal so we can look at it and suggest improvements, please do so here. Before we accept you we will need you to demonstrate some basic coding ability by making some small change (bugfix or feature) to Freenet. See the bug tracker at https://bugs.freenetproject.org/ for ideas. You should apply for at least two tasks within Freenet, so that we are able to choose both students and projects, and don't have to drop a good student because we already have somebody else for that project. Please read up on Freenet first or this may not make much sense: The most fundamental thing is that Freenet only provides "insert" (publish data to a key) and "request" (fetch a key) operations and everything else is built on top of it, including searching. There have been proposals for distributed searching but doing so in a secure and spam-proof way is remarkably difficult so for now we will probably continue to build search on top of inserts and requests... As regards searching, here is the current situation. - A spider (XMLSpider plugin) spiders Freenet freesites and generates an index, which is inserted into Freenet in the same way as a freesite would be. - Another plugin downloads the relevant parts of the index when you do a search, combines them and displays the matching URIs. - The old XMLLibrarian plugin implemented a simple XML-based search index format. One file contains the list of sub-indexes (split by md5 of the word being looked up), and then there is one file for each sub-index. Within each sub-index, we have a list of URIs, words, and which URIs each word is contained in. For an example of this format, please have a look here (install Freenet first): USK at 5hH~39FtjA7A9~VXWtBKI~prUDTuJZURudDG0xFn3KA,GDgRGt5f6xqbmo-WraQtU54x4H~871Sho9Hz6hC-0RA,AQACAAE/Search/24/index.xml USK at 5hH~39FtjA7A9~VXWtBKI~prUDTuJZURudDG0xFn3KA,GDgRGt5f6xqbmo-WraQtU54x4H~871Sho9Hz6hC-0RA,AQACAAE/Search/24/index_00.xml - This format is still supported by the new Library plugin, which replaced XMLLibrarian for the frontend some time ago. - Unfortunately there are severe scaling problems with this format: The sub-indexes can get huge, and inserting them all as a single freesite also causes problems. - Plus, the spider's architecture, involving a database of terms and URIs, also doesn't scale. It takes a week or more to write the index from the database. It would be better to maintain the database on the fly, rewriting affected parts every few hours as new keys are spidered. - We have a new format, created by infinity0, a Summer of Code student last year, which should scale much better. This is based on b-trees, and therefore data can be loaded into it progressively and just the changed nodes are re-uploaded. However, currently the spider uses the old format. So the first task would be to make the spider use the new format and load data into it progressively (on the fly). Also, when the new format is on Freenet, it is forkable - meaning that not only can the original author of the index add data and only insert those nodes affected (including their parents), but anyone else can also add their own changes - which do not affect the original btree - and reuse the existing ones. In other words, it is a "copy on write btree", although I don't think it has all the tweaks that the COW btrees paper talks about. This may have many wider applications, e.g. merging others' indexes, and maybe eventually distributing the spidering process. - The other half of infinity0's Summer of Code project was to be distributed search. This would allow each user to publish indexes and to link to other users' indexes; it is described on the wiki. - Most likely infinity0 would be your mentor. - There is limited support for basic page ranking based on word frequencies. I don't think the current search indexes support the metadata needed, but the spider should if I remember correctly. You should read: http://new-wiki.freenetproject.org/Library http://new-wiki.freenetproject.org/B-tree_indexes http://new-wiki.freenetproject.org/Web_of_Trust And of course some of the papers/videos/introductory stuff on our web page. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20100402/49eb283c/attachment.pgp>
