On Friday 02 April 2010 13:41:12 Ximin Luo wrote: > > On Saturday 06 February 2010 16:39:53 Karol Pysniak wrote: > >> Hi, I would like to take part in the development of Free Net Project as my > >> contribution to Google Summer of Code. I am especially interested in Data > >> Bases, Algorithms and AI. I think that it would be great to try to create > >> new, 'intelligent' search engine. > > what do you mean by "intelligent" search engine?
Google is now getting into social search. The WoT stuff might deliver something similar... > > there is another student, lusha on IRC, lushawang at gmail.com, interested in > doing a project on search. you two should talk to each other. if we take you > both, then you will need to be doing two distinct things - google doesn't > allow > students to collaborate on a single proposal, but two related (but distinct) > proposals are fine. We did this last year, it worked fine. > > On 04/02/2010 01:11 PM, Matthew Toseland wrote: > > - Unfortunately there are severe scaling problems with this format: The > > sub-indexes can get huge, and inserting them all as a single freesite also > > causes problems. - Plus, the spider's architecture, involving a database of > > terms and URIs, also doesn't scale. It takes a week or more to write the > > index from the database. It would be better to maintain the database on the > > fly, rewriting affected parts every few hours as new keys are spidered. s/database/index. Maintain the index on the fly. The database then is just a list of URLs that we have and have not spidered yet. > > the work to do in this area is described in some detail at > http://new-wiki.freenetproject.org/Talk:Library > > you'll need to understand how SkeletonBTreeMap.update() works first; if the > source code is too unintuitive (async so components are scattered), then ask > me > for an explanation. > > > - We have a new format, created by infinity0, a Summer of Code student last > > year, which should scale much better. This is based on b-trees, and > > therefore data can be loaded into it progressively and just the changed > > nodes are re-uploaded. However, currently the spider uses the old format. So > > the first task would be to make the spider use the new format and load data > > into it progressively (on the fly). Also, when the new format is on Freenet, > > it is forkable - meaning that not only can the original author of the index > > add data and only insert those nodes affected (including their parents), but > > anyone else can also add their own changes - which do not affect the > > original btree - and reuse the existing ones. In other words, it is a "copy > > on write btree", although I don't think it has all the tweaks that the COW > > btrees paper talks about. This may have many wider applications, e.g. > > merging others' indexes, and maybe eventually distributing the spidering > > process. > > there is a COW btree paper? do you have a link? http://en.wikipedia.org/wiki/BTRFS#History The core data structure of Btrfs?the copy-on-write B-tree?was originally proposed by IBM researcher Ohad Rodeh at a presentation at USENIX 2007. Rodeh suggested adding reference counts and certain relaxations to the balancing algorithms of standard B-trees that would make them suitable for a high-performance object store with copy-on-write snapshots, yet maintain good concurrency.[19] https://www.usenix.org/events/lsf07/tech/rodeh.pdf (Did I mention that COW btrees are incredibly awesome? ;) ) > > > - The other half of infinity0's Summer of Code project was to be > > distributed search. This would allow each user to publish indexes and to > > link to other users' indexes; it is described on the wiki. > > i'm working on this atm as part of my uni course; i've coded a prototype and > atm i'm collecting data to test it with (hoop-jumping for dissertation :/). my > deadline is mid-may so if you/lusha want to pick it up afterwards and work on > it for GSoC then i'm happy to explain how it works. > > however there is plenty of stuff to work on Library already, IMO it would be > better to get the basics working before trying to graft a more complex system > onto it. > > > - Most likely infinity0 would be your mentor. > > that's me, btw :) i'm on irc under that nick. > > > You should read: http://new-wiki.freenetproject.org/Library > > http://new-wiki.freenetproject.org/B-tree_indexes > > http://new-wiki.freenetproject.org/Web_of_Trust And of course some of the > > papers/videos/introductory stuff on our web page. > > source code is at > http://github.com/infinity0/plugin-Library-staging > > there are some notes in ./doc/ and ./TODO - you can pick things out of that. > if > anything is unclear, ask me. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20100402/c664f99f/attachment.pgp>
