Hi Kyriakos (and everyone in nutch-dev), As Doug mentioned, I have been working on our distributed Web database project. I just committed a lot of the code on Friday.
The WebDB is our custom database that we use to keep track of all known content and URLs. Until now, this database could only run on a single machine. So far, that fact has been a real bottleneck for us, keeping us from moving past 200M docs. I think the new code should allow us to scale the webdb to arbitrary numbers of machines (and hence, pages). However, we haven't really tried it yet. I've tested it on my laptop, but that's not much of a test. It would be great for more people to take a look at this code. [Note this goes for the CMU guys as well as all Nutch contributors.] You want to look in src/java/net/nutch/db. The uniprocessor versions are WebDBWriter and WebDBReader. The new ones are DistributedWebDBWriter and DistributedWebDBReader. The new code relies on a makeshift distributed filesystem I made, called NutchFS. You can find the support code for NutchFS in src/java/net/nutch/util. This isn't a "real" fs. It's more like a very simple file namespace that exists across multiple machines, plus some mechanisms for copying files to the machines that request files within that namespace. It allows the DistributedWebDB code to assume that all machines mount the same disk, even though this might not be true. There's no authoritative documentation on how the WebDB works. Sorry. Will generate it when I can. So for people who are interested in Nutch's distributed-computing angles, I would say the following might be interesting topics: -- Studying and improving the DistributedWebDB code. -- Making NutchFS into a more "real" system. Maybe replace the NutchFS entirely, but keep the interface to DistributedWebDB. I think the big search engines benefit a lot from in-house distributed filesystems that make every project easier to complete. It also makes system administration easier, which is critical for these massively-distributed systems, -- Possibly taking a look at the scoring algorithms we're using and making them faster and/or more effective in distributed computations. -- Getting some solid performance numbers for a real multi-machine deployment. Thanks, --Mike On Mon, 2004-01-26 at 16:09, Doug Cutting wrote: > Kyriakos T. Fourniadis < Carnegie Mellon University > wrote: > > We are a team of Carnegie Mellon Graduate students (4); from the schools > > of Information Networking Institute, Computer Science Department, > > Electrical & Computer Engineering Department of Carnegie Mellon > > University and are interested in contributed to the Nutch project, we > > find your work inspirational, so we would like to our share, apply and > > extend it to our distributed systems project, this project is completely > > open source, and any modifications to Nutch will be available to you. > > That's wonderful news. Welcome to the project! > > > *So what is this project about:* > > In essence we are interested in deploying your web-crawler > > successfully, so we are going to focus on efficient load-balancing as > > well as effective (& meaningful) web-crawling, along with security on > > the server side, this will be a complete project from top-down, and will > > be deployed by the end of April. We are also considering variations of > > the different channels that Nutch can be distributed and used. > > > > *Why are we contacting you:* > > We are interested in your recommendations, comments, concerns, or > > whatever would help. > > Furthermore if you have any good ideas, or sections that might need work... > > The best thing to do is join the nutch-developers list and post your > experiences and questions there. If you have code contributions, these > can either be posted to the list or attached to bug reports. Patch > files generated with 'cvs diff -Nu' are preferred. > > Mike Cafarella is about to commit an extensive change to the web db > code, a critical component of our crawler. Currently the web db is > single-threaded, but with Mike's changes it will be distributed. This > should enable the crawler to efficiently scale to billions of pages. > Perhaps you can help Mike to further develop, debug and tune these changes. > > First, I recommend downloading the existing code and working through the > tutorial. Then start asking questions. > > Cheers, > > Doug > > > > ------------------------------------------------------- > The SF.Net email is sponsored by EclipseCon 2004 > Premiere Conference on Open Tools Development and Integration > See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. > http://www.eclipsecon.org/osdn > _______________________________________________ > Nutch-developers mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/nutch-developers ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
