Re: [Nutch-dev] Experience with a big index

2004-11-30 Thread sg
> I would like to give my small contribute too. Great! So the question is only if the boxes are still available and if Doug give a ok. Stefan --- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Pr

Re: [Nutch-dev] Experience with a big index

2004-11-30 Thread Antonio Gulli
[EMAIL PROTECTED] wrote: Antonio, Do I miss something? Mike mentioned that the index was highly customized for a named entity extraction task. The WebDB and the WebGraph contained in it is still a gold mine. As I remeber there was a offer by archive.org to to use some boxes there. @Doug does t

Re: [Nutch-dev] Experience with a big index

2004-11-30 Thread sg
Antonio, Do I miss something? Mike mentioned that the index was highly customized for a named entity extraction task. As I remeber there was a offer by archive.org to to use some boxes there. @Doug does this offer still exist? I would love to offer to setup nutch on this boxes in case a other pe

Re: [Nutch-dev] Experience with a big index

2004-11-30 Thread Andrzej Bialecki
Michael Cafarella wrote: Andrzej, I think you make an excellent point here: On Mon, 2004-11-29 at 07:48, Andrzej Bialecki wrote: That was also the general idea of my ramblings about modifying NDFS so More on that in the thread about a month ago, titled "NDFS, DistributedSearch - redundant dep

Re: [Nutch-dev] Experience with a big index

2004-11-30 Thread Antonio Gulli
Hi Mike, you are doing a great job and i really impatient to read your paper, as soon as it will be published. A question: do you think that this big index can be available to the research community. It is a gold mine. The largest dataset if made by stanford in 2001 and it is outdated. It would

Re: [Nutch-dev] Experience with a big index

2004-11-29 Thread Michael Cafarella
Andrzej, I think you make an excellent point here: On Mon, 2004-11-29 at 07:48, Andrzej Bialecki wrote: > That was also the general idea of my ramblings about modifying NDFS so > that it works with data blocks that make sense for all parts of the > system, i.e. with segment slices. Then yo

RE: [Nutch-dev] Experience with a big index

2004-11-29 Thread Nick Lothian
> > One of the key points, for me at least, though not really the > focus of > the MapReduce paper was that their job distribution and workload > management system works together closely with their > filesystem[2]. By > doing this, they can distribute jobs in such a way so that > most if no

Re: [Nutch-dev] Experience with a big index

2004-11-29 Thread Andrzej Bialecki
Luke Baker wrote: I know everyone working on Nutch probably gets sick of hearing people say, "Do it like Google." However, it seems like we could certainly borrow some ideas from them regarding this as well. I'm primarly thinking of their implementation called MapReduce. Take a look at their

Re: [Nutch-dev] Experience with a big index

2004-11-29 Thread Luke Baker
On 11/28/2004 08:33 PM, Michael Cafarella wrote: [snip] 3) Job distribution and workload management is a really big problem that Nutch currently leaves to the end user. This was a lot of work for me, and I understand Nutch pretty well. It would be very hard for someone who doesn't know as many

Re: [Nutch-dev] Experience with a big index

2004-11-28 Thread Yousef Ourabi
Hey, Conrats on such a big crawl? Could you share the steps you took to distribute your crawl over the 35 nodes? Where there any issues? Do you know how nutch manages the fetchlist in such a setup? Thanks, and awesome job! Yousef --- SF email

Re: [Nutch-dev] Experience with a big index

2004-11-28 Thread Michael Cafarella
Hi John, Thanks for the note. 1) The whole project lasted several months, not all of them full-time. Most of the work was centered on my Lucene mods and experiments based on them. Computationally, crawling took maybe 3-4 days, and WebDB updates took about the same. 2) The paper has be

Re: [Nutch-dev] Experience with a big index

2004-11-28 Thread John X
Hi, Mike, It's a great report. A few questions: (1) how long did you take to do the whole thing? (2) where can we get a copy of your paper(s)? (3) besides text/html, what other mimetypes you have crawled/parsed? (4) could you elaborate a bit more about your lucene extension? That's for now, I mi

[Nutch-dev] Experience with a big index

2004-11-28 Thread Michael Cafarella
Hi everyone, A few weeks ago I completed a research project that involved building a 50-100m page Nutch crawl. I've been working on Nutch as a programmer for two years (!) now, but this was my first stab at such a large index. I thought I would write up my experience in case people find it