> I would like to give my small contribute too.
Great!
So the question is only if the boxes are still available and if Doug give a ok.
Stefan
---
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Pr
[EMAIL PROTECTED] wrote:
Antonio,
Do I miss something? Mike mentioned that the index was highly customized for a
named entity extraction task.
The WebDB and the WebGraph contained in it is still a gold mine.
As I remeber there was a offer by archive.org to to use some boxes there.
@Doug does t
Antonio,
Do I miss something? Mike mentioned that the index was highly customized for a
named entity extraction task.
As I remeber there was a offer by archive.org to to use some boxes there.
@Doug does this offer still exist?
I would love to offer to setup nutch on this boxes in case a other pe
Michael Cafarella wrote:
Andrzej,
I think you make an excellent point here:
On Mon, 2004-11-29 at 07:48, Andrzej Bialecki wrote:
That was also the general idea of my ramblings about modifying NDFS so
More on that in the thread about a month ago, titled "NDFS,
DistributedSearch - redundant dep
Hi Mike,
you are doing a great job and i really impatient to read your paper, as
soon as it will be published.
A question: do you think that this big index can be available to the
research community.
It is a gold mine. The largest dataset if made by stanford in 2001 and
it is outdated.
It would
Andrzej,
I think you make an excellent point here:
On Mon, 2004-11-29 at 07:48, Andrzej Bialecki wrote:
> That was also the general idea of my ramblings about modifying NDFS so
> that it works with data blocks that make sense for all parts of the
> system, i.e. with segment slices. Then yo
>
> One of the key points, for me at least, though not really the
> focus of
> the MapReduce paper was that their job distribution and workload
> management system works together closely with their
> filesystem[2]. By
> doing this, they can distribute jobs in such a way so that
> most if no
Luke Baker wrote:
I know everyone working on Nutch probably gets sick of hearing people
say, "Do it like Google." However, it seems like we could certainly
borrow some ideas from them regarding this as well. I'm primarly
thinking of their implementation called MapReduce. Take a look at their
On 11/28/2004 08:33 PM, Michael Cafarella wrote:
[snip]
3) Job distribution and workload management is a really big problem
that Nutch currently leaves to the end user. This was a lot of work for
me, and I understand Nutch pretty well. It would be very hard for
someone who doesn't know as many
Hey,
Conrats on such a big crawl? Could you share the steps
you took to distribute your crawl over the 35 nodes?
Where there any issues? Do you know how nutch manages
the fetchlist in such a setup?
Thanks, and awesome job!
Yousef
---
SF email
Hi John,
Thanks for the note.
1) The whole project lasted several months, not all of them
full-time. Most of the work was centered on my Lucene mods and
experiments based on them. Computationally, crawling took maybe 3-4
days, and WebDB updates took about the same.
2) The paper has be
Hi, Mike,
It's a great report. A few questions:
(1) how long did you take to do the whole thing?
(2) where can we get a copy of your paper(s)?
(3) besides text/html, what other mimetypes you have crawled/parsed?
(4) could you elaborate a bit more about your lucene extension?
That's for now, I mi
Hi everyone,
A few weeks ago I completed a research project that involved building
a 50-100m page Nutch crawl. I've been working on Nutch as a programmer
for two years (!) now, but this was my first stab at such a large
index. I thought I would write up my experience in case people find it
13 matches
Mail list logo