Hi,
I am still wondering how the similarity or relevance measure performed in
nutch ?
In Google one use LSI which encompasses SVD in itself.
If someone has some info please enlighten me .
Thanks
Satmeet
-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of
[
It's committed.
On Mon, Nov 15, 2004 at 11:53:34PM -0800, John X wrote:
> Hi, All,
>
> I have tried this plugin. It is quite useful. thanks mike.
> If no one objects, I will commit it with a few modifications
> late this week.
>
> John
>
> On Tue, Nov 09, 2004 at 08:41:05PM -0800, michael j pan
Selva,
I don't believe that Nutch, as yet, has any capability to deal with
HTTP authentication at all. Nor cookies either, which many
authenticated sites require.
If you can find an HTTP proxy that will handle authentication w/o a
browser's intervention, you might want to try running Nutch's craw
OTOH, using the last-modified header would allow the fetcher to do
conditional gets (which would be a considerable bandwidth saving).
Nick
> -Original Message-
> From: Matt Kangas [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, 30 November 2004 4:06 PM
> To: [EMAIL PROTECTED]
> Subject: Re: [N
Doug, et al,
It may be useful to index two dates: (1) what is known explicitly from
HTTP and/or the fetch time, and (2) what can be deduced by parsing the
document. Hence, index both Doug's and Matthias' suggestions
separately. These can have different meanings in different contexts,
so just recor
Andrzej,
I think you make an excellent point here:
On Mon, 2004-11-29 at 07:48, Andrzej Bialecki wrote:
> That was also the general idea of my ramblings about modifying NDFS so
> that it works with data blocks that make sense for all parts of the
> system, i.e. with segment slices. Then yo
I don't see any reference to it in the code.
Every once in a while, I run into sites like:
http://spodzone.org.uk/cesspit.jl
... that seem designed to ensnare crawlers like Nutch. I e-mailed the
website owner because he seems to have made a half-hearted attempt in
the robots.txt file to be nice t
The intranet tutorial crawling section explains how to
crawl a single url, or a list of url's. How would one
specify a single directory to be crawled periodically?
Does nutch have the capacity to crawl, and index xml
content in the same way that it does HTML?
Thanks for your time.
-
>
> One of the key points, for me at least, though not really the
> focus of
> the MapReduce paper was that their job distribution and workload
> management system works together closely with their
> filesystem[2]. By
> doing this, they can distribute jobs in such a way so that
> most if no
I'm trying to use NDFS in order to have multiple
machines performing my crawl, but I'm having some
problems getting it to work. This is what i'm doing:
First, I launch a namenode on machine #1:
# bin/nutch namenode 9500 ndfs_ns
041129 164742 Server listener on port 9500: starting
041129 164742 S
Luke Baker wrote:
I know everyone working on Nutch probably gets sick of hearing people
say, "Do it like Google." However, it seems like we could certainly
borrow some ideas from them regarding this as well. I'm primarly
thinking of their implementation called MapReduce. Take a look at their
On 11/28/2004 08:33 PM, Michael Cafarella wrote:
[snip]
3) Job distribution and workload management is a really big problem
that Nutch currently leaves to the end user. This was a lot of work for
me, and I understand Nutch pretty well. It would be very hard for
someone who doesn't know as many
12 matches
Mail list logo