[Nutch-dev] distributed SVD in LSI

2004-11-29 Thread Satmeet
Hi, I am still wondering how the similarity or relevance measure performed in nutch ? In Google one use LSI which encompasses SVD in itself. If someone has some info please enlighten me . Thanks Satmeet -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [

Re: [Nutch-dev] ontology query refinement

2004-11-29 Thread John X
It's committed. On Mon, Nov 15, 2004 at 11:53:34PM -0800, John X wrote: > Hi, All, > > I have tried this plugin. It is quite useful. thanks mike. > If no one objects, I will commit it with a few modifications > late this week. > > John > > On Tue, Nov 09, 2004 at 08:41:05PM -0800, michael j pan

Re: [Nutch-dev] Help Needed on Crawling the Authenticated sites Using Nutch!

2004-11-29 Thread Matt Kangas
Selva, I don't believe that Nutch, as yet, has any capability to deal with HTTP authentication at all. Nor cookies either, which many authenticated sites require. If you can find an HTTP proxy that will handle authentication w/o a browser's intervention, you might want to try running Nutch's craw

RE: [Nutch-dev] last-modified

2004-11-29 Thread Nick Lothian
OTOH, using the last-modified header would allow the fetcher to do conditional gets (which would be a considerable bandwidth saving). Nick > -Original Message- > From: Matt Kangas [mailto:[EMAIL PROTECTED] > Sent: Tuesday, 30 November 2004 4:06 PM > To: [EMAIL PROTECTED] > Subject: Re: [N

Re: [Nutch-dev] last-modified

2004-11-29 Thread Matt Kangas
Doug, et al, It may be useful to index two dates: (1) what is known explicitly from HTTP and/or the fetch time, and (2) what can be deduced by parsing the document. Hence, index both Doug's and Matthias' suggestions separately. These can have different meanings in different contexts, so just recor

Re: [Nutch-dev] Experience with a big index

2004-11-29 Thread Michael Cafarella
Andrzej, I think you make an excellent point here: On Mon, 2004-11-29 at 07:48, Andrzej Bialecki wrote: > That was also the general idea of my ramblings about modifying NDFS so > that it works with data blocks that make sense for all parts of the > system, i.e. with segment slices. Then yo

[Nutch-dev] does banned-hosts.txt still work?

2004-11-29 Thread Andrew Chen
I don't see any reference to it in the code. Every once in a while, I run into sites like: http://spodzone.org.uk/cesspit.jl ... that seem designed to ensnare crawlers like Nutch. I e-mailed the website owner because he seems to have made a half-hearted attempt in the robots.txt file to be nice t

[Nutch-dev] Crawling specific directories on server / XML reading

2004-11-29 Thread Yousef Ourabi
The intranet tutorial crawling section explains how to crawl a single url, or a list of url's. How would one specify a single directory to be crawled periodically? Does nutch have the capacity to crawl, and index xml content in the same way that it does HTML? Thanks for your time. -

RE: [Nutch-dev] Experience with a big index

2004-11-29 Thread Nick Lothian
> > One of the key points, for me at least, though not really the > focus of > the MapReduce paper was that their job distribution and workload > management system works together closely with their > filesystem[2]. By > doing this, they can distribute jobs in such a way so that > most if no

[Nutch-dev] Help with NDFS

2004-11-29 Thread Xin-Yi Liu
I'm trying to use NDFS in order to have multiple machines performing my crawl, but I'm having some problems getting it to work. This is what i'm doing: First, I launch a namenode on machine #1: # bin/nutch namenode 9500 ndfs_ns 041129 164742 Server listener on port 9500: starting 041129 164742 S

Re: [Nutch-dev] Experience with a big index

2004-11-29 Thread Andrzej Bialecki
Luke Baker wrote: I know everyone working on Nutch probably gets sick of hearing people say, "Do it like Google." However, it seems like we could certainly borrow some ideas from them regarding this as well. I'm primarly thinking of their implementation called MapReduce. Take a look at their

Re: [Nutch-dev] Experience with a big index

2004-11-29 Thread Luke Baker
On 11/28/2004 08:33 PM, Michael Cafarella wrote: [snip] 3) Job distribution and workload management is a really big problem that Nutch currently leaves to the end user. This was a lot of work for me, and I understand Nutch pretty well. It would be very hard for someone who doesn't know as many