[Nutch-dev] Re: Per-page crawling policy

2006-01-17 Thread Ken Krugler
Hi Andrzej, 1. Even with a pretty broad area of interest, you wind up focusing on a subset of all domains. Which then means that the max threads per host limit (for polite crawling) starts killing your efficiency. The "policies" approach that I described is able to follow and distribute the

[Nutch-dev] Re: Per-page crawling policy

2006-01-16 Thread Andrzej Bialecki
Hi Ken, First of all, thanks for sharing your insights, that's a very interesting read. Ken Krugler wrote: This sounds like the TrustRank algorithm. See http://www.vldb.org/conf/2004/RS15P3.PDF. This talks about trust attenuation via trust dampening (reducing the trust level as you get furt

[Nutch-dev] Re: Per-page crawling policy

2006-01-16 Thread Ken Krugler
Hi Andrzej, I've been toying with the following idea, which is an extension of the existing URLFilter mechanism and the concept of a "crawl frontier". Let's suppose we have several initial seed urls, each with a different subjective quality. We would like to crawl these, and expand the "cra

[Nutch-dev] Re: Per-page crawling policy

2006-01-07 Thread Neal Whitley
For others working in a vertical search scenario I am having some good luck with the following steps. For starters it begins with a bit of a manual process to obtain a good seed starting point. For my current business I already had a basic seed list of about 7,500 unique links to home pages

[Nutch-dev] Re: Per-page crawling policy

2006-01-06 Thread Andrzej Bialecki
Jack Tang wrote: Hi Andrzej The idea brings vertical search into nutch and definitely it is great:) I think nutch should add information retrieving layer into the who architecture, and export some abstract interface, say UrlBasedInformationRetrieve(you can implement your url grouping idea here?

[Nutch-dev] Re: Per-page crawling policy

2006-01-06 Thread Jack Tang
Hi Andrzej The idea brings vertical search into nutch and definitely it is great:) I think nutch should add information retrieving layer into the who architecture, and export some abstract interface, say UrlBasedInformationRetrieve(you can implement your url grouping idea here?), TextBasedInformat

[Nutch-dev] Re: Per-page crawling policy

2006-01-05 Thread Andrzej Bialecki
Doug Cutting wrote: Stefan Groschupf wrote: Before we start adding meta data and more meta data, why not once in general adding meta data to the crawlDatum, than we can have any kinds of plugins that add and process metadata that belongs to a url. +1 This feature strikes me as something

[Nutch-dev] Re: Per-page crawling policy

2006-01-05 Thread Neal Whitley
Andrzej, This sounds like another great way to create more of a vertical search application as well. By defining trusted seed sources we can limit the scope of the crawl to a more suitable set of links. Also, being able to apply additional rules by domain/host or by trusted source would be

[Nutch-dev] Re: Per-page crawling policy

2006-01-05 Thread Doug Cutting
Stefan Groschupf wrote: Before we start adding meta data and more meta data, why not once in general adding meta data to the crawlDatum, than we can have any kinds of plugins that add and process metadata that belongs to a url. +1 This feature strikes me as something that might prove very u

[Nutch-dev] Re: Per-page crawling policy

2006-01-05 Thread Stefan Groschupf
Hehe... That was what I advocated from the beginning. There is a cost associated with this, though, i.e. any change in CrawlDatum size has a significant impact on most operations' performance. Sure, if you every had a look to the 0.7 meta data patch, there i had implement things in a way that

[Nutch-dev] Re: Per-page crawling policy

2006-01-05 Thread Andrzej Bialecki
Stefan Groschupf wrote: I like the idea and it is another step in the direction of vertical search, where I personal see the biggest chance for nutch. How to implement it? Surprisingly, I think that it's very simple - just adding a CrawlDatum.policyId field would suffice, assuming we have

[Nutch-dev] Re: Per-page crawling policy

2006-01-05 Thread Jack Tang
BTW: if nutch is going to support vertical searching, I think page urls should be grouped in three type: fetchable url(just fetching it), extractable url(fetch it and extract information from this page) and pagination url. /Jack On 1/5/06, Jack Tang <[EMAIL PROTECTED]> wrote: > Hi Andrzej > > The

[Nutch-dev] Re: Per-page crawling policy

2006-01-05 Thread Stefan Groschupf
I like the idea and it is another step in the direction of vertical search, where I personal see the biggest chance for nutch. How to implement it? Surprisingly, I think that it's very simple - just adding a CrawlDatum.policyId field would suffice, assuming we have a means to store and retr

[Nutch-dev] Re: Per-page crawling policy

2006-01-05 Thread Byron Miller
Excellent Ideas and that is what i'm hoping to use some of the social bookmarking type ideas to build the starter sites from and linkmaps from. I hope to work with Simpy or other bookmarking projects to build somewhat of a popularity map(human edited authorit) to merge and calculate against a comp