Hi Andrzej,
1. Even with a pretty broad area of interest, you wind up focusing
on a subset of all domains. Which then means that the max threads
per host limit (for polite crawling) starts killing your efficiency.
The "policies" approach that I described is able to follow and
distribute the
Hi Ken,
First of all, thanks for sharing your insights, that's a very
interesting read.
Ken Krugler wrote:
This sounds like the TrustRank algorithm. See
http://www.vldb.org/conf/2004/RS15P3.PDF. This talks about trust
attenuation via trust dampening (reducing the trust level as you get
furt
Hi Andrzej,
I've been toying with the following idea, which is an extension of
the existing URLFilter mechanism and the concept of a "crawl
frontier".
Let's suppose we have several initial seed urls, each with a
different subjective quality. We would like to crawl these, and
expand the "cra
For others working in a vertical search scenario I am having some
good luck with the following steps.
For starters it begins with a bit of a manual process to obtain a
good seed starting point. For my current business I already had a
basic seed list of about 7,500 unique links to home pages
Jack Tang wrote:
Hi Andrzej
The idea brings vertical search into nutch and definitely it is great:)
I think nutch should add information retrieving layer into the who
architecture, and export some abstract interface, say
UrlBasedInformationRetrieve(you can implement your url grouping idea
here?
Hi Andrzej
The idea brings vertical search into nutch and definitely it is great:)
I think nutch should add information retrieving layer into the who
architecture, and export some abstract interface, say
UrlBasedInformationRetrieve(you can implement your url grouping idea
here?), TextBasedInformat
Doug Cutting wrote:
Stefan Groschupf wrote:
Before we start adding meta data and more meta data, why not once in
general adding meta data to the crawlDatum, than we can have any
kinds of plugins that add and process metadata that belongs to a url.
+1
This feature strikes me as something
Andrzej,
This sounds like another great way to create more of a vertical
search application as well. By defining trusted seed sources we can
limit the scope of the crawl to a more suitable set of links.
Also, being able to apply additional rules by domain/host or by
trusted source would be
Stefan Groschupf wrote:
Before we start adding meta data and more meta data, why not once in
general adding meta data to the crawlDatum, than we can have any kinds
of plugins that add and process metadata that belongs to a url.
+1
This feature strikes me as something that might prove very u
Hehe... That was what I advocated from the beginning. There is a
cost associated with this, though, i.e. any change in CrawlDatum
size has a significant impact on most operations' performance.
Sure, if you every had a look to the 0.7 meta data patch, there i had
implement things in a way that
Stefan Groschupf wrote:
I like the idea and it is another step in the direction of vertical
search, where I personal see the biggest chance for nutch.
How to implement it? Surprisingly, I think that it's very simple -
just adding a CrawlDatum.policyId field would suffice, assuming we
have
BTW: if nutch is going to support vertical searching, I think page
urls should be grouped in three type: fetchable url(just fetching it),
extractable url(fetch it and extract information from this page) and
pagination url.
/Jack
On 1/5/06, Jack Tang <[EMAIL PROTECTED]> wrote:
> Hi Andrzej
>
> The
I like the idea and it is another step in the direction of vertical
search, where I personal see the biggest chance for nutch.
How to implement it? Surprisingly, I think that it's very simple -
just adding a CrawlDatum.policyId field would suffice, assuming we
have a means to store and retr
Excellent Ideas and that is what i'm hoping to use
some of the social bookmarking type ideas to build the
starter sites from and linkmaps from.
I hope to work with Simpy or other bookmarking
projects to build somewhat of a popularity map(human
edited authorit) to merge and calculate against a
comp
14 matches
Mail list logo