Re: [Nutch-general] focussed crawling

Apache Lucene Wed, 04 Oct 2006 08:27:44 -0700

On 10/3/06, Jim Wilson <[EMAIL PROTECTED]> wrote:


PageRank is a Trademark of Google, and a source of great revenue for them
-
you'll have to call it something else. :(



I was referring to the link score nutch assigns.

Determining whether a page is relevant to a topic (with any degree of

accuracy) is a harder problem that it may appear - though your opening
post
says to assume you have a way to do it.  For example, suppose the word
"buffalo" appears on a page.  Does this mean the animal, or the NFL team,
the city, or the spicy sauce?



My case is fairly simple. The case you are mentioning is definitely a harder
problem.

Another concern is the assumption that documents linked-to by relevant

documents are themselves at all relevant.  Take Wikipedia for example -
there are lots of links on every page that have nothing to do with the
article (such as Main Page, Community Portal, Privacy Policy, etc).  If N
is
any more than 1 or 2, you'll probably be swamped with non-relevant pages



The reason I wanted this architecture was to get as many links as possible.
The link database would have the non-relevant links however the final index
would not. The non-relevant links will be used only for getting the links
for more relevant links (if any). At the end of first crawl the database
will have the following:

Links: Relevant links + Small number of non-relevant links
Pages/Content: Relevant pages only
Index: Relevant pages only

In the subsequent recrawling I would just have to visit recrawl a small
subset as opposed to the entire links had I stored all the links. If I dont
store the non-relevant links then I might not be able to get the relevant
links which might be in the non-relevant pages. I am willing to have false
positives in the database but keeping them low by setting N appropriately. I
hope this makes sense. Maybe the term "focussed crawling" was wrong..:)

.In researching the problem, you might want to check out the Carrot
Clustering Engine (http://demo.carrot-search.com/carrot2-webapp/main).  It
may do what you want OOTB.



If I am right, Carrot is useful in clustering the search results and helpful
in crawling. I will take a fresh look at it for crawling.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] focussed crawling

Reply via email to