In addition to the two popular approaches, 1) crawling a predetermined
list of web sites, and 2) generic crawling augmented with classifying
web pages into topics or domains, there is another approach 3) focused
crawling.

Let's say you want to crawl 20 million pages. Coming up with the list
by hand would be impractical. Crawling the web in general and then
using a classifier to pick the 20 million that match your target
subject may also be impractical, especially if only a small percentage
(e.g., 5%) of all web pages falls into your domain. A focused crawler
starts with a seed set of pages (say 100-1000 pages) that or manually
collected. From this seed set, the crawler extracts URLs and downloads
pages. However, a page is downloaded only if there is a good chance
the it will be relevant to the subject. A naïve Bayes classifier, or
another classifier, can be used to make the prediction. This approach,
with some more detail, is described in:

http://www2006.org/programme/item.php?id=4512

Cheers,

Tony A.A.

On 1/30/07, Dennis Kubes wrote:
> It means searching a specific domain such as automotive, health, etc.
>
> How to do it is another story, short answer you could either index only
> specific sites that you know are in the domain or you could create ways
> to determine automatically if a page is in a domain.
>
> Dennis Kubes
>
> Reddeppa Naidu wrote:
> > Hi,
> > i am new to Nutch search, i am working from past one  month.Any one can
> > tell what is ment by Vertical search.any one can suggest how can i do it.
> >
> > Thanks
> > pandu
> >
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to