In addition to the two popular approaches, 1) crawling a predetermined list of web sites, and 2) generic crawling augmented with classifying web pages into topics or domains, there is another approach 3) focused crawling.
Let's say you want to crawl 20 million pages. Coming up with the list by hand would be impractical. Crawling the web in general and then using a classifier to pick the 20 million that match your target subject may also be impractical, especially if only a small percentage (e.g., 5%) of all web pages falls into your domain. A focused crawler starts with a seed set of pages (say 100-1000 pages) that or manually collected. From this seed set, the crawler extracts URLs and downloads pages. However, a page is downloaded only if there is a good chance the it will be relevant to the subject. A naïve Bayes classifier, or another classifier, can be used to make the prediction. This approach, with some more detail, is described in: http://www2006.org/programme/item.php?id=4512 Cheers, Tony A.A. On 1/30/07, Dennis Kubes wrote: > It means searching a specific domain such as automotive, health, etc. > > How to do it is another story, short answer you could either index only > specific sites that you know are in the domain or you could create ways > to determine automatically if a page is in a domain. > > Dennis Kubes > > Reddeppa Naidu wrote: > > Hi, > > i am new to Nutch search, i am working from past one month.Any one can > > tell what is ment by Vertical search.any one can suggest how can i do it. > > > > Thanks > > pandu > > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
