Makes perfect sense and articulates very well what I was planning to do for my vertical Nutch/Solr implementation.
Guy McDowell [email protected] http://www.GuyMcDowell.com On Wed, Jun 18, 2014 at 4:27 PM, John McCormac <[email protected]> wrote: > On 18/06/2014 13:27, Vishal Tomar wrote: > >> Hi, >> >> I am new to apache nutch and web crawlers in general, I am trying to build >> a vertical search engine for real estate. >> >> Now, How do I implement the crawler? Probably use Nutch for the crawling >> and modify it to only extract links from a page if the page contents are >> relevant to real estate. I'd probably need to write some kind of relevancy >> scoring function which uses a mixture of keywords, ontology and some kind >> of similarity detection based on sites I know to be relevant. >> > > I think that you might be jumping ahead a few steps. Building a vertical > search engine is quite different from building an ordinary crawl based > search engine. With a vertical, the new sites are not so much detected as > added. It is the same as building a web directory. > You need to identify the relevant websites and then add them to the crawl > schedule. Otherwise you will end up with having to clean the index after it > has included a lot of junk websites. By controlling the websites that you > add, you also make it a lot easier to deal with compromised websites. > > Though Nutch is impressive, I am not exactly up to speed on using it for > crawling and search as my main work is with domain names and > website/IP/country mapping. > > A better strategy, (rather than running a full crawl on all sites), would > be to use the index page only and then analyse that for real estate > keywords and phrases. That could be a faster way of building a list of > candidate sites for crawling. (Effectively you break your site aquisition > process into three parts: Collection, Detection, Selection.) It might sound > like a convoluted way of doing things but for vertical search, it is a lot > simpler than cleaning an index. :) > > Regards...jmcc > -- > ********************************************************** > John McCormac * e-mail: [email protected] > MC2 * web: http://www.hosterstats.com/ > 22 Viewmount * Domain Registrations Statistics > Waterford * And Historical DNS Database. > Ireland * Over 392 Million Domains Tracked. > IE * http://www.hosterstats.com/blog > ********************************************************** >

