Hey Vishal, I'm attempting to do a very similar thing, but not with real estate. I'm only about one step ahead of you in this process though, so I can't offer much help.
I think you are on the right path as far as having Nutch crawl only websites related to real estate. A whole web crawl starting with seed URLs outside of that vertical would probably be a waste of your time. Might as well start with seeds in the vertical. I think if you're using Nutch with Solr as the front-end search for your users, Solr will rank your results based on relevancy of the keywords entered in the search. I'm focusing on learning Nutch right now, so I'm not certain of everything Solr does. >From the research I've done, using Nutch 1.x is better than 2.x as it is more stable and has more features. I could be wrong, but I think that's worth double checking on. I look forward to following your progress and learning from you. Hopefully my progress will be able to help you as well. Cheers! Guy McDowell [email protected] http://www.GuyMcDowell.com On Wed, Jun 18, 2014 at 9:27 AM, Vishal Tomar <[email protected]> wrote: > Hi, > > I am new to apache nutch and web crawlers in general, I am trying to build > a vertical search engine for real estate. > > Now, How do I implement the crawler? Probably use Nutch for the crawling > and modify it to only extract links from a page if the page contents are > relevant to real estate. I'd probably need to write some kind of relevancy > scoring function which uses a mixture of keywords, ontology and some kind > of similarity detection based on sites I know to be relevant. > > Now is there any way by which I can configure Nutch to use my relevancy > scoring function or do I need to change the source code, Also I would > prefer working in python over java as I am much more familiar with it, so > is there any library in python for nutch. > > Apart from this I would really appreciate any more pointers regarding nutch > in general. > > Thanks > Vishal >

