On 18/06/2014 13:27, Vishal Tomar wrote:
Hi,
I am new to apache nutch and web crawlers in general, I am trying to build
a vertical search engine for real estate.
Now, How do I implement the crawler? Probably use Nutch for the crawling
and modify it to only extract links from a page if the page contents are
relevant to real estate. I'd probably need to write some kind of relevancy
scoring function which uses a mixture of keywords, ontology and some kind
of similarity detection based on sites I know to be relevant.
I think that you might be jumping ahead a few steps. Building a vertical
search engine is quite different from building an ordinary crawl based
search engine. With a vertical, the new sites are not so much detected
as added. It is the same as building a web directory.
You need to identify the relevant websites and then add them to the
crawl schedule. Otherwise you will end up with having to clean the index
after it has included a lot of junk websites. By controlling the
websites that you add, you also make it a lot easier to deal with
compromised websites.
Though Nutch is impressive, I am not exactly up to speed on using it for
crawling and search as my main work is with domain names and
website/IP/country mapping.
A better strategy, (rather than running a full crawl on all sites),
would be to use the index page only and then analyse that for real
estate keywords and phrases. That could be a faster way of building a
list of candidate sites for crawling. (Effectively you break your site
aquisition process into three parts: Collection, Detection, Selection.)
It might sound like a convoluted way of doing things but for vertical
search, it is a lot simpler than cleaning an index. :)
Regards...jmcc
--
**********************************************************
John McCormac * e-mail: [email protected]
MC2 * web: http://www.hosterstats.com/
22 Viewmount * Domain Registrations Statistics
Waterford * And Historical DNS Database.
Ireland * Over 392 Million Domains Tracked.
IE * http://www.hosterstats.com/blog
**********************************************************