Makes perfect sense and articulates very well what I was planning to do for
my vertical Nutch/Solr implementation.

Guy McDowell
[email protected]
http://www.GuyMcDowell.com





On Wed, Jun 18, 2014 at 4:27 PM, John McCormac <[email protected]> wrote:

> On 18/06/2014 13:27, Vishal Tomar wrote:
>
>> Hi,
>>
>> I am new to apache nutch and web crawlers in general, I am trying to build
>> a vertical search engine for real estate.
>>
>> Now, How do I implement the crawler? Probably use Nutch for the crawling
>> and modify it to only extract links from a page if the page contents are
>> relevant to real estate. I'd probably need to write some kind of relevancy
>> scoring function which uses a mixture of keywords, ontology and some kind
>> of similarity detection based on sites I know to be relevant.
>>
>
> I think that you might be jumping ahead a few steps. Building a vertical
> search engine is quite different from building an ordinary crawl based
> search engine. With a vertical, the new sites are not so much detected as
> added. It is the same as building a web directory.
> You need to identify the relevant websites and then add them to the crawl
> schedule. Otherwise you will end up with having to clean the index after it
> has included a lot of junk websites. By controlling the websites that you
> add, you also make it a lot easier to deal with compromised websites.
>
> Though Nutch is impressive, I am not exactly up to speed on using it for
> crawling and search as my main work is with domain names and
> website/IP/country mapping.
>
> A better strategy, (rather than running a full crawl on all sites), would
> be to use the index page only and then analyse that for real estate
> keywords and phrases. That could be a faster way of building a list of
> candidate sites for crawling. (Effectively you break your site aquisition
> process into three parts: Collection, Detection, Selection.) It might sound
> like a convoluted way of doing things but for vertical search, it is a lot
> simpler than cleaning an index. :)
>
> Regards...jmcc
> --
> **********************************************************
> John McCormac  *  e-mail: [email protected]
> MC2            *  web: http://www.hosterstats.com/
> 22 Viewmount   *  Domain Registrations Statistics
> Waterford      *  And Historical DNS Database.
> Ireland        *  Over 392 Million Domains Tracked.
> IE             *  http://www.hosterstats.com/blog
> **********************************************************
>

Reply via email to