Re: Going Beyond the Prototype

Dietrich Thu, 12 May 2011 11:19:58 -0700

Why do you think Nutch is not suited for vertical search? I am in the
process of building just that, and am planning to use a Hadoop cluster
(most likely on AWS) for crawling.




On Tue, May 10, 2011 at 12:05 PM, J. Delgado <[email protected]> wrote:
> Nutch was never meant for vertical or enterprise search. Solr, is a
> great engine but obviously you need to get to the documents first. In
> order for me to state any further opinion I should ask the following:
>
> 1) What kind of documents/repositories are you trying to provide search for?
> 2) Are security and user access/permissions important for you?
> 3) What is the typical size of the document universe you which your
> software to handle (in number of documents + avg size and/or total
> GB)?
>
> -- J
>
> On Tue, May 10, 2011 at 7:37 AM, webdev1977 <[email protected]> wrote:
>> I have been working on an off for about a year now on developing a prototype
>> for Enterprise Search using Nutch and Solr.  I have also incorporated a
>> plugin using the hive-mrc google code for automatic tagging based on a
>> custom taxonomy that my customer uses.  I have been slowly migrating up the
>> chain of machines available and I have been given one machine for my
>> "prototype" that is fairly powerful.
>>
>> Some problems still remain that I beleive can be fixed and others make me
>> question my decision to use Nutch.
>>
>> One problem has to do with the fact that I am doing vertical searching.  The
>> side effect of this is that the crawl process is SO slow.  It took about 48
>> hours to crawl about 350,000 urls all from the same website. I am am
>> crawling a shared file system and I am sure that constitutes vertical
>> crawling.  The other web crawling I am doing also only comes from a handful
>> of urls.  Maybe nutch is not the solution to use based on this?
>>
>> The other problem is the fact that I would like to use the
>> AdaptiveFetchSchedule and the developers I work with refuse to use caching
>> and Last Modified time for our PHP pages.  This should be a nightmare :-(
>>
>> I love the solr aspect of our prototype.  It is very fast and reliable and I
>> have not had lots of issues.
>>
>> In the real world, how to production environments use Nutch?  Do they have a
>> separate custom script that runs each of the crawl commands separately?  Do
>> they run this script once a day?  What about vertical crawling, are there
>> any special setting that could help Nutch run faster?
>>
>>
>>
>>
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Going-Beyond-the-Prototype-tp2923289p2923289.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>

Re: Going Beyond the Prototype

Reply via email to