Why do you think Nutch is not suited for vertical search? I am in the process of building just that, and am planning to use a Hadoop cluster (most likely on AWS) for crawling.
On Tue, May 10, 2011 at 12:05 PM, J. Delgado <[email protected]> wrote: > Nutch was never meant for vertical or enterprise search. Solr, is a > great engine but obviously you need to get to the documents first. In > order for me to state any further opinion I should ask the following: > > 1) What kind of documents/repositories are you trying to provide search for? > 2) Are security and user access/permissions important for you? > 3) What is the typical size of the document universe you which your > software to handle (in number of documents + avg size and/or total > GB)? > > -- J > > On Tue, May 10, 2011 at 7:37 AM, webdev1977 <[email protected]> wrote: >> I have been working on an off for about a year now on developing a prototype >> for Enterprise Search using Nutch and Solr. I have also incorporated a >> plugin using the hive-mrc google code for automatic tagging based on a >> custom taxonomy that my customer uses. I have been slowly migrating up the >> chain of machines available and I have been given one machine for my >> "prototype" that is fairly powerful. >> >> Some problems still remain that I beleive can be fixed and others make me >> question my decision to use Nutch. >> >> One problem has to do with the fact that I am doing vertical searching. The >> side effect of this is that the crawl process is SO slow. It took about 48 >> hours to crawl about 350,000 urls all from the same website. I am am >> crawling a shared file system and I am sure that constitutes vertical >> crawling. The other web crawling I am doing also only comes from a handful >> of urls. Maybe nutch is not the solution to use based on this? >> >> The other problem is the fact that I would like to use the >> AdaptiveFetchSchedule and the developers I work with refuse to use caching >> and Last Modified time for our PHP pages. This should be a nightmare :-( >> >> I love the solr aspect of our prototype. It is very fast and reliable and I >> have not had lots of issues. >> >> In the real world, how to production environments use Nutch? Do they have a >> separate custom script that runs each of the crawl commands separately? Do >> they run this script once a day? What about vertical crawling, are there >> any special setting that could help Nutch run faster? >> >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Going-Beyond-the-Prototype-tp2923289p2923289.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >

