I have been working on an off for about a year now on developing a prototype for Enterprise Search using Nutch and Solr. I have also incorporated a plugin using the hive-mrc google code for automatic tagging based on a custom taxonomy that my customer uses. I have been slowly migrating up the chain of machines available and I have been given one machine for my "prototype" that is fairly powerful.
Some problems still remain that I beleive can be fixed and others make me question my decision to use Nutch. One problem has to do with the fact that I am doing vertical searching. The side effect of this is that the crawl process is SO slow. It took about 48 hours to crawl about 350,000 urls all from the same website. I am am crawling a shared file system and I am sure that constitutes vertical crawling. The other web crawling I am doing also only comes from a handful of urls. Maybe nutch is not the solution to use based on this? The other problem is the fact that I would like to use the AdaptiveFetchSchedule and the developers I work with refuse to use caching and Last Modified time for our PHP pages. This should be a nightmare :-( I love the solr aspect of our prototype. It is very fast and reliable and I have not had lots of issues. In the real world, how to production environments use Nutch? Do they have a separate custom script that runs each of the crawl commands separately? Do they run this script once a day? What about vertical crawling, are there any special setting that could help Nutch run faster? -- View this message in context: http://lucene.472066.n3.nabble.com/Going-Beyond-the-Prototype-tp2923289p2923289.html Sent from the Nutch - User mailing list archive at Nabble.com.

