I have been working on an off for about a year now on developing a prototype
for Enterprise Search using Nutch and Solr.  I have also incorporated a
plugin using the hive-mrc google code for automatic tagging based on a
custom taxonomy that my customer uses.  I have been slowly migrating up the
chain of machines available and I have been given one machine for my
"prototype" that is fairly powerful.  

Some problems still remain that I beleive can be fixed and others make me
question my decision to use Nutch.  

One problem has to do with the fact that I am doing vertical searching.  The
side effect of this is that the crawl process is SO slow.  It took about 48
hours to crawl about 350,000 urls all from the same website. I am am
crawling a shared file system and I am sure that constitutes vertical
crawling.  The other web crawling I am doing also only comes from a handful
of urls.  Maybe nutch is not the solution to use based on this?

The other problem is the fact that I would like to use the
AdaptiveFetchSchedule and the developers I work with refuse to use caching
and Last Modified time for our PHP pages.  This should be a nightmare :-(  

I love the solr aspect of our prototype.  It is very fast and reliable and I
have not had lots of issues.

In the real world, how to production environments use Nutch?  Do they have a
separate custom script that runs each of the crawl commands separately?  Do
they run this script once a day?  What about vertical crawling, are there
any special setting that could help Nutch run faster?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Going-Beyond-the-Prototype-tp2923289p2923289.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to