Re: Going Beyond the Prototype

webdev1977 Tue, 10 May 2011 09:57:09 -0700

Thanks for your reply.. I was kind of afraid someone was going to say that
:-( I have invested so much time into developing plugins for Nutch that I am
deathly afraid of moving on to something else.


To answer your questions:
1) What kind of documents/repositories are you trying to provide search for?

I have several internal websites I am crawling (most of which are web front
for database info), I am also crawling a local shared file system.  The
document types run the gamut.. html, pdf, word, excel, powerpoint, txt,
images, etc, etc.  (and any other crap the users throw on the file system)
  
2) Are security and user access/permissions important for you?

Somewhat.. but not as much as you would think.  I actually have/had more
problems with accessing sites that required SSL certificates.  But, I fixed
that by modifying the protocol-httpclient to use a java keystore and a
client cert to pass while fetching the page.

3) What is the typical size of the document universe you which your
software to handle (in number of documents + avg size and/or total
GB)?

The documents are all under 200mb or so.  Most of them are html or pdf files
that are of a normal size.  The total size of the documents to be crawled is
fairly large about 500gb.  The other stuff, maybe about 100gb total.  



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Going-Beyond-the-Prototype-tp2923289p2923807.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Going Beyond the Prototype

Reply via email to