I *think* you have to fetch the *full* content of MS Word docs (and PDFs and RTFs and ...) if you want parsers that handle those documents to be able to parse them. A partial MS Word/PDF/RTF/... document is considered invalid/broken. Try opening it with MS Word, for example -- it will not work.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: m.harig <[EMAIL PROTECTED]> > To: nutch-dev@lucene.apache.org > Sent: Thursday, June 5, 2008 3:27:18 AM > Subject: Re: nutch file content limit > > > thanks > > my situation is this.. i've 100 MS-WORD files . each has 15MB in size... > > if i set file.content.limit as 5MB. when nutch goes for fetching it can't > parse the content. it says Can't handle as Microsoft document. and its > failed.. how do i index partial content of those documents. any1 help me out > of this > > > this is my error > > Can't be handled as Microsoft document. java.io.IOException: Cannot remove > block[ 20839 ]; out of range > -- > View this message in context: > http://www.nabble.com/nutch-file-content-limit-tp17640376p17663787.html > Sent from the Nutch - Dev mailing list archive at Nabble.com.