Right, that is what I tried saying below. I don't think you can index partially fetched doc/xls/rtf documents.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: m.harig <[EMAIL PROTECTED]> > To: nutch-dev@lucene.apache.org > Sent: Friday, June 6, 2008 3:56:30 AM > Subject: Re: nutch file content limit > > > is there any way to index partial content of doc/xls/rtf . if its not > possible let me know. > > > ogjunk-nutch wrote: > > > > I *think* you have to fetch the *full* content of MS Word docs (and PDFs > > and RTFs and ...) if you want parsers that handle those documents to be > > able to parse them. A partial MS Word/PDF/RTF/... document is considered > > invalid/broken. Try opening it with MS Word, for example -- it will not > > work. > > > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > ----- Original Message ---- > >> From: m.harig > >> To: nutch-dev@lucene.apache.org > >> Sent: Thursday, June 5, 2008 3:27:18 AM > >> Subject: Re: nutch file content limit > >> > >> > >> thanks > >> > >> my situation is this.. i've 100 MS-WORD files . each has 15MB in size... > >> > >> if i set file.content.limit as 5MB. when nutch goes for fetching it can't > >> parse the content. it says Can't handle as Microsoft document. and its > >> failed.. how do i index partial content of those documents. any1 help me > >> out > >> of this > >> > >> > >> this is my error > >> > >> Can't be handled as Microsoft document. java.io.IOException: Cannot > >> remove > >> block[ 20839 ]; out of range > >> -- > >> View this message in context: > >> http://www.nabble.com/nutch-file-content-limit-tp17640376p17663787.html > >> Sent from the Nutch - Dev mailing list archive at Nabble.com. > > > > > > > > -- > View this message in context: > http://www.nabble.com/nutch-file-content-limit-tp17640376p17686729.html > Sent from the Nutch - Dev mailing list archive at Nabble.com.