Right, that is what I tried saying below.  I don't think you can index 
partially fetched doc/xls/rtf documents.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: m.harig <[EMAIL PROTECTED]>
> To: nutch-dev@lucene.apache.org
> Sent: Friday, June 6, 2008 3:56:30 AM
> Subject: Re: nutch file content limit
> 
> 
> is there any way to index partial content of doc/xls/rtf . if its not
> possible let me know.
> 
> 
> ogjunk-nutch wrote:
> > 
> > I *think* you have to fetch the *full* content of MS Word docs (and PDFs
> > and RTFs and ...) if you want parsers that handle those documents to be
> > able to parse them.  A partial MS Word/PDF/RTF/... document is considered
> > invalid/broken.  Try opening it with MS Word, for example -- it will not
> > work.
> > 
> > 
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > 
> > 
> > ----- Original Message ----
> >> From: m.harig 
> >> To: nutch-dev@lucene.apache.org
> >> Sent: Thursday, June 5, 2008 3:27:18 AM
> >> Subject: Re: nutch file content limit
> >> 
> >> 
> >> thanks
> >> 
> >> my situation is this.. i've 100 MS-WORD files . each has 15MB in size...
> >> 
> >> if i set file.content.limit as 5MB. when nutch goes for fetching it can't
> >> parse the content. it says Can't handle as Microsoft document. and its
> >> failed.. how do i index partial content of those documents. any1 help me
> >> out
> >> of this
> >> 
> >> 
> >> this is my error
> >> 
> >> Can't be handled as Microsoft document. java.io.IOException: Cannot
> >> remove
> >> block[ 20839 ]; out of range
> >> -- 
> >> View this message in context: 
> >> http://www.nabble.com/nutch-file-content-limit-tp17640376p17663787.html
> >> Sent from the Nutch - Dev mailing list archive at Nabble.com.
> > 
> > 
> > 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/nutch-file-content-limit-tp17640376p17686729.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.

Reply via email to