Re: nutch file content limit

ogjunk-nutch Thu, 05 Jun 2008 09:42:04 -0700

I *think* you have to fetch the *full* content of MS Word docs (and PDFs and 
RTFs and ...) if you want parsers that handle those documents to be able to 
parse them.  A partial MS Word/PDF/RTF/... document is considered 
invalid/broken.  Try opening it with MS Word, for example -- it will not work.



Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: m.harig <[EMAIL PROTECTED]>
> To: nutch-dev@lucene.apache.org
> Sent: Thursday, June 5, 2008 3:27:18 AM
> Subject: Re: nutch file content limit
> 
> 
> thanks
> 
> my situation is this.. i've 100 MS-WORD files . each has 15MB in size...
> 
> if i set file.content.limit as 5MB. when nutch goes for fetching it can't
> parse the content. it says Can't handle as Microsoft document. and its
> failed.. how do i index partial content of those documents. any1 help me out
> of this
> 
> 
> this is my error
> 
> Can't be handled as Microsoft document. java.io.IOException: Cannot remove
> block[ 20839 ]; out of range
> -- 
> View this message in context: 
> http://www.nabble.com/nutch-file-content-limit-tp17640376p17663787.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: nutch file content limit

Reply via email to