then how do i index files those who are greater than 15MB. please let me know




ogjunk-nutch wrote:
> 
> Right, that is what I tried saying below.  I don't think you can index
> partially fetched doc/xls/rtf documents.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
>> From: m.harig <[EMAIL PROTECTED]>
>> To: nutch-dev@lucene.apache.org
>> Sent: Friday, June 6, 2008 3:56:30 AM
>> Subject: Re: nutch file content limit
>> 
>> 
>> is there any way to index partial content of doc/xls/rtf . if its not
>> possible let me know.
>> 
>> 
>> ogjunk-nutch wrote:
>> > 
>> > I *think* you have to fetch the *full* content of MS Word docs (and
>> PDFs
>> > and RTFs and ...) if you want parsers that handle those documents to be
>> > able to parse them.  A partial MS Word/PDF/RTF/... document is
>> considered
>> > invalid/broken.  Try opening it with MS Word, for example -- it will
>> not
>> > work.
>> > 
>> > 
>> > Otis
>> > --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> > 
>> > 
>> > ----- Original Message ----
>> >> From: m.harig 
>> >> To: nutch-dev@lucene.apache.org
>> >> Sent: Thursday, June 5, 2008 3:27:18 AM
>> >> Subject: Re: nutch file content limit
>> >> 
>> >> 
>> >> thanks
>> >> 
>> >> my situation is this.. i've 100 MS-WORD files . each has 15MB in
>> size...
>> >> 
>> >> if i set file.content.limit as 5MB. when nutch goes for fetching it
>> can't
>> >> parse the content. it says Can't handle as Microsoft document. and its
>> >> failed.. how do i index partial content of those documents. any1 help
>> me
>> >> out
>> >> of this
>> >> 
>> >> 
>> >> this is my error
>> >> 
>> >> Can't be handled as Microsoft document. java.io.IOException: Cannot
>> >> remove
>> >> block[ 20839 ]; out of range
>> >> -- 
>> >> View this message in context: 
>> >>
>> http://www.nabble.com/nutch-file-content-limit-tp17640376p17663787.html
>> >> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>> > 
>> > 
>> > 
>> 
>> -- 
>> View this message in context: 
>> http://www.nabble.com/nutch-file-content-limit-tp17640376p17686729.html
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/nutch-file-content-limit-tp17640376p17727247.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Reply via email to