then how do i index files those who are greater than 15MB. please let me know
ogjunk-nutch wrote: > > Right, that is what I tried saying below. I don't think you can index > partially fetched doc/xls/rtf documents. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > ----- Original Message ---- >> From: m.harig <[EMAIL PROTECTED]> >> To: nutch-dev@lucene.apache.org >> Sent: Friday, June 6, 2008 3:56:30 AM >> Subject: Re: nutch file content limit >> >> >> is there any way to index partial content of doc/xls/rtf . if its not >> possible let me know. >> >> >> ogjunk-nutch wrote: >> > >> > I *think* you have to fetch the *full* content of MS Word docs (and >> PDFs >> > and RTFs and ...) if you want parsers that handle those documents to be >> > able to parse them. A partial MS Word/PDF/RTF/... document is >> considered >> > invalid/broken. Try opening it with MS Word, for example -- it will >> not >> > work. >> > >> > >> > Otis >> > -- >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> > >> > >> > ----- Original Message ---- >> >> From: m.harig >> >> To: nutch-dev@lucene.apache.org >> >> Sent: Thursday, June 5, 2008 3:27:18 AM >> >> Subject: Re: nutch file content limit >> >> >> >> >> >> thanks >> >> >> >> my situation is this.. i've 100 MS-WORD files . each has 15MB in >> size... >> >> >> >> if i set file.content.limit as 5MB. when nutch goes for fetching it >> can't >> >> parse the content. it says Can't handle as Microsoft document. and its >> >> failed.. how do i index partial content of those documents. any1 help >> me >> >> out >> >> of this >> >> >> >> >> >> this is my error >> >> >> >> Can't be handled as Microsoft document. java.io.IOException: Cannot >> >> remove >> >> block[ 20839 ]; out of range >> >> -- >> >> View this message in context: >> >> >> http://www.nabble.com/nutch-file-content-limit-tp17640376p17663787.html >> >> Sent from the Nutch - Dev mailing list archive at Nabble.com. >> > >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/nutch-file-content-limit-tp17640376p17686729.html >> Sent from the Nutch - Dev mailing list archive at Nabble.com. > > > -- View this message in context: http://www.nabble.com/nutch-file-content-limit-tp17640376p17727247.html Sent from the Nutch - Dev mailing list archive at Nabble.com.