I checked the code. It will extract and parse all documents in the zip file and concatenate all extracted text.
Markus -----Original message----- > From:Markus Jelsma <markus.jel...@openindex.io> > Sent: Monday 9th May 2016 11:37 > To: user@nutch.apache.org > Subject: RE: Nutch 1.x crawl Zip file URLs > > Content of size 17027128 was truncated to > > this means your http.size or whatever limit is too low. Increase the setting > and try again. By the way, i am not sure how indexing behaviour will be, i > don't think it will handle multiple files just like that. > > > -----Original message----- > > From:A Laxmi <a.lakshmi...@gmail.com> > > Sent: Friday 6th May 2016 20:55 > > To: user@nutch.apache.org > > Subject: Re: Nutch 1.x crawl Zip file URLs > > > > Hi Lewis, > > > > I tried what you suggested but still no change. Please see the log message > > below. I put the parse-zip under plugins directory and also edited > > nutch-site.xml to include parse-zip under plugin.includes. I hightlighted > > the Parse log message below which I think might be the one that didn't go > > through. > > > > PLease help! > > > > > > > > > > > > > > > > > > > > > > > > > > *2016-05-06 14:47:32,226 INFO fetcher.Fetcher - Fetcher: finished at > > 2016-05-06 14:47:32, elapsed: 00:00:272016-05-06 14:47:33,127 INFO > > parse.ParseSegment - ParseSegment: starting at 2016-05-06 > > 14:47:332016-05-06 14:47:33,127 INFO parse.ParseSegment - ParseSegment: > > segment: crawl_dir/crawl_zip2-sd/segments/201605061447022016-05-06 > > 14:47:33,497 WARN util.NativeCodeLoader - Unable to load native-hadoop > > library for your platform... using builtin-java classes where > > applicable2016-05-06 14:47:34,366 INFO parse.ParseSegment - > > https://www.xyz.xyz/sites/production/files/2016/policyarchive.zip > > <https://www.xyz.xyz/sites/production/files/2016/policyarchive.zip> > > skipped. Content of size 17027128 was truncated to 52427602016-05-06 > > 14:47:34,896 INFO parse.ParseSegment - ParseSegment: finished at > > 2016-05-06 14:47:34, elapsed: 00:00:012016-05-06 14:47:36,010 WARN > > util.NativeCodeLoader - Unable to load native-hadoop library for your > > platform... using builtin-java classes where applicable2016-05-06 > > 14:47:36,042 INFO crawl.CrawlDb - CrawlDb update: starting at 2016-05-06 > > 14:47:362016-05-06 14:47:36,042 INFO crawl.CrawlDb - CrawlDb update: db: > > crawl_dir/crawl_zip2-sd/crawldb2016-05-06 14:47:36,042 INFO crawl.CrawlDb > > - CrawlDb update: segments: > > [crawl_dir/crawl_zip2-sd/segments/20160506144702]2016-05-06 14:47:36,042 > > INFO crawl.CrawlDb - CrawlDb update: additions allowed: true2016-05-06 > > 14:47:36,042 INFO crawl.CrawlDb - * > > > > Regards, > > AL > > > > On Thu, May 5, 2016 at 10:48 PM, Lewis John Mcgibbney < > > lewis.mcgibb...@gmail.com> wrote: > > > > > Hi AL, > > > > > > Yes please see parse-zip plugin > > > https://github.com/apache/nutch/tree/master/src/plugin/parse-zip > > > You can register this within the plugin.includes property in > > > nutch-site.xml > > > Thanks > > > > > > On Thu, May 5, 2016 at 7:00 PM, <user-digest-h...@nutch.apache.org> wrote: > > > > > > > From: A Laxmi <a.lakshmi...@gmail.com> > > > > To: "user@nutch.apache.org" <user@nutch.apache.org> > > > > Cc: > > > > Date: Thu, 5 May 2016 21:59:34 -0400 > > > > Subject: Nutch 1.x crawl Zip file URLs > > > > Hi, > > > > > > > > (a) Is it possible to crawl URL of a Zip file using Nutch and index in > > > > Solr? (pls see example below) > > > > > > > > (b) Also, if a zip file URL has PDF files in them, is it possible to use > > > > Nutch to crawl the Zip file URL and also the PDF file inside the Zip > > > > file > > > > URL? > > > > > > > > > > > > E.g. > > > > *https://www.abc123.xxx/sites/docs/testing.zip > > > > <https://www.abc123.xxx/sites/docs/testing.zip>* > > > > When I unzip above URL - I would have the following: > > > > > > > > > > > > *def.pdf* > > > > > > > > *lmn.pdf* > > > > *reg.pdf* > > > > > > > > > > > > Please advise. > > > > > > > > Thanks! > > > > > > > > AL > > > > > > > > > > > > > > > > > -- > > > *Lewis* > > > > >