I checked the code. It will extract and parse all documents in the zip file and 
concatenate all extracted text.

Markus

 
 
-----Original message-----
> From:Markus Jelsma <markus.jel...@openindex.io>
> Sent: Monday 9th May 2016 11:37
> To: user@nutch.apache.org
> Subject: RE: Nutch 1.x crawl Zip file URLs
> 
> Content of size 17027128 was truncated to 
> 
> this means your http.size or whatever limit is too low. Increase the setting 
> and try again. By the way, i am not sure how indexing behaviour will be, i 
> don't think it will handle multiple files just like that.
> 
>  
> -----Original message-----
> > From:A Laxmi <a.lakshmi...@gmail.com>
> > Sent: Friday 6th May 2016 20:55
> > To: user@nutch.apache.org
> > Subject: Re: Nutch 1.x crawl Zip file URLs
> > 
> > Hi Lewis,
> > 
> > I tried what you suggested but still no change. Please see the log message
> > below. I put the parse-zip under plugins directory and also edited
> > nutch-site.xml to include parse-zip under plugin.includes. I hightlighted
> > the Parse log message below which I think might be the one that didn't go
> > through.
> > 
> > PLease help!
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > *2016-05-06 14:47:32,226 INFO  fetcher.Fetcher - Fetcher: finished at
> > 2016-05-06 14:47:32, elapsed: 00:00:272016-05-06 14:47:33,127 INFO
> > parse.ParseSegment - ParseSegment: starting at 2016-05-06
> > 14:47:332016-05-06 14:47:33,127 INFO  parse.ParseSegment - ParseSegment:
> > segment: crawl_dir/crawl_zip2-sd/segments/201605061447022016-05-06
> > 14:47:33,497 WARN  util.NativeCodeLoader - Unable to load native-hadoop
> > library for your platform... using builtin-java classes where
> > applicable2016-05-06 14:47:34,366 INFO  parse.ParseSegment -
> > https://www.xyz.xyz/sites/production/files/2016/policyarchive.zip
> > <https://www.xyz.xyz/sites/production/files/2016/policyarchive.zip>
> > skipped. Content of size 17027128 was truncated to 52427602016-05-06
> > 14:47:34,896 INFO  parse.ParseSegment - ParseSegment: finished at
> > 2016-05-06 14:47:34, elapsed: 00:00:012016-05-06 14:47:36,010 WARN
> > util.NativeCodeLoader - Unable to load native-hadoop library for your
> > platform... using builtin-java classes where applicable2016-05-06
> > 14:47:36,042 INFO  crawl.CrawlDb - CrawlDb update: starting at 2016-05-06
> > 14:47:362016-05-06 14:47:36,042 INFO  crawl.CrawlDb - CrawlDb update: db:
> > crawl_dir/crawl_zip2-sd/crawldb2016-05-06 14:47:36,042 INFO  crawl.CrawlDb
> > - CrawlDb update: segments:
> > [crawl_dir/crawl_zip2-sd/segments/20160506144702]2016-05-06 14:47:36,042
> > INFO  crawl.CrawlDb - CrawlDb update: additions allowed: true2016-05-06
> > 14:47:36,042 INFO  crawl.CrawlDb - *
> > 
> > Regards,
> > AL
> > 
> > On Thu, May 5, 2016 at 10:48 PM, Lewis John Mcgibbney <
> > lewis.mcgibb...@gmail.com> wrote:
> > 
> > > Hi AL,
> > >
> > > Yes please see parse-zip plugin
> > > https://github.com/apache/nutch/tree/master/src/plugin/parse-zip
> > > You can register this within the plugin.includes property in 
> > > nutch-site.xml
> > > Thanks
> > >
> > > On Thu, May 5, 2016 at 7:00 PM, <user-digest-h...@nutch.apache.org> wrote:
> > >
> > > > From: A Laxmi <a.lakshmi...@gmail.com>
> > > > To: "user@nutch.apache.org" <user@nutch.apache.org>
> > > > Cc:
> > > > Date: Thu, 5 May 2016 21:59:34 -0400
> > > > Subject: Nutch 1.x crawl Zip file URLs
> > > > Hi,
> > > >
> > > > (a) Is it possible to crawl URL of a Zip file using Nutch and index in
> > > > Solr? (pls see example below)
> > > >
> > > > (b) Also, if a zip file URL has PDF files in them, is it possible to use
> > > > Nutch to crawl the Zip file URL and also the PDF file inside the Zip 
> > > > file
> > > > URL?
> > > >
> > > >
> > > > E.g.
> > > > *https://www.abc123.xxx/sites/docs/testing.zip
> > > > <https://www.abc123.xxx/sites/docs/testing.zip>*
> > > > When I unzip above URL - I would have the following:
> > > >
> > > >
> > > > *def.pdf*
> > > >
> > > > *lmn.pdf*
> > > > *reg.pdf*
> > > >
> > > >
> > > > Please advise.
> > > >
> > > > Thanks!
> > > >
> > > > AL
> > > >
> > > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> > 

Reply via email to