Thanks, Lewis! Will give that a try. Appreciate your help!!
On Thu, May 5, 2016 at 10:48 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi AL, > > Yes please see parse-zip plugin > https://github.com/apache/nutch/tree/master/src/plugin/parse-zip > You can register this within the plugin.includes property in nutch-site.xml > Thanks > > On Thu, May 5, 2016 at 7:00 PM, <user-digest-h...@nutch.apache.org> wrote: > > > From: A Laxmi <a.lakshmi...@gmail.com> > > To: "user@nutch.apache.org" <user@nutch.apache.org> > > Cc: > > Date: Thu, 5 May 2016 21:59:34 -0400 > > Subject: Nutch 1.x crawl Zip file URLs > > Hi, > > > > (a) Is it possible to crawl URL of a Zip file using Nutch and index in > > Solr? (pls see example below) > > > > (b) Also, if a zip file URL has PDF files in them, is it possible to use > > Nutch to crawl the Zip file URL and also the PDF file inside the Zip file > > URL? > > > > > > E.g. > > *https://www.abc123.xxx/sites/docs/testing.zip > > <https://www.abc123.xxx/sites/docs/testing.zip>* > > When I unzip above URL - I would have the following: > > > > > > *def.pdf* > > > > *lmn.pdf* > > *reg.pdf* > > > > > > Please advise. > > > > Thanks! > > > > AL > > > > > > > -- > *Lewis* >