Hi David

I've created NUTCH-2255 <https://issues.apache.org/jira/browse/NUTCH-2255> to
track this (as well as https://github.com/DigitalPebble/sc-warc/issues/1 for
StormCrawler). Not sure if/when I'll find the time to work on this but at
least it is now in JIRA.

Best

Julien

On 18 April 2016 at 23:25, Davíð Steinn Geirsson <da...@dsg.is> wrote:

> Hi Julien,
>
> Julien Nioche <lists.digitalpeb...@gmail.com> wrote:
> > Hi David
> >
> >  the resulting file contains no matching request records, or even a
> > > warcinfo record for that matter.
> >
> >
> >  It wouldn't be too difficult to add at least the request records to
> > WARCExporter
> > - please open a JIRA + contributions are welcome as always.
>
> Thanks for the info, I'll open a ticket. I'm not familiar enough
> with java to take a crack at that unfortunately.
>
> I did manage to fix the response record output of the
> CommonCrawlDataDumper, since it was only a tiny change. But given
> this bug, I'm leary of trusting its WARC output and I think I'll
> need to find some good WARC test suite to run it through. If I
> do, I'll submit a patch.
>
> >
> > I'm willing to move to nutch v2.x if it makes a difference.
> >
> >
> > 2.x has neither resources, you're better off being on 1.x
>
> Good to know, thanks.
>
> Best regards,
> Davíð
>
>
>
> >
> > Julien
> >
> >
> > On 14 April 2016 at 16:51, Davíð Steinn Geirsson <da...@dsg.is>
> > wrote:
> >
> > > Hi all,
> > >
> > > I'm trying to use Nutch v1.11 for an archival crawl and export
> > > the results to WARC files.
> > >
> > > It seems there are at least two seperate WARC exporters in Nutch,
> > > but both have some problems.
> > >
> > > The first one is org.apache.nutch.tools.CommonCrawlDataDumper
> > > (invoked with 'nutch commoncrawldump' which can export a WARC
> > > file with the appropriate option. The resulting WARC file looks
> > > good, except that the HTTP response body seems to have been
> > > mangled by removing the CR-LF between the HTTP response headers
> > > and the HTTP response body. The result is that it's not really
> > > possible to tell where the headers end and the body begins.
> > >
> > > The second one is org.apache.nutch.tools.warc.WARCExporter
> > > (invoked with 'nutch warc'). That one writes WARC response
> > > records properly, with the header seperator. Unfortunately,
> > > that's *all* it writes - the resulting file contains no matching
> > > request records, or even a warcinfo record for that matter.
> > >
> > > So my question is, is it possible to use Nutch in its present
> > > state to export working WARC files containing both request and
> > > response records? I'm willing to move to nutch v2.x if it makes a
> > > difference.
> > >
> > > Best regards,
> > > Davíð
> >
> >
> >
> >
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

Reply via email to