Hi David I've created NUTCH-2255 <https://issues.apache.org/jira/browse/NUTCH-2255> to track this (as well as https://github.com/DigitalPebble/sc-warc/issues/1 for StormCrawler). Not sure if/when I'll find the time to work on this but at least it is now in JIRA.
Best Julien On 18 April 2016 at 23:25, Davíð Steinn Geirsson <da...@dsg.is> wrote: > Hi Julien, > > Julien Nioche <lists.digitalpeb...@gmail.com> wrote: > > Hi David > > > > the resulting file contains no matching request records, or even a > > > warcinfo record for that matter. > > > > > > It wouldn't be too difficult to add at least the request records to > > WARCExporter > > - please open a JIRA + contributions are welcome as always. > > Thanks for the info, I'll open a ticket. I'm not familiar enough > with java to take a crack at that unfortunately. > > I did manage to fix the response record output of the > CommonCrawlDataDumper, since it was only a tiny change. But given > this bug, I'm leary of trusting its WARC output and I think I'll > need to find some good WARC test suite to run it through. If I > do, I'll submit a patch. > > > > > I'm willing to move to nutch v2.x if it makes a difference. > > > > > > 2.x has neither resources, you're better off being on 1.x > > Good to know, thanks. > > Best regards, > Davíð > > > > > > > Julien > > > > > > On 14 April 2016 at 16:51, Davíð Steinn Geirsson <da...@dsg.is> > > wrote: > > > > > Hi all, > > > > > > I'm trying to use Nutch v1.11 for an archival crawl and export > > > the results to WARC files. > > > > > > It seems there are at least two seperate WARC exporters in Nutch, > > > but both have some problems. > > > > > > The first one is org.apache.nutch.tools.CommonCrawlDataDumper > > > (invoked with 'nutch commoncrawldump' which can export a WARC > > > file with the appropriate option. The resulting WARC file looks > > > good, except that the HTTP response body seems to have been > > > mangled by removing the CR-LF between the HTTP response headers > > > and the HTTP response body. The result is that it's not really > > > possible to tell where the headers end and the body begins. > > > > > > The second one is org.apache.nutch.tools.warc.WARCExporter > > > (invoked with 'nutch warc'). That one writes WARC response > > > records properly, with the header seperator. Unfortunately, > > > that's *all* it writes - the resulting file contains no matching > > > request records, or even a warcinfo record for that matter. > > > > > > So my question is, is it possible to use Nutch in its present > > > state to export working WARC files containing both request and > > > response records? I'm willing to move to nutch v2.x if it makes a > > > difference. > > > > > > Best regards, > > > Davíð > > > > > > > > > -- *Open Source Solutions for Text Engineering* http://www.digitalpebble.com http://digitalpebble.blogspot.com/ #digitalpebble <http://twitter.com/digitalpebble>