Hi Julien, hi David,

we could also try to merge the WARC generation code
of both tools, so that we do not have to apply fixes
twice (now and in the future).  The substantial
difference
 commoncrawldump - runs only locally while
 warc - is scalable via Hadoop
isn't as easy to merge.  But sharing the
representation / generation of a WARC document
(or request - response pair) should be doable.

Cheers,
Sebastian

On 04/14/2016 09:50 PM, Julien Nioche wrote:
> Hi David
> 
>  the resulting file contains no matching request records, or even a
>> warcinfo record for that matter.
> 
> 
>  It wouldn't be too difficult to add at least the request records to
> WARCExporter
> - please open a JIRA + contributions are welcome as always.
> 
> I'm willing to move to nutch v2.x if it makes a difference.
> 
> 
> 2.x has neither resources, you're better off being on 1.x
> 
> Julien
> 
> 
> On 14 April 2016 at 16:51, Davíð Steinn Geirsson <da...@dsg.is> wrote:
> 
>> Hi all,
>>
>> I'm trying to use Nutch v1.11 for an archival crawl and export
>> the results to WARC files.
>>
>> It seems there are at least two seperate WARC exporters in Nutch,
>> but both have some problems.
>>
>> The first one is org.apache.nutch.tools.CommonCrawlDataDumper
>> (invoked with 'nutch commoncrawldump' which can export a WARC
>> file with the appropriate option. The resulting WARC file looks
>> good, except that the HTTP response body seems to have been
>> mangled by removing the CR-LF between the HTTP response headers
>> and the HTTP response body. The result is that it's not really
>> possible to tell where the headers end and the body begins.
>>
>> The second one is org.apache.nutch.tools.warc.WARCExporter
>> (invoked with 'nutch warc'). That one writes WARC response
>> records properly, with the header seperator. Unfortunately,
>> that's *all* it writes - the resulting file contains no matching
>> request records, or even a warcinfo record for that matter.
>>
>> So my question is, is it possible to use Nutch in its present
>> state to export working WARC files containing both request and
>> response records? I'm willing to move to nutch v2.x if it makes a
>> difference.
>>
>> Best regards,
>> Davíð
> 
> 
> 
> 

Reply via email to