Markus Jelsma created NUTCH-2865:
------------------------------------

             Summary: WARC exporter support for metadata and dropping empty 
responses
                 Key: NUTCH-2865
                 URL: https://issues.apache.org/jira/browse/NUTCH-2865
             Project: Nutch
          Issue Type: Improvement
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
             Fix For: 1.19


WARCExporter is a handy tool to dump the segments. Unfortunately it also emits 
WARC records for status' other than success of notmodified, which accounts for 
a decent number in each crawl cycle. It also doesn't emit parsed metadata or 
extracted text. It does now.

 

This patch adds three switches:
 * -omitEmptyResponses to only emit records of success or notmodified
 * -includeParseData to also emit parse metadata as WARC metadata record
 * -includeParseText to also emit extracted text as WARC metadata

Both metadata objects are stored in the same WARC metadata record to safe space.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to