Markus Jelsma created NUTCH-2865: ------------------------------------ Summary: WARC exporter support for metadata and dropping empty responses Key: NUTCH-2865 URL: https://issues.apache.org/jira/browse/NUTCH-2865 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.19
WARCExporter is a handy tool to dump the segments. Unfortunately it also emits WARC records for status' other than success of notmodified, which accounts for a decent number in each crawl cycle. It also doesn't emit parsed metadata or extracted text. It does now. This patch adds three switches: * -omitEmptyResponses to only emit records of success or notmodified * -includeParseData to also emit parse metadata as WARC metadata record * -includeParseText to also emit extracted text as WARC metadata Both metadata objects are stored in the same WARC metadata record to safe space. -- This message was sent by Atlassian Jira (v8.3.4#803005)