[ https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447404#comment-17447404 ]
ASF GitHub Bot commented on NUTCH-2865: --------------------------------------- sebastian-nagel opened a new pull request #706: URL: https://github.com/apache/nutch/pull/706 (PR based on patch contributed by Markus Jelsma, see [NUTCH-2865](https://issues.apache.org/jira/browse/NUTCH-2865)) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > WARC exporter support for metadata and dropping empty responses > --------------------------------------------------------------- > > Key: NUTCH-2865 > URL: https://issues.apache.org/jira/browse/NUTCH-2865 > Project: Nutch > Issue Type: Improvement > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Priority: Minor > Fix For: 1.19 > > Attachments: NUTCH-2865.patch, NUTCH-2865.patch, NUTCH-2865.patch > > > WARCExporter is a handy tool to dump the segments. Unfortunately it also > emits WARC records for status' other than success of notmodified, which > accounts for a decent number in each crawl cycle. It also doesn't emit parsed > metadata or extracted text. It does now. > > This patch adds three switches: > * -includeOnlySuccessfulResponses to only emit records of success or > notmodified > * -includeParseData to also emit parse metadata as WARC metadata record > * -includeParseText to also emit extracted text as WARC metadata > Both metadata objects are stored in the same WARC metadata record to save > space. -- This message was sent by Atlassian Jira (v8.20.1#820001)