[ https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354530#comment-17354530 ]
Sebastian Nagel edited comment on NUTCH-2865 at 5/31/21, 3:57 PM: ------------------------------------------------------------------ Hi [~markus17], there are still print statements in the second patch. Yes, good idea. Some comments: bq. Both metadata objects are stored in the same WARC metadata record to save space. To save space it would be better to compress the WARC files, but to do this properly (per record) it would require to use another WARC writing library (eg. [jwarc|https://github.com/iipc/jwarc]). Having everything in a single record isn't easy to read again: {noformat} parseText=text starts continues continues ends parseData=key=value key=value ... {noformat} Why not put text and parse data into two records with well-defined and suitable content types? - for the text: {noformat} WARC/1.0 WARC-Type: conversion WARC-Date: ... WARC-Refers-To: <urn:uuid:points_to_response_record> Content-Type: text/plain Content-Length: ... text starts continues continues ends {noformat} - parse met data (btw. what about content metadata?): {noformat} WARC/1.0 WARC-Type: metadata WARC-Date: ... WARC-Record-ID: ... WARC-Refers-To: <urn:uuid:points_to_response_record> Content-Type: application/json Content-Length: ... {"key": "value", ...} {noformat} Ideally, the conversion and metadata records are linked via UUID with the response records. bq. -omitEmptyResponses Sometimes (if not frequently) 404s and redirects are not empty but include a payload (a customized error page). Maybe "-includeOnlySuccessfulResponses"? bq. notmodified You mean not-modified detected via signature comparison? HTTP 304 responses (ProtocolStatus.NOTMODIFIED) definitely have no payload (response body). See also [WARC revisit records|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#revisit] ([revisit example|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#example-of-revisit-record]). was (Author: wastl-nagel): Hi [~markus17], there are still print statements in the second patch. Yes, good idea bq. Both metadata objects are stored in the same WARC metadata record to save space. To save space it would be better to compress the WARC files, but to do this properly (per record) it would require to use another WARC writing library (eg. [jwarc|https://github.com/iipc/jwarc]). Having everything in a single record isn't easy to read again: {noformat} parseText=text starts continues continues ends parseData=key=value key=value ... {noformat} Why not put text and parse data into two records with well-defined and suitable content types? - for the text: {noformat} WARC/1.0 WARC-Type: conversion WARC-Date: ... WARC-Refers-To: <urn:uuid:points_to_response_record> Content-Type: text/plain Content-Length: ... text starts continues continues ends {noformat} - parse met data (btw. what about content metadata?): {noformat} WARC/1.0 WARC-Type: metadata WARC-Date: ... WARC-Record-ID: ... WARC-Refers-To: <urn:uuid:points_to_response_record> Content-Type: application/json Content-Length: ... {"key": "value", ...} {noformat} Ideally, the conversion and metadata records are linked via UUID with the response records. bq. -omitEmptyResponses Sometimes (if not frequently) 404s and redirects are not empty but include a payload (a customized error page). Maybe "-includeOnlySuccessfulResponses"? bq. notmodified You mean not-modified detected via signature comparison? HTTP 304 responses (ProtocolStatus.NOTMODIFIED) definitely have no payload (response body). See also [WARC revisit records|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#revisit] ([revisit example|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#example-of-revisit-record]). > WARC exporter support for metadata and dropping empty responses > --------------------------------------------------------------- > > Key: NUTCH-2865 > URL: https://issues.apache.org/jira/browse/NUTCH-2865 > Project: Nutch > Issue Type: Improvement > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Priority: Minor > Fix For: 1.19 > > Attachments: NUTCH-2865.patch, NUTCH-2865.patch > > > WARCExporter is a handy tool to dump the segments. Unfortunately it also > emits WARC records for status' other than success of notmodified, which > accounts for a decent number in each crawl cycle. It also doesn't emit parsed > metadata or extracted text. It does now. > > This patch adds three switches: > * -omitEmptyResponses to only emit records of success or notmodified > * -includeParseData to also emit parse metadata as WARC metadata record > * -includeParseText to also emit extracted text as WARC metadata > Both metadata objects are stored in the same WARC metadata record to save > space. -- This message was sent by Atlassian Jira (v8.3.4#803005)