[jira] [Commented] (NUTCH-2866) MetaData.toString() should return "key=value ..."
[ https://issues.apache.org/jira/browse/NUTCH-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354539#comment-17354539 ] ASF GitHub Bot commented on NUTCH-2866: --- sebastian-nagel opened a new pull request #648: URL: https://github.com/apache/nutch/pull/648 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > MetaData.toString() should return "key=value ..." > - > > Key: NUTCH-2866 > URL: https://issues.apache.org/jira/browse/NUTCH-2866 > Project: Nutch > Issue Type: Bug > Components: metadata >Affects Versions: 1.17 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.19 > > > The default implementation of Metadata.toString() returns {{key1 value1=key2 > value2}}. This should be {{key1=value1 key2=value2}}, of course. Introduced > with NUTCH-2788 (commit > [e3f7725|https://github.com/apache/nutch/commit/e3f7725#diff-d3d9833dabc24485aae013c756fc1c5266fcd782fa16caf6c5e0e9f109ab1eeeR237]) > - :(, seen while reviewing NUTCH-2865. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [nutch] sebastian-nagel opened a new pull request #648: NUTCH-2866 Fix MetaData.toString() to return "key=value ..."
sebastian-nagel opened a new pull request #648: URL: https://github.com/apache/nutch/pull/648 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses
[ https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354530#comment-17354530 ] Sebastian Nagel edited comment on NUTCH-2865 at 5/31/21, 3:57 PM: -- Hi [~markus17], there are still print statements in the second patch. Yes, good idea. Some comments: bq. Both metadata objects are stored in the same WARC metadata record to save space. To save space it would be better to compress the WARC files, but to do this properly (per record) it would require to use another WARC writing library (eg. [jwarc|https://github.com/iipc/jwarc]). Having everything in a single record isn't easy to read again: {noformat} parseText=text starts continues continues ends parseData=key=value key=value ... {noformat} Why not put text and parse data into two records with well-defined and suitable content types? - for the text: {noformat} WARC/1.0 WARC-Type: conversion WARC-Date: ... WARC-Refers-To: Content-Type: text/plain Content-Length: ... text starts continues continues ends {noformat} - parse met data (btw. what about content metadata?): {noformat} WARC/1.0 WARC-Type: metadata WARC-Date: ... WARC-Record-ID: ... WARC-Refers-To: Content-Type: application/json Content-Length: ... {"key": "value", ...} {noformat} Ideally, the conversion and metadata records are linked via UUID with the response records. bq. -omitEmptyResponses Sometimes (if not frequently) 404s and redirects are not empty but include a payload (a customized error page). Maybe "-includeOnlySuccessfulResponses"? bq. notmodified You mean not-modified detected via signature comparison? HTTP 304 responses (ProtocolStatus.NOTMODIFIED) definitely have no payload (response body). See also [WARC revisit records|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#revisit] ([revisit example|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#example-of-revisit-record]). was (Author: wastl-nagel): Hi [~markus17], there are still print statements in the second patch. Yes, good idea bq. Both metadata objects are stored in the same WARC metadata record to save space. To save space it would be better to compress the WARC files, but to do this properly (per record) it would require to use another WARC writing library (eg. [jwarc|https://github.com/iipc/jwarc]). Having everything in a single record isn't easy to read again: {noformat} parseText=text starts continues continues ends parseData=key=value key=value ... {noformat} Why not put text and parse data into two records with well-defined and suitable content types? - for the text: {noformat} WARC/1.0 WARC-Type: conversion WARC-Date: ... WARC-Refers-To: Content-Type: text/plain Content-Length: ... text starts continues continues ends {noformat} - parse met data (btw. what about content metadata?): {noformat} WARC/1.0 WARC-Type: metadata WARC-Date: ... WARC-Record-ID: ... WARC-Refers-To: Content-Type: application/json Content-Length: ... {"key": "value", ...} {noformat} Ideally, the conversion and metadata records are linked via UUID with the response records. bq. -omitEmptyResponses Sometimes (if not frequently) 404s and redirects are not empty but include a payload (a customized error page). Maybe "-includeOnlySuccessfulResponses"? bq. notmodified You mean not-modified detected via signature comparison? HTTP 304 responses (ProtocolStatus.NOTMODIFIED) definitely have no payload (response body). See also [WARC revisit records|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#revisit] ([revisit example|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#example-of-revisit-record]). > WARC exporter support for metadata and dropping empty responses > --- > > Key: NUTCH-2865 > URL: https://issues.apache.org/jira/browse/NUTCH-2865 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.19 > > Attachments: NUTCH-2865.patch, NUTCH-2865.patch > > > WARCExporter is a handy tool to dump the segments. Unfortunately it also > emits WARC records for status' other than success of notmodified, which > accounts for a decent number in each crawl cycle. It also doesn't emit parsed > metadata or extracted text. It does now. > > This patch adds three switches: > * -omitEmptyResponses to only emit records of success or notmodified > * -includeParseData to also emit parse metadata as WARC metadata record > * -includeParseText to also emit extracted text as WARC metadata > Both metadata objects are stored in the same WARC metadata record to save > space. -- This message was sent by A
[jira] [Commented] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses
[ https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354530#comment-17354530 ] Sebastian Nagel commented on NUTCH-2865: Hi [~markus17], there are still print statements in the second patch. Yes, good idea bq. Both metadata objects are stored in the same WARC metadata record to save space. To save space it would be better to compress the WARC files, but to do this properly (per record) it would require to use another WARC writing library (eg. [jwarc|https://github.com/iipc/jwarc]). Having everything in a single record isn't easy to read again: {noformat} parseText=text starts continues continues ends parseData=key=value key=value ... {noformat} Why not put text and parse data into two records with well-defined and suitable content types? - for the text: {noformat} WARC/1.0 WARC-Type: conversion WARC-Date: ... WARC-Refers-To: Content-Type: text/plain Content-Length: ... text starts continues continues ends {noformat} - parse met data (btw. what about content metadata?): {noformat} WARC/1.0 WARC-Type: metadata WARC-Date: ... WARC-Record-ID: ... WARC-Refers-To: Content-Type: application/json Content-Length: ... {"key": "value", ...} {noformat} Ideally, the conversion and metadata records are linked via UUID with the response records. bq. -omitEmptyResponses Sometimes (if not frequently) 404s and redirects are not empty but include a payload (a customized error page). Maybe "-includeOnlySuccessfulResponses"? bq. notmodified You mean not-modified detected via signature comparison? HTTP 304 responses (ProtocolStatus.NOTMODIFIED) definitely have no payload (response body). See also [WARC revisit records|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#revisit] ([revisit example|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#example-of-revisit-record]). > WARC exporter support for metadata and dropping empty responses > --- > > Key: NUTCH-2865 > URL: https://issues.apache.org/jira/browse/NUTCH-2865 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.19 > > Attachments: NUTCH-2865.patch, NUTCH-2865.patch > > > WARCExporter is a handy tool to dump the segments. Unfortunately it also > emits WARC records for status' other than success of notmodified, which > accounts for a decent number in each crawl cycle. It also doesn't emit parsed > metadata or extracted text. It does now. > > This patch adds three switches: > * -omitEmptyResponses to only emit records of success or notmodified > * -includeParseData to also emit parse metadata as WARC metadata record > * -includeParseText to also emit extracted text as WARC metadata > Both metadata objects are stored in the same WARC metadata record to save > space. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (NUTCH-2866) MetaData.toString() should return "key=value ..."
Sebastian Nagel created NUTCH-2866: -- Summary: MetaData.toString() should return "key=value ..." Key: NUTCH-2866 URL: https://issues.apache.org/jira/browse/NUTCH-2866 Project: Nutch Issue Type: Bug Components: metadata Affects Versions: 1.17 Reporter: Sebastian Nagel Fix For: 1.19 The default implementation of Metadata.toString() returns {{key1 value1=key2 value2}}. This should be {{key1=value1 key2=value2}}, of course. Introduced with NUTCH-2788 (commit [e3f7725|https://github.com/apache/nutch/commit/e3f7725#diff-d3d9833dabc24485aae013c756fc1c5266fcd782fa16caf6c5e0e9f109ab1eeeR237]) - :(, seen while reviewing NUTCH-2865. -- This message was sent by Atlassian Jira (v8.3.4#803005)