[jira] [Commented] (NUTCH-2866) MetaData.toString() should return "key=value ..."

2021-05-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354539#comment-17354539
 ] 

ASF GitHub Bot commented on NUTCH-2866:
---

sebastian-nagel opened a new pull request #648:
URL: https://github.com/apache/nutch/pull/648


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> MetaData.toString() should return "key=value ..."
> -
>
> Key: NUTCH-2866
> URL: https://issues.apache.org/jira/browse/NUTCH-2866
> Project: Nutch
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.17
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.19
>
>
> The default implementation of Metadata.toString() returns {{key1 value1=key2 
> value2}}. This should be {{key1=value1 key2=value2}}, of course. Introduced 
> with NUTCH-2788 (commit 
> [e3f7725|https://github.com/apache/nutch/commit/e3f7725#diff-d3d9833dabc24485aae013c756fc1c5266fcd782fa16caf6c5e0e9f109ab1eeeR237])
>  - :(, seen while reviewing NUTCH-2865.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [nutch] sebastian-nagel opened a new pull request #648: NUTCH-2866 Fix MetaData.toString() to return "key=value ..."

2021-05-31 Thread GitBox


sebastian-nagel opened a new pull request #648:
URL: https://github.com/apache/nutch/pull/648


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Comment Edited] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses

2021-05-31 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354530#comment-17354530
 ] 

Sebastian Nagel edited comment on NUTCH-2865 at 5/31/21, 3:57 PM:
--

Hi [~markus17], there are still print statements in the second patch. Yes, good 
idea. Some comments: 

bq. Both metadata objects are stored in the same WARC metadata record to save 
space.

To save space it would be better to compress the WARC files, but to do this 
properly (per record) it would require to use another WARC writing library (eg. 
[jwarc|https://github.com/iipc/jwarc]).

Having everything in a single record isn't easy to read again:
{noformat}
parseText=text starts
continues
continues
ends

parseData=key=value key=value ...
{noformat}

Why not put text and parse data into two records with well-defined and suitable 
content types?

- for the text:
{noformat}
WARC/1.0
WARC-Type: conversion
WARC-Date: ...
WARC-Refers-To: 
Content-Type: text/plain
Content-Length: ...

text starts
continues
continues
ends
{noformat}

- parse met data (btw. what about content metadata?):
{noformat}
WARC/1.0
WARC-Type: metadata
WARC-Date: ...
WARC-Record-ID: ...
WARC-Refers-To: 
Content-Type: application/json
Content-Length: ...

{"key": "value", ...}
{noformat}

Ideally, the conversion and metadata records are linked via UUID with the 
response records.

bq. -omitEmptyResponses

Sometimes (if not frequently) 404s and redirects are not empty but include a 
payload (a customized error page). Maybe "-includeOnlySuccessfulResponses"?

bq. notmodified

You mean not-modified detected via signature comparison? HTTP 304 responses 
(ProtocolStatus.NOTMODIFIED) definitely have no payload (response body). See 
also [WARC revisit 
records|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#revisit]
 ([revisit 
example|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#example-of-revisit-record]).



was (Author: wastl-nagel):
Hi [~markus17], there are still print statements in the second patch. Yes, good 
idea 

bq. Both metadata objects are stored in the same WARC metadata record to save 
space.

To save space it would be better to compress the WARC files, but to do this 
properly (per record) it would require to use another WARC writing library (eg. 
[jwarc|https://github.com/iipc/jwarc]).

Having everything in a single record isn't easy to read again:
{noformat}
parseText=text starts
continues
continues
ends

parseData=key=value key=value ...
{noformat}

Why not put text and parse data into two records with well-defined and suitable 
content types?

- for the text:
{noformat}
WARC/1.0
WARC-Type: conversion
WARC-Date: ...
WARC-Refers-To: 
Content-Type: text/plain
Content-Length: ...

text starts
continues
continues
ends
{noformat}

- parse met data (btw. what about content metadata?):
{noformat}
WARC/1.0
WARC-Type: metadata
WARC-Date: ...
WARC-Record-ID: ...
WARC-Refers-To: 
Content-Type: application/json
Content-Length: ...

{"key": "value", ...}
{noformat}

Ideally, the conversion and metadata records are linked via UUID with the 
response records.

bq. -omitEmptyResponses

Sometimes (if not frequently) 404s and redirects are not empty but include a 
payload (a customized error page). Maybe "-includeOnlySuccessfulResponses"?

bq. notmodified

You mean not-modified detected via signature comparison? HTTP 304 responses 
(ProtocolStatus.NOTMODIFIED) definitely have no payload (response body). See 
also [WARC revisit 
records|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#revisit]
 ([revisit 
example|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#example-of-revisit-record]).


> WARC exporter support for metadata and dropping empty responses
> ---
>
> Key: NUTCH-2865
> URL: https://issues.apache.org/jira/browse/NUTCH-2865
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.19
>
> Attachments: NUTCH-2865.patch, NUTCH-2865.patch
>
>
> WARCExporter is a handy tool to dump the segments. Unfortunately it also 
> emits WARC records for status' other than success of notmodified, which 
> accounts for a decent number in each crawl cycle. It also doesn't emit parsed 
> metadata or extracted text. It does now.
>  
> This patch adds three switches:
>  * -omitEmptyResponses to only emit records of success or notmodified
>  * -includeParseData to also emit parse metadata as WARC metadata record
>  * -includeParseText to also emit extracted text as WARC metadata
> Both metadata objects are stored in the same WARC metadata record to save 
> space.



--
This message was sent by A

[jira] [Commented] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses

2021-05-31 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354530#comment-17354530
 ] 

Sebastian Nagel commented on NUTCH-2865:


Hi [~markus17], there are still print statements in the second patch. Yes, good 
idea 

bq. Both metadata objects are stored in the same WARC metadata record to save 
space.

To save space it would be better to compress the WARC files, but to do this 
properly (per record) it would require to use another WARC writing library (eg. 
[jwarc|https://github.com/iipc/jwarc]).

Having everything in a single record isn't easy to read again:
{noformat}
parseText=text starts
continues
continues
ends

parseData=key=value key=value ...
{noformat}

Why not put text and parse data into two records with well-defined and suitable 
content types?

- for the text:
{noformat}
WARC/1.0
WARC-Type: conversion
WARC-Date: ...
WARC-Refers-To: 
Content-Type: text/plain
Content-Length: ...

text starts
continues
continues
ends
{noformat}

- parse met data (btw. what about content metadata?):
{noformat}
WARC/1.0
WARC-Type: metadata
WARC-Date: ...
WARC-Record-ID: ...
WARC-Refers-To: 
Content-Type: application/json
Content-Length: ...

{"key": "value", ...}
{noformat}

Ideally, the conversion and metadata records are linked via UUID with the 
response records.

bq. -omitEmptyResponses

Sometimes (if not frequently) 404s and redirects are not empty but include a 
payload (a customized error page). Maybe "-includeOnlySuccessfulResponses"?

bq. notmodified

You mean not-modified detected via signature comparison? HTTP 304 responses 
(ProtocolStatus.NOTMODIFIED) definitely have no payload (response body). See 
also [WARC revisit 
records|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#revisit]
 ([revisit 
example|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#example-of-revisit-record]).


> WARC exporter support for metadata and dropping empty responses
> ---
>
> Key: NUTCH-2865
> URL: https://issues.apache.org/jira/browse/NUTCH-2865
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.19
>
> Attachments: NUTCH-2865.patch, NUTCH-2865.patch
>
>
> WARCExporter is a handy tool to dump the segments. Unfortunately it also 
> emits WARC records for status' other than success of notmodified, which 
> accounts for a decent number in each crawl cycle. It also doesn't emit parsed 
> metadata or extracted text. It does now.
>  
> This patch adds three switches:
>  * -omitEmptyResponses to only emit records of success or notmodified
>  * -includeParseData to also emit parse metadata as WARC metadata record
>  * -includeParseText to also emit extracted text as WARC metadata
> Both metadata objects are stored in the same WARC metadata record to save 
> space.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (NUTCH-2866) MetaData.toString() should return "key=value ..."

2021-05-31 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2866:
--

 Summary: MetaData.toString() should return "key=value ..."
 Key: NUTCH-2866
 URL: https://issues.apache.org/jira/browse/NUTCH-2866
 Project: Nutch
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.17
Reporter: Sebastian Nagel
 Fix For: 1.19


The default implementation of Metadata.toString() returns {{key1 value1=key2 
value2}}. This should be {{key1=value1 key2=value2}}, of course. Introduced 
with NUTCH-2788 (commit 
[e3f7725|https://github.com/apache/nutch/commit/e3f7725#diff-d3d9833dabc24485aae013c756fc1c5266fcd782fa16caf6c5e0e9f109ab1eeeR237])
 - :(, seen while reviewing NUTCH-2865.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)