[jira] [Commented] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents

2023-10-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773184#comment-17773184
 ] 

ASF GitHub Bot commented on NUTCH-3012:
---

sebastian-nagel opened a new pull request, #787:
URL: https://github.com/apache/nutch/pull/787

   Use UTF-8 as fall-back encoding when stringifying the content of unparsed 
documents.
   




> SegmentReader when dumping with option -recode: NPE on unparsed documents
> -
>
> Key: NUTCH-3012
> URL: https://issues.apache.org/jira/browse/NUTCH-3012
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> SegmentReader when called with the flag {{-recode}} fails with a NPE when 
> trying to stringify the raw content of unparsed documents:
> {noformat}
> $> bin/nutch readseg  -dump crawl/segments/20231009065431 
> crawl/segreader/20231009065431 -recode
> ...
> 2023-10-09 07:55:18,451 INFO mapreduce.Job: Task Id : 
> attempt_1696825862783_0005_r_00_0, Status : FAILED
> Error: java.lang.NullPointerException: charset
> at java.base/java.lang.String.(String.java:504)
> at java.base/java.lang.String.(String.java:561)
> at org.apache.nutch.protocol.Content.toString(Content.java:297)
> at 
> org.apache.nutch.segment.SegmentReader$InputCompatReducer.reduce(SegmentReader.java:189)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] NUTCH-3012 SegmentReader when dumping with option -recode: NPE on unarsed documents [nutch]

2023-10-09 Thread via GitHub


sebastian-nagel opened a new pull request, #787:
URL: https://github.com/apache/nutch/pull/787

   Use UTF-8 as fall-back encoding when stringifying the content of unparsed 
documents.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents

2023-10-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3012:
---
Description: 
SegmentReader when called with the flag {{-recode}} fails with a NPE when 
trying to stringify the raw content of unparsed documents:
{noformat}
$> bin/nutch readseg  -dump crawl/segments/20231009065431 
crawl/segreader/20231009065431 -recode
...
2023-10-09 07:55:18,451 INFO mapreduce.Job: Task Id : 
attempt_1696825862783_0005_r_00_0, Status : FAILED
Error: java.lang.NullPointerException: charset
at java.base/java.lang.String.(String.java:504)
at java.base/java.lang.String.(String.java:561)
at org.apache.nutch.protocol.Content.toString(Content.java:297)
at 
org.apache.nutch.segment.SegmentReader$InputCompatReducer.reduce(SegmentReader.java:189)
{noformat}

> SegmentReader when dumping with option -recode: NPE on unparsed documents
> -
>
> Key: NUTCH-3012
> URL: https://issues.apache.org/jira/browse/NUTCH-3012
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> SegmentReader when called with the flag {{-recode}} fails with a NPE when 
> trying to stringify the raw content of unparsed documents:
> {noformat}
> $> bin/nutch readseg  -dump crawl/segments/20231009065431 
> crawl/segreader/20231009065431 -recode
> ...
> 2023-10-09 07:55:18,451 INFO mapreduce.Job: Task Id : 
> attempt_1696825862783_0005_r_00_0, Status : FAILED
> Error: java.lang.NullPointerException: charset
> at java.base/java.lang.String.(String.java:504)
> at java.base/java.lang.String.(String.java:561)
> at org.apache.nutch.protocol.Content.toString(Content.java:297)
> at 
> org.apache.nutch.segment.SegmentReader$InputCompatReducer.reduce(SegmentReader.java:189)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents

2023-10-09 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3012:
---
Summary: SegmentReader when dumping with option -recode: NPE on unparsed 
documents  (was: SegmentReader when dumping with option -recode: NPE on 
documents without charset defined)

> SegmentReader when dumping with option -recode: NPE on unparsed documents
> -
>
> Key: NUTCH-3012
> URL: https://issues.apache.org/jira/browse/NUTCH-3012
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on documents without charset defined

2023-10-09 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3012:
--

 Summary: SegmentReader when dumping with option -recode: NPE on 
documents without charset defined
 Key: NUTCH-3012
 URL: https://issues.apache.org/jira/browse/NUTCH-3012
 Project: Nutch
  Issue Type: Bug
  Components: segment
Affects Versions: 1.19
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.20






--
This message was sent by Atlassian Jira
(v8.20.10#820010)