[jira] [Commented] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents
[ https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773184#comment-17773184 ] ASF GitHub Bot commented on NUTCH-3012: --- sebastian-nagel opened a new pull request, #787: URL: https://github.com/apache/nutch/pull/787 Use UTF-8 as fall-back encoding when stringifying the content of unparsed documents. > SegmentReader when dumping with option -recode: NPE on unparsed documents > - > > Key: NUTCH-3012 > URL: https://issues.apache.org/jira/browse/NUTCH-3012 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > SegmentReader when called with the flag {{-recode}} fails with a NPE when > trying to stringify the raw content of unparsed documents: > {noformat} > $> bin/nutch readseg -dump crawl/segments/20231009065431 > crawl/segreader/20231009065431 -recode > ... > 2023-10-09 07:55:18,451 INFO mapreduce.Job: Task Id : > attempt_1696825862783_0005_r_00_0, Status : FAILED > Error: java.lang.NullPointerException: charset > at java.base/java.lang.String.(String.java:504) > at java.base/java.lang.String.(String.java:561) > at org.apache.nutch.protocol.Content.toString(Content.java:297) > at > org.apache.nutch.segment.SegmentReader$InputCompatReducer.reduce(SegmentReader.java:189) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] NUTCH-3012 SegmentReader when dumping with option -recode: NPE on unarsed documents [nutch]
sebastian-nagel opened a new pull request, #787: URL: https://github.com/apache/nutch/pull/787 Use UTF-8 as fall-back encoding when stringifying the content of unparsed documents. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents
[ https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3012: --- Description: SegmentReader when called with the flag {{-recode}} fails with a NPE when trying to stringify the raw content of unparsed documents: {noformat} $> bin/nutch readseg -dump crawl/segments/20231009065431 crawl/segreader/20231009065431 -recode ... 2023-10-09 07:55:18,451 INFO mapreduce.Job: Task Id : attempt_1696825862783_0005_r_00_0, Status : FAILED Error: java.lang.NullPointerException: charset at java.base/java.lang.String.(String.java:504) at java.base/java.lang.String.(String.java:561) at org.apache.nutch.protocol.Content.toString(Content.java:297) at org.apache.nutch.segment.SegmentReader$InputCompatReducer.reduce(SegmentReader.java:189) {noformat} > SegmentReader when dumping with option -recode: NPE on unparsed documents > - > > Key: NUTCH-3012 > URL: https://issues.apache.org/jira/browse/NUTCH-3012 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > SegmentReader when called with the flag {{-recode}} fails with a NPE when > trying to stringify the raw content of unparsed documents: > {noformat} > $> bin/nutch readseg -dump crawl/segments/20231009065431 > crawl/segreader/20231009065431 -recode > ... > 2023-10-09 07:55:18,451 INFO mapreduce.Job: Task Id : > attempt_1696825862783_0005_r_00_0, Status : FAILED > Error: java.lang.NullPointerException: charset > at java.base/java.lang.String.(String.java:504) > at java.base/java.lang.String.(String.java:561) > at org.apache.nutch.protocol.Content.toString(Content.java:297) > at > org.apache.nutch.segment.SegmentReader$InputCompatReducer.reduce(SegmentReader.java:189) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents
[ https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3012: --- Summary: SegmentReader when dumping with option -recode: NPE on unparsed documents (was: SegmentReader when dumping with option -recode: NPE on documents without charset defined) > SegmentReader when dumping with option -recode: NPE on unparsed documents > - > > Key: NUTCH-3012 > URL: https://issues.apache.org/jira/browse/NUTCH-3012 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on documents without charset defined
Sebastian Nagel created NUTCH-3012: -- Summary: SegmentReader when dumping with option -recode: NPE on documents without charset defined Key: NUTCH-3012 URL: https://issues.apache.org/jira/browse/NUTCH-3012 Project: Nutch Issue Type: Bug Components: segment Affects Versions: 1.19 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Fix For: 1.20 -- This message was sent by Atlassian Jira (v8.20.10#820010)