[jira] [Commented] (CONNECTORS-1655) Web connector - UnsupportedEncodingException utf-8

2020-10-16 Thread Julien Massiera (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215375#comment-17215375
 ] 

Julien Massiera commented on CONNECTORS-1655:
-

Thanks for the fix !

> Web connector - UnsupportedEncodingException utf-8
> --
>
> Key: CONNECTORS-1655
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1655
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.17
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
> Fix For: ManifoldCF 2.18
>
>
> When crawling some sites (for instance this one: 
> [http://www.antibes-juanlespins.com/] ) the job manages to index some 
> documents, but the stops with the following error code:
> Error: IO error: utf-8; filename=rseventspro_rss20_56.xml
> Here is one the MCF stacktrace: 
> Exception tossed: IO error: utf-8; filename=rseventspro_rss20_56.xml
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: IO error: utf-8; 
> filename=rseventspro_rss20_56.xml
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4203)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:3855)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:746)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) 
> [mcf-pull-agent.jar:?]
> Caused by: java.io.UnsupportedEncodingException: utf-8; 
> filename=rseventspro_rss20_56.xml
> at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:71) 
> ~[?:1.8.0_212]
> at java.io.InputStreamReader.(InputStreamReader.java:100) ~[?:1.8.0_212]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.DecodingByteReceiver.dealWithBytes(DecodingByteReceiver.java:47)
>  ~[?:?]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.BOMEncodingDetector.dealWithRemainder(BOMEncodingDetector.java:250)
>  ~[?:?]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.SingleByteReceiver.dealWithBytes(SingleByteReceiver.java:52)
>  ~[?:?]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.Parser.parseWithCharsetDetection(Parser.java:74)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4174)
>  ~[?:?]
> ... 3 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1655) Web connector - UnsupportedEncodingException utf-8

2020-10-16 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215311#comment-17215311
 ] 

Karl Wright commented on CONNECTORS-1655:
-

Ah, but wait a minute: the issue is that the document in question has an 
illegal content-type:

"utf-8; filename=rseventspro_rss20_56.xml"

A patch for that is possible.  


> Web connector - UnsupportedEncodingException utf-8
> --
>
> Key: CONNECTORS-1655
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1655
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.17
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
>
> When crawling some sites (for instance this one: 
> [http://www.antibes-juanlespins.com/] ) the job manages to index some 
> documents, but the stops with the following error code:
> Error: IO error: utf-8; filename=rseventspro_rss20_56.xml
> Here is one the MCF stacktrace: 
> Exception tossed: IO error: utf-8; filename=rseventspro_rss20_56.xml
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: IO error: utf-8; 
> filename=rseventspro_rss20_56.xml
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4203)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:3855)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:746)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) 
> [mcf-pull-agent.jar:?]
> Caused by: java.io.UnsupportedEncodingException: utf-8; 
> filename=rseventspro_rss20_56.xml
> at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:71) 
> ~[?:1.8.0_212]
> at java.io.InputStreamReader.(InputStreamReader.java:100) ~[?:1.8.0_212]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.DecodingByteReceiver.dealWithBytes(DecodingByteReceiver.java:47)
>  ~[?:?]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.BOMEncodingDetector.dealWithRemainder(BOMEncodingDetector.java:250)
>  ~[?:?]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.SingleByteReceiver.dealWithBytes(SingleByteReceiver.java:52)
>  ~[?:?]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.Parser.parseWithCharsetDetection(Parser.java:74)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4174)
>  ~[?:?]
> ... 3 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1655) Web connector - UnsupportedEncodingException utf-8

2020-10-16 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215309#comment-17215309
 ] 

Karl Wright commented on CONNECTORS-1655:
-

Basically what is failing is using character encoding "utf-8".  As you know 
this is a very standard charset and almost nothing will work without it.  This 
is not on the list of things removed from JDK 11 as far as I am aware.  Perhaps 
its name has changed and we therefore need to add a list of names that map to 
it somewhere.  But usage would be strewn throughout ManifoldCF in any case.

But the official Oracle doc says it should be there, and isn't case sensitive 
either:

https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/nio/charset/Charset.html

I'm afraid it's up to you to do research as to why it's not found in your setup.


> Web connector - UnsupportedEncodingException utf-8
> --
>
> Key: CONNECTORS-1655
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1655
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.17
>Reporter: Julien Massiera
>Priority: Critical
>
> When crawling some sites (for instance this one: 
> [http://www.antibes-juanlespins.com/] ) the job manages to index some 
> documents, but the stops with the following error code:
> Error: IO error: utf-8; filename=rseventspro_rss20_56.xml
> Here is one the MCF stacktrace: 
> Exception tossed: IO error: utf-8; filename=rseventspro_rss20_56.xml
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: IO error: utf-8; 
> filename=rseventspro_rss20_56.xml
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4203)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:3855)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:746)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) 
> [mcf-pull-agent.jar:?]
> Caused by: java.io.UnsupportedEncodingException: utf-8; 
> filename=rseventspro_rss20_56.xml
> at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:71) 
> ~[?:1.8.0_212]
> at java.io.InputStreamReader.(InputStreamReader.java:100) ~[?:1.8.0_212]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.DecodingByteReceiver.dealWithBytes(DecodingByteReceiver.java:47)
>  ~[?:?]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.BOMEncodingDetector.dealWithRemainder(BOMEncodingDetector.java:250)
>  ~[?:?]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.SingleByteReceiver.dealWithBytes(SingleByteReceiver.java:52)
>  ~[?:?]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.Parser.parseWithCharsetDetection(Parser.java:74)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4174)
>  ~[?:?]
> ... 3 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1655) Web connector - UnsupportedEncodingException utf-8

2020-10-16 Thread Julien Massiera (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215238#comment-17215238
 ] 

Julien Massiera commented on CONNECTORS-1655:
-

Hi [~kwri...@metacarta.com], I am using offical OpenJDK 11 installed from the 
Debian repo:
openjdk version "11.0.8" 2020-07-14
OpenJDK Runtime Environment 18.9 (build 11.0.8+10)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.8+10, mixed mode)

> Web connector - UnsupportedEncodingException utf-8
> --
>
> Key: CONNECTORS-1655
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1655
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.17
>Reporter: Julien Massiera
>Priority: Critical
>
> When crawling some sites (for instance this one: 
> [http://www.antibes-juanlespins.com/] ) the job manages to index some 
> documents, but the stops with the following error code:
> Error: IO error: utf-8; filename=rseventspro_rss20_56.xml
> Here is one the MCF stacktrace: 
> Exception tossed: IO error: utf-8; filename=rseventspro_rss20_56.xml
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: IO error: utf-8; 
> filename=rseventspro_rss20_56.xml
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4203)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:3855)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:746)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) 
> [mcf-pull-agent.jar:?]
> Caused by: java.io.UnsupportedEncodingException: utf-8; 
> filename=rseventspro_rss20_56.xml
> at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:71) 
> ~[?:1.8.0_212]
> at java.io.InputStreamReader.(InputStreamReader.java:100) ~[?:1.8.0_212]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.DecodingByteReceiver.dealWithBytes(DecodingByteReceiver.java:47)
>  ~[?:?]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.BOMEncodingDetector.dealWithRemainder(BOMEncodingDetector.java:250)
>  ~[?:?]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.SingleByteReceiver.dealWithBytes(SingleByteReceiver.java:52)
>  ~[?:?]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.Parser.parseWithCharsetDetection(Parser.java:74)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4174)
>  ~[?:?]
> ... 3 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1655) Web connector - UnsupportedEncodingException utf-8

2020-10-15 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214873#comment-17214873
 ] 

Karl Wright commented on CONNECTORS-1655:
-

So you are using a non-standard JVM that doesn't understand utf-8 character 
encoding.
Sorry, you don't get a fix for that. o_O  Use a standard JVM please.


> Web connector - UnsupportedEncodingException utf-8
> --
>
> Key: CONNECTORS-1655
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1655
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.17
>Reporter: Julien Massiera
>Priority: Critical
>
> When crawling some sites (for instance this one: 
> [http://www.antibes-juanlespins.com/] ) the job manages to index some 
> documents, but the stops with the following error code:
> Error: IO error: utf-8; filename=rseventspro_rss20_56.xml
> Here is one the MCF stacktrace: 
> Exception tossed: IO error: utf-8; filename=rseventspro_rss20_56.xml
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: IO error: utf-8; 
> filename=rseventspro_rss20_56.xml
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4203)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:3855)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:746)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) 
> [mcf-pull-agent.jar:?]
> Caused by: java.io.UnsupportedEncodingException: utf-8; 
> filename=rseventspro_rss20_56.xml
> at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:71) 
> ~[?:1.8.0_212]
> at java.io.InputStreamReader.(InputStreamReader.java:100) ~[?:1.8.0_212]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.DecodingByteReceiver.dealWithBytes(DecodingByteReceiver.java:47)
>  ~[?:?]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.BOMEncodingDetector.dealWithRemainder(BOMEncodingDetector.java:250)
>  ~[?:?]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.SingleByteReceiver.dealWithBytes(SingleByteReceiver.java:52)
>  ~[?:?]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.Parser.parseWithCharsetDetection(Parser.java:74)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4174)
>  ~[?:?]
> ... 3 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)