Julien Massiera created CONNECTORS-1655:
-------------------------------------------
Summary: Web connector - UnsupportedEncodingException utf-8
Key: CONNECTORS-1655
URL: https://issues.apache.org/jira/browse/CONNECTORS-1655
Project: ManifoldCF
Issue Type: Bug
Components: Web connector
Affects Versions: ManifoldCF 2.17
Reporter: Julien Massiera
When crawling some sites (for instance this one:
[http://www.antibes-juanlespins.com/] ) the job manages to index some
documents, but the stops with the following error code:
Error: IO error: utf-8; filename=rseventspro_rss20_56.xml
Here is one the MCF stacktrace:
Exception tossed: IO error: utf-8; filename=rseventspro_rss20_56.xml
org.apache.manifoldcf.core.interfaces.ManifoldCFException: IO error: utf-8;
filename=rseventspro_rss20_56.xml
at
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4203)
~[?:?]
at
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:3855)
~[?:?]
at
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:746)
~[?:?]
at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
[mcf-pull-agent.jar:?]
Caused by: java.io.UnsupportedEncodingException: utf-8;
filename=rseventspro_rss20_56.xml
at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:71)
~[?:1.8.0_212]
at java.io.InputStreamReader.<init>(InputStreamReader.java:100) ~[?:1.8.0_212]
at
org.apache.manifoldcf.connectorcommon.fuzzyml.DecodingByteReceiver.dealWithBytes(DecodingByteReceiver.java:47)
~[?:?]
at
org.apache.manifoldcf.connectorcommon.fuzzyml.BOMEncodingDetector.dealWithRemainder(BOMEncodingDetector.java:250)
~[?:?]
at
org.apache.manifoldcf.connectorcommon.fuzzyml.SingleByteReceiver.dealWithBytes(SingleByteReceiver.java:52)
~[?:?]
at
org.apache.manifoldcf.connectorcommon.fuzzyml.Parser.parseWithCharsetDetection(Parser.java:74)
~[?:?]
at
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4174)
~[?:?]
... 3 more
--
This message was sent by Atlassian Jira
(v8.3.4#803005)