[ 
https://issues.apache.org/jira/browse/CONNECTORS-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215311#comment-17215311
 ] 

Karl Wright commented on CONNECTORS-1655:
-----------------------------------------

Ah, but wait a minute: the issue is that the document in question has an 
illegal content-type:

"utf-8; filename=rseventspro_rss20_56.xml"

A patch for that is possible.  


> Web connector - UnsupportedEncodingException utf-8
> --------------------------------------------------
>
>                 Key: CONNECTORS-1655
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1655
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>    Affects Versions: ManifoldCF 2.17
>            Reporter: Julien Massiera
>            Assignee: Karl Wright
>            Priority: Critical
>
> When crawling some sites (for instance this one: 
> [http://www.antibes-juanlespins.com/] ) the job manages to index some 
> documents, but the stops with the following error code:
> Error: IO error: utf-8; filename=rseventspro_rss20_56.xml
> Here is one the MCF stacktrace: 
> Exception tossed: IO error: utf-8; filename=rseventspro_rss20_56.xml
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: IO error: utf-8; 
> filename=rseventspro_rss20_56.xml
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4203)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:3855)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:746)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) 
> [mcf-pull-agent.jar:?]
> Caused by: java.io.UnsupportedEncodingException: utf-8; 
> filename=rseventspro_rss20_56.xml
> at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:71) 
> ~[?:1.8.0_212]
> at java.io.InputStreamReader.<init>(InputStreamReader.java:100) ~[?:1.8.0_212]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.DecodingByteReceiver.dealWithBytes(DecodingByteReceiver.java:47)
>  ~[?:?]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.BOMEncodingDetector.dealWithRemainder(BOMEncodingDetector.java:250)
>  ~[?:?]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.SingleByteReceiver.dealWithBytes(SingleByteReceiver.java:52)
>  ~[?:?]
> at 
> org.apache.manifoldcf.connectorcommon.fuzzyml.Parser.parseWithCharsetDetection(Parser.java:74)
>  ~[?:?]
> at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4174)
>  ~[?:?]
> ... 3 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to