[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738231#comment-16738231
 ] 

Donald Van den Driessche commented on CONNECTORS-1562:
------------------------------------------------------

Kar

After resolving the issues with the API creation repository connections, I 
retested our crawling locally. With a docker which contains a ManifoldCF and an 
Elasticsearch container.

I used No Bandwith throttles and a max connection count of 25.

This on the seedmap existing of 1 page, our whitelist: 
[https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=nl&html=true]
I still get the Stream Closed I/O exception:.

Do you have any more ideas on how to keep the connection open, so that the 
whole whitelist can be processed?

 

Printscreen Simple Report

!image-2019-01-09-14-20-50-616.png!

Stacktrace
{code:java}
ERROR 2019-01-09T13:08:37,876 (Worker thread '22') - Exception tossed: Repeated 
service interruptions - failure processing document: Stream Closed 
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service 
interruptions - failure processing document: Stream Closed at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:489) 
[mcf-pull-agent.jar:?] Caused by: java.io.IOException: Stream Closed at 
java.io.FileInputStream.readBytes(Native Method) ~[?:1.8.0_191] at 
java.io.FileInputStream.read(FileInputStream.java:255) ~[?:1.8.0_191] at 
sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) ~[?:1.8.0_191] at 
sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) ~[?:1.8.0_191] at 
sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_191] at 
java.io.InputStreamReader.read(InputStreamReader.java:184) ~[?:1.8.0_191] at 
org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex$IndexRequestEntity.writeTo(ElasticSearchIndex.java:221)
 ~[?:?] at 
org.apache.http.impl.execchain.RequestEntityProxy.writeTo(RequestEntityProxy.java:121)
 ~[httpclient-4.5.6.jar:4.5.6] at 
org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:156)
 ~[httpcore-4.4.10.jar:4.4.10] at 
org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:160) 
~[httpclient-4.5.6.jar:4.5.6] at 
org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:238)
 ~[httpcore-4.4.10.jar:4.4.10] at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
 ~[httpcore-4.4.10.jar:4.4.10] at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) 
~[httpclient-4.5.6.jar:4.5.6] at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) 
~[httpclient-4.5.6.jar:4.5.6] at 
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) 
~[httpclient-4.5.6.jar:4.5.6] at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
 ~[httpclient-4.5.6.jar:4.5.6] at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
 ~[httpclient-4.5.6.jar:4.5.6] at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
 ~[httpclient-4.5.6.jar:4.5.6] at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
 ~[httpclient-4.5.6.jar:4.5.6] at 
org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection$CallThread.run(ElasticSearchConnection.java:133)
 ~[?:?] ERROR 2019-01-09T13:08:37,883 (Worker thread '7') - Exception tossed: 
Repeated service interruptions - failure processing document: Stream Closed 
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service 
interruptions - failure processing document: Stream Closed at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:489) 
[mcf-pull-agent.jar:?] Caused by: java.io.IOException: Stream Closed at 
java.io.FileInputStream.readBytes(Native Method) ~[?:1.8.0_191] at 
java.io.FileInputStream.read(FileInputStream.java:255) ~[?:1.8.0_191] at 
sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) ~[?:1.8.0_191] at 
sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) ~[?:1.8.0_191] at 
sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_191] at 
java.io.InputStreamReader.read(InputStreamReader.java:184) ~[?:1.8.0_191] at 
org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex$IndexRequestEntity.writeTo(ElasticSearchIndex.java:221)
 ~[?:?] at 
org.apache.http.impl.execchain.RequestEntityProxy.writeTo(RequestEntityProxy.java:121)
 ~[httpclient-4.5.6.jar:4.5.6] at 
org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:156)
 ~[httpcore-4.4.10.jar:4.4.10] at 
org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:160) 
~[httpclient-4.5.6.jar:4.5.6] at 
org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:238)
 ~[httpcore-4.4.10.jar:4.4.10] at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
 ~[httpcore-4.4.10.jar:4.4.10] at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) 
~[httpclient-4.5.6.jar:4.5.6] at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) 
~[httpclient-4.5.6.jar:4.5.6] at 
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) 
~[httpclient-4.5.6.jar:4.5.6] at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
 ~[httpclient-4.5.6.jar:4.5.6] at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
 ~[httpclient-4.5.6.jar:4.5.6] at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
 ~[httpclient-4.5.6.jar:4.5.6] at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
 ~[httpclient-4.5.6.jar:4.5.6] at 
org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection$CallThread.run(ElasticSearchConnection.java:133)
 ~[?:?]
 {code}

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> ------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1562
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector, Web connector
>    Affects Versions: ManifoldCF 2.11
>         Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>            Reporter: Tim Steenbeke
>            Assignee: Karl Wright
>            Priority: Critical
>              Labels: starter
>             Fix For: ManifoldCF 2.12
>
>         Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> image-2019-01-09-14-20-50-616.png, manifoldcf.log.cleanup, 
> manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to