[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738231#comment-16738231 ]
Donald Van den Driessche commented on CONNECTORS-1562: ------------------------------------------------------ Kar After resolving the issues with the API creation repository connections, I retested our crawling locally. With a docker which contains a ManifoldCF and an Elasticsearch container. I used No Bandwith throttles and a max connection count of 25. This on the seedmap existing of 1 page, our whitelist: [https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=nl&html=true] I still get the Stream Closed I/O exception:. Do you have any more ideas on how to keep the connection open, so that the whole whitelist can be processed? Printscreen Simple Report !image-2019-01-09-14-20-50-616.png! Stacktrace {code:java} ERROR 2019-01-09T13:08:37,876 (Worker thread '22') - Exception tossed: Repeated service interruptions - failure processing document: Stream Closed org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions - failure processing document: Stream Closed at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:489) [mcf-pull-agent.jar:?] Caused by: java.io.IOException: Stream Closed at java.io.FileInputStream.readBytes(Native Method) ~[?:1.8.0_191] at java.io.FileInputStream.read(FileInputStream.java:255) ~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) ~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) ~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_191] at java.io.InputStreamReader.read(InputStreamReader.java:184) ~[?:1.8.0_191] at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex$IndexRequestEntity.writeTo(ElasticSearchIndex.java:221) ~[?:?] at org.apache.http.impl.execchain.RequestEntityProxy.writeTo(RequestEntityProxy.java:121) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:156) ~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:160) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:238) ~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123) ~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection$CallThread.run(ElasticSearchConnection.java:133) ~[?:?] ERROR 2019-01-09T13:08:37,883 (Worker thread '7') - Exception tossed: Repeated service interruptions - failure processing document: Stream Closed org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions - failure processing document: Stream Closed at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:489) [mcf-pull-agent.jar:?] Caused by: java.io.IOException: Stream Closed at java.io.FileInputStream.readBytes(Native Method) ~[?:1.8.0_191] at java.io.FileInputStream.read(FileInputStream.java:255) ~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) ~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) ~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_191] at java.io.InputStreamReader.read(InputStreamReader.java:184) ~[?:1.8.0_191] at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex$IndexRequestEntity.writeTo(ElasticSearchIndex.java:221) ~[?:?] at org.apache.http.impl.execchain.RequestEntityProxy.writeTo(RequestEntityProxy.java:121) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:156) ~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:160) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:238) ~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123) ~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection$CallThread.run(ElasticSearchConnection.java:133) ~[?:?] {code} > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > ------------------------------------------------------------------------------------ > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector > Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic > Reporter: Tim Steenbeke > Assignee: Karl Wright > Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: Screenshot from 2018-12-31 11-17-29.png, > image-2019-01-09-14-20-50-616.png, manifoldcf.log.cleanup, > manifoldcf.log.init, manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)