[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-31 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731338#comment-16731338
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Yes, that's the error.  Specifically:

{code}
Caused by: java.io.IOException: Stream Closed
at java.io.FileInputStream.readBytes(Native Method) ~[?:1.8.0_191]
at java.io.FileInputStream.read(FileInputStream.java:255) ~[?:1.8.0_191]
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) 
~[?:1.8.0_191]
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) 
~[?:1.8.0_191]
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_191]
at java.io.InputStreamReader.read(InputStreamReader.java:184) 
~[?:1.8.0_191]
at 
org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex$IndexRequestEntity.writeTo(ElasticSearchIndex.java:221)
 ~[?:?]
{code}

What's happening is that a document is being streamed to ElasticSearch.  The 
input stream for the document is being read to do that.  But the stream is 
being closed early by the web connector for some reason before it's entirely 
read.  It's not clear why; it could be a difference between the size reported 
by the content type and the actual number of bytes being read, or it could be 
the actual web service closing the stream early at some point.

At any rate, it is *one* specific document doing this.  If you can figure out 
which document it is, I may be able to come up with a solution.  Is it a very 
large document?  When you try to fetch the document using (say) curl, does it 
completely fetch?  etc.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-31 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731315#comment-16731315
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

IS this the error ?
{code:java}
 WARN 2018-12-31T08:24:46,453 (Worker thread '32') - Service interruption 
reported for job 1546241012417 connection 'repo_website-en': IO exception: 
Stream Closed
 WARN 2018-12-31T08:28:52,471 (Worker thread '6') - Service interruption 
reported for job 1546241012417 connection 'repo_website-en': IO exception: 
Stream Closed
 WARN 2018-12-31T08:32:10,699 (Worker thread '13') - Service interruption 
reported for job 1546241012417 connection 'repo_website-en': IO exception: 
Stream Closed
ERROR 2018-12-31T08:32:10,750 (Worker thread '13') - Exception tossed: Repeated 
service interruptions - failure processing document: Stream Closed
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service 
interruptions - failure processing document: Stream Closed
    at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:489) 
[mcf-pull-agent.jar:?]
Caused by: java.io.IOException: Stream Closed
    at java.io.FileInputStream.readBytes(Native Method) ~[?:1.8.0_191]
    at java.io.FileInputStream.read(FileInputStream.java:255) ~[?:1.8.0_191]
    at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) 
~[?:1.8.0_191]
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) 
~[?:1.8.0_191]
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_191]
    at java.io.InputStreamReader.read(InputStreamReader.java:184) 
~[?:1.8.0_191]
    at 
org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex$IndexRequestEntity.writeTo(ElasticSearchIndex.java:221)
 ~[?:?]
    at 
org.apache.http.impl.execchain.RequestEntityProxy.writeTo(RequestEntityProxy.java:121)
 ~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:156)
 ~[httpcore-4.4.10.jar:4.4.10]
    at 
org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:160) 
~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:238)
 ~[httpcore-4.4.10.jar:4.4.10]
    at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
 ~[httpcore-4.4.10.jar:4.4.10]
    at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) 
~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) 
~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) 
~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
 ~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
 ~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
 ~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
 ~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection$CallThread.run(ElasticSearchConnection.java:133)
 ~[?:?]
 WARN 2018-12-31T08:33:35,958 (Job notification thread) - ES: Commit failed: 
{"error":"Incorrect HTTP method for uri [/website-en/_optimize] and method 
[GET], allowed: [POST]","status":405}
 WARN 2018-12-31T08:34:46,024 (Job notification thread) - ES: Commit failed: 
{"error":"Incorrect HTTP method for uri [/pintra/_optimize] and method [GET], 
allowed: [POST]","status":405}
{code}
The time is 1h difference, it's running on a docker container that has 
different timezone atm.

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining 

[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-31 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731290#comment-16731290
 ] 

Karl Wright commented on CONNECTORS-1562:
-

{code}
Error: Repeated service interruptions - failure processing document: Stream 
Closed
{code}

This is not a crash; this just means that the job aborts.  It also comes with 
numerous stack traces, one for each time the document retries.  That stack 
trace would be very helpful to have.  Thanks!


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-31 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731290#comment-16731290
 ] 

Karl Wright edited comment on CONNECTORS-1562 at 12/31/18 12:06 PM:


{code}
Error: Repeated service interruptions - failure processing document: Stream 
Closed
{code}

This is not a crash; this just means that the job aborts.  It also comes with 
numerous stack traces, in the manifoldcf log, one for each time the document 
retries.  That stack trace would be very helpful to have.  Thanks!



was (Author: kwri...@metacarta.com):
{code}
Error: Repeated service interruptions - failure processing document: Stream 
Closed
{code}

This is not a crash; this just means that the job aborts.  It also comes with 
numerous stack traces, one for each time the document retries.  That stack 
trace would be very helpful to have.  Thanks!


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-31 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731261#comment-16731261
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

[~kwri...@metacarta.com] I think there was some misscommunication: 


    The Issue with the "stopped working" was found by my college Donald Van den 
Driessche, so I didn't have any more info than what he gave me.
    I recreated the issue and this is the Error:
{code:java}
Error: Repeated service interruptions - failure processing document: Stream 
Closed{code}
!Screenshot from 2018-12-31 11-17-29.png!
   

the question I wanted answered is: How are we supposed to set up the job with 
the data we have, and what you see as the best solution, might not be the right 
solution.
I asked this and you only responded to the other issue with manifold, It looked 
like you avoided the question.
You suggested using the URL with the site-map but with excludes, and this is 
simply not possible because the exclude list is to big an there is no reg exp. 
possible because of the randomness of the links.
So on this part I also though that you were looking in to this and found a fix 
or edited code.

I'm sorry if my text was formed blunt but I'm just trying to get information 
and I didn't know any other way to get your attention to the full picture of 
the comment.
English is not my first language so I'm sorry for my small vocabulary usage, 
google translate also doesn't help on this part.
So i hope we can continue this communication to get to a solution, hopefully a 
solution that works for both of us.

 

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-31 Thread Tim Steenbeke (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1562:
--
Attachment: Screenshot from 2018-12-31 11-17-29.png

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)