[jira] [Commented] (CONNECTORS-899) Consider/ignore HTTP header fields when checking for document change
[ https://issues.apache.org/jira/browse/CONNECTORS-899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908414#comment-13908414 ] Florian Schmedding commented on CONNECTORS-899: --- Perhaps there is a mor minimal solution as indicated in [CONNECTORS-850|https://issues.apache.org/jira/browse/CONNECTORS-850?focusedCommentId=13901754page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13901754]. Consider/ignore HTTP header fields when checking for document change Key: CONNECTORS-899 URL: https://issues.apache.org/jira/browse/CONNECTORS-899 Project: ManifoldCF Issue Type: Improvement Components: Web connector Affects Versions: ManifoldCF 1.6 Reporter: Florian Schmedding Assignee: Karl Wright Priority: Minor Labels: http Fix For: ManifoldCF 1.6 The web connector does already ignore certain HTTP header fields that change on every request when checking for document changes. However, this is hardcoded. Some web servers are not properly configured and return even a new last-modified date on each request although the document remains the same. This leads to lots of unncecessary re-ingestions. It would be nice to have the possibility to configure the header fields that should be considerd and ignored. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling
[ https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13901726#comment-13901726 ] Florian Schmedding commented on CONNECTORS-850: --- Would it be possible add header include and exclude lists to the configuration options of a web repository? Some web servers even update the last-modified date on each access although nothing changed. It depends on the content and the server which header fields should be considered when checking for changes. Maximum interval in dynamic crawling Key: CONNECTORS-850 URL: https://issues.apache.org/jira/browse/CONNECTORS-850 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Florian Schmedding Assignee: Karl Wright Priority: Minor Labels: features Fix For: ManifoldCF 1.5 Currently, the dynamic crawling method used for a continuous job extends the reseed and recrawl intervals when no changes are found in a checked document. However, it should be possible to restrict this extension to a maximum value in order to make sure that new documents are discovered within a certain interval. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling
[ https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900198#comment-13900198 ] Florian Schmedding commented on CONNECTORS-850: --- What contributes to a document change - anything besides the content, e.g., HTTP header fields? The content was only changed at the time indicated by the *** note. The document is served by an Apache http server on localhost. I used a modified webcrawler connector that recognizes links in a custom xml format (it parses the xml and extracts the links and a document id, nothing else). Maximum interval in dynamic crawling Key: CONNECTORS-850 URL: https://issues.apache.org/jira/browse/CONNECTORS-850 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Florian Schmedding Assignee: Karl Wright Priority: Minor Labels: features Fix For: ManifoldCF 1.5 Currently, the dynamic crawling method used for a continuous job extends the reseed and recrawl intervals when no changes are found in a checked document. However, it should be possible to restrict this extension to a maximum value in order to make sure that new documents are discovered within a certain interval. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling
[ https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900307#comment-13900307 ] Florian Schmedding commented on CONNECTORS-850: --- Normally, the HTTP header contains a date value. It looks like this field is removed before computing the changes. A few values from ingeststatus: {noformat} lastingest: 12 Feb 2014 18:27:45 firstingest: 12 Feb 2014 17:00:25 (doesn't match exactly the history entry above, same for another document) lastoutputversion: 0+0++ lastversion: 0+-8+header-Accept-Ranges=bytes=+header-Connection=Keep-Alive=+header-Content-Length=7559=+header-Content-Type=application/xml=+header-ETag=14200039b75-1d87-4f238a1156aaf=+header-Keep-Alive=timeout\\=5, max\\=100=+header-Last-Modified=Wed, 12 Feb 2014 17:09:01 GMT=+header-Server=Apache/2.2.22 (Win32) PHP/5.4.5 mod_jk/1.2.37=+845393346261438975+.*+ changecount: 22 {noformat} Not considering the header date would explain the above fetches wihtout ingests. Hope this makes sense. Maximum interval in dynamic crawling Key: CONNECTORS-850 URL: https://issues.apache.org/jira/browse/CONNECTORS-850 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Florian Schmedding Assignee: Karl Wright Priority: Minor Labels: features Fix For: ManifoldCF 1.5 Currently, the dynamic crawling method used for a continuous job extends the reseed and recrawl intervals when no changes are found in a checked document. However, it should be possible to restrict this extension to a maximum value in order to make sure that new documents are discovered within a certain interval. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling
[ https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900504#comment-13900504 ] Florian Schmedding commented on CONNECTORS-850: --- Yes, there may be some other header that should be removed but I could not see any changing one for the above example crawl except date and age. Btw, the fetch times coincide exactly with the times logged by the Apache server. Maximum interval in dynamic crawling Key: CONNECTORS-850 URL: https://issues.apache.org/jira/browse/CONNECTORS-850 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Florian Schmedding Assignee: Karl Wright Priority: Minor Labels: features Fix For: ManifoldCF 1.5 Currently, the dynamic crawling method used for a continuous job extends the reseed and recrawl intervals when no changes are found in a checked document. However, it should be possible to restrict this extension to a maximum value in order to make sure that new documents are discovered within a certain interval. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time
[ https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899260#comment-13899260 ] Florian Schmedding commented on CONNECTORS-880: --- I believe that in my case the job repetition was depending on the wrong collation. When running a job with a case-insensitive collation in MySQL it get started again without a previous job end. The same job runs as expected with a correctly configured database. However, I think your fix does not intend to remedy completely inconsistent status values resulting from the wrong collation. So my setup inn't a test case for it. Under the right conditions, job aborts do not update last checked time Key: CONNECTORS-880 URL: https://issues.apache.org/jira/browse/CONNECTORS-880 Project: ManifoldCF Issue Type: Bug Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 1.6 When a scheduled job is being considered to be started, MCF updates the last-check field ONLY if the job didn't start. It relies on the job's completion to set the last-check field in the case where the job does start. But if the job aborts, in at least one case the last-check field is NOT updated. This leads to the job being run over and over again within the schedule window. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time
[ https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899260#comment-13899260 ] Florian Schmedding edited comment on CONNECTORS-880 at 2/12/14 4:55 PM: I believe that in my case the job repetition was depending on the wrong collation. When running a job with a case-insensitive collation in MySQL it get started again without a previous job end. The same job runs as expected with a correctly configured database. However, I think your fix does not intend to remedy completely inconsistent status values resulting from the wrong collation. So my setup isn't a test case for it. was (Author: florianschmedding): I believe that in my case the job repetition was depending on the wrong collation. When running a job with a case-insensitive collation in MySQL it get started again without a previous job end. The same job runs as expected with a correctly configured database. However, I think your fix does not intend to remedy completely inconsistent status values resulting from the wrong collation. So my setup inn't a test case for it. Under the right conditions, job aborts do not update last checked time Key: CONNECTORS-880 URL: https://issues.apache.org/jira/browse/CONNECTORS-880 Project: ManifoldCF Issue Type: Bug Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 1.6 When a scheduled job is being considered to be started, MCF updates the last-check field ONLY if the job didn't start. It relies on the job's completion to set the last-check field in the case where the job does start. But if the job aborts, in at least one case the last-check field is NOT updated. This leads to the job being run over and over again within the schedule window. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling
[ https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899304#comment-13899304 ] Florian Schmedding commented on CONNECTORS-850: --- I tested the feature now with Manifold 1.6 (build with ant). The maximum interval is respected. Should the interval drop below the maximum value after a document change was recognized? I would expect such a behavior but in a test this does not seem to be the case. Maximum interval in dynamic crawling Key: CONNECTORS-850 URL: https://issues.apache.org/jira/browse/CONNECTORS-850 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Florian Schmedding Assignee: Karl Wright Priority: Minor Labels: features Fix For: ManifoldCF 1.5 Currently, the dynamic crawling method used for a continuous job extends the reseed and recrawl intervals when no changes are found in a checked document. However, it should be possible to restrict this extension to a maximum value in order to make sure that new documents are discovered within a certain interval. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling
[ https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899490#comment-13899490 ] Florian Schmedding commented on CONNECTORS-850: --- minimum interval: 2 min maximum interval: 4 min The job was run a few times before it was set to dynamic crawling. 02-12-2014 18:27:45.475 fetch 02-12-2014 18:23:45.702 document ingest (solr localhost) 02-12-2014 18:23:44.921 fetch 02-12-2014 18:19:44.451 document ingest (solr localhost) 02-12-2014 18:19:43.837 fetch 02-12-2014 18:15:42.929 fetch 02-12-2014 18:11:41.582 document ingest (solr localhost) 02-12-2014 18:11:41.058 fetch *** document changed 02-12-2014 18:07:40.744 document ingest (solr localhost) 02-12-2014 18:07:40.249 fetch 02-12-2014 18:03:37.546 fetch 02-12-2014 17:59:36.426 fetch 02-12-2014 17:55:34.297 fetch 02-12-2014 17:51:33.431 document ingest (solr localhost) 02-12-2014 17:51:32.973 fetch *** job changed from scheduled to dynamic crawling 02-12-2014 17:24:24.560 document ingest (solr localhost) 02-12-2014 17:24:24.413 fetch 02-12-2014 17:21:17.042 document ingest (solr localhost) 02-12-2014 17:21:16.919 fetch 02-12-2014 17:18:15.892 document ingest (solr localhost) Maximum interval in dynamic crawling Key: CONNECTORS-850 URL: https://issues.apache.org/jira/browse/CONNECTORS-850 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Florian Schmedding Assignee: Karl Wright Priority: Minor Labels: features Fix For: ManifoldCF 1.5 Currently, the dynamic crawling method used for a continuous job extends the reseed and recrawl intervals when no changes are found in a checked document. However, it should be possible to restrict this extension to a maximum value in order to make sure that new documents are discovered within a certain interval. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (CONNECTORS-850) Maximum interval in dynamic crawling
[ https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899490#comment-13899490 ] Florian Schmedding edited comment on CONNECTORS-850 at 2/12/14 7:44 PM: minimum interval: 2 min maximum interval: 4 min The job was run a few times before it was set to dynamic crawling. {noformat} 02-12-2014 18:27:45.475 fetch 02-12-2014 18:23:45.702 document ingest (solr localhost) 02-12-2014 18:23:44.921 fetch 02-12-2014 18:19:44.451 document ingest (solr localhost) 02-12-2014 18:19:43.837 fetch 02-12-2014 18:15:42.929 fetch 02-12-2014 18:11:41.582 document ingest (solr localhost) 02-12-2014 18:11:41.058 fetch *** document changed 02-12-2014 18:07:40.744 document ingest (solr localhost) 02-12-2014 18:07:40.249 fetch 02-12-2014 18:03:37.546 fetch 02-12-2014 17:59:36.426 fetch 02-12-2014 17:55:34.297 fetch 02-12-2014 17:51:33.431 document ingest (solr localhost) 02-12-2014 17:51:32.973 fetch *** job changed from scheduled to dynamic crawling 02-12-2014 17:24:24.560 document ingest (solr localhost) 02-12-2014 17:24:24.413 fetch 02-12-2014 17:21:17.042 document ingest (solr localhost) 02-12-2014 17:21:16.919 fetch 02-12-2014 17:18:15.892 document ingest (solr localhost) {noformat} was (Author: florianschmedding): minimum interval: 2 min maximum interval: 4 min The job was run a few times before it was set to dynamic crawling. 02-12-2014 18:27:45.475 fetch 02-12-2014 18:23:45.702 document ingest (solr localhost) 02-12-2014 18:23:44.921 fetch 02-12-2014 18:19:44.451 document ingest (solr localhost) 02-12-2014 18:19:43.837 fetch 02-12-2014 18:15:42.929 fetch 02-12-2014 18:11:41.582 document ingest (solr localhost) 02-12-2014 18:11:41.058 fetch *** document changed 02-12-2014 18:07:40.744 document ingest (solr localhost) 02-12-2014 18:07:40.249 fetch 02-12-2014 18:03:37.546 fetch 02-12-2014 17:59:36.426 fetch 02-12-2014 17:55:34.297 fetch 02-12-2014 17:51:33.431 document ingest (solr localhost) 02-12-2014 17:51:32.973 fetch *** job changed from scheduled to dynamic crawling 02-12-2014 17:24:24.560 document ingest (solr localhost) 02-12-2014 17:24:24.413 fetch 02-12-2014 17:21:17.042 document ingest (solr localhost) 02-12-2014 17:21:16.919 fetch 02-12-2014 17:18:15.892 document ingest (solr localhost) Maximum interval in dynamic crawling Key: CONNECTORS-850 URL: https://issues.apache.org/jira/browse/CONNECTORS-850 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Florian Schmedding Assignee: Karl Wright Priority: Minor Labels: features Fix For: ManifoldCF 1.5 Currently, the dynamic crawling method used for a continuous job extends the reseed and recrawl intervals when no changes are found in a checked document. However, it should be possible to restrict this extension to a maximum value in order to make sure that new documents are discovered within a certain interval. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling
[ https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899529#comment-13899529 ] Florian Schmedding commented on CONNECTORS-850: --- The numbers from the second history result page have been missing in my previous comment. Starting with document ingest without fetch didn't make sense. Maximum interval in dynamic crawling Key: CONNECTORS-850 URL: https://issues.apache.org/jira/browse/CONNECTORS-850 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Florian Schmedding Assignee: Karl Wright Priority: Minor Labels: features Fix For: ManifoldCF 1.5 Currently, the dynamic crawling method used for a continuous job extends the reseed and recrawl intervals when no changes are found in a checked document. However, it should be possible to restrict this extension to a maximum value in order to make sure that new documents are discovered within a certain interval. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time
[ https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13897667#comment-13897667 ] Florian Schmedding commented on CONNECTORS-880: --- There are some error in the manifold log: DEBUG 2014-02-11 10:11:11,989 (Thread-20602) - Actual query: [SELECT status,connectionname,outputname FROM jobs WHERE id=? FOR UPDATE] DEBUG 2014-02-11 10:11:11,989 (Thread-20602) - Parameter 0: '1392051994515' DEBUG 2014-02-11 10:11:11,989 (Thread-20602) - Done actual query (0ms): [SELECT status,connectionname,outputname FROM jobs WHERE id=? FOR UPDATE] DEBUG 2014-02-11 10:11:11,989 (Job reset thread) - Ending transaction DEBUG 2014-02-11 10:11:11,989 (Job reset thread) - Rolling transaction back! DEBUG 2014-02-11 10:11:11,992 (Thread-20603) - Actual query: [ROLLBACK] DEBUG 2014-02-11 10:11:11,992 (Thread-20603) - Done actual query (0ms): [ROLLBACK] ERROR 2014-02-11 10:11:11,992 (Job reset thread) - Exception tossed: Unexpected job status encountered: 33 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected job status encountered: 33 at org.apache.manifoldcf.crawler.jobs.Jobs.returnJobToActive(Jobs.java:1726) at org.apache.manifoldcf.crawler.jobs.JobManager.resetJobs(JobManager.java:7427) at org.apache.manifoldcf.crawler.system.JobResetThread.run(JobResetThread.java:91) There is a similar exception with Unexpected job status encountered: 34. When looking into the database, the status field of all jobs is constantly changing between 's' and 'n'. Under the right conditions, job aborts do not update last checked time Key: CONNECTORS-880 URL: https://issues.apache.org/jira/browse/CONNECTORS-880 Project: ManifoldCF Issue Type: Bug Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 1.6 When a scheduled job is being considered to be started, MCF updates the last-check field ONLY if the job didn't start. It relies on the job's completion to set the last-check field in the case where the job does start. But if the job aborts, in at least one case the last-check field is NOT updated. This leads to the job being run over and over again within the schedule window. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time
[ https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13897672#comment-13897672 ] Florian Schmedding commented on CONNECTORS-880: --- I'm using a Solr output connection. Manually sending a document to its update handler does not raise any problems, however, Manifold seems to receive only service interruptions. No document gets indexed. WARN 2014-02-11 10:17:36,592 (Job notification thread) - IO exception during commit: The target server failed to respond org.apache.http.NoHttpResponseException: The target server failed to respond at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:95) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:61) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252) at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784) at org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:291) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) at org.apache.manifoldcf.agents.output.solr.HttpPoster$CommitThread.run(HttpPoster.java:1219) WARN 2014-02-11 10:17:36,592 (Job notification thread) - Service interruption notifying connection - retrying: IO exception during commit: The target server failed to respond org.apache.manifoldcf.agents.interfaces.ServiceInterruption: IO exception during commit: The target server failed to respond at org.apache.manifoldcf.agents.output.solr.HttpPoster.handleIOException(HttpPoster.java:477) at org.apache.manifoldcf.agents.output.solr.HttpPoster.handleSolrServerException(HttpPoster.java:357) at org.apache.manifoldcf.agents.output.solr.HttpPoster.commitPost(HttpPoster.java:304) at org.apache.manifoldcf.agents.output.solr.SolrConnector.noteJobComplete(SolrConnector.java:744) at org.apache.manifoldcf.crawler.system.JobNotificationThread.run(JobNotificationThread.java:121) Under the right conditions, job aborts do not update last checked time Key: CONNECTORS-880 URL: https://issues.apache.org/jira/browse/CONNECTORS-880 Project: ManifoldCF Issue Type: Bug Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 1.6 When a scheduled job is being considered to be started, MCF updates the last-check field ONLY if the job didn't start. It relies on the job's completion to set the last-check field in the case where the job does start. But if the job aborts, in at least one case the last-check field is NOT updated. This leads to the job being run over and over again within the schedule window. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time
[ https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13897703#comment-13897703 ] Florian Schmedding commented on CONNECTORS-880: --- A job with a null output connection works fine (same repository). Under the right conditions, job aborts do not update last checked time Key: CONNECTORS-880 URL: https://issues.apache.org/jira/browse/CONNECTORS-880 Project: ManifoldCF Issue Type: Bug Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 1.6 When a scheduled job is being considered to be started, MCF updates the last-check field ONLY if the job didn't start. It relies on the job's completion to set the last-check field in the case where the job does start. But if the job aborts, in at least one case the last-check field is NOT updated. This leads to the job being run over and over again within the schedule window. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time
[ https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13897868#comment-13897868 ] Florian Schmedding commented on CONNECTORS-880: --- Unfortunately, there are still the same errors with the new revision. I'm using the manifoldcf-combined-service.war under Tomcat 7 but I don't know if there are multiple agents processes running (according to the build doc I think there should be only one) nor how to check that. About the Solr connection: Connection status: Connection working (View Output Connection Status) I cannot find any wrong parameter, Solr admin is working fine. The ping request from manifold is visible in the access log: 127.0.0.1 - - [11/Feb/2014:15:08:52 +0100] GET /solr/default/admin/ping?wt=xmlversion=2.2 HTTP/1.1 200 1329 Other manually executed requests work as well: 0:0:0:0:0:0:0:1 - - [11/Feb/2014:15:12:21 +0100] GET /solr/default/update?commit=true HTTP/1.1 200 160 However, no further requests from manifold are logged. Did the Solr connection handler change? I'm using Solr 4.3.1. * DEBUG 2014-02-11 14:54:17,774 (Job reset thread) - Job 1385456433981 now completed ERROR 2014-02-11 14:54:17,801 (Job reset thread) - Exception tossed: Unexpected job status encountered: 33 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected job status encountered: 33 at org.apache.manifoldcf.crawler.jobs.Jobs.returnJobToActive(Jobs.java:1901) at org.apache.manifoldcf.crawler.jobs.JobManager.resetJobs(JobManager.java:7726) at org.apache.manifoldcf.crawler.system.JobResetThread.run(JobResetThread.java:91) DEBUG 2014-02-11 14:54:17,857 (Job notification thread) - Found job 1385456433981 in need of notification DEBUG 2014-02-11 14:54:17,862 (Job notification thread) - Found job 1392051994515 in need of notification DEBUG 2014-02-11 14:54:17,867 (Job notification thread) - Found job 1392109738731 in need of notification DEBUG 2014-02-11 14:54:17,871 (Job notification thread) - Found job 1392112746052 in need of notification DEBUG 2014-02-11 14:54:17,891 (Job reset thread) - Job 1385456433981 now completed ERROR 2014-02-11 14:54:17,928 (Job reset thread) - Exception tossed: Unexpected job status encountered: 34 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected job status encountered: 34 at org.apache.manifoldcf.crawler.jobs.Jobs.returnJobToActive(Jobs.java:1901) at org.apache.manifoldcf.crawler.jobs.JobManager.resetJobs(JobManager.java:7726) at org.apache.manifoldcf.crawler.system.JobResetThread.run(JobResetThread.java:91) ** WARN 2014-02-11 15:02:28,180 (Job notification thread) - Notification service interruption reported for job 1392112746052 output connection 'solr localhost': IO exception during commit: The target server failed to respond org.apache.manifoldcf.agents.interfaces.ServiceInterruption: IO exception during commit: The target server failed to respond at org.apache.manifoldcf.agents.output.solr.HttpPoster.handleIOException(HttpPoster.java:477) at org.apache.manifoldcf.agents.output.solr.HttpPoster.handleSolrServerException(HttpPoster.java:357) at org.apache.manifoldcf.agents.output.solr.HttpPoster.commitPost(HttpPoster.java:304) at org.apache.manifoldcf.agents.output.solr.SolrConnector.noteJobComplete(SolrConnector.java:744) at org.apache.manifoldcf.crawler.system.JobNotificationThread.run(JobNotificationThread.java:118) Caused by: org.apache.http.NoHttpResponseException: The target server failed to respond at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:95) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:61) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252) at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) at
[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time
[ https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13898001#comment-13898001 ] Florian Schmedding commented on CONNECTORS-880: --- Replacing the solj-connector with an older version does not seem to solve the problem. During ingestion the job gets aborted. Under the right conditions, job aborts do not update last checked time Key: CONNECTORS-880 URL: https://issues.apache.org/jira/browse/CONNECTORS-880 Project: ManifoldCF Issue Type: Bug Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 1.6 When a scheduled job is being considered to be started, MCF updates the last-check field ONLY if the job didn't start. It relies on the job's completion to set the last-check field in the case where the job does start. But if the job aborts, in at least one case the last-check field is NOT updated. This leads to the job being run over and over again within the schedule window. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time
[ https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13898413#comment-13898413 ] Florian Schmedding commented on CONNECTORS-880: --- I just replaced httpclient library (without re-building) and now the HTTP POST works fine. I noticed that all connections except those from Manifold appeared with the address 0:0:0:0:0:0:0:1 instead of 127.0.0.1. There is a resolved issue about IPv6 addresses: https://issues.apache.org/jira/browse/HTTPCLIENT-1317. I don't know if this was really the cause of this trouble, but anyway, the new version works. mcf-combined-service.war: httpclient.jar - httpclient-4.3.2.jar httpcore.jar - httpcore-4.3.1.jar connector-lib: httpmime.jar - httpmime-4.3.2.jar (perhaps not important) (binary from http://ftp.fau.de/apache//httpcomponents/httpclient/binary/httpcomponents-client-4.3.2-bin.zip at http://hc.apache.org/downloads.cgi) After restarting Tomcat all documents get indexed by Solr. Now I configured a schedule to check the fix you provided. Thanks for your help! Under the right conditions, job aborts do not update last checked time Key: CONNECTORS-880 URL: https://issues.apache.org/jira/browse/CONNECTORS-880 Project: ManifoldCF Issue Type: Bug Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 1.6 When a scheduled job is being considered to be started, MCF updates the last-check field ONLY if the job didn't start. It relies on the job's completion to set the last-check field in the case where the job does start. But if the job aborts, in at least one case the last-check field is NOT updated. This leads to the job being run over and over again within the schedule window. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (CONNECTORS-887) Database schema not updated properly
Florian Schmedding created CONNECTORS-887: - Summary: Database schema not updated properly Key: CONNECTORS-887 URL: https://issues.apache.org/jira/browse/CONNECTORS-887 Project: ManifoldCF Issue Type: Bug Components: Framework core Affects Versions: ManifoldCF 1.6 Environment: MySQL database Reporter: Florian Schmedding Priority: Minor When running Manifold 1.6 the first time with an database schema from Manifold 1.3 the schema is not updated properly. The SQL-command ALTER TABLE authconnections MODIFY groupname VARCHAR(32) NOT NULL REFERENCES authgroups(groupname) ON DELETE RESTRICT fails. It should instead add the column: ALTER TABLE authconnections ADD groupname VARCHAR(32) NOT NULL REFERENCES authgroups(groupname) ON DELETE RESTRICT The next startup after executing the corrected SQL-statement succeeds. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time
[ https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13896778#comment-13896778 ] Florian Schmedding commented on CONNECTORS-880: --- I installed mcf-combined-service.war version 1.6 on a Tomcat server. However, the test job is not working correctly. I've used an existing database from Manifold 1.3 with one existing output connection, one existing repository connection and one existing job. On the job status page there were no buttons to control the job, its status is End notification. Therefore I copied the job. Then there were buttons to start the new job, but it got only errors from the output connection when ingesting documents. After aborting the job, it got stuck with the status End notification and does not leave it. Should I better create new connections and jobs or are there other problems in version 1.6? Under the right conditions, job aborts do not update last checked time Key: CONNECTORS-880 URL: https://issues.apache.org/jira/browse/CONNECTORS-880 Project: ManifoldCF Issue Type: Bug Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 1.6 When a scheduled job is being considered to be started, MCF updates the last-check field ONLY if the job didn't start. It relies on the job's completion to set the last-check field in the case where the job does start. But if the job aborts, in at least one case the last-check field is NOT updated. This leads to the job being run over and over again within the schedule window. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling
[ https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13867820#comment-13867820 ] Florian Schmedding commented on CONNECTORS-850: --- Tomcat did not start up correctly. There might be a confilict of the mcf-combined-service-1.5-SNAPSHOT.war with the libraries and connector in the folder referenced in org.apache.manifoldcf.configfile. In the log there is INFO: validateJarFile(C:\PROGRA~1\APACHE~2\TOMCAT~1.0\webapps\mcf-combined-service-1.5\WEB-INF\lib\jsp-api-2.1-glassfish-2.1.v20091210.jar) - jar not loaded. See Servlet Spec 2.3, section 9.7.2. Offending class: javax/el/Expression.class. Maximum interval in dynamic crawling Key: CONNECTORS-850 URL: https://issues.apache.org/jira/browse/CONNECTORS-850 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Florian Schmedding Assignee: Karl Wright Priority: Minor Labels: features Fix For: ManifoldCF 1.5 Currently, the dynamic crawling method used for a continuous job extends the reseed and recrawl intervals when no changes are found in a checked document. However, it should be possible to restrict this extension to a maximum value in order to make sure that new documents are discovered within a certain interval. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling
[ https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866521#comment-13866521 ] Florian Schmedding commented on CONNECTORS-850: --- Is there a nightly build available? I couldn't run the mcf-combined-service-1.5-SNAPSHOT.war that I built with maven. Maximum interval in dynamic crawling Key: CONNECTORS-850 URL: https://issues.apache.org/jira/browse/CONNECTORS-850 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Florian Schmedding Assignee: Karl Wright Priority: Minor Labels: features Fix For: ManifoldCF 1.5 Currently, the dynamic crawling method used for a continuous job extends the reseed and recrawl intervals when no changes are found in a checked document. However, it should be possible to restrict this extension to a maximum value in order to make sure that new documents are discovered within a certain interval. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (CONNECTORS-850) Maximum interval in dynamic crawling
Florian Schmedding created CONNECTORS-850: - Summary: Maximum interval in dynamic crawling Key: CONNECTORS-850 URL: https://issues.apache.org/jira/browse/CONNECTORS-850 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Reporter: Florian Schmedding Priority: Minor Currently, the dynamic crawling method used for a continuous job extends the reseed and recrawl intervals when no changes are found in a checked document. However, it should be possible to restrict this extension to a maximum value in order to make sure that new documents are discovered within a certain interval. -- This message was sent by Atlassian JIRA (v6.1.5#6160)