[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time
[ https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899260#comment-13899260 ] Florian Schmedding commented on CONNECTORS-880: --- I believe that in my case the job repetition was depending on the wrong collation. When running a job with a case-insensitive collation in MySQL it get started again without a previous job end. The same job runs as expected with a correctly configured database. However, I think your fix does not intend to remedy completely inconsistent status values resulting from the wrong collation. So my setup inn't a test case for it. Under the right conditions, job aborts do not update last checked time Key: CONNECTORS-880 URL: https://issues.apache.org/jira/browse/CONNECTORS-880 Project: ManifoldCF Issue Type: Bug Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 1.6 When a scheduled job is being considered to be started, MCF updates the last-check field ONLY if the job didn't start. It relies on the job's completion to set the last-check field in the case where the job does start. But if the job aborts, in at least one case the last-check field is NOT updated. This leads to the job being run over and over again within the schedule window. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time
[ https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899260#comment-13899260 ] Florian Schmedding edited comment on CONNECTORS-880 at 2/12/14 4:55 PM: I believe that in my case the job repetition was depending on the wrong collation. When running a job with a case-insensitive collation in MySQL it get started again without a previous job end. The same job runs as expected with a correctly configured database. However, I think your fix does not intend to remedy completely inconsistent status values resulting from the wrong collation. So my setup isn't a test case for it. was (Author: florianschmedding): I believe that in my case the job repetition was depending on the wrong collation. When running a job with a case-insensitive collation in MySQL it get started again without a previous job end. The same job runs as expected with a correctly configured database. However, I think your fix does not intend to remedy completely inconsistent status values resulting from the wrong collation. So my setup inn't a test case for it. Under the right conditions, job aborts do not update last checked time Key: CONNECTORS-880 URL: https://issues.apache.org/jira/browse/CONNECTORS-880 Project: ManifoldCF Issue Type: Bug Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 1.6 When a scheduled job is being considered to be started, MCF updates the last-check field ONLY if the job didn't start. It relies on the job's completion to set the last-check field in the case where the job does start. But if the job aborts, in at least one case the last-check field is NOT updated. This leads to the job being run over and over again within the schedule window. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling
[ https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899304#comment-13899304 ] Florian Schmedding commented on CONNECTORS-850: --- I tested the feature now with Manifold 1.6 (build with ant). The maximum interval is respected. Should the interval drop below the maximum value after a document change was recognized? I would expect such a behavior but in a test this does not seem to be the case. Maximum interval in dynamic crawling Key: CONNECTORS-850 URL: https://issues.apache.org/jira/browse/CONNECTORS-850 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Florian Schmedding Assignee: Karl Wright Priority: Minor Labels: features Fix For: ManifoldCF 1.5 Currently, the dynamic crawling method used for a continuous job extends the reseed and recrawl intervals when no changes are found in a checked document. However, it should be possible to restrict this extension to a maximum value in order to make sure that new documents are discovered within a certain interval. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling
[ https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899490#comment-13899490 ] Florian Schmedding commented on CONNECTORS-850: --- minimum interval: 2 min maximum interval: 4 min The job was run a few times before it was set to dynamic crawling. 02-12-2014 18:27:45.475 fetch 02-12-2014 18:23:45.702 document ingest (solr localhost) 02-12-2014 18:23:44.921 fetch 02-12-2014 18:19:44.451 document ingest (solr localhost) 02-12-2014 18:19:43.837 fetch 02-12-2014 18:15:42.929 fetch 02-12-2014 18:11:41.582 document ingest (solr localhost) 02-12-2014 18:11:41.058 fetch *** document changed 02-12-2014 18:07:40.744 document ingest (solr localhost) 02-12-2014 18:07:40.249 fetch 02-12-2014 18:03:37.546 fetch 02-12-2014 17:59:36.426 fetch 02-12-2014 17:55:34.297 fetch 02-12-2014 17:51:33.431 document ingest (solr localhost) 02-12-2014 17:51:32.973 fetch *** job changed from scheduled to dynamic crawling 02-12-2014 17:24:24.560 document ingest (solr localhost) 02-12-2014 17:24:24.413 fetch 02-12-2014 17:21:17.042 document ingest (solr localhost) 02-12-2014 17:21:16.919 fetch 02-12-2014 17:18:15.892 document ingest (solr localhost) Maximum interval in dynamic crawling Key: CONNECTORS-850 URL: https://issues.apache.org/jira/browse/CONNECTORS-850 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Florian Schmedding Assignee: Karl Wright Priority: Minor Labels: features Fix For: ManifoldCF 1.5 Currently, the dynamic crawling method used for a continuous job extends the reseed and recrawl intervals when no changes are found in a checked document. However, it should be possible to restrict this extension to a maximum value in order to make sure that new documents are discovered within a certain interval. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (CONNECTORS-850) Maximum interval in dynamic crawling
[ https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899490#comment-13899490 ] Florian Schmedding edited comment on CONNECTORS-850 at 2/12/14 7:44 PM: minimum interval: 2 min maximum interval: 4 min The job was run a few times before it was set to dynamic crawling. {noformat} 02-12-2014 18:27:45.475 fetch 02-12-2014 18:23:45.702 document ingest (solr localhost) 02-12-2014 18:23:44.921 fetch 02-12-2014 18:19:44.451 document ingest (solr localhost) 02-12-2014 18:19:43.837 fetch 02-12-2014 18:15:42.929 fetch 02-12-2014 18:11:41.582 document ingest (solr localhost) 02-12-2014 18:11:41.058 fetch *** document changed 02-12-2014 18:07:40.744 document ingest (solr localhost) 02-12-2014 18:07:40.249 fetch 02-12-2014 18:03:37.546 fetch 02-12-2014 17:59:36.426 fetch 02-12-2014 17:55:34.297 fetch 02-12-2014 17:51:33.431 document ingest (solr localhost) 02-12-2014 17:51:32.973 fetch *** job changed from scheduled to dynamic crawling 02-12-2014 17:24:24.560 document ingest (solr localhost) 02-12-2014 17:24:24.413 fetch 02-12-2014 17:21:17.042 document ingest (solr localhost) 02-12-2014 17:21:16.919 fetch 02-12-2014 17:18:15.892 document ingest (solr localhost) {noformat} was (Author: florianschmedding): minimum interval: 2 min maximum interval: 4 min The job was run a few times before it was set to dynamic crawling. 02-12-2014 18:27:45.475 fetch 02-12-2014 18:23:45.702 document ingest (solr localhost) 02-12-2014 18:23:44.921 fetch 02-12-2014 18:19:44.451 document ingest (solr localhost) 02-12-2014 18:19:43.837 fetch 02-12-2014 18:15:42.929 fetch 02-12-2014 18:11:41.582 document ingest (solr localhost) 02-12-2014 18:11:41.058 fetch *** document changed 02-12-2014 18:07:40.744 document ingest (solr localhost) 02-12-2014 18:07:40.249 fetch 02-12-2014 18:03:37.546 fetch 02-12-2014 17:59:36.426 fetch 02-12-2014 17:55:34.297 fetch 02-12-2014 17:51:33.431 document ingest (solr localhost) 02-12-2014 17:51:32.973 fetch *** job changed from scheduled to dynamic crawling 02-12-2014 17:24:24.560 document ingest (solr localhost) 02-12-2014 17:24:24.413 fetch 02-12-2014 17:21:17.042 document ingest (solr localhost) 02-12-2014 17:21:16.919 fetch 02-12-2014 17:18:15.892 document ingest (solr localhost) Maximum interval in dynamic crawling Key: CONNECTORS-850 URL: https://issues.apache.org/jira/browse/CONNECTORS-850 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Florian Schmedding Assignee: Karl Wright Priority: Minor Labels: features Fix For: ManifoldCF 1.5 Currently, the dynamic crawling method used for a continuous job extends the reseed and recrawl intervals when no changes are found in a checked document. However, it should be possible to restrict this extension to a maximum value in order to make sure that new documents are discovered within a certain interval. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling
[ https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899497#comment-13899497 ] Karl Wright commented on CONNECTORS-850: Thanks -- I'll make sure this makes sense first chance I get. In general, the algorithm relies heavily on the time that has elapsed since the last fetch. If I recall correctly, it keeps a weighted average of the interval between changes and uses that to estimate the next time it should attempt a fetch. The limits that are applied don't affect the calculation; they just limit the actual result used. Maximum interval in dynamic crawling Key: CONNECTORS-850 URL: https://issues.apache.org/jira/browse/CONNECTORS-850 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Florian Schmedding Assignee: Karl Wright Priority: Minor Labels: features Fix For: ManifoldCF 1.5 Currently, the dynamic crawling method used for a continuous job extends the reseed and recrawl intervals when no changes are found in a checked document. However, it should be possible to restrict this extension to a maximum value in order to make sure that new documents are discovered within a certain interval. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling
[ https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899529#comment-13899529 ] Florian Schmedding commented on CONNECTORS-850: --- The numbers from the second history result page have been missing in my previous comment. Starting with document ingest without fetch didn't make sense. Maximum interval in dynamic crawling Key: CONNECTORS-850 URL: https://issues.apache.org/jira/browse/CONNECTORS-850 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Florian Schmedding Assignee: Karl Wright Priority: Minor Labels: features Fix For: ManifoldCF 1.5 Currently, the dynamic crawling method used for a continuous job extends the reseed and recrawl intervals when no changes are found in a checked document. However, it should be possible to restrict this extension to a maximum value in order to make sure that new documents are discovered within a certain interval. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Jenkins build is back to normal : ManifoldCF-mvn #152
See https://builds.apache.org/job/ManifoldCF-mvn/152/changes
[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling
[ https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899907#comment-13899907 ] Karl Wright commented on CONNECTORS-850: Hi Florian, According to your *** notes: First fetch: 17:00:33.256 First detected change to the document: 18:11:41.058 But: ingestions are inconsistent with this; they seem to reflect much more frequent changes. Ingestions should not happen unless there has been a change noted. The change would be to any feature of the document that would change how it was indexed. Can you explain? Maximum interval in dynamic crawling Key: CONNECTORS-850 URL: https://issues.apache.org/jira/browse/CONNECTORS-850 Project: ManifoldCF Issue Type: New Feature Components: Framework crawler agent Affects Versions: ManifoldCF 1.4.1 Reporter: Florian Schmedding Assignee: Karl Wright Priority: Minor Labels: features Fix For: ManifoldCF 1.5 Currently, the dynamic crawling method used for a continuous job extends the reseed and recrawl intervals when no changes are found in a checked document. However, it should be possible to restrict this extension to a maximum value in order to make sure that new documents are discovered within a certain interval. -- This message was sent by Atlassian JIRA (v6.1.5#6160)