[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time

2014-02-12 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899260#comment-13899260
 ] 

Florian Schmedding commented on CONNECTORS-880:
---

I believe that in my case the job repetition was depending on the wrong 
collation. When running a job with a case-insensitive collation in MySQL it get 
started again without a previous job end. The same job runs as expected with a 
correctly configured database. However, I think your fix does not intend to 
remedy completely inconsistent status values resulting from the wrong 
collation. So my setup inn't a test case for it.

 Under the right conditions, job aborts do not update last checked time
 

 Key: CONNECTORS-880
 URL: https://issues.apache.org/jira/browse/CONNECTORS-880
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 1.6


 When a scheduled job is being considered to be started, MCF updates the 
 last-check field ONLY if the job didn't start.  It relies on the job's 
 completion to set the last-check field in the case where the job does start.  
 But if the job aborts, in at least one case the last-check field is NOT 
 updated.  This leads to the job being run over and over again within the 
 schedule window.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time

2014-02-12 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899260#comment-13899260
 ] 

Florian Schmedding edited comment on CONNECTORS-880 at 2/12/14 4:55 PM:


I believe that in my case the job repetition was depending on the wrong 
collation. When running a job with a case-insensitive collation in MySQL it get 
started again without a previous job end. The same job runs as expected with a 
correctly configured database. However, I think your fix does not intend to 
remedy completely inconsistent status values resulting from the wrong 
collation. So my setup isn't a test case for it.


was (Author: florianschmedding):
I believe that in my case the job repetition was depending on the wrong 
collation. When running a job with a case-insensitive collation in MySQL it get 
started again without a previous job end. The same job runs as expected with a 
correctly configured database. However, I think your fix does not intend to 
remedy completely inconsistent status values resulting from the wrong 
collation. So my setup inn't a test case for it.

 Under the right conditions, job aborts do not update last checked time
 

 Key: CONNECTORS-880
 URL: https://issues.apache.org/jira/browse/CONNECTORS-880
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 1.6


 When a scheduled job is being considered to be started, MCF updates the 
 last-check field ONLY if the job didn't start.  It relies on the job's 
 completion to set the last-check field in the case where the job does start.  
 But if the job aborts, in at least one case the last-check field is NOT 
 updated.  This leads to the job being run over and over again within the 
 schedule window.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling

2014-02-12 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899304#comment-13899304
 ] 

Florian Schmedding commented on CONNECTORS-850:
---

I tested the feature now with Manifold 1.6 (build with ant). The maximum 
interval is respected. Should the interval drop below the maximum value after a 
document change was recognized? I would expect such a behavior but in a test 
this does not seem to be the case.

 Maximum interval in dynamic crawling
 

 Key: CONNECTORS-850
 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Florian Schmedding
Assignee: Karl Wright
Priority: Minor
  Labels: features
 Fix For: ManifoldCF 1.5


 Currently, the dynamic crawling method used for a continuous job extends the 
 reseed and recrawl intervals when no changes are found in a checked document. 
 However, it should be possible to restrict this extension to a maximum value 
 in order to make sure that new documents are discovered within a certain 
 interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling

2014-02-12 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899490#comment-13899490
 ] 

Florian Schmedding commented on CONNECTORS-850:
---

minimum interval: 2 min
maximum interval: 4 min
The job was run a few times before it was set to dynamic crawling. 

02-12-2014 18:27:45.475 fetch
02-12-2014 18:23:45.702 document ingest (solr localhost)
02-12-2014 18:23:44.921 fetch
02-12-2014 18:19:44.451 document ingest (solr localhost)
02-12-2014 18:19:43.837 fetch
02-12-2014 18:15:42.929 fetch
02-12-2014 18:11:41.582 document ingest (solr localhost)
02-12-2014 18:11:41.058 fetch
*** document changed
02-12-2014 18:07:40.744 document ingest (solr localhost)
02-12-2014 18:07:40.249 fetch
02-12-2014 18:03:37.546 fetch
02-12-2014 17:59:36.426 fetch
02-12-2014 17:55:34.297 fetch
02-12-2014 17:51:33.431 document ingest (solr localhost)
02-12-2014 17:51:32.973 fetch
*** job changed from scheduled to dynamic crawling
02-12-2014 17:24:24.560 document ingest (solr localhost)
02-12-2014 17:24:24.413 fetch
02-12-2014 17:21:17.042 document ingest (solr localhost)
02-12-2014 17:21:16.919 fetch
02-12-2014 17:18:15.892 document ingest (solr localhost)


 Maximum interval in dynamic crawling
 

 Key: CONNECTORS-850
 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Florian Schmedding
Assignee: Karl Wright
Priority: Minor
  Labels: features
 Fix For: ManifoldCF 1.5


 Currently, the dynamic crawling method used for a continuous job extends the 
 reseed and recrawl intervals when no changes are found in a checked document. 
 However, it should be possible to restrict this extension to a maximum value 
 in order to make sure that new documents are discovered within a certain 
 interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (CONNECTORS-850) Maximum interval in dynamic crawling

2014-02-12 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899490#comment-13899490
 ] 

Florian Schmedding edited comment on CONNECTORS-850 at 2/12/14 7:44 PM:


minimum interval: 2 min
maximum interval: 4 min
The job was run a few times before it was set to dynamic crawling. 

{noformat}
02-12-2014 18:27:45.475 fetch
02-12-2014 18:23:45.702 document ingest (solr localhost)
02-12-2014 18:23:44.921 fetch
02-12-2014 18:19:44.451 document ingest (solr localhost)
02-12-2014 18:19:43.837 fetch
02-12-2014 18:15:42.929 fetch
02-12-2014 18:11:41.582 document ingest (solr localhost)
02-12-2014 18:11:41.058 fetch
*** document changed
02-12-2014 18:07:40.744 document ingest (solr localhost)
02-12-2014 18:07:40.249 fetch
02-12-2014 18:03:37.546 fetch
02-12-2014 17:59:36.426 fetch
02-12-2014 17:55:34.297 fetch
02-12-2014 17:51:33.431 document ingest (solr localhost)
02-12-2014 17:51:32.973 fetch
*** job changed from scheduled to dynamic crawling
02-12-2014 17:24:24.560 document ingest (solr localhost)
02-12-2014 17:24:24.413 fetch
02-12-2014 17:21:17.042 document ingest (solr localhost)
02-12-2014 17:21:16.919 fetch
02-12-2014 17:18:15.892 document ingest (solr localhost)
{noformat}


was (Author: florianschmedding):
minimum interval: 2 min
maximum interval: 4 min
The job was run a few times before it was set to dynamic crawling. 

02-12-2014 18:27:45.475 fetch
02-12-2014 18:23:45.702 document ingest (solr localhost)
02-12-2014 18:23:44.921 fetch
02-12-2014 18:19:44.451 document ingest (solr localhost)
02-12-2014 18:19:43.837 fetch
02-12-2014 18:15:42.929 fetch
02-12-2014 18:11:41.582 document ingest (solr localhost)
02-12-2014 18:11:41.058 fetch
*** document changed
02-12-2014 18:07:40.744 document ingest (solr localhost)
02-12-2014 18:07:40.249 fetch
02-12-2014 18:03:37.546 fetch
02-12-2014 17:59:36.426 fetch
02-12-2014 17:55:34.297 fetch
02-12-2014 17:51:33.431 document ingest (solr localhost)
02-12-2014 17:51:32.973 fetch
*** job changed from scheduled to dynamic crawling
02-12-2014 17:24:24.560 document ingest (solr localhost)
02-12-2014 17:24:24.413 fetch
02-12-2014 17:21:17.042 document ingest (solr localhost)
02-12-2014 17:21:16.919 fetch
02-12-2014 17:18:15.892 document ingest (solr localhost)


 Maximum interval in dynamic crawling
 

 Key: CONNECTORS-850
 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Florian Schmedding
Assignee: Karl Wright
Priority: Minor
  Labels: features
 Fix For: ManifoldCF 1.5


 Currently, the dynamic crawling method used for a continuous job extends the 
 reseed and recrawl intervals when no changes are found in a checked document. 
 However, it should be possible to restrict this extension to a maximum value 
 in order to make sure that new documents are discovered within a certain 
 interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling

2014-02-12 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899497#comment-13899497
 ] 

Karl Wright commented on CONNECTORS-850:


Thanks -- I'll make sure this makes sense first chance I get.

In general, the algorithm relies heavily on the time that has elapsed since the 
last fetch.  If I recall correctly, it keeps a weighted average of the interval 
between changes and uses that to estimate the next time it should attempt a 
fetch.  The limits that are applied don't affect the calculation; they just 
limit the actual result used.


 Maximum interval in dynamic crawling
 

 Key: CONNECTORS-850
 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Florian Schmedding
Assignee: Karl Wright
Priority: Minor
  Labels: features
 Fix For: ManifoldCF 1.5


 Currently, the dynamic crawling method used for a continuous job extends the 
 reseed and recrawl intervals when no changes are found in a checked document. 
 However, it should be possible to restrict this extension to a maximum value 
 in order to make sure that new documents are discovered within a certain 
 interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling

2014-02-12 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899529#comment-13899529
 ] 

Florian Schmedding commented on CONNECTORS-850:
---

The numbers from the second history result page have been missing in my 
previous comment. Starting with document ingest without fetch didn't make 
sense. 

 Maximum interval in dynamic crawling
 

 Key: CONNECTORS-850
 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Florian Schmedding
Assignee: Karl Wright
Priority: Minor
  Labels: features
 Fix For: ManifoldCF 1.5


 Currently, the dynamic crawling method used for a continuous job extends the 
 reseed and recrawl intervals when no changes are found in a checked document. 
 However, it should be possible to restrict this extension to a maximum value 
 in order to make sure that new documents are discovered within a certain 
 interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Jenkins build is back to normal : ManifoldCF-mvn #152

2014-02-12 Thread Apache Jenkins Server
See https://builds.apache.org/job/ManifoldCF-mvn/152/changes



[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling

2014-02-12 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899907#comment-13899907
 ] 

Karl Wright commented on CONNECTORS-850:


Hi Florian,

According to your *** notes:

First fetch: 17:00:33.256
First detected change to the document: 18:11:41.058

But: ingestions are inconsistent with this; they seem to reflect much more 
frequent changes.  Ingestions should not happen unless there has been a change 
noted.  The change would be to any feature of the document that would change 
how it was indexed.  Can you explain?



 Maximum interval in dynamic crawling
 

 Key: CONNECTORS-850
 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Florian Schmedding
Assignee: Karl Wright
Priority: Minor
  Labels: features
 Fix For: ManifoldCF 1.5


 Currently, the dynamic crawling method used for a continuous job extends the 
 reseed and recrawl intervals when no changes are found in a checked document. 
 However, it should be possible to restrict this extension to a maximum value 
 in order to make sure that new documents are discovered within a certain 
 interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)