[jira] [Commented] (CONNECTORS-899) Consider/ignore HTTP header fields when checking for document change

2014-02-21 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908414#comment-13908414
 ] 

Florian Schmedding commented on CONNECTORS-899:
---

Perhaps there is a mor minimal solution as indicated in 
[CONNECTORS-850|https://issues.apache.org/jira/browse/CONNECTORS-850?focusedCommentId=13901754page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13901754].

 Consider/ignore HTTP header fields when checking for document change
 

 Key: CONNECTORS-899
 URL: https://issues.apache.org/jira/browse/CONNECTORS-899
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Web connector
Affects Versions: ManifoldCF 1.6
Reporter: Florian Schmedding
Assignee: Karl Wright
Priority: Minor
  Labels: http
 Fix For: ManifoldCF 1.6


 The web connector does already ignore certain HTTP header fields that change 
 on every request when checking for document changes. However, this is 
 hardcoded. Some web servers are not properly configured and return even a new 
 last-modified date on each request although the document remains the same. 
 This leads to lots of unncecessary re-ingestions. It would be nice to have 
 the possibility to configure the header fields that should be considerd and 
 ignored.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling

2014-02-14 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13901726#comment-13901726
 ] 

Florian Schmedding commented on CONNECTORS-850:
---

Would it be possible add header include and exclude lists to the configuration 
options of a web repository? Some web servers even update the last-modified 
date on each access although nothing changed. It depends on the content and the 
server which header fields should be considered when checking for changes.

 Maximum interval in dynamic crawling
 

 Key: CONNECTORS-850
 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Florian Schmedding
Assignee: Karl Wright
Priority: Minor
  Labels: features
 Fix For: ManifoldCF 1.5


 Currently, the dynamic crawling method used for a continuous job extends the 
 reseed and recrawl intervals when no changes are found in a checked document. 
 However, it should be possible to restrict this extension to a maximum value 
 in order to make sure that new documents are discovered within a certain 
 interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling

2014-02-13 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900198#comment-13900198
 ] 

Florian Schmedding commented on CONNECTORS-850:
---

What contributes to a document change - anything besides the content, e.g., 
HTTP header fields? The content was only changed at the time indicated by the 
*** note. The document is served by an Apache http server on localhost. I 
used a modified webcrawler connector that recognizes links in a custom xml 
format (it parses the xml and extracts the links and a document id, nothing 
else).

 Maximum interval in dynamic crawling
 

 Key: CONNECTORS-850
 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Florian Schmedding
Assignee: Karl Wright
Priority: Minor
  Labels: features
 Fix For: ManifoldCF 1.5


 Currently, the dynamic crawling method used for a continuous job extends the 
 reseed and recrawl intervals when no changes are found in a checked document. 
 However, it should be possible to restrict this extension to a maximum value 
 in order to make sure that new documents are discovered within a certain 
 interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling

2014-02-13 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900307#comment-13900307
 ] 

Florian Schmedding commented on CONNECTORS-850:
---

Normally, the HTTP header contains a date value. It looks like this field is 
removed before computing the changes. A few values from ingeststatus:
{noformat}
lastingest: 12 Feb 2014 18:27:45
firstingest: 12 Feb 2014 17:00:25 (doesn't match exactly the history entry 
above, same for another document)
lastoutputversion: 0+0++
lastversion: 
0+-8+header-Accept-Ranges=bytes=+header-Connection=Keep-Alive=+header-Content-Length=7559=+header-Content-Type=application/xml=+header-ETag=14200039b75-1d87-4f238a1156aaf=+header-Keep-Alive=timeout\\=5,
 max\\=100=+header-Last-Modified=Wed, 12 Feb 2014 17:09:01 
GMT=+header-Server=Apache/2.2.22 (Win32) PHP/5.4.5 
mod_jk/1.2.37=+845393346261438975+.*+
changecount: 22
{noformat}

Not considering the header date would explain the above fetches wihtout 
ingests. Hope this makes sense.

 Maximum interval in dynamic crawling
 

 Key: CONNECTORS-850
 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Florian Schmedding
Assignee: Karl Wright
Priority: Minor
  Labels: features
 Fix For: ManifoldCF 1.5


 Currently, the dynamic crawling method used for a continuous job extends the 
 reseed and recrawl intervals when no changes are found in a checked document. 
 However, it should be possible to restrict this extension to a maximum value 
 in order to make sure that new documents are discovered within a certain 
 interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling

2014-02-13 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900504#comment-13900504
 ] 

Florian Schmedding commented on CONNECTORS-850:
---

Yes, there may be some other header that should be removed but I could not see 
any changing one for the above example crawl except date and age. Btw, the 
fetch times coincide exactly with the times logged by the Apache server.

 Maximum interval in dynamic crawling
 

 Key: CONNECTORS-850
 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Florian Schmedding
Assignee: Karl Wright
Priority: Minor
  Labels: features
 Fix For: ManifoldCF 1.5


 Currently, the dynamic crawling method used for a continuous job extends the 
 reseed and recrawl intervals when no changes are found in a checked document. 
 However, it should be possible to restrict this extension to a maximum value 
 in order to make sure that new documents are discovered within a certain 
 interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time

2014-02-12 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899260#comment-13899260
 ] 

Florian Schmedding commented on CONNECTORS-880:
---

I believe that in my case the job repetition was depending on the wrong 
collation. When running a job with a case-insensitive collation in MySQL it get 
started again without a previous job end. The same job runs as expected with a 
correctly configured database. However, I think your fix does not intend to 
remedy completely inconsistent status values resulting from the wrong 
collation. So my setup inn't a test case for it.

 Under the right conditions, job aborts do not update last checked time
 

 Key: CONNECTORS-880
 URL: https://issues.apache.org/jira/browse/CONNECTORS-880
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 1.6


 When a scheduled job is being considered to be started, MCF updates the 
 last-check field ONLY if the job didn't start.  It relies on the job's 
 completion to set the last-check field in the case where the job does start.  
 But if the job aborts, in at least one case the last-check field is NOT 
 updated.  This leads to the job being run over and over again within the 
 schedule window.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time

2014-02-12 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899260#comment-13899260
 ] 

Florian Schmedding edited comment on CONNECTORS-880 at 2/12/14 4:55 PM:


I believe that in my case the job repetition was depending on the wrong 
collation. When running a job with a case-insensitive collation in MySQL it get 
started again without a previous job end. The same job runs as expected with a 
correctly configured database. However, I think your fix does not intend to 
remedy completely inconsistent status values resulting from the wrong 
collation. So my setup isn't a test case for it.


was (Author: florianschmedding):
I believe that in my case the job repetition was depending on the wrong 
collation. When running a job with a case-insensitive collation in MySQL it get 
started again without a previous job end. The same job runs as expected with a 
correctly configured database. However, I think your fix does not intend to 
remedy completely inconsistent status values resulting from the wrong 
collation. So my setup inn't a test case for it.

 Under the right conditions, job aborts do not update last checked time
 

 Key: CONNECTORS-880
 URL: https://issues.apache.org/jira/browse/CONNECTORS-880
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 1.6


 When a scheduled job is being considered to be started, MCF updates the 
 last-check field ONLY if the job didn't start.  It relies on the job's 
 completion to set the last-check field in the case where the job does start.  
 But if the job aborts, in at least one case the last-check field is NOT 
 updated.  This leads to the job being run over and over again within the 
 schedule window.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling

2014-02-12 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899304#comment-13899304
 ] 

Florian Schmedding commented on CONNECTORS-850:
---

I tested the feature now with Manifold 1.6 (build with ant). The maximum 
interval is respected. Should the interval drop below the maximum value after a 
document change was recognized? I would expect such a behavior but in a test 
this does not seem to be the case.

 Maximum interval in dynamic crawling
 

 Key: CONNECTORS-850
 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Florian Schmedding
Assignee: Karl Wright
Priority: Minor
  Labels: features
 Fix For: ManifoldCF 1.5


 Currently, the dynamic crawling method used for a continuous job extends the 
 reseed and recrawl intervals when no changes are found in a checked document. 
 However, it should be possible to restrict this extension to a maximum value 
 in order to make sure that new documents are discovered within a certain 
 interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling

2014-02-12 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899490#comment-13899490
 ] 

Florian Schmedding commented on CONNECTORS-850:
---

minimum interval: 2 min
maximum interval: 4 min
The job was run a few times before it was set to dynamic crawling. 

02-12-2014 18:27:45.475 fetch
02-12-2014 18:23:45.702 document ingest (solr localhost)
02-12-2014 18:23:44.921 fetch
02-12-2014 18:19:44.451 document ingest (solr localhost)
02-12-2014 18:19:43.837 fetch
02-12-2014 18:15:42.929 fetch
02-12-2014 18:11:41.582 document ingest (solr localhost)
02-12-2014 18:11:41.058 fetch
*** document changed
02-12-2014 18:07:40.744 document ingest (solr localhost)
02-12-2014 18:07:40.249 fetch
02-12-2014 18:03:37.546 fetch
02-12-2014 17:59:36.426 fetch
02-12-2014 17:55:34.297 fetch
02-12-2014 17:51:33.431 document ingest (solr localhost)
02-12-2014 17:51:32.973 fetch
*** job changed from scheduled to dynamic crawling
02-12-2014 17:24:24.560 document ingest (solr localhost)
02-12-2014 17:24:24.413 fetch
02-12-2014 17:21:17.042 document ingest (solr localhost)
02-12-2014 17:21:16.919 fetch
02-12-2014 17:18:15.892 document ingest (solr localhost)


 Maximum interval in dynamic crawling
 

 Key: CONNECTORS-850
 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Florian Schmedding
Assignee: Karl Wright
Priority: Minor
  Labels: features
 Fix For: ManifoldCF 1.5


 Currently, the dynamic crawling method used for a continuous job extends the 
 reseed and recrawl intervals when no changes are found in a checked document. 
 However, it should be possible to restrict this extension to a maximum value 
 in order to make sure that new documents are discovered within a certain 
 interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (CONNECTORS-850) Maximum interval in dynamic crawling

2014-02-12 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899490#comment-13899490
 ] 

Florian Schmedding edited comment on CONNECTORS-850 at 2/12/14 7:44 PM:


minimum interval: 2 min
maximum interval: 4 min
The job was run a few times before it was set to dynamic crawling. 

{noformat}
02-12-2014 18:27:45.475 fetch
02-12-2014 18:23:45.702 document ingest (solr localhost)
02-12-2014 18:23:44.921 fetch
02-12-2014 18:19:44.451 document ingest (solr localhost)
02-12-2014 18:19:43.837 fetch
02-12-2014 18:15:42.929 fetch
02-12-2014 18:11:41.582 document ingest (solr localhost)
02-12-2014 18:11:41.058 fetch
*** document changed
02-12-2014 18:07:40.744 document ingest (solr localhost)
02-12-2014 18:07:40.249 fetch
02-12-2014 18:03:37.546 fetch
02-12-2014 17:59:36.426 fetch
02-12-2014 17:55:34.297 fetch
02-12-2014 17:51:33.431 document ingest (solr localhost)
02-12-2014 17:51:32.973 fetch
*** job changed from scheduled to dynamic crawling
02-12-2014 17:24:24.560 document ingest (solr localhost)
02-12-2014 17:24:24.413 fetch
02-12-2014 17:21:17.042 document ingest (solr localhost)
02-12-2014 17:21:16.919 fetch
02-12-2014 17:18:15.892 document ingest (solr localhost)
{noformat}


was (Author: florianschmedding):
minimum interval: 2 min
maximum interval: 4 min
The job was run a few times before it was set to dynamic crawling. 

02-12-2014 18:27:45.475 fetch
02-12-2014 18:23:45.702 document ingest (solr localhost)
02-12-2014 18:23:44.921 fetch
02-12-2014 18:19:44.451 document ingest (solr localhost)
02-12-2014 18:19:43.837 fetch
02-12-2014 18:15:42.929 fetch
02-12-2014 18:11:41.582 document ingest (solr localhost)
02-12-2014 18:11:41.058 fetch
*** document changed
02-12-2014 18:07:40.744 document ingest (solr localhost)
02-12-2014 18:07:40.249 fetch
02-12-2014 18:03:37.546 fetch
02-12-2014 17:59:36.426 fetch
02-12-2014 17:55:34.297 fetch
02-12-2014 17:51:33.431 document ingest (solr localhost)
02-12-2014 17:51:32.973 fetch
*** job changed from scheduled to dynamic crawling
02-12-2014 17:24:24.560 document ingest (solr localhost)
02-12-2014 17:24:24.413 fetch
02-12-2014 17:21:17.042 document ingest (solr localhost)
02-12-2014 17:21:16.919 fetch
02-12-2014 17:18:15.892 document ingest (solr localhost)


 Maximum interval in dynamic crawling
 

 Key: CONNECTORS-850
 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Florian Schmedding
Assignee: Karl Wright
Priority: Minor
  Labels: features
 Fix For: ManifoldCF 1.5


 Currently, the dynamic crawling method used for a continuous job extends the 
 reseed and recrawl intervals when no changes are found in a checked document. 
 However, it should be possible to restrict this extension to a maximum value 
 in order to make sure that new documents are discovered within a certain 
 interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling

2014-02-12 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899529#comment-13899529
 ] 

Florian Schmedding commented on CONNECTORS-850:
---

The numbers from the second history result page have been missing in my 
previous comment. Starting with document ingest without fetch didn't make 
sense. 

 Maximum interval in dynamic crawling
 

 Key: CONNECTORS-850
 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Florian Schmedding
Assignee: Karl Wright
Priority: Minor
  Labels: features
 Fix For: ManifoldCF 1.5


 Currently, the dynamic crawling method used for a continuous job extends the 
 reseed and recrawl intervals when no changes are found in a checked document. 
 However, it should be possible to restrict this extension to a maximum value 
 in order to make sure that new documents are discovered within a certain 
 interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time

2014-02-11 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13897667#comment-13897667
 ] 

Florian Schmedding commented on CONNECTORS-880:
---

There are some error in the manifold log:

DEBUG 2014-02-11 10:11:11,989 (Thread-20602) - Actual query: [SELECT 
status,connectionname,outputname FROM jobs WHERE id=? FOR UPDATE]
DEBUG 2014-02-11 10:11:11,989 (Thread-20602) -   Parameter 0: '1392051994515'
DEBUG 2014-02-11 10:11:11,989 (Thread-20602) - Done actual query (0ms): [SELECT 
status,connectionname,outputname FROM jobs WHERE id=? FOR UPDATE]
DEBUG 2014-02-11 10:11:11,989 (Job reset thread) - Ending transaction
DEBUG 2014-02-11 10:11:11,989 (Job reset thread) - Rolling transaction back!
DEBUG 2014-02-11 10:11:11,992 (Thread-20603) - Actual query: [ROLLBACK]
DEBUG 2014-02-11 10:11:11,992 (Thread-20603) - Done actual query (0ms): 
[ROLLBACK]
ERROR 2014-02-11 10:11:11,992 (Job reset thread) - Exception tossed: Unexpected 
job status encountered: 33
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected job 
status encountered: 33
at 
org.apache.manifoldcf.crawler.jobs.Jobs.returnJobToActive(Jobs.java:1726)
at 
org.apache.manifoldcf.crawler.jobs.JobManager.resetJobs(JobManager.java:7427)
at 
org.apache.manifoldcf.crawler.system.JobResetThread.run(JobResetThread.java:91)

There is a similar exception with Unexpected job status encountered: 34. When 
looking into the database, the status field of all jobs is constantly changing 
between 's' and 'n'.

 Under the right conditions, job aborts do not update last checked time
 

 Key: CONNECTORS-880
 URL: https://issues.apache.org/jira/browse/CONNECTORS-880
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 1.6


 When a scheduled job is being considered to be started, MCF updates the 
 last-check field ONLY if the job didn't start.  It relies on the job's 
 completion to set the last-check field in the case where the job does start.  
 But if the job aborts, in at least one case the last-check field is NOT 
 updated.  This leads to the job being run over and over again within the 
 schedule window.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time

2014-02-11 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13897672#comment-13897672
 ] 

Florian Schmedding commented on CONNECTORS-880:
---

I'm using a Solr output connection. Manually sending a document to its update 
handler does not raise any problems, however, Manifold seems to receive only 
service interruptions. No document gets indexed.

 WARN 2014-02-11 10:17:36,592 (Job notification thread) - IO exception during 
commit: The target server failed to respond
org.apache.http.NoHttpResponseException: The target server failed to respond
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:95)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:61)
at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
at 
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
at 
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
at 
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
at 
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715)
at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520)
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
at 
org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:291)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at 
org.apache.manifoldcf.agents.output.solr.HttpPoster$CommitThread.run(HttpPoster.java:1219)
 WARN 2014-02-11 10:17:36,592 (Job notification thread) - Service interruption 
notifying connection - retrying: IO exception during commit: The target server 
failed to respond
org.apache.manifoldcf.agents.interfaces.ServiceInterruption: IO exception 
during commit: The target server failed to respond
at 
org.apache.manifoldcf.agents.output.solr.HttpPoster.handleIOException(HttpPoster.java:477)
at 
org.apache.manifoldcf.agents.output.solr.HttpPoster.handleSolrServerException(HttpPoster.java:357)
at 
org.apache.manifoldcf.agents.output.solr.HttpPoster.commitPost(HttpPoster.java:304)
at 
org.apache.manifoldcf.agents.output.solr.SolrConnector.noteJobComplete(SolrConnector.java:744)
at 
org.apache.manifoldcf.crawler.system.JobNotificationThread.run(JobNotificationThread.java:121)

 Under the right conditions, job aborts do not update last checked time
 

 Key: CONNECTORS-880
 URL: https://issues.apache.org/jira/browse/CONNECTORS-880
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 1.6


 When a scheduled job is being considered to be started, MCF updates the 
 last-check field ONLY if the job didn't start.  It relies on the job's 
 completion to set the last-check field in the case where the job does start.  
 But if the job aborts, in at least one case the last-check field is NOT 
 updated.  This leads to the job being run over and over again within the 
 schedule window.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time

2014-02-11 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13897703#comment-13897703
 ] 

Florian Schmedding commented on CONNECTORS-880:
---

A job with a null output connection works fine (same repository).

 Under the right conditions, job aborts do not update last checked time
 

 Key: CONNECTORS-880
 URL: https://issues.apache.org/jira/browse/CONNECTORS-880
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 1.6


 When a scheduled job is being considered to be started, MCF updates the 
 last-check field ONLY if the job didn't start.  It relies on the job's 
 completion to set the last-check field in the case where the job does start.  
 But if the job aborts, in at least one case the last-check field is NOT 
 updated.  This leads to the job being run over and over again within the 
 schedule window.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time

2014-02-11 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13897868#comment-13897868
 ] 

Florian Schmedding commented on CONNECTORS-880:
---

Unfortunately, there are still the same errors with the new revision. I'm using 
the manifoldcf-combined-service.war under Tomcat 7 but I don't know if there 
are multiple agents processes running (according to the build doc I think there 
should be only one) nor how to check that. 

About the Solr connection: 
Connection status:  Connection working (View Output Connection Status)
I cannot find any wrong parameter, Solr admin is working fine. The ping request 
from manifold is visible in the access log:
127.0.0.1 - - [11/Feb/2014:15:08:52 +0100] GET 
/solr/default/admin/ping?wt=xmlversion=2.2 HTTP/1.1 200 1329

Other manually executed requests work as well:
0:0:0:0:0:0:0:1 - - [11/Feb/2014:15:12:21 +0100] GET 
/solr/default/update?commit=true HTTP/1.1 200 160

However, no further requests from manifold are logged. Did the Solr connection 
handler change? I'm using Solr 4.3.1.

*

DEBUG 2014-02-11 14:54:17,774 (Job reset thread) - Job 1385456433981 now 
completed
ERROR 2014-02-11 14:54:17,801 (Job reset thread) - Exception tossed: Unexpected 
job status encountered: 33
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected job 
status encountered: 33
at 
org.apache.manifoldcf.crawler.jobs.Jobs.returnJobToActive(Jobs.java:1901)
at 
org.apache.manifoldcf.crawler.jobs.JobManager.resetJobs(JobManager.java:7726)
at 
org.apache.manifoldcf.crawler.system.JobResetThread.run(JobResetThread.java:91)
DEBUG 2014-02-11 14:54:17,857 (Job notification thread) - Found job 
1385456433981 in need of notification
DEBUG 2014-02-11 14:54:17,862 (Job notification thread) - Found job 
1392051994515 in need of notification
DEBUG 2014-02-11 14:54:17,867 (Job notification thread) - Found job 
1392109738731 in need of notification
DEBUG 2014-02-11 14:54:17,871 (Job notification thread) - Found job 
1392112746052 in need of notification
DEBUG 2014-02-11 14:54:17,891 (Job reset thread) - Job 1385456433981 now 
completed
ERROR 2014-02-11 14:54:17,928 (Job reset thread) - Exception tossed: Unexpected 
job status encountered: 34
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected job 
status encountered: 34
at 
org.apache.manifoldcf.crawler.jobs.Jobs.returnJobToActive(Jobs.java:1901)
at 
org.apache.manifoldcf.crawler.jobs.JobManager.resetJobs(JobManager.java:7726)
at 
org.apache.manifoldcf.crawler.system.JobResetThread.run(JobResetThread.java:91)

**

 WARN 2014-02-11 15:02:28,180 (Job notification thread) - Notification service 
interruption reported for job 1392112746052 output connection 'solr localhost': 
IO exception during commit: The target server failed to respond
org.apache.manifoldcf.agents.interfaces.ServiceInterruption: IO exception 
during commit: The target server failed to respond
at 
org.apache.manifoldcf.agents.output.solr.HttpPoster.handleIOException(HttpPoster.java:477)
at 
org.apache.manifoldcf.agents.output.solr.HttpPoster.handleSolrServerException(HttpPoster.java:357)
at 
org.apache.manifoldcf.agents.output.solr.HttpPoster.commitPost(HttpPoster.java:304)
at 
org.apache.manifoldcf.agents.output.solr.SolrConnector.noteJobComplete(SolrConnector.java:744)
at 
org.apache.manifoldcf.crawler.system.JobNotificationThread.run(JobNotificationThread.java:118)
Caused by: org.apache.http.NoHttpResponseException: The target server failed to 
respond
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:95)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:61)
at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
at 
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
at 
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
at 
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
at 
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715)
at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520)
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at 

[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time

2014-02-11 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13898001#comment-13898001
 ] 

Florian Schmedding commented on CONNECTORS-880:
---

Replacing the solj-connector with an older version does not seem to solve the 
problem. During ingestion the job gets aborted.

 Under the right conditions, job aborts do not update last checked time
 

 Key: CONNECTORS-880
 URL: https://issues.apache.org/jira/browse/CONNECTORS-880
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 1.6


 When a scheduled job is being considered to be started, MCF updates the 
 last-check field ONLY if the job didn't start.  It relies on the job's 
 completion to set the last-check field in the case where the job does start.  
 But if the job aborts, in at least one case the last-check field is NOT 
 updated.  This leads to the job being run over and over again within the 
 schedule window.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time

2014-02-11 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13898413#comment-13898413
 ] 

Florian Schmedding commented on CONNECTORS-880:
---

I just replaced httpclient library (without re-building) and now the HTTP POST 
works fine. I noticed that all connections except those from Manifold appeared 
with the address 0:0:0:0:0:0:0:1 instead of 127.0.0.1. There is a resolved 
issue about IPv6 addresses: 
https://issues.apache.org/jira/browse/HTTPCLIENT-1317. I don't know if this was 
really the cause of this trouble, but anyway, the new version works.

mcf-combined-service.war: 
httpclient.jar - httpclient-4.3.2.jar
httpcore.jar - httpcore-4.3.1.jar

connector-lib:
httpmime.jar - httpmime-4.3.2.jar (perhaps not important)

(binary from 
http://ftp.fau.de/apache//httpcomponents/httpclient/binary/httpcomponents-client-4.3.2-bin.zip
 at http://hc.apache.org/downloads.cgi)

After restarting Tomcat all documents get indexed by Solr. Now I configured a 
schedule to check the fix you provided. Thanks for your help!



 Under the right conditions, job aborts do not update last checked time
 

 Key: CONNECTORS-880
 URL: https://issues.apache.org/jira/browse/CONNECTORS-880
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 1.6


 When a scheduled job is being considered to be started, MCF updates the 
 last-check field ONLY if the job didn't start.  It relies on the job's 
 completion to set the last-check field in the case where the job does start.  
 But if the job aborts, in at least one case the last-check field is NOT 
 updated.  This leads to the job being run over and over again within the 
 schedule window.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (CONNECTORS-887) Database schema not updated properly

2014-02-10 Thread Florian Schmedding (JIRA)
Florian Schmedding created CONNECTORS-887:
-

 Summary: Database schema not updated properly
 Key: CONNECTORS-887
 URL: https://issues.apache.org/jira/browse/CONNECTORS-887
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework core
Affects Versions: ManifoldCF 1.6
 Environment: MySQL database
Reporter: Florian Schmedding
Priority: Minor


When running Manifold 1.6 the first time with an database schema from Manifold 
1.3 the schema is not updated properly. The SQL-command 

ALTER TABLE authconnections MODIFY groupname VARCHAR(32) NOT NULL REFERENCES 
authgroups(groupname) ON DELETE RESTRICT

fails. It should instead add the column:

ALTER TABLE authconnections ADD groupname VARCHAR(32) NOT NULL REFERENCES 
authgroups(groupname) ON DELETE RESTRICT

The next startup after executing the corrected SQL-statement succeeds.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-880) Under the right conditions, job aborts do not update last checked time

2014-02-10 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13896778#comment-13896778
 ] 

Florian Schmedding commented on CONNECTORS-880:
---

I installed mcf-combined-service.war version 1.6 on a Tomcat server. However, 
the test job is not working correctly. I've used an existing database from 
Manifold 1.3 with one existing output connection, one existing repository 
connection and one existing job. On the job status page there were no buttons 
to control the job, its status is End notification. Therefore I copied the 
job. Then there were buttons to start the new job, but it got only errors from 
the output connection when ingesting documents. After aborting the job, it got 
stuck with the status End notification and does not leave it. Should I better 
create new connections and jobs or are there other problems in version 1.6?

 Under the right conditions, job aborts do not update last checked time
 

 Key: CONNECTORS-880
 URL: https://issues.apache.org/jira/browse/CONNECTORS-880
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 1.6


 When a scheduled job is being considered to be started, MCF updates the 
 last-check field ONLY if the job didn't start.  It relies on the job's 
 completion to set the last-check field in the case where the job does start.  
 But if the job aborts, in at least one case the last-check field is NOT 
 updated.  This leads to the job being run over and over again within the 
 schedule window.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling

2014-01-10 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13867820#comment-13867820
 ] 

Florian Schmedding commented on CONNECTORS-850:
---

Tomcat did not start up correctly. There might be a confilict of the 
mcf-combined-service-1.5-SNAPSHOT.war with the libraries and connector in the 
folder referenced in org.apache.manifoldcf.configfile. In the log there is 
INFO: 
validateJarFile(C:\PROGRA~1\APACHE~2\TOMCAT~1.0\webapps\mcf-combined-service-1.5\WEB-INF\lib\jsp-api-2.1-glassfish-2.1.v20091210.jar)
 - jar not loaded. See Servlet Spec 2.3, section 9.7.2. Offending class: 
javax/el/Expression.class.

 Maximum interval in dynamic crawling
 

 Key: CONNECTORS-850
 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Florian Schmedding
Assignee: Karl Wright
Priority: Minor
  Labels: features
 Fix For: ManifoldCF 1.5


 Currently, the dynamic crawling method used for a continuous job extends the 
 reseed and recrawl intervals when no changes are found in a checked document. 
 However, it should be possible to restrict this extension to a maximum value 
 in order to make sure that new documents are discovered within a certain 
 interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CONNECTORS-850) Maximum interval in dynamic crawling

2014-01-09 Thread Florian Schmedding (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866521#comment-13866521
 ] 

Florian Schmedding commented on CONNECTORS-850:
---

Is there a nightly build available? I couldn't run the 
mcf-combined-service-1.5-SNAPSHOT.war that I built with maven.

 Maximum interval in dynamic crawling
 

 Key: CONNECTORS-850
 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Affects Versions: ManifoldCF 1.4.1
Reporter: Florian Schmedding
Assignee: Karl Wright
Priority: Minor
  Labels: features
 Fix For: ManifoldCF 1.5


 Currently, the dynamic crawling method used for a continuous job extends the 
 reseed and recrawl intervals when no changes are found in a checked document. 
 However, it should be possible to restrict this extension to a maximum value 
 in order to make sure that new documents are discovered within a certain 
 interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (CONNECTORS-850) Maximum interval in dynamic crawling

2014-01-05 Thread Florian Schmedding (JIRA)
Florian Schmedding created CONNECTORS-850:
-

 Summary: Maximum interval in dynamic crawling
 Key: CONNECTORS-850
 URL: https://issues.apache.org/jira/browse/CONNECTORS-850
 Project: ManifoldCF
  Issue Type: New Feature
  Components: Framework crawler agent
Reporter: Florian Schmedding
Priority: Minor


Currently, the dynamic crawling method used for a continuous job extends the 
reseed and recrawl intervals when no changes are found in a checked document. 
However, it should be possible to restrict this extension to a maximum value in 
order to make sure that new documents are discovered within a certain interval.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)