Hi Karl,
I have attached a file to CONNECTORS-1395 which shows an
excerpt from the Zookeeper console log. Unfortunately I don’t have enough
history in the console to see much before the event and may have lost some
important information.
This Zookeeper issue happened in the middle of the night and no one would have
manually instigated it.
Best Regards,
Guy
From: Karl Wright [mailto:[email protected]]
Sent: 08 March 2017 13:45
To: [email protected]
Subject: Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
Hi Guy,
If nobody recycled Zookeeper intentionally at 00:26:00, is it possible that
some automated process "cleaned up" zookeeper temporary files out from under it
at around this time? Really, we're seeing a catastrophic failure of Zookeeper
to retain anything beyond that point, and I have no explanation for that
behavior at all, and have never seen it before.
This could also potentially explain the apparent postgresql transaction
integrity issues, because in some cases we rely on local Zookeeper locks to
prevent threads from interfering with one another.
Karl
On Wed, Mar 8, 2017 at 8:37 AM, Karl Wright
<[email protected]<mailto:[email protected]>> wrote:
Right, sorry, I overlooked this attachment in your original mail. Have a look
at the ticket for updated status of the research, or later posts in this thread.
Karl
On Wed, Mar 8, 2017 at 8:06 AM, Standen Guy
<[email protected]<mailto:[email protected]>> wrote:
Hi Karl,
Attached is the MCF trace that includes all the logging pertaining to the FATAL
error ( the last entry is at 00:36:56,798 there was no further logging until
08:00 when I looked at the jobs in the morning). This is the same trace I sent
on the original mail, I’m afraid I have no more trace than this.
I’ll try and reproduce the problem with forensic logging on and append the
traces to connectors-1395.
Best Regards,
Guy
From: Karl Wright [mailto:[email protected]<mailto:[email protected]>]
Sent: 08 March 2017 12:32
To: [email protected]<mailto:[email protected]>
Subject: Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
Hi Guy,
I've now had a look at everything available to me on your issue. I've created
CONNECTORS-1395 to track this issue; please attach any new materials to that
ticket.
There's an exception trace you didn't include, specifically for this FATAL
exception:
FATAL 2017-03-08 00:32:24,819 (Idle cleanup thread) - Error tossed: Can't
release lock we don't hold
java.lang.IllegalStateException: Can't release lock we don't hold”
This is very probably the smoking gun for the stuck lock. I'd really like that
exception trace from the log if you wouldn't mind.
What has happened is this:
(1) The document fetch cycle runs like this: (a) "stuff" the document (by
changing its state to "active", (b) fetch the document, (c) mark the document
completed. For a specific document, after it's been fetched, when we're trying
to mark the document as being "completed" we do not see the expected "active"
status. Instead we see a status of "pending purgatory", which means that
either some other thread changed the document's status out from under us, or
that document's status was never in fact set to "active" during the stuffing
phase. Neither of these is possible given the code paths available, but we can
prove it one way or another by turning on "forensic" debugging, as I described
above.
(2) Once the failure happens, the job in question should abort. But it seems
like that abort does not complete because there's a second problem with lock
management (which generates the FATAL message above). This should be readily
fixed if I can get that trace.
Thanks,
Karl
On Wed, Mar 8, 2017 at 6:52 AM, Karl Wright
<[email protected]<mailto:[email protected]>> wrote:
Hi Guy,
The agents thread dump shows that there's a lock stuck from somewhere; I expect
it's from the UI. Next time this happens, could you get a thread dump for the
UI process as well as from the agents process? Thanks!!
Karl
On Wed, Mar 8, 2017 at 6:12 AM, Karl Wright
<[email protected]<mailto:[email protected]>> wrote:
Hi Guy,
See https://issues.apache.org/jira/browse/CONNECTORS-590.
When you see "unexpected" this is not a good sign:
>>>>>>
“ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception tossed:
Unexpected jobqueue status - record id 1488898668325, expecting active status,
saw 4
org.apache.manifoldcf.core.int<http://org.apache.manifoldcf.core.int>erfaces.ManifoldCFException:
Unexpected jobqueue status - record id 1488898668325, expecting active status,
saw 4
at
org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:1019)
at
org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:3271)
at
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:710)
<<<<<<
I've spent weeks chasing this issue in the past. I have not seen it at all
except in certain installations: 2 of them, counting yours. I introduced
transaction forensics into the MCF code to figure out what was going on and I
definitively proved (on Postgresql 9.1) that we were seeing a transactional
integrity problem with Postgresql itself. A ticket against Postgresql was not
logged because they'd need a reproducible test case and also the latest
version, and it didn't happen on 9.3 at the time.
You also seem to be seeing a deadlock in MCF locks as well. When the
Postgresql bug happens it means that the database is in a state that MCF really
can't figure out and thus doesn't know how to deal with, so this may just be a
downstream result of that occurrence. But I can't be sure without further
analysis.
I'm very curious now about the details of your setup. (1) How is your
postgresql set up? What version did you decide to use? (2) How many agents
processes do you have? Just one? (3) What OS? (4) What JDK?
Other things you can try:
(1) Running the postgresql LT tests (ant run-LT-postgresql) against your
postgresql installation; you will need to change the test code itself to allow
it to create an instance for testing in that case;
(2) Turning on database transaction forensics (property name
"org.apache.manifoldcf.diagnostics" value "DEBUG")
(3) Turning on lock debugging (property name "org.apache.manifoldcf.lock" value
"DEBUG")
Transaction forensics uses memory but only generates lots of output when
something goes wrong, so that may be the best place to start. Turning on lock
debugging will generate a large amount of log output but may be necessary,
especially if we want to figure out if the lock issue is a downstream problem
caused by the database integrity issue.
Thanks,
Karl
On Wed, Mar 8, 2017 at 5:11 AM, Standen Guy
<[email protected]<mailto:[email protected]>> wrote:
Hi Karl,
I have upgraded to MCF 2.6 on Windows using Zookeeper for
synchronisation and PostgreSQL 9.3.16 for the database.
I re-ran the pair of jobs ( 1 web crawl of local intranet site every 15
minutes and 1 crawl of JDBC connected database every 3 minutes) that would
previously lock up after a couple of iterations. These seemed to run much
better together but after 6 hours running there was a lock up for both these
jobs (both indicating they were starting up).
Attempting to restart or abort the jobs through the MCF webapp failed.
After restarting the MCF agents process all sprang into life and the jobs
aborted/restarted as expected.
In the MCF log the following error was found
“ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception tossed:
Unexpected jobqueue status - record id 1488898668325, expecting active status,
saw 4
org.apache.manifoldcf.core.int<http://org.apache.manifoldcf.core.int>erfaces.ManifoldCFException:
Unexpected jobqueue status - record id 1488898668325, expecting active status,
saw 4
at
org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:1019)
at
org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:3271)
at
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:710)
WARN 2017-03-08 00:25:30,449 (Worker thread '23') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:30,449 (Worker thread '24') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:30,464 (Worker thread '9') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:30,464 (Worker thread '0') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:31,900 (Worker thread '11') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:31,900 (Worker thread '29') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:32,867 (Worker thread '10') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:32,867 (Worker thread '2') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:33,335 (Worker thread '8') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:36,642 (Worker thread '20') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:37,422 (Worker thread '21') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:38,280 (Worker thread '22') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:38,280 (Worker thread '3') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:38,280 (Worker thread '5') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:38,826 (Worker thread '28') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:39,045 (Worker thread '13') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:45,425 (Worker thread '4') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:45,425 (Worker thread '15') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:45,425 (Worker thread '17') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:46,392 (Worker thread '25') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:25:46,392 (Worker thread '27') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:26:11,043 (Worker thread '1') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:26:35,817 (Worker thread '19') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:26:35,817 (Worker thread '26') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:26:36,753 (Worker thread '7') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:26:39,248 (Worker thread '6') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:26:39,248 (Worker thread '18') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
WARN 2017-03-08 00:26:43,129 (Worker thread '16') - Service interruption
reported for job 1488898090224 connection 'web': Job no longer active
FATAL 2017-03-08 00:32:24,819 (Idle cleanup thread) - Error tossed: Can't
release lock we don't hold
java.lang.IllegalStateException: Can't release lock we don't hold”
ManifoldCF does not seem to have recovered from this error and the jobs can’t
be re-started without restarting ManifoldCF agents.
The manifoldcf, PostgreSQL and Agents JVM stack trace are attached
Best Regards,
Guy
From: Karl Wright [mailto:[email protected]<mailto:[email protected]>]
Sent: 06 March 2017 09:24
To: [email protected]<mailto:[email protected]>
Subject: Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
Hi Guy,
(1) I have no experience with PostgresSQL versions beyond 9.3, but I doubt you
would have problems.
(2) If you are using multiple processes, even if there's only one agents
process, you must use synchronization. I would recommend Zookeeper;
file-system-based synchronization is deprecated.
(3) Windows has many ways of interfering with file-based sync, including
path-length issues. I have seen Windows fail to unlock files and need a reboot
to release the lock. This is one reason why file-system-based locking is
deprecated.
Thanks,
Karl
On Mon, Mar 6, 2017 at 4:03 AM, Standen Guy
<[email protected]<mailto:[email protected]>> wrote:
Hi Karl,
Thanks for that I will try version 2.6. Whilst moving to MCF
2.6 I would potentially like to upgrade my backend PostgreSQL version from
9.3.5.
1) Do you have a recommendation for which PostgreSQL to use with MCF 2.6 e.g.
PostgreSQL 9.3.16 or PostgreSQL 9.6.2?
2) For a production system on a single server running a single MCF agents
process would you recommend the file based synchronisation locking or zookeeper
based synchronisation locking. With the file based synchronisation locking
mechanism I have sometimes seen errors of the form :
'D:\Apps\ManifoldCF\apache-manifoldcf-2.0.1\multiprocess-file-example\.\.\syncharea\475\708\lock-_POOLTARGET__OUTPUTCONNECTORPOOL_Solr
COLL1 osp_unstruct.lock' failed: Access is denied’ ( I have ensured that the
SYNCHAREA is not scanned by AV or Indexed by Windows Search and all MCF
processes run as the same user)
What could cause these errors?
Many Thanks,
Guy
From: Karl Wright [mailto:[email protected]<mailto:[email protected]>]
Sent: 03 March 2017 17:35
To: [email protected]<mailto:[email protected]>
Subject: Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
Hi Guy:
It is expected that sometimes database deadlock will develop, and the
transaction will need to be retried. There is code in MCF for doing this:
>>>>>>
if (sqlState != null && sqlState.equals("40001"))
return new
ManifoldCFException(message,e,ManifoldCFException.DATABASE_TRANSACTION_ABORT);
<<<<<<
I suspect that your version of MCF is old enough so that this particular error
and the associated retry are not taking place. Upgrading to 2.6 will
definitely help there.
Karl
On Fri, Mar 3, 2017 at 11:17 AM, Standen Guy
<[email protected]<mailto:[email protected]>> wrote:
Hi Karl,
Thanks for coming back so quickly. Unfortunately I wasn’t using
a JCIFS connection. One of the issues I was seeing was between a crawl of an
intranet site (no explicit throttling other than number of connections) and
scheduled crawl (every 5 mins) to a relational DB via JDBC connector again no
explicit throttling. To simplify things both jobs are using a NULL output
connection. Sometimes both the Web crawl and the JDBC connection can run
together but at other times 1 or both jobs will appear to lock up with just a
few active documents showing. When I get a lock up the mcf log contains errors
like:
“DEBUG 2017-03-03 15:28:20,466 (Worker thread '5') - Exception Database
exception: SQLException doing query (40001): ERROR: could not serialize access
due to read/write dependencies among transactions”
See the attached log extract for a little more detail.
Any view why this might be happening?
Best Regards,
Guy
From: Karl Wright [mailto:[email protected]<mailto:[email protected]>]
Sent: 03 March 2017 11:27
To: [email protected]<mailto:[email protected]>
Subject: Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
Hi Guy,
A issue with concurrent jobs is known for jobs sharing the same JCIFS
connection. Is that what you are using? This has nothing to do with the
version of Postgresql you are using; it has to do with what "bins" documents
are thought to come from. There has been a recent improvement for this issue,
which will be released in April. See
https://issues.apache.org/jira/browse/CONNECTORS-1364.
The current version of MCF (2.6) supports Solr 6.x.
Thanks,
Karl
On Fri, Mar 3, 2017 at 5:27 AM, Standen Guy
<[email protected]<mailto:[email protected]>> wrote:
Hi Karl,
I am currently using MCF 2.0.1 with PostgreSQL 9.3.5 on Windows and have had
some issues with multiple jobs running concurrently.
I am considering upgrading to MCF 2.6 and to a newer version of PostgreSQL.
Would you be able to advise which version of PostgreSQL I should consider using
with MCF 2.6 (e.g. PostgreSQL 9.3.16 or 9.6.2)
I am also considering upgrading from SOLR 4.10.3 to a newer version. The MCF
compatibility matrix mentions that compatibility has been tested to SOLR
version 4.5.1. Do you have any advice about compatibility with the newer
versions of SOLR e.g. 6.4.1.
Best Regards
Guy
Unless otherwise stated, this email has been sent from Fujitsu Services Limited
(registered in England No 96056); Fujitsu EMEA PLC (registered in England No
2216100) both with registered offices at: 22 Baker Street, London W1U 3BW; PFU
(EMEA) Limited, (registered in England No 1578652) and Fujitsu Laboratories of
Europe Limited (registered in England No. 4153469) both with registered offices
at: Hayes Park Central, Hayes End Road, Hayes, Middlesex, UB4 8FE.
This email is only for the use of its intended recipient. Its contents are
subject to a duty of confidence and may be privileged. Fujitsu does not
guarantee that this email has not been intercepted and amended or that it is
virus-free.
Unless otherwise stated, this email has been sent from Fujitsu Services Limited
(registered in England No 96056); Fujitsu EMEA PLC (registered in England No
2216100) both with registered offices at: 22 Baker Street, London W1U 3BW; PFU
(EMEA) Limited, (registered in England No 1578652) and Fujitsu Laboratories of
Europe Limited (registered in England No. 4153469) both with registered offices
at: Hayes Park Central, Hayes End Road, Hayes, Middlesex, UB4 8FE.
This email is only for the use of its intended recipient. Its contents are
subject to a duty of confidence and may be privileged. Fujitsu does not
guarantee that this email has not been intercepted and amended or that it is
virus-free.
Unless otherwise stated, this email has been sent from Fujitsu Services Limited
(registered in England No 96056); Fujitsu EMEA PLC (registered in England No
2216100) both with registered offices at: 22 Baker Street, London W1U 3BW; PFU
(EMEA) Limited, (registered in England No 1578652) and Fujitsu Laboratories of
Europe Limited (registered in England No. 4153469) both with registered offices
at: Hayes Park Central, Hayes End Road, Hayes, Middlesex, UB4 8FE.
This email is only for the use of its intended recipient. Its contents are
subject to a duty of confidence and may be privileged. Fujitsu does not
guarantee that this email has not been intercepted and amended or that it is
virus-free.
Unless otherwise stated, this email has been sent from Fujitsu Services Limited
(registered in England No 96056); Fujitsu EMEA PLC (registered in England No
2216100) both with registered offices at: 22 Baker Street, London W1U 3BW; PFU
(EMEA) Limited, (registered in England No 1578652) and Fujitsu Laboratories of
Europe Limited (registered in England No. 4153469) both with registered offices
at: Hayes Park Central, Hayes End Road, Hayes, Middlesex, UB4 8FE.
This email is only for the use of its intended recipient. Its contents are
subject to a duty of confidence and may be privileged. Fujitsu does not
guarantee that this email has not been intercepted and amended or that it is
virus-free.
Unless otherwise stated, this email has been sent from Fujitsu Services Limited
(registered in England No 96056); Fujitsu EMEA PLC (registered in England No
2216100) both with registered offices at: 22 Baker Street, London W1U 3BW; PFU
(EMEA) Limited, (registered in England No 1578652) and Fujitsu Laboratories of
Europe Limited (registered in England No. 4153469) both with registered offices
at: Hayes Park Central, Hayes End Road, Hayes, Middlesex, UB4 8FE.
This email is only for the use of its intended recipient. Its contents are
subject to a duty of confidence and may be privileged. Fujitsu does not
guarantee that this email has not been intercepted and amended or that it is
virus-free.
Unless otherwise stated, this email has been sent from Fujitsu Services Limited
(registered in England No 96056); Fujitsu EMEA PLC (registered in England No
2216100) both with registered offices at: 22 Baker Street, London W1U 3BW; PFU
(EMEA) Limited, (registered in England No 1578652) and Fujitsu Laboratories of
Europe Limited (registered in England No. 4153469) both with registered offices
at: Hayes Park Central, Hayes End Road, Hayes, Middlesex, UB4 8FE.
This email is only for the use of its intended recipient. Its contents are
subject to a duty of confidence and may be privileged. Fujitsu does not
guarantee that this email has not been intercepted and amended or that it is
virus-free.