Hi Guy, It looks like the lock unhappiness in the log may be due to you recycling Zookeeper while the MCF agents process was still running. This isn't a good idea and will cause carnage of this kind.
So I have to conclude at the moment that the lock problems seem to be either due to a deadlock between the UI and the Agents process (which I can determine if I see thread dumps from both of these processes when this happens), or it's due a downstream effect of the database integrity problem earlier described. For future reference, the only right way to deal with a hang of this kind is: (1) Get a thread dump (2) Get a snapshot of the log at that point (3) Shut down the agents process and the UI process (4) Start up the agents process and the UI process You should *not* need to recycle Zookeeper, ever. Thanks, Karl On Wed, Mar 8, 2017 at 8:16 AM, Karl Wright <daddy...@gmail.com> wrote: > Ah, the trace is in one of the attachments you provided, so that's enough > to look into it. Will get back to you shortly. > > Thanks, > Karl > > > On Wed, Mar 8, 2017 at 7:31 AM, Karl Wright <daddy...@gmail.com> wrote: > >> Hi Guy, >> >> I've now had a look at everything available to me on your issue. I've >> created CONNECTORS-1395 to track this issue; please attach any new >> materials to that ticket. >> >> There's an exception trace you didn't include, specifically for this >> FATAL exception: >> >> FATAL 2017-03-08 00:32:24,819 (Idle cleanup thread) - Error tossed: Can't >> release lock we don't hold >> java.lang.IllegalStateException: Can't release lock we don't hold” >> >> >> This is very probably the smoking gun for the stuck lock. I'd really >> like that exception trace from the log if you wouldn't mind. >> >> What has happened is this: >> >> (1) The document fetch cycle runs like this: (a) "stuff" the document (by >> changing its state to "active", (b) fetch the document, (c) mark the >> document completed. For a specific document, after it's been fetched, when >> we're trying to mark the document as being "completed" we do not see the >> expected "active" status. Instead we see a status of "pending purgatory", >> which means that either some other thread changed the document's status out >> from under us, or that document's status was never in fact set to "active" >> during the stuffing phase. Neither of these is possible given the code >> paths available, but we can prove it one way or another by turning on >> "forensic" debugging, as I described above. >> >> (2) Once the failure happens, the job in question should abort. But it >> seems like that abort does not complete because there's a second problem >> with lock management (which generates the FATAL message above). This >> should be readily fixed if I can get that trace. >> >> Thanks, >> Karl >> >> On Wed, Mar 8, 2017 at 6:52 AM, Karl Wright <daddy...@gmail.com> wrote: >> >>> Hi Guy, >>> >>> The agents thread dump shows that there's a lock stuck from somewhere; I >>> expect it's from the UI. Next time this happens, could you get a thread >>> dump for the UI process as well as from the agents process? Thanks!! >>> >>> Karl >>> >>> >>> On Wed, Mar 8, 2017 at 6:12 AM, Karl Wright <daddy...@gmail.com> wrote: >>> >>>> Hi Guy, >>>> >>>> See https://issues.apache.org/jira/browse/CONNECTORS-590. >>>> >>>> When you see "unexpected" this is not a good sign: >>>> >>>> >>>>>> >>>> >>>> “ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception >>>> tossed: Unexpected jobqueue status - record id 1488898668325, expecting >>>> active status, saw 4 >>>> >>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected >>>> jobqueue status - record id 1488898668325, expecting active status, saw 4 >>>> >>>> at org.apache.manifoldcf.crawler. >>>> jobs.JobQueue.updateCompletedRecord(JobQueue.java:1019) >>>> >>>> at org.apache.manifoldcf.crawler. >>>> jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:3271) >>>> >>>> at org.apache.manifoldcf.crawler. >>>> system.WorkerThread.run(WorkerThread.java:710) >>>> >>>> <<<<<< >>>> >>>> >>>> I've spent weeks chasing this issue in the past. I have not seen it at >>>> all except in certain installations: 2 of them, counting yours. I >>>> introduced transaction forensics into the MCF code to figure out what was >>>> going on and I definitively proved (on Postgresql 9.1) that we were seeing >>>> a transactional integrity problem with Postgresql itself. A ticket against >>>> Postgresql was not logged because they'd need a reproducible test case and >>>> also the latest version, and it didn't happen on 9.3 at the time. >>>> >>>> You also seem to be seeing a deadlock in MCF locks as well. When the >>>> Postgresql bug happens it means that the database is in a state that MCF >>>> really can't figure out and thus doesn't know how to deal with, so this may >>>> just be a downstream result of that occurrence. But I can't be sure >>>> without further analysis. >>>> >>>> I'm very curious now about the details of your setup. (1) How is your >>>> postgresql set up? What version did you decide to use? (2) How many >>>> agents processes do you have? Just one? (3) What OS? (4) What JDK? >>>> >>>> Other things you can try: >>>> (1) Running the postgresql LT tests (ant run-LT-postgresql) against >>>> your postgresql installation; you will need to change the test code itself >>>> to allow it to create an instance for testing in that case; >>>> (2) Turning on database transaction forensics (property name >>>> "org.apache.manifoldcf.diagnostics" value "DEBUG") >>>> (3) Turning on lock debugging (property name >>>> "org.apache.manifoldcf.lock" value "DEBUG") >>>> >>>> Transaction forensics uses memory but only generates lots of output >>>> when something goes wrong, so that may be the best place to start. Turning >>>> on lock debugging will generate a large amount of log output but may be >>>> necessary, especially if we want to figure out if the lock issue is a >>>> downstream problem caused by the database integrity issue. >>>> >>>> >>>> Thanks, >>>> Karl >>>> >>>> >>>> On Wed, Mar 8, 2017 at 5:11 AM, Standen Guy <guy.stan...@uk.fujitsu.com >>>> > wrote: >>>> >>>>> Hi Karl, >>>>> >>>>> I have upgraded to MCF 2.6 on Windows using Zookeeper >>>>> for synchronisation and PostgreSQL 9.3.16 for the database. >>>>> >>>>> I re-ran the pair of jobs ( 1 web crawl of local intranet site every >>>>> 15 minutes and 1 crawl of JDBC connected database every 3 minutes) >>>>> that >>>>> would previously lock up after a couple of iterations. These seemed to >>>>> run much better together but after 6 hours running there was a lock up for >>>>> both these jobs (both indicating they were starting up). >>>>> >>>>> >>>>> >>>>> Attempting to restart or abort the jobs through the MCF webapp failed. >>>>> >>>>> >>>>> >>>>> After restarting the MCF agents process all sprang into life and the >>>>> jobs aborted/restarted as expected. >>>>> >>>>> >>>>> >>>>> In the MCF log the following error was found >>>>> >>>>> “ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception >>>>> tossed: Unexpected jobqueue status - record id 1488898668325, expecting >>>>> active status, saw 4 >>>>> >>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected >>>>> jobqueue status - record id 1488898668325, expecting active status, saw 4 >>>>> >>>>> at org.apache.manifoldcf.crawler. >>>>> jobs.JobQueue.updateCompletedRecord(JobQueue.java:1019) >>>>> >>>>> at org.apache.manifoldcf.crawler. >>>>> jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:3271) >>>>> >>>>> at org.apache.manifoldcf.crawler. >>>>> system.WorkerThread.run(WorkerThread.java:710) >>>>> >>>>> WARN 2017-03-08 00:25:30,449 (Worker thread '23') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:30,449 (Worker thread '24') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:30,464 (Worker thread '9') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:30,464 (Worker thread '0') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:31,900 (Worker thread '11') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:31,900 (Worker thread '29') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:32,867 (Worker thread '10') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:32,867 (Worker thread '2') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:33,335 (Worker thread '8') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:36,642 (Worker thread '20') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:37,422 (Worker thread '21') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:38,280 (Worker thread '22') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:38,280 (Worker thread '3') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:38,280 (Worker thread '5') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:38,826 (Worker thread '28') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:39,045 (Worker thread '13') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:45,425 (Worker thread '4') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:45,425 (Worker thread '15') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:45,425 (Worker thread '17') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:46,392 (Worker thread '25') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:25:46,392 (Worker thread '27') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:26:11,043 (Worker thread '1') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:26:35,817 (Worker thread '19') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:26:35,817 (Worker thread '26') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:26:36,753 (Worker thread '7') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:26:39,248 (Worker thread '6') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:26:39,248 (Worker thread '18') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> WARN 2017-03-08 00:26:43,129 (Worker thread '16') - Service >>>>> interruption reported for job 1488898090224 connection 'web': Job no >>>>> longer >>>>> active >>>>> >>>>> FATAL 2017-03-08 00:32:24,819 (Idle cleanup thread) - Error tossed: >>>>> Can't release lock we don't hold >>>>> >>>>> java.lang.IllegalStateException: Can't release lock we don't hold” >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ManifoldCF does not seem to have recovered from this error and the >>>>> jobs can’t be re-started without restarting ManifoldCF agents. >>>>> >>>>> >>>>> >>>>> The manifoldcf, PostgreSQL and Agents JVM stack trace are attached >>>>> >>>>> >>>>> >>>>> Best Regards, >>>>> >>>>> >>>>> >>>>> Guy >>>>> >>>>> >>>>> >>>>> *From:* Karl Wright [mailto:daddy...@gmail.com] >>>>> *Sent:* 06 March 2017 09:24 >>>>> *To:* user@manifoldcf.apache.org >>>>> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6 >>>>> >>>>> >>>>> >>>>> Hi Guy, >>>>> >>>>> >>>>> >>>>> (1) I have no experience with PostgresSQL versions beyond 9.3, but I >>>>> doubt you would have problems. >>>>> >>>>> (2) If you are using multiple processes, even if there's only one >>>>> agents process, you must use synchronization. I would recommend >>>>> Zookeeper; >>>>> file-system-based synchronization is deprecated. >>>>> >>>>> (3) Windows has many ways of interfering with file-based sync, >>>>> including path-length issues. I have seen Windows fail to unlock files >>>>> and >>>>> need a reboot to release the lock. This is one reason why >>>>> file-system-based locking is deprecated. >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> Karl >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Mar 6, 2017 at 4:03 AM, Standen Guy < >>>>> guy.stan...@uk.fujitsu.com> wrote: >>>>> >>>>> Hi Karl, >>>>> >>>>> Thanks for that I will try version 2.6. Whilst moving >>>>> to MCF 2.6 I would potentially like to upgrade my backend PostgreSQL >>>>> version from 9.3.5. >>>>> >>>>> 1) Do you have a recommendation for which PostgreSQL to use with MCF >>>>> 2.6 e.g. PostgreSQL 9.3.16 or PostgreSQL 9.6.2? >>>>> >>>>> 2) For a production system on a single server running a single MCF >>>>> agents process would you recommend the file based synchronisation locking >>>>> or zookeeper based synchronisation locking. With the file based >>>>> synchronisation locking mechanism I have sometimes seen errors of the >>>>> form : >>>>> >>>>> 'D:\Apps\ManifoldCF\apache-man >>>>> ifoldcf-2.0.1\multiprocess-file-example\.\.\syncharea\475\70 >>>>> 8\lock-_POOLTARGET__OUTPUTCONNECTORPOOL_Solr COLL1 osp_unstruct.lock' >>>>> failed: Access is denied’ ( I have ensured that the SYNCHAREA is not >>>>> scanned by AV or Indexed by Windows Search and all MCF processes run as >>>>> the same user) >>>>> >>>>> What could cause these errors? >>>>> >>>>> >>>>> >>>>> Many Thanks, >>>>> >>>>> >>>>> >>>>> Guy >>>>> >>>>> >>>>> >>>>> *From:* Karl Wright [mailto:daddy...@gmail.com] >>>>> *Sent:* 03 March 2017 17:35 >>>>> >>>>> >>>>> *To:* user@manifoldcf.apache.org >>>>> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6 >>>>> >>>>> >>>>> >>>>> Hi Guy: >>>>> >>>>> >>>>> >>>>> It is expected that sometimes database deadlock will develop, and the >>>>> transaction will need to be retried. There is code in MCF for doing this: >>>>> >>>>> >>>>> >>>>> >>>>>> >>>>> >>>>> if (sqlState != null && sqlState.equals("40001")) >>>>> >>>>> return new ManifoldCFException(message,e, >>>>> ManifoldCFException.DATABASE_TRANSACTION_ABORT); >>>>> >>>>> <<<<<< >>>>> >>>>> >>>>> >>>>> I suspect that your version of MCF is old enough so that this >>>>> particular error and the associated retry are not taking place. Upgrading >>>>> to 2.6 will definitely help there. >>>>> >>>>> >>>>> >>>>> Karl >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Mar 3, 2017 at 11:17 AM, Standen Guy < >>>>> guy.stan...@uk.fujitsu.com> wrote: >>>>> >>>>> Hi Karl, >>>>> >>>>> Thanks for coming back so quickly. Unfortunately I >>>>> wasn’t using a JCIFS connection. One of the issues I was seeing was >>>>> between >>>>> a crawl of an intranet site (no explicit throttling other than number of >>>>> connections) and scheduled crawl (every 5 mins) to a relational DB via >>>>> JDBC connector again no explicit throttling. To simplify things both jobs >>>>> are using a NULL output connection. Sometimes both the Web crawl and the >>>>> JDBC connection can run together but at other times 1 or both jobs will >>>>> appear to lock up with just a few active documents showing. When I get a >>>>> lock up the mcf log contains errors like: >>>>> >>>>> >>>>> >>>>> “DEBUG 2017-03-03 15:28:20,466 (Worker thread '5') - Exception >>>>> Database exception: SQLException doing query (40001): ERROR: could not >>>>> serialize access due to read/write dependencies among transactions” >>>>> >>>>> >>>>> >>>>> See the attached log extract for a little more detail. >>>>> >>>>> >>>>> >>>>> Any view why this might be happening? >>>>> >>>>> >>>>> >>>>> Best Regards, >>>>> >>>>> >>>>> >>>>> Guy >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *From:* Karl Wright [mailto:daddy...@gmail.com] >>>>> *Sent:* 03 March 2017 11:27 >>>>> *To:* user@manifoldcf.apache.org >>>>> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6 >>>>> >>>>> >>>>> >>>>> Hi Guy, >>>>> >>>>> >>>>> >>>>> A issue with concurrent jobs is known for jobs sharing the same JCIFS >>>>> connection. Is that what you are using? This has nothing to do with the >>>>> version of Postgresql you are using; it has to do with what "bins" >>>>> documents are thought to come from. There has been a recent improvement >>>>> for this issue, which will be released in April. See >>>>> https://issues.apache.org/jira/browse/CONNECTORS-1364. >>>>> >>>>> >>>>> >>>>> The current version of MCF (2.6) supports Solr 6.x. >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Karl >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Mar 3, 2017 at 5:27 AM, Standen Guy < >>>>> guy.stan...@uk.fujitsu.com> wrote: >>>>> >>>>> Hi Karl, >>>>> >>>>> I am currently using MCF 2.0.1 with PostgreSQL 9.3.5 on Windows and >>>>> have had some issues with multiple jobs running concurrently. >>>>> >>>>> I am considering upgrading to MCF 2.6 and to a newer version of >>>>> PostgreSQL. Would you be able to advise which version of PostgreSQL I >>>>> should consider using with MCF 2.6 (e.g. PostgreSQL 9.3.16 or 9.6.2) >>>>> >>>>> >>>>> >>>>> I am also considering upgrading from SOLR 4.10.3 to a newer version. >>>>> The MCF compatibility matrix mentions that compatibility has been tested >>>>> to >>>>> SOLR version 4.5.1. Do you have any advice about compatibility with the >>>>> newer versions of SOLR e.g. 6.4.1. >>>>> >>>>> >>>>> >>>>> Best Regards >>>>> >>>>> >>>>> >>>>> Guy >>>>> >>>>> >>>>> >>>>> >>>>> Unless otherwise stated, this email has been sent from Fujitsu >>>>> Services Limited (registered in England No 96056); Fujitsu EMEA PLC >>>>> (registered in England No 2216100) both with registered offices at: 22 >>>>> Baker Street, London W1U 3BW; PFU (EMEA) Limited, (registered in England >>>>> No >>>>> 1578652) and Fujitsu Laboratories of Europe Limited (registered in England >>>>> No. 4153469) both with registered offices at: Hayes Park Central, Hayes >>>>> End >>>>> Road, Hayes, Middlesex, UB4 8FE. >>>>> This email is only for the use of its intended recipient. Its contents >>>>> are subject to a duty of confidence and may be privileged. Fujitsu does >>>>> not >>>>> guarantee that this email has not been intercepted and amended or that it >>>>> is virus-free. >>>>> >>>>> >>>>> >>>>> >>>>> Unless otherwise stated, this email has been sent from Fujitsu >>>>> Services Limited (registered in England No 96056); Fujitsu EMEA PLC >>>>> (registered in England No 2216100) both with registered offices at: 22 >>>>> Baker Street, London W1U 3BW; PFU (EMEA) Limited, (registered in England >>>>> No >>>>> 1578652) and Fujitsu Laboratories of Europe Limited (registered in England >>>>> No. 4153469) both with registered offices at: Hayes Park Central, Hayes >>>>> End >>>>> Road, Hayes, Middlesex, UB4 8FE. >>>>> This email is only for the use of its intended recipient. Its contents >>>>> are subject to a duty of confidence and may be privileged. Fujitsu does >>>>> not >>>>> guarantee that this email has not been intercepted and amended or that it >>>>> is virus-free. >>>>> >>>>> >>>>> >>>>> >>>>> Unless otherwise stated, this email has been sent from Fujitsu >>>>> Services Limited (registered in England No 96056); Fujitsu EMEA PLC >>>>> (registered in England No 2216100) both with registered offices at: 22 >>>>> Baker Street, London W1U 3BW; PFU (EMEA) Limited, (registered in England >>>>> No >>>>> 1578652) and Fujitsu Laboratories of Europe Limited (registered in England >>>>> No. 4153469) both with registered offices at: Hayes Park Central, Hayes >>>>> End >>>>> Road, Hayes, Middlesex, UB4 8FE. >>>>> This email is only for the use of its intended recipient. Its contents >>>>> are subject to a duty of confidence and may be privileged. Fujitsu does >>>>> not >>>>> guarantee that this email has not been intercepted and amended or that it >>>>> is virus-free. >>>>> >>>>> >>>>> >>>>> Unless otherwise stated, this email has been sent from Fujitsu >>>>> Services Limited (registered in England No 96056); Fujitsu EMEA PLC >>>>> (registered in England No 2216100) both with registered offices at: 22 >>>>> Baker Street, London W1U 3BW; PFU (EMEA) Limited, (registered in England >>>>> No >>>>> 1578652) and Fujitsu Laboratories of Europe Limited (registered in England >>>>> No. 4153469) both with registered offices at: Hayes Park Central, Hayes >>>>> End >>>>> Road, Hayes, Middlesex, UB4 8FE. >>>>> This email is only for the use of its intended recipient. Its contents >>>>> are subject to a duty of confidence and may be privileged. Fujitsu does >>>>> not >>>>> guarantee that this email has not been intercepted and amended or that it >>>>> is virus-free. >>>>> >>>> >>>> >>> >> >