Right, sorry, I overlooked this attachment in your original mail. Have a look at the ticket for updated status of the research, or later posts in this thread.
Karl On Wed, Mar 8, 2017 at 8:06 AM, Standen Guy <guy.stan...@uk.fujitsu.com> wrote: > Hi Karl, > > Attached is the MCF trace that includes all the logging pertaining to the > FATAL error ( the last entry is at 00:36:56,798 there was no further > logging until 08:00 when I looked at the jobs in the morning). This is the > same trace I sent on the original mail, I’m afraid I have no more trace > than this. > > > > I’ll try and reproduce the problem with forensic logging on and append the > traces to connectors-1395. > > > > Best Regards, > > > > Guy > > > > *From:* Karl Wright [mailto:daddy...@gmail.com] > *Sent:* 08 March 2017 12:32 > > *To:* user@manifoldcf.apache.org > *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6 > > > > Hi Guy, > > > > I've now had a look at everything available to me on your issue. I've > created CONNECTORS-1395 to track this issue; please attach any new > materials to that ticket. > > > > There's an exception trace you didn't include, specifically for this FATAL > exception: > > FATAL 2017-03-08 00:32:24,819 (Idle cleanup thread) - Error tossed: Can't > release lock we don't hold > > java.lang.IllegalStateException: Can't release lock we don't hold” > > > > This is very probably the smoking gun for the stuck lock. I'd really like > that exception trace from the log if you wouldn't mind. > > > > What has happened is this: > > > > (1) The document fetch cycle runs like this: (a) "stuff" the document (by > changing its state to "active", (b) fetch the document, (c) mark the > document completed. For a specific document, after it's been fetched, when > we're trying to mark the document as being "completed" we do not see the > expected "active" status. Instead we see a status of "pending purgatory", > which means that either some other thread changed the document's status out > from under us, or that document's status was never in fact set to "active" > during the stuffing phase. Neither of these is possible given the code > paths available, but we can prove it one way or another by turning on > "forensic" debugging, as I described above. > > > > (2) Once the failure happens, the job in question should abort. But it > seems like that abort does not complete because there's a second problem > with lock management (which generates the FATAL message above). This > should be readily fixed if I can get that trace. > > > > Thanks, > > Karl > > > > On Wed, Mar 8, 2017 at 6:52 AM, Karl Wright <daddy...@gmail.com> wrote: > > Hi Guy, > > > > The agents thread dump shows that there's a lock stuck from somewhere; I > expect it's from the UI. Next time this happens, could you get a thread > dump for the UI process as well as from the agents process? Thanks!! > > > > Karl > > > > > > On Wed, Mar 8, 2017 at 6:12 AM, Karl Wright <daddy...@gmail.com> wrote: > > Hi Guy, > > > > See https://issues.apache.org/jira/browse/CONNECTORS-590. > > > > When you see "unexpected" this is not a good sign: > > > > >>>>>> > > “ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception tossed: > Unexpected jobqueue status - record id 1488898668325, expecting active > status, saw 4 > > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected > jobqueue status - record id 1488898668325, expecting active status, saw 4 > > at org.apache.manifoldcf.crawler.jobs.JobQueue. > updateCompletedRecord(JobQueue.java:1019) > > at org.apache.manifoldcf.crawler.jobs.JobManager. > markDocumentCompletedMultiple(JobManager.java:3271) > > at org.apache.manifoldcf.crawler.system.WorkerThread.run( > WorkerThread.java:710) > > <<<<<< > > > > I've spent weeks chasing this issue in the past. I have not seen it at > all except in certain installations: 2 of them, counting yours. I > introduced transaction forensics into the MCF code to figure out what was > going on and I definitively proved (on Postgresql 9.1) that we were seeing > a transactional integrity problem with Postgresql itself. A ticket against > Postgresql was not logged because they'd need a reproducible test case and > also the latest version, and it didn't happen on 9.3 at the time. > > > > You also seem to be seeing a deadlock in MCF locks as well. When the > Postgresql bug happens it means that the database is in a state that MCF > really can't figure out and thus doesn't know how to deal with, so this may > just be a downstream result of that occurrence. But I can't be sure > without further analysis. > > > > I'm very curious now about the details of your setup. (1) How is your > postgresql set up? What version did you decide to use? (2) How many > agents processes do you have? Just one? (3) What OS? (4) What JDK? > > > > Other things you can try: > > (1) Running the postgresql LT tests (ant run-LT-postgresql) against your > postgresql installation; you will need to change the test code itself to > allow it to create an instance for testing in that case; > > (2) Turning on database transaction forensics (property name > "org.apache.manifoldcf.diagnostics" value "DEBUG") > > (3) Turning on lock debugging (property name "org.apache.manifoldcf.lock" > value "DEBUG") > > > > Transaction forensics uses memory but only generates lots of output when > something goes wrong, so that may be the best place to start. Turning on > lock debugging will generate a large amount of log output but may be > necessary, especially if we want to figure out if the lock issue is a > downstream problem caused by the database integrity issue. > > > > > > Thanks, > > Karl > > > > > > On Wed, Mar 8, 2017 at 5:11 AM, Standen Guy <guy.stan...@uk.fujitsu.com> > wrote: > > Hi Karl, > > I have upgraded to MCF 2.6 on Windows using Zookeeper for > synchronisation and PostgreSQL 9.3.16 for the database. > > I re-ran the pair of jobs ( 1 web crawl of local intranet site every 15 > minutes and 1 crawl of JDBC connected database every 3 minutes) that > would previously lock up after a couple of iterations. These seemed to > run much better together but after 6 hours running there was a lock up for > both these jobs (both indicating they were starting up). > > > > Attempting to restart or abort the jobs through the MCF webapp failed. > > > > After restarting the MCF agents process all sprang into life and the jobs > aborted/restarted as expected. > > > > In the MCF log the following error was found > > “ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception tossed: > Unexpected jobqueue status - record id 1488898668325, expecting active > status, saw 4 > > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected > jobqueue status - record id 1488898668325, expecting active status, saw 4 > > at org.apache.manifoldcf.crawler.jobs.JobQueue. > updateCompletedRecord(JobQueue.java:1019) > > at org.apache.manifoldcf.crawler.jobs.JobManager. > markDocumentCompletedMultiple(JobManager.java:3271) > > at org.apache.manifoldcf.crawler.system.WorkerThread.run( > WorkerThread.java:710) > > WARN 2017-03-08 00:25:30,449 (Worker thread '23') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:30,449 (Worker thread '24') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:30,464 (Worker thread '9') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:30,464 (Worker thread '0') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:31,900 (Worker thread '11') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:31,900 (Worker thread '29') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:32,867 (Worker thread '10') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:32,867 (Worker thread '2') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:33,335 (Worker thread '8') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:36,642 (Worker thread '20') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:37,422 (Worker thread '21') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:38,280 (Worker thread '22') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:38,280 (Worker thread '3') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:38,280 (Worker thread '5') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:38,826 (Worker thread '28') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:39,045 (Worker thread '13') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:45,425 (Worker thread '4') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:45,425 (Worker thread '15') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:45,425 (Worker thread '17') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:46,392 (Worker thread '25') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:25:46,392 (Worker thread '27') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:26:11,043 (Worker thread '1') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:26:35,817 (Worker thread '19') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:26:35,817 (Worker thread '26') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:26:36,753 (Worker thread '7') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:26:39,248 (Worker thread '6') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:26:39,248 (Worker thread '18') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > WARN 2017-03-08 00:26:43,129 (Worker thread '16') - Service interruption > reported for job 1488898090224 connection 'web': Job no longer active > > FATAL 2017-03-08 00:32:24,819 (Idle cleanup thread) - Error tossed: Can't > release lock we don't hold > > java.lang.IllegalStateException: Can't release lock we don't hold” > > > > > > ManifoldCF does not seem to have recovered from this error and the jobs > can’t be re-started without restarting ManifoldCF agents. > > > > The manifoldcf, PostgreSQL and Agents JVM stack trace are attached > > > > Best Regards, > > > > Guy > > > > *From:* Karl Wright [mailto:daddy...@gmail.com] > *Sent:* 06 March 2017 09:24 > > *To:* user@manifoldcf.apache.org > *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6 > > > > Hi Guy, > > > > (1) I have no experience with PostgresSQL versions beyond 9.3, but I doubt > you would have problems. > > (2) If you are using multiple processes, even if there's only one agents > process, you must use synchronization. I would recommend Zookeeper; > file-system-based synchronization is deprecated. > > (3) Windows has many ways of interfering with file-based sync, including > path-length issues. I have seen Windows fail to unlock files and need a > reboot to release the lock. This is one reason why file-system-based > locking is deprecated. > > > > Thanks, > Karl > > > > > > On Mon, Mar 6, 2017 at 4:03 AM, Standen Guy <guy.stan...@uk.fujitsu.com> > wrote: > > Hi Karl, > > Thanks for that I will try version 2.6. Whilst moving to > MCF 2.6 I would potentially like to upgrade my backend PostgreSQL version > from 9.3.5. > > 1) Do you have a recommendation for which PostgreSQL to use with MCF 2.6 > e.g. PostgreSQL 9.3.16 or PostgreSQL 9.6.2? > > 2) For a production system on a single server running a single MCF agents > process would you recommend the file based synchronisation locking or > zookeeper based synchronisation locking. With the file based > synchronisation locking mechanism I have sometimes seen errors of the form : > > 'D:\Apps\ManifoldCF\apache-manifoldcf-2.0.1\multiprocess- > file-example\.\.\syncharea\475\708\lock-_POOLTARGET__OUTPUTCONNECTORPOOL_Solr > COLL1 osp_unstruct.lock' failed: Access is denied’ ( I have ensured that > the SYNCHAREA is not scanned by AV or Indexed by Windows Search and all > MCF processes run as the same user) > > What could cause these errors? > > > > Many Thanks, > > > > Guy > > > > *From:* Karl Wright [mailto:daddy...@gmail.com] > *Sent:* 03 March 2017 17:35 > > > *To:* user@manifoldcf.apache.org > *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6 > > > > Hi Guy: > > > > It is expected that sometimes database deadlock will develop, and the > transaction will need to be retried. There is code in MCF for doing this: > > > > >>>>>> > > if (sqlState != null && sqlState.equals("40001")) > > return new ManifoldCFException(message,e, > ManifoldCFException.DATABASE_TRANSACTION_ABORT); > > <<<<<< > > > > I suspect that your version of MCF is old enough so that this particular > error and the associated retry are not taking place. Upgrading to 2.6 will > definitely help there. > > > > Karl > > > > > > On Fri, Mar 3, 2017 at 11:17 AM, Standen Guy <guy.stan...@uk.fujitsu.com> > wrote: > > Hi Karl, > > Thanks for coming back so quickly. Unfortunately I wasn’t > using a JCIFS connection. One of the issues I was seeing was between a > crawl of an intranet site (no explicit throttling other than number of > connections) and scheduled crawl (every 5 mins) to a relational DB via > JDBC connector again no explicit throttling. To simplify things both jobs > are using a NULL output connection. Sometimes both the Web crawl and the > JDBC connection can run together but at other times 1 or both jobs will > appear to lock up with just a few active documents showing. When I get a > lock up the mcf log contains errors like: > > > > “DEBUG 2017-03-03 15:28:20,466 (Worker thread '5') - Exception Database > exception: SQLException doing query (40001): ERROR: could not serialize > access due to read/write dependencies among transactions” > > > > See the attached log extract for a little more detail. > > > > Any view why this might be happening? > > > > Best Regards, > > > > Guy > > > > > > > > > > > > > > *From:* Karl Wright [mailto:daddy...@gmail.com] > *Sent:* 03 March 2017 11:27 > *To:* user@manifoldcf.apache.org > *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6 > > > > Hi Guy, > > > > A issue with concurrent jobs is known for jobs sharing the same JCIFS > connection. Is that what you are using? This has nothing to do with the > version of Postgresql you are using; it has to do with what "bins" > documents are thought to come from. There has been a recent improvement > for this issue, which will be released in April. See > https://issues.apache.org/jira/browse/CONNECTORS-1364. > > > > The current version of MCF (2.6) supports Solr 6.x. > > > > Thanks, > > Karl > > > > > > On Fri, Mar 3, 2017 at 5:27 AM, Standen Guy <guy.stan...@uk.fujitsu.com> > wrote: > > Hi Karl, > > I am currently using MCF 2.0.1 with PostgreSQL 9.3.5 on Windows and have > had some issues with multiple jobs running concurrently. > > I am considering upgrading to MCF 2.6 and to a newer version of > PostgreSQL. Would you be able to advise which version of PostgreSQL I > should consider using with MCF 2.6 (e.g. PostgreSQL 9.3.16 or 9.6.2) > > > > I am also considering upgrading from SOLR 4.10.3 to a newer version. The > MCF compatibility matrix mentions that compatibility has been tested to > SOLR version 4.5.1. Do you have any advice about compatibility with the > newer versions of SOLR e.g. 6.4.1. > > > > Best Regards > > > > Guy > > > > > Unless otherwise stated, this email has been sent from Fujitsu Services > Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in > England No 2216100) both with registered offices at: 22 Baker Street, > London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and > Fujitsu Laboratories of Europe Limited (registered in England No. 4153469) > both with registered offices at: Hayes Park Central, Hayes End Road, Hayes, > Middlesex, UB4 8FE. > This email is only for the use of its intended recipient. Its contents are > subject to a duty of confidence and may be privileged. Fujitsu does not > guarantee that this email has not been intercepted and amended or that it > is virus-free. > > > > > Unless otherwise stated, this email has been sent from Fujitsu Services > Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in > England No 2216100) both with registered offices at: 22 Baker Street, > London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and > Fujitsu Laboratories of Europe Limited (registered in England No. 4153469) > both with registered offices at: Hayes Park Central, Hayes End Road, Hayes, > Middlesex, UB4 8FE. > This email is only for the use of its intended recipient. Its contents are > subject to a duty of confidence and may be privileged. Fujitsu does not > guarantee that this email has not been intercepted and amended or that it > is virus-free. > > > > > Unless otherwise stated, this email has been sent from Fujitsu Services > Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in > England No 2216100) both with registered offices at: 22 Baker Street, > London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and > Fujitsu Laboratories of Europe Limited (registered in England No. 4153469) > both with registered offices at: Hayes Park Central, Hayes End Road, Hayes, > Middlesex, UB4 8FE. > This email is only for the use of its intended recipient. Its contents are > subject to a duty of confidence and may be privileged. Fujitsu does not > guarantee that this email has not been intercepted and amended or that it > is virus-free. > > > > > Unless otherwise stated, this email has been sent from Fujitsu Services > Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in > England No 2216100) both with registered offices at: 22 Baker Street, > London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and > Fujitsu Laboratories of Europe Limited (registered in England No. 4153469) > both with registered offices at: Hayes Park Central, Hayes End Road, Hayes, > Middlesex, UB4 8FE. > This email is only for the use of its intended recipient. Its contents are > subject to a duty of confidence and may be privileged. Fujitsu does not > guarantee that this email has not been intercepted and amended or that it > is virus-free. > > > > > > > > Unless otherwise stated, this email has been sent from Fujitsu Services > Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in > England No 2216100) both with registered offices at: 22 Baker Street, > London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and > Fujitsu Laboratories of Europe Limited (registered in England No. 4153469) > both with registered offices at: Hayes Park Central, Hayes End Road, Hayes, > Middlesex, UB4 8FE. > This email is only for the use of its intended recipient. Its contents are > subject to a duty of confidence and may be privileged. Fujitsu does not > guarantee that this email has not been intercepted and amended or that it > is virus-free. >