Re: Advice on which PostgreSQL to use with ManifoldCF 2.6

Karl Wright Wed, 08 Mar 2017 05:38:26 -0800

Right, sorry, I overlooked this attachment in your original mail.  Have a
look at the ticket for updated status of the research, or later posts in
this thread.


Karl


On Wed, Mar 8, 2017 at 8:06 AM, Standen Guy <guy.stan...@uk.fujitsu.com>
wrote:

> Hi Karl,
>
> Attached is the MCF trace that includes all the logging pertaining to the
> FATAL error  ( the last entry is at 00:36:56,798  there was no further
> logging until  08:00 when I looked at the jobs in the morning). This is the
> same trace I sent on the original mail, I’m afraid I have no more trace
> than this.
>
>
>
> I’ll try and reproduce the problem with forensic logging on and append the
> traces to  connectors-1395.
>
>
>
> Best Regards,
>
>
>
> Guy
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* 08 March 2017 12:32
>
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
>
>
>
> Hi Guy,
>
>
>
> I've now had a look at everything available to me on your issue.  I've
> created CONNECTORS-1395 to track this issue; please attach any new
> materials to that ticket.
>
>
>
> There's an exception trace you didn't include, specifically for this FATAL
> exception:
>
> FATAL 2017-03-08 00:32:24,819 (Idle cleanup thread) - Error tossed: Can't 
> release lock we don't hold
>
> java.lang.IllegalStateException: Can't release lock we don't hold”
>
>
>
> This is very probably the smoking gun for the stuck lock.  I'd really like
> that exception trace from the log if you wouldn't mind.
>
>
>
> What has happened is this:
>
>
>
> (1) The document fetch cycle runs like this: (a) "stuff" the document (by
> changing its state to "active", (b) fetch the document, (c) mark the
> document completed.  For a specific document, after it's been fetched, when
> we're trying to mark the document as being "completed" we do not see the
> expected "active" status.  Instead we see a status of "pending purgatory",
> which means that either some other thread changed the document's status out
> from under us, or that document's status was never in fact set to "active"
> during the stuffing phase.  Neither of these is possible given the code
> paths available, but we can prove it one way or another by turning on
> "forensic" debugging, as I described above.
>
>
>
> (2) Once the failure happens, the job in question should abort.  But it
> seems like that abort does not complete because there's a second problem
> with lock management (which generates the FATAL message above).  This
> should be readily fixed if I can get that trace.
>
>
>
> Thanks,
>
> Karl
>
>
>
> On Wed, Mar 8, 2017 at 6:52 AM, Karl Wright <daddy...@gmail.com> wrote:
>
> Hi Guy,
>
>
>
> The agents thread dump shows that there's a lock stuck from somewhere; I
> expect it's from the UI.  Next time this happens, could you get a thread
> dump for the UI process as well as from the agents process?  Thanks!!
>
>
>
> Karl
>
>
>
>
>
> On Wed, Mar 8, 2017 at 6:12 AM, Karl Wright <daddy...@gmail.com> wrote:
>
> Hi Guy,
>
>
>
> See https://issues.apache.org/jira/browse/CONNECTORS-590.
>
>
>
> When you see "unexpected" this is not a good sign:
>
>
>
> >>>>>>
>
> “ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception tossed:
> Unexpected jobqueue status - record id 1488898668325, expecting active
> status, saw 4
>
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected
> jobqueue status - record id 1488898668325, expecting active status, saw 4
>
>                 at org.apache.manifoldcf.crawler.jobs.JobQueue.
> updateCompletedRecord(JobQueue.java:1019)
>
>                 at org.apache.manifoldcf.crawler.jobs.JobManager.
> markDocumentCompletedMultiple(JobManager.java:3271)
>
>                 at org.apache.manifoldcf.crawler.system.WorkerThread.run(
> WorkerThread.java:710)
>
> <<<<<<
>
>
>
> I've spent weeks chasing this issue in the past.  I have not seen it at
> all except in certain installations: 2 of them, counting yours.  I
> introduced transaction forensics into the MCF code to figure out what was
> going on and I definitively proved (on Postgresql 9.1) that we were seeing
> a transactional integrity problem with Postgresql itself.  A ticket against
> Postgresql was not logged because they'd need a reproducible test case and
> also the latest version, and it didn't happen on 9.3 at the time.
>
>
>
> You also seem to be seeing a deadlock in MCF locks as well.  When the
> Postgresql bug happens it means that the database is in a state that MCF
> really can't figure out and thus doesn't know how to deal with, so this may
> just be a downstream result of that occurrence.  But I can't be sure
> without further analysis.
>
>
>
> I'm very curious now about the details of your setup.  (1) How is your
> postgresql set up?  What version did you decide to use?  (2) How many
> agents processes do you have?  Just one?  (3) What OS? (4) What JDK?
>
>
>
> Other things you can try:
>
> (1) Running the postgresql LT tests (ant run-LT-postgresql) against your
> postgresql installation; you will need to change the test code itself to
> allow it to create an instance for testing in that case;
>
> (2) Turning on database transaction forensics (property name
> "org.apache.manifoldcf.diagnostics" value "DEBUG")
>
> (3) Turning on lock debugging (property name "org.apache.manifoldcf.lock"
> value "DEBUG")
>
>
>
> Transaction forensics uses memory but only generates lots of output when
> something goes wrong, so that may be the best place to start.  Turning on
> lock debugging will generate a large amount of log output but may be
> necessary, especially if we want to figure out if the lock issue is a
> downstream problem caused by the database integrity issue.
>
>
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Wed, Mar 8, 2017 at 5:11 AM, Standen Guy <guy.stan...@uk.fujitsu.com>
> wrote:
>
> Hi Karl,
>
>                 I have upgraded to MCF 2.6 on Windows using Zookeeper for
> synchronisation and PostgreSQL 9.3.16 for the database.
>
> I re-ran  the pair of jobs ( 1 web crawl of local intranet site every 15
> minutes  and  1  crawl of  JDBC connected database every 3 minutes) that
> would  previously lock up after a couple of iterations.  These seemed to
> run much better together but after 6 hours running there was a lock up for
> both these jobs (both indicating they were starting up).
>
>
>
> Attempting to restart or abort the jobs through the MCF webapp failed.
>
>
>
> After restarting the MCF agents process all sprang into life and the jobs
> aborted/restarted as expected.
>
>
>
> In the  MCF log the following error was found
>
> “ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception tossed:
> Unexpected jobqueue status - record id 1488898668325, expecting active
> status, saw 4
>
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected
> jobqueue status - record id 1488898668325, expecting active status, saw 4
>
>                 at org.apache.manifoldcf.crawler.jobs.JobQueue.
> updateCompletedRecord(JobQueue.java:1019)
>
>                 at org.apache.manifoldcf.crawler.jobs.JobManager.
> markDocumentCompletedMultiple(JobManager.java:3271)
>
>                 at org.apache.manifoldcf.crawler.system.WorkerThread.run(
> WorkerThread.java:710)
>
> WARN 2017-03-08 00:25:30,449 (Worker thread '23') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:30,449 (Worker thread '24') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:30,464 (Worker thread '9') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:30,464 (Worker thread '0') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:31,900 (Worker thread '11') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:31,900 (Worker thread '29') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:32,867 (Worker thread '10') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:32,867 (Worker thread '2') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:33,335 (Worker thread '8') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:36,642 (Worker thread '20') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:37,422 (Worker thread '21') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:38,280 (Worker thread '22') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:38,280 (Worker thread '3') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:38,280 (Worker thread '5') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:38,826 (Worker thread '28') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:39,045 (Worker thread '13') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:45,425 (Worker thread '4') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:45,425 (Worker thread '15') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:45,425 (Worker thread '17') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:46,392 (Worker thread '25') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:46,392 (Worker thread '27') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:26:11,043 (Worker thread '1') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:26:35,817 (Worker thread '19') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:26:35,817 (Worker thread '26') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:26:36,753 (Worker thread '7') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:26:39,248 (Worker thread '6') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:26:39,248 (Worker thread '18') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:26:43,129 (Worker thread '16') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> FATAL 2017-03-08 00:32:24,819 (Idle cleanup thread) - Error tossed: Can't
> release lock we don't hold
>
> java.lang.IllegalStateException: Can't release lock we don't hold”
>
>
>
>
>
> ManifoldCF does not seem to have recovered from this error and  the jobs
> can’t be re-started without restarting ManifoldCF agents.
>
>
>
> The manifoldcf, PostgreSQL and Agents JVM stack trace are attached
>
>
>
> Best Regards,
>
>
>
> Guy
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* 06 March 2017 09:24
>
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
>
>
>
> Hi Guy,
>
>
>
> (1) I have no experience with PostgresSQL versions beyond 9.3, but I doubt
> you would have problems.
>
> (2) If you are using multiple processes, even if there's only one agents
> process, you must use synchronization.  I would recommend Zookeeper;
> file-system-based synchronization is deprecated.
>
> (3) Windows has many ways of interfering with file-based sync, including
> path-length issues.  I have seen Windows fail to unlock files and need a
> reboot to release the lock.  This is one reason why file-system-based
> locking is deprecated.
>
>
>
> Thanks,
> Karl
>
>
>
>
>
> On Mon, Mar 6, 2017 at 4:03 AM, Standen Guy <guy.stan...@uk.fujitsu.com>
> wrote:
>
> Hi Karl,
>
>                 Thanks for that I will try version 2.6.  Whilst moving to
> MCF 2.6 I would potentially like to upgrade my backend PostgreSQL version
> from 9.3.5.
>
> 1) Do you have a recommendation for which PostgreSQL to use with MCF 2.6
>   e.g. PostgreSQL 9.3.16 or PostgreSQL 9.6.2?
>
> 2) For a production system on a single server running a single MCF agents
> process would you recommend the file based synchronisation locking or
> zookeeper based synchronisation locking.  With the file based
> synchronisation locking mechanism I have sometimes seen errors of the form :
>
>                 'D:\Apps\ManifoldCF\apache-manifoldcf-2.0.1\multiprocess-
> file-example\.\.\syncharea\475\708\lock-_POOLTARGET__OUTPUTCONNECTORPOOL_Solr
> COLL1 osp_unstruct.lock' failed: Access is denied’ ( I have ensured that
> the SYNCHAREA  is not  scanned by AV or Indexed by Windows Search  and all
> MCF processes run as the same user)
>
> What could cause these errors?
>
>
>
> Many Thanks,
>
>
>
> Guy
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* 03 March 2017 17:35
>
>
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
>
>
>
> Hi Guy:
>
>
>
> It is expected that sometimes database deadlock will develop, and the
> transaction will need to be retried.  There is code in MCF for doing this:
>
>
>
> >>>>>>
>
>     if (sqlState != null && sqlState.equals("40001"))
>
>       return new ManifoldCFException(message,e,
> ManifoldCFException.DATABASE_TRANSACTION_ABORT);
>
> <<<<<<
>
>
>
> I suspect that your version of MCF is old enough so that this particular
> error and the associated retry are not taking place.  Upgrading to 2.6 will
> definitely help there.
>
>
>
> Karl
>
>
>
>
>
> On Fri, Mar 3, 2017 at 11:17 AM, Standen Guy <guy.stan...@uk.fujitsu.com>
> wrote:
>
> Hi Karl,
>
>                 Thanks for coming back so quickly. Unfortunately I wasn’t
> using a JCIFS connection. One of the issues I was seeing was between a
> crawl of an intranet site (no explicit throttling other than number of
> connections) and   scheduled crawl (every 5 mins) to a relational DB via
> JDBC connector again no explicit throttling.  To simplify things both jobs
> are using a NULL output connection. Sometimes both the Web crawl and the
> JDBC connection can run together but at other times 1 or both jobs will
> appear to lock up with just a few active documents showing. When I get a
> lock up the mcf log contains errors like:
>
>
>
> “DEBUG 2017-03-03 15:28:20,466 (Worker thread '5') - Exception Database
> exception: SQLException doing query (40001): ERROR: could not serialize
> access due to read/write dependencies among transactions”
>
>
>
> See the attached log extract for a little more detail.
>
>
>
> Any view why this might be happening?
>
>
>
> Best Regards,
>
>
>
> Guy
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* 03 March 2017 11:27
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
>
>
>
> Hi Guy,
>
>
>
> A issue with concurrent jobs is known for jobs sharing the same JCIFS
> connection.  Is that what you are using?  This has nothing to do with the
> version of Postgresql you are using; it has to do with what "bins"
> documents are thought to come from.  There has been a recent improvement
> for this issue, which will be released in April.  See
> https://issues.apache.org/jira/browse/CONNECTORS-1364.
>
>
>
> The current version of MCF (2.6) supports Solr 6.x.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Fri, Mar 3, 2017 at 5:27 AM, Standen Guy <guy.stan...@uk.fujitsu.com>
> wrote:
>
> Hi Karl,
>
> I am currently using MCF 2.0.1 with PostgreSQL 9.3.5 on Windows and have
> had some issues with multiple jobs running concurrently.
>
> I am considering upgrading to MCF 2.6 and to a newer version of
> PostgreSQL. Would you be able to advise which version of PostgreSQL I
> should consider using with MCF 2.6 (e.g.  PostgreSQL  9.3.16 or 9.6.2)
>
>
>
> I am also considering upgrading from SOLR 4.10.3 to a newer version. The
> MCF compatibility matrix mentions that compatibility has been tested to
> SOLR version 4.5.1. Do you have any advice about compatibility with the
> newer versions of SOLR e.g. 6.4.1.
>
>
>
> Best Regards
>
>
>
> Guy
>
>
>
>
> Unless otherwise stated, this email has been sent from Fujitsu Services
> Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in
> England No 2216100) both with registered offices at: 22 Baker Street,
> London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and
> Fujitsu Laboratories of Europe Limited (registered in England No. 4153469)
> both with registered offices at: Hayes Park Central, Hayes End Road, Hayes,
> Middlesex, UB4 8FE.
> This email is only for the use of its intended recipient. Its contents are
> subject to a duty of confidence and may be privileged. Fujitsu does not
> guarantee that this email has not been intercepted and amended or that it
> is virus-free.
>
>
>
>
> Unless otherwise stated, this email has been sent from Fujitsu Services
> Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in
> England No 2216100) both with registered offices at: 22 Baker Street,
> London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and
> Fujitsu Laboratories of Europe Limited (registered in England No. 4153469)
> both with registered offices at: Hayes Park Central, Hayes End Road, Hayes,
> Middlesex, UB4 8FE.
> This email is only for the use of its intended recipient. Its contents are
> subject to a duty of confidence and may be privileged. Fujitsu does not
> guarantee that this email has not been intercepted and amended or that it
> is virus-free.
>
>
>
>
> Unless otherwise stated, this email has been sent from Fujitsu Services
> Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in
> England No 2216100) both with registered offices at: 22 Baker Street,
> London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and
> Fujitsu Laboratories of Europe Limited (registered in England No. 4153469)
> both with registered offices at: Hayes Park Central, Hayes End Road, Hayes,
> Middlesex, UB4 8FE.
> This email is only for the use of its intended recipient. Its contents are
> subject to a duty of confidence and may be privileged. Fujitsu does not
> guarantee that this email has not been intercepted and amended or that it
> is virus-free.
>
>
>
>
> Unless otherwise stated, this email has been sent from Fujitsu Services
> Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in
> England No 2216100) both with registered offices at: 22 Baker Street,
> London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and
> Fujitsu Laboratories of Europe Limited (registered in England No. 4153469)
> both with registered offices at: Hayes Park Central, Hayes End Road, Hayes,
> Middlesex, UB4 8FE.
> This email is only for the use of its intended recipient. Its contents are
> subject to a duty of confidence and may be privileged. Fujitsu does not
> guarantee that this email has not been intercepted and amended or that it
> is virus-free.
>
>
>
>
>
>
>
> Unless otherwise stated, this email has been sent from Fujitsu Services
> Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in
> England No 2216100) both with registered offices at: 22 Baker Street,
> London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and
> Fujitsu Laboratories of Europe Limited (registered in England No. 4153469)
> both with registered offices at: Hayes Park Central, Hayes End Road, Hayes,
> Middlesex, UB4 8FE.
> This email is only for the use of its intended recipient. Its contents are
> subject to a duty of confidence and may be privileged. Fujitsu does not
> guarantee that this email has not been intercepted and amended or that it
> is virus-free.
>

Re: Advice on which PostgreSQL to use with ManifoldCF 2.6

Reply via email to