Hi Guy,

See https://issues.apache.org/jira/browse/CONNECTORS-590.

When you see "unexpected" this is not a good sign:

>>>>>>

“ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception tossed:
Unexpected jobqueue status - record id 1488898668325, expecting active
status, saw 4

org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected
jobqueue status - record id 1488898668325, expecting active status, saw 4

                at org.apache.manifoldcf.crawler.jobs.JobQueue.
updateCompletedRecord(JobQueue.java:1019)

                at org.apache.manifoldcf.crawler.jobs.JobManager.
markDocumentCompletedMultiple(JobManager.java:3271)

                at org.apache.manifoldcf.crawler.system.WorkerThread.run(
WorkerThread.java:710)

<<<<<<


I've spent weeks chasing this issue in the past.  I have not seen it at all
except in certain installations: 2 of them, counting yours.  I introduced
transaction forensics into the MCF code to figure out what was going on and
I definitively proved (on Postgresql 9.1) that we were seeing a
transactional integrity problem with Postgresql itself.  A ticket against
Postgresql was not logged because they'd need a reproducible test case and
also the latest version, and it didn't happen on 9.3 at the time.

You also seem to be seeing a deadlock in MCF locks as well.  When the
Postgresql bug happens it means that the database is in a state that MCF
really can't figure out and thus doesn't know how to deal with, so this may
just be a downstream result of that occurrence.  But I can't be sure
without further analysis.

I'm very curious now about the details of your setup.  (1) How is your
postgresql set up?  What version did you decide to use?  (2) How many
agents processes do you have?  Just one?  (3) What OS? (4) What JDK?

Other things you can try:
(1) Running the postgresql LT tests (ant run-LT-postgresql) against your
postgresql installation; you will need to change the test code itself to
allow it to create an instance for testing in that case;
(2) Turning on database transaction forensics (property name
"org.apache.manifoldcf.diagnostics" value "DEBUG")
(3) Turning on lock debugging (property name "org.apache.manifoldcf.lock"
value "DEBUG")

Transaction forensics uses memory but only generates lots of output when
something goes wrong, so that may be the best place to start.  Turning on
lock debugging will generate a large amount of log output but may be
necessary, especially if we want to figure out if the lock issue is a
downstream problem caused by the database integrity issue.


Thanks,
Karl


On Wed, Mar 8, 2017 at 5:11 AM, Standen Guy <guy.stan...@uk.fujitsu.com>
wrote:

> Hi Karl,
>
>                 I have upgraded to MCF 2.6 on Windows using Zookeeper for
> synchronisation and PostgreSQL 9.3.16 for the database.
>
> I re-ran  the pair of jobs ( 1 web crawl of local intranet site every 15
> minutes  and  1  crawl of  JDBC connected database every 3 minutes) that
> would  previously lock up after a couple of iterations.  These seemed to
> run much better together but after 6 hours running there was a lock up for
> both these jobs (both indicating they were starting up).
>
>
>
> Attempting to restart or abort the jobs through the MCF webapp failed.
>
>
>
> After restarting the MCF agents process all sprang into life and the jobs
> aborted/restarted as expected.
>
>
>
> In the  MCF log the following error was found
>
> “ERROR 2017-03-08 00:25:30,433 (Worker thread '14') - Exception tossed:
> Unexpected jobqueue status - record id 1488898668325, expecting active
> status, saw 4
>
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected
> jobqueue status - record id 1488898668325, expecting active status, saw 4
>
>                 at org.apache.manifoldcf.crawler.jobs.JobQueue.
> updateCompletedRecord(JobQueue.java:1019)
>
>                 at org.apache.manifoldcf.crawler.jobs.JobManager.
> markDocumentCompletedMultiple(JobManager.java:3271)
>
>                 at org.apache.manifoldcf.crawler.system.WorkerThread.run(
> WorkerThread.java:710)
>
> WARN 2017-03-08 00:25:30,449 (Worker thread '23') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:30,449 (Worker thread '24') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:30,464 (Worker thread '9') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:30,464 (Worker thread '0') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:31,900 (Worker thread '11') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:31,900 (Worker thread '29') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:32,867 (Worker thread '10') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:32,867 (Worker thread '2') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:33,335 (Worker thread '8') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:36,642 (Worker thread '20') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:37,422 (Worker thread '21') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:38,280 (Worker thread '22') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:38,280 (Worker thread '3') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:38,280 (Worker thread '5') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:38,826 (Worker thread '28') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:39,045 (Worker thread '13') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:45,425 (Worker thread '4') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:45,425 (Worker thread '15') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:45,425 (Worker thread '17') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:46,392 (Worker thread '25') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:25:46,392 (Worker thread '27') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:26:11,043 (Worker thread '1') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:26:35,817 (Worker thread '19') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:26:35,817 (Worker thread '26') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:26:36,753 (Worker thread '7') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:26:39,248 (Worker thread '6') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:26:39,248 (Worker thread '18') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> WARN 2017-03-08 00:26:43,129 (Worker thread '16') - Service interruption
> reported for job 1488898090224 connection 'web': Job no longer active
>
> FATAL 2017-03-08 00:32:24,819 (Idle cleanup thread) - Error tossed: Can't
> release lock we don't hold
>
> java.lang.IllegalStateException: Can't release lock we don't hold”
>
>
>
>
>
> ManifoldCF does not seem to have recovered from this error and  the jobs
> can’t be re-started without restarting ManifoldCF agents.
>
>
>
> The manifoldcf, PostgreSQL and Agents JVM stack trace are attached
>
>
>
> Best Regards,
>
>
>
> Guy
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* 06 March 2017 09:24
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
>
>
>
> Hi Guy,
>
>
>
> (1) I have no experience with PostgresSQL versions beyond 9.3, but I doubt
> you would have problems.
>
> (2) If you are using multiple processes, even if there's only one agents
> process, you must use synchronization.  I would recommend Zookeeper;
> file-system-based synchronization is deprecated.
>
> (3) Windows has many ways of interfering with file-based sync, including
> path-length issues.  I have seen Windows fail to unlock files and need a
> reboot to release the lock.  This is one reason why file-system-based
> locking is deprecated.
>
>
>
> Thanks,
> Karl
>
>
>
>
>
> On Mon, Mar 6, 2017 at 4:03 AM, Standen Guy <guy.stan...@uk.fujitsu.com>
> wrote:
>
> Hi Karl,
>
>                 Thanks for that I will try version 2.6.  Whilst moving to
> MCF 2.6 I would potentially like to upgrade my backend PostgreSQL version
> from 9.3.5.
>
> 1) Do you have a recommendation for which PostgreSQL to use with MCF 2.6
>   e.g. PostgreSQL 9.3.16 or PostgreSQL 9.6.2?
>
> 2) For a production system on a single server running a single MCF agents
> process would you recommend the file based synchronisation locking or
> zookeeper based synchronisation locking.  With the file based
> synchronisation locking mechanism I have sometimes seen errors of the form :
>
>                 'D:\Apps\ManifoldCF\apache-manifoldcf-2.0.1\multiprocess-
> file-example\.\.\syncharea\475\708\lock-_POOLTARGET__OUTPUTCONNECTORPOOL_Solr
> COLL1 osp_unstruct.lock' failed: Access is denied’ ( I have ensured that
> the SYNCHAREA  is not  scanned by AV or Indexed by Windows Search  and all
> MCF processes run as the same user)
>
> What could cause these errors?
>
>
>
> Many Thanks,
>
>
>
> Guy
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* 03 March 2017 17:35
>
>
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
>
>
>
> Hi Guy:
>
>
>
> It is expected that sometimes database deadlock will develop, and the
> transaction will need to be retried.  There is code in MCF for doing this:
>
>
>
> >>>>>>
>
>     if (sqlState != null && sqlState.equals("40001"))
>
>       return new ManifoldCFException(message,e,
> ManifoldCFException.DATABASE_TRANSACTION_ABORT);
>
> <<<<<<
>
>
>
> I suspect that your version of MCF is old enough so that this particular
> error and the associated retry are not taking place.  Upgrading to 2.6 will
> definitely help there.
>
>
>
> Karl
>
>
>
>
>
> On Fri, Mar 3, 2017 at 11:17 AM, Standen Guy <guy.stan...@uk.fujitsu.com>
> wrote:
>
> Hi Karl,
>
>                 Thanks for coming back so quickly. Unfortunately I wasn’t
> using a JCIFS connection. One of the issues I was seeing was between a
> crawl of an intranet site (no explicit throttling other than number of
> connections) and   scheduled crawl (every 5 mins) to a relational DB via
> JDBC connector again no explicit throttling.  To simplify things both jobs
> are using a NULL output connection. Sometimes both the Web crawl and the
> JDBC connection can run together but at other times 1 or both jobs will
> appear to lock up with just a few active documents showing. When I get a
> lock up the mcf log contains errors like:
>
>
>
> “DEBUG 2017-03-03 15:28:20,466 (Worker thread '5') - Exception Database
> exception: SQLException doing query (40001): ERROR: could not serialize
> access due to read/write dependencies among transactions”
>
>
>
> See the attached log extract for a little more detail.
>
>
>
> Any view why this might be happening?
>
>
>
> Best Regards,
>
>
>
> Guy
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From:* Karl Wright [mailto:daddy...@gmail.com]
> *Sent:* 03 March 2017 11:27
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: Advice on which PostgreSQL to use with ManifoldCF 2.6
>
>
>
> Hi Guy,
>
>
>
> A issue with concurrent jobs is known for jobs sharing the same JCIFS
> connection.  Is that what you are using?  This has nothing to do with the
> version of Postgresql you are using; it has to do with what "bins"
> documents are thought to come from.  There has been a recent improvement
> for this issue, which will be released in April.  See
> https://issues.apache.org/jira/browse/CONNECTORS-1364.
>
>
>
> The current version of MCF (2.6) supports Solr 6.x.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Fri, Mar 3, 2017 at 5:27 AM, Standen Guy <guy.stan...@uk.fujitsu.com>
> wrote:
>
> Hi Karl,
>
> I am currently using MCF 2.0.1 with PostgreSQL 9.3.5 on Windows and have
> had some issues with multiple jobs running concurrently.
>
> I am considering upgrading to MCF 2.6 and to a newer version of
> PostgreSQL. Would you be able to advise which version of PostgreSQL I
> should consider using with MCF 2.6 (e.g.  PostgreSQL  9.3.16 or 9.6.2)
>
>
>
> I am also considering upgrading from SOLR 4.10.3 to a newer version. The
> MCF compatibility matrix mentions that compatibility has been tested to
> SOLR version 4.5.1. Do you have any advice about compatibility with the
> newer versions of SOLR e.g. 6.4.1.
>
>
>
> Best Regards
>
>
>
> Guy
>
>
>
>
> Unless otherwise stated, this email has been sent from Fujitsu Services
> Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in
> England No 2216100) both with registered offices at: 22 Baker Street,
> London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and
> Fujitsu Laboratories of Europe Limited (registered in England No. 4153469)
> both with registered offices at: Hayes Park Central, Hayes End Road, Hayes,
> Middlesex, UB4 8FE.
> This email is only for the use of its intended recipient. Its contents are
> subject to a duty of confidence and may be privileged. Fujitsu does not
> guarantee that this email has not been intercepted and amended or that it
> is virus-free.
>
>
>
>
> Unless otherwise stated, this email has been sent from Fujitsu Services
> Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in
> England No 2216100) both with registered offices at: 22 Baker Street,
> London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and
> Fujitsu Laboratories of Europe Limited (registered in England No. 4153469)
> both with registered offices at: Hayes Park Central, Hayes End Road, Hayes,
> Middlesex, UB4 8FE.
> This email is only for the use of its intended recipient. Its contents are
> subject to a duty of confidence and may be privileged. Fujitsu does not
> guarantee that this email has not been intercepted and amended or that it
> is virus-free.
>
>
>
>
> Unless otherwise stated, this email has been sent from Fujitsu Services
> Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in
> England No 2216100) both with registered offices at: 22 Baker Street,
> London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and
> Fujitsu Laboratories of Europe Limited (registered in England No. 4153469)
> both with registered offices at: Hayes Park Central, Hayes End Road, Hayes,
> Middlesex, UB4 8FE.
> This email is only for the use of its intended recipient. Its contents are
> subject to a duty of confidence and may be privileged. Fujitsu does not
> guarantee that this email has not been intercepted and amended or that it
> is virus-free.
>
>
>
> Unless otherwise stated, this email has been sent from Fujitsu Services
> Limited (registered in England No 96056); Fujitsu EMEA PLC (registered in
> England No 2216100) both with registered offices at: 22 Baker Street,
> London W1U 3BW; PFU (EMEA) Limited, (registered in England No 1578652) and
> Fujitsu Laboratories of Europe Limited (registered in England No. 4153469)
> both with registered offices at: Hayes Park Central, Hayes End Road, Hayes,
> Middlesex, UB4 8FE.
> This email is only for the use of its intended recipient. Its contents are
> subject to a duty of confidence and may be privileged. Fujitsu does not
> guarantee that this email has not been intercepted and amended or that it
> is virus-free.
>

Reply via email to