And yes, we'd also need to hand the mysql folks a similar test case. Karl
On Sat, Jun 8, 2019 at 1:53 AM Karl Wright <daddy...@gmail.com> wrote: > Here's an explanation for Postgresql about what is supposed to happen in > this case. See slide 7. > > https://www.postgresql.org/files/developer/concurrency.pdf > > The explanation should apply to all MVCC databases as well. The sequence > we're seeing is: > > Thread 1: > <start transaction> > <do_expensive_stuffer_query> > <update_jobqueue_status> > <end transaction> > <hand records to worker threads> > > > Worker thread: > <process document> > <begin transaction> > <change jobqueue status from active to completed> > <end transaction > > The worker thread, however, is finding that the jobqueue status is not > what was expected. It's as if transactions didn't actually serialize, once > in a great while, and either the original stuffer update never became > visible by the time the worker thread completed its work, or some other > thread modified it while the worker thread was processing the document. > But I've got every jobqueue job status write instrumented so if that were > what was happening we'd see it in the diagnostic dump, and we do not. > > I wish it were reasonably possible to hand the postgresql folks a test > case that demonstrated this behavior. But, as you know, the problem is > quite rare, and you have to pound on the database for an extended period of > time before you get this to happen. :-( > > Karl > > On Fri, Jun 7, 2019 at 3:28 AM Karl Wright <daddy...@gmail.com> wrote: > >> I just reviewed the places where JobQueue status is updated to be sure >> that all places this happens are included in the diagnostics trackiing. >> They were, save for when a cluster gets restarted, and I added that to the >> code in trunk as well. >> >> So the analysis, above, stands. >> >> :-( >> >> Karl >> >> >> On Thu, Jun 6, 2019 at 4:28 PM Karl Wright <daddy...@gmail.com> wrote: >> >>> I just looked at this briefly. >>> >>> For >>> https://gist.githubusercontent.com/schuch/16f617eed064b004cb69a2f929b08ca8/raw/46b9585bf73bb7eaf5e3b28aeb87610525afc9ca/Unexpected%2520jobqueue%2520status%2520-%2520record%2520id%25201522147023170,%2520expecting%2520active%2520status,%2520saw%25204, >>> look for the following record ID: >>> >>> 1522147023170 >>> >>> This is the one when we look at it after the document is processed that it >>> has status 4: >>> >>> >>>>>> >>> >>> manifoldcf_1 | 2019-06-06T02:01:05.380401275Z 2019-06-06T04:01:05,380 >>> DEBUG [Worker thread '37'] org.apache.manifoldcf.diagnostics: ==== >>> Forensics for record 1522147023170, current status: 4 ==== >>> >>> manifoldcf_1 | 2019-06-06T02:01:05.380512360Z 2019-06-06T04:01:05,380 >>> DEBUG [Worker thread '37'] org.apache.manifoldcf.diagnostics: === Current >>> stack trace === >>> manifoldcf_1 | 2019-06-06T02:01:05.380544192Z java.lang.Exception: >>> Forensics stack trace >>> manifoldcf_1 | 2019-06-06T02:01:05.380548494Z at >>> org.apache.manifoldcf.crawler.jobs.TrackerClass.printForensics(TrackerClass.java:216) >>> [mcf-pull-agent.jar:?] >>> manifoldcf_1 | 2019-06-06T02:01:05.380552564Z at >>> org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:1028) >>> [mcf-pull-agent.jar:?] >>> manifoldcf_1 | 2019-06-06T02:01:05.380568210Z at >>> org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:3414) >>> [mcf-pull-agent.jar:?] >>> >>> <<<<<< >>> >>> It expects it to have status 5, though, because the only way the document >>> could be being worked on at all is if the document had been put into status >>> 5 by the stuffer thread: >>> >>> >>>>>> >>> >>> manifoldcf_1 | 2019-06-06T02:01:05.387210733Z 2019-06-06T04:01:05,381 >>> DEBUG [Worker thread '37'] org.apache.manifoldcf.diagnostics: == About to >>> commit transaction by 'Stuffer thread' at 1559786464972 == >>> manifoldcf_1 | 2019-06-06T02:01:05.387250827Z java.lang.Exception: >>> Precommit stack trace >>> manifoldcf_1 | 2019-06-06T02:01:05.387256464Z at >>> org.apache.manifoldcf.crawler.jobs.TrackerClass.notePrecommit(TrackerClass.java:108) >>> ~[mcf-pull-agent.jar:?] >>> manifoldcf_1 | 2019-06-06T02:01:05.387260987Z at >>> org.apache.manifoldcf.crawler.jobs.JobManager.fetchAndProcessDocuments(JobManager.java:3295) >>> ~[mcf-pull-agent.jar:?] >>> manifoldcf_1 | 2019-06-06T02:01:05.387265034Z at >>> org.apache.manifoldcf.crawler.jobs.JobManager.getNextDocuments(JobManager.java:3014) >>> ~[mcf-pull-agent.jar:?] >>> manifoldcf_1 | 2019-06-06T02:01:05.387269073Z at >>> org.apache.manifoldcf.crawler.system.StufferThread.run(StufferThread.java:186) >>> ~[mcf-pull-agent.jar:?] >>> manifoldcf_1 | 2019-06-06T02:01:05.387285663Z 2019-06-06T04:01:05,387 >>> DEBUG [Worker thread '37'] org.apache.manifoldcf.diagnostics: >>> Transaction includes: >>> manifoldcf_1 | 2019-06-06T02:01:05.387290117Z 2019-06-06T04:01:05,387 >>> DEBUG [Worker thread '37'] org.apache.manifoldcf.diagnostics: Record >>> 1522147023170 status modified to 5: Make active >>> >>> ... >>> >>> <<<<<< >>> >>> The actual commit follows a short while later to actually set the status, >>> and only AFTER that are the documents queued to be handed to the worker >>> threads. >>> >>> So what this is saying is that the stuffer thread sets the document status >>> to 5 and then a worker thread gets it and when the worker thread is done it >>> tries to set the status back to 4 -- but it's ALREADY at 4 and that is not >>> possible. >>> >>> Now, it may be worth examining the possibility that some other job gets to >>> the document while it is waiting to be queued, and does stuff to the >>> document's jobqueue record. There are a couple of background threads which >>> (for example) reprioritize documents that might conceivably change the >>> jobqueue record's status. They generally work by changing the record >>> status in a transaction to some other value that would not make the record >>> eligible for queuing, and then changing it back when done. But unless >>> completely overlooked, these would, however, be visible in the diagnostics, >>> and furthermore it should not be possible for one thread in a transaction >>> to change the status at the same time as another thread in a transaction >>> changes the status. >>> >>> So either way I'm forced to conclude that a record gets changed and >>> committed, but the change is not visible to another thread later trying to >>> reset the status. >>> >>> Karl >>> >>> >>> >>> >>> On Thu, Jun 6, 2019 at 5:58 AM Karl Wright <daddy...@gmail.com> wrote: >>> >>>> I can look at the log output but not until this weekend. >>>> >>>> Karl >>>> >>>> >>>> On Thu, Jun 6, 2019 at 3:59 AM Markus Schuch < >>>> markus.sch...@deutschebahn.com> wrote: >>>> >>>>> Hi Olivier, >>>>> >>>>> >>>>> >>>>> we were not able to fix this yet. >>>>> >>>>> >>>>> >>>>> But now we have new diagnostics log data, after the error occurred >>>>> again yesterday: >>>>> >>>>> >>>>> >>>>> Unexpected jobqueue status - record id 1522147023170, expecting active >>>>> status, saw 4 >>>>> >>>>> >>>>> https://gist.githubusercontent.com/schuch/16f617eed064b004cb69a2f929b08ca8/raw/46b9585bf73bb7eaf5e3b28aeb87610525afc9ca/Unexpected%2520jobqueue%2520status%2520-%2520record%2520id%25201522147023170,%2520expecting%2520active%2520status,%2520saw%25204 >>>>> >>>>> >>>>> >>>>> Unexpected jobqueue status - record id 1541529864711, expecting active >>>>> status, saw 4 >>>>> >>>>> >>>>> https://gist.githubusercontent.com/schuch/d73672e44b0edd5a9184170c9ee417a5/raw/f55b1c023056ef8e310d99489a0b6f9dfcaaa585/Unexpected%2520jobqueue%2520status%2520-%2520record%2520id%25201541529864711,%2520expecting%2520active%2520status,%2520saw%25204 >>>>> >>>>> >>>>> >>>>> @Karl, maybe you can get some ideas from that log data? >>>>> >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> Markus >>>>> >>>>> >>>>> >>>>> *Von:* Olivier Tavard <olivier.tav...@francelabs.com> >>>>> *Gesendet:* Dienstag, 21. Mai 2019 11:00 >>>>> *An:* user@manifoldcf.apache.org >>>>> *Betreff:* Re: Error: Unexpected jobqueue status - record id X, >>>>> expecting active status, saw 4 (MySQL compatible Database) >>>>> >>>>> >>>>> >>>>> Hi Markus, >>>>> >>>>> >>>>> >>>>> We have the same error (with postgresql database). Did the error occur >>>>> again since your last mail ? >>>>> >>>>> Did you change something on your MCF configuration to fix this ? >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Best regards, >>>>> >>>>> >>>>> >>>>> Olivier >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Le 13 févr. 2019 à 13:58, Markus Schuch < >>>>> markus.sch...@deutschebahn.com> a écrit : >>>>> >>>>> >>>>> >>>>> Hi Karl, >>>>> >>>>> >>>>> >>>>> we set the diagnostigs logger to level debug. >>>>> >>>>> >>>>> >>>>> I will get back when the error occurs again. >>>>> >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> Markus >>>>> >>>>> >>>>> ------------------------------ >>>>> >>>>> >>>>> Pflichtangaben anzeigen >>>>> <http://www.deutschebahn.com/pflichtangaben/20190211> >>>>> >>>>> Nähere Informationen zur Datenverarbeitung im DB-Konzern finden Sie >>>>> hier: http://www.deutschebahn.com/de/konzern/datenschutz >>>>> >>>>> >>>>> >>>>> Le 12 févr. 2019 à 17:41, Karl Wright <daddy...@gmail.com> a écrit : >>>>> >>>>> >>>>> >>>>> Hi Marcus, >>>>> >>>>> >>>>> >>>>> There's a properties.xml debugging logger you can enable that will >>>>> keep track of what's happening with transactions, so that when an error of >>>>> this kinds is reported, information about why the situation is unexpected >>>>> is dumped to the log. The logger is called "diagnostics" e.g. >>>>> "org.apache.manifoldcf.diagnostics". >>>>> >>>>> >>>>> >>>>> Karl >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Feb 12, 2019 at 10:53 AM Markus Schuch < >>>>> markus.sch...@deutschebahn.com> wrote: >>>>> >>>>> Hi, >>>>> >>>>> we are seeing "Error: Unexpected jobqueue status - record id >>>>> 1484612513829, expecting active status, saw 4" from time to time. >>>>> I didn’t report it, because we were running on an old MCF version, and >>>>> there were some bugreports relating to this error that are resolved in >>>>> newer versions. >>>>> >>>>> Now we have upgraded to latest and greatest and the error still >>>>> occurred. So want to start to track this down. >>>>> >>>>> Out setup is: >>>>> >>>>> - ManifoldCF 2.12, running in a Docker Container based on Redhat >>>>> Linux, OpenJDK 8 >>>>> - AWS RDS Database (Aurora MySQL -> 5.6 compatible) >>>>> - Single Process Setup >>>>> >>>>> >>>>> We run several Jobs (run once mode) over night with schedule windows >>>>> and max runtime settings. Some of the time windows are overlapping. >>>>> At normal days some Jobs finish during their time window, some go to >>>>> WAITING and will go on in the next night. >>>>> >>>>> The error mostly hits at a Sharepoint Repository Job. >>>>> We have the impression, that the error is somehow related to >>>>> situations, when service interruptions (e.g. connection issues) occur in >>>>> other jobs. >>>>> >>>>> >>>>> >>>>> Maybe all this is related to fundamental problems with the database or >>>>> the JDBC driver >>>>> >>>>> (Karl expressed his concerns about that in the maybe-related >>>>> https://issues.apache.org/jira/browse/CONNECTORS-1581 ticket) >>>>> >>>>> On the other hand, there were similar bugreports in the past >>>>> concerning unexpected jobqueue status and there were resolved. >>>>> >>>>> Mayby there is a chance this can also be analyzed and fixed, but I do >>>>> not really know where to start. >>>>> >>>>> >>>>> Has anybody ideas how we can track this down / debug this to get to >>>>> the bottom of the problem? >>>>> What could be the first step in the analysis? >>>>> >>>>> Many thanks in advance >>>>> Markus >>>>> >>>>> May be related bugreports: >>>>> >>>>> >>>>> >>>>> [0] https://issues.apache.org/jira/browse/CONNECTORS-1395 >>>>> [1] https://issues.apache.org/jira/browse/CONNECTORS-1180 >>>>> [2] https://issues.apache.org/jira/browse/CONNECTORS-590 >>>>> [3] https://issues.apache.org/jira/browse/CONNECTORS-246 >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Markus Schuch >>>>> >>>>> Web Business (T.IPB 26) >>>>> >>>>> >>>>> >>>>> DB Systel GmbH >>>>> >>>>> Jürgen-Ponto-Platz 1, 60329 Frankfurt a. Main >>>>> >>>>> >>>>> ------------------------------ >>>>> >>>>> >>>>> Pflichtangaben anzeigen >>>>> <http://www.deutschebahn.com/pflichtangaben/20190211> >>>>> >>>>> Nähere Informationen zur Datenverarbeitung im DB-Konzern finden Sie >>>>> hier: http://www.deutschebahn.com/de/konzern/datenschutz >>>>> >>>>> >>>>> >>>>> >>>>> ------------------------------ >>>>> >>>>> Pflichtangaben anzeigen >>>>> <http://www.deutschebahn.com/pflichtangaben/20190529> >>>>> >>>>> Nähere Informationen zur Datenverarbeitung im DB-Konzern finden Sie >>>>> hier: http://www.deutschebahn.com/de/konzern/datenschutz >>>>> >>>>