And yes, we'd also need to hand the mysql folks a similar test case.
Karl

On Sat, Jun 8, 2019 at 1:53 AM Karl Wright <daddy...@gmail.com> wrote:

> Here's an explanation for Postgresql about what is supposed to happen in
> this case.  See slide 7.
>
> https://www.postgresql.org/files/developer/concurrency.pdf
>
> The explanation should apply to all MVCC databases as well.  The sequence
> we're seeing is:
>
> Thread 1:
> <start transaction>
> <do_expensive_stuffer_query>
> <update_jobqueue_status>
> <end transaction>
> <hand records to worker threads>
>
>
> Worker thread:
> <process document>
> <begin transaction>
> <change jobqueue status from active to completed>
> <end transaction
>
> The worker thread, however, is finding that the jobqueue status is not
> what was expected.  It's as if transactions didn't actually serialize, once
> in a great while, and either the original stuffer update never became
> visible by the time the worker thread completed its work, or some other
> thread modified it while the worker thread was processing the document.
> But I've got every jobqueue job status write instrumented so if that were
> what was happening we'd see it in the diagnostic dump, and we do not.
>
> I wish it were reasonably possible to hand the postgresql folks a test
> case that demonstrated this behavior.  But, as you know, the problem is
> quite rare, and you have to pound on the database for an extended period of
> time before you get this to happen. :-(
>
> Karl
>
> On Fri, Jun 7, 2019 at 3:28 AM Karl Wright <daddy...@gmail.com> wrote:
>
>> I just reviewed the places where JobQueue status is updated to be sure
>> that all places this happens are included in the diagnostics trackiing.
>> They were, save for when a cluster gets restarted, and I added that to the
>> code in trunk as well.
>>
>> So the analysis, above, stands.
>>
>> :-(
>>
>> Karl
>>
>>
>> On Thu, Jun 6, 2019 at 4:28 PM Karl Wright <daddy...@gmail.com> wrote:
>>
>>> I just looked at this briefly.
>>>
>>> For
>>> https://gist.githubusercontent.com/schuch/16f617eed064b004cb69a2f929b08ca8/raw/46b9585bf73bb7eaf5e3b28aeb87610525afc9ca/Unexpected%2520jobqueue%2520status%2520-%2520record%2520id%25201522147023170,%2520expecting%2520active%2520status,%2520saw%25204,
>>> look for the following record ID:
>>>
>>> 1522147023170
>>>
>>> This is the one when we look at it after the document is processed that it 
>>> has status 4:
>>>
>>> >>>>>>
>>>
>>> manifoldcf_1     | 2019-06-06T02:01:05.380401275Z 2019-06-06T04:01:05,380 
>>> DEBUG [Worker thread '37'] org.apache.manifoldcf.diagnostics: ==== 
>>> Forensics for record 1522147023170, current status: 4 ====
>>>
>>> manifoldcf_1     | 2019-06-06T02:01:05.380512360Z 2019-06-06T04:01:05,380 
>>> DEBUG [Worker thread '37'] org.apache.manifoldcf.diagnostics: === Current 
>>> stack trace ===
>>> manifoldcf_1     | 2019-06-06T02:01:05.380544192Z java.lang.Exception: 
>>> Forensics stack trace
>>> manifoldcf_1     | 2019-06-06T02:01:05.380548494Z   at 
>>> org.apache.manifoldcf.crawler.jobs.TrackerClass.printForensics(TrackerClass.java:216)
>>>  [mcf-pull-agent.jar:?]
>>> manifoldcf_1     | 2019-06-06T02:01:05.380552564Z   at 
>>> org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:1028)
>>>  [mcf-pull-agent.jar:?]
>>> manifoldcf_1     | 2019-06-06T02:01:05.380568210Z   at 
>>> org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:3414)
>>>  [mcf-pull-agent.jar:?]
>>>
>>> <<<<<<
>>>
>>> It expects it to have status 5, though, because the only way the document 
>>> could be being worked on at all is if the document had been put into status 
>>> 5 by the stuffer thread:
>>>
>>> >>>>>>
>>>
>>> manifoldcf_1     | 2019-06-06T02:01:05.387210733Z 2019-06-06T04:01:05,381 
>>> DEBUG [Worker thread '37'] org.apache.manifoldcf.diagnostics: == About to 
>>> commit transaction by 'Stuffer thread' at 1559786464972 ==
>>> manifoldcf_1     | 2019-06-06T02:01:05.387250827Z java.lang.Exception: 
>>> Precommit stack trace
>>> manifoldcf_1     | 2019-06-06T02:01:05.387256464Z   at 
>>> org.apache.manifoldcf.crawler.jobs.TrackerClass.notePrecommit(TrackerClass.java:108)
>>>  ~[mcf-pull-agent.jar:?]
>>> manifoldcf_1     | 2019-06-06T02:01:05.387260987Z   at 
>>> org.apache.manifoldcf.crawler.jobs.JobManager.fetchAndProcessDocuments(JobManager.java:3295)
>>>  ~[mcf-pull-agent.jar:?]
>>> manifoldcf_1     | 2019-06-06T02:01:05.387265034Z   at 
>>> org.apache.manifoldcf.crawler.jobs.JobManager.getNextDocuments(JobManager.java:3014)
>>>  ~[mcf-pull-agent.jar:?]
>>> manifoldcf_1     | 2019-06-06T02:01:05.387269073Z   at 
>>> org.apache.manifoldcf.crawler.system.StufferThread.run(StufferThread.java:186)
>>>  ~[mcf-pull-agent.jar:?]
>>> manifoldcf_1     | 2019-06-06T02:01:05.387285663Z 2019-06-06T04:01:05,387 
>>> DEBUG [Worker thread '37'] org.apache.manifoldcf.diagnostics:     
>>> Transaction includes:
>>> manifoldcf_1     | 2019-06-06T02:01:05.387290117Z 2019-06-06T04:01:05,387 
>>> DEBUG [Worker thread '37'] org.apache.manifoldcf.diagnostics:       Record 
>>> 1522147023170 status modified to 5: Make active
>>>
>>> ...
>>>
>>> <<<<<<
>>>
>>> The actual commit follows a short while later to actually set the status, 
>>> and only AFTER that are the documents queued to be handed to the worker 
>>> threads.
>>>
>>> So what this is saying is that the stuffer thread sets the document status 
>>> to 5 and then a worker thread gets it and when the worker thread is done it 
>>> tries to set the status back to 4 -- but it's ALREADY at 4 and that is not 
>>> possible.
>>>
>>> Now, it may be worth examining the possibility that some other job gets to 
>>> the document while it is waiting to be queued, and does stuff to the 
>>> document's jobqueue record.  There are a couple of background threads which 
>>> (for example) reprioritize documents that might conceivably change the 
>>> jobqueue record's status.  They generally work by changing the record 
>>> status in a transaction to some other value that would not make the record 
>>> eligible for queuing, and then changing it back when done.  But unless 
>>> completely overlooked, these would, however, be visible in the diagnostics, 
>>> and furthermore it should not be possible for one thread in a transaction 
>>> to change the status at the same time as another thread in a transaction 
>>> changes the status.
>>>
>>> So either way I'm forced to conclude that a record gets changed and 
>>> committed, but the change is not visible to another thread later trying to 
>>> reset the status.
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>> On Thu, Jun 6, 2019 at 5:58 AM Karl Wright <daddy...@gmail.com> wrote:
>>>
>>>> I can look at the log output but not until this weekend.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Jun 6, 2019 at 3:59 AM Markus Schuch <
>>>> markus.sch...@deutschebahn.com> wrote:
>>>>
>>>>> Hi Olivier,
>>>>>
>>>>>
>>>>>
>>>>> we were not able to fix this yet.
>>>>>
>>>>>
>>>>>
>>>>> But now we have new diagnostics log data, after the error occurred
>>>>> again yesterday:
>>>>>
>>>>>
>>>>>
>>>>> Unexpected jobqueue status - record id 1522147023170, expecting active
>>>>> status, saw 4
>>>>>
>>>>>
>>>>> https://gist.githubusercontent.com/schuch/16f617eed064b004cb69a2f929b08ca8/raw/46b9585bf73bb7eaf5e3b28aeb87610525afc9ca/Unexpected%2520jobqueue%2520status%2520-%2520record%2520id%25201522147023170,%2520expecting%2520active%2520status,%2520saw%25204
>>>>>
>>>>>
>>>>>
>>>>> Unexpected jobqueue status - record id 1541529864711, expecting active
>>>>> status, saw 4
>>>>>
>>>>>
>>>>> https://gist.githubusercontent.com/schuch/d73672e44b0edd5a9184170c9ee417a5/raw/f55b1c023056ef8e310d99489a0b6f9dfcaaa585/Unexpected%2520jobqueue%2520status%2520-%2520record%2520id%25201541529864711,%2520expecting%2520active%2520status,%2520saw%25204
>>>>>
>>>>>
>>>>>
>>>>> @Karl, maybe you can get some ideas from that log data?
>>>>>
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Markus
>>>>>
>>>>>
>>>>>
>>>>> *Von:* Olivier Tavard <olivier.tav...@francelabs.com>
>>>>> *Gesendet:* Dienstag, 21. Mai 2019 11:00
>>>>> *An:* user@manifoldcf.apache.org
>>>>> *Betreff:* Re: Error: Unexpected jobqueue status - record id X,
>>>>> expecting active status, saw 4 (MySQL compatible Database)
>>>>>
>>>>>
>>>>>
>>>>> Hi Markus,
>>>>>
>>>>>
>>>>>
>>>>> We have the same error (with postgresql database). Did the error occur
>>>>> again since your last mail ?
>>>>>
>>>>> Did you change something on your MCF configuration to fix this ?
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Best regards,
>>>>>
>>>>>
>>>>>
>>>>> Olivier
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Le 13 févr. 2019 à 13:58, Markus Schuch <
>>>>> markus.sch...@deutschebahn.com> a écrit :
>>>>>
>>>>>
>>>>>
>>>>> Hi Karl,
>>>>>
>>>>>
>>>>>
>>>>> we set the diagnostigs logger to level debug.
>>>>>
>>>>>
>>>>>
>>>>> I will get back when the error occurs again.
>>>>>
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Markus
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>>
>>>>> Pflichtangaben anzeigen
>>>>> <http://www.deutschebahn.com/pflichtangaben/20190211>
>>>>>
>>>>> Nähere Informationen zur Datenverarbeitung im DB-Konzern finden Sie
>>>>> hier: http://www.deutschebahn.com/de/konzern/datenschutz
>>>>>
>>>>>
>>>>>
>>>>> Le 12 févr. 2019 à 17:41, Karl Wright <daddy...@gmail.com> a écrit :
>>>>>
>>>>>
>>>>>
>>>>> Hi Marcus,
>>>>>
>>>>>
>>>>>
>>>>> There's a properties.xml debugging logger you can enable that will
>>>>> keep track of what's happening with transactions, so that when an error of
>>>>> this kinds is reported, information about why the situation is unexpected
>>>>> is dumped to the log.  The logger is called "diagnostics" e.g.
>>>>> "org.apache.manifoldcf.diagnostics".
>>>>>
>>>>>
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 12, 2019 at 10:53 AM Markus Schuch <
>>>>> markus.sch...@deutschebahn.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> we are seeing "Error: Unexpected jobqueue status - record id
>>>>> 1484612513829, expecting active status, saw 4" from time to time.
>>>>> I didn’t report it, because we were running on an old MCF version, and
>>>>> there were some bugreports relating to this error that are resolved in
>>>>> newer versions.
>>>>>
>>>>> Now we have upgraded to latest and greatest and the error still
>>>>> occurred. So want to start to track this down.
>>>>>
>>>>> Out setup is:
>>>>>
>>>>>    - ManifoldCF 2.12, running in a Docker Container based on Redhat
>>>>>    Linux, OpenJDK 8
>>>>>    - AWS RDS Database (Aurora MySQL -> 5.6 compatible)
>>>>>    - Single Process Setup
>>>>>
>>>>>
>>>>> We run several Jobs (run once mode) over night with schedule windows
>>>>> and max runtime settings. Some of the time windows are overlapping.
>>>>> At normal days some Jobs finish during their time window, some go to
>>>>> WAITING and will go on in the next night.
>>>>>
>>>>> The error mostly hits at a Sharepoint Repository Job.
>>>>> We have the impression, that the error is somehow related to
>>>>> situations, when service interruptions (e.g. connection issues) occur in
>>>>> other jobs.
>>>>>
>>>>>
>>>>>
>>>>> Maybe all this is related to fundamental problems with the database or
>>>>> the JDBC driver
>>>>>
>>>>> (Karl expressed his concerns about that in the maybe-related
>>>>> https://issues.apache.org/jira/browse/CONNECTORS-1581 ticket)
>>>>>
>>>>> On the other hand, there were similar bugreports in the past
>>>>> concerning unexpected jobqueue status and there were resolved.
>>>>>
>>>>> Mayby there is a chance this can also be analyzed and fixed, but I do
>>>>> not really know where to start.
>>>>>
>>>>>
>>>>> Has anybody ideas how we can track this down / debug this to get to
>>>>> the bottom of the problem?
>>>>> What could be the first step in the analysis?
>>>>>
>>>>> Many thanks in advance
>>>>> Markus
>>>>>
>>>>> May be related bugreports:
>>>>>
>>>>>
>>>>>
>>>>> [0] https://issues.apache.org/jira/browse/CONNECTORS-1395
>>>>> [1] https://issues.apache.org/jira/browse/CONNECTORS-1180
>>>>> [2] https://issues.apache.org/jira/browse/CONNECTORS-590
>>>>> [3] https://issues.apache.org/jira/browse/CONNECTORS-246
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Markus Schuch
>>>>>
>>>>> Web Business (T.IPB 26)
>>>>>
>>>>>
>>>>>
>>>>> DB Systel GmbH
>>>>>
>>>>> Jürgen-Ponto-Platz 1, 60329 Frankfurt a. Main
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>>
>>>>> Pflichtangaben anzeigen
>>>>> <http://www.deutschebahn.com/pflichtangaben/20190211>
>>>>>
>>>>> Nähere Informationen zur Datenverarbeitung im DB-Konzern finden Sie
>>>>> hier: http://www.deutschebahn.com/de/konzern/datenschutz
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> Pflichtangaben anzeigen
>>>>> <http://www.deutschebahn.com/pflichtangaben/20190529>
>>>>>
>>>>> Nähere Informationen zur Datenverarbeitung im DB-Konzern finden Sie
>>>>> hier: http://www.deutschebahn.com/de/konzern/datenschutz
>>>>>
>>>>

Reply via email to