Jira (PDB-4932) Sync summary queries can hang

2020-12-07 Thread zendesk.jira (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 zendesk.jira updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 PuppetDB /  PDB-4932  
 
 
  Sync summary queries can hang   
 

  
 
 
 
 

 
Change By: 
 zendesk.jira  
 
 
Zendesk Ticket Count: 
 1 2  
 
 
Zendesk Ticket IDs: 
 41188 ,42072  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.5.2#805002-sha1:a66f935)  
 
 

 
   
 

  
 

  
 

   





-- 
You received this message because you are subscribed to the Google Groups "Puppet Bugs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-bugs+unsubscr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-bugs/JIRA.375125.1602801356000.93159.1607403600091%40Atlassian.JIRA.


Jira (PDB-4932) Sync summary queries can hang

2020-12-07 Thread Zachary Kent (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Zachary Kent commented on  PDB-4932  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Sync summary queries can hang   
 

  
 
 
 
 

 
 We're planning on adding additional logging and a way to capture stack traces from the sync thread automatically if we're unable to interrupt sync. This should allow us to better diagnose where sync can hang and rule out possible issues caused by the recent sync interruption changes. For now I'm moving this ticket into the suspended column but will link it to the ticket Rob Browning is planning on making for the sync interrupt logging/stack trace capture work.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.5.2#805002-sha1:a66f935)  
 
 

 
   
 

  
 

  
 

   





-- 
You received this message because you are subscribed to the Google Groups "Puppet Bugs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-bugs+unsubscr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-bugs/JIRA.375125.1602801356000.93065.1607388180035%40Atlassian.JIRA.


Jira (PDB-4932) Sync summary queries can hang

2020-12-02 Thread Zachary Kent (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Zachary Kent updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 PuppetDB /  PDB-4932  
 
 
  Sync summary queries can hang   
 

  
 
 
 
 

 
Change By: 
 Zachary Kent  
 
 
Story Points: 
 8  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.5.2#805002-sha1:a66f935)  
 
 

 
   
 

  
 

  
 

   





-- 
You received this message because you are subscribed to the Google Groups "Puppet Bugs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-bugs+unsubscr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-bugs/JIRA.375125.1602801356000.89897.1606937280032%40Atlassian.JIRA.


Jira (PDB-4932) Sync summary queries can hang

2020-12-01 Thread Zachary Kent (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Zachary Kent commented on  PDB-4932  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Sync summary queries can hang   
 

  
 
 
 
 

 
 A customer recently hit an issue where it appeared that PDB sync got stuck in similar way to what's described in this ticket. In the new case the problem presented differently, but we believe the difference is due to some of the mitigations which have been recently added to address long running PDB sync queries. In the more recent case PDB was reloaded after some issues were seen with the service being intermittently unreachable. After the reload we noticed that the start up sync successfully completed but every subsequent periodic sync was reporting that it was unable to run with the following log message:   
 
 
 
 
 [sync] Refusing to sync from ... Sync already in progress
  
 
 
 
  This was caused by the global currently-syncing atom being set to true and never getting reset because a periodic sync got stuck. We were able to get a thread dump from PDB when it was in this state and confirmed some of our suspicions. It appears that the shutdown of the at-at threadpool we use to schedule tasks was still waiting to gracefully shutdown a periodic sync thread which got stuck at some point before the reload. Evidence of this can be seen in the thread dump. at-at uses a future to shutdown and reset the thread pool we use to schedule sync. See at-at stop-and-reset-pool! call. PDB calls this function when stopping the PDB TK sync service seen here. Because this happens in a future in the at-at library it doesn't block and will wait until all threads running in the pool finish the job they're working on. When we look at the elapsed time of this thread (121 hrs) it roughly lines up with the reload of the PuppetDB service and when the thread dump was taken afterwards. Providing evidence that the at-at stop-and-reset-pool! call was stuck waiting on a thread in the pool to finish. at-at pool shutdown thread from dump. We also see evidence of a sync thread being stuck in the thread dump where it looks like a it's waiting to deref a promise in the puppetlabs/clj-http-client. The elapsed time of this thread (485 hrs) indicates that a periodic sync got stuck sometime before PDB was reloaded and was never cleared during the reload. Before a periodic sync being stuck like this would have caused issues with leaving a long running query held open, but due to the 2hr statement_timeout the customer had in place for the pe-puppetdb user Postgres was able to continue to function even though sync was stuck. stuck sync thread from dump. iiuc, when TK services get a SIGHUP the signal gets intercepted and the stop/start method of the service and its TK deps get called but it doesn't fully shutdown the JVM. This could have caused the behaviour we noticed where the global currently-syncing atom was never reset due to the hung sync and at-at shutdown and as a result any periodic syncs after the reload reported that another sync was already in progress. The full start/stop (SIGTERM) of the PDB service corrected this issue because it forced the stuck threads to get shutdown and reset the state in the 

Jira (PDB-4932) Sync summary queries can hang

2020-12-01 Thread Zachary Kent (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Zachary Kent assigned an issue to Zachary Kent  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 PuppetDB /  PDB-4932  
 
 
  Sync summary queries can hang   
 

  
 
 
 
 

 
Change By: 
 Zachary Kent  
 
 
Assignee: 
 Zachary Kent  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.5.2#805002-sha1:a66f935)  
 
 

 
   
 

  
 

  
 

   





-- 
You received this message because you are subscribed to the Google Groups "Puppet Bugs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-bugs+unsubscr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-bugs/JIRA.375125.1602801356000.88379.1606841640235%40Atlassian.JIRA.


Jira (PDB-4932) Sync summary queries can hang

2020-12-01 Thread Zachary Kent (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Zachary Kent updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 PuppetDB /  PDB-4932  
 
 
  Sync summary queries can hang   
 

  
 
 
 
 

 
Change By: 
 Zachary Kent  
 
 
Epic Link: 
 PDB-4969  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.5.2#805002-sha1:a66f935)  
 
 

 
   
 

  
 

  
 

   





-- 
You received this message because you are subscribed to the Google Groups "Puppet Bugs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-bugs+unsubscr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-bugs/JIRA.375125.1602801356000.88378.1606841640192%40Atlassian.JIRA.


Jira (PDB-4932) Sync summary queries can hang

2020-10-18 Thread zendesk.jira (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 zendesk.jira updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 PuppetDB /  PDB-4932  
 
 
  Sync summary queries can hang   
 

  
 
 
 
 

 
Change By: 
 zendesk.jira  
 
 
Labels: 
 jira_escalated  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.5.2#805002-sha1:a66f935)  
 
 

 
   
 

  
 

  
 

   





-- 
You received this message because you are subscribed to the Google Groups "Puppet Bugs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-bugs+unsubscr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-bugs/JIRA.375125.1602801356000.57641.1603070280104%40Atlassian.JIRA.


Jira (PDB-4932) Sync summary queries can hang

2020-10-18 Thread zendesk.jira (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 zendesk.jira updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 PuppetDB /  PDB-4932  
 
 
  Sync summary queries can hang   
 

  
 
 
 
 

 
Change By: 
 zendesk.jira  
 
 
Zendesk Ticket Count: 
 1  
 
 
Zendesk Ticket IDs: 
 41188  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.5.2#805002-sha1:a66f935)  
 
 

 
   
 

  
 

  
 

   





-- 
You received this message because you are subscribed to the Google Groups "Puppet Bugs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-bugs+unsubscr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-bugs/JIRA.375125.1602801356000.57640.1603070280048%40Atlassian.JIRA.


Jira (PDB-4932) Sync summary queries can hang

2020-10-15 Thread Zachary Kent (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Zachary Kent created an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 PuppetDB /  PDB-4932  
 
 
  Sync summary queries can hang   
 

  
 
 
 
 

 
Issue Type: 
  Bug  
 
 
Assignee: 
 Unassigned  
 
 
Created: 
 2020/10/15 3:35 PM  
 
 
Priority: 
  Major  
 
 
Reporter: 
 Zachary Kent  
 

  
 
 
 
 

 
 There seems to be situations where PDB sync summary query transactions can remain open and PDB sync stops logging and hangs. When this happens sync will stop until PDB is restarted. It's also possible that running SELECT pg_cancel_backend(pid); on the query will restore sync, but this is less certain to work then a full PDB restart.  We recently saw this issue when a PDB was in the middle of pulling reports from a replica and the replica was upgraded. link to related slack msgs Order of events: 
 
primary started its report sync at: 2020-10-13T02:26:55.984Z 
replica received a shutdown signal at: 2020-10-13T02:27:23.876Z 
replica saw the errors in the following gist during shutdown: shutdown-error-gist 
 After the replica was shutdown and upgraded the sync on the primary never logged again and there was an open sync summary query observed in pg_stat_activity which stayed open idle in transaction waiting on ClientRead. We recently added a thread interrupter for sync in PDB-4909 but it seems like there are still edge cases that this work didn't cover.  It's possible that adding a statement_timeout to sync queries would help avoid this issue.  We'll work towards reproducing this error in the coming days will update the ticket with what we find.