Hi,

We run a dev/alpha stack of our application in Azure Kubernetes. Persistent
storage is contained in Azure Files NAS storage volumes, one per server
node.

We ran an upgrade of Kubernetes today (from 1.24.9 to 1.26.3). During the
update various pods were stopped and restarted as is normal for an update.
This included nodes running the dev/alpha stack.

At least one node (of a cluster of four server nodes in the cluster) failed
to restart after the update, with the following logging:

  2023-07-18 01:23:55.171 [1] INF    Restoring checkpoint after logical
recovery, will start physical recovery from back pointer: WALPointer
[idx=2431, fileOff=209031823, len=29]
 2023-07-18 01:23:55.205  [28] ERR    Failed to apply page delta.
rec=[PagesListRemovePageRecord [rmvdPageId=0101000100000057,
pageId=0101000100000004, grpId=-1476359018, super=PageDeltaRecord
[grpId=-1476359018, pageId=0101000100000004, super=WALRecord [size=41,
chainSize=0, pos=WALPointer [idx=2431, fileOff=209169155, len=41],
type=PAGES_LIST_REMOVE_PAGE]]]]
 2023-07-18 01:23:55.217 [1] INF    Cleanup cache stores [total=0, left=0,
cleanFiles=false]
 2023-07-18 01:23:55.218 [1] ERR    Got exception while starting (will
rollback startup routine).
 2023-07-18 01:23:55.218 [1] ERR    Exception during start processors, node
will be stopped and close connections

I know Apache Ignite is very good at surviving 'Big Red Switch' scenarios,
and we have our data regions configured with the strictest update protocol
(full sync after each write), however it's possible the NAS implementation
does something different!

I think if we delete the WAL files from the nodes that won't restart then
the node may be happy, though we will lose any updates since the last
checkpoint (but then, it has low use and checkpoints are every 30-45
seconds or so, so this won't be significant).

Is this an error anyone else has noticed?
Has anyone else had similar issues with Azure Files when using strict
update/sync semantics?

Thanks,
Raymond.

-- 
<http://www.trimble.com/>
Raymond Wilson
Trimble Distinguished Engineer, Civil Construction Software (CCS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wil...@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Reply via email to