Also, if snapshotting multiple filesets, it's important to group these into a single mmcrsnapshot command. Then you get a single quiesce, instead of one per fileset.
i.e. do: snapname=$(date --utc +@GMT-%Y.%m.%d-%H.%M.%S) mmcrsnapshot gpfs0 fileset1:$snapname,filset2:snapname,fileset3:snapname instead of: mmcrsnapshot gpfs0 fileset1:$snapname mmcrsnapshot gpfs0 fileset2:$snapname mmcrsnapshot gpfs0 fileset3:$snapname -jf On Wed, Feb 2, 2022 at 12:07 PM Jordi Caubet Serrabou < jordi.cau...@es.ibm.com> wrote: > Ivano, > > if it happens frequently, I would recommend to open a support case. > > The creation or deletion of a snapshot requires a quiesce of the nodes to > obtain a consistent point-in-time image of the file system and/or update > some internal structures afaik. Quiesce is required for nodes at the > storage cluster but also remote clusters. Quiesce means stop activities > (incl. I/O) for a short period of time to get such consistent image. Also > waiting to flush any data in-flight to disk that does not allow a > consistent point-in-time image. > > Nodes receive a quiesce request and acknowledge when ready. When all nodes > acknowledge, snapshot operation can proceed and immediately I/O can resume. > It usually takes few seconds at most and the operation performed is short > but time I/O is stopped depends of how long it takes to quiesce the nodes. > If some node take longer to agree stop the activities, such node will > be delay the completion of the quiesce and keep I/O paused on the rest. > There could many things while some nodes delay quiesce ack. > > The larger the cluster, the more difficult it gets. The more network > congestion or I/O load, the more difficult it gets. I recommend to open a > ticket for support to try to identify the root cause of which nodes not > acknowledge the quiesce and maybe find the root cause. If I recall some > previous thread, default timeout was 60 seconds which match your log > message. After such timeout, snapshot is considered failed to complete. > > Support might help you understand the root cause and provide some > recommendations if it happens frequently. > > Best Regards, > -- > Jordi Caubet Serrabou > IBM Storage Client Technical Specialist (IBM Spain) > > > ----- Original message ----- > From: "Talamo Ivano Giuseppe (PSI)" <ivano.tal...@psi.ch> > Sent by: gpfsug-discuss-boun...@spectrumscale.org > To: "gpfsug main discussion list" <gpfsug-discuss@spectrumscale.org> > Cc: > Subject: [EXTERNAL] Re: [gpfsug-discuss] snapshots causing filesystem > quiesce > Date: Wed, Feb 2, 2022 11:45 AM > > > Hello Andrew, > > > > Thanks for your questions. > > > > We're not experiencing any other issue/slowness during normal activity. > > The storage is a Lenovo DSS appliance with a dedicated SSD enclosure/pool > for metadata only. > > > > The two NSD servers have 750GB of RAM and 618 are configured as pagepool. > > > > The issue we see is happening on both the two filesystems we have: > > > > - perf filesystem: > > - 1.8 PB size (71% in use) > > - 570 milions of inodes (24% in use) > > > > - tiered filesystem: > > - 400 TB size (34% in use) > > - 230 Milions of files (60% in use) > > > > Cheers, > > Ivano > > > > > > > __________________________________________ > Paul Scherrer Institut > Ivano Talamo > WHGA/038 > Forschungsstrasse 111 > 5232 Villigen PSI > Schweiz > > Telefon: +41 56 310 47 11 > E-Mail: ivano.tal...@psi.ch > > > > > ------------------------------ > *From:* gpfsug-discuss-boun...@spectrumscale.org < > gpfsug-discuss-boun...@spectrumscale.org> on behalf of Andrew Beattie < > abeat...@au1.ibm.com> > *Sent:* Wednesday, February 2, 2022 10:33 AM > *To:* gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] snapshots causing filesystem quiesce > > Ivano, > > How big is the filesystem in terms of number of files? > How big is the filesystem in terms of capacity? > Is the Metadata on Flash or Spinning disk? > Do you see issues when users do an LS of the filesystem or only when you > are doing snapshots. > > How much memory do the NSD servers have? > How much is allocated to the OS / Spectrum > Scale Pagepool > > Regards > > Andrew Beattie > Technical Specialist - Storage for Big Data & AI > IBM Technology Group > IBM Australia & New Zealand > P. +61 421 337 927 > E. abeat...@au1.ibm.com > > > > > On 2 Feb 2022, at 19:14, Talamo Ivano Giuseppe (PSI) <ivano.tal...@psi.ch> > wrote: > > > > > > Dear all, > > Since a while we are experiencing an issue when dealing with snapshots. > Basically what happens is that when deleting a fileset snapshot (and maybe > also when creating new ones) the filesystem becomes inaccessible on the > clients for the duration of the operation (can take a few minutes). > > The clients and the storage are on two different clusters, using remote > cluster mount for the access. > > On the log files many lines like the following appear (on both clusters): > Snapshot whole quiesce of SG perf from xbldssio1 on this node lasted 60166 > msec > > By looking around I see we're not the first one. I am wondering if that's > considered an unavoidable part of the snapshotting and if there's any > tunable that can improve the situation. Since when this occurs all the > clients are stuck and users are very quick to complain. > > If it can help, the clients are running GPFS 5.1.2-1 while the storage > cluster is on 5.1.1-0. > > Thanks, > Ivano > > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > Salvo indicado de otro modo más arriba / Unless stated otherwise above: > > International Business Machines, S.A. > > Santa Hortensia, 26-28, 28002 Madrid > > Registro Mercantil de Madrid; Folio 1; Tomo 1525; Hoja M-28146 > > CIF A28-010791 > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss