Dear Chavdar, Michall and others, Just an update on the issue I raised a few days ago concerning the failed backups.
Our ISP server uses a separate windows server with about 100 TB of containers, divided over 28 disks. Backups started crashing about a week ago. At some point the system would get stuck for five minutes or so. Of course, a lot of different issues happened at the same time: -The problem started after windows server updates, which forced us to reboot most of our systems. -One user dumped about a terabyte of mostly small files on our system. - Our Spectrum Protect system manager was on holidays. -As always there are other usual suspects: antivirus etc. Our container server runs on Vmware ESXI infrastructure. We opened a call to VMWARE, sent them the logs of the ESXI server. They found a very simple cause to the problem: disks were filled up, and the system froze. When checking the logs, I found that the backup containers opened in write mode were on disks without any space left, while other disks were less than half full. So here is my solution: set the containerdirs that are full on read-only, move containers, wait till the containers are deleted. My question is: why is this process not managed automatically by ISP? Why are disks with a lot of space not prioritized for writing? Thanks for your help ! David de Leeuw -----Original Message----- From: ADSM: Dist Stor Manager <ADSM-L@VM.MARIST.EDU> On Behalf Of Chavdar Cholev Sent: Monday, August 21, 2023 6:11 PM To: ADSM-L@VM.MARIST.EDU Subject: Re: [ADSM-L] INCR backups fail ! TSM 8.1.17 Windows Server and client Hi David, Just make sure that containers are excluded from anti-virus scan. On Sunday, August 20, 2023, David L.A. De Leeuw <da...@bgu.ac.il> wrote: > Hi all, > > Apparently, this has nothing to do with SP at all ! > > The (Windows server 2019 on ESXI) system holding the containers just > disconnects for 5 minutes ! > > No pings to the server. > > When access is restored, later on, a message appears in the events: > "The system time has changed to 2023-08-20T19:05:05 from > 2023-08-20T19:01:04 " > This is no warning even, just "information". > > I have no idea why this should happen, but we will find it. > Thanks for your support ! > > David > > > > -----Original Message----- > From: דוד דה ליאו > Sent: Sunday, August 20, 2023 9:37 PM > To: ADSM: Dist Stor Manager <ADSM-L@VM.MARIST.EDU> > Subject: RE: [ADSM-L] INCR backups fail ! TSM 8.1.17 Windows Server > and client > > Hi Michael, > > Thanks a lot. > > The SP Server is not on VM, just the storage. I am not the manager to > the server. > Just got a lot of backup storage if we provide the space for the > containers. > > Sure we run a lot of sessions in parallel as you said. I will try a > run according to your recommendations. > One other thought I am testing, is that over a year ago we also had > crashes. The 10 Gb optical network had hickups. Our 1 Gb line worked fine. > I just switched back to the 1 Gb and see what happens. > > Will keep you posted ! > > David > > > -----Original Message----- > From: ADSM: Dist Stor Manager <ADSM-L@VM.MARIST.EDU> On Behalf Of > Michael Prix > Sent: Sunday, August 20, 2023 9:04 PM > To: ADSM-L@VM.MARIST.EDU > Subject: Re: [ADSM-L] INCR backups fail ! TSM 8.1.17 Windows Server > and client > > Hello David, > > an *SP-Server in a VM is not the best setup, but nevertheless it > should work - and has proven so for the past. > > For the client: Please show the dsm.opt. I suspect you are trunning > several sessions from this client in parallel during a backup-> stop > it for the moment. > Start with a basic dsm.opt, disable the option "resourceutilization", > if set, and set "memoryefficient yes" (or "diskcachem" if you like). > I'f it still crashes with a plain dsm.opt, you should open a ticket with IBM. > > -- > Michael Prix > > > > > August 20, 2023 at 7:25 PM, "David L.A. De Leeuw" <da...@bgu.ac.il> wrote: > > > > > > Hi Chavdar and Michael, > > > > Thanks for your thoughts and help. > > > > I added "memoryefficientbackup". > > > > But still the sessions keep crashing. Once the session crashes, I > > get a > whole bit of errors for storage pool directories, and in fact the > whole pool becomes unavailable. > > I run "update stgpooldir ... access=readwrite" and all is accessible > again. > > Some of the containers are in unavailable state and need audit. > > > > Our container storage is on a Dell PowerEdge R730xd, has 24 CPU's > allocated, 64 GB memory, 110 TB disk. The disks are declared as VMDKs. > Network is on a 10Gb Intel 82588 card. > > Nothing I can see points to a lack of resources. > > > > Everything worked fine till 4 days ago. That is why I thought of a > problem with Windows updates, but as I rolled them back, that does not > make sense. > > > > I am quite at a loss where to look next ... > > > > Thanks > > > > David > > > > [Server Side] . > > 20-08-2023 19:47:22 ANR0839I Session 197902 started for node MEDFS2 > (WinNT) > > (SSL medspice.bgu.ac.il[132.72.73.246]:53184) on > > STOREWARE13.auth.ad.bgu.ac.il:1502. (SESSION: 197902) > > 20-08-2023 19:47:26 ANR8592I Session 197903 connection is using > > protocol TLSV13, cipher specification TLS_AES_256_GCM_SHA384, > > certificate TSM Self-Signed Certificate. (SESSION: > > 197903) > > 20-08-2023 19:47:26 ANR0839I Session 197903 started for node MEDFS2 > (WinNT) > > (SSL medspice.bgu.ac.il[132.72.73.246]:53185) on > > STOREWARE13.auth.ad.bgu.ac.il:1502. (SESSION: 197903) > > 20-08-2023 19:47:55 ANR2012W Error encountered for storage pool > directory: > > \\medbackup.med.ad.bgu.ac.il\tsmc20 in storage pool: > > CPOOL. (SESSION: 197881) > > 20-08-2023 19:47:55 ANR1181E sdtxn.c(1404): Data storage transaction > > 0:83236375 was aborted. (SESSION: 197881) > > 20-08-2023 19:47:55 ANR0204I The container state for > > \\medbackup.med.ad.bgu.ac.il\tsmc17\18\0000000000001853.- > > ncf is updated from AVAILABLE to UNAVAILABLE. (SESSION: > > 197883) > > 20-08-2023 19:47:55 ANR3660E An unexpected error occurred while > > opening > or > > writing to the container. Container > > \\medbackup.med.ad.bgu.ac.il\tsmc17\18\0000000000001853.- > > ncf in stgpool CPOOL has been marked as UNAVAILABLE and should be > > audited to validate accessibility and content. > > (SESSION: 197883) > > > > [From the client side:] > > > > During the incr of a large filespace: > > > > Normal File--> 7.132.827 \\medfs2\e$\medusers14\angel\17.8.23 BU - > E\MyDocs(E)-PrevOLD-D\MyDocs (D)\PERSON-CRITER\FAMILY\OMRI's folder > 313843070\OMRI 1-16 medical issue\MRIs - CTs - OMRI\MY PROCESSING of > MRI and general MRI data\For-Crop-T2W - coronal Copy.pptx ** > Unsuccessful ** > > ANS1228E Sending of object '\\medfs2\e$\medusers14\angel\17.8.23 BU > > - > E\MyDocs(E)-PrevOLD-D\MyDocs (D)\PERSON-CRITER\FAMILY\OMRI's folder > 313843070\OMRI 1-16 medical issue\MRIs - CTs - OMRI\MY PROCESSING of > MRI and general MRI data\For-Crop-T2W - coronal Copy.pptx' failed. > > ANS1311E Server out of data storage space > > > > [I ran sel of the latest file. It failed because all containerdirs > > were > unavailable.] > > > > ANS1804E Selective Backup processing of > > '\\medfs2\e$\medusers14\angel\17.8.23 > BU - E\MyDocs(E)-PrevOLD-D\MyDocs (D)\PERSON-CRITER\FAMILY\OMRI's > folder 313843070\OMRI 1-16 medical issue\MRIs - CTs - OMRI\MY > PROCESSING of MRI and general MRI data\For-Crop-T2W - coronal > Copy.pptx' finished with failures. > > > > Total number of objects inspected: 1 Total number of objects backed > > up: 0 Total number of objects updated: 0 Total number of objects > > rebound: 0 Total number of objects deleted: 0 Total number of > > objects expired: 0 Total number of objects failed: 1 ... > > Network data transfer rate: 148.306,35 KB/sec Aggregate data > > transfer rate: 211,50 KB/sec Objects compressed by: 0% Total data > > reduction ratio: 0.23% Subfile objects reduced by: 0% Elapsed > > processing time: 00:00:32 ANS1311E Server out of data storage space > > > > [Then I updated the containerdirs to readwrite and ran the selective > backup. No problem] > > ------------------------------------------------------------ > ----------------------------------------------- > > Protect> sel '\\medfs2\e$\medusers14\angel\17.8.23 BU - > E\MyDocs(E)-PrevOLD-D\MyDocs (D)\PERSON-CRITER\FAMILY\OMRI's folder > 313843070\OMRI 1-16 medical issue\MRIs - CTs - OMRI\MY PROCESSING of > MRI and general MRI data\For-Crop-T2W - coronal Copy.pptx' > > Selective Backup function invoked. > > > > Normal File--> 7.132.827 \\medfs2\e$\medusers14\angel\17.8.23 BU - > E\MyDocs(E)-PrevOLD-D\MyDocs (D)\PERSON-CRITER\FAMILY\OMRI's folder > 313843070\OMRI 1-16 medical issue\MRIs - CTs - OMRI\MY PROCESSING of > MRI and general MRI data\For-Crop-T2W - coronal Copy.pptx [Sent] > > Selective Backup processing of '\\medfs2\e$\medusers14\angel\17.8.23 > > BU > - E\MyDocs(E)-PrevOLD-D\MyDocs (D)\PERSON-CRITER\FAMILY\OMRI's folder > 313843070\OMRI 1-16 medical issue\MRIs - CTs - OMRI\MY PROCESSING of > MRI and general MRI data\For-Crop-T2W - coronal Copy.pptx' finished > without failure. > > > > -----Original Message----- > > From: ADSM: Dist Stor Manager <ADSM-L@VM.MARIST.EDU> On Behalf Of > Chavdar Cholev > > Sent: Sunday, August 20, 2023 3:43 PM > > To: ADSM-L@VM.MARIST.EDU > > Subject: Re: [ADSM-L] INCR backups fail ! TSM 8.1.17 Windows Server > > and > client > > > > Just to make sure that we are on the same page... > > You have TSM installed on VM running on VMware. This VM has few LUNs > presented and those LUN are used for containers? > > > > Short in the dark: > > 1. Check VM resources if they are as IBM TSM blue print. > > 2. Check LUNs/HDDs response time in perf. monitor. The response time > should around 20-30 Ms during the backup operating. > > 3. Do you know if those HDDd for LUNs are .vmdk or RDM (raw device map)? > > > > Thank you! > > Chavdar > > > > On Saturday, August 19, 2023, David L.A. De Leeuw <da...@bgu.ac.il> > wrote: > > > > > > > > Hi TSM experts, > > > > > > Our incr backup fails consistently in the last few days. It > > > starts alright but after a few gigabyte on the client we get the error: > > > > > > ANS1301E This operation cannot continue due to an error on the > > > IBM Spectrum Protect server. See your IBM Spectrum Protect server > > > administrator for assistance. > > > > > > On the server side we see: > > > > > > 18-08-2023 22:57:25 ANR2012W Error encountered for storage pool > directory: > > > \\medbackup.med.ad.bgu.ac.il\tsmc1 in storage pool: > > > CPOOL. (SESSION: 194578) > > > 18-08-2023 22:57:25 ANR0530W Transaction failed for session > > > 194578 > for > > > node > > > MEDFS2 (WinNT) - internal server error detected. > > > (SESSION: 194578) > > > 18-08-2023 22:57:26 ANR2012W Error encountered for storage pool > directory: > > > \\medbackup.med.ad.bgu.ac.il\tsmc1 in storage pool: > > > CPOOL. (SESSION: 194578) > > > > > > Then we find one or more containers unavailable. We fix the > containers > > > with "audit container ... action=scanall" > > > No errors are found. But the next backup will fail again. > > > > > > The server is on 8.1.17, the client as well. > > > The containers are on a number of disks on a shared windows > > > server > 2019. > > > There have been some updates on the windows server recently. > > > (KB5029247,KB5029647) > > > > > > The audits are fine, data is accessible, but backups fail. > > > Any ideas ? > > > > > > David de Leeuw > > > Ben-Gurion University of the Negev Beer Sheva Israel > > > > > >