Hi List, Disabling the buggy TCP window size scaling in RHEL 5.4 (on the TSM server), appears to be an effective work around for this part of the problem. (The change can be made on the fly, and can be added to sysctl.conf so that it persists across reboots.)
(net.ipv4.tcp_window_scaling=0 is the flag.) The bug has been fixed in at least one level of RHEL 5.6, and we will be upgrading to that level, when the logistics can be worked out. Thanks, [RC] ----- Forwarded by Robert A Clark/OR/USB on 09/21/2011 08:36 AM ----- From: Robert A Clark/OR/USB To: "ADSM: Dist Stor Manager" <ADSM-L@VM.MARIST.EDU> Date: 09/15/2011 01:50 PM Subject: Re: [ADSM-L] The continuing saga of TSM 5.5.5.0 on RedHat 5.4 on x86_64. Hi Andrew, The start of this problem is obscure. We did have serious problems when Linux was upgraded from RHEL 5.3 => 5.4 on the TSM servers, requiring we set "TCPWINDOWsize 0" in dsmserv.opt on the Linux servers. Log pinning has been a problem for us for at least two or three months, and it has been getting worse lately. Starting at 02:25:53.536 this morning, this particular client (which demonstrates the problem pretty reliably) goes into "Producer is waiting on consumer". It stays this way until we used "dsmcutil stop /name:"TSM Scheduler" once we arrived at work. Somewhere in all the semantics between the TSM client and the TSM server, the little FSA state machine has entered the "I've fallen and I can't get up!" state. The client and server don't seem to do anything about this. Network capture during this same timeframe shows the client sends 39 bytes of data every 5 seconds which the server ACKS. This behavior repeats until the end of the network capture at ~04:00. Personally, I'm looking at " https://bugzilla.redhat.com/show_bug.cgi?id=627496", although its not clear whether this patch for kernel-2.6.18-194 would apply to RHEL 5.4 or not. (We're running 2.6.18-164.11.1.el5 at the moment.) Tonight, I'm changing from "TCPWINDOWSize 63" to "TCPWINDOWSize 0" in the dsm.opt on some of the nodes, to see if it helps the client recover from whatever is happening. 09/15/2011 02:25:52.942 [000736] [1880] : DccTaskStatus.cpp ( 818): DccTaskletMsg.msgID ==> 1 09/15/2011 02:25:52.942 [000736] [1880] : dsfifo.cpp ( 641): fifoQgetNextWaitNoTS(1794b24): cbCanProceed=True. 09/15/2011 02:25:52.942 [000736] [1880] : dsfifo.cpp ( 690): fifoQgetNextWaitNoTS(1794b24): Next object is available. 09/15/2011 02:25:52.942 [000736] [1880] : dsfifo.cpp ( 342): fifoQchangeWaitTime(1794b24): wait time 1644, change 0 09/15/2011 02:25:52.942 [000736] [1880] : circq.cpp ( 248): Q 17a3eb0, poped 3d9eef0, count 1, top index 434 - 1773b78, bot index 435 - 3e66328 09/15/2011 02:25:52.942 [000736] [1880] : dsfifo.cpp ( 745): fifoQgetNext(1794b24): Giving entry 3d9eef0, rc 0. 09/15/2011 02:25:52.942 [000736] [1880] : DccTaskStatus.cpp ( 818): DccTaskletMsg.msgID ==> 2 09/15/2011 02:25:52.942 [000736] [1880] : dsfifo.cpp ( 641): fifoQgetNextWaitNoTS(1794b24): cbCanProceed=True. 09/15/2011 02:25:52.942 [000736] [1880] : dsfifo.cpp ( 690): fifoQgetNextWaitNoTS(1794b24): Next object is available. 09/15/2011 02:25:52.942 [000736] [1880] : dsfifo.cpp ( 342): fifoQchangeWaitTime(1794b24): wait time 1644, change 0 09/15/2011 02:25:52.942 [000736] [1880] : circq.cpp ( 248): Q 17a3eb0, poped 1773b78, count 0, top index 435 - 3e66328, bot index 435 - 3e66328 09/15/2011 02:25:52.942 [000736] [1880] : dsfifo.cpp ( 735): fifoQgetNext(1794b24): Giving entry 1773b78, but next entry is NULL. 09/15/2011 02:25:52.942 [000736] [1880] : dsfifo.cpp ( 745): fifoQgetNext(1794b24): Giving entry 1773b78, rc 0. 09/15/2011 02:25:52.942 [000736] [1880] : DccTaskStatus.cpp ( 818): DccTaskletMsg.msgID ==> 1 B/A Performance thread 4================> 09/15/2011 02:25:53.536 [000736] [16312] : pkthread.cpp ( 678): After TimedWait rc = 964, milliSecondsLeft = 0, cbCanProceed = 0. B/A Txn Producer thread 2================> 09/15/2011 02:25:53.536 [000736] [14136] : dsfifo.cpp ( 431): fifoQinsert(3d531b0): Producer is done waiting, rc is 0 09/15/2011 02:25:53.536 [000736] [14136] : dsfifo.cpp ( 433): fifoQinsert() Thread 14136 returned from wait on queue 3D531B0 . 09/15/2011 02:25:53.536 [000736] [14136] : dsfifo.cpp ( 342): fifoQchangeWaitTime(3d531b0): wait time -19640, change -1000 09/15/2011 02:25:53.536 [000736] [14136] : dsfifo.cpp ( 420): fifoQinsert(3d531b0): Producer is waiting on consumer 09/15/2011 02:25:53.536 [000736] [14136] : dsfifo.cpp ( 422): fifoQinsert() Thread 14136 waiting on queue 3D531B0 . main thread 0================> 09/15/2011 02:25:53.942 [000736] [1880] : pkthread.cpp ( 678): After TimedWait rc = 964, milliSecondsLeft = 0, cbCanProceed = 0. B/A Performance thread 4================> 09/15/2011 02:25:54.536 [000736] [16312] : pkthread.cpp ( 678): After TimedWait rc = 964, milliSecondsLeft = 0, cbCanProceed = 0. B/A Txn Producer thread 2================> 09/15/2011 02:25:54.536 [000736] [14136] : dsfifo.cpp ( 431): fifoQinsert(3d531b0): Producer is done waiting, rc is 0 09/15/2011 02:25:54.536 [000736] [14136] : dsfifo.cpp ( 433): fifoQinsert() Thread 14136 returned from wait on queue 3D531B0 . 09/15/2011 02:25:54.536 [000736] [14136] : dsfifo.cpp ( 342): fifoQchangeWaitTime(3d531b0): wait time -20640, change -1000 09/15/2011 02:25:54.536 [000736] [14136] : dsfifo.cpp ( 420): fifoQinsert(3d531b0): Producer is waiting on consumer 09/15/2011 02:25:54.536 [000736] [14136] : dsfifo.cpp ( 422): fifoQinsert() Thread 14136 waiting on queue 3D531B0 . Thanks, [RC] From: Andrew Raibeck <stor...@us.ibm.com> To: ADSM-L@VM.MARIST.EDU Date: 09/14/2011 04:05 PM Subject: Re: [ADSM-L] The continuing saga of TSM 5.5.5.0 on RedHat 5.4 on x86_64. Sent by: "ADSM: Dist Stor Manager" <ADSM-L@VM.MARIST.EDU> Hi Robert, The RecvW state suggests that the TSM server is waiting on data from the TSM client, and isn't receiving it. Either the server sent something to the client and is awaiting a response but the client never got the information; or the client sent something back, but the server didn't get it. Off-hand this sounds like an underlying networking-related issue. I know you are a long-time TSM user, so I imagine that it hasn't been like this for years. :-) Do you have any history on when this problem started to manifest? And if so, what environmental changes might have coincided with that time? The first thing that came to mind for me is an issue with the Windows Scalable Network Pack (SNP) that is installed and activated when you install Windows 2003 SP2. See this document: https://www-304.ibm.com/support/docview.wss?uid=swg21460285 In addition, do you specify in dsm.opt the TCPWINDOWSIZE option? And if so, do you deviate from the default value of 63? If so, I recommend commenting it out, restarting the scheduler service, and see how that works. Best regards, Andy Raibeck IBM Software Group Tivoli Storage Manager Client Product Development Level 3 Team Lead Internal Notes e-mail: Andrew Raibeck/Hartford/IBM@IBMUS Internet e-mail: stor...@us.ibm.com IBM Tivoli Storage Manager support web page: http://www.ibm.com/support/entry/portal/Overview/Software/Tivoli/Tivoli_Storage_Manager "ADSM: Dist Stor Manager" <ADSM-L@vm.marist.edu> wrote on 2011-09-14 17:05:35: > From: Robert Clark <robert.cla...@usbank.com> > To: ADSM-L@vm.marist.edu > Date: 2011-09-14 17:07 > Subject: The continuing saga of TSM 5.5.5.0 on RedHat 5.4 on x86_64. > Sent by: "ADSM: Dist Stor Manager" <ADSM-L@vm.marist.edu> > > Hi List, > > We have a number of TSM clients (5.5.2.0 on mostly Windows 2003 SP2) that > stop making forward progress in the busy part of the night, and are just > sitting in RecvW when we check in the morning. > > These clients that aren't making progress appear to be involved as cause > or effect in the log pinning and getting to 80% full, and then dsmserv > starts killing the oldest session. (The cynic in me wonders if this pain > was added to get people to move from 5.5 to 6.x.) > > For the problem sessions below, there is only one session per node, and > based on the received bytes count, appear to correspond to the data > session and not the meta-data session. > > Stopping the backup on the client side 'dsmcutil q /name:"TSM Scheduler"', > and then running a manual "dsmc i" finishes a manual incremental ( > typically) in under 5 minutes. > > So far, I've upped the MAXNUMMP value for all the effected nodes from 2 to > 5, and have remove "queryschedperiod 12" and "quiet yes" from the client > option set. > > I have a PMR open with Tivoli support, but wanted to check with the list > to see if anyone has seen this particular weirdness before. > > > tsm: TSMXX15>q ses > > Sess Comm. Sess Wait Bytes Bytes Sess > Platform Client Name > Number Method State Time Sent Recvd Type > ------ ------ ------ ------ ------- ------- ----- > -------- -------------------- > > THESE SYSTEMS WERE EXPERIENCING THE PROBLEM ON THE MORNING THIS OUTPUT WAS > GATHERED: > > 56,511 Tcp/Ip RecvW 0 S 7.4 K 135.8 G Node > WinNT FRED > 56,844 Tcp/Ip RecvW 40.5 M 760 640.1 M Node > WinNT BARNEY > 58,047 Tcp/Ip RecvW 46.3 M 1.5 K 150.7 M Node > WinNT WILMA > 58,907 Tcp/Ip RecvW 9.1 M 4.7 K 56.0 G Node > WinNT DINO > 59,677 Tcp/Ip RecvW 33.1 M 736 946.8 M Node > WinNT BAMBAM > > THESE SYSTEMS WERE NOT EXPERIENCING THE PROBLEM ON THE MORNING THIS OUTPUT > WAS GATHERED: > > 64,606 Tcp/Ip RecvW 37.7 M 1.8 K 156.3 M Node > SQLLite- TRURL_SQL > Speed > 64,725 Tcp/Ip RecvW 10.0 M 2.5 K 2.2 G Node > SQLLite- TRURL_SQL > Speed > 64,759 Tcp/Ip IdleW 21 S 8.1 M 530 Node > AIX KLAPAUCIUS > 64,760 Tcp/Ip RecvW 0 S 352 1.1 G Node > AIX KLAPAUCIUS > > THE NODE NAMES IN THIS EMAIL WERE CHANGED TO CONFORM TO POLICY. > > Thanks, > [RC] > U.S. BANCORP made the following annotations > --------------------------------------------------------------------- > Electronic Privacy Notice. This e-mail, and any attachments, > contains information that is, or may be, covered by electronic > communications privacy laws, and is also confidential and > proprietary in nature. If you are not the intended recipient, please > be advised that you are legally prohibited from retaining, using, > copying, distributing, or otherwise disclosing this information in > any manner. Instead, please reply to the sender that you have > received this communication in error, and then immediately delete > it. Thank you in advance for your cooperation. > > > > --------------------------------------------------------------------- > U.S. BANCORP made the following annotations --------------------------------------------------------------------- Electronic Privacy Notice. This e-mail, and any attachments, contains information that is, or may be, covered by electronic communications privacy laws, and is also confidential and proprietary in nature. If you are not the intended recipient, please be advised that you are legally prohibited from retaining, using, copying, distributing, or otherwise disclosing this information in any manner. Instead, please reply to the sender that you have received this communication in error, and then immediately delete it. Thank you in advance for your cooperation. ---------------------------------------------------------------------