Here is a twist! Today I was connected to the console of the file server at the very moment the problem occurred. The problem seems to be the drive array, as the System volume responded just fine during the outage, but the internal RAID 5 drive array went to a non-responding state for FOUR MINUTES!
I have opened a ticket with Dell, as it's a Dell PowerEdge 2950 server which is fully under warranty. The tech that answered did not see anything wrong in the DSET report, and has escalated the issue to a supervisor. So I think our Network guys are right, it's not a network issue, it's inside the box. This is a fairly new server, which runs as a file server only, no other roles are installed, so it 'should' be fairly easy to diagnose. At the time of the problem, all windows explorer windows showing anything on the RAID5 array go dormant with Not Responding at the top. Any windows explorer window displaying something on the system volume responds as normal, where I am able to open and close files, modify and save modified files, etc. The taskbar also goes dormant where it does not respond to any clicking. When the server returned to normal it very quickly processed all the clicks I had done to switch windows, just flashing on the screen rather quickly as though it had been storing my mouse clicks. The event logs don't record anything during nor after the problem. The next entries in the App, Security, system logs are well after it started to respond and have nothing to do with 'anything'. So now I await a return call from Dell. Thought I'd provide a follow up since several of you have sent me messages on what to look for! Thanks again! -----Original Message----- From: Kim Longenbaugh [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 22, 2008 3:49 PM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? Have the network guys look at the flow-control settings on your switches. If flow-control is on (as it should be in most cases), ports may be getting overwhelmed with traffic, resulting in pause frames. Flow-control pausing a connection will not result in tcp retransmits. Also, some switches may run out of buffer for the paused frames, although that condition would cause you to start seeing tcp retransmits. Some switches allow broadcast and unicast throttling. If they're turned on, they may be shutting down connections until the traffic goes below the thresholds again. An obvious thing is the speed/duplex settings. If there's a mismatch, the resulting degradation may only become noticeable under heavy traffic loads. Can you identify the source and destination for the SMB traffic? If so, you could try to find what's causing it. -----Original Message----- From: Stephen Wimberly [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 22, 2008 2:16 PM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? This just gets more fun... Our network team came out to our building to perform an on-site network sniff. There are no TCP retries, so there are no lost packets. Follow that with the statement There is a lot of SMB traffic, and SMB wouldn't attempt a resend, so there might be some network lost packets. He has taken the network traffic to research SMB traffic. In the meantime, we find that some machines drop connection at the same time that other machines don't. We have a test script running on several machines which append a text file every fifteen seconds and records failures. -----Original Message----- From: Stephen Wimberly [mailto:[EMAIL PROTECTED] Sent: Thursday, July 17, 2008 8:24 AM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? When we ping the file server and any server in the same network a 'normal' reply would be either =1 ms or =2 ms. At the time of these problems we are getting well over 100 ms for approximately two minutes! Our network department has looked at wireshark traces from both workstation and server and has merely pointed out that there is SMB traffic happening at the time of the problem. (I would think that to be rather 'normal' when you run an application from a file share.) I asked why they brought it up, whether it is unusual, they said that they did not know and would need to do more research. So now we are waiting on them to review more log files. -----Original Message----- From: Terry Dickson [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2008 2:45 PM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? So have you tried something simple like a Ping to that server to see if the Pings timeout, or are slower at the time of the slowdowns? Just might help to figure out if it is network related or not. -----Original Message----- From: Stephen Wimberly [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2008 1:34 PM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? We will "un-team" in the next couple of days as a test; but keep in mind the SQL Server is teamed using the same NICs as well with no issues, that's why it hasn't been suspect yet. I'm going to look into the firmware tomorrow morning when we have scheduled downtime, thanks for mentioning. As for Software firewall; we normally run the Windows firewall, but turned that off for testing with no change. The problem occurred again today at 1:15 PM. It seems that Windows Explorer 'freezes' on almost all domain computers and no one can access their file shares for a few seconds, until a reconnect can be established. One diagnostic script we have running appends a text file on the server every 15 seconds and during the outage could not append for a full five minutes! Network ports are not ours to swap, but our network team. Once they give the word we could try that. There are hardware firewalls at play as well; the firewall team is looking into those to determine possible issues with load balancing, etc. Thanks for your suggestions! -----Original Message----- From: Miller Bonnie L. [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2008 1:42 PM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? Hmm.. sounds like it's already been set then, but I don't know as I've always done both the reg entry and the RSS on the Bcom NIC itself. We also are not using teaming at the moment, so I don't know if that might have a separate issue. Just re-read your post. I see you mentioned all drivers updated, but how about firmware? Are you able to swap a network port the file server is using with the SQL server that works? What else is running on your file servers that is the same across both--any software firewalls? -----Original Message----- From: Stephen Wimberly [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2008 8:23 AM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? All the registry entries are as you have them.... Although; my "Broadcom BCM5708C NetXtreme II GigE" cards were set to ENABLE 'Receive Side Scaling'. I changed them to 'Disable'. Each card disabled for a moment, then auto re-enabled; so I assume this does not need a restart. These servers have teamed NICs; all our servers do. The BACS (BroadCom Advanced Control Suite) is set up for switch failover as each NIC is physically plugged to a different switch for failover. -----Original Message----- From: Miller Bonnie L. [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2008 10:29 AM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? They're in the same area of the registry--My .reg file that I import looks like this: Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters] "EnableTCPA"=dword:00000000 "EnableRSS"=dword:00000000 "EnableTCPChimney"=dword:00000000 Also, on the Broadcom NIC(s) properties, look at the advanced tab. Make sure "Receive Side Scaling" is set to Disable. I haven't done the netsh method, but I understand that can change it w/out needing a server reboot. -Bonnie -----Original Message----- From: Stephen Wimberly [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2008 7:23 AM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? Thanks Bonnie! The TCP Chimney options are off! (I had to look, @ HKLM\System\CurrentControlSet\Services\Tcpip\Parapeters\EnableTCPChimney =0 I've never configured them either way!) The SNP I don't know how to check. I see where I can use a netsh to set it to disabled, but how would I see its current state? -----Original Message----- From: Miller Bonnie L. [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2008 8:56 AM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? Any kind of backup or snapshot taking place at those times? Although I can't say this would happen like clockwork, have you already disabled the Chimney/SNP network options on those servers? -Bonnie From: Stephen Wimberly [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2008 5:51 AM To: NT System Admin Issues Subject: Disconnected on a schedule??? We have workstations that appear to be losing connection to the file share on the server at almost precise times, every six hours. 7 AM, 1 PM, 7 PM, 1 AM; Repeat. The event logs on the workstation and servers are clean, Domain controllers and file share server. So I assume the loss is not long enough for the OS to recognize it. Although we have a custom application running on many machines that can't seem to handle the brief outage and fails like clockwork. The application vendor tells us it has a sixty second timeout before it will fail; certainly long enough to handle any brief disconnect. Network traces (using wireshark) from the server to workstation and workstation to server do not show any sign of failure. A script that updates a text file on the server every fifteen seconds does show the failure, it fails to update the text file on the server for up to four _minutes_ at a time! Although during the four minute failure period it's able to update once or twice during the outage, so it's not a total blackout. Workstations map a drive to the file share using a DFS path; ie: \\domain\share <file:///\\domain\share> . So we tested a direct mapping using \\server\share <file:///\\server\share> , and we get the same result. We mapped drives to two different file servers, each file server is in a different building on different ends of campus. The workstations used four test drive mappings, two for each server, one DFS on each server and one direct for each server. All four drive mappings failed at the same time. The connection to the SQL server is never lost. The SQL server is plugged into the same network switch as the file server. The Windows Domain has no trusts; it's a single domain forest. There are no services on any server with a six hour schedule that we know of. Backup runs daily at midnight and completes prior to 7 AM. Virus scan is still running at the 7 AM hour, but is long since complete by the 1 PM hour. Both file servers are Dell PE 2950 running Windows Server 2003 R2; All drivers seem up to date with Dell's support site. Workstations are a variety of makes, running either Windows XP Pro SP2, Windows XP Pro SP3 and Windows Vista SP1 and are scattered all over campus on different network subnets. Our network department is telling us that the network is fine, it's either a workstation or a server issue. Anyone seen this type of thing before??? Thanks! ~ Upgrade to Next Generation Antispam/Antivirus with Ninja! ~ ~ <http://www.sunbelt-software.com/SunbeltMessagingNinja.cfm> ~ ~ Upgrade to Next Generation Antispam/Antivirus with Ninja! ~ ~ <http://www.sunbelt-software.com/SunbeltMessagingNinja.cfm> ~ ~ Upgrade to Next Generation Antispam/Antivirus with Ninja! ~ ~ <http://www.sunbelt-software.com/SunbeltMessagingNinja.cfm> ~ ~ Upgrade to Next Generation Antispam/Antivirus with Ninja! ~ ~ <http://www.sunbelt-software.com/SunbeltMessagingNinja.cfm> ~ ~ Upgrade to Next Generation Antispam/Antivirus with Ninja! ~ ~ <http://www.sunbelt-software.com/SunbeltMessagingNinja.cfm> ~ ~ Upgrade to Next Generation Antispam/Antivirus with Ninja! ~ ~ <http://www.sunbelt-software.com/SunbeltMessagingNinja.cfm> ~ ~ Upgrade to Next Generation Antispam/Antivirus with Ninja! ~ ~ <http://www.sunbelt-software.com/SunbeltMessagingNinja.cfm> ~