Kevin, On the SQL server - is there multiple instances, or dedicated? Single SQL instance dedicated to SCOM 2012. Physical or virtual? Physical How much memory and CPU resources are available to the OS? 32 GB RAM, 2 CPUs (Intel E5-2620) Have you inspected Disk I/O on the SQL server? This is the most common cause of serious performance backlogs and issues. I have looked at disk I/O on the SQL server, and it does not seem to be an issue when I have looked right after the SCOM server is started up (but I have not been able to look at the exact time the 2115s are logged.) I realized the SCOM Agent is not installed on the SQL server so I have installed it to try to capture some performance data - but I don't know when that will start to get collected, because the State when looking at it in the Windows Computers view is the open green circle icon - "Not monitored" - this is also something I noticed last week; four new servers I installed the SCOM agent were still in the not monitored state two hours after I installed the Agent on them because of whatever is going on...
Does the management server have enough memory? The management server also has 32 GB RAM, of which only 9 GB is in use. Performance requirements for SCOM 2012 in your sized environment should be very similar to SCOM 2007 R2.... Since you are seeing 2115's for BOTH databases - I'd suspect significant SQL performance issues, either not anywhere near enough memory, or most likely a disk I/O capability issue. What is the disk subsystem for the SQL database? What is the avg disk sec/read and avg disk sec/write during the time that 2115's begin? 1 RAID 10 virtual disk (using 8 physical disks) (on which there are two logical disks: one for the OS and one for SQL server) I restarted SCOM server about 30 minutes ago, and there have not been any 2115s yet. I'll try to get the avg disk sec/read and avg disk sec/write during the time that 2115's begin, but wanted to reply back with this info for now... Thankx, Geoff From: [email protected] [mailto:[email protected]] On Behalf Of Nelson, Geoffrey D Sent: Monday, May 13, 2013 10:55 AM To: [email protected] Subject: [msmom] SCOM 2012 issues - not sure what is going on I am having issues with the health of my new SCOM 2012 Management Group, and was hoping someone might have some ideas on what is going on. I am pretty new to SCOM. We are running SCOM on two physical servers: one Management Server and one Database server. We are in the process of migrating from SCOM 2007 to SCOM 2012. Everything had been working pretty well until last week, and we have a pretty small environment (only 222 Windows computers.) The first symptom I noticed was that alerts were not being received or were delayed: we reseated a hard drive in a server and did not receive the alert that a disk had been removed until an two hours later... At first I dismissed it as an issue with the particular Agent. Then SCOM alerting from 2012 just seemed to stop. Looking in the Monitoring pane of the SCOM Console, under Operations Manager - Management Group Health every component (for both Management Group Functions and Management Group Infrastructure) has a grey checkmark. Looking in Management Server - Management Server State the State shows as critical (and is greyed out) - Opening alert view I see "Event data collection process unable to store data in the Data Warehouse in a timely manner", "Performance data collection process unable to store data in the Data Warehouse in a timely manner", and "Object Health State data collection process unable to store data in the Data Warehouse in a timely manner." Looking at the Services in Server Manager, I verified the System Center Management service status was Paused. The Operations Manager log has various warnings and errors including a lot of Event ID 2115s, and a few 5300s (Local health service is not healthy) and a few 26017s (The Windows Event Log Provider monitoring the Operations Manager Event Log is 88 minutes behind in processing events.) Most of what I found online suggested issues with performance on the DB server - but CPU and RAM usage on the database server is pretty low. I have also read that running out of space in the database can cause 2115s, but that is not the case here (OperationsManager DB size is 24 GB with only 4 GB used; DW database size is 356 GB with only 22 GB used.) I also checked the Disk Usage by Top Tables report and found that the table consuming the most disk space in the OperationsManager database was dbo.StateChangeEvent which has 707,065 records in it. That seems like a lot (for only 222 servers being monitored.) The next tables using the most Disk space are dbo.PerformanceData_21, ..._20, _18, _19, _17, _16, _15, and _24. Among the things I have already tried: - Removed the latest Management Packs I had recently imported over the past week. - Deleted a few overrides I created that enabled alerts on a few monitors where it is disabled by default - Cleared the Operations Manager log, since it was so far behind in processing events - Stopped the System Center Management service on the Management Server and deleted the Health Service State folder, letting it rebuild itself - Removed the Exchange 2010 Management Pack (I had imported this about two months ago, and didn't have any issues, but since it is such a complex MP, I figured it was worth a try.) Another test I did to check on alert delay was I shut down a test server to see how long it would take to get the Failed To Connect to Computer alert. I was able to see Event ID 20022 (entity not heartbeating) in the OperationsManager log a minute later, but it takes about an hour to get any kind of alert for this. I just checked the Operations Manager log, looking at the time corresponding the when the last SCOM alerts were received by email, and saw the following: - Event ID 11411: Alert subscription data source module encountered alert subscriptions that were waiting for a long time to receive an acknowledgement. Alert subscription ruleid, Alert subscription query low watermark, Alert subscription query high watermark: 5fcdbf15-4f5b-29db-ffdc-f2088a0f33b7,05/12/2013 01:53:15, 05/12/2013 02:25:15 - Event ID 22402: Forced to terminate the following PowerShell script because it ran past the configured timeout 300 seconds. Script Name: GetMGAlertsCount.ps1 One or more workflows were affected by this. Workflow name: ManagementGroupCollectionAlertsCountRule Instance name: All Management Servers Resource Pool Instance ID: {4932D8F0-C8E2-2F4B-288E-3ED98A340B9F} - Event ID 22402: Forced to terminate the following PowerShell script because it ran past the configured timeout 300 seconds. Script Name: AgentStateRollup.ps1 One or more workflows were affected by this. Workflow name: ManagementGroupCollectionAgentHealthStatesRule Instance name: All Management Servers Resource Pool Instance ID: {4932D8F0-C8E2-2F4B-288E-3ED98A340B9F} - A lot (287)of Event ID 2115s: A Bind Data Source in Management Group ITS-Systems has posted items to the workflow, but has not received a response in 1979 seconds. This indicates a performance or functional problem with the workflow. Workflow Id : Microsoft.SystemCenter.CollectSignatureData Instance : <SCOM Management Server Name was here> Instance Id : {0CE7C182-DC86-D1A2-79CB-DE3E4A745350} - Several Event ID 8000s: A subscriber data source in management group <Management Group Name was here> has posted items to the workflow, but has not received a response in 6 minutes. Data will be queued to disk until a response has been received. This indicates a performance or functional problem with the workflow. Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectPerformanceData Instance : <SCOM Management Server Name was here> Instance Id : {0CE7C182-DC86-D1A2-79CB-DE3E4A745350} - Event ID 10000: A scheduled discovery task was not started because the previous task for that discovery was still executing. Discovery name: Microsoft.SystemCenter.DiscoveryHealthServiceCommunication Instance name: <SCOM Management Server Name was here> Management group name: <Management Group Name was here> The lot of Event IDs 2115 referenced the following WorkflowIDs, repeating these same workflows over and over, with each entry just having a different number of seconds reported on "has not received a response in ### seconds" even though all 2115s are stamped with the exact time. - Microsoft.SystemCenter.CollectSignatureData - Microsoft.SystemCenter.CollectPublishedEntityState - Microsoft.SystemCenter.CollectDiscoveryData - Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange - Microsoft.SystemCenter.DataWarehouse.CollectEventData - Microsoft.SystemCenter.CollectAlerts - Microsoft.SystemCenter.CollectEventData - Microsoft.SystemCenter.CollectPerformanceData - Microsoft.SystemCenter.DataWarehouse.CollectPerformanceData - Microsoft.SystemCenter.CollectSignatureData Whenever I restart the Management Server, or restart the System Centre Management service, I get a bunch of new/closed alerts in my inbox right away (and pretty much no alerts after that) and the flood of Event ID 2115s seems to come in within an hour. I'm thinking maybe there is a lot of activity cached somewhere that SCOM is trying to process, and maybe clearing that out will resolve things... Does anyone have any ideas on what I can try to identify what exactly is causing this or to try to resolve this issue? Thankx, Geoff
