I am having issues with the health of my new SCOM 2012 Management Group, and
was hoping someone might have some ideas on what is going on.
I am pretty new to SCOM. We are running SCOM on two physical servers: one
Management Server and one Database server.
We are in the process of migrating from SCOM 2007 to SCOM 2012. Everything had
been working pretty well until last week, and we have a pretty small
environment (only 222 Windows computers.) The first symptom I noticed was that
alerts were not being received or were delayed: we reseated a hard drive in a
server and did not receive the alert that a disk had been removed until an two
hours later... At first I dismissed it as an issue with the particular Agent.
Then SCOM alerting from 2012 just seemed to stop. Looking in the Monitoring
pane of the SCOM Console, under Operations Manager - Management Group Health
every component (for both Management Group Functions and Management Group
Infrastructure) has a grey checkmark. Looking in Management Server -
Management Server State the State shows as critical (and is greyed out) -
Opening alert view I see "Event data collection process unable to store data in
the Data Warehouse in a timely manner", "Performance data collection process
unable to store data in the Data Warehouse in a timely manner", and "Object
Health State data collection process unable to store data in the Data Warehouse
in a timely manner." Looking at the Services in Server Manager, I verified the
System Center Management service status was Paused.
The Operations Manager log has various warnings and errors including a lot of
Event ID 2115s, and a few 5300s (Local health service is not healthy) and a few
26017s (The Windows Event Log Provider monitoring the Operations Manager Event
Log is 88 minutes behind in processing events.)
Most of what I found online suggested issues with performance on the DB server
- but CPU and RAM usage on the database server is pretty low.
I have also read that running out of space in the database can cause 2115s, but
that is not the case here (OperationsManager DB size is 24 GB with only 4 GB
used; DW database size is 356 GB with only 22 GB used.)
I also checked the Disk Usage by Top Tables report and found that the table
consuming the most disk space in the OperationsManager database was
dbo.StateChangeEvent which has 707,065 records in it. That seems like a lot
(for only 222 servers being monitored.) The next tables using the most Disk
space are dbo.PerformanceData_21, ..._20, _18, _19, _17, _16, _15, and _24.
Among the things I have already tried:
- Removed the latest Management Packs I had recently imported over the
past week.
- Deleted a few overrides I created that enabled alerts on a few
monitors where it is disabled by default
- Cleared the Operations Manager log, since it was so far behind in
processing events
- Stopped the System Center Management service on the Management
Server and deleted the Health Service State folder, letting it rebuild itself
- Removed the Exchange 2010 Management Pack (I had imported this about
two months ago, and didn't have any issues, but since it is such a complex MP,
I figured it was worth a try.)
Another test I did to check on alert delay was I shut down a test server to see
how long it would take to get the Failed To Connect to Computer alert. I was
able to see Event ID 20022 (entity not heartbeating) in the OperationsManager
log a minute later, but it takes about an hour to get any kind of alert for
this.
I just checked the Operations Manager log, looking at the time corresponding
the when the last SCOM alerts were received by email, and saw the following:
- Event ID 11411: Alert subscription data source module encountered
alert subscriptions that were waiting for a long time to receive an
acknowledgement. Alert subscription ruleid, Alert subscription query low
watermark, Alert subscription query high watermark:
5fcdbf15-4f5b-29db-ffdc-f2088a0f33b7,05/12/2013 01:53:15, 05/12/2013 02:25:15
- Event ID 22402: Forced to terminate the following PowerShell script
because it ran past the configured timeout 300 seconds.
Script Name: GetMGAlertsCount.ps1
One or more workflows were affected by this.
Workflow name: ManagementGroupCollectionAlertsCountRule
Instance name: All Management Servers Resource Pool
Instance ID: {4932D8F0-C8E2-2F4B-288E-3ED98A340B9F}
- Event ID 22402: Forced to terminate the following PowerShell script
because it ran past the configured timeout 300 seconds.
Script Name: AgentStateRollup.ps1
One or more workflows were affected by this.
Workflow name: ManagementGroupCollectionAgentHealthStatesRule
Instance name: All Management Servers Resource Pool
Instance ID: {4932D8F0-C8E2-2F4B-288E-3ED98A340B9F}
- A lot (287)of Event ID 2115s: A Bind Data Source in Management Group
ITS-Systems has posted items to the workflow, but has not received a response
in 1979 seconds. This indicates a performance or functional problem with the
workflow.
Workflow Id : Microsoft.SystemCenter.CollectSignatureData
Instance : <SCOM Management Server Name was here>
Instance Id : {0CE7C182-DC86-D1A2-79CB-DE3E4A745350}
- Several Event ID 8000s: A subscriber data source in management group
<Management Group Name was here> has posted items to the workflow, but has not
received a response in 6 minutes. Data will be queued to disk until a response
has been received. This indicates a performance or functional problem with the
workflow.
Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectPerformanceData
Instance : <SCOM Management Server Name was here>
Instance Id : {0CE7C182-DC86-D1A2-79CB-DE3E4A745350}
- Event ID 10000: A scheduled discovery task was not started because
the previous task for that discovery was still executing.
Discovery name: Microsoft.SystemCenter.DiscoveryHealthServiceCommunication
Instance name: <SCOM Management Server Name was here>
Management group name: <Management Group Name was here>
The lot of Event IDs 2115 referenced the following WorkflowIDs, repeating these
same workflows over and over, with each entry just having a different number of
seconds reported on "has not received a response in ### seconds" even though
all 2115s are stamped with the exact time.
- Microsoft.SystemCenter.CollectSignatureData
- Microsoft.SystemCenter.CollectPublishedEntityState
- Microsoft.SystemCenter.CollectDiscoveryData
- Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange
- Microsoft.SystemCenter.DataWarehouse.CollectEventData
- Microsoft.SystemCenter.CollectAlerts
- Microsoft.SystemCenter.CollectEventData
- Microsoft.SystemCenter.CollectPerformanceData
- Microsoft.SystemCenter.DataWarehouse.CollectPerformanceData
- Microsoft.SystemCenter.CollectSignatureData
Whenever I restart the Management Server, or restart the System Centre
Management service, I get a bunch of new/closed alerts in my inbox right away
(and pretty much no alerts after that) and the flood of Event ID 2115s seems to
come in within an hour.
I'm thinking maybe there is a lot of activity cached somewhere that SCOM is
trying to process, and maybe clearing that out will resolve things...
Does anyone have any ideas on what I can try to identify what exactly is
causing this or to try to resolve this issue?
Thankx,
Geoff