I am having issues with the health of my new SCOM 2012 Management Group, and 
was hoping someone might have some ideas on what is going on.
I am pretty new to SCOM.  We are running SCOM on two physical servers: one 
Management Server and one Database server.

We are in the process of migrating from SCOM 2007 to SCOM 2012.  Everything had 
been working pretty well until last week, and we have a pretty small 
environment (only 222 Windows computers.)  The first symptom I noticed was that 
alerts were not being received or were delayed: we reseated a hard drive in a 
server and did not receive the alert that a disk had been removed until an two 
hours later...  At first I dismissed it as an issue with the particular Agent.  
Then SCOM alerting from 2012 just seemed to stop.  Looking in the Monitoring 
pane of the SCOM Console, under Operations Manager - Management Group Health 
every component (for both Management Group Functions and Management Group 
Infrastructure) has a grey checkmark.  Looking in Management Server - 
Management Server State the State shows as critical (and is greyed out) - 
Opening alert view I see "Event data collection process unable to store data in 
the Data Warehouse in a timely manner", "Performance data collection process 
unable to store data in the Data Warehouse in a timely manner", and "Object 
Health State data collection process unable to store data in the Data Warehouse 
in a timely manner."  Looking at the Services in Server Manager, I verified the 
System Center Management service status was Paused.

The Operations Manager log has various warnings and errors including a lot of 
Event ID 2115s, and a few 5300s (Local health service is not healthy) and a few 
26017s (The Windows Event Log Provider monitoring the Operations Manager Event 
Log is 88 minutes behind in processing events.)

Most of what I found online suggested issues with performance on the DB server 
- but CPU and RAM usage on the database server is pretty low.
I have also read that running out of space in the database can cause 2115s, but 
that is not the case here (OperationsManager DB size is 24 GB with only 4 GB 
used; DW database size is 356 GB with only 22 GB used.)
I also checked the Disk Usage by Top Tables report and found that the table 
consuming the most disk space in the OperationsManager database was 
dbo.StateChangeEvent which has 707,065 records in it.  That seems like a lot 
(for only 222 servers being monitored.)  The next tables using the most Disk 
space are dbo.PerformanceData_21, ..._20, _18, _19, _17, _16, _15, and _24.

Among the things I have already tried:

-          Removed the latest Management Packs I had recently imported over the 
past week.

-          Deleted a few overrides I created that enabled alerts on a few 
monitors where it is disabled by default

-          Cleared the Operations Manager log, since it was so far behind in 
processing events

-          Stopped the System Center Management service on the Management 
Server and deleted the Health Service State folder, letting it rebuild itself

-          Removed the Exchange 2010 Management Pack (I had imported this about 
two months ago, and didn't have any issues, but since it is such a complex MP, 
I figured it was worth a try.)

Another test I did to check on alert delay was I shut down a test server to see 
how long it would take to get the Failed To Connect to Computer alert.  I was 
able to see Event ID  20022 (entity not heartbeating) in the OperationsManager 
log a minute later, but it takes about an hour to get any kind of alert for 
this.

I just checked the Operations Manager log, looking at the time corresponding 
the when the last SCOM alerts were received by email, and saw the following:

-          Event ID 11411: Alert subscription data source module encountered 
alert subscriptions that were waiting for a long time to receive an 
acknowledgement.  Alert subscription ruleid, Alert subscription query low 
watermark, Alert subscription query high watermark: 
5fcdbf15-4f5b-29db-ffdc-f2088a0f33b7,05/12/2013 01:53:15, 05/12/2013 02:25:15

-          Event ID 22402: Forced to terminate the following PowerShell script 
because it ran past the configured timeout 300 seconds.
Script Name:          GetMGAlertsCount.ps1
One or more workflows were affected by this.
Workflow name: ManagementGroupCollectionAlertsCountRule
Instance name: All Management Servers Resource Pool
Instance ID: {4932D8F0-C8E2-2F4B-288E-3ED98A340B9F}

-          Event ID 22402: Forced to terminate the following PowerShell script 
because it ran past the configured timeout 300 seconds.
Script Name:          AgentStateRollup.ps1
One or more workflows were affected by this.
Workflow name: ManagementGroupCollectionAgentHealthStatesRule
Instance name: All Management Servers Resource Pool
Instance ID: {4932D8F0-C8E2-2F4B-288E-3ED98A340B9F}

-          A lot (287)of Event ID 2115s: A Bind Data Source in Management Group 
ITS-Systems has posted items to the workflow, but has not received a response 
in 1979 seconds.  This indicates a performance or functional problem with the 
workflow.
Workflow Id : Microsoft.SystemCenter.CollectSignatureData
Instance    : <SCOM Management Server Name was here>
Instance Id : {0CE7C182-DC86-D1A2-79CB-DE3E4A745350}

-          Several Event ID 8000s: A subscriber data source in management group 
<Management Group Name was here> has posted items to the workflow, but has not 
received a response in 6 minutes.  Data will be queued to disk until a response 
has been received.  This indicates a performance or functional problem with the 
workflow.
Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectPerformanceData
Instance    : <SCOM Management Server Name was here>
Instance Id : {0CE7C182-DC86-D1A2-79CB-DE3E4A745350}

-          Event ID 10000: A scheduled discovery task was not started because 
the previous task for that discovery was still executing.
Discovery name: Microsoft.SystemCenter.DiscoveryHealthServiceCommunication
Instance name: <SCOM Management Server Name was here>
Management group name: <Management Group Name was here>

The lot of Event IDs 2115 referenced the following WorkflowIDs, repeating these 
same workflows over and over, with each entry just having a different number of 
seconds reported on  "has not received a response in ### seconds" even though 
all 2115s are stamped with the exact time.

-          Microsoft.SystemCenter.CollectSignatureData

-          Microsoft.SystemCenter.CollectPublishedEntityState

-          Microsoft.SystemCenter.CollectDiscoveryData

-          Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange

-          Microsoft.SystemCenter.DataWarehouse.CollectEventData

-          Microsoft.SystemCenter.CollectAlerts

-          Microsoft.SystemCenter.CollectEventData

-          Microsoft.SystemCenter.CollectPerformanceData

-          Microsoft.SystemCenter.DataWarehouse.CollectPerformanceData

-          Microsoft.SystemCenter.CollectSignatureData

Whenever I restart the Management Server, or restart the System Centre 
Management service, I get a bunch of new/closed alerts in my inbox right away 
(and pretty much no alerts after that) and the flood of Event ID 2115s seems to 
come in within an hour.

I'm thinking maybe there is a lot of activity cached somewhere that SCOM is 
trying to process, and maybe clearing that out will resolve things...

Does anyone have any ideas on what I can try to identify what exactly is 
causing this or to try to resolve this issue?

Thankx,
Geoff



Reply via email to