[msmom] RE: SCOM 2012 issues - not sure what is going on

Nelson, Geoffrey D Mon, 13 May 2013 14:50:29 -0700

Kevin,

On the SQL server - is there multiple instances, or dedicated?
Single SQL instance dedicated to SCOM 2012.
Physical or virtual?
Physical
How much memory and CPU resources are available to the OS?
32 GB RAM, 2 CPUs (Intel E5-2620)
Have you inspected Disk I/O on the SQL server?  This is the most common cause 
of serious performance backlogs and issues.
I have looked at disk I/O on the SQL server, and it does not seem to be an 
issue when I have looked right after the SCOM server is started up (but I have 
not been able to look at the exact time the 2115s are logged.)  I realized the 
SCOM Agent is not installed on the SQL server so I have installed it to try to 
capture some performance data - but I don't know when that will start to get 
collected, because the State when looking at it in the Windows Computers view 
is the open green circle icon - "Not monitored" - this is also something I 
noticed last week; four new servers I installed the SCOM agent were still in 
the not monitored state two hours after I installed the Agent on them because 
of whatever is going on...


Does the management server have enough memory?
The management server also has 32 GB RAM, of which only 9 GB is in use.

Performance requirements for SCOM 2012 in your sized environment should be very 
similar to SCOM 2007 R2.... Since you are seeing 2115's for BOTH databases - 
I'd suspect significant SQL performance issues, either not anywhere near enough 
memory, or most likely a disk I/O capability issue.

What is the disk subsystem for the SQL database?  What is the avg disk sec/read 
and avg disk sec/write during the time that 2115's begin?
1 RAID 10 virtual disk (using 8 physical disks) (on which there are two logical 
disks: one for the OS and one for SQL server)
I restarted SCOM server about 30 minutes ago, and there have not been any 2115s 
yet.

I'll try to get the avg disk sec/read and avg disk sec/write during the time 
that 2115's begin, but wanted to reply back with this info for now...

Thankx,
Geoff


From: [email protected] [mailto:[email protected]] On 
Behalf Of Nelson, Geoffrey D
Sent: Monday, May 13, 2013 10:55 AM
To: [email protected]
Subject: [msmom] SCOM 2012 issues - not sure what is going on

I am having issues with the health of my new SCOM 2012 Management Group, and 
was hoping someone might have some ideas on what is going on.
I am pretty new to SCOM.  We are running SCOM on two physical servers: one 
Management Server and one Database server.

We are in the process of migrating from SCOM 2007 to SCOM 2012.  Everything had 
been working pretty well until last week, and we have a pretty small 
environment (only 222 Windows computers.)  The first symptom I noticed was that 
alerts were not being received or were delayed: we reseated a hard drive in a 
server and did not receive the alert that a disk had been removed until an two 
hours later...  At first I dismissed it as an issue with the particular Agent.  
Then SCOM alerting from 2012 just seemed to stop.  Looking in the Monitoring 
pane of the SCOM Console, under Operations Manager - Management Group Health 
every component (for both Management Group Functions and Management Group 
Infrastructure) has a grey checkmark.  Looking in Management Server - 
Management Server State the State shows as critical (and is greyed out) - 
Opening alert view I see "Event data collection process unable to store data in 
the Data Warehouse in a timely manner", "Performance data collection process 
unable to store data in the Data Warehouse in a timely manner", and "Object 
Health State data collection process unable to store data in the Data Warehouse 
in a timely manner."  Looking at the Services in Server Manager, I verified the 
System Center Management service status was Paused.

The Operations Manager log has various warnings and errors including a lot of 
Event ID 2115s, and a few 5300s (Local health service is not healthy) and a few 
26017s (The Windows Event Log Provider monitoring the Operations Manager Event 
Log is 88 minutes behind in processing events.)

Most of what I found online suggested issues with performance on the DB server 
- but CPU and RAM usage on the database server is pretty low.
I have also read that running out of space in the database can cause 2115s, but 
that is not the case here (OperationsManager DB size is 24 GB with only 4 GB 
used; DW database size is 356 GB with only 22 GB used.)
I also checked the Disk Usage by Top Tables report and found that the table 
consuming the most disk space in the OperationsManager database was 
dbo.StateChangeEvent which has 707,065 records in it.  That seems like a lot 
(for only 222 servers being monitored.)  The next tables using the most Disk 
space are dbo.PerformanceData_21, ..._20, _18, _19, _17, _16, _15, and _24.

Among the things I have already tried:

-          Removed the latest Management Packs I had recently imported over the 
past week.

-          Deleted a few overrides I created that enabled alerts on a few 
monitors where it is disabled by default

-          Cleared the Operations Manager log, since it was so far behind in 
processing events

-          Stopped the System Center Management service on the Management 
Server and deleted the Health Service State folder, letting it rebuild itself

-          Removed the Exchange 2010 Management Pack (I had imported this about 
two months ago, and didn't have any issues, but since it is such a complex MP, 
I figured it was worth a try.)

Another test I did to check on alert delay was I shut down a test server to see 
how long it would take to get the Failed To Connect to Computer alert.  I was 
able to see Event ID  20022 (entity not heartbeating) in the OperationsManager 
log a minute later, but it takes about an hour to get any kind of alert for 
this.

I just checked the Operations Manager log, looking at the time corresponding 
the when the last SCOM alerts were received by email, and saw the following:

-          Event ID 11411: Alert subscription data source module encountered 
alert subscriptions that were waiting for a long time to receive an 
acknowledgement.  Alert subscription ruleid, Alert subscription query low 
watermark, Alert subscription query high watermark: 
5fcdbf15-4f5b-29db-ffdc-f2088a0f33b7,05/12/2013 01:53:15, 05/12/2013 02:25:15

-          Event ID 22402: Forced to terminate the following PowerShell script 
because it ran past the configured timeout 300 seconds.
Script Name:          GetMGAlertsCount.ps1
One or more workflows were affected by this.
Workflow name: ManagementGroupCollectionAlertsCountRule
Instance name: All Management Servers Resource Pool
Instance ID: {4932D8F0-C8E2-2F4B-288E-3ED98A340B9F}

-          Event ID 22402: Forced to terminate the following PowerShell script 
because it ran past the configured timeout 300 seconds.
Script Name:          AgentStateRollup.ps1
One or more workflows were affected by this.
Workflow name: ManagementGroupCollectionAgentHealthStatesRule
Instance name: All Management Servers Resource Pool
Instance ID: {4932D8F0-C8E2-2F4B-288E-3ED98A340B9F}

-          A lot (287)of Event ID 2115s: A Bind Data Source in Management Group 
ITS-Systems has posted items to the workflow, but has not received a response 
in 1979 seconds.  This indicates a performance or functional problem with the 
workflow.
Workflow Id : Microsoft.SystemCenter.CollectSignatureData
Instance    : <SCOM Management Server Name was here>
Instance Id : {0CE7C182-DC86-D1A2-79CB-DE3E4A745350}

-          Several Event ID 8000s: A subscriber data source in management group 
<Management Group Name was here> has posted items to the workflow, but has not 
received a response in 6 minutes.  Data will be queued to disk until a response 
has been received.  This indicates a performance or functional problem with the 
workflow.
Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectPerformanceData
Instance    : <SCOM Management Server Name was here>
Instance Id : {0CE7C182-DC86-D1A2-79CB-DE3E4A745350}

-          Event ID 10000: A scheduled discovery task was not started because 
the previous task for that discovery was still executing.
Discovery name: Microsoft.SystemCenter.DiscoveryHealthServiceCommunication
Instance name: <SCOM Management Server Name was here>
Management group name: <Management Group Name was here>

The lot of Event IDs 2115 referenced the following WorkflowIDs, repeating these 
same workflows over and over, with each entry just having a different number of 
seconds reported on  "has not received a response in ### seconds" even though 
all 2115s are stamped with the exact time.

-          Microsoft.SystemCenter.CollectSignatureData

-          Microsoft.SystemCenter.CollectPublishedEntityState

-          Microsoft.SystemCenter.CollectDiscoveryData

-          Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange

-          Microsoft.SystemCenter.DataWarehouse.CollectEventData

-          Microsoft.SystemCenter.CollectAlerts

-          Microsoft.SystemCenter.CollectEventData

-          Microsoft.SystemCenter.CollectPerformanceData

-          Microsoft.SystemCenter.DataWarehouse.CollectPerformanceData

-          Microsoft.SystemCenter.CollectSignatureData

Whenever I restart the Management Server, or restart the System Centre 
Management service, I get a bunch of new/closed alerts in my inbox right away 
(and pretty much no alerts after that) and the flood of Event ID 2115s seems to 
come in within an hour.

I'm thinking maybe there is a lot of activity cached somewhere that SCOM is 
trying to process, and maybe clearing that out will resolve things...

Does anyone have any ideas on what I can try to identify what exactly is 
causing this or to try to resolve this issue?

Thankx,
Geoff

[msmom] RE: SCOM 2012 issues - not sure what is going on

Reply via email to