Ok, let's go through this one at a time. See inserted comments.
Alan Ackerman wrote:
Creating new thread.
1. The folks that receive the data at my shop are z/OS folks. Historicall
the capture ratio of MVS was really poor. The notion was that you should
use SMF data and never RMF data. I don't know if z/OS has cleaned up its
act or not.
But I have heard the same thing from VM folks. (I've said it myself.)
As Barton says, the capture ratio in VM has always been quite high, due t
the way the data is captured in the VMDBK. However, Barton computes this
(I think) by comparing different record types in the monitor data, not by
comparing monitor to accounting data.
There is system overhead, but it is captured in the SYSTEM VMDBK block.
Accounting data and monitor data are using the same data, so they should
get the same results. Of course, some time gets charged to the wrong user
for example between the time an interrupt comes in and the new user is
identified. But it shows up the same in the monitor and the accounting
data. (User CPU time is more reproducible than total CPU time, for this
reason.)
Is "some time gets charged to the wrong user" a validated and relevant
issue? I've not seen any "overhead" issues in accounting or monitor data
in MANY years.
2. Monitor sample data is taken at one minute samples. It used to be that
data for users that logged in or off between samples was dropped for the
partial minutes. Is this still true? Was it ever true? Or is it urban
folklore?
Transaction records are cut at logon/logoff, that is how we get 100.00%
capture ratio. Nothing is lost.
3. On our systems, we sometimes see messages from CP that say the monitor
data has been thrown away because the user connected to *MONITOR did not
respond in time. This happens when the system is overloaded, either in CP
or storage. So we lose some minutes of monitor data, but not, I think,
accounting data.
Often you can fix this by increasing the segment sizes or give
MONWRITE/ESAWRITE a bigger SHARE. Not always, though. In some cases the
monitor segments get paged out. (We reported it to Velocity, who said it
was a CP problem.) I think IBM could do things to make collection of
monitor data more reliable in the extreme cases.
Unfortunately, I'm not responsible for this and it is "only performance
data". I think this can be dealt with, but it does take diligence and wor
to keep your monitor data accurate. You don't have to do this work for
accounting data.
I think IBM could do things to make collection of monitor data more easy.
This still does happen occasionally when systems are thrashing so much
that everything stops. At this point, accounting is probably lower
priority. Capacity planning and performance tuning do need to be
employed in this platform.
IBM could stop the DCSS from being paged out when the system starts to
thrash.
4. On our systems, we switch files (I think hourly) to keep them from
getting too big. We lose a minute or two of data each time.
ESALPS does not lose data each hour. Capture ratio is 100%
5. The default for ESAWRITE is to collect User history records only for
userids using more than 0.5% CPU. So when we go back to process CPU
utilization for users, we get smaller totals for monitor than from
accounting data. I assume this could be fixed by setting the threshold to
zero.
I don't know which of these, if any, affect the ESALPS data collection
that Barton mentioned. We have tested ESALPS, but are not yet licensed.
The default for ESAWRITE is 100% capture ratio. ALL USER DATA is
captured and retained for capacity planning and accounting. the
thresholds only apply to current performance data. This has been the
case for 20 years. I'll repeat, capture ratio for user data is ALWAYS
100.00%. You can't look at the interval data collected for performance
and use it for accounting. The summary data for each hour is 100% and
is what one would use for accounting and capacity planning.
Alan Ackerman
Alan (dot) Ackerman (at) Bank of America (dot) com