Could you please shift these questions to fm -> discuss, since I am not capable to do this.
Hello, Generally, I have 2 questions regarding FMA (memory related) on the M-Series (here: SES M9000). (1) Is there any (significant) influence of page retirement on system performance ? We have suffered from a DIMM degradation recently (more than 500 retired pages and an entry in “fmadm faulty”), which seemed to have an impact on system performance (Oracle database): it resulted in kind of several temporary “hangs” of the database. Unfortunately, we didn’t panic the system to get a system dump to get that analyzed afterwards. Therefore, we don’t have any proof on that assumption. In another blog on another website I have read that somebody else seemed to have suffered from similar symptoms (although very like with another kernel patch level), which might be even severe under heavy load. So I have the question, if there is any proof (knowledge) that there is a direct relation between page retirement and system performance (is U5 / U7 behaving significantly different ?) ? Is there anybody who has experienced a similar phenomenon ? To illustrate our symptoms, here is some information of the domain. 1195 entries like these over a period of 11.5 hours (excerpt from fmdump –a): May 03 10:17:50.1529 b634cf6e-cc33-e46d-ab6d-8d800ef05865 SUN4U-8000-1A high run queue value (excerpt from sar): runq-sz %runocc swpq-sz %swpocc 09:00:15 104.6 93 0.0 0 09:10:13 140.7 93 0.0 0 09:20:37 180.0 97 0.0 0 09:30:38 126.3 96 0.0 0 09:40:22 133.6 96 0.0 0 09:50:33 185.3 95 0.0 0 10:00:35 121.1 94 0.0 0 10:10:20 49.7 92 0.0 0 10:20:10 177.2 98 0.0 0 10:30:16 32.0 94 0.0 0 10:40:16 3.1 86 0.0 0 The entries in FMA stopped 10:24 and this is soon reflected in the normalizing run queue size. (2) Are there any public available tools/packages/commands (XSCF, Solaris onboard, downloadable tool) to simulate a page-retirement ? We need an actual page-retirement (done by the memory controller) not an FMA simulation. So far we considered 3 methods: - Solaris FMA demo kit, which doesn’t work, because there is no implementation for the M-Series - fminject: we were told that this causes only an error simulation in FMA, but doesn’t initiate an actual page retirement on the DIMM - mtst: seems to part of a package which is not available for the public and there is no public documentation Our goal is to initiate several page retirements while having constant writes to the memory during a significant system load. We want to see whether there is a significant performance impact and whether there is a positive change in behavior from Solaris 10 U5 compared to U7(8). Any suggestions how to achieve this ? --------------------------------------------------------------------------- Bonus section. Well, since I have a basic understanding of FMA, there are still some unanswered questions. My life doesn’t depend on getting those answers, but if you have any input based on the knowledge of the FMA reaction agents or a longtime experience especially with the M-Series and Solaris 10, then any input is welcome. - Why is the memory error evaluation of Solaris and the XSCF different (intended / benefit ?) -> there might be a degraded DIMM in XSCF but not in Solaris and vice versa ? - As far as I have understood the algorithm, a potential page retirement for a DIMM will be stopped at a certain threshold (which is not only the 0.1 % of total amount of memory). Is this true and what is (could be) the reason for it (perhaps the performance issue ?) ? - If there is a direct relation between excessive page retirements and system performance, is there a safe (!) way to stop page retirement on the fly ? Well, I know I risk an UE, but the risk is there even while having page retirement in place (although decreased). Any input regarding my questions is very much appreciated ! -- This message posted from opensolaris.org _______________________________________________ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org