Can we start to prep a spare to be shipped to site? Sent from my iPhone
> On Jan 7, 2015, at 10:27 AM, David MacMahon <[email protected]> wrote: > > Hi, Matthys, > > Thank you very much for that detailed report! I agree with your assessment > that pf4 is behaving strangely. It is a useful piece of new data that pf4 > actually powers off spontaneously. That had been suspected, but now it is > confirmed. I don't think we have a spare ROACH2 on site. We might have to > carry on without it (PSA112?) for a while. > > Here's what the logfile for pf4 shows since (but not including) Jan 5 > 11:14:53. These pairs of lines are the syslogd restart line followed by the > last log line from that power cycle. > > A couple of spontaneous power cycles(?!)... > >> Jan 6 12:45:16 syslogd 1.5.0#6: restart. >> Jan 6 12:45:17 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22. >> >> Jan 7 02:08:04 syslogd 1.5.0#6: restart. >> Jan 7 02:08:05 r2d020671 sshd[542]: Server listening on 0.0.0.0 port 22. > > The start of your work... > >> Jan 7 09:53:16 syslogd 1.5.0#6: restart. >> Jan 7 09:53:17 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22. >> >> Jan 7 10:05:49 syslogd 1.5.0#6: restart. >> Jan 7 10:05:51 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22. >> >> Jan 7 10:12:26 syslogd 1.5.0#6: restart. >> Jan 7 10:12:27 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22. > > You left at 10:20 with pf4 ON, but we didn't get the 10:32:27 "MARK" message > so it must have powered OFF between 10:20 and 10:32. > > Here is when you found pf4 OFF and powered it up a few times... > >> Jan 7 11:02:16 syslogd 1.5.0#6: restart. >> Jan 7 11:02:18 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22. >> >> Jan 7 11:06:28 syslogd 1.5.0#6: restart. >> Jan 7 11:06:29 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22. >> >> Jan 7 11:09:46 syslogd 1.5.0#6: restart. >> Jan 7 20:09:46 r2d020671 -- MARK -- > > It's currently still running, but I don't know how long it will last. > > Thanks again, > Dave > >> On Jan 7, 2015, at 3:42 AM, Matthys Maree wrote: >> >> OK >> >> 9:43: >> - Enter PAPER Container. >> - All internal lights OFF. Wall sockets OFF. This was strange yes, but I >> do not think this must be linked with the Roach4 problem. Earth leakage >> tripped inside DB on the lights and wall sockets.(We had a dip in >> electricity supply from ESKOM yesterday afternoon for some reason, and might >> have to do with this) >> - All racks ON except for PF#4 which was OFF again. >> >> 9:54: >> - Pushed kettle plug hard at back of unit, and it turned ON automatically. >> >> 10:00: >> - Swopped kettle plugs (only the ends that connect into the ROACHes), >> between units #3 and #4. >> - #3 turned ON without problems. >> - #3 needed some hard pushing from the kettle plug before turning ON, and >> died again after some seconds. >> >> 10:05: >> - Pulled kettle plug out again from #4, while in the OFF state. Pushed back >> in lightly. Did not come ON auto, but did respond to the PWR button on >> front. Strange YES? >> >> 10:10: >> - Fiddle with the kettle plug on #4(while ON), to see if it might turn OFF >> due to the fiddling. This did not affect the status at all, and it remained >> ON. >> - Pulled out kettle plug again. >> >> 10:12: >> - Pushed back in kettle plug in #4 normally(not with much force to see if >> normal operation work fine). Unit came ON immediately. >> >> 10:20: >> - #4 Still ON. >> - Leave container. >> >> 11:00: >> - Re-enter container. >> - #4 OFF. >> >> 11:05: >> First tried PWR button with no response. >> Then RESET button, and unit came ON with the FAULT light ON as RED. >> RESET button seem to turn this unit OFF when pressed now(Maybe part of the >> RESET cycle?), and the PWR button seem to get it ON now. Now this unit >> confuse me..... >> >> 11:08: >> - Remove kettle plug from unit in attempt to HARD RESET it. >> >> 11:10: >> - Re-connect kettle plug. >> - Unit turn ON (Like it should without failure/fault) >> >> 11:15: >> - #4 still ON without fault. >> - Leave container. >> >> >> My conclusion is that something is behaving strange with this Roach#4, and >> not the power supplies/kettle plugs. >> It looks like it turn OFF by itself after a while, maybe because of >> something heating up? >> I would suggest swopping this unit out with a spare one if there is a spare? >> Maybe you can try a power cycle on the PDU for this unit in attempt to get >> it back ON again if you have difficulty. >> >> >> (Remember the kettle plugs are still swopped between #3 and #4, only on the >> Roach side.) >> >> Let me know if I can assist further, maybe with a swop or so. >> >> >> >> Me and Jasper plan to add more gas to the cooling unit on Friday 9 Jan, in >> an attempt to keep the cooling unit running, until the fault/leak or >> whatever is fixed on it later this month hopefully. >> >> >> Groete >> >> Matthys Maree >> SKA South Africa – Carnarvon >> >> Tel: 021 506 7300 ext.#1035 (Carnarvon, Klerefontein) >> Web : www.ska.ac.za >> >> >> -----Original Message----- >> From: David MacMahon [mailto:[email protected]] On Behalf Of David >> MacMahon >> Sent: 06 January 2015 07:13 PM >> To: Matthys Maree >> Cc: 'danny jacobs'; 'David DeBoer'; 'PAPER List'; 'Matt Dexter' >> Subject: Re: recalcitrant roach >> >> Thanks, Matthys, that's very helpful! The ROACH2s are configured to power >> on automatically when VAC power is applied. Since you had to push the power >> button to turn it on then I suspect something internal to that ROACH2 is >> unwell. It sounds like the power cables are (and were) securely connected. >> >> It would be great if you could please check in on it again tomorrow (Jan 7). >> The things that will be of most interest to us are: >> >> 1) Is pf4 currently powered on when you arrive at the container? This will >> tell us whether it is a power problem or communication problem. Depending >> on the current "powered on" status, do either 2A or 2B... >> >> 2A) If pf4 is currently off, does pushing the power button turn it on? >> >> 2B) If pf4 is currently on, unplug its kettle plug, wait a few seconds, >> reconnect the kettle plug. Does it turn on automatically when the power >> cable is reconnected? If not, does pushing the power button turn it on? >> >> Assuming that pf4 is powered up after doing 2A or 2B, please wait a few >> minutes for it to boot. I'm not sure how you can tell that the boot has >> completed (maybe the network LEDs will stop their rapid blinking?), but I >> think 3 minutes should be adequate. >> >> 3) If it is not too difficult to access, you could try swapping the kettle >> plugs for pf4 and pf5. That way if the symptom moves to pf5 we will know it >> is a problem in the PDU (or power cable?). If the symptom stays with pf4 >> then we'll know it's not the PDU. If it's easier, you could instead swap >> pf4's kettle plug with pf3's. If you do this swap, please let us know which >> two you swapped. This is an optional step. >> >> 4) If you could check that the RJ-45 network cable is securely attached to >> the back of pf4 that would be reassuring. This is also an optional step. >> >> 5) So that we can correlate your actions with what we see in the log files, >> it would be great if you could record the times when things power on and >> when you leave the container. >> >> 6) Anything else you observe that might be relevant to why pf4 is behaving >> differently from the other ROACH2s. >> >> Thanks again for your assistance!!! >> >> Cheers, >> Dave >> >>> On Jan 6, 2015, at 4:57 AM, Matthys Maree wrote: >>> >>> Sorry, I only read this mail now that I am already back from site for >> today. >>> >>> What I did yesterday 5 January, was around that time you mentioned. >>> Unfortunately I did not check the exact time. >>> I first tried the "kettle plug" directly on the ROACH#4 machine. >>> Tried to push it in probably(even if it was not out). I did not succeed. >>> I traced it down to where it get power supplied from.(for this I had >>> to bend over and under some cables! Could easily have pulled a cable >>> slightly of something with this attempt). >>> On the Power supply unit where all the kettle plugs get power from, I >>> did the same by ensuring proper connection. >>> Still not successful. >>> I went back to Roach #4 power inlet, pushed again, and tried Power >>> button on front of Roach. Now it turned ON. >>> So I assumed it was either on the bottom PDU unit or top connection. >>> >>> I was probably in the container for +/- 20minutes. >>> >>> Please let me know if you need me to try something in there again. I >>> can have a look tomorrow(7 Jan). >>> >>> >>> Groete >>> >>> Matthys Maree >>> SKA South Africa – Carnarvon >>> >>> Tel: 021 506 7300 ext.#1035 (Carnarvon, Klerefontein) >>> Web : www.ska.ac.za >>> >>> -----Original Message----- >>> From: David MacMahon [mailto:[email protected]] On Behalf Of David >>> MacMahon >>> Sent: 06 January 2015 08:12 AM >>> To: Matthys Maree >>> Cc: 'danny jacobs'; 'David DeBoer'; 'PAPER List'; 'Matt Dexter' >>> Subject: Re: recalcitrant roach >>> >>> Thanks and Happy New Year, Matthys! We really appreciate having your >>> on-site support!!! >>> >>> Unfortunately, we're still not able to access this machine >>> ("r2d020671", aka >>> "pf4") via the network. Here is what we see in the log file for that >>> system: >>> >>>> Dec 26 21:13:58 r2d020671 -- MARK -- >>>> Dec 28 02:26:52 syslogd 1.5.0#6: restart. >>>> [...] >>>> Dec 28 03:06:52 r2d020671 -- MARK -- >>>> Dec 30 21:24:08 syslogd 1.5.0#6: restart. >>>> [...] >>>> Dec 30 21:44:08 r2d020671 -- MARK -- >>>> Jan 2 03:01:47 syslogd 1.5.0#6: restart. >>>> [...] >>>> Jan 2 03:41:47 r2d020671 -- MARK -- >>>> Jan 5 11:14:52 syslogd 1.5.0#6: restart. >>>> [...] >>>> Jan 5 11:14:53 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22. >>> >>> The "MARK" messages get logged after 20 minutes on logging inactivity >>> and the "syslogd ... restart" lines get logged when the machine >>> reboots. The final "sshd" line is the last line in the log file. The >>> timestamps are SAST (UTC+2). Since we didn't get the expected "MARK" >>> line at 11:34 I can only assume that connectivity was lost sometime >> between 11:14:53 and 11:34:53. >>> >>> It would really help our understanding of the problem if you could >>> please provide some more details of your visit to the PAPER container >>> (e.g. time of day, duration of visit, actions taken, etc). I suspect >>> it's either a power problem, a network problem, or a system problem >>> (e.g. bad RAM). The problem is isolated to "pf4" (or its associated >>> cables); all the other ROACH2s seem fine. >>> >>> Thanks again, >>> Dave >>> >>>> On Jan 5, 2015, at 2:36 AM, Matthys Maree wrote: >>>> >>>> Hi >>>> >>>> Roach#4 back ON. >>>> >>>> Probably the power cable. >>>> >>>> Cooling still fine inside container. >>>> >>>> Groete >>>> >>>> Matthys Maree >>>> SKA South Africa – Carnarvon >>>> >>>> Tel: 021 506 7300 ext.#1035 (Carnarvon, Klerefontein) >>>> Web : www.ska.ac.za >>>> >>>> From: danny jacobs [mailto:[email protected]] >>>> Sent: 31 December 2014 06:57 AM >>>> To: David DeBoer; PAPER List; Matthys Maree; Matt Dexter >>>> Subject: Fwd: recalcitrant roach >>>> >>>> Hi Matthys (cc PAPER), >>>> >>>> One of our ROACHs has stopped responding. A power issue seems most >>> likely. What with the heat cycling, its possible that its power cable >>> has loosened (or maybe even the ethernet). A failing power supply is >>> also possible. Could you, or someone like you, double check that >>> ROACH #4 is getting power and shows an ethernet light? >>>> >>>> Thanks, >>>> >>>> ~Danny >>>> >>>> >>>> >>>> >>>> ---------- Forwarded message ---------- >>>> From: David MacMahon <[email protected]> >>>> Date: Tue, Dec 30, 2014 at 12:21 PM >>>> Subject: Re: recalcitrant roach >>>> To: danny jacobs <[email protected]> >>>> Cc: Matt Dexter <[email protected]> >>>> >>>> >>>> Hi, Danny, >>>> >>>> pf4 seems to be having problems. These problems seem to have started >>>> on >>> December 19. The roach2s log a "syslog restart" line when they boot. >>> I've extracted the December restart messages from the log files: >>>> >>>> pf1:2014 Dec 19 08:46:09 syslogd 1.5.0#6: restart. >>>> pf2:2014 Dec 19 08:46:15 syslogd 1.5.0#6: restart. >>>> pf3:2014 Dec 19 08:46:17 syslogd 1.5.0#6: restart. >>>> pf5:2014 Dec 19 08:46:15 syslogd 1.5.0#6: restart. >>>> pf6:2014 Dec 19 08:46:16 syslogd 1.5.0#6: restart. >>>> pf7:2014 Dec 19 08:46:16 syslogd 1.5.0#6: restart. >>>> pf8:2014 Dec 19 08:46:15 syslogd 1.5.0#6: restart. >>>> >>>> pf4:2014 Dec 19 09:45:58 syslogd 1.5.0#6: restart. >>>> pf4:2014 Dec 19 10:28:00 syslogd 1.5.0#6: restart. >>>> pf4:2014 Dec 19 11:36:52 syslogd 1.5.0#6: restart. >>>> pf4:2014 Dec 19 15:10:14 syslogd 1.5.0#6: restart. >>>> pf4:2014 Dec 19 16:28:23 syslogd 1.5.0#6: restart. >>>> >>>> pf4:2014 Dec 20 23:17:49 syslogd 1.5.0#6: restart. >>>> pf4:2014 Dec 23 23:55:54 syslogd 1.5.0#6: restart. >>>> pf4:2014 Dec 26 20:33:59 syslogd 1.5.0#6: restart. >>>> pf4:2014 Dec 28 02:26:52 syslogd 1.5.0#6: restart. >>>> >>>> pf1:2014 Dec 30 18:26:41 syslogd 1.5.0#6: restart. >>>> pf2:2014 Dec 30 18:26:36 syslogd 1.5.0#6: restart. >>>> pf3:2014 Dec 30 18:26:37 syslogd 1.5.0#6: restart. >>>> pf5:2014 Dec 30 18:26:37 syslogd 1.5.0#6: restart. >>>> pf6:2014 Dec 30 18:26:36 syslogd 1.5.0#6: restart. >>>> pf7:2014 Dec 30 18:26:37 syslogd 1.5.0#6: restart. >>>> pf8:2014 Dec 30 18:26:36 syslogd 1.5.0#6: restart. >>>> >>>> As you can see, pf4 did not restart on Dec 19 with the rest of the >>>> roach2s >>> at 08:46. It restarted almost an hour later at 9:45. It then >>> restarted several times throughout the day on the 19th. It also >>> restarted sporadically a few days since then with the most recent >>> being on Dec 28 at 02:26. The last log message for pf4 was Dec 28 >>> 03:06. It went down sometime in the next 20 minutes after that. >>>> >>>> I'm guessing it's a flaky power issue. Hopefully just power cord >>>> that got >>> loose at one end or the other during the shutdown. If it's not that >>> then I'd guess it's something internal to the power supply? >>>> >>>> I've CC'd Matt in case he has any other ideas. >>>> >>>> It would probably be a good idea to have someone check on the power >>> cables. >>>> >>>> Thanks, >>>> Dave >>>> >>>>> On Dec 30, 2014, at 8:32 AM, danny jacobs wrote: >>>>> >>>>> Hi Dave, >>>>> >>>>> I thought I'd give PAPER a boot up and see if we could break the A/C >>>>> but >>> it looks like we may have a dead roach. #4 doesn't respond to pings >>> even after power cycling. Just in case there was some mislabeling on >>> the roachpdu apc page I even rebooted all of them. All go down, all >>> come back... except for #4. >>>>> >>>>> Could you maybe take a look and confirm? >>>>> >>>>> Thanks, >>>>> ~Danny >>>>> >>>>> >>>>> -- >>>>> >>>>> National Science Foundation Fellow >>>>> Arizona State University >>>>> School of Earth and Space Exploration Low Frequency Cosmology >>>>> Phone: (505) 500 4521 >>>>> Homepage: http://loco.lab.asu.edu/danny_jacobs/ >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> National Science Foundation Fellow >>>> Arizona State University >>>> School of Earth and Space Exploration Low Frequency Cosmology >>>> Phone: (505) 500 4521 >>>> Homepage: http://loco.lab.asu.edu/danny_jacobs/ >
