Hi, Matthys, Thank you very much for that detailed report! I agree with your assessment that pf4 is behaving strangely. It is a useful piece of new data that pf4 actually powers off spontaneously. That had been suspected, but now it is confirmed. I don't think we have a spare ROACH2 on site. We might have to carry on without it (PSA112?) for a while.
Here's what the logfile for pf4 shows since (but not including) Jan 5 11:14:53. These pairs of lines are the syslogd restart line followed by the last log line from that power cycle. A couple of spontaneous power cycles(?!)... > Jan 6 12:45:16 syslogd 1.5.0#6: restart. > Jan 6 12:45:17 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22. > > Jan 7 02:08:04 syslogd 1.5.0#6: restart. > Jan 7 02:08:05 r2d020671 sshd[542]: Server listening on 0.0.0.0 port 22. The start of your work... > Jan 7 09:53:16 syslogd 1.5.0#6: restart. > Jan 7 09:53:17 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22. > > Jan 7 10:05:49 syslogd 1.5.0#6: restart. > Jan 7 10:05:51 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22. > > Jan 7 10:12:26 syslogd 1.5.0#6: restart. > Jan 7 10:12:27 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22. You left at 10:20 with pf4 ON, but we didn't get the 10:32:27 "MARK" message so it must have powered OFF between 10:20 and 10:32. Here is when you found pf4 OFF and powered it up a few times... > Jan 7 11:02:16 syslogd 1.5.0#6: restart. > Jan 7 11:02:18 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22. > > Jan 7 11:06:28 syslogd 1.5.0#6: restart. > Jan 7 11:06:29 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22. > > Jan 7 11:09:46 syslogd 1.5.0#6: restart. > Jan 7 20:09:46 r2d020671 -- MARK -- It's currently still running, but I don't know how long it will last. Thanks again, Dave On Jan 7, 2015, at 3:42 AM, Matthys Maree wrote: > OK > > 9:43: > - Enter PAPER Container. > - All internal lights OFF. Wall sockets OFF. This was strange yes, but I > do not think this must be linked with the Roach4 problem. Earth leakage > tripped inside DB on the lights and wall sockets.(We had a dip in > electricity supply from ESKOM yesterday afternoon for some reason, and might > have to do with this) > - All racks ON except for PF#4 which was OFF again. > > 9:54: > - Pushed kettle plug hard at back of unit, and it turned ON automatically. > > 10:00: > - Swopped kettle plugs (only the ends that connect into the ROACHes), > between units #3 and #4. > - #3 turned ON without problems. > - #3 needed some hard pushing from the kettle plug before turning ON, and > died again after some seconds. > > 10:05: > - Pulled kettle plug out again from #4, while in the OFF state. Pushed back > in lightly. Did not come ON auto, but did respond to the PWR button on > front. Strange YES? > > 10:10: > - Fiddle with the kettle plug on #4(while ON), to see if it might turn OFF > due to the fiddling. This did not affect the status at all, and it remained > ON. > - Pulled out kettle plug again. > > 10:12: > - Pushed back in kettle plug in #4 normally(not with much force to see if > normal operation work fine). Unit came ON immediately. > > 10:20: > - #4 Still ON. > - Leave container. > > 11:00: > - Re-enter container. > - #4 OFF. > > 11:05: > First tried PWR button with no response. > Then RESET button, and unit came ON with the FAULT light ON as RED. > RESET button seem to turn this unit OFF when pressed now(Maybe part of the > RESET cycle?), and the PWR button seem to get it ON now. Now this unit > confuse me..... > > 11:08: > - Remove kettle plug from unit in attempt to HARD RESET it. > > 11:10: > - Re-connect kettle plug. > - Unit turn ON (Like it should without failure/fault) > > 11:15: > - #4 still ON without fault. > - Leave container. > > > My conclusion is that something is behaving strange with this Roach#4, and > not the power supplies/kettle plugs. > It looks like it turn OFF by itself after a while, maybe because of > something heating up? > I would suggest swopping this unit out with a spare one if there is a spare? > Maybe you can try a power cycle on the PDU for this unit in attempt to get > it back ON again if you have difficulty. > > > (Remember the kettle plugs are still swopped between #3 and #4, only on the > Roach side.) > > Let me know if I can assist further, maybe with a swop or so. > > > > Me and Jasper plan to add more gas to the cooling unit on Friday 9 Jan, in > an attempt to keep the cooling unit running, until the fault/leak or > whatever is fixed on it later this month hopefully. > > > Groete > > Matthys Maree > SKA South Africa – Carnarvon > > Tel: 021 506 7300 ext.#1035 (Carnarvon, Klerefontein) > Web : www.ska.ac.za > > > -----Original Message----- > From: David MacMahon [mailto:[email protected]] On Behalf Of David > MacMahon > Sent: 06 January 2015 07:13 PM > To: Matthys Maree > Cc: 'danny jacobs'; 'David DeBoer'; 'PAPER List'; 'Matt Dexter' > Subject: Re: recalcitrant roach > > Thanks, Matthys, that's very helpful! The ROACH2s are configured to power > on automatically when VAC power is applied. Since you had to push the power > button to turn it on then I suspect something internal to that ROACH2 is > unwell. It sounds like the power cables are (and were) securely connected. > > It would be great if you could please check in on it again tomorrow (Jan 7). > The things that will be of most interest to us are: > > 1) Is pf4 currently powered on when you arrive at the container? This will > tell us whether it is a power problem or communication problem. Depending > on the current "powered on" status, do either 2A or 2B... > > 2A) If pf4 is currently off, does pushing the power button turn it on? > > 2B) If pf4 is currently on, unplug its kettle plug, wait a few seconds, > reconnect the kettle plug. Does it turn on automatically when the power > cable is reconnected? If not, does pushing the power button turn it on? > > Assuming that pf4 is powered up after doing 2A or 2B, please wait a few > minutes for it to boot. I'm not sure how you can tell that the boot has > completed (maybe the network LEDs will stop their rapid blinking?), but I > think 3 minutes should be adequate. > > 3) If it is not too difficult to access, you could try swapping the kettle > plugs for pf4 and pf5. That way if the symptom moves to pf5 we will know it > is a problem in the PDU (or power cable?). If the symptom stays with pf4 > then we'll know it's not the PDU. If it's easier, you could instead swap > pf4's kettle plug with pf3's. If you do this swap, please let us know which > two you swapped. This is an optional step. > > 4) If you could check that the RJ-45 network cable is securely attached to > the back of pf4 that would be reassuring. This is also an optional step. > > 5) So that we can correlate your actions with what we see in the log files, > it would be great if you could record the times when things power on and > when you leave the container. > > 6) Anything else you observe that might be relevant to why pf4 is behaving > differently from the other ROACH2s. > > Thanks again for your assistance!!! > > Cheers, > Dave > > On Jan 6, 2015, at 4:57 AM, Matthys Maree wrote: > >> Sorry, I only read this mail now that I am already back from site for > today. >> >> What I did yesterday 5 January, was around that time you mentioned. >> Unfortunately I did not check the exact time. >> I first tried the "kettle plug" directly on the ROACH#4 machine. >> Tried to push it in probably(even if it was not out). I did not succeed. >> I traced it down to where it get power supplied from.(for this I had >> to bend over and under some cables! Could easily have pulled a cable >> slightly of something with this attempt). >> On the Power supply unit where all the kettle plugs get power from, I >> did the same by ensuring proper connection. >> Still not successful. >> I went back to Roach #4 power inlet, pushed again, and tried Power >> button on front of Roach. Now it turned ON. >> So I assumed it was either on the bottom PDU unit or top connection. >> >> I was probably in the container for +/- 20minutes. >> >> Please let me know if you need me to try something in there again. I >> can have a look tomorrow(7 Jan). >> >> >> Groete >> >> Matthys Maree >> SKA South Africa – Carnarvon >> >> Tel: 021 506 7300 ext.#1035 (Carnarvon, Klerefontein) >> Web : www.ska.ac.za >> >> -----Original Message----- >> From: David MacMahon [mailto:[email protected]] On Behalf Of David >> MacMahon >> Sent: 06 January 2015 08:12 AM >> To: Matthys Maree >> Cc: 'danny jacobs'; 'David DeBoer'; 'PAPER List'; 'Matt Dexter' >> Subject: Re: recalcitrant roach >> >> Thanks and Happy New Year, Matthys! We really appreciate having your >> on-site support!!! >> >> Unfortunately, we're still not able to access this machine >> ("r2d020671", aka >> "pf4") via the network. Here is what we see in the log file for that >> system: >> >>> Dec 26 21:13:58 r2d020671 -- MARK -- >>> Dec 28 02:26:52 syslogd 1.5.0#6: restart. >>> [...] >>> Dec 28 03:06:52 r2d020671 -- MARK -- >>> Dec 30 21:24:08 syslogd 1.5.0#6: restart. >>> [...] >>> Dec 30 21:44:08 r2d020671 -- MARK -- >>> Jan 2 03:01:47 syslogd 1.5.0#6: restart. >>> [...] >>> Jan 2 03:41:47 r2d020671 -- MARK -- >>> Jan 5 11:14:52 syslogd 1.5.0#6: restart. >>> [...] >>> Jan 5 11:14:53 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22. >> >> The "MARK" messages get logged after 20 minutes on logging inactivity >> and the "syslogd ... restart" lines get logged when the machine >> reboots. The final "sshd" line is the last line in the log file. The >> timestamps are SAST (UTC+2). Since we didn't get the expected "MARK" >> line at 11:34 I can only assume that connectivity was lost sometime > between 11:14:53 and 11:34:53. >> >> It would really help our understanding of the problem if you could >> please provide some more details of your visit to the PAPER container >> (e.g. time of day, duration of visit, actions taken, etc). I suspect >> it's either a power problem, a network problem, or a system problem >> (e.g. bad RAM). The problem is isolated to "pf4" (or its associated >> cables); all the other ROACH2s seem fine. >> >> Thanks again, >> Dave >> >> On Jan 5, 2015, at 2:36 AM, Matthys Maree wrote: >> >>> Hi >>> >>> Roach#4 back ON. >>> >>> Probably the power cable. >>> >>> Cooling still fine inside container. >>> >>> Groete >>> >>> Matthys Maree >>> SKA South Africa – Carnarvon >>> >>> Tel: 021 506 7300 ext.#1035 (Carnarvon, Klerefontein) >>> Web : www.ska.ac.za >>> >>> From: danny jacobs [mailto:[email protected]] >>> Sent: 31 December 2014 06:57 AM >>> To: David DeBoer; PAPER List; Matthys Maree; Matt Dexter >>> Subject: Fwd: recalcitrant roach >>> >>> Hi Matthys (cc PAPER), >>> >>> One of our ROACHs has stopped responding. A power issue seems most >> likely. What with the heat cycling, its possible that its power cable >> has loosened (or maybe even the ethernet). A failing power supply is >> also possible. Could you, or someone like you, double check that >> ROACH #4 is getting power and shows an ethernet light? >>> >>> Thanks, >>> >>> ~Danny >>> >>> >>> >>> >>> ---------- Forwarded message ---------- >>> From: David MacMahon <[email protected]> >>> Date: Tue, Dec 30, 2014 at 12:21 PM >>> Subject: Re: recalcitrant roach >>> To: danny jacobs <[email protected]> >>> Cc: Matt Dexter <[email protected]> >>> >>> >>> Hi, Danny, >>> >>> pf4 seems to be having problems. These problems seem to have started >>> on >> December 19. The roach2s log a "syslog restart" line when they boot. >> I've extracted the December restart messages from the log files: >>> >>> pf1:2014 Dec 19 08:46:09 syslogd 1.5.0#6: restart. >>> pf2:2014 Dec 19 08:46:15 syslogd 1.5.0#6: restart. >>> pf3:2014 Dec 19 08:46:17 syslogd 1.5.0#6: restart. >>> pf5:2014 Dec 19 08:46:15 syslogd 1.5.0#6: restart. >>> pf6:2014 Dec 19 08:46:16 syslogd 1.5.0#6: restart. >>> pf7:2014 Dec 19 08:46:16 syslogd 1.5.0#6: restart. >>> pf8:2014 Dec 19 08:46:15 syslogd 1.5.0#6: restart. >>> >>> pf4:2014 Dec 19 09:45:58 syslogd 1.5.0#6: restart. >>> pf4:2014 Dec 19 10:28:00 syslogd 1.5.0#6: restart. >>> pf4:2014 Dec 19 11:36:52 syslogd 1.5.0#6: restart. >>> pf4:2014 Dec 19 15:10:14 syslogd 1.5.0#6: restart. >>> pf4:2014 Dec 19 16:28:23 syslogd 1.5.0#6: restart. >>> >>> pf4:2014 Dec 20 23:17:49 syslogd 1.5.0#6: restart. >>> pf4:2014 Dec 23 23:55:54 syslogd 1.5.0#6: restart. >>> pf4:2014 Dec 26 20:33:59 syslogd 1.5.0#6: restart. >>> pf4:2014 Dec 28 02:26:52 syslogd 1.5.0#6: restart. >>> >>> pf1:2014 Dec 30 18:26:41 syslogd 1.5.0#6: restart. >>> pf2:2014 Dec 30 18:26:36 syslogd 1.5.0#6: restart. >>> pf3:2014 Dec 30 18:26:37 syslogd 1.5.0#6: restart. >>> pf5:2014 Dec 30 18:26:37 syslogd 1.5.0#6: restart. >>> pf6:2014 Dec 30 18:26:36 syslogd 1.5.0#6: restart. >>> pf7:2014 Dec 30 18:26:37 syslogd 1.5.0#6: restart. >>> pf8:2014 Dec 30 18:26:36 syslogd 1.5.0#6: restart. >>> >>> As you can see, pf4 did not restart on Dec 19 with the rest of the >>> roach2s >> at 08:46. It restarted almost an hour later at 9:45. It then >> restarted several times throughout the day on the 19th. It also >> restarted sporadically a few days since then with the most recent >> being on Dec 28 at 02:26. The last log message for pf4 was Dec 28 >> 03:06. It went down sometime in the next 20 minutes after that. >>> >>> I'm guessing it's a flaky power issue. Hopefully just power cord >>> that got >> loose at one end or the other during the shutdown. If it's not that >> then I'd guess it's something internal to the power supply? >>> >>> I've CC'd Matt in case he has any other ideas. >>> >>> It would probably be a good idea to have someone check on the power >> cables. >>> >>> Thanks, >>> Dave >>> >>> On Dec 30, 2014, at 8:32 AM, danny jacobs wrote: >>> >>>> Hi Dave, >>>> >>>> I thought I'd give PAPER a boot up and see if we could break the A/C >>>> but >> it looks like we may have a dead roach. #4 doesn't respond to pings >> even after power cycling. Just in case there was some mislabeling on >> the roachpdu apc page I even rebooted all of them. All go down, all >> come back... except for #4. >>>> >>>> Could you maybe take a look and confirm? >>>> >>>> Thanks, >>>> ~Danny >>>> >>>> >>>> -- >>>> >>>> National Science Foundation Fellow >>>> Arizona State University >>>> School of Earth and Space Exploration Low Frequency Cosmology >>>> Phone: (505) 500 4521 >>>> Homepage: http://loco.lab.asu.edu/danny_jacobs/ >>> >>> >>> >>> >>> -- >>> >>> National Science Foundation Fellow >>> Arizona State University >>> School of Earth and Space Exploration Low Frequency Cosmology >>> Phone: (505) 500 4521 >>> Homepage: http://loco.lab.asu.edu/danny_jacobs/ >> >
