Hi, Matthys,

Thank you very much for that detailed report!  I agree with your assessment 
that pf4 is behaving strangely.  It is a useful piece of new data that pf4 
actually powers off spontaneously.  That had been suspected, but now it is 
confirmed.  I don't think we have a spare ROACH2 on site.  We might have to 
carry on without it (PSA112?) for a while.

Here's what the logfile for pf4 shows since (but not including) Jan 5 11:14:53. 
 These pairs of lines are the syslogd restart line followed by the last log 
line from that power cycle.

A couple of spontaneous power cycles(?!)...

> Jan  6 12:45:16 syslogd 1.5.0#6: restart.
> Jan  6 12:45:17 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22.
> 
> Jan  7 02:08:04 syslogd 1.5.0#6: restart.
> Jan  7 02:08:05 r2d020671 sshd[542]: Server listening on 0.0.0.0 port 22.

The start of your work...

> Jan  7 09:53:16 syslogd 1.5.0#6: restart.
> Jan  7 09:53:17 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22.
> 
> Jan  7 10:05:49 syslogd 1.5.0#6: restart.
> Jan  7 10:05:51 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22.
> 
> Jan  7 10:12:26 syslogd 1.5.0#6: restart.
> Jan  7 10:12:27 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22.

You left at 10:20 with pf4 ON, but we didn't get the 10:32:27 "MARK" message so 
it must have powered OFF between 10:20 and 10:32.

Here is when you found pf4 OFF and powered it up a few times...

> Jan  7 11:02:16 syslogd 1.5.0#6: restart.
> Jan  7 11:02:18 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22.
> 
> Jan  7 11:06:28 syslogd 1.5.0#6: restart.
> Jan  7 11:06:29 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22.
> 
> Jan  7 11:09:46 syslogd 1.5.0#6: restart.
> Jan  7 20:09:46 r2d020671 -- MARK --

It's currently still running, but I don't know how long it will last.

Thanks again,
Dave

On Jan 7, 2015, at 3:42 AM, Matthys Maree wrote:

> OK
> 
> 9:43:
> - Enter PAPER Container.
> - All internal lights OFF.  Wall sockets OFF.  This was strange yes, but I
> do not think this must be linked with the Roach4 problem.  Earth leakage
> tripped inside DB on the lights and wall sockets.(We had a dip in
> electricity supply from ESKOM yesterday afternoon for some reason, and might
> have to do with this)
> - All racks ON except for PF#4 which was OFF again.
> 
> 9:54:
> - Pushed kettle plug hard at back of unit, and it turned ON automatically.
> 
> 10:00:
> - Swopped kettle plugs (only the ends that connect into the ROACHes),
> between units #3 and #4.
> - #3 turned ON without problems.
> - #3 needed some hard pushing from the kettle plug before turning ON, and
> died again after some seconds.
> 
> 10:05:
> - Pulled kettle plug out again from #4, while in the OFF state.  Pushed back
> in lightly.  Did not come ON auto, but did respond to the PWR button on
> front.  Strange YES?
> 
> 10:10:
> - Fiddle with the kettle plug on #4(while ON), to see if it might turn OFF
> due to the fiddling.  This did not affect the status at all, and it remained
> ON.
> - Pulled out kettle plug again.
> 
> 10:12:
> - Pushed back in kettle plug in #4 normally(not with much force to see if
> normal operation work fine).  Unit came ON immediately.
> 
> 10:20:
> - #4 Still ON.  
> - Leave container.
> 
> 11:00:
> - Re-enter container.
> - #4 OFF.
> 
> 11:05:
> First tried PWR button with no response.
> Then RESET button, and unit came ON with the FAULT light ON as RED.
> RESET button seem to turn this unit OFF when pressed now(Maybe part of the
> RESET cycle?), and the PWR button seem to get it ON now.  Now this unit
> confuse me.....
> 
> 11:08:
> - Remove kettle plug from unit in attempt to HARD RESET it.
> 
> 11:10:
> - Re-connect kettle plug.
> - Unit turn ON (Like it should without failure/fault)
> 
> 11:15:
> - #4 still ON without fault.
> - Leave container.
> 
> 
> My conclusion is that something is behaving strange with this Roach#4, and
> not the power supplies/kettle plugs.
> It looks like it turn OFF by itself after a while, maybe because of
> something heating up?
> I would suggest swopping this unit out with a spare one if there is a spare?
> Maybe you can try a power cycle on the PDU for this unit in attempt to get
> it back ON again if you have difficulty.
> 
> 
> (Remember the kettle plugs are still swopped between #3 and #4, only on the
> Roach side.)
> 
> Let me know if I can assist further, maybe with a swop or so.
> 
> 
> 
> Me and Jasper plan to add more gas to the cooling unit on Friday 9 Jan, in
> an attempt to keep the cooling unit running, until the fault/leak or
> whatever is fixed on it later this month hopefully. 
> 
> 
> Groete
> 
> Matthys Maree
> SKA South Africa – Carnarvon
> 
> Tel:       021 506 7300 ext.#1035 (Carnarvon, Klerefontein)
> Web :  www.ska.ac.za
> 
> 
> -----Original Message-----
> From: David MacMahon [mailto:[email protected]] On Behalf Of David
> MacMahon
> Sent: 06 January 2015 07:13 PM
> To: Matthys Maree
> Cc: 'danny jacobs'; 'David DeBoer'; 'PAPER List'; 'Matt Dexter'
> Subject: Re: recalcitrant roach
> 
> Thanks, Matthys, that's very helpful!  The ROACH2s are configured to power
> on automatically when VAC power is applied.  Since you had to push the power
> button to turn it on then I suspect something internal to that ROACH2 is
> unwell.  It sounds like the power cables are (and were) securely connected.
> 
> It would be great if you could please check in on it again tomorrow (Jan 7).
> The things that will be of most interest to us are:
> 
> 1) Is pf4 currently powered on when you arrive at the container?  This will
> tell us whether it is a power problem or  communication problem.  Depending
> on the current "powered on" status, do either 2A or 2B...
> 
> 2A) If pf4 is currently off, does pushing the power button turn it on?
> 
> 2B) If pf4 is currently on, unplug its kettle plug, wait a few seconds,
> reconnect the kettle plug.  Does it turn on automatically when the power
> cable is reconnected?  If not, does pushing the power button turn it on?
> 
> Assuming that pf4 is powered up after doing 2A or 2B, please wait a few
> minutes for it to boot.  I'm not sure how you can tell that the boot has
> completed (maybe the network LEDs will stop their rapid blinking?), but I
> think 3 minutes should be adequate.
> 
> 3) If it is not too difficult to access, you could try swapping the kettle
> plugs for pf4 and pf5.  That way if the symptom moves to pf5 we will know it
> is a problem in the PDU (or power cable?).  If the symptom stays with pf4
> then we'll know it's not the PDU.  If it's easier, you could instead swap
> pf4's kettle plug with pf3's.  If you do this swap, please let us know which
> two you swapped.  This is an optional step.
> 
> 4) If you could check that the RJ-45 network cable is securely attached to
> the back of pf4 that would be reassuring.  This is also an optional step.
> 
> 5) So that we can correlate your actions with what we see in the log files,
> it would be great if you could record the times when things power on and
> when you leave the container.
> 
> 6) Anything else you observe that might be relevant to why pf4 is behaving
> differently from the other ROACH2s.
> 
> Thanks again for your assistance!!!
> 
> Cheers,
> Dave
> 
> On Jan 6, 2015, at 4:57 AM, Matthys Maree wrote:
> 
>> Sorry, I only read this mail now that I am already back from site for
> today.
>> 
>> What I did yesterday 5 January, was around that time you mentioned.
>> Unfortunately I did not check the exact time.
>> I first tried the "kettle plug" directly on the ROACH#4 machine.  
>> Tried to push it in probably(even if it was not out).  I did not succeed.
>> I traced it down to where it get power supplied from.(for this I had 
>> to bend over and under some cables!  Could easily have pulled a cable 
>> slightly of something with this attempt).
>> On the Power supply unit where all the kettle plugs get power from, I 
>> did the same by ensuring proper connection.
>> Still not successful.
>> I went back to Roach #4 power inlet, pushed again, and tried Power 
>> button on front of Roach.  Now it turned ON.
>> So I assumed it was either on the bottom PDU unit or top connection.
>> 
>> I was probably in the container  for +/- 20minutes.
>> 
>> Please let me know if you need me to try something in there again.  I 
>> can have a look tomorrow(7 Jan).
>> 
>> 
>> Groete
>> 
>> Matthys Maree
>> SKA South Africa – Carnarvon
>> 
>> Tel:       021 506 7300 ext.#1035 (Carnarvon, Klerefontein)
>> Web :  www.ska.ac.za
>> 
>> -----Original Message-----
>> From: David MacMahon [mailto:[email protected]] On Behalf Of David 
>> MacMahon
>> Sent: 06 January 2015 08:12 AM
>> To: Matthys Maree
>> Cc: 'danny jacobs'; 'David DeBoer'; 'PAPER List'; 'Matt Dexter'
>> Subject: Re: recalcitrant roach
>> 
>> Thanks and Happy New Year, Matthys!  We really appreciate having your 
>> on-site support!!!
>> 
>> Unfortunately, we're still not able to access this machine 
>> ("r2d020671", aka
>> "pf4") via the network.  Here is what we see in the log file for that
>> system:
>> 
>>> Dec 26 21:13:58 r2d020671 -- MARK --
>>> Dec 28 02:26:52 syslogd 1.5.0#6: restart.
>>> [...]
>>> Dec 28 03:06:52 r2d020671 -- MARK --
>>> Dec 30 21:24:08 syslogd 1.5.0#6: restart.
>>> [...]
>>> Dec 30 21:44:08 r2d020671 -- MARK --
>>> Jan  2 03:01:47 syslogd 1.5.0#6: restart.
>>> [...]
>>> Jan  2 03:41:47 r2d020671 -- MARK --
>>> Jan  5 11:14:52 syslogd 1.5.0#6: restart.
>>> [...]
>>> Jan  5 11:14:53 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22.
>> 
>> The "MARK" messages get logged after 20 minutes on logging inactivity 
>> and the "syslogd ... restart" lines get logged when the machine 
>> reboots.  The final "sshd" line is the last line in the log file.  The 
>> timestamps are SAST (UTC+2).  Since we didn't get the expected "MARK" 
>> line at 11:34 I can only assume that connectivity was lost sometime
> between 11:14:53 and 11:34:53.
>> 
>> It would really help our understanding of the problem if you could 
>> please provide some more details of your visit to the PAPER container 
>> (e.g. time of day, duration of visit, actions taken, etc).  I suspect 
>> it's either a power problem, a network problem, or a system problem 
>> (e.g. bad RAM).  The problem is isolated to "pf4" (or its associated 
>> cables); all the other ROACH2s seem fine.
>> 
>> Thanks again,
>> Dave
>> 
>> On Jan 5, 2015, at 2:36 AM, Matthys Maree wrote:
>> 
>>> Hi
>>> 
>>> Roach#4 back ON.
>>> 
>>> Probably the power cable.
>>> 
>>> Cooling still fine inside container.
>>> 
>>> Groete
>>> 
>>> Matthys Maree
>>> SKA South Africa – Carnarvon
>>> 
>>> Tel:       021 506 7300 ext.#1035 (Carnarvon, Klerefontein)
>>> Web :  www.ska.ac.za
>>> 
>>> From: danny jacobs [mailto:[email protected]]
>>> Sent: 31 December 2014 06:57 AM
>>> To: David DeBoer; PAPER List; Matthys Maree; Matt Dexter
>>> Subject: Fwd: recalcitrant roach
>>> 
>>> Hi Matthys (cc PAPER),
>>> 
>>> One of our ROACHs has stopped responding.  A power issue seems most
>> likely. What with the heat cycling, its possible that its power cable 
>> has loosened (or maybe even the ethernet). A failing power supply is 
>> also possible.  Could you, or someone like you, double check that 
>> ROACH #4 is getting power and shows an ethernet light?
>>> 
>>> Thanks,
>>> 
>>> ~Danny
>>> 
>>> 
>>> 
>>> 
>>> ---------- Forwarded message ----------
>>> From: David MacMahon <[email protected]>
>>> Date: Tue, Dec 30, 2014 at 12:21 PM
>>> Subject: Re: recalcitrant roach
>>> To: danny jacobs <[email protected]>
>>> Cc: Matt Dexter <[email protected]>
>>> 
>>> 
>>> Hi, Danny,
>>> 
>>> pf4 seems to be having problems.  These problems seem to have started 
>>> on
>> December 19.  The roach2s log a "syslog restart" line when they boot.  
>> I've extracted the December restart messages from the log files:
>>> 
>>> pf1:2014 Dec 19 08:46:09 syslogd 1.5.0#6: restart.
>>> pf2:2014 Dec 19 08:46:15 syslogd 1.5.0#6: restart.
>>> pf3:2014 Dec 19 08:46:17 syslogd 1.5.0#6: restart.
>>> pf5:2014 Dec 19 08:46:15 syslogd 1.5.0#6: restart.
>>> pf6:2014 Dec 19 08:46:16 syslogd 1.5.0#6: restart.
>>> pf7:2014 Dec 19 08:46:16 syslogd 1.5.0#6: restart.
>>> pf8:2014 Dec 19 08:46:15 syslogd 1.5.0#6: restart.
>>> 
>>> pf4:2014 Dec 19 09:45:58 syslogd 1.5.0#6: restart.
>>> pf4:2014 Dec 19 10:28:00 syslogd 1.5.0#6: restart.
>>> pf4:2014 Dec 19 11:36:52 syslogd 1.5.0#6: restart.
>>> pf4:2014 Dec 19 15:10:14 syslogd 1.5.0#6: restart.
>>> pf4:2014 Dec 19 16:28:23 syslogd 1.5.0#6: restart.
>>> 
>>> pf4:2014 Dec 20 23:17:49 syslogd 1.5.0#6: restart.
>>> pf4:2014 Dec 23 23:55:54 syslogd 1.5.0#6: restart.
>>> pf4:2014 Dec 26 20:33:59 syslogd 1.5.0#6: restart.
>>> pf4:2014 Dec 28 02:26:52 syslogd 1.5.0#6: restart.
>>> 
>>> pf1:2014 Dec 30 18:26:41 syslogd 1.5.0#6: restart.
>>> pf2:2014 Dec 30 18:26:36 syslogd 1.5.0#6: restart.
>>> pf3:2014 Dec 30 18:26:37 syslogd 1.5.0#6: restart.
>>> pf5:2014 Dec 30 18:26:37 syslogd 1.5.0#6: restart.
>>> pf6:2014 Dec 30 18:26:36 syslogd 1.5.0#6: restart.
>>> pf7:2014 Dec 30 18:26:37 syslogd 1.5.0#6: restart.
>>> pf8:2014 Dec 30 18:26:36 syslogd 1.5.0#6: restart.
>>> 
>>> As you can see, pf4 did not restart on Dec 19 with the rest of the 
>>> roach2s
>> at 08:46.  It restarted almost an hour later at 9:45.  It then 
>> restarted several times throughout the day on the 19th.  It also 
>> restarted sporadically a few days since then with the most recent 
>> being on Dec 28 at 02:26.  The last log message for pf4 was Dec 28 
>> 03:06.  It went down sometime in the next 20 minutes after that.
>>> 
>>> I'm guessing it's a flaky power issue.  Hopefully just power cord 
>>> that got
>> loose at one end or the other during the shutdown.  If it's not that 
>> then I'd guess it's something internal to the power supply?
>>> 
>>> I've CC'd Matt in case he has any other ideas.
>>> 
>>> It would probably be a good idea to have someone check on the power
>> cables.
>>> 
>>> Thanks,
>>> Dave
>>> 
>>> On Dec 30, 2014, at 8:32 AM, danny jacobs wrote:
>>> 
>>>> Hi Dave,
>>>> 
>>>> I thought I'd give PAPER a boot up and see if we could break the A/C 
>>>> but
>> it looks like we may have a dead roach.  #4 doesn't respond to pings 
>> even after power cycling. Just in case there was some mislabeling on 
>> the roachpdu apc page I even rebooted all of them.  All go down, all 
>> come back... except for #4.
>>>> 
>>>> Could you maybe take a look and confirm?
>>>> 
>>>> Thanks,
>>>> ~Danny
>>>> 
>>>> 
>>>> --
>>>> 
>>>> National Science Foundation Fellow
>>>> Arizona State University
>>>> School of Earth and Space Exploration Low Frequency Cosmology
>>>> Phone:           (505) 500 4521
>>>> Homepage:     http://loco.lab.asu.edu/danny_jacobs/
>>> 
>>> 
>>> 
>>> 
>>> --
>>> 
>>> National Science Foundation Fellow
>>> Arizona State University
>>> School of Earth and Space Exploration Low Frequency Cosmology
>>> Phone:           (505) 500 4521
>>> Homepage:     http://loco.lab.asu.edu/danny_jacobs/
>> 
> 


Reply via email to