Can we start to prep a spare to be shipped to site?

Sent from my iPhone

> On Jan 7, 2015, at 10:27 AM, David MacMahon <[email protected]> wrote:
> 
> Hi, Matthys,
> 
> Thank you very much for that detailed report!  I agree with your assessment 
> that pf4 is behaving strangely.  It is a useful piece of new data that pf4 
> actually powers off spontaneously.  That had been suspected, but now it is 
> confirmed.  I don't think we have a spare ROACH2 on site.  We might have to 
> carry on without it (PSA112?) for a while.
> 
> Here's what the logfile for pf4 shows since (but not including) Jan 5 
> 11:14:53.  These pairs of lines are the syslogd restart line followed by the 
> last log line from that power cycle.
> 
> A couple of spontaneous power cycles(?!)...
> 
>> Jan  6 12:45:16 syslogd 1.5.0#6: restart.
>> Jan  6 12:45:17 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22.
>> 
>> Jan  7 02:08:04 syslogd 1.5.0#6: restart.
>> Jan  7 02:08:05 r2d020671 sshd[542]: Server listening on 0.0.0.0 port 22.
> 
> The start of your work...
> 
>> Jan  7 09:53:16 syslogd 1.5.0#6: restart.
>> Jan  7 09:53:17 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22.
>> 
>> Jan  7 10:05:49 syslogd 1.5.0#6: restart.
>> Jan  7 10:05:51 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22.
>> 
>> Jan  7 10:12:26 syslogd 1.5.0#6: restart.
>> Jan  7 10:12:27 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22.
> 
> You left at 10:20 with pf4 ON, but we didn't get the 10:32:27 "MARK" message 
> so it must have powered OFF between 10:20 and 10:32.
> 
> Here is when you found pf4 OFF and powered it up a few times...
> 
>> Jan  7 11:02:16 syslogd 1.5.0#6: restart.
>> Jan  7 11:02:18 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22.
>> 
>> Jan  7 11:06:28 syslogd 1.5.0#6: restart.
>> Jan  7 11:06:29 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22.
>> 
>> Jan  7 11:09:46 syslogd 1.5.0#6: restart.
>> Jan  7 20:09:46 r2d020671 -- MARK --
> 
> It's currently still running, but I don't know how long it will last.
> 
> Thanks again,
> Dave
> 
>> On Jan 7, 2015, at 3:42 AM, Matthys Maree wrote:
>> 
>> OK
>> 
>> 9:43:
>> - Enter PAPER Container.
>> - All internal lights OFF.  Wall sockets OFF.  This was strange yes, but I
>> do not think this must be linked with the Roach4 problem.  Earth leakage
>> tripped inside DB on the lights and wall sockets.(We had a dip in
>> electricity supply from ESKOM yesterday afternoon for some reason, and might
>> have to do with this)
>> - All racks ON except for PF#4 which was OFF again.
>> 
>> 9:54:
>> - Pushed kettle plug hard at back of unit, and it turned ON automatically.
>> 
>> 10:00:
>> - Swopped kettle plugs (only the ends that connect into the ROACHes),
>> between units #3 and #4.
>> - #3 turned ON without problems.
>> - #3 needed some hard pushing from the kettle plug before turning ON, and
>> died again after some seconds.
>> 
>> 10:05:
>> - Pulled kettle plug out again from #4, while in the OFF state.  Pushed back
>> in lightly.  Did not come ON auto, but did respond to the PWR button on
>> front.  Strange YES?
>> 
>> 10:10:
>> - Fiddle with the kettle plug on #4(while ON), to see if it might turn OFF
>> due to the fiddling.  This did not affect the status at all, and it remained
>> ON.
>> - Pulled out kettle plug again.
>> 
>> 10:12:
>> - Pushed back in kettle plug in #4 normally(not with much force to see if
>> normal operation work fine).  Unit came ON immediately.
>> 
>> 10:20:
>> - #4 Still ON.  
>> - Leave container.
>> 
>> 11:00:
>> - Re-enter container.
>> - #4 OFF.
>> 
>> 11:05:
>> First tried PWR button with no response.
>> Then RESET button, and unit came ON with the FAULT light ON as RED.
>> RESET button seem to turn this unit OFF when pressed now(Maybe part of the
>> RESET cycle?), and the PWR button seem to get it ON now.  Now this unit
>> confuse me.....
>> 
>> 11:08:
>> - Remove kettle plug from unit in attempt to HARD RESET it.
>> 
>> 11:10:
>> - Re-connect kettle plug.
>> - Unit turn ON (Like it should without failure/fault)
>> 
>> 11:15:
>> - #4 still ON without fault.
>> - Leave container.
>> 
>> 
>> My conclusion is that something is behaving strange with this Roach#4, and
>> not the power supplies/kettle plugs.
>> It looks like it turn OFF by itself after a while, maybe because of
>> something heating up?
>> I would suggest swopping this unit out with a spare one if there is a spare?
>> Maybe you can try a power cycle on the PDU for this unit in attempt to get
>> it back ON again if you have difficulty.
>> 
>> 
>> (Remember the kettle plugs are still swopped between #3 and #4, only on the
>> Roach side.)
>> 
>> Let me know if I can assist further, maybe with a swop or so.
>> 
>> 
>> 
>> Me and Jasper plan to add more gas to the cooling unit on Friday 9 Jan, in
>> an attempt to keep the cooling unit running, until the fault/leak or
>> whatever is fixed on it later this month hopefully. 
>> 
>> 
>> Groete
>> 
>> Matthys Maree
>> SKA South Africa – Carnarvon
>> 
>> Tel:       021 506 7300 ext.#1035 (Carnarvon, Klerefontein)
>> Web :  www.ska.ac.za
>> 
>> 
>> -----Original Message-----
>> From: David MacMahon [mailto:[email protected]] On Behalf Of David
>> MacMahon
>> Sent: 06 January 2015 07:13 PM
>> To: Matthys Maree
>> Cc: 'danny jacobs'; 'David DeBoer'; 'PAPER List'; 'Matt Dexter'
>> Subject: Re: recalcitrant roach
>> 
>> Thanks, Matthys, that's very helpful!  The ROACH2s are configured to power
>> on automatically when VAC power is applied.  Since you had to push the power
>> button to turn it on then I suspect something internal to that ROACH2 is
>> unwell.  It sounds like the power cables are (and were) securely connected.
>> 
>> It would be great if you could please check in on it again tomorrow (Jan 7).
>> The things that will be of most interest to us are:
>> 
>> 1) Is pf4 currently powered on when you arrive at the container?  This will
>> tell us whether it is a power problem or  communication problem.  Depending
>> on the current "powered on" status, do either 2A or 2B...
>> 
>> 2A) If pf4 is currently off, does pushing the power button turn it on?
>> 
>> 2B) If pf4 is currently on, unplug its kettle plug, wait a few seconds,
>> reconnect the kettle plug.  Does it turn on automatically when the power
>> cable is reconnected?  If not, does pushing the power button turn it on?
>> 
>> Assuming that pf4 is powered up after doing 2A or 2B, please wait a few
>> minutes for it to boot.  I'm not sure how you can tell that the boot has
>> completed (maybe the network LEDs will stop their rapid blinking?), but I
>> think 3 minutes should be adequate.
>> 
>> 3) If it is not too difficult to access, you could try swapping the kettle
>> plugs for pf4 and pf5.  That way if the symptom moves to pf5 we will know it
>> is a problem in the PDU (or power cable?).  If the symptom stays with pf4
>> then we'll know it's not the PDU.  If it's easier, you could instead swap
>> pf4's kettle plug with pf3's.  If you do this swap, please let us know which
>> two you swapped.  This is an optional step.
>> 
>> 4) If you could check that the RJ-45 network cable is securely attached to
>> the back of pf4 that would be reassuring.  This is also an optional step.
>> 
>> 5) So that we can correlate your actions with what we see in the log files,
>> it would be great if you could record the times when things power on and
>> when you leave the container.
>> 
>> 6) Anything else you observe that might be relevant to why pf4 is behaving
>> differently from the other ROACH2s.
>> 
>> Thanks again for your assistance!!!
>> 
>> Cheers,
>> Dave
>> 
>>> On Jan 6, 2015, at 4:57 AM, Matthys Maree wrote:
>>> 
>>> Sorry, I only read this mail now that I am already back from site for
>> today.
>>> 
>>> What I did yesterday 5 January, was around that time you mentioned.
>>> Unfortunately I did not check the exact time.
>>> I first tried the "kettle plug" directly on the ROACH#4 machine.  
>>> Tried to push it in probably(even if it was not out).  I did not succeed.
>>> I traced it down to where it get power supplied from.(for this I had 
>>> to bend over and under some cables!  Could easily have pulled a cable 
>>> slightly of something with this attempt).
>>> On the Power supply unit where all the kettle plugs get power from, I 
>>> did the same by ensuring proper connection.
>>> Still not successful.
>>> I went back to Roach #4 power inlet, pushed again, and tried Power 
>>> button on front of Roach.  Now it turned ON.
>>> So I assumed it was either on the bottom PDU unit or top connection.
>>> 
>>> I was probably in the container  for +/- 20minutes.
>>> 
>>> Please let me know if you need me to try something in there again.  I 
>>> can have a look tomorrow(7 Jan).
>>> 
>>> 
>>> Groete
>>> 
>>> Matthys Maree
>>> SKA South Africa – Carnarvon
>>> 
>>> Tel:       021 506 7300 ext.#1035 (Carnarvon, Klerefontein)
>>> Web :  www.ska.ac.za
>>> 
>>> -----Original Message-----
>>> From: David MacMahon [mailto:[email protected]] On Behalf Of David 
>>> MacMahon
>>> Sent: 06 January 2015 08:12 AM
>>> To: Matthys Maree
>>> Cc: 'danny jacobs'; 'David DeBoer'; 'PAPER List'; 'Matt Dexter'
>>> Subject: Re: recalcitrant roach
>>> 
>>> Thanks and Happy New Year, Matthys!  We really appreciate having your 
>>> on-site support!!!
>>> 
>>> Unfortunately, we're still not able to access this machine 
>>> ("r2d020671", aka
>>> "pf4") via the network.  Here is what we see in the log file for that
>>> system:
>>> 
>>>> Dec 26 21:13:58 r2d020671 -- MARK --
>>>> Dec 28 02:26:52 syslogd 1.5.0#6: restart.
>>>> [...]
>>>> Dec 28 03:06:52 r2d020671 -- MARK --
>>>> Dec 30 21:24:08 syslogd 1.5.0#6: restart.
>>>> [...]
>>>> Dec 30 21:44:08 r2d020671 -- MARK --
>>>> Jan  2 03:01:47 syslogd 1.5.0#6: restart.
>>>> [...]
>>>> Jan  2 03:41:47 r2d020671 -- MARK --
>>>> Jan  5 11:14:52 syslogd 1.5.0#6: restart.
>>>> [...]
>>>> Jan  5 11:14:53 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22.
>>> 
>>> The "MARK" messages get logged after 20 minutes on logging inactivity 
>>> and the "syslogd ... restart" lines get logged when the machine 
>>> reboots.  The final "sshd" line is the last line in the log file.  The 
>>> timestamps are SAST (UTC+2).  Since we didn't get the expected "MARK" 
>>> line at 11:34 I can only assume that connectivity was lost sometime
>> between 11:14:53 and 11:34:53.
>>> 
>>> It would really help our understanding of the problem if you could 
>>> please provide some more details of your visit to the PAPER container 
>>> (e.g. time of day, duration of visit, actions taken, etc).  I suspect 
>>> it's either a power problem, a network problem, or a system problem 
>>> (e.g. bad RAM).  The problem is isolated to "pf4" (or its associated 
>>> cables); all the other ROACH2s seem fine.
>>> 
>>> Thanks again,
>>> Dave
>>> 
>>>> On Jan 5, 2015, at 2:36 AM, Matthys Maree wrote:
>>>> 
>>>> Hi
>>>> 
>>>> Roach#4 back ON.
>>>> 
>>>> Probably the power cable.
>>>> 
>>>> Cooling still fine inside container.
>>>> 
>>>> Groete
>>>> 
>>>> Matthys Maree
>>>> SKA South Africa – Carnarvon
>>>> 
>>>> Tel:       021 506 7300 ext.#1035 (Carnarvon, Klerefontein)
>>>> Web :  www.ska.ac.za
>>>> 
>>>> From: danny jacobs [mailto:[email protected]]
>>>> Sent: 31 December 2014 06:57 AM
>>>> To: David DeBoer; PAPER List; Matthys Maree; Matt Dexter
>>>> Subject: Fwd: recalcitrant roach
>>>> 
>>>> Hi Matthys (cc PAPER),
>>>> 
>>>> One of our ROACHs has stopped responding.  A power issue seems most
>>> likely. What with the heat cycling, its possible that its power cable 
>>> has loosened (or maybe even the ethernet). A failing power supply is 
>>> also possible.  Could you, or someone like you, double check that 
>>> ROACH #4 is getting power and shows an ethernet light?
>>>> 
>>>> Thanks,
>>>> 
>>>> ~Danny
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ---------- Forwarded message ----------
>>>> From: David MacMahon <[email protected]>
>>>> Date: Tue, Dec 30, 2014 at 12:21 PM
>>>> Subject: Re: recalcitrant roach
>>>> To: danny jacobs <[email protected]>
>>>> Cc: Matt Dexter <[email protected]>
>>>> 
>>>> 
>>>> Hi, Danny,
>>>> 
>>>> pf4 seems to be having problems.  These problems seem to have started 
>>>> on
>>> December 19.  The roach2s log a "syslog restart" line when they boot.  
>>> I've extracted the December restart messages from the log files:
>>>> 
>>>> pf1:2014 Dec 19 08:46:09 syslogd 1.5.0#6: restart.
>>>> pf2:2014 Dec 19 08:46:15 syslogd 1.5.0#6: restart.
>>>> pf3:2014 Dec 19 08:46:17 syslogd 1.5.0#6: restart.
>>>> pf5:2014 Dec 19 08:46:15 syslogd 1.5.0#6: restart.
>>>> pf6:2014 Dec 19 08:46:16 syslogd 1.5.0#6: restart.
>>>> pf7:2014 Dec 19 08:46:16 syslogd 1.5.0#6: restart.
>>>> pf8:2014 Dec 19 08:46:15 syslogd 1.5.0#6: restart.
>>>> 
>>>> pf4:2014 Dec 19 09:45:58 syslogd 1.5.0#6: restart.
>>>> pf4:2014 Dec 19 10:28:00 syslogd 1.5.0#6: restart.
>>>> pf4:2014 Dec 19 11:36:52 syslogd 1.5.0#6: restart.
>>>> pf4:2014 Dec 19 15:10:14 syslogd 1.5.0#6: restart.
>>>> pf4:2014 Dec 19 16:28:23 syslogd 1.5.0#6: restart.
>>>> 
>>>> pf4:2014 Dec 20 23:17:49 syslogd 1.5.0#6: restart.
>>>> pf4:2014 Dec 23 23:55:54 syslogd 1.5.0#6: restart.
>>>> pf4:2014 Dec 26 20:33:59 syslogd 1.5.0#6: restart.
>>>> pf4:2014 Dec 28 02:26:52 syslogd 1.5.0#6: restart.
>>>> 
>>>> pf1:2014 Dec 30 18:26:41 syslogd 1.5.0#6: restart.
>>>> pf2:2014 Dec 30 18:26:36 syslogd 1.5.0#6: restart.
>>>> pf3:2014 Dec 30 18:26:37 syslogd 1.5.0#6: restart.
>>>> pf5:2014 Dec 30 18:26:37 syslogd 1.5.0#6: restart.
>>>> pf6:2014 Dec 30 18:26:36 syslogd 1.5.0#6: restart.
>>>> pf7:2014 Dec 30 18:26:37 syslogd 1.5.0#6: restart.
>>>> pf8:2014 Dec 30 18:26:36 syslogd 1.5.0#6: restart.
>>>> 
>>>> As you can see, pf4 did not restart on Dec 19 with the rest of the 
>>>> roach2s
>>> at 08:46.  It restarted almost an hour later at 9:45.  It then 
>>> restarted several times throughout the day on the 19th.  It also 
>>> restarted sporadically a few days since then with the most recent 
>>> being on Dec 28 at 02:26.  The last log message for pf4 was Dec 28 
>>> 03:06.  It went down sometime in the next 20 minutes after that.
>>>> 
>>>> I'm guessing it's a flaky power issue.  Hopefully just power cord 
>>>> that got
>>> loose at one end or the other during the shutdown.  If it's not that 
>>> then I'd guess it's something internal to the power supply?
>>>> 
>>>> I've CC'd Matt in case he has any other ideas.
>>>> 
>>>> It would probably be a good idea to have someone check on the power
>>> cables.
>>>> 
>>>> Thanks,
>>>> Dave
>>>> 
>>>>> On Dec 30, 2014, at 8:32 AM, danny jacobs wrote:
>>>>> 
>>>>> Hi Dave,
>>>>> 
>>>>> I thought I'd give PAPER a boot up and see if we could break the A/C 
>>>>> but
>>> it looks like we may have a dead roach.  #4 doesn't respond to pings 
>>> even after power cycling. Just in case there was some mislabeling on 
>>> the roachpdu apc page I even rebooted all of them.  All go down, all 
>>> come back... except for #4.
>>>>> 
>>>>> Could you maybe take a look and confirm?
>>>>> 
>>>>> Thanks,
>>>>> ~Danny
>>>>> 
>>>>> 
>>>>> --
>>>>> 
>>>>> National Science Foundation Fellow
>>>>> Arizona State University
>>>>> School of Earth and Space Exploration Low Frequency Cosmology
>>>>> Phone:           (505) 500 4521
>>>>> Homepage:     http://loco.lab.asu.edu/danny_jacobs/
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> 
>>>> National Science Foundation Fellow
>>>> Arizona State University
>>>> School of Earth and Space Exploration Low Frequency Cosmology
>>>> Phone:           (505) 500 4521
>>>> Homepage:     http://loco.lab.asu.edu/danny_jacobs/
> 

Reply via email to