Hi Matthys,

thanks for all your help with the system.
And great notes.  This is really appreciated.

Here I have nothing too insightful to offer.

Instead, I will share my notes on Roach2 rev2 D01020671:

2013apr29-2013may01
  I was at Digicom and Alex Rust was still in SA helping me test.
  We used a few new BOF files to exercise the QDRs and so on.
  with one of the revised BOF files D01020671 passed AOK
  and was released to be installed into a chassis and so on.
  D01020671 had no troubles with powering on.

2013sep26 in the Evans Hall Dig Lab
  D01020671 was tested with 2 ADC16x250-8 coax rev2 boards.

2013oct02
  D01020671 has 2 ADC16x250-8 coax rev2 boards and
  has passed all tests and has been released to be part of PAPER.

But I note it wasn't until 2013oct14 that the very troublesome
mismatch between the PCB footprint for U33 and the component U33
was finally identified.  This mismatch caused all sorts of other
Roach2 rev2 boards to have flaky power issues because U33 controls
the power supplies and the mismatch could lead to all sorts of
electrical shorts between critical control signals.

I have no records of Roach2 rev2 D01020671 being sent back to
Digicom to have a new component installed at U33 after a layer
of Kapton tape was installed to prevent the unwanted electrical
shorts.  Thus, perhaps whatever providing the insulation,
soldermask, dirt, ...,  has finally failed and from now on
Roach2 rev2 D01020671 will continue to have power issues.

The U33 rework isn't all that hard for any assembly house
to perform.  The component package is leadless so
a hot air system of some sort is required along with
an experienced operator.  It's probably necessary to
removed the PCB from the chassis.  This isn't hard to
do but it is a nuisance because the SFP+ Mezzanine cards
and the ADC16x250-8 coax rev 2 boards get in the way.

I have no spare Roach2 rev2s.

I have some concerns that other of PAPER's Roach2 rev2s
will start to have similar problems as D01020671 is
exhibiting.  Fingers crossed that my concerns prove
overly pessimistic.

Matt

On Wed, 7 Jan 2015, Matthys Maree wrote:

Date: Wed, 7 Jan 2015 13:42:47 +0200
From: Matthys Maree <[email protected]>
To: 'David MacMahon' <[email protected]>
Cc: 'danny jacobs' <[email protected]>,
    'David DeBoer' <[email protected]>,
    'PAPER List' <[email protected]>,
    'Matt Dexter' <[email protected]>
Subject: RE: recalcitrant roach

OK

9:43:
- Enter PAPER Container.
- All internal lights OFF.  Wall sockets OFF.  This was strange yes, but I
do not think this must be linked with the Roach4 problem.  Earth leakage
tripped inside DB on the lights and wall sockets.(We had a dip in
electricity supply from ESKOM yesterday afternoon for some reason, and might
have to do with this)
- All racks ON except for PF#4 which was OFF again.

9:54:
- Pushed kettle plug hard at back of unit, and it turned ON automatically.

10:00:
- Swopped kettle plugs (only the ends that connect into the ROACHes),
between units #3 and #4.
- #3 turned ON without problems.
- #3 needed some hard pushing from the kettle plug before turning ON, and
died again after some seconds.

10:05:
- Pulled kettle plug out again from #4, while in the OFF state.  Pushed back
in lightly.  Did not come ON auto, but did respond to the PWR button on
front.  Strange YES?

10:10:
- Fiddle with the kettle plug on #4(while ON), to see if it might turn OFF
due to the fiddling.  This did not affect the status at all, and it remained
ON.
- Pulled out kettle plug again.

10:12:
- Pushed back in kettle plug in #4 normally(not with much force to see if
normal operation work fine).  Unit came ON immediately.

10:20:
- #4 Still ON.
- Leave container.

11:00:
- Re-enter container.
- #4 OFF.

11:05:
First tried PWR button with no response.
Then RESET button, and unit came ON with the FAULT light ON as RED.
RESET button seem to turn this unit OFF when pressed now(Maybe part of the
RESET cycle?), and the PWR button seem to get it ON now.  Now this unit
confuse me.....

11:08:
- Remove kettle plug from unit in attempt to HARD RESET it.

11:10:
- Re-connect kettle plug.
- Unit turn ON (Like it should without failure/fault)

11:15:
- #4 still ON without fault.
- Leave container.


My conclusion is that something is behaving strange with this Roach#4, and
not the power supplies/kettle plugs.
It looks like it turn OFF by itself after a while, maybe because of
something heating up?
I would suggest swopping this unit out with a spare one if there is a spare?
Maybe you can try a power cycle on the PDU for this unit in attempt to get
it back ON again if you have difficulty.


(Remember the kettle plugs are still swopped between #3 and #4, only on the
Roach side.)

Let me know if I can assist further, maybe with a swop or so.



Me and Jasper plan to add more gas to the cooling unit on Friday 9 Jan, in
an attempt to keep the cooling unit running, until the fault/leak or
whatever is fixed on it later this month hopefully.


Groete

Matthys Maree
SKA South Africa ? Carnarvon

Tel:       021 506 7300 ext.#1035 (Carnarvon, Klerefontein)
Web :  www.ska.ac.za


-----Original Message-----
From: David MacMahon [mailto:[email protected]] On Behalf Of David
MacMahon
Sent: 06 January 2015 07:13 PM
To: Matthys Maree
Cc: 'danny jacobs'; 'David DeBoer'; 'PAPER List'; 'Matt Dexter'
Subject: Re: recalcitrant roach

Thanks, Matthys, that's very helpful!  The ROACH2s are configured to power
on automatically when VAC power is applied.  Since you had to push the power
button to turn it on then I suspect something internal to that ROACH2 is
unwell.  It sounds like the power cables are (and were) securely connected.

It would be great if you could please check in on it again tomorrow (Jan 7).
The things that will be of most interest to us are:

1) Is pf4 currently powered on when you arrive at the container?  This will
tell us whether it is a power problem or  communication problem.  Depending
on the current "powered on" status, do either 2A or 2B...

2A) If pf4 is currently off, does pushing the power button turn it on?

2B) If pf4 is currently on, unplug its kettle plug, wait a few seconds,
reconnect the kettle plug.  Does it turn on automatically when the power
cable is reconnected?  If not, does pushing the power button turn it on?

Assuming that pf4 is powered up after doing 2A or 2B, please wait a few
minutes for it to boot.  I'm not sure how you can tell that the boot has
completed (maybe the network LEDs will stop their rapid blinking?), but I
think 3 minutes should be adequate.

3) If it is not too difficult to access, you could try swapping the kettle
plugs for pf4 and pf5.  That way if the symptom moves to pf5 we will know it
is a problem in the PDU (or power cable?).  If the symptom stays with pf4
then we'll know it's not the PDU.  If it's easier, you could instead swap
pf4's kettle plug with pf3's.  If you do this swap, please let us know which
two you swapped.  This is an optional step.

4) If you could check that the RJ-45 network cable is securely attached to
the back of pf4 that would be reassuring.  This is also an optional step.

5) So that we can correlate your actions with what we see in the log files,
it would be great if you could record the times when things power on and
when you leave the container.

6) Anything else you observe that might be relevant to why pf4 is behaving
differently from the other ROACH2s.

Thanks again for your assistance!!!

Cheers,
Dave

On Jan 6, 2015, at 4:57 AM, Matthys Maree wrote:

Sorry, I only read this mail now that I am already back from site for
today.

What I did yesterday 5 January, was around that time you mentioned.
Unfortunately I did not check the exact time.
I first tried the "kettle plug" directly on the ROACH#4 machine.
Tried to push it in probably(even if it was not out).  I did not succeed.
I traced it down to where it get power supplied from.(for this I had
to bend over and under some cables!  Could easily have pulled a cable
slightly of something with this attempt).
On the Power supply unit where all the kettle plugs get power from, I
did the same by ensuring proper connection.
Still not successful.
I went back to Roach #4 power inlet, pushed again, and tried Power
button on front of Roach.  Now it turned ON.
So I assumed it was either on the bottom PDU unit or top connection.

I was probably in the container  for +/- 20minutes.

Please let me know if you need me to try something in there again.  I
can have a look tomorrow(7 Jan).


Groete

Matthys Maree
SKA South Africa ? Carnarvon

Tel:       021 506 7300 ext.#1035 (Carnarvon, Klerefontein)
Web :  www.ska.ac.za

-----Original Message-----
From: David MacMahon [mailto:[email protected]] On Behalf Of David
MacMahon
Sent: 06 January 2015 08:12 AM
To: Matthys Maree
Cc: 'danny jacobs'; 'David DeBoer'; 'PAPER List'; 'Matt Dexter'
Subject: Re: recalcitrant roach

Thanks and Happy New Year, Matthys!  We really appreciate having your
on-site support!!!

Unfortunately, we're still not able to access this machine
("r2d020671", aka
"pf4") via the network.  Here is what we see in the log file for that
system:

Dec 26 21:13:58 r2d020671 -- MARK --
Dec 28 02:26:52 syslogd 1.5.0#6: restart.
[...]
Dec 28 03:06:52 r2d020671 -- MARK --
Dec 30 21:24:08 syslogd 1.5.0#6: restart.
[...]
Dec 30 21:44:08 r2d020671 -- MARK --
Jan  2 03:01:47 syslogd 1.5.0#6: restart.
[...]
Jan  2 03:41:47 r2d020671 -- MARK --
Jan  5 11:14:52 syslogd 1.5.0#6: restart.
[...]
Jan  5 11:14:53 r2d020671 sshd[543]: Server listening on 0.0.0.0 port 22.

The "MARK" messages get logged after 20 minutes on logging inactivity
and the "syslogd ... restart" lines get logged when the machine
reboots.  The final "sshd" line is the last line in the log file.  The
timestamps are SAST (UTC+2).  Since we didn't get the expected "MARK"
line at 11:34 I can only assume that connectivity was lost sometime
between 11:14:53 and 11:34:53.

It would really help our understanding of the problem if you could
please provide some more details of your visit to the PAPER container
(e.g. time of day, duration of visit, actions taken, etc).  I suspect
it's either a power problem, a network problem, or a system problem
(e.g. bad RAM).  The problem is isolated to "pf4" (or its associated
cables); all the other ROACH2s seem fine.

Thanks again,
Dave

On Jan 5, 2015, at 2:36 AM, Matthys Maree wrote:

Hi

Roach#4 back ON.

Probably the power cable.

Cooling still fine inside container.

Groete

Matthys Maree
SKA South Africa ? Carnarvon

Tel:       021 506 7300 ext.#1035 (Carnarvon, Klerefontein)
Web :  www.ska.ac.za

From: danny jacobs [mailto:[email protected]]
Sent: 31 December 2014 06:57 AM
To: David DeBoer; PAPER List; Matthys Maree; Matt Dexter
Subject: Fwd: recalcitrant roach

Hi Matthys (cc PAPER),

One of our ROACHs has stopped responding.  A power issue seems most
likely. What with the heat cycling, its possible that its power cable
has loosened (or maybe even the ethernet). A failing power supply is
also possible.  Could you, or someone like you, double check that
ROACH #4 is getting power and shows an ethernet light?

Thanks,

~Danny




---------- Forwarded message ----------
From: David MacMahon <[email protected]>
Date: Tue, Dec 30, 2014 at 12:21 PM
Subject: Re: recalcitrant roach
To: danny jacobs <[email protected]>
Cc: Matt Dexter <[email protected]>


Hi, Danny,

pf4 seems to be having problems.  These problems seem to have started
on
December 19.  The roach2s log a "syslog restart" line when they boot.
I've extracted the December restart messages from the log files:

pf1:2014 Dec 19 08:46:09 syslogd 1.5.0#6: restart.
pf2:2014 Dec 19 08:46:15 syslogd 1.5.0#6: restart.
pf3:2014 Dec 19 08:46:17 syslogd 1.5.0#6: restart.
pf5:2014 Dec 19 08:46:15 syslogd 1.5.0#6: restart.
pf6:2014 Dec 19 08:46:16 syslogd 1.5.0#6: restart.
pf7:2014 Dec 19 08:46:16 syslogd 1.5.0#6: restart.
pf8:2014 Dec 19 08:46:15 syslogd 1.5.0#6: restart.

pf4:2014 Dec 19 09:45:58 syslogd 1.5.0#6: restart.
pf4:2014 Dec 19 10:28:00 syslogd 1.5.0#6: restart.
pf4:2014 Dec 19 11:36:52 syslogd 1.5.0#6: restart.
pf4:2014 Dec 19 15:10:14 syslogd 1.5.0#6: restart.
pf4:2014 Dec 19 16:28:23 syslogd 1.5.0#6: restart.

pf4:2014 Dec 20 23:17:49 syslogd 1.5.0#6: restart.
pf4:2014 Dec 23 23:55:54 syslogd 1.5.0#6: restart.
pf4:2014 Dec 26 20:33:59 syslogd 1.5.0#6: restart.
pf4:2014 Dec 28 02:26:52 syslogd 1.5.0#6: restart.

pf1:2014 Dec 30 18:26:41 syslogd 1.5.0#6: restart.
pf2:2014 Dec 30 18:26:36 syslogd 1.5.0#6: restart.
pf3:2014 Dec 30 18:26:37 syslogd 1.5.0#6: restart.
pf5:2014 Dec 30 18:26:37 syslogd 1.5.0#6: restart.
pf6:2014 Dec 30 18:26:36 syslogd 1.5.0#6: restart.
pf7:2014 Dec 30 18:26:37 syslogd 1.5.0#6: restart.
pf8:2014 Dec 30 18:26:36 syslogd 1.5.0#6: restart.

As you can see, pf4 did not restart on Dec 19 with the rest of the
roach2s
at 08:46.  It restarted almost an hour later at 9:45.  It then
restarted several times throughout the day on the 19th.  It also
restarted sporadically a few days since then with the most recent
being on Dec 28 at 02:26.  The last log message for pf4 was Dec 28
03:06.  It went down sometime in the next 20 minutes after that.

I'm guessing it's a flaky power issue.  Hopefully just power cord
that got
loose at one end or the other during the shutdown.  If it's not that
then I'd guess it's something internal to the power supply?

I've CC'd Matt in case he has any other ideas.

It would probably be a good idea to have someone check on the power
cables.

Thanks,
Dave

On Dec 30, 2014, at 8:32 AM, danny jacobs wrote:

Hi Dave,

I thought I'd give PAPER a boot up and see if we could break the A/C
but
it looks like we may have a dead roach.  #4 doesn't respond to pings
even after power cycling. Just in case there was some mislabeling on
the roachpdu apc page I even rebooted all of them.  All go down, all
come back... except for #4.

Could you maybe take a look and confirm?

Thanks,
~Danny


--

National Science Foundation Fellow
Arizona State University
School of Earth and Space Exploration Low Frequency Cosmology
Phone:           (505) 500 4521
Homepage:     http://loco.lab.asu.edu/danny_jacobs/




--

National Science Foundation Fellow
Arizona State University
School of Earth and Space Exploration Low Frequency Cosmology
Phone:           (505) 500 4521
Homepage:     http://loco.lab.asu.edu/danny_jacobs/


Reply via email to