Dear kind CASPER Colleagues,

To offer a little more feedback on this:

—We reiterate that all advice is appreciated and useful.  They may well be 
relevant to prior weird experiences, however in the current case . . .

— . . . after assorted power cycles removing all inputs, confirmation that the 
unit has an approved FSP power supply, and swapping in spares both at the LRU 
and NIC level, we are now convinced that our current issues with one 10GigE 
port of 8 going down are not ROACH2 hardware related, but rather something to 
do with the environment in which it is installed (i.e. related to external 
stimuli).  Still investigating.

—One unusual aspect of this application is we are using all 8 SFP+ ports on the 
ROACH2, though we are not stressing the rates. It is a long shot, but are there 
any insights into possible stresses or snafus we might run into when fully 
utilizing the ROACH 10GigE NIC ports?

Thanks again.

Jonathan & crew




> On Apr 18, 2018, at 10:13 AM, Jonathan Weintroub <jweintr...@cfa.harvard.edu> 
> wrote:
> 
> Hi Jonathon,
> 
> Your important input here warrants cc to the mailing list, hereby 
> accomplished.
> 
> We have switched to the FSP power supplies for new builds, and have repaired 
> older ROACH2s a number of which have had failing XEALs (mostly) by replacing 
> same with FSPs.  We have I think done some prophylactic FSP replacements in 
> offline spare stock.  But we’ve ordered and deployed probably over 100 
> ROACH2s over about a four, perhaps even five year period, they are used at 
> SMA for SWARM, and also distributed all over the world for the EHT.  So we 
> have NOT retrofitted every unit out there with FSP power supplies.  
> 
> While the XEAL are known to be not reliable, when a unit is working, it's not 
> that straightforward to recall it for a power supply replacement—ain’t broke 
> don’t fix applies.
> 
> Thanks for your input. Thanks also for input from Dan, Jason, Matt and Mike, 
> which is valuable and relevant advice.  I was holding off on responding, 
> we’re at SMA running tests, and don’t yet know the resolution for the units 
> in question.  
> 
> Jonathon’s email triggered this interim response. I’ll let all know the 
> outcome on the lightening damage when we have one.
> 
> Thanks,
> 
> Jonathan
> 
> 
> 
>> On Apr 18, 2018, at 9:44 AM, Jonathon Kocz <jxk...@gmail.com 
>> <mailto:jxk...@gmail.com>> wrote:
>> 
>> Hi Jonathan,
>> 
>> I think you've already addressed this, but to double check, are these R2s 
>> after you switched to the SP25-60FAG power supply?
>> 
>> I've had a lot of trouble with R2s using istar/xeal supplies getting into 
>> strange situations that always seem fixable with a new power supply. 
>> 
>> Cheers,
>> Jonathon
>> 
>> On 17 April 2018 at 16:22, Jonathan Weintroub <jweintr...@cfa.harvard.edu 
>> <mailto:jweintr...@cfa.harvard.edu>> wrote:
>> Hi CASPERites,
>> 
>> With experience on quite a few ROACH2s in the lab and in the field for some 
>> years, and a pattern has emerged which warrants a question to the ROACH2 
>> experts on this list. The SAO team has seen strange faults happen on 
>> multiple ROACH2 units after power failures, dips and lightening storms.   
>> I’ll list the various weirdnesses below, but the key point is while a full 
>> power cycle, including removing power from the line input, does not reset 
>> and cure the units. But extended power down (like overnight, or 24 hours, or 
>> more) does seem to bring the units back to life again.  This was discovered 
>> serendipitously, and has happened often enough that the pattern seems 
>> repeatable (though controlled experiments aren’t really possible, we try not 
>> to stress our equipment this way).
>> 
>> Has anyone else seen this, and does someone perhaps have a suggestion as to 
>> root cause, or some way to accelerate the reset?
>> 
>> Example faults have included:
>> 
>> —ADC5G clock not being correctly received, or not being transmitted to FPGA, 
>> or being transmitted at incorrect speed.
>> 
>> —A particular ADC would refuse to calibrate its digital interface to the 
>> FPGA.
>> 
>> —QDRs which don’t calibrate
>> 
>> —After a lightening storm on Maunakea we have two units with a single SFP+ 
>> port among 8 falling to transmit packets, though we have yet to see if an 
>> extended power down will cure this.
>> 
>> Again these faults have been distributed across multiple units, and in all 
>> cases have eventually been cleared, after extended power down.  Which is 
>> good, but the pathology worries us.
>> 
>> Thanks in advance for any light that might be cast on this issue.
>> 
>> Jonathan and André
>> EHT/SMA
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "casper@lists.berkeley.edu <mailto:casper@lists.berkeley.edu>" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to casper+unsubscr...@lists.berkeley.edu 
>> <mailto:casper%2bunsubscr...@lists.berkeley.edu>.
>> To post to this group, send email to casper@lists.berkeley.edu 
>> <mailto:casper@lists.berkeley.edu>.
>> 
> 

-- 
You received this message because you are subscribed to the Google Groups 
"casper@lists.berkeley.edu" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to casper+unsubscr...@lists.berkeley.edu.
To post to this group, send email to casper@lists.berkeley.edu.

Reply via email to