Re: [casper] wideband conversion and correlation

Dan Werthimer Wed, 29 Dec 2010 10:58:57 -0800


hi bob,

i agree with dave that option #2 in your email is
best, because you only have to oversample each
PFB sub-band by 15% to 20%, instead of 50% overlap,
so the F engine computing cost is only slightly increased, and you
can still get the flat passband response you want in your FFX correlator.

here's a strawperson FFX correlator using oversampling:

1)  divide a 16 GHz band up into two 8 GHz bands
     using analog techniques.   if your ADC has 16GHz bandwidth,
     then you can use a diplexer - no mixers are needed.

2)   digitize each 8 GHz band using a 16Gsps ADC board and a Roach II.

(you will need to design this ADC board or perhaps Dave H. willdo this).


3)  break the 8 GHz bands up into 8 sub-bands of 1 GHz each
     using an 8 tap PFB or 8 DDC's.
     oversample the data by 20% (1.2 GHz bandwidth per sub-band)
     and transmit the eight sub-bands using  Roach II's eight ports.
     (use 4 bit real, 4 bit imaginary, XAUI protocol).

4)  feed the above data into 8 FX correlators, each with 1.2 GHz bandwidth:

4a)  each 1.2 GHz FX correlator consists of 8 Roach II  boards
        and a 16 port 10Gbit switch and is implemented as follows:

4b) each Roach II board in the 1.2 GHz FX correlator serves as both a1.2 GHz bandwidthdual pol F single antenna engine and a 125 MHz bandwidth eightantenna dual pol X engine.the libraries for these F and X blocks are available from thecasper packetized correlator designs.

the roach II board receives 1.2 GHz sub-bands via two xaui linksfrom two polarizationsfrom one antenna (from step 3 above), then breaks the two 1.2 GHzbands up into about 4K channels,packetizes the data and transmits 1GHz of the 1.2 GHz band outover a pair of 10Gbit ethernet links

     to a 10Gbe switch (4 bit real, 4 bit imaginary data).

(there's no need to transmit or correlate the overlapping partsthe 1.2 GHz band, so 100 MHz on each side is discarded).the switch implements the FX corner turn, and sends 125 MHz bandsback to each Roach II for correlation

     over the same pair of 10Gbit ethernet links.

each roach II FPGA also contains an 8 antenna X engine for 125 MHzbandwidth..



if you'd like to discuss some time,  please give me a call.


best wishes,

dan

On 12/29/2010 9:09 AM, David Hawkins wrote:

Hi Bob,

I believe the CASPER implementations of the PFB resamples
the (output) channel data streams at a rate consistent with
the channel spacing, resulting in an overall output data rate
that matches the input data rate.

The PFB output channels can alternatively be resampled at a
higher data rate. This allows for a wider channel transition
on the individual channels, but would result in a higher total
output data rate than input data rate. This higher data rate
would need to be accommodated between the output of the
coarse channelizer PFB, and the second fine channelizer PFB.
The transition channels would be discarded before sending the
data to the cross-correlator.

Given that the FFX F-to-F path is point-to-point, there would
be no need to use packetized data, so having to deal with
higher-bandwidth might be accommodated by operating FPGA-to-FPGA
transceiver links as synchronous links, eg. given a XAUI lane
nominally operated at 3.125Gbps, operate it as a synchronous
link at say 6.5Gbps. The maximum bandwidth of the F-to-F path
would help determine what your output channel resample rate
should be.

I believe Fred Harris' book [1] has a discussion on this, in
Chapter 9 'Polyphase Channelizers'.

Cheers,
Dave

[1] F. J. Harris, "Multirate Signal Processing for
    Communications Systems", 2004.

Robert Wilson wrote:
Dear Dan et al.,

Seasons Greetings.  I hope that you have had less rain than
we had snow.

Previously I have suggested stacking two layers of PFBs in
the case where the sampler runs faster than a single FPGA can
be a complete F engine for.  Although I can't find an email
with  your earlier suggestion, I believe that this is what
you are calling an FFX correlator.  A couple of weeks ago,
I was thinking about this and realized that, at least for our
purposes it will not work.

We would like to cover a broad band with relatively fine
spectral resolution and do not want holes in our spectral
coverage unless there is a big penalty for avoiding them.
Think about the first PFB which divides up the original band
into, say, 8 blocks.  Suppose we use the Micram ADC30 and
convert 9 GHz at a time.  Then each block is 1.125 GHz wide.
We will want to be able to divide that into 32K (or perhaps
even 64K) channels.  Now consider the edge of the band covered
by that block.  The PFB can be designed to have a very sharp
cutoff at the edge, perhaps 30 dB.  This will avoid aliasing
from adjacent bands, but the channels near each edge will be ~
30 dB down and effectively useless.  The block will be sampled
at the Nyquist rate, so one can not design the filter a bit
wider and throw away the edge channels in the second stage.

I have seen two designs which I believe are attempts to deal
with this problem:

Mark Torres described a design in which the first stage
PFB is actually duplicated as two PFBs shifted by half of a
channel width.  The PFBs can have simple FIR filters as only
the central half of each will be used after the second stages.
This certainly solves the problem, but it looks to me as though
it requires almost twice as much computing in the F engines
and the data rate out of the first stage will be twice the
input rate.

I believe that I have also seen designs in which the first
stage is done with overlapping FIR filters.  I don't know how
much more computing that requires than the PFB, but the data
rate is only modestly increased as is the computing in the
second stage PFBs.

The latter is probably the preferred solution, but there are
two places where the CASPER PFB could be split.  After the
FIR filter and between the two stages of the FFT.  This would
allow sharing the load with up to three FPGAs.  There would
be no increase in data communications rate in this option.

I have discussed this with Alan Rogers who offered to
think about efficient solutions to the problem.  In their
VLBI processors, they split the input band into separate
channels with a PFB to emulate the original analog filters.
Apparently they have not worried about complete spectral
coverage with the multitap off-line cross correlator.

I wonder if there are other solutions to this problem.

Regards,
Bob Wilson

On Fri, 24 Dec 2010, Dan Werthimer wrote:
hi jason,  jonathan,

regarding jason's concerns below about corner turns
and 10Gbit links:

in the FFX model that i propose, where the
first FGPA breaks up the 9 GHz band up into 8 pieces of
1.25 GHz each, there is no corner turner needed, as
the 8 frequency bands emerge from the PFB in eight
parallel paths and each path goes separately to it's own
XAUI or 10Gbit ethernet port.

the FFX design doesn't require any block ram or QDR:
all the coefficients in the 8 channel PFB/FFT are constants,
and there are no BRAM delays in the FFT, only registers,
as the FFT is implemented with fully parallel inputs and outputs.
(the FFT is implemented like a text book diagram of an
8 input FFT with all the  butterfly's done in parallel).
or instead of an 8 channel PFB, the channelization can
be implemented as 8 DDC's, again with no BRAM's.

i think the Roach II's eight 10Gbit links can just barely support
support 9 GHz of bandwidth, with 4 bit real, 4 bit imaginary data:
(1.25 GHz each * 8 bits = 10Gbits/sec on each link).
this will work with XAUI, but for 10Gbe, the extra overhead
from headers,  time stamps, etc will reduce the  bandwidth slightly.

jonathan,

suraj's conern about achieving high clock rates at high demux values
is for large FFT's   (you asked about 32K points).
if you are just doing an 8 point PFB or FFT, or implementing 8 DDC's,
for an FFX correlator, the routing is pretty  straghtforward -
you won't be using the CASPER PFB or FFT blocks,
it's all fully parallel implementation.


best wishes,

dan





On 12/23/2010 11:17 PM, Jason Manley wrote:
To the best of my knowledge, nobody's built a CASPER correlatorthat processes such high bandwidths. I took a closer look atbringing 20Gsps into a ROACH2 for MeerKAT use a few months back. Myconclusion was that this would be possible with current librarieswith minimal changes. However, we weren't aiming for 32k PFBs andwe weren't aiming to process the entire 10GHz band (we'd use a DDCand only process a couple of GHz).
I believe that you could put a PFB on the whole band if you tweakand optimise the library block as Billy has just done with the FFT.Though the FPGA might not run at very high speeds and you might notget the spectral resolution that you want directly due to resourceconsumption of pipelining, I think it would be possible do breakthis up into subbands (FFX approach) on a single board.
I will highlight the following limitations with processing suchlarge bandwidths on the ROACH2 platform:
1) QDR corner-turn bandwidth. You don't mention how many inputsyou're planning and so you might not need the packetisedinfrastructure at all (perhaps you're considering something likeBilly's point-to-point 3GHz 3-input correlator). ROACH2 will havefour 36-bit QDR interfaces. These can be ganged together anddemuxed to produce a single 288bit SDR interface so that yourlimits would be:32 parallel_streams * 4bit * 2complex = need 256-bitinterface.These 32 parallel streams are complex, post-FFT (after the imaghalf of spectrum has been tossed) so that it would represent~300MHz*32=9.6GHz of real band. So you might be OK here.
2) QDR capacity for the corner turn is much less of a concern.With a packet length of 128 (what everyone's using right now), youcan have up to 64 antennas:128pkt_len * 32768chan * 4bit * 2complex = 32Mbit perdual-pol antenna.
*) With the huge BRAM reserves on the V6, it might even bepossible to bypass the QDR and do the whole corner-turn in BRAM,especially if you opt for smaller packet sizes (which'd result insmaller buffers and potentially faster dump rates but with reducednetwork efficiencies due to smaller payload/header ratio).
3) Another consideration, and possible deal-breaker, is theinterconnect: ROACH-II will have 8 10GbE links (or maybe later two40Gbps links) which could carry a little over 7GHz bandwidth afternetwork overhead. Again, if you're not aiming for a packetisedsystem, then you can do a little better. If your ADC is going touse some of the SERDES lines though (as many of the new high speedsamplers do), then you might have to forfeit some of thisinterconnect. But basically, I think you're going to run out ofbandwidth to get 10GHz out.
WRT clock rates, I think that 300MHz should be achievable onROACH-II with a little tweaking. ROACH-1 is able to do 250MHz withmuch less fiddling than the iBOBs at these speeds. The iBOB withthe old libraries used to start choking around just 200MHz. So theclock rates are improving a little generation-to-generation and Idon't think it's unreasonable to hope for 300MHz from V6 but I'mconservatively banking on at least 250MHz.
My conservative conclusion after going through this whole exercisefor KAT was that ROACH-2 could comfortably handle 4GHz bandwidthchunks at ~8Gsps (8000/32=250MHz clk rate) and that we'd starthitting various limits not long after that. So I would say that ifyou're considering ROACH-2 as a platform, you'd be safe if aimingfor IF chunks around 4 or 5 GHz.
Jason

On 24 Dec 2010, at 07:51, Dan Werthimer wrote:
On 2. it seems to me that if we are digitizing a 9 GHz and using20 Gsps, one still needs substantial demux (at least 64) nomatter how small the PFB. As Sura points out this is far inexcess of practical limits. This stacks with what we have found:BW is the difficult part, large PFB for high res less so.
hi jonathan,
i agree you need to demux 20 Gsps by 64 or 128, but i don't thinkthis will be a problem.
20 Gsps should fit pretty easily into an FPGA an FFX correlator:

in my example of the FFX, you'd need to implement an 8 point PFB
on the first FPGA to break the 10 GHz band into 8 sub-bands.
let's assume you do demux of 64, and clock the FPGA at 312.5 MHz:
you'd need 64*8 multipliers to implement the FIR part of an 8 tapPFB.and 64 * 16 multipliers to implement the real to complex FFT partof the PFB.all the multipliers have fixed coefficients - no need to useblock rams tostore coefficients - no block rams are needed for delays orcoefficients, as you'd
implement the butterfly diagram directly.

so there's no coefficient routing, but there is data routing.
the data paths can all be 8 bit, and you can add pipeline registers
where needed, so you should be able to get to 312.5 MHz.

if you can't get the FPGA to route at 312.5 MHz, then you'd have
to demux by 128, and you'd need twice as many multipliers.
(instead of 1536 multipliers, it would take 3072 multipliers).
you can use block rams for many of the multipliers, as most of the
computations are multiplying 8 bit data by a fixed coefficient,
so an 8 input, 8 output look up table is all you need.

if you don't want to implement a an 8 channel PFB,
you could also implement this as eight DDC's running in parallel
from the same ADC data, each DDC with a different downmix frequency.
the mixer coefficients are fixed, and many of the coefficients are0, 1, -1.the DDC"s low pass filter coefficients are fixed as well - you canuse look up tables for thelow pass filters multipliers and the mixer multipliers if you areshort on DSP48's.
best wishes,

dan
BTW I realize as I write that my 6 GHz BW demux 32 case suggestedin response to Suraj still requires> 400 MHz FPGA clock, thusnot so practical. Can one gain a factor of 2 in demux doingquadrature sampling, and having I and Q inputs to a complex inputPFB each at 1/2 the rate?
Jonathan


On Dec 23, 2010, at 5:24 PM, Dan Werthimer wrote:
hi jonathan,

some ideas for your correlator:

1)
300 MHz is a good target, especially for V6.
suraj has shown how to  achieve 375 MHz for V5
by using floor planning and auto-placing.
suraj or i can send you his draft paper on this if you'd like.

2)
you might want to consider FFX instead of FX:
eg: digitizing your 9 GHz band and using a PFB to break it upinto eight sub-bandsof 1.25 GHz each, and then sending the sub-bands into eight 1.25GHzFX correlators. this will simplify your switch requirementsand each correlatornow has only 4K channels, which is better suited for corneringturn in a roach II.
3)
also, be sure to use billy's latest FFT, (recently checked in),
which moves all the adders and multipliers into DSP48's makesrouting easier.
you should also consider bit growth FFT's and PFB's, which start
out with the 4 or 5 or 8 bits from your ADC, and add bits gradually
as you move the frequency domain.   dave mcmahon and hong chen
have done work on this.

best wishes,

dan

On 12/23/2010 1:47 PM, Jonathan Weintroub wrote:
Hi CASPERites,
Here's a somewhat fluffy RFI which I hope might start a littlethought and/or discussion over the season (acknowledging thatnot all in the global collaboration celebrate the traditionalWestern winter holidays):
At SMA we are looking into the use of CASPER methods to build aultra wideband high spectral resolution correlator. Typicalspecs are, say, 18 GHz bandwidth with roughly 300 KHz spectralresolution, by two polarizations, full Stokes. We areconsidering using a standard CASPER packetized FX architecture(FX much better for high res than XF), but in the relativelyunexplored "small number of antennas, wide bandwidth" regime.If the entire 18 GHz were eaten by one ADC, this would requirea sample rate of 40 Gsps and 64 kpoint PFB. Perhaps morereasonable would be two 9 GHz BW blocks and a 32 k PFB sampledat about 20 Gsps, or three 6 GHz / 16 or 32 k PFB / 14 Gsps.
To start we are looking closely at the FPGA resourceutilization of large PFBs. Something that probably is commonknowledge amongst those experienced in FX correlator design isthat the demux factor drives the utilization much faster thanthe size of the PFB. In that sense bandwidth is far moreexpensive than spectral resolution. We've put some effort intoaccurately quantifying the utilization, at least as far asmultipliers and adders are concerned, and are expanding thisanalysis to block ram and other resources. And demux factor istypically radix 2, so it is very much quantized.
For example at 20 Gsps one might consider a demux factor of 128resulting in an FPGA clock rate of 156 MHz, which is quitecomfortable for the FPGA. Alternatively a demux factor of 64with corresponding FPGA clock of twice that, or over 300 MHz.Traditionally a rather uncomfortable regime for CASPER (we'reunusual, I believe, in running iBOBs at 256 MHz for the VLBIphased array). The trouble is our analysis shows that thedifference between these two demux setting in the size of PFBone can fit in a Virtex 6 is really quite large, and 128definitely won't allow us to do what we need to do.
So we are increasingly highly motivated to run the FPGAs fasterstill. Just a 20% increment from the 256 MHz which wecurrently view as a practical upper limit allows us to cross aclock rate threshold which then enables a factor of twodecrease in demux factor, and consequent even larger incrementin the realizable PFB size.
Which is just a long winded way of asking if there are anyothers in the collaboration motivated to run the FPGAs faster,and whether any tricks can be shared? In particular, does theCASPER toolflow support multiple clock domains? Ourunderstanding is not yet, but that's based on incompleteinformation. We know that there exists Virtex 5 (?) IP FFTcores which supposably run at greater than 500 MHz rates, usingthe enhanced interconnect between DSP slices.
While on this topic of high demux factors, the tool flowlargely chokes on demux factors of 32 or greater. Any tipshere would also be appreciated.
If anyone can cast light on this general topic and relatedconcerns it would be very much appreciated.
Jonathan Weintroub
SAO

Re: [casper] wideband conversion and correlation

Reply via email to