hi bob,

i agree with dave that option #2 in your email is
best, because you only have to oversample each
PFB sub-band by 15% to 20%, instead of 50% overlap,
so the F engine computing cost is only slightly increased, and you
can still get the flat passband response you want in your FFX correlator.

here's a strawperson FFX correlator using oversampling:

1)  divide a 16 GHz band up into two 8 GHz bands
     using analog techniques.   if your ADC has 16GHz bandwidth,
     then you can use a diplexer - no mixers are needed.

2)   digitize each 8 GHz band using a 16Gsps ADC board and a Roach II.
(you will need to design this ADC board or perhaps Dave H. will do this).

3)  break the 8 GHz bands up into 8 sub-bands of 1 GHz each
     using an 8 tap PFB or 8 DDC's.
     oversample the data by 20% (1.2 GHz bandwidth per sub-band)
     and transmit the eight sub-bands using  Roach II's eight ports.
     (use 4 bit real, 4 bit imaginary, XAUI protocol).

4)  feed the above data into 8 FX correlators, each with 1.2 GHz bandwidth:

4a)  each 1.2 GHz FX correlator consists of 8 Roach II  boards
        and a 16 port 10Gbit switch and is implemented as follows:

4b) each Roach II board in the 1.2 GHz FX correlator serves as both a 1.2 GHz bandwidth dual pol F single antenna engine and a 125 MHz bandwidth eight antenna dual pol X engine. the libraries for these F and X blocks are available from the casper packetized correlator designs.

the roach II board receives 1.2 GHz sub-bands via two xaui links from two polarizations from one antenna (from step 3 above), then breaks the two 1.2 GHz bands up into about 4K channels, packetizes the data and transmits 1GHz of the 1.2 GHz band out over a pair of 10Gbit ethernet links
     to a 10Gbe switch (4 bit real, 4 bit imaginary data).
(there's no need to transmit or correlate the overlapping parts the 1.2 GHz band, so 100 MHz on each side is discarded). the switch implements the FX corner turn, and sends 125 MHz bands back to each Roach II for correlation
     over the same pair of 10Gbit ethernet links.
each roach II FPGA also contains an 8 antenna X engine for 125 MHz bandwidth..


if you'd like to discuss some time,  please give me a call.


best wishes,

dan

On 12/29/2010 9:09 AM, David Hawkins wrote:
Hi Bob,

I believe the CASPER implementations of the PFB resamples
the (output) channel data streams at a rate consistent with
the channel spacing, resulting in an overall output data rate
that matches the input data rate.

The PFB output channels can alternatively be resampled at a
higher data rate. This allows for a wider channel transition
on the individual channels, but would result in a higher total
output data rate than input data rate. This higher data rate
would need to be accommodated between the output of the
coarse channelizer PFB, and the second fine channelizer PFB.
The transition channels would be discarded before sending the
data to the cross-correlator.

Given that the FFX F-to-F path is point-to-point, there would
be no need to use packetized data, so having to deal with
higher-bandwidth might be accommodated by operating FPGA-to-FPGA
transceiver links as synchronous links, eg. given a XAUI lane
nominally operated at 3.125Gbps, operate it as a synchronous
link at say 6.5Gbps. The maximum bandwidth of the F-to-F path
would help determine what your output channel resample rate
should be.

I believe Fred Harris' book [1] has a discussion on this, in
Chapter 9 'Polyphase Channelizers'.

Cheers,
Dave

[1] F. J. Harris, "Multirate Signal Processing for
    Communications Systems", 2004.

Robert Wilson wrote:
Dear Dan et al.,

Seasons Greetings.  I hope that you have had less rain than
we had snow.

Previously I have suggested stacking two layers of PFBs in
the case where the sampler runs faster than a single FPGA can
be a complete F engine for.  Although I can't find an email
with  your earlier suggestion, I believe that this is what
you are calling an FFX correlator.  A couple of weeks ago,
I was thinking about this and realized that, at least for our
purposes it will not work.

We would like to cover a broad band with relatively fine
spectral resolution and do not want holes in our spectral
coverage unless there is a big penalty for avoiding them.
Think about the first PFB which divides up the original band
into, say, 8 blocks.  Suppose we use the Micram ADC30 and
convert 9 GHz at a time.  Then each block is 1.125 GHz wide.
We will want to be able to divide that into 32K (or perhaps
even 64K) channels.  Now consider the edge of the band covered
by that block.  The PFB can be designed to have a very sharp
cutoff at the edge, perhaps 30 dB.  This will avoid aliasing
from adjacent bands, but the channels near each edge will be ~
30 dB down and effectively useless.  The block will be sampled
at the Nyquist rate, so one can not design the filter a bit
wider and throw away the edge channels in the second stage.

I have seen two designs which I believe are attempts to deal
with this problem:

Mark Torres described a design in which the first stage
PFB is actually duplicated as two PFBs shifted by half of a
channel width.  The PFBs can have simple FIR filters as only
the central half of each will be used after the second stages.
This certainly solves the problem, but it looks to me as though
it requires almost twice as much computing in the F engines
and the data rate out of the first stage will be twice the
input rate.

I believe that I have also seen designs in which the first
stage is done with overlapping FIR filters.  I don't know how
much more computing that requires than the PFB, but the data
rate is only modestly increased as is the computing in the
second stage PFBs.

The latter is probably the preferred solution, but there are
two places where the CASPER PFB could be split.  After the
FIR filter and between the two stages of the FFT.  This would
allow sharing the load with up to three FPGAs.  There would
be no increase in data communications rate in this option.

I have discussed this with Alan Rogers who offered to
think about efficient solutions to the problem.  In their
VLBI processors, they split the input band into separate
channels with a PFB to emulate the original analog filters.
Apparently they have not worried about complete spectral
coverage with the multitap off-line cross correlator.

I wonder if there are other solutions to this problem.

Regards,
Bob Wilson

On Fri, 24 Dec 2010, Dan Werthimer wrote:


hi jason,  jonathan,

regarding jason's concerns below about corner turns
and 10Gbit links:

in the FFX model that i propose, where the
first FGPA breaks up the 9 GHz band up into 8 pieces of
1.25 GHz each, there is no corner turner needed, as
the 8 frequency bands emerge from the PFB in eight
parallel paths and each path goes separately to it's own
XAUI or 10Gbit ethernet port.

the FFX design doesn't require any block ram or QDR:
all the coefficients in the 8 channel PFB/FFT are constants,
and there are no BRAM delays in the FFT, only registers,
as the FFT is implemented with fully parallel inputs and outputs.
(the FFT is implemented like a text book diagram of an
8 input FFT with all the  butterfly's done in parallel).
or instead of an 8 channel PFB, the channelization can
be implemented as 8 DDC's, again with no BRAM's.

i think the Roach II's eight 10Gbit links can just barely support
support 9 GHz of bandwidth, with 4 bit real, 4 bit imaginary data:
(1.25 GHz each * 8 bits = 10Gbits/sec on each link).
this will work with XAUI, but for 10Gbe, the extra overhead
from headers,  time stamps, etc will reduce the  bandwidth slightly.

jonathan,

suraj's conern about achieving high clock rates at high demux values
is for large FFT's   (you asked about 32K points).
if you are just doing an 8 point PFB or FFT, or implementing 8 DDC's,
for an FFX correlator, the routing is pretty  straghtforward -
you won't be using the CASPER PFB or FFT blocks,
it's all fully parallel implementation.


best wishes,

dan





On 12/23/2010 11:17 PM, Jason Manley wrote:
To the best of my knowledge, nobody's built a CASPER correlator that processes such high bandwidths. I took a closer look at bringing 20Gsps into a ROACH2 for MeerKAT use a few months back. My conclusion was that this would be possible with current libraries with minimal changes. However, we weren't aiming for 32k PFBs and we weren't aiming to process the entire 10GHz band (we'd use a DDC and only process a couple of GHz).

I believe that you could put a PFB on the whole band if you tweak and optimise the library block as Billy has just done with the FFT. Though the FPGA might not run at very high speeds and you might not get the spectral resolution that you want directly due to resource consumption of pipelining, I think it would be possible do break this up into subbands (FFX approach) on a single board.

I will highlight the following limitations with processing such large bandwidths on the ROACH2 platform:

1) QDR corner-turn bandwidth. You don't mention how many inputs you're planning and so you might not need the packetised infrastructure at all (perhaps you're considering something like Billy's point-to-point 3GHz 3-input correlator). ROACH2 will have four 36-bit QDR interfaces. These can be ganged together and demuxed to produce a single 288bit SDR interface so that your limits would be: 32 parallel_streams * 4bit * 2complex = need 256-bit interface. These 32 parallel streams are complex, post-FFT (after the imag half of spectrum has been tossed) so that it would represent ~300MHz*32=9.6GHz of real band. So you might be OK here.

2) QDR capacity for the corner turn is much less of a concern. With a packet length of 128 (what everyone's using right now), you can have up to 64 antennas: 128pkt_len * 32768chan * 4bit * 2complex = 32Mbit per dual-pol antenna.

*) With the huge BRAM reserves on the V6, it might even be possible to bypass the QDR and do the whole corner-turn in BRAM, especially if you opt for smaller packet sizes (which'd result in smaller buffers and potentially faster dump rates but with reduced network efficiencies due to smaller payload/header ratio).

3) Another consideration, and possible deal-breaker, is the interconnect: ROACH-II will have 8 10GbE links (or maybe later two 40Gbps links) which could carry a little over 7GHz bandwidth after network overhead. Again, if you're not aiming for a packetised system, then you can do a little better. If your ADC is going to use some of the SERDES lines though (as many of the new high speed samplers do), then you might have to forfeit some of this interconnect. But basically, I think you're going to run out of bandwidth to get 10GHz out.

WRT clock rates, I think that 300MHz should be achievable on ROACH-II with a little tweaking. ROACH-1 is able to do 250MHz with much less fiddling than the iBOBs at these speeds. The iBOB with the old libraries used to start choking around just 200MHz. So the clock rates are improving a little generation-to-generation and I don't think it's unreasonable to hope for 300MHz from V6 but I'm conservatively banking on at least 250MHz.

My conservative conclusion after going through this whole exercise for KAT was that ROACH-2 could comfortably handle 4GHz bandwidth chunks at ~8Gsps (8000/32=250MHz clk rate) and that we'd start hitting various limits not long after that. So I would say that if you're considering ROACH-2 as a platform, you'd be safe if aiming for IF chunks around 4 or 5 GHz.

Jason

On 24 Dec 2010, at 07:51, Dan Werthimer wrote:

On 2. it seems to me that if we are digitizing a 9 GHz and using 20 Gsps, one still needs substantial demux (at least 64) no matter how small the PFB. As Sura points out this is far in excess of practical limits. This stacks with what we have found: BW is the difficult part, large PFB for high res less so.
hi jonathan,

i agree you need to demux 20 Gsps by 64 or 128, but i don't think this will be a problem.
20 Gsps should fit pretty easily into an FPGA an FFX correlator:

in my example of the FFX, you'd need to implement an 8 point PFB
on the first FPGA to break the 10 GHz band into 8 sub-bands.
let's assume you do demux of 64, and clock the FPGA at 312.5 MHz:
you'd need 64*8 multipliers to implement the FIR part of an 8 tap PFB. and 64 * 16 multipliers to implement the real to complex FFT part of the PFB. all the multipliers have fixed coefficients - no need to use block rams to store coefficients - no block rams are needed for delays or coefficients, as you'd
implement the butterfly diagram directly.

so there's no coefficient routing, but there is data routing.
the data paths can all be 8 bit, and you can add pipeline registers
where needed, so you should be able to get to 312.5 MHz.

if you can't get the FPGA to route at 312.5 MHz, then you'd have
to demux by 128, and you'd need twice as many multipliers.
(instead of 1536 multipliers, it would take 3072 multipliers).
you can use block rams for many of the multipliers, as most of the
computations are multiplying 8 bit data by a fixed coefficient,
so an 8 input, 8 output look up table is all you need.

if you don't want to implement a an 8 channel PFB,
you could also implement this as eight DDC's running in parallel
from the same ADC data, each DDC with a different downmix frequency.
the mixer coefficients are fixed, and many of the coefficients are 0, 1, -1. the DDC"s low pass filter coefficients are fixed as well - you can use look up tables for the low pass filters multipliers and the mixer multipliers if you are short on DSP48's.

best wishes,

dan


BTW I realize as I write that my 6 GHz BW demux 32 case suggested in response to Suraj still requires> 400 MHz FPGA clock, thus not so practical. Can one gain a factor of 2 in demux doing quadrature sampling, and having I and Q inputs to a complex input PFB each at 1/2 the rate?

Jonathan


On Dec 23, 2010, at 5:24 PM, Dan Werthimer wrote:

hi jonathan,

some ideas for your correlator:

1)
300 MHz is a good target, especially for V6.
suraj has shown how to  achieve 375 MHz for V5
by using floor planning and auto-placing.
suraj or i can send you his draft paper on this if you'd like.

2)
you might want to consider FFX instead of FX:
eg: digitizing your 9 GHz band and using a PFB to break it up into eight sub-bands of 1.25 GHz each, and then sending the sub-bands into eight 1.25 GHz FX correlators. this will simplify your switch requirements and each correlator now has only 4K channels, which is better suited for cornering turn in a roach II.

3)
also, be sure to use billy's latest FFT, (recently checked in),
which moves all the adders and multipliers into DSP48's makes routing easier.
you should also consider bit growth FFT's and PFB's, which start
out with the 4 or 5 or 8 bits from your ADC, and add bits gradually
as you move the frequency domain.   dave mcmahon and hong chen
have done work on this.

best wishes,

dan

On 12/23/2010 1:47 PM, Jonathan Weintroub wrote:
Hi CASPERites,

Here's a somewhat fluffy RFI which I hope might start a little thought and/or discussion over the season (acknowledging that not all in the global collaboration celebrate the traditional Western winter holidays):

At SMA we are looking into the use of CASPER methods to build a ultra wideband high spectral resolution correlator. Typical specs are, say, 18 GHz bandwidth with roughly 300 KHz spectral resolution, by two polarizations, full Stokes. We are considering using a standard CASPER packetized FX architecture (FX much better for high res than XF), but in the relatively unexplored "small number of antennas, wide bandwidth" regime. If the entire 18 GHz were eaten by one ADC, this would require a sample rate of 40 Gsps and 64 kpoint PFB. Perhaps more reasonable would be two 9 GHz BW blocks and a 32 k PFB sampled at about 20 Gsps, or three 6 GHz / 16 or 32 k PFB / 14 Gsps.

To start we are looking closely at the FPGA resource utilization of large PFBs. Something that probably is common knowledge amongst those experienced in FX correlator design is that the demux factor drives the utilization much faster than the size of the PFB. In that sense bandwidth is far more expensive than spectral resolution. We've put some effort into accurately quantifying the utilization, at least as far as multipliers and adders are concerned, and are expanding this analysis to block ram and other resources. And demux factor is typically radix 2, so it is very much quantized.

For example at 20 Gsps one might consider a demux factor of 128 resulting in an FPGA clock rate of 156 MHz, which is quite comfortable for the FPGA. Alternatively a demux factor of 64 with corresponding FPGA clock of twice that, or over 300 MHz. Traditionally a rather uncomfortable regime for CASPER (we're unusual, I believe, in running iBOBs at 256 MHz for the VLBI phased array). The trouble is our analysis shows that the difference between these two demux setting in the size of PFB one can fit in a Virtex 6 is really quite large, and 128 definitely won't allow us to do what we need to do.

So we are increasingly highly motivated to run the FPGAs faster still. Just a 20% increment from the 256 MHz which we currently view as a practical upper limit allows us to cross a clock rate threshold which then enables a factor of two decrease in demux factor, and consequent even larger increment in the realizable PFB size.

Which is just a long winded way of asking if there are any others in the collaboration motivated to run the FPGAs faster, and whether any tricks can be shared? In particular, does the CASPER toolflow support multiple clock domains? Our understanding is not yet, but that's based on incomplete information. We know that there exists Virtex 5 (?) IP FFT cores which supposably run at greater than 500 MHz rates, using the enhanced interconnect between DSP slices.

While on this topic of high demux factors, the tool flow largely chokes on demux factors of 32 or greater. Any tips here would also be appreciated.

If anyone can cast light on this general topic and related concerns it would be very much appreciated.

Jonathan Weintroub
SAO









Reply via email to