Re: [casper] wideband conversion and correlation

2010-12-29 Thread Dan Werthimer



hi bob,

i agree with dave that option #2 in your email is
best, because you only have to oversample each
PFB sub-band by 15% to 20%, instead of 50% overlap,
so the F engine computing cost is only slightly increased, and you
can still get the flat passband response you want in your FFX correlator.

here's a strawperson FFX correlator using oversampling:

1)  divide a 16 GHz band up into two 8 GHz bands
 using analog techniques.   if your ADC has 16GHz bandwidth,
 then you can use a diplexer - no mixers are needed.

2)   digitize each 8 GHz band using a 16Gsps ADC board and a Roach II.
   (you will need to design this ADC board or perhaps Dave H. will 
do this).


3)  break the 8 GHz bands up into 8 sub-bands of 1 GHz each
 using an 8 tap PFB or 8 DDC's.
 oversample the data by 20% (1.2 GHz bandwidth per sub-band)
 and transmit the eight sub-bands using  Roach II's eight ports.
 (use 4 bit real, 4 bit imaginary, XAUI protocol).

4)  feed the above data into 8 FX correlators, each with 1.2 GHz bandwidth:

4a)  each 1.2 GHz FX correlator consists of 8 Roach II  boards
and a 16 port 10Gbit switch and is implemented as follows:

4b) each Roach II board in the 1.2 GHz FX correlator serves as both a 
1.2 GHz bandwidth
dual pol F single antenna engine and a 125 MHz bandwidth eight 
antenna dual pol X engine.
the libraries for these F and X blocks are available from the 
casper packetized correlator designs.


the roach II board receives 1.2 GHz sub-bands via two xaui links 
from two polarizations
 from one antenna (from step 3 above),  then breaks the two 1.2 GHz 
bands up into about 4K channels,
 packetizes the data and transmits 1GHz of the 1.2 GHz band out 
over a pair of 10Gbit ethernet links

 to a 10Gbe switch (4 bit real, 4 bit imaginary data).
  (there's no need to transmit or correlate the overlapping parts 
the 1.2 GHz band, so 100 MHz on each side is discarded).
 the switch implements the FX corner turn, and sends 125 MHz bands 
back to each Roach II for correlation

 over the same pair of 10Gbit ethernet links.
 each roach II FPGA also contains an 8 antenna X engine for 125 MHz 
bandwidth..



if you'd like to discuss some time,  please give me a call.


best wishes,

dan

On 12/29/2010 9:09 AM, David Hawkins wrote:

Hi Bob,

I believe the CASPER implementations of the PFB resamples
the (output) channel data streams at a rate consistent with
the channel spacing, resulting in an overall output data rate
that matches the input data rate.

The PFB output channels can alternatively be resampled at a
higher data rate. This allows for a wider channel transition
on the individual channels, but would result in a higher total
output data rate than input data rate. This higher data rate
would need to be accommodated between the output of the
coarse channelizer PFB, and the second fine channelizer PFB.
The transition channels would be discarded before sending the
data to the cross-correlator.

Given that the FFX F-to-F path is point-to-point, there would
be no need to use packetized data, so having to deal with
higher-bandwidth might be accommodated by operating FPGA-to-FPGA
transceiver links as synchronous links, eg. given a XAUI lane
nominally operated at 3.125Gbps, operate it as a synchronous
link at say 6.5Gbps. The maximum bandwidth of the F-to-F path
would help determine what your output channel resample rate
should be.

I believe Fred Harris' book [1] has a discussion on this, in
Chapter 9 'Polyphase Channelizers'.

Cheers,
Dave

[1] F. J. Harris, "Multirate Signal Processing for
Communications Systems", 2004.

Robert Wilson wrote:

Dear Dan et al.,

Seasons Greetings.  I hope that you have had less rain than
we had snow.

Previously I have suggested stacking two layers of PFBs in
the case where the sampler runs faster than a single FPGA can
be a complete F engine for.  Although I can't find an email
with  your earlier suggestion, I believe that this is what
you are calling an FFX correlator.  A couple of weeks ago,
I was thinking about this and realized that, at least for our
purposes it will not work.

We would like to cover a broad band with relatively fine
spectral resolution and do not want holes in our spectral
coverage unless there is a big penalty for avoiding them.
Think about the first PFB which divides up the original band
into, say, 8 blocks.  Suppose we use the Micram ADC30 and
convert 9 GHz at a time.  Then each block is 1.125 GHz wide.
We will want to be able to divide that into 32K (or perhaps
even 64K) channels.  Now consider the edge of the band covered
by that block.  The PFB can be designed to have a very sharp
cutoff at the edge, perhaps 30 dB.  This will avoid aliasing
from adjacent bands, but the channels near each edge will be ~
30 dB down and effectively useless.  The block will be sampled
at the Nyquist rate, so one can not design the filter a bit
wider and throw away the edge channels in the second

Re: [casper] wideband conversion and correlation

2010-12-29 Thread David Hawkins

Hi Bob,

I believe the CASPER implementations of the PFB resamples
the (output) channel data streams at a rate consistent with
the channel spacing, resulting in an overall output data rate
that matches the input data rate.

The PFB output channels can alternatively be resampled at a
higher data rate. This allows for a wider channel transition
on the individual channels, but would result in a higher total
output data rate than input data rate. This higher data rate
would need to be accommodated between the output of the
coarse channelizer PFB, and the second fine channelizer PFB.
The transition channels would be discarded before sending the
data to the cross-correlator.

Given that the FFX F-to-F path is point-to-point, there would
be no need to use packetized data, so having to deal with
higher-bandwidth might be accommodated by operating FPGA-to-FPGA
transceiver links as synchronous links, eg. given a XAUI lane
nominally operated at 3.125Gbps, operate it as a synchronous
link at say 6.5Gbps. The maximum bandwidth of the F-to-F path
would help determine what your output channel resample rate
should be.

I believe Fred Harris' book [1] has a discussion on this, in
Chapter 9 'Polyphase Channelizers'.

Cheers,
Dave

[1] F. J. Harris, "Multirate Signal Processing for
Communications Systems", 2004.

Robert Wilson wrote:

Dear Dan et al.,

Seasons Greetings.  I hope that you have had less rain than
we had snow.

Previously I have suggested stacking two layers of PFBs in
the case where the sampler runs faster than a single FPGA can
be a complete F engine for.  Although I can't find an email
with  your earlier suggestion, I believe that this is what
you are calling an FFX correlator.  A couple of weeks ago,
I was thinking about this and realized that, at least for our
purposes it will not work.

We would like to cover a broad band with relatively fine
spectral resolution and do not want holes in our spectral
coverage unless there is a big penalty for avoiding them.
Think about the first PFB which divides up the original band
into, say, 8 blocks.  Suppose we use the Micram ADC30 and
convert 9 GHz at a time.  Then each block is 1.125 GHz wide.
We will want to be able to divide that into 32K (or perhaps
even 64K) channels.  Now consider the edge of the band covered
by that block.  The PFB can be designed to have a very sharp
cutoff at the edge, perhaps 30 dB.  This will avoid aliasing
from adjacent bands, but the channels near each edge will be ~
30 dB down and effectively useless.  The block will be sampled
at the Nyquist rate, so one can not design the filter a bit
wider and throw away the edge channels in the second stage.

I have seen two designs which I believe are attempts to deal
with this problem:

Mark Torres described a design in which the first stage
PFB is actually duplicated as two PFBs shifted by half of a
channel width.  The PFBs can have simple FIR filters as only
the central half of each will be used after the second stages.
This certainly solves the problem, but it looks to me as though
it requires almost twice as much computing in the F engines
and the data rate out of the first stage will be twice the
input rate.

I believe that I have also seen designs in which the first
stage is done with overlapping FIR filters.  I don't know how
much more computing that requires than the PFB, but the data
rate is only modestly increased as is the computing in the
second stage PFBs.

The latter is probably the preferred solution, but there are
two places where the CASPER PFB could be split.  After the
FIR filter and between the two stages of the FFT.  This would
allow sharing the load with up to three FPGAs.  There would
be no increase in data communications rate in this option.

I have discussed this with Alan Rogers who offered to
think about efficient solutions to the problem.  In their
VLBI processors, they split the input band into separate
channels with a PFB to emulate the original analog filters.
Apparently they have not worried about complete spectral
coverage with the multitap off-line cross correlator.

I wonder if there are other solutions to this problem.

Regards,
Bob Wilson

On Fri, 24 Dec 2010, Dan Werthimer wrote:



hi jason,  jonathan,

regarding jason's concerns below about corner turns
and 10Gbit links:

in the FFX model that i propose, where the
first FGPA breaks up the 9 GHz band up into 8 pieces of
1.25 GHz each, there is no corner turner needed, as
the 8 frequency bands emerge from the PFB in eight
parallel paths and each path goes separately to it's own
XAUI or 10Gbit ethernet port.

the FFX design doesn't require any block ram or QDR:
all the coefficients in the 8 channel PFB/FFT are constants,
and there are no BRAM delays in the FFT, only registers,
as the FFT is implemented with fully parallel inputs and outputs.
(the FFT is implemented like a text book diagram of an
8 input FFT with all the  butterfly's done in parallel).
or instead of an 8 channel PFB, the channelization can

Re: [casper] wideband conversion and correlation

2010-12-29 Thread Robert Wilson
Dear Dan et al.,

Seasons Greetings.  I hope that you have had less rain than
we had snow.

Previously I have suggested stacking two layers of PFBs in
the case where the sampler runs faster than a single FPGA can
be a complete F engine for.  Although I can't find an email
with  your earlier suggestion, I believe that this is what
you are calling an FFX correlator.  A couple of weeks ago,
I was thinking about this and realized that, at least for our
purposes it will not work.

We would like to cover a broad band with relatively fine
spectral resolution and do not want holes in our spectral
coverage unless there is a big penalty for avoiding them.
Think about the first PFB which divides up the original band
into, say, 8 blocks.  Suppose we use the Micram ADC30 and
convert 9 GHz at a time.  Then each block is 1.125 GHz wide.
We will want to be able to divide that into 32K (or perhaps
even 64K) channels.  Now consider the edge of the band covered
by that block.  The PFB can be designed to have a very sharp
cutoff at the edge, perhaps 30 dB.  This will avoid aliasing
from adjacent bands, but the channels near each edge will be ~
30 dB down and effectively useless.  The block will be sampled
at the Nyquist rate, so one can not design the filter a bit
wider and throw away the edge channels in the second stage.

I have seen two designs which I believe are attempts to deal
with this problem:

Mark Torres described a design in which the first stage
PFB is actually duplicated as two PFBs shifted by half of a
channel width.  The PFBs can have simple FIR filters as only
the central half of each will be used after the second stages.
This certainly solves the problem, but it looks to me as though
it requires almost twice as much computing in the F engines
and the data rate out of the first stage will be twice the
input rate.

I believe that I have also seen designs in which the first
stage is done with overlapping FIR filters.  I don't know how
much more computing that requires than the PFB, but the data
rate is only modestly increased as is the computing in the
second stage PFBs.

The latter is probably the preferred solution, but there are
two places where the CASPER PFB could be split.  After the
FIR filter and between the two stages of the FFT.  This would
allow sharing the load with up to three FPGAs.  There would
be no increase in data communications rate in this option.

I have discussed this with Alan Rogers who offered to
think about efficient solutions to the problem.  In their
VLBI processors, they split the input band into separate
channels with a PFB to emulate the original analog filters.
Apparently they have not worried about complete spectral
coverage with the multitap off-line cross correlator.

I wonder if there are other solutions to this problem.

Regards,
Bob Wilson

On Fri, 24 Dec 2010, Dan Werthimer wrote:

>
>
> hi jason,  jonathan,
>
> regarding jason's concerns below about corner turns
> and 10Gbit links:
>
> in the FFX model that i propose, where the
> first FGPA breaks up the 9 GHz band up into 8 pieces of
> 1.25 GHz each, there is no corner turner needed, as
> the 8 frequency bands emerge from the PFB in eight
> parallel paths and each path goes separately to it's own
> XAUI or 10Gbit ethernet port.
>
> the FFX design doesn't require any block ram or QDR:
> all the coefficients in the 8 channel PFB/FFT are constants,
> and there are no BRAM delays in the FFT, only registers,
> as the FFT is implemented with fully parallel inputs and outputs.
> (the FFT is implemented like a text book diagram of an
> 8 input FFT with all the  butterfly's done in parallel).
> or instead of an 8 channel PFB, the channelization can
> be implemented as 8 DDC's, again with no BRAM's.
>
> i think the Roach II's eight 10Gbit links can just barely support
> support 9 GHz of bandwidth, with 4 bit real, 4 bit imaginary data:
> (1.25 GHz each * 8 bits = 10Gbits/sec on each link).
> this will work with XAUI, but for 10Gbe, the extra overhead
> from headers,  time stamps, etc will reduce the  bandwidth slightly.
>
> jonathan,
>
> suraj's conern about achieving high clock rates at high demux values
> is for large FFT's   (you asked about 32K points).
> if you are just doing an 8 point PFB or FFT, or implementing 8 DDC's,
> for an FFX correlator, the routing is pretty  straghtforward -
> you won't be using the CASPER PFB or FFT blocks,
> it's all fully parallel implementation.
>
>
> best wishes,
>
> dan
>
>
>
>
>
> On 12/23/2010 11:17 PM, Jason Manley wrote:
> > To the best of my knowledge, nobody's built a CASPER correlator that 
> > processes such high bandwidths. I took a closer look at bringing 20Gsps 
> > into a ROACH2 for MeerKAT use a few months back. My conclusion was that 
> > this would be possible with current libraries with minimal changes. 
> > However, we weren't aiming for 32k PFBs and we weren't aiming to process 
> > the entire 10GHz band (we'd use a DDC and only process a couple of GHz).
> >
> > I believe t

Re: [casper] wideband conversion and correlation

2010-12-24 Thread Dan Werthimer



hi jason,  jonathan,

regarding jason's concerns below about corner turns
and 10Gbit links:

in the FFX model that i propose, where the
first FGPA breaks up the 9 GHz band up into 8 pieces of
1.25 GHz each, there is no corner turner needed, as
the 8 frequency bands emerge from the PFB in eight
parallel paths and each path goes separately to it's own
XAUI or 10Gbit ethernet port.

the FFX design doesn't require any block ram or QDR:
all the coefficients in the 8 channel PFB/FFT are constants,
and there are no BRAM delays in the FFT, only registers,
as the FFT is implemented with fully parallel inputs and outputs.
(the FFT is implemented like a text book diagram of an
8 input FFT with all the  butterfly's done in parallel).
or instead of an 8 channel PFB, the channelization can
be implemented as 8 DDC's, again with no BRAM's.

i think the Roach II's eight 10Gbit links can just barely support
support 9 GHz of bandwidth, with 4 bit real, 4 bit imaginary data:
(1.25 GHz each * 8 bits = 10Gbits/sec on each link).
this will work with XAUI, but for 10Gbe, the extra overhead
from headers,  time stamps, etc will reduce the  bandwidth slightly.

jonathan,

suraj's conern about achieving high clock rates at high demux values
is for large FFT's   (you asked about 32K points).
if you are just doing an 8 point PFB or FFT, or implementing 8 DDC's,
for an FFX correlator, the routing is pretty  straghtforward -
you won't be using the CASPER PFB or FFT blocks,
it's all fully parallel implementation.


best wishes,

dan





On 12/23/2010 11:17 PM, Jason Manley wrote:

To the best of my knowledge, nobody's built a CASPER correlator that processes 
such high bandwidths. I took a closer look at bringing 20Gsps into a ROACH2 for 
MeerKAT use a few months back. My conclusion was that this would be possible 
with current libraries with minimal changes. However, we weren't aiming for 32k 
PFBs and we weren't aiming to process the entire 10GHz band (we'd use a DDC and 
only process a couple of GHz).

I believe that you could put a PFB on the whole band if you tweak and optimise 
the library block as Billy has just done with the FFT. Though the FPGA might 
not run at very high speeds and you might not get the spectral resolution that 
you want directly due to resource consumption of pipelining, I think it would 
be possible do break this up into subbands (FFX approach) on a single board.

I will highlight the following limitations with processing such large 
bandwidths on the ROACH2 platform:

   1) QDR corner-turn bandwidth. You don't mention how many inputs you're 
planning and so you might not need the packetised infrastructure at all 
(perhaps you're considering something like Billy's point-to-point 3GHz 3-input 
correlator). ROACH2 will have four 36-bit QDR interfaces. These can be ganged 
together and demuxed to produce a single 288bit SDR interface so that your 
limits would be:
 32 parallel_streams * 4bit * 2complex = need 256-bit interface.
These 32 parallel streams are complex, post-FFT (after the imag half of 
spectrum has been tossed) so that it would represent ~300MHz*32=9.6GHz of real 
band. So you might be OK here.

   2) QDR capacity for the corner turn is much less of a concern. With a packet 
length of 128 (what everyone's using right now), you can have up to 64 antennas:
 128pkt_len * 32768chan * 4bit * 2complex = 32Mbit per dual-pol antenna.

   *) With the huge BRAM reserves on the V6, it might even be possible to 
bypass the QDR and do the whole corner-turn in BRAM, especially if you opt for 
smaller packet sizes (which'd result in smaller buffers and potentially faster 
dump rates but with reduced network efficiencies due to smaller payload/header 
ratio).

   3) Another consideration, and possible deal-breaker, is the interconnect: 
ROACH-II will have 8 10GbE links (or maybe later two 40Gbps links) which could 
carry a little over 7GHz bandwidth after network overhead. Again, if you're not 
aiming for a packetised system, then you can do a little better. If your ADC is 
going to use some of the SERDES lines though (as many of the new high speed 
samplers do), then you might have to forfeit some of this interconnect. But 
basically, I think you're going to run out of bandwidth to get 10GHz out.

WRT clock rates, I think that 300MHz should be achievable on ROACH-II with a 
little tweaking. ROACH-1 is able to do 250MHz with much less fiddling than the 
iBOBs at these speeds. The iBOB with the old libraries used to start choking 
around just 200MHz. So the clock rates are improving a little 
generation-to-generation and I don't think it's unreasonable to hope for 300MHz 
from V6 but I'm conservatively banking on at least 250MHz.

My conservative conclusion after going through this whole exercise for KAT was 
that ROACH-2 could comfortably handle 4GHz bandwidth chunks at ~8Gsps 
(8000/32=250MHz clk rate) and that we'd start hitting various limits not long 
after that. So I would say that if you'r

Re: [casper] wideband conversion and correlation

2010-12-24 Thread Jonathan Weintroub

Dear all who responded,

First, I apologize for inadvertently cc'ing the entire list with a  
message to my internal team.  A consequence of using autocomplete in  
the cc field to make sure I got the list address right.  Thankfully I  
think I only said nice things ;)


Second, I really appreciate all the responses which are very  
enlightening.  I have little time to read carefully, and less time to  
respond, as I leave with my family to Cape Town this morning, and  
don't expect to surface for a  good few days.  I do look forward to  
connecting with the SA SKA/KAT group, probably in January to discuss  
this and other things in person.  Perhaps the discussion will continue  
nonetheless.


A few quick comments, based only on a scan of the responses.

--the SMA is an 8 antenna array, with two active receivers per  
antenna.  In particular might be dual pol, thus 16 "ant-pols".


--we are certainly open to distributing the processing in the manner  
suggested by Dan, Mel, and possibly others.  Even in such a scheme,  
though, an understanding of PFB fit and limits, and, related,  
increasing clock rates to improve performance is warranted.  We are  
also open to not packetizing (on-board corner turn).


--I made mention of 500 MHz FFT cores, those were advertised by  
industry DSP specialists we have had discussions with.   Not designed  
with CASPER methods.   Multiple clock domains are required, and  
perhaps we could "black box" one of these cores.  I don't think anyone  
has commented on multiple clock domains in CASPER yet, Billy, anyone?   
(may have missed it on scanning).


--We need to understand memory util, including bram, qdr, ddr, amount  
and bandwidth.  Will read your comments carefully.


--Andrew, our finding is that *both* multipliers and adders scale as  
Dlog2D (other terms, but this one dominates).  If N is the size of the  
PFB they scale only as logN (I may mis-remember if this is dominant  
term).  I don't understand the implication of the condition "(for  
large FFT sizes, i.e not doing straight butterfly)"  I would very much  
like to discuss all of this with you, and others who might be  
interested, in CT if possible.


--Dan your statement that D=64 or 128 would be possible is very  
encouraging, but appears to contradict what Suraj said.  Would very  
much like to resolve this.




Thanks to all who contributed.  In a huge rush, please excuse mis- 
statement or typos, or questions on matters already addressed.  Merry  
Christmas to those who celebrate it.  And look forward to picking up  
this thread again.


Jonathan


On Dec 24, 2010, at 4:34 AM, Andrew Martens wrote:


Hi Jonathan

To start we are looking closely at the FPGA resource utilization of  
large PFBs.  Something that probably is common knowledge amongst  
those experienced in FX correlator design is that the demux factor  
drives the utilization much faster than the size of the PFB.  In  
that sense bandwidth is far more expensive than spectral  
resolution.  We've put some effort into accurately quantifying the  
utilization, at least as far as multipliers and adders are  
concerned, and are expanding this analysis to block ram and other  
resources.  And demux factor is typically radix 2, so it is very  
much quantized.


Some thoughts on resource usage with the CASPER pfb_fir (for large  
FFT sizes, i.e not doing straight butterfly);


complex multiplier usage;
  - scales linearly with the demux factor (often bandwidth)
  - scales linearly with number of FIR taps
  - is not affected by the FFT size

adder usage (the final adder tree);
  - scales by nlogn with the demux factor. Will dominate adder usage  
for large demux factors

  - scales by nlogn with the number of FIR taps
  - is not affected by the FFT size

BRAM usage;
  - scales linearly with demux factor but should not be affected  
(barring constraints set by underlying hardware). (BRAMs are  
currently not used efficiently - a separate set of coefficient and  
data storage BRAMs is not needed for each data input. The storage  
requirements should be completely dependent on FFT size and number  
of FIR taps).
  - scales linearly with the number of FIR taps. The current design  
could be improved so that BRAMs are more efficiently used though.

  - scales linearly with FFT size.

Routing constraints;
The design is simple, highly pipelined (almost no feedback) with  
very low fanout. Major constraints are BRAM to DSP slice, DSP slice  
to DSP slice and rounding, all of which are parameterised.


Optimisations possible;
The efficiency of BRAM use can be improved with some small logic  
savings.


Resource usage in the CASPER FFT (when using the biplex FFT (eg  
fft_wideband_real and fft for 'large' FFTs);


complex multiplier usage;
  - dominated by (n/2)*log2n (n = demux factor) needed in fft_direct  
for large FFTs.

  - scales linearly with increase in FFT size.

BRAM usage;
  - scales linearly with bandwidth for large FFTs if FFT size kept  
const

Re: [casper] wideband conversion and correlation

2010-12-24 Thread Andrew Martens
Hi Jonathan

To start we are looking closely at the FPGA resource utilization of large
> PFBs.  Something that probably is common knowledge amongst those experienced
> in FX correlator design is that the demux factor drives the utilization much
> faster than the size of the PFB.  In that sense bandwidth is far more
> expensive than spectral resolution.  We've put some effort into accurately
> quantifying the utilization, at least as far as multipliers and adders are
> concerned, and are expanding this analysis to block ram and other resources.
>  And demux factor is typically radix 2, so it is very much quantized.
>

Some thoughts on resource usage with the CASPER pfb_fir (for large FFT
sizes, i.e not doing straight butterfly);

complex multiplier usage;
  - scales linearly with the demux factor (often bandwidth)
  - scales linearly with number of FIR taps
  - is not affected by the FFT size

adder usage (the final adder tree);
  - scales by nlogn with the demux factor. Will dominate adder usage for
large demux factors
  - scales by nlogn with the number of FIR taps
  - is not affected by the FFT size

BRAM usage;
  - scales linearly with demux factor but should not be affected (barring
constraints set by underlying hardware). (BRAMs are currently not used
efficiently - a separate set of coefficient and data storage BRAMs is not
needed for each data input. The storage requirements should be completely
dependent on FFT size and number of FIR taps).
  - scales linearly with the number of FIR taps. The current design could be
improved so that BRAMs are more efficiently used though.
  - scales linearly with FFT size.

Routing constraints;
The design is simple, highly pipelined (almost no feedback) with very low
fanout. Major constraints are BRAM to DSP slice, DSP slice to DSP slice and
rounding, all of which are parameterised.

Optimisations possible;
The efficiency of BRAM use can be improved with some small logic savings.

Resource usage in the CASPER FFT (when using the biplex FFT (eg
fft_wideband_real and fft for 'large' FFTs);

complex multiplier usage;
  - dominated by (n/2)*log2n (n = demux factor) needed in fft_direct for
large FFTs.
  - scales linearly with increase in FFT size.

BRAM usage;
  - scales linearly with bandwidth for large FFTs if FFT size kept constant.
  - for constant (large) FFT size, unaffected by demux factor. Biplex cores
shrink in length by one stage while doubling in number for each doubling in
demux factor.
  - scales roughly like n^2 with increase in FFT size.

Routing constraints;
  The FFT is highly pipelined with low fanout except for in the unscrambler
(although some work has been done here and the unscrambler is now optional).
Major constraints are BRAM to DSP slice, DSP slice to DSP slice and
rounding, all of which are parameterised.

Optimisations possible;
Various optimisations are still possible;
  - Coefficients could be shared between twiddles, reducing the number of
BRAMs required by the demux factor. This would be significant for large
demux factor designs at the expense of some fanout.
  - BRAMs used for delaying data could be shared between input streams,
saving some BRAMs at the expense of extra routing constraints.
  - As Dan has suggested, grow the bits in the FFT at each stage as needed
to reduce logic (and BRAM) use and probably help timing. Care should be
taken however, as data quality is directly related to the width of the data
path through the FFT.

As noted by Jason, please also remember that other constraints such as QDR
SRAM and XAUI bandwidth needs to be considered when building such a large
system.

Dan's suggestion of FFX is worth considering. It is upgradeable, allowing
the addition of newer, more capable boards as they come online until you end
up with a simple FX correlator again.

I would love to see a correlator like that in action.

Regards
Andrew


Re: [casper] wideband conversion and correlation

2010-12-23 Thread Jason Manley
To the best of my knowledge, nobody's built a CASPER correlator that processes 
such high bandwidths. I took a closer look at bringing 20Gsps into a ROACH2 for 
MeerKAT use a few months back. My conclusion was that this would be possible 
with current libraries with minimal changes. However, we weren't aiming for 32k 
PFBs and we weren't aiming to process the entire 10GHz band (we'd use a DDC and 
only process a couple of GHz). 

I believe that you could put a PFB on the whole band if you tweak and optimise 
the library block as Billy has just done with the FFT. Though the FPGA might 
not run at very high speeds and you might not get the spectral resolution that 
you want directly due to resource consumption of pipelining, I think it would 
be possible do break this up into subbands (FFX approach) on a single board.

I will highlight the following limitations with processing such large 
bandwidths on the ROACH2 platform:

  1) QDR corner-turn bandwidth. You don't mention how many inputs you're 
planning and so you might not need the packetised infrastructure at all 
(perhaps you're considering something like Billy's point-to-point 3GHz 3-input 
correlator). ROACH2 will have four 36-bit QDR interfaces. These can be ganged 
together and demuxed to produce a single 288bit SDR interface so that your 
limits would be:
32 parallel_streams * 4bit * 2complex = need 256-bit interface.
These 32 parallel streams are complex, post-FFT (after the imag half of 
spectrum has been tossed) so that it would represent ~300MHz*32=9.6GHz of real 
band. So you might be OK here.

  2) QDR capacity for the corner turn is much less of a concern. With a packet 
length of 128 (what everyone's using right now), you can have up to 64 antennas:
128pkt_len * 32768chan * 4bit * 2complex = 32Mbit per dual-pol antenna.

  *) With the huge BRAM reserves on the V6, it might even be possible to bypass 
the QDR and do the whole corner-turn in BRAM, especially if you opt for smaller 
packet sizes (which'd result in smaller buffers and potentially faster dump 
rates but with reduced network efficiencies due to smaller payload/header 
ratio).

  3) Another consideration, and possible deal-breaker, is the interconnect: 
ROACH-II will have 8 10GbE links (or maybe later two 40Gbps links) which could 
carry a little over 7GHz bandwidth after network overhead. Again, if you're not 
aiming for a packetised system, then you can do a little better. If your ADC is 
going to use some of the SERDES lines though (as many of the new high speed 
samplers do), then you might have to forfeit some of this interconnect. But 
basically, I think you're going to run out of bandwidth to get 10GHz out.

WRT clock rates, I think that 300MHz should be achievable on ROACH-II with a 
little tweaking. ROACH-1 is able to do 250MHz with much less fiddling than the 
iBOBs at these speeds. The iBOB with the old libraries used to start choking 
around just 200MHz. So the clock rates are improving a little 
generation-to-generation and I don't think it's unreasonable to hope for 300MHz 
from V6 but I'm conservatively banking on at least 250MHz.

My conservative conclusion after going through this whole exercise for KAT was 
that ROACH-2 could comfortably handle 4GHz bandwidth chunks at ~8Gsps 
(8000/32=250MHz clk rate) and that we'd start hitting various limits not long 
after that. So I would say that if you're considering ROACH-2 as a platform, 
you'd be safe if aiming for IF chunks around 4 or 5 GHz. 

Jason

On 24 Dec 2010, at 07:51, Dan Werthimer wrote:

> 
>> On 2. it seems to me that if we are digitizing a 9 GHz and using 20 Gsps, 
>> one still needs substantial demux (at least 64) no matter how small the PFB. 
>> As Sura points out this is far in excess of practical limits.  This stacks 
>> with what we have found: BW is the difficult part, large PFB for high res 
>> less so.
> 
> 
> hi jonathan,
> 
> i agree you need to demux 20 Gsps by 64 or 128, but i don't think this will 
> be a problem.
> 20 Gsps should fit pretty easily into an FPGA an FFX correlator:
> 
> in my example of the FFX, you'd need to implement an 8 point PFB
> on the first FPGA to break the 10 GHz band into 8 sub-bands.
> let's assume you do demux of 64, and clock the FPGA at 312.5 MHz:
> you'd need 64*8 multipliers to implement the FIR part of an 8 tap PFB.
> and 64 * 16 multipliers to implement the real to complex FFT part of the PFB.
> all the multipliers have fixed coefficients  - no need to use block rams to
> store coefficients - no block rams are needed for delays or coefficients, as 
> you'd
> implement the butterfly diagram directly.
> 
> so there's no coefficient routing, but there is data routing.
> the data paths can all be 8 bit, and you can add pipeline registers
> where needed, so you should be able to get to 312.5 MHz.
> 
> if you can't get the FPGA to route at 312.5 MHz, then you'd have
> to demux by 128, and you'd need twice as many multipliers.
> (instead of 1536 multi

Re: [casper] wideband conversion and correlation

2010-12-23 Thread Dan Werthimer


On 2. it seems to me that if we are digitizing a 9 GHz and using 20 
Gsps, one still needs substantial demux (at least 64) no matter how 
small the PFB. 
As Sura points out this is far in excess of practical limits.  This 
stacks with what we have found: BW is the difficult part, large PFB 
for high res less so.



hi jonathan,

i agree you need to demux 20 Gsps by 64 or 128, but i don't think this 
will be a problem.

20 Gsps should fit pretty easily into an FPGA an FFX correlator:

in my example of the FFX, you'd need to implement an 8 point PFB
on the first FPGA to break the 10 GHz band into 8 sub-bands.
let's assume you do demux of 64, and clock the FPGA at 312.5 MHz:
you'd need 64*8 multipliers to implement the FIR part of an 8 tap PFB.
and 64 * 16 multipliers to implement the real to complex FFT part of the 
PFB.

all the multipliers have fixed coefficients  - no need to use block rams to
store coefficients - no block rams are needed for delays or 
coefficients, as you'd

implement the butterfly diagram directly.

so there's no coefficient routing, but there is data routing.
the data paths can all be 8 bit, and you can add pipeline registers
where needed, so you should be able to get to 312.5 MHz.

if you can't get the FPGA to route at 312.5 MHz, then you'd have
to demux by 128, and you'd need twice as many multipliers.
(instead of 1536 multipliers, it would take 3072 multipliers).
you can use block rams for many of the multipliers, as most of the
computations are multiplying 8 bit data by a fixed coefficient,
so an 8 input, 8 output look up table is all you need.

if you don't want to implement a an 8 channel PFB,
you could also implement this as eight DDC's running in parallel
from the same ADC data, each DDC with a different downmix frequency.
the mixer coefficients are fixed, and many of the coefficients are 0, 1, 
-1.
the DDC"s low pass filter coefficients are fixed as well - you can use 
look up tables for the
low pass filters multipliers and the mixer multipliers if you are short 
on DSP48's.


best wishes,

dan




BTW I realize as I write that my 6 GHz BW demux 32 case suggested in 
response to Suraj still requires > 400 MHz FPGA clock, thus not so 
practical.   Can one gain a factor of 2 in demux doing quadrature 
sampling, and having I and Q inputs to a complex input PFB each at 1/2 
the rate?


Jonathan


On Dec 23, 2010, at 5:24 PM, Dan Werthimer wrote:




hi jonathan,

some ideas for your correlator:

1)
300 MHz is a good target, especially for V6.
suraj has shown how to  achieve 375 MHz for V5
by using floor planning and auto-placing.
suraj or i can send you his draft paper on this if you'd like.

2)
you might want to consider FFX instead of FX:
eg: digitizing your 9 GHz band and using a PFB to break it up into 
eight sub-bands

of 1.25 GHz each, and then sending the sub-bands into eight 1.25 GHz
FX correlators.   this will simplify your switch requirements and 
each correlator
now has only 4K channels, which is better suited for cornering turn 
in a roach II.


3)
also, be sure to use billy's latest FFT, (recently checked in),
which moves all the adders and multipliers into DSP48's makes routing 
easier.

you should also consider bit growth FFT's and PFB's, which start
out with the 4 or 5 or 8 bits from your ADC, and add bits gradually
as you move the frequency domain.   dave mcmahon and hong chen
have done work on this.

best wishes,

dan

On 12/23/2010 1:47 PM, Jonathan Weintroub wrote:

Hi CASPERites,

Here's a somewhat fluffy RFI which I hope might start a little 
thought and/or discussion over the season (acknowledging that not 
all in the global collaboration celebrate the traditional Western 
winter holidays):


At SMA we are looking into the use of CASPER methods to build a 
ultra wideband high spectral resolution correlator.  Typical specs 
are, say, 18 GHz bandwidth with roughly 300 KHz spectral resolution, 
by two polarizations, full Stokes.   We are considering using a 
standard CASPER packetized FX architecture (FX much better for high 
res than XF), but in the relatively unexplored "small number of 
antennas, wide bandwidth" regime.   If the entire 18 GHz were eaten 
by one ADC, this would require a sample rate of 40 Gsps and 64 
kpoint PFB.   Perhaps more reasonable would be two 9 GHz BW blocks 
and a 32 k PFB sampled at about 20 Gsps, or three 6 GHz / 16 or 32 k 
PFB / 14 Gsps.


To start we are looking closely at the FPGA resource utilization of 
large PFBs.  Something that probably is common knowledge amongst 
those experienced in FX correlator design is that the demux factor 
drives the utilization much faster than the size of the PFB.  In 
that sense bandwidth is far more expensive than spectral 
resolution.  We've put some effort into accurately quantifying the 
utilization, at least as far as multipliers and adders are 
concerned, and are expanding this analysis to block ram and other 
resources.  And demux factor is typically radix 2, so it is very 
much quan

Re: [casper] wideband conversion and correlation

2010-12-23 Thread melvyn wright
Another reason to consider FFX is more flexible selection of  how
many channels you want in each subband.

ii)  Equalization across a 16 GHz may be an issue.  If there is too much slope
across the band, suppose in an extreme case, a change at low gains end
might make no change in the digitized signal.
Again, there might be some specification on bandpass flatness requirements.


mel

On 12/23/10, Dan Werthimer  wrote:
>
>
> hi jonathan,
>
> some ideas for your correlator:
>
> 1)
> 300 MHz is a good target, especially for V6.
> suraj has shown how to  achieve 375 MHz for V5
> by using floor planning and auto-placing.
> suraj or i can send you his draft paper on this if you'd like.
>
> 2)
> you might want to consider FFX instead of FX:
> eg: digitizing your 9 GHz band and using a PFB to break it up into eight
> sub-bands
> of 1.25 GHz each, and then sending the sub-bands into eight 1.25 GHz
> FX correlators.   this will simplify your switch requirements and each
> correlator
> now has only 4K channels, which is better suited for cornering turn in a
> roach II.
>
> 3)
> also, be sure to use billy's latest FFT, (recently checked in),
> which moves all the adders and multipliers into DSP48's makes routing
> easier.
> you should also consider bit growth FFT's and PFB's, which start
> out with the 4 or 5 or 8 bits from your ADC, and add bits gradually
> as you move the frequency domain.   dave mcmahon and hong chen
> have done work on this.
>
> best wishes,
>
> dan
>
> On 12/23/2010 1:47 PM, Jonathan Weintroub wrote:
>> Hi CASPERites,
>>
>> Here's a somewhat fluffy RFI which I hope might start a little thought
>> and/or discussion over the season (acknowledging that not all in the
>> global collaboration celebrate the traditional Western winter holidays):
>>
>> At SMA we are looking into the use of CASPER methods to build a ultra
>> wideband high spectral resolution correlator.  Typical specs are, say,
>> 18 GHz bandwidth with roughly 300 KHz spectral resolution, by two
>> polarizations, full Stokes.   We are considering using a standard
>> CASPER packetized FX architecture (FX much better for high res than
>> XF), but in the relatively unexplored "small number of antennas, wide
>> bandwidth" regime.   If the entire 18 GHz were eaten by one ADC, this
>> would require a sample rate of 40 Gsps and 64 kpoint PFB.   Perhaps
>> more reasonable would be two 9 GHz BW blocks and a 32 k PFB sampled at
>> about 20 Gsps, or three 6 GHz / 16 or 32 k PFB / 14 Gsps.
>>
>> To start we are looking closely at the FPGA resource utilization of
>> large PFBs.  Something that probably is common knowledge amongst those
>> experienced in FX correlator design is that the demux factor drives
>> the utilization much faster than the size of the PFB.  In that sense
>> bandwidth is far more expensive than spectral resolution.  We've put
>> some effort into accurately quantifying the utilization, at least as
>> far as multipliers and adders are concerned, and are expanding this
>> analysis to block ram and other resources.  And demux factor is
>> typically radix 2, so it is very much quantized.
>>
>> For example at 20 Gsps one might consider a demux factor of 128
>> resulting in an FPGA clock rate of 156 MHz, which is quite comfortable
>> for the FPGA.  Alternatively a demux factor of 64 with corresponding
>> FPGA clock of twice that, or over 300 MHz.   Traditionally a rather
>> uncomfortable regime for CASPER (we're unusual, I believe, in running
>> iBOBs at 256 MHz for the VLBI phased array).  The trouble is our
>> analysis shows that the difference between these two demux setting in
>> the size of PFB one can fit in a Virtex 6 is really quite large, and
>> 128 definitely won't allow us to do what we need to do.
>>
>> So we are increasingly highly motivated to run the FPGAs faster
>> still.  Just a 20% increment from the 256 MHz which we currently view
>> as a practical upper limit allows us to cross a clock rate threshold
>> which then enables a factor of two decrease in demux factor, and
>> consequent even larger increment in the realizable PFB size.
>>
>> Which is just a long winded way of asking if there are any others in
>> the collaboration motivated to run the FPGAs faster, and whether any
>> tricks can be shared?  In particular, does the CASPER toolflow support
>> multiple clock domains? Our understanding is not yet, but that's based
>> on incomplete information.   We know that there exists Virtex 5 (?) IP
>> FFT cores which supposably run at greater than 500 MHz rates, using
>> the enhanced interconnect between DSP slices.
>>
>> While on this topic of high demux factors, the tool flow largely
>> chokes on demux factors of 32 or greater.  Any tips here would also be
>> appreciated.
>>
>> If anyone can cast light on this general topic and related concerns it
>> would be very much appreciated.
>>
>> Jonathan Weintroub
>> SAO
>>
>>
>>
>>
>
>
>



Re: [casper] wideband conversion and correlation

2010-12-23 Thread Jonathan Weintroub

Hi Dan,

Thanks for the input.  As you see, Suraj has responded, and I will  
explore his techniques with him.  Yes, very interested in any papers.


3. is good advice.

On 2. it seems to me that if we are digitizing a 9 GHz and using 20  
Gsps, one still needs substantial demux (at least 64) no matter how  
small the PFB.  As Sura points out this is far in excess of practical  
limits.  This stacks with what we have found: BW is the difficult  
part, large PFB for high res less so.


BTW I realize as I write that my 6 GHz BW demux 32 case suggested in  
response to Suraj still requires > 400 MHz FPGA clock, thus not so  
practical.   Can one gain a factor of 2 in demux doing quadrature  
sampling, and having I and Q inputs to a complex input PFB each at 1/2  
the rate?


Jonathan


On Dec 23, 2010, at 5:24 PM, Dan Werthimer wrote:




hi jonathan,

some ideas for your correlator:

1)
300 MHz is a good target, especially for V6.
suraj has shown how to  achieve 375 MHz for V5
by using floor planning and auto-placing.
suraj or i can send you his draft paper on this if you'd like.

2)
you might want to consider FFX instead of FX:
eg: digitizing your 9 GHz band and using a PFB to break it up into  
eight sub-bands

of 1.25 GHz each, and then sending the sub-bands into eight 1.25 GHz
FX correlators.   this will simplify your switch requirements and  
each correlator
now has only 4K channels, which is better suited for cornering turn  
in a roach II.


3)
also, be sure to use billy's latest FFT, (recently checked in),
which moves all the adders and multipliers into DSP48's makes  
routing easier.

you should also consider bit growth FFT's and PFB's, which start
out with the 4 or 5 or 8 bits from your ADC, and add bits gradually
as you move the frequency domain.   dave mcmahon and hong chen
have done work on this.

best wishes,

dan

On 12/23/2010 1:47 PM, Jonathan Weintroub wrote:

Hi CASPERites,

Here's a somewhat fluffy RFI which I hope might start a little  
thought and/or discussion over the season (acknowledging that not  
all in the global collaboration celebrate the traditional Western  
winter holidays):


At SMA we are looking into the use of CASPER methods to build a  
ultra wideband high spectral resolution correlator.  Typical specs  
are, say, 18 GHz bandwidth with roughly 300 KHz spectral  
resolution, by two polarizations, full Stokes.   We are considering  
using a standard CASPER packetized FX architecture (FX much better  
for high res than XF), but in the relatively unexplored "small  
number of antennas, wide bandwidth" regime.   If the entire 18 GHz  
were eaten by one ADC, this would require a sample rate of 40 Gsps  
and 64 kpoint PFB.   Perhaps more reasonable would be two 9 GHz BW  
blocks and a 32 k PFB sampled at about 20 Gsps, or three 6 GHz / 16  
or 32 k PFB / 14 Gsps.


To start we are looking closely at the FPGA resource utilization of  
large PFBs.  Something that probably is common knowledge amongst  
those experienced in FX correlator design is that the demux factor  
drives the utilization much faster than the size of the PFB.  In  
that sense bandwidth is far more expensive than spectral  
resolution.  We've put some effort into accurately quantifying the  
utilization, at least as far as multipliers and adders are  
concerned, and are expanding this analysis to block ram and other  
resources.  And demux factor is typically radix 2, so it is very  
much quantized.


For example at 20 Gsps one might consider a demux factor of 128  
resulting in an FPGA clock rate of 156 MHz, which is quite  
comfortable for the FPGA.  Alternatively a demux factor of 64 with  
corresponding FPGA clock of twice that, or over 300 MHz.
Traditionally a rather uncomfortable regime for CASPER (we're  
unusual, I believe, in running iBOBs at 256 MHz for the VLBI phased  
array).  The trouble is our analysis shows that the difference  
between these two demux setting in the size of PFB one can fit in a  
Virtex 6 is really quite large, and 128 definitely won't allow us  
to do what we need to do.


So we are increasingly highly motivated to run the FPGAs faster  
still.  Just a 20% increment from the 256 MHz which we currently  
view as a practical upper limit allows us to cross a clock rate  
threshold which then enables a factor of two decrease in demux  
factor, and consequent even larger increment in the realizable PFB  
size.


Which is just a long winded way of asking if there are any others  
in the collaboration motivated to run the FPGAs faster, and whether  
any tricks can be shared?  In particular, does the CASPER toolflow  
support multiple clock domains? Our understanding is not yet, but  
that's based on incomplete information.   We know that there exists  
Virtex 5 (?) IP FFT cores which supposably run at greater than 500  
MHz rates, using the enhanced interconnect between DSP slices.


While on this topic of high demux factors, the tool flow largely  
chokes on demux factors of 3

Re: [casper] wideband conversion and correlation

2010-12-23 Thread Jonathan Weintroub
Thanks, Suraj, it is good to of your experiences.  I may ask more in  
time about the details of your implementation.


Also, the practical limit, our analysis so far is purely number of  
multiplies and adds, and does yet look at routing.  However, is your  
practical limit for Virtex 5, and might a demux 32 work on a Virtex  
6?  Demux 32 is an interesting case for us (6 GHz blocks and 14 GSa/s  
or so).


Jonathan



On Dec 23, 2010, at 5:09 PM, Suraj Gowda wrote:


Hi Jonathan,

I have been able to build spectrometers (FFT only) that operate at  
375 MHz FPGA clock rate (3 GHz bandwidth).  I don't know of anyone  
who has operated faster designs.


16x is a practical limit for demux factors for the FFT.  The reason  
is that the fft_direct block for 16 inputs uses 32 butterflies,  
which can be fit in 3 DSP48E columns.  A demux factor of 32x would  
substantially increase the routing complexity, probably reducing the  
overall speed.  But this is only my best guess, I haven't actually  
tried.


-Suraj

On Dec 23, 2010, at 4:47 PM, Jonathan Weintroub wrote:


Hi CASPERites,

Here's a somewhat fluffy RFI which I hope might start a little  
thought and/or discussion over the season (acknowledging that not  
all in the global collaboration celebrate the traditional Western  
winter holidays):


At SMA we are looking into the use of CASPER methods to build a  
ultra wideband high spectral resolution correlator.  Typical specs  
are, say, 18 GHz bandwidth with roughly 300 KHz spectral  
resolution, by two polarizations, full Stokes.   We are considering  
using a standard CASPER packetized FX architecture (FX much better  
for high res than XF), but in the relatively unexplored "small  
number of antennas, wide bandwidth" regime.   If the entire 18 GHz  
were eaten by one ADC, this would require a sample rate of 40 Gsps  
and 64 kpoint PFB.   Perhaps more reasonable would be two 9 GHz BW  
blocks and a 32 k PFB sampled at about 20 Gsps, or three 6 GHz / 16  
or 32 k PFB / 14 Gsps.


To start we are looking closely at the FPGA resource utilization of  
large PFBs.  Something that probably is common knowledge amongst  
those experienced in FX correlator design is that the demux factor  
drives the utilization much faster than the size of the PFB.  In  
that sense bandwidth is far more expensive than spectral  
resolution.  We've put some effort into accurately quantifying the  
utilization, at least as far as multipliers and adders are  
concerned, and are expanding this analysis to block ram and other  
resources.  And demux factor is typically radix 2, so it is very  
much quantized.


For example at 20 Gsps one might consider a demux factor of 128  
resulting in an FPGA clock rate of 156 MHz, which is quite  
comfortable for the FPGA.  Alternatively a demux factor of 64 with  
corresponding FPGA clock of twice that, or over 300 MHz.
Traditionally a rather uncomfortable regime for CASPER (we're  
unusual, I believe, in running iBOBs at 256 MHz for the VLBI phased  
array).  The trouble is our analysis shows that the difference  
between these two demux setting in the size of PFB one can fit in a  
Virtex 6 is really quite large, and 128 definitely won't allow us  
to do what we need to do.


So we are increasingly highly motivated to run the FPGAs faster  
still.  Just a 20% increment from the 256 MHz which we currently  
view as a practical upper limit allows us to cross a clock rate  
threshold which then enables a factor of two decrease in demux  
factor, and consequent even larger increment in the realizable PFB  
size.


Which is just a long winded way of asking if there are any others  
in the collaboration motivated to run the FPGAs faster, and whether  
any tricks can be shared?  In particular, does the CASPER toolflow  
support multiple clock domains? Our understanding is not yet, but  
that's based on incomplete information.   We know that there exists  
Virtex 5 (?) IP FFT cores which supposably run at greater than 500  
MHz rates, using the enhanced interconnect between DSP slices.


While on this topic of high demux factors, the tool flow largely  
chokes on demux factors of 32 or greater.  Any tips here would also  
be appreciated.


If anyone can cast light on this general topic and related concerns  
it would be very much appreciated.


Jonathan Weintroub
SAO











Re: [casper] wideband conversion and correlation

2010-12-23 Thread Dan Werthimer



hi jonathan,

some ideas for your correlator:

1)
300 MHz is a good target, especially for V6.
suraj has shown how to  achieve 375 MHz for V5
by using floor planning and auto-placing.
suraj or i can send you his draft paper on this if you'd like.

2)
you might want to consider FFX instead of FX:
eg: digitizing your 9 GHz band and using a PFB to break it up into eight 
sub-bands

of 1.25 GHz each, and then sending the sub-bands into eight 1.25 GHz
FX correlators.   this will simplify your switch requirements and each 
correlator
now has only 4K channels, which is better suited for cornering turn in a 
roach II.


3)
also, be sure to use billy's latest FFT, (recently checked in),
which moves all the adders and multipliers into DSP48's makes routing 
easier.

you should also consider bit growth FFT's and PFB's, which start
out with the 4 or 5 or 8 bits from your ADC, and add bits gradually
as you move the frequency domain.   dave mcmahon and hong chen
have done work on this.

best wishes,

dan

On 12/23/2010 1:47 PM, Jonathan Weintroub wrote:

Hi CASPERites,

Here's a somewhat fluffy RFI which I hope might start a little thought 
and/or discussion over the season (acknowledging that not all in the 
global collaboration celebrate the traditional Western winter holidays):


At SMA we are looking into the use of CASPER methods to build a ultra 
wideband high spectral resolution correlator.  Typical specs are, say, 
18 GHz bandwidth with roughly 300 KHz spectral resolution, by two 
polarizations, full Stokes.   We are considering using a standard 
CASPER packetized FX architecture (FX much better for high res than 
XF), but in the relatively unexplored "small number of antennas, wide 
bandwidth" regime.   If the entire 18 GHz were eaten by one ADC, this 
would require a sample rate of 40 Gsps and 64 kpoint PFB.   Perhaps 
more reasonable would be two 9 GHz BW blocks and a 32 k PFB sampled at 
about 20 Gsps, or three 6 GHz / 16 or 32 k PFB / 14 Gsps.


To start we are looking closely at the FPGA resource utilization of 
large PFBs.  Something that probably is common knowledge amongst those 
experienced in FX correlator design is that the demux factor drives 
the utilization much faster than the size of the PFB.  In that sense 
bandwidth is far more expensive than spectral resolution.  We've put 
some effort into accurately quantifying the utilization, at least as 
far as multipliers and adders are concerned, and are expanding this 
analysis to block ram and other resources.  And demux factor is 
typically radix 2, so it is very much quantized.


For example at 20 Gsps one might consider a demux factor of 128 
resulting in an FPGA clock rate of 156 MHz, which is quite comfortable 
for the FPGA.  Alternatively a demux factor of 64 with corresponding 
FPGA clock of twice that, or over 300 MHz.   Traditionally a rather 
uncomfortable regime for CASPER (we're unusual, I believe, in running 
iBOBs at 256 MHz for the VLBI phased array).  The trouble is our 
analysis shows that the difference between these two demux setting in 
the size of PFB one can fit in a Virtex 6 is really quite large, and 
128 definitely won't allow us to do what we need to do.


So we are increasingly highly motivated to run the FPGAs faster 
still.  Just a 20% increment from the 256 MHz which we currently view 
as a practical upper limit allows us to cross a clock rate threshold 
which then enables a factor of two decrease in demux factor, and 
consequent even larger increment in the realizable PFB size.


Which is just a long winded way of asking if there are any others in 
the collaboration motivated to run the FPGAs faster, and whether any 
tricks can be shared?  In particular, does the CASPER toolflow support 
multiple clock domains? Our understanding is not yet, but that's based 
on incomplete information.   We know that there exists Virtex 5 (?) IP 
FFT cores which supposably run at greater than 500 MHz rates, using 
the enhanced interconnect between DSP slices.


While on this topic of high demux factors, the tool flow largely 
chokes on demux factors of 32 or greater.  Any tips here would also be 
appreciated.


If anyone can cast light on this general topic and related concerns it 
would be very much appreciated.


Jonathan Weintroub
SAO









Re: [casper] wideband conversion and correlation

2010-12-23 Thread Suraj Gowda

Hi Jonathan,

I have been able to build spectrometers (FFT only) that operate at 375  
MHz FPGA clock rate (3 GHz bandwidth).  I don't know of anyone who has  
operated faster designs.


16x is a practical limit for demux factors for the FFT.  The reason is  
that the fft_direct block for 16 inputs uses 32 butterflies, which can  
be fit in 3 DSP48E columns.  A demux factor of 32x would substantially  
increase the routing complexity, probably reducing the overall speed.   
But this is only my best guess, I haven't actually tried.


-Suraj

On Dec 23, 2010, at 4:47 PM, Jonathan Weintroub wrote:


Hi CASPERites,

Here's a somewhat fluffy RFI which I hope might start a little  
thought and/or discussion over the season (acknowledging that not  
all in the global collaboration celebrate the traditional Western  
winter holidays):


At SMA we are looking into the use of CASPER methods to build a  
ultra wideband high spectral resolution correlator.  Typical specs  
are, say, 18 GHz bandwidth with roughly 300 KHz spectral resolution,  
by two polarizations, full Stokes.   We are considering using a  
standard CASPER packetized FX architecture (FX much better for high  
res than XF), but in the relatively unexplored "small number of  
antennas, wide bandwidth" regime.   If the entire 18 GHz were eaten  
by one ADC, this would require a sample rate of 40 Gsps and 64  
kpoint PFB.   Perhaps more reasonable would be two 9 GHz BW blocks  
and a 32 k PFB sampled at about 20 Gsps, or three 6 GHz / 16 or 32 k  
PFB / 14 Gsps.


To start we are looking closely at the FPGA resource utilization of  
large PFBs.  Something that probably is common knowledge amongst  
those experienced in FX correlator design is that the demux factor  
drives the utilization much faster than the size of the PFB.  In  
that sense bandwidth is far more expensive than spectral  
resolution.  We've put some effort into accurately quantifying the  
utilization, at least as far as multipliers and adders are  
concerned, and are expanding this analysis to block ram and other  
resources.  And demux factor is typically radix 2, so it is very  
much quantized.


For example at 20 Gsps one might consider a demux factor of 128  
resulting in an FPGA clock rate of 156 MHz, which is quite  
comfortable for the FPGA.  Alternatively a demux factor of 64 with  
corresponding FPGA clock of twice that, or over 300 MHz.
Traditionally a rather uncomfortable regime for CASPER (we're  
unusual, I believe, in running iBOBs at 256 MHz for the VLBI phased  
array).  The trouble is our analysis shows that the difference  
between these two demux setting in the size of PFB one can fit in a  
Virtex 6 is really quite large, and 128 definitely won't allow us to  
do what we need to do.


So we are increasingly highly motivated to run the FPGAs faster  
still.  Just a 20% increment from the 256 MHz which we currently  
view as a practical upper limit allows us to cross a clock rate  
threshold which then enables a factor of two decrease in demux  
factor, and consequent even larger increment in the realizable PFB  
size.


Which is just a long winded way of asking if there are any others in  
the collaboration motivated to run the FPGAs faster, and whether any  
tricks can be shared?  In particular, does the CASPER toolflow  
support multiple clock domains? Our understanding is not yet, but  
that's based on incomplete information.   We know that there exists  
Virtex 5 (?) IP FFT cores which supposably run at greater than 500  
MHz rates, using the enhanced interconnect between DSP slices.


While on this topic of high demux factors, the tool flow largely  
chokes on demux factors of 32 or greater.  Any tips here would also  
be appreciated.


If anyone can cast light on this general topic and related concerns  
it would be very much appreciated.


Jonathan Weintroub
SAO









Re: [casper] wideband conversion and correlation

2010-12-23 Thread melvyn wright
Hi Jonathan,

Other specs ?

For SMA  8 or 10 antennas x 2 pols  analog input possible
For CARMA 15 antennas x 2 pols, or 23 ants x 1 pol, or 23 ants x 2 pols

Output sample and accumulation time for cross correlation:  typical
10sec, fast for longer baselines 1 sec.

Mel.

On Thu, Dec 23, 2010 at 1:47 PM, Jonathan Weintroub
 wrote:
> Hi CASPERites,
>
> Here's a somewhat fluffy RFI which I hope might start a little thought
> and/or discussion over the season (acknowledging that not all in the global
> collaboration celebrate the traditional Western winter holidays):
>
> At SMA we are looking into the use of CASPER methods to build a ultra
> wideband high spectral resolution correlator.  Typical specs are, say, 18
> GHz bandwidth with roughly 300 KHz spectral resolution, by two
> polarizations, full Stokes.   We are considering using a standard CASPER
> packetized FX architecture (FX much better for high res than XF), but in the
> relatively unexplored "small number of antennas, wide bandwidth" regime.
> If the entire 18 GHz were eaten by one ADC, this would require a sample rate
> of 40 Gsps and 64 kpoint PFB.   Perhaps more reasonable would be two 9 GHz
> BW blocks and a 32 k PFB sampled at about 20 Gsps, or three 6 GHz / 16 or 32
> k PFB / 14 Gsps.
>
> To start we are looking closely at the FPGA resource utilization of large
> PFBs.  Something that probably is common knowledge amongst those experienced
> in FX correlator design is that the demux factor drives the utilization much
> faster than the size of the PFB.  In that sense bandwidth is far more
> expensive than spectral resolution.  We've put some effort into accurately
> quantifying the utilization, at least as far as multipliers and adders are
> concerned, and are expanding this analysis to block ram and other resources.
>  And demux factor is typically radix 2, so it is very much quantized.
>
> For example at 20 Gsps one might consider a demux factor of 128 resulting in
> an FPGA clock rate of 156 MHz, which is quite comfortable for the FPGA.
>  Alternatively a demux factor of 64 with corresponding FPGA clock of twice
> that, or over 300 MHz.   Traditionally a rather uncomfortable regime for
> CASPER (we're unusual, I believe, in running iBOBs at 256 MHz for the VLBI
> phased array).  The trouble is our analysis shows that the difference
> between these two demux setting in the size of PFB one can fit in a Virtex 6
> is really quite large, and 128 definitely won't allow us to do what we need
> to do.
>
> So we are increasingly highly motivated to run the FPGAs faster still.  Just
> a 20% increment from the 256 MHz which we currently view as a practical
> upper limit allows us to cross a clock rate threshold which then enables a
> factor of two decrease in demux factor, and consequent even larger increment
> in the realizable PFB size.
>
> Which is just a long winded way of asking if there are any others in the
> collaboration motivated to run the FPGAs faster, and whether any tricks can
> be shared?  In particular, does the CASPER toolflow support multiple clock
> domains? Our understanding is not yet, but that's based on incomplete
> information.   We know that there exists Virtex 5 (?) IP FFT cores which
> supposably run at greater than 500 MHz rates, using the enhanced
> interconnect between DSP slices.
>
> While on this topic of high demux factors, the tool flow largely chokes on
> demux factors of 32 or greater.  Any tips here would also be appreciated.
>
> If anyone can cast light on this general topic and related concerns it would
> be very much appreciated.
>
> Jonathan Weintroub
> SAO
>
>
>
>
>