Re: [casper] building 300-receiver channel cross-correlator

Neil Salmon Tue, 22 Dec 2015 08:26:00 -0800

Dave,

Many thanks to you and others for help.


Perhaps I’d better start from scratch and design my own large n 
cross-correlator, with single bit samplers, looking to minimise risk.

Cheers,
Neil

From: casper-boun...@lists.berkeley.edu 
[mailto:casper-boun...@lists.berkeley.edu] On Behalf Of David Hawkins
Sent: 19 December 2015 19:46
To: casper@lists.berkeley.edu
Subject: Re: [casper] building 300-receiver channel cross-correlator

Hi Neil,

Here's my back-of-the-envelope calculations

1. Complex-valued multiplier (CMULT)

   a = a_re + j*a_im
   b = b_re + j*b_im
   c = a*conj(b) = (a_re + j*a_im)*(b_re - j*b_im)

   Given 1-bit sampled I+Q components, there are 4-inputs, so the logic will map
   nicely to a 4-input LUT. The number of LUTs required depends on the number 
of bits
   out of the product table.

   If the 1-bit values are assigned -1, 1, then each of the products takes on 
the
   values -1 or 1. The sums in the complex product each take on the values -2, 
0,
   or 2, i.e., three possible output values. Divide-by-2 and these three values
   can be -1, 0, 1, or the signed 2-bit codes 11, 00, 01, or you could add 1
   to get the products 0, 1, and 2, and use codes 00, 01, 10.

   Given the fact that each complex-valued product component has a 2-bit output,
   a complex-valued multiplier requires 4 x 4-LUTs.

   Your correlator needs 300x299/2 = 44850 of these CMULTs, i.e., 179400 4-LUTs.

   The 2-bit plus 2-bit output of these CMULTs would feed accumulators.

2. Accumulators

   If your 1-bit ADCs get stuck in one state, eg., always outputting -1 or 1,
   then you will get a static product out of your CMULT. For an integration time
   of 10ms at 300MHz, you have 3M samples, so your accumulator will have a bit
   growth of log2(3M) = 22-bits, i.e., each accumulator would need to have a
   worst-case bit-width of around 24-bits. For random noise the bit-growth
   would be half this. So your solution will depend on whether you expect to
   have to handle coherent signals, eg., RFI.

   The first thing to realize is that you do not need to use LUTs to implement
   these accumulators. You can use a combination of all the resources
   available to you on your selected FPGA. For example, you can use DSP blocks
   split into sub-accumulators, or you can use short counters implemented in
   LUTs, and then use RAM for long-term accumulators.

   For example, lets say you have an FPGA with a 48-bit DSP block, and you
   consider that 48-bit DSP block as 4 x 12-bit counters. If your CMAC outputs
   the unsigned codes 0, 1, 2, then the input to your DSP block accumulator for
   two complex-valued products would be;

   0000_0000_00aa_0000_0000_00bb_0000_0000_00cc_0000_0000_0000_000dd

   where the aa, bb, cc, dd are the 2-bit outputs of two CMACs.

   The unsigned 0, 1, 2 values can be accumulated for 2^12/2 = 2048 clocks
   before they overflow. (Overflow into the next 12-bit data word, and
   corrupt the DSP block 48bit accumulator contents).

   You could dump the DSP blocks every 2048 clocks into RAM, and then a
   long-term accumulator could read the RAM and accumulate the data further.

   The 44850 CMULTs could feed into 22425 of these 48-bit accumulators.
   Assuming a device with say 4000 DSP blocks, you'd need to implement
   the other accumulators in the fabric, eg., 18425 x 48-bits = 880k
   registers.

Clearly your dominant resource is going to be your accumulation logic, so you'll
want to carefully investigate methods of performing a low-bit-width accumulation
of the CMULT output, and then use RAM-based long-term accumulation, and then
get the data off the chip. What is the advantage of RAM accumulation? If the
fast accumulation occurs for say 1000 clocks, your RAM accumulation logic has
1000 clocks in which to do its work, so one accumulator can read two RAMs and
accumulate the two values for 1000 different partial products, i.e., you save
999 accumulators by reusing one.

This system sounds like it could fit into one FPGA. You'd have to figure out
how to get all 300 inputs onto the FPGA, eg., perhaps using 600 LVDS receivers
(assuming you could find a device with that many). The other option to consider
is several lower-cost FPGAs, eg., 4 x FPGAs with 150 LVDS pairs each linked
together with enough serdes links to take the data between the devices that
need it.

You should start by prototyping a simple design in HDL and testing it using
an existing development kit.

Cheers,
Dave
On 12/18/2015 8:23 AM, Neil Salmon wrote:
Thank you for your response. The system is part of a generic microwave/mm-wave 
aperture synthesis imaging system, so there’s an array of front-end heterodyne 
receivers with an IF earmarked at a centre frequency of 3 GHz (away from Wifi 
mobile comms), but the bandwidth is 300 MHz. Front-end initially may be 
receiving at a centre frequency of 20 GHz, but I could change this to 10 GHz or 
go up to 35 GHz. (I’ll be taking a single polarisation say horizontal or 
right-hand circular – I’ve not decided yet)

Either way, I’ll need to digitise this 300 MHz bandwidth on each channel, and 
I’m quite happy with the loss in SNR in using a single bit digitisation, so 
satisfying the Nyquist criterion there will be I & Q channels, each generating 
a data stream at 300 M samples per second, ie a total of 600 Mbps for both I&Q 
per receiver channel, giving the total data rate of 180 Gbps. (sampling clock 
and mixing LO’s will be synched to a master oscillator)

(as for the 3 GHz centre frequency the I&Q digitisation could be bandpass 
sampling/digital down conversion or a second analogue downshift using a matched 
pair of mixers and then comparators in each section to generate I and Q digits)

So there will be this huge rate of I & Q data from 300 channels that needs to 
be cross-correlated in real-time with 95% duty cycle to avoid loss of SNR. 
(software correlation would generate just too much data for harddisk and a GPU 
PCIe bus solution couldn’t cope with the data rate – or at least I’d be 
uncomfortable about working close to data rate ceilings of PCIe.) That leaves 
the FPGA solution. So I need some high speed data bus to get the data into the 
FPGAs for cross-correlation. As I’m working single bits XOR gates will do 
nicely for the cross-multiplies and I want to store the four components of the 
cross-multiply in separate registers, just for diagnostic / trouble shooting. 
This gives the XOR op rate 54 T ops/sec and the requirement for the 180,000 
accumulation registers.

For me the challenges with be getting the arrays of single bit digitisers and 
linking them to the cross-correlators and doing the cross-correlation at this 
huge rate. Build of analogue front end heterodyne array and image formation 
algorithms I’ve done before. It’s just the digital hardware I need to sort. 
I’ve got a few researchers and postgrads around me in the engineering 
department who have general educational interests in FPGA technologies and the 
Xilinx and Altera University representative support under the university 
agreement. So I’m just wondering if I can do this with the Casper tools or 
others if necessary.

Hope this extra information help.  Many thanks for your help.
Neil

From: James Smith [mailto:jsm...@ska.ac.za]
Sent: 18 December 2015 14:25
To: Neil Salmon
Cc: casper@lists.berkeley.edu<mailto:casper@lists.berkeley.edu>
Subject: Re: [casper] building 300-receiver channel cross-correlator

Hello Neil,

CASPER tools could probably do what you're looking for, but I found your 
description a bit confusing. You're going to need to clarify somewhat.

Regards,
James


On Fri, Dec 18, 2015 at 4:15 PM, Neil Salmon 
<n.sal...@mmu.ac.uk<mailto:n.sal...@mmu.ac.uk>> wrote:
Anyone help?

I’m working in academia and need to build a 300-receiver channel single-bit 
digitiser / cross-correlator with a single frequency channel having a bandwidth 
of 300 MHz, centre frequency ~3 GHz. The single bit digitisers sample I&Q 
giving a total data rate of 180 Gbps and using XOR gates to do the 
cross-correlations, the total computation rate is 54 T XOR operations per 
second. I need to accumulate cross-correlations typically for times ranging 
from 10 ms to a few seconds. The system would comprise an array of single bit 
digitisers linked via a high speed data bus to FPGA boards for the 
cross-correlation/accumulation. I’ve no skills in board design but could 
probably learn VHDL. I don’t have funding to commission a design and build but 
wondered if anyone in this community could advise how I should go about 
building this system at our university.

Thank you for any help you can provide.
"Before acting on this email or opening any attachments you should read the 
Manchester Metropolitan University email disclaimer available on its website 
http://www.mmu.ac.uk/emaildisclaimer "

Re: [casper] building 300-receiver channel cross-correlator

Reply via email to