Re: [casper] unable to load my boffiles and to configure my roach2

2013-03-12 Thread Guy kenfack
Hi Marc,
thank you for answer.

Now I have another question from my 1st email, I did'nt have an answer. it
is about the process to generate a boffile compatible with roach2 and
tcpborph3 using the 11.5 toolflow.
I'd like to know if there is a trick , which can be used to generate a
boffile with the current casper_astro_mlib or the old ska_sa_lib , which
will be compatible with the roach2.

regards,





 Roach2s which run tcpborphserver3 do not execute bof files,
 they use tcpborphserver3 to program the FPGA using a device
 file. So this is not an error. Use progdev to program the fpga,
 or if you wish to use the commandline

 ~ # kcpcmd progdev r2_tut1_2013_Mar_07_0837.bof

 regards

 marc



Re: [casper] unable to load my boffiles and to configure my roach2

2013-03-12 Thread G Jones
Hi Wes,
Can you elaborate on this MMCM stability problem? Which version of ISE was
it fixed in?
Thanks,
Glenn
On Mar 12, 2013 5:40 AM, Wesley New wes...@ska.ac.za wrote:

 Hi Guy,

 The current casper_astro_mlib should doesn't support ROACH2 as the ROACH2
 changes dont seem to have been pulled from SKA-SA. Why would you want to
 build will old tools anyway?

 The older Xilinx tools have issues with the clock managers (MMCMs) where
 you can get instability problems when deploying your design. I would highly
 recommend upgrading your design machine to 14.4 and matlab2012b.

 Regards

 Wes



 On Tue, Mar 12, 2013 at 11:11 AM, Guy kenfack guy.kenf...@gmail.comwrote:

 Hi Marc,
 thank you for answer.

 Now I have another question from my 1st email, I did'nt have an answer.
 it is about the process to generate a boffile compatible with roach2 and
 tcpborph3 using the 11.5 toolflow.
 I'd like to know if there is a trick , which can be used to generate a
 boffile with the current casper_astro_mlib or the old ska_sa_lib , which
 will be compatible with the roach2.

 regards,






 Roach2s which run tcpborphserver3 do not execute bof files,
 they use tcpborphserver3 to program the FPGA using a device
 file. So this is not an error. Use progdev to program the fpga,
 or if you wish to use the commandline

 ~ # kcpcmd progdev r2_tut1_2013_Mar_07_0837.bof

 regards

 marc






[casper] ADC 1x5000-8 1:2 boards

2013-03-12 Thread Homin Jiang
Dear Ross:
I will check our inventory. I guess we still have some. I am interested to
know why not 1:1 ?

cheers
homin


Re: [casper] unable to load my boffiles and to configure my roach2

2013-03-12 Thread Henno Kriel
Hi Glenn

Xilinx tighted up the constraining and configuration options on the MMCM in
ISE versions 12  13 due to lock issues. We have also ran into poor timing
results on 11.5.

ISE 11.5 was the first version to support Virtex 6, so moving to the latest
(ISE 14) is recommended for Roach 2 compiles.

Regards
Henno

On Tue, Mar 12, 2013 at 1:52 PM, G Jones glenn.calt...@gmail.com wrote:

 Hi Wes,
 Can you elaborate on this MMCM stability problem? Which version of ISE was
 it fixed in?
 Thanks,
 Glenn
 On Mar 12, 2013 5:40 AM, Wesley New wes...@ska.ac.za wrote:

 Hi Guy,

 The current casper_astro_mlib should doesn't support ROACH2 as the ROACH2
 changes dont seem to have been pulled from SKA-SA. Why would you want to
 build will old tools anyway?

 The older Xilinx tools have issues with the clock managers (MMCMs) where
 you can get instability problems when deploying your design. I would highly
 recommend upgrading your design machine to 14.4 and matlab2012b.

 Regards

 Wes



 On Tue, Mar 12, 2013 at 11:11 AM, Guy kenfack guy.kenf...@gmail.comwrote:

 Hi Marc,
 thank you for answer.

 Now I have another question from my 1st email, I did'nt have an answer.
 it is about the process to generate a boffile compatible with roach2 and
 tcpborph3 using the 11.5 toolflow.
 I'd like to know if there is a trick , which can be used to generate a
 boffile with the current casper_astro_mlib or the old ska_sa_lib , which
 will be compatible with the roach2.

 regards,






 Roach2s which run tcpborphserver3 do not execute bof files,
 they use tcpborphserver3 to program the FPGA using a device
 file. So this is not an error. Use progdev to program the fpga,
 or if you wish to use the commandline

 ~ # kcpcmd progdev r2_tut1_2013_Mar_07_0837.bof

 regards

 marc






-- 
Henno Kriel

DSP Engineer
Digital Back End
meerKAT

SKA South Africa
Third Floor
The Park
Park Road (off Alexandra Road)
Pinelands
7405
Western Cape
South Africa

Latitude: -33.94329 (South); Longitude: 18.48945 (East).

(p) +27 (0)21 506 7300
(p) +27 (0)21 506 7365 (direct)
(f) +27 (0)21 506 7375
(m) +27 (0)84 504 5050


Re: [casper] ADC1X5000-8 correlator question

2013-03-12 Thread Jonathan Weintroub
Hi Ross,

I am commenting this older thread prompted by your most recent message re 
DMUX1:2 ADCs, to which Homin responded today.

Your approach, and the advice received from Dan and Glenn, is sound---in 
principal.  With suitable bit codes, you can run DMUX1:2 4-bit versions of the 
ASIAA ADC at full rate, in dual channel 2.5 GSa/s mode, hosted on a ROACH1.  
And with two ADC boards installed you can enable four such 2.5 GSa/s input 
channels on ROACH1.  Whether the ROACH1 can process the four channels with 
Virtex 5 resources of course depends on what you want to do, Dan gave one 
benchmark.  We have a PFB resource utilization memo which may be able to help 
you.

There is a caveat, however:  my group is among a few CASPER collaborators who 
are doing wideband correlator design using ASIAA 5 GSa/s ADCs.  We decided some 
time ago to standardize on using the 8-bit DMUX 1:1 version of the ADC 
supported by ROACH2.  In our case we without question needed the computational 
resources of Virtex6-SX475, so the higher GPIO interface speed (translates to 
Z-DOK speed) in some sense came for free.  Though we could probably meet our 
requirements with fewer than 8-bits, there was no penalty for enabling them by 
using the DMUX1:1 board.

A consequence of this is that all our developments and contributions to the 
CASPER open source infrastructure support this configuration (8-bit and 
ROACH2).  In brief we have contributed yellow block work, ADC characterization, 
and resource calculations for PFBs.  Now, we are not the only ones doing this 
kind of work, and much of what we do is probably transferable to the ROACH1 and 
4-bit case.  However, at least from my biased perspective, you may find the 
ROACH2 8-but case to be better supported, and thus the hardware cost savings 
through using ROACH1 may prove to be false economy. The 4-bit and 8-bit ADCs 
are equal in cost.

(By the way the 8-bit ADC board can be used on ROACH1 but you would be limited 
to 1.8 GSa/s per channel or 3.6 GSa/s per ADC board.)

All that said, I have a few DMUX1:2 ADCs sitting in the lab fully assembled, 
working, and unused, and I would be happy to loan these to you.  Actually they 
probably belong to Homin / ASIAA, so I guess subject to his ok.

Our wideband developments are documented on a public Wiki:

https://www.cfa.harvard.edu/twiki/bin/view/SMAwideband

Dig down for ADC characterization and our resource memo.

Cheers, stay in touch.

Jonathan Weintroub
CfA
+1-617-495-7319



On Feb 12, 2013, at 4:09 PM, Ross Williamson wrote:

 Hi Dan and Glenn,
 
 All sounds good - I think have the ROACH-1 design will help us out a lot here 
 to get started.  We really don't need a very high spectral response as we are 
 just trying to measure the amplitude and phase of a PSFs wings referenced to 
 the central core of the PSF.
 
 Cheers,
 
 Ross
 
 On Tue, Feb 12, 2013 at 1:05 PM, G Jones glenn.calt...@gmail.com wrote:
 Note that the design Dan is referring to currently has only been
 tested to ~1.5 GHz BW. Getting to higher bandwidths will likely
 require some work to meet timing. Also the data comes out over 10
 gigabit Ethernet so requires something to catch the data.
 
 On Tue, Feb 12, 2013 at 4:02 PM, Dan Werthimer d...@ssl.berkeley.edu wrote:
 
  hi ross,
 
  how many frequency channels do you need in your two input correlator?
 
  we have a full stokes VEGAS 1K channel spectrometer design
   that uses a pair of ADC08-5000's.
  (a full stokes spectrometer is the same thing as a two input correlator)
 
  our current design is for roach2, but we had tested working designs
  for roach1 last year, and we can probably dig these up.
 
  best wishes,
 
  dan
 
 
  On Tue, Feb 12, 2013 at 12:53 PM, Ross Williamson
  rwilliam...@astro.caltech.edu wrote:
 
  Hi All,
 
  I'm new to CASPER/ROACH boards and so apologies if this is obvious.  We
  are hoping to build a simple 2-channel correlator  with a fairly high
  bandwidth.  We are in the process of ordering a ROACH-1 board and I was
  thinking of also purchasing the ADC1X5000-8 ADC - A few questions:
 
  1) Should I even be looking at an ADC1X5000-8 for use as a correlator?
  2) I believe I need the DMUX 1:2 version to work with ROACH-1 board - is
  that correct?
  3) I should be able to run 2 channels at 2.5GHz 4bits with a single board
  - correct?
  4) If I wanted to could I purchase another ADC1X5000-8 and try and run
  with 5GHz bandwidth on each channel using the ROACH-1?
 
  Best regards,
 
  Ross
 
  --
  Ross Williamson
  Research Scientist - Sub-mm Group
  California Institute of Technology
  626-395-2647 (office)
  312-504-3051 (Cell)
 
 
 
 
 
 -- 
 Ross Williamson
 Research Scientist - Sub-mm Group
 California Institute of Technology
 626-395-2647 (office)
 312-504-3051 (Cell)




Re: [casper] unable to load my boffiles and to configure my roach2

2013-03-12 Thread David MacMahon
Hi, Wes,

On Mar 12, 2013, at 10:35 AM, Wesley New wrote:

 There
 might be 1 or 2 tweeks to the tools that you would have to do.

The only tweak that I've needed to do so far is to export model files (e.g. 
xps_library.mdl) that have been saved in the latest Simulink version to an 
older Simulink model file format.  The latest version stores the mask 
parameters in a way that is incompatible with earlier versions.  What was 
Mathworks thinking?!

Dave




Re: [casper] unable to load my boffiles and to configure my roach2

2013-03-12 Thread David MacMahon
I've not tried that far back.  I go back and forth between 13.x and 14.x tools.

I have not yet backported your shared bram changes yet.  I very much like the 
concept and think that it should probably be used even for shared brams that 
have data width of 32 so that we can use the BRAM output register option.  The 
Xilinx BRAM pcore that is currently used for data width of 32 does not support 
that option, but it would greatly helps with timing (and therefor placement) as 
the clock-to-out time for the UNregistered BRAM is 2 ns!

Dave

On Mar 12, 2013, at 10:54 AM, Wesley New wrote:

 That is pretty cool, thanks for the info dave. :)
 
 Have you tried building with the most recent ska-sa libraries on 11.5?
 I updated the custom shared bram block to use the latest coregen. I am
 not sure how an older version of coregen would behave. This is only
 for brams using a data width that is not 32 bits wide.
 
 On 3/12/13, David MacMahon dav...@astro.berkeley.edu wrote:
 Hi, Wes,
 
 On Mar 12, 2013, at 10:35 AM, Wesley New wrote:
 
 There
 might be 1 or 2 tweeks to the tools that you would have to do.
 
 The only tweak that I've needed to do so far is to export model files
 (e.g. xps_library.mdl) that have been saved in the latest Simulink version
 to an older Simulink model file format.  The latest version stores the mask
 parameters in a way that is incompatible with earlier versions.  What was
 Mathworks thinking?!
 
 Dave
 
 




Re: [casper] Error reading registers with ROACH2

2013-03-12 Thread Andrew Martens
Hi Jason

 I'm having a hard time updating the .mdl file for tutorial1 after I
 tried to update to the newest CASPER tools.  

By 'CASPER tools' are you referring to recent git commits of the
Simulink libraries from the ska-sa repo? If so, I know what your problem
below is...

 All the links to the yellow blocks are broken.  Should the .mdl files
 work with the updated CASPER tools or do I have to go in and replace
 those blocks that have broken links?  I tried doing this as well, but
 I can't find similar blocks in the simulink explorer.  Could you send
 me the .mdl file for the .bof file that you sent me?  Would that help?

It seems that older versions of Matlab/Simulink are not forwards-
compatible with the .mdl files generated by the latest incarnation of
Matlab/Simulink. When the library is loading you will notice a bunch of
error messages scrolling past. These are not the usual Warnings that can
be ignored but result in a crippled Simulink library with much of it
missing (as seen in the simulink explorer). Your .mdl file has links
that link back to this crippled library and much of it is missing (hence
broken links).

Your solution is to install the latest compatible spawn from Mathworks.
We may have to look into saving our libraries so that they are
compatible with older Matlab/Simulink versions as continually upgrading
Matlab/Simulink is expensive, time intensive, and does not usually come
with any improvements that relate to us. 

Regards
Andrew






Re: [casper] what should I see?

2013-03-12 Thread G Jones
I think your clock source should be  1 GHz right?  Also, your signal
generator is set to way too high of a level I think. Start with -40
dBm and increase from there.

On Tue, Mar 12, 2013 at 4:10 PM, katherine viviana cortes urbina
kattycort...@gmail.com wrote:
 Dear Casperites,

 1.- I compile the tut3 with 2048 channels, ADC 1 Ghz. The control is via
 katcp server (python library), I run the script tut3.py and the clock source
 (valon 5007) is connected to clk_i of the adc with f_out =2425 MHz, the
 spectrum  generated is attachment:  tut3_clock_2048canales.png , when I
 connect a Signal generator (frecuency 120Mhz , 2Vpp and signal senoidal) to
 SMA +i of the adc , the spectrum is  attachment tut3_2048canales_120MHz.png,
 is this correct? .

 2.- Now, when I compile the tut3 with 32 channels, the script .py change see
 attachments: tut3_32channel.py, is this correct ? . The spectrum
 generated (only with clock source) is tut3_32kcanales_2012_Dec_17_2104.png,
 is this correct? .

 my questions is what should I see (the spectrum)?


 any ideas?

 cheers

 katty








Re: [casper] ADC 1x5000-8 1:2 boards

2013-03-12 Thread Ross Williamson
Hi Homin,

Thanks - My understanding is that we can not achieve 5 GSPS 8bits on a
ROACH-1 - we need a ROACH-2 for that. We gain a lot more in our
correlator design by having a larger bandwidth than number of bits -
We may be able to push it up the 1:1 to about 3.6 GSPS which would be
ok but we can do better with the 1:2

Best regards,

Ross

On Tue, Mar 12, 2013 at 5:12 AM, Homin Jiang ho...@asiaa.sinica.edu.tw wrote:
 Dear Ross:
 I will check our inventory. I guess we still have some. I am interested to
 know why not 1:1 ?

 cheers
 homin




-- 
Ross Williamson
Research Scientist - Sub-mm Group
California Institute of Technology
626-395-2647 (office)
312-504-3051 (Cell)



Re: [casper] ADC1X5000-8 correlator question

2013-03-12 Thread Ross Williamson
Hi Jonathan,

Thank you for the detailed response (which I read after replying to
Homin - doh). We already have our ROACH-1 here and are trying to keep
costs down so I'm going to push ahead a little more with trying to
figure out if we can get this to work with a ROACH-1 and 1:2 ADCs.  I
do have a couple of questions/points though before I ask to borrow the
1:2 boards.

1) The nominal plan would be to run 2 channels at 5GSa/s, 4bit on a
ROACH-1- i.e. using 2 ADC boards.  My understanding is the the ROACH-1
in practice should be ok to do this (depending on number of channels
etc in the PFB)
2)  Am I being very naive (I suspect I am) in thinking that with some
work (but not a total rewrite) I could just take the correlator
tutorial, replace the ADC and with the yellow block for the 1x5000-8
boards and tweek the number of channels in the PFB (we don't need many
at all which should help fitting on the VIRTEX 5).

If 1+2 are within the bounds of sanity then it might be worth trying
to see if we can get this up and running - We can always fall back to
the 3.6 GS/s if it's proving to be a nightmare. I'll check out the
memo and website too.

Cheers,

Ross

On Tue, Mar 12, 2013 at 7:37 AM, Jonathan Weintroub
jweintr...@cfa.harvard.edu wrote:
 Hi Ross,

 I am commenting this older thread prompted by your most recent message re 
 DMUX1:2 ADCs, to which Homin responded today.

 Your approach, and the advice received from Dan and Glenn, is sound---in 
 principal.  With suitable bit codes, you can run DMUX1:2 4-bit versions of 
 the ASIAA ADC at full rate, in dual channel 2.5 GSa/s mode, hosted on a 
 ROACH1.  And with two ADC boards installed you can enable four such 2.5 GSa/s 
 input channels on ROACH1.  Whether the ROACH1 can process the four channels 
 with Virtex 5 resources of course depends on what you want to do, Dan gave 
 one benchmark.  We have a PFB resource utilization memo which may be able to 
 help you.

 There is a caveat, however:  my group is among a few CASPER collaborators who 
 are doing wideband correlator design using ASIAA 5 GSa/s ADCs.  We decided 
 some time ago to standardize on using the 8-bit DMUX 1:1 version of the ADC 
 supported by ROACH2.  In our case we without question needed the 
 computational resources of Virtex6-SX475, so the higher GPIO interface speed 
 (translates to Z-DOK speed) in some sense came for free.  Though we could 
 probably meet our requirements with fewer than 8-bits, there was no penalty 
 for enabling them by using the DMUX1:1 board.

 A consequence of this is that all our developments and contributions to the 
 CASPER open source infrastructure support this configuration (8-bit and 
 ROACH2).  In brief we have contributed yellow block work, ADC 
 characterization, and resource calculations for PFBs.  Now, we are not the 
 only ones doing this kind of work, and much of what we do is probably 
 transferable to the ROACH1 and 4-bit case.  However, at least from my biased 
 perspective, you may find the ROACH2 8-but case to be better supported, and 
 thus the hardware cost savings through using ROACH1 may prove to be false 
 economy. The 4-bit and 8-bit ADCs are equal in cost.

 (By the way the 8-bit ADC board can be used on ROACH1 but you would be 
 limited to 1.8 GSa/s per channel or 3.6 GSa/s per ADC board.)

 All that said, I have a few DMUX1:2 ADCs sitting in the lab fully assembled, 
 working, and unused, and I would be happy to loan these to you.  Actually 
 they probably belong to Homin / ASIAA, so I guess subject to his ok.

 Our wideband developments are documented on a public Wiki:

 https://www.cfa.harvard.edu/twiki/bin/view/SMAwideband

 Dig down for ADC characterization and our resource memo.

 Cheers, stay in touch.

 Jonathan Weintroub
 CfA
 +1-617-495-7319



 On Feb 12, 2013, at 4:09 PM, Ross Williamson wrote:

 Hi Dan and Glenn,

 All sounds good - I think have the ROACH-1 design will help us out a lot 
 here to get started.  We really don't need a very high spectral response as 
 we are just trying to measure the amplitude and phase of a PSFs wings 
 referenced to the central core of the PSF.

 Cheers,

 Ross

 On Tue, Feb 12, 2013 at 1:05 PM, G Jones glenn.calt...@gmail.com wrote:
 Note that the design Dan is referring to currently has only been
 tested to ~1.5 GHz BW. Getting to higher bandwidths will likely
 require some work to meet timing. Also the data comes out over 10
 gigabit Ethernet so requires something to catch the data.

 On Tue, Feb 12, 2013 at 4:02 PM, Dan Werthimer d...@ssl.berkeley.edu wrote:
 
  hi ross,
 
  how many frequency channels do you need in your two input correlator?
 
  we have a full stokes VEGAS 1K channel spectrometer design
   that uses a pair of ADC08-5000's.
  (a full stokes spectrometer is the same thing as a two input correlator)
 
  our current design is for roach2, but we had tested working designs
  for roach1 last year, and we can probably dig these up.
 
  best wishes,
 
  dan
 
 
  On Tue, Feb 12, 2013 

[casper] Purpose of FFT-Direct

2013-03-12 Thread Ryan Monroe
Hey all,
Luke Madden was asking me about what's going on in the FFT-direct today.
 I'm pretty sure we have basically zero documentation on this lying around,
so it's a good time to fix that.  I'm going share what I know, but I'd
appreciate it if other people could add/correct me as needed.

So, you can split the CASPER FFTs into streaming and parallel FFTs:

streaming: fft_biplex, fft_biplex_real, fft_biplex_real_4x
These FFTs have several independent ports.  Each of these ports is fed with
normal-order, serial time-domain data and produces normal-order, serial
frequency-domain data.  If you know something about how pipelined FFTs
work, you'll probably call it a Radix 2, Delay-Commutator FFT, or R2DC.
 In the fft_biplex, we follow the R2DC FFT with an
inverse-delay-commutator stage to un-scramble the data (the casper
implementation doesn't have the same structure as an
inverse-delay-commutator, but they do the same thing).  In
fft_biplex_real, we do the same R2DC FFT, but we treat real and imag as
separate inputs, making four inputs.

parallel: fft_direct
If map_tail is not set, then the fft_direct block accepts all the inputs
for an fft on *each clock cycle*.  Natural order in, Natural order out.
If map_tail *is* set, it's a bit more complicated.  Then, this block is
being used with a number of streaming FFTs to achieve a wideband FFT.
Imagine a standard DIT FFT.  The early stages of the FFT only use a few
coefficients.  In fact, they are each FFTs in their own rights, only on a
subset of the data.  These streaming FFTs are just that:  for as long as we
can still process the data in a serial fashion, we process each sample
sequentially.  Then, we do the last 1-4 (typically) stages in a massive
parallel format.  Here, the same structure is drawn as in the map_tail=0
fft_direct... but the coefficients now change (specifically, their phases
are incrementing).

This is where my understanding gets a bit hazy, but it looks like the last
stages of the FFT are being literally enumerated here.  *If someone wants
to chime in, here is the place to do it*.

In any case, you could actually do these mixed streaming/parallel FFTs
(which are fft, fft_wideband_real) in a different fashion, by re-casting
them as a split-radix FFT (look it up).  Doing this is computationally
about the same, but saves resources and memory... and is simpler if the
size of fft_direct is greater than 2^2.


I hope this helps, Luke (and everyone else)!


--Ryan Monroe


Re: [casper] Purpose of FFT-Direct

2013-03-12 Thread Aaron Parsons
Hi Ryan,

I wrote the various forms of the CASPER FFT, including this one.  The broad
idea of the architecture was described in:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4840623tag=1

Basically, (as far as I can tell from the brief perusal of split-radix
ffts), I think this *is* a split radix FFT.   The mix of serial and
parallell FFTs is used to evaluate a radix-2 Cooley Tukey FFT that is
decomposed into several smaller FFTs that can be computed independently
(without inter-communication of samples), followed by a direct FFT that
cycles through twiddle coefficients (i.e. it is not truly a stand-alone
direct FFT) that combines does the remaining butterflies, drawing on
samples from all the sub-FFTs.  Data permutation is a bit of a headache in
these architectures, so I invented a permuting buffer that uses basic group
theory to automatically generate in-place permuters that do the necessary
data reordering.

I think you may have been misunderstanding how the architecture worked, and
that is why you perhaps thought it was inefficient.  The total buffering is
only 50% higher than the minimum of buffering possible (i.e. only storing
each sample once), and the multipliers are all used at 100% efficiency.
 Higher radices can produce some savings if you are doing more FFTs in
parallel, but barring that, I'd be surprised if there is another
architecture that substantially outperforms this one (but you are welcome
to try!  :)

I'm happy you're documenting.

All the best,
Aaron

On Tue, Mar 12, 2013 at 3:39 PM, Ryan Monroe ryan.m.mon...@gmail.comwrote:

 Hey all,
 Luke Madden was asking me about what's going on in the FFT-direct today.
  I'm pretty sure we have basically zero documentation on this lying around,
 so it's a good time to fix that.  I'm going share what I know, but I'd
 appreciate it if other people could add/correct me as needed.

 So, you can split the CASPER FFTs into streaming and parallel FFTs:

 streaming: fft_biplex, fft_biplex_real, fft_biplex_real_4x
 These FFTs have several independent ports.  Each of these ports is fed
 with normal-order, serial time-domain data and produces normal-order,
 serial frequency-domain data.  If you know something about how pipelined
 FFTs work, you'll probably call it a Radix 2, Delay-Commutator FFT, or
 R2DC.  In the fft_biplex, we follow the R2DC FFT with an
 inverse-delay-commutator stage to un-scramble the data (the casper
 implementation doesn't have the same structure as an
 inverse-delay-commutator, but they do the same thing).  In
 fft_biplex_real, we do the same R2DC FFT, but we treat real and imag as
 separate inputs, making four inputs.

 parallel: fft_direct
 If map_tail is not set, then the fft_direct block accepts all the inputs
 for an fft on *each clock cycle*.  Natural order in, Natural order out.
 If map_tail *is* set, it's a bit more complicated.  Then, this block is
 being used with a number of streaming FFTs to achieve a wideband FFT.
 Imagine a standard DIT FFT.  The early stages of the FFT only use a few
 coefficients.  In fact, they are each FFTs in their own rights, only on a
 subset of the data.  These streaming FFTs are just that:  for as long as we
 can still process the data in a serial fashion, we process each sample
 sequentially.  Then, we do the last 1-4 (typically) stages in a massive
 parallel format.  Here, the same structure is drawn as in the map_tail=0
 fft_direct... but the coefficients now change (specifically, their phases
 are incrementing).

 This is where my understanding gets a bit hazy, but it looks like the last
 stages of the FFT are being literally enumerated here.  *If someone wants
 to chime in, here is the place to do it*.

 In any case, you could actually do these mixed streaming/parallel FFTs
 (which are fft, fft_wideband_real) in a different fashion, by re-casting
 them as a split-radix FFT (look it up).  Doing this is computationally
 about the same, but saves resources and memory... and is simpler if the
 size of fft_direct is greater than 2^2.


 I hope this helps, Luke (and everyone else)!


 --Ryan Monroe




-- 
Aaron Parsons
510-306-4322
Hearst Field Annex B54, UCB


Re: [casper] Purpose of FFT-Direct

2013-03-12 Thread Ryan Monroe
Hey Aaron!

My understanding may be imperfect, but I thought that a split-radix FFT
would have a bank of phase rotations (one for each input to fft-direct)
after the biplex FFTs.  If you chose your phase rotation coefficients
correctly, you'd be able to finish the larger FFT with a simple fft-direct
(map_tail=0).  That's the split-radix FFT which I was talking about.  It
simplifies things (all the coefficient storage goes in one place, reduces
routing, counters can be shared more easily, coefficients shared more
easily, etc) but I think the multiplier usage ends up the same.  The
difference would really start to show if you were trying to do like, a
2^21-point FFT... where you'd do the corner turns in QDR and generate
phase-rotate coefficients.  If you had the same coefficient schedule that
is used in fft_direct your FPGA would not be able to hold them all.

Either way, hat's off to you in a serious way, I would never have been able
to design this madness on my own :-)  Finally, as far as I can read your
memory utilization is the best that anyone can achieve under the constraint
of normal output order (you can do a bit better if you're okay with taking
a bit-reversal tho)  Ultimately these are all factorizations of the same
basic algorithm.  If you do a bit of mental gymnastics I guess it all looks
pretty similar

I have a radix-4 fft_wideband_real which uses 65%-85% as many multipliers
and better coefficient sharing, but as you say, you'll need to be doing
many parallel FFTs to take advantage of it (one R4MDC block can eat an
entire KATADC's worth of signal!).  No improvement to memory utilization
though.

*correction on my last post:  *When I said R4DC (radix-4, Delay
Commutator), I should have said R4MDC (radix-4, multi-delay commutator),
to distinguish it from streaming FFTs which only process FFT's worth of
data at a time.

--Ryan

On Tue, Mar 12, 2013 at 5:44 PM, Aaron Parsons apars...@astron.berkeley.edu
 wrote:

 Hi Ryan,

 I wrote the various forms of the CASPER FFT, including this one.  The
 broad idea of the architecture was described in:
 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4840623tag=1

 Basically, (as far as I can tell from the brief perusal of split-radix
 ffts), I think this *is* a split radix FFT.   The mix of serial and
 parallell FFTs is used to evaluate a radix-2 Cooley Tukey FFT that is
 decomposed into several smaller FFTs that can be computed independently
 (without inter-communication of samples), followed by a direct FFT that
 cycles through twiddle coefficients (i.e. it is not truly a stand-alone
 direct FFT) that combines does the remaining butterflies, drawing on
 samples from all the sub-FFTs.  Data permutation is a bit of a headache in
 these architectures, so I invented a permuting buffer that uses basic group
 theory to automatically generate in-place permuters that do the necessary
 data reordering.

 I think you may have been misunderstanding how the architecture worked,
 and that is why you perhaps thought it was inefficient.  The total
 buffering is only 50% higher than the minimum of buffering possible (i.e.
 only storing each sample once), and the multipliers are all used at 100%
 efficiency.  Higher radices can produce some savings if you are doing more
 FFTs in parallel, but barring that, I'd be surprised if there is another
 architecture that substantially outperforms this one (but you are welcome
 to try!  :)

 I'm happy you're documenting.

 All the best,
 Aaron


 On Tue, Mar 12, 2013 at 3:39 PM, Ryan Monroe ryan.m.mon...@gmail.comwrote:

 Hey all,
 Luke Madden was asking me about what's going on in the FFT-direct today.
  I'm pretty sure we have basically zero documentation on this lying around,
 so it's a good time to fix that.  I'm going share what I know, but I'd
 appreciate it if other people could add/correct me as needed.

 So, you can split the CASPER FFTs into streaming and parallel FFTs:

 streaming: fft_biplex, fft_biplex_real, fft_biplex_real_4x
 These FFTs have several independent ports.  Each of these ports is fed
 with normal-order, serial time-domain data and produces normal-order,
 serial frequency-domain data.  If you know something about how pipelined
 FFTs work, you'll probably call it a Radix 2, Delay-Commutator FFT, or
 R2DC.  In the fft_biplex, we follow the R2DC FFT with an
 inverse-delay-commutator stage to un-scramble the data (the casper
 implementation doesn't have the same structure as an
 inverse-delay-commutator, but they do the same thing).  In
 fft_biplex_real, we do the same R2DC FFT, but we treat real and imag as
 separate inputs, making four inputs.

 parallel: fft_direct
 If map_tail is not set, then the fft_direct block accepts all the inputs
 for an fft on *each clock cycle*.  Natural order in, Natural order out.
 If map_tail *is* set, it's a bit more complicated.  Then, this block is
 being used with a number of streaming FFTs to achieve a wideband FFT.
 Imagine a standard DIT FFT.  The early stages of 

Re: [casper] Purpose of FFT-Direct

2013-03-12 Thread Aaron Parsons
 My understanding may be imperfect, but I thought that a split-radix FFT
 would have a bank of phase rotations (one for each input to fft-direct)
 after the biplex FFTs.  If you chose your phase rotation coefficients
 correctly, you'd be able to finish the larger FFT with a simple fft-direct
 (map_tail=0).


Hm. I think you have a point.  I'd missed that if you do one phasing for
each biplex stream, you could have the last stages all use the same direct
FFT (which would be a true direct FFT).  Cute!  I can see how in some
applications this could be helpful.  I'd generally assumed that
coefficients weren't that important for memory usage, because of all the
sample buffering.  However, if lots of that is happening off-chip, I guess
maybe you start caring about coefficient storage.


  Finally, as far as I can read your memory utilization is the best that
 anyone can achieve under the constraint of normal output order (you can do
 a bit better if you're okay with taking a bit-reversal tho)


Don't say this too loudly around Dan.  He always suggests pulling out the
bit reversal at the drop of a hat.  I think that'd be a nightmare from a
system integration perspective, and constantly have to rein him in.  :)


 I have a radix-4 fft_wideband_real which uses 65%-85% as many multipliers
 and better coefficient sharing, but as you say, you'll need to be doing
 many parallel FFTs to take advantage of it (one R4MDC block can eat an
 entire KATADC's worth of signal!).  No improvement to memory utilization
 though.


For very wideband FFTs (= 4 samples in parallel), using a single radix-4
biplex core for the set of streaming FFTs could be advantageous...

Aaron



 *correction on my last post:  *When I said R4DC (radix-4, Delay
 Commutator), I should have said R4MDC (radix-4, multi-delay commutator),
 to distinguish it from streaming FFTs which only process FFT's worth of
 data at a time.

 --Ryan


 On Tue, Mar 12, 2013 at 5:44 PM, Aaron Parsons 
 apars...@astron.berkeley.edu wrote:

 Hi Ryan,

 I wrote the various forms of the CASPER FFT, including this one.  The
 broad idea of the architecture was described in:
 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4840623tag=1

 Basically, (as far as I can tell from the brief perusal of split-radix
 ffts), I think this *is* a split radix FFT.   The mix of serial and
 parallell FFTs is used to evaluate a radix-2 Cooley Tukey FFT that is
 decomposed into several smaller FFTs that can be computed independently
 (without inter-communication of samples), followed by a direct FFT that
 cycles through twiddle coefficients (i.e. it is not truly a stand-alone
 direct FFT) that combines does the remaining butterflies, drawing on
 samples from all the sub-FFTs.  Data permutation is a bit of a headache in
 these architectures, so I invented a permuting buffer that uses basic group
 theory to automatically generate in-place permuters that do the necessary
 data reordering.

 I think you may have been misunderstanding how the architecture worked,
 and that is why you perhaps thought it was inefficient.  The total
 buffering is only 50% higher than the minimum of buffering possible (i.e.
 only storing each sample once), and the multipliers are all used at 100%
 efficiency.  Higher radices can produce some savings if you are doing more
 FFTs in parallel, but barring that, I'd be surprised if there is another
 architecture that substantially outperforms this one (but you are welcome
 to try!  :)

 I'm happy you're documenting.

 All the best,
 Aaron


 On Tue, Mar 12, 2013 at 3:39 PM, Ryan Monroe ryan.m.mon...@gmail.comwrote:

 Hey all,
 Luke Madden was asking me about what's going on in the FFT-direct today.
  I'm pretty sure we have basically zero documentation on this lying around,
 so it's a good time to fix that.  I'm going share what I know, but I'd
 appreciate it if other people could add/correct me as needed.

 So, you can split the CASPER FFTs into streaming and parallel FFTs:

 streaming: fft_biplex, fft_biplex_real, fft_biplex_real_4x
 These FFTs have several independent ports.  Each of these ports is fed
 with normal-order, serial time-domain data and produces normal-order,
 serial frequency-domain data.  If you know something about how pipelined
 FFTs work, you'll probably call it a Radix 2, Delay-Commutator FFT, or
 R2DC.  In the fft_biplex, we follow the R2DC FFT with an
 inverse-delay-commutator stage to un-scramble the data (the casper
 implementation doesn't have the same structure as an
 inverse-delay-commutator, but they do the same thing).  In
 fft_biplex_real, we do the same R2DC FFT, but we treat real and imag as
 separate inputs, making four inputs.

 parallel: fft_direct
 If map_tail is not set, then the fft_direct block accepts all the inputs
 for an fft on *each clock cycle*.  Natural order in, Natural order out.
 If map_tail *is* set, it's a bit more complicated.  Then, this block is
 being used with a number of streaming FFTs to achieve a wideband 

Re: [casper] Purpose of FFT-Direct

2013-03-12 Thread Ryan Monroe

That makes two of us!  Viva la revolution!

On 03/12/2013 06:35 PM, Dan Werthimer wrote:


it's pretty loud where i'm sitting.





Re: [casper] Purpose of FFT-Direct

2013-03-12 Thread Aaron Parsons
ay dios mio

On Tue, Mar 12, 2013 at 6:40 PM, Ryan Monroe ryan.m.mon...@gmail.comwrote:

 That makes two of us!  Viva la revolution!


 On 03/12/2013 06:35 PM, Dan Werthimer wrote:


 it's pretty loud where i'm sitting.





-- 
Aaron Parsons
510-306-4322
Hearst Field Annex B54, UCB