[casper] Purpose of FFT-Direct

2013-03-12 Thread Ryan Monroe
Hey all,
Luke Madden was asking me about what's going on in the FFT-direct today.
 I'm pretty sure we have basically zero documentation on this lying around,
so it's a good time to fix that.  I'm going share what I know, but I'd
appreciate it if other people could add/correct me as needed.

So, you can split the CASPER FFTs into streaming and parallel FFTs:

streaming: fft_biplex, fft_biplex_real, fft_biplex_real_4x
These FFTs have several independent ports.  Each of these ports is fed with
normal-order, serial time-domain data and produces normal-order, serial
frequency-domain data.  If you know something about how pipelined FFTs
work, you'll probably call it a Radix 2, Delay-Commutator FFT, or R2DC.
 In the fft_biplex, we follow the R2DC FFT with an
inverse-delay-commutator stage to un-scramble the data (the casper
implementation doesn't have the same structure as an
inverse-delay-commutator, but they do the same thing).  In
fft_biplex_real, we do the same R2DC FFT, but we treat real and imag as
separate inputs, making four inputs.

parallel: fft_direct
If map_tail is not set, then the fft_direct block accepts all the inputs
for an fft on *each clock cycle*.  Natural order in, Natural order out.
If map_tail *is* set, it's a bit more complicated.  Then, this block is
being used with a number of streaming FFTs to achieve a wideband FFT.
Imagine a standard DIT FFT.  The early stages of the FFT only use a few
coefficients.  In fact, they are each FFTs in their own rights, only on a
subset of the data.  These streaming FFTs are just that:  for as long as we
can still process the data in a serial fashion, we process each sample
sequentially.  Then, we do the last 1-4 (typically) stages in a massive
parallel format.  Here, the same structure is drawn as in the map_tail=0
fft_direct... but the coefficients now change (specifically, their phases
are incrementing).

This is where my understanding gets a bit hazy, but it looks like the last
stages of the FFT are being literally enumerated here.  *If someone wants
to chime in, here is the place to do it*.

In any case, you could actually do these mixed streaming/parallel FFTs
(which are fft, fft_wideband_real) in a different fashion, by re-casting
them as a split-radix FFT (look it up).  Doing this is computationally
about the same, but saves resources and memory... and is simpler if the
size of fft_direct is greater than 2^2.


I hope this helps, Luke (and everyone else)!


--Ryan Monroe


Re: [casper] Purpose of FFT-Direct

2013-03-12 Thread Aaron Parsons
Hi Ryan,

I wrote the various forms of the CASPER FFT, including this one.  The broad
idea of the architecture was described in:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4840623tag=1

Basically, (as far as I can tell from the brief perusal of split-radix
ffts), I think this *is* a split radix FFT.   The mix of serial and
parallell FFTs is used to evaluate a radix-2 Cooley Tukey FFT that is
decomposed into several smaller FFTs that can be computed independently
(without inter-communication of samples), followed by a direct FFT that
cycles through twiddle coefficients (i.e. it is not truly a stand-alone
direct FFT) that combines does the remaining butterflies, drawing on
samples from all the sub-FFTs.  Data permutation is a bit of a headache in
these architectures, so I invented a permuting buffer that uses basic group
theory to automatically generate in-place permuters that do the necessary
data reordering.

I think you may have been misunderstanding how the architecture worked, and
that is why you perhaps thought it was inefficient.  The total buffering is
only 50% higher than the minimum of buffering possible (i.e. only storing
each sample once), and the multipliers are all used at 100% efficiency.
 Higher radices can produce some savings if you are doing more FFTs in
parallel, but barring that, I'd be surprised if there is another
architecture that substantially outperforms this one (but you are welcome
to try!  :)

I'm happy you're documenting.

All the best,
Aaron

On Tue, Mar 12, 2013 at 3:39 PM, Ryan Monroe ryan.m.mon...@gmail.comwrote:

 Hey all,
 Luke Madden was asking me about what's going on in the FFT-direct today.
  I'm pretty sure we have basically zero documentation on this lying around,
 so it's a good time to fix that.  I'm going share what I know, but I'd
 appreciate it if other people could add/correct me as needed.

 So, you can split the CASPER FFTs into streaming and parallel FFTs:

 streaming: fft_biplex, fft_biplex_real, fft_biplex_real_4x
 These FFTs have several independent ports.  Each of these ports is fed
 with normal-order, serial time-domain data and produces normal-order,
 serial frequency-domain data.  If you know something about how pipelined
 FFTs work, you'll probably call it a Radix 2, Delay-Commutator FFT, or
 R2DC.  In the fft_biplex, we follow the R2DC FFT with an
 inverse-delay-commutator stage to un-scramble the data (the casper
 implementation doesn't have the same structure as an
 inverse-delay-commutator, but they do the same thing).  In
 fft_biplex_real, we do the same R2DC FFT, but we treat real and imag as
 separate inputs, making four inputs.

 parallel: fft_direct
 If map_tail is not set, then the fft_direct block accepts all the inputs
 for an fft on *each clock cycle*.  Natural order in, Natural order out.
 If map_tail *is* set, it's a bit more complicated.  Then, this block is
 being used with a number of streaming FFTs to achieve a wideband FFT.
 Imagine a standard DIT FFT.  The early stages of the FFT only use a few
 coefficients.  In fact, they are each FFTs in their own rights, only on a
 subset of the data.  These streaming FFTs are just that:  for as long as we
 can still process the data in a serial fashion, we process each sample
 sequentially.  Then, we do the last 1-4 (typically) stages in a massive
 parallel format.  Here, the same structure is drawn as in the map_tail=0
 fft_direct... but the coefficients now change (specifically, their phases
 are incrementing).

 This is where my understanding gets a bit hazy, but it looks like the last
 stages of the FFT are being literally enumerated here.  *If someone wants
 to chime in, here is the place to do it*.

 In any case, you could actually do these mixed streaming/parallel FFTs
 (which are fft, fft_wideband_real) in a different fashion, by re-casting
 them as a split-radix FFT (look it up).  Doing this is computationally
 about the same, but saves resources and memory... and is simpler if the
 size of fft_direct is greater than 2^2.


 I hope this helps, Luke (and everyone else)!


 --Ryan Monroe




-- 
Aaron Parsons
510-306-4322
Hearst Field Annex B54, UCB


Re: [casper] Purpose of FFT-Direct

2013-03-12 Thread Ryan Monroe
Hey Aaron!

My understanding may be imperfect, but I thought that a split-radix FFT
would have a bank of phase rotations (one for each input to fft-direct)
after the biplex FFTs.  If you chose your phase rotation coefficients
correctly, you'd be able to finish the larger FFT with a simple fft-direct
(map_tail=0).  That's the split-radix FFT which I was talking about.  It
simplifies things (all the coefficient storage goes in one place, reduces
routing, counters can be shared more easily, coefficients shared more
easily, etc) but I think the multiplier usage ends up the same.  The
difference would really start to show if you were trying to do like, a
2^21-point FFT... where you'd do the corner turns in QDR and generate
phase-rotate coefficients.  If you had the same coefficient schedule that
is used in fft_direct your FPGA would not be able to hold them all.

Either way, hat's off to you in a serious way, I would never have been able
to design this madness on my own :-)  Finally, as far as I can read your
memory utilization is the best that anyone can achieve under the constraint
of normal output order (you can do a bit better if you're okay with taking
a bit-reversal tho)  Ultimately these are all factorizations of the same
basic algorithm.  If you do a bit of mental gymnastics I guess it all looks
pretty similar

I have a radix-4 fft_wideband_real which uses 65%-85% as many multipliers
and better coefficient sharing, but as you say, you'll need to be doing
many parallel FFTs to take advantage of it (one R4MDC block can eat an
entire KATADC's worth of signal!).  No improvement to memory utilization
though.

*correction on my last post:  *When I said R4DC (radix-4, Delay
Commutator), I should have said R4MDC (radix-4, multi-delay commutator),
to distinguish it from streaming FFTs which only process FFT's worth of
data at a time.

--Ryan

On Tue, Mar 12, 2013 at 5:44 PM, Aaron Parsons apars...@astron.berkeley.edu
 wrote:

 Hi Ryan,

 I wrote the various forms of the CASPER FFT, including this one.  The
 broad idea of the architecture was described in:
 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4840623tag=1

 Basically, (as far as I can tell from the brief perusal of split-radix
 ffts), I think this *is* a split radix FFT.   The mix of serial and
 parallell FFTs is used to evaluate a radix-2 Cooley Tukey FFT that is
 decomposed into several smaller FFTs that can be computed independently
 (without inter-communication of samples), followed by a direct FFT that
 cycles through twiddle coefficients (i.e. it is not truly a stand-alone
 direct FFT) that combines does the remaining butterflies, drawing on
 samples from all the sub-FFTs.  Data permutation is a bit of a headache in
 these architectures, so I invented a permuting buffer that uses basic group
 theory to automatically generate in-place permuters that do the necessary
 data reordering.

 I think you may have been misunderstanding how the architecture worked,
 and that is why you perhaps thought it was inefficient.  The total
 buffering is only 50% higher than the minimum of buffering possible (i.e.
 only storing each sample once), and the multipliers are all used at 100%
 efficiency.  Higher radices can produce some savings if you are doing more
 FFTs in parallel, but barring that, I'd be surprised if there is another
 architecture that substantially outperforms this one (but you are welcome
 to try!  :)

 I'm happy you're documenting.

 All the best,
 Aaron


 On Tue, Mar 12, 2013 at 3:39 PM, Ryan Monroe ryan.m.mon...@gmail.comwrote:

 Hey all,
 Luke Madden was asking me about what's going on in the FFT-direct today.
  I'm pretty sure we have basically zero documentation on this lying around,
 so it's a good time to fix that.  I'm going share what I know, but I'd
 appreciate it if other people could add/correct me as needed.

 So, you can split the CASPER FFTs into streaming and parallel FFTs:

 streaming: fft_biplex, fft_biplex_real, fft_biplex_real_4x
 These FFTs have several independent ports.  Each of these ports is fed
 with normal-order, serial time-domain data and produces normal-order,
 serial frequency-domain data.  If you know something about how pipelined
 FFTs work, you'll probably call it a Radix 2, Delay-Commutator FFT, or
 R2DC.  In the fft_biplex, we follow the R2DC FFT with an
 inverse-delay-commutator stage to un-scramble the data (the casper
 implementation doesn't have the same structure as an
 inverse-delay-commutator, but they do the same thing).  In
 fft_biplex_real, we do the same R2DC FFT, but we treat real and imag as
 separate inputs, making four inputs.

 parallel: fft_direct
 If map_tail is not set, then the fft_direct block accepts all the inputs
 for an fft on *each clock cycle*.  Natural order in, Natural order out.
 If map_tail *is* set, it's a bit more complicated.  Then, this block is
 being used with a number of streaming FFTs to achieve a wideband FFT.
 Imagine a standard DIT FFT.  The early stages of 

Re: [casper] Purpose of FFT-Direct

2013-03-12 Thread Aaron Parsons
 My understanding may be imperfect, but I thought that a split-radix FFT
 would have a bank of phase rotations (one for each input to fft-direct)
 after the biplex FFTs.  If you chose your phase rotation coefficients
 correctly, you'd be able to finish the larger FFT with a simple fft-direct
 (map_tail=0).


Hm. I think you have a point.  I'd missed that if you do one phasing for
each biplex stream, you could have the last stages all use the same direct
FFT (which would be a true direct FFT).  Cute!  I can see how in some
applications this could be helpful.  I'd generally assumed that
coefficients weren't that important for memory usage, because of all the
sample buffering.  However, if lots of that is happening off-chip, I guess
maybe you start caring about coefficient storage.


  Finally, as far as I can read your memory utilization is the best that
 anyone can achieve under the constraint of normal output order (you can do
 a bit better if you're okay with taking a bit-reversal tho)


Don't say this too loudly around Dan.  He always suggests pulling out the
bit reversal at the drop of a hat.  I think that'd be a nightmare from a
system integration perspective, and constantly have to rein him in.  :)


 I have a radix-4 fft_wideband_real which uses 65%-85% as many multipliers
 and better coefficient sharing, but as you say, you'll need to be doing
 many parallel FFTs to take advantage of it (one R4MDC block can eat an
 entire KATADC's worth of signal!).  No improvement to memory utilization
 though.


For very wideband FFTs (= 4 samples in parallel), using a single radix-4
biplex core for the set of streaming FFTs could be advantageous...

Aaron



 *correction on my last post:  *When I said R4DC (radix-4, Delay
 Commutator), I should have said R4MDC (radix-4, multi-delay commutator),
 to distinguish it from streaming FFTs which only process FFT's worth of
 data at a time.

 --Ryan


 On Tue, Mar 12, 2013 at 5:44 PM, Aaron Parsons 
 apars...@astron.berkeley.edu wrote:

 Hi Ryan,

 I wrote the various forms of the CASPER FFT, including this one.  The
 broad idea of the architecture was described in:
 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4840623tag=1

 Basically, (as far as I can tell from the brief perusal of split-radix
 ffts), I think this *is* a split radix FFT.   The mix of serial and
 parallell FFTs is used to evaluate a radix-2 Cooley Tukey FFT that is
 decomposed into several smaller FFTs that can be computed independently
 (without inter-communication of samples), followed by a direct FFT that
 cycles through twiddle coefficients (i.e. it is not truly a stand-alone
 direct FFT) that combines does the remaining butterflies, drawing on
 samples from all the sub-FFTs.  Data permutation is a bit of a headache in
 these architectures, so I invented a permuting buffer that uses basic group
 theory to automatically generate in-place permuters that do the necessary
 data reordering.

 I think you may have been misunderstanding how the architecture worked,
 and that is why you perhaps thought it was inefficient.  The total
 buffering is only 50% higher than the minimum of buffering possible (i.e.
 only storing each sample once), and the multipliers are all used at 100%
 efficiency.  Higher radices can produce some savings if you are doing more
 FFTs in parallel, but barring that, I'd be surprised if there is another
 architecture that substantially outperforms this one (but you are welcome
 to try!  :)

 I'm happy you're documenting.

 All the best,
 Aaron


 On Tue, Mar 12, 2013 at 3:39 PM, Ryan Monroe ryan.m.mon...@gmail.comwrote:

 Hey all,
 Luke Madden was asking me about what's going on in the FFT-direct today.
  I'm pretty sure we have basically zero documentation on this lying around,
 so it's a good time to fix that.  I'm going share what I know, but I'd
 appreciate it if other people could add/correct me as needed.

 So, you can split the CASPER FFTs into streaming and parallel FFTs:

 streaming: fft_biplex, fft_biplex_real, fft_biplex_real_4x
 These FFTs have several independent ports.  Each of these ports is fed
 with normal-order, serial time-domain data and produces normal-order,
 serial frequency-domain data.  If you know something about how pipelined
 FFTs work, you'll probably call it a Radix 2, Delay-Commutator FFT, or
 R2DC.  In the fft_biplex, we follow the R2DC FFT with an
 inverse-delay-commutator stage to un-scramble the data (the casper
 implementation doesn't have the same structure as an
 inverse-delay-commutator, but they do the same thing).  In
 fft_biplex_real, we do the same R2DC FFT, but we treat real and imag as
 separate inputs, making four inputs.

 parallel: fft_direct
 If map_tail is not set, then the fft_direct block accepts all the inputs
 for an fft on *each clock cycle*.  Natural order in, Natural order out.
 If map_tail *is* set, it's a bit more complicated.  Then, this block is
 being used with a number of streaming FFTs to achieve a wideband 

Re: [casper] Purpose of FFT-Direct

2013-03-12 Thread Ryan Monroe

That makes two of us!  Viva la revolution!

On 03/12/2013 06:35 PM, Dan Werthimer wrote:


it's pretty loud where i'm sitting.





Re: [casper] Purpose of FFT-Direct

2013-03-12 Thread Aaron Parsons
ay dios mio

On Tue, Mar 12, 2013 at 6:40 PM, Ryan Monroe ryan.m.mon...@gmail.comwrote:

 That makes two of us!  Viva la revolution!


 On 03/12/2013 06:35 PM, Dan Werthimer wrote:


 it's pretty loud where i'm sitting.





-- 
Aaron Parsons
510-306-4322
Hearst Field Annex B54, UCB