Re: [casper] unable to load my boffiles and to configure my roach2
Hi Marc, thank you for answer. Now I have another question from my 1st email, I did'nt have an answer. it is about the process to generate a boffile compatible with roach2 and tcpborph3 using the 11.5 toolflow. I'd like to know if there is a trick , which can be used to generate a boffile with the current casper_astro_mlib or the old ska_sa_lib , which will be compatible with the roach2. regards, Roach2s which run tcpborphserver3 do not execute bof files, they use tcpborphserver3 to program the FPGA using a device file. So this is not an error. Use progdev to program the fpga, or if you wish to use the commandline ~ # kcpcmd progdev r2_tut1_2013_Mar_07_0837.bof regards marc
Re: [casper] unable to load my boffiles and to configure my roach2
Hi Wes, Can you elaborate on this MMCM stability problem? Which version of ISE was it fixed in? Thanks, Glenn On Mar 12, 2013 5:40 AM, Wesley New wes...@ska.ac.za wrote: Hi Guy, The current casper_astro_mlib should doesn't support ROACH2 as the ROACH2 changes dont seem to have been pulled from SKA-SA. Why would you want to build will old tools anyway? The older Xilinx tools have issues with the clock managers (MMCMs) where you can get instability problems when deploying your design. I would highly recommend upgrading your design machine to 14.4 and matlab2012b. Regards Wes On Tue, Mar 12, 2013 at 11:11 AM, Guy kenfack guy.kenf...@gmail.comwrote: Hi Marc, thank you for answer. Now I have another question from my 1st email, I did'nt have an answer. it is about the process to generate a boffile compatible with roach2 and tcpborph3 using the 11.5 toolflow. I'd like to know if there is a trick , which can be used to generate a boffile with the current casper_astro_mlib or the old ska_sa_lib , which will be compatible with the roach2. regards, Roach2s which run tcpborphserver3 do not execute bof files, they use tcpborphserver3 to program the FPGA using a device file. So this is not an error. Use progdev to program the fpga, or if you wish to use the commandline ~ # kcpcmd progdev r2_tut1_2013_Mar_07_0837.bof regards marc
[casper] ADC 1x5000-8 1:2 boards
Dear Ross: I will check our inventory. I guess we still have some. I am interested to know why not 1:1 ? cheers homin
Re: [casper] unable to load my boffiles and to configure my roach2
Hi Glenn Xilinx tighted up the constraining and configuration options on the MMCM in ISE versions 12 13 due to lock issues. We have also ran into poor timing results on 11.5. ISE 11.5 was the first version to support Virtex 6, so moving to the latest (ISE 14) is recommended for Roach 2 compiles. Regards Henno On Tue, Mar 12, 2013 at 1:52 PM, G Jones glenn.calt...@gmail.com wrote: Hi Wes, Can you elaborate on this MMCM stability problem? Which version of ISE was it fixed in? Thanks, Glenn On Mar 12, 2013 5:40 AM, Wesley New wes...@ska.ac.za wrote: Hi Guy, The current casper_astro_mlib should doesn't support ROACH2 as the ROACH2 changes dont seem to have been pulled from SKA-SA. Why would you want to build will old tools anyway? The older Xilinx tools have issues with the clock managers (MMCMs) where you can get instability problems when deploying your design. I would highly recommend upgrading your design machine to 14.4 and matlab2012b. Regards Wes On Tue, Mar 12, 2013 at 11:11 AM, Guy kenfack guy.kenf...@gmail.comwrote: Hi Marc, thank you for answer. Now I have another question from my 1st email, I did'nt have an answer. it is about the process to generate a boffile compatible with roach2 and tcpborph3 using the 11.5 toolflow. I'd like to know if there is a trick , which can be used to generate a boffile with the current casper_astro_mlib or the old ska_sa_lib , which will be compatible with the roach2. regards, Roach2s which run tcpborphserver3 do not execute bof files, they use tcpborphserver3 to program the FPGA using a device file. So this is not an error. Use progdev to program the fpga, or if you wish to use the commandline ~ # kcpcmd progdev r2_tut1_2013_Mar_07_0837.bof regards marc -- Henno Kriel DSP Engineer Digital Back End meerKAT SKA South Africa Third Floor The Park Park Road (off Alexandra Road) Pinelands 7405 Western Cape South Africa Latitude: -33.94329 (South); Longitude: 18.48945 (East). (p) +27 (0)21 506 7300 (p) +27 (0)21 506 7365 (direct) (f) +27 (0)21 506 7375 (m) +27 (0)84 504 5050
Re: [casper] ADC1X5000-8 correlator question
Hi Ross, I am commenting this older thread prompted by your most recent message re DMUX1:2 ADCs, to which Homin responded today. Your approach, and the advice received from Dan and Glenn, is sound---in principal. With suitable bit codes, you can run DMUX1:2 4-bit versions of the ASIAA ADC at full rate, in dual channel 2.5 GSa/s mode, hosted on a ROACH1. And with two ADC boards installed you can enable four such 2.5 GSa/s input channels on ROACH1. Whether the ROACH1 can process the four channels with Virtex 5 resources of course depends on what you want to do, Dan gave one benchmark. We have a PFB resource utilization memo which may be able to help you. There is a caveat, however: my group is among a few CASPER collaborators who are doing wideband correlator design using ASIAA 5 GSa/s ADCs. We decided some time ago to standardize on using the 8-bit DMUX 1:1 version of the ADC supported by ROACH2. In our case we without question needed the computational resources of Virtex6-SX475, so the higher GPIO interface speed (translates to Z-DOK speed) in some sense came for free. Though we could probably meet our requirements with fewer than 8-bits, there was no penalty for enabling them by using the DMUX1:1 board. A consequence of this is that all our developments and contributions to the CASPER open source infrastructure support this configuration (8-bit and ROACH2). In brief we have contributed yellow block work, ADC characterization, and resource calculations for PFBs. Now, we are not the only ones doing this kind of work, and much of what we do is probably transferable to the ROACH1 and 4-bit case. However, at least from my biased perspective, you may find the ROACH2 8-but case to be better supported, and thus the hardware cost savings through using ROACH1 may prove to be false economy. The 4-bit and 8-bit ADCs are equal in cost. (By the way the 8-bit ADC board can be used on ROACH1 but you would be limited to 1.8 GSa/s per channel or 3.6 GSa/s per ADC board.) All that said, I have a few DMUX1:2 ADCs sitting in the lab fully assembled, working, and unused, and I would be happy to loan these to you. Actually they probably belong to Homin / ASIAA, so I guess subject to his ok. Our wideband developments are documented on a public Wiki: https://www.cfa.harvard.edu/twiki/bin/view/SMAwideband Dig down for ADC characterization and our resource memo. Cheers, stay in touch. Jonathan Weintroub CfA +1-617-495-7319 On Feb 12, 2013, at 4:09 PM, Ross Williamson wrote: Hi Dan and Glenn, All sounds good - I think have the ROACH-1 design will help us out a lot here to get started. We really don't need a very high spectral response as we are just trying to measure the amplitude and phase of a PSFs wings referenced to the central core of the PSF. Cheers, Ross On Tue, Feb 12, 2013 at 1:05 PM, G Jones glenn.calt...@gmail.com wrote: Note that the design Dan is referring to currently has only been tested to ~1.5 GHz BW. Getting to higher bandwidths will likely require some work to meet timing. Also the data comes out over 10 gigabit Ethernet so requires something to catch the data. On Tue, Feb 12, 2013 at 4:02 PM, Dan Werthimer d...@ssl.berkeley.edu wrote: hi ross, how many frequency channels do you need in your two input correlator? we have a full stokes VEGAS 1K channel spectrometer design that uses a pair of ADC08-5000's. (a full stokes spectrometer is the same thing as a two input correlator) our current design is for roach2, but we had tested working designs for roach1 last year, and we can probably dig these up. best wishes, dan On Tue, Feb 12, 2013 at 12:53 PM, Ross Williamson rwilliam...@astro.caltech.edu wrote: Hi All, I'm new to CASPER/ROACH boards and so apologies if this is obvious. We are hoping to build a simple 2-channel correlator with a fairly high bandwidth. We are in the process of ordering a ROACH-1 board and I was thinking of also purchasing the ADC1X5000-8 ADC - A few questions: 1) Should I even be looking at an ADC1X5000-8 for use as a correlator? 2) I believe I need the DMUX 1:2 version to work with ROACH-1 board - is that correct? 3) I should be able to run 2 channels at 2.5GHz 4bits with a single board - correct? 4) If I wanted to could I purchase another ADC1X5000-8 and try and run with 5GHz bandwidth on each channel using the ROACH-1? Best regards, Ross -- Ross Williamson Research Scientist - Sub-mm Group California Institute of Technology 626-395-2647 (office) 312-504-3051 (Cell) -- Ross Williamson Research Scientist - Sub-mm Group California Institute of Technology 626-395-2647 (office) 312-504-3051 (Cell)
Re: [casper] unable to load my boffiles and to configure my roach2
Hi, Wes, On Mar 12, 2013, at 10:35 AM, Wesley New wrote: There might be 1 or 2 tweeks to the tools that you would have to do. The only tweak that I've needed to do so far is to export model files (e.g. xps_library.mdl) that have been saved in the latest Simulink version to an older Simulink model file format. The latest version stores the mask parameters in a way that is incompatible with earlier versions. What was Mathworks thinking?! Dave
Re: [casper] unable to load my boffiles and to configure my roach2
I've not tried that far back. I go back and forth between 13.x and 14.x tools. I have not yet backported your shared bram changes yet. I very much like the concept and think that it should probably be used even for shared brams that have data width of 32 so that we can use the BRAM output register option. The Xilinx BRAM pcore that is currently used for data width of 32 does not support that option, but it would greatly helps with timing (and therefor placement) as the clock-to-out time for the UNregistered BRAM is 2 ns! Dave On Mar 12, 2013, at 10:54 AM, Wesley New wrote: That is pretty cool, thanks for the info dave. :) Have you tried building with the most recent ska-sa libraries on 11.5? I updated the custom shared bram block to use the latest coregen. I am not sure how an older version of coregen would behave. This is only for brams using a data width that is not 32 bits wide. On 3/12/13, David MacMahon dav...@astro.berkeley.edu wrote: Hi, Wes, On Mar 12, 2013, at 10:35 AM, Wesley New wrote: There might be 1 or 2 tweeks to the tools that you would have to do. The only tweak that I've needed to do so far is to export model files (e.g. xps_library.mdl) that have been saved in the latest Simulink version to an older Simulink model file format. The latest version stores the mask parameters in a way that is incompatible with earlier versions. What was Mathworks thinking?! Dave
Re: [casper] Error reading registers with ROACH2
Hi Jason I'm having a hard time updating the .mdl file for tutorial1 after I tried to update to the newest CASPER tools. By 'CASPER tools' are you referring to recent git commits of the Simulink libraries from the ska-sa repo? If so, I know what your problem below is... All the links to the yellow blocks are broken. Should the .mdl files work with the updated CASPER tools or do I have to go in and replace those blocks that have broken links? I tried doing this as well, but I can't find similar blocks in the simulink explorer. Could you send me the .mdl file for the .bof file that you sent me? Would that help? It seems that older versions of Matlab/Simulink are not forwards- compatible with the .mdl files generated by the latest incarnation of Matlab/Simulink. When the library is loading you will notice a bunch of error messages scrolling past. These are not the usual Warnings that can be ignored but result in a crippled Simulink library with much of it missing (as seen in the simulink explorer). Your .mdl file has links that link back to this crippled library and much of it is missing (hence broken links). Your solution is to install the latest compatible spawn from Mathworks. We may have to look into saving our libraries so that they are compatible with older Matlab/Simulink versions as continually upgrading Matlab/Simulink is expensive, time intensive, and does not usually come with any improvements that relate to us. Regards Andrew
Re: [casper] what should I see?
I think your clock source should be 1 GHz right? Also, your signal generator is set to way too high of a level I think. Start with -40 dBm and increase from there. On Tue, Mar 12, 2013 at 4:10 PM, katherine viviana cortes urbina kattycort...@gmail.com wrote: Dear Casperites, 1.- I compile the tut3 with 2048 channels, ADC 1 Ghz. The control is via katcp server (python library), I run the script tut3.py and the clock source (valon 5007) is connected to clk_i of the adc with f_out =2425 MHz, the spectrum generated is attachment: tut3_clock_2048canales.png , when I connect a Signal generator (frecuency 120Mhz , 2Vpp and signal senoidal) to SMA +i of the adc , the spectrum is attachment tut3_2048canales_120MHz.png, is this correct? . 2.- Now, when I compile the tut3 with 32 channels, the script .py change see attachments: tut3_32channel.py, is this correct ? . The spectrum generated (only with clock source) is tut3_32kcanales_2012_Dec_17_2104.png, is this correct? . my questions is what should I see (the spectrum)? any ideas? cheers katty
Re: [casper] ADC 1x5000-8 1:2 boards
Hi Homin, Thanks - My understanding is that we can not achieve 5 GSPS 8bits on a ROACH-1 - we need a ROACH-2 for that. We gain a lot more in our correlator design by having a larger bandwidth than number of bits - We may be able to push it up the 1:1 to about 3.6 GSPS which would be ok but we can do better with the 1:2 Best regards, Ross On Tue, Mar 12, 2013 at 5:12 AM, Homin Jiang ho...@asiaa.sinica.edu.tw wrote: Dear Ross: I will check our inventory. I guess we still have some. I am interested to know why not 1:1 ? cheers homin -- Ross Williamson Research Scientist - Sub-mm Group California Institute of Technology 626-395-2647 (office) 312-504-3051 (Cell)
Re: [casper] ADC1X5000-8 correlator question
Hi Jonathan, Thank you for the detailed response (which I read after replying to Homin - doh). We already have our ROACH-1 here and are trying to keep costs down so I'm going to push ahead a little more with trying to figure out if we can get this to work with a ROACH-1 and 1:2 ADCs. I do have a couple of questions/points though before I ask to borrow the 1:2 boards. 1) The nominal plan would be to run 2 channels at 5GSa/s, 4bit on a ROACH-1- i.e. using 2 ADC boards. My understanding is the the ROACH-1 in practice should be ok to do this (depending on number of channels etc in the PFB) 2) Am I being very naive (I suspect I am) in thinking that with some work (but not a total rewrite) I could just take the correlator tutorial, replace the ADC and with the yellow block for the 1x5000-8 boards and tweek the number of channels in the PFB (we don't need many at all which should help fitting on the VIRTEX 5). If 1+2 are within the bounds of sanity then it might be worth trying to see if we can get this up and running - We can always fall back to the 3.6 GS/s if it's proving to be a nightmare. I'll check out the memo and website too. Cheers, Ross On Tue, Mar 12, 2013 at 7:37 AM, Jonathan Weintroub jweintr...@cfa.harvard.edu wrote: Hi Ross, I am commenting this older thread prompted by your most recent message re DMUX1:2 ADCs, to which Homin responded today. Your approach, and the advice received from Dan and Glenn, is sound---in principal. With suitable bit codes, you can run DMUX1:2 4-bit versions of the ASIAA ADC at full rate, in dual channel 2.5 GSa/s mode, hosted on a ROACH1. And with two ADC boards installed you can enable four such 2.5 GSa/s input channels on ROACH1. Whether the ROACH1 can process the four channels with Virtex 5 resources of course depends on what you want to do, Dan gave one benchmark. We have a PFB resource utilization memo which may be able to help you. There is a caveat, however: my group is among a few CASPER collaborators who are doing wideband correlator design using ASIAA 5 GSa/s ADCs. We decided some time ago to standardize on using the 8-bit DMUX 1:1 version of the ADC supported by ROACH2. In our case we without question needed the computational resources of Virtex6-SX475, so the higher GPIO interface speed (translates to Z-DOK speed) in some sense came for free. Though we could probably meet our requirements with fewer than 8-bits, there was no penalty for enabling them by using the DMUX1:1 board. A consequence of this is that all our developments and contributions to the CASPER open source infrastructure support this configuration (8-bit and ROACH2). In brief we have contributed yellow block work, ADC characterization, and resource calculations for PFBs. Now, we are not the only ones doing this kind of work, and much of what we do is probably transferable to the ROACH1 and 4-bit case. However, at least from my biased perspective, you may find the ROACH2 8-but case to be better supported, and thus the hardware cost savings through using ROACH1 may prove to be false economy. The 4-bit and 8-bit ADCs are equal in cost. (By the way the 8-bit ADC board can be used on ROACH1 but you would be limited to 1.8 GSa/s per channel or 3.6 GSa/s per ADC board.) All that said, I have a few DMUX1:2 ADCs sitting in the lab fully assembled, working, and unused, and I would be happy to loan these to you. Actually they probably belong to Homin / ASIAA, so I guess subject to his ok. Our wideband developments are documented on a public Wiki: https://www.cfa.harvard.edu/twiki/bin/view/SMAwideband Dig down for ADC characterization and our resource memo. Cheers, stay in touch. Jonathan Weintroub CfA +1-617-495-7319 On Feb 12, 2013, at 4:09 PM, Ross Williamson wrote: Hi Dan and Glenn, All sounds good - I think have the ROACH-1 design will help us out a lot here to get started. We really don't need a very high spectral response as we are just trying to measure the amplitude and phase of a PSFs wings referenced to the central core of the PSF. Cheers, Ross On Tue, Feb 12, 2013 at 1:05 PM, G Jones glenn.calt...@gmail.com wrote: Note that the design Dan is referring to currently has only been tested to ~1.5 GHz BW. Getting to higher bandwidths will likely require some work to meet timing. Also the data comes out over 10 gigabit Ethernet so requires something to catch the data. On Tue, Feb 12, 2013 at 4:02 PM, Dan Werthimer d...@ssl.berkeley.edu wrote: hi ross, how many frequency channels do you need in your two input correlator? we have a full stokes VEGAS 1K channel spectrometer design that uses a pair of ADC08-5000's. (a full stokes spectrometer is the same thing as a two input correlator) our current design is for roach2, but we had tested working designs for roach1 last year, and we can probably dig these up. best wishes, dan On Tue, Feb 12, 2013
[casper] Purpose of FFT-Direct
Hey all, Luke Madden was asking me about what's going on in the FFT-direct today. I'm pretty sure we have basically zero documentation on this lying around, so it's a good time to fix that. I'm going share what I know, but I'd appreciate it if other people could add/correct me as needed. So, you can split the CASPER FFTs into streaming and parallel FFTs: streaming: fft_biplex, fft_biplex_real, fft_biplex_real_4x These FFTs have several independent ports. Each of these ports is fed with normal-order, serial time-domain data and produces normal-order, serial frequency-domain data. If you know something about how pipelined FFTs work, you'll probably call it a Radix 2, Delay-Commutator FFT, or R2DC. In the fft_biplex, we follow the R2DC FFT with an inverse-delay-commutator stage to un-scramble the data (the casper implementation doesn't have the same structure as an inverse-delay-commutator, but they do the same thing). In fft_biplex_real, we do the same R2DC FFT, but we treat real and imag as separate inputs, making four inputs. parallel: fft_direct If map_tail is not set, then the fft_direct block accepts all the inputs for an fft on *each clock cycle*. Natural order in, Natural order out. If map_tail *is* set, it's a bit more complicated. Then, this block is being used with a number of streaming FFTs to achieve a wideband FFT. Imagine a standard DIT FFT. The early stages of the FFT only use a few coefficients. In fact, they are each FFTs in their own rights, only on a subset of the data. These streaming FFTs are just that: for as long as we can still process the data in a serial fashion, we process each sample sequentially. Then, we do the last 1-4 (typically) stages in a massive parallel format. Here, the same structure is drawn as in the map_tail=0 fft_direct... but the coefficients now change (specifically, their phases are incrementing). This is where my understanding gets a bit hazy, but it looks like the last stages of the FFT are being literally enumerated here. *If someone wants to chime in, here is the place to do it*. In any case, you could actually do these mixed streaming/parallel FFTs (which are fft, fft_wideband_real) in a different fashion, by re-casting them as a split-radix FFT (look it up). Doing this is computationally about the same, but saves resources and memory... and is simpler if the size of fft_direct is greater than 2^2. I hope this helps, Luke (and everyone else)! --Ryan Monroe
Re: [casper] Purpose of FFT-Direct
Hi Ryan, I wrote the various forms of the CASPER FFT, including this one. The broad idea of the architecture was described in: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4840623tag=1 Basically, (as far as I can tell from the brief perusal of split-radix ffts), I think this *is* a split radix FFT. The mix of serial and parallell FFTs is used to evaluate a radix-2 Cooley Tukey FFT that is decomposed into several smaller FFTs that can be computed independently (without inter-communication of samples), followed by a direct FFT that cycles through twiddle coefficients (i.e. it is not truly a stand-alone direct FFT) that combines does the remaining butterflies, drawing on samples from all the sub-FFTs. Data permutation is a bit of a headache in these architectures, so I invented a permuting buffer that uses basic group theory to automatically generate in-place permuters that do the necessary data reordering. I think you may have been misunderstanding how the architecture worked, and that is why you perhaps thought it was inefficient. The total buffering is only 50% higher than the minimum of buffering possible (i.e. only storing each sample once), and the multipliers are all used at 100% efficiency. Higher radices can produce some savings if you are doing more FFTs in parallel, but barring that, I'd be surprised if there is another architecture that substantially outperforms this one (but you are welcome to try! :) I'm happy you're documenting. All the best, Aaron On Tue, Mar 12, 2013 at 3:39 PM, Ryan Monroe ryan.m.mon...@gmail.comwrote: Hey all, Luke Madden was asking me about what's going on in the FFT-direct today. I'm pretty sure we have basically zero documentation on this lying around, so it's a good time to fix that. I'm going share what I know, but I'd appreciate it if other people could add/correct me as needed. So, you can split the CASPER FFTs into streaming and parallel FFTs: streaming: fft_biplex, fft_biplex_real, fft_biplex_real_4x These FFTs have several independent ports. Each of these ports is fed with normal-order, serial time-domain data and produces normal-order, serial frequency-domain data. If you know something about how pipelined FFTs work, you'll probably call it a Radix 2, Delay-Commutator FFT, or R2DC. In the fft_biplex, we follow the R2DC FFT with an inverse-delay-commutator stage to un-scramble the data (the casper implementation doesn't have the same structure as an inverse-delay-commutator, but they do the same thing). In fft_biplex_real, we do the same R2DC FFT, but we treat real and imag as separate inputs, making four inputs. parallel: fft_direct If map_tail is not set, then the fft_direct block accepts all the inputs for an fft on *each clock cycle*. Natural order in, Natural order out. If map_tail *is* set, it's a bit more complicated. Then, this block is being used with a number of streaming FFTs to achieve a wideband FFT. Imagine a standard DIT FFT. The early stages of the FFT only use a few coefficients. In fact, they are each FFTs in their own rights, only on a subset of the data. These streaming FFTs are just that: for as long as we can still process the data in a serial fashion, we process each sample sequentially. Then, we do the last 1-4 (typically) stages in a massive parallel format. Here, the same structure is drawn as in the map_tail=0 fft_direct... but the coefficients now change (specifically, their phases are incrementing). This is where my understanding gets a bit hazy, but it looks like the last stages of the FFT are being literally enumerated here. *If someone wants to chime in, here is the place to do it*. In any case, you could actually do these mixed streaming/parallel FFTs (which are fft, fft_wideband_real) in a different fashion, by re-casting them as a split-radix FFT (look it up). Doing this is computationally about the same, but saves resources and memory... and is simpler if the size of fft_direct is greater than 2^2. I hope this helps, Luke (and everyone else)! --Ryan Monroe -- Aaron Parsons 510-306-4322 Hearst Field Annex B54, UCB
Re: [casper] Purpose of FFT-Direct
Hey Aaron! My understanding may be imperfect, but I thought that a split-radix FFT would have a bank of phase rotations (one for each input to fft-direct) after the biplex FFTs. If you chose your phase rotation coefficients correctly, you'd be able to finish the larger FFT with a simple fft-direct (map_tail=0). That's the split-radix FFT which I was talking about. It simplifies things (all the coefficient storage goes in one place, reduces routing, counters can be shared more easily, coefficients shared more easily, etc) but I think the multiplier usage ends up the same. The difference would really start to show if you were trying to do like, a 2^21-point FFT... where you'd do the corner turns in QDR and generate phase-rotate coefficients. If you had the same coefficient schedule that is used in fft_direct your FPGA would not be able to hold them all. Either way, hat's off to you in a serious way, I would never have been able to design this madness on my own :-) Finally, as far as I can read your memory utilization is the best that anyone can achieve under the constraint of normal output order (you can do a bit better if you're okay with taking a bit-reversal tho) Ultimately these are all factorizations of the same basic algorithm. If you do a bit of mental gymnastics I guess it all looks pretty similar I have a radix-4 fft_wideband_real which uses 65%-85% as many multipliers and better coefficient sharing, but as you say, you'll need to be doing many parallel FFTs to take advantage of it (one R4MDC block can eat an entire KATADC's worth of signal!). No improvement to memory utilization though. *correction on my last post: *When I said R4DC (radix-4, Delay Commutator), I should have said R4MDC (radix-4, multi-delay commutator), to distinguish it from streaming FFTs which only process FFT's worth of data at a time. --Ryan On Tue, Mar 12, 2013 at 5:44 PM, Aaron Parsons apars...@astron.berkeley.edu wrote: Hi Ryan, I wrote the various forms of the CASPER FFT, including this one. The broad idea of the architecture was described in: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4840623tag=1 Basically, (as far as I can tell from the brief perusal of split-radix ffts), I think this *is* a split radix FFT. The mix of serial and parallell FFTs is used to evaluate a radix-2 Cooley Tukey FFT that is decomposed into several smaller FFTs that can be computed independently (without inter-communication of samples), followed by a direct FFT that cycles through twiddle coefficients (i.e. it is not truly a stand-alone direct FFT) that combines does the remaining butterflies, drawing on samples from all the sub-FFTs. Data permutation is a bit of a headache in these architectures, so I invented a permuting buffer that uses basic group theory to automatically generate in-place permuters that do the necessary data reordering. I think you may have been misunderstanding how the architecture worked, and that is why you perhaps thought it was inefficient. The total buffering is only 50% higher than the minimum of buffering possible (i.e. only storing each sample once), and the multipliers are all used at 100% efficiency. Higher radices can produce some savings if you are doing more FFTs in parallel, but barring that, I'd be surprised if there is another architecture that substantially outperforms this one (but you are welcome to try! :) I'm happy you're documenting. All the best, Aaron On Tue, Mar 12, 2013 at 3:39 PM, Ryan Monroe ryan.m.mon...@gmail.comwrote: Hey all, Luke Madden was asking me about what's going on in the FFT-direct today. I'm pretty sure we have basically zero documentation on this lying around, so it's a good time to fix that. I'm going share what I know, but I'd appreciate it if other people could add/correct me as needed. So, you can split the CASPER FFTs into streaming and parallel FFTs: streaming: fft_biplex, fft_biplex_real, fft_biplex_real_4x These FFTs have several independent ports. Each of these ports is fed with normal-order, serial time-domain data and produces normal-order, serial frequency-domain data. If you know something about how pipelined FFTs work, you'll probably call it a Radix 2, Delay-Commutator FFT, or R2DC. In the fft_biplex, we follow the R2DC FFT with an inverse-delay-commutator stage to un-scramble the data (the casper implementation doesn't have the same structure as an inverse-delay-commutator, but they do the same thing). In fft_biplex_real, we do the same R2DC FFT, but we treat real and imag as separate inputs, making four inputs. parallel: fft_direct If map_tail is not set, then the fft_direct block accepts all the inputs for an fft on *each clock cycle*. Natural order in, Natural order out. If map_tail *is* set, it's a bit more complicated. Then, this block is being used with a number of streaming FFTs to achieve a wideband FFT. Imagine a standard DIT FFT. The early stages of
Re: [casper] Purpose of FFT-Direct
My understanding may be imperfect, but I thought that a split-radix FFT would have a bank of phase rotations (one for each input to fft-direct) after the biplex FFTs. If you chose your phase rotation coefficients correctly, you'd be able to finish the larger FFT with a simple fft-direct (map_tail=0). Hm. I think you have a point. I'd missed that if you do one phasing for each biplex stream, you could have the last stages all use the same direct FFT (which would be a true direct FFT). Cute! I can see how in some applications this could be helpful. I'd generally assumed that coefficients weren't that important for memory usage, because of all the sample buffering. However, if lots of that is happening off-chip, I guess maybe you start caring about coefficient storage. Finally, as far as I can read your memory utilization is the best that anyone can achieve under the constraint of normal output order (you can do a bit better if you're okay with taking a bit-reversal tho) Don't say this too loudly around Dan. He always suggests pulling out the bit reversal at the drop of a hat. I think that'd be a nightmare from a system integration perspective, and constantly have to rein him in. :) I have a radix-4 fft_wideband_real which uses 65%-85% as many multipliers and better coefficient sharing, but as you say, you'll need to be doing many parallel FFTs to take advantage of it (one R4MDC block can eat an entire KATADC's worth of signal!). No improvement to memory utilization though. For very wideband FFTs (= 4 samples in parallel), using a single radix-4 biplex core for the set of streaming FFTs could be advantageous... Aaron *correction on my last post: *When I said R4DC (radix-4, Delay Commutator), I should have said R4MDC (radix-4, multi-delay commutator), to distinguish it from streaming FFTs which only process FFT's worth of data at a time. --Ryan On Tue, Mar 12, 2013 at 5:44 PM, Aaron Parsons apars...@astron.berkeley.edu wrote: Hi Ryan, I wrote the various forms of the CASPER FFT, including this one. The broad idea of the architecture was described in: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4840623tag=1 Basically, (as far as I can tell from the brief perusal of split-radix ffts), I think this *is* a split radix FFT. The mix of serial and parallell FFTs is used to evaluate a radix-2 Cooley Tukey FFT that is decomposed into several smaller FFTs that can be computed independently (without inter-communication of samples), followed by a direct FFT that cycles through twiddle coefficients (i.e. it is not truly a stand-alone direct FFT) that combines does the remaining butterflies, drawing on samples from all the sub-FFTs. Data permutation is a bit of a headache in these architectures, so I invented a permuting buffer that uses basic group theory to automatically generate in-place permuters that do the necessary data reordering. I think you may have been misunderstanding how the architecture worked, and that is why you perhaps thought it was inefficient. The total buffering is only 50% higher than the minimum of buffering possible (i.e. only storing each sample once), and the multipliers are all used at 100% efficiency. Higher radices can produce some savings if you are doing more FFTs in parallel, but barring that, I'd be surprised if there is another architecture that substantially outperforms this one (but you are welcome to try! :) I'm happy you're documenting. All the best, Aaron On Tue, Mar 12, 2013 at 3:39 PM, Ryan Monroe ryan.m.mon...@gmail.comwrote: Hey all, Luke Madden was asking me about what's going on in the FFT-direct today. I'm pretty sure we have basically zero documentation on this lying around, so it's a good time to fix that. I'm going share what I know, but I'd appreciate it if other people could add/correct me as needed. So, you can split the CASPER FFTs into streaming and parallel FFTs: streaming: fft_biplex, fft_biplex_real, fft_biplex_real_4x These FFTs have several independent ports. Each of these ports is fed with normal-order, serial time-domain data and produces normal-order, serial frequency-domain data. If you know something about how pipelined FFTs work, you'll probably call it a Radix 2, Delay-Commutator FFT, or R2DC. In the fft_biplex, we follow the R2DC FFT with an inverse-delay-commutator stage to un-scramble the data (the casper implementation doesn't have the same structure as an inverse-delay-commutator, but they do the same thing). In fft_biplex_real, we do the same R2DC FFT, but we treat real and imag as separate inputs, making four inputs. parallel: fft_direct If map_tail is not set, then the fft_direct block accepts all the inputs for an fft on *each clock cycle*. Natural order in, Natural order out. If map_tail *is* set, it's a bit more complicated. Then, this block is being used with a number of streaming FFTs to achieve a wideband
Re: [casper] Purpose of FFT-Direct
That makes two of us! Viva la revolution! On 03/12/2013 06:35 PM, Dan Werthimer wrote: it's pretty loud where i'm sitting.
Re: [casper] Purpose of FFT-Direct
ay dios mio On Tue, Mar 12, 2013 at 6:40 PM, Ryan Monroe ryan.m.mon...@gmail.comwrote: That makes two of us! Viva la revolution! On 03/12/2013 06:35 PM, Dan Werthimer wrote: it's pretty loud where i'm sitting. -- Aaron Parsons 510-306-4322 Hearst Field Annex B54, UCB