For long FFTs, you could also use two BRAM18s (as lookup tables) and two complex multiplies (3 dsps each for V6, 4 dsps each for v5) to get a coefficient with 17 bits of accuracy and enough resolution for a 2^19-point FFT.

On 01/24/2013 05:49 PM, Dan Werthimer wrote:

hi ryan, andrew,

we used to use CORDIC for generating coefficients.
not sure how cordic comares to goertzel.
there are a few open source VHDL cordics.

i think dave macmahon or someone developed
a radix4 version of the casper streaming FFT.

dan


On Thu, Jan 24, 2013 at 5:44 PM, Ryan Monroe <ryan.m.mon...@gmail.com <mailto:ryan.m.mon...@gmail.com>> wrote:

    Hey Andrew, thanks for the designs! I'll have to spend some time
    looking them over later, there's some good stuff there.

    Nice idea (I think the Goertzel algorithm is often used with this
    technique?). I have considered this for the DDC, it allows almost
    arbitrary frequency and phase resolution. The only cost is a fair
    amount
    of multipliers. For most applications at the moment we are BRAM
    limited
    so this is not a problem (the very wide bandwidth instruments might be
    multiplier limited at some point). It would be good as an option to
    trade off multipliers for BRAM.

    I haven't seen the Goertzel algorithm before, but it looks like a
    great idea for this: we might be able to produce a coefficient DDS
    in just two DSPs!

    For my applications, I'm *totally* DSP limited, but I agree that
    we should try to cater to the greater CASPER community of course.

    Coefficient reuse (as you describe between phases) would be nice
    (at the
    cost of some register stages I guess).

    The CASPER libraries *hemmorage* pipeline stages.  A few more
    won't hurt, and you'll be saving the RAM addressing logic.  Not so
    bad.

    I think the reuse of control logic, coefficients etc would potentially
    be the biggest saver assuming wide bandwidth systems. Ideally the
    compiler would do this for us implicitly, but in the meantime explicit
    reuse with optional register stages to reduce fanout would be awesome.

    You can change a setting on pipeline registers (and maybe other
    places too) which allows it to do this.  it's called "Implement
    using behavioral HDL" in simulink, or "allow_register_retiming" in
    the xBlock interface.  I had a bad experience with it though:
    It'll try to optimize EVERYTHING.  Got two identical registers
    which you intend to place on opposite sides of the chip? They're
    now the same register.  In my experience, the only good way to
    control the sharing (or lack thereof) was to do it manually..... YMMV.

    I've got another idea we can consider too.  This one is farther
    away.  I'm building radix-4 versions of my FFTs (1/2 as much
    fabric, 85% as much DSP and 100% as much coeff).  Now, for radix
    4, you get three coefficient banks per butterfly stage, and while
    the sum total (# coefficients stored) is the same, the
    coefficients are actually in trios of (x^1; x^2; x^3 and an
    implicit x^0).  You could, in principle, store just the x^1 and
    square/cube it into x^2 and x^3.  I haven't tried this (just
    thought of it), so no idea regarding performance.  In addition,
    while Dan and I are working with JPL legal to get my library
    open-sourced, it's looking pretty clear that I won't be able to
    share the really new stuff, so you'd have to do radix-4 on your
    own :-(

    --Ryan

    On 01/22/2013 04:41 AM, Andrew Martens wrote:

        Hi all


                     It would work well for the PFB, but what we
            *really* need is a
                     solid "Direct Digital Synth (DDS) coefficient
            generator".
                     ...

        Nice idea (I think the Goertzel algorithm is often used with this
        technique?). I have considered this for the DDC, it allows almost
        arbitrary frequency and phase resolution. The only cost is a
        fair amount
        of multipliers. For most applications at the moment we are
        BRAM limited
        so this is not a problem (the very wide bandwidth instruments
        might be
        multiplier limited at some point). It would be good as an
        option to
        trade off multipliers for BRAM.

        Coefficient reuse (as you describe between phases) would be
        nice (at the
        cost of some register stages I guess).

        I think the reuse of control logic, coefficients etc would
        potentially
        be the biggest saver assuming wide bandwidth systems. Ideally the
        compiler would do this for us implicitly, but in the meantime
        explicit
        reuse with optional register stages to reduce fanout would be
        awesome.

                     PS, Sorry, I'm a bit busy right now so I can't
            implement a
                     coefficient interpolator for you guys right now.
             I'll write
                     back when I'm more free

        Got a bit carried away and implemented one. Attached is a
        model that
        allows the comparison between ideal, interpolator, and Dan's
        reduced
        storage idea. The interpolator uses a multiplier, cruder
        versions might
        not at the cost of noise and/or more logic.

                     PS2.  I'm a bit anal about noise performance so I
            usually use
                     a couple more bits then Dan prescribes, but as he
            demonstrated
                     in the asic talks, his comments about bit widths
            are 100%
                     correct.   I would recommend them as a general
            design practice
                     as well.

        I have also seen papers that show that FFT performance is more
        dependent
        on data path bit width than coefficient bit width. We need a
        proper
        study on how many bits are required for different performance
        levels.

                                                             but for long
             transforms, perhaps
                                                             >4K
            points or so,
                                                             then
            BRAM's might be
                                                             in short
            supply, and
                                                             then one
            could
             consider storing fewer
             coefficients (and also
                                                             taking
            advantage
                                                             of
            sin/cos and mirror
             symmetries, which
                                                             don't
            degrade SNR at
                                                             all).

        Did some work a while back. Attached is a model
        (sincos_upgrade.mdl)
        that implements BRAM saving in different ways when generating FFT
        twiddle factors (or DDC coefficients);

        1. For very small numbers of coefficients, store them in the
        same word
        (can output up to 36 bits from a BRAM so can store 18 bit sin
        and cos
        values next to each other in the same word) so that we use 1
        instead of
        (current) 2 BRAMs. (see sincos_single_bram in the design)

        2. Store only a quarter of a sinusoid and generate the complex
        exponential via clever address generation and inversion of the
        output.
        This uses 1 BRAM instead of (current, assuming a 'large' FFT)
        8 at the
        cost of logic (and multipliers) (see sincos_min_ram in the design)

        3. Store half a sinusoid and generate the complex exponential
        via clever
        address generation. Uses 1 BRAM instead of the (current,
        assuming a
        'large' FFT) 4 at the cost of some logic. (see sincos_med_ram
        in the
        design).

        The interpolator could be integrated into these to use even
        less BRAM.

        I will upgrade the library at some point this year to include
        these (and
        the interpolator).

             I did some
             tests this
             morning with a
             simple moving
             average filter
             to turn 256
             BRAM
             coefficients
             into 1024 (see
             attached model
             file), and it
             looks pretty
             promising:
             errors are a
             max of about
             2.5%.

        Could you send me this file? I would like to see how you did your
        interpolation.

        Regards
        Andrew









Reply via email to