On 2 September 2015 at 02:58, Michael D'Cruze <
michael.dcr...@postgrad.manchester.ac.uk> wrote:

> Hi Jack,
>
>
>
> Thanks for your suggestions! It’s taken me a little while to work through
> it all.
>
>
>
> I did manage to get a 2-pol, 32k channel design to compile eventually. I
> think it was a combination of a) sufficient latency in the Vacc and PFB
> blocks (as you suggest) and b) splitting the two pols into two separate
> chains, rather than using one block for the PFB and FFT in biplex mode. It
> was found by just iterating through latency combinations until the timing
> started to converge, but there were still several hundred errors until I
> split up the design into two chains. Now, I don’t really understand why
> using two separate PFB/FFT blocks should make a difference (surely sysgen
> sorts such detail?), but the latency settings were:
>

Two separate blocks has two main implications. The first is that
coefficients are duplicated for each block, rather than being stored in one
place and fanned out. This is good for ram savings, but in general bad for
timing. The blocks have fanout latency and max fanout settings to try and
solve the timing problems.
The second implication is that bram delays your signals go through are
separated for your two streams. I.e., the delay buffers are half the size
but you have twice as many of them. Where your delays are short (e.g. in
FIRs with fewer than ~2k channels (depending on how many simultaneous
samples per FPGA clock you get, and the specific implementation)) a single
large buffer is better than two buffers of half the size, since any buffer
will always use at least 1 bram. If you're making loads of channels, all
the buffers are many brams in size. In this case whether or not you split
them up doesn't make any difference, bram-usage-wise, since a large buffer
just uses twice as many brams as a buffer of half the size. *However*
separate buffers of smaller sizes should be easier to run fast, which may
be a contributing factor to your design working better with separate
blocks...



>
>
> PFB: Add Latency 1; Mult latency 2; BRAM latency 3; Fanout latency 1;
> Convert latency 1 (4 taps).
>
> Vacc: Add latency 2; BRAM latency 6 (overkill?); Mux latency 0,
>
>
>

BRAM latency up to 3 certainly helps, I would have thought anything above 4
just turns into a bram with latency 3, plus a shift register, but you'd
have to dig into the mapped design to see. In any case, if it works, it
works!


> where DSP48 has been used wherever possible in the design.
>
>
>
> Now I need to see if it still works if I use more FIR taps ;-)
>

I admire your boldness.

Jack

PS.
If your design is on github, it might be a nice reference for people trying
to build models with relatively large numbers of channels....


>
>
> Best wishes
>
> Michael
>
>
>
> *From:* Jack Hickish [mailto:jackhick...@gmail.com]
> *Sent:* 30 August 2015 21:57
> *To:* Michael D'Cruze
> *Cc:* casper@lists.berkeley.edu
> *Subject:* Re: [casper] Mystery timing errors
>
>
>
> Hi Michael,
>
> As one of the Xilinx timing closure documents helpfully articulates, it's
> very difficult to give specific recipes for solving timing problems, other
> than reading the timing report, looking at things in PlanAhead/FPGA Editor,
> iterating compiles and developing some intuition.
>
> That said, there are a few things you could try, of varying levels of
> complexity --
> Since you have quite a lot of channels in your design, I suspect your vacc
> problems are related to the size of the bram buffer (especially if your
> samples are wide). Keep in mind that a bram in a Virtex6 is a 36 bit wide x
> 1k deep memory block, so building a vacc to accommodate many thousands of
> channels means banking together lots of brams -- this is liable to cause
> fanout issues on your address/control signals. The same is probably true of
> the large delays in the PFB blocks, though I'm not sure how these are
> implemented in the various block versions which exist in mlib_devel.
>
> Setting brams to optimize for speed and giving them at least 3 cycles
> latency may help. You could also try manually controlling bram control
> signals -- I believe the casper_library_bus/bus_single_port_ram block will
> do some of this for you if you set to optimize for speed and include some
> fanout latency. If you replace the bram delay in the vacc with one of these
> (and associated address counter logic to make the block act as a delay)
> then perhaps this will help. Be careful to get your latencies right so the
> block still functions properly.
> You might also find that implementing the adder in a DSP slice (if you
> haven't already) might help.
>
> A more involved option is to go about manually constraining the placement
> of components. You might find setting up pblocks in PlanAhead to place
> major parts of your design might free up the compiler to do a little better
> with the remainder of the logic. Adding pblocks for the pfb, and the
> various FFT stages can be very helpful in this regard. Or perhaps just
> constraining the vacc that's causing you problems might be enough.
> Once you have some constraints from planahead which work, you can
> auto-include them in your simulink compiles using various methods, my
> favourite being the 'UCF' yellow block, which is in the current
> casper-astro mlib_devel.
>
> In general, I suspect that you would be well served by trying to reduce
> your bram use in your model - the resource utilisation report can probably
> either confirm or contest that this is the case. You could do this by
> trying to use alternatives to brams where blocks allow it (for example the
> FFT will allow coefficients to be generated using DSPs in some cases), or
> by reducing the number of FIR taps, or implementing something using QDR
> rather than bram if appropriate.
>
> The unfortunate truth is that some, none or all of these suggestions may
> help, or might make things worse. I would say that in my experience
> over-judicious use of pipeline/register stages throughout a design can
> often do more harm than good, and a sensibly placed PFB can result in
> incredible timing improvements.
>
> Hope that helps, please email back with any more info and/or updates -- I
> suspect if you solve this problem documenting your method may prove very
> useful for others encountering similar problems.
>
> Cheers, and good luck!
> Jack
>
> On 30 August 2015 at 10:42, Michael D'Cruze <
> michael.dcr...@postgrad.manchester.ac.uk> wrote:
> >
> > Hi everyone,
> >
> >
> >
> > I’m having quite a lot of problems getting a particular design to
> compile, XPS consistently reporting timing errors. The design is a wideband
> spectrometer, somewhat similar to the tutorial 3 design. I’m running an
> iADC at 1024 MHz, and the FPGA at 256 MHz. It is a two-polarisation design,
> with a 64k-point PFB and FFT (for 32k channels) in each polarisation. I am
> running the PFB and FFT blocks in “two-polarisation” mode, so there are
> just one of each block. I am using the Vacc block from the xBlocks
> repository.
> >
> >
> >
> > I should add at this point that the same design with 16k channels
> compiles successfully at this speed, and that the 32k channel design
> compiles at slower speeds (e.g. 200 MHz FPGA).
> >
> >
> >
> > The timing reports suggest negative slack in the PFB and Vacc blocks.
> I’ve tried adding various combinations of latency in each, however no
> permutations have resulted in substantial improvements. The overwhelming
> majority of the errors are reported by the Vacc block. I have also tried
> replacing the xBlocks Vacc block with the wide_bram_vacc block in 64-bit
> mode, however this results in even more timing errors (and without the
> option to adjust latencies).
> >
> >
> >
> > I’ve tried various combinations of latencies suggested previously on the
> mail archive, but again nothing has really given any improvement.
> >
> >
> >
> > Am I simply pushing such a large design too hard? Is the only option to
> slow it down or is there some strategic method I can adopt to get closer to
> timing closure? Suggestions much appreciated!
> >
> >
> >
> > Thanks
> >
> > Michael
>

Reply via email to