Re: [Ghdl-discuss] Huge simulation speed slowdown

David Koontz Mon, 14 Mar 2016 21:56:33 -0700

I had gone through and characterized your design looking for anomalies, also 
looked at the previous one you had reported and Tristan responded to about 
ghdl's slow-ish concatenation. The idea was to look for things before the 
effort of profiling.

I found those 5,204 input vectors (bytes) that weren't finishing loading in any 
reasonable time would take 5-6 seconds with another VHDL simulator (not quite 
as feature rich as ghdl) and the entire simulation would take 12 minutes and a 
few seconds (completed successfully, around 784 cycles/second), same platform 
(late 2008 Aluminum Macbook). The idea to see what to expect from ghdl.

As part of the characterization I looked at both versions of the generated 
design:

Design             # of "&" operators        Lines of code      stdin port 
width 
                                                                  (top.vhd)

vhdl5082 (2014)         9,822                    9,440                  32
vhdl7L5y (2016)         2,474                   16,190                   8

While the earlier one doesn't look complete by line count you still noticeably 
reduced the number of concatenation operators. I take it that was as a result 
of the repeat function, I didn't really search for the impact of the different 
bus sizes during characterization.

The two assignment statements Tristan notes as large contain 974 of those 
remaining 2,474 concatenation operators between the two of them.  (An editor 
that can count them).

I had  performed various easy experiments like commenting out the write textio 
stuff and determined that wasn't responsible for the slow down nor the memory 
leakage.

Commenting out the two assignments (providing them with a dummy value) gives a 
lot faster simulation - too fast to look for memory leaks, and utterly failing 
the testbench.

It looks like the concatenation performance problem you had originally may be 
responsible in the present model, in a 
poke-the-kerchief-in-your-fist-and-sticks-out-there sort of way. You have fewer 
concatenations but have some big expressions.  

Code generation changes might alleviate the immediate crisis.

It's possible to get rid of the large number of concatenation operators in 
those two statements by producing an extend function to match the repeat 
function:

        signal mux_291 : std_logic_vector(40 downto 0);

        mux_291 <=
                (
                  repeat(41, sig_1468) and 
               -- sig_1642(31) & sig_1642(31) & sig_1642(31) & sig_1642(31) & 
sig_1642(31) & 
               -- sig_1642(31) & sig_1642(31) & sig_1642(31) & sig_1642(31) & 
sig_1642) 
                  std_logic_vector(resize(signed(sig_1642),41)                
-- equivalent
                 )
                  or
                (
                  repeat(41, sig_1505) and 
               -- huff_make_dhuff_tb_ac_j(31) & huff_make_dhuff_tb_ac_j(31) & 
               -- huff_make_dhuff_tb_ac_j(31) & huff_make_dhuff_tb_ac_j(31) & 
               -- huff_make_dhuff_tb_ac_j(31) & huff_make_dhuff_tb_ac_j(31) & 
               -- huff_make_dhuff_tb_ac_j(31) & huff_make_dhuff_tb_ac_j(31) & 
               -- huff_make_dhuff_tb_ac_j(31) & huff_make_dhuff_tb_ac_j
                  std_logic_vector(resize(signed(huff_make_dhuff_tb_ac_j),41)  
-- equivalent
                ) 
                  or
                  ...

A resize type extend function could  take advantage of asssumptions - You don't 
call it when you don't need it, it doesn't clip, the arrays it deals with are 
descending order.. 

> On 15/03/2016, at 1:43 pm, Brian Drummond <br...@shapes.demon.co.uk> wrote:
> 
> On Tue, 2016-03-15 at 00:40 +0100, Adrien Prost-Boucle wrote:
>> Hi,
>>  
>> I had another idea.
>> 
>> Re-evaluating the entire mux expressions is a heavy task.
>> So would it be possible to re-evaluate only the sub-expressions that
>> have changed? Including, for function calls, only those that have no
>> side-effects ("pure" functions), and whose arguments have changed? If
>> such an optimization is possible, it may improve speed for many other
>> designs.

This seems more akin mapping in FPGA's during synthesis where duplicates will 
fall out. 

The shortcoming seems to be in the generated code.

> Seems to me an easy way to do this would be to generate a tree of
> sqrt(N) sized muxes (for a tree of depth 2) then you are evaluating 2
> small muxes instead of one huge one.
> 
> Might be easy to change this part of the code generator to test the
> idea (for muxes of large N) and tune the heuristics (depth, when to
> start splitting muxes) later.

I didn't receive any of Adrien's posts today either after switching email 
accounts for ghdl-discuss.  Turns out gmail had them as SPAM, saw the DKIM 
stuff, marked them as not SPAM. I can't imagine any possible resolution with my 
ISP outsourcing to Yahoo. Gmail has a strong sender address keying, here's 
hoping.

sig_1468 and sig_1505 in the above code snippet are individual decoded selects 
in a big mux as Brian characterizes it. There are 50 "repeat(41," invocations 
in the assignment to mux_291 (gotta love TextMate editor counting things). I'm 
tempted to reformat and retarget the two assignments (as in using resize or an 
extend). It'd give a strong indication the concatenation operators were 
responsible.

After having traversed some of these two statements by hand I had been 
wondering if something along the lines of longest static prefix thinking might 
also serve.  You seem to have some target elements that are solely targets of 
array expansion to match length. The one in mux_291 expands length from 32 to 
41, the other one 32 to 39.

There are also 'bit' wise combinatorics that could take advantage of separate 
control steering where generating processes with if statements instead of using 
the repeat function and 'bit' wise combinatorics and evaluating once. It fits 
in with Brian's lots of littler muxes except the control path is not embedded 
on a 'bit' wise basis (while it would still synthesize that way).

    process (...)
    begin 
       if sig_1468 = '1' then
           mux_291 <= std_logic_vector(resize(signed(sig_1642),41);
       elsif sig_1505 = '1' then 
           mux_291 <= 
std_logic_vector(resize(signed(huff_make_dhuff_tb_ac_j),41);
       elsif ...
       ...
    end process;

Notice you'd evaluate the conditions first, stopping where you find one true 
then evaluate one assignment statement. No real need for an extend here.

I don't think anything in your repeat function should be causing a slow down, 
it's using an aggregate which is a basic operation. I'd really think you're 
being bitten by concatenation.

And of course us armchair quarterbacks haven't seen what you feed the code 
generator. While we're seeing the results filtered through a ghdl shortcoming 
you have to wonder how fast it could go with a bit more sophistication in the 
code generator (on ghdl or any other simulator).

And the big question is, if there's a memory leak, where does it come from?

> On 15/03/2016, at 9:30 am, Tristan Gingold <tging...@free.fr> wrote:
> 
> On 09/03/16 21:37, Adrien Prost-Boucle wrote:
>> Hi,
>> 
>> I have a VHDL design that GHDL simulates at a speed of only one or 2
>> clock cycles per second. The same design, same VHDL files, is simulated
>> by Xilinx Vivado 2015.3 at a speed of 400 clock cycles per second.
>> 
>> For all other designs I have and/or generate, GHDL is always noticeably
>> faster that Vivado, so I think my current app exhibits a GHDL behaviour
>> that at least could be improved.
>> 
>> Note that the attached VHDL was generated by a HLS tool so it's not
>> easy to read... and unfortunately I can't reduce the VHDL code to
>> something simpler that exhibits the slowdown.
>> 
>> The top-level entity is the testbench entity "tb". The simulation is
>> supposed to stop after around 550k clock cycles.
> 
> I had a look at your design.
> More than 95% of the time is spent in two processes of top.vhd:
>  -- Behaviour of component 'mux_266' model 'mux'

> and
>  -- Behaviour of component 'mux_291' model 'mux'

> They are quite large...
> GHDL could be able to optimize them better, but if your tool
> could simplify them, that would be helpful too!
> 
> Tristan.
> 

_______________________________________________
Ghdl-discuss mailing list
Ghdl-discuss@gna.org
https://mail.gna.org/listinfo/ghdl-discuss

Re: [Ghdl-discuss] Huge simulation speed slowdown

Reply via email to