I had gone through and characterized your design looking for anomalies, also looked at the previous one you had reported and Tristan responded to about ghdl's slow-ish concatenation. The idea was to look for things before the effort of profiling.
I found those 5,204 input vectors (bytes) that weren't finishing loading in any reasonable time would take 5-6 seconds with another VHDL simulator (not quite as feature rich as ghdl) and the entire simulation would take 12 minutes and a few seconds (completed successfully, around 784 cycles/second), same platform (late 2008 Aluminum Macbook). The idea to see what to expect from ghdl. As part of the characterization I looked at both versions of the generated design: Design # of "&" operators Lines of code stdin port width (top.vhd) vhdl5082 (2014) 9,822 9,440 32 vhdl7L5y (2016) 2,474 16,190 8 While the earlier one doesn't look complete by line count you still noticeably reduced the number of concatenation operators. I take it that was as a result of the repeat function, I didn't really search for the impact of the different bus sizes during characterization. The two assignment statements Tristan notes as large contain 974 of those remaining 2,474 concatenation operators between the two of them. (An editor that can count them). I had performed various easy experiments like commenting out the write textio stuff and determined that wasn't responsible for the slow down nor the memory leakage. Commenting out the two assignments (providing them with a dummy value) gives a lot faster simulation - too fast to look for memory leaks, and utterly failing the testbench. It looks like the concatenation performance problem you had originally may be responsible in the present model, in a poke-the-kerchief-in-your-fist-and-sticks-out-there sort of way. You have fewer concatenations but have some big expressions. Code generation changes might alleviate the immediate crisis. It's possible to get rid of the large number of concatenation operators in those two statements by producing an extend function to match the repeat function: signal mux_291 : std_logic_vector(40 downto 0); mux_291 <= ( repeat(41, sig_1468) and -- sig_1642(31) & sig_1642(31) & sig_1642(31) & sig_1642(31) & sig_1642(31) & -- sig_1642(31) & sig_1642(31) & sig_1642(31) & sig_1642(31) & sig_1642) std_logic_vector(resize(signed(sig_1642),41) -- equivalent ) or ( repeat(41, sig_1505) and -- huff_make_dhuff_tb_ac_j(31) & huff_make_dhuff_tb_ac_j(31) & -- huff_make_dhuff_tb_ac_j(31) & huff_make_dhuff_tb_ac_j(31) & -- huff_make_dhuff_tb_ac_j(31) & huff_make_dhuff_tb_ac_j(31) & -- huff_make_dhuff_tb_ac_j(31) & huff_make_dhuff_tb_ac_j(31) & -- huff_make_dhuff_tb_ac_j(31) & huff_make_dhuff_tb_ac_j std_logic_vector(resize(signed(huff_make_dhuff_tb_ac_j),41) -- equivalent ) or ... A resize type extend function could take advantage of asssumptions - You don't call it when you don't need it, it doesn't clip, the arrays it deals with are descending order.. > On 15/03/2016, at 1:43 pm, Brian Drummond <br...@shapes.demon.co.uk> wrote: > > On Tue, 2016-03-15 at 00:40 +0100, Adrien Prost-Boucle wrote: >> Hi, >> >> I had another idea. >> >> Re-evaluating the entire mux expressions is a heavy task. >> So would it be possible to re-evaluate only the sub-expressions that >> have changed? Including, for function calls, only those that have no >> side-effects ("pure" functions), and whose arguments have changed? If >> such an optimization is possible, it may improve speed for many other >> designs. This seems more akin mapping in FPGA's during synthesis where duplicates will fall out. The shortcoming seems to be in the generated code. > Seems to me an easy way to do this would be to generate a tree of > sqrt(N) sized muxes (for a tree of depth 2) then you are evaluating 2 > small muxes instead of one huge one. > > Might be easy to change this part of the code generator to test the > idea (for muxes of large N) and tune the heuristics (depth, when to > start splitting muxes) later. I didn't receive any of Adrien's posts today either after switching email accounts for ghdl-discuss. Turns out gmail had them as SPAM, saw the DKIM stuff, marked them as not SPAM. I can't imagine any possible resolution with my ISP outsourcing to Yahoo. Gmail has a strong sender address keying, here's hoping. sig_1468 and sig_1505 in the above code snippet are individual decoded selects in a big mux as Brian characterizes it. There are 50 "repeat(41," invocations in the assignment to mux_291 (gotta love TextMate editor counting things). I'm tempted to reformat and retarget the two assignments (as in using resize or an extend). It'd give a strong indication the concatenation operators were responsible. After having traversed some of these two statements by hand I had been wondering if something along the lines of longest static prefix thinking might also serve. You seem to have some target elements that are solely targets of array expansion to match length. The one in mux_291 expands length from 32 to 41, the other one 32 to 39. There are also 'bit' wise combinatorics that could take advantage of separate control steering where generating processes with if statements instead of using the repeat function and 'bit' wise combinatorics and evaluating once. It fits in with Brian's lots of littler muxes except the control path is not embedded on a 'bit' wise basis (while it would still synthesize that way). process (...) begin if sig_1468 = '1' then mux_291 <= std_logic_vector(resize(signed(sig_1642),41); elsif sig_1505 = '1' then mux_291 <= std_logic_vector(resize(signed(huff_make_dhuff_tb_ac_j),41); elsif ... ... end process; Notice you'd evaluate the conditions first, stopping where you find one true then evaluate one assignment statement. No real need for an extend here. I don't think anything in your repeat function should be causing a slow down, it's using an aggregate which is a basic operation. I'd really think you're being bitten by concatenation. And of course us armchair quarterbacks haven't seen what you feed the code generator. While we're seeing the results filtered through a ghdl shortcoming you have to wonder how fast it could go with a bit more sophistication in the code generator (on ghdl or any other simulator). And the big question is, if there's a memory leak, where does it come from? > On 15/03/2016, at 9:30 am, Tristan Gingold <tging...@free.fr> wrote: > > On 09/03/16 21:37, Adrien Prost-Boucle wrote: >> Hi, >> >> I have a VHDL design that GHDL simulates at a speed of only one or 2 >> clock cycles per second. The same design, same VHDL files, is simulated >> by Xilinx Vivado 2015.3 at a speed of 400 clock cycles per second. >> >> For all other designs I have and/or generate, GHDL is always noticeably >> faster that Vivado, so I think my current app exhibits a GHDL behaviour >> that at least could be improved. >> >> Note that the attached VHDL was generated by a HLS tool so it's not >> easy to read... and unfortunately I can't reduce the VHDL code to >> something simpler that exhibits the slowdown. >> >> The top-level entity is the testbench entity "tb". The simulation is >> supposed to stop after around 550k clock cycles. > > I had a look at your design. > More than 95% of the time is spent in two processes of top.vhd: > -- Behaviour of component 'mux_266' model 'mux' > and > -- Behaviour of component 'mux_291' model 'mux' > They are quite large... > GHDL could be able to optimize them better, but if your tool > could simplify them, that would be helpful too! > > Tristan. > _______________________________________________ Ghdl-discuss mailing list Ghdl-discuss@gna.org https://mail.gna.org/listinfo/ghdl-discuss