On Wed, Feb 26, 2014 at 07:34:24PM -0500, al davis wrote: > Very simple .. Identify certain loops that can run in parallel. > That is really all. > > You should look at the output of the "status" command to see > where the time is spent, which will show where parallelism could > be of benefit and how much benefit to expect.
the status command says, a lot of time is used in "advance". advance apparently just shifts the history for each device. so, why does every device store this individually? is this bound to a condition that is not implemented? something like 'dormant subcircuits'? and... "rewiev" takes quite some time. i have started to implement an audio processor, mostly consisting of behavioural models connecting to jack. here, the sophisticated time step control is of no use, and a simplified variant of the tran command runs siginificantly faster. it reaches realtime on simple circuits. > I think the overhead of parallelizing the dot product would be > too high, thinking of the multi-thread model. The dot product > might be a candidate for GPU type processing, but look at > "status" to judge whether there is enough potential benefit > before doing this. when i read this article [1] i had the impression that with some hints to the compiler, the dot product can be computed faster on recent hardware. i havent tried. maybe alignment hints help in other places as well? > No .. that would probably make it slower. The speed gain of a > better order would be offset by the overhead of ordering and the > more complex access. there's already a permutation matrix involved in matrix allocation, _sim->_nm (u_sim_data). what i do not understand is, why the permutation is computed before iwant_matrix is called on the circuit. i have changed this (in gnucap-uf [2]). the current use case is weeding out unconnected nodes. but it is also possible to compute a permutation from the adjecency matrix (which is constructed in BSMATRIX), or from the netlist hierarchy or from anything. i have implemented some simple examples. > Also .. remember that gnucap does incremental updates and > partial solutions. The order that is optimal for this is > different from the ordering optimal for solving the entire > matrix. i expect that global bandwith optimization (amd or rcm) will break the incremental and partial stuff (haven't come to try). a local optimizer, tied to the subcircuit hierarchy, might still improve the order, particularly if the netlist has not been written very carefully. regards felix [1] http://locklessinc.com/articles/vectorize [2] http://tool.em.cs.uni-frankfurt.de/~felix/gnucap-uf _______________________________________________ Gnucap-devel mailing list [email protected] https://lists.gnu.org/mailman/listinfo/gnucap-devel
