On Tuesday 04 March 2014, beranger six wrote: > Regarding CUDA licensing, I refered to the FAQ > #SystemLibraryException pointing to the section 1 of > GNU/GPL3. > Based on that I assume we could distribute the software with > CUDA code in it but without the actual library.
For us to distribute, it must be licensed GPL-3. To include the code the license must allow us to relicense it GPL-3. Having said that ... sometimes it makes sense to write trial code you don't expect to leave in as a learning experience. > Using our projects (big system with low accuracy), a lot of > time is spent inside lu_decomp(). > > >> In the LU decomposition, running the outermost loop in > >> parallel should be all that is needed there. But to get > >> enough benefit model evaluation also should be parallel, > >> and likely more important. > > I think that we can not parallelize the outermost loop > because we noticed that most elements are dependant on the > previous computations. (a block is dependent to all block > beetwen its lownode(bn) and its row_value(mm) ) 25 years later ... I look at the code again .... Yes .. the inner loop uses previously calculated values in L and U. But it is in blocks. It is a bordered-block-diagonal form. You need to solve each block as a unit, but blocks can be solved in parallel with other blocks, and must be solved before the border can be solved. The border is connections between blocks. I considered moving the block decomposition to be part of the subckt code, perhaps doing the decomposition as part of the load pass. Then the final LU would only need to do the main border. Block loading can be done in parallel but I question how effective it would be because of the non-parallel hardware data transfer path. This move makes paralleling matrix blocks easy, but that is not the biggest advantage. A much bigger advantage is that it makes it easier to skip whole blocks when doing partial solutions. Another big advantage is that when there are multiple instantiations of linear or quasi-linear subckt blocks it becomes possible to solve the block once and reuse the decomposed blocks. > To bypass this issue we thought about using a dependence > tree. This tree would be created when we get the system and > updated each time the lownodes change. not making this > dependence tree time consuming. Please refer to the PDF > attached to this mail explaining it with some figures. > Please keep in mind this document is a draft for now. The 'lownode' array is a dependence tree. It is set up once when the matrix is allocated then remains constant. The list filters do not pass attachments. Can you post a link? If you need a place to upload it I can set up an account for you on the wiki and you can upload it there. > You are right, my tests show that the number of data > transfert between host(CPU) and deviceGPU made it not > suitable fr our kind of application. Still seem like a good > way for very-large system. Though we need to do some testing > to determine the matrix size making this interesting. For math the CPU is speed limited by memory access. The operations add, subtract, and multiply all take essentially no time because the time spent to transfer from memory to/from CPU registers dominates and is done in parallel with actual calculation. Since we are discussing speed ... There are two reasons the matrix solver is not a plugin. The first is speed. Plugins use the virtual function mechanism, which for speed is equivalent to a real function call. Having the access functions in the header file, not a plugin, allows inline code generation without function call overhead. It makes about a 2:1 difference in run speed for the load step. The other reason is template generation. The matrix is a template class. Gnucap already uses double and complex in the release version. I have done experiments using other types including polynomials. This would not be possible with the plugin mechanism. On the pdf .... That particular matrix structure is unlikely. It shows two circuit blocks with no connection to each other. One paper referenced states run time of NGSpice as O(n^1.2) or something like that. My tests show it as O(n^2). Gnucap somewhere between O(n) and O(n log(n)) for the same set of circuits. I believe that if the authors of that paper didn't see spice at O(n^2) they used circuits that are too small. It depends on node ordering. I know that the gnucap ordering is not optimal, particularly when fed big scrambled flat circuits. Working on it can be very frustrating, particularly if you don't understand why the patterns come out as they do. _______________________________________________ Gnucap-devel mailing list [email protected] https://lists.gnu.org/mailman/listinfo/gnucap-devel
