Thanks for the mini example and timings. Very instructive. By making CMat an object encapsulated in CachedCMat you avoided the 'self destruct' problem, in that the self object is doomed once it reaches __dealloc__.

The only annoying thing about that is that one has to carry around CachedCMat along with the CMat object when it's being used or possibly rewrite the CMat class itself to do the extra indirection when you are doing operations such as add, multiply, qr etc.

The only Python object in my cmat class is an ndarray, so just copying a reference to that object and a couple of C pointers to a container class in the hash presumably shouldn't be that bad and accomplishes nearly the same thing. It does create the overhead of creating the container class whenever a cmat get's hashed. However my cython timings seem to suggest that that's not such a big deal.

I did some more experiments by the way and discovered that my probability of cache miss was 38%. That's still too high. I either need to cache multiple matrices of the same size or move to a different memory management scheme. I've used a circular buffer in the past for temp matrices somewhat successfully, but you have to size your buffer pretty well to avoid overwriting live objects.

I'm not certain but it may be that cython 0.12 runs the code a bit faster as well. More tests on Monday I guess.

-Matt

On 1/8/2010 2:07 PM, Robert Bradshaw wrote:
On Jan 7, 2010, at 2:04 PM, Matthew wrote:

OK I'm going to give the __new__ hack a try from
http://trac.cython.org/cython_trac/ticket/238
I don't really need to overload __new__ do I, so I don't have to change matrix.pxd?

Sorry, this is the ticket number that I meant to refer to: http://trac.cython.org/cython_trac/ticket/443 , though this takes no arguments, so may not apply to you.


The vsipl vendor tells me that the only real expensive operation is the cblockbind() within __cinit__(). However the __dealloc()__ routine is very expensive as well (Given the number of times it's being called). It would be nice if I could profile on a line by line basis. I'm not sure if the python cProfile tool supports this or not.

It does, but we don't have that implemented in Cython yet. Given that it's a deterministic rather than (external) probabilistic profiler, the profiling itself may significantly impact the speed and results. Try commenting stuff out, or factoring it into an (inline) function.

I can't just chalk up this result to the vsipl code, since the hash routine is not giving me any performance gain and seemed to be making things worse. (Though I probably need to do some more debugging to see if I have a lot of cache misses,or some bug in my logic.)

Wel, maybe hashing is slightly more expensive than the vsipl call. On a completely unrelated note, getting data to/from a GPU can be a bottleneck as well, and due to its asynchronous nature may not show up as obviously in the main CPU profiling results.

For the life of me I could not figure out how to just put the matrix object itself into my hash indexed memory cache. It seemed like my python objects were always being garbage collected once I hit the __dealloc__ routine (the self.arr ndarray as an example). Later I found out that the cython class get's stripped of it's attributes if it's stored in a dictionary. Only those attributes written to the classes internal dictionary in the __init__() method seem to get saved, as far as I can tell from my experiments. Of course I'd actually like to avoid calling __init__(). I really didn't intend to try to learn the internals of python or cython for that matter, but I do need to figure
out how to optimize this code.

I think you're trying to make things way more complicated than necessary. The easiest approach is to only expose wrapper classes, and cache the expensive initalization in an internal class. See

http://sage.math.washington.edu/home/robertwb/cython/mat.html
http://sage.math.washington.edu/home/robertwb/cython/mat.pyx

(I'm sure there's some more room for optimization, and the caching algorithm could be improved as well.) Also, note that creating the numpy arrays is expensive as well.

In [1]: from mat import *

In [2]: %time make_np(10**5)
CPU times: user 0.56 s, sys: 0.43 s, total: 0.99 s
Wall time: 0.99 s

In [4]: %time make_CMat(10**5)
CPU times: user 0.68 s, sys: 0.45 s, total: 1.13 s
Wall time: 1.14 s

In [6]: %time make_CachedCMat(10**5)
CPU times: user 0.14 s, sys: 0.00 s, total: 0.14 s
Wall time: 0.14 s

In [8]: %time make_Empty(10**5)
CPU times: user 0.02 s, sys: 0.00 s, total: 0.02 s
Wall time: 0.02 s

- Robert



No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.725 / Virus Database: 270.14.130/2607 - Release Date: 01/08/10 
02:35:00

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to