Thanks for the mini example and timings. Very instructive. By making
CMat an object encapsulated in CachedCMat
you avoided the 'self destruct' problem, in that the self object is
doomed once it reaches __dealloc__.
The only annoying thing about that is that one has to carry around
CachedCMat along with the CMat object when it's being used or possibly
rewrite the CMat class itself to do the extra indirection when you are
doing operations such as add, multiply, qr etc.
The only Python object in my cmat class is an ndarray, so just copying
a reference to that object and a couple of C pointers to a container
class in the hash presumably shouldn't be that bad and accomplishes
nearly the same thing. It does create the overhead of creating the
container class whenever a cmat get's hashed. However my cython timings
seem to suggest that that's not such a big deal.
I did some more experiments by the way and discovered that my
probability of cache miss was 38%. That's still too high.
I either need to cache multiple matrices of the same size or move to a
different memory management scheme. I've used a circular buffer in the
past for temp matrices somewhat successfully, but you have to size your
buffer pretty well to avoid overwriting live objects.
I'm not certain but it may be that cython 0.12 runs the code a bit
faster as well. More tests on Monday I guess.
-Matt
On 1/8/2010 2:07 PM, Robert Bradshaw wrote:
On Jan 7, 2010, at 2:04 PM, Matthew wrote:
OK I'm going to give the __new__ hack a try from
http://trac.cython.org/cython_trac/ticket/238
I don't really need to overload __new__ do I, so I don't have to
change matrix.pxd?
Sorry, this is the ticket number that I meant to refer to:
http://trac.cython.org/cython_trac/ticket/443 , though this takes no
arguments, so may not apply to you.
The vsipl vendor tells me that the only real expensive operation is
the cblockbind() within __cinit__(). However the __dealloc()__ routine
is very expensive as well (Given the number of times it's being
called). It would be nice if I could profile on a line by
line basis. I'm not sure if the python cProfile tool supports this
or not.
It does, but we don't have that implemented in Cython yet. Given that
it's a deterministic rather than (external) probabilistic profiler,
the profiling itself may significantly impact the speed and results.
Try commenting stuff out, or factoring it into an (inline) function.
I can't just chalk up this result to the vsipl code, since the hash
routine is not giving me any performance gain and seemed to be making
things worse. (Though I probably need to do some more debugging to
see if I have a lot of cache misses,or some bug in my logic.)
Wel, maybe hashing is slightly more expensive than the vsipl call. On
a completely unrelated note, getting data to/from a GPU can be a
bottleneck as well, and due to its asynchronous nature may not show up
as obviously in the main CPU profiling results.
For the life of me I could not figure out how to just put the matrix
object itself into my hash indexed memory cache. It seemed like my
python objects were always being garbage collected once I hit the
__dealloc__ routine (the self.arr ndarray as an example). Later I
found out that the cython class get's stripped of it's attributes if
it's stored in a dictionary. Only those attributes written to the
classes internal dictionary in the __init__() method seem to get
saved, as far as I can tell from my experiments. Of course I'd
actually like to avoid calling __init__(). I really didn't intend to
try to learn the internals of python or cython for that matter, but
I do need to figure
out how to optimize this code.
I think you're trying to make things way more complicated than
necessary. The easiest approach is to only expose wrapper classes, and
cache the expensive initalization in an internal class. See
http://sage.math.washington.edu/home/robertwb/cython/mat.html
http://sage.math.washington.edu/home/robertwb/cython/mat.pyx
(I'm sure there's some more room for optimization, and the caching
algorithm could be improved as well.) Also, note that creating the
numpy arrays is expensive as well.
In [1]: from mat import *
In [2]: %time make_np(10**5)
CPU times: user 0.56 s, sys: 0.43 s, total: 0.99 s
Wall time: 0.99 s
In [4]: %time make_CMat(10**5)
CPU times: user 0.68 s, sys: 0.45 s, total: 1.13 s
Wall time: 1.14 s
In [6]: %time make_CachedCMat(10**5)
CPU times: user 0.14 s, sys: 0.00 s, total: 0.14 s
Wall time: 0.14 s
In [8]: %time make_Empty(10**5)
CPU times: user 0.02 s, sys: 0.00 s, total: 0.02 s
Wall time: 0.02 s
- Robert
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.725 / Virus Database: 270.14.130/2607 - Release Date: 01/08/10
02:35:00
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev