[Numpy-discussion] Datarray BoF, part2
About a dozen people attended what was billed as a continuation of the SciPy 2010 datarray BoF. We met at UC Berkeley on July 19 as part of the py4science series. A datarray is a subclass of a Numpy array that adds the ability to label the axes and to label the elements along each axis. We spent most of the time discussing how to index with tick labels. The main issue is with integers: is an integer index a tick name or a position index? At the top level, datarrays always use regular Numpy indexing: an int is a position, never a label. So darr[0] always returns the first element of the datarray. The ambiguity occurs in specialized indexing methods that allow indexing by tick label name (because the name could be an int). To break the ambiguity, the proposal was to provide several tick indexing methods[1]: 1. Integers are always labels 2. Integers are never treated as labels 3. Try 1, then 2 We also discussed allowing axis labels to be any hashable object (currently only strings are allowed). The main problem: integers. Currently if an axis is labeled, say, time, you can do darr.sum(axis=time). What happens when an axis is labeled with an int? What does the 2 in darr.sum(axis=2) refer to? A position or a label? The same problem exists for floats since a float is (currently) a valid axis for Numpy arrays. References: [1] http://github.com/fperez/datarray/commit/3c5151baa233675b355058eb3ba028d2629bece5 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Datarray BoF, part2
I don't really know much about this topic, but what about a flag at array creation time (or whenever you define labels) that says whether valid indexes will be treated as labels or indexes for that array? On Wed, Jul 21, 2010 at 9:37 AM, Keith Goodman kwgood...@gmail.com wrote: About a dozen people attended what was billed as a continuation of the SciPy 2010 datarray BoF. We met at UC Berkeley on July 19 as part of the py4science series. A datarray is a subclass of a Numpy array that adds the ability to label the axes and to label the elements along each axis. We spent most of the time discussing how to index with tick labels. The main issue is with integers: is an integer index a tick name or a position index? At the top level, datarrays always use regular Numpy indexing: an int is a position, never a label. So darr[0] always returns the first element of the datarray. The ambiguity occurs in specialized indexing methods that allow indexing by tick label name (because the name could be an int). To break the ambiguity, the proposal was to provide several tick indexing methods[1]: 1. Integers are always labels 2. Integers are never treated as labels 3. Try 1, then 2 We also discussed allowing axis labels to be any hashable object (currently only strings are allowed). The main problem: integers. Currently if an axis is labeled, say, time, you can do darr.sum(axis=time). What happens when an axis is labeled with an int? What does the 2 in darr.sum(axis=2) refer to? A position or a label? The same problem exists for floats since a float is (currently) a valid axis for Numpy arrays. References: [1] http://github.com/fperez/datarray/commit/3c5151baa233675b355058eb3ba028d2629bece5 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Datarray BoF, part2
On Wed, Jul 21, 2010 at 10:08 AM, Keith Goodman kwgood...@gmail.com wrote: On Wed, Jul 21, 2010 at 9:56 AM, John Salvatier jsalv...@u.washington.edu wrote: I don't really know much about this topic, but what about a flag at array creation time (or whenever you define labels) that says whether valid indexes will be treated as labels or indexes for that array? It's an interesting idea. My guess is that you'd end up having to check the attribute all the time when writing code: if dar.intaslabel: dar2 = dar[tickmap(i)] else: dar2 = dar[i] My thoughts too.. it's dangerous to simply have a toggle that changes what darr.slicingobject[2] means. Much safer to have slicing options that always try to approach slicing with consistent rules. Separately, regarding the permissible axis labels, I think we must not allow any enumerated axis labels (ie, ints and floats). I don't remember if there was a consensus about that yesterday. We don't have the flexibility in the ndarray API to allow for the expression darr.method(axis=2) to mean not the 2nd dimension, but the Axis with label==2 Mike ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Datarray BoF, part2
On Wed, Jul 21, 2010 at 10:58 AM, M Trumpis mtrum...@berkeley.edu wrote: Separately, regarding the permissible axis labels, I think we must not allow any enumerated axis labels (ie, ints and floats). I don't remember if there was a consensus about that yesterday. We don't have the flexibility in the ndarray API to allow for the expression darr.method(axis=2) to mean not the 2nd dimension, but the Axis with label==2 So the axis label rule could be either: 1. str only 2. Any hashable object except int or float #1 is looking better and better. Plus you already coded it :) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Datarray BoF, part2
On Wed, Jul 21, 2010 at 11:41 AM, Vincent Davis vinc...@vincentdavis.net wrote: On Wed, Jul 21, 2010 at 11:08 AM, Keith Goodman kwgood...@gmail.com wrote: On Wed, Jul 21, 2010 at 9:56 AM, John Salvatier jsalv...@u.washington.edu wrote: I don't really know much about this topic, but what about a flag at array creation time (or whenever you define labels) that says whether valid indexes will be treated as labels or indexes for that array? It's an interesting idea. My guess is that you'd end up having to check the attribute all the time when writing code: if dar.intaslabel: dar2 = dar[tickmap(i)] else: dar2 = dar[i] Obviously there are several aspects of a labels that need to be considered. An important on is if an operation breaks the meaning of the labels. I like the idea of tickmap(i) it could have a lot of features like grouping... Maybe even work on structure arrays(maybe). The flag could connect the tickmap() to the array. Then if an operation was performed on the array to would result in the labels no longer being meaningful then the flag would change. In this way tickmap(i) checks for the flags and each axis could have a flag. (I am sure there is lots I am missing) Each axis currently has a tick map: Axis._tick_dict. From a datarray you can access it like this: from datarray import DataArray x = DataArray([1,2,3], labels=[('time', ['n1', 'n2', 'n3'])]) x.axis.time._tick_dict {'n1': 0, 'n2': 1, 'n3': 2} x.axes[0]._tick_dict {'n1': 0, 'n2': 1, 'n3': 2} x.axis['time']._tick_dict {'n1': 0, 'n2': 1, 'n3': 2} On a separate note, I think having both axis and axes attribute is confusing. Would it be possible to only have one of them? Here's a proposal: http://github.com/fperez/datarray/commit/01b2d3d2082ade38ec89dbca0c070dd4fc9d9ca3#L1R330 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] recarray slow?
I have an recarray -- the first column is date. I have the following function to compute the number of unique dates in my data set: def byName(): return(len(list(set(d['Date'])) )) Question: is the string 'Date' looked up at each iteration? If so, this is dumb, but explains my horrible performance. Or, is there a better way to code the above? Can I convert this to something indexed by column number and convert 'Date' to column number 0 upfront? Would this help with speed? W ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] recarray slow?
On Wed, Jul 21, 2010 at 15:12, wheres pythonmonks wherespythonmo...@gmail.com wrote: I have an recarray -- the first column is date. I have the following function to compute the number of unique dates in my data set: def byName(): return(len(list(set(d['Date'])) )) Question: is the string 'Date' looked up at each iteration? If so, this is dumb, but explains my horrible performance. Or, is there a better way to code the above? len(np.unique(d['Date'])) If you can come up with a self-contained example that we can benchmark, it would help. In my examples, I don't see any hideous performance, but my examples may be missing some crucially important detail about your data that is causing your performance problems. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] recarray slow?
Wed, 21 Jul 2010 15:12:14 -0400, wheres pythonmonks wrote: I have an recarray -- the first column is date. I have the following function to compute the number of unique dates in my data set: def byName(): return(len(list(set(d['Date'])) )) What this code does is: 1. d['Date'] Extract an array slice containing the dates. This is fast. 2. set(d['Date']) Make copies of each array item, and box them into Python objects. This is slow. Insert each of the objects in the set. Also this is somewhat slow. 3. list(set(d['Date'])) Get each item in the set, and insert them to a new list. This is somewhat slow, and unnecessary if you only want to count. 4. len(list(set(d['Date']))) So the slowness arises because the code is copying data around, and boxing it into Python objects. You should try using Numpy functions (these don't re-box the data) to do this. http://docs.scipy.org/doc/numpy/reference/routines.set.html -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] recarray slow?
Thank you very much better crack open a numpy reference manual instead of relying on my python intuition. On Wed, Jul 21, 2010 at 3:44 PM, Pauli Virtanen p...@iki.fi wrote: Wed, 21 Jul 2010 15:12:14 -0400, wheres pythonmonks wrote: I have an recarray -- the first column is date. I have the following function to compute the number of unique dates in my data set: def byName(): return(len(list(set(d['Date'])) )) What this code does is: 1. d['Date'] Extract an array slice containing the dates. This is fast. 2. set(d['Date']) Make copies of each array item, and box them into Python objects. This is slow. Insert each of the objects in the set. Also this is somewhat slow. 3. list(set(d['Date'])) Get each item in the set, and insert them to a new list. This is somewhat slow, and unnecessary if you only want to count. 4. len(list(set(d['Date']))) So the slowness arises because the code is copying data around, and boxing it into Python objects. You should try using Numpy functions (these don't re-box the data) to do this. http://docs.scipy.org/doc/numpy/reference/routines.set.html -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] doc string linalg.solve -- works for b is matrix
I am using 1.3.0. Glad to hear it is correct in 1.4.0 Sorry for bothering you with an old version, but I am very happy with this feature! Mark What version of numpy are you using? That docstring was updated in that fashion about 8 mo. ago (at least in the Wiki; I'm not sure exactly when it was merged, but it does appear that way in version 1.4.0). DG I am using linalg.solve to solve a system of linear equations. As I have to solve multiple systems with the same matrix, but different right-hand sides, I tried to make the right-hand side a matrix and that works fine. So the docstring should say: Solve the equation ``a x = b`` for ``x``. Parameters -- a : array_like, shape (M, M) Input equation coefficients. b : array_like, shape (M,) or array_like, shape (M,N) N can be arbitrary size Equation target values. Returns --- x : array, shape (M,) or array, shape (M,N) Thanks, Mark ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] recarray slow?
However: is there an automatic way to convert a named index to a position? What about looping over tuples of my recarray: for t in d: date = t['Date'] I guess that the above does have to lookup 'Date' each time. But the following does not need the hash lookup for each tuple: for t in d: date = t[0] Should I create a map from dtype.names(), and use that to look up the index based on the name in advance? (if I really really want to factorize out the lookup of 'Date'] On Wed, Jul 21, 2010 at 3:47 PM, wheres pythonmonks wherespythonmo...@gmail.com wrote: Thank you very much better crack open a numpy reference manual instead of relying on my python intuition. On Wed, Jul 21, 2010 at 3:44 PM, Pauli Virtanen p...@iki.fi wrote: Wed, 21 Jul 2010 15:12:14 -0400, wheres pythonmonks wrote: I have an recarray -- the first column is date. I have the following function to compute the number of unique dates in my data set: def byName(): return(len(list(set(d['Date'])) )) What this code does is: 1. d['Date'] Extract an array slice containing the dates. This is fast. 2. set(d['Date']) Make copies of each array item, and box them into Python objects. This is slow. Insert each of the objects in the set. Also this is somewhat slow. 3. list(set(d['Date'])) Get each item in the set, and insert them to a new list. This is somewhat slow, and unnecessary if you only want to count. 4. len(list(set(d['Date']))) So the slowness arises because the code is copying data around, and boxing it into Python objects. You should try using Numpy functions (these don't re-box the data) to do this. http://docs.scipy.org/doc/numpy/reference/routines.set.html -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] recarray slow?
On Jul 21, 2010, at 4:22 PM, wheres pythonmonks wrote: However: is there an automatic way to convert a named index to a position? What about looping over tuples of my recarray: for t in d: date = t['Date'] Why don't you use zip ? for (date, t) in (d['Date'], d) That way, you save repetitive calls to __getitem__ Should I create a map from dtype.names(), and use that to look up the index based on the name in advance? (if I really really want to factorize out the lookup of 'Date'] Meh. I have a bad feeling about it that it won't be really performant. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] recarray slow?
What about: idx_by_name = dict(enumerate(d.dtype.names)) Then I can look up the index of the columns I want before the loop, and then access by the index during the loop. - W On Wed, Jul 21, 2010 at 4:29 PM, Pierre GM pgmdevl...@gmail.com wrote: On Jul 21, 2010, at 4:22 PM, wheres pythonmonks wrote: However: is there an automatic way to convert a named index to a position? What about looping over tuples of my recarray: for t in d: date = t['Date'] Why don't you use zip ? for (date, t) in (d['Date'], d) That way, you save repetitive calls to __getitem__ Should I create a map from dtype.names(), and use that to look up the index based on the name in advance? (if I really really want to factorize out the lookup of 'Date'] Meh. I have a bad feeling about it that it won't be really performant. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] recarray slow?
On Jul 21, 2010, at 4:35 PM, wheres pythonmonks wrote: What about: idx_by_name = dict(enumerate(d.dtype.names)) Then I can look up the index of the columns I want before the loop, and then access by the index during the loop. Sure. Why don't you try both approaches, time them and document it ? I still bet that manipulating tuples of numbers might be easier and more performant than juggling w/ the fields of a numpy.void, but that's a gut feeling only... ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] recarray slow?
My code had a bug: idx_by_name = dict((n,i) for i,n in enumerate(d.dtype.names)) On Wed, Jul 21, 2010 at 4:49 PM, Pauli Virtanen p...@iki.fi wrote: Wed, 21 Jul 2010 16:22:37 -0400, wheres pythonmonks wrote: However: is there an automatic way to convert a named index to a position? It's not really a named index -- it's a field name. Since the fields of an array element can be of different size, they cannot be referred to with an array index (in the sense that Numpy understands the concept). What about looping over tuples of my recarray: for t in d: date = t['Date'] I guess that the above does have to lookup 'Date' each time. As Pierre said, you can move the lookups outside the loop. for date in t['Date']: ... If you want to iterate over multiple fields, it may be best to use itertools.izip so that you unbox a single element at a time. However, I'd be quite surprised if the hash lookups would actually take a significant part of the run time: 1) Python dictionaries are ubiquitous and the implementation appears heavily optimized to be fast with strings. 2) The hash of a Python string is cached, and only computed only once. 3) String literals are interned, and represented by a single object only: 'Date' is 'Date' True So when running the above Python code, the hash of 'Date' is computed exactly once. 4) For small dictionaries containing strings, such as the fields dictionary, I'd expect 1-3) to be dwarfed by the overhead involved in making Python function calls (PyArg_*) and interpreting the bytecode. So as the usual optimization mantra applies here: measure first :) Of course, if you measure and show that the expectations 1-4) are actually wrong, that's fine. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Datarray BoF, part2
I agree with the idea that axis labels must be strings. Yes, this is the opposite of my position on tick labels (names), but there's a reason: ticks are often defined by whatever data you happen to be working with, but axis labels will in the vast majority of situations be defined by the programmer as they're writing the code. If the programmer wants to name something, they'll certainly be able to do so with a string. -- Rob On Wed, Jul 21, 2010 at 2:08 PM, Keith Goodman kwgood...@gmail.com wrote: On Wed, Jul 21, 2010 at 10:58 AM, M Trumpis mtrum...@berkeley.edu wrote: Separately, regarding the permissible axis labels, I think we must not allow any enumerated axis labels (ie, ints and floats). I don't remember if there was a consensus about that yesterday. We don't have the flexibility in the ndarray API to allow for the expression darr.method(axis=2) to mean not the 2nd dimension, but the Axis with label==2 So the axis label rule could be either: 1. str only 2. Any hashable object except int or float #1 is looking better and better. Plus you already coded it :) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Datarray BoF, part2
On Wed, Jul 21, 2010 at 5:32 PM, Rob Speer rsp...@mit.edu wrote: I agree with the idea that axis labels must be strings. Yes, this is the opposite of my position on tick labels (names), but there's a reason: ticks are often defined by whatever data you happen to be working with, but axis labels will in the vast majority of situations be defined by the programmer as they're writing the code. If the programmer wants to name something, they'll certainly be able to do so with a string. +1 This is what I was thinking as well. Skipper ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Datarray BoF, part2
Make that +2. --Josh On Wed, Jul 21, 2010 at 1:38 PM, Skipper Seabold jsseab...@gmail.com wrote: On Wed, Jul 21, 2010 at 5:32 PM, Rob Speer rsp...@mit.edu wrote: I agree with the idea that axis labels must be strings. Yes, this is the opposite of my position on tick labels (names), but there's a reason: ticks are often defined by whatever data you happen to be working with, but axis labels will in the vast majority of situations be defined by the programmer as they're writing the code. If the programmer wants to name something, they'll certainly be able to do so with a string. +1 This is what I was thinking as well. Skipper ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Datarray BoF, part2
On Wed, Jul 21, 2010 at 2:32 PM, Rob Speer rsp...@mit.edu wrote: I agree with the idea that axis labels must be strings. Yes, this is the opposite of my position on tick labels (names), but there's a reason: ticks are often defined by whatever data you happen to be working with, but axis labels will in the vast majority of situations be defined by the programmer as they're writing the code. If the programmer wants to name something, they'll certainly be able to do so with a string. What started the discussion was that someone wanted to have more than one label name for one axis. So I suggested that if we allow any hasable objects as axis label then, for example, a tuple could be used to hold multiple names. That would also allow a datarray to be flattened to 1d since the axis labels could be combined into a tuple. So a 2d datarray with axis names time and distance and ticks 't1', 't2' and 'd1', 'd2' could flatten to axis -- ('time', 'distance') ticks -- [('t1', 'd1'), ('t1', 'd2'), ('t2', 'd1'), ('t2', 'd2')] An unflatten function along with a fill value could unflatten the datarray. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] doc string linalg.solve -- works for b is matrix
Take it as a reminder: when reporting an error or problem, even if it doesn't seem relevant, always provide version number. :-) DG On Wed, Jul 21, 2010 at 12:50 PM, Mark Bakker mark...@gmail.com wrote: I am using 1.3.0. Glad to hear it is correct in 1.4.0 Sorry for bothering you with an old version, but I am very happy with this feature! Mark What version of numpy are you using? That docstring was updated in that fashion about 8 mo. ago (at least in the Wiki; I'm not sure exactly when it was merged, but it does appear that way in version 1.4.0). DG I am using linalg.solve to solve a system of linear equations. As I have to solve multiple systems with the same matrix, but different right-hand sides, I tried to make the right-hand side a matrix and that works fine. So the docstring should say: Solve the equation ``a x = b`` for ``x``. Parameters -- a : array_like, shape (M, M) Input equation coefficients. b : array_like, shape (M,) or array_like, shape (M,N) N can be arbitrary size Equation target values. Returns --- x : array, shape (M,) or array, shape (M,N) Thanks, Mark ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Mathematician: noun, someone who disavows certainty when their uncertainty set is non-empty, even if that set has measure zero. Hope: noun, that delusive spirit which escaped Pandora's jar and, with her lies, prevents mankind from committing a general suicide. (As interpreted by Robert Graves) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] lstsq functionality
On 2010-07-20, at 10:16 PM, Skipper Seabold wrote: Out of curiosity, is there an explicit way to check if these share memory? You could do the exact calculations (I think) but this isn't actually implemented in NumPy, though np.may_share_memory is a conservative test for it that will err on the side of false positives. David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] (no subject)
Hello everyone, I'm currently planning to use a Python-based infrastructure for our HPC project. I've previously used NumPy and SciPy for basic scientific computing tasks, but performance hasn't been quite an issue for me until now. At the moment I'm not too sure as to what to do next though, and I was hoping that someone with more experience in performance-related issues could point me to a way out of this. The trouble lays in the following piece of code: === w = 2 * math.pi * f M = A - (1j*w*E) n = M.shape[1] B1 = numpy.zeros(n) B2 = numpy.zeros(n) B1[n-2] = 1.0 B2[n-1] = 1.0 - slow part starts here umfpack.numeric(M) x1 = umfpack.solve( um.UMFPACK_A, M, B1, autoTranspose = False) x2 = umfpack.solve( um.UMFPACK_A, M, B2, autoTranspose = False) solution = scipy.array([ [ x1[n-2], x2[n-2] ], [ x1[n-1], x2[n-1] ]]) return solution This isn't really too much -- it's generating a small ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] UMFPACK interface is unexpectedly slow
Hello everyone, First of all, let me apologize for my earlier message; I made the mistake of trying to indent my code using SquirrelMail's horrible interface -- and pressing Tab and Space resulted in sending my (incomplete) e-mail to the list. Cursed be Opera's keyboard shortcuts now :-). I'm currently planning to use a Python-based infrastructure for our HPC project. I've previously used NumPy and SciPy for basic scientific computing tasks, so performance hasn't been quite an issue for me until now. At the moment I'm not too sure as to what to do next though, and I was hoping that someone with more experience in performance-related issues could point me to a way out of this. The trouble lays in the following piece of code: === w = 2 * math.pi * f M = A - (1j*w*E) n = M.shape[1] B1 = numpy.zeros(n) B2 = numpy.zeros(n) B1[n-2] = 1.0 B2[n-1] = 1.0 - slow part starts here umfpack.numeric(M) x1 = umfpack.solve( um.UMFPACK_A, M, B1, autoTranspose = False) x2 = umfpack.solve( um.UMFPACK_A, M, B2, autoTranspose = False) solution = scipy.array([ [ x1[n-2], x2[n-2] ], [ x1[n-1], x2[n-1] ]]) return solution This isn't really too much -- it's generating a system matrix via operations that take little time, as I was expecting. Trouble is, the solve part takes significantly more time than Octave -- about 4 times. I'm using the stock version of UMFPACK in Ubuntu's repository; it's compiled against standard BLAS, so it's fairly slow, but so is Octave -- so the problem isn't there. I'm obviously doing something wrong related to memory management here, because the memory consumption is also rocketing, but I'm not sure what exactly it is that I'm doing wrong. Could you point me towards some relevant documentation describing what I could do in order to improve the performance, or give me some hint related to that? Best regards, Alexandru Lazar ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] UMFPACK interface is unexpectedly slow
I hope I won't get identified as a spam bot :-). While I have not resolved the problem itself, this is an issue that I cannot reproduce on our cluster. I wanted to get back with some actual timings from the real hardware we are going to be using and some details about the matrices, so as not to chase ghosts, but this proved to be a headache saver. It's still baffling because on the cluster I have also used stock packages (albeit from Fedora, which is what our system administrator insists on using) rather than my hand-compiled and optimized GotoBLAS and UMFPACK. It didn't even occur to me to try to reproduce this on another system in the last 4 hours I've been struggling with this, because I assumed that using stock packages was giving me the uniformity I required. It seems I was wrong. Nonetheless, I think it's safe to assume in this case that the problem is not in NumPy or my code, and it would be wiser to bring this up in Ubuntu's trackpad. Thanks for your patience, Alexandru On Thu, July 22, 2010 4:10 am, Ioan-Alexandru Lazar wrote: Hello everyone, First of all, let me apologize for my earlier message; I made the mistake of trying to indent my code using SquirrelMail's horrible interface -- and pressing Tab and Space resulted in sending my (incomplete) e-mail to the list. Cursed be Opera's keyboard shortcuts now :-). I'm currently planning to use a Python-based infrastructure for our HPC project. I've previously used NumPy and SciPy for basic scientific computing tasks, so performance hasn't been quite an issue for me until now. At the moment I'm not too sure as to what to do next though, and I was hoping that someone with more experience in performance-related issues could point me to a way out of this. The trouble lays in the following piece of code: === w = 2 * math.pi * f M = A - (1j*w*E) n = M.shape[1] B1 = numpy.zeros(n) B2 = numpy.zeros(n) B1[n-2] = 1.0 B2[n-1] = 1.0 - slow part starts here umfpack.numeric(M) x1 = umfpack.solve( um.UMFPACK_A, M, B1, autoTranspose = False) x2 = umfpack.solve( um.UMFPACK_A, M, B2, autoTranspose = False) solution = scipy.array([ [ x1[n-2], x2[n-2] ], [ x1[n-1], x2[n-1] ]]) return solution This isn't really too much -- it's generating a system matrix via operations that take little time, as I was expecting. Trouble is, the solve part takes significantly more time than Octave -- about 4 times. I'm using the stock version of UMFPACK in Ubuntu's repository; it's compiled against standard BLAS, so it's fairly slow, but so is Octave -- so the problem isn't there. I'm obviously doing something wrong related to memory management here, because the memory consumption is also rocketing, but I'm not sure what exactly it is that I'm doing wrong. Could you point me towards some relevant documentation describing what I could do in order to improve the performance, or give me some hint related to that? Best regards, Alexandru Lazar ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] lstsq functionality
On Wed, Jul 21, 2010 at 5:44 PM, Keith Goodman kwgood...@gmail.com wrote: On Wed, Jul 21, 2010 at 4:35 PM, Skipper Seabold jsseab...@gmail.com wrote: On Tue, Jul 20, 2010 at 10:24 PM, Keith Goodman kwgood...@gmail.com wrote: Good point. Looks like we can get rid of 2 copies! I didn't get rid of the second copy but I did cut things down just to see what the timing was like. I also threw out the ability to handle complex numbers (only saves an if iscomplex statement): Just for timing purposes though right? Yes. Can someone confirm that the copy in np.linalg.lstsq bstar[:b.shape[0],:n_rhs] = b.copy() is not needed? I'm assuming that ndarray.__setitem__ never gives a view of the right-hand side. bstar doesn't share memory with b, so there is no need for a copy. The whole first part of lstsq could use a cleanup, there are a lot of other things that could be made cleaner and some other bits that don't look right. Anyone want to take a shot at it? Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] lstsq functionality
On Wed, Jul 21, 2010 at 8:24 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Wed, Jul 21, 2010 at 5:44 PM, Keith Goodman kwgood...@gmail.com wrote: Can someone confirm that the copy in np.linalg.lstsq bstar[:b.shape[0],:n_rhs] = b.copy() is not needed? I'm assuming that ndarray.__setitem__ never gives a view of the right-hand side. bstar doesn't share memory with b, so there is no need for a copy. The whole first part of lstsq could use a cleanup, there are a lot of other things that could be made cleaner and some other bits that don't look right. Anyone want to take a shot at it? Chuck Some minor tweaks suggestions: b = np.random.rand(1000) timeit b[:,newaxis] # actual 100 loops, best of 3: 1.03 us per loop timeit b.reshape(-1,1) # alternative 100 loops, best of 3: 445 ns per loop timeit len(b.shape) # actual 1000 loops, best of 3: 145 ns per loop timeit b.ndim # alternative 1000 loops, best of 3: 74.1 ns per loop ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] summarizing blocks of an array using a moving window
Hello all, The short version: For a given NxN array, is there an efficient way to use a moving window to collect a summary statistic on a chunk of the array, and insert it into another array? The long version: I am trying to resample an image loaded with GDAL into an NxN array. Note that this is for statistical purposes, so image quality doesn't matter. For the curious, the image is derived from satellite imagery and displays a map of hotspots of tropical deforestation at 500m resolution. I need to assign a count of these deforestation 'hits' to each pixel in an existing array of 1000m pixels. Let's say the image looks like this: np.random.randint(0,2, 16).reshape(4,4) array([[0, 0, 0, 1], [0, 0, 1, 1], [1, 1, 0, 0], [0, 0, 0, 0]]) I want to use a square, non-overlapping moving window for resampling, so that I get a count of all the 1's in each 2x2 window. 0, 0, 0, 1 0, 0, 1, 1 0 3 = 2 0 1, 1, 0, 0 0, 0, 0, 0 In another situation with similar data I'll need the average, or the maximum value, etc.. My brute-force method is to loop through the rows and columns to get little chunks of data to process. But must be a more elegant way to do this - it's probably obvious too ... (inelegant way further down). Another way to do it would be to use np.tile to create an array called arr filed with blocks of [[0,1],[2,3]]. I could then use something like this to get the stats I want: d0 = img[np.where(arr==0)] d1 = img[np.where(arr==1)] d2 = img[np.where(arr==2)] d3 = img[np.where(arr==3)] img_out = d0 + d1 + d2 + d3 This would be quite snappy if I could create arr efficiently. Unfortunately np.tile does something akin to np.hstack to create this array, so it isn't square and can't be reshaped appropriately (np.tile(np.arange(2**2).reshape(2, 2), 4)): array([[0, 1, 0, 1, 0, 1, 0, 1], [2, 3, 2, 3, 2, 3, 2, 3]]) Inefficient sample code below. Advice greatly appreciated! -Robin import numpy as np from math import sqrt from time import time def rs(img_size=16, window_size=2): w = window_size # making sure the numbers work if img_size % sqrt(img_size) 0: print Square root of image size must be an integer. print Sqrt =, sqrt(img_size) return elif (img_size / sqrt(img_size)) % w 0: print Square root of image size must be evenly divisible by, w print Sqrt =, sqrt(img_size) print sqrt(img_size), /, w, =, sqrt(img_size)/w return else: rows = int(sqrt(img_size)) cols = int(sqrt(img_size)) # create fake data: ones and zeros img = np.random.randint(0, 2, img_size).reshape(rows, cols) # create empty array for resampled data img_out = np.zeros((rows/w, cols/w), dtype=np.int) t = time() # retreive blocks of pixels in img # insert summary into img_out for r in xrange(0, rows, w): # skip rows based on window size for c in xrange(0, cols, w): # skip columns based on window size # calculate the sum of the values in moving window, #insert value into corresponding pixel in img_out img_out[r/w, c/w] = np.int(np.sum(img[r:r+w, c:c+w])) t1= time() elapsed = t1-t print img shape:, img.shape print img, \n print img_out shape:, img_out.shape print img_out print %.7f seconds % elapsed rs(imgage_size=16, window=2) rs(81, 3) rs(100) #rs(4800*4800) # very slow ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion