Re: [Numpy-discussion] performance matrix multiplication vs. matlab
I also tried to Install numpy with intel mkl 9.1 I still used gfortran for numpy installation as intel mkl 9.1 supports gnu compiler. I would suggest using GotoBLAS instead of ATLAS. It is easier to build then ATLAS (basically no configuration), and has even better performance than MKL. http://www.tacc.utexas.edu/tacc-projects/ S.M. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Repeated dot products
On 12/12/2009 22:55, T J wrote: Hi, Suppose I have an array of shape: (n, k, k). In this case, I have n k-by-k matrices. My goal is to compute the product of a (potentially large) user-specified selection (with replacement) of these matrices. For example, x = [0,1,2,1,3,3,2,1,3,2,1,5,3,2,3,5,2,5,3,2,1,3,5,6] TJ, what are your n, k, len(x) ? _dotblas.dot is fast: dot( 10x10 matrices ) takes ~ 22 usec on my g4 ppc, which is ~ 15 clock cycles (700 MHz) per mem access * +. A hack to find repeated pairs (or triples ...) follows. Your sequence above has only (3,2) 4 times, no win. (Can someone give a probabilistic estimate of the number of non-overlapping pairs in N letters from an alphabet of size A ?) #!/usr/bin/env python # numpy-discuss 2009 12dec TJ repeated dot products from __future__ import division from collections import defaultdict import numpy as np __version__ = 2010 7jan denis def pairs( s, Len=2 ): repeated non-overlapping pairs (substrings, subwords) abracadabra - ab ra [[0 7] [2 9]], not br Len=3: triples, 4 ... # bruteforce # grow repeated 2 3 ... ? pairs = defaultdict(list) for j in range(len(s)-Len+1): pairs[ s[j:j+Len] ].append(j) min2 = filter( lambda x: len(x) 1, pairs.values() ) min2.sort( key = lambda x: len(x), reverse=True ) # remove overlaps -- # (if many, during init scan would be faster) runs = np.zeros( len(s), np.uint8 ) run = np.ones( Len, np.uint8 ) run[0] = Len chains = [] for ovchain in min2: chain = [] for c in ovchain: if not runs[c:c+Len].any(): runs[c:c+Len] = run chain.append(c) if len(chain) 1: chains.append(chain) return (chains, runs) #... if __name__ == __main__: import sys abra = abracadabra alph = 5 randlen = 100 randseed = 1 exec( \n.join( sys.argv[1:] )) # Test= ... print pairs( %s ) -- % abra print pairs( abra ) # ab [0, 7], br [2, 9]] print pairs( abra, 3 ) # abr [0, 7] np.random.seed( randseed ) r = np.random.random_integers( 1, alph, randlen ) chains, runs = pairs( tuple(r) ) npair = sum([ len(c) for c in chains ]) print %d repeated pairs in %d random %d % (npair, randlen, alph) # 35 repeated pairs in 100 random 5 (prob estimate this ?) # 25 repeated pairs in 100 random 10 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] performance matrix multiplication vs. matlab
Sturla Molden wrote: I would suggest using GotoBLAS instead of ATLAS. http://www.tacc.utexas.edu/tacc-projects/ That does look promising -- nay idea what the license is? They don't make it clear on the site (maybe it it is you set up a user account and download, but I'd rather know up front). The only reference I could find is from 2006: http://www.utexas.edu/news/2006/04/12/tacc/ and in that, they refer to one of those annoying free for academic and scientific use clauses. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] performance matrix multiplication vs. matlab
Sturla Molden wrote: I would suggest using GotoBLAS instead of ATLAS. http://www.tacc.utexas.edu/tacc-projects/ That does look promising -- nay idea what the license is? They don't make it clear on the site UT TACC Research License (Source Code) The Texas Advanced Computing Center of The University of Texas at Austin has developed certain software and documentation that it desires to make available without charge to anyone for academic, research, experimental or personal use. This license is designed to guarantee freedom to use the software for these purposes. If you wish to distribute or make other use of the software, you may purchase a license to do so from the University of Texas. The accompanying source code is made available to you under the terms of this UT TACC Research License (this UTTRL). By clicking the ACCEPT button, or by installing or using the code, you are consenting to be bound by this UTTRL. If you do not agree to the terms and conditions of this license, do not click the ACCEPT button, and do not install or use any part of the code. The terms and conditions in this UTTRL not only apply to the source code made available by UT TACC, but also to any improvements to, or derivative works of, that source code made by you and to any object code compiled from such source code, improvements or derivative works. 1. DEFINITIONS. 1.1 Commercial Use shall mean use of Software or Documentation by Licensee for direct or indirect financial, commercial or strategic gain or advantage, including without limitation: (a) bundling or integrating the Software with any hardware product or another software product for transfer, sale or license to a third party (even if distributing the Software on separate media and not charging for the Software); (b) providing customers with a link to the Software or a copy of the Software for use with hardware or another software product purchased by that customer; or (c) use in connection with the performance of services for which Licensee is compensated. 1.2 Derivative Products means any improvements to, or other derivative works of, the Software made by Licensee. 1.3 Documentation shall mean all manuals, user documentation, and other related materials pertaining to the Software that are made available to Licensee in connection with the Software. 1.4 Licensor shall mean The University of Texas. 1.5 Licensee shall mean the person or entity that has agreed to the terms hereof and is exercising rights granted hereunder. 1.6 Software shall mean the computer program(s) referred to as GotoBLAS2 made available under this UTTRL in source code form, including any error corrections, bug fixes, patches, updates or other modifications that Licensor may in its sole discretion make available to Licensee from time to time, and any object code compiled from such source code. 2. GRANT OF RIGHTS. Subject to the terms and conditions hereunder, Licensor hereby grants to Licensee a worldwide, non-transferable, non-exclusive license to (a) install, use and reproduce the Software for academic, research, experimental and personal use (but specifically excluding Commercial Use); (b) use and modify the Software to create Derivative Products, subject to Section 3.2; and (c) use the Documentation, if any, solely in connection with Licensee's authorized use of the Software. 3. RESTRICTIONS; COVENANTS. 3.1 Licensee may not: (a) distribute, sub-license or otherwise transfer copies or rights to the Software (or any portion thereof) or the Documentation; (b) use the Software (or any portion thereof) or Documentation for Commercial Use, or for any other use except as described in Section 2; (c) copy the Software or Documentation other than for archival and backup purposes; or (d) remove any product identification, copyright, proprietary notices or labels from the Software and Documentation. This UTTRL confers no rights upon Licensee except those expressly granted herein. 3.2 Licensee hereby agrees that it will provide a copy of all Derivative Products to Licensor and that its use of the Derivative Products will be subject to all of the same terms, conditions, restrictions and limitations on use imposed on the Software under this UTTRL. Licensee hereby grants Licensor a worldwide, non-exclusive, royalty-free license to reproduce, prepare derivative works of, publicly display, publicly perform, sublicense and distribute Derivative Products. Licensee also hereby grants Licensor a worldwide, non-exclusive, royalty-free patent license to make, have made, use, offer to sell, sell, import and otherwise transfer the Derivative Products under those patent claims licensable by Licensee that are necessarily infringed by the Derivative Products. 4. PROTECTION OF SOFTWARE. 4.1 Confidentiality. The Software and Documentation are the confidential and proprietary information of Licensor. Licensee agrees to take adequate steps to protect the Software and Documentation from unauthorized
[Numpy-discussion] Behaviour of vdot(array2d, array1d)
Hi, I am new to this list, but I have been using scipy for a couple of months now with great satisfaction. Currently I have a problem: I diagonalize a hermitian complex matrix using the eigh routine from scipy.linalg (this is still a numpy question, see below) This returns the eigenvectors as columns of a 2d array. Now I would like to project a vector onto this new basis. I could either do: inital_state = array(...) #dtype=complex, shape=(dim,) coefficients = zeros( shape=(dim,), dtype=complex) matrix = array(...) #dtype=complex, shape=(dim, dim) eigenvalues, eigenvectors = eigh(matrix) for i in xrange(dim): coefficients[i] = vdot(eigenvalues[:, i], initial_state) But it seems to me after reading the documentation for vdot, that it should also be possible to do this without a loop: initial_state = array(...) #dtype=complex, shape=(dim,) matrix = array(...) #dtype=complex, shape=(dim, dim) eigenvalues, eigenvectors = eigh(matrix) coefficients = vdot( eigenvalues.transpose(), initial_state) However when I do this, vdot raises a ValueError complaining that the vectors have different lengths. It seems that vdot (as opposed to dot) cannot handle arguments with different shape although the documentation suggests otherwise. I am using numpy version 1.3.0. Is this a bug or am I missing something? Regards, Nikolas ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] [Pythonmac-SIG] 1.4.0 installer fails on OSX 10.6.2
David Cournapeau wrote: On Thu, Jan 7, 2010 at 1:35 AM, Christopher Barker In the past, I think folks' have used the default name provided by bdist_mpkg, and those are not always clear. Something like: numpy1.4-osx10.4-python.org2.6-32bit.dmg The 32 bits is redundant - we support all archs supported by the official python binary, so python.org is enough. True, though I was anticipating that there may be 32 and 64 bit builds some day. About osx10.4, As for that -- I put that in 'cause I remembered that in the past it has said 10.5, when, in fact 10.4 was supported. Thinking more, I think it's like 32 bit -- the python.org build supports 10.4, so that's all the information folks need. still don't know how to make sure we do work there with distutils. The whole MACOSX_DEPLOYMENT_TARGET confuses me quite a lot. distutils should do it right, and indeed, I just tested the py2.5 and py2.6 binaries on my 10.4 PPC machine ,and most of the tests all pass on both. (though see the note below) I think distutils does do it right, at least if you use the latest version of 2.6 -- a bug was fixed there. What OS/architecture were those built with? Other than that, the numpy 1.4.0 follows your advice, and contains the python.org part. I should have looked first -- thanks, I think that will be helpful. NOTE: When I first installed the binary, I got a whole bunch of errors because matrix' wasn't found. I recalled this issue from testing, and cleared out the install, then re-installed, and all was fine. I wonder if it's possible to have a mpkg remove anything? Other failed tests: == FAIL: test_umath.test_nextafterl ... return _test_nextafter(np.longdouble) File /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/numpy/core/tests/test_umath.py, line 852, in _test_nextafter assert np.nextafter(one, two) - one == eps AssertionError == FAIL: test_umath.test_spacingl -- ... Traceback (most recent call last): line 887, in test_spacingl return _test_spacing(np.longdouble) File /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/numpy/core/tests/test_umath.py, line 873, in _test_spacing assert np.spacing(one) == eps AssertionError I think both of those are known issues, and not a big deal. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
Pauli Virtanen wrote: ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti: it also does odd things with spaces embedded in the separator: , $ # matches all of: ,$# , $# ,$ # That's a documented feature: Fair enough. OK, I've written a patch that allows newlines to be interpreted as separators in addition to whatever is specified in sep. In the process of testing, I found again these issues, which are still marked as needs decision. http://projects.scipy.org/numpy/ticket/883 In short: what to do with missing values? I'd like to address this bug, but I need a decision to do so. My proposal: Raise an ValueError with missing values. Justification: No function should EVER return data that is not there. Period. It is simply asking for hard to find bugs. Therefore: fromstring(3, 4,,5, sep=,) Should never, ever, return: array([ 3., 4., 0., 5.]) Which is what it does now. bad. bad. bad. Alternatives: A) Raising a ValueError is the easiest way to get proper behavior. Folks can use a more sophisticated file reader if they want missing values handled. I'm willing to contribute this patch. B) If the dtype is a floating point type, NaN could fill in the missing values -- a fine idea, but you can't use it for integers, and zero is a really bad replacement! C) The user could specify what they want filled in for missing values. This is a fine idea, though I'm not sure I want to take the time to impliment it. Oh, and this is a bug too, with probably the same solution: In [20]: np.fromstring(hjba, sep=',') Out[20]: array([ 0.]) In [26]: np.fromstring(34gytf39, sep=',') Out[26]: array([ 34.]) One more unresolved question: what should: np.fromstring(3, 4, 5,, sep=,) return? it currently returns: array([ 3., 4., 5.]) which seems a bit inconsitent with missing value handling. I also found a bug: In [6]: np.fromstring(3, 4, 5 , , sep=,) Out[6]: array([ 3., 4., 5., 0.]) so if there is some extra whitespace in there, it does return a missing value. With my proposal, that wouldn't happen, but you might get an exception. I think you should, but it'll be easier to implement my allow newlines code if not. so, should I do (A) ? Another question: I've got a patch mostly working (except for the above issues) that will allow fromfile/string to read multiline non-whitespace separated data in one shot: In [15]: str Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' In [16]: np.fromstring(str, sep=',', allow_newlines=True) Out[16]: array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12.]) I think this is a very helpful enhancement, and, as it is a new kwarg, backward compatible: 1) Might it be accepted for inclusion? 2) Is the name for the flag OK: allow_newlines? It's pretty explicit, but also long -- I used it for the flag name in the C code, too. 3) What C datatype should I use for a boolean flag? I used a char, but I don't know what the numpy standard is. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
On Thu, Jan 7, 2010 at 3:08 PM, Christopher Barker chris.bar...@noaa.gov wrote: Pauli Virtanen wrote: ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti: it also does odd things with spaces embedded in the separator: , $ # matches all of: ,$# , $# ,$ # That's a documented feature: Fair enough. OK, I've written a patch that allows newlines to be interpreted as separators in addition to whatever is specified in sep. In the process of testing, I found again these issues, which are still marked as needs decision. http://projects.scipy.org/numpy/ticket/883 In short: what to do with missing values? I'd like to address this bug, but I need a decision to do so. My proposal: Raise an ValueError with missing values. Justification: No function should EVER return data that is not there. Period. It is simply asking for hard to find bugs. Therefore: fromstring(3, 4,,5, sep=,) Should never, ever, return: array([ 3., 4., 0., 5.]) Which is what it does now. bad. bad. bad. Alternatives: A) Raising a ValueError is the easiest way to get proper behavior. Folks can use a more sophisticated file reader if they want missing values handled. I'm willing to contribute this patch. B) If the dtype is a floating point type, NaN could fill in the missing values -- a fine idea, but you can't use it for integers, and zero is a really bad replacement! C) The user could specify what they want filled in for missing values. This is a fine idea, though I'm not sure I want to take the time to impliment it. Oh, and this is a bug too, with probably the same solution: In [20]: np.fromstring(hjba, sep=',') Out[20]: array([ 0.]) In [26]: np.fromstring(34gytf39, sep=',') Out[26]: array([ 34.]) One more unresolved question: what should: np.fromstring(3, 4, 5,, sep=,) return? it currently returns: array([ 3., 4., 5.]) which seems a bit inconsitent with missing value handling. I also found a bug: In [6]: np.fromstring(3, 4, 5 , , sep=,) Out[6]: array([ 3., 4., 5., 0.]) so if there is some extra whitespace in there, it does return a missing value. With my proposal, that wouldn't happen, but you might get an exception. I think you should, but it'll be easier to implement my allow newlines code if not. so, should I do (A) ? Another question: I've got a patch mostly working (except for the above issues) that will allow fromfile/string to read multiline non-whitespace separated data in one shot: In [15]: str Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' In [16]: np.fromstring(str, sep=',', allow_newlines=True) Out[16]: array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12.]) I think this is a very helpful enhancement, and, as it is a new kwarg, backward compatible: 1) Might it be accepted for inclusion? 2) Is the name for the flag OK: allow_newlines? It's pretty explicit, but also long -- I used it for the flag name in the C code, too. 3) What C datatype should I use for a boolean flag? I used a char, but I don't know what the numpy standard is. -Chris I don't know much about this, just a few more test cases comma and newline str = '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12' extra comma at end of file str = '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12,' extra newlines at end of file str = '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n' It would be nice if these cases would go through without missing values or exception, but I don't often have files that are clean enough for fromfile(). I'm in favor of nan for missing values with floating point numbers. It would make it easy to read correctly formatted csv files, even if the data is not complete. Josef -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
On Thu, Jan 7, 2010 at 2:32 PM, josef.p...@gmail.com wrote: On Thu, Jan 7, 2010 at 3:08 PM, Christopher Barker chris.bar...@noaa.gov wrote: Pauli Virtanen wrote: ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti: it also does odd things with spaces embedded in the separator: , $ # matches all of: ,$# , $# ,$ # That's a documented feature: Fair enough. OK, I've written a patch that allows newlines to be interpreted as separators in addition to whatever is specified in sep. In the process of testing, I found again these issues, which are still marked as needs decision. http://projects.scipy.org/numpy/ticket/883 In short: what to do with missing values? I'd like to address this bug, but I need a decision to do so. My proposal: Raise an ValueError with missing values. Justification: No function should EVER return data that is not there. Period. It is simply asking for hard to find bugs. Therefore: fromstring(3, 4,,5, sep=,) Should never, ever, return: array([ 3., 4., 0., 5.]) Which is what it does now. bad. bad. bad. Alternatives: A) Raising a ValueError is the easiest way to get proper behavior. Folks can use a more sophisticated file reader if they want missing values handled. I'm willing to contribute this patch. B) If the dtype is a floating point type, NaN could fill in the missing values -- a fine idea, but you can't use it for integers, and zero is a really bad replacement! C) The user could specify what they want filled in for missing values. This is a fine idea, though I'm not sure I want to take the time to impliment it. Oh, and this is a bug too, with probably the same solution: In [20]: np.fromstring(hjba, sep=',') Out[20]: array([ 0.]) In [26]: np.fromstring(34gytf39, sep=',') Out[26]: array([ 34.]) One more unresolved question: what should: np.fromstring(3, 4, 5,, sep=,) return? it currently returns: array([ 3., 4., 5.]) which seems a bit inconsitent with missing value handling. I also found a bug: In [6]: np.fromstring(3, 4, 5 , , sep=,) Out[6]: array([ 3., 4., 5., 0.]) so if there is some extra whitespace in there, it does return a missing value. With my proposal, that wouldn't happen, but you might get an exception. I think you should, but it'll be easier to implement my allow newlines code if not. so, should I do (A) ? Another question: I've got a patch mostly working (except for the above issues) that will allow fromfile/string to read multiline non-whitespace separated data in one shot: In [15]: str Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' In [16]: np.fromstring(str, sep=',', allow_newlines=True) Out[16]: array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12.]) I think this is a very helpful enhancement, and, as it is a new kwarg, backward compatible: 1) Might it be accepted for inclusion? 2) Is the name for the flag OK: allow_newlines? It's pretty explicit, but also long -- I used it for the flag name in the C code, too. 3) What C datatype should I use for a boolean flag? I used a char, but I don't know what the numpy standard is. -Chris I don't know much about this, just a few more test cases comma and newline str = '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12' extra comma at end of file str = '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12,' extra newlines at end of file str = '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n' It would be nice if these cases would go through without missing values or exception, but I don't often have files that are clean enough for fromfile(). I'm in favor of nan for missing values with floating point numbers. It would make it easy to read correctly formatted csv files, even if the data is not complete. Using the numpy NaN or similar (noting R's approach to missing values which in turn allows it to have the above functionality) is just a very bad idea for missing values because you always have to check that which NaN is a missing value and which was due to some numerical calculation. It is a very bad idea because we have masked arrays that nicely but slowly handle this situation. From what I can see is that you expect that fromfile() should only split at the supplied delimiters, optionally(?) strip any whitespace and force a specific dtype. I would agree that the failure of any of one these should create an exception by default rather than making the best guess. So 'missing data' would potentially fail with forcing the specified dtype. Thus, you should either create an exception for invalid data (with appropriate location) or use masked arrays. Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' actually assumes multiple delimiters because there is no comma between 4 and 5 and 8 and 9. So I think it would be better if fromfile accepted multiple delimiters. In Josef's last case how many 'missing values should there be? Bruce
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
On Jan 7, 2010, at 2:32 PM, josef.p...@gmail.com wrote: On Thu, Jan 7, 2010 at 3:08 PM, Christopher Barker chris.bar...@noaa.gov wrote: Pauli Virtanen wrote: ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti: it also does odd things with spaces embedded in the separator: , $ # matches all of: ,$# , $# ,$ # That's a documented feature: Fair enough. OK, I've written a patch that allows newlines to be interpreted as separators in addition to whatever is specified in sep. In the process of testing, I found again these issues, which are still marked as needs decision. http://projects.scipy.org/numpy/ticket/883 In short: what to do with missing values? I'd like to address this bug, but I need a decision to do so. My proposal: Raise an ValueError with missing values. Justification: No function should EVER return data that is not there. Period. It is simply asking for hard to find bugs. Therefore: fromstring(3, 4,,5, sep=,) Should never, ever, return: array([ 3., 4., 0., 5.]) Which is what it does now. bad. bad. bad. Alternatives: A) Raising a ValueError is the easiest way to get proper behavior. Folks can use a more sophisticated file reader if they want missing values handled. I'm willing to contribute this patch. B) If the dtype is a floating point type, NaN could fill in the missing values -- a fine idea, but you can't use it for integers, and zero is a really bad replacement! C) The user could specify what they want filled in for missing values. This is a fine idea, though I'm not sure I want to take the time to impliment it. Oh, and this is a bug too, with probably the same solution: In [20]: np.fromstring(hjba, sep=',') Out[20]: array([ 0.]) In [26]: np.fromstring(34gytf39, sep=',') Out[26]: array([ 34.]) One more unresolved question: what should: np.fromstring(3, 4, 5,, sep=,) return? it currently returns: array([ 3., 4., 5.]) which seems a bit inconsitent with missing value handling. I also found a bug: In [6]: np.fromstring(3, 4, 5 , , sep=,) Out[6]: array([ 3., 4., 5., 0.]) so if there is some extra whitespace in there, it does return a missing value. With my proposal, that wouldn't happen, but you might get an exception. I think you should, but it'll be easier to implement my allow newlines code if not. so, should I do (A) ? Another question: I've got a patch mostly working (except for the above issues) that will allow fromfile/string to read multiline non-whitespace separated data in one shot: In [15]: str Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' In [16]: np.fromstring(str, sep=',', allow_newlines=True) Out[16]: array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12.]) I think this is a very helpful enhancement, and, as it is a new kwarg, backward compatible: 1) Might it be accepted for inclusion? 2) Is the name for the flag OK: allow_newlines? It's pretty explicit, but also long -- I used it for the flag name in the C code, too. 3) What C datatype should I use for a boolean flag? I used a char, but I don't know what the numpy standard is. -Chris I don't know much about this, just a few more test cases comma and newline str = '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12' extra comma at end of file str = '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12,' extra newlines at end of file str = '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n' It would be nice if these cases would go through without missing values or exception, but I don't often have files that are clean enough for fromfile(). +1 (ignoring new-lines transparently is a nice feature). You can also use sscanf with weave to read most files. I'm in favor of nan for missing values with floating point numbers. It would make it easy to read correctly formatted csv files, even if the data is not complete. +1 (much preferrable to insert NaN or other user value than raise ValueError in my opinion) -Travis ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
Bruce Southey wrote: chris.bar...@noaa.gov wrote: Using the numpy NaN or similar (noting R's approach to missing values which in turn allows it to have the above functionality) is just a very bad idea for missing values because you always have to check that which NaN is a missing value and which was due to some numerical calculation. well, this is specific to reading files, so you know where it came from. And the principle of fromfile() is that it is fast and simple, if you want masked arrays, use slower, but more full-featured methods. However, in this case: In [9]: np.fromstring(3, 4, NaN, 5, sep=,) Out[9]: array([ 3., 4., NaN, 5.]) An actual NaN is read from the file, rather than a missing value. Perhaps the user does want the distinction, so maybe it should really only fil it in if the users asks for it, but specifying missing_value=np.nan or something. From what I can see is that you expect that fromfile() should only split at the supplied delimiters, optionally(?) strip any whitespace whitespace stripping is not optional. Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' actually assumes multiple delimiters because there is no comma between 4 and 5 and 8 and 9. Yes, that's the point. I thought about allowing arbitrary multiple delimiters, but I think '/n' is a special case - for instance, a comma at the end of some numbers might mean missing data, but a '\n' would not. And I couldn't really think of a useful use-case for arbitrary multiple delimiters. In Josef's last case how many 'missing values should there be? extra newlines at end of file str = '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n' none -- exactly why I think \n is a special case. What about: extra newlines in the middle of the file str = '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n' I think they should be ignored, but I hope I'm not making something that is too specific to my personal needs. Travis Oliphant wrote: +1 (ignoring new-lines transparently is a nice feature). You can also use sscanf with weave to read most files. right -- but that requires weave. In fact, MATLAB has a fscanf function that allows you to pass in a C format string and it vectorizes it to use the same one over an over again until it's done. It's actually quite powerful and flexible. I once started with that in mind, but didn't have the C chops to do it. I ended up with a tool that only did doubles (come to think of it, MATLAB only does doubles, anyway...) I may some day write a whole new C (or, more likely, Cython) function that does something like that, but for now, I'm jsut trying to get fromfile to be useful for me. +1 (much preferrable to insert NaN or other user value than raise ValueError in my opinion) But raise an error for integer types? I guess this is still up the air -- no consensus yet. Thanks, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] cPickle/unPickle across archs
On Thu, Jan 7, 2010 at 15:54, James Mazer james.ma...@yale.edu wrote: Hi, I've got a some Numeric arrays that were created without an explicit byte size in the initial declaration and pickled. Something like this: cPickle.write(array(ones((3,3,)), 'f'), open('foo.pic', 'w')) as opposed to: cPickle.write(array(ones((3,3,)), Float32), open('foo.pic', 'w')) This works as long as the word size doesn't change between the reading and writing machines. The data were generated under a 32bit linux kernel and now I'm trying to read them under a 64bit kernel, so the word size has changed and Numeric assumes that the 'f' type is the NATIVE float Please note that 'f' is always a 32-bit float on any machine. Only integers may change size. and 'l' type is the NATIVE long) and dies miserable when the native types don't match the actual types (which defeats the whole point of pickling, to some extent -- I thought that cPickle.save/load were ensured to be invertable...) I don't think cPickle ensures much at all. It's actually rather fragile for persisting data over long times and between different environments. It works better as a wire format for communication between similar codebases when thoroughly tested on both ends. Using a standard scientific file format for storing your important data has always been de rigeur. That said, it is a deficiency in Numeric that it records the native typecode instead of a platform-neutral, explicitly sized typecode. Unfortunately, Numeric has been deprecated for many years now, and is not maintained. Numeric's replacement, numpy, does not have this problem. I've got terrabytes of data that need to be read by both 32bit and 64bit machines (and it's not really feasible to scan all the files into new structures with explict types on a 32bit machine). Anybody have hints for addressing this problem? I found similar questions, but no answers, so I'm not completely alone iwth this problem. What you can do is monkeypatch the function Numeric.array_constructor() to do the right thing for your case when it sees a platform-specific integer typecode. Something like the following (untested; you may need to generalize it to handle the unsigned integer typecodes, too, if you have that kind of data): import Numeric i_size = Numeric.empty(0, 'i').itemsize() def patched_array_constructor(shape, typecode, thestr, Endian=Numeric.LittleEndian): if typecode == l: # Ensure that the length of the data matches our expectations. size = Numeric.product(shape) itemsize = len(thestr) // size if itemsize == i_size: typecode = 'i' if typecode == O: x = Numeric.array(thestr,O) else: x = Numeric.fromstring(thestr, typecode) x.shape = shape if LittleEndian != Endian: return x.byteswapped() else: return x Numeric.array_constructor = patched_array_constructor After you have done that, cPickle.load() will use that patched function to reconstruct the arrays and make sure that the appropriate typecode is used to interpret the data. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] [Pythonmac-SIG] 1.4.0 installer fails on OSX 10.6.2
On Wed, Jan 6, 2010 at 11:35 AM, Christopher Barker chris.bar...@noaa.gov wrote: It's worse to have a binary you expect to work fail for you than to not have one available. IN the past, I think folks' have used the default name provided by bdist_mpkg, and those are not always clear. Something like: numpy1.4-osx10.4-python.org2.6-32bit.dmg or something -- even better, with a a bit more text -- would help a lot. I agree here. Better labeling of the .dmg would indeed help, I think. And thanks to everyone for all of the responses. I joined the mailing list, posted my question, and then went back to dissertation writing for a few days. When I looked up, there were 18 answers. I'll try getting python from python.org and/or building it all from scratch. Thanks again, Neil ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
On Thu, Jan 7, 2010 at 4:45 PM, Christopher Barker chris.bar...@noaa.gov wrote: Bruce Southey wrote: chris.bar...@noaa.gov wrote: Using the numpy NaN or similar (noting R's approach to missing values which in turn allows it to have the above functionality) is just a very bad idea for missing values because you always have to check that which NaN is a missing value and which was due to some numerical calculation. well, this is specific to reading files, so you know where it came from. And the principle of fromfile() is that it is fast and simple, if you want masked arrays, use slower, but more full-featured methods. However, in this case: In [9]: np.fromstring(3, 4, NaN, 5, sep=,) Out[9]: array([ 3., 4., NaN, 5.]) An actual NaN is read from the file, rather than a missing value. Perhaps the user does want the distinction, so maybe it should really only fil it in if the users asks for it, but specifying missing_value=np.nan or something. From what I can see is that you expect that fromfile() should only split at the supplied delimiters, optionally(?) strip any whitespace whitespace stripping is not optional. Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' actually assumes multiple delimiters because there is no comma between 4 and 5 and 8 and 9. Yes, that's the point. I thought about allowing arbitrary multiple delimiters, but I think '/n' is a special case - for instance, a comma at the end of some numbers might mean missing data, but a '\n' would not. And I couldn't really think of a useful use-case for arbitrary multiple delimiters. In Josef's last case how many 'missing values should there be? extra newlines at end of file str = '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n' none -- exactly why I think \n is a special case. What about: extra newlines in the middle of the file str = '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n' I think they should be ignored, but I hope I'm not making something that is too specific to my personal needs. Travis Oliphant wrote: +1 (ignoring new-lines transparently is a nice feature). You can also use sscanf with weave to read most files. right -- but that requires weave. In fact, MATLAB has a fscanf function that allows you to pass in a C format string and it vectorizes it to use the same one over an over again until it's done. It's actually quite powerful and flexible. I once started with that in mind, but didn't have the C chops to do it. I ended up with a tool that only did doubles (come to think of it, MATLAB only does doubles, anyway...) I may some day write a whole new C (or, more likely, Cython) function that does something like that, but for now, I'm jsut trying to get fromfile to be useful for me. +1 (much preferrable to insert NaN or other user value than raise ValueError in my opinion) But raise an error for integer types? I guess this is still up the air -- no consensus yet. raise an exception, I hate the silent cast of nan to integer zero, too much debugging and useless if there are real zeros. (or use some -999 kind of thing if user defined nan codes are allowed, but I just work with float if I expect nans/missing values.) Josef Thanks, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
josef.p...@gmail.com wrote: +1 (much preferrable to insert NaN or other user value than raise ValueError in my opinion) But raise an error for integer types? I guess this is still up the air -- no consensus yet. raise an exception, I hate the silent cast of nan to integer zero, me too -- I'm sorry, I wasn't clear -- I'm not going to write any code that returns a zero for a missing value. These are the options I'd consider: 1) Have the user specify what to use for missing values, otherwise, raise an exception 2) Insert a NaN for floating points types, and raise an exception for integer types. what's not clear is whether (2) is a good idea. As for (1), I just don't know if I'm going to get around to writing the code, and I maybe more kwargs is a bad idea -- though maybe not. Enough talk: I've got ugly C code to wade through... -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Numpy MKL
I understand that intel mkl uses openMP parallel model. Therefore I set environment variable os.environ['OMP_NUM_THREADS'] = '4' With same test example, however, still one cpu is used. Do I need any specifications when I run numpy with intel MKL (MKL9.1)? numpy developers would be able to answer this question? I changed the name of numpy-discussion thread to Numpy MKL attempting to draw attentions from wide range of readers. Thanks! Sue On Thu, Jan 7, 2010 at 11:20 AM, Xue (Sue) Yang x.y...@physics.usyd.edu.au wrote: This time, only one cpu was used. Does it mean that our installed intel mkl 9.1 is not threaded? You would have to consult the MKL documentation - I believe you can control how many threads are used from an environment variable. Also, the exact build commands depend on the version of the MKL, as its libraries often change between versions. David Thank you for the reply which is useful. I also tried to Install numpy with intel mkl 9.1 I still used gfortran for numpy installation as intel mkl 9.1 supports gnu compiler. I only uncomment these lines for site.cfg in site.cfg.example [mkl] library_dirs = /usr/physics/intel/mkl/lib/32 include_dirs = /usr/physics/intel/mkl/include lapack_libs = mkl_lapack then I tested the numpy with python import numpy a = numpy.random.randn(6000, 6000) numpy.dot(a, a) This time, only one cpu was used. Does it mean that our installed intel mkl 9.1 is not threaded? I don't think so. We have used it for openMP parallelization for quite a while. Thanks! Sue ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Numpy MKL
On 7-Jan-10, at 6:58 PM, Xue (Sue) Yang wrote: Do I need any specifications when I run numpy with intel MKL (MKL9.1)? numpy developers would be able to answer this question? Are you sure you've compiled against MKL properly? What is printed by numpy.show_config()? David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Numpy MKL
This is what I had (when I built numpy, I chose gnu compilers instead of intel compilers), numpy.show_config() lapack_opt_info: libraries = ['mkl_lapack', 'mkl', 'vml', 'guide', 'pthread'] library_dirs = ['/usr/physics/intel/mkl/lib/32'] define_macros = [('SCIPY_MKL_H', None)] include_dirs = ['/usr/physics/intel/mkl/include'] blas_opt_info: libraries = ['mkl', 'vml', 'guide', 'pthread'] library_dirs = ['/usr/physics/intel/mkl/lib/32'] define_macros = [('SCIPY_MKL_H', None)] include_dirs = ['/usr/physics/intel/mkl/include'] lapack_mkl_info: libraries = ['mkl_lapack', 'mkl', 'vml', 'guide', 'pthread'] library_dirs = ['/usr/physics/intel/mkl/lib/32'] define_macros = [('SCIPY_MKL_H', None)] include_dirs = ['/usr/physics/intel/mkl/include'] blas_mkl_info: libraries = ['mkl', 'vml', 'guide', 'pthread'] library_dirs = ['/usr/physics/intel/mkl/lib/32'] define_macros = [('SCIPY_MKL_H', None)] include_dirs = ['/usr/physics/intel/mkl/include'] mkl_info: libraries = ['mkl', 'vml', 'guide', 'pthread'] library_dirs = ['/usr/physics/intel/mkl/lib/32'] define_macros = [('SCIPY_MKL_H', None)] include_dirs = ['/usr/physics/intel/mkl/include'] Thanks! Sue Do I need any specifications when I run numpy with intel MKL (MKL9.1)? numpy developers would be able to answer this question? Are you sure you've compiled against MKL properly? What is printed by numpy.show_config()? David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] fromfile() -- help!
OK, I'm trying to dig into the code and figure out how to get it to stop putting in zeros for missing data with fromfile()/fromstring() text reading. It looks like the culprit is this, in arraytypes.c.src: @fn...@_scan(FILE *fp, @type@ *ip, void *NPY_UNUSED(ignore), PyArray_Descr *NPY_UNUSED(ignored)) { double result; int ret; ret = NumPyOS_ascii_ftolf(fp, result); *ip = (@type@) result; return ret; } If I'm reading this right, this gets called for the datatype of interest, and it is passed in a pointer to the file that is being read. if I have NumPyOS_ascii_ftolf right, it should return 0 if it doesn't succesfully read a number. However, this looks like it sets the data in *ip, even if the return value is zero. It does pass on that return value, but, from ctors.c: fromfile_next_element(FILE **fp, void *dptr, PyArray_Descr *dtype, void *NPY_UNUSED(stream_data)) { /* the NULL argument is for backwards-compatibility */ return dtype-f-scanfunc(*fp, dptr, NULL, dtype); } just moves it on through. This is called from here: if (next(stream, dptr, dtype, stream_data) 0) { break; } which is checking for 0 , so if a zero is returned, it will just go in its merry way... So, have I got that right? Should this get fixed at that last point? One more point, this is a bit different for fromfile and fromstring, so I'm getting really confused! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] [Pythonmac-SIG] 1.4.0 installer fails on OSX 10.6.2
Christopher Barker wrote: David Cournapeau wrote: On Thu, Jan 7, 2010 at 1:35 AM, Christopher Barker In the past, I think folks' have used the default name provided by bdist_mpkg, and those are not always clear. Something like: numpy1.4-osx10.4-python.org2.6-32bit.dmg The 32 bits is redundant - we support all archs supported by the official python binary, so python.org is enough. True, though I was anticipating that there may be 32 and 64 bit builds some day. I suspect it will be exactly as today, i.e. a universal build with 64 bits. I have not followed closely the discussion on python-dev on that topic, but I believe python 2.7 sill contain 64 bits as an arch. What OS/architecture were those built with? Snow Leopard. When I first installed the binary, I got a whole bunch of errors because matrix' wasn't found. I recalled this issue from testing, and cleared out the install, then re-installed, and all was fine. I wonder if it's possible to have a mpkg remove anything? pkg does not have a uninstaller - I don't think Apple provides one, that's a known limitation of Mac OS X installers (although I believe there are 3rd party ones) I think both of those are known issues, and not a big deal. Maybe the spacing function is wrong on PPC. The underlying is highly architecture dependent. David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Numpy MKL
On 7-Jan-10, at 8:13 PM, Xue (Sue) Yang wrote: This is what I had (when I built numpy, I chose gnu compilers instead of intel compilers), numpy.show_config() lapack_opt_info: libraries = ['mkl_lapack', 'mkl', 'vml', 'guide', 'pthread'] library_dirs = ['/usr/physics/intel/mkl/lib/32'] define_macros = [('SCIPY_MKL_H', None)] include_dirs = ['/usr/physics/intel/mkl/include'] blas_opt_info: libraries = ['mkl', 'vml', 'guide', 'pthread'] library_dirs = ['/usr/physics/intel/mkl/lib/32'] define_macros = [('SCIPY_MKL_H', None)] include_dirs = ['/usr/physics/intel/mkl/include'] lapack_mkl_info: libraries = ['mkl_lapack', 'mkl', 'vml', 'guide', 'pthread'] library_dirs = ['/usr/physics/intel/mkl/lib/32'] define_macros = [('SCIPY_MKL_H', None)] include_dirs = ['/usr/physics/intel/mkl/include'] blas_mkl_info: libraries = ['mkl', 'vml', 'guide', 'pthread'] library_dirs = ['/usr/physics/intel/mkl/lib/32'] define_macros = [('SCIPY_MKL_H', None)] include_dirs = ['/usr/physics/intel/mkl/include'] mkl_info: libraries = ['mkl', 'vml', 'guide', 'pthread'] library_dirs = ['/usr/physics/intel/mkl/lib/32'] define_macros = [('SCIPY_MKL_H', None)] include_dirs = ['/usr/physics/intel/mkl/include'] That looks right to me... And you're sure you've set the environment variable before Python is run and NumPy is loaded? Try running: import os; print os.environ['OMP_NUM_THREADS'] and verify it's the right number. David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 1.4.0 installer fails on OSX 10.6.2
On 5-Jan-10, at 7:18 PM, Christopher Barker wrote: If distutils/setuptools could identify the python version properly, then binary eggs and easy-install could be a solution -- but that's a mess, too. Long live toydist! :) David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 1.4.0 installer fails on OSX 10.6.2
On 5-Jan-10, at 7:02 PM, Christopher Barker wrote: Pretty sure the python.org binaries are 32-bit only. I still think it's sensible to prefer the waiting the rest of this sentence.. ;-) I had meant to say 'sensible to prefer the Python.org version' though in reality I'm a little miffed that Python.org isn't providing Ron's 4- way binaries, since he went to the trouble of adding support for building them. Grumble grumble. I'm not really a fan of packages polluting /usr/local, I'd rather the tree appear /opt/packagename well, /opt has kind of been co-opted by macports. I'd forgotten about that. or /usr/local/packagename instead, for ease of removal wxPython gets put entirely into: /usr/local/lib/wxPython-unicode-2.10.8 which isn't bad. Ah, yeah, that isn't bad either. but the general approach of stash somewhere and put a .pth in both site-packages seems fine to me. OK -- what about simply punting and doing two builds: one 32 bit, and one 64 bit. I wonder if we need 64bit PPC at all? I know I'm running 64 bit hardware, but never ran a 64 bit OS on it -- I wonder if anyone is? I've built for ppc64 before, and in fact discovered a long-standing bug in the way ppc64 was detected. The fact that nobody found it before me is probably evidence that it is nearly never used. It could be useful in a minority of situations but I don't think it's going to be worth it for most people. David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 1.4.0 installer fails on OSX 10.6.2
On 2010-01-07, David Warde-Farley d...@cs.toronto.edu wrote: On 5-Jan-10, at 7:02 PM, Christopher Barker wrote: I'm not really a fan of packages polluting /usr/local, I'd rather the tree appear /opt/packagename well, /opt has kind of been co-opted by macports. I'd forgotten about that. It's not really true, though. MacPorts took /opt/local/, but /opt/yourbrandnamehere/ probably hasn't been. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 1.4.0 installer fails on OSX 10.6.2
On Fri, Jan 8, 2010 at 11:24 AM, David Warde-Farley d...@cs.toronto.edu wrote: On 5-Jan-10, at 7:18 PM, Christopher Barker wrote: If distutils/setuptools could identify the python version properly, then binary eggs and easy-install could be a solution -- but that's a mess, too. Long live toydist! :) Toydist will not solve anything here. Versioning info is useless here if it does not translate to compatible ABI. What is required is to be able to identify a precise python ABI: python makes that hard, mac os x harder, and universal builds ever harder. Things like PEP 384 may help in the future - As it is written by someone who actually knows about this stuff, it will hopefully be useful. David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] FIY: a (new ?) practical profiling tool on linux
Hi, I don't know if many people are aware of it, but I have recently discovered perf, a tool available from the kernel sources. It is extremely simple to use, and very useful when looking at numpy/scipy perf issues in compiled code. For example, I can get this kind of results for looking at the numpy neighborhood iterator performance in one simple command, without special compilation flags: 44.69% python /home/david/local/stow/scipy.git/lib/python2.6/site-packages/scipy/signal/sigtools.so [.] _imp_correlate_nd_double 39.47% python /home/david/local/stow/numpy-1.4.0/lib/python2.6/site-packages/numpy/core/multiarray.so [.] get_ptr_constant 9.98% python /home/david/local/stow/numpy-1.4.0/lib/python2.6/site-packages/numpy/core/multiarray.so [.] get_ptr_simple 0.65% python /usr/bin/python2.6 [.] 0x12b8a0 0.40% python /usr/bin/python2.6 [.] 0x0a6662 0.37% python /usr/bin/python2.6 [.] 0x04c10d 0.32% python /usr/bin/python2.6 [.] PyEval_EvalFrameEx 0.15% python [kernel] [k] __d_lookup 0.14% python /lib/libc-2.10.1.so [.] _int_malloc 0.12% python /usr/bin/python2.6 [.] 0x04f90e 0.10% python [kernel] [k] __link_path_walk 0.09% python /usr/bin/python2.6 [.] PyObject_Malloc 0.09% python /lib/ld-2.10.1.so [.] do_lookup_x 0.09% python /lib/libc-2.10.1.so [.] __GI_memcpy 0.08% python [kernel] [k] __ticket_spin_lock 0.07% python /usr/bin/python2.6 [.] PyParser_AddToken And even cooler, annotated sources: Percent | Source code Disassembly of multiarray.so : : : : Disassembly of section .text: : : 0001d8a0 get_ptr_constant: : _coordinates[c] = bd; : : /* set the dataptr from its current coordinates */ : static char* : get_ptr_constant(PyArrayIterObject* _iter, npy_intp *coordinates) : { 15.69 : 1d8a0: 48 81 ec 08 01 00 00sub$0x108,%rsp : int i; : npy_intp bd, _coordinates[NPY_MAXDIMS]; : PyArrayNeighborhoodIterObject *niter = (PyArrayNeighborhoodIterObject*)_iter; : PyArrayIterObject *p = niter-_internal_iter; : : for(i = 0; i niter-nd; ++i) { 0.02 : 1d8a7: 48 83 bf 48 0a 00 00cmpq $0x0,0xa48(%rdi) 0.00 : 1d8ae: 00 : get_ptr_constant(PyArrayIterObject* _iter, npy_intp *coordinates) : { : int i; : npy_intp bd, _coordinates[NPY_MAXDIMS]; : PyArrayNeighborhoodIterObject *niter = (PyArrayNeighborhoodIterObject*)_iter; : PyArrayIterObject *p = niter-_internal_iter; 0.01 : 1d8af: 48 8b 87 50 0b 00 00mov0xb50(%rdi),%rax : : for(i = 0; i niter-nd; ++i) { 7.92 : 1d8b6: 7e 64 jle1d91c get_ptr_constant+0x7c : _INF_SET_PTR(i) 0.01 : 1d8b8: 48 8b 0emov(%rsi),%rcx 0.00 : 1d8bb: 48 03 48 28 add0x28(%rax),%rcx 0.03 : 1d8bf: 48 3b 88 40 07 00 00cmp0x740(%rax),%rcx 7.97 : 1d8c6: 7c 68 jl 1d930 get_ptr_constant+0x90 0.02 : 1d8c8: 45 31 c9xor%r9d,%r9d 0.00 : 1d8cb: 31 d2 xor%edx,%edx 0.00 : 1d8cd: 48 3b 88 48 07 00 00cmp0x748(%rax),%rcx 7.75 : 1d8d4: 7e 32 jle1d908 get_ptr_constant+0x68 0.00 : 1d8d6: eb 58 jmp1d930 get_ptr_constant+0x90 0.00 : 1d8d8: 0f 1f 84 00 00 00 00nopl 0x0(%rax,%rax,1) 0.00 : 1d8df: 00 7.68 : 1d8e0: 4c 8d 42 74 lea0x74(%rdx),%r8 0.00 : 1d8e4: 48 8b 0c d6 mov (%rsi,%rdx,8),%rcx 0.00 : 1d8e8:
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
On Thu, Jan 7, 2010 at 3:45 PM, Christopher Barker chris.bar...@noaa.gov wrote: Bruce Southey wrote: chris.bar...@noaa.gov wrote: Using the numpy NaN or similar (noting R's approach to missing values which in turn allows it to have the above functionality) is just a very bad idea for missing values because you always have to check that which NaN is a missing value and which was due to some numerical calculation. well, this is specific to reading files, so you know where it came from. You can only know where it came from when you compare the original array to the transformed one. Also a user has to check for missing values or numpy has to warn a user that missing values are present immediately after reading the data so the appropriate action can be taken (like using functions that handle missing values appropriately). That is my second problem with using codes (NaN, -9 etc) for missing values. And the principle of fromfile() is that it is fast and simple, if you want masked arrays, use slower, but more full-featured methods. So in that case it should fail with missing data. However, in this case: In [9]: np.fromstring(3, 4, NaN, 5, sep=,) Out[9]: array([ 3., 4., NaN, 5.]) An actual NaN is read from the file, rather than a missing value. Perhaps the user does want the distinction, so maybe it should really only fil it in if the users asks for it, but specifying missing_value=np.nan or something. Yes, that is my first problem of using predefined codes for missing values as you do not always know what is going to occur in the data. From what I can see is that you expect that fromfile() should only split at the supplied delimiters, optionally(?) strip any whitespace whitespace stripping is not optional. Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' actually assumes multiple delimiters because there is no comma between 4 and 5 and 8 and 9. Yes, that's the point. I thought about allowing arbitrary multiple delimiters, but I think '/n' is a special case - for instance, a comma at the end of some numbers might mean missing data, but a '\n' would not. And I couldn't really think of a useful use-case for arbitrary multiple delimiters. In Josef's last case how many 'missing values should there be? extra newlines at end of file str = '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n' none -- exactly why I think \n is a special case. What about '\r' and '\n\r'? What about: extra newlines in the middle of the file str = '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n' I think they should be ignored, but I hope I'm not making something that is too specific to my personal needs. Not really, it is more that I am being somewhat difficult to ensure I understand what you actually need. My problem with this is that you are reading one huge 1-D array (that you can resize later) rather than a 2-D array with rows and columns (which is what I deal with). But I agree that you can have an option to say treat '\n' or '\r' as a delimiter but I think it should be turned off by default. Travis Oliphant wrote: +1 (ignoring new-lines transparently is a nice feature). You can also use sscanf with weave to read most files. right -- but that requires weave. In fact, MATLAB has a fscanf function that allows you to pass in a C format string and it vectorizes it to use the same one over an over again until it's done. It's actually quite powerful and flexible. I once started with that in mind, but didn't have the C chops to do it. I ended up with a tool that only did doubles (come to think of it, MATLAB only does doubles, anyway...) I may some day write a whole new C (or, more likely, Cython) function that does something like that, but for now, I'm jsut trying to get fromfile to be useful for me. +1 (much preferrable to insert NaN or other user value than raise ValueError in my opinion) But raise an error for integer types? I guess this is still up the air -- no consensus yet. Thanks, -Chris You should have a corresponding value for ints because raising an exceptionwould be inconsistent with allowing floats to have a value. If you must keep the user defined dtype then, as Josef suggests, just use some code be it -999 or most negative number supported by the OS for the defined dtype or, just convert the ints into floats if the user does not define a missing value code. It would be nice to either return the number of missing values or display a warning indicating how many occurred. Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
On Thu, Jan 7, 2010 at 11:10 PM, Bruce Southey bsout...@gmail.com wrote: On Thu, Jan 7, 2010 at 3:45 PM, Christopher Barker chris.bar...@noaa.gov wrote: Bruce Southey wrote: chris.bar...@noaa.gov wrote: Using the numpy NaN or similar (noting R's approach to missing values which in turn allows it to have the above functionality) is just a very bad idea for missing values because you always have to check that which NaN is a missing value and which was due to some numerical calculation. well, this is specific to reading files, so you know where it came from. You can only know where it came from when you compare the original array to the transformed one. Also a user has to check for missing values or numpy has to warn a user that missing values are present immediately after reading the data so the appropriate action can be taken (like using functions that handle missing values appropriately). That is my second problem with using codes (NaN, -9 etc) for missing values. And the principle of fromfile() is that it is fast and simple, if you want masked arrays, use slower, but more full-featured methods. So in that case it should fail with missing data. However, in this case: In [9]: np.fromstring(3, 4, NaN, 5, sep=,) Out[9]: array([ 3., 4., NaN, 5.]) An actual NaN is read from the file, rather than a missing value. Perhaps the user does want the distinction, so maybe it should really only fil it in if the users asks for it, but specifying missing_value=np.nan or something. Yes, that is my first problem of using predefined codes for missing values as you do not always know what is going to occur in the data. From what I can see is that you expect that fromfile() should only split at the supplied delimiters, optionally(?) strip any whitespace whitespace stripping is not optional. Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' actually assumes multiple delimiters because there is no comma between 4 and 5 and 8 and 9. Yes, that's the point. I thought about allowing arbitrary multiple delimiters, but I think '/n' is a special case - for instance, a comma at the end of some numbers might mean missing data, but a '\n' would not. And I couldn't really think of a useful use-case for arbitrary multiple delimiters. In Josef's last case how many 'missing values should there be? extra newlines at end of file str = '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n' none -- exactly why I think \n is a special case. What about '\r' and '\n\r'? Yes, I forgot about this, and it will be the most common case for Windows users like myself. I think \r should be stripped automatically, like in non-binary reading of files in python. What about: extra newlines in the middle of the file str = '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n' I think they should be ignored, but I hope I'm not making something that is too specific to my personal needs. Not really, it is more that I am being somewhat difficult to ensure I understand what you actually need. My problem with this is that you are reading one huge 1-D array (that you can resize later) rather than a 2-D array with rows and columns (which is what I deal with). But I agree that you can have an option to say treat '\n' or '\r' as a delimiter but I think it should be turned off by default. Travis Oliphant wrote: +1 (ignoring new-lines transparently is a nice feature). You can also use sscanf with weave to read most files. right -- but that requires weave. In fact, MATLAB has a fscanf function that allows you to pass in a C format string and it vectorizes it to use the same one over an over again until it's done. It's actually quite powerful and flexible. I once started with that in mind, but didn't have the C chops to do it. I ended up with a tool that only did doubles (come to think of it, MATLAB only does doubles, anyway...) I may some day write a whole new C (or, more likely, Cython) function that does something like that, but for now, I'm jsut trying to get fromfile to be useful for me. +1 (much preferrable to insert NaN or other user value than raise ValueError in my opinion) But raise an error for integer types? I guess this is still up the air -- no consensus yet. Thanks, -Chris You should have a corresponding value for ints because raising an exceptionwould be inconsistent with allowing floats to have a value. No, I think different nan/missing value handling between integers and float is a natural distinction. There is no default nan code for integers, but nan (and inf) are valid floating point numbers (even if nan is not a number). And the default treatment of nans in numpy is getting pretty good (e.g. I like the new (nan)sort). If you must keep the user defined dtype then, as Josef suggests, just use some code be it -999 or most negative number supported by the OS for the defined dtype or, just convert the ints