Re: [Numpy-discussion] "import numpy" performance
On 10 July 2012 09:05, Andrew Dalke wrote: > On Jul 8, 2012, at 9:22 AM, Scott Sinclair wrote: >> On 6 July 2012 15:48, Andrew Dalke wrote: >>> I followed the instructions at >>> http://docs.scipy.org/doc/numpy/dev/gitwash/patching.html >>> and added Ticket #2181 (with patch) ... >> >> Those instructions need to be updated to reflect the current preferred >> practice. You'll make code review easier and increase the chances of >> getting your patch accepted by submitting the patch as a Github pull >> request instead (see >> http://docs.scipy.org/doc/numpy/dev/gitwash/development_workflow.html >> for a how-to). It's not very much extra work. > > Both of those URLs point to related documentation under the same > root, so I assumed that both are equally valid. That's a valid assumption. > I did look at the development_workflow documentation, and am already > bewildered by the terms 'rebase','fast-foward' etc. It seems to that > last week I made a mistake because I did a "git pull" on my local copy > (which is what I do with Mercurial to get the current trunk code) > instead of: > > git fetch followed by gitrebase, git merge --ff-only or > git merge --no-ff, depending on what you intend. > > I don't know if I made a "common mistake", and I don't know "what [I] > intend." Fair enough, new terminology is seldom fun. Using git pull wasn't necessary in your case, neither was git rebase. > I realize that for someone who plans to be a long term contributor, > understanding git, github, and the NumPy development model is > "not very much extra work", but in terms of extra work for me, > or at least minimizing my level of confusion, I would rather do > what the documentation suggests and continue with the submitted > patch. By "not very much extra work" I assumed that you'd already done most of the legwork towards submitting a pull request (Github account, forking numpy repo, etc..) My mistake, I now retract that statement :) and submitted your patch in https://github.com/numpy/numpy/pull/334 as a peace offering. Cheers, Scott ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
Andrew, Thank you for your comments. I agree it's confusing coming to github at first. I still have to refer to the jargon-file to understand what everything means. There are a lot of unfamiliar terms. Thank you for your patches. It does imply more work for developers on NumPy, which is why we prefer the github pull request mechanism. But, having patches is better than not having them. Having easy ways to upload a patch somewhere is something to think about with the intended move to github issue tracker. Best, -Travis On Jul 10, 2012, at 2:05 AM, Andrew Dalke wrote: > On Jul 8, 2012, at 9:22 AM, Scott Sinclair wrote: >> On 6 July 2012 15:48, Andrew Dalke wrote: >>> I followed the instructions at >>> http://docs.scipy.org/doc/numpy/dev/gitwash/patching.html >>> and added Ticket #2181 (with patch) ... >> >> Those instructions need to be updated to reflect the current preferred >> practice. You'll make code review easier and increase the chances of >> getting your patch accepted by submitting the patch as a Github pull >> request instead (see >> http://docs.scipy.org/doc/numpy/dev/gitwash/development_workflow.html >> for a how-to). It's not very much extra work. > > Both of those URLs point to related documentation under the same > root, so I assumed that both are equally valid. The 'patching' one I > linked to says: > > Making a patch is the simplest and quickest, but if you’re going to be > doing anything more than simple quick things, please consider following > the Git for development model instead. > > That really fits me the best, because I don't know git or github, and > I don't plan to get involved in numpy development other than two patches > (one already posted, and the other, after my holiday, to get rid of > required the numpy.testing import). > > I did look at the development_workflow documentation, and am already > bewildered by the terms 'rebase','fast-foward' etc. It seems to that > last week I made a mistake because I did a "git pull" on my local copy > (which is what I do with Mercurial to get the current trunk code) > instead of: > > git fetch followed by gitrebase, git merge --ff-only or > git merge --no-ff, depending on what you intend. > > I don't know if I made a "common mistake", and I don't know "what [I] > intend." > > I realize that for someone who plans to be a long term contributor, > understanding git, github, and the NumPy development model is > "not very much extra work", but in terms of extra work for me, > or at least minimizing my level of confusion, I would rather do > what the documentation suggests and continue with the submitted > patch. > > Andrew > da...@dalkescientific.com > > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
On Jul 8, 2012, at 9:22 AM, Scott Sinclair wrote: > On 6 July 2012 15:48, Andrew Dalke wrote: >> I followed the instructions at >> http://docs.scipy.org/doc/numpy/dev/gitwash/patching.html >> and added Ticket #2181 (with patch) ... > > Those instructions need to be updated to reflect the current preferred > practice. You'll make code review easier and increase the chances of > getting your patch accepted by submitting the patch as a Github pull > request instead (see > http://docs.scipy.org/doc/numpy/dev/gitwash/development_workflow.html > for a how-to). It's not very much extra work. Both of those URLs point to related documentation under the same root, so I assumed that both are equally valid. The 'patching' one I linked to says: Making a patch is the simplest and quickest, but if you’re going to be doing anything more than simple quick things, please consider following the Git for development model instead. That really fits me the best, because I don't know git or github, and I don't plan to get involved in numpy development other than two patches (one already posted, and the other, after my holiday, to get rid of required the numpy.testing import). I did look at the development_workflow documentation, and am already bewildered by the terms 'rebase','fast-foward' etc. It seems to that last week I made a mistake because I did a "git pull" on my local copy (which is what I do with Mercurial to get the current trunk code) instead of: git fetch followed by gitrebase, git merge --ff-only or git merge --no-ff, depending on what you intend. I don't know if I made a "common mistake", and I don't know "what [I] intend." I realize that for someone who plans to be a long term contributor, understanding git, github, and the NumPy development model is "not very much extra work", but in terms of extra work for me, or at least minimizing my level of confusion, I would rather do what the documentation suggests and continue with the submitted patch. Andrew da...@dalkescientific.com ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
On 6 July 2012 15:48, Andrew Dalke wrote: > I followed the instructions at > http://docs.scipy.org/doc/numpy/dev/gitwash/patching.html > and added Ticket #2181 (with patch) at > http://projects.scipy.org/numpy/ticket/2181 Those instructions need to be updated to reflect the current preferred practice. You'll make code review easier and increase the chances of getting your patch accepted by submitting the patch as a Github pull request instead (see http://docs.scipy.org/doc/numpy/dev/gitwash/development_workflow.html for a how-to). It's not very much extra work. Cheers, Scott ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
I followed the instructions at http://docs.scipy.org/doc/numpy/dev/gitwash/patching.html and added Ticket #2181 (with patch) at http://projects.scipy.org/numpy/ticket/2181 This remove the 5 'exec' calls from polynomial/*.py and improves the 'import numpy' time by about 25-30%. That is, on my laptop python -c 'import time; t1=time.time(); import numpy; print time.time()-t1' goes from 0.079 seconds to 0.057 (best of 10 for both cases). The patch does mean that if someone edits the template then they will need to run the template expansion script manually. I think it's well worth the effort. Cheers, Andrew da...@dalkescientific.com ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
On Tue, Jul 3, 2012 at 1:16 AM, Andrew Dalke wrote: > On Jul 3, 2012, at 12:46 AM, David Cournapeau wrote: > > It is indeed irrelevant to your end goal, but it does affect the > > interpretation of what import_array does, and thus of your benchmark > > Indeed. > > > Focusing on polynomial seems the only sensible action. Except for > > test, all the other stuff seem difficult to change without breaking > > anything. > > I confirm that when I comment out numpy/__init__.py's "import polynomial" > then the import time for numpy.core.multiarray goes from > > 0.084u 0.031s 0:00.11 100.0%0+0k 0+0io 0pf+0w > > to > > 0.058u 0.028s 0:00.08 87.5% 0+0k 0+0io 0pf+0w > > > numpy/polynomial imports: > from polynomial import Polynomial > from chebyshev import Chebyshev > from legendre import Legendre > from hermite import Hermite > from hermite_e import HermiteE > from laguerre import Laguerre > > and there's no easy way to make these be lazy imports. > > > Strange! The bottom of hermite.py has: > > exec polytemplate.substitute(name='Hermite', nick='herm', domain='[-1,1]') > > as well as similar code in laguerre.py, chebyshev.py, hermite_e.py, > and polynomial.py. > > I bet there's a lot of overhead generating and exec'ing > those for each import! > Looks like it. That could easily be done at build time though. Making that change and your proposed change to the test functions, which seems fine, will likely be enough to reach your 40% target. No need for new imports or lazy loading then I hope. Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
On Mon, Jul 2, 2012 at 2:59 PM, Nathaniel Smith wrote: > On Mon, Jul 2, 2012 at 10:06 PM, Robert Kern wrote: >> On Mon, Jul 2, 2012 at 9:43 PM, Benjamin Root wrote: >>> >>> On Mon, Jul 2, 2012 at 4:34 PM, Nathaniel Smith wrote: >> I think this ship has sailed, but it'd be worth looking into lazy importing, where 'numpy.fft' isn't actually imported until someone starts using it. There are a bunch of libraries that do this, and one would have to fiddle to get compatibility with all the different python versions and make sure you're not killing performance (might have to be in C) but something along the lines of class _FFTModule(object): def __getattribute__(self, name): mod = importlib.import_module("numpy.fft") _FFTModule.__getattribute__ = mod.__getattribute__ return getattr(mod, name) fft = _FFTModule() >>> >>> Not sure how this would impact projects like ipython that does >>> tab-completion support, but I know that that would drive me nuts in my basic >>> tab-completion setup I have for my regular python terminal. Of course, in >>> the grand scheme of things, that really isn't all that important, I don't >>> think. >> >> We used to do it for scipy. It did interfere with tab completion. It >> did drive many people nuts. > > Sounds like a bug in your old code, or else the REPLs have gotten > better? I just pasted the above code into both ipython and python > prompts, and typing 'fft.' worked fine in both cases. dir(fft) > works first try as well. > > (If you try this, don't forget to 'import importlib' first, and note > importlib is 2.7+ only. Obviously importlib is not necessary but it > makes the minimal example less tedious.) For anyone interested, I worked out a small lazy-loading class that we use in nitime [1], which does not need importlib and thus works on python versions before 2.7 and also has a bit of repr pretty printing. I wrote about this to Scipy-Dev [2], and in the original nitime PR [3] tested that it works in python 2.5, 2.6, 2.7, 3.0, 3.1 and 3.2. Since that time, we've only changed how we deal with the one known limitation: reloading a lazily-loaded module was a noop in that PR, but now throws an error (there's one line commented out if the noop behavior is preferred). Here's a link to the rendered docs [4], but if you just grab the LazyImport class from [1], you can do fft = LazyImport('numpy.fft') 1. https://github.com/nipy/nitime/blob/master/nitime/lazyimports.py 2. http://mail.scipy.org/pipermail/scipy-dev/2011-September/016606.html 3. https://github.com/nipy/nitime/pull/88 4. http://nipy.sourceforge.net/nitime/api/generated/nitime.lazyimports.html#module-nitime.lazyimports best, -- Paul Ivanov 314 address only used for lists, off-list direct email at: http://pirsquared.org | GPG/PGP key id: 0x0F3E28F7 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
On Mon, Jul 2, 2012 at 12:17 PM, Andrew Dalke wrote: > In this email I propose a few changes which I think are minor > and which don't really affect the external NumPy API but which > I think could improve the "import numpy" performance by at > least 40%. +1 -- I think I remember that thread -- at the time, I was experiencing some really, really slow inport times myself -- it turned out to be something really wierd with my system (though I don't remember exactly what), but numpy still is too big an import. Another note -- I ship stuff with py2exe and friends a fair bit -- numpy's "Import a whole bunch of stuff you may well not be using" approach means I have to include all that stuff, or hack the heck out of numpy -- not ideal. > 1) remove "add_newdocs" and put the docstrings in the C code > 'add_newdocs' still needs to be there, > > The code says: > > # This is only meant to add docs to objects defined in C-extension modules. > # The purpose is to allow easier editing of the docstrings without > # requiring a re-compile. +1 -- isn't it better for the docs to be with the code, anyway? > 2) Don't optimistically assume that all submodules are > needed. For example, some current code uses > import numpy numpy.fft.ifft > +1 see above -- really, what fraction of code uses fft and polynomial, and ... "namespaces are one honking great idea" I appreciate the legacy, and the easy-of-use at the interpreter, but it sure would be nice to clean this up -- maybe keep the leegacy by having a new import: import just_numpy as np that would import the core stuff, and offer the "extra" packages as specific imports -- ideally, we'd dpreciate the old way, and reccoment the extra importing for the future, and some day have "numpy" and "numpy_plus". (Kind of like pylab, I suppose) lazy importing may work OK, too, though more awkward for py2exe and friends, and perhaps a bit "magic" for my taste. > 3) Especially: don't always import 'numpy.testing' +1 > I have not worried about numpy import performance for > 4 years. While I have been developing scientific software > for 20 years, and in Python for 15 years, it has been > in areas of biology and chemistry which don't use arrays. remarkable -- I use arrays for everything! most of which are not classic big arrays you process with lapack type stuff ;-) > yeah, it's just using the homogenous array most of the time. exactly -- I know Travis says: "if you're going to use numpy arrays, use numpy", but they really are pretty darn handy even if you just use them as containers. Ben root wrote: > Not sure how this would impact projects like ipython that does tab-completion > support, > but I know that that would drive me nuts in my basic tab-completion setup I > have for >my regular python terminal. Of course, in the grand scheme of things, that >really > isn't all that important, I don't think. I do think it's important to support easy interactive use, Ipyhton, etc -- with nice tab completion, easy access to doc string, etc. But it should alo be possible to not have all that where it isn't required -- hence my "import numpy_plus" type proposal. I never did get why the polynomial stuff was added to core numpy -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
On Jul 3, 2012, at 12:46 AM, David Cournapeau wrote: > It is indeed irrelevant to your end goal, but it does affect the > interpretation of what import_array does, and thus of your benchmark Indeed. > Focusing on polynomial seems the only sensible action. Except for > test, all the other stuff seem difficult to change without breaking > anything. I confirm that when I comment out numpy/__init__.py's "import polynomial" then the import time for numpy.core.multiarray goes from 0.084u 0.031s 0:00.11 100.0%0+0k 0+0io 0pf+0w to 0.058u 0.028s 0:00.08 87.5% 0+0k 0+0io 0pf+0w numpy/polynomial imports: from polynomial import Polynomial from chebyshev import Chebyshev from legendre import Legendre from hermite import Hermite from hermite_e import HermiteE from laguerre import Laguerre and there's no easy way to make these be lazy imports. Strange! The bottom of hermite.py has: exec polytemplate.substitute(name='Hermite', nick='herm', domain='[-1,1]') as well as similar code in laguerre.py, chebyshev.py, hermite_e.py, and polynomial.py. I bet there's a lot of overhead generating and exec'ing those for each import! Andrew da...@dalkescientific.com ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
On Jul 3, 2012, at 12:21 AM, Nathaniel Smith wrote: > Yes, but for a proper benchmark we need to compare this to the number > that we would get with some other implementation... I'm assuming you > aren't proposing we just delete the docstrings :-). I suspect that we have a different meaning of the term 'benchmark'. A benchmark establishes first the baseline by which future implementations are measured. Which is why I did. Once there are changes, the benchmark, rerun, helps judge the usefulness of those changes. This I did not do. I do not believe that a benchmark requires the changed code as well before it can be considered a "proper benchmark" >> This says that 'add_newdocs', which is imported from >> numpy.core.multiarray (though there may be other importers) >> takes 0.038 seconds to go through __import__, including >> all of its children module imports. > > There are no "children modules", all these modules refer to each > other, and you're assuming that whichever module you happen to load > first is responsible for all the other modules it happens to > reference. While I believe there is an "import tree" analogous to a "call tree" and Python's import scheme helps ensure that it's a DAG (so that 'children modules' has a real meaning), you are correct in identifying that I was only pointing out the first parent, and not all of the parents. add_newdocs is the first module to import 'numpy.lib', but after further testing (I stubbed out the import and made a fake function), I see that other modules import numpy.lib and there's no measurable performance increase. I retract therefore my proposal to move the documentation which is currently in add_newdocs into the C code. With instrumentation I found that 0.083s of the 0.119s is spent loading numpy.core.multiarray. > > The number 0.083 doesn't appear anywhere in that profile you pasted, > so I don't know where this comes from... I did not save the output run which I used for my original email. It's easy to generate, so I just ran it again. Cheers, Andrew da...@dalkescientific.com ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
On Mon, Jul 2, 2012 at 11:15 PM, Andrew Dalke wrote: > On Jul 2, 2012, at 11:38 PM, Fernando Perez wrote: >> No, that's the wrong thing to test, because it effectively amounts to >> 'import numpy', sicne the numpy __init__ file is still executed. As >> David indicated, you must import multarray.so by itself. > > I understand that clarification. However, it does not affect me. It is indeed irrelevant to your end goal, but it does affect the interpretation of what import_array does, and thus of your benchmark polynomial is definitely the big new overhead (I don't remember it being significant last time I optimized numpy import times), it is roughly 30 % of the total cost of importing numpy (95 -> 70 ms total time, of which numpy went from 70 to 50 ms). Then ctypeslib and test are the two other significant ones. I use profile_imports.py from bzr as follows: import sys import profile_imports profile_imports.install() import numpy profile_imports.log_stack_info(sys.stdout) Focusing on polynomial seems the only sensible action. Except for test, all the other stuff seem difficult to change without breaking anything. David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
On Mon, Jul 2, 2012 at 11:15 PM, Andrew Dalke wrote: > On Jul 2, 2012, at 11:38 PM, Fernando Perez wrote: >> No, that's the wrong thing to test, because it effectively amounts to >> 'import numpy', sicne the numpy __init__ file is still executed. As >> David indicated, you must import multarray.so by itself. > > I understand that clarification. However, it does not affect me. > > I do "import rdkit.Chem". This is all I really care about. > > That imports "rdkit.Chem.rdchem" which is a shared library. > > That shared library calls the C function/macro "import_array", which appears > to be: > > #define import_array() { if (_import_array() < 0) {PyErr_Print(); > PyErr_SetString(PyExc_ImportError, "numpy.core.multiarray failed to import"); > } } > > > The _import_array looks to be defined via > numpy/core/code_generators/generate_numpy_api.py > which contains > > static int > _import_array(void) > { > int st; > PyObject *numpy = PyImport_ImportModule("numpy.core.multiarray"); > PyObject *c_api = NULL; >... > > > Thus, I don't see any way that I can import 'multiarray' directly, > because the underlying C code is the one which imports > 'numpy.core.multiarray' and by design it is inaccessible to change > from Python code. > > Thus, the correct reference benchmark is "import numpy.core.multiarray" Oh, I see. I withdraw my comment about how you shouldn't import numpy.core.multiarray directly, I forgot import_array() does that. -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
On Mon, Jul 2, 2012 at 3:15 PM, Andrew Dalke wrote: > > Thus, I don't see any way that I can import 'multiarray' directly, > because the underlying C code is the one which imports > 'numpy.core.multiarray' and by design it is inaccessible to change > from Python code. I was just referring to how David was benchmarking the cost of multiarray in isolation, which can indeed be done, and is useful for understanding the cumulative effect. Indeed for your case, it's the sum total of what import_array does that ultimately matters, but it's still useful to be able to understand these pieces in isolation. Cheers, f ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
On Mon, Jul 2, 2012 at 10:44 PM, Andrew Dalke wrote: > On Jul 2, 2012, at 10:34 PM, Nathaniel Smith wrote: >> I don't have any opinion on how acceptable this would be, but I also >> don't see a benchmark showing how much this would help? > > The profile output was lower in that email. The relevant line is > > 0.038 add_newdocs (numpy.core.multiarray) Yes, but for a proper benchmark we need to compare this to the number that we would get with some other implementation... I'm assuming you aren't proposing we just delete the docstrings :-). > This says that 'add_newdocs', which is imported from > numpy.core.multiarray (though there may be other importers) > takes 0.038 seconds to go through __import__, including > all of its children module imports. There are no "children modules", all these modules refer to each other, and you're assuming that whichever module you happen to load first is responsible for all the other modules it happens to reference. > add_newdocs: 0.067 (numpy.core.multiarray) > numpy.lib: 0.061 (add_newdocs) I'm pretty sure that what these two lines say is that the actual add_newdocs code only takes 0.006 seconds? > numpy.testing: 0.041 (numpy.core.numeric) However, it does look like numpy.testing is responsible for something like 35% of our startup overhead and for pulling in a ton of extra modules (with associated disk seeks), which is pretty dumb. >>> With instrumentation I found that 0.083s of the 0.119s >>> is spent loading numpy.core.multiarray. The number 0.083 doesn't appear anywhere in that profile you pasted, so I don't know where this comes from... Anyway, it sounds like the answer is that importing numpy.core.multiarray doesn't take that long; you're measuring the total time to do 'import numpy', and it just happens that numpy.core.multiarray is the first module you load. (BTW, you probably shouldn't be importing numpy.core.multiarray directly at all, just do 'import numpy'.) -N ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
On Jul 2, 2012, at 11:38 PM, Fernando Perez wrote: > No, that's the wrong thing to test, because it effectively amounts to > 'import numpy', sicne the numpy __init__ file is still executed. As > David indicated, you must import multarray.so by itself. I understand that clarification. However, it does not affect me. I do "import rdkit.Chem". This is all I really care about. That imports "rdkit.Chem.rdchem" which is a shared library. That shared library calls the C function/macro "import_array", which appears to be: #define import_array() { if (_import_array() < 0) {PyErr_Print(); PyErr_SetString(PyExc_ImportError, "numpy.core.multiarray failed to import"); } } The _import_array looks to be defined via numpy/core/code_generators/generate_numpy_api.py which contains static int _import_array(void) { int st; PyObject *numpy = PyImport_ImportModule("numpy.core.multiarray"); PyObject *c_api = NULL; ... Thus, I don't see any way that I can import 'multiarray' directly, because the underlying C code is the one which imports 'numpy.core.multiarray' and by design it is inaccessible to change from Python code. Thus, the correct reference benchmark is "import numpy.core.multiarray" Unless I'm lost in a set of header files? Cheers, Andrew da...@dalkescientific.com ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
On Mon, Jul 2, 2012 at 10:06 PM, Robert Kern wrote: > On Mon, Jul 2, 2012 at 9:43 PM, Benjamin Root wrote: >> >> On Mon, Jul 2, 2012 at 4:34 PM, Nathaniel Smith wrote: > >>> I think this ship has sailed, but it'd be worth looking into lazy >>> importing, where 'numpy.fft' isn't actually imported until someone >>> starts using it. There are a bunch of libraries that do this, and one >>> would have to fiddle to get compatibility with all the different >>> python versions and make sure you're not killing performance (might >>> have to be in C) but something along the lines of >>> >>> class _FFTModule(object): >>> def __getattribute__(self, name): >>> mod = importlib.import_module("numpy.fft") >>> _FFTModule.__getattribute__ = mod.__getattribute__ >>> return getattr(mod, name) >>> fft = _FFTModule() >> >> Not sure how this would impact projects like ipython that does >> tab-completion support, but I know that that would drive me nuts in my basic >> tab-completion setup I have for my regular python terminal. Of course, in >> the grand scheme of things, that really isn't all that important, I don't >> think. > > We used to do it for scipy. It did interfere with tab completion. It > did drive many people nuts. Sounds like a bug in your old code, or else the REPLs have gotten better? I just pasted the above code into both ipython and python prompts, and typing 'fft.' worked fine in both cases. dir(fft) works first try as well. (If you try this, don't forget to 'import importlib' first, and note importlib is 2.7+ only. Obviously importlib is not necessary but it makes the minimal example less tedious.) -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
On Jul 2, 2012, at 10:34 PM, Nathaniel Smith wrote: > I don't have any opinion on how acceptable this would be, but I also > don't see a benchmark showing how much this would help? The profile output was lower in that email. The relevant line is 0.038 add_newdocs (numpy.core.multiarray) This says that 'add_newdocs', which is imported from numpy.core.multiarray (though there may be other importers) takes 0.038 seconds to go through __import__, including all of its children module imports. I have attached my import profile script. It has only minor changes since the one I posted on this list 4 years ago. Its output is here, showing the import dependency tree first, and then the list of slowest modules to import. == Tree == rdkit: 0.150 (None) os: 0.000 (rdkit) sys: 0.000 (rdkit) exceptions: 0.000 (rdkit) sqlite3: 0.003 (pyPgSQL) dbapi2: 0.002 (sqlite3) datetime: 0.001 (dbapi2) time: 0.000 (dbapi2) _sqlite3: 0.001 (dbapi2) cDataStructs: 0.008 (pyPgSQL) rdkit.Geometry: 0.003 (pyPgSQL) rdGeometry: 0.003 (rdkit.Geometry) PeriodicTable: 0.002 (pyPgSQL) re: 0.000 (PeriodicTable) rdchem: 0.116 (pyPgSQL) numpy.core.multiarray: 0.109 (rdchem) numpy.__config__: 0.000 (numpy.core.multiarray) version: 0.000 (numpy.core.multiarray) _import_tools: 0.000 (numpy.core.multiarray) add_newdocs: 0.067 (numpy.core.multiarray) numpy.lib: 0.061 (add_newdocs) info: 0.000 (numpy.lib) numpy.version: 0.000 (numpy.lib) type_check: 0.053 (numpy.lib) numpy.core.numeric: 0.053 (type_check) multiarray: 0.001 (numpy.core.numeric) umath: 0.000 (numpy.core.numeric) _internal: 0.004 (numpy.core.numeric) warnings: 0.000 (_internal) numpy.compat: 0.000 (_internal) _inspect: 0.000 (numpy.compat) types: 0.000 (_inspect) py3k: 0.000 (numpy.compat) numerictypes: 0.001 (numpy.core.numeric) __builtin__: 0.000 (numerictypes) _sort: 0.000 (numpy.core.numeric) numeric: 0.003 (numpy.core.numeric) _dotblas: 0.000 (numeric) arrayprint: 0.001 (numeric) fromnumeric: 0.000 (arrayprint) cPickle: 0.001 (numeric) copy_reg: 0.000 (cPickle) cStringIO: 0.000 (cPickle) defchararray: 0.001 (numpy.core.numeric) numpy: 0.000 (defchararray) records: 0.000 (numpy.core.numeric) memmap: 0.000 (numpy.core.numeric) scalarmath: 0.000 (numpy.core.numeric) numpy.core.umath: 0.000 (scalarmath) function_base: 0.000 (numpy.core.numeric) machar: 0.000 (numpy.core.numeric) numpy.core.fromnumeric: 0.000 (machar) getlimits: 0.000 (numpy.core.numeric) shape_base: 0.000 (numpy.core.numeric) numpy.testing: 0.041 (numpy.core.numeric) unittest: 0.039 (numpy.testing) result: 0.002 (unittest) traceback: 0.000 (result) linecache: 0.000 (traceback) StringIO: 0.000 (result) errno: 0.000 (StringIO) : 0.000 (result) functools: 0.001 (result) _functools: 0.000 (functools) case: 0.035 (unittest) difflib: 0.034 (case) heapq: 0.031 (difflib) itertools: 0.029 (heapq) operator: 0.001 (heapq) bisect: 0.001 (heapq) _bisect: 0.000 (bisect) _heapq: 0.000 (heapq) collections: 0.001 (difflib) _abcoll: 0.000 (collections) _collections: 0.000 (collections) keyword: 0.000 (collections) thread: 0.000 (collections) pprint: 0.000 (case) util: 0.000 (case) suite: 0.000 (unittest) loader: 0.001 (unittest) fnmatch: 0.000 (loader) main: 0.001 (unittest) signals: 0.001 (main) signal: 0.000 (signals) weakref: 0.001 (signals) UserDict: 0.000 (weakref) _weakref: 0.000 (weakref) _weakrefset: 0.000 (weakref) runner: 0.000 (unittest) decorators: 0.001 (numpy.testing) numpy.testing.utils: 0.001 (decorators) nosetester: 0.000 (numpy.testing.utils) utils: 0.000 (numpy.testing) numpytest: 0.000 (numpy.testing) ufunclike: 0.000 (type_check) index_tricks: 0.003 (numpy.lib) numpy.core.numerictypes: 0.000 (index_tricks) math: 0.000 (index_tricks) numpy.core: 0.000 (index_tricks) numpy.lib.twodim_base: 0.000 (index_tricks) _compiled_base: 0.000 (index_tricks) arraysetops: 0.001 (index_tricks) numpy.lib.utils: 0.001 (arraysetops) numpy.matrixlib: 0.001 (index_tricks) defmatrix: 0.001 (numpy.matrixlib) numpy.lib._compiled_base: 0.000 (index_tricks) stride_tricks: 0.000 (numpy.lib) twodim_base: 0.000 (numpy.lib) scimath: 0.000 (numpy.lib)
Re: [Numpy-discussion] "import numpy" performance
On Mon, Jul 2, 2012 at 2:26 PM, Andrew Dalke wrote: > > so the relevant timing test is more likely: > > % time python -c 'import numpy.core.multiarray' > 0.086u 0.031s 0:00.12 91.6% 0+0k 0+0io 0pf+0w No, that's the wrong thing to test, because it effectively amounts to 'import numpy', sicne the numpy __init__ file is still executed. As David indicated, you must import multarray.so by itself. > I do not know how to run the timing test you did, as I get: > > % python -c "import multiarray" > Traceback (most recent call last): > File "", line 1, in > ImportError: No module named multi array You just have to cd to the directory where multiarray.so lives. I get the same numbers as David: longs[core]> time python -c '' real0m0.038s user0m0.032s sys 0m0.000s longs[core]> time python -c 'import multiarray' real0m0.035s user0m0.020s sys 0m0.012s longs[core]> pwd /usr/lib/python2.7/dist-packages/numpy/core Cheers, f ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
On Jul 2, 2012, at 10:33 PM, David Cournapeau wrote: > On Mon, Jul 2, 2012 at 8:17 PM, Andrew Dalke > wrote: >> In July of 2008 I started a thread about how "import numpy" >> was noticeably slow for one of my customers. ... >> I managed to get the import time down from 0.21 seconds to >> 0.08 seconds. > > I will answer to your other remarks later, but 0.21 sec to import > numpy is very slow, especially on a recent computer. It is 0.095 sec > on my mac, and 0.075 sec on a linux VM on the same computer (both hot > cache of course). That quote was historical review from 4 years ago. I described the problems I had then, the work-around solution I implemented, and my additional work to see if I could identify ways which would have kept me from needing to find a work-around solution. I then described why I have not worked on this problem for the last four years, and what has changed to make me interested in it again. That included current details, such as how "import numpy" with a warm cache takes 0.083 seconds on my Mac. > importing multiarray.so only is negligible for me (i.e. difference > between python -c "import multiarray" and python -c "" is > statistically insignificant). The NumPy initialization is being done in C++ code through "import_array()". That C function does (among other things) PyObject *numpy = PyImport_ImportModule("numpy.core.multiarray"); so the relevant timing test is more likely: % time python -c 'import numpy.core.multiarray' 0.086u 0.031s 0:00.12 91.6% 0+0k 0+0io 0pf+0w % time python -c 'import numpy.core.multiarray' 0.083u 0.031s 0:00.11 100.0%0+0k 0+0io 0pf+0w % time python -c 'import numpy.core.multiarray' 0.083u 0.030s 0:00.12 91.6% 0+0k 0+0io 0pf+0w I do not know how to run the timing test you did, as I get: % python -c "import multiarray" Traceback (most recent call last): File "", line 1, in ImportError: No module named multi array > I would check external factors, like the size of your sys.path as well. I have checked that, and inspected the output of python -v -v. Andrew da...@dalkescientific.com ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
02.07.2012 21:17, Andrew Dalke kirjoitti: [clip] > 1) remove "add_newdocs" and put the docstrings in the C code > 'add_newdocs' still needs to be there, The docstrings need to be in an easily parseable format, because of the online documentation editor. Keeping the current format may be the easiest as that already works. Moving them in the middle of other C code won't do, but a header file e.g. generated at build-time should work. This is how it's currently done with the ufunc docstrings, and it should work also for everything else. The commit statistics for add_newdocs.py are somewhat misleading --- since 2008, many of the documentation edits went in the online way, and these only show up in a single large commit, usually before releases. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
On Mon, Jul 2, 2012 at 9:43 PM, Benjamin Root wrote: > > On Mon, Jul 2, 2012 at 4:34 PM, Nathaniel Smith wrote: >> I think this ship has sailed, but it'd be worth looking into lazy >> importing, where 'numpy.fft' isn't actually imported until someone >> starts using it. There are a bunch of libraries that do this, and one >> would have to fiddle to get compatibility with all the different >> python versions and make sure you're not killing performance (might >> have to be in C) but something along the lines of >> >> class _FFTModule(object): >> def __getattribute__(self, name): >> mod = importlib.import_module("numpy.fft") >> _FFTModule.__getattribute__ = mod.__getattribute__ >> return getattr(mod, name) >> fft = _FFTModule() > > Not sure how this would impact projects like ipython that does > tab-completion support, but I know that that would drive me nuts in my basic > tab-completion setup I have for my regular python terminal. Of course, in > the grand scheme of things, that really isn't all that important, I don't > think. We used to do it for scipy. It did interfere with tab completion. It did drive many people nuts. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
On Mon, Jul 2, 2012 at 4:34 PM, Nathaniel Smith wrote: > On Mon, Jul 2, 2012 at 8:17 PM, Andrew Dalke > wrote: > > In this email I propose a few changes which I think are minor > > and which don't really affect the external NumPy API but which > > I think could improve the "import numpy" performance by at > > least 40%. This affects me because I and my clients use a > > chemistry toolkit which uses only NumPy arrays, and where > > we run short programs often on the command-line. > > > > > > In July of 2008 I started a thread about how "import numpy" > > was noticeably slow for one of my customers. They had > > chemical analysis software, often even run on a single > > molecular structure using command-line tools, and the > > several invocations with 0.1 seconds overhead was one of > > the dominant costs even when numpy wasn't needed. > > > > I fixed most of their problems by deferring numpy imports > > until needed. I remember well the Steve Jobs anecdote at > > > http://folklore.org/StoryView.py?project=Macintosh&story=Saving_Lives.txt > > and spent another day of my time in 2008 to identify the > > parts of the numpy import sequence which seemed excessive. > > I managed to get the import time down from 0.21 seconds to > > 0.08 seconds. > > > > Very little of that made it into NumPy. > > > > > > The three biggest changes I would like are: > > > > 1) remove "add_newdocs" and put the docstrings in the C code > > 'add_newdocs' still needs to be there, > > > > The code says: > > > > # This is only meant to add docs to objects defined in C-extension > modules. > > # The purpose is to allow easier editing of the docstrings without > > # requiring a re-compile. > > > > However, the change log shows that there are relatively few commits > > to this module > > > > YearNumber of commits > > = > > 2012 8 > > 2011 62 > > 2010 9 > > 2009 18 > > 2008 17 > > > > so I propose moving the docstrings to the C code, and perhaps > > leaving 'add_newdocs' there, but only used when testing new > > docstrings. > > I don't have any opinion on how acceptable this would be, but I also > don't see a benchmark showing how much this would help? > > > 2) Don't optimistically assume that all submodules are > > needed. For example, some current code uses > > > import numpy > numpy.fft.ifft > > > > > > (See a real-world example at > > > http://stackoverflow.com/questions/10222812/python-numpy-fft-and-inverse-fft > > ) > > > > IMO, this optimizes the needs of the interactive shell > > NumPy author over the needs of the many-fold more people > > who don't spend their time in the REPL and/or don't need > > those extra features added to every NumPy startup. Please > > bear in mind that NumPy users of the first category will > > be active on the mailing list, go to SciPy conferences, > > etc. while members of the second category are less visible. > > > > I recognize that this is backwards incompatible, and will > > not change. However, I understand that "NumPy 2.0" is a > > glimmer in the future, which might be a natural place for > > a transition to the more standard Python style of > > > > from numpy import fft > > > > Personally, I think the documentation now (if it doesn't > > already) should transition to use this form. > > I think this ship has sailed, but it'd be worth looking into lazy > importing, where 'numpy.fft' isn't actually imported until someone > starts using it. There are a bunch of libraries that do this, and one > would have to fiddle to get compatibility with all the different > python versions and make sure you're not killing performance (might > have to be in C) but something along the lines of > > class _FFTModule(object): > def __getattribute__(self, name): > mod = importlib.import_module("numpy.fft") > _FFTModule.__getattribute__ = mod.__getattribute__ > return getattr(mod, name) > fft = _FFTModule() > > Not sure how this would impact projects like ipython that does tab-completion support, but I know that that would drive me nuts in my basic tab-completion setup I have for my regular python terminal. Of course, in the grand scheme of things, that really isn't all that important, I don't think. Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] "import numpy" performance
On Mon, Jul 2, 2012 at 8:17 PM, Andrew Dalke wrote: > In this email I propose a few changes which I think are minor > and which don't really affect the external NumPy API but which > I think could improve the "import numpy" performance by at > least 40%. This affects me because I and my clients use a > chemistry toolkit which uses only NumPy arrays, and where > we run short programs often on the command-line. > > > In July of 2008 I started a thread about how "import numpy" > was noticeably slow for one of my customers. They had > chemical analysis software, often even run on a single > molecular structure using command-line tools, and the > several invocations with 0.1 seconds overhead was one of > the dominant costs even when numpy wasn't needed. > > I fixed most of their problems by deferring numpy imports > until needed. I remember well the Steve Jobs anecdote at > http://folklore.org/StoryView.py?project=Macintosh&story=Saving_Lives.txt > and spent another day of my time in 2008 to identify the > parts of the numpy import sequence which seemed excessive. > I managed to get the import time down from 0.21 seconds to > 0.08 seconds. > > Very little of that made it into NumPy. > > > The three biggest changes I would like are: > > 1) remove "add_newdocs" and put the docstrings in the C code > 'add_newdocs' still needs to be there, > > The code says: > > # This is only meant to add docs to objects defined in C-extension modules. > # The purpose is to allow easier editing of the docstrings without > # requiring a re-compile. > > However, the change log shows that there are relatively few commits > to this module > > YearNumber of commits > = > 2012 8 > 2011 62 > 2010 9 > 2009 18 > 2008 17 > > so I propose moving the docstrings to the C code, and perhaps > leaving 'add_newdocs' there, but only used when testing new > docstrings. I don't have any opinion on how acceptable this would be, but I also don't see a benchmark showing how much this would help? > 2) Don't optimistically assume that all submodules are > needed. For example, some current code uses > import numpy numpy.fft.ifft > > > (See a real-world example at > http://stackoverflow.com/questions/10222812/python-numpy-fft-and-inverse-fft > ) > > IMO, this optimizes the needs of the interactive shell > NumPy author over the needs of the many-fold more people > who don't spend their time in the REPL and/or don't need > those extra features added to every NumPy startup. Please > bear in mind that NumPy users of the first category will > be active on the mailing list, go to SciPy conferences, > etc. while members of the second category are less visible. > > I recognize that this is backwards incompatible, and will > not change. However, I understand that "NumPy 2.0" is a > glimmer in the future, which might be a natural place for > a transition to the more standard Python style of > > from numpy import fft > > Personally, I think the documentation now (if it doesn't > already) should transition to use this form. I think this ship has sailed, but it'd be worth looking into lazy importing, where 'numpy.fft' isn't actually imported until someone starts using it. There are a bunch of libraries that do this, and one would have to fiddle to get compatibility with all the different python versions and make sure you're not killing performance (might have to be in C) but something along the lines of class _FFTModule(object): def __getattribute__(self, name): mod = importlib.import_module("numpy.fft") _FFTModule.__getattribute__ = mod.__getattribute__ return getattr(mod, name) fft = _FFTModule() > 3) Especially: don't always import 'numpy.testing' > > As far as I can tell, automatic import of this module > is not needed, so is pure overhead for the vast majority > of NumPy users. Unfortunately, there's a large number > of user-facing 'test' and 'bench' bound methods acting > as functions. > > from numpy.testing import Tester > test = Tester().test > bench = Tester().test > > They seem rather pointless to me but could be replaced > with per-module functions like > > def test(...): >from numpy.testing import Tester >Tester().test(...) > > > > > I have not worried about numpy import performance for > 4 years. While I have been developing scientific software > for 20 years, and in Python for 15 years, it has been > in areas of biology and chemistry which don't use arrays. > I use numpy for a day about once every two years, and > so far I have had no reason to use scipy. > > > This has changed. > > I talked with one of my clients last week. They (and I) > use a chemistry toolkit called "RDKit". RDKit uses > numpy as a way to store coordinate data for molecules. > I checked with the package author and he confirms: > > yeah, it's just using the homogenous array most of the time. > > My client complained about RDKit's high startup cost, > due to the NumPy depe
Re: [Numpy-discussion] "import numpy" performance
On Mon, Jul 2, 2012 at 8:17 PM, Andrew Dalke wrote: > In this email I propose a few changes which I think are minor > and which don't really affect the external NumPy API but which > I think could improve the "import numpy" performance by at > least 40%. This affects me because I and my clients use a > chemistry toolkit which uses only NumPy arrays, and where > we run short programs often on the command-line. > > > In July of 2008 I started a thread about how "import numpy" > was noticeably slow for one of my customers. They had > chemical analysis software, often even run on a single > molecular structure using command-line tools, and the > several invocations with 0.1 seconds overhead was one of > the dominant costs even when numpy wasn't needed. > > I fixed most of their problems by deferring numpy imports > until needed. I remember well the Steve Jobs anecdote at > http://folklore.org/StoryView.py?project=Macintosh&story=Saving_Lives.txt > and spent another day of my time in 2008 to identify the > parts of the numpy import sequence which seemed excessive. > I managed to get the import time down from 0.21 seconds to > 0.08 seconds. I will answer to your other remarks later, but 0.21 sec to import numpy is very slow, especially on a recent computer. It is 0.095 sec on my mac, and 0.075 sec on a linux VM on the same computer (both hot cache of course). importing multiarray.so only is negligible for me (i.e. difference between python -c "import multiarray" and python -c "" is statistically insignificant). I would check external factors, like the size of your sys.path as well. David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion