[Numpy-discussion] Log Arrays
Hi, For precision reasons, I almost always need to work with arrays whose elements are log values. My thought was that it would be really neat to have a 'logarray' class implemented in C or as a subclass of the standard array class. Here is a sample of how I'd like to work with these objects: x = array([-2,-2,-3], base=2) y = array([-1,-2,-inf], base=2) z = x + y z array([-0.415037499279, -1.0, -3]) z = x * y z array([-3, -4, -inf]) z[:2].sum() -2.41503749928 This would do a lot for the code I writeand some of the numerical stability issues would handled more appropriately. For example, the logsum function is frequently handled as: log(x + y) == log(x) + log(1 + exp(log(y) - log(x)) ) when log(x) log(y). So the goal of the above function is to return log(x + y) using only the logarithms of x and y. Does this sound like a good idea? ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Thu, May 8, 2008 at 12:26 AM, T J [EMAIL PROTECTED] wrote: x = array([-2,-2,-3], base=2) y = array([-1,-2,-inf], base=2) z = x + y z array([-0.415037499279, -1.0, -3]) z = x * y z array([-3, -4, -inf]) z[:2].sum() -2.41503749928 Whoops s/array/logarray/ ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] First steps with f2py and first problems...
Hi, I've tried to follow the example given at : http://www.scipy.org/Cookbook/Theoretical_Ecology/Hastings_and_Powell but I've got errors when compiling the fortran file : -errors -- 12:53 loic:~ % f2py -c -m hastings hastings.f90 --fcompiler=gnu95 running build running config_cc unifing config_cc, config, build_clib, build_ext, build commands -- compiler options running config_fc unifing config_fc, config, build_clib, build_ext, build commands -- fcompiler options running build_src building extension hastings sources f2py options: [] f2py: /tmp/tmpDRL9Gh/src.linux-i686-2.5/hastingsmodule.c creating /tmp/tmpDRL9Gh creating /tmp/tmpDRL9Gh/src.linux-i686-2.5 Reading fortran codes... Reading file 'hastings.f90' (format:free) Post-processing... Block: hastings Block: model Block: fweb Post-processing (stage 2)... Block: hastings Block: unknown_interface Block: model Block: fweb Building modules... Building module hastings... Constructing F90 module support for model... Variables: a1 a2 b1 b2 d2 d1 Constructing wrapper function model.fweb... yprime = fweb(y,t) Wrote C/API module hastings to file /tmp/tmpDRL9Gh/src.linux- i686-2.5/hastingsmodule.c Traceback (most recent call last): File /usr/bin/f2py, line 26, in module main() File /usr/lib/python2.5/site-packages/numpy/f2py/f2py2e.py, line 558, in main run_compile() File /usr/lib/python2.5/site-packages/numpy/f2py/f2py2e.py, line 545, in run_compile setup(ext_modules = [ext]) File /usr/lib/python2.5/site-packages/numpy/distutils/core.py, line 176, in setup return old_setup(**new_attr) File /usr/lib/python2.5/distutils/core.py, line 151, in setup dist.run_commands() File /usr/lib/python2.5/distutils/dist.py, line 974, in run_commands self.run_command(cmd) File /usr/lib/python2.5/distutils/dist.py, line 994, in run_command cmd_obj.run() File /usr/lib/python2.5/distutils/command/build.py, line 113, in run self.run_command(cmd_name) File /usr/lib/python2.5/distutils/cmd.py, line 333, in run_command self.distribution.run_command(command) File /usr/lib/python2.5/distutils/dist.py, line 994, in run_command cmd_obj.run() File /usr/lib/python2.5/site-packages/numpy/distutils/command/ build_src.py, line 130, in run self.build_sources() File /usr/lib/python2.5/site-packages/numpy/distutils/command/ build_src.py, line 147, in build_sources self.build_extension_sources(ext) File /usr/lib/python2.5/site-packages/numpy/distutils/command/ build_src.py, line 256, in build_extension_sources sources = self.f2py_sources(sources, ext) File /usr/lib/python2.5/site-packages/numpy/distutils/command/ build_src.py, line 513, in f2py_sources ['-m',ext_name]+f_sources) File /usr/lib/python2.5/site-packages/numpy/f2py/f2py2e.py, line 367, in run_main ret=buildmodules(postlist) File /usr/lib/python2.5/site-packages/numpy/f2py/f2py2e.py, line 319, in buildmodules dict_append(ret[mnames[i]],rules.buildmodule(modules[i],um)) File /usr/lib/python2.5/site-packages/numpy/f2py/rules.py, line 1222, in buildmodule for l in '\n\n'.join(funcwrappers2)+'\n'.split('\n'): TypeError: cannot concatenate 'str' and 'list' objects zsh: exit 1 f2py -c -m hastings hastings.f90 --fcompiler=gnu95 - configuration- I'm using debian testing, and I got the following information at the bottom of `f2py -h` : Version: 2_4422 numpy Version: 1.0.4 Requires:Python 2.3 or higher. License: NumPy license (see LICENSE.txt in the NumPy source code) Have you got any clue to solve this pb ? -- LB ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] First steps with f2py and first problems...
On Thu, May 8, 2008 2:06 pm, LB wrote: Hi, I've tried to follow the example given at : http://www.scipy.org/Cookbook/Theoretical_Ecology/Hastings_and_Powell but I've got errors when compiling the fortran file : 12:53 loic:~ % f2py -c -m hastings hastings.f90 --fcompiler=gnu95 ... File /usr/lib/python2.5/site-packages/numpy/f2py/rules.py, line 1222, in buildmodule for l in '\n\n'.join(funcwrappers2)+'\n'.split('\n'): TypeError: cannot concatenate 'str' and 'list' objects zsh: exit 1 f2py -c -m hastings hastings.f90 --fcompiler=gnu95 ... Have you got any clue to solve this pb ? This issue is fixed in SVN. So, either use numpy from svn, or wait a bit until numpy 1.0.5 is released, or change the line #1222 in numpy/f2py/rules.py to for l in ('\n\n'.join(funcwrappers2)+'\n').split('\n'): HTH, Pearu ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Thu, May 8, 2008 at 1:26 AM, T J [EMAIL PROTECTED] wrote: Hi, For precision reasons, I almost always need to work with arrays whose elements are log values. My thought was that it would be really neat to have a 'logarray' class implemented in C or as a subclass of the standard array class. Here is a sample of how I'd like to work with these objects: Floating point numbers are essentially logs to base 2, i.e., integer exponent and mantissa between 1 and 2. What does using the log buy you? Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Problems with Trac
There seem to be several problems with the Trac system. 1) Submitting/modifying tickets yields 500 internal server error, even though the changes are made. This leads to duplicate tickets. 2) Mail isn't sent to the numpy-ticket mailing list. 3) It still always takes two tries to log in. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Test
This is a test to see if the list is working for me. -teo ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
2008/5/8 David Cournapeau [EMAIL PROTECTED]: On Thu, May 8, 2008 at 10:20 PM, Charles R Harris [EMAIL PROTECTED] wrote: Floating point numbers are essentially logs to base 2, i.e., integer exponent and mantissa between 1 and 2. What does using the log buy you? Precision, of course. I am not sure I understand the notation base = 2, but doing computation in the so called log-domain is a must in many statistical computations. In particular, in machine learning with large datasets, it is common to have some points whose pdf is extremely small, and well below the precision of double. Typically, internally, the computation of my EM toolbox are done in the log domain, and use the logsumexp trick to compute likelihood given some data: I'm not sure I'd describe this as precision, exactly; it's an issue of numerical range. But yes, I've come across this while doing maximum-likelihood fitting, and a coworker ran into it doing Bayesian statistics. It definitely comes up. Is logarray really the way to handle it, though? it seems like you could probably get away with providing a logsum ufunc that did the right thing. I mean, what operations does one want to do on logarrays? add - logsum subtract - ? multiply - add mean - logsum/N median - median exponentiate to recover normal-space values - exp str - ? I suppose numerical integration is also valuable, so it would help to have a numerical integrator that was reasonably smart about working with logs. (Though really it's just a question of rescaling and exponentiating, I think.) A logarray type would help by keeping track of the fact that its contents were in log space, and would make expressions a little less cumbersome, I guess. How much effort would it take to write it so that it got all the corner cases right? Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
2008/5/8 Charles R Harris [EMAIL PROTECTED]: What realistic probability is in the range exp(-1000) ? Well, I ran into it while doing a maximum-likelihood fit - my early guesses had exceedingly low probabilities, but I needed to know which way the probabilities were increasing. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Fri, May 9, 2008 at 1:04 AM, Charles R Harris [EMAIL PROTECTED] wrote: 1e-308 ? Yes, all the time. I mean, if it was not, why people would bother with long double and co ? Why denormal would exist ? I don't consider the comparison with the number of particules to be really relevant here. We are talking about implementation problems. Yes, logs can be useful there, but I still fail to see any precision advantage. As I say, to all intents and purposes, IEEE floating point *is* a logarithm. You will see that if you look at how log is implemented in hardware. I'm less sure of the C floating point library because it needs to be portable. a = np.array([-1000., -1001.]) np.log(np.sum(np.exp(a))) - -inf -1000 + np.log(np.sum([1 + np.exp(-1)])) - correct result What realistic probability is in the range exp(-1000) ? Realistic as significant, none of course. Realistic as it happens in computation ? Certainly. Typically, when you do clustering, and you have distant clusters, when you compute the probabilities of the points from one cluster relatively to another one, you can quickly get several units from the mean. Adds a small variance, and you can quickly get (x-mu)**2/sigma**2 around 1000. You cannot just clip to 0, specially in online settings. If you have a hammer... It's portable, but there are wasted cpu cycles in there. If speed was important, I suspect you could do better writing a low level function that assumed IEEE doubles and twiddled the bits. When you call a function in python, you waste thousand cycles at every call. Yet, you use python, and not assembly :) The above procedure is extremely standard, and used in all robust implementations of machine learning algorithms I am aware of, it is implemented in HTK, a widely used toolkit for HMM for speech recognition for example. Twiddling bits is all fun, but it takes time and is extremely error prone. Also, I don't see what kind of method you have in mind here, exactly: how would you do a logsumexp algorithm with bit twiddling ? cheers, David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Fri, May 9, 2008 at 1:25 AM, Charles R Harris [EMAIL PROTECTED] wrote: But to expand on David's computation... If the numbers are stored without using logs, i.e., as the exponentials, then the sum is of the form: x_1*2**y_1 + ... + x_i*2**y_i You missed the part on parametric models: in parametric settings, your x_i are often exponential, so it makes sense to compute in the log domain (you don't compute more log/exp than the naive implementation). Of course, if we were not interested in log x_i in the first place, the thing would not have made any sense. David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Thu, May 8, 2008 at 10:42 AM, David Cournapeau [EMAIL PROTECTED] wrote: On Fri, May 9, 2008 at 1:04 AM, Charles R Harris [EMAIL PROTECTED] wrote: 1e-308 ? Yes, all the time. I mean, if it was not, why people would bother with long double and co ? Why denormal would exist ? I don't consider the comparison with the number of particules to be really relevant here. We are talking about implementation problems. Yes, logs can be useful there, but I still fail to see any precision advantage. As I say, to all intents and purposes, IEEE floating point *is* a logarithm. You will see that if you look at how log is implemented in hardware. I'm less sure of the C floating point library because it needs to be portable. a = np.array([-1000., -1001.]) np.log(np.sum(np.exp(a))) - -inf -1000 + np.log(np.sum([1 + np.exp(-1)])) - correct result What realistic probability is in the range exp(-1000) ? Realistic as significant, none of course. Realistic as it happens in computation ? Certainly. Typically, when you do clustering, and you have distant clusters, when you compute the probabilities of the points from one cluster relatively to another one, you can quickly get several units from the mean. Adds a small variance, and you can quickly get (x-mu)**2/sigma**2 around 1000. Yes, and Gaussians are a delusion beyond a few sigma. One of my pet peeves. If you have more than 8 standard deviations, then something is fundamentally wrong in the concept and formulation. It is more likely that a particle sized blackhole has whacked out some component of the experiment. You cannot just clip to 0, specially in online settings. If you have a hammer... It's portable, but there are wasted cpu cycles in there. If speed was important, I suspect you could do better writing a low level function that assumed IEEE doubles and twiddled the bits. When you call a function in python, you waste thousand cycles at every call. Yet, you use python, and not assembly :) The above procedure is extremely standard, and used in all robust implementations of machine learning algorithms I am aware of, it is implemented in HTK, a widely used toolkit for HMM for speech recognition for example. Twiddling bits is all fun, but it takes time and is extremely error prone. Also, I don't see what kind of method you have in mind here, exactly: how would you do a logsumexp algorithm with bit twiddling ? You are complaining of inadequate range, but that is what scale factors are for. Why compute exponentials and logs when all you need to do is store an exponent in an integer. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Thu, May 8, 2008 at 11:25 AM, Charles R Harris [EMAIL PROTECTED] wrote: On Thu, May 8, 2008 at 10:11 AM, Anne Archibald [EMAIL PROTECTED] wrote: 2008/5/8 Charles R Harris [EMAIL PROTECTED]: What realistic probability is in the range exp(-1000) ? Well, I ran into it while doing a maximum-likelihood fit - my early guesses had exceedingly low probabilities, but I needed to know which way the probabilities were increasing. The number of bosons in the universe is only on the order of 1e-42. Exp(-1000) may be convenient, but as a probability it is a delusion. The hypothesis none of the above would have a much larger prior. When you're running an optimizer over a PDF, you will be stuck in the region of exp(-1000) for a substantial amount of time before you get to the peak. If you don't use the log representation, you will never get to the peak because all of the gradient information is lost to floating point error. You can consult any book on computational statistics for many more examples. This is a long-established best practice in statistics. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Thu, May 8, 2008 at 10:52 AM, David Cournapeau [EMAIL PROTECTED] wrote: On Fri, May 9, 2008 at 1:25 AM, Charles R Harris [EMAIL PROTECTED] wrote: But to expand on David's computation... If the numbers are stored without using logs, i.e., as the exponentials, then the sum is of the form: x_1*2**y_1 + ... + x_i*2**y_i You missed the part on parametric models: in parametric settings, your x_i are often exponential, I'm talking IEEE floating point. The x_i are never exponentials. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Thu, May 8, 2008 at 10:56 AM, Robert Kern [EMAIL PROTECTED] wrote: On Thu, May 8, 2008 at 11:25 AM, Charles R Harris [EMAIL PROTECTED] wrote: On Thu, May 8, 2008 at 10:11 AM, Anne Archibald [EMAIL PROTECTED] wrote: 2008/5/8 Charles R Harris [EMAIL PROTECTED]: What realistic probability is in the range exp(-1000) ? Well, I ran into it while doing a maximum-likelihood fit - my early guesses had exceedingly low probabilities, but I needed to know which way the probabilities were increasing. The number of bosons in the universe is only on the order of 1e-42. Exp(-1000) may be convenient, but as a probability it is a delusion. The hypothesis none of the above would have a much larger prior. When you're running an optimizer over a PDF, you will be stuck in the region of exp(-1000) for a substantial amount of time before you get to the peak. If you don't use the log representation, you will never get to the peak because all of the gradient information is lost to floating point error. You can consult any book on computational statistics for many more examples. This is a long-established best practice in statistics. But IEEE is already a log representation. You aren't gaining precision, you are gaining more bits in the exponent at the expense of fewer bits in the mantissa, i.e., less precision. As I say, it may be convenient, but if cpu cycles matter, it isn't efficient. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Thu, May 8, 2008 at 12:02 PM, Charles R Harris [EMAIL PROTECTED] wrote: On Thu, May 8, 2008 at 10:56 AM, Robert Kern [EMAIL PROTECTED] wrote: On Thu, May 8, 2008 at 11:25 AM, Charles R Harris [EMAIL PROTECTED] wrote: On Thu, May 8, 2008 at 10:11 AM, Anne Archibald [EMAIL PROTECTED] wrote: 2008/5/8 Charles R Harris [EMAIL PROTECTED]: What realistic probability is in the range exp(-1000) ? Well, I ran into it while doing a maximum-likelihood fit - my early guesses had exceedingly low probabilities, but I needed to know which way the probabilities were increasing. The number of bosons in the universe is only on the order of 1e-42. Exp(-1000) may be convenient, but as a probability it is a delusion. The hypothesis none of the above would have a much larger prior. When you're running an optimizer over a PDF, you will be stuck in the region of exp(-1000) for a substantial amount of time before you get to the peak. If you don't use the log representation, you will never get to the peak because all of the gradient information is lost to floating point error. You can consult any book on computational statistics for many more examples. This is a long-established best practice in statistics. But IEEE is already a log representation. You aren't gaining precision, you are gaining more bits in the exponent at the expense of fewer bits in the mantissa, i.e., less precision. *YES*. As David pointed out, many of these PDFs are in exponential form. Most of the meaningful variation is in the exponent, not the mantissa. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
Is the 80 bits float (float96 on IA32, float128 on AMD64) isn't enough? It has a 64 bits mantissa and can represent numbers up to nearly 1E(+-)5000. Nadav. -הודעה מקורית- מאת: [EMAIL PROTECTED] בשם Charles R Harris נשלח: ה 08-מאי-08 19:25 אל: Discussion of Numerical Python נושא: Re: [Numpy-discussion] Log Arrays On Thu, May 8, 2008 at 10:11 AM, Anne Archibald [EMAIL PROTECTED] wrote: 2008/5/8 Charles R Harris [EMAIL PROTECTED]: What realistic probability is in the range exp(-1000) ? Well, I ran into it while doing a maximum-likelihood fit - my early guesses had exceedingly low probabilities, but I needed to know which way the probabilities were increasing. The number of bosons in the universe is only on the order of 1e-42. Exp(-1000) may be convenient, but as a probability it is a delusion. The hypothesis none of the above would have a much larger prior. But to expand on David's computation... If the numbers are stored without using logs, i.e., as the exponentials, then the sum is of the form: x_1*2**y_1 + ... + x_i*2**y_i Where 1= x_j 2 and both x_i and y_i are available. When the numbers are all of the same general magnitude you get essentially the same result as David's formula by simply by dividing out the first value. Chuck Chuck winmail.dat___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Fri, May 9, 2008 at 1:54 AM, Charles R Harris [EMAIL PROTECTED] wrote: Yes, and Gaussians are a delusion beyond a few sigma. One of my pet peeves. If you have more than 8 standard deviations, then something is fundamentally wrong in the concept and formulation. If you have a mixture of Gaussian, and the components are not all mostly overlapping, you will get those ranges, and nothing is wrong in the formulation. I mean, it is not like EM algorithms are untested things and totally new. It is used in many different fields, and all its successful implementations use the logsumexp trick. Look at here for the formula involved: http://en.wikipedia.org/wiki/Expectation-maximization_algorithm If you need to compute log (exp (-1000) + exp(-1001)), how would you do ? If you do it the naive way, you have -inf, and it propagates across all your computation quickly. -inf instead of -1000 seems like a precision win to me. of course you are trading precision for range, but when you are out of range for your number representation, the tradeoff is not a loss anymore. It is really like denormal: they are less precise than normal format *for the usual range*, but in the range where denormal are used, they are much more precise; they are actually infinitely more precise, since the normal representation would be 0 :) David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Thu, 8 May 2008, Charles R Harris wrote: On Thu, May 8, 2008 at 10:11 AM, Anne Archibald [EMAIL PROTECTED] wrote: 2008/5/8 Charles R Harris [EMAIL PROTECTED]: What realistic probability is in the range exp(-1000) ? Well, I ran into it while doing a maximum-likelihood fit - my early guesses had exceedingly low probabilities, but I needed to know which way the probabilities were increasing. The number of bosons in the universe is only on the order of 1e-42. Exp(-1000) may be convenient, but as a probability it is a delusion. The hypothesis none of the above would have a much larger prior. You might like to think so. Sadly, not. If you're doing a least-square (or any other maximum-likelihood) fit to 2000 data points, exp(-1000) is the highest probability you can reasonably hope for. For a good fit. Chi-square is -2*ln(P). In the course of doing the fit, you will evaluate many parameter sets which are bad fits, and the probablility will be much lower. This has no real effect on the current discussion, but: The number of bosons in the universe (or any subset thereof) is not well-defined. It's not just a question of not knowing the number; there really is no answer to that question (well, ok, 'mu'). It's like asking which slit the particle went through in a double-slit interference experiment. The question is incorrect. Values 1 will never be tenable, but I suspect that the minus sign was a typo. The estimates I hear for the number of baryons (protons, atoms) are ~ 1e80. w ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
2008/5/8 Charles R Harris [EMAIL PROTECTED]: On Thu, May 8, 2008 at 10:56 AM, Robert Kern [EMAIL PROTECTED] wrote: When you're running an optimizer over a PDF, you will be stuck in the region of exp(-1000) for a substantial amount of time before you get to the peak. If you don't use the log representation, you will never get to the peak because all of the gradient information is lost to floating point error. You can consult any book on computational statistics for many more examples. This is a long-established best practice in statistics. But IEEE is already a log representation. You aren't gaining precision, you are gaining more bits in the exponent at the expense of fewer bits in the mantissa, i.e., less precision. As I say, it may be convenient, but if cpu cycles matter, it isn't efficient. Efficiency is not the point here. IEEE floats simply cannot represent the difference between exp(-1000) and exp(-1001). This difference can matter in many contexts. For example, suppose I have observed a source in X-rays and I want to fit a blackbody spectrum. I have, say, hundreds of spectral bins, with a number of photons in each. For any temperature T and amplitude A, I can compute the distribution of photons in each bin (Poisson with mean depending on T and A), and obtain the probability of obtaining the observed number of photons in each bin. Even if a blackbody with temperature T and amplitude A is a perfect fit I should expect this number to be very small, since the chance of obtaining *exactly* this sequence of integers is quite small. But when I start the process I need to guess a T and an A and evauate the probability. If my values are far off, the probability will almost certainly be lower than exp(-1000). But I need to know whether, if I increase T, this probability will increase or decrease, so I cannot afford to treat it as zero, or throw up my hands and say This is smaller than one over the number of baryons in the universe! My optimization problem doesn't make any sense!. I could also point out that frequently when one obtains these crazy numbers, one is not working with probabilities at all, but probability densities. A probability density of exp(-1000) means nothing special. Finally, the fact that *you* don't think this is a useful technique doesn't affect the fact that there is a substantial community of users who use it daily and who would like some support for it in scipy. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Thu, May 8, 2008 at 11:18 AM, David Cournapeau [EMAIL PROTECTED] wrote: On Fri, May 9, 2008 at 2:06 AM, Nadav Horesh [EMAIL PROTECTED] wrote: Is the 80 bits float (float96 on IA32, float128 on AMD64) isn't enough? It has a 64 bits mantissa and can represent numbers up to nearly 1E(+-)5000. It only make the problem happen later, I think. If you have a GMM with million of samples of high dimension with many clusters, any linear representation will fail I think. In a sense, the IEEE format is not adequate for that kind of computation. David, what you are using is a log(log(x)) representation internally. IEEE is *not* linear, it is logarithmic. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Thu, May 8, 2008 at 12:53 PM, Charles R Harris [EMAIL PROTECTED] wrote: On Thu, May 8, 2008 at 11:18 AM, David Cournapeau [EMAIL PROTECTED] wrote: On Fri, May 9, 2008 at 2:06 AM, Nadav Horesh [EMAIL PROTECTED] wrote: Is the 80 bits float (float96 on IA32, float128 on AMD64) isn't enough? It has a 64 bits mantissa and can represent numbers up to nearly 1E(+-)5000. It only make the problem happen later, I think. If you have a GMM with million of samples of high dimension with many clusters, any linear representation will fail I think. In a sense, the IEEE format is not adequate for that kind of computation. David, what you are using is a log(log(x)) representation internally. IEEE is *not* linear, it is logarithmic. *YES*. That is precisely the point. I want 53 bits devoted to the x part of exp(-x). The straight IEEE representation is not logarithmic *enough*. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Thu, May 8, 2008 at 11:31 AM, Warren Focke [EMAIL PROTECTED] wrote: On Thu, 8 May 2008, Charles R Harris wrote: On Thu, May 8, 2008 at 10:11 AM, Anne Archibald [EMAIL PROTECTED] wrote: 2008/5/8 Charles R Harris [EMAIL PROTECTED]: What realistic probability is in the range exp(-1000) ? Well, I ran into it while doing a maximum-likelihood fit - my early guesses had exceedingly low probabilities, but I needed to know which way the probabilities were increasing. The number of bosons in the universe is only on the order of 1e-42. Exp(-1000) may be convenient, but as a probability it is a delusion. The hypothesis none of the above would have a much larger prior. You might like to think so. Sadly, not. If you're doing a least-square (or any other maximum-likelihood) fit to 2000 data points, exp(-1000) is the highest probability you can reasonably hope for. For a good fit. Chi-square is -2*ln(P). In the course of doing the fit, you will evaluate many parameter sets which are bad fits, and the probablility will be much lower. This has no real effect on the current discussion, but: The number of bosons in the universe (or any subset thereof) is not well-defined. It's not just a question of not knowing the number; there really is no answer to that question (well, ok, 'mu'). It's like asking which slit the particle went through in a double-slit interference experiment. The question is incorrect. Values 1 will never be tenable, but I suspect that the minus sign was a typo. The estimates I hear for the number of baryons (protons, atoms) are ~ 1e80. Say, mostly photons. Temperature (~2.7 K) determines density, multiply by volume. But I meant baryons and the last number I saw was about 1e42. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Changes to matrix class broke scipy
Developers, take note: http://thread.gmane.org/gmane.comp.python.scientific.user/16297 ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
2008/5/8 Charles R Harris [EMAIL PROTECTED]: David, what you are using is a log(log(x)) representation internally. IEEE is *not* linear, it is logarithmic. As Robert Kern says, yes, this is exactly what the OP and all the rest of us want. But it's a strange thing to say that IEEE is logarithmic - 2.3*10**1 is not exactly logarithmic, since the 2.3 is not the logarithm of anything. IEEE floats work the same way, which is important, since it means they can exactly represent integers of moderate size. For example, 257 is represented as sign 0, exponent 135, (implied leading 1).0001b. The exponent is indeed the integer part of the log base 2 of the value, up to some fiddling, but the mantissa is not a logarithm of any kind. Anyway, all this is immaterial. The point is, in spite of the fact that floating-point numbers can represent a very wide range of numbers, there are some important contexts in which this range is not wide enough. One could in principle store an additional power of two in an accompanying integer, but this would be less efficient in terms of space and time, and more cumbersome, when for the usual situations where this is applied, simply taking the logarithm works fine. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Thu, May 8, 2008 at 12:39 PM, Anne Archibald [EMAIL PROTECTED] wrote: 2008/5/8 Charles R Harris [EMAIL PROTECTED]: David, what you are using is a log(log(x)) representation internally. IEEE is *not* linear, it is logarithmic. As Robert Kern says, yes, this is exactly what the OP and all the rest of us want. But it's a strange thing to say that IEEE is logarithmic - 2.3*10**1 is not exactly logarithmic, since the 2.3 is not the logarithm of anything. IEEE floats work the same way, which is important, since it means they can exactly represent integers of moderate size. For example, 257 is represented as sign 0, exponent 135, (implied leading 1).0001b. The exponent is indeed the integer part of the log base 2 of the value, up to some fiddling, but the mantissa is not a logarithm of any kind. First order in the Taylors series. The log computation uses the fact that it has small curvature over the range 1..2 Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Changes to matrix class broke scipy
On Thu, May 8, 2008 at 12:25 PM, Stéfan van der Walt [EMAIL PROTECTED] wrote: Developers, take note: http://thread.gmane.org/gmane.comp.python.scientific.user/16297 ___ Yep, not unexpected. I suppose the real question is whether is should have been in 1.1 or 1.2, but if we are making the change we have to bite the bullet sometime. So I can understand the irritation, even anger, up front. Now let's see how hard the needed changes are. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Thu, May 8, 2008 at 2:12 PM, Charles R Harris [EMAIL PROTECTED] wrote: On Thu, May 8, 2008 at 11:46 AM, Anne Archibald [EMAIL PROTECTED] wrote: 2008/5/8 Charles R Harris [EMAIL PROTECTED]: On Thu, May 8, 2008 at 10:56 AM, Robert Kern [EMAIL PROTECTED] wrote: When you're running an optimizer over a PDF, you will be stuck in the region of exp(-1000) for a substantial amount of time before you get to the peak. If you don't use the log representation, you will never get to the peak because all of the gradient information is lost to floating point error. You can consult any book on computational statistics for many more examples. This is a long-established best practice in statistics. But IEEE is already a log representation. You aren't gaining precision, you are gaining more bits in the exponent at the expense of fewer bits in the mantissa, i.e., less precision. As I say, it may be convenient, but if cpu cycles matter, it isn't efficient. Efficiency is not the point here. IEEE floats simply cannot represent the difference between exp(-1000) and exp(-1001). This difference can matter in many contexts. For example, suppose I have observed a source in X-rays and I want to fit a blackbody spectrum. I have, say, hundreds of spectral bins, with a number of photons in each. For any temperature T and amplitude A, I can compute the distribution of photons in each bin (Poisson with mean depending on T and A), and obtain the probability of obtaining the observed number of photons in each bin. Even if a blackbody with temperature T and amplitude A is a perfect fit I should expect this number to be very small, since the chance of obtaining *exactly* this sequence of integers is quite small. But when I start the process I need to guess a T and an A and evauate the probability. If my values are far off, the probability will almost certainly be lower than exp(-1000). But I need to know whether, if I increase T, this probability will increase or decrease, so I cannot afford to treat it as zero, or throw up my hands and say This is smaller than one over the number of baryons in the universe! My optimization problem doesn't make any sense!. You want the covariance of the parameters of the fit, which will be much more reasonable. And if not, then you have outliers. Mostly, folks are looking for the probability of classes of things, in this case the class of curves you are fitting, the probability of any give sequence, which approaches to zero, is a much less interest. So the errors in the parameters of the model matter, the probability of essentially unique sequence much less so. Except that when you are doing practical computations, you will often be computing likelihoods as part of the larger computation. Yes, the final probabilities that you interpret won't be exp(-1000) (and if they are, then 0 is close enough), but the calculations *in between* do require that exp(-1000) and exp(-1001) be distinguishable from each other. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On 5/8/08, Anne Archibald [EMAIL PROTECTED] wrote: Is logarray really the way to handle it, though? it seems like you could probably get away with providing a logsum ufunc that did the right thing. I mean, what operations does one want to do on logarrays? add - logsum subtract - ? multiply - add mean - logsum/N median - median exponentiate to recover normal-space values - exp str - ? That's about it, as far as my usage goes. Additionally, I would also benefit from: logdot logouter In addition to the elementwise operations, it would be nice to have logsum along an axis logprod along an axis cumlogsum cumlogprod Whether these are through additional ufuncs or through a subclass is not so much of an issue for me---either would be a huge improvement to the current situation. One benefit of a subclass, IMO, is that it maintains the feel of non-log arrays. That is, when I want to multiply to logarrays, I just do x*y, rather than x+ybut I can understand arguments that this might not be desirable. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Thu, May 8, 2008 at 11:46 AM, Anne Archibald [EMAIL PROTECTED] wrote: 2008/5/8 Charles R Harris [EMAIL PROTECTED]: On Thu, May 8, 2008 at 10:56 AM, Robert Kern [EMAIL PROTECTED] wrote: When you're running an optimizer over a PDF, you will be stuck in the region of exp(-1000) for a substantial amount of time before you get to the peak. If you don't use the log representation, you will never get to the peak because all of the gradient information is lost to floating point error. You can consult any book on computational statistics for many more examples. This is a long-established best practice in statistics. But IEEE is already a log representation. You aren't gaining precision, you are gaining more bits in the exponent at the expense of fewer bits in the mantissa, i.e., less precision. As I say, it may be convenient, but if cpu cycles matter, it isn't efficient. Efficiency is not the point here. IEEE floats simply cannot represent the difference between exp(-1000) and exp(-1001). This difference can matter in many contexts. For example, suppose I have observed a source in X-rays and I want to fit a blackbody spectrum. I have, say, hundreds of spectral bins, with a number of photons in each. For any temperature T and amplitude A, I can compute the distribution of photons in each bin (Poisson with mean depending on T and A), and obtain the probability of obtaining the observed number of photons in each bin. Even if a blackbody with temperature T and amplitude A is a perfect fit I should expect this This is an interesting problem, partly because I was one of the first guys to use synthetic spectra (NO, OH), to fit temperatures to spectral data. But also because it might look like the Poisson matters. But probably not, the curve fitting effectively aggregates data and the central limit theorem kicks in, so that the normal least squares approach will probably work. Also, if the number of photons in a bin is much more that about 10, the computation of the Poisson distribution probably uses a Gaussian, with perhaps a small adjustment for the tail. So I'm curious, did you find any significant difference trying both methods? Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
2008/5/8 T J [EMAIL PROTECTED]: On 5/8/08, Anne Archibald [EMAIL PROTECTED] wrote: Is logarray really the way to handle it, though? it seems like you could probably get away with providing a logsum ufunc that did the right thing. I mean, what operations does one want to do on logarrays? add - logsum subtract - ? multiply - add mean - logsum/N median - median exponentiate to recover normal-space values - exp str - ? That's about it, as far as my usage goes. Additionally, I would also benefit from: logdot logouter In addition to the elementwise operations, it would be nice to have logsum along an axis logprod along an axis cumlogsum cumlogprod Whether these are through additional ufuncs or through a subclass is not so much of an issue for me---either would be a huge improvement to the current situation. One benefit of a subclass, IMO, is that it maintains the feel of non-log arrays. That is, when I want to multiply to logarrays, I just do x*y, rather than x+ybut I can understand arguments that this might not be desirable. Well, how about a two-step process? Let's come up with a nice implementation of each of the above, then we can discuss whether a subclass is warranted (and at that point each operation on the subclass will be very simple to implement). Most of them could be implemented almost for free by providing a ufunc that did logsum(); then logsum along an axis becomes logsum.reduce(), and cumlogsum becomes logsum.accumulate(). logprod is of course just add. logouter is conveniently and efficiently written as logprod.outer. logdot can be written in terms of logsum() and logprod() at the cost of a sizable temporary, but should really have its own implementation. It might be possible to test this by using vectorize() on a single-element version of logsum(). This would be slow but if vectorize() makes a real ufunc object should provide all the usual handy methods (reduce, accumulate, outer, etcetera). Alternatively, since logsum can be written in terms of existing ufuncs, it should be possible to implement a class providing all the ufunc methods by hand without too too much pain. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Thu, 8 May 2008, Charles R Harris wrote: On Thu, May 8, 2008 at 11:46 AM, Anne Archibald [EMAIL PROTECTED] wrote: 2008/5/8 Charles R Harris [EMAIL PROTECTED]: On Thu, May 8, 2008 at 10:56 AM, Robert Kern [EMAIL PROTECTED] wrote: When you're running an optimizer over a PDF, you will be stuck in the region of exp(-1000) for a substantial amount of time before you get to the peak. If you don't use the log representation, you will never get to the peak because all of the gradient information is lost to floating point error. You can consult any book on computational statistics for many more examples. This is a long-established best practice in statistics. But IEEE is already a log representation. You aren't gaining precision, you are gaining more bits in the exponent at the expense of fewer bits in the mantissa, i.e., less precision. As I say, it may be convenient, but if cpu cycles matter, it isn't efficient. Efficiency is not the point here. IEEE floats simply cannot represent the difference between exp(-1000) and exp(-1001). This difference can matter in many contexts. For example, suppose I have observed a source in X-rays and I want to fit a blackbody spectrum. I have, say, hundreds of spectral bins, with a number of photons in each. For any temperature T and amplitude A, I can compute the distribution of photons in each bin (Poisson with mean depending on T and A), and obtain the probability of obtaining the observed number of photons in each bin. Even if a blackbody with temperature T and amplitude A is a perfect fit I should expect this This is an interesting problem, partly because I was one of the first guys to use synthetic spectra (NO, OH), to fit temperatures to spectral data. But also because it might look like the Poisson matters. But probably not, the curve fitting effectively aggregates data and the central limit theorem kicks in, so that the normal least squares approach will probably work. Also, if the number of photons in a bin is much more that about 10, the computation of the Poisson distribution probably uses a Gaussian, with perhaps a small adjustment for the tail. So I'm curious, did you find any significant difference trying both methods? I can't address Anne's results, but I've certainly found it to make a difference in my work, and it's pretty standard in high-energy astronomy. The central limit theorem does not get you out of having to use the right PDF to compare data to model. It does mean that, if you have enough events, the PDF of the fitness function is fairly chi-squarish, so much of the logic developed for least-squares fitting still applies (like for finding confidence intervals on the fitted parameters). Using Poisson statistics instead of a Gaussian approximation is actually pretty easy, see Cash (Astrophysical Journal, Part 1, vol. 228, Mar. 15, 1979, p. 939-947 http://adsabs.harvard.edu/abs/1979ApJ...228..939C) w ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Why are ufunc docstrings useless?
Hi, I frequently use functions like np.add.reduce and np.add.outer, but their docstrings are totally uninformative. Would it be possible to provide proper docstrings for these ufunc methods? They need not be specific to np.add; just an explanation of what arguments to give (for example) reduce() (presumably it takes an axis= argument? what is the default behaviour?) and what it does would help. Thanks, Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Why are ufunc docstrings useless?
On Thu, May 8, 2008 at 5:28 PM, Anne Archibald [EMAIL PROTECTED] wrote: Hi, I frequently use functions like np.add.reduce and np.add.outer, but their docstrings are totally uninformative. Would it be possible to provide proper docstrings for these ufunc methods? They need not be specific to np.add; just an explanation of what arguments to give (for example) reduce() (presumably it takes an axis= argument? what is the default behaviour?) and what it does would help. Sure. The place to add them would be in the PyMethodDef ufunc_methods array on line 3953 of numpy/core/src/ufuncobject.c. Just extend each of the rows with a char* containing the docstring contents. A literal string will suffice. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] online (1-shot) calculation of variance (complex)
On Thu, May 8, 2008 at 5:45 PM, Neal Becker [EMAIL PROTECTED] wrote: I saw some links on 1-pass recursive calculation of mean/variance. When I tried the algorithms, it did not seem to give correct results for complex values. Anyone know how to correctly implement this? Well, exactly what did you try? -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] online (1-shot) calculation of variance (complex)
Robert Kern wrote: On Thu, May 8, 2008 at 5:45 PM, Neal Becker [EMAIL PROTECTED] wrote: I saw some links on 1-pass recursive calculation of mean/variance. When I tried the algorithms, it did not seem to give correct results for complex values. Anyone know how to correctly implement this? Well, exactly what did you try? See Algorithm III here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] online (1-shot) calculation of variance (complex)
Robert Kern wrote: On Thu, May 8, 2008 at 5:45 PM, Neal Becker [EMAIL PROTECTED] wrote: I saw some links on 1-pass recursive calculation of mean/variance. When I tried the algorithms, it did not seem to give correct results for complex values. Anyone know how to correctly implement this? Well, exactly what did you try? Here's my python translation: It seems to give a variance that just converges to 0 given a vector of gaussian r.v.: class stat2 (object): def __init__(self): self.n = 0 self._mean = 0 self.M2 = 0 def __iadd__(self, x): self.n += 1 delta = x - self._mean self._mean += delta/self.n self.M2 += delta*(x - self._mean) # This expression uses the new value of mean def mean(self): return self._mean def var(self): return self.M2/(self.n - 1) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] online (1-shot) calculation of variance (complex)
On Thu, May 8, 2008 at 6:05 PM, Neal Becker [EMAIL PROTECTED] wrote: Robert Kern wrote: On Thu, May 8, 2008 at 5:45 PM, Neal Becker [EMAIL PROTECTED] wrote: I saw some links on 1-pass recursive calculation of mean/variance. When I tried the algorithms, it did not seem to give correct results for complex values. Anyone know how to correctly implement this? Well, exactly what did you try? Here's my python translation: It seems to give a variance that just converges to 0 given a vector of gaussian r.v.: class stat2 (object): def __init__(self): self.n = 0 self._mean = 0 self.M2 = 0 def __iadd__(self, x): self.n += 1 delta = x - self._mean self._mean += delta/self.n self.M2 += delta*(x - self._mean) # This expression uses the new value of mean You may not be able to convert this one to apply to complex numbers. The recurrence relation may not hold. In the two-pass algorithm for complex numbers, remember that you are summing (x[i] - mean).conj * (x[i] - mean) each of which is real. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Status Report for NumPy 1.1.0
I don't think it is reasonable to say the trunk is in good shape when the power function does not work... Just my thoughts... C. Charles R Harris wrote: Hi Jarrod, On Tue, May 6, 2008 at 2:40 AM, Jarrod Millman [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: Hey, The trunk is in pretty good shape and it is about time that I put out an official release. So tomorrow (in a little over twelve hours) I am going to create a 1.1.x branch and the trunk will be officially open for 1.2 development. If there are no major issues that show up at the last minute, I will tag 1.1.0 twenty-four hours after I branch. As soon as I tag the release I will ask the David and Chris to create the official Windows and Mac binaries. If nothing goes awry, you can expect the official release announcement by Monday, May 12th. In order to help me with the final touches, would everyone look over the release notes one last time: http://projects.scipy.org/scipy/numpy/milestone/1.1.0 Please let me know if there are any important omissions or errors ASAP. Scalar indexing of matrices has changed, 1D arrays are now returned instead of matrices. This has to be documented in the release notes. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Status Report for NumPy 1.1.0
On Thu, May 8, 2008 at 6:38 PM, Charles Doutriaux [EMAIL PROTECTED] wrote: I don't think it is reasonable to say the trunk is in good shape when the power function does not work... Just my thoughts... Is that ticket #301? What are you suggesting it do? Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Log Arrays
On Thu, May 8, 2008 at 2:37 PM, Warren Focke [EMAIL PROTECTED] wrote: On Thu, 8 May 2008, Charles R Harris wrote: On Thu, May 8, 2008 at 11:46 AM, Anne Archibald [EMAIL PROTECTED] wrote: 2008/5/8 Charles R Harris [EMAIL PROTECTED]: On Thu, May 8, 2008 at 10:56 AM, Robert Kern [EMAIL PROTECTED] wrote: When you're running an optimizer over a PDF, you will be stuck in the region of exp(-1000) for a substantial amount of time before you get to the peak. If you don't use the log representation, you will never get to the peak because all of the gradient information is lost to floating point error. You can consult any book on computational statistics for many more examples. This is a long-established best practice in statistics. But IEEE is already a log representation. You aren't gaining precision, you are gaining more bits in the exponent at the expense of fewer bits in the mantissa, i.e., less precision. As I say, it may be convenient, but if cpu cycles matter, it isn't efficient. Efficiency is not the point here. IEEE floats simply cannot represent the difference between exp(-1000) and exp(-1001). This difference can matter in many contexts. For example, suppose I have observed a source in X-rays and I want to fit a blackbody spectrum. I have, say, hundreds of spectral bins, with a number of photons in each. For any temperature T and amplitude A, I can compute the distribution of photons in each bin (Poisson with mean depending on T and A), and obtain the probability of obtaining the observed number of photons in each bin. Even if a blackbody with temperature T and amplitude A is a perfect fit I should expect this This is an interesting problem, partly because I was one of the first guys to use synthetic spectra (NO, OH), to fit temperatures to spectral data. But also because it might look like the Poisson matters. But probably not, the curve fitting effectively aggregates data and the central limit theorem kicks in, so that the normal least squares approach will probably work. Also, if the number of photons in a bin is much more that about 10, the computation of the Poisson distribution probably uses a Gaussian, with perhaps a small adjustment for the tail. So I'm curious, did you find any significant difference trying both methods? I can't address Anne's results, but I've certainly found it to make a difference in my work, and it's pretty standard in high-energy astronomy. The central limit theorem does not get you out of having to use the right PDF to compare data to model. It does mean that, if you have enough events, the PDF of the fitness function is fairly chi-squarish, so much of the logic developed for least-squares fitting still applies (like for finding confidence intervals on the fitted parameters). Using Poisson statistics instead of a Gaussian approximation is actually pretty easy, see Cash (Astrophysical Journal, Part 1, vol. 228, Mar. 15, 1979, p. 939-947 http://adsabs.harvard.edu/abs/1979ApJ...228..939C) Interesting paper. I've also been playing with a 64 bit floating format with a 31 bit offset binary exponent and 33 bit mantissa with a hidden bit, which can hold the probability of any give sequence of 2**30 - 1 coin tosses, or about 1e-323228496. It looks to be pretty efficient for multiplication and compare, and functions like log and exp aren't hard to do. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Why are ufunc docstrings useless?
2008/5/8 Robert Kern [EMAIL PROTECTED]: On Thu, May 8, 2008 at 5:28 PM, Anne Archibald [EMAIL PROTECTED] wrote: Hi, I frequently use functions like np.add.reduce and np.add.outer, but their docstrings are totally uninformative. Would it be possible to provide proper docstrings for these ufunc methods? They need not be specific to np.add; just an explanation of what arguments to give (for example) reduce() (presumably it takes an axis= argument? what is the default behaviour?) and what it does would help. Sure. The place to add them would be in the PyMethodDef ufunc_methods array on line 3953 of numpy/core/src/ufuncobject.c. Just extend each of the rows with a char* containing the docstring contents. A literal string will suffice. Thanks! Done add, reduce, outer, and reduceat. What about __call__? Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Why are ufunc docstrings useless?
On Thu, May 8, 2008 at 9:52 PM, Anne Archibald [EMAIL PROTECTED] wrote: Thanks! Done add, reduce, outer, and reduceat. What about __call__? If anyone knows enough to explicitly request a docstring from __call__, they already know what it does. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Why are ufunc docstrings useless?
On Thu, May 8, 2008 at 9:07 PM, Robert Kern [EMAIL PROTECTED] wrote: On Thu, May 8, 2008 at 9:52 PM, Anne Archibald [EMAIL PROTECTED] wrote: Thanks! Done add, reduce, outer, and reduceat. What about __call__? If anyone knows enough to explicitly request a docstring from __call__, they already know what it does. -- It's easier/better to do this in numpy/add_newdocs.py. For example: In [14]: from numpy.lib import add_newdoc as add In [15]: add('numpy.core','ufunc',('reduce','hello world')) In [16]: ufunc.reduce.__doc__ Out[16]: 'hello world' You don't have to clutter up the c sources and you can use , instead of putting all those newlines in. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Why are ufunc docstrings useless?
On Thu, May 8, 2008 at 10:12 PM, Charles R Harris [EMAIL PROTECTED] wrote: It's easier/better to do this in numpy/add_newdocs.py. For example: In [14]: from numpy.lib import add_newdoc as add In [15]: add('numpy.core','ufunc',('reduce','hello world')) In [16]: ufunc.reduce.__doc__ Out[16]: 'hello world' You don't have to clutter up the c sources and you can use , instead of putting all those newlines in. Ah good. I didn't realize it would handle builtin type methods. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Why are ufunc docstrings useless?
On Thu, May 8, 2008 at 9:12 PM, Charles R Harris [EMAIL PROTECTED] wrote: On Thu, May 8, 2008 at 9:07 PM, Robert Kern [EMAIL PROTECTED] wrote: On Thu, May 8, 2008 at 9:52 PM, Anne Archibald [EMAIL PROTECTED] wrote: Thanks! Done add, reduce, outer, and reduceat. What about __call__? If anyone knows enough to explicitly request a docstring from __call__, they already know what it does. -- It's easier/better to do this in numpy/add_newdocs.py. For example: In [14]: from numpy.lib import add_newdoc as add In [15]: add('numpy.core','ufunc',('reduce','hello world')) In [16]: ufunc.reduce.__doc__ Out[16]: 'hello world' You don't have to clutter up the c sources and you can use , instead of putting all those newlines in. Also, not all c compilers will join strings on separate lines. You need to add explicit continuation backslashes. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Why are ufunc docstrings useless?
2008/5/8 Robert Kern [EMAIL PROTECTED]: On Thu, May 8, 2008 at 9:52 PM, Anne Archibald [EMAIL PROTECTED] wrote: Thanks! Done add, reduce, outer, and reduceat. What about __call__? If anyone knows enough to explicitly request a docstring from __call__, they already know what it does. How exactly are they to find out? It does take additional arguments, for example dtype and out - I think. Also, help(np.add) displays the the object, its docstring, its methods, and all their docstrings. So it provides a way to get a docstring out of __call__ without having to know what it is. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] searchsorted() and memory cache
I've got a big element array (25 million int64s) that searchsorted() takes a long time to grind through. After a bit of digging in the literature and the numpy source code, I believe that searchsorted() is implementing a classic binary search, which is pretty bad in terms of cache misses. There are several modern implementations of binary search which arrange items in memory such that cache misses are much more rare. Clearly making such an indexing arrangement would take time, but in my particular case, I can spare the time to create an index if searching was faster, since I'd make the index once but do the searching many times. Is there an implementation of such an algorithm that works easilty with numpy? Also, can you offer any advice, suggestions, and comments to me if I attempted to implement such an algorithm? Thanks, Andrew ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Why are ufunc docstrings useless?
On Thu, May 8, 2008 at 10:39 PM, Anne Archibald [EMAIL PROTECTED] wrote: 2008/5/8 Robert Kern [EMAIL PROTECTED]: On Thu, May 8, 2008 at 9:52 PM, Anne Archibald [EMAIL PROTECTED] wrote: Thanks! Done add, reduce, outer, and reduceat. What about __call__? If anyone knows enough to explicitly request a docstring from __call__, they already know what it does. How exactly are they to find out? It does take additional arguments, for example dtype and out - I think. That should be recorded in the ufunc's main docstring, e.g. numpy.add.__doc__, since that is what people will actually be calling. No one will explicitly call numpy.add.__call__(x,y). Also, help(np.add) displays the the object, its docstring, its methods, and all their docstrings. So it provides a way to get a docstring out of __call__ without having to know what it is. Meh. All it can usefully say is Refer to the main docstring. Which is more or less what it currently says. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] searchsorted() and memory cache
On Thu, May 8, 2008 at 10:51 PM, Andrew Straw [EMAIL PROTECTED] wrote: I've got a big element array (25 million int64s) that searchsorted() takes a long time to grind through. After a bit of digging in the literature and the numpy source code, I believe that searchsorted() is implementing a classic binary search, Yes. which is pretty bad in terms of cache misses. There are several modern implementations of binary search which arrange items in memory such that cache misses are much more rare. Clearly making such an indexing arrangement would take time, but in my particular case, I can spare the time to create an index if searching was faster, since I'd make the index once but do the searching many times. Is there an implementation of such an algorithm that works easilty with numpy? Also, can you offer any advice, suggestions, and comments to me if I attempted to implement such an algorithm? I'm no help. You seem to know more than I do. Sadly, the first few Google hits I get for binary search minimize cache misses are patents. I don't know what the substantive content of those patents are; I have a strict policy of not reading patents. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] searchsorted() and memory cache
On Thu, May 8, 2008 at 9:55 PM, Robert Kern [EMAIL PROTECTED] wrote: On Thu, May 8, 2008 at 10:51 PM, Andrew Straw [EMAIL PROTECTED] wrote: I've got a big element array (25 million int64s) that searchsorted() takes a long time to grind through. After a bit of digging in the literature and the numpy source code, I believe that searchsorted() is implementing a classic binary search, Yes. which is pretty bad in terms of cache misses. There are several modern implementations of binary search which arrange items in memory such that cache misses are much more rare. Clearly making such an indexing arrangement would take time, but in my particular case, I can spare the time to create an index if searching was faster, since I'd make the index once but do the searching many times. Is there an implementation of such an algorithm that works easilty with numpy? Also, can you offer any advice, suggestions, and comments to me if I attempted to implement such an algorithm? I'm no help. You seem to know more than I do. Sadly, the first few Google hits I get for binary search minimize cache misses are patents. I don't know what the substantive content of those patents are; I have a strict policy of not reading patents. I would be interested in adding such a thing if it wasn't patent encumbered. A good start would be a prototype in python to show how it all went together and whether it needed a separate indexing/lookup function or could be fit into the current setup. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Why are ufunc docstrings useless?
2008/5/8 Robert Kern [EMAIL PROTECTED]: On Thu, May 8, 2008 at 10:39 PM, Anne Archibald [EMAIL PROTECTED] wrote: 2008/5/8 Robert Kern [EMAIL PROTECTED]: If anyone knows enough to explicitly request a docstring from __call__, they already know what it does. How exactly are they to find out? It does take additional arguments, for example dtype and out - I think. That should be recorded in the ufunc's main docstring, e.g. numpy.add.__doc__, since that is what people will actually be calling. No one will explicitly call numpy.add.__call__(x,y). Also, help(np.add) displays the the object, its docstring, its methods, and all their docstrings. So it provides a way to get a docstring out of __call__ without having to know what it is. Meh. All it can usefully say is Refer to the main docstring. Which is more or less what it currently says. So is the custom that double-underscore methods get documented in the class docstring and normal methods get documented in their own docstrings? Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Why are ufunc docstrings useless?
On Thu, May 8, 2008 at 11:31 PM, Anne Archibald [EMAIL PROTECTED] wrote: 2008/5/8 Robert Kern [EMAIL PROTECTED]: On Thu, May 8, 2008 at 10:39 PM, Anne Archibald [EMAIL PROTECTED] wrote: 2008/5/8 Robert Kern [EMAIL PROTECTED]: If anyone knows enough to explicitly request a docstring from __call__, they already know what it does. How exactly are they to find out? It does take additional arguments, for example dtype and out - I think. That should be recorded in the ufunc's main docstring, e.g. numpy.add.__doc__, since that is what people will actually be calling. No one will explicitly call numpy.add.__call__(x,y). Also, help(np.add) displays the the object, its docstring, its methods, and all their docstrings. So it provides a way to get a docstring out of __call__ without having to know what it is. Meh. All it can usefully say is Refer to the main docstring. Which is more or less what it currently says. So is the custom that double-underscore methods get documented in the class docstring and normal methods get documented in their own docstrings? No. Objects whose main purpose is to be callable should document their call signature in the main docstring since that's where the objects they are mimiccing (functions) have their docstring. In this case, there is also a technical reason: you can't override add.__call__.__doc__, just ufunc.__call__.__doc__, but the call signatures vary between ufuncs. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] searchsorted() and memory cache
On Thu, May 8, 2008 at 10:30 PM, Charles R Harris [EMAIL PROTECTED] wrote: On Thu, May 8, 2008 at 9:55 PM, Robert Kern [EMAIL PROTECTED] wrote: On Thu, May 8, 2008 at 10:51 PM, Andrew Straw [EMAIL PROTECTED] wrote: I've got a big element array (25 million int64s) that searchsorted() takes a long time to grind through. After a bit of digging in the literature and the numpy source code, I believe that searchsorted() is implementing a classic binary search, Yes. which is pretty bad in terms of cache misses. There are several modern implementations of binary search which arrange items in memory such that cache misses are much more rare. Clearly making such an indexing arrangement would take time, but in my particular case, I can spare the time to create an index if searching was faster, since I'd make the index once but do the searching many times. Is there an implementation of such an algorithm that works easilty with numpy? Also, can you offer any advice, suggestions, and comments to me if I attempted to implement such an algorithm? I'm no help. You seem to know more than I do. Sadly, the first few Google hits I get for binary search minimize cache misses are patents. I don't know what the substantive content of those patents are; I have a strict policy of not reading patents. I would be interested in adding such a thing if it wasn't patent encumbered. A good start would be a prototype in python to show how it all went together and whether it needed a separate indexing/lookup function or could be fit into the current setup. One way I can think of doing this is to have two indices. One is the usual sorted list, the second consists of, say, every 1024'th entry in the first list. Then search the second list first to find the part of the first list to search. That won't get you into the very best cache, but it could buy you a factor of 2x-4x in speed. It's sort of splitting the binary tree into two levels. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] searchsorted() and memory cache
On Thu, May 8, 2008 at 11:06 PM, Charles R Harris [EMAIL PROTECTED] wrote: On Thu, May 8, 2008 at 10:30 PM, Charles R Harris [EMAIL PROTECTED] wrote: On Thu, May 8, 2008 at 9:55 PM, Robert Kern [EMAIL PROTECTED] wrote: On Thu, May 8, 2008 at 10:51 PM, Andrew Straw [EMAIL PROTECTED] wrote: I've got a big element array (25 million int64s) that searchsorted() takes a long time to grind through. After a bit of digging in the literature and the numpy source code, I believe that searchsorted() is implementing a classic binary search, Yes. which is pretty bad in terms of cache misses. There are several modern implementations of binary search which arrange items in memory such that cache misses are much more rare. Clearly making such an indexing arrangement would take time, but in my particular case, I can spare the time to create an index if searching was faster, since I'd make the index once but do the searching many times. Is there an implementation of such an algorithm that works easilty with numpy? Also, can you offer any advice, suggestions, and comments to me if I attempted to implement such an algorithm? I'm no help. You seem to know more than I do. Sadly, the first few Google hits I get for binary search minimize cache misses are patents. I don't know what the substantive content of those patents are; I have a strict policy of not reading patents. I would be interested in adding such a thing if it wasn't patent encumbered. A good start would be a prototype in python to show how it all went together and whether it needed a separate indexing/lookup function or could be fit into the current setup. One way I can think of doing this is to have two indices. One is the usual sorted list, the second consists of, say, every 1024'th entry in the first list. Then search the second list first to find the part of the first list to search. That won't get you into the very best cache, but it could buy you a factor of 2x-4x in speed. It's sort of splitting the binary tree into two levels. You could even do it with your data from python, generating the second list and calling searchsorted multiple times. If you are searching for a bunch of values, it's probably good to also sort them first so they bunch together in the same part of the big list. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion