[Numpy-discussion] Log Arrays

2008-05-08 Thread T J
Hi,

For precision reasons, I almost always need to work with arrays whose
elements are log values.  My thought was that it would be really neat
to have a 'logarray' class implemented in C or as a subclass of the
standard array class.  Here is a sample of how I'd like to work with
these objects:

 x = array([-2,-2,-3], base=2)
 y = array([-1,-2,-inf], base=2)
 z = x + y
 z
array([-0.415037499279, -1.0, -3])
 z = x * y
 z
array([-3, -4, -inf])
 z[:2].sum()
-2.41503749928

This would do a lot for the code I writeand some of the numerical
stability issues would handled more appropriately.  For example, the
logsum function is frequently handled as:

 log(x + y)  ==  log(x) + log(1 + exp(log(y) - log(x)) )

when log(x)  log(y).   So the goal of the above function is to return
log(x + y)  using only the logarithms of x and y.


Does this sound like a good idea?
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread T J
On Thu, May 8, 2008 at 12:26 AM, T J [EMAIL PROTECTED] wrote:

   x = array([-2,-2,-3], base=2)
   y = array([-1,-2,-inf], base=2)
   z = x + y
   z
  array([-0.415037499279, -1.0, -3])
   z = x * y
   z
  array([-3, -4, -inf])
   z[:2].sum()
  -2.41503749928


Whoops

s/array/logarray/
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] First steps with f2py and first problems...

2008-05-08 Thread LB
Hi,

I've tried to follow the example given at :
http://www.scipy.org/Cookbook/Theoretical_Ecology/Hastings_and_Powell
but I've got errors when compiling the fortran file :

-errors
--
12:53 loic:~ % f2py -c -m hastings hastings.f90 --fcompiler=gnu95
running build
running config_cc
unifing config_cc, config, build_clib, build_ext, build commands --
compiler options
running config_fc
unifing config_fc, config, build_clib, build_ext, build commands --
fcompiler options
running build_src
building extension hastings sources
f2py options: []
f2py: /tmp/tmpDRL9Gh/src.linux-i686-2.5/hastingsmodule.c
creating /tmp/tmpDRL9Gh
creating /tmp/tmpDRL9Gh/src.linux-i686-2.5
Reading fortran codes...
Reading file 'hastings.f90' (format:free)
Post-processing...
Block: hastings
Block: model
Block: fweb
Post-processing (stage 2)...
Block: hastings
Block: unknown_interface
Block: model
Block: fweb
Building modules...
Building module hastings...
Constructing F90 module support for model...
  Variables: a1 a2 b1 b2 d2 d1
Constructing wrapper function model.fweb...
  yprime = fweb(y,t)
Wrote C/API module hastings to file /tmp/tmpDRL9Gh/src.linux-
i686-2.5/hastingsmodule.c
Traceback (most recent call last):
  File /usr/bin/f2py, line 26, in module
main()
  File /usr/lib/python2.5/site-packages/numpy/f2py/f2py2e.py, line
558, in main
run_compile()
  File /usr/lib/python2.5/site-packages/numpy/f2py/f2py2e.py, line
545, in run_compile
setup(ext_modules = [ext])
  File /usr/lib/python2.5/site-packages/numpy/distutils/core.py,
line 176, in setup
return old_setup(**new_attr)
  File /usr/lib/python2.5/distutils/core.py, line 151, in setup
dist.run_commands()
  File /usr/lib/python2.5/distutils/dist.py, line 974, in
run_commands
self.run_command(cmd)
  File /usr/lib/python2.5/distutils/dist.py, line 994, in
run_command
cmd_obj.run()
  File /usr/lib/python2.5/distutils/command/build.py, line 113, in
run
self.run_command(cmd_name)
  File /usr/lib/python2.5/distutils/cmd.py, line 333, in run_command
self.distribution.run_command(command)
  File /usr/lib/python2.5/distutils/dist.py, line 994, in
run_command
cmd_obj.run()
  File /usr/lib/python2.5/site-packages/numpy/distutils/command/
build_src.py, line 130, in run
self.build_sources()
  File /usr/lib/python2.5/site-packages/numpy/distutils/command/
build_src.py, line 147, in build_sources
self.build_extension_sources(ext)
  File /usr/lib/python2.5/site-packages/numpy/distutils/command/
build_src.py, line 256, in build_extension_sources
sources = self.f2py_sources(sources, ext)
  File /usr/lib/python2.5/site-packages/numpy/distutils/command/
build_src.py, line 513, in f2py_sources
['-m',ext_name]+f_sources)
  File /usr/lib/python2.5/site-packages/numpy/f2py/f2py2e.py, line
367, in run_main
ret=buildmodules(postlist)
  File /usr/lib/python2.5/site-packages/numpy/f2py/f2py2e.py, line
319, in buildmodules
dict_append(ret[mnames[i]],rules.buildmodule(modules[i],um))
  File /usr/lib/python2.5/site-packages/numpy/f2py/rules.py, line
1222, in buildmodule
for l in '\n\n'.join(funcwrappers2)+'\n'.split('\n'):
TypeError: cannot concatenate 'str' and 'list' objects
zsh: exit 1 f2py -c -m hastings hastings.f90 --fcompiler=gnu95

-
configuration-
I'm using debian testing, and I got the following information at the
bottom of `f2py -h` :
Version: 2_4422
numpy Version: 1.0.4
Requires:Python 2.3 or higher.
License: NumPy license (see LICENSE.txt in the NumPy source code)


Have you got any clue to solve this pb ?
--
LB
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] First steps with f2py and first problems...

2008-05-08 Thread Pearu Peterson
On Thu, May 8, 2008 2:06 pm, LB wrote:
 Hi,

 I've tried to follow the example given at :
 http://www.scipy.org/Cookbook/Theoretical_Ecology/Hastings_and_Powell
 but I've got errors when compiling the fortran file :

 12:53 loic:~ % f2py -c -m hastings hastings.f90 --fcompiler=gnu95
...
   File /usr/lib/python2.5/site-packages/numpy/f2py/rules.py, line
 1222, in buildmodule
 for l in '\n\n'.join(funcwrappers2)+'\n'.split('\n'):
 TypeError: cannot concatenate 'str' and 'list' objects
 zsh: exit 1 f2py -c -m hastings hastings.f90 --fcompiler=gnu95
...
 Have you got any clue to solve this pb ?

This issue is fixed in SVN. So, either use numpy from svn,
or wait a bit until numpy 1.0.5 is released, or change the
line #1222 in numpy/f2py/rules.py to

  for l in ('\n\n'.join(funcwrappers2)+'\n').split('\n'):

HTH,
Pearu

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Charles R Harris
On Thu, May 8, 2008 at 1:26 AM, T J [EMAIL PROTECTED] wrote:

 Hi,

 For precision reasons, I almost always need to work with arrays whose
 elements are log values.  My thought was that it would be really neat
 to have a 'logarray' class implemented in C or as a subclass of the
 standard array class.  Here is a sample of how I'd like to work with
 these objects:


Floating point numbers are essentially logs to base 2, i.e., integer
exponent and mantissa between 1 and 2. What does using the log buy you?

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Problems with Trac

2008-05-08 Thread Charles R Harris
There seem to be several problems with the Trac system.

1) Submitting/modifying tickets yields 500 internal server error,
even though the changes are made. This leads to duplicate tickets.

2) Mail isn't sent to the numpy-ticket mailing list.

3) It still always takes two tries to log in.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Test

2008-05-08 Thread Travis Oliphant
This is a test to see if the list is working for me.

-teo


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Anne Archibald
2008/5/8 David Cournapeau [EMAIL PROTECTED]:
 On Thu, May 8, 2008 at 10:20 PM, Charles R Harris
  [EMAIL PROTECTED] wrote:
  
  
   Floating point numbers are essentially logs to base 2, i.e., integer
   exponent and mantissa between 1 and 2. What does using the log buy you?

  Precision, of course. I am not sure I understand the notation base =
  2, but doing computation in the so called log-domain is a must in many
  statistical computations. In particular, in machine learning with
  large datasets, it is common to have some points whose pdf is
  extremely small, and well below the precision of double. Typically,
  internally, the computation of my EM toolbox are done in the log
  domain, and use the logsumexp trick to compute likelihood given some
  data:

I'm not sure I'd describe this as precision, exactly; it's an issue of
numerical range. But yes, I've come across this while doing
maximum-likelihood fitting, and a coworker ran into it doing Bayesian
statistics. It definitely comes up.

Is logarray really the way to handle it, though? it seems like you
could probably get away with providing a logsum ufunc that did the
right thing. I mean, what operations does one want to do on logarrays?

add - logsum
subtract - ?
multiply - add
mean - logsum/N
median - median
exponentiate to recover normal-space values - exp
str - ?

I suppose numerical integration is also valuable, so it would help to
have a numerical integrator that was reasonably smart about working
with logs. (Though really it's just a question of rescaling and
exponentiating, I think.)

A logarray type would help by keeping track of the fact that its
contents were in log space, and would make expressions a little less
cumbersome, I guess. How much effort would it take to write it so that
it got all the corner cases right?

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Anne Archibald
2008/5/8 Charles R Harris [EMAIL PROTECTED]:

 What realistic probability is in the range exp(-1000) ?

Well, I ran into it while doing a maximum-likelihood fit - my early
guesses had exceedingly low probabilities, but I needed to know which
way the probabilities were increasing.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread David Cournapeau
On Fri, May 9, 2008 at 1:04 AM, Charles R Harris
[EMAIL PROTECTED] wrote:

   1e-308 ?

Yes, all the time. I mean, if it was not, why people would bother with
long double and co ? Why denormal would exist ? I don't consider the
comparison with the number of particules to be really relevant here.
We are talking about implementation problems.


 Yes, logs can be useful there, but I still fail to see any precision
 advantage. As I say, to all intents and purposes, IEEE floating point *is* a
 logarithm. You will see that if you look at how log is implemented in
 hardware. I'm less sure of the C floating point library because it needs to
 be portable.


 a = np.array([-1000., -1001.])
 np.log(np.sum(np.exp(a))) - -inf
 -1000 + np.log(np.sum([1 + np.exp(-1)])) - correct result

 What realistic probability is in the range exp(-1000) ?

Realistic as significant, none of course. Realistic as it happens in
computation ? Certainly. Typically, when you do clustering, and you
have distant clusters, when you compute the probabilities of the
points from one cluster relatively to another one, you can quickly get
several units from the mean. Adds a small variance, and you can
quickly get (x-mu)**2/sigma**2 around 1000.

You cannot just clip to 0, specially in online settings.


 If you have a hammer... It's portable, but there are wasted cpu cycles in
 there. If speed was important, I suspect you could do better writing a low
 level function that assumed IEEE doubles and twiddled the bits.

When you call a function in python, you waste thousand cycles at every
call. Yet, you use python, and not assembly :) The above procedure is
extremely standard, and used in all robust implementations of machine
learning algorithms I am aware of, it is implemented in HTK, a widely
used toolkit for HMM for speech recognition for example.

Twiddling bits is all fun, but it takes time and is extremely error
prone. Also, I don't see what kind of method you have in mind here,
exactly: how would you do a logsumexp algorithm with bit twiddling ?

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread David Cournapeau
On Fri, May 9, 2008 at 1:25 AM, Charles R Harris
[EMAIL PROTECTED] wrote:



 But to expand on David's computation... If the numbers are stored without
 using logs, i.e., as the exponentials, then the sum is of the form:

 x_1*2**y_1 + ... + x_i*2**y_i

You missed the part on parametric models: in parametric settings, your
x_i are often exponential, so it makes sense to compute in the log
domain (you don't compute more log/exp than the naive implementation).

Of course, if we were not interested in log  x_i in the first place,
the thing would not have made any sense.

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Charles R Harris
On Thu, May 8, 2008 at 10:42 AM, David Cournapeau [EMAIL PROTECTED]
wrote:

 On Fri, May 9, 2008 at 1:04 AM, Charles R Harris
 [EMAIL PROTECTED] wrote:

1e-308 ?

 Yes, all the time. I mean, if it was not, why people would bother with
 long double and co ? Why denormal would exist ? I don't consider the
 comparison with the number of particules to be really relevant here.
 We are talking about implementation problems.

 
  Yes, logs can be useful there, but I still fail to see any precision
  advantage. As I say, to all intents and purposes, IEEE floating point
 *is* a
  logarithm. You will see that if you look at how log is implemented in
  hardware. I'm less sure of the C floating point library because it needs
 to
  be portable.
 
 
  a = np.array([-1000., -1001.])
  np.log(np.sum(np.exp(a))) - -inf
  -1000 + np.log(np.sum([1 + np.exp(-1)])) - correct result
 
  What realistic probability is in the range exp(-1000) ?

 Realistic as significant, none of course. Realistic as it happens in
 computation ? Certainly. Typically, when you do clustering, and you
 have distant clusters, when you compute the probabilities of the
 points from one cluster relatively to another one, you can quickly get
 several units from the mean. Adds a small variance, and you can
 quickly get (x-mu)**2/sigma**2 around 1000.


Yes, and Gaussians are a delusion beyond a few sigma. One of my pet peeves.
If you have more than 8 standard deviations, then something is fundamentally
wrong in the concept and formulation. It is more likely that a particle
sized blackhole has whacked out some component of the experiment.


 You cannot just clip to 0, specially in online settings.

 
  If you have a hammer... It's portable, but there are wasted cpu cycles in
  there. If speed was important, I suspect you could do better writing a
 low
  level function that assumed IEEE doubles and twiddled the bits.

 When you call a function in python, you waste thousand cycles at every
 call. Yet, you use python, and not assembly :) The above procedure is
 extremely standard, and used in all robust implementations of machine
 learning algorithms I am aware of, it is implemented in HTK, a widely
 used toolkit for HMM for speech recognition for example.

 Twiddling bits is all fun, but it takes time and is extremely error
 prone. Also, I don't see what kind of method you have in mind here,
 exactly: how would you do a logsumexp algorithm with bit twiddling ?


You are complaining of inadequate range, but that is what scale factors are
for. Why compute exponentials and logs when all you need to do is store an
exponent in an integer.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Robert Kern
On Thu, May 8, 2008 at 11:25 AM, Charles R Harris
[EMAIL PROTECTED] wrote:

 On Thu, May 8, 2008 at 10:11 AM, Anne Archibald [EMAIL PROTECTED]
 wrote:

 2008/5/8 Charles R Harris [EMAIL PROTECTED]:
 
  What realistic probability is in the range exp(-1000) ?

 Well, I ran into it while doing a maximum-likelihood fit - my early
 guesses had exceedingly low probabilities, but I needed to know which
 way the probabilities were increasing.

 The number of bosons in the universe is only on the order of 1e-42.
 Exp(-1000) may be convenient, but as a probability it is a delusion. The
 hypothesis none of the above would have a much larger prior.

When you're running an optimizer over a PDF, you will be stuck in the
region of exp(-1000) for a substantial amount of time before you get
to the peak. If you don't use the log representation, you will never
get to the peak because all of the gradient information is lost to
floating point error. You can consult any book on computational
statistics for many more examples. This is a long-established best
practice in statistics.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Charles R Harris
On Thu, May 8, 2008 at 10:52 AM, David Cournapeau [EMAIL PROTECTED]
wrote:

 On Fri, May 9, 2008 at 1:25 AM, Charles R Harris
 [EMAIL PROTECTED] wrote:
 

 
  But to expand on David's computation... If the numbers are stored without
  using logs, i.e., as the exponentials, then the sum is of the form:
 
  x_1*2**y_1 + ... + x_i*2**y_i

 You missed the part on parametric models: in parametric settings, your
 x_i are often exponential,


I'm talking IEEE floating point. The x_i are never exponentials.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Charles R Harris
On Thu, May 8, 2008 at 10:56 AM, Robert Kern [EMAIL PROTECTED] wrote:

 On Thu, May 8, 2008 at 11:25 AM, Charles R Harris
 [EMAIL PROTECTED] wrote:
 
  On Thu, May 8, 2008 at 10:11 AM, Anne Archibald 
 [EMAIL PROTECTED]
  wrote:
 
  2008/5/8 Charles R Harris [EMAIL PROTECTED]:
  
   What realistic probability is in the range exp(-1000) ?
 
  Well, I ran into it while doing a maximum-likelihood fit - my early
  guesses had exceedingly low probabilities, but I needed to know which
  way the probabilities were increasing.
 
  The number of bosons in the universe is only on the order of 1e-42.
  Exp(-1000) may be convenient, but as a probability it is a delusion. The
  hypothesis none of the above would have a much larger prior.

 When you're running an optimizer over a PDF, you will be stuck in the
 region of exp(-1000) for a substantial amount of time before you get
 to the peak. If you don't use the log representation, you will never
 get to the peak because all of the gradient information is lost to
 floating point error. You can consult any book on computational
 statistics for many more examples. This is a long-established best
 practice in statistics.


But IEEE is already a log representation. You aren't gaining precision, you
are gaining more bits in the exponent at the expense of fewer bits in the
mantissa, i.e., less precision. As I say, it may be convenient, but if cpu
cycles matter, it isn't efficient.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Robert Kern
On Thu, May 8, 2008 at 12:02 PM, Charles R Harris
[EMAIL PROTECTED] wrote:

 On Thu, May 8, 2008 at 10:56 AM, Robert Kern [EMAIL PROTECTED] wrote:

 On Thu, May 8, 2008 at 11:25 AM, Charles R Harris
 [EMAIL PROTECTED] wrote:
 
  On Thu, May 8, 2008 at 10:11 AM, Anne Archibald
  [EMAIL PROTECTED]
  wrote:
 
  2008/5/8 Charles R Harris [EMAIL PROTECTED]:
  
   What realistic probability is in the range exp(-1000) ?
 
  Well, I ran into it while doing a maximum-likelihood fit - my early
  guesses had exceedingly low probabilities, but I needed to know which
  way the probabilities were increasing.
 
  The number of bosons in the universe is only on the order of 1e-42.
  Exp(-1000) may be convenient, but as a probability it is a delusion. The
  hypothesis none of the above would have a much larger prior.

 When you're running an optimizer over a PDF, you will be stuck in the
 region of exp(-1000) for a substantial amount of time before you get
 to the peak. If you don't use the log representation, you will never
 get to the peak because all of the gradient information is lost to
 floating point error. You can consult any book on computational
 statistics for many more examples. This is a long-established best
 practice in statistics.

 But IEEE is already a log representation. You aren't gaining precision, you
 are gaining more bits in the exponent at the expense of fewer bits in the
 mantissa, i.e., less precision.

*YES*. As David pointed out, many of these PDFs are in exponential
form. Most of the meaningful variation is in the exponent, not the
mantissa.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Nadav Horesh
Is the 80 bits float (float96 on IA32, float128 on AMD64) isn't enough? It has 
a 64 bits mantissa and can represent numbers up to nearly 1E(+-)5000.

  Nadav.


-הודעה מקורית-
מאת: [EMAIL PROTECTED] בשם Charles R Harris
נשלח: ה 08-מאי-08 19:25
אל: Discussion of Numerical Python
נושא: Re: [Numpy-discussion] Log Arrays
 
On Thu, May 8, 2008 at 10:11 AM, Anne Archibald [EMAIL PROTECTED]
wrote:

 2008/5/8 Charles R Harris [EMAIL PROTECTED]:
 
  What realistic probability is in the range exp(-1000) ?

 Well, I ran into it while doing a maximum-likelihood fit - my early
 guesses had exceedingly low probabilities, but I needed to know which
 way the probabilities were increasing.


The number of bosons in the universe is only on the order of 1e-42.
Exp(-1000) may be convenient, but as a probability it is a delusion. The
hypothesis none of the above would have a much larger prior.

But to expand on David's computation... If the numbers are stored without
using logs, i.e., as the exponentials, then the sum is of the form:

x_1*2**y_1 + ... + x_i*2**y_i

Where 1= x_j  2 and both x_i and y_i are available. When the numbers are
all of the same general magnitude you get essentially the same result as
David's formula by simply by dividing out the first value.

Chuck

Chuck

winmail.dat___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread David Cournapeau
On Fri, May 9, 2008 at 1:54 AM, Charles R Harris
[EMAIL PROTECTED] wrote:

 Yes, and Gaussians are a delusion beyond a few sigma. One of my pet peeves.
 If you have more than 8 standard deviations, then something is fundamentally
 wrong in the concept and formulation.

If you have a mixture of Gaussian, and the components are not all
mostly overlapping, you will get those ranges, and nothing is wrong in
the formulation. I mean, it is not like EM algorithms are untested
things and totally new. It is used in many different fields, and all
its successful implementations use the logsumexp trick.

Look at here for the formula involved:

http://en.wikipedia.org/wiki/Expectation-maximization_algorithm

If you need to compute log (exp (-1000) + exp(-1001)), how would you
do ? If you do it the naive way, you have -inf, and it propagates
across all your computation quickly. -inf instead of -1000 seems like
a precision win to me. of course you are trading precision for range,
but when you are out of range for your number representation, the
tradeoff is not a loss anymore. It is really like denormal: they are
less precise than normal format *for the usual range*, but in the
range where denormal are used, they are much more precise; they are
actually infinitely more precise, since the normal representation
would be 0 :)

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Warren Focke


On Thu, 8 May 2008, Charles R Harris wrote:

 On Thu, May 8, 2008 at 10:11 AM, Anne Archibald [EMAIL PROTECTED]
 wrote:

 2008/5/8 Charles R Harris [EMAIL PROTECTED]:

 What realistic probability is in the range exp(-1000) ?

 Well, I ran into it while doing a maximum-likelihood fit - my early
 guesses had exceedingly low probabilities, but I needed to know which
 way the probabilities were increasing.


 The number of bosons in the universe is only on the order of 1e-42.
 Exp(-1000) may be convenient, but as a probability it is a delusion. The
 hypothesis none of the above would have a much larger prior.

You might like to think so.  Sadly, not.

If you're doing a least-square (or any other maximum-likelihood) fit to 2000 
data points, exp(-1000) is the highest probability you can reasonably hope for. 
For a good fit.  Chi-square is -2*ln(P).  In the course of doing the fit, you 
will evaluate many parameter sets which are bad fits, and the probablility will 
be much lower.

This has no real effect on the current discussion, but:

The number of bosons in the universe (or any subset thereof) is not 
well-defined.  It's not just a question of not knowing the number; there really 
is no answer to that question (well, ok, 'mu').  It's like asking which slit 
the 
particle went through in a double-slit interference experiment.  The question 
is 
incorrect.  Values 1 will never be tenable, but I suspect that the minus sign 
was a typo.  The estimates I hear for the number of baryons (protons, atoms) 
are 
~ 1e80.

w

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Anne Archibald
2008/5/8 Charles R Harris [EMAIL PROTECTED]:

 On Thu, May 8, 2008 at 10:56 AM, Robert Kern [EMAIL PROTECTED] wrote:
 
  When you're running an optimizer over a PDF, you will be stuck in the
  region of exp(-1000) for a substantial amount of time before you get
  to the peak. If you don't use the log representation, you will never
  get to the peak because all of the gradient information is lost to
  floating point error. You can consult any book on computational
  statistics for many more examples. This is a long-established best
  practice in statistics.

 But IEEE is already a log representation. You aren't gaining precision, you
 are gaining more bits in the exponent at the expense of fewer bits in the
 mantissa, i.e., less precision. As I say, it may be convenient, but if cpu
 cycles matter, it isn't efficient.

Efficiency is not the point here. IEEE floats simply cannot represent
the difference between exp(-1000) and exp(-1001). This difference can
matter in many contexts.

For example, suppose I have observed a source in X-rays and I want to
fit a blackbody spectrum. I have, say, hundreds of spectral bins, with
a number of photons in each. For any temperature T and amplitude A, I
can compute the distribution of photons in each bin (Poisson with mean
depending on T and A), and obtain the probability of obtaining the
observed number of photons in each bin. Even if a blackbody with
temperature T and amplitude A is a perfect fit I should expect this
number to be very small, since the chance of obtaining *exactly* this
sequence of integers is quite small. But when I start the process I
need to guess a T and an A and evauate the probability. If my values
are far off, the probability will almost certainly be lower than
exp(-1000). But I need to know whether, if I increase T, this
probability will increase or decrease, so I cannot afford to treat it
as zero, or throw up my hands and say This is smaller than one over
the number of baryons in the universe! My optimization problem doesn't
make any sense!.

I could also point out that frequently when one obtains these crazy
numbers, one is not working with probabilities at all, but probability
densities. A probability density of exp(-1000) means nothing special.

Finally, the fact that *you* don't think this is a useful technique
doesn't affect the fact that there is a substantial community of users
who use it daily and who would like some support for it in scipy.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Charles R Harris
On Thu, May 8, 2008 at 11:18 AM, David Cournapeau [EMAIL PROTECTED]
wrote:

 On Fri, May 9, 2008 at 2:06 AM, Nadav Horesh [EMAIL PROTECTED]
 wrote:
  Is the 80 bits float (float96 on IA32, float128 on AMD64) isn't enough?
 It has a 64 bits mantissa and can represent numbers up to nearly 1E(+-)5000.

 It only make the problem happen later, I think. If you have a GMM with
 million of samples of high dimension with many clusters, any linear
 representation will fail I think. In a sense, the IEEE format is not
 adequate for that kind of computation.


David, what you are using is a log(log(x)) representation internally. IEEE
is *not* linear, it is logarithmic.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Robert Kern
On Thu, May 8, 2008 at 12:53 PM, Charles R Harris
[EMAIL PROTECTED] wrote:

 On Thu, May 8, 2008 at 11:18 AM, David Cournapeau [EMAIL PROTECTED]
 wrote:

 On Fri, May 9, 2008 at 2:06 AM, Nadav Horesh [EMAIL PROTECTED]
 wrote:
  Is the 80 bits float (float96 on IA32, float128 on AMD64) isn't enough?
  It has a 64 bits mantissa and can represent numbers up to nearly 
  1E(+-)5000.

 It only make the problem happen later, I think. If you have a GMM with
 million of samples of high dimension with many clusters, any linear
 representation will fail I think. In a sense, the IEEE format is not
 adequate for that kind of computation.

 David, what you are using is a log(log(x)) representation internally. IEEE
 is *not* linear, it is logarithmic.

*YES*. That is precisely the point. I want 53 bits devoted to the x
part of exp(-x). The straight IEEE representation is not logarithmic
*enough*.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Charles R Harris
On Thu, May 8, 2008 at 11:31 AM, Warren Focke [EMAIL PROTECTED]
wrote:



 On Thu, 8 May 2008, Charles R Harris wrote:

  On Thu, May 8, 2008 at 10:11 AM, Anne Archibald 
 [EMAIL PROTECTED]
  wrote:
 
  2008/5/8 Charles R Harris [EMAIL PROTECTED]:
 
  What realistic probability is in the range exp(-1000) ?
 
  Well, I ran into it while doing a maximum-likelihood fit - my early
  guesses had exceedingly low probabilities, but I needed to know which
  way the probabilities were increasing.
 
 
  The number of bosons in the universe is only on the order of 1e-42.
  Exp(-1000) may be convenient, but as a probability it is a delusion. The
  hypothesis none of the above would have a much larger prior.

 You might like to think so.  Sadly, not.

 If you're doing a least-square (or any other maximum-likelihood) fit to
 2000
 data points, exp(-1000) is the highest probability you can reasonably hope
 for.
 For a good fit.  Chi-square is -2*ln(P).  In the course of doing the fit,
 you
 will evaluate many parameter sets which are bad fits, and the probablility
 will
 be much lower.

 This has no real effect on the current discussion, but:

 The number of bosons in the universe (or any subset thereof) is not
 well-defined.  It's not just a question of not knowing the number; there
 really
 is no answer to that question (well, ok, 'mu').  It's like asking which
 slit the
 particle went through in a double-slit interference experiment.  The
 question is
 incorrect.  Values 1 will never be tenable, but I suspect that the minus
 sign
 was a typo.  The estimates I hear for the number of baryons (protons,
 atoms) are
 ~ 1e80.


Say, mostly photons. Temperature (~2.7 K) determines density, multiply by
volume. But I meant baryons and the last number I saw was about 1e42.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Changes to matrix class broke scipy

2008-05-08 Thread Stéfan van der Walt
Developers, take note:

http://thread.gmane.org/gmane.comp.python.scientific.user/16297
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Anne Archibald
2008/5/8 Charles R Harris [EMAIL PROTECTED]:

 David, what you are using is a log(log(x)) representation internally. IEEE
 is *not* linear, it is logarithmic.

As Robert Kern says, yes, this is exactly what the OP and all the rest
of us want.

But it's a strange thing to say that IEEE is logarithmic - 2.3*10**1
is not exactly logarithmic, since the 2.3 is not the logarithm of
anything. IEEE floats work the same way, which is important, since it
means they can exactly represent integers of moderate size. For
example, 257 is represented as

sign 0, exponent 135, (implied leading 1).0001b.

The exponent is indeed the integer part of the log base 2 of the
value, up to some fiddling, but the mantissa is not a logarithm of any
kind.

Anyway, all this is immaterial. The point is, in spite of the fact
that floating-point numbers can represent a very wide range of
numbers, there are some important contexts in which this range is not
wide enough. One could in principle store an additional power of two
in an accompanying integer, but this would be less efficient in terms
of space and time, and more cumbersome, when for the usual situations
where this is applied, simply taking the logarithm works fine.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Charles R Harris
On Thu, May 8, 2008 at 12:39 PM, Anne Archibald [EMAIL PROTECTED]
wrote:

 2008/5/8 Charles R Harris [EMAIL PROTECTED]:
 
  David, what you are using is a log(log(x)) representation internally.
 IEEE
  is *not* linear, it is logarithmic.

 As Robert Kern says, yes, this is exactly what the OP and all the rest
 of us want.

 But it's a strange thing to say that IEEE is logarithmic - 2.3*10**1
 is not exactly logarithmic, since the 2.3 is not the logarithm of
 anything. IEEE floats work the same way, which is important, since it
 means they can exactly represent integers of moderate size. For
 example, 257 is represented as

 sign 0, exponent 135, (implied leading 1).0001b.

 The exponent is indeed the integer part of the log base 2 of the
 value, up to some fiddling, but the mantissa is not a logarithm of any
 kind.


First order in the  Taylors series. The log computation uses the fact that
it has small curvature over the range 1..2

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Changes to matrix class broke scipy

2008-05-08 Thread Charles R Harris
On Thu, May 8, 2008 at 12:25 PM, Stéfan van der Walt [EMAIL PROTECTED]
wrote:

 Developers, take note:

 http://thread.gmane.org/gmane.comp.python.scientific.user/16297
 ___


Yep, not unexpected. I suppose the real question is whether is should have
been in 1.1 or 1.2, but if we are making the change we have to bite the
bullet sometime. So I can understand the irritation, even anger, up front.
Now let's see how hard the needed changes are.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Robert Kern
On Thu, May 8, 2008 at 2:12 PM, Charles R Harris
[EMAIL PROTECTED] wrote:

 On Thu, May 8, 2008 at 11:46 AM, Anne Archibald [EMAIL PROTECTED]
 wrote:

 2008/5/8 Charles R Harris [EMAIL PROTECTED]:
 
  On Thu, May 8, 2008 at 10:56 AM, Robert Kern [EMAIL PROTECTED]
  wrote:
  
   When you're running an optimizer over a PDF, you will be stuck in the
   region of exp(-1000) for a substantial amount of time before you get
   to the peak. If you don't use the log representation, you will never
   get to the peak because all of the gradient information is lost to
   floating point error. You can consult any book on computational
   statistics for many more examples. This is a long-established best
   practice in statistics.
 
  But IEEE is already a log representation. You aren't gaining precision,
  you
  are gaining more bits in the exponent at the expense of fewer bits in
  the
  mantissa, i.e., less precision. As I say, it may be convenient, but if
  cpu
  cycles matter, it isn't efficient.

 Efficiency is not the point here. IEEE floats simply cannot represent
 the difference between exp(-1000) and exp(-1001). This difference can
 matter in many contexts.

 For example, suppose I have observed a source in X-rays and I want to
 fit a blackbody spectrum. I have, say, hundreds of spectral bins, with
 a number of photons in each. For any temperature T and amplitude A, I
 can compute the distribution of photons in each bin (Poisson with mean
 depending on T and A), and obtain the probability of obtaining the
 observed number of photons in each bin. Even if a blackbody with
 temperature T and amplitude A is a perfect fit I should expect this
 number to be very small, since the chance of obtaining *exactly* this
 sequence of integers is quite small. But when I start the process I
 need to guess a T and an A and evauate the probability. If my values
 are far off, the probability will almost certainly be lower than
 exp(-1000). But I need to know whether, if I increase T, this
 probability will increase or decrease, so I cannot afford to treat it
 as zero, or throw up my hands and say This is smaller than one over
 the number of baryons in the universe! My optimization problem doesn't
 make any sense!.

 You want the covariance of the parameters of the fit, which will be much
 more reasonable. And if not, then you have outliers. Mostly, folks are
 looking for the probability of classes of things, in this case the class of
 curves you are fitting, the probability of any give sequence, which
 approaches to zero, is a much less interest. So the errors in the parameters
 of the model matter, the probability of essentially unique sequence much
 less so.

Except that when you are doing practical computations, you will often
be computing likelihoods as part of the larger computation. Yes, the
final probabilities that you interpret won't be exp(-1000) (and if
they are, then 0 is close enough), but the calculations *in between*
do require that exp(-1000) and exp(-1001) be distinguishable from each
other.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread T J
On 5/8/08, Anne Archibald [EMAIL PROTECTED] wrote:
 Is logarray really the way to handle it, though? it seems like you
 could probably get away with providing a logsum ufunc that did the
 right thing. I mean, what operations does one want to do on logarrays?

 add - logsum
 subtract - ?
 multiply - add
 mean - logsum/N
 median - median
 exponentiate to recover normal-space values - exp
 str - ?


That's about it, as far as my usage goes.  Additionally, I would also
benefit from:

   logdot
   logouter

In addition to the elementwise operations, it would be nice to have

   logsum along an axis
   logprod along an axis
   cumlogsum
   cumlogprod

Whether these are through additional ufuncs or through a subclass is
not so much of an issue for me---either would be a huge improvement to
the current situation.  One benefit of a subclass, IMO, is that it
maintains the feel of non-log arrays.  That is, when I want to
multiply to logarrays, I just do x*y, rather than x+ybut I can
understand arguments that this might not be desirable.
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Charles R Harris
On Thu, May 8, 2008 at 11:46 AM, Anne Archibald [EMAIL PROTECTED]
wrote:

 2008/5/8 Charles R Harris [EMAIL PROTECTED]:
 
  On Thu, May 8, 2008 at 10:56 AM, Robert Kern [EMAIL PROTECTED]
 wrote:
  
   When you're running an optimizer over a PDF, you will be stuck in the
   region of exp(-1000) for a substantial amount of time before you get
   to the peak. If you don't use the log representation, you will never
   get to the peak because all of the gradient information is lost to
   floating point error. You can consult any book on computational
   statistics for many more examples. This is a long-established best
   practice in statistics.
 
  But IEEE is already a log representation. You aren't gaining precision,
 you
  are gaining more bits in the exponent at the expense of fewer bits in the
  mantissa, i.e., less precision. As I say, it may be convenient, but if
 cpu
  cycles matter, it isn't efficient.

 Efficiency is not the point here. IEEE floats simply cannot represent
 the difference between exp(-1000) and exp(-1001). This difference can
 matter in many contexts.

 For example, suppose I have observed a source in X-rays and I want to
 fit a blackbody spectrum. I have, say, hundreds of spectral bins, with
 a number of photons in each. For any temperature T and amplitude A, I
 can compute the distribution of photons in each bin (Poisson with mean
 depending on T and A), and obtain the probability of obtaining the
 observed number of photons in each bin. Even if a blackbody with
 temperature T and amplitude A is a perfect fit I should expect this


This is an interesting problem, partly because I was one of the first guys
to use synthetic spectra (NO, OH), to fit temperatures to spectral data. But
also because it might look like the Poisson matters. But probably not, the
curve fitting effectively aggregates data and the central limit theorem
kicks in, so that the normal least squares approach will probably work.
Also, if the number of photons in a bin is much more that about 10, the
computation of the Poisson distribution probably uses a Gaussian, with
perhaps a small adjustment for the tail. So I'm curious, did you find any
significant difference trying both methods?

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Anne Archibald
2008/5/8 T J [EMAIL PROTECTED]:
 On 5/8/08, Anne Archibald [EMAIL PROTECTED] wrote:
   Is logarray really the way to handle it, though? it seems like you
   could probably get away with providing a logsum ufunc that did the
   right thing. I mean, what operations does one want to do on logarrays?
  
   add - logsum
   subtract - ?
   multiply - add
   mean - logsum/N
   median - median
   exponentiate to recover normal-space values - exp
   str - ?
  

  That's about it, as far as my usage goes.  Additionally, I would also
  benefit from:

logdot
logouter

  In addition to the elementwise operations, it would be nice to have

logsum along an axis
logprod along an axis
cumlogsum
cumlogprod

  Whether these are through additional ufuncs or through a subclass is
  not so much of an issue for me---either would be a huge improvement to
  the current situation.  One benefit of a subclass, IMO, is that it
  maintains the feel of non-log arrays.  That is, when I want to
  multiply to logarrays, I just do x*y, rather than x+ybut I can
  understand arguments that this might not be desirable.

Well, how about a two-step process? Let's come up with a nice
implementation of each of the above, then we can discuss whether a
subclass is warranted (and at that point each operation on the
subclass will be very simple to implement).

Most of them could be implemented almost for free by providing a ufunc
that did logsum(); then logsum along an axis becomes logsum.reduce(),
and cumlogsum becomes logsum.accumulate(). logprod is of course just
add. logouter is conveniently and efficiently written as
logprod.outer. logdot can be written in terms of logsum() and
logprod() at the cost of a sizable temporary, but should really have
its own implementation.

It might be possible to test this by using vectorize() on a
single-element version of logsum(). This would be slow but if
vectorize() makes a real ufunc object should provide all the usual
handy methods (reduce, accumulate, outer, etcetera). Alternatively,
since logsum can be written in terms of existing ufuncs, it should be
possible to implement a class providing all the ufunc methods by hand
without too too much pain.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Warren Focke


On Thu, 8 May 2008, Charles R Harris wrote:

 On Thu, May 8, 2008 at 11:46 AM, Anne Archibald [EMAIL PROTECTED]
 wrote:

 2008/5/8 Charles R Harris [EMAIL PROTECTED]:

 On Thu, May 8, 2008 at 10:56 AM, Robert Kern [EMAIL PROTECTED]
 wrote:

 When you're running an optimizer over a PDF, you will be stuck in the
 region of exp(-1000) for a substantial amount of time before you get
 to the peak. If you don't use the log representation, you will never
 get to the peak because all of the gradient information is lost to
 floating point error. You can consult any book on computational
 statistics for many more examples. This is a long-established best
 practice in statistics.

 But IEEE is already a log representation. You aren't gaining precision,
 you
 are gaining more bits in the exponent at the expense of fewer bits in the
 mantissa, i.e., less precision. As I say, it may be convenient, but if
 cpu
 cycles matter, it isn't efficient.

 Efficiency is not the point here. IEEE floats simply cannot represent
 the difference between exp(-1000) and exp(-1001). This difference can
 matter in many contexts.

 For example, suppose I have observed a source in X-rays and I want to
 fit a blackbody spectrum. I have, say, hundreds of spectral bins, with
 a number of photons in each. For any temperature T and amplitude A, I
 can compute the distribution of photons in each bin (Poisson with mean
 depending on T and A), and obtain the probability of obtaining the
 observed number of photons in each bin. Even if a blackbody with
 temperature T and amplitude A is a perfect fit I should expect this


 This is an interesting problem, partly because I was one of the first guys
 to use synthetic spectra (NO, OH), to fit temperatures to spectral data. But
 also because it might look like the Poisson matters. But probably not, the
 curve fitting effectively aggregates data and the central limit theorem
 kicks in, so that the normal least squares approach will probably work.
 Also, if the number of photons in a bin is much more that about 10, the
 computation of the Poisson distribution probably uses a Gaussian, with
 perhaps a small adjustment for the tail. So I'm curious, did you find any
 significant difference trying both methods?

I can't address Anne's results, but I've certainly found it to make a 
difference 
in my work, and it's pretty standard in high-energy astronomy.  The central 
limit theorem does not get you out of having to use the right PDF to compare 
data to model.  It does mean that, if you have enough events, the PDF of the 
fitness function is fairly chi-squarish, so much of the logic developed for 
least-squares fitting still applies (like for finding confidence intervals on 
the fitted parameters).  Using Poisson statistics instead of a Gaussian 
approximation is actually pretty easy, see Cash (Astrophysical Journal, Part 1, 
vol. 228, Mar. 15, 1979, p. 939-947 
http://adsabs.harvard.edu/abs/1979ApJ...228..939C)

w

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Why are ufunc docstrings useless?

2008-05-08 Thread Anne Archibald
Hi,

I frequently use functions like np.add.reduce and np.add.outer, but
their docstrings are totally uninformative. Would it be possible to
provide proper docstrings for these ufunc methods? They need not be
specific to np.add; just an explanation of what arguments to give (for
example) reduce() (presumably it takes an axis= argument? what is the
default behaviour?) and what it does would help.

Thanks,
Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Why are ufunc docstrings useless?

2008-05-08 Thread Robert Kern
On Thu, May 8, 2008 at 5:28 PM, Anne Archibald
[EMAIL PROTECTED] wrote:
 Hi,

 I frequently use functions like np.add.reduce and np.add.outer, but
 their docstrings are totally uninformative. Would it be possible to
 provide proper docstrings for these ufunc methods? They need not be
 specific to np.add; just an explanation of what arguments to give (for
 example) reduce() (presumably it takes an axis= argument? what is the
 default behaviour?) and what it does would help.

Sure. The place to add them would be in the PyMethodDef ufunc_methods
array on line 3953 of numpy/core/src/ufuncobject.c. Just extend each
of the rows with a char* containing the docstring contents. A literal
string will suffice.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] online (1-shot) calculation of variance (complex)

2008-05-08 Thread Robert Kern
On Thu, May 8, 2008 at 5:45 PM, Neal Becker [EMAIL PROTECTED] wrote:
 I saw some links on 1-pass recursive calculation of mean/variance.  When I
 tried the algorithms, it did not seem to give correct results for complex
 values.  Anyone know how to correctly implement this?

Well, exactly what did you try?

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] online (1-shot) calculation of variance (complex)

2008-05-08 Thread Neal Becker
Robert Kern wrote:

 On Thu, May 8, 2008 at 5:45 PM, Neal Becker [EMAIL PROTECTED] wrote:
 I saw some links on 1-pass recursive calculation of mean/variance.  When
 I tried the algorithms, it did not seem to give correct results for
 complex
 values.  Anyone know how to correctly implement this?
 
 Well, exactly what did you try?
 
See Algorithm III here:
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] online (1-shot) calculation of variance (complex)

2008-05-08 Thread Neal Becker
Robert Kern wrote:

 On Thu, May 8, 2008 at 5:45 PM, Neal Becker [EMAIL PROTECTED] wrote:
 I saw some links on 1-pass recursive calculation of mean/variance.  When
 I tried the algorithms, it did not seem to give correct results for
 complex
 values.  Anyone know how to correctly implement this?
 
 Well, exactly what did you try?
 
Here's my python translation:
It seems to give a variance that just converges to 0 given a vector of
gaussian r.v.:

class stat2 (object):
def __init__(self):
self.n = 0
self._mean = 0
self.M2 = 0

def __iadd__(self, x):
self.n += 1
delta = x - self._mean
self._mean += delta/self.n
self.M2 += delta*(x - self._mean)  # This expression uses the
new value of mean

def mean(self):
return self._mean

def var(self):
return self.M2/(self.n - 1)


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] online (1-shot) calculation of variance (complex)

2008-05-08 Thread Robert Kern
On Thu, May 8, 2008 at 6:05 PM, Neal Becker [EMAIL PROTECTED] wrote:
 Robert Kern wrote:

 On Thu, May 8, 2008 at 5:45 PM, Neal Becker [EMAIL PROTECTED] wrote:
 I saw some links on 1-pass recursive calculation of mean/variance.  When
 I tried the algorithms, it did not seem to give correct results for
 complex
 values.  Anyone know how to correctly implement this?

 Well, exactly what did you try?

 Here's my python translation:
 It seems to give a variance that just converges to 0 given a vector of
 gaussian r.v.:

 class stat2 (object):
def __init__(self):
self.n = 0
self._mean = 0
self.M2 = 0

def __iadd__(self, x):
self.n += 1
delta = x - self._mean
self._mean += delta/self.n
self.M2 += delta*(x - self._mean)  # This expression uses the
 new value of mean

You may not be able to convert this one to apply to complex numbers.
The recurrence relation may not hold. In the two-pass algorithm for
complex numbers, remember that you are summing

  (x[i] - mean).conj * (x[i] - mean)

each of which is real.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Status Report for NumPy 1.1.0

2008-05-08 Thread Charles Doutriaux
I don't think it is reasonable to say the trunk is in good shape when 
the power function does not work...

Just my thoughts...

C.

Charles R Harris wrote:
 Hi Jarrod,

 On Tue, May 6, 2008 at 2:40 AM, Jarrod Millman [EMAIL PROTECTED] 
 mailto:[EMAIL PROTECTED] wrote:

 Hey,

 The trunk is in pretty good shape and it is about time that I put out
 an official release.  So tomorrow (in a little over twelve hours) I am
 going to create a 1.1.x branch and the trunk will be officially open
 for 1.2 development.  If there are no major issues that show up at the
 last minute, I will tag 1.1.0 twenty-four hours after I branch.  As
 soon as I tag the release I will ask the David and Chris to create the
 official Windows and Mac binaries.  If nothing goes awry, you can
 expect the official release announcement by Monday, May 12th.

 In order to help me with the final touches, would everyone look over
 the release notes one last time:
 http://projects.scipy.org/scipy/numpy/milestone/1.1.0
 Please let me know if there are any important omissions or errors
 ASAP.


 Scalar indexing of matrices has changed, 1D arrays are now returned 
 instead of matrices. This has to be documented in the release notes.

 Chuck


 

 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion
   

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Status Report for NumPy 1.1.0

2008-05-08 Thread Charles R Harris
On Thu, May 8, 2008 at 6:38 PM, Charles Doutriaux [EMAIL PROTECTED]
wrote:

 I don't think it is reasonable to say the trunk is in good shape when
 the power function does not work...

 Just my thoughts...


Is that ticket #301? What are you suggesting it do?

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Log Arrays

2008-05-08 Thread Charles R Harris
On Thu, May 8, 2008 at 2:37 PM, Warren Focke [EMAIL PROTECTED]
wrote:



 On Thu, 8 May 2008, Charles R Harris wrote:

  On Thu, May 8, 2008 at 11:46 AM, Anne Archibald 
 [EMAIL PROTECTED]
  wrote:
 
  2008/5/8 Charles R Harris [EMAIL PROTECTED]:
 
  On Thu, May 8, 2008 at 10:56 AM, Robert Kern [EMAIL PROTECTED]
  wrote:
 
  When you're running an optimizer over a PDF, you will be stuck in the
  region of exp(-1000) for a substantial amount of time before you get
  to the peak. If you don't use the log representation, you will never
  get to the peak because all of the gradient information is lost to
  floating point error. You can consult any book on computational
  statistics for many more examples. This is a long-established best
  practice in statistics.
 
  But IEEE is already a log representation. You aren't gaining precision,
  you
  are gaining more bits in the exponent at the expense of fewer bits in
 the
  mantissa, i.e., less precision. As I say, it may be convenient, but if
  cpu
  cycles matter, it isn't efficient.
 
  Efficiency is not the point here. IEEE floats simply cannot represent
  the difference between exp(-1000) and exp(-1001). This difference can
  matter in many contexts.
 
  For example, suppose I have observed a source in X-rays and I want to
  fit a blackbody spectrum. I have, say, hundreds of spectral bins, with
  a number of photons in each. For any temperature T and amplitude A, I
  can compute the distribution of photons in each bin (Poisson with mean
  depending on T and A), and obtain the probability of obtaining the
  observed number of photons in each bin. Even if a blackbody with
  temperature T and amplitude A is a perfect fit I should expect this
 
 
  This is an interesting problem, partly because I was one of the first
 guys
  to use synthetic spectra (NO, OH), to fit temperatures to spectral data.
 But
  also because it might look like the Poisson matters. But probably not,
 the
  curve fitting effectively aggregates data and the central limit theorem
  kicks in, so that the normal least squares approach will probably work.
  Also, if the number of photons in a bin is much more that about 10, the
  computation of the Poisson distribution probably uses a Gaussian, with
  perhaps a small adjustment for the tail. So I'm curious, did you find any
  significant difference trying both methods?

 I can't address Anne's results, but I've certainly found it to make a
 difference
 in my work, and it's pretty standard in high-energy astronomy.  The central
 limit theorem does not get you out of having to use the right PDF to
 compare
 data to model.  It does mean that, if you have enough events, the PDF of
 the
 fitness function is fairly chi-squarish, so much of the logic developed for
 least-squares fitting still applies (like for finding confidence intervals
 on
 the fitted parameters).  Using Poisson statistics instead of a Gaussian
 approximation is actually pretty easy, see Cash (Astrophysical Journal,
 Part 1,
 vol. 228, Mar. 15, 1979, p. 939-947
 http://adsabs.harvard.edu/abs/1979ApJ...228..939C)


Interesting paper.  I've also been playing with a 64 bit floating format
with a 31 bit offset binary exponent and 33 bit mantissa with a hidden bit,
which can hold the probability of any give sequence of 2**30 - 1 coin
tosses, or about 1e-323228496. It looks to be pretty efficient for
multiplication and compare, and functions like log and exp aren't hard to
do.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Why are ufunc docstrings useless?

2008-05-08 Thread Anne Archibald
2008/5/8 Robert Kern [EMAIL PROTECTED]:
 On Thu, May 8, 2008 at 5:28 PM, Anne Archibald
 [EMAIL PROTECTED] wrote:
 Hi,

 I frequently use functions like np.add.reduce and np.add.outer, but
 their docstrings are totally uninformative. Would it be possible to
 provide proper docstrings for these ufunc methods? They need not be
 specific to np.add; just an explanation of what arguments to give (for
 example) reduce() (presumably it takes an axis= argument? what is the
 default behaviour?) and what it does would help.

 Sure. The place to add them would be in the PyMethodDef ufunc_methods
 array on line 3953 of numpy/core/src/ufuncobject.c. Just extend each
 of the rows with a char* containing the docstring contents. A literal
 string will suffice.

Thanks! Done add, reduce, outer, and reduceat. What about __call__?

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Why are ufunc docstrings useless?

2008-05-08 Thread Robert Kern
On Thu, May 8, 2008 at 9:52 PM, Anne Archibald
[EMAIL PROTECTED] wrote:

 Thanks! Done add, reduce, outer, and reduceat. What about __call__?

If anyone knows enough to explicitly request a docstring from
__call__, they already know what it does.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Why are ufunc docstrings useless?

2008-05-08 Thread Charles R Harris
On Thu, May 8, 2008 at 9:07 PM, Robert Kern [EMAIL PROTECTED] wrote:

 On Thu, May 8, 2008 at 9:52 PM, Anne Archibald
 [EMAIL PROTECTED] wrote:

  Thanks! Done add, reduce, outer, and reduceat. What about __call__?

 If anyone knows enough to explicitly request a docstring from
 __call__, they already know what it does.

 --


It's easier/better to do this in numpy/add_newdocs.py. For example:

In [14]: from  numpy.lib import add_newdoc as add

In [15]: add('numpy.core','ufunc',('reduce','hello world'))

In [16]: ufunc.reduce.__doc__
Out[16]: 'hello world'

You don't have to clutter up the c sources and you can use  , instead
of putting all those newlines in.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Why are ufunc docstrings useless?

2008-05-08 Thread Robert Kern
On Thu, May 8, 2008 at 10:12 PM, Charles R Harris
[EMAIL PROTECTED] wrote:
 It's easier/better to do this in numpy/add_newdocs.py. For example:

 In [14]: from  numpy.lib import add_newdoc as add

 In [15]: add('numpy.core','ufunc',('reduce','hello world'))

 In [16]: ufunc.reduce.__doc__
 Out[16]: 'hello world'

 You don't have to clutter up the c sources and you can use  , instead
 of putting all those newlines in.

Ah good. I didn't realize it would handle builtin type methods.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Why are ufunc docstrings useless?

2008-05-08 Thread Charles R Harris
On Thu, May 8, 2008 at 9:12 PM, Charles R Harris [EMAIL PROTECTED]
wrote:



 On Thu, May 8, 2008 at 9:07 PM, Robert Kern [EMAIL PROTECTED] wrote:

 On Thu, May 8, 2008 at 9:52 PM, Anne Archibald
 [EMAIL PROTECTED] wrote:

  Thanks! Done add, reduce, outer, and reduceat. What about __call__?

 If anyone knows enough to explicitly request a docstring from
 __call__, they already know what it does.

 --


 It's easier/better to do this in numpy/add_newdocs.py. For example:

 In [14]: from  numpy.lib import add_newdoc as add

 In [15]: add('numpy.core','ufunc',('reduce','hello world'))

 In [16]: ufunc.reduce.__doc__
 Out[16]: 'hello world'

 You don't have to clutter up the c sources and you can use  , instead
 of putting all those newlines in.


Also, not all c compilers will join strings on separate lines. You need to
add explicit continuation backslashes.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Why are ufunc docstrings useless?

2008-05-08 Thread Anne Archibald
2008/5/8 Robert Kern [EMAIL PROTECTED]:
 On Thu, May 8, 2008 at 9:52 PM, Anne Archibald
 [EMAIL PROTECTED] wrote:

 Thanks! Done add, reduce, outer, and reduceat. What about __call__?

 If anyone knows enough to explicitly request a docstring from
 __call__, they already know what it does.

How exactly are they to find out? It does take additional arguments,
for example dtype and out - I think.

Also, help(np.add) displays the the object, its docstring, its
methods, and all their docstrings. So it provides a way to get a
docstring out of __call__ without having to know what it is.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] searchsorted() and memory cache

2008-05-08 Thread Andrew Straw
I've got a big element array (25 million int64s) that searchsorted()
takes a long time to grind through. After a bit of digging in the
literature and the numpy source code, I believe that searchsorted() is
implementing a classic binary search, which is pretty bad in terms of
cache misses. There are several modern implementations of binary search
which arrange items in memory such that cache misses are much more rare.
Clearly making such an indexing arrangement would take time, but in my
particular case, I can spare the time to create an index if searching
was faster, since I'd make the index once but do the searching many times.

Is there an implementation of such an algorithm that works easilty with
numpy? Also, can you offer any advice, suggestions, and comments to me
if I attempted to implement such an algorithm?

Thanks,
Andrew
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Why are ufunc docstrings useless?

2008-05-08 Thread Robert Kern
On Thu, May 8, 2008 at 10:39 PM, Anne Archibald
[EMAIL PROTECTED] wrote:
 2008/5/8 Robert Kern [EMAIL PROTECTED]:
 On Thu, May 8, 2008 at 9:52 PM, Anne Archibald
 [EMAIL PROTECTED] wrote:

 Thanks! Done add, reduce, outer, and reduceat. What about __call__?

 If anyone knows enough to explicitly request a docstring from
 __call__, they already know what it does.

 How exactly are they to find out? It does take additional arguments,
 for example dtype and out - I think.

That should be recorded in the ufunc's main docstring, e.g.
numpy.add.__doc__, since that is what people will actually be calling.
No one will explicitly call numpy.add.__call__(x,y).

 Also, help(np.add) displays the the object, its docstring, its
 methods, and all their docstrings. So it provides a way to get a
 docstring out of __call__ without having to know what it is.

Meh. All it can usefully say is Refer to the main docstring. Which
is more or less what it currently says.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] searchsorted() and memory cache

2008-05-08 Thread Robert Kern
On Thu, May 8, 2008 at 10:51 PM, Andrew Straw [EMAIL PROTECTED] wrote:
 I've got a big element array (25 million int64s) that searchsorted()
 takes a long time to grind through. After a bit of digging in the
 literature and the numpy source code, I believe that searchsorted() is
 implementing a classic binary search,

Yes.

 which is pretty bad in terms of
 cache misses. There are several modern implementations of binary search
 which arrange items in memory such that cache misses are much more rare.
 Clearly making such an indexing arrangement would take time, but in my
 particular case, I can spare the time to create an index if searching
 was faster, since I'd make the index once but do the searching many times.

 Is there an implementation of such an algorithm that works easilty with
 numpy? Also, can you offer any advice, suggestions, and comments to me
 if I attempted to implement such an algorithm?

I'm no help. You seem to know more than I do. Sadly, the first few
Google hits I get for binary search minimize cache misses are
patents. I don't know what the substantive content of those patents
are; I have a strict policy of not reading patents.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] searchsorted() and memory cache

2008-05-08 Thread Charles R Harris
On Thu, May 8, 2008 at 9:55 PM, Robert Kern [EMAIL PROTECTED] wrote:

 On Thu, May 8, 2008 at 10:51 PM, Andrew Straw [EMAIL PROTECTED] wrote:
  I've got a big element array (25 million int64s) that searchsorted()
  takes a long time to grind through. After a bit of digging in the
  literature and the numpy source code, I believe that searchsorted() is
  implementing a classic binary search,

 Yes.

  which is pretty bad in terms of
  cache misses. There are several modern implementations of binary search
  which arrange items in memory such that cache misses are much more rare.
  Clearly making such an indexing arrangement would take time, but in my
  particular case, I can spare the time to create an index if searching
  was faster, since I'd make the index once but do the searching many
 times.
 
  Is there an implementation of such an algorithm that works easilty with
  numpy? Also, can you offer any advice, suggestions, and comments to me
  if I attempted to implement such an algorithm?

 I'm no help. You seem to know more than I do. Sadly, the first few
 Google hits I get for binary search minimize cache misses are
 patents. I don't know what the substantive content of those patents
 are; I have a strict policy of not reading patents.


I would be interested in adding such a thing if it wasn't patent encumbered.
A good start would be a prototype in python to show how it all went together
and whether it needed a separate indexing/lookup function or could be fit
into the current setup.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Why are ufunc docstrings useless?

2008-05-08 Thread Anne Archibald
2008/5/8 Robert Kern [EMAIL PROTECTED]:
 On Thu, May 8, 2008 at 10:39 PM, Anne Archibald
 [EMAIL PROTECTED] wrote:
 2008/5/8 Robert Kern [EMAIL PROTECTED]:

 If anyone knows enough to explicitly request a docstring from
 __call__, they already know what it does.

 How exactly are they to find out? It does take additional arguments,
 for example dtype and out - I think.

 That should be recorded in the ufunc's main docstring, e.g.
 numpy.add.__doc__, since that is what people will actually be calling.
 No one will explicitly call numpy.add.__call__(x,y).

 Also, help(np.add) displays the the object, its docstring, its
 methods, and all their docstrings. So it provides a way to get a
 docstring out of __call__ without having to know what it is.

 Meh. All it can usefully say is Refer to the main docstring. Which
 is more or less what it currently says.

So is the custom that double-underscore methods get documented in the
class docstring and normal methods get documented in their own
docstrings?

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Why are ufunc docstrings useless?

2008-05-08 Thread Robert Kern
On Thu, May 8, 2008 at 11:31 PM, Anne Archibald
[EMAIL PROTECTED] wrote:
 2008/5/8 Robert Kern [EMAIL PROTECTED]:
 On Thu, May 8, 2008 at 10:39 PM, Anne Archibald
 [EMAIL PROTECTED] wrote:
 2008/5/8 Robert Kern [EMAIL PROTECTED]:

 If anyone knows enough to explicitly request a docstring from
 __call__, they already know what it does.

 How exactly are they to find out? It does take additional arguments,
 for example dtype and out - I think.

 That should be recorded in the ufunc's main docstring, e.g.
 numpy.add.__doc__, since that is what people will actually be calling.
 No one will explicitly call numpy.add.__call__(x,y).

 Also, help(np.add) displays the the object, its docstring, its
 methods, and all their docstrings. So it provides a way to get a
 docstring out of __call__ without having to know what it is.

 Meh. All it can usefully say is Refer to the main docstring. Which
 is more or less what it currently says.

 So is the custom that double-underscore methods get documented in the
 class docstring and normal methods get documented in their own
 docstrings?

No. Objects whose main purpose is to be callable should document their
call signature in the main docstring since that's where the objects
they are mimiccing (functions) have their docstring.

In this case, there is also a technical reason: you can't override
add.__call__.__doc__, just ufunc.__call__.__doc__, but the call
signatures vary between ufuncs.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] searchsorted() and memory cache

2008-05-08 Thread Charles R Harris
On Thu, May 8, 2008 at 10:30 PM, Charles R Harris [EMAIL PROTECTED]
wrote:



 On Thu, May 8, 2008 at 9:55 PM, Robert Kern [EMAIL PROTECTED] wrote:

 On Thu, May 8, 2008 at 10:51 PM, Andrew Straw [EMAIL PROTECTED]
 wrote:
  I've got a big element array (25 million int64s) that searchsorted()
  takes a long time to grind through. After a bit of digging in the
  literature and the numpy source code, I believe that searchsorted() is
  implementing a classic binary search,

 Yes.

  which is pretty bad in terms of
  cache misses. There are several modern implementations of binary search
  which arrange items in memory such that cache misses are much more rare.
  Clearly making such an indexing arrangement would take time, but in my
  particular case, I can spare the time to create an index if searching
  was faster, since I'd make the index once but do the searching many
 times.
 
  Is there an implementation of such an algorithm that works easilty with
  numpy? Also, can you offer any advice, suggestions, and comments to me
  if I attempted to implement such an algorithm?

 I'm no help. You seem to know more than I do. Sadly, the first few
 Google hits I get for binary search minimize cache misses are
 patents. I don't know what the substantive content of those patents
 are; I have a strict policy of not reading patents.


 I would be interested in adding such a thing if it wasn't patent
 encumbered. A good start would be a prototype in python to show how it all
 went together and whether it needed a separate indexing/lookup function or
 could be fit into the current setup.


One way I can think of doing this is to have two indices. One is the usual
sorted list, the second consists of, say, every 1024'th entry in the first
list. Then search the second list first to find the part of the first list
to search. That won't get you into the very best cache, but it could buy you
a factor of 2x-4x in speed. It's sort of splitting the binary tree into two
levels.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] searchsorted() and memory cache

2008-05-08 Thread Charles R Harris
On Thu, May 8, 2008 at 11:06 PM, Charles R Harris [EMAIL PROTECTED]
wrote:



 On Thu, May 8, 2008 at 10:30 PM, Charles R Harris 
 [EMAIL PROTECTED] wrote:



 On Thu, May 8, 2008 at 9:55 PM, Robert Kern [EMAIL PROTECTED]
 wrote:

 On Thu, May 8, 2008 at 10:51 PM, Andrew Straw [EMAIL PROTECTED]
 wrote:
  I've got a big element array (25 million int64s) that searchsorted()
  takes a long time to grind through. After a bit of digging in the
  literature and the numpy source code, I believe that searchsorted() is
  implementing a classic binary search,

 Yes.

  which is pretty bad in terms of
  cache misses. There are several modern implementations of binary search
  which arrange items in memory such that cache misses are much more
 rare.
  Clearly making such an indexing arrangement would take time, but in my
  particular case, I can spare the time to create an index if searching
  was faster, since I'd make the index once but do the searching many
 times.
 
  Is there an implementation of such an algorithm that works easilty with
  numpy? Also, can you offer any advice, suggestions, and comments to me
  if I attempted to implement such an algorithm?

 I'm no help. You seem to know more than I do. Sadly, the first few
 Google hits I get for binary search minimize cache misses are
 patents. I don't know what the substantive content of those patents
 are; I have a strict policy of not reading patents.


 I would be interested in adding such a thing if it wasn't patent
 encumbered. A good start would be a prototype in python to show how it all
 went together and whether it needed a separate indexing/lookup function or
 could be fit into the current setup.


 One way I can think of doing this is to have two indices. One is the usual
 sorted list, the second consists of, say, every 1024'th entry in the first
 list. Then search the second list first to find the part of the first list
 to search. That won't get you into the very best cache, but it could buy you
 a factor of 2x-4x in speed. It's sort of splitting the binary tree into two
 levels.


You could even do it with your data from python, generating the second list
and calling searchsorted multiple times. If you are searching for a bunch of
values, it's probably good to also sort them first so they bunch together in
the same part of the big list.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion