Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays

2014-07-25 Thread Eelco Hoogendoorn
To elaborate on that point; knowing that numpy accumulates in a simple
first-to-last sweep, and does not implicitly upcast, the original problem
can be solved in several ways; specifying a higher precision to sum with,
or by a nested summation, like X.mean(0).mean(0)==1.0. I personally like
this explicitness, and am wary of numpy doing overly clever things behind
the scenes, as I can think of other code that might become broken if things
change too radically. For instance, I often sort large arrays with a large
spread in magnitudes before summation, relying on the fact that summing the
smallest values first gives best precision. Any changes made to reduction
behavior should try and be backwards compatible with such properties of
straightforward reductions, or else a lot of code is going to be broken
without warning.

I suppose using maximum precision internally, and nesting all reductions
over multiple axes of an ndarray, are both easy to implement improvements
that do not come with any drawbacks that I can think of. Actually the
maximum precision I am not so sure of, as I personally prefer to make an
informed decision about precision used, and get an error on a platform that
does not support the specified precision, rather than obtain subtly or
horribly broken results without warning when moving your code to a
different platform/compiler whatever.


On Fri, Jul 25, 2014 at 5:37 AM, Eelco Hoogendoorn 
hoogendoorn.ee...@gmail.com wrote:

  Perhaps it is a slightly semantical discussion; but all fp calculations
 have errors, and there are always strategies for making them smaller. We
 just don't happen to like the error for this case; but rest assured it
 won't be hard to find new cases of 'blatantly wrong' results, no matter
 what accumulator is implemented. That's no reason to not try and be clever
 about it, but there isn't going to be an algorithm that is best for all
 possible inputs, and in the end the most important thing is that the
 algorithm used is specified in the docs.
  --
 From: Alan G Isaac alan.is...@gmail.com
 Sent: ‎25-‎7-‎2014 00:10

 To: Discussion of Numerical Python numpy-discussion@scipy.org
 Subject: Re: [Numpy-discussion] numpy.mean still broken for
 largefloat32arrays

 On 7/24/2014 4:42 PM, Eelco Hoogendoorn wrote:
  This isn't a bug report, but rather a feature request.

 I'm not sure statement this is correct.  The mean of a float32 array
 can certainly be computed as a float32.  Currently this is
 not necessarily what happens, not even approximately.
 That feels a lot like a bug, even if we can readily understand
 how the algorithm currently used produces it.  To say whether
 it is a bug or not, don't we have to ask about the intent of `mean`?
 If the intent is to sum and divide, then it is not a bug.
 If the intent is to produce the mean, then it is a bug.

 Alan Isaac
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] change default integer from int32 to int64 on win64?

2014-07-25 Thread Olivier Grisel
The dtype returned by np.where looks right (int64):

 import platform
 platform.architecture()
('64bit', 'WindowsPE')
 import numpy as np
 np.__version__
'1.9.0b1'
 a = np.zeros(10)
 np.where(a == 0)
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64),)

-- 
Olivier
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Pandas 0.14.0 Release Candidate 1

2014-07-25 Thread Jeff
How does the build trigger? If its just a matter of clicking on something 
when released. I think we can handle that :)

On Saturday, May 17, 2014 7:22:00 AM UTC-4, Jeff wrote:

 Hi,

 I'm pleased to announce the availability of the first release candidate of 
 Pandas 0.14.0.
 Please try this RC and report any issues here: Pandas Issues 
 https://github.com/pydata/pandas/issues
 We will be releasing officially in about 2 weeks or so.

 This is a major release from 0.13.1 and includes a small number of API 
 changes, several new features, enhancements, and 
 performance improvements along with a large number of bug fixes. 

 Highlights include:

- Officially support Python 3.4
- SQL interfaces updated to use sqlalchemy,
- Display interface changes
- MultiIndexing Using Slicers
- Ability to join a singly-indexed DataFrame with a multi-indexed 
DataFrame
- More consistency in groupby results and more flexible groupby 
specifications
- Holiday calendars are now supported in CustomBusinessDay
- Several improvements in plotting functions, including: hexbin, area 
and pie plots.
- Performance doc section on I/O operations

 Since there are some significant changes in the default way DataFrames are 
 displayed. I have put
 up a comment issue looking for some feedback here 
 https://github.com/pydata/pandas/issues/7146

 Here are the full whatsnew and documentation links:

 v0.14.0 Whatsnew 
 http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html

 v0.14.0 Documentation Page 
 http://pandas-docs.github.io/pandas-docs-travis/

 Source tarballs, and windows builds are available here:

 Pandas v0.14rc1 Release https://github.com/pydata/pandas/releases

 A big thank you to everyone who contributed to this release!

 Jeff

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays

2014-07-25 Thread RayS
At 01:22 AM 7/25/2014, you wrote:
  Actually the maximum precision I am not so 
 sure of, as I personally prefer to make an 
 informed decision about precision used, and get 
 an error on a platform that does not support 
 the specified precision, rather than obtain 
 subtly or horribly broken results without 
 warning when moving your code to a different platform/compiler whatever.

We were talking on this in the office, as we 
realized it does affect a couple of lines dealing 
with large arrays, including complex64.
As I expect Python modules to work uniformly 
cross platform unless documented otherwise, to me 
that includes 32 vs 64 bit platforms, implying 
that the modules should automatically use large 
enough accumulators for the data type input.

http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
does mention inaccuracy.
http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.stats.mstats.gmean.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html
etc do not, exactly

- Ray

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays

2014-07-25 Thread Robert Kern
On Fri, Jul 25, 2014 at 3:11 PM, RayS r...@blue-cove.com wrote:
 At 01:22 AM 7/25/2014, you wrote:
  Actually the maximum precision I am not so
 sure of, as I personally prefer to make an
 informed decision about precision used, and get
 an error on a platform that does not support
 the specified precision, rather than obtain
 subtly or horribly broken results without
 warning when moving your code to a different platform/compiler whatever.

 We were talking on this in the office, as we
 realized it does affect a couple of lines dealing
 with large arrays, including complex64.
 As I expect Python modules to work uniformly
 cross platform unless documented otherwise, to me
 that includes 32 vs 64 bit platforms, implying
 that the modules should automatically use large
 enough accumulators for the data type input.

The 32/64-bitness of your platform has nothing to do with floating
point. Nothing discussed in this thread is platform-specific (modulo
some minor details about the hardware FPU, but that should be taken as
read).

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays

2014-07-25 Thread RayS

At 07:22 AM 7/25/2014, you wrote:

 We were talking on this in the office, as we
 realized it does affect a couple of lines dealing
 with large arrays, including complex64.
 As I expect Python modules to work uniformly
 cross platform unless documented otherwise, to me
 that includes 32 vs 64 bit platforms, implying
 that the modules should automatically use large
 enough accumulators for the data type input.

The 32/64-bitness of your platform has nothing to do with floating
point.


As a naive end user, I can, and do, download different binaries for 
different CPUs/Windows versions and will get different results

http://mail.scipy.org/pipermail/numpy-discussion/2014-July/070747.html


 Nothing discussed in this thread is platform-specific (modulo
some minor details about the hardware FPU, but that should be taken as
read).


And compilers, apparently.

The important point was that it would be best if all of the methods 
affected by summing 32 bit floats with 32 bit accumulators had the 
same Notes as numpy.mean(). We went through a lot of code yesterday, 
assuming that any numpy or Scipy.stats functions that use 
accumulators suffer the same issue, whether noted or not, and found it true.


Depending on the input data, this can cause the results to be 
inaccurate, especially for float32 (see example below). Specifying a 
higher-precision accumulator using the 
http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html#numpy.dtypedtype 
keyword can alleviate this issue. seems rather un-Pythonic.


- Ray

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays

2014-07-25 Thread Eelco Hoogendoorn
Arguably, the whole of floating point numbers and their related shenanigans is 
not very pythonic in the first place. The accuracy of the output WILL depend on 
the input, to some degree or another. At the risk of repeating myself: explicit 
is better than implicit

-Original Message-
From: RayS r...@blue-cove.com
Sent: ‎25-‎7-‎2014 19:56
To: Discussion of Numerical Python numpy-discussion@scipy.org
Subject: Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays

At 07:22 AM 7/25/2014, you wrote:

 We were talking on this in the office, as we
 realized it does affect a couple of lines dealing
 with large arrays, including complex64.
 As I expect Python modules to work uniformly
 cross platform unless documented otherwise, to me
 that includes 32 vs 64 bit platforms, implying
 that the modules should automatically use large
 enough accumulators for the data type input.

The 32/64-bitness of your platform has nothing to do with floating
point.

As a naive end user, I can, and do, download different binaries for different 
CPUs/Windows versions and will get different results
http://mail.scipy.org/pipermail/numpy-discussion/2014-July/070747.html


 Nothing discussed in this thread is platform-specific (modulo
some minor details about the hardware FPU, but that should be taken as
read).

And compilers, apparently.

The important point was that it would be best if all of the methods affected by 
summing 32 bit floats with 32 bit accumulators had the same Notes as 
numpy.mean(). We went through a lot of code yesterday, assuming that any numpy 
or Scipy.stats functions that use accumulators suffer the same issue, whether 
noted or not, and found it true.

Depending on the input data, this can cause the results to be inaccurate, 
especially for float32 (see example below). Specifying a higher-precision 
accumulator using the dtype keyword can alleviate this issue. seems rather 
un-Pythonic.

- Ray___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays

2014-07-25 Thread Alan G Isaac
On 7/25/2014 1:40 PM, Eelco Hoogendoorn wrote:
 At the risk of repeating myself: explicit is better than implicit


This sounds like an argument for renaming the `mean` function `naivemean`
rather than `mean`.  Whatever numpy names `mean`, shouldn't it
implement an algorithm that produces the mean?  And obviously, for any
float data type, the mean value of the values in the array is representable
as a value of the same type.

Alan Isaac

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [pydata] Re: ANN: Pandas 0.14.0 Release Candidate 1

2014-07-25 Thread Matthew Brett
Hi,

On Fri, Jul 25, 2014 at 9:52 AM, Jeff jeffreb...@gmail.com wrote:
 How does the build trigger? If its just a matter of clicking on something
 when released. I think we can handle that :)


The two options are:

* I add you and whoever else does releases to my repo, and you can
trigger builds by pressing a button on the travis page for my repo, or
pushing commits to the repo
* You take over the repo, I submit a pull request to make sure you
have auth to upload to rackspace, and proceed as above.

But yes - single click - build

Cheers,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays

2014-07-25 Thread Nathaniel Smith
On Fri, Jul 25, 2014 at 5:56 PM, RayS r...@blue-cove.com wrote:
 The important point was that it would be best if all of the methods affected
 by summing 32 bit floats with 32 bit accumulators had the same Notes as
 numpy.mean(). We went through a lot of code yesterday, assuming that any
 numpy or Scipy.stats functions that use accumulators suffer the same issue,
 whether noted or not, and found it true.

Do you have a list of the functions that are affected?

 Depending on the input data, this can cause the results to be inaccurate,
 especially for float32 (see example below). Specifying a higher-precision
 accumulator using the dtype keyword can alleviate this issue. seems rather
 un-Pythonic.

It's true that in its full generality, this problem just isn't
something numpy can solve. Using float32 is extremely dangerous and
should not be attempted unless you're prepared to seriously analyze
all your code for numeric stability; IME it often runs into problems
in practice, in any number of ways. Remember that it only has as much
precision as a 24 bit integer. There are good reasons why float64 is
the default!

That said, it does seem that np.mean could be implemented better than
it is, even given float32's inherent limitations. If anyone wants to
implement better algorithms for computing the mean, variance, sums,
etc., then we would love to add them to numpy. I'd suggest
implementing them as gufuncs -- there are examples of defining gufuncs
in numpy/linalg/umath_linalg.c.src.

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays

2014-07-25 Thread Eelco Hoogendoorn
It need not be exactly representable as such; take the mean of [1, 1+eps]
for instance. Granted, there are at most two number in the range of the
original dtype which are closest to the true mean; but im not sure that
computing them exactly is a tractable problem for arbitrary input.

Im not sure what is considered best practice for these problems; or if
there is one, considering the hetrogenity of the problem. As noted earlier,
summing a list of floating point values is a remarkably multifaceted
problem, once you get down into the details.

I think it should be understood that all floating point algorithms are
subject to floating point errors. As long as the algorithm used is
specified, one can make an informed decision if the given algorithm will do
what you expect of it. That's the best we can hope for.

If we are going to advocate doing 'clever' things behind the scenes, we
have to take backwards compatibility (not creating a possibility of
producing worse results on the same input) and platform independence in
mind. Funny summation orders could violate the former depending on the
implementation details, and 'using the highest machine precision available'
violates the latter (and is horrible practice in general, imo. Either you
don't need the extra accuracy, or you do, and the absence on a given
platform should be an error)

Perhaps pairwise summation in the original order of the data is the best
option:

q = np.ones((2,)*26, np.float32)
print q.mean()
while q.ndim  0:
q = q.mean(axis=-1, dtype=np.float32)
print q

This only requires log(N) space on the stack if properly implemented, and
is not platform dependent, nor should have any backward compatibility
issues that I can think of. But im not sure how easy it would be to
implement, given the current framework. The ability to specify different
algorithms per kwarg wouldn't be a bad idea either, imo; or the ability to
explicitly specify a separate output and accumulator dtype.


On Fri, Jul 25, 2014 at 8:00 PM, Alan G Isaac alan.is...@gmail.com wrote:

 On 7/25/2014 1:40 PM, Eelco Hoogendoorn wrote:
  At the risk of repeating myself: explicit is better than implicit


 This sounds like an argument for renaming the `mean` function `naivemean`
 rather than `mean`.  Whatever numpy names `mean`, shouldn't it
 implement an algorithm that produces the mean?  And obviously, for any
 float data type, the mean value of the values in the array is representable
 as a value of the same type.

 Alan Isaac

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays

2014-07-25 Thread RayS
At 11:29 AM 7/25/2014, you wrote:
On Fri, Jul 25, 2014 at 5:56 PM, RayS r...@blue-cove.com wrote:
  The important point was that it would be best if all of the 
 methods affected
  by summing 32 bit floats with 32 bit accumulators had the same Notes as
  numpy.mean(). We went through a lot of code yesterday, assuming that any
  numpy or Scipy.stats functions that use accumulators suffer the same issue,
  whether noted or not, and found it true.

Do you have a list of the functions that are affected?

We only tested a few we used, but
scipy.stats.nanmean, or any .*mean()
numpy.sum, mean, average, std, var,...

via something like:

import numpy
import scipy.stats
print numpy.__version__
print scipy.__version__
onez = numpy.ones((2**25, 1), numpy.float32)
step = 2**10
func = scipy.stats.nanmean
for s in range(2**24-step, 2**25, step):
 if func(onez[:s+step])!=1.:
 print '\nbroke', s, func(onez[:s+step])
 break
 else:
 print '\r',s,

  That said, it does seem that np.mean could be implemented better than
it is, even given float32's inherent limitations. If anyone wants to
implement better algorithms for computing the mean, variance, sums,
etc., then we would love to add them to numpy.

Others have pointed out the possible tradeoffs in summation algos, 
perhaps a method arg would be appropriate, better depending on your 
desire for speed vs. accuracy.

It just occurred to me that if the STSI folks (who count photons) 
took the mean() or other such func of an image array from Hubble 
sensors to find background value, they'd better always be using float64.

  - Ray



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays

2014-07-25 Thread josef.pktd
On Fri, Jul 25, 2014 at 4:25 PM, RayS r...@blue-cove.com wrote:

 At 11:29 AM 7/25/2014, you wrote:
 On Fri, Jul 25, 2014 at 5:56 PM, RayS r...@blue-cove.com wrote:
   The important point was that it would be best if all of the
  methods affected
   by summing 32 bit floats with 32 bit accumulators had the same Notes as
   numpy.mean(). We went through a lot of code yesterday, assuming that
 any
   numpy or Scipy.stats functions that use accumulators suffer the same
 issue,
   whether noted or not, and found it true.
 
 Do you have a list of the functions that are affected?

 We only tested a few we used, but
 scipy.stats.nanmean, or any .*mean()
 numpy.sum, mean, average, std, var,...

 via something like:

 import numpy
 import scipy.stats
 print numpy.__version__
 print scipy.__version__
 onez = numpy.ones((2**25, 1), numpy.float32)
 step = 2**10
 func = scipy.stats.nanmean
 for s in range(2**24-step, 2**25, step):
  if func(onez[:s+step])!=1.:
  print '\nbroke', s, func(onez[:s+step])
  break
  else:
  print '\r',s,

   That said, it does seem that np.mean could be implemented better than
 it is, even given float32's inherent limitations. If anyone wants to
 implement better algorithms for computing the mean, variance, sums,
 etc., then we would love to add them to numpy.

 Others have pointed out the possible tradeoffs in summation algos,
 perhaps a method arg would be appropriate, better depending on your
 desire for speed vs. accuracy.


I think this would be a good improvement.
But it doesn't compensate for users to be aware of the problems. I think
the docstring and the description of the dtype argument is pretty clear.

I'm largely with Eelco, whatever precision or algorithm we use, with
floating point calculations we run into problems in some cases. And I don't
think this is a broken function but a design decision that takes the
different tradeoffs into account.
Whether it's the right decision is an open question, if there are better
algorithm with essentially not disadvantages.

Two examples:
I had problems to verify some results against Stata at more than a few
significant digits, until I realized that Stata had used float32 for the
calculations by default in this case, while I was working with float64.
Using single precision linear algebra causes the same numerical problems as
numpy.mean runs into.

A few years ago I tried to match some tougher NIST examples that were
intentionally very badly scaled. numpy.mean at float64 had quite large
errors, but a simple trick with two passes through the data managed to get
very close to the certified NIST examples.


my conclusion:
don't use float32 unless you know you don't need any higher precision.
even with float64 it is possible to run into extreme cases where you get
numerical garbage or large precision losses.
However, in the large majority of cases a boring fast naive
implementation is enough.

Also, whether we use mean, sum or dot in a calculation is an implementation
detail, which in the case of dot doesn't have a dtype argument and always
depends on the dtype of the arrays, AFAIK.
Separating the accumulation dtype from the array dtype would require a lot
of work except in the simplest cases, like those that only use sum and mean
with specified dtype argument.


trying out the original example:

 X = np.ones((5, 1024), np.float32)
 X.mean()
1.0
 X.mean(dtype=np.float32)
1.0


 np.dot(X.ravel(), np.ones(X.ravel().shape) *1. / X.ravel().shape)
1.02299174

 np.__version__
'1.5.1'

Win32

Josef





 It just occurred to me that if the STSI folks (who count photons)
 took the mean() or other such func of an image array from Hubble
 sensors to find background value, they'd better always be using float64.

   - Ray



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays

2014-07-25 Thread Eelco Hoogendoorn
Ray: I'm not working with Hubble data, but yeah these are all issues I've run 
into with my terrabytes of microscopy data as well. Given that such raw data 
comes as uint16, its best to do your calculations as much as possible in good 
old ints. What you compute is what you get, no obscure shenanigans.

It just occurred to me that pairwise summation will lead to highly branchy 
code, and you can forget about any vector extensions. Tradeoffs indeed. Any 
such hierarchical summation is probably best done by aggregating naively summed 
blocks. 

-Original Message-
From: RayS r...@blue-cove.com
Sent: ‎25-‎7-‎2014 23:26
To: Discussion of Numerical Python numpy-discussion@scipy.org
Subject: Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays

At 11:29 AM 7/25/2014, you wrote:
On Fri, Jul 25, 2014 at 5:56 PM, RayS r...@blue-cove.com wrote:
  The important point was that it would be best if all of the 
 methods affected
  by summing 32 bit floats with 32 bit accumulators had the same Notes as
  numpy.mean(). We went through a lot of code yesterday, assuming that any
  numpy or Scipy.stats functions that use accumulators suffer the same issue,
  whether noted or not, and found it true.

Do you have a list of the functions that are affected?

We only tested a few we used, but
scipy.stats.nanmean, or any .*mean()
numpy.sum, mean, average, std, var,...

via something like:

import numpy
import scipy.stats
print numpy.__version__
print scipy.__version__
onez = numpy.ones((2**25, 1), numpy.float32)
step = 2**10
func = scipy.stats.nanmean
for s in range(2**24-step, 2**25, step):
 if func(onez[:s+step])!=1.:
 print '\nbroke', s, func(onez[:s+step])
 break
 else:
 print '\r',s,

  That said, it does seem that np.mean could be implemented better than
it is, even given float32's inherent limitations. If anyone wants to
implement better algorithms for computing the mean, variance, sums,
etc., then we would love to add them to numpy.

Others have pointed out the possible tradeoffs in summation algos, 
perhaps a method arg would be appropriate, better depending on your 
desire for speed vs. accuracy.

It just occurred to me that if the STSI folks (who count photons) 
took the mean() or other such func of an image array from Hubble 
sensors to find background value, they'd better always be using float64.

  - Ray



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays

2014-07-25 Thread Julian Taylor
On 25.07.2014 23:51, Eelco Hoogendoorn wrote:
 Ray: I'm not working with Hubble data, but yeah these are all issues
 I've run into with my terrabytes of microscopy data as well. Given that
 such raw data comes as uint16, its best to do your calculations as much
 as possible in good old ints. What you compute is what you get, no
 obscure shenanigans.

integers are dangerous too, they overflow quickly and signed overflow is
even undefined in C the standard.

 
 It just occurred to me that pairwise summation will lead to highly
 branchy code, and you can forget about any vector extensions. Tradeoffs
 indeed. Any such hierarchical summation is probably best done by
 aggregating naively summed blocks.

pairwise summation is usually implemented with a naive sum cutoff large
enough so the recursion does not matter much.
In numpy 1.9 this cutoff is 128 elements, but the inner loop is unrolled
8 times which makes it effectively 16 elements.
the unrolling factor of 8 was intentionally chosen to allow using AVX in
the inner loop without changing the summation ordering, but last I
tested actually using AVX here only gave mediocre speedups (10%-20% on
an i5).

 
 From: RayS mailto:r...@blue-cove.com
 Sent: ‎25-‎7-‎2014 23:26
 To: Discussion of Numerical Python mailto:numpy-discussion@scipy.org
 Subject: Re: [Numpy-discussion] numpy.mean still broken for
 largefloat32arrays
 
 At 11:29 AM 7/25/2014, you wrote:
On Fri, Jul 25, 2014 at 5:56 PM, RayS r...@blue-cove.com wrote:
  The important point was that it would be best if all of the
 methods affected
  by summing 32 bit floats with 32 bit accumulators had the same Notes as
  numpy.mean(). We went through a lot of code yesterday, assuming that any
  numpy or Scipy.stats functions that use accumulators suffer the same
 issue,
  whether noted or not, and found it true.

Do you have a list of the functions that are affected?
 
 We only tested a few we used, but
 scipy.stats.nanmean, or any .*mean()
 numpy.sum, mean, average, std, var,...
 
 via something like:
 
 import numpy
 import scipy.stats
 print numpy.__version__
 print scipy.__version__
 onez = numpy.ones((2**25, 1), numpy.float32)
 step = 2**10
 func = scipy.stats.nanmean
 for s in range(2**24-step, 2**25, step):
  if func(onez[:s+step])!=1.:
  print '\nbroke', s, func(onez[:s+step])
  break
  else:
  print '\r',s,
 
  That said, it does seem that np.mean could be implemented better than
it is, even given float32's inherent limitations. If anyone wants to
implement better algorithms for computing the mean, variance, sums,
etc., then we would love to add them to numpy.
 
 Others have pointed out the possible tradeoffs in summation algos,
 perhaps a method arg would be appropriate, better depending on your
 desire for speed vs. accuracy.
 
 It just occurred to me that if the STSI folks (who count photons)
 took the mean() or other such func of an image array from Hubble
 sensors to find background value, they'd better always be using float64.
 
   - Ray
 
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
 

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy.mean still broken for largefloat32arrays

2014-07-25 Thread RayS
At 02:36 PM 7/25/2014, you wrote:
But it doesn't compensate for users to be aware of the problems. I 
think the docstring and the description of the dtype argument is pretty clear.

Most of the docs for the affected functions do not have a Note with 
the same warning as mean()

- Ray


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion