date:20060921

David M. Cooke wrote:
> On Thu, 21 Sep 2006 11:34:42 -0700
> Tim Hochberg <[EMAIL PROTECTED]> wrote:
>
>   
>> Tim Hochberg wrote:
>> 
>>> Robert Kern wrote:
>>>   
>>>   
 David M. Cooke wrote:
   
 
 
> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
> 
>   
>   
>> Let me offer a third path: the algorithms used for .mean() and .var()
>> are substandard. There are much better incremental algorithms that
>> entirely avoid the need to accumulate such large (and therefore
>> precision-losing) intermediate values. The algorithms look like the
>> following for 1D arrays in Python:
>>
>> def mean(a):
>>  m = a[0]
>>  for i in range(1, len(a)):
>>  m += (a[i] - m) / (i + 1)
>>  return m
>>   
>> 
>> 
> This isn't really going to be any better than using a simple sum.
> It'll also be slower (a division per iteration).
> 
>   
>   
 With one exception, every test that I've thrown at it shows that it's
 better for float32. That exception is uniformly spaced arrays, like
 linspace().

  > You do avoid
  > accumulating large sums, but then doing the division a[i]/len(a) and
  > adding that will do the same.

 Okay, this is true.

   
 
 
> Now, if you want to avoid losing precision, you want to use a better
> summation technique, like compensated (or Kahan) summation:
>
> def mean(a):
> s = e = a.dtype.type(0)
> for i in range(0, len(a)):
> temp = s
> y = a[i] + e
> s = temp + y
> e = (temp - s) + y
> return s / len(a)
>   
 
 
>> def var(a):
>>  m = a[0]
>>  t = a.dtype.type(0)
>>  for i in range(1, len(a)):
>>  q = a[i] - m
>>  r = q / (i+1)
>>  m += r
>>  t += i * q * r
>>  t /= len(a)
>>  return t
>>
>> Alternatively, from Knuth:
>>
>> def var_knuth(a):
>>  m = a.dtype.type(0)
>>  variance = a.dtype.type(0)
>>  for i in range(len(a)):
>>  delta = a[i] - m
>>  m += delta / (i+1)
>>  variance += delta * (a[i] - m)
>>  variance /= len(a)
>>  return variance
>> 
>> I'm going to go ahead and attach a module containing the versions of 
>> mean, var, etc that I've been playing with in case someone wants to mess 
>> with them. Some were stolen from traffic on this list, for others I 
>> grabbed the algorithms from wikipedia or equivalent.
>> 
>
> I looked into this a bit more. I checked float32 (single precision) and
> float64 (double precision), using long doubles (float96) for the "exact"
> results. This is based on your code. Results are compared using
> abs(exact_stat - computed_stat) / max(abs(values)), with 1 values in the
> range of [-100, 900]
>
> First, the mean. In float32, the Kahan summation in single precision is
> better by about 2 orders of magnitude than simple summation. However,
> accumulating the sum in double precision is better by about 9 orders of
> magnitude than simple summation (7 orders more than Kahan).
>
> In float64, Kahan summation is the way to go, by 2 orders of magnitude.
>
> For the variance, in float32, Knuth's method is *no better* than the two-pass
> method. Tim's code does an implicit conversion of intermediate results to
> float64, which is why he saw a much better result. The two-pass method using
> Kahan summation (again, in single precision), is better by about 2 orders of
> magnitude. There is practically no difference when using a double-precision
> accumulator amongst the techniques: they're all about 9 orders of magnitude
> better than single-precision two-pass.
>
> In float64, Kahan summation is again better than the rest, by about 2 orders
> of magnitude.
>
> I've put my adaptation of Tim's code, and box-and-whisker plots of the
> results, at http://arbutus.mcmaster.ca/dmc/numpy/variance/
>
> Conclusions:
>
> - If you're going to calculate everything in single precision, use Kahan
> summation. Using it in double-precision also helps.
> - If you can use a double-precision accumulator, it's much better than any of
> the techniques in single-precision only.
>
> - for speed+precision in the variance, either use Kahan summation in single
> precision with the two-pass method, or use double precision with simple
> summation with the two-pass method. Knuth buys you nothing, except slower
> code :-)
>
> After 1.0 is out, we should look at doing one of the above.
>   
One more little tidbit; it appears possible to "fix up" Knuth's 
algorithm so that it's comparable in accuracy to the two pass Kahan 
version by doing Kahan summation while accumulating the variance. 
Testing on this was

Re: [Numpy-discussion] please change mean to use dtype=float

David M. Cooke wrote:
> On Thu, 21 Sep 2006 11:34:42 -0700
> Tim Hochberg <[EMAIL PROTECTED]> wrote:
>
>   
>> Tim Hochberg wrote:
>> 
>>> Robert Kern wrote:
>>>   
>>>   
 David M. Cooke wrote:
   
 
 
> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
> 
>   
>   
>> Let me offer a third path: the algorithms used for .mean() and .var()
>> are substandard. There are much better incremental algorithms that
>> entirely avoid the need to accumulate such large (and therefore
>> precision-losing) intermediate values. The algorithms look like the
>> following for 1D arrays in Python:
>>
>> def mean(a):
>>  m = a[0]
>>  for i in range(1, len(a)):
>>  m += (a[i] - m) / (i + 1)
>>  return m
>>   
>> 
>> 
> This isn't really going to be any better than using a simple sum.
> It'll also be slower (a division per iteration).
> 
>   
>   
 With one exception, every test that I've thrown at it shows that it's
 better for float32. That exception is uniformly spaced arrays, like
 linspace().

  > You do avoid
  > accumulating large sums, but then doing the division a[i]/len(a) and
  > adding that will do the same.

 Okay, this is true.

   
 
 
> Now, if you want to avoid losing precision, you want to use a better
> summation technique, like compensated (or Kahan) summation:
>
> def mean(a):
> s = e = a.dtype.type(0)
> for i in range(0, len(a)):
> temp = s
> y = a[i] + e
> s = temp + y
> e = (temp - s) + y
> return s / len(a)
>   
 
 
>> def var(a):
>>  m = a[0]
>>  t = a.dtype.type(0)
>>  for i in range(1, len(a)):
>>  q = a[i] - m
>>  r = q / (i+1)
>>  m += r
>>  t += i * q * r
>>  t /= len(a)
>>  return t
>>
>> Alternatively, from Knuth:
>>
>> def var_knuth(a):
>>  m = a.dtype.type(0)
>>  variance = a.dtype.type(0)
>>  for i in range(len(a)):
>>  delta = a[i] - m
>>  m += delta / (i+1)
>>  variance += delta * (a[i] - m)
>>  variance /= len(a)
>>  return variance
>> 
>> I'm going to go ahead and attach a module containing the versions of 
>> mean, var, etc that I've been playing with in case someone wants to mess 
>> with them. Some were stolen from traffic on this list, for others I 
>> grabbed the algorithms from wikipedia or equivalent.
>> 
>
> I looked into this a bit more. I checked float32 (single precision) and
> float64 (double precision), using long doubles (float96) for the "exact"
> results. This is based on your code. Results are compared using
> abs(exact_stat - computed_stat) / max(abs(values)), with 1 values in the
> range of [-100, 900]
>
> First, the mean. In float32, the Kahan summation in single precision is
> better by about 2 orders of magnitude than simple summation. However,
> accumulating the sum in double precision is better by about 9 orders of
> magnitude than simple summation (7 orders more than Kahan).
>
> In float64, Kahan summation is the way to go, by 2 orders of magnitude.
>
> For the variance, in float32, Knuth's method is *no better* than the two-pass
> method. Tim's code does an implicit conversion of intermediate results to
> float64, which is why he saw a much better result. 
Doh! And I fixed that same problem in the mean implementation earlier 
too. I was astounded by how good knuth was doing, but not astounded 
enough apparently.

Does it seem weird to anyone else that in:
numpy_scalar  python_scalar
the precision ends up being controlled by the python scalar? I would 
expect the numpy_scalar to control the resulting precision just like 
numpy arrays do in similar circumstances. Perhaps the egg on my face is 
just clouding my vision though.

> The two-pass method using
> Kahan summation (again, in single precision), is better by about 2 orders of
> magnitude. There is practically no difference when using a double-precision
> accumulator amongst the techniques: they're all about 9 orders of magnitude
> better than single-precision two-pass.
>
> In float64, Kahan summation is again better than the rest, by about 2 orders
> of magnitude.
>
> I've put my adaptation of Tim's code, and box-and-whisker plots of the
> results, at http://arbutus.mcmaster.ca/dmc/numpy/variance/
>
> Conclusions:
>
> - If you're going to calculate everything in single precision, use Kahan
> summation. Using it in double-precision also helps.
> - If you can use a double-precision accumulator, it's much better than any of
> the techniques in single-precision only.
>
> - for speed+precision in the variance, eith

Re: [Numpy-discussion] immutable arrays

2006-09-21 Thread Martin Wiechert

On Thursday 21 September 2006 18:24, Travis Oliphant wrote:
> Martin Wiechert wrote:
> > Thanks Travis.
> >
> > Do I understand correctly that the only way to be really safe is to make
> > a copy and not to export a reference to it?
> > Because anybody having a reference to the owner of the data can override
> > the flag?
>
> No, that's not quite correct.   Of course in C, anybody can do anything
> they want to the flags.
>
> In Python, only the owner of the object itself can change the writeable
> flag once it is set to False.   So, if you only return a "view" of the
> array (a.view())  then the Python user will not be able to change the
> flags.
>
> Example:
>
> a = array([1,2,3])
> a.flags.writeable = False
>
> b = a.view()
>
> b.flags.writeable = True   # raises an error.
>
> c = a
> c.flags.writeable = True  # can be done because c is a direct alias to a.
>
> Hopefully, that explains the situation a bit better.
>

It does. Thanks Travis.

> -Travis
>
>
>
>
>
>
>
>
>
> -
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share
> your opinions on IT & business topics through brief surveys -- and earn
> cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ___
> Numpy-discussion mailing list
> Numpy-discussion@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/numpy-discussion
AV scanned by FortiGate
-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] please change mean to use dtype=float

David M. Cooke wrote:

>
>Conclusions:
>
>- If you're going to calculate everything in single precision, use Kahan
>summation. Using it in double-precision also helps.
>- If you can use a double-precision accumulator, it's much better than any of
>the techniques in single-precision only.
>
>- for speed+precision in the variance, either use Kahan summation in single
>precision with the two-pass method, or use double precision with simple
>summation with the two-pass method. Knuth buys you nothing, except slower
>code :-)
>
>After 1.0 is out, we should look at doing one of the above.
>  
>

+1

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] please change mean to use dtype=float

2006-09-21 Thread David M. Cooke

On Thu, 21 Sep 2006 11:34:42 -0700
Tim Hochberg <[EMAIL PROTECTED]> wrote:

> Tim Hochberg wrote:
> > Robert Kern wrote:
> >   
> >> David M. Cooke wrote:
> >>   
> >> 
> >>> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
> >>> 
> >>>   
>  Let me offer a third path: the algorithms used for .mean() and .var()
>  are substandard. There are much better incremental algorithms that
>  entirely avoid the need to accumulate such large (and therefore
>  precision-losing) intermediate values. The algorithms look like the
>  following for 1D arrays in Python:
> 
>  def mean(a):
>   m = a[0]
>   for i in range(1, len(a)):
>   m += (a[i] - m) / (i + 1)
>   return m
>    
>  
> >>> This isn't really going to be any better than using a simple sum.
> >>> It'll also be slower (a division per iteration).
> >>> 
> >>>   
> >> With one exception, every test that I've thrown at it shows that it's
> >> better for float32. That exception is uniformly spaced arrays, like
> >> linspace().
> >>
> >>  > You do avoid
> >>  > accumulating large sums, but then doing the division a[i]/len(a) and
> >>  > adding that will do the same.
> >>
> >> Okay, this is true.
> >>
> >>   
> >> 
> >>> Now, if you want to avoid losing precision, you want to use a better
> >>> summation technique, like compensated (or Kahan) summation:
> >>>
> >>> def mean(a):
> >>> s = e = a.dtype.type(0)
> >>> for i in range(0, len(a)):
> >>> temp = s
> >>> y = a[i] + e
> >>> s = temp + y
> >>> e = (temp - s) + y
> >>> return s / len(a)
> >> 
>  def var(a):
>   m = a[0]
>   t = a.dtype.type(0)
>   for i in range(1, len(a)):
>   q = a[i] - m
>   r = q / (i+1)
>   m += r
>   t += i * q * r
>   t /= len(a)
>   return t
> 
>  Alternatively, from Knuth:
> 
>  def var_knuth(a):
>   m = a.dtype.type(0)
>   variance = a.dtype.type(0)
>   for i in range(len(a)):
>   delta = a[i] - m
>   m += delta / (i+1)
>   variance += delta * (a[i] - m)
>   variance /= len(a)
>   return variance
> 
> I'm going to go ahead and attach a module containing the versions of 
> mean, var, etc that I've been playing with in case someone wants to mess 
> with them. Some were stolen from traffic on this list, for others I 
> grabbed the algorithms from wikipedia or equivalent.

I looked into this a bit more. I checked float32 (single precision) and
float64 (double precision), using long doubles (float96) for the "exact"
results. This is based on your code. Results are compared using
abs(exact_stat - computed_stat) / max(abs(values)), with 1 values in the
range of [-100, 900]

First, the mean. In float32, the Kahan summation in single precision is
better by about 2 orders of magnitude than simple summation. However,
accumulating the sum in double precision is better by about 9 orders of
magnitude than simple summation (7 orders more than Kahan).

In float64, Kahan summation is the way to go, by 2 orders of magnitude.

For the variance, in float32, Knuth's method is *no better* than the two-pass
method. Tim's code does an implicit conversion of intermediate results to
float64, which is why he saw a much better result. The two-pass method using
Kahan summation (again, in single precision), is better by about 2 orders of
magnitude. There is practically no difference when using a double-precision
accumulator amongst the techniques: they're all about 9 orders of magnitude
better than single-precision two-pass.

In float64, Kahan summation is again better than the rest, by about 2 orders
of magnitude.

I've put my adaptation of Tim's code, and box-and-whisker plots of the
results, at http://arbutus.mcmaster.ca/dmc/numpy/variance/

Conclusions:

- If you're going to calculate everything in single precision, use Kahan
summation. Using it in double-precision also helps.
- If you can use a double-precision accumulator, it's much better than any of
the techniques in single-precision only.

- for speed+precision in the variance, either use Kahan summation in single
precision with the two-pass method, or use double precision with simple
summation with the two-pass method. Knuth buys you nothing, except slower
code :-)

After 1.0 is out, we should look at doing one of the above.

-- 
|>|\/|<
/--\
|David M. Cooke  http://arbutus.physics.mcmaster.ca/dmc/
|[EMAIL PROTECTED]

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default

Re: [Numpy-discussion] Tests and code documentation

Hi,
On 9/21/06, Robert Kern <[EMAIL PROTECTED]> wrote:
Steve Lianoglou wrote:> So .. I guess I'm wondering why we want to break from the standard?We don't as far as Python code goes. The code that Chuck added Doxygen-stylecomments to was C code. I presume he was simply answering Sebastian's question
rather than suggesting we use Doxygen for Python code, too.
Exactly. I also don't think the Python hack description applies to
doxygen any longer. As to the oddness of \param or @param, here is an
example from Epydoc using Epytext
@type m: number@param m: The slope of the line.@type b: number@param b:
The y intercept of the line. The X{y intercept} of a
Looks like they borrowed something there ;) The main advantage of
epydoc vs doxygen seems to be that you can use the markup inside the
normal python docstring without having to make a separate comment
block. Or would that be a disadvantage? Then again, I've been thinking
of moving the python function docstrings into the add_newdocs.py file
so everything is together in one spot and that would separate the
Python docstrings from the functions anyway.

I'll fool around with doxygen a bit and see what it does. The C code is the code that most needs documentation in any case.

Chuck

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] Tests and code documentation

2006-09-21 Thread Robert Kern

Steve Lianoglou wrote:
> So .. I guess I'm wondering why we want to break from the standard?

We don't as far as Python code goes. The code that Chuck added Doxygen-style 
comments to was C code. I presume he was simply answering Sebastian's question 
rather than suggesting we use Doxygen for Python code, too.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] Tests and code documentation

2006-09-21 Thread David M. Cooke

On Thu, 21 Sep 2006 10:05:58 -0600
"Charles R Harris" <[EMAIL PROTECTED]> wrote:

> Travis,
> 
> A few questions.
> 
> 1) I can't find any systematic code testing units, although there seem to be
> tests for regressions and such. Is there a place we should be putting such
> tests?
> 
> 2) Any plans for code documentation? I documented some of my stuff with
> doxygen markups and wonder if we should include a Doxyfile as part of the
> package.

We don't have much of a defined standard for docs. Personally, I wouldn't use
doxygen: what I've seen for Python versions are hacks, whose output looks
like C++, and which requires markup that's not like commonly-used conventions
in Python (\brief, for instance).

Foremost for Python doc strings, I think, is that it look ok when using pydoc
or similar (ipython's ?, for instance). That means a minimal amount of
markup.

Someone previously mentioned including cross-references; I think that's a
good idea. A 'See also' line, for instance. Examples are good too, especially
if there's been disputes on the interpretation of the command :-)

For the C code, documentation is autogenerated from the /** ... API */
comments that determine which functions are part of the C API. This are put
into files multiarray_api.txt and ufunc_api.txt (in the include/ directory).
The files are in reST format, so the comments should/could be. At some point
I've got to through and add more :-)

-- 
|>|\/|<
/--\
|David M. Cooke  http://arbutus.physics.mcmaster.ca/dmc/
|[EMAIL PROTECTED]

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] Tests and code documentation

2006-09-21 Thread Steve Lianoglou

> Are able to use doxygen for Python code ? I thought it only worked  
> for C (and
> alike) ?
>
> IIRC correctly, it now does Python too. Let's see... here is an  
> example
> ## Documentation for this module.
> #
> # More details.
>
> ## Documentation for a function.
> #
> # More details.
> def func():
> pass
> Looks like ## replaces the /**

I never found it (although I haven't looked too hard), but I always  
thought there was an official way to document python code --  
minimally to put the documentation in the docstring following the  
function definition:

def func(..):
 """One liner.

 Continue docs -- some type of reStructredText style
 """
 pas

Isn't that the same docstring that ipython uses to bring up help,  
when you do:

In [1]: myobject.some_func?


So .. I guess I'm wondering why we want to break from the standard?

-steve



-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] please change mean to use dtype=float

Tim Hochberg wrote:

Robert Kern wrote:

David M. Cooke wrote:

On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:

Let me offer a third path: the algorithms used for .mean() and .var() are 
substandard. There are much better incremental algorithms that entirely avoid 
the need to accumulate such large (and therefore precision-losing) intermediate 
values. The algorithms look like the following for 1D arrays in Python:

def mean(a):
 m = a[0]
 for i in range(1, len(a)):
 m += (a[i] - m) / (i + 1)
 return m

This isn't really going to be any better than using a simple sum.
It'll also be slower (a division per iteration).

With one exception, every test that I've thrown at it shows that it's better for 
float32. That exception is uniformly spaced arrays, like linspace().

 > You do avoid
 > accumulating large sums, but then doing the division a[i]/len(a) and
 > adding that will do the same.

Okay, this is true.

Now, if you want to avoid losing precision, you want to use a better
summation technique, like compensated (or Kahan) summation:

def mean(a):
s = e = a.dtype.type(0)
for i in range(0, len(a)):
temp = s
y = a[i] + e
s = temp + y
e = (temp - s) + y
return s / len(a)

Some numerical experiments in Maple using 5-digit precision show that
your mean is maybe a bit better in some cases, but can also be much
worse, than sum(a)/len(a), but both are quite poor in comparision to the
Kahan summation.

(We could probably use a fast implementation of Kahan summation in
addition to a.sum())

+1

def var(a):
 m = a[0]
 t = a.dtype.type(0)
 for i in range(1, len(a)):
 q = a[i] - m
 r = q / (i+1)
 m += r
 t += i * q * r
 t /= len(a)
 return t

Alternatively, from Knuth:

def var_knuth(a):
 m = a.dtype.type(0)
 variance = a.dtype.type(0)
 for i in range(len(a)):
 delta = a[i] - m
 m += delta / (i+1)
 variance += delta * (a[i] - m)
 variance /= len(a)
 return variance

These formulas are good when you can only do one pass over the data
(like in a calculator where you don't store all the data points), but
are slightly worse than doing two passes. Kahan summation would probably
also be good here too.

Again, my tests show otherwise for float32. I'll condense my ipython log into a 
module for everyone's perusal. It's possible that the Kahan summation of the 
squared residuals will work better than the current two-pass algorithm and the 
implementations I give above.

This is what my tests show as well var_knuth outperformed any simple two 
pass algorithm I could come up with, even ones using Kahan sums. 
Interestingly, for 1D arrays the built in float32 variance performs 
better than it should. After a bit of twiddling around I discovered that 
it actually does most of it's calculations in float64. It uses a two 
pass calculation, the result of mean is a scalar, and in the process of 
converting that back to an array we end up with float64 values. Or 
something like that; I was mostly reverse engineering the sequence of 
events from the results.

Here's a simple of example of how var is a little wacky. A shape-[N] 
array will give you a different result than a shape-[1,N] array. The 
reason is clear -- in the second case the mean is not a scalar so there 
isn't the inadvertent promotion to float64, but it's still odd.

>>> data = (1000*(random.random([1]) - 0.1)).astype(float32)
>>> print data.var() - data.reshape([1, -1]).var(-1)
[ 0.1171875]

I'm going to go ahead and attach a module containing the versions of 
mean, var, etc that I've been playing with in case someone wants to mess 
with them. Some were stolen from traffic on this list, for others I 
grabbed the algorithms from wikipedia or equivalent.

-tim

-tim

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

def raw_kahan_sum(values):
"""raw_kahan_sum(values) -> sum(values), residual

where sum(values) is computed using Kahan's summation algorithm and the 
residual is value of the lower order bits when finished.

"""
total = c = values.dtype.type(0)
for x in values:
y = x + c  
t = total + y  
c = y - (t - total)  
total = t 
return total, c

def sum(values):
"""sum(values) -> sum of

Re: [Numpy-discussion] please change mean to use dtype=float

Robert Kern wrote:
> David M. Cooke wrote:
>   
>> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
>> 
>>> Let me offer a third path: the algorithms used for .mean() and .var() are 
>>> substandard. There are much better incremental algorithms that entirely 
>>> avoid 
>>> the need to accumulate such large (and therefore precision-losing) 
>>> intermediate 
>>> values. The algorithms look like the following for 1D arrays in Python:
>>>
>>> def mean(a):
>>>  m = a[0]
>>>  for i in range(1, len(a)):
>>>  m += (a[i] - m) / (i + 1)
>>>  return m
>>>   
>> This isn't really going to be any better than using a simple sum.
>> It'll also be slower (a division per iteration).
>> 
>
> With one exception, every test that I've thrown at it shows that it's better 
> for 
> float32. That exception is uniformly spaced arrays, like linspace().
>
>  > You do avoid
>  > accumulating large sums, but then doing the division a[i]/len(a) and
>  > adding that will do the same.
>
> Okay, this is true.
>
>   
>> Now, if you want to avoid losing precision, you want to use a better
>> summation technique, like compensated (or Kahan) summation:
>>
>> def mean(a):
>> s = e = a.dtype.type(0)
>> for i in range(0, len(a)):
>> temp = s
>> y = a[i] + e
>> s = temp + y
>> e = (temp - s) + y
>> return s / len(a)
>>
>> Some numerical experiments in Maple using 5-digit precision show that
>> your mean is maybe a bit better in some cases, but can also be much
>> worse, than sum(a)/len(a), but both are quite poor in comparision to the
>> Kahan summation.
>>
>> (We could probably use a fast implementation of Kahan summation in
>> addition to a.sum())
>> 
>
> +1
>
>   
>>> def var(a):
>>>  m = a[0]
>>>  t = a.dtype.type(0)
>>>  for i in range(1, len(a)):
>>>  q = a[i] - m
>>>  r = q / (i+1)
>>>  m += r
>>>  t += i * q * r
>>>  t /= len(a)
>>>  return t
>>>
>>> Alternatively, from Knuth:
>>>
>>> def var_knuth(a):
>>>  m = a.dtype.type(0)
>>>  variance = a.dtype.type(0)
>>>  for i in range(len(a)):
>>>  delta = a[i] - m
>>>  m += delta / (i+1)
>>>  variance += delta * (a[i] - m)
>>>  variance /= len(a)
>>>  return variance
>>>   
>> These formulas are good when you can only do one pass over the data
>> (like in a calculator where you don't store all the data points), but
>> are slightly worse than doing two passes. Kahan summation would probably
>> also be good here too.
>> 
>
> Again, my tests show otherwise for float32. I'll condense my ipython log into 
> a 
> module for everyone's perusal. It's possible that the Kahan summation of the 
> squared residuals will work better than the current two-pass algorithm and 
> the 
> implementations I give above.
>   
This is what my tests show as well var_knuth outperformed any simple two 
pass algorithm I could come up with, even ones using Kahan sums. 
Interestingly, for 1D arrays the built in float32 variance performs 
better than it should. After a bit of twiddling around I discovered that 
it actually does most of it's calculations in float64. It uses a two 
pass calculation, the result of mean is a scalar, and in the process of 
converting that back to an array we end up with float64 values. Or 
something like that; I was mostly reverse engineering the sequence of 
events from the results.

-tim




-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] Question about recarray

Lionel Roubeyrie wrote:
> find any solution for that. I have tried with arrays of dtype=object, but I 
> have problem when I want to compute min, max, ... with an error like:
> TypeError: function not supported for these types, and can't coerce safely to 
> supported types.
>   
I just added support for min and max methods of object arrays, by adding 
support for Object arrays to the minimum and maximum functions.

-Travis

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] arr.dtype.kind is 'i' for dtype=unit !?

Matthew Brett wrote:
> Hi,
>
>   
>> It's in the array interface specification:
>>
>> http://numpy.scipy.org/array_interface.shtml
>> 
>
> I was interested in the 't' (bitfield) type - is there an example of
> usage somewhere?
>   
No,  It's not implemented in NumPy.  It's just part of the array 
interface specification for completeness.

-Travis

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] arr.dtype.kind is 'i' for dtype=unit !?

2006-09-21 Thread Matthew Brett

Hi,

> It's in the array interface specification:
>
> http://numpy.scipy.org/array_interface.shtml

I was interested in the 't' (bitfield) type - is there an example of
usage somewhere?

In [13]: dtype('t8')
---
exceptions.TypeError Traceback (most
recent call last)

/home/mb312/python/

TypeError: data type not understood

Best,

Matthew

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] Question about recarray

Lionel Roubeyrie wrote:
> Hi all,
> Is it possible to put masked values into recarrays, I need a array with 
> heterogenous types of datas (datetime objects in the first col, all others 
> are float) but with missing values in some records. For the moment, I don't 
> find any solution for that. 
Either use "nans" or "inf" for missing values or use the masked array 
object with a complex data-type.   You don't need to use a recarray 
object to get "records".  Any array can have "records".  Therefore, you 
can have a masked array of "records" by creating an array with the 
appropriate data-type.  

It may also be possible to use a recarray as the "array" for the masked 
array object becuase the recarray is a sub-class of the array.

> I have tried with arrays of dtype=object, but I 
> have problem when I want to compute min, max, ... with an error like:
> TypeError: function not supported for these types, and can't coerce safely to 
> supported types.
>   
It looks like the max and min functions are not supported for Object 
arrays.

import numpy as N
N.maximum.types

does not include Object arrays. 

It probably should.

-Travis

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] Tests and code documentation

On 9/21/06, Sebastian Haase <[EMAIL PROTECTED]> wrote:
On Thursday 21 September 2006 09:05, Charles R Harris wrote:> Travis,>> A few questions.>> 1) I can't find any systematic code testing units, although there seem to> be tests for regressions and such. Is there a place we should be putting
> such tests?>> 2) Any plans for code documentation? I documented some of my stuff with> doxygen markups and wonder if we should include a Doxyfile as part of the> package.Are able to use doxygen for Python code ? I thought it only worked for C (and
alike) ?IIRC correctly, it now does Python too. Let's see... here is an example## Documentation for this module.
##  More details.
## Documentation for a function.#
#  More details.def func():pass
Looks like ## replaces the /**Chuck
-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] Tests and code documentation

2006-09-21 Thread Louis Cordier


> Are able to use doxygen for Python code ? I thought it only worked for C (and
> alike) ?

There is an ugly-hack :)
http://i31www.ira.uka.de/~baas/pydoxy/

But I wouldn't recommend using it, rather stick with Epydoc.


-- 
Louis Cordier <[EMAIL PROTECTED]> cell: +27721472305
Point45 Entertainment (Pty) Ltd. http://www.point45.org


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] Tests and code documentation

Charles R Harris wrote:
> Travis,
>
> A few questions.
>
> 1) I can't find any systematic code testing units, although there seem 
> to be tests for regressions and such. Is there a place we should be 
> putting such tests?
All tests are placed under the tests directory of the corresponding 
sub-package.  They will only be picked up by .test(level < 10) if the 
file is named test_..test(level>10) should pick up all 
test files.   If you want to name something different but still have it 
run at a test level < 10,  then you need to run the test from one of the 
other test files that will be picked up (test_regression.py and 
test_unicode.py are doing that for example). 
>
> 2) Any plans for code documentation? I documented some of my stuff 
> with doxygen markups and wonder if we should include a Doxyfile as 
> part of the package.
I'm not familiar with Doxygen, but would welcome any improvements to the 
code documentation.
>
> 3) Would you consider breaking out the Converters into a separate .c 
> file for inclusion? The code generator seems to take care of the ordering.
You are right that it doesn't matter which order the API subroutines are 
placed.  I'm not opposed to more breaking up of the .c files, as long as 
it is clear where things will be located.The #include strategy is 
necessary to get it all in one Python module, but having smaller .c 
files usually makes for faster editing.   It's the arrayobject.c file 
that is "too-large" IMHO, however.   That's where I would look for ways 
to break it up.

The iterobject and the data-type object could be taken out, for example.


-Travis



-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] immutable arrays

Martin Wiechert wrote:
> Thanks Travis.
>
> Do I understand correctly that the only way to be really safe is to make a 
> copy and not to export a reference to it?
> Because anybody having a reference to the owner of the data can override the 
> flag?
>   
No, that's not quite correct.   Of course in C, anybody can do anything 
they want to the flags.

In Python, only the owner of the object itself can change the writeable 
flag once it is set to False.   So, if you only return a "view" of the 
array (a.view())  then the Python user will not be able to change the 
flags.

Example:

a = array([1,2,3])
a.flags.writeable = False

b = a.view()

b.flags.writeable = True   # raises an error.

c = a
c.flags.writeable = True  # can be done because c is a direct alias to a.

Hopefully, that explains the situation a bit better.

-Travis

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] Tests and code documentation

2006-09-21 Thread Sebastian Haase

On Thursday 21 September 2006 09:05, Charles R Harris wrote:
> Travis,
>
> A few questions.
>
> 1) I can't find any systematic code testing units, although there seem to
> be tests for regressions and such. Is there a place we should be putting
> such tests?
>
> 2) Any plans for code documentation? I documented some of my stuff with
> doxygen markups and wonder if we should include a Doxyfile as part of the
> package.

Are able to use doxygen for Python code ? I thought it only worked for C (and 
alike) ?

>
> 3) Would you consider breaking out the Converters into a separate .c file
> for inclusion? The code generator seems to take care of the ordering.
>
> Chuck

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

[Numpy-discussion] Tests and code documentation

Travis,A few questions.1) I can't find any systematic code testing units, although there seem to be tests for regressions and such. Is there a place we should be putting such tests?2) Any plans for code documentation? I documented some of my stuff with doxygen markups and wonder if we should include a Doxyfile as part of the package.
3) Would you consider breaking out the Converters into a separate .c file for inclusion? The code generator seems to take care of the ordering.Chuck
-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV___
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] 1.0rc1 doesn't seem to work on AMD64