Bug#322454: python-scipy: several scipy.stats failures

2005-08-11 Thread Alexandre Fayolle
On Wed, Aug 10, 2005 at 02:39:54PM -0400, Perry, Alexander (GE Infrastructure) 
wrote:
 Package: python-scipy
 Version: 0.3.2-6
 Severity: normal
 Tags: patch
 
 
 There are typographic bugs in stats.py that make two functions fail.  In
 addition, the calculation for nanstd() provides obviously wrong answers.
 I have not done any formal checking for the correction attached below.
 
 In the distributions.py file, the calculation of est_loc_scale() uses
 a sibling method wrongly and references an import by the wrong name.
 In the same file, nnlf() method passes its own object twice to _nnlf()
 and its parameter parsing is incompatible with the fit() method's needs.
 
 The patch below is on directory /lib/python*/site-packages/scipy/stats

Thanks a lot for your bug report and patch. 

I shall prepare a new revision of the package and upload it before the
end of the week.  

-- 
Alexandre Fayolle  LOGILAB, Paris (France).
http://www.logilab.com   http://www.logilab.fr  http://www.logilab.org


signature.asc
Description: Digital signature


Bug#322454: python-scipy: several scipy.stats failures

2005-08-11 Thread Alexandre Fayolle
On Wed, Aug 10, 2005 at 02:39:54PM -0400, Perry, Alexander (GE Infrastructure) 
wrote:
 Package: python-scipy
 Version: 0.3.2-6
 Severity: normal
 Tags: patch
 
 
 There are typographic bugs in stats.py that make two functions fail.  In
 addition, the calculation for nanstd() provides obviously wrong answers.
 I have not done any formal checking for the correction attached below.

I'm having a few problems understanding your patch for nanstd, which
does not seem to give the right results either, but I'm not sure of the
expected semantics of the nanstd function. Is it expected that

nanstd([nan, 2., 4.]) == std([2., 4.]) 

or that 

nanstd([nan, 2., 4.]) == std([0., 2., 4.]) 

your code produces neither. I'd go for the first option (given the
definition of nanmean)

I propose the following implementation of nanstd:

def nanstd(x,axis=-1,bias=0):
Compute the standard deviation over the given axis ignoring nans

x, axis = _chk_asarray(x,axis)
x = x.copy()
mn = expand_dims(nanmean(x, axis), axis)
deviation = x - mn
N.putmask(deviation, isnan(x), 0.)
y = ss(deviation, axis)

Norig = x.shape[axis]
n = Norig - sum(isnan(x),axis)*1.0
if bias:
y /= n
else:
y /= n-1.
return sqrt(y)

Does it seem correct to you?


 
 In the distributions.py file, the calculation of est_loc_scale() uses
 a sibling method wrongly and references an import by the wrong name.

ok.

 In the same file, nnlf() method passes its own object twice to _nnlf()

I don't see this in my code. The diff you sent me are strange, some of
the chunks seem to be reversed. 

 ***
 *** 743,749 
   return inf
   else:
   N = len(x)
 ! return self._nnlf(self, x, *args) + N*log(scale)
   
   def fit(self, data, *args, **kwds):
   loc0, scale0 = map(kwds.get, ['loc', 'scale'],[0.0, 1.0])
 --- 749,755 
   return inf
   else:
   N = len(x)
 ! return self._nnlf(x, *args) + N*log(scale)
   
   def fit(self, data, *args, **kwds):
   loc0, scale0 = map(kwds.get, ['loc', 'scale'],[0.0, 1.0])

Do we agree that the bottom version is correct ? I.e.:
return self._nnlf(x, *args) + N*log(scale) 

This is the code I had in my source tree. 


This is the final patch that I plan to upload. Comments are welcome:

--- stats.py.orig   2005-08-11 17:15:40.0 +0200
+++ stats.py2005-08-11 18:17:38.0 +0200
@@ -241,28 +241,27 @@
 x = x.copy()
 Norig = x.shape[axis]
 factor = 1.0-sum(isnan(x),axis)*1.0/Norig
-n = N-sum(isnan(x),axis)
+n = N.sum(isnan(x),axis)
 N.putmask(x,isnan(x),0)
-return stats.mean(x,axis)/factor
+return mean(x,axis)/factor
 
 def nanstd(x,axis=-1,bias=0):
 Compute the standard deviation over the given axis ignoring nans
 
 x, axis = _chk_asarray(x,axis)
 x = x.copy()
+mn = expand_dims(nanmean(x, axis), axis)
+deviation = x - mn
+N.putmask(deviation, isnan(x), 0.)
+y = ss(deviation, axis)
+
 Norig = x.shape[axis]
 n = Norig - sum(isnan(x),axis)*1.0
-factor = n/Norig
-n = N-sum(isnan(x),axis)
-N.putmask(x,isnan(x),0)
-m1 = stats.mean(x,axis)
-m1c = m1/factor
-m2 = stats.mean((x-m1c)**2.0,axis)
 if bias:
-m2c = m2/factor
+y /= n
 else:
-m2c = m2*Norig/(n-1.0)
-return m2c
+y /= n-1.
+return sqrt(y)
 
 def _nanmedian(arr1d):  # This only works on 1d arrays
cond = 1-isnan(arr1d)

--- distributions.py.orig   2005-08-11 17:18:39.0 +0200
+++ distributions.py2005-08-11 18:32:39.0 +0200
@@ -730,15 +730,9 @@
 #
 try:
 x = args[-1]
-  args = args[:-1]
-except IndexError:
-raise ValueError, Need shape and data arguments.
-  try:
-if len(args) == 1:
-  args = args[0]
-  loc = args[-1]
-scale = args[-2]
-args = args[:-2]
+loc = args[-2]
+scale = args[-3]
+args = args[:-3]
 except IndexError:
 raise ValueError, Not enough shape arguments.
 if not self._argcheck(*args) or scale = 0:
@@ -764,9 +758,9 @@
 return optimize.fmin(self.nnlf,x0,args=(ravel(data),),disp=0)
 
 def est_loc_scale(self, data, *args):
-mu, mu2 = self.stats(*args,**{'moments':'mv'})
-muhat = st.nanmean(data)
-mu2hat = st.nanstd(data)
+mu, mu2, g1, g2 = self.stats(*args,**{'moments':'mv'})
+muhat = stats.nanmean(data)
+mu2hat = stats.nanstd(data)
 Shat = sqrt(mu2hat / mu2)
 Lhat = muhat - Shat*mu
 return Lhat, Shat


-- 
Alexandre Fayolle  LOGILAB, Paris (France).
http://www.logilab.com   http://www.logilab.fr  http://www.logilab.org


signature.asc
Description: Digital signature


Bug#322454: python-scipy: several scipy.stats failures

2005-08-11 Thread Perry, Alexander (GE Infrastructure)
Oh, I didn't know there was a Debian SciPy list!

From: Alexandre Fayolle [mailto:[EMAIL PROTECTED]
On Wed, Aug 10, 2005 at 02:39:54PM -0400, Perry, Alexander (GE
Infrastructure) wrote:
  There are typographic bugs in stats.py that make two functions fail.
In
  addition, the calculation for nanstd() provides obviously wrong
answers.
  I have not done any formal checking for the correction attached
below.

 I'm having a few problems understanding your patch for nanstd, which
 does not seem to give the right results either, but I'm not sure of
the
 expected semantics of the nanstd function. Is it expected that
 nanstd([nan, 2., 4.]) == std([2., 4.]) 
 or that 
 nanstd([nan, 2., 4.]) == std([0., 2., 4.]) 
 your code produces neither. I'd go for the first option (given the
 definition of nanmean)

I don't know; as far as I can tell, it isn't documented what upstream
intends the function to return.  Feel free to pick which you think is
best and we can see what upstream does with my bug report.  Bear in mind
there is a third result option, which is what I attempted to achieve.  I
don't know whether it is the correct one, of course.  The third option
is that the nan-removed dataset is a sample of the larger dataset, so
that the standard deviation of the larger dataset has to be increased
beyond the the standard deviation of the nan-removed dataset to account
for the increase in uncertainty.

 I propose the following implementation of nanstd:

Fine by me.

  In the same file, nnlf() method passes its own object twice to
_nnlf()

 I don't see this in my code.

Line 746 of
/usr/lib/python2.3/site-packages/scipy/stats/distributions.py
  ! return self._nnlf(self, x, *args) + N*log(scale)

The _nnlf is being called as a method of self, so automatically
gets the identity of itself being passed to the call as a consequence.
Placing self as the first parameter means it gets passed a second
time.
This is sometimes useful, but not in this case; look at line 724.

 The diff you sent me are strange, some of
 the chunks seem to be reversed. 

Yeah, I notice now that part of my diffs ended up reversed.  Sorry.

 Do we agree that the bottom version is correct ? I.e.:
 return self._nnlf(x, *args) + N*log(scale) 

Yes.

 This is the code I had in my source tree. 

Really?  How odd.  My lines were quoted from a computer running
Testing, the file is in binary package python2.3-scipy for i386
and the unmodified installed version is 0.3.2-6

 This is the final patch that I plan to upload.

Some of that looks reversed; for example:
 -mu, mu2 = self.stats(*args,**{'moments':'mv'})
 -muhat = st.nanmean(data)
 -mu2hat = st.nanstd(data)
 +mu, mu2, g1, g2 = self.stats(*args,**{'moments':'mv'})
 +muhat = stats.nanmean(data)
 +mu2hat = stats.nanstd(data)
Maybe you already have some of my fixes in your local tree?




Bug#322454: python-scipy: several scipy.stats failures

2005-08-11 Thread Alexandre Fayolle
On Thu, Aug 11, 2005 at 01:48:58PM -0400, Perry, Alexander (GE Infrastructure) 
wrote:
 Oh, I didn't know there was a Debian SciPy list!

The package is maintained by several people and we use this list to
synchronize. Discussing patches to apply to the package fits nicely with
the topic of the list :-) 

The BTS is fine too. 

I've also sent some information to the scipy bugtracker
(http://www.scipy.net/roundup/scipy/issue243)

 
 I don't know; as far as I can tell, it isn't documented what upstream
 intends the function to return.  Feel free to pick which you think is
 best and we can see what upstream does with my bug report.  Bear in mind
 there is a third result option, which is what I attempted to achieve.  I
 don't know whether it is the correct one, of course.  The third option
 is that the nan-removed dataset is a sample of the larger dataset, so
 that the standard deviation of the larger dataset has to be increased
 beyond the the standard deviation of the nan-removed dataset to account
 for the increase in uncertainty.

I see. I'll ask upstream about this. 
 

 Maybe you already have some of my fixes in your local tree?

Maybe, I'll have to double check tomorrow to be sure. It's too late for
me right now, I'm bound to make some mistakes. It is possible that the
patch you sent me was partially applied and this brought confusion in my
local tree. 

-- 
Alexandre Fayolle  LOGILAB, Paris (France).
http://www.logilab.com   http://www.logilab.fr  http://www.logilab.org


signature.asc
Description: Digital signature


Bug#322454: python-scipy: several scipy.stats failures

2005-08-10 Thread Perry, Alexander (GE Infrastructure)
Package: python-scipy
Version: 0.3.2-6
Severity: normal
Tags: patch


There are typographic bugs in stats.py that make two functions fail.  In
addition, the calculation for nanstd() provides obviously wrong answers.
I have not done any formal checking for the correction attached below.

In the distributions.py file, the calculation of est_loc_scale() uses
a sibling method wrongly and references an import by the wrong name.
In the same file, nnlf() method passes its own object twice to _nnlf()
and its parameter parsing is incompatible with the fit() method's needs.

The patch below is on directory /lib/python*/site-packages/scipy/stats


*** stats.py2005-08-10 10:05:38.0 -0700
--- stats.py.orig   2005-08-10 09:31:59.0 -0700
***
*** 241,249 
  x = x.copy()
  Norig = x.shape[axis]
  factor = 1.0-sum(isnan(x),axis)*1.0/Norig
! n = N.sum(isnan(x),axis)
  N.putmask(x,isnan(x),0)
! return mean(x,axis)/factor
  
  def nanstd(x,axis=-1,bias=0):
  Compute the standard deviation over the given axis ignoring nans
--- 241,249 
  x = x.copy()
  Norig = x.shape[axis]
  factor = 1.0-sum(isnan(x),axis)*1.0/Norig
! n = N-sum(isnan(x),axis)
  N.putmask(x,isnan(x),0)
! return stats.mean(x,axis)/factor
  
  def nanstd(x,axis=-1,bias=0):
  Compute the standard deviation over the given axis ignoring nans
***
*** 253,266 
  Norig = x.shape[axis]
  n = Norig - sum(isnan(x),axis)*1.0
  factor = n/Norig
  N.putmask(x,isnan(x),0)
! mn = expand_dims(mean(x,axis),axis)
! y = ss(x-mn,axis)
  if bias:
!y /= n
  else:
!y /= n-1
! return sqrt(y)
  
  def _nanmedian(arr1d):  # This only works on 1d arrays
 cond = 1-isnan(arr1d)
--- 253,268 
  Norig = x.shape[axis]
  n = Norig - sum(isnan(x),axis)*1.0
  factor = n/Norig
+ n = N-sum(isnan(x),axis)
  N.putmask(x,isnan(x),0)
! m1 = stats.mean(x,axis)
! m1c = m1/factor
! m2 = stats.mean((x-m1c)**2.0,axis)
  if bias:
! m2c = m2/factor
  else:
! m2c = m2*Norig/(n-1.0)
! return m2c
  
  def _nanmedian(arr1d):  # This only works on 1d arrays
 cond = 1-isnan(arr1d)
*** distributions.py.orig   2005-08-10 09:25:14.0 -0700
--- distributions.py2005-08-10 10:32:11.0 -0700
***
*** 730,740 
  #
  try:
  x = args[-1]
! loc = args[-2]
! scale = args[-3]
! args = args[:-3]
  except IndexError:
! raise ValueError, Not enough input arguments.
  if not self._argcheck(*args) or scale = 0:
  return inf
  x = arr((x-loc) / scale)
--- 730,746 
  #
  try:
  x = args[-1]
!   args = args[:-1]
  except IndexError:
! raise ValueError, Need shape and data arguments.
!   try:
! if len(args) == 1:
!   args = args[0]
!   loc = args[-1]
! scale = args[-2]
! args = args[:-2]
! except IndexError:
! raise ValueError, Not enough shape arguments.
  if not self._argcheck(*args) or scale = 0:
  return inf
  x = arr((x-loc) / scale)
***
*** 743,749 
  return inf
  else:
  N = len(x)
! return self._nnlf(self, x, *args) + N*log(scale)
  
  def fit(self, data, *args, **kwds):
  loc0, scale0 = map(kwds.get, ['loc', 'scale'],[0.0, 1.0])
--- 749,755 
  return inf
  else:
  N = len(x)
! return self._nnlf(x, *args) + N*log(scale)
  
  def fit(self, data, *args, **kwds):
  loc0, scale0 = map(kwds.get, ['loc', 'scale'],[0.0, 1.0])
***
*** 758,766 
  return optimize.fmin(self.nnlf,x0,args=(ravel(data),),disp=0)
  
  def est_loc_scale(self, data, *args):
! mu, mu2, g1, g2 = self.stats(*args,**{'moments':'mv'})
! muhat = stats.nanmean(data)
! mu2hat = stats.nanstd(data)
  Shat = sqrt(mu2hat / mu2)
  Lhat = muhat - Shat*mu
  return Lhat, Shat
--- 764,772 
  return optimize.fmin(self.nnlf,x0,args=(ravel(data),),disp=0)
  
  def est_loc_scale(self, data, *args):
! mu, mu2 = self.stats(*args,**{'moments':'mv'})
! muhat = st.nanmean(data)
! mu2hat = st.nanstd(data)
  Shat = sqrt(mu2hat / mu2)
  Lhat = muhat - Shat*mu
  return Lhat, Shat




-- System Information:
Debian Release: testing/unstable
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: i386 (i686)
Shell:  /bin/sh linked to /bin/bash
Kernel: Linux 2.4.23
Locale: LANG=en_US, LC_CTYPE=en_US (charmap=ISO-8859-1)

Versions of packages python-scipy depends on:
ii  python2.3-scipy   0.3.2-6scientific tools for Python 2.3

python-scipy