Re: [Numpy-discussion] Coverting ranks to a Gaussian

2008-06-10 Thread Anne Archibald
2008/6/9 Keith Goodman [EMAIL PROTECTED]:
 Does anyone have a function that converts ranks into a Gaussian?

 I have an array x:

 import numpy as np
 x = np.random.rand(5)

 I rank it:

 x = x.argsort().argsort()
 x_ranked = x.argsort().argsort()
 x_ranked
   array([3, 1, 4, 2, 0])

 I would like to convert the ranks to a Gaussian without using scipy.
 So instead of the equal distance between ranks in array x, I would
 like the distance been them to follow a Gaussian distribution.

 How far out in the tails of the Gaussian should 0 and N-1 (N=5 in the
 example above) be? Ideally, or arbitrarily, the areas under the
 Gaussian to the left of 0 (and the right of N-1) should be 1/N or
 1/2N. Something like that. Or a fixed value is good too.

I'm actually not clear on what you need.

If what you need is for rank i of N to be the 100*i/N th percentile in
a Gaussian distribution, then you should indeed use scipy's functions
to accomplish that; I'd use scipy.stats.norm.ppf().

Of course, if your points were drawn from a Gaussian distribution,
they wouldn't be exactly 1/N apart, there would be some distribution.
Quite what the distribution of (say) the maximum or the median of N
points drawn from a Gaussian is, I can't say, though people have
looked at it. But if you want typical values, just generate N points
from a Gaussian and sort them:

V = np.random.randn(N)
V = np.sort(V)

return V[ranks]

Of course they will be different every time, but the distribution will be right.

Anne
P.S. why the no scipy restriction? it's a bit unreasonable. -A
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Coverting ranks to a Gaussian

2008-06-10 Thread Keith Goodman
On Tue, Jun 10, 2008 at 12:56 AM, Anne Archibald
[EMAIL PROTECTED] wrote:
 2008/6/9 Keith Goodman [EMAIL PROTECTED]:
 Does anyone have a function that converts ranks into a Gaussian?

 I have an array x:

 import numpy as np
 x = np.random.rand(5)

 I rank it:

 x = x.argsort().argsort()
 x_ranked = x.argsort().argsort()
 x_ranked
   array([3, 1, 4, 2, 0])

 I would like to convert the ranks to a Gaussian without using scipy.
 So instead of the equal distance between ranks in array x, I would
 like the distance been them to follow a Gaussian distribution.

 How far out in the tails of the Gaussian should 0 and N-1 (N=5 in the
 example above) be? Ideally, or arbitrarily, the areas under the
 Gaussian to the left of 0 (and the right of N-1) should be 1/N or
 1/2N. Something like that. Or a fixed value is good too.

 I'm actually not clear on what you need.

 If what you need is for rank i of N to be the 100*i/N th percentile in
 a Gaussian distribution, then you should indeed use scipy's functions
 to accomplish that; I'd use scipy.stats.norm.ppf().

 Of course, if your points were drawn from a Gaussian distribution,
 they wouldn't be exactly 1/N apart, there would be some distribution.
 Quite what the distribution of (say) the maximum or the median of N
 points drawn from a Gaussian is, I can't say, though people have
 looked at it. But if you want typical values, just generate N points
 from a Gaussian and sort them:

 V = np.random.randn(N)
 V = np.sort(V)

 return V[ranks]

 Of course they will be different every time, but the distribution will be 
 right.

I guess I botched the description of my problem.

I have data that contains outliers and other noise. I am trying
various transformations of the data to preprocess it before plugging
it into my prediction algorithm. One such transformation is to rank
the data and then convert that rank to a Gaussian. The particular
details of the transformation don't matter. I just want something
smooth and normal like.

 Anne
 P.S. why the no scipy restriction? it's a bit unreasonable. -A

I'd rather not pull in a scipy dependency for one function if there is
a numpy alternative. I think it is funny that you picked up on my
brief mention of scipy and called it unreasonable.
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Coverting ranks to a Gaussian

2008-06-10 Thread Dag Sverre Seljebotn
Keith Goodman wrote:
 I'd rather not pull in a scipy dependency for one function if there is
 a numpy alternative. I think it is funny that you picked up on my
 brief mention of scipy and called it unreasonable.
   
(I didn't follow this exact discussion, arguing from general principles 
here about SciPy dependencies.) Try to look at it this way:

NumPy may solve almost all your needs, and you only need, say, 0.1% of 
SciPy.

Assume, then, that the same statement is true about n other people. The 
problem then is that the 0.1% that each person needs from SciPy does not 
typically overlap. So as n grows larger, and assuming that everyone use 
the same logic that you use, the amount of SciPy stuff that must be 
reimplemented on the NumPy discussion mailing list could become quite large.

(Besides, SciPy is open source, so you can always copy  paste the 
function from it if you only need one trivial bit of it. Not that I 
recommend doing that, but it's still better than reimplementing it 
yourself. Unless you're doing it for the learning experience of course.)

Dag Sverre
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Coverting ranks to a Gaussian

2008-06-09 Thread Robert Kern
On Mon, Jun 9, 2008 at 18:34, Keith Goodman [EMAIL PROTECTED] wrote:
 Does anyone have a function that converts ranks into a Gaussian?

 I have an array x:

 import numpy as np
 x = np.random.rand(5)

 I rank it:

 x = x.argsort().argsort()
 x_ranked = x.argsort().argsort()
 x_ranked
   array([3, 1, 4, 2, 0])

There are subtleties in computing ranks when ties are involved. Take a
look at the implementation of scipy.stats.rankdata().

 I would like to convert the ranks to a Gaussian without using scipy.

No dice. You are going to have to use scipy.special.ndtri somewhere. A
basic transformation (off the top of my head, I have no idea if this
is statistically meaningful):

  scipy.special.ndtri((ranks + 1.0) / (len(ranks) + 1.0))

Barring tied first or last items, this should give equal weight to
each of the tails outside of the range of your data.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Coverting ranks to a Gaussian

2008-06-09 Thread Pierre GM
On Monday 09 June 2008 22:06:24 Keith Goodman wrote:
 On Mon, Jun 9, 2008 at 4:45 PM, Robert Kern [EMAIL PROTECTED] wrote:

  There are subtleties in computing ranks when ties are involved. Take a
  look at the implementation of scipy.stats.rankdata().

 Good point. I had to deal with ties and missing data. I bet
 scipy.stats.rankdata() is faster than my implementation.

There's a scipy.stats.mstats.rankdata() that take care of both ties and 
missing data. Missing data are allocated a rank of either 0 or the average 
rank, depending on some parameter.

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Coverting ranks to a Gaussian

2008-06-09 Thread Keith Goodman
On Mon, Jun 9, 2008 at 7:02 PM, Pierre GM [EMAIL PROTECTED] wrote:
 On Monday 09 June 2008 22:06:24 Keith Goodman wrote:
 On Mon, Jun 9, 2008 at 4:45 PM, Robert Kern [EMAIL PROTECTED] wrote:

  There are subtleties in computing ranks when ties are involved. Take a
  look at the implementation of scipy.stats.rankdata().

 Good point. I had to deal with ties and missing data. I bet
 scipy.stats.rankdata() is faster than my implementation.

 There's a scipy.stats.mstats.rankdata() that take care of both ties and
 missing data. Missing data are allocated a rank of either 0 or the average
 rank, depending on some parameter.

That sounds interesting. But I can't find it:

 import scipy
 from scipy import stats
 scipy.stats.m
scipy.stats.mannwhitneyu  scipy.stats.mean  scipy.stats.mielke
   scipy.stats.momentscipy.stats.morestats
scipy.stats.maxwell   scipy.stats.medianscipy.stats.mode
   scipy.stats.mood  scipy.stats.mvn
 scipy.stats.morestats.r
scipy.stats.morestats.r_ scipy.stats.morestats.ravel

In my implementation I leave the missing values as missing. I think
that would be a nice option for rankdata.
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Coverting ranks to a Gaussian

2008-06-09 Thread Robert Kern
On Mon, Jun 9, 2008 at 21:06, Keith Goodman [EMAIL PROTECTED] wrote:
 On Mon, Jun 9, 2008 at 4:45 PM, Robert Kern [EMAIL PROTECTED] wrote:
 On Mon, Jun 9, 2008 at 18:34, Keith Goodman [EMAIL PROTECTED] wrote:
 Does anyone have a function that converts ranks into a Gaussian?

 I have an array x:

 import numpy as np
 x = np.random.rand(5)

 I rank it:

 x = x.argsort().argsort()
 x_ranked = x.argsort().argsort()
 x_ranked
   array([3, 1, 4, 2, 0])

 There are subtleties in computing ranks when ties are involved. Take a
 look at the implementation of scipy.stats.rankdata().

 Good point. I had to deal with ties and missing data. I bet
 scipy.stats.rankdata() is faster than my implementation.

Actually, it's pretty slow. I think there are opportunities to make it
faster, but I haven't explored them.

 I would like to convert the ranks to a Gaussian without using scipy.

 No dice. You are going to have to use scipy.special.ndtri somewhere. A
 basic transformation (off the top of my head, I have no idea if this
 is statistically meaningful):

  scipy.special.ndtri((ranks + 1.0) / (len(ranks) + 1.0))

 Barring tied first or last items, this should give equal weight to
 each of the tails outside of the range of your data.

 Nice. Thank you. It passes the never wrong chi-by-eye test:

 r = np.arange(1000)
 g = special.ndtri((r + 1.0) / (len(r) + 1.0))
 pylab.hist(g, 50)
 pylab.show()

BTW, what is your use case? If you are trying to compare your data to
the normal distribution, you might want to go the other way: find the
empirical quantiles of your data and compare them against the
hypothetical quantiles of the data on the distribution. This *is* an
established statistical technique; when you graph one versus the
other, you get what is known as a Q-Q plot.

 I wasn't able to use scipy.special.ndtri (after import scipy) like you
 did.

Actually, I didn't specify any import. The behavior you see is
correct. Importing scipy alone doesn't import any of the subpackages.
If you like, you can pretend there was an implicit import
scipy.special before the code I gave.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Coverting ranks to a Gaussian

2008-06-09 Thread Pierre GM
On Monday 09 June 2008 22:30:09 Keith Goodman wrote:
 On Mon, Jun 9, 2008 at 7:02 PM, Pierre GM [EMAIL PROTECTED] wrote:
  There's a scipy.stats.mstats.rankdata() that take care of both ties and
  missing data. Missing data are allocated a rank of either 0 or the
  average rank, depending on some parameter.

 That sounds interesting. But I can't find it:
  import scipy
  from scipy import stats

Yes, you should do
 import scipy.stats.mstats as mstats
 mstats.rankdata

 In my implementation I leave the missing values as missing. I think
 that would be a nice option for rankdata.

Handling missing data is why I needed a tailored rankdata. 
In mstats.rankdata, if you set the use_missing optional parameter to False 
(the default), they will have a rank of 0. As no other value will have a rank 
of zero, you can then remask with masked_values or masked_equal.
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Coverting ranks to a Gaussian

2008-06-09 Thread Keith Goodman
On Mon, Jun 9, 2008 at 7:35 PM, Robert Kern [EMAIL PROTECTED] wrote:
 On Mon, Jun 9, 2008 at 21:06, Keith Goodman [EMAIL PROTECTED] wrote:
 On Mon, Jun 9, 2008 at 4:45 PM, Robert Kern [EMAIL PROTECTED] wrote:
 On Mon, Jun 9, 2008 at 18:34, Keith Goodman [EMAIL PROTECTED] wrote:
 Does anyone have a function that converts ranks into a Gaussian?

 I have an array x:

 import numpy as np
 x = np.random.rand(5)

 I rank it:

 x = x.argsort().argsort()
 x_ranked = x.argsort().argsort()
 x_ranked
   array([3, 1, 4, 2, 0])

 There are subtleties in computing ranks when ties are involved. Take a
 look at the implementation of scipy.stats.rankdata().

 Good point. I had to deal with ties and missing data. I bet
 scipy.stats.rankdata() is faster than my implementation.

 Actually, it's pretty slow. I think there are opportunities to make it
 faster, but I haven't explored them.

 I would like to convert the ranks to a Gaussian without using scipy.

 No dice. You are going to have to use scipy.special.ndtri somewhere. A
 basic transformation (off the top of my head, I have no idea if this
 is statistically meaningful):

  scipy.special.ndtri((ranks + 1.0) / (len(ranks) + 1.0))

 Barring tied first or last items, this should give equal weight to
 each of the tails outside of the range of your data.

 Nice. Thank you. It passes the never wrong chi-by-eye test:

 r = np.arange(1000)
 g = special.ndtri((r + 1.0) / (len(r) + 1.0))
 pylab.hist(g, 50)
 pylab.show()

 BTW, what is your use case? If you are trying to compare your data to
 the normal distribution, you might want to go the other way: find the
 empirical quantiles of your data and compare them against the
 hypothetical quantiles of the data on the distribution. This *is* an
 established statistical technique; when you graph one versus the
 other, you get what is known as a Q-Q plot.

Some of the techniques I use are sensitive to noise. So to make things
more robust I sometimes transform my noisy, sort-of-normal data.
There's no theory to guide me, only empirical results.
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion