Re: [Numpy-discussion] Coverting ranks to a Gaussian
2008/6/9 Keith Goodman [EMAIL PROTECTED]: Does anyone have a function that converts ranks into a Gaussian? I have an array x: import numpy as np x = np.random.rand(5) I rank it: x = x.argsort().argsort() x_ranked = x.argsort().argsort() x_ranked array([3, 1, 4, 2, 0]) I would like to convert the ranks to a Gaussian without using scipy. So instead of the equal distance between ranks in array x, I would like the distance been them to follow a Gaussian distribution. How far out in the tails of the Gaussian should 0 and N-1 (N=5 in the example above) be? Ideally, or arbitrarily, the areas under the Gaussian to the left of 0 (and the right of N-1) should be 1/N or 1/2N. Something like that. Or a fixed value is good too. I'm actually not clear on what you need. If what you need is for rank i of N to be the 100*i/N th percentile in a Gaussian distribution, then you should indeed use scipy's functions to accomplish that; I'd use scipy.stats.norm.ppf(). Of course, if your points were drawn from a Gaussian distribution, they wouldn't be exactly 1/N apart, there would be some distribution. Quite what the distribution of (say) the maximum or the median of N points drawn from a Gaussian is, I can't say, though people have looked at it. But if you want typical values, just generate N points from a Gaussian and sort them: V = np.random.randn(N) V = np.sort(V) return V[ranks] Of course they will be different every time, but the distribution will be right. Anne P.S. why the no scipy restriction? it's a bit unreasonable. -A ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Coverting ranks to a Gaussian
On Tue, Jun 10, 2008 at 12:56 AM, Anne Archibald [EMAIL PROTECTED] wrote: 2008/6/9 Keith Goodman [EMAIL PROTECTED]: Does anyone have a function that converts ranks into a Gaussian? I have an array x: import numpy as np x = np.random.rand(5) I rank it: x = x.argsort().argsort() x_ranked = x.argsort().argsort() x_ranked array([3, 1, 4, 2, 0]) I would like to convert the ranks to a Gaussian without using scipy. So instead of the equal distance between ranks in array x, I would like the distance been them to follow a Gaussian distribution. How far out in the tails of the Gaussian should 0 and N-1 (N=5 in the example above) be? Ideally, or arbitrarily, the areas under the Gaussian to the left of 0 (and the right of N-1) should be 1/N or 1/2N. Something like that. Or a fixed value is good too. I'm actually not clear on what you need. If what you need is for rank i of N to be the 100*i/N th percentile in a Gaussian distribution, then you should indeed use scipy's functions to accomplish that; I'd use scipy.stats.norm.ppf(). Of course, if your points were drawn from a Gaussian distribution, they wouldn't be exactly 1/N apart, there would be some distribution. Quite what the distribution of (say) the maximum or the median of N points drawn from a Gaussian is, I can't say, though people have looked at it. But if you want typical values, just generate N points from a Gaussian and sort them: V = np.random.randn(N) V = np.sort(V) return V[ranks] Of course they will be different every time, but the distribution will be right. I guess I botched the description of my problem. I have data that contains outliers and other noise. I am trying various transformations of the data to preprocess it before plugging it into my prediction algorithm. One such transformation is to rank the data and then convert that rank to a Gaussian. The particular details of the transformation don't matter. I just want something smooth and normal like. Anne P.S. why the no scipy restriction? it's a bit unreasonable. -A I'd rather not pull in a scipy dependency for one function if there is a numpy alternative. I think it is funny that you picked up on my brief mention of scipy and called it unreasonable. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Coverting ranks to a Gaussian
Keith Goodman wrote: I'd rather not pull in a scipy dependency for one function if there is a numpy alternative. I think it is funny that you picked up on my brief mention of scipy and called it unreasonable. (I didn't follow this exact discussion, arguing from general principles here about SciPy dependencies.) Try to look at it this way: NumPy may solve almost all your needs, and you only need, say, 0.1% of SciPy. Assume, then, that the same statement is true about n other people. The problem then is that the 0.1% that each person needs from SciPy does not typically overlap. So as n grows larger, and assuming that everyone use the same logic that you use, the amount of SciPy stuff that must be reimplemented on the NumPy discussion mailing list could become quite large. (Besides, SciPy is open source, so you can always copy paste the function from it if you only need one trivial bit of it. Not that I recommend doing that, but it's still better than reimplementing it yourself. Unless you're doing it for the learning experience of course.) Dag Sverre ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Coverting ranks to a Gaussian
On Mon, Jun 9, 2008 at 18:34, Keith Goodman [EMAIL PROTECTED] wrote: Does anyone have a function that converts ranks into a Gaussian? I have an array x: import numpy as np x = np.random.rand(5) I rank it: x = x.argsort().argsort() x_ranked = x.argsort().argsort() x_ranked array([3, 1, 4, 2, 0]) There are subtleties in computing ranks when ties are involved. Take a look at the implementation of scipy.stats.rankdata(). I would like to convert the ranks to a Gaussian without using scipy. No dice. You are going to have to use scipy.special.ndtri somewhere. A basic transformation (off the top of my head, I have no idea if this is statistically meaningful): scipy.special.ndtri((ranks + 1.0) / (len(ranks) + 1.0)) Barring tied first or last items, this should give equal weight to each of the tails outside of the range of your data. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Coverting ranks to a Gaussian
On Monday 09 June 2008 22:06:24 Keith Goodman wrote: On Mon, Jun 9, 2008 at 4:45 PM, Robert Kern [EMAIL PROTECTED] wrote: There are subtleties in computing ranks when ties are involved. Take a look at the implementation of scipy.stats.rankdata(). Good point. I had to deal with ties and missing data. I bet scipy.stats.rankdata() is faster than my implementation. There's a scipy.stats.mstats.rankdata() that take care of both ties and missing data. Missing data are allocated a rank of either 0 or the average rank, depending on some parameter. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Coverting ranks to a Gaussian
On Mon, Jun 9, 2008 at 7:02 PM, Pierre GM [EMAIL PROTECTED] wrote: On Monday 09 June 2008 22:06:24 Keith Goodman wrote: On Mon, Jun 9, 2008 at 4:45 PM, Robert Kern [EMAIL PROTECTED] wrote: There are subtleties in computing ranks when ties are involved. Take a look at the implementation of scipy.stats.rankdata(). Good point. I had to deal with ties and missing data. I bet scipy.stats.rankdata() is faster than my implementation. There's a scipy.stats.mstats.rankdata() that take care of both ties and missing data. Missing data are allocated a rank of either 0 or the average rank, depending on some parameter. That sounds interesting. But I can't find it: import scipy from scipy import stats scipy.stats.m scipy.stats.mannwhitneyu scipy.stats.mean scipy.stats.mielke scipy.stats.momentscipy.stats.morestats scipy.stats.maxwell scipy.stats.medianscipy.stats.mode scipy.stats.mood scipy.stats.mvn scipy.stats.morestats.r scipy.stats.morestats.r_ scipy.stats.morestats.ravel In my implementation I leave the missing values as missing. I think that would be a nice option for rankdata. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Coverting ranks to a Gaussian
On Mon, Jun 9, 2008 at 21:06, Keith Goodman [EMAIL PROTECTED] wrote: On Mon, Jun 9, 2008 at 4:45 PM, Robert Kern [EMAIL PROTECTED] wrote: On Mon, Jun 9, 2008 at 18:34, Keith Goodman [EMAIL PROTECTED] wrote: Does anyone have a function that converts ranks into a Gaussian? I have an array x: import numpy as np x = np.random.rand(5) I rank it: x = x.argsort().argsort() x_ranked = x.argsort().argsort() x_ranked array([3, 1, 4, 2, 0]) There are subtleties in computing ranks when ties are involved. Take a look at the implementation of scipy.stats.rankdata(). Good point. I had to deal with ties and missing data. I bet scipy.stats.rankdata() is faster than my implementation. Actually, it's pretty slow. I think there are opportunities to make it faster, but I haven't explored them. I would like to convert the ranks to a Gaussian without using scipy. No dice. You are going to have to use scipy.special.ndtri somewhere. A basic transformation (off the top of my head, I have no idea if this is statistically meaningful): scipy.special.ndtri((ranks + 1.0) / (len(ranks) + 1.0)) Barring tied first or last items, this should give equal weight to each of the tails outside of the range of your data. Nice. Thank you. It passes the never wrong chi-by-eye test: r = np.arange(1000) g = special.ndtri((r + 1.0) / (len(r) + 1.0)) pylab.hist(g, 50) pylab.show() BTW, what is your use case? If you are trying to compare your data to the normal distribution, you might want to go the other way: find the empirical quantiles of your data and compare them against the hypothetical quantiles of the data on the distribution. This *is* an established statistical technique; when you graph one versus the other, you get what is known as a Q-Q plot. I wasn't able to use scipy.special.ndtri (after import scipy) like you did. Actually, I didn't specify any import. The behavior you see is correct. Importing scipy alone doesn't import any of the subpackages. If you like, you can pretend there was an implicit import scipy.special before the code I gave. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Coverting ranks to a Gaussian
On Monday 09 June 2008 22:30:09 Keith Goodman wrote: On Mon, Jun 9, 2008 at 7:02 PM, Pierre GM [EMAIL PROTECTED] wrote: There's a scipy.stats.mstats.rankdata() that take care of both ties and missing data. Missing data are allocated a rank of either 0 or the average rank, depending on some parameter. That sounds interesting. But I can't find it: import scipy from scipy import stats Yes, you should do import scipy.stats.mstats as mstats mstats.rankdata In my implementation I leave the missing values as missing. I think that would be a nice option for rankdata. Handling missing data is why I needed a tailored rankdata. In mstats.rankdata, if you set the use_missing optional parameter to False (the default), they will have a rank of 0. As no other value will have a rank of zero, you can then remask with masked_values or masked_equal. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Coverting ranks to a Gaussian
On Mon, Jun 9, 2008 at 7:35 PM, Robert Kern [EMAIL PROTECTED] wrote: On Mon, Jun 9, 2008 at 21:06, Keith Goodman [EMAIL PROTECTED] wrote: On Mon, Jun 9, 2008 at 4:45 PM, Robert Kern [EMAIL PROTECTED] wrote: On Mon, Jun 9, 2008 at 18:34, Keith Goodman [EMAIL PROTECTED] wrote: Does anyone have a function that converts ranks into a Gaussian? I have an array x: import numpy as np x = np.random.rand(5) I rank it: x = x.argsort().argsort() x_ranked = x.argsort().argsort() x_ranked array([3, 1, 4, 2, 0]) There are subtleties in computing ranks when ties are involved. Take a look at the implementation of scipy.stats.rankdata(). Good point. I had to deal with ties and missing data. I bet scipy.stats.rankdata() is faster than my implementation. Actually, it's pretty slow. I think there are opportunities to make it faster, but I haven't explored them. I would like to convert the ranks to a Gaussian without using scipy. No dice. You are going to have to use scipy.special.ndtri somewhere. A basic transformation (off the top of my head, I have no idea if this is statistically meaningful): scipy.special.ndtri((ranks + 1.0) / (len(ranks) + 1.0)) Barring tied first or last items, this should give equal weight to each of the tails outside of the range of your data. Nice. Thank you. It passes the never wrong chi-by-eye test: r = np.arange(1000) g = special.ndtri((r + 1.0) / (len(r) + 1.0)) pylab.hist(g, 50) pylab.show() BTW, what is your use case? If you are trying to compare your data to the normal distribution, you might want to go the other way: find the empirical quantiles of your data and compare them against the hypothetical quantiles of the data on the distribution. This *is* an established statistical technique; when you graph one versus the other, you get what is known as a Q-Q plot. Some of the techniques I use are sensitive to noise. So to make things more robust I sometimes transform my noisy, sort-of-normal data. There's no theory to guide me, only empirical results. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion