Re: Hypergeometric distribution
Bengt Richter wrote: On 4 Jan 2006 12:46:47 -0800, Raven [EMAIL PROTECTED] wrote: The problem with Stirling's approximation is that I need to calculate the hypergeometric hence the factorial for numbers within a large range e.g. choose(14000,170) or choose(5,2) It seems you are hinting at some accuracy requirements that you haven't yet explained. I'm curious how you use the values, and how that affects your judgement of Stirling's approximation. In fact, perhaps the semantics of your value usage could even suggest an alternate algorithmic approach to your actual end result. Does it matter? Implementing Stirling's approximation is pointless when scipy.special.gammaln() or scipy.special.gamma() does it for him. -- Robert Kern [EMAIL PROTECTED] In the fields of hell where the grass grows high Are the graves of dreams allowed to die. -- Richard Harter -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
On Thu, 05 Jan 2006 09:47:02 -0600, Robert Kern [EMAIL PROTECTED] wrote: Bengt Richter wrote: On 4 Jan 2006 12:46:47 -0800, Raven [EMAIL PROTECTED] wrote: The problem with Stirling's approximation is that I need to calculate the hypergeometric hence the factorial for numbers within a large range e.g. choose(14000,170) or choose(5,2) It seems you are hinting at some accuracy requirements that you haven't yet explained. I'm curious how you use the values, and how that affects your judgement of Stirling's approximation. In fact, perhaps the semantics of your value usage could even suggest an alternate algorithmic approach to your actual end result. Does it matter? Implementing Stirling's approximation is pointless when scipy.special.gammaln() or scipy.special.gamma() does it for him. Who's talking about implementing Stirling's approximation? ;-) I'm trying to determine first why the OP is thinking there's a problem with using it at all. With alternate algorithmic approach I didn't mean an alternate way of calculating Stirling's approximation. I meant to allude to the possibility that pulling a little further on the requirements thread might even unravel some of the rationale for calculating the hypergeometric per se, depending on how he's actually using it and why. Same old, same old: requirements, requirements ;-) Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
Cameron Laird wrote: This thread confuses me. I've lost track of the real goal. If it's an exact calculation of binomial coefficients--or even one of several other potential targets mentioned--I echo Steven D'Aprano, and ask, are you *sure* the suggestions already offered aren't adequate? Hi Cameron, my real goal was to calculate the hypergeometric distribution. The problem was that the function for hypergeometric calculation from scipy uses the scipy.comb function which by default uses floats so for large numbers comb(n,r) returns inf. and hence the hypergeometric returns nan. The first suggestion, the one by Robert Kern, resolved my problem: Raven wrote: Thanks to all of you guys, I could resolve my problem using the logarithms as proposed by Robert. Then the other guys gave alternative solutions so I tried them out. So form me the suggestions offered are more than adequate :-) Cameron Laird wrote: Also, I think you might not realize how accurate Stirling's approximation (perhaps to second order) is in the range of interest. The problem with Stirling's approximation is that I need to calculate the hypergeometric hence the factorial for numbers within a large range e.g. choose(14000,170) or choose(5,2) Ale -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
Raven [EMAIL PROTECTED] writes: The problem with Stirling's approximation is that I need to calculate the hypergeometric hence the factorial for numbers within a large range e.g. choose(14000,170) or choose(5,2) Stirling's approximation to second order is fairly accurate even at low values: from math import log,exp,pi def stirling(n): # approx log(n!) return n*(log(n)-1) + .5*(log(2.*pi*n)) + 1/(12.*n) for n in range(1,6): print n, exp(stirling(n)) ... 1 1.00227444918 2 2.00065204769 3 6.00059914247 4 24.0010238913 5 120.002637086 To third order it's even better: from math import log,exp,pi def stirling(n): # approx log(n!) return n*(log(n)-1) + .5*(log(2.*pi*n)) + 1/(12.*n) - 1/(360.*n*n*n) for n in range(1,6): print n, exp(stirling(n)) ... 1 0.999494216712 2 1.5749743 3 5.8182863 4 23.822028 5 119.70391 Reference: http://en.wikipedia.org/wiki/Stirling's_approximation -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
On 4 Jan 2006 12:46:47 -0800, Raven [EMAIL PROTECTED] wrote: Cameron Laird wrote: This thread confuses me. I've lost track of the real goal. If it's an exact calculation of binomial coefficients--or even one of several other potential targets mentioned--I echo Steven D'Aprano, and ask, are you *sure* the suggestions already offered aren't adequate? Hi Cameron, my real goal was to calculate the hypergeometric distribution. The problem was that the function for hypergeometric ISTM that can't have been your real goal -- unless you are e.g. preparing numeric tables for publication. IOW, IWT you probably intend to USE the hypergeometric distribution values in some useful way to go towards your real goal. ;-) The requirements of this USE are still not apparent to me in your posts, though that may be because I've missed something. calculation from scipy uses the scipy.comb function which by default uses floats so for large numbers comb(n,r) returns inf. and hence the hypergeometric returns nan. The first suggestion, the one by Robert Kern, resolved my problem: Raven wrote: Thanks to all of you guys, I could resolve my problem using the logarithms as proposed by Robert. Then the other guys gave alternative solutions so I tried them out. So form me the suggestions offered are more than adequate :-) Cameron Laird wrote: Also, I think you might not realize how accurate Stirling's approximation (perhaps to second order) is in the range of interest. The problem with Stirling's approximation is that I need to calculate the hypergeometric hence the factorial for numbers within a large range e.g. choose(14000,170) or choose(5,2) It seems you are hinting at some accuracy requirements that you haven't yet explained. I'm curious how you use the values, and how that affects your judgement of Stirling's approximation. In fact, perhaps the semantics of your value usage could even suggest an alternate algorithmic approach to your actual end result. Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
Travis E. Oliphant wrote: Notice the keyword for the comb function (in scipy) lets you use it to compute exact values. SciPy does not just automatically use the long integer because this will always slow you down. comb(N, k, exact=0) Combinations of N things taken k at a time. If exact==0, then floating point precision is used, otherwise exact long integer is computed. Notes: - Array arguments accepted only for exact=0 case. - If k N, N 0, or k 0, then a 0 is returned. -Travis Oliphant Great, thanks Travis. Ale -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
In article [EMAIL PROTECTED], Raven [EMAIL PROTECTED] wrote: Well, what to say? I am very happy for all the solutions you guys have posted :-) For Paul: I would prefer not to use Stirling's approximation The problem with long integers is that to calculate the hypergeometric I need to do float division and multiplication because integer division returns 0. A solution could be to calculate log(Long_Factorial_Integer) . . . This thread confuses me. I've lost track of the real goal. If it's an exact calculation of binomial coefficients--or even one of several other potential targets mentioned--I echo Steven D'Aprano, and ask, are you *sure* the suggestions already offered aren't adequate? Also, I think you might not realize how accurate Stirling's approximation (perhaps to second order) is in the range of interest. -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
Well, what to say? I am very happy for all the solutions you guys have posted :-) For Paul: I would prefer not to use Stirling's approximation The problem with long integers is that to calculate the hypergeometric I need to do float division and multiplication because integer division returns 0. A solution could be to calculate log(Long_Factorial_Integer) and finally calculate the hypergeometric with the logarithmic values. I've done a test: iterated 1000 times two different functions for the hypergeometric, the first one based on scipy.special.gammaln: from scipy.special import gammaln def lnchoose(n, m): nf = gammaln(n + 1) mf = gammaln(m + 1) nmmnf = gammaln(n - m + 1) return nf - (mf + nmmnf) def hypergeometric_gamma(k, n1, n2, t): if t n1 + n2: t = n1 + n2 if k n1 or k t: return 0 elif t n2 and ((k + n2) t): return 0 else: c1 = lnchoose(n1,k) c2 = lnchoose(n2, t - k) c3 = lnchoose(n1 + n2 ,t) return exp(c1 + c2 - c3) and the second one based on the code by Steven and Scott: import time from math import log, exp def bincoeff1(n, r): if r n - r: r = n - r x = 1 for i in range(n, r, -1): x *= i for i in range(n - r, 1, -1): x /= i return x def hypergeometric(k, n1, n2, t): if t n1 + n2: t = n1 + n2 if k n1 or k t: return 0 elif t n2 and ((k + n2) t): return 0 else: c1 = log(raw_bincoeff1(n1,k)) c2 = log(raw_bincoeff1(n2, t - k)) c3 = log(raw_bincoeff1(n1 + n2 ,t)) return exp(c1 + c2 - c3) def main(): t = time.time() for i in range(1000): r = hypergeometric(6,6,30,6) print time.time() - t t = time.time() for i in range(1000): r = hypergeometric_gamma(6,6,30,6) print time.time() - t if __name__ == __main__: main() and the result is: 0.0386447906494 0.192448139191 The first approach is faster so I think I will adopt it. -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
Raven wrote: ... def main(): t = time.time() for i in range(1000): r = hypergeometric(6,6,30,6) print time.time() - t t = time.time() for i in range(1000): r = hypergeometric_gamma(6,6,30,6) print time.time() - t and the result is: 0.0386447906494 0.192448139191 The first approach is faster so I think I will adopt it. You should really look into the timeit module -- you'll get nice solid timings slightly easier to tweak. Imagine something like: import timeit ... t0 = timeit.Timer(stmt='f(6, 6, 30, 6)', setup='from __main__ import hypergeometric as f') t1 = timeit.Timer(stmt='f(6, 6, 30, 6)', setup='from __main__ import hypergeometric_gamma as f') repetitions = 1 # Gross under-estimate of needed repetitions while t0.timeit(repetitions) .25: # .25 = minimum Secs per round repetitions *= 10 print 'Going for %s repetitions' % repetitions print 'hypergeometric:', t0.repeat(3, repetitions) print 'hypergeometric_gamma:', t1.repeat(3, repetitions) --Scott David Daniels [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
On 2 Jan 2006 03:35:33 -0800, Raven [EMAIL PROTECTED] wrote: [...] The problem with long integers is that to calculate the hypergeometric I need to do float division and multiplication because integer division returns 0. A solution could be to calculate log(Long_Factorial_Integer) ISTM you wouldn't get zero if you scaled by 10**significant_digits (however many you require) before dividing. E.g., expected hits per trillion (or septillion or whatever) expresses probability too. Perhaps that could work in your calculation? Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
Scott David Daniels ha scritto: You should really look into the timeit module -- you'll get nice solid timings slightly easier to tweak. This seems a very interesting module, I will give it a try as soon as possible. Thanks Scott. Ale -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
Bengt Richter wrote: ISTM you wouldn't get zero if you scaled by 10**significant_digits (however many you require) before dividing. E.g., expected hits per trillion (or septillion or whatever) expresses probability too. Perhaps that could work in your calculation? Regards, Bengt Richter Sorry Bengt but I can't figure out how to do it, can you give me an example please? Thanks in advance. Ale -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
Bengt Richter wrote: ISTM you wouldn't get zero if you scaled by 10**significant_digits (however many you require) before dividing. E.g., expected hits per trillion (or septillion or whatever) expresses probability too. Perhaps that could work in your calculation? Regards, Bengt Richter Sorry but I can't figure out how to do it, can you give me an example please? Thnx Ale -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
Raven wrote: Thanks Steven for your very interesting post. This was a critical instance from my problem: from scipy import comb comb(14354,174) inf The scipy.stats.distributions.hypergeom function uses the scipy.comb function, so it returned nan since it tries to divide an infinite. I did not tried to write a self-made function using standard python as I supposed that the scipy functions reached python's limits but I was wrong, what a fool :-) Notice the keyword for the comb function (in scipy) lets you use it to compute exact values. SciPy does not just automatically use the long integer because this will always slow you down. comb(N, k, exact=0) Combinations of N things taken k at a time. If exact==0, then floating point precision is used, otherwise exact long integer is computed. Notes: - Array arguments accepted only for exact=0 case. - If k N, N 0, or k 0, then a 0 is returned. -Travis Oliphant -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
On Sat, 31 Dec 2005 16:24:02 -0800, Raven wrote: Thanks to all of you guys, I could resolve my problem using the logarithms as proposed by Robert. I needed to calculate the factorial for genomic data, more specifically for the number of genes in the human genome i.e. about 30.000 and that is a big number :-) I didn't know gmpy Thanks a lot, really Are you *sure* the existing functions didn't work? Did you try them? def log2(x): ... return math.log(x)/math.log(2) ... n = 0.0 for i in range(1, 30): # ten times bigger than you need ... n += log2(i) ... n 5025564.6087276665 t = time.time(); x = 2L**(int(n) + 1); time.time() - t 0.26649093627929688 That's about one quarter of a second to calculate 300,000 factorial (approximately), and it shows that the calculations are well within Python's capabilities. Of course, converting this 1.5 million-plus digit number to a string takes a bit longer: t = time.time(); len(str(x)); time.time() - t 1512846 6939.3762848377228 A quarter of a second to calculate, and almost two hours to convert to a string. Lesson one: calculations on longints are fast. Converting them to strings is not. As far as your original question goes, try something like this: (formula from memory, may be wrong) def bincoeff(n,r): ... x = 1 ... for i in range(r+1, n+1): ... x *= i ... for i in range(1, n-r+1): ... x /= i ... return x ... bincoeff(10, 0) 1 bincoeff(10, 1) 10 bincoeff(10, 2) 45 bincoeff(10, 3) 120 import time t = time.time(); L = bincoeff(3, 7000); time.time() - t 28.317800045013428 Less than thirty seconds to calculate a rather large binomial coefficient exactly. How many digits? len(str(L)) 7076 If you are calculating hundreds of hypergeometric probabilities, 30 seconds each could be quite painful, but it certainly shows that Python is capable of doing it without resorting to logarithms which may lose some significant digits. Although, in fairness, the log function doesn't seem to lose much accuracy for arguments in the range you are dealing with. How long does it take to calculate factorials? def timefact(n): ... # calculate n! and return the time taken in seconds ... t = time.time() ... L = 1 ... for i in range(1, n+1): ... L *= i ... return time.time() - t ... timefact(3000) 0.054913997650146484 timefact(3) # equivalent to human genome 5.069951057434082 timefact(30) # ten times bigger 4255.2370519638062 Keep in mind, if you are calculating the hypergeometric probabilities using raw factorials, you are doing way too much work. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
Thanks Steven for your very interesting post. This was a critical instance from my problem: from scipy import comb comb(14354,174) inf The scipy.stats.distributions.hypergeom function uses the scipy.comb function, so it returned nan since it tries to divide an infinite. I did not tried to write a self-made function using standard python as I supposed that the scipy functions reached python's limits but I was wrong, what a fool :-) If you are calculating hundreds of hypergeometric probabilities, 30 seconds each could be quite painful, but it certainly shows that Python is capable of doing it without resorting to logarithms which may lose some significant digits. Although, in fairness, the log function doesn't seem to lose much accuracy for arguments in the range you are dealing with. Yes I am calculating hundreds of hypergeometric probabilities so I need fast calculations Ale -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
Raven [EMAIL PROTECTED] writes: Yes I am calculating hundreds of hypergeometric probabilities so I need fast calculations Can you use Stirling's approximation to get the logs of the factorials? -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
On Sun, 01 Jan 2006 14:24:39 -0800, Raven wrote: Thanks Steven for your very interesting post. This was a critical instance from my problem: from scipy import comb comb(14354,174) inf Curious. It wouldn't surprise me if scipy was using floats, because 'inf' is usually a floating point value, not an integer. Using my test code from yesterday, I got: bincoeff(14354,174) 11172777193562324917353367958024437473336018053487854593870 07090637489405604489192488346144684402362344409632515556732 33563523161308145825208276395238764441857829454464446478336 90173777095041891067637551783324071233625370619908633625448 31076677382448616246125346667737896891548166898009878730510 57476139515840542769956414204130692733629723305869285300247 645972456505830620188961902165086857407612722931651840L Took about three seconds on my system. Yes I am calculating hundreds of hypergeometric probabilities so I need fast calculations Another possibility, if you want exact integer maths rather than floating point with logarithms, is to memoise the binomial coefficients. Something like this: # untested def bincoeff(n,r, \ cache={}): try: return cache((n,r)) except KeyError: x = 1 for i in range(r+1, n+1): x *= i for i in range(1, n-r+1): x /= i cache((n,r)) = x return x -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
Thanks to all of you guys, I could resolve my problem using the logarithms as proposed by Robert. I needed to calculate the factorial for genomic data, more specifically for the number of genes in the human genome i.e. about 30.000 and that is a big number :-) I didn't know gmpy Thanks a lot, really Ale -- http://mail.python.org/mailman/listinfo/python-list
Hypergeometric distribution
Hi to all, I need to calculate the hpergeometric distribution: choose(r, x) * choose(b, n-x) p(x; r,b,n) = - choose(r+b, n) choose(r,x) is the binomial coefficient I use the factorial to calculate the above formula but since I am using large numbers, the result of choose(a,b) (ie: the binomial coefficient) is too big even for large int. I've tried the scipy library, but this library calculates the hypergeometric using the factorials too, so the problem subsist. Is there any other libray or an algorithm to calculate the hypergeometric distribution? The statistical package R can handle such calculations but I don't want to use python R binding since I want a standalone app. Thanks a lot Ale -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
Raven wrote: Hi to all, I need to calculate the hpergeometric distribution: choose(r, x) * choose(b, n-x) p(x; r,b,n) = - choose(r+b, n) choose(r,x) is the binomial coefficient I use the factorial to calculate the above formula but since I am using large numbers, the result of choose(a,b) (ie: the binomial coefficient) is too big even for large int. I've tried the scipy library, but this library calculates the hypergeometric using the factorials too, so the problem subsist. Is there any other libray or an algorithm to calculate the hypergeometric distribution? Use logarithms. Specifically, from scipy import special def logchoose(n, k): lgn1 = special.gammaln(n+1) lgk1 = special.gammaln(k+1) lgnk1 = special.gammaln(n-k+1) return lgn1 - (lgnk1 + lgk1) def gauss_hypergeom(x, r, b, n): return exp(logchoose(r, x) + logchoose(b, n-x) - logchoose(r+b, n)) Or you could use gmpy if you need exact rational arithmetic rather than floating point. -- Robert Kern [EMAIL PROTECTED] In the fields of hell where the grass grows high Are the graves of dreams allowed to die. -- Richard Harter -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
Raven wrote: Hi to all, I need to calculate the hpergeometric distribution: choose(r, x) * choose(b, n-x) p(x; r,b,n) = - choose(r+b, n) choose(r,x) is the binomial coefficient I use the factorial to calculate the above formula but since I am using large numbers, the result of choose(a,b) (ie: the binomial coefficient) is too big even for large int. I've tried the scipy library, but this library calculates the hypergeometric using the factorials too, so the problem subsist. Is there any other libray or an algorithm to calculate the hypergeometric distribution? The statistical package R can handle such calculations but I don't want to use python R binding since I want a standalone app. Thanks a lot Ale Ale I had this code lying about if it helps. I don't know if it's even correct but it's non-recursive! def Binomial( n, k ): ret = 0 if k == 0: ret = 1 elif k 0: a = range( n+1 ) a[0] = 1 for i in a[1:]: a[i] = 1 for j in range(i-1,0,-1): a[j] = a[j] + a[j-1] ret = a[k] return ret Gerard -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
On Mon, 26 Dec 2005 12:18:55 -0800, Raven wrote: Hi to all, I need to calculate the hpergeometric distribution: choose(r, x) * choose(b, n-x) p(x; r,b,n) = - choose(r+b, n) choose(r,x) is the binomial coefficient I use the factorial to calculate the above formula but since I am using large numbers, the result of choose(a,b) (ie: the binomial coefficient) is too big even for large int. Are you sure about that? Python long ints can be as big as you have enough memory for. My Python can print 10L**1 to the console with a barely detectable pause, and 10L**10 with about a ten second delay. (Most of that delay is printing it, not calculating it.) 25206 is the first integer whose factorial exceeds 10L**10, so even if you are calculating the binomial coefficient using the most naive algorithm, calculating the factorials and dividing, you should easily be able to generate it for a,b up to 20,000 unless you have a severe shortage of memory. I've tried the scipy library, but this library calculates the hypergeometric using the factorials too, so the problem subsist. What exactly is your problem? What values of hypergeometric(x; r,b,n) fail for you? -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: Hypergeometric distribution
Raven [EMAIL PROTECTED] wrote: Hi to all, I need to calculate the hpergeometric distribution: choose(r, x) * choose(b, n-x) p(x; r,b,n) = - choose(r+b, n) choose(r,x) is the binomial coefficient I use the factorial to calculate the above formula but since I am using large numbers, the result of choose(a,b) (ie: the binomial coefficient) is too big even for large int. I've tried the scipy library, but this library calculates the hypergeometric using the factorials too, so the problem subsist. Is there any other libray or an algorithm to calculate the hypergeometric distribution? The statistical package R can handle such calculations but I don't want to use python R binding since I want a standalone app. Have you tried with gmpy? Alex -- http://mail.python.org/mailman/listinfo/python-list