On Sun, 21 Feb 2010 03:22:19 am Andrew Fithian wrote: > Hi tutor, > > I'm have a statistical bootstrapping script that is bottlenecking on > a python function sample_with_replacement(). I wrote this function > myself because I couldn't find a similar function in python's random > library.
random.choice(list) does sample with replacement. If you need more than one sample, call it in a loop or a list comprehension: >>> mylist = [1, 2, 4, 8, 16, 32, 64, 128] >>> choice = random.choice >>> values = [choice(mylist) for _ in xrange(4)] >>> values [64, 32, 4, 4] > This is the fastest version of the function I could come up > with (I used cProfile.run() to time every version I wrote) but it's > not fast enough, can you help me speed it up even more? Profiling doesn't tell you how *fast* something is, but how much time it is using. If something takes a long time to run, perhaps that's because you are calling it lots of times. If that's the case, you should try to find a way to call it less often. E.g. instead of taking a million samples, can you get by with only a thousand? Do you need to prepare all the samples ahead of time, or do them only when needed? > import random > def sample_with_replacement(list): > l = len(list) # the sample needs to be as long as list > r = xrange(l) > _random = random.random > return [list[int(_random()*l)] for i in r] # using > list[int(_random()*l)] is faster than random.choice(list) Well, maybe... it's a near thing. I won't bother showing my timing code (hint: use the timeit module) unless you ask, but in my tests, I don't get a clear consistent winner between random.choice and your version. Your version is a little bit faster about 2 out of 3 trials, but slower 1 out of 3. This isn't surprising if you look at the random.choice function: def choice(self, seq): return seq[int(self.random() * len(seq))] It is essentially identical to your version, the major difference being you use the same algorithm directly in a list comprehension instead of calling it as a function. So I would expect any difference to be small. I would say that you aren't going to speed up the sampling algorithm in pure Python. This leaves you with some options: (1) Live with the speed as it is. Are you sure it's too slow? (2) Re-write the random number generator in C. (3) Find a way to use fewer samples. (4) Use a lower-quality random number generator which is faster. > FWIW, my bootstrapping script is spending roughly half of the run > time in sample_with_replacement() much more than any other function > or method. Thanks in advance for any advice you can give me. You don't tell us whether that actually matters. Is it taking 3 hours in a 6 hour run time? If so, then it is worth spending time and effort to cut that down by a few hours. Or is it taking 0.2 seconds out of 0.4 seconds? If that's the case, then who cares? :) -- Steven D'Aprano _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor