Dear List, I'm trying to speed up a piece of code that selects a subsample based on some criteria: Setup: I have two samples, raw and cut. Cut is a pure subset of raw, all elements in cut are also in raw, and cut is derived from raw by applying some cuts. Now I would like to select a random subsample of raw and find out how many are also in cut. In other words, some of those random events pass the cuts, others don't. So in principle I have
randomSample = np.random.random_integers(0, len(raw)-1, size=sampleSize) random_that_pass1 = [r for r in raw[randomSample] if r in cut] This is fine (I hope), but slow. I have seen searchsorted mentioned as a possible way to speed this up. Now it gets complicated. I'm creating a boolean array that contains True, wherever a raw event is in cut. raw_sorted = np.sort(raw) cut_sorted = np.sort(cut) passed = np.searchsorted(raw_sorted, cut_sorted) raw_bool = np.zeros(len(raw), dtype='bool') raw_bool[passed] = True Now I create a second boolean array that is set to True at the random values. The events I care about are the ones that pass the cuts and are selected by the random selection: sample_bool = np.zeros(len(raw), dtype='bool') sample_bool[randomSample] = True random_that_pass2 = raw[np.logical_and(raw_bool, sample_bool)] The problem comes in now: random_that_pass2 and random_that_pass1 have different lengths!!! Sometimes one is longer, sometimes the other. I am completely at a loss to explain this. I tend to believe the slow selection leading to random_that_pass1, because it's only two lines, but I don't understand where the other selection could fail. Unfortunately, the samples that give me trouble are 2.2 MB, so maybe a bit large to mail around, but I can put it somewhere if needed. Thank you for your help, Cheers, Jan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion