Dear List,

I'm trying to speed up a piece of code that selects a subsample based on some 
criteria:
Setup:
I have two samples, raw and cut. Cut is a pure subset of raw, all elements in 
cut are also in raw, and cut is derived from raw by applying some cuts.
Now I would like to select a random subsample of raw and find out how many are 
also in cut. In other words, some of those random events pass the cuts, others 
don't.
So in principle I have 

randomSample = np.random.random_integers(0, len(raw)-1, size=sampleSize)
random_that_pass1 = [r for r in raw[randomSample] if r in cut]

This is fine (I hope), but slow.
I have seen searchsorted mentioned as a possible way to speed this up.
Now it gets complicated. I'm creating a boolean array that contains True, 
wherever a raw event is in cut.

raw_sorted = np.sort(raw)
cut_sorted = np.sort(cut)
passed = np.searchsorted(raw_sorted, cut_sorted)
raw_bool = np.zeros(len(raw), dtype='bool')
raw_bool[passed] = True

Now I create a second boolean array that is set to True at the random values. 
The events I care about are the ones that pass the cuts and are selected by the 
random selection:

sample_bool = np.zeros(len(raw), dtype='bool')
sample_bool[randomSample] = True
random_that_pass2 = raw[np.logical_and(raw_bool, sample_bool)]

The problem comes in now:
random_that_pass2 and random_that_pass1 have different lengths!!! 
Sometimes one is longer, sometimes the other. I am completely at a loss to 
explain this.
I tend to believe the slow selection leading to random_that_pass1, because it's 
only two lines, but I don't understand where the other selection could fail.

Unfortunately, the samples that give me trouble are 2.2 MB, so maybe a bit 
large to mail around, but I can put it somewhere if needed.
Thank you for your help,
Cheers,
    Jan

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to