Re: [R] Selecting subsamples

Richard A. O'Keefe Thu, 04 Dec 2003 20:13:17 -0800

[EMAIL PROTECTED] wrote
    [that he has a data set with 9 variables (columns) measured on 2000
     individuals (rows) and wants a sample] in which the sum of the
    volume of the individuals in that sample >= 100 cubic m.


Let's suppose that this information is held in d, a data frame, and that
the volume column is d$vol.

If sum(d$vol) < 100, there is no sample which satisfies your condition.
If sum(d$vol) >= 100, then d is such a sample as it stands.

If you want the smallest number of rows, then

    indices <- order(d$vol, decreasing=TRUE)

gives you the row indices sorted by decreasing volume;

    d$vol[indices]      => the volumes in decreasing order
    cumsum(")           => the cumulative sum
    sum(" < 100.0)      => 1 less than then number of rows you want

so

    indices <- order(d$vol, decreasing=TRUE)
    d[indices[1:(sum(cumsum(d$vol[indices]) < 100.0) + 1)]]

should be the answer you want.

This is O(n.lg n) where n is the number of rows; in your case n is 2000.

If you don't need the smallest sample, but just any old haphazard answer,

    indices <- sample(nrow(d))
    d[indices[1:(sum(cumsum(d$vol[indices]) < 100.0) + 1)]]

should be useful.

______________________________________________
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help

Re: [R] Selecting subsamples

Reply via email to