Marc:
Thanks! I wrote a semi-similar function (not as elegant as the one
you have below) to do the stratified sampling via the sample function
vs. the strata (from sampling) function and, yes, it is WAY faster.
Seconds instead of hours. I'll send a note to the sampling maintainer
to see if they can take a look at whats going on with that function.
The dataframe does have a lot of columns, but I'm confused why that
should matter since strata is supposed to return an index (so, at least
in theory, it should just be looking at the stratification columns, no?)
Again, thanks for looking into this -- verification that I'm not
crazy and that function is behaving oddly was very helpful.
--j
P.S. "Also, the construct:
uniqueids_stratasize <- c(1:length(uniqueids))*0+1
is rather strange. You are essentially creating a vector of 661 0's,
then adding 1 to each element to get 661 1's. Why not use:
rep(1, length(uniqueids)) "
--> Learn something new every day! Wasn't familiar with the rep()
function, so I got my vector of 1s the only way I currently knew how!
Marc Schwartz wrote:
On Jul 11, 2009, at 8:42 PM, Jonathan Greenberg wrote:
I'm a bit confused why the following command is taking an
extrodinarily long time (> 2-3 hours) to run on an 3.06ghz iMac
(brand new). I'm trying to do a stratified random sample, drawing
only a single value per UniqueID from the patch_summary data frame:
uniqueids <- unique(patch_summary$UniqueID)
uniqueids_stratasize <- c(1:length(uniqueids))*0+1
temp_species_patch_random_strata <-
strata(patch_summary,c("UniqueID"),size=uniqueids_stratasize)
The patch_summary data frame is too big to include in this email, but
I'll give some summary results for the inputs. patch_summary has
48253 rows, UniqueID has 661 levels, and uniqueid_stratasize is a
vector if 661 "1"s (corresponding to the 661 levels from UniqueID --
as I said I only want one per UniqueID).
Am I doing something silly here? Am I using the wrong stratified
sampling algorithm? Any help would be appreciated -- thanks -- I
don't see why a stratified sampling would take hours to run -- I am
about to recode this as a series of subset statements which I'm sure
will take only minutes...
--j
I presume that this is the strata() function from the sampling
package? That was unstated, but seems to be the only strata() function
that makes sense here.
Also, the construct:
uniqueids_stratasize <- c(1:length(uniqueids))*0+1
is rather strange. You are essentially creating a vector of 661 0's,
then adding 1 to each element to get 661 1's. Why not use:
rep(1, length(uniqueids))
?
I installed the sampling package and ran the strata() function first
on the examples and then on some test data. The examples ran fine and
reasonably fast.
The test data that I created (see below) had a basic structure of a
certain number of unique IDs and then just a record counter per ID.
There are 75 records per ID, just to pick the number that I used in
the sample data below. So the record counter runs from 1:75 for each ID.
When I ran the strata() function on 10 unique IDs (750 records), it
took about 1 second to execute. However, when I ran the function with
50 unique IDs (3750 records), it took about 25 seconds. For 100 unique
IDs (7500 records), it took around 90 seconds.
This suggests to me that the problem that you are having is that the
strata() function does not scale well at all to large datasets,
perhaps being more sensitive to the number of unique strata. You might
want to contact the package maintainer to see if they intended the
function and/or package to be used with large datasets or not. I am
going to reasonably guess that they did not or at least, did not test
it on large datasets to optimize their algorithms.
I am not clear how large your data frame is in terms of columns, but
if we create one to at least have the same number of unique IDs and
approximately the same number of records and create some specific code
to get one randomly selected record from each unique ID, it runs very
quickly.
# Create a data frame with 661 unique IDs, 75 records per ID.
# Include a second column with a counter of 1:75 for each ID
DF <- data.frame(ID = rep(1:661, each = 75), RecNum = rep(1:75, 661))
# Randomize the rows, so that the unique IDs are not sorted together
DF <- DF[sample(661 * 75), ]
> str(DF)
'data.frame': 49575 obs. of 2 variables:
$ ID : int 512 587 640 602 148 250 31 487 336 145 ...
$ RecNum: int 14 23 72 49 39 56 58 17 37 8 ...
# There are 661 unique IDs
> length(unique(DF$ID))
[1] 661
# There are 75 of each
> all(table(DF$ID) == 75)
[1] TRUE
# Now, get one randomly selected record per unique ID
> system.time(DF.new <- do.call(rbind,
lapply(split(DF, DF$ID),
function(x) x[sample(nrow(x),
1), ])))
user system elapsed
0.361 0.110 0.478
# We have 661 records now
> str(DF.new)
'data.frame': 661 obs. of 2 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ RecNum: int 54 3 73 5 6 19 51 7 18 8 ...
# We have 661 unique IDs
> length(unique(DF.new$ID))
[1] 661
# 1 record per ID
> all(table(DF.new$ID) == 1)
[1] TRUE
In this simple example, it took less than half a second to generate
the result. That is on a 2.93 Ghz MacBook Pro.
So, for your data, the code would look something like this:
system.time(DF.new <- do.call(rbind,
lapply(split(patch_summary,
patch_summary$UniqueID),
function(x) x[sample(nrow(x), 1),
])))
DF.new will contain your subset of data and you will get output
similar to mine above, with the time it took for the function to
execute. See ?system.time for more information.
Try that and see if that works and how fast it executes on your actual
data.
HTH,
Marc Schwartz
--
Jonathan A. Greenberg, PhD
Postdoctoral Scholar
Center for Spatial Technologies and Remote Sensing (CSTARS)
University of California, Davis
One Shields Avenue
The Barn, Room 250N
Davis, CA 95616
Cell: 415-794-5043
AIM: jgrn307, MSN: jgrn...@hotmail.com, Gchat: jgrn307
_______________________________________________
R-sig-Geo mailing list
R-sig-Geo@stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/r-sig-geo