Re: [R] Fwd: rarefy a matrix of counts

Brian Frappier Thu, 12 Oct 2006 12:06:37 -0700

Here's the script I wrote to randomly sample without replacement from a csv
file of counts for various object classes (columns) in 76 samples (rows):
The data file "common_macro_raw.csv"  was:
SiteID,Scaling_factor,Collembola,Hydrachnida,Nematomorpha,Oligochaeta,Turbellaria,Glossiphonidae,Hirudinidae,Gammaridae,Asellidae,Baetidae,Ephemerellidae,Ephemeridae,Heptageniidae,Leptophlebiidae,Siphlonuridae,Chloroperlidae,Leuctridae,Nemouridae,Peltoperlidae,Perlidae,Perlolidae,Pteronarcyidae,Brachycentridae,Glossosomatidae,Hydropsychidae,Hydroptilidae,Lepidostomatidae,Leptoceridae,Limnephilidae,Molannidae,Odontoceridae,Philopotamidae,Phryganeidae,Polycentropidae,Rhyacophilidae,Uenoidae,Corixidae,Corydalidae,Sialidae,Chrysolmelidae,Dytiscidae,Elmidae,Psephenidae,Athericidae,Blepharicidae,Ceratopogonidae,Chironomidae,Dixidae,Empididae,Psychodidae,Simuliidae,Strationyidae,Tabanidae,Tipulidae,Aeshnidae,Calopterygidae,Cordulegastridae,Gomphidae,Libellulidae,Pyralidae,Planorbidae,Sphaeridae
1100291,1,1,3,2,2,0,0,0,0,0,4,66,1,2,11,1,10,21,0,0,0,1,0,0,1,0,0,3,0,0,0,3,0,0,0,8,0,0,0,1,0,0,71,0,1,0,5,121,0,1,0,2,0,0,15,0,0,0,0,0,0,1,12
2400143,1.88
,0,0,0,25,0,0,0,0,0,6,8,0,17,3,0,11,9,1,6,0,1,3,0,4,0,0,1,0,0,0,4,38,0,0,8,2,0,0,0,0,11,25,0,1,0,2,29,0,0,0,22,0,0,8,0,0,2,5,0,0,0,0
2500364,1,0,4,0,6,0,0,0,0,0,66,0,0,63,0,0,55,14,3,0,0,0,0,0,4,0,0,1,0,2,0,0,11,0,0,18,0,0,0,0,0,0,2,0,2,0,0,86,0,0,0,9,0,0,10,0,0,0,0,0,1,0,0
2600075,1,0,1,0,15,0,0,0,0,0...etc


The program requires two loops, but took less than a second to run on my 1.8Ghz:

#Reads matrix of raw macroinvertebrate counts from the subsampling prior to
large-rare search
#and scaling for sub-sampling effort
rm(list=ls())
library(stats)
master_data = read.csv("common_macro_raw.csv", row.names=1)
data.frame(master_data)
attach(master_data)
counts = master_data[,2:ncol(master_data)]

#These loops will extract a stream's assemblage, create a list of buggies
identified,
#take a random sample of 100 buggies without repalcement, and then
re-combine the resulting
#list into a vector of counts by taxa
taxa_codes = c(1:ncol(counts)) #this creates a sequential integer for each
taxon that will be the index for the subsequent lists
rarified_samples = numeric()
for (x in 1: nrow(counts)) {
    temp_counts = counts[x,]
    full_list = rep(taxa_codes, times=temp_counts)
    stream_rand = sum(temp_counts)/100*master_data[x,1] #puts new scaling
factor in first column of stream_rand
    rare_list = sample(full_list, 100)
    for (i in 1:ncol(counts)) {
        temp_sum = sum(rare_list==i)
        stream_rand = c(stream_rand, temp_sum)
        }
    rarified_samples = rbind(rarified_samples, stream_rand)
    }
rownames(rarified_samples)=SiteID
colnames(rarified_samples)=colnames(master_data)
data.frame(rarified_samples)
write.csv(rarified_samples, file = "rarified_samples.csv")

You could add another for loop that appends as many iterations as needed to
the output file.  Thanks for all of your input, it helped tremendously.

On 10/12/06, Petr Pikal <[EMAIL PROTECTED]> wrote:
>
> Hi
>
> On 11 Oct 2006 at 12:54, Tony Plate wrote:
>
> Date sent:              Wed, 11 Oct 2006 12:54:44 -0600
> From:                   Tony Plate <[EMAIL PROTECTED]>
> To:                     Brian Frappier <[EMAIL PROTECTED]>
> Copies to:              Petr Pikal <[EMAIL PROTECTED]>,
> r-help@stat.math.ethz.ch
> Subject:                Re: [R] Fwd: rarefy a matrix of counts
>
> > Two things to note:
> >
> > (1) rep() can be vectorized:
> >  > rep(1:3, 2:4)
> > [1] 1 1 2 2 2 3 3 3 3
> >  >
> >
> > (2) you will likely get much better performance if you work with
> > integers and convert to strings after sampling (or use factors), e.g.:
>
> that is what I actually used in my suggestion (I hope).
>
> > DF
>   color sample1 sample2 sample3
> 1   red     400     300    2500
> 2 green     100       0     200
> 3 black     300    1000     500
>
> notice that red, green, black is not **row names** but a column in
> data frame.
> That is why following code gives red, green, etc.
>
> x <- data.frame(matrix(NA,100,3))
> for (i in 2:ncol(DF)) x[,i-1] <- sample(rep(DF[,1], DF[,i]),100)
> if you want result in data frame
> or
> x<-vector("list", 3)
> for (i in 2:ncol(DF)) x[[,i-1]] <- sample(rep(DF[,1], DF[,i]),100)
>
> >
> >  > c("red","green","blue")[sample(rep(1:3,c(400,100,300)), 5)]
> > [1] "red"  "blue" "red"  "red"  "red"
> >  >
> >
> > -- Tony Plate
> >
>
> <snip>
>
> > > is that this code still samples the rows, not the elements, i.e.
>
> No, see above.
>
> > > returns 100 or 300 in the matrix cells instead of "red" or a matrix
> > > of counts by color (object type) like:
> > >        x1    x2   x3
> > > red  32     5    60
> > > gr    68    95   40
> > > sum 100  100  100
>
> something like
>
> sapply(x,table)
>        X1 X2 X3
> black 36 79 15
> green 14  0  9
> red   50 21 76
>
> HTH
> Petr
>
> > >
> > >  It looks like Tony is right: sampling without replacement requires
> > > listing of all elements to be sampled.  But, the code Petr provided
> > >
> > > x1 <- sample(c(rep("red",400),rep("green",
> > > 100),rep("black",300)),100)
> > >
> > > did give me a clue of how to quickly make such a list using the
> > > 'rep' command.  I will for-loop a rep statement using my original
> > > matrix to create a list of elements for each sample:
> > >
> > > Thanks Petr and Tony for your help!
> > >
> > > On 10/11/06, *Tony Plate* <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>
> > > wrote:
> > >
> > >     Here's a way using apply(), and the prob= argument of sample():
> > >
> > >      > df <- data.frame(sample1=c(red=400,green=100,black=300),
> > >     sample2=c(300,0,1000), sample3=c(2500,200,500))
> > >      > df
> > >            sample1 sample2 sample3
> > >     red       400     300    2500
> > >     green     100       0     200
> > >     black     300    1000     500
> > >      > set.seed(1)
> > >      > apply(df, 2, function(counts) sample(seq(along=counts),
> > >      > rep=T,
> > >     size=7, prob=counts))
> > >           sample1 sample2 sample3
> > >     [1,]       1       3       1
> > >     [2,]       1       3       1
> > >     [3,]       3       3       1
> > >     [4,]       2       3       2
> > >     [5,]       1       3       1
> > >     [6,]       2       3       1
> > >     [7,]       2       3       3
> > >      >
> > >
> > >     Note that this does sampling WITH replacement.
> > >     AFAIK, sampling without replacement requires enumerating the
> > >     entire population to be sampled from.  I.e., you cannot do
> > >      > sample(1:3, prob=1:3, rep=F, size=4)
> > >     instead of
> > >      > sample(c(1,2,2,3,3,3), rep=F, size=4)
> > >
> > >     -- Tony Plate
> > >
> > >      From reading ?sample, I was a little unclear on whether
> > >      sampling
> > >     without replacement could work
> > >
> > >     Petr Pikal wrote:
> > >      > Hi
> > >      >
> > >      > a litle bit different story. But
> > >      >
> > >      > x1 <- sample(c(rep("red",400),rep("green", 100),
> > >      > rep("black",300)),100)
> > >      >
> > >      > is maybe close. With data frame (if it is not big)
> > >      >
> > >      >
> > >      >>DF
> > >      >
> > >      >   color sample1 sample2 sample3
> > >      > 1   red     400     300    2500
> > >      > 2 green     100       0     200
> > >      > 3 black     300    1000     500
> > >      >
> > >      > x <- data.frame(matrix(NA,100,3))
> > >      > for (i in 2:ncol(DF)) x[,i-1] <- sample(rep(DF[,1],
> > >      > DF[,i]),100) if you want result in data frame or
> > >      > x<-vector("list", 3) for (i in 2:ncol(DF)) x[[,i-1]] <-
> > >      > sample(rep(DF[,1], DF[,i]),100)
> > >      >
> > >      > if you want it in list. Maybe somebody is clever enough to
> > >      > discard for loop but you said you have 80 columns which shall
> > >      > be no problem.
> > >      >
> > >      > HTH
> > >      > Petr
> > >      >
> > >      >
> > >      >
> > >      >
> > >      >
> > >      >
> > >      >
> > >      > On 11 Oct 2006 at 10:11, Brian Frappier wrote:
> > >      >
> > >      > Date sent:            Wed, 11 Oct 2006 10:11:33 -0400
> > >      > From:                 "Brian Frappier" <
> > >      > [EMAIL PROTECTED]
> > >     <mailto:[EMAIL PROTECTED]>>
> > >      > To:                   "Petr Pikal" <[EMAIL PROTECTED]
> > >     <mailto:[EMAIL PROTECTED]>>
> > >      > Subject:              Fwd: [R] rarefy a matrix of counts
> > >      >
> > >      >
> > >      >>---------- Forwarded message ----------
> > >      >>From: Brian Frappier <[EMAIL PROTECTED]
> > >     <mailto:[EMAIL PROTECTED]>>
> > >      >>Date: Oct 11, 2006 10:10 AM
> > >      >>Subject: Re: [R] rarefy a matrix of counts
> > >      >>To: r-help@stat.math.ethz.ch
> > >      >><mailto:r-help@stat.math.ethz.ch>
> > >      >>
> > >      >>Hi Petr,
> > >      >>
> > >      >>Thanks for your response.  I have data that looks like the
> > >     following:
> > >      >>
> > >      >>               sample 1         sample 2         sample 3
> > >      >>               ....
> > >      >>red candy        400                 300               2500
> > >      >>green candy    100                    0                  200
> > >      >>black candy     300                1000                500
> > >      >>
> > >      >>I don't want to randomly select either the samples (columns)
> > >      >>or the "candy" types (rows), which sample as you state would
> > >      >>allow me. Instead, I want to randomly sample 100 candies from
> > >      >>each sample and retain info on their associated type.  I
> > >      >>could make a list of all the candies in each sample:
> > >      >>
> > >      >>sample 1
> > >      >>red
> > >      >>red
> > >      >>red
> > >      >>red
> > >      >>green
> > >      >>green
> > >      >>black
> > >      >>red
> > >      >>black
> > >      >>...
> > >      >>
> > >      >>and then randomly sample those rows.  Repeat for each
> > >     sample.  But, I
> > >      >>am not sure how to do that without alot of loops, and am
> > >      >>wondering if there is an easier way in R.  Thanks!  I should
> > >      >>have laid this out in the first email...sorry.
> > >      >>
> > >      >>
> > >      >>On 10/11/06, Petr Pikal <[EMAIL PROTECTED]
> > >     <mailto:[EMAIL PROTECTED]>> wrote:
> > >      >>
> > >      >>>Hi
> > >      >>>
> > >      >>>I am not experienced in Matlab and from your explanation I
> > >      >>>do not understand what exactly do you want. It seems that
> > >      >>>you want randomly choose a sample of 100 rows from your
> > >      >>>martix, what can be achived by sample.
> > >      >>>
> > >      >>>DF<- data.frame(rnorm(100), 1:100, 101:200, 201:300)
> > >      >>>DF[sample(1:100, 10),]
> > >      >>>
> > >      >>>If you want to do this several times, you need to save your
> > >      >>>result and than it depends on what you want to do next. One
> > >      >>>suitable form is list of matrices the other is array and you
> > >      >>>can use for loop for completing it.
> > >      >>>
> > >      >>>HTH
> > >      >>>Petr
> > >      >>>
> > >      >>>
> > >      >>>On 10 Oct 2006 at 17:40, Brian Frappier wrote:
> > >      >>>
> > >      >>>Date sent:              Tue, 10 Oct 2006 17:40:47 -0400
> > >      >>>From:                   "Brian Frappier"
> > >     <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>
> > >      >>>To:                     r-help@stat.math.ethz.ch
> > >     <mailto:r-help@stat.math.ethz.ch> Subject:
> > >      >>>    [R] rarefy a matrix of counts
> > >      >>>
> > >      >>>
> > >      >>>>Hi all,
> > >      >>>>
> > >      >>>>I have a matrix of counts for objects (rows) by samples
> > >      >>>>(columns).
> > >      >>>> I aimed for about 500 counts in each sample (I have about
> > >      >>>> 80
> > >      >>>>samples) and would now like to rarefy these down to 100
> > >      >>>>counts in each sample using simple random sampling without
> > >      >>>>replacement.  I plan on rarefying several times for each
> > >      >>>>sample.  I could do the tedious looping task of making a
> > >      >>>>list of all objects (with its associated identifier) in
> > >      >>>>each sample and then use the wonderful "sampling" package
> > >      >>>>to select a sub-sample of 100 for each sample and thereby
> > >      >>>>get a logical vector of inclusions.  I would then regroup
> > >      >>>>the resulting logical vector into a vector of counts by
> > >      >>>>object, rinse and repeat several times for each sample.
> > >      >>>>
> > >      >>>>Alternately, using the same list, I could create a random
> > >      >>>>index of integers between 1 and the number of objects for a
> > >      >>>>sample (without repeats) and then select those objects from
> > >      >>>>the list.  Again, rinse and repeat several time for each
> > >      >>>>sample.
> > >      >>>>
> > >      >>>>Is there a way to directly rarefy a matrix of counts
> > >      >>>>without having to create a list of objects first?  I am
> > >      >>>>trying to switch to R from Matlab and am trying to pick up
> > >      >>>>good programming habits from the start.
> > >      >>>>
> > >      >>>>Much appreciation!
> > >      >>>>
> > >      >>>> [[alternative HTML version deleted]]
> > >      >>>>
> > >      >>>>______________________________________________
> > >      >>>>R-help@stat.math.ethz.ch <mailto:R-help@stat.math.ethz.ch>
> > >     mailing list
> > >      >>>>https://stat.ethz.ch/mailman/listinfo/r-help
> > >     <https://stat.ethz.ch/mailman/listinfo/r-help>
> > >      >>>>PLEASE do read the posting guide
> > >      >>>>http://www.R-project.org/posting-guide.html and provide
> > >      >>>>commented, minimal, self-contained, reproducible code.
> > >      >>>
> > >      >>>Petr Pikal
> > >      >>>[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
> > >      >>>
> > >      >>>
> > >      >>
> > >      >
> > >      > Petr Pikal
> > >      > [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
> > >      >
> > >      > ______________________________________________
> > >      > R-help@stat.math.ethz.ch <mailto:R-help@stat.math.ethz.ch>
> > >     mailing list
> > >      > https://stat.ethz.ch/mailman/listinfo/r-help
> > >      > PLEASE do read the posting guide
> > >     http://www.R-project.org/posting-guide.html
> > >      > and provide commented, minimal, self-contained, reproducible
> > >      > code.
> > >      >
> > >
> > >
> >
> > ______________________________________________
> > R-help@stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html and provide commented,
> > minimal, self-contained, reproducible code.
>
> Petr Pikal
> [EMAIL PROTECTED]
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Fwd: rarefy a matrix of counts

Reply via email to