[R] Extract cell of many values from dataframe cells and sample from them.

Benjamin Ward (ENV) Thu, 08 Nov 2012 09:26:56 -0800

Hi,

First my apologies for a non-working piece of code in a previous submission, I 
have corrected this error.


I'm doing is individual based modelling of a pathogen and it's host. The way 
I've thought of doing this is with two dataframes, one of the pathogen and it's 
genes and effector genes, and one of the host and it's resistance genes. During 
the simulation, these things can be pulled out of the dataframes and operated 
on, before being stored again in the dataframes.

Below is how I've created my dataframe and stored my effector genes. In this 
model, effector genes are numerical values between 1 and 10000.

Path_Number <- 0500
inds <- 
data.frame(ID=formatC(0001:Path_Number,width=4,flag=0),No_of_Effectors="",No_Expressed_Effectors="")
inds$No_of_Effectors <- round(as.numeric(lapply(1:nrow(inds),function(x) 
runif(1, min=1, max=550))))
Effectors <- lapply(1:nrow(inds),function(x) 
sample(1:10000,inds$No_of_Effectors,replace=TRUE))
inds <- data.frame(inds,Effectors=as.character(Effectors))
Ind_Genes <- strsplit(as.character(inds[1,4]),",")

What I'm trying to do is:
1). For each individual (row) in my database, extract the values in the 
"Effectors" cell to an object.
2). Sample a number of those values and assign them to a new object called 
"Expressed_Effectors"
3). Storing it in the Expressed_Effectors cell, in much the same manner as I 
stored the Effectors object in the "Effectors" cell.

My example attempt (for the first row/individual in my dataset) is below:

(step by step, I didn't put this in a loop until I know it works for 1 row)

Extract the values (effector genes) for the first individual, from the 
Effectors Cell in the dataframe, to "Ind_Effectors" object.
Ind_Effectors <- strsplit(as.character(inds[1,4]),",")

Randomly dictate how many values (effectors) will be sampled
n<-round(runif(1, min=10, max=50))

Sample n values (effector genes) from "Ind_Effectors", not replacing
Expressed_Genes <- sample(Ind_Effectors,n,replace=F)

If I run this I receive the error:
Error in sample(Ind_Effectors, n, replace = F) :
  cannot take a sample larger than the population when 'replace = FALSE'

What I think this means is rather than picking out n values from the whole set 
of values in "Ind_Effectors" it's trying to sample the whole lot n times, which 
it cannot do because replace=F. This is not what I need, what I need is n 
values sampled from "Ind_Effectors", not all values from Ind_Effectors sampled 
n times.

I hope this clears up the confusion with what I'm trying to do. It may very 
well be I'm not instructing R to sample as a require properly. Sadly my 
previous experience with R amounts to loading in dataframes from experiment and 
doing stat analysis & model fitting, not simulations or individual based models.

Best wishes,

Ben W.
UEA (ENV) & The Sainsbury Laboratory.

P.S. As an aside I've been thinking about doing this model an alternative way 
to as I described in the first bit of my email (based on dataframes).
Instead I would use a multi-dimentional ragged array(s):
The format would be a 2D layout, Where every line is an effector gene and every 
column an aspect of the effector gene(value, expression state, fitness 
contribution etc.) This 2D layout of rows and columns is then repeated in the 
3rd dimension (the z of x,y,z) of the array for each individual. It is ragged 
in the sense each individual, each slice through the array in the z direction, 
would have different numbers of rows - different numbers of effectors. This may 
be easier to work on, but I've not worked with multidimensional arrays, I'm 
used to data in dataframes (usually from spreadsheets from experiments).

________________________________
From: Jean V Adams [jvad...@usgs.gov]
Sent: 08 November 2012 13:35
To: Benjamin Ward (ENV)
Cc: r-help@r-project.org
Subject: RE: [R] sample from list

Ben,

You have still not supplied reproducible code for me (and any other r-help 
reader) to run, which makes it very difficult to help you.  I can run your 
first 5 lines of code with no problem.

Path_Number <- 0500
inds 
<-data.frame(ID=formatC(0001:Path_Number,width=4,flag=0),No_of_Effectors="",No_Expressed_Effectors="")
inds$No_of_Effectors <- round(as.numeric(lapply(1:nrow(inds),function(x) 
runif(1, min=1, max=550))))
Effectors <- lapply(1:nrow(inds),function(x) 
sample(1:10000,inds$No_of_Effectors,replace=TRUE))
inds <- data.frame(inds,Effectors=as.character(Effectors))

But your 6th line of code doesn't work ... there is no object inds2.

Ind_Genes<-strsplit(as.character(inds2[1,4]),",")

If I use code that you provided in your earlier e-mail to create inds2, I get 
errors because inds doesn't have a variable No_of_Genes.

Genes <- lapply(1:nrow(inds),function(x) 
sample(1:10000,inds$No_of_Genes,replace=TRUE))
inds2 <- data.frame(inds, Genes=I(Genes))
inds2$No_Expressed_Genes <- round(as.numeric(lapply(1:nrow(inds2),function(x) 
runif(1, min=10, max=50))))

So, before you hit the send button on your next e-mail.  Start a clean R 
session with none of your objects in the working directory or the search path, 
and test your code to see if it runs.

You will find many more willing helpers if you supply reproducible code.

You might want to start with a new posting, too, to give more people a fresh 
look.

Jean



"Benjamin Ward (ENV)" <b.w...@uea.ac.uk> wrote on 11/08/2012 05:04:20 AM:
>
> Hi,
>
> Thanks, for the reply.
>
> I should explain more, I'll be as brief as I can, the code for
> generating the dataframe is below.
>
> What I'm doing is individual based modelling of a pathogen and it's
> host. The way I've thought of doing this is with two dataframes, one
> of the pathogen and it's genes and effectors, and one of the host
> and it's resistance genes. During the processes of the model these
> things can be pulled out of the dataframes and operated on, before
> being stored again in the dataframes.
>
> I have generated my dataset as below, it was suggested by "arun" in
> a reply to a previous email I wrote with the subject "Trouble with
> data structures".
>
> Path_Number <- 0500 # The number of pathogen individuals in the population.
> # Create the initial dataframe, with initial number of effectors and
> initial number of expressed effectors.
> inds <-data.frame(ID=formatC
> (0001:Path_Number,width=4,flag=0),No_of_Effectors="",No_Expressed_Effectors="")
> # Generate the number of effectors genes each individual has.
> inds$No_of_Effectors <- round(as.numeric(lapply(1:nrow
> (inds),function(x) runif(1, min=1, max=550))))
> # Generate the actual efector genes.
> Effectors <- lapply(1:nrow(inds),function(x) sample(1:10000,inds
> $No_of_Effectors,replace=TRUE))
> #Add them to the dataframe
> inds <- data.frame(inds,Effectors=as.character(Effectors))
>
> What I'm trying to do is for each individual, extract the values in
> the Effector genes cell to an object. As far as I can tell,
>
> Ind_Genes<-strsplit(as.character(inds2[1,4]),",")
>
> Will do this for the first individual or I can get all of them with
>
> All_Genes<-strsplit(as.character(inds2[,4]),",")
>
> What I then want to do is according to a generated number for each
> individual...
>
> round(as.numeric(lapply(1:nrow(inds2),function(x) runif(1, min=10, max=50))))
>
> ... sample that many genes from Ind_Genes and make a new object
> called Expressed_Genes, which can be stored in the dataframe. My
> attempt at doing this is:
>
> Expressed_Genes<-lapply(First_Ind_Genes,function(x) sample
> (First_Ind_Genes,round(runif(1, min=10, max=50)),replace=F))
>
> to get Expressed genes for each individual, this might be part of a
> for loop, or to the whole list of every individuals genes like so:
>
> Expressed_Genes<-lapply(All_Genes,function(x) sample(All_Genes,3,replace=F))
>
> What usually happens however is I get errors:
> Error in sample(First_Ind_Genes, round(runif(1, min = 10, max = 50)),  :
>   cannot take a sample larger than the population when 'replace = FALSE'
>
> or it will rather than sample 3 values, sample all the values, 3
> times if I allow replacement (which I don't want).
>
> So it's not sampling 3 values for me, but the whole lot of values 3 times.
>
> I do not know of another way to extract these gene values and then
> do things with them.
> For my model it is essential I can pull the genes or expressed genes
> out of the dataframe, work functions or operations on them and then
> store them back again. For example if an individual turns a gene on
> that was not before, then the genes would need to be pulled from the
> database, as would the expressed genes, and a random value from the
> genes object added to the expressed genes object, and then they
> could both be put back. A similar thing would happen when I wanted
> to mutate the genes.
>
> In short my aim is pull genes or expressed genes out, work functions
> or operations on them and then store them back again.
>
> Hopefully I've explained better, I have been thinking of changing my
> approach from datasets of pathogen and host from which values are
> pulled to objects and operated on to a multi-dimentional ragged
> arrays. I've been told this may be more simple for me.
>
> Where every line is an effector gene and there can be columns for
> the gene value, expression state (1 or 0/T or F), fitness
> contribution etc. This 2D layout of rows and columns is then
> repeated in the z dimension of the array for each individual. It is
> ragged in the sense each individual, each slice through the array in
> the z direction, would have different numbers of rows - different
> numbers of effectors. I can then simulate mutations by changing the
> gene values, cause duplications by adding rows of duplicated genes,
> or even cause deletions by removing rows.
> Once I have this set up for the pathogen I may make a similar array
> for the host plants, then perhaps with indexing or some such thing I
> can write functions to do the interactions and immunology and such.
>
> Best,
>
> Ben W.
>
> UEA (ENV) & The Sainsbury Laboratory.
>
> From: Jean V Adams [jvad...@usgs.gov]
> Sent: 07 November 2012 21:12
> To: Benjamin Ward (ENV)
> Cc: r-help@r-project.org
> Subject: Re: [R] sample from list

> Ben,
>
> Can you provide a small example data set for
>         inds
> so that we can run the code you have supplied?
> It's difficult for me to follow what you've got and where you're trying to go.
>
> Jean
>
>
>
> "Benjamin Ward (ENV)" <b.w...@uea.ac.uk> wrote on 11/06/2012 03:29:52 PM:
> >
> > Hi all,
> >
> > I have a list of genes present in 500 individuals, the individuals
> > are the elements:
> > Genes <- lapply(1:nrow(inds),function(x) sample(1:10000,inds
> > $No_of_Genes,replace=TRUE))
> >
> > (This was later written to a dataframe as well as kept as the list
> > object: inds2 <- data.frame(inds,Genes=I(Genes)))
> >
> > I also have a vector of  how many of those genes are expressed in
> > the individuals, this can also kept as a vector object or written to
> > a data frame:
> >
> > inds2$No_Expressed_Genes <- round(as.numeric(lapply(1:nrow
> > (inds2),function(x) runif(1, min=10, max=50))))
> >
> > I want to create another list which consists of each individuals
> > expressed genes - essentially a subset of the total genes the
> > individuals have in the "Genes" list, by sampling from the Genes
> > list for each individual, the number of genes (values)in the
> > Num_Expressed_Genes vector. i.e. if Num_Expressed_Genes = 3 then
> > sample 3 values from the element in the Genes list. I can't quite
> > figure it out though. So far I have the following:
> >
> > #Defines The number of expressed genes for each individual in my data frame.
> > Num_Expressed_Genes <- round(as.numeric(lapply(1:nrow
> > (inds2),function(x) runif(1, min=10, max=50))))
> >
> >
> > #My attempts to apply the sample function to every element
> > (individual organism) of the "Genes" list , to subset the genes expressed.
> > Expressed_Genes <- lapply(1:nrow(inds),function(x) sample
> > (Genes,Num_Expressed_Genes, replace=FALSE))
> > Expressed_Genes <- lapply(Genes,function(x) sample
> > (Genes,Num_Expressed_Genes, replace=FALSE))
> >
> > So far though I'm getting results like this:
> >
> > [[49]]
> > [[49]][[1]]
> >   [1] 3540   27 5344 7278 9758 8077 ............................... [217]
> >
> >
> > [[49]][[2]]
> >   [1]  740 3362 8588 8574 4371 1447 .............................. [340]
> >
> >
> > When what I need is more:
> >
> > [[49]]
> > [1] 6070 1106 6275
> > In a case where Num_Expressed_Genes = 3 and the values are taken
> > from the much larger set of values for element (individual) 49 in my
> > Genes list.
> >
> > I'm not sure what I'm doing wrong but it seems what is happening is
> > instead of picking out a few values according to the
> > Num_Expressed_Genes vector - as an example say 3 again, It's drawing
> > a large number of values, if not all of them, from elements in the
> > list, 3 times.
> >
> > Any help is greatly appreciated,
> > I've thought of using loops to achieve the same task, but I'm trying
> > to get my individual/genes/expressed genes data.frame set up for my
> > individual based model and get it running using vectors and as
> > little loops as possible.
> >
> > Thanks,
> > Ben.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Extract cell of many values from dataframe cells and sample from them.

Reply via email to