Re: [R] Extract cell of many values from dataframe cells and sample from them.

Benjamin Ward (ENV) Sun, 11 Nov 2012 07:35:28 -0800

Hi,

Thank you for your suggestion, this works a treat.


For my understanding and future reference, this would also work for something 
like 2D matrices of unequal row size? As far as I understand it would not be 
possible to make a 3D array jagged like this because the rows would need to be 
of equal number for the array function, yet in a list there is not such 
requirement, and operations on matrices can target elements in specific 
matrices by [[,]][,] ?

Best Wishes,

Ben W.

UEA (ENV) & The Sainsbury Laboratory.

________________________________
From: Jean V Adams [jvad...@usgs.gov]
Sent: 08 November 2012 19:59
To: r-help@r-project.org
Cc: Benjamin Ward (ENV)
Subject: Re: [R] Extract cell of many values from dataframe cells and sample 
from them.

Ben,

I think you would find lists a helpful way to arrange your data.  They do not 
require equal lengths of data in each element.  Check out the code below for a 
smaller version of the example you provided (with only 5 individuals rather 
than 500).

# An alternative way to arrange your data, as a list
# Each element of the list is an individual, with all its effector genes
ID.unique <- formatC(0001:0005, width=4, flag=0)
No_of_Effectors <- sample(1:550, length(ID.unique), replace=TRUE)
Effectors <- split(sample(1:10000, sum(No_of_Effectors), replace=TRUE), 
rep(ID.unique, No_of_Effectors))
Effectors

# Now take a random sample of effectors from each individual
Expressed_Genes <- lapply(Effectors, function(x) sample(x, sample(1:length(x), 
1)))
Expressed_Genes

Jean



"Benjamin Ward (ENV)" <b.w...@uea.ac.uk> wrote on 11/08/2012 10:00:57 AM:
>
> Hi,
>
> First my apologies for a non-working piece of code in a previous
> submission, I have corrected this error.
>
> I'm doing is individual based modelling of a pathogen and it's host.
> The way I've thought of doing this is with two dataframes, one of
> the pathogen and it's genes and effector genes, and one of the host
> and it's resistance genes. During the simulation, these things can
> be pulled out of the dataframes and operated on, before being stored
> again in the dataframes.
>
> Below is how I've created my dataframe and stored my effector genes.
> In this model, effector genes are numerical values between 1 and 10000.
>
> Path_Number <- 0500
> inds <- data.frame(ID=formatC
> (0001:Path_Number,width=4,flag=0),No_of_Effectors="",No_Expressed_Effectors="")
> inds$No_of_Effectors <- round(as.numeric(lapply(1:nrow
> (inds),function(x) runif(1, min=1, max=550))))
> Effectors <- lapply(1:nrow(inds),function(x) sample(1:10000,inds
> $No_of_Effectors,replace=TRUE))
> inds <- data.frame(inds,Effectors=as.character(Effectors))
> Ind_Genes <- strsplit(as.character(inds[1,4]),",")
>
> What I'm trying to do is:
> 1). For each individual (row) in my database, extract the values in
> the "Effectors" cell to an object.
> 2). Sample a number of those values and assign them to a new object
> called "Expressed_Effectors"
> 3). Storing it in the Expressed_Effectors cell, in much the same
> manner as I stored the Effectors object in the "Effectors" cell.
>
> My example attempt (for the first row/individual in my dataset) is below:
>
> (step by step, I didn't put this in a loop until I know it works for 1 row)
>
> Extract the values (effector genes) for the first individual, from
> the Effectors Cell in the dataframe, to "Ind_Effectors" object.
> Ind_Effectors <- strsplit(as.character(inds[1,4]),",")
>
> Randomly dictate how many values (effectors) will be sampled
> n<-round(runif(1, min=10, max=50))
>
> Sample n values (effector genes) from "Ind_Effectors", not replacing
> Expressed_Genes <- sample(Ind_Effectors,n,replace=F)
>
> If I run this I receive the error:
> Error in sample(Ind_Effectors, n, replace = F) :
>   cannot take a sample larger than the population when 'replace = FALSE'
>
> What I think this means is rather than picking out n values from the
> whole set of values in "Ind_Effectors" it's trying to sample the
> whole lot n times, which it cannot do because replace=F. This is not
> what I need, what I need is n values sampled from "Ind_Effectors",
> not all values from Ind_Effectors sampled n times.
>
> I hope this clears up the confusion with what I'm trying to do. It
> may very well be I'm not instructing R to sample as a require
> properly. Sadly my previous experience with R amounts to loading in
> dataframes from experiment and doing stat analysis & model fitting,
> not simulations or individual based models.
>
> Best wishes,
>
> Ben W.
> UEA (ENV) & The Sainsbury Laboratory.
>
> P.S. As an aside I've been thinking about doing this model an
> alternative way to as I described in the first bit of my email
> (based on dataframes).
> Instead I would use a multi-dimentional ragged array(s):
> The format would be a 2D layout, Where every line is an effector
> gene and every column an aspect of the effector gene(value,
> expression state, fitness contribution etc.) This 2D layout of rows
> and columns is then repeated in the 3rd dimension (the z of x,y,z)
> of the array for each individual. It is ragged in the sense each
> individual, each slice through the array in the z direction, would
> have different numbers of rows - different numbers of effectors.
> This may be easier to work on, but I've not worked with
> multidimensional arrays, I'm used to data in dataframes (usually
> from spreadsheets from experiments).
>
> From: Jean V Adams [jvad...@usgs.gov]
> Sent: 08 November 2012 13:35
> To: Benjamin Ward (ENV)
> Cc: r-help@r-project.org
> Subject: RE: [R] sample from list

> Ben,
>
> You have still not supplied reproducible code for me (and any other
> r-help reader) to run, which makes it very difficult to help you.  I
> can run your first 5 lines of code with no problem.
>
> Path_Number <- 0500
> inds <-data.frame(ID=formatC
> (0001:Path_Number,width=4,flag=0),No_of_Effectors="",No_Expressed_Effectors="")
> inds$No_of_Effectors <- round(as.numeric(lapply(1:nrow
> (inds),function(x) runif(1, min=1, max=550))))
> Effectors <- lapply(1:nrow(inds),function(x) sample(1:10000,inds
> $No_of_Effectors,replace=TRUE))
> inds <- data.frame(inds,Effectors=as.character(Effectors))
>
> But your 6th line of code doesn't work ... there is no object inds2.
>
> Ind_Genes<-strsplit(as.character(inds2[1,4]),",")
>
> If I use code that you provided in your earlier e-mail to create
> inds2, I get errors because inds doesn't have a variable No_of_Genes.
>
> Genes <- lapply(1:nrow(inds),function(x) sample(1:10000,inds
> $No_of_Genes,replace=TRUE))
> inds2 <- data.frame(inds, Genes=I(Genes))
> inds2$No_Expressed_Genes <- round(as.numeric(lapply(1:nrow
> (inds2),function(x) runif(1, min=10, max=50))))
>
> So, before you hit the send button on your next e-mail.  Start a
> clean R session with none of your objects in the working directory
> or the search path, and test your code to see if it runs.
>
> You will find many more willing helpers if you supply reproducible code.
>
> You might want to start with a new posting, too, to give more people
> a fresh look.
>
> Jean
>
>
>
> "Benjamin Ward (ENV)" <b.w...@uea.ac.uk> wrote on 11/08/2012 05:04:20 AM:
> >
> > Hi,
> >
> > Thanks, for the reply.
> >
> > I should explain more, I'll be as brief as I can, the code for
> > generating the dataframe is below.
> >
> > What I'm doing is individual based modelling of a pathogen and it's
> > host. The way I've thought of doing this is with two dataframes, one
> > of the pathogen and it's genes and effectors, and one of the host
> > and it's resistance genes. During the processes of the model these
> > things can be pulled out of the dataframes and operated on, before
> > being stored again in the dataframes.
> >
> > I have generated my dataset as below, it was suggested by "arun" in
> > a reply to a previous email I wrote with the subject "Trouble with
> > data structures".
> >
> > Path_Number <- 0500 # The number of pathogen individuals in the population.
> > # Create the initial dataframe, with initial number of effectors and
> > initial number of expressed effectors.
> > inds <-data.frame(ID=formatC
> >
> (0001:Path_Number,width=4,flag=0),No_of_Effectors="",No_Expressed_Effectors="")
> > # Generate the number of effectors genes each individual has.
> > inds$No_of_Effectors <- round(as.numeric(lapply(1:nrow
> > (inds),function(x) runif(1, min=1, max=550))))
> > # Generate the actual efector genes.
> > Effectors <- lapply(1:nrow(inds),function(x) sample(1:10000,inds
> > $No_of_Effectors,replace=TRUE))
> > #Add them to the dataframe
> > inds <- data.frame(inds,Effectors=as.character(Effectors))
> >
> > What I'm trying to do is for each individual, extract the values in
> > the Effector genes cell to an object. As far as I can tell,
> >
> > Ind_Genes<-strsplit(as.character(inds2[1,4]),",")
> >
> > Will do this for the first individual or I can get all of them with
> >
> > All_Genes<-strsplit(as.character(inds2[,4]),",")
> >
> > What I then want to do is according to a generated number for each
> > individual...
> >
> > round(as.numeric(lapply(1:nrow(inds2),function(x) runif(1, min=10,max=50))))
> >
> > ... sample that many genes from Ind_Genes and make a new object
> > called Expressed_Genes, which can be stored in the dataframe. My
> > attempt at doing this is:
> >
> > Expressed_Genes<-lapply(First_Ind_Genes,function(x) sample
> > (First_Ind_Genes,round(runif(1, min=10, max=50)),replace=F))
> >
> > to get Expressed genes for each individual, this might be part of a
> > for loop, or to the whole list of every individuals genes like so:
> >
> > Expressed_Genes<-lapply(All_Genes,function(x) sample(All_Genes,3,replace=F))
> >
> > What usually happens however is I get errors:
> > Error in sample(First_Ind_Genes, round(runif(1, min = 10, max = 50)),  :
> >   cannot take a sample larger than the population when 'replace = FALSE'
> >
> > or it will rather than sample 3 values, sample all the values, 3
> > times if I allow replacement (which I don't want).
> >
> > So it's not sampling 3 values for me, but the whole lot of values 3 times.
> >
> > I do not know of another way to extract these gene values and then
> > do things with them.
> > For my model it is essential I can pull the genes or expressed genes
> > out of the dataframe, work functions or operations on them and then
> > store them back again. For example if an individual turns a gene on
> > that was not before, then the genes would need to be pulled from the
> > database, as would the expressed genes, and a random value from the
> > genes object added to the expressed genes object, and then they
> > could both be put back. A similar thing would happen when I wanted
> > to mutate the genes.
> >
> > In short my aim is pull genes or expressed genes out, work functions
> > or operations on them and then store them back again.
> >
> > Hopefully I've explained better, I have been thinking of changing my
> > approach from datasets of pathogen and host from which values are
> > pulled to objects and operated on to a multi-dimentional ragged
> > arrays. I've been told this may be more simple for me.
> >
> > Where every line is an effector gene and there can be columns for
> > the gene value, expression state (1 or 0/T or F), fitness
> > contribution etc. This 2D layout of rows and columns is then
> > repeated in the z dimension of the array for each individual. It is
> > ragged in the sense each individual, each slice through the array in
> > the z direction, would have different numbers of rows - different
> > numbers of effectors. I can then simulate mutations by changing the
> > gene values, cause duplications by adding rows of duplicated genes,
> > or even cause deletions by removing rows.
> > Once I have this set up for the pathogen I may make a similar array
> > for the host plants, then perhaps with indexing or some such thing I
> > can write functions to do the interactions and immunology and such.
> >
> > Best,
> >
> > Ben W.
> >
> > UEA (ENV) & The Sainsbury Laboratory.
> >
> > From: Jean V Adams [jvad...@usgs.gov]
> > Sent: 07 November 2012 21:12
> > To: Benjamin Ward (ENV)
> > Cc: r-help@r-project.org
> > Subject: Re: [R] sample from list
>
> > Ben,
> >
> > Can you provide a small example data set for
> >         inds
> > so that we can run the code you have supplied?
> > It's difficult for me to follow what you've got and where you're
> trying to go.
> >
> > Jean
> >
> >
> >
> > "Benjamin Ward (ENV)" <b.w...@uea.ac.uk> wrote on 11/06/2012 03:29:52 PM:
> > >
> > > Hi all,
> > >
> > > I have a list of genes present in 500 individuals, the individuals
> > > are the elements:
> > > Genes <- lapply(1:nrow(inds),function(x) sample(1:10000,inds
> > > $No_of_Genes,replace=TRUE))
> > >
> > > (This was later written to a dataframe as well as kept as the list
> > > object: inds2 <- data.frame(inds,Genes=I(Genes)))
> > >
> > > I also have a vector of  how many of those genes are expressed in
> > > the individuals, this can also kept as a vector object or written to
> > > a data frame:
> > >
> > > inds2$No_Expressed_Genes <- round(as.numeric(lapply(1:nrow
> > > (inds2),function(x) runif(1, min=10, max=50))))
> > >
> > > I want to create another list which consists of each individuals
> > > expressed genes - essentially a subset of the total genes the
> > > individuals have in the "Genes" list, by sampling from the Genes
> > > list for each individual, the number of genes (values)in the
> > > Num_Expressed_Genes vector. i.e. if Num_Expressed_Genes = 3 then
> > > sample 3 values from the element in the Genes list. I can't quite
> > > figure it out though. So far I have the following:
> > >
> > > #Defines The number of expressed genes for each individual in my
> data frame.
> > > Num_Expressed_Genes <- round(as.numeric(lapply(1:nrow
> > > (inds2),function(x) runif(1, min=10, max=50))))
> > >
> > >
> > > #My attempts to apply the sample function to every element
> > > (individual organism) of the "Genes" list , to subset the genes expressed.
> > > Expressed_Genes <- lapply(1:nrow(inds),function(x) sample
> > > (Genes,Num_Expressed_Genes, replace=FALSE))
> > > Expressed_Genes <- lapply(Genes,function(x) sample
> > > (Genes,Num_Expressed_Genes, replace=FALSE))
> > >
> > > So far though I'm getting results like this:
> > >
> > > [[49]]
> > > [[49]][[1]]
> > >   [1] 3540   27 5344 7278 9758 8077 ............................... [217]
> > >
> > >
> > > [[49]][[2]]
> > >   [1]  740 3362 8588 8574 4371 1447 .............................. [340]
> > >
> > >
> > > When what I need is more:
> > >
> > > [[49]]
> > > [1] 6070 1106 6275
> > > In a case where Num_Expressed_Genes = 3 and the values are taken
> > > from the much larger set of values for element (individual) 49 in my
> > > Genes list.
> > >
> > > I'm not sure what I'm doing wrong but it seems what is happening is
> > > instead of picking out a few values according to the
> > > Num_Expressed_Genes vector - as an example say 3 again, It's drawing
> > > a large number of values, if not all of them, from elements in the
> > > list, 3 times.
> > >
> > > Any help is greatly appreciated,
> > > I've thought of using loops to achieve the same task, but I'm trying
> > > to get my individual/genes/expressed genes data.frame set up for my
> > > individual based model and get it running using vectors and as
> > > little loops as possible.
> > >
> > > Thanks,
> > > Ben.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Extract cell of many values from dataframe cells and sample from them.

Reply via email to