Instead of writing some long, ugly, "script", the way to use R is to break problems down into distinct tasks. Reading data is one task, and performing regressions on the data, plotting & summarising are different tasks. Write functions to do each task in general, and then use those functions.
So one task is reading the data from a Coverage dir. You want to do a linear regression on the data, so you want to have the data stored as a data frame. Following on from Don McQueen's good advice, here's a function that does the job: read.data.from.coverage.dir <- function(dir, pattern="Length_[0-9]+", min.length=0, max.length=Inf) { ## return a data frame with lengths in first column and means of ## file contents in second column files <- list.files(dir, pattern) lengths <- as.numeric(gsub("Length_", "", files, perl=TRUE)) files <- files[lengths >= min.length && lengths <= max.length] get.mean.from.file <- function(file) mean(scan(file.path(dir,file), quiet=TRUE)) data.frame(x=lengths, y=sapply(files, get.mean.from.file)) } And here's a function, that uses the first one, to get all the data from your various coverage dirs get.all.data <- function(topdir) { coverage.dirs <- list.files(path=topdir, pattern="Coverage", full.names=TRUE) lapply(coverage.dirs, read.data.from.coverage.dir) } So now you can do ## read all the data all.data <- get.all.data(topdir="~") ## perform all the regressions regression.fits <- lapply(all.data, function(df) lm(y ~ x, data=df)) ## summarise them summaries <- lapply(regression.fits, summary) ## etc All those commands are generating lists of objects; lapply is a shorthand for doing a for loop over a list. You can use sink() to redirect output, but it would probably be better to create tables and/or figures in R first, then write them to files. Dan On Tue, Sep 16, 2008 at 07:01:42AM -0700, bioinformatics_guy wrote: > > Don, > Excellent advice. I've gone back and done a bit of coding and wanted to see > what you think and possibly "shore up" some of the technical stuff I am > still having a bit of difficulty with. > > I'll past the code I have to date with any important annotations: > > topdir="~" > library(gmodels) > > setwd(topdir) > > ### Will probably want to do two for loops as opposed to recursive > files=list.files(path=topdir,pattern="Coverage") > > for (i in files) > { > dir=paste("~/hangers/",i,sep="") > > files2=list.files(path=dir,pattern="Length") > > ### Make an empty matrix that will have the independent variable as > the filenum and the dependent variable > ### as the mean of the length or should I have two vectors for the > regression. Basically the Length_(\d+) is the independent variable (which > is taken from the filename) which all the regressions will have and then > inside the Length_(\d+) is a 1d set of numbers which I take the mean of > which in turn becomes the dependent variable. So in essence the points are: > f(length)=mean(length$V1) > f(45)=50 > f(50)=60 > etc ... > > > for (j in files2) > { > ## I just rearranged the following line but I'm not sure what the > command is doing > ## I am assuming 'as.numeric' means take the input as a number > instead of a string and the gsub has #me stumped > > filenum=as.numeric(gsub('Length_','',j)) > > ## Can I assign variables at the top instead of hardcoding? like > upper=50 , lower=30? > ## And I don't need to put brackets for this if statement do I? > Does it basically just > ## say that if the filenum is outside those parameters, just go to > the next j in files2? > if (filenum > 200 | filenum < -10) next > > dir2=paste("~/hangers",i,j,sep="/") > > tmp=read.table(dir2) > > mean(tmp($V1)) > > Now should I put these in a matrix or a vector (all j values (length > vs mean(tmp$V1) for each i iteration) > } > } > > I think lastly, Id like to get a print out of each of the regressions (each > iteration of i). Is that when I use the summary command? And, like in > unix, can I redirect the output to a file? > > Best > > > Don MacQueen wrote: > > > > I can't go through all the details, but hopefully this will help get > > you started. > > > > If you look at the help page for the list.files() function, you will see > > this: > > > > list.files(path = ".", pattern = NULL, all.files = FALSE, > > full.names = FALSE, recursive = FALSE, > > ignore.case = FALSE) > > > > The "." in path means to start at your current working directory. > > Assuming your 5 Coverage directories are subdirectories of your > > current working directory, that's what you want. > > > > Then, setting recursive to TRUE will cause it to also list the > > contents of all subdirectories. Since your Length files are in the > > Coverage subdirectories, that's what you want. > > > > Finally, the pattern argument returns only files that match the > > pattern, so something like > > patter="Length" > > should get you just the files you want. > > > > The result is a character vector containing the names of all your > > Length files. Try it and see. > > > > Then, a simple loop over the over the vector of filenames, with an > > appropriate scan() or read.table() command for each, will read the > > data in. > > > > If you need to restrict the files, say Length_20, Length_25, > > Length_30, etc. then you'll have to do some more work. > > Look at > > as.numeric(gsub( 'Length_', '', filename)) > > to get just the number part of the filename, as a number, and then > > you can use numeric inequalities to identify whether or not any > > particular file is to be processed. > > > > Since you haven't shown what the contents of your files look like > > (two columns of numbers or what), I have no idea what to suggest for > > the part having to do with reading them in, plotting or doing linear > > regression. > > > > The basic function for linear regression is lm(). > > > > > > Here is a summary: > > > > files <- list.files( '~' , pattern='Length', recursive=TRUE) > > > > for (fl in files) { > > > > ## optional, to restrict to only certain files > > filenum <- as.numeric(gsub( 'Length_', '', filename)) > > > > ## skip to next file if it isn't in the correct number range > > if (filenum > 50 | filenum < 20) next > > > > ## a command to read the current file. perhaps: > > ## tmp <- read.table(fl) > > > > ## commands to do statistics on the data in the current file. perhaps: > > ## fit <- lm( y ~ y, data=tmp) > > > > ## some output > > cat('------ file =',fl,'-----\n') > > print(fit) > > > > } > > > > This example doesn't restrict only to certain Coverage subdirectories. > > > > -Don > > > > > > > > At 9:29 AM -0700 9/15/08, bioinformatics_guy wrote: > >>Im very new to R so this might be a very simple question. First I'll lay > out > >>the hierarchy of my directories, goals. > >> > >>I have say 5 directories of form "Coverage_(some number)" and each one of > >>these I have text files of form "Length_(some number)" which are comprised > >>of say 30 numbers. Each one of these Length files (which are basically > >>incremented by 5 from 0 to 100, Length_(0,5,10,15,20) are to be averaged > >>where the average is the y-value and the length is the x-value in a linear > >>regression. > >> > >>What I want to do is, write a script that looks in each of the coverage > >>directories and then reads in each of the files, takes the means, and > plots > >>them in form I specified above. The catch is, what if I only want to plot > >>say Length_(20-50) and what command/method is best for a linear > regression? > >>I've looked at m1(), but have not gotten it to work properly. > >> > >>Below is some of the code I've put together: > >> > >>topdir="~" > >> > >>setwd(topdir) > >> > >>### Took this function from a friend so I'm not sure what its doing > besides > >>grep-ing a directory? > >>ll<-function(string) > >>{ > >> grep(string,dir(),value=T) > >>} > >> > >>### I believe this is looking for all files of form below > >>subdir = ll("Coverage_[1-9][0-9]$") > >> > >>### A for loop iterating through each of the sub directories. > >>for (i in subdir) > >>{ > >> #not sure what this line is doing as I found it on the internet > >> on a > >>similar function > >> setwd(paste(getwd(),i,sep="/")) > >> #This makes a vector of all the file names > >> filelist=ll("Length_") > >> > >>Can I use a regex or logic to only take the filelist variables I want? > >>And can I now get the mean of each Length_* and set in a matrix (length x > >>mean)? > >> > >>Then finally, how to do a linear regression of this. > >> > >>-- > >>View this message in context: http:// www. > >>nabble.com/Scripting-in-R----pattern-matching%2C-logic%2C-system-calls%2C-the-works%21-tp19496451p19496451.html > >>Sent from the R help mailing list archive at Nabble.com. > >> > >>______________________________________________ > >>R-help@r-project.org mailing list > >>https:// stat.ethz.ch/mailman/listinfo/r-help > >>PLEASE do read the posting guide http:// www. > R-project.org/posting-guide.html > >>and provide commented, minimal, self-contained, reproducible code. > > > > > > -- > > -------------------------------------- > > Don MacQueen > > Environmental Protection Department > > Lawrence Livermore National Laboratory > > Livermore, CA, USA > > 925-423-1062 > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > > -- > View this message in context: > http://www.nabble.com/Scripting-in-R----pattern-matching%2C-logic%2C-system-calls%2C-the-works%21-tp19496451p19512508.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- http://www.stats.ox.ac.uk/~davison ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.