Thanks for your insights Matthew. Actually, some of the merged documents are over 1,000 pages. I have never had a programming class, and I had one college statistics course in 1975. I might be in over my head, but R along with Word's mail merge has allowed me to put together some pretty useful reports.
Thanks for your suggestions; I am sure they will help improve my R skill set. Jerry Floren Minnesota Department of Agriculture Matthew Dowle-3 wrote: > > > Did you really say you're using Word's mail merge to construct "hundreds" > of > pages of R code which you then paste in to R ? It sounds like you just > missed somehow how to create a function in R. Did you fully read the book > Introduction to R ? Did you know R can read xls directly, and connect to > spreadsheets as if they were databases, see ?odbcConnectExcel. > > Your graphs may exist and be beautiful but are they correct ? This link > contains a formal discussion of the topic : > http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html > > > "Jerry Floren" <jerry.flo...@state.mn.us> wrote in message > news:1262877373634-1008892.p...@n4.nabble.com... >> >> As a novice R user, I face a similar challenge. I am almost afraid to >> share >> with this group how I solved it. About 65 labs in our proficiency program >> submit data on individual Excel spreadsheets with triple replicates. >> There >> always are a few labs that do not complete the full set of three >> replicates, >> and I do not want their data included in my analysis. >> >> First, I combine all the individual spreadsheets into one large Excel >> spreadsheet. The replicates are in three columns: rep1, rep2, and rep3. I >> sort on each individual rep column in Excel. Then I go to both the top >> and >> the bottom of the list. >> >> For example, I sort on rep1 and go to the top of the list to delete any >> rows >> where a value for rep1 was not recorded. Then I go to the bottom of the >> list >> and delete any rows where rep1 is text instead of a number, for example, >> <0.001. I should say that the labs are instructed that they must complete >> all three replicates, and they must not enter results as text. Next I >> repeat >> the process for rep2 and rep3. >> >> I'll do a little more work in Excel on the large, combined table with all >> the lab data. I calculate in Excel the mean, standard deviation, and >> coefficient of variation for each of the three reps. Finally, I filter >> all >> the data and delete duplicate rows. This is necessary as I sometimes >> accidentally copy the same spreadsheet two times from a lab into my large >> table. Finally, I save the cleaned up table in *.csv format that is >> easily >> read into R. >> >> I know that R can do all of these things, but if you are just learning >> how >> to use R it might be easier to do some initial work in Excel, or a >> similar >> spreadsheet, before running your data through R. >> >> I also use MS-Word's mail merge feature to generate my code. I'll get >> three >> or four pages of code doing what I want for a single analytical test, for >> example, calcium. Then I'll use the mail merge feature to generate >> hundreds >> of pages of code with the other analytical tests (nitrogen, phosphorus, >> potassium, etc.). I just copy and paste the large, merged Word document >> into >> R. R cranks away for 30 minutes and I end up with several large tables >> (and >> these get additional editing in Ecel) and hundreds of beautiful graphs >> that >> would take weeks to create in Excel. >> >> I was amazed that Word would work. I expected all of Word's special print >> control codes would mess things up. I just recently received a new laptop >> computer, and now I have an occassional problem with Word's "pretty print >> quotes," but if you know about that problem, it is easy to fix. >> >> Jerry Floren >> Minnesota Department of Agriculture >> >> >> >> >> >> Matthew Dowle-3 wrote: >>> >>> >>> As can data.table (i.e. do 'having' in one statement) : >>> >>>> DT = data.table(DF) >>>> DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3] >>> NAME n V2 >>> [1,] James 3 64.00000 >>> [2,] Tom 3 78.66667 >>>> >>> >>> but data.table isn't restricted to SQL functions (such as avg), any R >>> functions can be used, sometimes for their side effects (such as >>> plotting) >>> rather than just returning data. >>> >>> Further data.table has a thing called 'join inherited scoping'. Say we >>> knew the specific groups, we can go directly to them (without even >>> looking >>> at the rest of the data in the table) in very short and convenient >>> syntax, >>> which also happens to run quickly on large data sets (but can be useful >>> just >>> for the syntax alone) : >>> >>>> setkey(DT,NAME) >>>> DT[c("James","Tom"),mean(SCORE),mult="all"] >>> NAME V1 >>> [1,] James 64.00000 >>> [2,] Tom 78.66667 >>>> >>> >>> Notice there is no "group by" or even a "by" in the above. It inherits >>> the >>> scope from the join because mult="all" means that "James" matches to >>> multiple rows, as does "Tom", creating two groups. It does it by binary >>> search to the beginning of each group, binary search to the end of the >>> group, and runs the R expression inside the scope of that group. >>> >>> An example of join inherited scoping for the side effects only : >>> >>>> pdf("out.pdf") >>>> DT[c("James","Tom"),plot(SCORE),mult="all"] >>> NULL data table >>>> dev.off() >>> # out.pdf now contains 2 plots >>> >>> which you couldn't do in SQL because SQL has no plotting (or any of R's >>> other packages). >>> >>> It aims to do this quickly. Where 'quickly' means 1) shorter code is >>> quicker to write, read, debug and maintain and also 2) quicker to >>> compute, >>> and its 1 that often dominates 2. >>> >>> Finally, consider the following two statements which are both equivalent >>> : >>> >>>> sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) = >>>> 3") >>> NAME avg(SCORE) >>> 1 James 64.00000 >>> 2 Tom 78.66667 >>>> DT[ J(DT[,length(NAME),by="NAME"][V1==3,NAME]), mean(SCORE), >>>> mult="all"] >>> NAME avg(SCORE) >>> 1 James 64.00000 >>> 2 Tom 78.66667 >>> >>> Now ok I hear you groaning (!) that the 2nd looks (on first glance) >>> ugly, >>> but bear with me ... in the SQL solution do you know for sure that >>> avg(SCORE) isn't computed wastefully for the all the groups that don't >>> have >>> count(*)=3 ? It might well do the 'group by' first for all the groups, >>> then >>> do the 'having' afterwards as a 'where' on the result. It might depend >>> on >>> the particular SQL database being used (mySQL, sqllite, etc) or the >>> installation parameters, any indexes etc. Some investigation would be >>> required (taking time) if someone doesn't already know. In the >>> data.table >>> however, the syntax explictly makes it clear than mean(SCORE) is only >>> computed for the particular groups. For certain, always. Maybe this >>> particular example is not a good one, but I'm trying to demonstrate an >>> overall syntax which is scalable (i.e. this syntax can do more >>> complicated >>> things that SQL can't, or can't do well). Notice that the method >>> earlier >>> on i.e. "DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]" is >>> simpler >>> but wasteful as it does compute mean(SCORE) for all the groups. But the >>> syntax explicity conveys what is being done, and the user has the >>> choice. >>> >>> >>> "Gabor Grothendieck" <ggrothendi...@gmail.com> wrote in message >>> news:971536df1001051122l58389037p4e16288aedfde...@mail.gmail.com... >>> Here is the solution using sqldf which can do it in one statement: >>> >>>> # read in data >>>> Lines <- "OBS NAME SCORE >>> + 1 Tom 92 >>> + 2 Tom 88 >>> + 3 Tom 56 >>> + 4 James 85 >>> + 5 James 75 >>> + 6 James 32 >>> + 7 Dawn 56 >>> + 8 Dawn 91 >>> + 9 Clara 95 >>> + 10 Clara 84" >>>> >>>> DF <- read.table(textConnection(Lines), header = TRUE) >>>> >>>> # run >>>> library(sqldf) >>>> sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) = >>>> 3") >>> NAME avg(SCORE) >>> 1 James 64.00000 >>> 2 Tom 78.66667 >>> >>> >>> On Tue, Jan 5, 2010 at 2:03 PM, Gabor Grothendieck >>> <ggrothendi...@gmail.com> wrote: >>>> Have a look at this post and the rest of that thread: >>>> >>>> https://stat.ethz.ch/pipermail/r-help/2010-January/223420.html >>>> >>>> On Tue, Jan 5, 2010 at 1:29 PM, Geoffrey Smith <g...@asu.edu> wrote: >>>>> Hello, does anyone know how to take the mean for a subset of >>>>> observations? >>>>> For example, suppose my data looks like this: >>>>> >>>>> OBS NAME SCORE >>>>> 1 Tom 92 >>>>> 2 Tom 88 >>>>> 3 Tom 56 >>>>> 4 James 85 >>>>> 5 James 75 >>>>> 6 James 32 >>>>> 7 Dawn 56 >>>>> 8 Dawn 91 >>>>> 9 Clara 95 >>>>> 10 Clara 84 >>>>> >>>>> Is there a way to get the mean of the SCORE variable by NAME but only >>>>> when >>>>> the number of observations is equal to 3? In other words, is there a >>>>> way >>>>> to >>>>> get the mean of the SCORE variable for Tom and James, but not for Dawn >>>>> and >>>>> Clara? Thank you. >>>>> >>>>> -- >>>>> Geoffrey Smith >>>>> Visiting Assistant Professor >>>>> Department of Finance >>>>> W. P. Carey School of Business >>>>> Arizona State University >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>> >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> >> -- >> View this message in context: >> http://n4.nabble.com/mean-for-subset-tp999254p1008892.html >> Sent from the R help mailing list archive at Nabble.com. >> > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > -- View this message in context: http://n4.nabble.com/mean-for-subset-tp999254p1009194.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.