Re: [R] metafor - code for analysing geometric means
While you may get a helpful reply, I think this is really not the forum for such relatively basic math/stat questions. As you seem to be more or less at sea here, I really really suggest that you seek help from a local statistical resource. Cheers, Bert Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." Clifford Stoll On Sun, Dec 7, 2014 at 10:25 PM, Purssell, Ed wrote: > Dear All > > I have tried very hard to work out what to do with putting logged data into > metafor; the paper says.. > 'geometric mean antibody concentrations (GMCs) or opsonophagocytic activity > titres (geometric mean titres [GMT]) were calculated with 95% CIs by taking > the antilog of the mean of the log concentration or titre transformations.' > > Does this look right if I take the reported mean, upper and lower bound of > the CI, and the number? > > m<-log(mean) > ub<-log(upper bound) > lb<-log(lower bound) > diff<-ub-lb > SE<-diff/3.92 > SD<-SE*(sqrt(n)) > > Then put m, SD and n for each group into metafor as normal. Or is there a > better way? I am afraid I didn't understand how to do it on a log scale. > > Thank you > > Edward > > Edward Purssell PhD > Senior Lecturer > > Florence Nightingale Faculty of Nursing and Midwifery > King's College London > James Clerk Maxwell Building > 57 Waterloo Road > London SE1 8WA > Telephone 020 7848 3021 > Mobile 07782 374217 > email edward.purss...@kcl.ac.uk > https://www.researchgate.net/profile/Edward_Purssell > > > From: Viechtbauer Wolfgang (STAT) > > Sent: 14 November 2014 10:40 > To: Michael Dewey; Purssell, Ed; r-help@r-project.org > Subject: RE: [R] metafor - code for analysing geometric means > > With "geometric mean 1 CI /3.92", I assume you mean "(upper bound - lower > bound) / 3.92". Two things: > > 1) That will give you the SE of the mean, not the SD of the observations > (which is what you need as input). > > 2) Probably the CI for the geometric mean was calculated on the log-scale (as > Michael hinted at). Check if log(upper bound) and log(lower bound) is (within > rounding error) symmetric around log(geometric mean). Then (log(upper bound) > - log(lower bound)) / 3.96 * sqrt(n) will give you the SD of the log of the > values used to compute the geometric mean. Then you could use log(geometric > mean) and that SD as input. But this would give you the difference of the > log-transformed geometric means. Not sure if this is what you want to analyze. > > Two more articles that may be helpful here: > > Friedrich, J. O., Adhikari, N. K., & Beyene, J. (2012). Ratio of geometric > means to analyze continuous outcomes in meta-analysis: Comparison to mean > differences and ratio of arithmetic means using empiric data and simulation. > Statistics in Medicine, 31(17), 1857-1886. > > Souverein, O. W., Dullemeijer, C., van 't Veer, P., & van der Voet, H. > (2012). Transformations of summary statistics as input in meta-analysis for > linear dose-response models on a logarithmic scale: A methodology developed > within EURRECA. BMC Medical Research Methodology, 12(57). > > Best, > Wolfgang > >> -Original Message- >> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] >> On Behalf Of Michael Dewey >> Sent: Thursday, November 13, 2014 12:36 >> To: Purssell, Ed; r-help@r-project.org >> Subject: Re: [R] metafor - code for analysing geometric means >> >> On 13/11/2014 11:00, Purssell, Ed wrote: >> > ?Dear All >> > >> > I have some data expressed in geometric means and 95% confidence >> intervals. Can I code them in metafor as: >> > >> > rma(m1i=geometric mean 1, m2i=geometric mean 2, sd1i=geometric mean 1 >> CI /3.92, sd2i=geometric mean 2 CI/3.92...etc, measure="MD") >> >> Would it not be better to work on the log scale? >> >> > All of the studies use geometric means. >> > >> > Thanks! >> > >> > Edward >> >> -- >> Michael >> http://www.dewey.myzen.co.uk > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] metafor - code for analysing geometric means
Dear All I have tried very hard to work out what to do with putting logged data into metafor; the paper says.. 'geometric mean antibody concentrations (GMCs) or opsonophagocytic activity titres (geometric mean titres [GMT]) were calculated with 95% CIs by taking the antilog of the mean of the log concentration or titre transformations.' Does this look right if I take the reported mean, upper and lower bound of the CI, and the number? m<-log(mean) ub<-log(upper bound) lb<-log(lower bound) diff<-ub-lb SE<-diff/3.92 SD<-SE*(sqrt(n)) Then put m, SD and n for each group into metafor as normal. Or is there a better way? I am afraid I didn't understand how to do it on a log scale. Thank you Edward Edward Purssell PhD Senior Lecturer Florence Nightingale Faculty of Nursing and Midwifery King's College London James Clerk Maxwell Building 57 Waterloo Road London SE1 8WA Telephone 020 7848 3021 Mobile 07782 374217 email edward.purss...@kcl.ac.uk https://www.researchgate.net/profile/Edward_Purssell From: Viechtbauer Wolfgang (STAT) Sent: 14 November 2014 10:40 To: Michael Dewey; Purssell, Ed; r-help@r-project.org Subject: RE: [R] metafor - code for analysing geometric means With "geometric mean 1 CI /3.92", I assume you mean "(upper bound - lower bound) / 3.92". Two things: 1) That will give you the SE of the mean, not the SD of the observations (which is what you need as input). 2) Probably the CI for the geometric mean was calculated on the log-scale (as Michael hinted at). Check if log(upper bound) and log(lower bound) is (within rounding error) symmetric around log(geometric mean). Then (log(upper bound) - log(lower bound)) / 3.96 * sqrt(n) will give you the SD of the log of the values used to compute the geometric mean. Then you could use log(geometric mean) and that SD as input. But this would give you the difference of the log-transformed geometric means. Not sure if this is what you want to analyze. Two more articles that may be helpful here: Friedrich, J. O., Adhikari, N. K., & Beyene, J. (2012). Ratio of geometric means to analyze continuous outcomes in meta-analysis: Comparison to mean differences and ratio of arithmetic means using empiric data and simulation. Statistics in Medicine, 31(17), 1857-1886. Souverein, O. W., Dullemeijer, C., van 't Veer, P., & van der Voet, H. (2012). Transformations of summary statistics as input in meta-analysis for linear dose-response models on a logarithmic scale: A methodology developed within EURRECA. BMC Medical Research Methodology, 12(57). Best, Wolfgang > -Original Message- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] > On Behalf Of Michael Dewey > Sent: Thursday, November 13, 2014 12:36 > To: Purssell, Ed; r-help@r-project.org > Subject: Re: [R] metafor - code for analysing geometric means > > On 13/11/2014 11:00, Purssell, Ed wrote: > > ?Dear All > > > > I have some data expressed in geometric means and 95% confidence > intervals. Can I code them in metafor as: > > > > rma(m1i=geometric mean 1, m2i=geometric mean 2, sd1i=geometric mean 1 > CI /3.92, sd2i=geometric mean 2 CI/3.92...etc, measure="MD") > > Would it not be better to work on the log scale? > > > All of the studies use geometric means. > > > > Thanks! > > > > Edward > > -- > Michael > http://www.dewey.myzen.co.uk __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to plot bootstrap distribution in rms package?
Hi guys, I am trying to familiar myself with bootstrapping using rms package and stuck on how to plot the bootstrap distribution. Would appreciate if somebody could help. Thanks in advance. From http://stats.stackexchange.com/questions/64788/interpreting-a-logistic-regression-model-with-multiple-predictors library(rms) mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv";) mylogit <- lrm(admit ~ gre, x=TRUE, y=TRUE, data = mydata) mylogit my.valid <- validate(mylogit, method="boot", B=1000) my.valid my.calib <- calibrate(mylogit, method="boot", B=1000) my.calib [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] date time problem
Thank you very much Jeff. Below is the data I used: > Corrected_data SA_LST SA_GHI_mean 61759 3/11/2007 1:00 0.0 67517 3/11/2007 2:00 0.0 70017 3/11/2007 3:00 0.0 70524 3/11/2007 4:00 0.0 71061 3/11/2007 5:00 0.0 71638 3/11/2007 6:00 0.0 72113 3/11/2007 7:0020.20873 72629 3/11/2007 8:00 123.14231 73274 3/11/2007 9:00 306.97343 >test1<-as.POSIXct(Corrected_data$SA_LST,format="%m/%d/%Y %H:%M", tz="EST5EDT") >test1 [1] "2007-03-11 01:00:00 EST" NA"2007-03-11 03:00:00 EDT" "2007-03-11 04:00:00 EDT" [5] "2007-03-11 05:00:00 EDT" "2007-03-11 06:00:00 EDT" "2007-03-11 07:00:00 EDT" "2007-03-11 08:00:00 EDT" [9] "2007-03-11 09:00:00 EDT" > format(test1, tz="Etc/GMT", usetz=TRUE ) [1] "2007-03-11 06:00:00 GMT" NA"2007-03-11 07:00:00 GMT" "2007-03-11 08:00:00 GMT" [5] "2007-03-11 09:00:00 GMT" "2007-03-11 10:00:00 GMT" "2007-03-11 11:00:00 GMT" "2007-03-11 12:00:00 GMT" [9] "2007-03-11 13:00:00 GMT" > As you can see the "2007-03-11 02:00:00 EST" is not recognized and returned as NA. This may be due to DST, which I do not know how to handle in R. Best, Alemu On Sun, Dec 7, 2014 at 1:59 PM, Jeff Newmiller wrote: > You have not provided a reproducible example, so anything I say could be > wrong just due to misinterpretation. Please read [1] for suggestions on > making your examples reproducible, particularly regarding the use of dput > to provide example data. You have also posted in HTML format, which can > cause additional scrambling of communications on this list. > > From the notation you are using, I would guess that "Corrected_SA_data" is > a data frame containing columns "date_time" and "TZ" at a minimum. The > "date_time" column could be a vector of class POSIXlt or POSIXct... except > that I cannot reproduce such an object that doesn't print some kind of > timezone indicator in its output. What version of R are you using? Note > that such an object contains a tz attribute already, so unless you have > already made sure that the date_time column knows about that timezone, they > are probably unrelated to each other. > > Note that any POSIXct object is internally represented as a specific > instant of time. The tz attribute is only used to control how that time > will be displayed. For example: > > testtime3 <- as.POSIXct( rawtime1, tz="America/New_York" ) > format( testtime3, tz="Etc/GMT", usetz=TRUE ) > > If you want the output to omit the timezone information you can omit the > usetz argument: > > format( testtime3, tz="Etc/GMT" ) > > If you have not yet, you should read [2]. Note that three-letter timezone > indicators (even though they are given in OUTPUT!) are at best unreliable > for use in specifying timezones in R... read ?timezones. > > [1] http://stackoverflow.com/questions/5963269/how-to-make- > a-great-r-reproducible-example > [2] http://www.r-project.org/doc/Rnews/Rnews_2004-1.pdf > > > On Sun, 7 Dec 2014, Alemu Tadesse wrote: > > Dear R users >> >> I am puzzled by the following result from R script. I am trying to convert >> local time to UTC time. Time zone is -5, therefore I used the following >> approach. >> >> Below is the script. >> >>> Corrected_SA_data$date_time[k-1] >>> >> [1] "2007-03-11 01:00:00" >> >>> Corrected_SA_data$TZ[k-1] >>> >> [1] -5 >> >>> Corrected_SA_data$date_time[k-1]-Corrected_SA_data$TZ[k-1]*3600 >>> >> [1] "2007-03-11 07:00:00 MDT" >> >> I was expecting this last value to be something like "2007-03-11 06:00:00 >> UTC" >> >> Please correct me if I ma wrong. >> >> On the other hand I have >> >>> Corrected_SA_data$date_time[k] >>> >> [1] "2007-03-11 02:00:00" >> >>> Corrected_SA_data$TZ[k] >>> >> [1] -5 >> >>> Corrected_SA_data$date_time[k]-Corrected_SA_data$TZ[k]*3600 >>> >> [1] NA >> >> I am not sure why I am getting NA. >> >> Thank you for your help. >> >> Alemu >> >> [[alternative HTML version deleted]] >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/ >> posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> > > --- > Jeff NewmillerThe . . Go Live... > DCN:Basics: ##.#. ##.#. Live > Go... > Live: OO#.. Dead: OO#.. Playing > Research Engineer (Solar/BatteriesO.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. rocks...1k > > --- > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-he
Re: [R] neighborhood competition index in r package
... and even more directly, googling on "R package Hegyi" brought up the siplab package as the first hit. As it deals with forestry and vegetation, it would appear to be what the OP wanted -- and could have found him/her self with a minimum of effort. Sigh... I don't understand why people don't make even a minimal search effort on their own and feel it necessary to ask for help here. It ain't rocket science! -- Bert Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." Clifford Stoll On Sun, Dec 7, 2014 at 6:05 PM, Ben Bolker wrote: > catalin roibu gmail.com> writes: > >> >> Dear all! >> >> Is there a R package to compute neighborhood competition index (Shutz, >> Hegyi, and many index). >> >> Thank you very much! >> > > library("sos") > findFn("Hegyi") > > leads to > > http://finzi.psych.upenn.edu/R/library/siplab/html/pairwise.html > > Does that help? > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] neighborhood competition index in r package
catalin roibu gmail.com> writes: > > Dear all! > > Is there a R package to compute neighborhood competition index (Shutz, > Hegyi, and many index). > > Thank you very much! > library("sos") findFn("Hegyi") leads to http://finzi.psych.upenn.edu/R/library/siplab/html/pairwise.html Does that help? __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Condensing data.frame
Looking over Jeff's "dplyr" solution, I see that I forgot this part of the original spec: where the first 6 have the highest "countPercentage" So here's a corrected sum_cid() function for my "plyr" solution: summ_cid = function(frm) { # sort by countPercentage sorted_frm = arrange(frm, desc(countPercentage)) # grab at most 6 rows toprows = head(sorted_frm, 6) # add a 7th row toprows[nrow(toprows)+1,] = c(toprows[1,1], NA, NA, NA, sum(toprows$countPercentage)) # all done return(toprows) } -John > -Original Message- > From: Jeff Newmiller [mailto:jdnew...@dcn.davis.ca.us] > Sent: Sunday, December 07, 2014 3:14 PM > To: John Posner > Cc: 'Chel Hee Lee'; Morway, Eric; R mailing list > Subject: Re: [R] Condensing data.frame > > dplyr version (good for large datasets): > > library(dplyr) > > # if original example dat data.frame is used # using read.csv with as.is=TRUE > or stringsAsFactors=FALSE is better dat$tax_name <- as.character( > dat$tax_name ) > > # dplyr pipe chain > ( dat > %>% arrange( site, desc( countPercentage )) %>% group_by( site ) %>% do( > rbind_list( .[ 1:6, c( "site", "tax_name", "countPercentage" ) ] >, data.frame( site=.[ 1, "site" ] >, tax_name="Other" >, countPercentage=sum( .[ -(1:6) >, "countPercentage" ] ) >) >) >) > %>% as.data.frame > ) > > #note that this will break if there are fewer than 7 rows for any site > > On Sun, 7 Dec 2014, John Posner wrote: > > > Here's a solution using the plyr library: > > > > > > library(plyr) > > > > dat <- read.table(header=TRUE, sep=",", as.is=TRUE, ## < > as.is=TRUE > > text="site,tax_name,count,countTotal,countPercentage > > CID_1,Cyanobacteria,46295,123509,37.483098398 > > CID_1,Proteobacteria,36120,123509,29.244832360 > > CID_1,Bacteroidetes,19546,123509,15.825567368 > > CID_1,Verrucomicrobia,7886,123509,6.384959801 > > CID_1,Firmicutes,2843,123509,2.301856545 > > ... > > CID_9,Armatimonadetes,27,77120,0.035010373 > > CID_9,Fusobacteria,25,77120,0.032417012 > > CID_9,Aquificae,13,77120,0.016856846 > > CID_9,Synergistetes,12,77120,0.015560166 > > CID_9,Deferribacteres,8,77120,0.010373444 > > CID_9,Thermotogae,7,77120,0.009076763 > > CID_9,Chrysiogenetes,3,77120,0.003890041. > > ") > > > > summ_cid = function(frm) { > > # grab at most 6 rows from data frame toprows = head(frm, 6) # add > > a 7th row toprows[nrow(toprows)+1,] = c(toprows[1,1], "", NA, NA, > > sum(toprows$countPercentage)) # all done > > return(toprows) > > } > > > > result = ddply(dat, "site", summ_cid) > > > > > > Notes: > > > > 1. I needed to add the as.is=TRUE option to read.table() 2. The > > invocation of ddply() does *not* use summarize() 3. Because there is > > no use of summarize(), I have not figured out how to use the dplyr package > in this context. > > > > -John > > > >> -Original Message- > >> From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Chel > >> Hee Lee > >> Sent: Sunday, December 07, 2014 11:43 AM > >> To: Morway, Eric; R mailing list > >> Subject: Re: [R] Condensing data.frame > >> > >> > datBySite <- split(dat, dat$site) > >> > output <- lapply(datBySite, function(x){ > >> + x$idx <- seq_len(nrow(x)) > >> + x$grp <- ifelse(x$idx < 7, x$idx, 7) rval <- > >> + tapply(x$countPercentage, x$grp, sum) x$grp <- x$count <- > >> + x$countTotal <- NULL x <- x[seq_len(7), ] x$tax_name <- > >> + as.character(x$tax_name) x$tax_name[7] <- "Others" > >> + x$new <- rval > >> + return(x) > >> + }) > >> > > >> > head(do.call(rbind, output), 14) > >> sitetax_name countPercentage idx new > >> CID_1.1CID_1 Cyanobacteria 37.483098 1 37.483098 > >> CID_1.2CID_1 Proteobacteria 29.244832 2 29.244832 > >> CID_1.3CID_1 Bacteroidetes 15.825567 3 15.825567 > >> CID_1.4CID_1 Verrucomicrobia6.384960 4 6.384960 > >> CID_1.5CID_1 Firmicutes2.301857 5 2.301857 > >> CID_1.6CID_1 Acidobacteria2.075152 6 2.075152 > >> CID_1.7CID_1 Others1.675182 7 6.684533 > >> CID_10.27 CID_10 Proteobacteria 35.366606 1 35.366606 > >> CID_10.28 CID_10 Bacteroidetes 25.188484 2 25.188484 > >> CID_10.29 CID_10 Cyanobacteria 23.294828 3 23.294828 > >> CID_10.30 CID_10 Verrucomicrobia6.970592 4 6.970592 > >> CID_10.31 CID_10 Acidobacteria1.988448 5 1.988448 > >> CID_10.32 CID_10 Actinobacteria1.644548 6 1.644548 > >> CID_10.33 CID_10 Others1.582823 7 5.546493 > >> > > >> > >> I hope this helps. > >> > >> Chel Hee Lee > >> > >> On 12/07/2014 08:21 AM, Morway, Eric wrote: > >>> Using the dataset "dat" (found below), I'm seeking a way to condense > >>> down the data.fram
Re: [R] bad STATA dataset import, how to change value labels
>> Excuse the inaccuracy, the warning is "value label missing". the same variable is considered as factor (w/ values ranging from a to b) in one dataset, as int in another. I want it to be a factor in both. > So, you are importing two different Stata formatted files an in only one of them is the warning being emitted? Yes. But the variable is the very same variable, it's an administrative/postal code given to each province of the country I'm studying. Factor have mode integer, but does that that "int" imply also "nominal" as well? I thought I needed a "factor" for that. One dataset emits a warning and shows the variable as "int". The other seems fine, factor w/ values ranging from a to b, but when I use the the str(data$var) function, the value shown are weird, like 1011000, 1012 ... instead of being four-digit numbers. >> Post the results of dput(head( dfrm[ , "varname"])) >My intent was for you to substitute the name of your dataframe for the token `dfrm`. I apologize. The output is: __ [1] 1 1 1 1 1 1 __ Thanks for your time 2014-12-07 19:15 GMT+01:00 David Winsemius : > > On Dec 6, 2014, at 6:37 PM, Edoardo Prestianni wrote: > > > Excuse the inaccuracy, the warning is "value label missing". the same > variable is considered as factor (w/ values ranging from a to b) in one > dataset, as int in another. I want it to be a factor in both. > > So, you are importing two different Stata formatted files an in only one > of them is the warning being emitted? > > > > > I think I am missing a package, the output is. > > > > Error in head(dfrm[, "variable"]) : object 'dfrm' not found > > My intent was for you to substitute the name of your dataframe for the > token `dfrm`. > > head(yourDataObject[, "yourVariableNameInQuotes"]) > > -- > David. > > > > > > > 2014-12-07 3:14 GMT+01:00 David Winsemius : > > > > On Dec 6, 2014, at 3:54 PM, Edoardo Prestianni wrote: > > > > > hello, > > > > > > I have imported a couple of .dta datasets, but a variable, instead of > being > > > labeled as factor (w/ values ranging from a to b) is labeled as > integer. > > > > > > How can I fix this? I am sorry if it is a rookie question but I don't > find > > > the command googling. > > > > What "command"? > > > > The word "labeled" is not an R term unless on is talking about the > labels of factor variables in which case there is no problem. Factors have > mode integer. > > > > Post the results of dput(head( dfrm[ , "varname"])) > > > > -- > > David. > > > > > > > Thanks everyone for their help, > > > > > > -- > > > Edoardo Prestianni > > > > > > [[alternative HTML version deleted]] > > > > > > __ > > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible code. > > > > David Winsemius > > Alameda, CA, USA > > > > > > > > > > -- > > Edoardo Prestianni > > David Winsemius > Alameda, CA, USA > > -- Edoardo Prestianni [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] neighborhood competition index in r package
Dear all! Is there a R package to compute neighborhood competition index (Shutz, Hegyi, and many index). Thank you very much! -- --- Catalin-Constantin ROIBU Lecturer PhD, Forestry engineer Forestry Faculty of Suceava Str. Universitatii no. 13, Suceava, 720229, Romania office phone +4 0230 52 29 78, ext. 531 mobile phone +4 0745 53 18 01 +4 0766 71 76 58 FAX:+4 0230 52 16 64 silvic.usv.ro [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] date time problem
You have not provided a reproducible example, so anything I say could be wrong just due to misinterpretation. Please read [1] for suggestions on making your examples reproducible, particularly regarding the use of dput to provide example data. You have also posted in HTML format, which can cause additional scrambling of communications on this list. From the notation you are using, I would guess that "Corrected_SA_data" is a data frame containing columns "date_time" and "TZ" at a minimum. The "date_time" column could be a vector of class POSIXlt or POSIXct... except that I cannot reproduce such an object that doesn't print some kind of timezone indicator in its output. What version of R are you using? Note that such an object contains a tz attribute already, so unless you have already made sure that the date_time column knows about that timezone, they are probably unrelated to each other. Note that any POSIXct object is internally represented as a specific instant of time. The tz attribute is only used to control how that time will be displayed. For example: testtime3 <- as.POSIXct( rawtime1, tz="America/New_York" ) format( testtime3, tz="Etc/GMT", usetz=TRUE ) If you want the output to omit the timezone information you can omit the usetz argument: format( testtime3, tz="Etc/GMT" ) If you have not yet, you should read [2]. Note that three-letter timezone indicators (even though they are given in OUTPUT!) are at best unreliable for use in specifying timezones in R... read ?timezones. [1] http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example [2] http://www.r-project.org/doc/Rnews/Rnews_2004-1.pdf On Sun, 7 Dec 2014, Alemu Tadesse wrote: Dear R users I am puzzled by the following result from R script. I am trying to convert local time to UTC time. Time zone is -5, therefore I used the following approach. Below is the script. Corrected_SA_data$date_time[k-1] [1] "2007-03-11 01:00:00" Corrected_SA_data$TZ[k-1] [1] -5 Corrected_SA_data$date_time[k-1]-Corrected_SA_data$TZ[k-1]*3600 [1] "2007-03-11 07:00:00 MDT" I was expecting this last value to be something like "2007-03-11 06:00:00 UTC" Please correct me if I ma wrong. On the other hand I have Corrected_SA_data$date_time[k] [1] "2007-03-11 02:00:00" Corrected_SA_data$TZ[k] [1] -5 Corrected_SA_data$date_time[k]-Corrected_SA_data$TZ[k]*3600 [1] NA I am not sure why I am getting NA. Thank you for your help. Alemu [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. --- Jeff NewmillerThe . . Go Live... DCN:Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] date time problem
I would use the 'lubridate' package for this: > z <- Sys.time() > z [1] "2014-12-07 15:43:50 EST" > require(lubridate) > with_tz(z, "UTC") [1] "2014-12-07 20:43:50 UTC" Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Sun, Dec 7, 2014 at 2:25 PM, Alemu Tadesse wrote: > Dear R users > > I am puzzled by the following result from R script. I am trying to convert > local time to UTC time. Time zone is -5, therefore I used the following > approach. > > Below is the script. > > Corrected_SA_data$date_time[k-1] > [1] "2007-03-11 01:00:00" > > Corrected_SA_data$TZ[k-1] > [1] -5 > > Corrected_SA_data$date_time[k-1]-Corrected_SA_data$TZ[k-1]*3600 > [1] "2007-03-11 07:00:00 MDT" > > I was expecting this last value to be something like "2007-03-11 06:00:00 > UTC" > > Please correct me if I ma wrong. > > On the other hand I have > > Corrected_SA_data$date_time[k] > [1] "2007-03-11 02:00:00" > > Corrected_SA_data$TZ[k] > [1] -5 > > Corrected_SA_data$date_time[k]-Corrected_SA_data$TZ[k]*3600 > [1] NA > > I am not sure why I am getting NA. > > Thank you for your help. > > Alemu > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Condensing data.frame
dplyr version (good for large datasets): library(dplyr) # if original example dat data.frame is used # using read.csv with as.is=TRUE or stringsAsFactors=FALSE is better dat$tax_name <- as.character( dat$tax_name ) # dplyr pipe chain ( dat %>% arrange( site, desc( countPercentage )) %>% group_by( site ) %>% do( rbind_list( .[ 1:6, c( "site", "tax_name", "countPercentage" ) ] , data.frame( site=.[ 1, "site" ] , tax_name="Other" , countPercentage=sum( .[ -(1:6) , "countPercentage" ] ) ) ) ) %>% as.data.frame ) #note that this will break if there are fewer than 7 rows for any site On Sun, 7 Dec 2014, John Posner wrote: Here's a solution using the plyr library: library(plyr) dat <- read.table(header=TRUE, sep=",", as.is=TRUE, ## < as.is=TRUE text="site,tax_name,count,countTotal,countPercentage CID_1,Cyanobacteria,46295,123509,37.483098398 CID_1,Proteobacteria,36120,123509,29.244832360 CID_1,Bacteroidetes,19546,123509,15.825567368 CID_1,Verrucomicrobia,7886,123509,6.384959801 CID_1,Firmicutes,2843,123509,2.301856545 ... CID_9,Armatimonadetes,27,77120,0.035010373 CID_9,Fusobacteria,25,77120,0.032417012 CID_9,Aquificae,13,77120,0.016856846 CID_9,Synergistetes,12,77120,0.015560166 CID_9,Deferribacteres,8,77120,0.010373444 CID_9,Thermotogae,7,77120,0.009076763 CID_9,Chrysiogenetes,3,77120,0.003890041. ") summ_cid = function(frm) { # grab at most 6 rows from data frame toprows = head(frm, 6) # add a 7th row toprows[nrow(toprows)+1,] = c(toprows[1,1], "", NA, NA, sum(toprows$countPercentage)) # all done return(toprows) } result = ddply(dat, "site", summ_cid) Notes: 1. I needed to add the as.is=TRUE option to read.table() 2. The invocation of ddply() does *not* use summarize() 3. Because there is no use of summarize(), I have not figured out how to use the dplyr package in this context. -John -Original Message- From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Chel Hee Lee Sent: Sunday, December 07, 2014 11:43 AM To: Morway, Eric; R mailing list Subject: Re: [R] Condensing data.frame > datBySite <- split(dat, dat$site) > output <- lapply(datBySite, function(x){ + x$idx <- seq_len(nrow(x)) + x$grp <- ifelse(x$idx < 7, x$idx, 7) + rval <- tapply(x$countPercentage, x$grp, sum) x$grp <- x$count <- + x$countTotal <- NULL x <- x[seq_len(7), ] x$tax_name <- + as.character(x$tax_name) x$tax_name[7] <- "Others" + x$new <- rval + return(x) + }) > > head(do.call(rbind, output), 14) sitetax_name countPercentage idx new CID_1.1CID_1 Cyanobacteria 37.483098 1 37.483098 CID_1.2CID_1 Proteobacteria 29.244832 2 29.244832 CID_1.3CID_1 Bacteroidetes 15.825567 3 15.825567 CID_1.4CID_1 Verrucomicrobia6.384960 4 6.384960 CID_1.5CID_1 Firmicutes2.301857 5 2.301857 CID_1.6CID_1 Acidobacteria2.075152 6 2.075152 CID_1.7CID_1 Others1.675182 7 6.684533 CID_10.27 CID_10 Proteobacteria 35.366606 1 35.366606 CID_10.28 CID_10 Bacteroidetes 25.188484 2 25.188484 CID_10.29 CID_10 Cyanobacteria 23.294828 3 23.294828 CID_10.30 CID_10 Verrucomicrobia6.970592 4 6.970592 CID_10.31 CID_10 Acidobacteria1.988448 5 1.988448 CID_10.32 CID_10 Actinobacteria1.644548 6 1.644548 CID_10.33 CID_10 Others1.582823 7 5.546493 > I hope this helps. Chel Hee Lee On 12/07/2014 08:21 AM, Morway, Eric wrote: Using the dataset "dat" (found below), I'm seeking a way to condense down the data.frame such that each "site" (i.e., "CID_1"..."CID_13") has a maximum of 7 rows of post-processed data, where the first 6 have the highest "countPercentage" and the 7th row is the sum of "countPercentage" from all other rows within that "site", and it is assigned the name "Other". So, for the first two sites in the provided data.frame, CID_1 & CID_10, they would reduce to: CID_1 Cyanobacteria 37.48 CID_1 Proteobacteria 29.24 CID_1 Bacteroidetes 15.83 CID_1 Verrucomicrobia 6.38 CID_1 Firmicutes 2.30 CID_1 Acidobacteria 2.08 CID_1 Other6.68 CID_10 Proteobacteria 35.37 CID_10 Bacteroidetes 25.19 CID_10 Cyanobacteria 23.29 CID_10 Verrucomicrobia 6.97 CID_10 Acidobacteria 1.99 CID_10 Actinobacteria 1.64 CID_10 Other 5.55 dat <- read.table(header=TRUE, sep=",", text="site,tax_name,count,countTotal,countPercentage CID_1,Cyanobacteria,46295,123509,37.483098398 CID_1,Proteobacteria,36120,123509,29.244832360 CID_1,Bacteroidetes,19546,123509,15.825567368 CID_1,Verrucomicrobia,7886,123509,6.384959801 CID_1,Firmicutes,2843,123509,2.301856545 CID_1,Acidobacteria,2563,123509,2.075152418 CID_1,Actinobacter
[R] date time problem
Dear R users I am puzzled by the following result from R script. I am trying to convert local time to UTC time. Time zone is -5, therefore I used the following approach. Below is the script. > Corrected_SA_data$date_time[k-1] [1] "2007-03-11 01:00:00" > Corrected_SA_data$TZ[k-1] [1] -5 > Corrected_SA_data$date_time[k-1]-Corrected_SA_data$TZ[k-1]*3600 [1] "2007-03-11 07:00:00 MDT" I was expecting this last value to be something like "2007-03-11 06:00:00 UTC" Please correct me if I ma wrong. On the other hand I have > Corrected_SA_data$date_time[k] [1] "2007-03-11 02:00:00" > Corrected_SA_data$TZ[k] [1] -5 > Corrected_SA_data$date_time[k]-Corrected_SA_data$TZ[k]*3600 [1] NA I am not sure why I am getting NA. Thank you for your help. Alemu [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Difference in cummulative variance depending on print command
Dear Rena, As Peter points out, it is better to ask the maintainer of the program for detailed questions. As Peter correctly surmised, print.psych (which is used to print the output from the fa function), knows that you have an oblique solution and is reporting the amount of variance associated with the oblique factors (taking into account that they are correlated). The default print method assumes orthogonal factors. If you compare the total amount of variance accounted for (cumulative Var) for all of the factors (.59) , this will match that found using orthogonal rotations, while the default print method of the loadings does not. Bill > On Dec 6, 2014, at 10:48 AM, peter dalgaard wrote: > > Firstly, there is no fa() function in base R. There is one in package > psych(), which has a maintainer, etc. > > I guess that it is because fa() does a non-orthogonal factor rotation and its > print method knows about it, whereas the default print method for loadings > assumes that rotations are orthogonal. > > - Peter D. > >> On 05 Dec 2014, at 13:28 , Rena Büsch wrote: >> >> Hello, >> I am trying a factor analysis via R. >> When running the pricipal axis analysis I do get different tables depending >> on the print command. >> This is my factor analysis: >> fa.pa_cor_3_2<- fa(ItemsCor_4, nfactors=3, fm="pa",rotate="oblimin") >> >> To get the h2 I did the following print command: >> print (fa.pa_cor_3_2, digits=2, cut=.3, sort=T) >> To just get the loadings I did the following print command: >> print (fa.pa_cor_3_2$loadings, digits=2, cutoff=.3, sort=T) >> >> The result of the first print is the following Eigenvalue-cumulative >> variance table: >> PA1 PA2 PA3 >> SS loadings20.59 18.16 5.03 >> Proportion Var 0.28 0.25 0.07 >> Cumulative Var 0.28 0.52 0.59 >> >> With the second print command I get a different table: >> PA1 PA2 PA3 >> SS loadings17.63 15.12 3.14 >> Proportion Var 0.24 0.20 0.04 >> Cumulative Var 0.24 0.44 0.49 >> >> The loadings are the same for both commands. There is just this slight >> difference in the cumulative Var. >> >> Does anyone have an idea of a cause for the difference? What can I report? >> Did I post enough information to fully understand my problem? >> Thanks in Advance >> Rena >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Email: pd@cbs.dk Priv: pda...@gmail.com > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. William Revellehttp://personality-project.org/revelle.html Professor http://personality-project.org Department of Psychology http://www.wcas.northwestern.edu/psych/ Northwestern Universityhttp://www.northwestern.edu/ Use R for psychology http://personality-project.org/r It is 5 minutes to midnighthttp://www.thebulletin.org __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Which is the final model for a Boosted Regression Trees (GBM)?
Hi Kristi, One year later I've been with the same question and found a solution with the help (see plot.gbm: Marginal plots of fitted gbm objects.) If your GBM-model is gbm1 <- gbm(y ~ x1+x2, .) one can get the coefficients for each x with: print(plot(gbm1, i.var=1, n.trees=1000, return.grid=TRUE)) print(plot(gbm1, 2, return.grid=TRUE)) print(plot(gbm1, i.var=2, n.trees=1000, return.grid=TRUE)) print(plot(gbm1, 2, return.grid=TRUE)) This includes all trees from 1 to 1000. Without "return.grid=TRUE" the graphic is shown. Additionally one can see the coefficients for up to 3-way interactions: print(plot(gbm1, i.var=c(1,2), n.trees=1000, return.grid=TRUE)) print(plot(gbm1, 2, return.grid=TRUE)) Hope that helps for others searching for the same thing! Samuel [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] boot strapping poisson getting warnings and negative values
Team, I am giving the exact code that produces the error. Please see below. Can anyone please help ? Thanks Aravindhan PROGRAM --- rm(list = ls()) x<-c(1,14,49,26,4,10,25,36,79,15) y<-c(1,5,32,8,4,23,399,79,341,2) db = data.frame(x,y) library(car) db = data.frame(x,y) print("-==") summary(m1 <- glm(y~0+x, family = poisson())) mod.db.hub<-glm(y~0+x,family="poisson",data=db) fit<-fitted(mod.db.hub) e<-residuals(mod.db.hub) X<-model.matrix(mod.db.hub) boot.huber.fixed<-function(data,indices,maxit=20) { Y<-fit+e[indices] mod<-glm(Y~X+0,family="poisson",maxit=maxit) coefficients(mod) } library(boot) db.fix.boot<-boot(db,boot.huber.fixed,2000,maxit=20) db.fix.boot boot.ci(db.fix.boot,index=1,type=c("bca","perc","poisson")) The error I get is -- > db.fix.boot<-boot(db,boot.huber.fixed,2000,maxit=20) Error in eval(expr, envir, enclos) : negative values not allowed for the 'Poisson' family In addition: Warning messages: 1: In dpois(y, mu, log = TRUE) : non-integer x = 1.002114 2: In dpois(y, mu, log = TRUE) : non-integer x = 4.050746 3: In dpois(y, mu, log = TRUE) : non-integer x = 44.219309 4: In dpois(y, mu, log = TRUE) : non-integer x = 7.786340 5: In dpois(y, mu, log = TRUE) : non-integer x = 3.189963 6: In dpois(y, mu, log = TRUE) : non-integer x = 10.348190 7: In dpois(y, mu, log = TRUE) : non-integer x = 56.408862 8: In dpois(y, mu, log = TRUE) : non-integer x = 27.751783 9: In dpois(y, mu, log = TRUE) : non-integer x = 480.383098 10: In dpois(y, mu, log = TRUE) : non-integer x = 2.497504 > db.fix.boot Error: object 'db.fix.boot' not found > boot.ci(db.fix.boot,index=1,type=c("bca","perc","poisson")) Error in boot.ci(db.fix.boot, index = 1, type = c("bca", "perc", "poisson")) : object 'db.fix.boot' not found -- -Original Message- From: Aravindhan, K Sent: Saturday, November 15, 2014 7:59 AM To: r-help@r-project.org Subject: RE: [R] boot strapping poisson getting warnings and negative values Team, Has anyone looked at this question from me ? it will help me immensely if someone can provide an answer to this. Thanks Aravindhan -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of K Aravindhan Sent: Friday, August 08, 2014 7:42 PM To: r-help@r-project.org Subject: [R] boot strapping poisson getting warnings and negative values Dear Team, I am getting this error while running the boot-strapping functions. == mod.db.hub<-glm(TOTAL~1+IPD,family="poisson",data=db) fit<-fitted(mod.db.hub) e<-residuals(mod.db.hub) X<-model.matrix(mod.db.hub) boot.huber.fixed<-function(data,indices,maxit=20) { Y<-fit+e[indices] mod<-glm(Y~X-1,family="poisson",maxit=maxit) coefficients(mod) } library(boot) db.fix.boot<-boot(db,boot.huber.fixed,2000,maxit=20) db.fix.boot boot.ci(db.fix.boot,index=1,type=c("bca","perc","poisson")) boot.ci(db.fix.boot,index=2,type=c("bca","perc","poisson")) == Error in eval(expr, envir, enclos) : negative values not allowed for the 'Poisson' family In addition: Warning messages: 1: In dpois(y, mu, log = TRUE) : non-integer x = 25.006412 2: In dpois(y, mu, log = TRUE) : non-integer x = 26.969411 3: In dpois(y, mu, log = TRUE) : non-integer x = 66.352323 4: In dpois(y, mu, log = TRUE) : non-integer x = 61.083519 5: In dpois(y, mu, log = TRUE) : non-integer x = 20.596770 6: In dpois(y, mu, log = TRUE) : non-integer x = 43.428258 7: In dpois(y, mu, log = TRUE) : non-integer x = 1108.263554 8: In dpois(y, mu, log = TRUE) : non-integer x = 61.937982 9: In dpois(y, mu, log = TRUE) : non-integer x = 419.991213 10: In dpois(y, mu, log = TRUE) : non-integer x = 47.369133 Can you explain to me how to get rid of these ? Thanks Aravindhan __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] bad STATA dataset import, how to change value labels
On Dec 6, 2014, at 6:37 PM, Edoardo Prestianni wrote: > Excuse the inaccuracy, the warning is "value label missing". the same > variable is considered as factor (w/ values ranging from a to b) in one > dataset, as int in another. I want it to be a factor in both. So, you are importing two different Stata formatted files an in only one of them is the warning being emitted? > > I think I am missing a package, the output is. > > Error in head(dfrm[, "variable"]) : object 'dfrm' not found My intent was for you to substitute the name of your dataframe for the token `dfrm`. head(yourDataObject[, "yourVariableNameInQuotes"]) -- David. > > > 2014-12-07 3:14 GMT+01:00 David Winsemius : > > On Dec 6, 2014, at 3:54 PM, Edoardo Prestianni wrote: > > > hello, > > > > I have imported a couple of .dta datasets, but a variable, instead of being > > labeled as factor (w/ values ranging from a to b) is labeled as integer. > > > > How can I fix this? I am sorry if it is a rookie question but I don't find > > the command googling. > > What "command"? > > The word "labeled" is not an R term unless on is talking about the labels of > factor variables in which case there is no problem. Factors have mode integer. > > Post the results of dput(head( dfrm[ , "varname"])) > > -- > David. > > > > Thanks everyone for their help, > > > > -- > > Edoardo Prestianni > > > > [[alternative HTML version deleted]] > > > > __ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > David Winsemius > Alameda, CA, USA > > > > > -- > Edoardo Prestianni David Winsemius Alameda, CA, USA __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Condensing data.frame
Here's a solution using the plyr library: library(plyr) dat <- read.table(header=TRUE, sep=",", as.is=TRUE, ## < as.is=TRUE text="site,tax_name,count,countTotal,countPercentage CID_1,Cyanobacteria,46295,123509,37.483098398 CID_1,Proteobacteria,36120,123509,29.244832360 CID_1,Bacteroidetes,19546,123509,15.825567368 CID_1,Verrucomicrobia,7886,123509,6.384959801 CID_1,Firmicutes,2843,123509,2.301856545 ... CID_9,Armatimonadetes,27,77120,0.035010373 CID_9,Fusobacteria,25,77120,0.032417012 CID_9,Aquificae,13,77120,0.016856846 CID_9,Synergistetes,12,77120,0.015560166 CID_9,Deferribacteres,8,77120,0.010373444 CID_9,Thermotogae,7,77120,0.009076763 CID_9,Chrysiogenetes,3,77120,0.003890041. ") summ_cid = function(frm) { # grab at most 6 rows from data frame toprows = head(frm, 6) # add a 7th row toprows[nrow(toprows)+1,] = c(toprows[1,1], "", NA, NA, sum(toprows$countPercentage)) # all done return(toprows) } result = ddply(dat, "site", summ_cid) Notes: 1. I needed to add the as.is=TRUE option to read.table() 2. The invocation of ddply() does *not* use summarize() 3. Because there is no use of summarize(), I have not figured out how to use the dplyr package in this context. -John > -Original Message- > From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Chel Hee > Lee > Sent: Sunday, December 07, 2014 11:43 AM > To: Morway, Eric; R mailing list > Subject: Re: [R] Condensing data.frame > > > datBySite <- split(dat, dat$site) > > output <- lapply(datBySite, function(x){ > + x$idx <- seq_len(nrow(x)) > + x$grp <- ifelse(x$idx < 7, x$idx, 7) > + rval <- tapply(x$countPercentage, x$grp, sum) x$grp <- x$count <- > + x$countTotal <- NULL x <- x[seq_len(7), ] x$tax_name <- > + as.character(x$tax_name) x$tax_name[7] <- "Others" > + x$new <- rval > + return(x) > + }) > > > > head(do.call(rbind, output), 14) > sitetax_name countPercentage idx new > CID_1.1CID_1 Cyanobacteria 37.483098 1 37.483098 > CID_1.2CID_1 Proteobacteria 29.244832 2 29.244832 > CID_1.3CID_1 Bacteroidetes 15.825567 3 15.825567 > CID_1.4CID_1 Verrucomicrobia6.384960 4 6.384960 > CID_1.5CID_1 Firmicutes2.301857 5 2.301857 > CID_1.6CID_1 Acidobacteria2.075152 6 2.075152 > CID_1.7CID_1 Others1.675182 7 6.684533 > CID_10.27 CID_10 Proteobacteria 35.366606 1 35.366606 > CID_10.28 CID_10 Bacteroidetes 25.188484 2 25.188484 > CID_10.29 CID_10 Cyanobacteria 23.294828 3 23.294828 > CID_10.30 CID_10 Verrucomicrobia6.970592 4 6.970592 > CID_10.31 CID_10 Acidobacteria1.988448 5 1.988448 > CID_10.32 CID_10 Actinobacteria1.644548 6 1.644548 > CID_10.33 CID_10 Others1.582823 7 5.546493 > > > > I hope this helps. > > Chel Hee Lee > > On 12/07/2014 08:21 AM, Morway, Eric wrote: > > Using the dataset "dat" (found below), I'm seeking a way to condense > > down the data.frame such that each "site" (i.e., "CID_1"..."CID_13") > > has a maximum of 7 rows of post-processed data, where the first 6 have > > the highest "countPercentage" and the 7th row is the sum of > "countPercentage" > > from all other rows within that "site", and it is assigned the name > > "Other". So, for the first two sites in the provided data.frame, > > CID_1 & CID_10, they would reduce to: > > > > CID_1 Cyanobacteria 37.48 > > CID_1 Proteobacteria 29.24 > > CID_1 Bacteroidetes 15.83 > > CID_1 Verrucomicrobia 6.38 > > CID_1 Firmicutes 2.30 > > CID_1 Acidobacteria 2.08 > > CID_1 Other6.68 > > CID_10 Proteobacteria 35.37 > > CID_10 Bacteroidetes 25.19 > > CID_10 Cyanobacteria 23.29 > > CID_10 Verrucomicrobia 6.97 > > CID_10 Acidobacteria 1.99 > > CID_10 Actinobacteria 1.64 > > CID_10 Other 5.55 > > > > > > dat <- read.table(header=TRUE, sep=",", > > text="site,tax_name,count,countTotal,countPercentage > > CID_1,Cyanobacteria,46295,123509,37.483098398 > > CID_1,Proteobacteria,36120,123509,29.244832360 > > CID_1,Bacteroidetes,19546,123509,15.825567368 > > CID_1,Verrucomicrobia,7886,123509,6.384959801 > > CID_1,Firmicutes,2843,123509,2.301856545 > > CID_1,Acidobacteria,2563,123509,2.075152418 > > CID_1,Actinobacteria,2069,123509,1.675181566 > > CID_1,Planctomycetes,1481,123509,1.199102899 > > CID_1,Chloroflexi,1181,123509,0.956205621 > > CID_1,Gemmatimonadetes,956,123509,0.774032662 > > CID_1,Spirochaetes,688,123509,0.557044426 > > CID_1,Lentisphaerae,526,123509,0.425879895 > > CID_1,Ignavibacteriae,324,123509,0.262329061 > > CID_1,Chlorobi,238,123509,0.192698508 > > CID_1,Nitrospirae,230,123509,0.186221247 > > CID_1,Nitrospinae,169,123509,0.136832134 > > CID_1,Elusimicrobia,131,123509,0.106065145 > > CID_1,Tenericutes,114,123509,0.092300966 > > CID_1,Fibrobacteres,72,123509,0.058295347 > > CID_1
Re: [R] Condensing data.frame
> datBySite <- split(dat, dat$site) > output <- lapply(datBySite, function(x){ + x$idx <- seq_len(nrow(x)) + x$grp <- ifelse(x$idx < 7, x$idx, 7) + rval <- tapply(x$countPercentage, x$grp, sum) + x$grp <- x$count <- x$countTotal <- NULL + x <- x[seq_len(7), ] + x$tax_name <- as.character(x$tax_name) + x$tax_name[7] <- "Others" + x$new <- rval + return(x) + }) > > head(do.call(rbind, output), 14) sitetax_name countPercentage idx new CID_1.1CID_1 Cyanobacteria 37.483098 1 37.483098 CID_1.2CID_1 Proteobacteria 29.244832 2 29.244832 CID_1.3CID_1 Bacteroidetes 15.825567 3 15.825567 CID_1.4CID_1 Verrucomicrobia6.384960 4 6.384960 CID_1.5CID_1 Firmicutes2.301857 5 2.301857 CID_1.6CID_1 Acidobacteria2.075152 6 2.075152 CID_1.7CID_1 Others1.675182 7 6.684533 CID_10.27 CID_10 Proteobacteria 35.366606 1 35.366606 CID_10.28 CID_10 Bacteroidetes 25.188484 2 25.188484 CID_10.29 CID_10 Cyanobacteria 23.294828 3 23.294828 CID_10.30 CID_10 Verrucomicrobia6.970592 4 6.970592 CID_10.31 CID_10 Acidobacteria1.988448 5 1.988448 CID_10.32 CID_10 Actinobacteria1.644548 6 1.644548 CID_10.33 CID_10 Others1.582823 7 5.546493 > I hope this helps. Chel Hee Lee On 12/07/2014 08:21 AM, Morway, Eric wrote: Using the dataset "dat" (found below), I'm seeking a way to condense down the data.frame such that each "site" (i.e., "CID_1"..."CID_13") has a maximum of 7 rows of post-processed data, where the first 6 have the highest "countPercentage" and the 7th row is the sum of "countPercentage" from all other rows within that "site", and it is assigned the name "Other". So, for the first two sites in the provided data.frame, CID_1 & CID_10, they would reduce to: CID_1 Cyanobacteria 37.48 CID_1 Proteobacteria 29.24 CID_1 Bacteroidetes 15.83 CID_1 Verrucomicrobia 6.38 CID_1 Firmicutes 2.30 CID_1 Acidobacteria 2.08 CID_1 Other6.68 CID_10 Proteobacteria 35.37 CID_10 Bacteroidetes 25.19 CID_10 Cyanobacteria 23.29 CID_10 Verrucomicrobia 6.97 CID_10 Acidobacteria 1.99 CID_10 Actinobacteria 1.64 CID_10 Other 5.55 dat <- read.table(header=TRUE, sep=",", text="site,tax_name,count,countTotal,countPercentage CID_1,Cyanobacteria,46295,123509,37.483098398 CID_1,Proteobacteria,36120,123509,29.244832360 CID_1,Bacteroidetes,19546,123509,15.825567368 CID_1,Verrucomicrobia,7886,123509,6.384959801 CID_1,Firmicutes,2843,123509,2.301856545 CID_1,Acidobacteria,2563,123509,2.075152418 CID_1,Actinobacteria,2069,123509,1.675181566 CID_1,Planctomycetes,1481,123509,1.199102899 CID_1,Chloroflexi,1181,123509,0.956205621 CID_1,Gemmatimonadetes,956,123509,0.774032662 CID_1,Spirochaetes,688,123509,0.557044426 CID_1,Lentisphaerae,526,123509,0.425879895 CID_1,Ignavibacteriae,324,123509,0.262329061 CID_1,Chlorobi,238,123509,0.192698508 CID_1,Nitrospirae,230,123509,0.186221247 CID_1,Nitrospinae,169,123509,0.136832134 CID_1,Elusimicrobia,131,123509,0.106065145 CID_1,Tenericutes,114,123509,0.092300966 CID_1,Fibrobacteres,72,123509,0.058295347 CID_1,Thermotogae,21,123509,0.017002810 CID_1,Fusobacteria,21,123509,0.017002810 CID_1,Armatimonadetes,15,123509,0.012144864 CID_1,Synergistetes,10,123509,0.008096576 CID_1,Deinococcus-Thermus,6,123509,0.004857946 CID_1,Deferribacteres,2,123509,0.001619315 CID_1,Caldiserica,2,123509,0.001619315 CID_10,Proteobacteria,16043,45362,35.366606411 CID_10,Bacteroidetes,11426,45362,25.188483753 CID_10,Cyanobacteria,10567,45362,23.294828270 CID_10,Verrucomicrobia,3162,45362,6.970592126 CID_10,Acidobacteria,902,45362,1.988448481 CID_10,Actinobacteria,746,45362,1.644548300 CID_10,Firmicutes,718,45362,1.582822627 CID_10,Gemmatimonadetes,358,45362,0.789206825 CID_10,Planctomycetes,357,45362,0.787002337 CID_10,Chloroflexi,265,45362,0.584189410 CID_10,Spirochaetes,235,45362,0.518054759 CID_10,Ignavibacteriae,177,45362,0.390194436 CID_10,Lentisphaerae,108,45362,0.238084741 CID_10,Nitrospinae,75,45362,0.165336625 CID_10,Nitrospirae,58,45362,0.127860324 CID_10,Chlorobi,44,45362,0.096997487 CID_10,Elusimicrobia,28,45362,0.061725673 CID_10,Fibrobacteres,26,45362,0.057316697 CID_10,Armatimonadetes,15,45362,0.033067325 CID_10,Deinococcus-Thermus,13,45362,0.028658348 CID_10,Tenericutes,10,45362,0.022044883 CID_10,Synergistetes,9,45362,0.019840395 CID_10,Fusobacteria,9,45362,0.019840395 CID_10,Deferribacteres,6,45362,0.013226930 CID_10,Thermotogae,3,45362,0.006613465 CID_10,Caldiserica,2,45362,0.004408977 CID_11,Proteobacteria,10019,31387,31.920858954 CID_11,Cyanobacteria,8811,31387,28.072131774 CID_11,Bacteroidetes,7930,31387,25.265237200 CID_11,Verrucomicrobia,1750,31387,5.575556759 CID_11,Firmicutes,806,31387,2.567942142 CID_11,Acidobacteria,548,31387,1.745945774 CID_11,Actinobacteria,434,31387,1.382738076 CID_11,Chloroflexi,203,31387,0.646764584 CID_1
[R] Condensing data.frame
Using the dataset "dat" (found below), I'm seeking a way to condense down the data.frame such that each "site" (i.e., "CID_1"..."CID_13") has a maximum of 7 rows of post-processed data, where the first 6 have the highest "countPercentage" and the 7th row is the sum of "countPercentage" from all other rows within that "site", and it is assigned the name "Other". So, for the first two sites in the provided data.frame, CID_1 & CID_10, they would reduce to: CID_1 Cyanobacteria 37.48 CID_1 Proteobacteria 29.24 CID_1 Bacteroidetes 15.83 CID_1 Verrucomicrobia 6.38 CID_1 Firmicutes 2.30 CID_1 Acidobacteria 2.08 CID_1 Other6.68 CID_10 Proteobacteria 35.37 CID_10 Bacteroidetes 25.19 CID_10 Cyanobacteria 23.29 CID_10 Verrucomicrobia 6.97 CID_10 Acidobacteria 1.99 CID_10 Actinobacteria 1.64 CID_10 Other 5.55 dat <- read.table(header=TRUE, sep=",", text="site,tax_name,count,countTotal,countPercentage CID_1,Cyanobacteria,46295,123509,37.483098398 CID_1,Proteobacteria,36120,123509,29.244832360 CID_1,Bacteroidetes,19546,123509,15.825567368 CID_1,Verrucomicrobia,7886,123509,6.384959801 CID_1,Firmicutes,2843,123509,2.301856545 CID_1,Acidobacteria,2563,123509,2.075152418 CID_1,Actinobacteria,2069,123509,1.675181566 CID_1,Planctomycetes,1481,123509,1.199102899 CID_1,Chloroflexi,1181,123509,0.956205621 CID_1,Gemmatimonadetes,956,123509,0.774032662 CID_1,Spirochaetes,688,123509,0.557044426 CID_1,Lentisphaerae,526,123509,0.425879895 CID_1,Ignavibacteriae,324,123509,0.262329061 CID_1,Chlorobi,238,123509,0.192698508 CID_1,Nitrospirae,230,123509,0.186221247 CID_1,Nitrospinae,169,123509,0.136832134 CID_1,Elusimicrobia,131,123509,0.106065145 CID_1,Tenericutes,114,123509,0.092300966 CID_1,Fibrobacteres,72,123509,0.058295347 CID_1,Thermotogae,21,123509,0.017002810 CID_1,Fusobacteria,21,123509,0.017002810 CID_1,Armatimonadetes,15,123509,0.012144864 CID_1,Synergistetes,10,123509,0.008096576 CID_1,Deinococcus-Thermus,6,123509,0.004857946 CID_1,Deferribacteres,2,123509,0.001619315 CID_1,Caldiserica,2,123509,0.001619315 CID_10,Proteobacteria,16043,45362,35.366606411 CID_10,Bacteroidetes,11426,45362,25.188483753 CID_10,Cyanobacteria,10567,45362,23.294828270 CID_10,Verrucomicrobia,3162,45362,6.970592126 CID_10,Acidobacteria,902,45362,1.988448481 CID_10,Actinobacteria,746,45362,1.644548300 CID_10,Firmicutes,718,45362,1.582822627 CID_10,Gemmatimonadetes,358,45362,0.789206825 CID_10,Planctomycetes,357,45362,0.787002337 CID_10,Chloroflexi,265,45362,0.584189410 CID_10,Spirochaetes,235,45362,0.518054759 CID_10,Ignavibacteriae,177,45362,0.390194436 CID_10,Lentisphaerae,108,45362,0.238084741 CID_10,Nitrospinae,75,45362,0.165336625 CID_10,Nitrospirae,58,45362,0.127860324 CID_10,Chlorobi,44,45362,0.096997487 CID_10,Elusimicrobia,28,45362,0.061725673 CID_10,Fibrobacteres,26,45362,0.057316697 CID_10,Armatimonadetes,15,45362,0.033067325 CID_10,Deinococcus-Thermus,13,45362,0.028658348 CID_10,Tenericutes,10,45362,0.022044883 CID_10,Synergistetes,9,45362,0.019840395 CID_10,Fusobacteria,9,45362,0.019840395 CID_10,Deferribacteres,6,45362,0.013226930 CID_10,Thermotogae,3,45362,0.006613465 CID_10,Caldiserica,2,45362,0.004408977 CID_11,Proteobacteria,10019,31387,31.920858954 CID_11,Cyanobacteria,8811,31387,28.072131774 CID_11,Bacteroidetes,7930,31387,25.265237200 CID_11,Verrucomicrobia,1750,31387,5.575556759 CID_11,Firmicutes,806,31387,2.567942142 CID_11,Acidobacteria,548,31387,1.745945774 CID_11,Actinobacteria,434,31387,1.382738076 CID_11,Chloroflexi,203,31387,0.646764584 CID_11,Planctomycetes,197,31387,0.627648389 CID_11,Gemmatimonadetes,192,31387,0.611718227 CID_11,Ignavibacteriae,87,31387,0.277184822 CID_11,Spirochaetes,80,31387,0.254882595 CID_11,Tenericutes,71,31387,0.226208303 CID_11,Fusobacteria,67,31387,0.213464173 CID_11,Lentisphaerae,54,31387,0.172045751 CID_11,Chlorobi,40,31387,0.127441297 CID_11,Nitrospinae,33,31387,0.105139070 CID_11,Armatimonadetes,22,31387,0.070092714 CID_11,Fibrobacteres,15,31387,0.047790487 CID_11,Nitrospirae,13,31387,0.041418422 CID_11,Elusimicrobia,13,31387,0.041418422 CID_11,Deinococcus-Thermus,2,31387,0.006372065 CID_12,Cyanobacteria,241,644,37.422360248 CID_12,Bacteroidetes,210,644,32.608695652 CID_12,Proteobacteria,118,644,18.322981366 CID_12,Verrucomicrobia,38,644,5.900621118 CID_12,Acidobacteria,11,644,1.708074534 CID_12,Ignavibacteriae,6,644,0.931677019 CID_12,Lentisphaerae,5,644,0.776397516 CID_12,Firmicutes,5,644,0.776397516 CID_12,Planctomycetes,3,644,0.465838509 CID_12,Fusobacteria,3,644,0.465838509 CID_12,Tenericutes,2,644,0.310559006 CID_12,Actinobacteria,2,644,0.310559006 CID_13,Cyanobacteria,8581,25530,33.611437524 CID_13,Bacteroidetes,6878,25530,26.940853897 CID_13,Proteobacteria,5341,25530,20.920485703 CID_13,Verrucomicrobia,1244,25530,4.872698786 CID_13,Firmicutes,1148,25530,4.496670584 CID_13,Acidobacteria,548,25530,2.146494320 CID_13,Spirochaetes,477,25530,1.868390129 CID_13,Ignavibacteriae,298,25530,1.167254211 CID_13,Ac