Re: [R] How to represent tree-structured values
Really this depends on the analysis you want to perform. In the past, I have used a super/sub two-column format as a compact, non-redundant representation for data entry, and after applying a recursive algorithm to convert this to a super/sub/level/id table where _all_ sub components have (duplicative) entries corresponding to each super component. But there is always the recursive list structure that formats such as yaml and json functions typically return. On May 29, 2022 9:54:44 PM PDT, Richard O'Keefe wrote: >There is a kind of data I run into fairly often >which I have never known how to represent in R, >and nothing I've tried really satisfies me. > >Consider for example > ... > - injuries > ... > - injuries to limbs > ... > - injuries to extremities > ... > - injuries to hands > - injuries to dominant hand > - injuries to non-dominant hand > ... > ... > ... > >This isn't ordinal data, because there is no >"left to right" order on the values. But there >IS a "part/whole" order, which an analysis should >respect, so it's not pure nominal data either. > >As one particular example, if I want to >tabulate data like this, an occurrence of one >value should be counted as an occurrence of >*every* superordinate value. > >Examples of such data include "why is this patient >being treated", "what drug is this patient being >treated with", "what geographic region is this >school from", "what biological group does this >insect belong to". > >So what is the recommended way to represent >and the recommended way to analyse such data in R? > > [[alternative HTML version deleted]] > >__ >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. -- Sent from my phone. Please excuse my brevity. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to represent tree-structured values
There is a kind of data I run into fairly often which I have never known how to represent in R, and nothing I've tried really satisfies me. Consider for example ... - injuries ... - injuries to limbs ... - injuries to extremities ... - injuries to hands - injuries to dominant hand - injuries to non-dominant hand ... ... ... This isn't ordinal data, because there is no "left to right" order on the values. But there IS a "part/whole" order, which an analysis should respect, so it's not pure nominal data either. As one particular example, if I want to tabulate data like this, an occurrence of one value should be counted as an occurrence of *every* superordinate value. Examples of such data include "why is this patient being treated", "what drug is this patient being treated with", "what geographic region is this school from", "what biological group does this insect belong to". So what is the recommended way to represent and the recommended way to analyse such data in R? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] categorizing data
Here is one way to get the table you are describing. First some made up data: dta <- structure(list(tree = c(27, 47, 33, 31, 45, 54, 47, 27, 33, 26, 14, 43, 36, 0, 29, 24, 43, 38, 32, 21, 21, 23, 12, 42, 34), shrub = c(19, 29, 27, 31, 5, 24, 6, 37, 4, 6, 59, 7, 23, 15, 32, 1, 31, 37, 30, 44, 40, 10, 28, 23, 32), grass = c(44, 14, 30, 28, 40, 12, 37, 26, 53, 58, 17, 40, 31, 75, 29, 65, 16, 15, 28, 25, 29, 57, 50, 25, 24)), class = "data.frame", row.names = c(NA, -25L)) rnks <- data.frame(t(apply(dta, 1, rank, ties.method="first"))) rnks <- sapply(rnks, factor, labels=c("Low", "Med", "High")) head(rnks) tree shrub grass [1,] "Med" "Low" "High" [2,] "High" "Med" "Low" [3,] "High" "Low" "Med" [4,] "Med" "High" "Low" [5,] "High" "Low" "Med" [6,] "High" "Med" "Low" table(apply(rnks, 1, paste, collapse="/")) High/Low/Med High/Med/Low Low/High/Med Low/Med/High Med/High/Low Med/Low/High 664225 David L Carlson Texas A University On Sun, May 29, 2022 at 5:08 PM Roy Mendelssohn - NOAA Federal via R-help wrote: > > Hi Janet: here is a start to give you the idea, now you need loop either use > a "for" or one of the apply functions. 1. Preallocate new data (i am lazy so > it is array, for example of size three. 2. order the data and set values. > junk <- array(0, > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > > Hi Janet: > > here is a start to give you the idea, now you need loop either use a "for" > or one of the apply functions. > > 1. Preallocate new data (i am lazy so it is array, for example of size > three. > > 2. order the data and set values. > > junk <- array(0, dim = c(2,3)) > values <- c(10, 30, 50) > junk[1, order(c(32, 11, 17))] <- values > junk[1, ] > [1] 50 10 30 > > > This works because order() returns the index of the ordering, not the values. > > HTH, > > -Roy > > On May 29, 2022, at 1:31 PM, Janet Choate wrote: > > > > I'm sorry if this has come across as a homework assignment!I was trying to > > provide a simple example. > > There are actually 38323 rows of data, each row is an observation of the > > percent that each of those veg types occupies in a spatial unit - where > > each line adds to 90 - and values are different every line. > > I need a way to categorize the data, so I can reduce the number of unique > > observations. > > > > So instead of 38323 unique observations - I can reduce this to > > X number of High/Med/Low > > X number of Med/Low/High > > X number of Low/High/Med > > etc... for all combinations > > > > I hope this makes it more clear.. > > thank you all for your responses, > > JC > > > > On Sun, May 29, 2022 at 1:16 PM Avi Gross via R-help > > wrote: > > > >> Tom, > >> You may have a very different impression of what was asked! LOL! > >> Unless Janet clarifies what seems a bit like a homework assignment, it > >> seems to be a fairly simple and straightforward assignment with exactly > >> three rows/columns and asking how to replace the variables, in a sense, by > >> finding the high and low and perhaps thus identifying the medium, but to do > >> this for each row without changing the order of the resulting data.frame. > >> I note most techniques people have used focus on columns, not rows, but an > >> all-numeric data.frame can be transposed, or converted to a matrix and > >> later converted back. > >> If this is HW, the question becomes what has been taught so far and is > >> supposed to be used in solving it. Can they make their own functions > >> perhaps to be called three times, once per row or column, to replace that > >> row/column, or can they use some form of loop to iterate over the columns? > >> Does it need to sort of be done in place or can they create gradually a > >> second data.frame and then move the pointer to it and lots of other similar > >> ideas. > >> I am not sure, other than as a HW assignment, why this transformation > >> would need to be done but of course, there may well be a reason. > >> I note that the particular example shown just happens to create almost a > >> magic square as the sum of rows and columns and the major diagonal happen > >> to be 0, albeit the reverse diagonal is all 50's. > >> Again, there are many solutions imaginable but the goal may be more > >> specific and I shudder to supply one given that too often questions here > >> are not detailed enough and are misunderstood. In this case, I thought I > >> understood until I saw what Tom wrote! LOL! > >> I will add this. Is it guaranteed that no two items in the same row are > >> never equal or is there some requirement for how to handle a tie? And note > >> there are base R functions called min() and max() and you can ask for > >> things like: > >> > >> if ( current == min(mydata[1,])) ... > >> > >> > >> -Original Message- > >> From: Tom Woolman > >> To: Janet Choate >
Re: [R] Circular Graph Recommendation Request
If the units of analysis are real spatial regions (e.g. states), how about a cartogram? https://gisgeography.com/cartogram-maps/ An R package (I have no experience with it) https://cran.r-project.org/web/packages/cartogram/index.html The advantage of a cartogram is that it is a single graphic, rather than 2 like the original post referenced. No need to move eye back and forth to decode the colors. And it maintains---as much as possible given the distortion, which is the whole point of a cartogram--- the relative spatial positions of the areal units (in this case, states.) The round figure in the original post has the northern midwestern region in the 7:00 to 8:00-ish position, what might be considered notionally the "southwest." A little counterintuitive. --Chris Ryan Bert Gunter wrote: > Very nice plot. Thanks for sharing. > Can't help directly, but as the plot is sort of a map with polygonal > areas encoding the value of a variable, you might try posting on > r-sig-geo instead where there might be more relevant expertise in such > things -- or perhaps suggestions for alternative visualizations that > work similarly. > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along > and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > On Sat, May 28, 2022 at 8:39 AM Stephen H. Dawson, DSL via R-help > wrote: >> >> https://www.visualcapitalist.com/us-goods-exports-by-state/ >> Visualizing U.S. Exports by State >> >> Good Morning, >> >> >> https://www.visualcapitalist.com/wp-content/uploads/2022/05/us-exports-by-state-infographic.jpg >> >> Saw an impressive graph today. Sharing with the list. >> >> The size proportionality of the state segments in a circle graph is catchy. >> >> QUESTION >> Is there a package one could use with R to accomplish this particular >> circular-style graph? >> >> >> Kindest Regards, >> -- >> *Stephen Dawson, DSL* >> /Executive Strategy Consultant/ >> Business & Technology >> +1 (865) 804-3454 >> http://www.shdawson.com >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to color boxplots with respect to the variable names
Thank you so much Jim for your help. Best regards On Monday, May 30, 2022, Jim Lemon wrote: > Hi Neha, > As you have a distinguishing feature in the variable names, here is > one way to do it: > > RF<- c(4.7, 1.52, 1.46, 4.5, 0.62, 1.12) > RF_LOO<- c(5.2, 1.52, 1.44, 4.3, 0.64, 1.11) > RF_boot<- c(5.8, 1.5, 1.23, 4.3, 0.64, 1.12) > Ranger<- c(4.5, 1.57, 1.25, 3.75, 0.56, 1.09) > Ranger_LOO<- c(5, 1.56, 1.35, 3.7, 0.6, 1.0) > Ranger_boot<- c(4.2, 1.53, 1.12, 3.7, 0.63, 1.1) > SVM<- c(3.51, 1.34, 0.62, 1.45, 0.5, 1.06) > SVM_LOO<- c(3.6, 1.33, 0.33, 1.4, 0.41, 1.1) > SVM_boot<- c(3.75, 1.35, 0.58, 1.4, 0.4, 1.0) > KNN<- c(2.85, 1.35, 0.25, 1.76, 0.43, 1.25) > KNN_LOO<- c(2.85, 1.34, 0.375, 1.75, 0.44, 1.27) > KNN_boot<- c(2.75, 1.35, 0.375, 1.75, 0.45, 1.27) > varnames<-c("RF","RF_LOO","RF_boot", > "RANGER","RANGER_LOO","RANGER_boot", > "SVM","SVM_LOO","SVM_boot", > "KNN","KNN_LOO","KNN_boot") > colors<-rep("blue",length(varnames)) > colors[grep("LOO",varnames)]<-"green" > colors[grep("boot",varnames)]<-"red" > at.x <- seq(1,by=.4, length.out = 10) > boxplot(RF, RF_LOO, RF_boot, Ranger, Ranger_LOO, Ranger_boot, SVM, SVM_LOO, > SVM_boot, > KNN, KNN_LOO, KNN_boot, range = 0, col=colors, names= c("RF", > "RF_LOO", "RF_boot", > "Ranger", "Ranger_LOO", "Ranger_boot", "SVM", "SVM_LOO", > "SVM_boot", > "KNN", "KNN_LOO", > "KNN_boot"),las=2,boxwex=0.5,outline=FALSE,cex.axis=0.8, main="Consistency > of the 100% features ") > legend(8,5.5,c("Raw","LOO","boot"),fill=c("blue","green","red")) > > Jim > > On Mon, May 30, 2022 at 4:46 AM Neha gupta > wrote: > > > > I have the following data and I need to use a boxplot which displays the > > variables (RF, Ranger, SVM, KNN) with one color, variables (RF_boot, > > Ranger_boot, SVM_boot, KNN_boot) with another color and the variables > > (RF_LOO, SVM_LOO, Ranger_LOO, KNN_LOO) with another color. > > > > How can I do that? Currently, I am using the base boxplot which displays > > them in one color. I know it will be more easily achieved with ggplot > but I > > have no experience/knowledge with it. > > > > RF= c(4.7, 1.52, 1.46, 4.5, 0.62, 1.12) > > RF_LOO= c(5.2, 1.52, 1.44, 4.3, 0.64, 1.11) > > RF_boot= c(5.8, 1.5, 1.23, 4.3, 0.64, 1.12) > > Ranger= c(4.5, 1.57, 1.25, 3.75, 0.56, 1.09) > > Ranger_LOO= c(5, 1.56, 1.35, 3.7, 0.6, 1.0) > > Ranger_boot= c(4.2, 1.53, 1.12, 3.7, 0.63, 1.1) > > SVM= c(3.51, 1.34, 0.62, 1.45, 0.5, 1.06) > > SVM_LOO= c(3.6, 1.33, 0.33, 1.4, 0.41, 1.1) > > SVM_boot= c(3.75, 1.35, 0.58, 1.4, 0.4, 1.0) > > KNN= c(2.85, 1.35, 0.25, 1.76, 0.43, 1.25) > > KNN_LOO= c(2.85, 1.34, 0.375, 1.75, 0.44, 1.27) > > KNN_boot= c(2.75, 1.35, 0.375, 1.75, 0.45, 1.27) > > > > My base boxplot is here > > > > colors = rep("blue",12) > > at.x <- seq(1,by=.4, length.out = 10) > > boxplot(RF, RF_LOO, RF_boot, Ranger, Ranger_LOO, Ranger_boot, SVM, > SVM_LOO, > > SVM_boot, > > KNN, KNN_LOO, KNN_boot, range = 0, col=colors, names= c("RF", > > "RF_LOO", "RF_boot", > > "Ranger", "Ranger_LOO", "Ranger_boot", "SVM", "SVM_LOO", > "SVM_boot", > > "KNN", "KNN_LOO", > > "KNN_boot"),las=2,boxwex=0.5,outline=FALSE,cex.axis=0.8, > main="Consistency > > of the 100% features ") > > > > [[alternative HTML version deleted]] > > > > __ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to color boxplots with respect to the variable names
Hi Neha, As you have a distinguishing feature in the variable names, here is one way to do it: RF<- c(4.7, 1.52, 1.46, 4.5, 0.62, 1.12) RF_LOO<- c(5.2, 1.52, 1.44, 4.3, 0.64, 1.11) RF_boot<- c(5.8, 1.5, 1.23, 4.3, 0.64, 1.12) Ranger<- c(4.5, 1.57, 1.25, 3.75, 0.56, 1.09) Ranger_LOO<- c(5, 1.56, 1.35, 3.7, 0.6, 1.0) Ranger_boot<- c(4.2, 1.53, 1.12, 3.7, 0.63, 1.1) SVM<- c(3.51, 1.34, 0.62, 1.45, 0.5, 1.06) SVM_LOO<- c(3.6, 1.33, 0.33, 1.4, 0.41, 1.1) SVM_boot<- c(3.75, 1.35, 0.58, 1.4, 0.4, 1.0) KNN<- c(2.85, 1.35, 0.25, 1.76, 0.43, 1.25) KNN_LOO<- c(2.85, 1.34, 0.375, 1.75, 0.44, 1.27) KNN_boot<- c(2.75, 1.35, 0.375, 1.75, 0.45, 1.27) varnames<-c("RF","RF_LOO","RF_boot", "RANGER","RANGER_LOO","RANGER_boot", "SVM","SVM_LOO","SVM_boot", "KNN","KNN_LOO","KNN_boot") colors<-rep("blue",length(varnames)) colors[grep("LOO",varnames)]<-"green" colors[grep("boot",varnames)]<-"red" at.x <- seq(1,by=.4, length.out = 10) boxplot(RF, RF_LOO, RF_boot, Ranger, Ranger_LOO, Ranger_boot, SVM, SVM_LOO, SVM_boot, KNN, KNN_LOO, KNN_boot, range = 0, col=colors, names= c("RF", "RF_LOO", "RF_boot", "Ranger", "Ranger_LOO", "Ranger_boot", "SVM", "SVM_LOO", "SVM_boot", "KNN", "KNN_LOO", "KNN_boot"),las=2,boxwex=0.5,outline=FALSE,cex.axis=0.8, main="Consistency of the 100% features ") legend(8,5.5,c("Raw","LOO","boot"),fill=c("blue","green","red")) Jim On Mon, May 30, 2022 at 4:46 AM Neha gupta wrote: > > I have the following data and I need to use a boxplot which displays the > variables (RF, Ranger, SVM, KNN) with one color, variables (RF_boot, > Ranger_boot, SVM_boot, KNN_boot) with another color and the variables > (RF_LOO, SVM_LOO, Ranger_LOO, KNN_LOO) with another color. > > How can I do that? Currently, I am using the base boxplot which displays > them in one color. I know it will be more easily achieved with ggplot but I > have no experience/knowledge with it. > > RF= c(4.7, 1.52, 1.46, 4.5, 0.62, 1.12) > RF_LOO= c(5.2, 1.52, 1.44, 4.3, 0.64, 1.11) > RF_boot= c(5.8, 1.5, 1.23, 4.3, 0.64, 1.12) > Ranger= c(4.5, 1.57, 1.25, 3.75, 0.56, 1.09) > Ranger_LOO= c(5, 1.56, 1.35, 3.7, 0.6, 1.0) > Ranger_boot= c(4.2, 1.53, 1.12, 3.7, 0.63, 1.1) > SVM= c(3.51, 1.34, 0.62, 1.45, 0.5, 1.06) > SVM_LOO= c(3.6, 1.33, 0.33, 1.4, 0.41, 1.1) > SVM_boot= c(3.75, 1.35, 0.58, 1.4, 0.4, 1.0) > KNN= c(2.85, 1.35, 0.25, 1.76, 0.43, 1.25) > KNN_LOO= c(2.85, 1.34, 0.375, 1.75, 0.44, 1.27) > KNN_boot= c(2.75, 1.35, 0.375, 1.75, 0.45, 1.27) > > My base boxplot is here > > colors = rep("blue",12) > at.x <- seq(1,by=.4, length.out = 10) > boxplot(RF, RF_LOO, RF_boot, Ranger, Ranger_LOO, Ranger_boot, SVM, SVM_LOO, > SVM_boot, > KNN, KNN_LOO, KNN_boot, range = 0, col=colors, names= c("RF", > "RF_LOO", "RF_boot", > "Ranger", "Ranger_LOO", "Ranger_boot", "SVM", "SVM_LOO", "SVM_boot", > "KNN", "KNN_LOO", > "KNN_boot"),las=2,boxwex=0.5,outline=FALSE,cex.axis=0.8, main="Consistency > of the 100% features ") > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] categorizing data
Hi Janet: here is a start to give you the idea, now you need loop either use a "for" or one of the apply functions. 1. Preallocate new data (i am lazy so it is array, for example of size three. 2. order the data and set values. junk <- array(0, dim = c(2,3)) values <- c(10, 30, 50) junk[1, order(c(32, 11, 17))] <- values junk[1, ] [1] 50 10 30 This works because order() returns the index of the ordering, not the values. HTH, -Roy > On May 29, 2022, at 1:31 PM, Janet Choate wrote: > > I'm sorry if this has come across as a homework assignment!I was trying to > provide a simple example. > There are actually 38323 rows of data, each row is an observation of the > percent that each of those veg types occupies in a spatial unit - where > each line adds to 90 - and values are different every line. > I need a way to categorize the data, so I can reduce the number of unique > observations. > > So instead of 38323 unique observations - I can reduce this to > X number of High/Med/Low > X number of Med/Low/High > X number of Low/High/Med > etc... for all combinations > > I hope this makes it more clear.. > thank you all for your responses, > JC > > On Sun, May 29, 2022 at 1:16 PM Avi Gross via R-help > wrote: > >> Tom, >> You may have a very different impression of what was asked! LOL! >> Unless Janet clarifies what seems a bit like a homework assignment, it >> seems to be a fairly simple and straightforward assignment with exactly >> three rows/columns and asking how to replace the variables, in a sense, by >> finding the high and low and perhaps thus identifying the medium, but to do >> this for each row without changing the order of the resulting data.frame. >> I note most techniques people have used focus on columns, not rows, but an >> all-numeric data.frame can be transposed, or converted to a matrix and >> later converted back. >> If this is HW, the question becomes what has been taught so far and is >> supposed to be used in solving it. Can they make their own functions >> perhaps to be called three times, once per row or column, to replace that >> row/column, or can they use some form of loop to iterate over the columns? >> Does it need to sort of be done in place or can they create gradually a >> second data.frame and then move the pointer to it and lots of other similar >> ideas. >> I am not sure, other than as a HW assignment, why this transformation >> would need to be done but of course, there may well be a reason. >> I note that the particular example shown just happens to create almost a >> magic square as the sum of rows and columns and the major diagonal happen >> to be 0, albeit the reverse diagonal is all 50's. >> Again, there are many solutions imaginable but the goal may be more >> specific and I shudder to supply one given that too often questions here >> are not detailed enough and are misunderstood. In this case, I thought I >> understood until I saw what Tom wrote! LOL! >> I will add this. Is it guaranteed that no two items in the same row are >> never equal or is there some requirement for how to handle a tie? And note >> there are base R functions called min() and max() and you can ask for >> things like: >> >> if ( current == min(mydata[1,])) ... >> >> >> -Original Message- >> From: Tom Woolman >> To: Janet Choate >> Cc: r-help@r-project.org >> Sent: Sun, May 29, 2022 3:42 pm >> Subject: Re: [R] categorizing data >> >> >> Some ideas: >> >> You could create a cluster model with k=3 for each of the 3 variables, >> to determine what constitutes high/medium/low centroid values for each >> of the 3 types of plant types. Centroid values could then be used as the >> upper/lower boundary ranges for high/med/low. >> >> Or utilize a histogram for each variable, and use quantiles or >> densities, etc. to determine the natural breaks for the high/med/low >> ranges for each of the IVs. >> >> >> >> >> On 2022-05-29 15:28, Janet Choate wrote: >>> Hi R community, >>> I have a data frame with three variables, where each row adds up to 90. >>> I want to assign a category of low, medium, or high to the values in >>> each >>> row - where the lowest value per row will be set to 10, the medium >>> value >>> set to 30, and the high value set to 50 - so each row still adds up to >>> 90. >>> >>> For example: >>> Data: Orig >>> tree shrub grass >>> 3211 47 >>> 23 41 26 >>> 49 23 18 >>> >>> Data: New >>> tree shrub grass >>> 30 10 50 >>> 10 5030 >>> 50 3010 >>> >>> I am not attaching any code here as I have not been able to write >>> anything >>> effective! appreciate help with this! >>> thank you, >>> JC >>> >>> -- >>> >>>[[alternative HTML version deleted]] >>> >>> __ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>>
Re: [R] categorizing data
I'm sorry if this has come across as a homework assignment!I was trying to provide a simple example. There are actually 38323 rows of data, each row is an observation of the percent that each of those veg types occupies in a spatial unit - where each line adds to 90 - and values are different every line. I need a way to categorize the data, so I can reduce the number of unique observations. So instead of 38323 unique observations - I can reduce this to X number of High/Med/Low X number of Med/Low/High X number of Low/High/Med etc... for all combinations I hope this makes it more clear.. thank you all for your responses, JC On Sun, May 29, 2022 at 1:16 PM Avi Gross via R-help wrote: > Tom, > You may have a very different impression of what was asked! LOL! > Unless Janet clarifies what seems a bit like a homework assignment, it > seems to be a fairly simple and straightforward assignment with exactly > three rows/columns and asking how to replace the variables, in a sense, by > finding the high and low and perhaps thus identifying the medium, but to do > this for each row without changing the order of the resulting data.frame. > I note most techniques people have used focus on columns, not rows, but an > all-numeric data.frame can be transposed, or converted to a matrix and > later converted back. > If this is HW, the question becomes what has been taught so far and is > supposed to be used in solving it. Can they make their own functions > perhaps to be called three times, once per row or column, to replace that > row/column, or can they use some form of loop to iterate over the columns? > Does it need to sort of be done in place or can they create gradually a > second data.frame and then move the pointer to it and lots of other similar > ideas. > I am not sure, other than as a HW assignment, why this transformation > would need to be done but of course, there may well be a reason. > I note that the particular example shown just happens to create almost a > magic square as the sum of rows and columns and the major diagonal happen > to be 0, albeit the reverse diagonal is all 50's. > Again, there are many solutions imaginable but the goal may be more > specific and I shudder to supply one given that too often questions here > are not detailed enough and are misunderstood. In this case, I thought I > understood until I saw what Tom wrote! LOL! > I will add this. Is it guaranteed that no two items in the same row are > never equal or is there some requirement for how to handle a tie? And note > there are base R functions called min() and max() and you can ask for > things like: > > if ( current == min(mydata[1,])) ... > > > -Original Message- > From: Tom Woolman > To: Janet Choate > Cc: r-help@r-project.org > Sent: Sun, May 29, 2022 3:42 pm > Subject: Re: [R] categorizing data > > > Some ideas: > > You could create a cluster model with k=3 for each of the 3 variables, > to determine what constitutes high/medium/low centroid values for each > of the 3 types of plant types. Centroid values could then be used as the > upper/lower boundary ranges for high/med/low. > > Or utilize a histogram for each variable, and use quantiles or > densities, etc. to determine the natural breaks for the high/med/low > ranges for each of the IVs. > > > > > On 2022-05-29 15:28, Janet Choate wrote: > > Hi R community, > > I have a data frame with three variables, where each row adds up to 90. > > I want to assign a category of low, medium, or high to the values in > > each > > row - where the lowest value per row will be set to 10, the medium > > value > > set to 30, and the high value set to 50 - so each row still adds up to > > 90. > > > > For example: > > Data: Orig > > tree shrub grass > > 3211 47 > > 23 41 26 > > 49 23 18 > > > > Data: New > > tree shrub grass > > 30 10 50 > > 10 5030 > > 50 3010 > > > > I am not attaching any code here as I have not been able to write > > anything > > effective! appreciate help with this! > > thank you, > > JC > > > > -- > > > > [[alternative HTML version deleted]] > > > > __ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help >
Re: [R] categorizing data
Tom, You may have a very different impression of what was asked! LOL! Unless Janet clarifies what seems a bit like a homework assignment, it seems to be a fairly simple and straightforward assignment with exactly three rows/columns and asking how to replace the variables, in a sense, by finding the high and low and perhaps thus identifying the medium, but to do this for each row without changing the order of the resulting data.frame. I note most techniques people have used focus on columns, not rows, but an all-numeric data.frame can be transposed, or converted to a matrix and later converted back. If this is HW, the question becomes what has been taught so far and is supposed to be used in solving it. Can they make their own functions perhaps to be called three times, once per row or column, to replace that row/column, or can they use some form of loop to iterate over the columns? Does it need to sort of be done in place or can they create gradually a second data.frame and then move the pointer to it and lots of other similar ideas. I am not sure, other than as a HW assignment, why this transformation would need to be done but of course, there may well be a reason. I note that the particular example shown just happens to create almost a magic square as the sum of rows and columns and the major diagonal happen to be 0, albeit the reverse diagonal is all 50's. Again, there are many solutions imaginable but the goal may be more specific and I shudder to supply one given that too often questions here are not detailed enough and are misunderstood. In this case, I thought I understood until I saw what Tom wrote! LOL! I will add this. Is it guaranteed that no two items in the same row are never equal or is there some requirement for how to handle a tie? And note there are base R functions called min() and max() and you can ask for things like: if ( current == min(mydata[1,])) ... -Original Message- From: Tom Woolman To: Janet Choate Cc: r-help@r-project.org Sent: Sun, May 29, 2022 3:42 pm Subject: Re: [R] categorizing data Some ideas: You could create a cluster model with k=3 for each of the 3 variables, to determine what constitutes high/medium/low centroid values for each of the 3 types of plant types. Centroid values could then be used as the upper/lower boundary ranges for high/med/low. Or utilize a histogram for each variable, and use quantiles or densities, etc. to determine the natural breaks for the high/med/low ranges for each of the IVs. On 2022-05-29 15:28, Janet Choate wrote: > Hi R community, > I have a data frame with three variables, where each row adds up to 90. > I want to assign a category of low, medium, or high to the values in > each > row - where the lowest value per row will be set to 10, the medium > value > set to 30, and the high value set to 50 - so each row still adds up to > 90. > > For example: > Data: Orig > tree shrub grass > 32 11 47 > 23 41 26 > 49 23 18 > > Data: New > tree shrub grass > 30 10 50 > 10 50 30 > 50 30 10 > > I am not attaching any code here as I have not been able to write > anything > effective! appreciate help with this! > thank you, > JC > > -- > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] categorizing data
Hello, Here is a way. Define a function to change the values and call it in a apply loop. But Tom's suggestions are more reasonable, you should have a good reason why to change the data. x <- ' tree shrub grass 32 11 47 23 41 26 49 23 18' orig <- read.table(textConnection(x), header = TRUE) f <- function(x) { stopifnot(length(x) == 3L) i_min <- which.min(x) i_max <- which.max(x) s <- (x[i_min] - 10) + (x[i_max] - 50) x[i_min] <- 10 x[i_max] <- 50 x[-c(i_min, i_max)] <- x[-c(i_min, i_max)] + s x } t(apply(orig, 1, f)) # tree shrub grass # [1,] 301050 # [2,] 105030 # [3,] 503010 Hope this helps, Rui Barradas Às 20:28 de 29/05/2022, Janet Choate escreveu: Hi R community, I have a data frame with three variables, where each row adds up to 90. I want to assign a category of low, medium, or high to the values in each row - where the lowest value per row will be set to 10, the medium value set to 30, and the high value set to 50 - so each row still adds up to 90. For example: Data: Orig tree shrub grass 32 11 47 23 41 26 49 23 18 Data: New tree shrub grass 30 10 50 10 50 30 50 30 10 I am not attaching any code here as I have not been able to write anything effective! appreciate help with this! thank you, JC -- [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [External] categorizing data
Orig <- read.table(text=" tree shrub grass 32 11 47 23 41 26 49 23 18 ", header=TRUE) New <- Orig for (i in seq(nrow(Orig))) New[i,] <- c(10, 30, 50)[order(unlist(Orig[i,]))] New > On May 29, 2022, at 15:28, Janet Choate wrote: > > Hi R community, > I have a data frame with three variables, where each row adds up to 90. > I want to assign a category of low, medium, or high to the values in each > row - where the lowest value per row will be set to 10, the medium value > set to 30, and the high value set to 50 - so each row still adds up to 90. > > For example: > Data: Orig > tree shrub grass > 32 11 47 > 23 41 26 > 49 23 18 > > Data: New > tree shrub grass > 30 10 50 > 10 50 30 > 50 30 10 > > I am not attaching any code here as I have not been able to write anything > effective! appreciate help with this! > thank you, > JC > > -- > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-helpdata=05%7C01%7Crmh%40temple.edu%7C165bca7d509542fc339d08da41a98821%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637894493792524879%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=ZxDMzULApfm9p%2BnnXhToAfvFNZx7du6e%2BbqoaNc6iYE%3Dreserved=0 > PLEASE do read the posting guide > https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.htmldata=05%7C01%7Crmh%40temple.edu%7C165bca7d509542fc339d08da41a98821%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C637894493792524879%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=oVJe7FTikuD7Y59kbg9O1k4od357HPwTcylhTn6ZLWw%3Dreserved=0 > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] categorizing data
You could write a function that deals with one row of your data, based on the order() function. E.g., > to_10_30_50 function(x) { stopifnot(is.numeric(x), length(x)==3, sum(x)==90, all(x>0)) c(10,30,50)[order(x)] } > to_10_30_50(c(23,41,26)) [1] 10 50 30 Then loop over the rows. Since this is a data.frame and not a matrix, you need to coerce each row from a single-row data.frame to a numeric vector: > data <- data.frame(tree=c(32,23,49), shrub=c(11,41,23), grass=c(47,26,18)) > for(i in 1:nrow(new)) data[i,] <- to_10_30_50(as.numeric(data[i,])) > data tree shrub grass 1 301050 2 105030 3 503010 -Bill On Sun, May 29, 2022 at 12:29 PM Janet Choate wrote: > Hi R community, > I have a data frame with three variables, where each row adds up to 90. > I want to assign a category of low, medium, or high to the values in each > row - where the lowest value per row will be set to 10, the medium value > set to 30, and the high value set to 50 - so each row still adds up to 90. > > For example: > Data: Orig > tree shrub grass > 32 11 47 > 23 41 26 > 49 23 18 > > Data: New > tree shrub grass > 30 10 50 > 10 50 30 > 50 30 10 > > I am not attaching any code here as I have not been able to write anything > effective! appreciate help with this! > thank you, > JC > > -- > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] categorizing data
Some ideas: You could create a cluster model with k=3 for each of the 3 variables, to determine what constitutes high/medium/low centroid values for each of the 3 types of plant types. Centroid values could then be used as the upper/lower boundary ranges for high/med/low. Or utilize a histogram for each variable, and use quantiles or densities, etc. to determine the natural breaks for the high/med/low ranges for each of the IVs. On 2022-05-29 15:28, Janet Choate wrote: Hi R community, I have a data frame with three variables, where each row adds up to 90. I want to assign a category of low, medium, or high to the values in each row - where the lowest value per row will be set to 10, the medium value set to 30, and the high value set to 50 - so each row still adds up to 90. For example: Data: Orig tree shrub grass 32 11 47 23 41 26 49 23 18 Data: New tree shrub grass 30 10 50 10 50 30 50 30 10 I am not attaching any code here as I have not been able to write anything effective! appreciate help with this! thank you, JC -- [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to color boxplots with respect to the variable names
I have the following data and I need to use a boxplot which displays the variables (RF, Ranger, SVM, KNN) with one color, variables (RF_boot, Ranger_boot, SVM_boot, KNN_boot) with another color and the variables (RF_LOO, SVM_LOO, Ranger_LOO, KNN_LOO) with another color. How can I do that? Currently, I am using the base boxplot which displays them in one color. I know it will be more easily achieved with ggplot but I have no experience/knowledge with it. RF= c(4.7, 1.52, 1.46, 4.5, 0.62, 1.12) RF_LOO= c(5.2, 1.52, 1.44, 4.3, 0.64, 1.11) RF_boot= c(5.8, 1.5, 1.23, 4.3, 0.64, 1.12) Ranger= c(4.5, 1.57, 1.25, 3.75, 0.56, 1.09) Ranger_LOO= c(5, 1.56, 1.35, 3.7, 0.6, 1.0) Ranger_boot= c(4.2, 1.53, 1.12, 3.7, 0.63, 1.1) SVM= c(3.51, 1.34, 0.62, 1.45, 0.5, 1.06) SVM_LOO= c(3.6, 1.33, 0.33, 1.4, 0.41, 1.1) SVM_boot= c(3.75, 1.35, 0.58, 1.4, 0.4, 1.0) KNN= c(2.85, 1.35, 0.25, 1.76, 0.43, 1.25) KNN_LOO= c(2.85, 1.34, 0.375, 1.75, 0.44, 1.27) KNN_boot= c(2.75, 1.35, 0.375, 1.75, 0.45, 1.27) My base boxplot is here colors = rep("blue",12) at.x <- seq(1,by=.4, length.out = 10) boxplot(RF, RF_LOO, RF_boot, Ranger, Ranger_LOO, Ranger_boot, SVM, SVM_LOO, SVM_boot, KNN, KNN_LOO, KNN_boot, range = 0, col=colors, names= c("RF", "RF_LOO", "RF_boot", "Ranger", "Ranger_LOO", "Ranger_boot", "SVM", "SVM_LOO", "SVM_boot", "KNN", "KNN_LOO", "KNN_boot"),las=2,boxwex=0.5,outline=FALSE,cex.axis=0.8, main="Consistency of the 100% features ") [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Use of ellipsis
Thank you very much Ivan and Bert! I used the eval(substitute()) workaround suggested by Ivan and it worked perfectly. Andreas Matre __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.