Re: [R] How to do non-parametric calculations in R
Imagine that it's the year 2022 and you don't know how to look up information about performing a Kruskal-Wallis H test. It would take you longer to join the listserv and then write such a cokamemie email than to open the stats textbook you are supposed to have for the course, much less doing a simple search query. On 2022-06-11 19:19, Ebert,Timothy Aaron wrote: LOL. Thank goodness I successfully rolled my saving throw and resisted, at least for this round. Tim -Original Message- From: R-help On Behalf Of Jeff Newmiller Sent: Saturday, June 11, 2022 5:27 PM To: r-help@r-project.org; J C Nash Subject: Re: [R] How to do non-parametric calculations in R [External Email] Really? But it is such a random list that I thought it was a test of our ability to resist providing impromptu lectures on off-list-topics, since we all like to expound on "stuff" even when R isn't needed to understand them. Or perhaps "A R Lover" just didn't read the Posting Guide warnings about HTML email, homework, and statistics, and will soon have done do and be sharing some R code that is giving them an error. On June 11, 2022 1:53:37 PM PDT, J C Nash wrote: Homework! On 2022-06-11 10:24, Shantanu Shimpi wrote: Dear R community, Please help me in knowing how to do following non-parametric tests: 1. kruskal-Wallis test 2. Wilcoxson rank sum test 3. Lee Cronbac Alpha test 4. Spearman's Rank correlation test 5. Henry Garrett method formula calculations 6. Factor Analysis 7. Chi square test Kindly guide me on the above queries in the easiest way. A R lover, Col Shantanu. India. attitudeshantanu1...@gmail.com 7722030088 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mai lman_listinfo_r-2Dhelp=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVe AsRzsn7AkP-g=4tM4fJqtbe_uTsyPRnMDpR560AlVh83t9wmvJuUPvfsciEQhPSY5Yb F8c2Lixwy8=zhlVVKEML3MeOEdjlR2Z1gqYLVcrE0gpEiFPdo0MxNg= PLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.o rg_posting-2Dguide.html=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kV eAsRzsn7AkP-g=4tM4fJqtbe_uTsyPRnMDpR560AlVh83t9wmvJuUPvfsciEQhPSY5Y bF8c2Lixwy8=IRGFPuDAXZu6xEr36GrRy5jeXkI0D62fDLt-FxbIqBs= and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailm an_listinfo_r-2Dhelp=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeAsRz sn7AkP-g=4tM4fJqtbe_uTsyPRnMDpR560AlVh83t9wmvJuUPvfsciEQhPSY5YbF8c2Li xwy8=zhlVVKEML3MeOEdjlR2Z1gqYLVcrE0gpEiFPdo0MxNg= PLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org _posting-2Dguide.html=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeAsR zsn7AkP-g=4tM4fJqtbe_uTsyPRnMDpR560AlVh83t9wmvJuUPvfsciEQhPSY5YbF8c2L ixwy8=IRGFPuDAXZu6xEr36GrRy5jeXkI0D62fDLt-FxbIqBs= and provide commented, minimal, self-contained, reproducible code. -- Sent from my phone. Please excuse my brevity. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeAsRzsn7AkP-g=4tM4fJqtbe_uTsyPRnMDpR560AlVh83t9wmvJuUPvfsciEQhPSY5YbF8c2Lixwy8=zhlVVKEML3MeOEdjlR2Z1gqYLVcrE0gpEiFPdo0MxNg= PLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeAsRzsn7AkP-g=4tM4fJqtbe_uTsyPRnMDpR560AlVh83t9wmvJuUPvfsciEQhPSY5YbF8c2Lixwy8=IRGFPuDAXZu6xEr36GrRy5jeXkI0D62fDLt-FxbIqBs= and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] categorizing data
Some ideas: You could create a cluster model with k=3 for each of the 3 variables, to determine what constitutes high/medium/low centroid values for each of the 3 types of plant types. Centroid values could then be used as the upper/lower boundary ranges for high/med/low. Or utilize a histogram for each variable, and use quantiles or densities, etc. to determine the natural breaks for the high/med/low ranges for each of the IVs. On 2022-05-29 15:28, Janet Choate wrote: Hi R community, I have a data frame with three variables, where each row adds up to 90. I want to assign a category of low, medium, or high to the values in each row - where the lowest value per row will be set to 10, the medium value set to 30, and the high value set to 50 - so each row still adds up to 90. For example: Data: Orig tree shrub grass 32 11 47 23 41 26 49 23 18 Data: New tree shrub grass 30 10 50 10 50 30 50 30 10 I am not attaching any code here as I have not been able to write anything effective! appreciate help with this! thank you, JC -- [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Is there a canonical way to pronounce CRAN?
Everyone needs to speak English exactly like I do or else they're doing it wrong :) By I pronounce CRAN the same way that I pronounce the first half of cranberry. On 2022-05-04 20:24, Avi Gross via R-help wrote: Extended discussion may be a waste but speaking for myself, I found it highly enlightening to hear that many had a mental image of an alternate way to pronounce CRAN as it so OBVIOUSLY had a natural way to pronounce. I often cringe when I listen to an audio book in the car and the person chosen to narrate gets words not just wrong but very wrong as in nobody in any country would likely pronounce it that way! TV shows and elsewhere do the same. If someone asked me to check if something was on see-ran or even see-are-a-en, it might take me a moment to shift gears and realize they meant CRAN. There is no real right or wrong way and we see organizations with names hard to pronounce like FBI or CIA are often referred to with words like Quantico or Langley based on an area they are associated with. People like words they can hope to pronounce like the non-existent UNCLE or even SMERSH and KAOS. I will say it is quite logical if you see C-SPAN as see-span then CPAN as C-PAN you might then see CRAN the way you do as C-RAN. But consider the many functions and packages in a language like R and ask if everyone thinks or pronounces them the way you do? How many initially read runif() as run-if until you realized it was a distribution of uniform random numbers in R as compared to rnorm() for an R version of a normal distribution and thus maybe can be pronounced more like r-unif and r-norm. Also, runif is part of a related set of functions all having something to do with a uniform distribution, dunif(), punif() and qunif() so, again, it hints to some of a consistent way to speak them aloud. Of course, in written form, they speak for themselves. But not everything can or should be pronounced. We do not all speak the same languages or the same way. Sometimes spelling things out as C-R-A-N is a better way to go albeit if the others pronounce those letters differently, you end up like people who think there is a Hungarian word for something like vey-tsey to mean toilet when it actually is spelled WC as in borrowed from the English Water Closet and the way you say W as a letter of the alphabet followed by the way you say C as a letter of the alphabet sounds ... -Original Message- From: Jim Lemon To: Stephen P. Molnar ; r-help mailing list Sent: Wed, May 4, 2022 6:46 pm Subject: Re: [R] Is there a canonical way to pronounce CRAN? Perhaps not entirely a waste. Shots have been fired over less. Allow the neologism 'packronym' to signify the packing of an acronym into a pronounceable word. (A necessary skill in the public service. If you cannot correctly pronounce DFAT, resign yourself to menial labor) If we endorse the anglophone packronym we get: kræn An Italian might lean toward: tʃrɑn while Spanish (the happiest language, they say) would produce: krɛn So Babel continues to amuse or enrage us, depending upon our emotional disposition. I dare not comment upon those who would laboriously spell it out as: Charlie Romeo Alfa November Jim On Thu, May 5, 2022 at 4:05 AM Stephen P. Molnar wrote: Yes, I know that I'm contributing, but what a waste of band width. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Combining data.frames
Have you looked at the merge function in base R? https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/merge On 2022-03-19 21:15, Jeff Reichman wrote: R-Help Community I'm trying to combine two data.frames which each containing 10 columns of which they each share two common fields. Here are two small test datasets. df1 <- data.frame(date = c("2021-1-1","2021-1-1","2021-1-1","2021-1-1","2021-1-1", "2021-1-2","2021-1-2","2021-1-3","2021-1-3","2021-1-3"), geo_hash = c("abc123","abc123","abc456","abc789","abc246","abc123", "asd123","abc789","abc890","abc123"), ad_id = c("a12345","b12345","a12345","a12345","c12345", "b12345","b12345","a12345","b12345","a12345")) df2 <- data.frame(date = c("2021-1-1","2021-1-1","2021-1-2","2021-1-3","2021-1-3"), geo_hash = c("abc123","abc456","abc123","abc789","abc890"), event = c("shoting","ied","protest","riot","protest")) I'm trying to combine them such that I get a combined data.frames such as dategeo_hashad_id event 1/1/2021abc123 a12345 shoting 1/1/2021abc123 b12345 1/1/2021abc456 a12345 ied 1/1/2021abc789 a12345 1/1/2021abc246 c12345 Jeff __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Time for a companion mailing list for R packages?
I concur on both of Eric's suggestions below. I love R but I couldn't imagine using it on a daily basis without "key" packages for various regression and classification modeling problems, etc. Likewise on being able to embed images (within reason... maybe establish a max KB or MB file size for attachments). Thanks, Tom On 2022-01-13 12:25, Eric Berger wrote: Re: constructive criticism to make this list more useful to more people: Suggestion 1: accommodate questions related to non-base-R packages This has been addressed by many already. The current de facto situation is that such questions are asked and often answered. Perhaps the posting guide should be altered so that such questions fall within the guidelines. Suggestion 2: expand beyond plain-text mode I assume there is a reason for this restriction but it seems to create a lot of delay and often havoc. Also, many questions on this list relate to graphics which is an important part of R (even base R) and such questions may often be more easily communicated with images. Eric On Thu, Jan 13, 2022 at 6:08 PM John Fox wrote: Dear Avi et al., Rather than proliferating R mailing lists, why not just allow questions on non-standard packages on the r-help list? (1) If people don't want to answer these questions, they don't have to. (2) Users won't necessarily find the new email list and so may post to r-help anyway, only to be told that they should have posted to another list. (3) Many of the questions currently posted to the list concern non-standard packages and most of them are answered. (4) If people prefer other sources of help (as listed on the R website "getting help" page) then they are free to use them. (5) As I read the posting guide, questions about non-standard packages aren't actually disallowed; the posting guide suggests, however, that the package maintainer be contacted first. But answers can be helpful to other users, and so it may be preferable for at least some of these questions to be asked on the list. (6) Finally, the instruction concerning non-standard packages is buried near the end of the posting guide, and users, especially new users, may not understand what the term "standard packages" means even if they find their way to the posting guide. Best, John -- John Fox, Professor Emeritus McMaster University Hamilton, Ontario, Canada web: https://socialsciences.mcmaster.ca/jfox/ On 2022-01-12 10:27 p.m., Avi Gross via R-help wrote: > Respectfully, this forum gets lots of questions that include non-base R components and especially packages in the tidyverse. Like it or not, the extended R language is far more useful and interesting for many people and especially those who do not wish to constantly reinvent the wheel. > And repeatedly, we get people reminding (and sometimes chiding) others for daring to post questions or supply answers on what they see as a pure R list. They have a point. > Yes, there are other places (many not being mailing lists like this one) where we can direct the questions but why can't there be an official mailing list alongside this one specifically focused on helping or just discussing R issues related partially to the use of packages. I don't mean for people making a package to share, just users who may be searching for an appropriate package or using a common package, especially the ones in the tidyverse that are NOT GOING AWAY just because some purists ... > I prefer a diverse set of ways to do things and base R is NOT enough for me, nor frankly is R with all packages included as I find other languages suit my needs at times for doing various things. If this group is for purists, fine. Can we have another for the rest of us? Live and let live. > > > -Original Message- > From: Duncan Murdoch > To: Kai Yang ; R-help Mailing List < r-help@r-project.org> > Sent: Wed, Jan 12, 2022 3:22 pm > Subject: Re: [R] how to find the table in R studio > > On 12/01/2022 3:07 p.m., Kai Yang via R-help wrote: >> Hi all, >> I created a function in R. It will be generate a table "temp". I can view it in R studio, but I cannot find it on the top right window in R studio. Can someone tell me how to find it in there? Same thing for f_table. >> Thank you, >> Kai >> library(tidyverse) >> >> f1 <- function(indata , subgrp1){ >> subgrp1 <- enquo(subgrp1) >> indata0 <- indata >> temp<- indata0 %>% select(!!subgrp1) %>% arrange(!!subgrp1) %>% >>group_by(!!subgrp1) %>% >>mutate(numbering =row_number(), max=max(numbering)) >> view(temp) >> f_table <- table(temp$Species) >> view(f_table) >> } >> >> f1(iris, Species) >> > > Someone is sure to point out that this isn't an RStudio support list, > but your issue is with R, not with RStudio. You created the table in > f1, but you never returned it. The variable f_table is local to the > function. You'd need the following code to
Re: [R] Defining Parameters in arules
Greg Williams has a book titled "Data Mining with Rattle and R", which has a chapter on association rules and the arules package. Williams' Rattle GUI package for R also lets you define an association rules model using a graphical interface (which creates the R code for you in the log file for Rattle). I use this textbook in one of the MS-level R courses that I teach and found it to be a good way to convey these concepts especially for those new to R and AI/ML generally. On 2021-11-23 05:17, Ivan Krylov wrote: Hello, If you don't get an answer here, consider asking the package maintainer: Michael Hahsler __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Creating a log-transformed histogram of multiclass data
Apologies, I left out 3 critical lines of code after the randomized sample dataframe is created: group_a <- d[ which(d$label =='A'), ] group_b <- d[ which(d$label =='B'), ] group_c <- d[ which(d$label =='C'), ] On 2021-08-03 18:56, Tom Woolman wrote: # Resending this message since the original email was held in queue by the listserv software because of a "suspicious" subject line, and/or because of attached .png histogram chart attachments. I'm guessing that the listserv software doesn't like multiple image file attachments. Hi everyone. I'm working on a research model now that is calculating anomaly scores (RMSE values) for three distinct groups within a large dataset. The anomaly scores are a continuous data type and are quite small, ranging from approximately 1e-04 to 1-e07 across a population of approximately 1 million observations. I have all of the summary and descriptive statistics for each of the anomaly score distributions across each group label in the dataset, and I am able to create some useful histograms showing how each of the three groups is uniquely distributed across the range of scores. However, because of the large variance within the frequency of score values and the high density peaks within much of the anomaly scores, I need to use a log transformation within the histogram to show both the log frequency count of each binned observation range (y-axis) and a log transformation of the binned score values (x-axis) to be able to appropriately illustrate the distributions within the data and make it more readily understandable. Fortunately, ggplot2 is really useful for creating some really attractive dual-axis log transformed histograms. However, I cannot figure out a way to create the log transformed histograms to show each of my three groups by color within the same histogram. I would want it to look like this, BUT use a log transformation for each axis. This plot below shows the 3 groups in one histogram but uses the default normal values. For log transformed axis values, the best I can do so far is produce three separate histograms, one for each group. Below is sample R code to illustrate my problem with a randomly-generated example dataset and the ggplot2 approaches that I have taken so far: # Sample R code below: library(ggplot2) library(dplyr) library(hrbrthemes) # I created some simple random sample data to produce an example dataset. # This produces an example dataframe called d, which contains a class label IV of either A, B or C for each observation. The target variable is the anomaly_score continuous value for each observation. # There are 300 rows of dummy data in this dataframe. DV_score_generator = round(runif(300,0.001,0.999), 3) d <- data.frame( label = sample( LETTERS[1:3], 300, replace=TRUE, prob=c(0.65, 0.30, 0.05) ), anomaly_score = DV_score_generator) # First, I use ggplot to create the normal distribution histogram that shows all 3 groups on the same plot, by color. # Please note that with this small set of randomized sample data it doesn't appear to be necessary to use an x and y-axis log transformation to show the distribution patterns, but it does becomes an issue with my vastly larger and more complex score values in the DV of the actual data. p <- d %>% ggplot( aes(x=anomaly_score, fill=label)) + geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') + scale_fill_manual(values=c("#69b3a2", "blue", "#404080")) + theme_ipsum() + labs(fill="") p # Produces a normal multiclass histogram. # Now produce a series of x and y-axis log-transformed histograms, producing one histogram for each distinct label class in the dataset: # Group A, log transformed ggplot(group_a, aes(x = anomaly_score)) + geom_histogram(aes(y = ..count..), binwidth = 0.05, colour = "darkgoldenrod1", fill = "darkgoldenrod2") + scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") + scale_y_continuous(trans="log2", name="Log-transformed Frequency Counts") + ggtitle("Transformed Anomaly Scores - Group A Only") # Group A transformed histogram is produced here. # Group B, log transformed ggplot(group_b, aes(x = anomaly_score)) + geom_histogram(aes(y = ..count..), binwidth = 0.05, colour = "green", fill = "darkgreen") + scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") + scale_y_continuous(trans="log2", name="Log-transformed Frequency Counts") + ggtitle("Transformed Anomaly Scores - Group B Only") # Group B transformed histogram is produced here. # Group C, log transformed ggplot(group_c, aes(x = anomaly_score)) + geom_histogram(aes(y = ..count..), binwidth = 0.05, colour = "red", fill = "darkred") + scale
[R] Creating a log-transformed histogram of multiclass data
# Resending this message since the original email was held in queue by the listserv software because of a "suspicious" subject line, and/or because of attached .png histogram chart attachments. I'm guessing that the listserv software doesn't like multiple image file attachments. Hi everyone. I'm working on a research model now that is calculating anomaly scores (RMSE values) for three distinct groups within a large dataset. The anomaly scores are a continuous data type and are quite small, ranging from approximately 1e-04 to 1-e07 across a population of approximately 1 million observations. I have all of the summary and descriptive statistics for each of the anomaly score distributions across each group label in the dataset, and I am able to create some useful histograms showing how each of the three groups is uniquely distributed across the range of scores. However, because of the large variance within the frequency of score values and the high density peaks within much of the anomaly scores, I need to use a log transformation within the histogram to show both the log frequency count of each binned observation range (y-axis) and a log transformation of the binned score values (x-axis) to be able to appropriately illustrate the distributions within the data and make it more readily understandable. Fortunately, ggplot2 is really useful for creating some really attractive dual-axis log transformed histograms. However, I cannot figure out a way to create the log transformed histograms to show each of my three groups by color within the same histogram. I would want it to look like this, BUT use a log transformation for each axis. This plot below shows the 3 groups in one histogram but uses the default normal values. For log transformed axis values, the best I can do so far is produce three separate histograms, one for each group. Below is sample R code to illustrate my problem with a randomly-generated example dataset and the ggplot2 approaches that I have taken so far: # Sample R code below: library(ggplot2) library(dplyr) library(hrbrthemes) # I created some simple random sample data to produce an example dataset. # This produces an example dataframe called d, which contains a class label IV of either A, B or C for each observation. The target variable is the anomaly_score continuous value for each observation. # There are 300 rows of dummy data in this dataframe. DV_score_generator = round(runif(300,0.001,0.999), 3) d <- data.frame( label = sample( LETTERS[1:3], 300, replace=TRUE, prob=c(0.65, 0.30, 0.05) ), anomaly_score = DV_score_generator) # First, I use ggplot to create the normal distribution histogram that shows all 3 groups on the same plot, by color. # Please note that with this small set of randomized sample data it doesn't appear to be necessary to use an x and y-axis log transformation to show the distribution patterns, but it does becomes an issue with my vastly larger and more complex score values in the DV of the actual data. p <- d %>% ggplot( aes(x=anomaly_score, fill=label)) + geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') + scale_fill_manual(values=c("#69b3a2", "blue", "#404080")) + theme_ipsum() + labs(fill="") p # Produces a normal multiclass histogram. # Now produce a series of x and y-axis log-transformed histograms, producing one histogram for each distinct label class in the dataset: # Group A, log transformed ggplot(group_a, aes(x = anomaly_score)) + geom_histogram(aes(y = ..count..), binwidth = 0.05, colour = "darkgoldenrod1", fill = "darkgoldenrod2") + scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") + scale_y_continuous(trans="log2", name="Log-transformed Frequency Counts") + ggtitle("Transformed Anomaly Scores - Group A Only") # Group A transformed histogram is produced here. # Group B, log transformed ggplot(group_b, aes(x = anomaly_score)) + geom_histogram(aes(y = ..count..), binwidth = 0.05, colour = "green", fill = "darkgreen") + scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") + scale_y_continuous(trans="log2", name="Log-transformed Frequency Counts") + ggtitle("Transformed Anomaly Scores - Group B Only") # Group B transformed histogram is produced here. # Group C, log transformed ggplot(group_c, aes(x = anomaly_score)) + geom_histogram(aes(y = ..count..), binwidth = 0.05, colour = "red", fill = "darkred") + scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") + scale_y_continuous(trans="log2", name="Log-transformed Frequency Counts") + ggtitle("Transformed Anomaly Scores - Group C Only") # Group C transformed histogram is produced here. # End. Thanks in advance, everyone! - Tom Thomas A. Woolman, PhD Candidate (Indiana State University), MBA, MS, MS On Target Technologies, Inc. Virginia, USA
Re: [R] [EXT] Re: Assigning categorical values to dates
Sure thing. Typically with date related measures I'm going to build a time series model e.g. ARIMA or maybe something more funky like a recurrent neural net via Tensorflow. But in theory there's no reason dates can't be factors if it's in keeping with your particular design of experiment and you want to perform an analysis that treats time as qualitative data. Quoting "N. F. Parsons" : @Tom Okay, yeah. That might actually be an elegant solution. I will mess around with it. Thank you - I’m not in the habit of using factors and am not super familiar with how they automatically sort themselves. @Andrew Yes. Each month is a different 30,000 row file upon which this task must be performed. @Bert If you’re not interested in being helpful, why comment? Am I interupting your clubhouse time? I’m legitimately stumped by this one and reaching out in earnest. “You’ve been told how to do it” Seriously? We all have different backgrounds and knowledge levels with the entire atlas of the wonderful world of R and I neither need or want your opinion on my corner of it. Don’t be a Hooke. I’m not here to impress or inspire confidence in you - I’m here with a question that has had me spinning my wheels for the better part of a day and need fresh perspectives. Your response certainly inspires no confidence in me as to the nature of your character or your knowledge on the topic. Best regards all, — Nathan Parsons, B.SC, M.Sc, G.C. Ph.D. Candidate, Dept. of Sociology, Portland State University Adjunct Professor, Dept. of Sociology, Washington State University Graduate Advocate, American Association of University Professors (OR) Recent work (https://www.researchgate.net/profile/Nathan_Parsons3/publications) Schedule an appointment (https://calendly.com/nate-parsons) On Wednesday, Jul 21, 2021 at 9:12 PM, Andrew Robinson mailto:a...@unimelb.edu.au)> wrote: I wonder if you mean that you want the levels of the factor to reset within each month? That is not obvious from your example, but implied by your question. Andrew -- Andrew Robinson Director, CEBRA and Professor of Biosecurity, School/s of BioSciences and Mathematics & Statistics University of Melbourne, VIC 3010 Australia Tel: (+61) 0403 138 955 Email: a...@unimelb.edu.au Website: https://researchers.ms.unimelb.edu.au/~apro@unimelb/ I acknowledge the Traditional Owners of the land I inhabit, and pay my respects to their Elders. On 22 Jul 2021, 1:47 PM +1000, N. F. Parsons , wrote: > External email: Please exercise caution > > I am not averse to a factor-based solution, but I would still have to manually enter that factor each month, correct? If possible, I’d just like to point R at that column and have it do the work. > > — > Nathan Parsons, B.SC, M.Sc, G.C. > > Ph.D. Candidate, Dept. of Sociology, Portland State University > Adjunct Professor, Dept. of Sociology, Washington State University > Graduate Advocate, American Association of University Professors (OR) > > Recent work (https://www.researchgate.net/profile/Nathan_Parsons3/publications) > Schedule an appointment (https://calendly.com/nate-parsons) > > > On Wednesday, Jul 21, 2021 at 8:30 PM, Tom Woolman mailto:twool...@ontargettek.com)> wrote: > > > > Couldn't you convert the date columns to character type data in a data > > frame, and then convert those strings to factors in a 2nd step? > > > > The only downside I think to treating dates as factor levels is that > > you might have an awful lot of factors if you have a large enough > > dataset. > > > > > > > > Quoting "N. F. Parsons" : > > > > > Hi all, > > > > > > If I have a tibble as follows: > > > > > > tibble(dates = c(rep("2021-07-04", 2), rep("2021-07-25", 3), > > > rep("2021-07-18", 4))) > > > > > > how in the world do I add a column that evaluates each of those dates and > > > assigns it a categorical value such that > > > > > > dates cycle > > > > > > 2021-07-04 1 > > > 2021-07-04 1 > > > 2021-07-25 3 > > > 2021-07-25 3 > > > 2021-07-25 3 > > > 2021-07-18 2 > > > 2021-07-18 2 > > > 2021-07-18 2 > > > 2021-07-18 2 > > > > > > Not to further complicate matters, but some months I may only have one > > > date, and some months I will have 4 dates - so thats not a fixed quantity. > > > We've literally been doing this by hand at my job and I'd like to automate > > > it. > > > > > > Thanks in advance! > > > > > > Nate Parsons > > > > > > [[alternative HTML version deleted]] > > > > > >
Re: [R] Assigning categorical values to dates
Not if you use as.factor to convert a character type column to factor levels. It should recode the distinct string values to factors automatically for you. i.e., df$datefactors <- as.factor(df$datestrings) Quoting "N. F. Parsons" : I am not averse to a factor-based solution, but I would still have to manually enter that factor each month, correct? If possible, I’d just like to point R at that column and have it do the work. — Nathan Parsons, B.SC, M.Sc, G.C. Ph.D. Candidate, Dept. of Sociology, Portland State University Adjunct Professor, Dept. of Sociology, Washington State University Graduate Advocate, American Association of University Professors (OR) Recent work (https://www.researchgate.net/profile/Nathan_Parsons3/publications) Schedule an appointment (https://calendly.com/nate-parsons) On Wednesday, Jul 21, 2021 at 8:30 PM, Tom Woolman mailto:twool...@ontargettek.com)> wrote: Couldn't you convert the date columns to character type data in a data frame, and then convert those strings to factors in a 2nd step? The only downside I think to treating dates as factor levels is that you might have an awful lot of factors if you have a large enough dataset. Quoting "N. F. Parsons" : > Hi all, > > If I have a tibble as follows: > > tibble(dates = c(rep("2021-07-04", 2), rep("2021-07-25", 3), > rep("2021-07-18", 4))) > > how in the world do I add a column that evaluates each of those dates and > assigns it a categorical value such that > > dates cycle > > 2021-07-04 1 > 2021-07-04 1 > 2021-07-25 3 > 2021-07-25 3 > 2021-07-25 3 > 2021-07-18 2 > 2021-07-18 2 > 2021-07-18 2 > 2021-07-18 2 > > Not to further complicate matters, but some months I may only have one > date, and some months I will have 4 dates - so thats not a fixed quantity. > We've literally been doing this by hand at my job and I'd like to automate > it. > > Thanks in advance! > > Nate Parsons > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Assigning categorical values to dates
Couldn't you convert the date columns to character type data in a data frame, and then convert those strings to factors in a 2nd step? The only downside I think to treating dates as factor levels is that you might have an awful lot of factors if you have a large enough dataset. Quoting "N. F. Parsons" : Hi all, If I have a tibble as follows: tibble(dates = c(rep("2021-07-04", 2), rep("2021-07-25", 3), rep("2021-07-18", 4))) how in the world do I add a column that evaluates each of those dates and assigns it a categorical value such that datescycle 2021-07-04 1 2021-07-04 1 2021-07-25 3 2021-07-25 3 2021-07-25 3 2021-07-18 2 2021-07-18 2 2021-07-18 2 2021-07-18 2 Not to further complicate matters, but some months I may only have one date, and some months I will have 4 dates - so thats not a fixed quantity. We've literally been doing this by hand at my job and I'd like to automate it. Thanks in advance! Nate Parsons [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Using R to analyse Court documents
Hi Brian. I assume you're interested in some kind of classification of the theme or the contents within each document? In which case I would direct you to natural language processing for multinomial classification of unstructured data. Basically an NLP (natural language processing) classification problem. The first challenge will be obtaining human-labeled training examples of a sufficient number of example documents. Thanks, Tom Quoting Brian Smith : Hi, I am wondering if there is some references on how R can be used to analyse legal/court documents. I searched a bit in internet but unable to get anything meaningful. Any reference will be very appreciated. Thanks for your time. Thanks and regards, __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Windows path backward slash
In Windows versions of R/RStudio when refering to filename paths, you need to either use two "\\" characters instead of one, OR use the reverse slash "/" as used in Linux/Unix. It's an unfortunate conflict between R and Windows in that a single \ character by itself is treated as an escape character. It's all Microsoft's fault for using the wrong direction slash in MS-DOS and not conforming to Unix style c. 1980. Quoting Anbu A : Hi Bill, r"{C:\Users\Anbu\Desktop\sas\}" - This is the key and code below worked. fsasdat<-function(dsn) { pat=r"{C:\Users\Anbu\Desktop\sas\}" str1=str_c(pat,dsn,".sas7bdat") read_sas(str1) #return(str1) } allmetrx=fsasdat("all") str(allmetrx) Thank you. Anbu. On Thu, Dec 24, 2020 at 12:12 PM Bill Dunlap wrote: The "\n" is probably not in the file name. Does omitting it from the call to str_c help? -Bill On Thu, Dec 24, 2020 at 6:20 AM Anbu A wrote: Hi All, I am a newbie. This is my first program. I am trying to read SAS dataset from below path. I added escape "\" along "\" found in path C:\Users\axyz\Desktop\sas\ but still not working. fsasdat<-function(dsn) { pat="C:\\Users\\axyz\\Desktop\\sas\\" str1=str_c(pat,dsn,".sas7bdat","\n") allmetrx=read_sas(str1) } fsasdat("all") Please help me. Thanks, AA. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cooks distance for repeated measures anova
Hi Dr. Pedersen. I haven't used cook's on an aov object but I do it all the time from an lm (general linear model) object, ie.: mod <- lm(data=dataframe) cooksdistance <- cooks.distance(mod) I *think* you might be able to simulate an aov using the lm functon by selecting the parameter in lm to calculate the type 1 sum of squares error that would be provided by the aov function. FYI I'm using Cook's in my case as part of an anomaly detection engine based on a linear model interaction. Quoting Walker Scott Pedersen : Hi all, Is there a way to get cook's distance for a repeated measures anova? Neither cooks.distance or CookD from the predictmeans package seem to allow for this. For example, if I have the model data(iris) mod<-aov(Sepal.Length ~ Petal.Length + Petal.Width + Error(Species), data=iris) both cooks.distance(mod) and library(predictmeans) CookD(mod, group=Species) give an error saying they don't support an aovlist object. I would prefer a method to get a cook's distance for each category in my repeated factor (i.e. Species), rather than each observation. Thanks! -- Walker Pedersen, Ph.D. Center for Healthy Minds University of Wisconsin -- Madison [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] counting duplicate items that occur in multiple groups
Thanks, everyone! Quoting Jim Lemon : Oops, I sent this to Tom earlier today and forgot to copy to the list: VendorID=rep(paste0("V",1:10),each=5) AcctID=paste0("A",sample(1:5,50,TRUE)) Data<-data.frame(VendorID,AcctID) table(Data) # get multiple vendors for each account dupAcctID<-colSums(table(Data)>0) Data$dupAcct<-NA # fill in the new column for(i in 1:length(dupAcctID)) Data$dupAcct[Data$AcctID == names(dupAcctID[i])]<-dupAcctID[i] Jim On Wed, Nov 18, 2020 at 8:20 AM Tom Woolman wrote: Hi everyone. I have a dataframe that is a collection of Vendor IDs plus a bank account number for each vendor. I'm trying to find a way to count the number of duplicate bank accounts that occur in more than one unique Vendor_ID, and then assign the count value for each row in the dataframe in a new variable. I can do a count of bank accounts that occur within the same vendor using dplyr and group_by and count, but I can't figure out a way to count duplicates among multiple Vendor_IDs. Dataframe example code: #Create a sample data frame: set.seed(1) Data <- data.frame(Vendor_ID = sample(1:1), Bank_Account_ID = sample(1:1)) Thanks in advance for any help. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] counting duplicate items that occur in multiple groups
Yes, good catch. Thanks Quoting Bert Gunter : Why 0's in the data frame? Shouldn't that be 1 (vendor with that account)? Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Tue, Nov 17, 2020 at 3:29 PM Tom Woolman wrote: Hi Bill. Sorry to be so obtuse with the example data, I was trying (too hard) not to share any actual values so I just created randomized values for my example; of course I should have specified that the random values would not provide the expected problem pattern. I should have just used simple dummy codes as Bill Dunlap did. So per Bill's example data for Data1, the expected (hoped for) output should be: Vendor Account Num_Vendors_Sharing_Bank_Acct 1 V1 A1 0 2 V2 A2 3 3 V3 A2 3 4 V4 A2 3 Where the new calculated variable is Num_Vendors_Sharing_Bank_Acct. The value is 3 for V2, V3 and V4 because they each share bank account A2. Likewise, in the Data2 frame, the same logic applies: Vendor Account Num_Vendors_Sharing_Bank_Acct 1 V1 A1 0 2 V2 A2 3 3 V3 A2 3 4 V1 A2 3 5 V4 A3 0 6 V2 A4 0 Thanks! Quoting Bill Dunlap : > What should the result be for > Data1 <- data.frame(Vendor=c("V1","V2","V3","V4"), > Account=c("A1","A2","A2","A2")) > ? > > Must each vendor have only one account? If not, what should the result be > for >Data2 <- data.frame(Vendor=c("V1","V2","V3","V1","V4","V2"), > Account=c("A1","A2","A2","A2","A3","A4")) > ? > > -Bill > > On Tue, Nov 17, 2020 at 1:20 PM Tom Woolman > wrote: > >> Hi everyone. I have a dataframe that is a collection of Vendor IDs >> plus a bank account number for each vendor. I'm trying to find a way >> to count the number of duplicate bank accounts that occur in more than >> one unique Vendor_ID, and then assign the count value for each row in >> the dataframe in a new variable. >> >> I can do a count of bank accounts that occur within the same vendor >> using dplyr and group_by and count, but I can't figure out a way to >> count duplicates among multiple Vendor_IDs. >> >> >> Dataframe example code: >> >> >> #Create a sample data frame: >> >> set.seed(1) >> >> Data <- data.frame(Vendor_ID = sample(1:1), Bank_Account_ID = >> sample(1:1)) >> >> >> >> >> Thanks in advance for any help. >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] counting duplicate items that occur in multiple groups
Hi Bill. Sorry to be so obtuse with the example data, I was trying (too hard) not to share any actual values so I just created randomized values for my example; of course I should have specified that the random values would not provide the expected problem pattern. I should have just used simple dummy codes as Bill Dunlap did. So per Bill's example data for Data1, the expected (hoped for) output should be: Vendor Account Num_Vendors_Sharing_Bank_Acct 1 V1 A1 0 2 V2 A2 3 3 V3 A2 3 4 V4 A2 3 Where the new calculated variable is Num_Vendors_Sharing_Bank_Acct. The value is 3 for V2, V3 and V4 because they each share bank account A2. Likewise, in the Data2 frame, the same logic applies: Vendor Account Num_Vendors_Sharing_Bank_Acct 1 V1 A1 0 2 V2 A2 3 3 V3 A2 3 4 V1 A2 3 5 V4 A3 0 6 V2 A4 0 Thanks! Quoting Bill Dunlap : What should the result be for Data1 <- data.frame(Vendor=c("V1","V2","V3","V4"), Account=c("A1","A2","A2","A2")) ? Must each vendor have only one account? If not, what should the result be for Data2 <- data.frame(Vendor=c("V1","V2","V3","V1","V4","V2"), Account=c("A1","A2","A2","A2","A3","A4")) ? -Bill On Tue, Nov 17, 2020 at 1:20 PM Tom Woolman wrote: Hi everyone. I have a dataframe that is a collection of Vendor IDs plus a bank account number for each vendor. I'm trying to find a way to count the number of duplicate bank accounts that occur in more than one unique Vendor_ID, and then assign the count value for each row in the dataframe in a new variable. I can do a count of bank accounts that occur within the same vendor using dplyr and group_by and count, but I can't figure out a way to count duplicates among multiple Vendor_IDs. Dataframe example code: #Create a sample data frame: set.seed(1) Data <- data.frame(Vendor_ID = sample(1:1), Bank_Account_ID = sample(1:1)) Thanks in advance for any help. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] counting duplicate items that occur in multiple groups
Hi everyone. I have a dataframe that is a collection of Vendor IDs plus a bank account number for each vendor. I'm trying to find a way to count the number of duplicate bank accounts that occur in more than one unique Vendor_ID, and then assign the count value for each row in the dataframe in a new variable. I can do a count of bank accounts that occur within the same vendor using dplyr and group_by and count, but I can't figure out a way to count duplicates among multiple Vendor_IDs. Dataframe example code: #Create a sample data frame: set.seed(1) Data <- data.frame(Vendor_ID = sample(1:1), Bank_Account_ID = sample(1:1)) Thanks in advance for any help. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] RIDIT scoring in R
Hi everyone. I'd like to perform RIDIT scoring of a column that consists of ordinal values, but I don't have a comparison dataset to use against it as required by the Ridit::ridit function. As a question of best practice, could I use a normally distributed frequency distribution table generated by the rnorm function for use as comparison data for RIDIT scoring? Or would I be better off using a 2nd ordinal variable from the same dataframe for comparison? Thanks in advance! __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Assigning cores
Hi Leslie and all. You may want to investigate using SparklyR on a cloud environment like AWS, where you have more packages that are designed to work on cluster computing environments and you have more control over those types of parallel operations. V/r, Tom W. Quoting Leslie Rutkowski : Hi all, I'm working on a large simulation and I'm using the doParallel package to parallelize my work. I have 20 cores on my machine and would like to preserve some for day-to-day activities - word processing, sending emails, etc. I started by saving 1 core and it was clear that *everything* was so slow as to be nearly unusable. Any suggestions on how many cores to hold back (e.g., not to put to work on the parallel process)? Thanks, Leslie [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] kernlab ksvm rbfdot kernel - prediction returning fewer rows than provided for input
forgot to mention, the training and testing dataframes are composed of 4 IVs (one double numeric IV and three factor IVs) and one DV (dichotomous factor, i.e. true or false). The training dataframe consists of 48819 rows and test dataframe consists of 24408 rows. Thanks again. Quoting Tom Woolman : Hi everyone. I'm using the kernlab ksvm function with the rbfdot kernel for a binary classification problem and getting a strange result back. The predictions seem to be very accurate judging by the training results provided by the algorithm, but I'm unable to generate a confusion matrix because there is a difference in the number of output records from my model test compared to what was input into the test dataframe. I've used ksvm before but never had this problem. Here's my sample code: install.packages("kernlab") library(kernlab) set.seed(3233) trainIndex <- caret::createDataPartition(dataset_labeled_fraud$isFraud, p=0.70,kist=FALSE) train <- dataset_labeled_fraud[trainIndex,] test <- dataset_labeled_fraud[-trainIndex,] #clear out the training model filter <- NULL filter <- kernlab::ksvm(isFraud~.,data=train,kernel="rbfdot",kpar=list(sigma=0.5),C=3,prob.model=TRUE) #clear out the test results test_pred_rbfdot <- NULL test_pred_rbfdot <- kernlab::predict(filter,test,type="probabilities") dataframe_test_pred_rbfdot <- as.data.frame(test_pred_rbfdot) nrow(dataframe_test_pred_rbfdot) 23300 nrow(test) 24408 # ok, how did I go from 24408 input rows to only 23300 output prediction rows? :( Thanks in advance anyone! Thomas A. Woolman PhD Candidate, Technology Management Indiana State University __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] kernlab ksvm rbfdot kernel - prediction returning fewer rows than provided for input
Hi everyone. I'm using the kernlab ksvm function with the rbfdot kernel for a binary classification problem and getting a strange result back. The predictions seem to be very accurate judging by the training results provided by the algorithm, but I'm unable to generate a confusion matrix because there is a difference in the number of output records from my model test compared to what was input into the test dataframe. I've used ksvm before but never had this problem. Here's my sample code: install.packages("kernlab") library(kernlab) set.seed(3233) trainIndex <- caret::createDataPartition(dataset_labeled_fraud$isFraud, p=0.70,kist=FALSE) train <- dataset_labeled_fraud[trainIndex,] test <- dataset_labeled_fraud[-trainIndex,] #clear out the training model filter <- NULL filter <- kernlab::ksvm(isFraud~.,data=train,kernel="rbfdot",kpar=list(sigma=0.5),C=3,prob.model=TRUE) #clear out the test results test_pred_rbfdot <- NULL test_pred_rbfdot <- kernlab::predict(filter,test,type="probabilities") dataframe_test_pred_rbfdot <- as.data.frame(test_pred_rbfdot) nrow(dataframe_test_pred_rbfdot) 23300 nrow(test) 24408 # ok, how did I go from 24408 input rows to only 23300 output prediction rows? :( Thanks in advance anyone! Thomas A. Woolman PhD Candidate, Technology Management Indiana State University __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] random forest significance testing tools
Hi everyone. I'm using a random forest in R to successfully perform a classification on a dichotomous DV in a dataset that has 29 IVs of type double and approximately 285,000 records. I ran my model on a 70/30 train/test split of the original dataset. I'm trying to use the rfUtilities package for rf model selection and performance evaluation, in order to generate a p-value and other quantitative performance statistics for use in hypothesis testing, similar to what I would do with a logistic regression glm model. The initial random forest model results and OOB error estimates were as follows: randomForest(formula = Class ~ ., data = train) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 5 OOB estimate of error rate: 0.04% Confusion matrix: 0 1 class.error 0 199004 16 8.039393e-05 1 73 271 2.122093e-01 I'm running this model on my laptop (Win10, 8 GB RAM) as I don't have access to my server during the pandemic. The rfUtilities function call works (or at least it doesn't give me an error message or crash), but it's been running for over a day in RStudio on the original rf model and the training dataset without providing any results. For anyone who has used the rfUtilities package before, is this just too large of a dataframe for a Win10 laptop to process effectively or should I be doing something different? This is my first time using the rfUtilities package and I understand that it is relatively new. The function call for the rfUtilities function rf.significance is as follows (rf is my original random forest data model from the randomForest function): rf.perm <- rf.significance(rf, train[,1:29], nperm=99, ntree=500) Thanks in advance. Tom Woolman PhD student, Indiana State University __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Problem witth nnet:multinom
I am using R with the nnet package to perform a multinomial logistic regression on a training dataset with ~5800 training dataset records and 45 predictor variables in that training data. Predictor variables were chosen as a subset of all ~120 available variables based on PCA analysis. My target variable is a factor with 10 items. All predictor variable are numeric (type "dbl"). My command in R is as follows: model <- nnet:multinom(frmla, data = training_set, maxit = 1000, na.action = na.omit) #note that the frmla string is a value of "Target_Variable ~ v1 + v2 + v3, etc." Output of this command is as follows (I will truncate to save a little space after the first few rows): # weights: 360 (308 variable) initial value 10912.909211 iter 10 value 9194.608309 iter 20 value 9142.608309 iter 30 value 9128.737991 iter 40 value 9093.899887 . . . iter 420 value 8077.803755 final value 8077.800112 converged Error in nnet:multinom(frmla, data = training_set, maxit = 1000, : NA/NaN argument In addition: Warning message: In nnet:multinom(frmla, data= training_set, maxit = 1000, : numerical expression has 26 elements: only the first used So that's my issue. I can't figure out the meaning behind both the error message and the warning message above. There are no NA values in my data set. I've also tried reducing the number of predictor variables, but I get the same issue (just a different number of iterations). Thanks in advance. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Trying to fix code that will find highest 5 column names and their associated values for each row in a data frame in R
I have a data frame each with 10 variables of integer data for various attributes about each row of data, and I need to know the highest 5 variables related to each of row in this data frame and output that to a new data frame. In addition to the 5 highest variable names, I also need to know the corresponding 5 highest variable values for each row. A simple code example to generate a sample data frame for this is: set.seed(1) DF <- matrix(sample(1:9,9),ncol=10,nrow=9) DF <- as.data.frame.matrix(DF) This would result in an example data frame like this: # V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 # 1 3 2 5 6 5 2 6 8 1 3 # 2 1 4 7 8 7 7 3 4 2 9 # 3 2 3 4 7 5 8 9 1 3 5 # 4 3 8 3 4 5 6 7 4 6 5 # 5 6 2 3 7 2 1 8 3 2 4 # 6 8 2 4 8 3 2 9 7 6 5 # 7 1 5 3 6 8 3 8 9 1 3 # 8 9 3 5 8 4 9 7 8 1 2 # 9 1 2 4 8 3 2 1 2 5 6 My ideal output would be something like this: # V1 V2 V3 V4 V5 # 1 V2:9 V7:8 V8:7 V4:6 V3:5 # 2 V9:9 V3:8 V5:7 V7:6 V4:5 # 3 V5:9 V3:8 V2:7 V9:6 V7:5 # 4 V8:9 V4:8 V2:7 V5:6 V9:5 # 5 V9:9 V1:8 V6:7 V3:6 V5:5 # 6 V8:9 V1:8 V5:7 V9:6 V4:5 # 7 V2:9 V8:8 V7:7 V5:6 V9:5 # 8 V4:9 V7:8 V9:7 V2:6 V8:5 # 9 V3:9 V7:8 V8:7 V4:6 V5:5 # 10 V6:9 V8:8 V1:7 V9:6 V4:5 I was trying to use code, but this doesn't seem to work: out <- t(apply(DF, 1, function(x){ o <- head(order(-x), 5) paste0(names(x[o]), ':', x[o]) })) as.data.frame(out) Thanks everyone! __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.