Re: [R] How to do non-parametric calculations in R

2022-06-11 Thread Tom Woolman
Imagine that it's the year 2022 and you don't know how to look up 
information about performing a Kruskal-Wallis H test.


 It would take you longer to join the listserv and then write such a 
cokamemie email than to open the stats textbook you are supposed to have 
for the course, much less doing a simple search query.



On 2022-06-11 19:19, Ebert,Timothy Aaron wrote:

LOL. Thank goodness I successfully rolled my saving throw and
resisted, at least for this round.
Tim

-Original Message-
From: R-help  On Behalf Of Jeff Newmiller
Sent: Saturday, June 11, 2022 5:27 PM
To: r-help@r-project.org; J C Nash 
Subject: Re: [R] How to do non-parametric calculations in R

[External Email]

Really? But it is such a random list that I thought it was a test of
our ability to resist providing impromptu lectures on off-list-topics,
since we all like to expound on "stuff" even when R isn't needed to
understand them. Or perhaps "A R Lover" just didn't read the Posting
Guide warnings about HTML email, homework, and statistics, and will
soon have done do and be sharing some R code that is giving them an
error.

On June 11, 2022 1:53:37 PM PDT, J C Nash  wrote:

Homework!

On 2022-06-11 10:24, Shantanu Shimpi wrote:

Dear R community,

Please help me in knowing how to do following non-parametric tests:

1.  kruskal-Wallis test
2.  Wilcoxson rank sum test
3.  Lee Cronbac Alpha test
4.  Spearman's Rank correlation test
5.  Henry Garrett method formula calculations
6.  Factor Analysis
7.  Chi square test

Kindly guide me on the above queries in the easiest way.

A R lover,
Col Shantanu.
India.
attitudeshantanu1...@gmail.com

7722030088

 [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mai
lman_listinfo_r-2Dhelp=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVe
AsRzsn7AkP-g=4tM4fJqtbe_uTsyPRnMDpR560AlVh83t9wmvJuUPvfsciEQhPSY5Yb
F8c2Lixwy8=zhlVVKEML3MeOEdjlR2Z1gqYLVcrE0gpEiFPdo0MxNg=
PLEASE do read the posting guide
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.o
rg_posting-2Dguide.html=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kV
eAsRzsn7AkP-g=4tM4fJqtbe_uTsyPRnMDpR560AlVh83t9wmvJuUPvfsciEQhPSY5Y
bF8c2Lixwy8=IRGFPuDAXZu6xEr36GrRy5jeXkI0D62fDLt-FxbIqBs=
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailm
an_listinfo_r-2Dhelp=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeAsRz
sn7AkP-g=4tM4fJqtbe_uTsyPRnMDpR560AlVh83t9wmvJuUPvfsciEQhPSY5YbF8c2Li
xwy8=zhlVVKEML3MeOEdjlR2Z1gqYLVcrE0gpEiFPdo0MxNg=
PLEASE do read the posting guide
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org
_posting-2Dguide.html=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeAsR
zsn7AkP-g=4tM4fJqtbe_uTsyPRnMDpR560AlVh83t9wmvJuUPvfsciEQhPSY5YbF8c2L
ixwy8=IRGFPuDAXZu6xEr36GrRy5jeXkI0D62fDLt-FxbIqBs=
and provide commented, minimal, self-contained, reproducible code.


--
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeAsRzsn7AkP-g=4tM4fJqtbe_uTsyPRnMDpR560AlVh83t9wmvJuUPvfsciEQhPSY5YbF8c2Lixwy8=zhlVVKEML3MeOEdjlR2Z1gqYLVcrE0gpEiFPdo0MxNg=
PLEASE do read the posting guide
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeAsRzsn7AkP-g=4tM4fJqtbe_uTsyPRnMDpR560AlVh83t9wmvJuUPvfsciEQhPSY5YbF8c2Lixwy8=IRGFPuDAXZu6xEr36GrRy5jeXkI0D62fDLt-FxbIqBs=
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] categorizing data

2022-05-29 Thread Tom Woolman



Some ideas:

You could create a cluster model with k=3 for each of the 3 variables, 
to determine what constitutes high/medium/low centroid values for each 
of the 3 types of plant types. Centroid values could then be used as the 
upper/lower boundary ranges for high/med/low.


Or utilize a histogram for each variable, and use quantiles or 
densities, etc. to determine the natural breaks for the high/med/low 
ranges for each of the IVs.





On 2022-05-29 15:28, Janet Choate wrote:

Hi R community,
I have a data frame with three variables, where each row adds up to 90.
I want to assign a category of low, medium, or high to the values in 
each
row - where the lowest value per row will be set to 10, the medium 
value
set to 30, and the high value set to 50 - so each row still adds up to 
90.


For example:
Data: Orig
tree  shrub  grass
32 11   47
23  41  26
49  23  18

Data: New
tree  shrub  grass
30  10  50
10   50 30
50   30 10

I am not attaching any code here as I have not been able to write 
anything

effective! appreciate help with this!
thank you,
JC

--

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Is there a canonical way to pronounce CRAN?

2022-05-04 Thread Tom Woolman
Everyone needs to speak English exactly like I do or else they're doing 
it wrong

 :)

By I pronounce CRAN the same way that I pronounce the first half of 
cranberry.





On 2022-05-04 20:24, Avi Gross via R-help wrote:

Extended discussion may be a waste but speaking for myself, I found it
highly enlightening to hear that many had a mental image of an
alternate way to pronounce CRAN as it so OBVIOUSLY had a natural way
to pronounce.

I often cringe when I listen to an audio book in the car and the
person chosen to narrate gets words not just wrong but very wrong as
in nobody in any country would likely pronounce it that way!

TV shows and elsewhere do the same. If someone asked me to check
if something was on see-ran or even see-are-a-en, it might take me a
moment to shift gears and realize they meant CRAN. There is no real
right or wrong way and we see organizations with names hard to
pronounce like FBI or CIA are often referred to with words like
Quantico or Langley based on an area they are associated with. People
like words they can hope to pronounce like the non-existent UNCLE or
even SMERSH and KAOS.

I will say it is quite logical if you see C-SPAN as see-span then CPAN
as C-PAN you might then see CRAN the way you do as C-RAN.

But consider the many functions and packages in a language like R and
ask if everyone thinks or pronounces them the way you do? How many
initially read runif() as run-if until you realized it was a
distribution of uniform random numbers in R as compared to rnorm() for
an R version of a normal distribution and thus maybe can be
pronounced more like r-unif and r-norm. Also, runif is part of a
related set of functions all having something to do with a uniform
distribution, dunif(), punif() and qunif() so, again, it hints to some
of a consistent way to speak them aloud. Of course, in written form,
they speak for themselves. 
But not everything can or should be pronounced. We do not all speak
the same languages or the same way. Sometimes spelling things out as
C-R-A-N is a better way to go albeit if the others pronounce those
letters differently, you end up like people who think there is a
Hungarian word for something like vey-tsey to mean toilet when
it actually is spelled WC as in borrowed from the English Water Closet
and the way you say W as a letter of the alphabet followed by the way
you say C as a letter of the alphabet sounds ...


-Original Message-
From: Jim Lemon 
To: Stephen P. Molnar ; r-help mailing list

Sent: Wed, May 4, 2022 6:46 pm
Subject: Re: [R] Is there a canonical way to pronounce CRAN?

Perhaps not entirely a waste. Shots have been fired over less.
Allow the neologism 'packronym' to signify the packing of an acronym
into a pronounceable word.
(A necessary skill in the public service. If you cannot correctly
pronounce DFAT, resign yourself to menial labor)
If we endorse the anglophone packronym we get:
kræn
An Italian might lean toward:
tʃrɑn
while Spanish (the happiest language, they say) would produce:
krɛn
So Babel continues to amuse or enrage us, depending upon our emotional
disposition.
I dare not comment upon those who would laboriously spell it out as:
Charlie Romeo Alfa November

Jim

On Thu, May 5, 2022 at 4:05 AM Stephen P. Molnar 
 wrote:

Yes, I know that I'm contributing, but what a waste of band width.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Combining data.frames

2022-03-19 Thread Tom Woolman

Have you looked at the merge function in base R?

https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/merge


On 2022-03-19 21:15, Jeff Reichman wrote:

R-Help Community

I'm trying to combine two data.frames which each containing 10 columns 
of
which they each share two common fields. Here are two small test 
datasets.


df1 <- data.frame(date =
c("2021-1-1","2021-1-1","2021-1-1","2021-1-1","2021-1-1",

"2021-1-2","2021-1-2","2021-1-3","2021-1-3","2021-1-3"),
  geo_hash =
c("abc123","abc123","abc456","abc789","abc246","abc123",
   "asd123","abc789","abc890","abc123"),
  ad_id = 
c("a12345","b12345","a12345","a12345","c12345",

"b12345","b12345","a12345","b12345","a12345"))

df2 <- data.frame(date =
c("2021-1-1","2021-1-1","2021-1-2","2021-1-3","2021-1-3"),
  geo_hash =
c("abc123","abc456","abc123","abc789","abc890"),
  event = 
c("shoting","ied","protest","riot","protest"))


I'm trying to combine them such that I get a combined data.frames such 
as


dategeo_hashad_id   event
1/1/2021abc123  a12345  shoting
1/1/2021abc123  b12345
1/1/2021abc456  a12345  ied
1/1/2021abc789  a12345
1/1/2021abc246  c12345

Jeff

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Time for a companion mailing list for R packages?

2022-01-13 Thread Tom Woolman
I concur on both of Eric's suggestions below. I love R but I couldn't 
imagine using it on a daily basis without "key" packages for various 
regression and classification modeling problems, etc.   Likewise on 
being able to embed images (within reason... maybe establish a max KB or 
MB file size for attachments).



Thanks,
Tom


On 2022-01-13 12:25, Eric Berger wrote:
Re: constructive criticism to make this list more useful to more 
people:


Suggestion 1: accommodate questions related to non-base-R packages
   This has been addressed by many already. The 
current

de facto situation is that such questions are asked and often answered.
Perhaps the posting guide should be altered so that such questions fall
within the guidelines.

Suggestion 2: expand beyond plain-text mode
I assume there is a reason for this restriction 
but
it seems to create a lot of delay and often havoc. Also, many questions 
on
this list relate to graphics which is an important part of R (even base 
R)

and such questions may often be more easily communicated with images.

Eric




On Thu, Jan 13, 2022 at 6:08 PM John Fox  wrote:


Dear Avi et al.,

Rather than proliferating R mailing lists, why not just allow 
questions

on non-standard packages on the r-help list?

(1) If people don't want to answer these questions, they don't have 
to.


(2) Users won't necessarily find the new email list and so may post to
r-help anyway, only to be told that they should have posted to another
list.

(3) Many of the questions currently posted to the list concern
non-standard packages and most of them are answered.

(4) If people prefer other sources of help (as listed on the R website
"getting help" page) then they are free to use them.

(5) As I read the posting guide, questions about non-standard packages
aren't actually disallowed; the posting guide suggests, however, that
the package maintainer be contacted first. But answers can be helpful 
to

other users, and so it may be preferable for at least some of these
questions to be asked on the list.

(6) Finally, the instruction concerning non-standard packages is 
buried
near the end of the posting guide, and users, especially new users, 
may
not understand what the term "standard packages" means even if they 
find

their way to the posting guide.

Best,
  John

--
John Fox, Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
web: https://socialsciences.mcmaster.ca/jfox/

On 2022-01-12 10:27 p.m., Avi Gross via R-help wrote:
> Respectfully, this forum gets lots of questions that include non-base R
components and especially packages in the tidyverse. Like it or not, 
the
extended R language is far more useful and interesting for many people 
and

especially those who do not wish to constantly reinvent the wheel.
> And repeatedly, we get people reminding (and sometimes chiding) others
for daring to post questions or supply answers on what they see as a 
pure R

list. They have a point.
> Yes, there are other places (many not being mailing lists like this one)
where we can direct the questions but why can't there be an official
mailing list alongside this one specifically focused on helping or 
just
discussing R issues related partially to the use of packages. I don't 
mean
for people making a package to share, just users who may be searching 
for
an appropriate package or using a common package, especially the ones 
in

the tidyverse that are NOT GOING AWAY just because some purists ...
> I prefer a diverse set of ways to do things and base R is NOT enough for
me, nor frankly is R with all packages included as I find other 
languages

suit my needs at times for doing various things. If this group is for
purists, fine. Can we have another for the rest of us? Live and let 
live.

>
>
> -Original Message-
> From: Duncan Murdoch 
> To: Kai Yang ; R-help Mailing List <
r-help@r-project.org>
> Sent: Wed, Jan 12, 2022 3:22 pm
> Subject: Re: [R] how to find the table in R studio
>
> On 12/01/2022 3:07 p.m., Kai Yang via R-help wrote:
>> Hi all,
>> I created a function in R. It will be generate a table "temp". I can
view it in R studio, but I cannot find it on the top right window in R
studio. Can someone tell me how to find it in there? Same thing for 
f_table.

>> Thank you,
>> Kai
>> library(tidyverse)
>>
>> f1 <- function(indata , subgrp1){
>>  subgrp1 <- enquo(subgrp1)
>>  indata0 <- indata
>>  temp<- indata0 %>% select(!!subgrp1) %>% arrange(!!subgrp1) %>%
>>group_by(!!subgrp1) %>%
>>mutate(numbering =row_number(), max=max(numbering))
>>  view(temp)
>>  f_table <- table(temp$Species)
>>  view(f_table)
>> }
>>
>> f1(iris, Species)
>>
>
> Someone is sure to point out that this isn't an RStudio support list,
> but your issue is with R, not with RStudio.  You created the table in
> f1, but you never returned it.  The variable f_table is local to the
> function.  You'd need the following code to 

Re: [R] Defining Parameters in arules

2021-11-23 Thread Tom Woolman
Greg Williams has a book titled "Data Mining with Rattle and R", which 
has a chapter on association rules and the arules package. Williams' 
Rattle GUI package for R also lets you define an association rules model 
using a graphical interface (which creates the R code for you in the log 
file for Rattle). I use this textbook in one of the MS-level R courses 
that I teach and found it to be a good way to convey these concepts 
especially for those new to R and AI/ML generally.



On 2021-11-23 05:17, Ivan Krylov wrote:

Hello,

If you don't get an answer here, consider asking the package
maintainer: Michael Hahsler


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Creating a log-transformed histogram of multiclass data

2021-08-03 Thread Tom Woolman
Apologies, I left out 3 critical lines of code after the randomized 
sample dataframe is created:


group_a <- d[ which(d$label =='A'), ]
group_b <- d[ which(d$label =='B'), ]
group_c <- d[ which(d$label =='C'), ]





On 2021-08-03 18:56, Tom Woolman wrote:

# Resending this message since the original email was held in queue by
the listserv software because of a "suspicious" subject line, and/or
because of attached .png histogram chart attachments. I'm guessing
that the listserv software doesn't like multiple image file
attachments.


Hi everyone. I'm working on a research model now that is calculating
anomaly scores (RMSE values) for three distinct groups within a large
dataset. The anomaly scores are a continuous data type and are quite
small, ranging from approximately 1e-04 to 1-e07 across a population
of approximately 1 million observations.

I have all of the summary and descriptive statistics for each of the
anomaly score distributions across each group label in the dataset,
and I am able to create some useful histograms showing how each of the
three groups is uniquely distributed across the range of scores.
However, because of the large variance within the frequency of score
values and the high density peaks within much of the anomaly scores, I
need to use a log transformation within the histogram to show both the
log frequency count of each binned observation range (y-axis) and a
log transformation of the binned score values (x-axis) to be able to
appropriately illustrate the distributions within the data and make it
more readily understandable.

Fortunately, ggplot2 is really useful for creating some really
attractive dual-axis log transformed histograms.

However, I cannot figure out a way to create the log transformed
histograms to show each of my three groups by color within the same
histogram. I would want it to look like this, BUT use a log
transformation for each axis. This plot below shows the 3 groups in
one histogram but uses the default normal values.

For log transformed axis values, the best I can do so far is produce
three separate histograms, one for each group.



Below is sample R code to illustrate my problem with a
randomly-generated example dataset and the ggplot2 approaches that I
have taken so far:

# Sample R code below:

library(ggplot2)
library(dplyr)
library(hrbrthemes)

# I created some simple random sample data to produce an example 
dataset.

# This produces an example dataframe called d, which contains a class
label IV of either A, B or C for each observation. The target variable
is the anomaly_score continuous value for each observation.
# There are 300 rows of dummy data in this dataframe.

DV_score_generator = round(runif(300,0.001,0.999), 3)
d <- data.frame( label = sample( LETTERS[1:3], 300, replace=TRUE,
prob=c(0.65, 0.30, 0.05) ), anomaly_score = DV_score_generator)

# First, I use ggplot to create the normal distribution histogram that
shows all 3 groups on the same plot, by color.
# Please note that with this small set of randomized sample data it
doesn't appear to be necessary to use an x and y-axis log
transformation to show the distribution patterns, but it does becomes
an issue with my vastly larger and more complex score values in the DV
of the actual data.

p <- d %>%
ggplot( aes(x=anomaly_score, fill=label)) +
geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') +
scale_fill_manual(values=c("#69b3a2", "blue", "#404080")) +
theme_ipsum() +
labs(fill="")

p

# Produces a normal multiclass histogram.



# Now produce a series of x and y-axis log-transformed histograms,
producing one histogram for each distinct label class in the dataset:


# Group A, log transformed

ggplot(group_a, aes(x = anomaly_score)) +
 geom_histogram(aes(y = ..count..), binwidth = 0.05,
 colour = "darkgoldenrod1", fill = "darkgoldenrod2") +
 scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") 
+
 scale_y_continuous(trans="log2", name="Log-transformed Frequency 
Counts") +

 ggtitle("Transformed Anomaly Scores - Group A Only")


# Group A transformed histogram is produced here.



# Group B, log transformed

 ggplot(group_b, aes(x = anomaly_score)) +
 geom_histogram(aes(y = ..count..), binwidth = 0.05,
 colour = "green", fill = "darkgreen") +
 scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") 
+
 scale_y_continuous(trans="log2", name="Log-transformed Frequency 
Counts") +

 ggtitle("Transformed Anomaly Scores - Group B Only")

# Group B transformed histogram is produced here.



# Group C, log transformed

 ggplot(group_c, aes(x = anomaly_score)) +
 geom_histogram(aes(y = ..count..), binwidth = 0.05,
 colour = "red", fill = "darkred") +
 scale

[R] Creating a log-transformed histogram of multiclass data

2021-08-03 Thread Tom Woolman



# Resending this message since the original email was held in queue by 
the listserv software because of a "suspicious" subject line, and/or 
because of attached .png histogram chart attachments. I'm guessing that 
the listserv software doesn't like multiple image file attachments.



Hi everyone. I'm working on a research model now that is calculating 
anomaly scores (RMSE values) for three distinct groups within a large 
dataset. The anomaly scores are a continuous data type and are quite 
small, ranging from approximately 1e-04 to 1-e07 across a population of 
approximately 1 million observations.


I have all of the summary and descriptive statistics for each of the 
anomaly score distributions across each group label in the dataset, and 
I am able to create some useful histograms showing how each of the three 
groups is uniquely distributed across the range of scores. However, 
because of the large variance within the frequency of score values and 
the high density peaks within much of the anomaly scores, I need to use 
a log transformation within the histogram to show both the log frequency 
count of each binned observation range (y-axis) and a log transformation 
of the binned score values (x-axis) to be able to appropriately 
illustrate the distributions within the data and make it more readily 
understandable.


Fortunately, ggplot2 is really useful for creating some really 
attractive dual-axis log transformed histograms.


However, I cannot figure out a way to create the log transformed 
histograms to show each of my three groups by color within the same 
histogram. I would want it to look like this, BUT use a log 
transformation for each axis. This plot below shows the 3 groups in one 
histogram but uses the default normal values.


For log transformed axis values, the best I can do so far is produce 
three separate histograms, one for each group.




Below is sample R code to illustrate my problem with a 
randomly-generated example dataset and the ggplot2 approaches that I 
have taken so far:


# Sample R code below:

library(ggplot2)
library(dplyr)
library(hrbrthemes)

# I created some simple random sample data to produce an example 
dataset.
# This produces an example dataframe called d, which contains a class 
label IV of either A, B or C for each observation. The target variable 
is the anomaly_score continuous value for each observation.

# There are 300 rows of dummy data in this dataframe.

DV_score_generator = round(runif(300,0.001,0.999), 3)
d <- data.frame( label = sample( LETTERS[1:3], 300, replace=TRUE, 
prob=c(0.65, 0.30, 0.05) ), anomaly_score = DV_score_generator)


# First, I use ggplot to create the normal distribution histogram that 
shows all 3 groups on the same plot, by color.
# Please note that with this small set of randomized sample data it 
doesn't appear to be necessary to use an x and y-axis log transformation 
to show the distribution patterns, but it does becomes an issue with my 
vastly larger and more complex score values in the DV of the actual 
data.


p <- d %>%
ggplot( aes(x=anomaly_score, fill=label)) +
geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') +
scale_fill_manual(values=c("#69b3a2", "blue", "#404080")) +
theme_ipsum() +
labs(fill="")

p

# Produces a normal multiclass histogram.



# Now produce a series of x and y-axis log-transformed histograms, 
producing one histogram for each distinct label class in the dataset:



# Group A, log transformed

ggplot(group_a, aes(x = anomaly_score)) +
 geom_histogram(aes(y = ..count..), binwidth = 0.05,
 colour = "darkgoldenrod1", fill = "darkgoldenrod2") +
 scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") 
+
 scale_y_continuous(trans="log2", name="Log-transformed Frequency 
Counts") +

 ggtitle("Transformed Anomaly Scores - Group A Only")


# Group A transformed histogram is produced here.



# Group B, log transformed

 ggplot(group_b, aes(x = anomaly_score)) +
 geom_histogram(aes(y = ..count..), binwidth = 0.05,
 colour = "green", fill = "darkgreen") +
 scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") 
+
 scale_y_continuous(trans="log2", name="Log-transformed Frequency 
Counts") +

 ggtitle("Transformed Anomaly Scores - Group B Only")

# Group B transformed histogram is produced here.



# Group C, log transformed

 ggplot(group_c, aes(x = anomaly_score)) +
 geom_histogram(aes(y = ..count..), binwidth = 0.05,
 colour = "red", fill = "darkred") +
 scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") 
+
 scale_y_continuous(trans="log2", name="Log-transformed Frequency 
Counts") +

 ggtitle("Transformed Anomaly Scores - Group C Only")

# Group C transformed histogram is produced here.


# End.



Thanks in advance, everyone!


- Tom


Thomas A. Woolman, PhD Candidate (Indiana State University), MBA, MS, MS
On Target Technologies, Inc.
Virginia, USA


Re: [R] [EXT] Re: Assigning categorical values to dates

2021-07-21 Thread Tom Woolman
Sure thing. Typically with date related measures I'm going to build a  
time series model e.g. ARIMA or maybe something more funky like a  
recurrent neural net via Tensorflow. But in theory there's no reason  
dates can't be factors if it's in keeping with your particular design  
of experiment and you want to perform an analysis that treats time as  
qualitative data.




Quoting "N. F. Parsons" :

@Tom Okay, yeah. That might actually be an elegant solution. I will  
mess around with it. Thank you - I’m not in the habit of using  
factors and am not super familiar with how they automatically sort  
themselves.


@Andrew Yes. Each month is a different 30,000 row file upon which  
this task must be performed.


@Bert If you’re not interested in being helpful, why comment? Am I  
interupting your clubhouse time? I’m legitimately stumped by this  
one and reaching out in earnest. “You’ve been told how to do it”  
Seriously? We all have different backgrounds and knowledge levels  
with the entire atlas of the wonderful world of R and I neither need  
or want your opinion on my corner of it. Don’t be a Hooke. I’m not  
here to impress or inspire confidence in you - I’m here with a  
question that has had me spinning my wheels for the better part of a  
day and need fresh perspectives. Your response certainly inspires no  
confidence in me as to the nature of your character or your  
knowledge on the topic.


Best regards all,
—
Nathan Parsons, B.SC, M.Sc, G.C.

Ph.D. Candidate, Dept. of Sociology, Portland State University
Adjunct Professor, Dept. of Sociology, Washington State University
Graduate Advocate, American Association of University Professors (OR)

Recent work  
(https://www.researchgate.net/profile/Nathan_Parsons3/publications)

Schedule an appointment (https://calendly.com/nate-parsons)

On Wednesday, Jul 21, 2021 at 9:12 PM, Andrew Robinson  
mailto:a...@unimelb.edu.au)> wrote:
I wonder if you mean that you want the levels of the factor to  
reset within each month? That is not obvious from your example, but  
implied by your question.


Andrew


--
Andrew Robinson
Director, CEBRA and Professor of Biosecurity,
School/s of BioSciences and Mathematics & Statistics
University of Melbourne, VIC 3010 Australia
Tel: (+61) 0403 138 955
Email: a...@unimelb.edu.au
Website: https://researchers.ms.unimelb.edu.au/~apro@unimelb/

I acknowledge the Traditional Owners of the land I inhabit, and pay  
my respects to their Elders.






On 22 Jul 2021, 1:47 PM +1000, N. F. Parsons  
, wrote:

> External email: Please exercise caution
>
> I am not averse to a factor-based solution, but I would still  
have to manually enter that factor each month, correct? If  
possible, I’d just like to point R at that column and have it do  
the work.

>
> —
> Nathan Parsons, B.SC, M.Sc, G.C.
>
> Ph.D. Candidate, Dept. of Sociology, Portland State University
> Adjunct Professor, Dept. of Sociology, Washington State University
> Graduate Advocate, American Association of University Professors (OR)
>
> Recent work  
(https://www.researchgate.net/profile/Nathan_Parsons3/publications)

> Schedule an appointment (https://calendly.com/nate-parsons)
>
> > On Wednesday, Jul 21, 2021 at 8:30 PM, Tom Woolman  
mailto:twool...@ontargettek.com)> wrote:

> >
> > Couldn't you convert the date columns to character type data in a data
> > frame, and then convert those strings to factors in a 2nd step?
> >
> > The only downside I think to treating dates as factor levels is that
> > you might have an awful lot of factors if you have a large enough
> > dataset.
> >
> >
> >
> > Quoting "N. F. Parsons" :
> >
> > > Hi all,
> > >
> > > If I have a tibble as follows:
> > >
> > > tibble(dates = c(rep("2021-07-04", 2), rep("2021-07-25", 3),
> > > rep("2021-07-18", 4)))
> > >
> > > how in the world do I add a column that evaluates each of  
those dates and

> > > assigns it a categorical value such that
> > >
> > > dates cycle
> > >  
> > > 2021-07-04 1
> > > 2021-07-04 1
> > > 2021-07-25 3
> > > 2021-07-25 3
> > > 2021-07-25 3
> > > 2021-07-18 2
> > > 2021-07-18 2
> > > 2021-07-18 2
> > > 2021-07-18 2
> > >
> > > Not to further complicate matters, but some months I may only have one
> > > date, and some months I will have 4 dates - so thats not a  
fixed quantity.
> > > We've literally been doing this by hand at my job and I'd  
like to automate

> > > it.
> > >
> > > Thanks in advance!
> > >
> > > Nate Parsons
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > 

Re: [R] Assigning categorical values to dates

2021-07-21 Thread Tom Woolman
Not if you use as.factor to convert a character type column to factor  
levels. It should recode the distinct string values to factors  
automatically for you.


i.e.,

df$datefactors <- as.factor(df$datestrings)



Quoting "N. F. Parsons" :

I am not averse to a factor-based solution, but I would still have  
to manually enter that factor each month, correct? If possible, I’d  
just like to point R at that column and have it do the work.


—
Nathan Parsons, B.SC, M.Sc, G.C.

Ph.D. Candidate, Dept. of Sociology, Portland State University
Adjunct Professor, Dept. of Sociology, Washington State University
Graduate Advocate, American Association of University Professors (OR)

Recent work  
(https://www.researchgate.net/profile/Nathan_Parsons3/publications)

Schedule an appointment (https://calendly.com/nate-parsons)

On Wednesday, Jul 21, 2021 at 8:30 PM, Tom Woolman  
mailto:twool...@ontargettek.com)> wrote:


Couldn't you convert the date columns to character type data in a data
frame, and then convert those strings to factors in a 2nd step?

The only downside I think to treating dates as factor levels is that
you might have an awful lot of factors if you have a large enough
dataset.



Quoting "N. F. Parsons" :

> Hi all,
>
> If I have a tibble as follows:
>
> tibble(dates = c(rep("2021-07-04", 2), rep("2021-07-25", 3),
> rep("2021-07-18", 4)))
>
> how in the world do I add a column that evaluates each of those dates and
> assigns it a categorical value such that
>
> dates cycle
>  
> 2021-07-04 1
> 2021-07-04 1
> 2021-07-25 3
> 2021-07-25 3
> 2021-07-25 3
> 2021-07-18 2
> 2021-07-18 2
> 2021-07-18 2
> 2021-07-18 2
>
> Not to further complicate matters, but some months I may only have one
> date, and some months I will have 4 dates - so thats not a fixed quantity.
> We've literally been doing this by hand at my job and I'd like to automate
> it.
>
> Thanks in advance!
>
> Nate Parsons
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide  
http://www.R-project.org/posting-guide.html

> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Assigning categorical values to dates

2021-07-21 Thread Tom Woolman



Couldn't you convert the date columns to character type data in a data  
frame, and then convert those strings to factors in a 2nd step?


The only downside I think to treating dates as factor levels is that  
you might have an awful lot of factors if you have a large enough  
dataset.




Quoting "N. F. Parsons" :


Hi all,

If I have a tibble as follows:

tibble(dates = c(rep("2021-07-04", 2),  rep("2021-07-25", 3),
rep("2021-07-18", 4)))

how in the world do I add a column that evaluates each of those dates and
assigns it a categorical value such that

datescycle
   
2021-07-04  1
2021-07-04  1
2021-07-25  3
2021-07-25  3
2021-07-25  3
2021-07-18  2
2021-07-18  2
2021-07-18  2
2021-07-18  2

Not to further complicate matters, but some months I may only have one
date, and some months I will have 4 dates - so thats not a fixed quantity.
We've literally been doing this by hand at my job and I'd like to automate
it.

Thanks in advance!

Nate Parsons

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Using R to analyse Court documents

2021-07-20 Thread Tom Woolman
Hi Brian. I assume you're interested in some kind of classification of  
the theme or the contents within each document?
In which case I would direct you to natural language processing for  
multinomial classification of unstructured data. Basically an NLP  
(natural language processing) classification problem. The first  
challenge will be obtaining human-labeled training examples of a  
sufficient number of example documents.



Thanks,
Tom


Quoting Brian Smith :


Hi,

I am wondering if there is some references on how R can be used to
analyse legal/court documents. I searched a bit in internet but unable
to get anything meaningful.

Any reference will be very appreciated.

Thanks for your time.

Thanks and regards,

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Windows path backward slash

2020-12-24 Thread Tom Woolman




In Windows versions of R/RStudio when refering to filename paths, you  
need to either use two "\\" characters instead of one, OR use the  
reverse slash "/" as used in Linux/Unix. It's an unfortunate conflict  
between R and Windows in that a single \ character by itself is  
treated as an escape character.


It's all Microsoft's fault for using the wrong direction slash in  
MS-DOS and not conforming to Unix style c. 1980.





Quoting Anbu A :


Hi Bill,
  r"{C:\Users\Anbu\Desktop\sas\}"  - This is the key and code below worked.
fsasdat<-function(dsn) {
  pat=r"{C:\Users\Anbu\Desktop\sas\}"
  str1=str_c(pat,dsn,".sas7bdat")
  read_sas(str1)
#return(str1)
}
allmetrx=fsasdat("all")
str(allmetrx)

Thank you.

Anbu.


On Thu, Dec 24, 2020 at 12:12 PM Bill Dunlap 
wrote:


The "\n" is probably not in the file name.  Does omitting it from the call
to str_c help?

-Bill

On Thu, Dec 24, 2020 at 6:20 AM Anbu A  wrote:


Hi All,
I am a newbie. This is my first program.
I am trying to read SAS dataset from below path. I added escape "\" along
"\" found in path C:\Users\axyz\Desktop\sas\  but still not working.

fsasdat<-function(dsn) {
  pat="C:\\Users\\axyz\\Desktop\\sas\\"
  str1=str_c(pat,dsn,".sas7bdat","\n")
  allmetrx=read_sas(str1)
}
fsasdat("all")

Please help me.

Thanks,
AA.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.





[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cooks distance for repeated measures anova

2020-12-23 Thread Tom Woolman

Hi Dr. Pedersen.

I haven't used cook's on an aov object but I do it all the time from  
an lm (general linear model) object, ie.:


mod <- lm(data=dataframe)
cooksdistance <- cooks.distance(mod)



I *think* you might be able to simulate an aov using the lm functon by  
selecting the parameter in lm to calculate the type 1 sum of squares  
error that would be provided by the aov function.



FYI I'm using Cook's in my case as part of an anomaly detection engine  
based on a linear model interaction.





Quoting Walker Scott Pedersen :


Hi all,

Is there a way to get cook's distance for a repeated measures anova?  
Neither cooks.distance or CookD from the predictmeans package seem  
to allow for this.  For example, if I have the model


data(iris)

mod<-aov(Sepal.Length  ~ Petal.Length + Petal.Width +  
Error(Species), data=iris)


both

cooks.distance(mod)

and

library(predictmeans)
CookD(mod, group=Species)

give an error saying they don't support an aovlist object.

I would prefer a method to get a cook's distance for each category  
in my repeated factor (i.e. Species), rather than each observation.


Thanks!


--
Walker Pedersen, Ph.D.
Center for Healthy Minds
University of Wisconsin -- Madison


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] counting duplicate items that occur in multiple groups

2020-11-18 Thread Tom Woolman

Thanks, everyone!



Quoting Jim Lemon :


Oops, I sent this to Tom earlier today and forgot to copy to the list:

VendorID=rep(paste0("V",1:10),each=5)
AcctID=paste0("A",sample(1:5,50,TRUE))
Data<-data.frame(VendorID,AcctID)
table(Data)
# get multiple vendors for each account
dupAcctID<-colSums(table(Data)>0)
Data$dupAcct<-NA
# fill in the new column
for(i in 1:length(dupAcctID))
 Data$dupAcct[Data$AcctID == names(dupAcctID[i])]<-dupAcctID[i]

Jim

On Wed, Nov 18, 2020 at 8:20 AM Tom Woolman 
wrote:


Hi everyone.  I have a dataframe that is a collection of Vendor IDs
plus a bank account number for each vendor. I'm trying to find a way
to count the number of duplicate bank accounts that occur in more than
one unique Vendor_ID, and then assign the count value for each row in
the dataframe in a new variable.

I can do a count of bank accounts that occur within the same vendor
using dplyr and group_by and count, but I can't figure out a way to
count duplicates among multiple Vendor_IDs.


Dataframe example code:


#Create a sample data frame:

set.seed(1)

Data <- data.frame(Vendor_ID = sample(1:1), Bank_Account_ID =
sample(1:1))




Thanks in advance for any help.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] counting duplicate items that occur in multiple groups

2020-11-17 Thread Tom Woolman

Yes, good catch. Thanks


Quoting Bert Gunter :


Why 0's in the data frame? Shouldn't that be 1 (vendor with that account)?

Bert
Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Tue, Nov 17, 2020 at 3:29 PM Tom Woolman 
wrote:


Hi Bill. Sorry to be so obtuse with the example data, I was trying
(too hard) not to share any actual values so I just created randomized
values for my example; of course I should have specified that the
random values would not provide the expected problem pattern. I should
have just used simple dummy codes as Bill Dunlap did.

So per Bill's example data for Data1, the expected (hoped for) output
should be:

  Vendor Account Num_Vendors_Sharing_Bank_Acct
1 V1  A1  0
2 V2  A2  3
3 V3  A2  3
4 V4  A2  3


Where the new calculated variable is Num_Vendors_Sharing_Bank_Acct.
The value is 3 for V2, V3 and V4 because they each share bank account
A2.


Likewise, in the Data2 frame, the same logic applies:

  Vendor Account Num_Vendors_Sharing_Bank_Acct
1 V1  A1 0
2 V2  A2 3
3 V3  A2 3
4 V1  A2 3
5 V4  A3 0
6 V2  A4 0






Thanks!


Quoting Bill Dunlap :

> What should the result be for
>   Data1 <- data.frame(Vendor=c("V1","V2","V3","V4"),
> Account=c("A1","A2","A2","A2"))
> ?
>
> Must each vendor have only one account?  If not, what should the result
be
> for
>Data2 <- data.frame(Vendor=c("V1","V2","V3","V1","V4","V2"),
> Account=c("A1","A2","A2","A2","A3","A4"))
> ?
>
> -Bill
>
> On Tue, Nov 17, 2020 at 1:20 PM Tom Woolman 
> wrote:
>
>> Hi everyone.  I have a dataframe that is a collection of Vendor IDs
>> plus a bank account number for each vendor. I'm trying to find a way
>> to count the number of duplicate bank accounts that occur in more than
>> one unique Vendor_ID, and then assign the count value for each row in
>> the dataframe in a new variable.
>>
>> I can do a count of bank accounts that occur within the same vendor
>> using dplyr and group_by and count, but I can't figure out a way to
>> count duplicates among multiple Vendor_IDs.
>>
>>
>> Dataframe example code:
>>
>>
>> #Create a sample data frame:
>>
>> set.seed(1)
>>
>> Data <- data.frame(Vendor_ID = sample(1:1), Bank_Account_ID =
>> sample(1:1))
>>
>>
>>
>>
>> Thanks in advance for any help.
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] counting duplicate items that occur in multiple groups

2020-11-17 Thread Tom Woolman
Hi Bill. Sorry to be so obtuse with the example data, I was trying  
(too hard) not to share any actual values so I just created randomized  
values for my example; of course I should have specified that the  
random values would not provide the expected problem pattern. I should  
have just used simple dummy codes as Bill Dunlap did.


So per Bill's example data for Data1, the expected (hoped for) output  
should be:


 Vendor Account Num_Vendors_Sharing_Bank_Acct
1 V1  A1  0
2 V2  A2  3
3 V3  A2  3
4 V4  A2  3


Where the new calculated variable is Num_Vendors_Sharing_Bank_Acct.  
The value is 3 for V2, V3 and V4 because they each share bank account  
A2.



Likewise, in the Data2 frame, the same logic applies:

 Vendor Account Num_Vendors_Sharing_Bank_Acct
1 V1  A1 0
2 V2  A2 3
3 V3  A2 3
4 V1  A2 3
5 V4  A3 0
6 V2  A4 0






Thanks!


Quoting Bill Dunlap :


What should the result be for
  Data1 <- data.frame(Vendor=c("V1","V2","V3","V4"),
Account=c("A1","A2","A2","A2"))
?

Must each vendor have only one account?  If not, what should the result be
for
   Data2 <- data.frame(Vendor=c("V1","V2","V3","V1","V4","V2"),
Account=c("A1","A2","A2","A2","A3","A4"))
?

-Bill

On Tue, Nov 17, 2020 at 1:20 PM Tom Woolman 
wrote:


Hi everyone.  I have a dataframe that is a collection of Vendor IDs
plus a bank account number for each vendor. I'm trying to find a way
to count the number of duplicate bank accounts that occur in more than
one unique Vendor_ID, and then assign the count value for each row in
the dataframe in a new variable.

I can do a count of bank accounts that occur within the same vendor
using dplyr and group_by and count, but I can't figure out a way to
count duplicates among multiple Vendor_IDs.


Dataframe example code:


#Create a sample data frame:

set.seed(1)

Data <- data.frame(Vendor_ID = sample(1:1), Bank_Account_ID =
sample(1:1))




Thanks in advance for any help.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] counting duplicate items that occur in multiple groups

2020-11-17 Thread Tom Woolman
Hi everyone.  I have a dataframe that is a collection of Vendor IDs  
plus a bank account number for each vendor. I'm trying to find a way  
to count the number of duplicate bank accounts that occur in more than  
one unique Vendor_ID, and then assign the count value for each row in  
the dataframe in a new variable.


I can do a count of bank accounts that occur within the same vendor  
using dplyr and group_by and count, but I can't figure out a way to  
count duplicates among multiple Vendor_IDs.



Dataframe example code:


#Create a sample data frame:

set.seed(1)

Data <- data.frame(Vendor_ID = sample(1:1), Bank_Account_ID =  
sample(1:1))





Thanks in advance for any help.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] RIDIT scoring in R

2020-09-14 Thread Tom Woolman

Hi everyone.

I'd like to perform RIDIT scoring of a column that consists of ordinal  
values, but I don't have a comparison dataset to use against it as  
required by the Ridit::ridit function.


As a question of best practice, could I use a normally distributed  
frequency distribution table generated by the rnorm function for use  
as comparison data for RIDIT scoring?


Or would I be better off using a 2nd ordinal variable from the same  
dataframe for comparison?




Thanks in advance!

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Assigning cores

2020-09-03 Thread Tom Woolman

Hi Leslie and all.

You may want to investigate using SparklyR on a cloud environment like  
AWS, where you have more packages that are designed to work on cluster  
computing environments and you have more control over those types of  
parallel operations.



V/r,

Tom W.


Quoting Leslie Rutkowski :


Hi all,

I'm working on a large simulation and I'm using the doParallel package to
parallelize my work. I have 20 cores on my machine and would like to
preserve some for day-to-day activities - word processing, sending emails,
etc.

I started by saving 1 core and it was clear that *everything* was so slow
as to be nearly unusable.

Any suggestions on how many cores to hold back (e.g., not to put to work on
the parallel process)?

Thanks,
Leslie

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] kernlab ksvm rbfdot kernel - prediction returning fewer rows than provided for input

2020-06-10 Thread Tom Woolman
forgot to mention, the training and testing dataframes are composed of  
4 IVs (one double numeric IV and three factor IVs) and one DV  
(dichotomous factor, i.e. true or false).


The training dataframe consists of 48819 rows and test dataframe  
consists of 24408 rows.




Thanks again.



Quoting Tom Woolman :

Hi everyone. I'm using the kernlab ksvm function with the rbfdot  
kernel for a binary classification problem and getting a strange  
result back. The predictions seem to be very accurate judging by the  
training results provided by the algorithm, but I'm unable to  
generate a confusion matrix because there is a difference in the  
number of output records from my model test compared to what was  
input into the test dataframe.


I've used ksvm before but never had this problem.

Here's my sample code:



install.packages("kernlab")
library(kernlab)


set.seed(3233)


trainIndex <-  
caret::createDataPartition(dataset_labeled_fraud$isFraud,  
p=0.70,kist=FALSE)


train <- dataset_labeled_fraud[trainIndex,]
test <- dataset_labeled_fraud[-trainIndex,]


#clear out the training model
filter <- NULL

filter <-  
kernlab::ksvm(isFraud~.,data=train,kernel="rbfdot",kpar=list(sigma=0.5),C=3,prob.model=TRUE)



#clear out the test results
test_pred_rbfdot <- NULL

test_pred_rbfdot <- kernlab::predict(filter,test,type="probabilities")

dataframe_test_pred_rbfdot <- as.data.frame(test_pred_rbfdot)


nrow(dataframe_test_pred_rbfdot)


23300


nrow(test)


24408



# ok, how did I go from 24408 input rows to only 23300 output  
prediction rows? :(



Thanks in advance anyone!




Thomas A. Woolman
PhD Candidate, Technology Management
Indiana State University


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] kernlab ksvm rbfdot kernel - prediction returning fewer rows than provided for input

2020-06-10 Thread Tom Woolman



Hi everyone. I'm using the kernlab ksvm function with the rbfdot  
kernel for a binary classification problem and getting a strange  
result back. The predictions seem to be very accurate judging by the  
training results provided by the algorithm, but I'm unable to generate  
a confusion matrix because there is a difference in the number of  
output records from my model test compared to what was input into the  
test dataframe.


I've used ksvm before but never had this problem.

Here's my sample code:



install.packages("kernlab")
library(kernlab)


set.seed(3233)


trainIndex <-  
caret::createDataPartition(dataset_labeled_fraud$isFraud,  
p=0.70,kist=FALSE)


train <- dataset_labeled_fraud[trainIndex,]
test <- dataset_labeled_fraud[-trainIndex,]


#clear out the training model
filter <- NULL

filter <-  
kernlab::ksvm(isFraud~.,data=train,kernel="rbfdot",kpar=list(sigma=0.5),C=3,prob.model=TRUE)



#clear out the test results
test_pred_rbfdot <- NULL

test_pred_rbfdot <- kernlab::predict(filter,test,type="probabilities")

dataframe_test_pred_rbfdot <- as.data.frame(test_pred_rbfdot)


nrow(dataframe_test_pred_rbfdot)


23300


nrow(test)


24408



# ok, how did I go from 24408 input rows to only 23300 output  
prediction rows? :(



Thanks in advance anyone!




Thomas A. Woolman
PhD Candidate, Technology Management
Indiana State University

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] random forest significance testing tools

2020-05-10 Thread Tom Woolman
Hi everyone. I'm using a random forest in R to successfully perform a  
classification on a dichotomous DV in a dataset that has 29 IVs of  
type double and approximately 285,000 records. I ran my model on a  
70/30 train/test split of the original dataset.


I'm trying to use the rfUtilities package for rf model selection and  
performance evaluation, in order to generate a p-value and other  
quantitative performance statistics for use in hypothesis testing,  
similar to what I would do with a logistic regression glm model.


The initial random forest model results and OOB error estimates were  
as follows:


randomForest(formula = Class ~ ., data = train)
   Type of random forest: classification
 Number of trees: 500
No. of variables tried at each split: 5

OOB estimate of  error rate: 0.04%
Confusion matrix:
   0   1  class.error
0 199004  16 8.039393e-05
1 73 271 2.122093e-01


I'm running this model on my laptop (Win10, 8 GB RAM) as I don't have  
access to my server during the pandemic. The rfUtilities function call  
works (or at least it doesn't give me an error message or crash), but  
it's been running for over a day in RStudio on the original rf model  
and the training dataset without providing any results.


For anyone who has used the rfUtilities package before, is this just  
too large of a dataframe for a Win10 laptop to process effectively or  
should I be doing something different? This is my first time using the  
rfUtilities package and I understand that it is relatively new.


The function call for the rfUtilities function rf.significance is as  
follows (rf is my original random forest data model from the  
randomForest function):


rf.perm <- rf.significance(rf, train[,1:29], nperm=99, ntree=500)


Thanks in advance.

Tom Woolman
PhD student, Indiana State University

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Problem witth nnet:multinom

2019-06-21 Thread Tom Woolman
I am using R with the nnet package to perform a multinomial logistic  
regression on a training dataset with ~5800 training dataset records  
and 45 predictor variables in that training data. Predictor variables  
were chosen as a subset of all ~120 available variables based on PCA  
analysis. My target variable is a factor with 10 items.

All predictor variable are numeric (type "dbl").

My command in R is as follows:

model <- nnet:multinom(frmla, data = training_set, maxit = 1000,  
na.action = na.omit)


#note that the frmla string is a value of "Target_Variable ~ v1 + v2 +  
v3, etc."
Output of this command is as follows (I will truncate to save a little  
space after the first few rows):


# weights: 360 (308 variable)
initial value 10912.909211

iter 10 value 9194.608309

iter 20 value 9142.608309

iter 30 value 9128.737991

iter 40 value 9093.899887
.
.
.
iter 420 value 8077.803755

final value 8077.800112
converged
Error in nnet:multinom(frmla, data = training_set, maxit = 1000, :
NA/NaN argument

In addition: Warning message:

In nnet:multinom(frmla, data= training_set, maxit = 1000, :
numerical expression has 26 elements: only the first used

So that's my issue. I can't figure out the meaning behind both the  
error message and the warning message above. There are no NA values in  
my data set. I've also tried reducing the number of predictor  
variables, but I get the same issue (just a different number of  
iterations).


Thanks in advance.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Trying to fix code that will find highest 5 column names and their associated values for each row in a data frame in R

2018-12-17 Thread Tom Woolman



I have a data frame each with 10 variables of integer data for various
 attributes about each row of data, and I need to know the highest 5  
variables related to each of

 row in this data frame and output that to a new data frame. In addition to
 the 5 highest variable names, I also need to know the corresponding 5
 highest variable values for each row.

 A simple code example to generate a sample data frame for this is:

 set.seed(1)
 DF <- matrix(sample(1:9,9),ncol=10,nrow=9)
 DF <- as.data.frame.matrix(DF)


This would result in an example data frame like this:

 #   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
 # 1  3  2  5  6  5  2  6  8  1   3
 # 2  1  4  7  8  7  7  3  4  2   9
 # 3  2  3  4  7  5  8  9  1  3   5
 # 4  3  8  3  4  5  6  7  4  6   5
 # 5  6  2  3  7  2  1  8  3  2   4
 # 6  8  2  4  8  3  2  9  7  6   5
 # 7  1  5  3  6  8  3  8  9  1   3
 # 8  9  3  5  8  4  9  7  8  1   2
 # 9  1  2  4  8  3  2  1  2  5   6


 My ideal output would be something like this:


 #  V1   V2   V3   V4   V5
 # 1  V2:9 V7:8 V8:7 V4:6 V3:5
 # 2  V9:9 V3:8 V5:7 V7:6 V4:5
 # 3  V5:9 V3:8 V2:7 V9:6 V7:5
 # 4  V8:9 V4:8 V2:7 V5:6 V9:5
 # 5  V9:9 V1:8 V6:7 V3:6 V5:5
 # 6  V8:9 V1:8 V5:7 V9:6 V4:5
 # 7  V2:9 V8:8 V7:7 V5:6 V9:5
 # 8  V4:9 V7:8 V9:7 V2:6 V8:5
 # 9  V3:9 V7:8 V8:7 V4:6 V5:5
 # 10 V6:9 V8:8 V1:7 V9:6 V4:5


 I was trying to use code, but this doesn't seem to work:

 out <- t(apply(DF, 1, function(x){
   o <- head(order(-x), 5)
   paste0(names(x[o]), ':', x[o])
 }))
 as.data.frame(out)



 Thanks everyone!

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.