Re: [R] What are the pros and cons of the log.p parameter in (p|q)norm and similar?

matthias-gondan Wed, 04 Aug 2021 05:08:39 -0700

Response to 1You need the log version e.g. in maximum likelihood, otherwise the 
product of the densities and probabilities can become very small.
-------- Ursprüngliche Nachricht --------Von: r-help-requ...@r-project.org 
Datum: 04.08.21  12:01  (GMT+01:00) An: r-help@r-project.org Betreff: R-help 
Digest, Vol 222, Issue 4 Send R-help mailing list submissions to        
r-help@r-project.orgTo subscribe or unsubscribe via the World Wide Web, visit   
https://stat.ethz.ch/mailman/listinfo/r-helpor, via email, send a message with 
subject or body 'help' to        r-help-request@r-project.orgYou can reach the 
person managing the list at       r-help-owner@r-project.orgWhen replying, 
please edit your Subject line so it is more specificthan "Re: Contents of 
R-help digest..."Today's Topics:   1. What are the pros and cons of the log.p 
parameter in      (p|q)norm and similar? (Michael Dewey)   2. Help with package 
EasyPubmed (bharat rawlley)   3. Re: Help with package EasyPubmed (bharat 
rawlley)   4. Re:  What are the pros and cons of the log.p parameter in      
(p|q)norm and similar? (Duncan Murdoch)   5. Re:  What are the pros and cons of 
the log.p parameter in      (p|q)norm and similar? (Bill Dunlap)   6. Creating 
a log-transformed histogram of multiclass data      (Tom Woolman)   7. Re: 
Creating a log-transformed histogram of multiclass data      (Tom 
Woolman)----------------------------------------------------------------------Message:
 1Date: Tue, 3 Aug 2021 17:20:12 +0100From: Michael Dewey 
<li...@dewey.myzen.co.uk>To: "r-help@r-project.org" 
<r-help@r-project.org>Subject: [R] What are the pros and cons of the log.p 
parameter in (p|q)norm and similar?Message-ID: 
<e17bdaaa-7945-4f37-ee69-941eb8270...@dewey.myzen.co.uk>Content-Type: 
text/plain; charset="utf-8"; Format="flowed"Short versionApart from the ability 
to work with values of p too small to be of much practical use what are the 
advantages and disadvantages of setting this to TRUE?Longer versionI am 
contemplating upgrading various functions in one of my packages to use this and 
as far as I can see it would only have the advantage of allowing people to use 
very small p-values but before I go ahead have I missed anything? I am most 
concerned with negatives but if there is any other advantage I would mention 
that in the vignette. I am not concerned about speed or the extra effort in 
coding and expanding the documentation.-- 
Michaelhttp://www.dewey.myzen.co.uk/home.html------------------------------Message:
 2Date: Tue, 3 Aug 2021 18:20:52 +0000 (UTC)From: bharat rawlley 
<bharat_m_...@yahoo.co.in>To: R-help Mailing List 
<r-help@r-project.org>Subject: [R] Help with package EasyPubmedMessage-ID: 
<1046636584.2205366.1628014852...@mail.yahoo.com>Content-Type: text/plain; 
charset="utf-8"Hello, When I try to run the following code using the package 
Easypubmed, I get a null result - > batch_pubmed_download(query_7)NULL#query_7 
<- "Cardiology AND randomizedcontrolledtrial[Filter] AND 2011[PDAT]"However, 
the exact same search string yields 668 results on Pubmed. I am unable to 
figure out why this is happening. If I use the search string "Cardiology AND 
2011[PDAT]" then it works just fine. Any help would be greatly appreciatedThank 
you!      [[alternative HTML version 
deleted]]------------------------------Message: 3Date: Tue, 3 Aug 2021 18:26:40 
+0000 (UTC)From: bharat rawlley <bharat_m_...@yahoo.co.in>To: R-help Mailing 
List <r-help@r-project.org>Subject: Re: [R] Help with package 
EasyPubmedMessage-ID: 
<712126143.2207911.1628015200...@mail.yahoo.com>Content-Type: text/plain; 
charset="utf-8"  Okay, the following search string resolved my issue  - 
"Cardiology AND randomized controlled trial[Publication type] AND 
2011[PDAT]"Thank you!    On Tuesday, 3 August, 2021, 02:21:38 pm GMT-4, bharat 
rawlley via R-help <r-help@r-project.org> wrote:    Hello, When I try to run 
the following code using the package Easypubmed, I get a null result - > 
batch_pubmed_download(query_7)NULL#query_7 <- "Cardiology AND 
randomizedcontrolledtrial[Filter] AND 2011[PDAT]"However, the exact same search 
string yields 668 results on Pubmed. I am unable to figure out why this is 
happening. If I use the search string "Cardiology AND 2011[PDAT]" then it works 
just fine. Any help would be greatly appreciatedThank you!     [[alternative 
HTML version 
deleted]]______________________________________________r-h...@r-project.org 
mailing list -- To UNSUBSCRIBE and more, 
seehttps://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide 
http://www.R-project.org/posting-guide.htmland provide commented, minimal, 
self-contained, reproducible code.      [[alternative HTML version 
deleted]]------------------------------Message: 4Date: Tue, 3 Aug 2021 14:53:28 
-0400From: Duncan Murdoch <murdoch.dun...@gmail.com>To: Michael Dewey 
<li...@dewey.myzen.co.uk>, "r-help@r-project.org"      
<r-help@r-project.org>Subject: Re: [R]  What are the pros and cons of the log.p 
parameter in    (p|q)norm and similar?Message-ID: 
<c15f610b-7a16-9d84-884c-54cc170bb...@gmail.com>Content-Type: text/plain; 
charset="utf-8"; Format="flowed"On 03/08/2021 12:20 p.m., Michael Dewey wrote:> 
Short version> > Apart from the ability to work with values of p too small to 
be of much> practical use what are the advantages and disadvantages of setting 
this> to TRUE?> > Longer version> > I am contemplating upgrading various 
functions in one of my packages to> use this and as far as I can see it would 
only have the advantage of> allowing people to use very small p-values but 
before I go ahead have I> missed anything? I am most concerned with negatives 
but if there is any> other advantage I would mention that in the vignette. I am 
not concerned> about speed or the extra effort in coding and expanding the 
documentation.> These are often needed in likelihood problems.  In just about 
any problem where the normal density shows up in the likelihood, you're better 
off working with the log likelihood and setting log = TRUE in dnorm, because 
sometimes you want to evaluate the likelihood very far from its mode.The same 
sort of thing happens with pnorm for similar reasons.  Some likelihoods involve 
normal integrals and will need it.I can't think of an example for qnorm off the 
top of my head, but I imagine there are some:  maybe involving simulation way 
out in the tails.The main negative about using logs is that they aren't always 
needed.Duncan Murdoch------------------------------Message: 5Date: Tue, 3 Aug 
2021 13:24:08 -0700From: Bill Dunlap <williamwdun...@gmail.com>To: Duncan 
Murdoch <murdoch.dun...@gmail.com>Cc: Michael Dewey <li...@dewey.myzen.co.uk>, 
"r-help@r-project.org"  <r-help@r-project.org>Subject: Re: [R]  What are the 
pros and cons of the log.p parameter in    (p|q)norm and similar?Message-ID:    
   
<cahqsrusbqyuyj5a9yrhk3bhxpn5umbxq54bkhau3g6yrocn...@mail.gmail.com>Content-Type:
 text/plain; charset="utf-8"In maximum likelihood problems, even when the 
individual density values arefairly far from zero, their product may underflow 
to zero.  Optimizers haveproblems when there is a large flat area.   > q <- 
runif(n=1000, -0.1, +0.1)   > prod(dnorm(q))   [1] 0   > sum(dnorm(q, 
log=TRUE))   [1] -920.6556A more minor advantage for some probability-related 
functions is speed.E.g., dnorm(log=TRUE,...) does not need to evaluate exp().   
> q <- runif(1e6, -10, 10)   > system.time(for(i in 1:100)dnorm(q, log=FALSE))  
    user  system elapsed      9.13    0.11    9.23   > system.time(for(i in 
1:100)dnorm(q, log=TRUE))      user  system elapsed      4.60    0.19    4.78 
-BillOn Tue, Aug 3, 2021 at 11:53 AM Duncan Murdoch 
<murdoch.dun...@gmail.com>wrote:> On 03/08/2021 12:20 p.m., Michael Dewey 
wrote:> > Short version> >> > Apart from the ability to work with values of p 
too small to be of much> > practical use what are the advantages and 
disadvantages of setting this> > to TRUE?> >> > Longer version> >> > I am 
contemplating upgrading various functions in one of my packages to> > use this 
and as far as I can see it would only have the advantage of> > allowing people 
to use very small p-values but before I go ahead have I> > missed anything? I 
am most concerned with negatives but if there is any> > other advantage I would 
mention that in the vignette. I am not concerned> > about speed or the extra 
effort in coding and expanding the> documentation.> >>> These are often needed 
in likelihood problems.  In just about any> problem where the normal density 
shows up in the likelihood, you're> better off working with the log likelihood 
and setting log = TRUE in> dnorm, because sometimes you want to evaluate the 
likelihood very far> from its mode.>> The same sort of thing happens with pnorm 
for similar reasons.  Some> likelihoods involve normal integrals and will need 
it.>> I can't think of an example for qnorm off the top of my head, but I> 
imagine there are some:  maybe involving simulation way out in the tails.>> The 
main negative about using logs is that they aren't always needed.>> Duncan 
Murdoch>> ______________________________________________> R-help@r-project.org 
mailing list -- To UNSUBSCRIBE and more, see> 
https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide> 
http://www.R-project.org/posting-guide.html> and provide commented, minimal, 
self-contained, reproducible code.>      [[alternative HTML version 
deleted]]------------------------------Message: 6Date: Tue, 03 Aug 2021 
18:56:08 -0400From: Tom Woolman <twool...@ontargettek.com>To: 
r-help@r-project.orgSubject: [R] Creating a log-transformed histogram of 
multiclass dataMessage-ID: 
<2bc87c25f161bac1d8e5101e20bf2...@ontargettek.com>Content-Type: text/plain; 
charset="us-ascii"; Format="flowed"# Resending this message since the original 
email was held in queue by the listserv software because of a "suspicious" 
subject line, and/or because of attached .png histogram chart attachments. I'm 
guessing that the listserv software doesn't like multiple image file 
attachments.Hi everyone. I'm working on a research model now that is 
calculating anomaly scores (RMSE values) for three distinct groups within a 
large dataset. The anomaly scores are a continuous data type and are quite 
small, ranging from approximately 1e-04 to 1-e07 across a population of 
approximately 1 million observations.I have all of the summary and descriptive 
statistics for each of the anomaly score distributions across each group label 
in the dataset, and I am able to create some useful histograms showing how each 
of the three groups is uniquely distributed across the range of scores. 
However, because of the large variance within the frequency of score values and 
the high density peaks within much of the anomaly scores, I need to use a log 
transformation within the histogram to show both the log frequency count of 
each binned observation range (y-axis) and a log transformation of the binned 
score values (x-axis) to be able to appropriately illustrate the distributions 
within the data and make it more readily understandable.Fortunately, ggplot2 is 
really useful for creating some really attractive dual-axis log transformed 
histograms.However, I cannot figure out a way to create the log transformed 
histograms to show each of my three groups by color within the same histogram. 
I would want it to look like this, BUT use a log transformation for each axis. 
This plot below shows the 3 groups in one histogram but uses the default normal 
values.For log transformed axis values, the best I can do so far is produce 
three separate histograms, one for each group.Below is sample R code to 
illustrate my problem with a randomly-generated example dataset and the ggplot2 
approaches that I have taken so far:# Sample R code 
below:library(ggplot2)library(dplyr)library(hrbrthemes)# I created some simple 
random sample data to produce an example dataset.# This produces an example 
dataframe called d, which contains a class label IV of either A, B or C for 
each observation. The target variable is the anomaly_score continuous value for 
each observation.# There are 300 rows of dummy data in this 
dataframe.DV_score_generator = round(runif(300,0.001,0.999), 3)d <- data.frame( 
label = sample( LETTERS[1:3], 300, replace=TRUE, prob=c(0.65, 0.30, 0.05) ), 
anomaly_score = DV_score_generator)# First, I use ggplot to create the normal 
distribution histogram that shows all 3 groups on the same plot, by color.# 
Please note that with this small set of randomized sample data it doesn't 
appear to be necessary to use an x and y-axis log transformation to show the 
distribution patterns, but it does becomes an issue with my vastly larger and 
more complex score values in the DV of the actual data.p <- d %>%ggplot( 
aes(x=anomaly_score, fill=label)) +geom_histogram( color="#e9ecef", alpha=0.6, 
position = 'identity') +scale_fill_manual(values=c("#69b3a2", "blue", 
"#404080")) +theme_ipsum() +labs(fill="")p# Produces a normal multiclass 
histogram.# Now produce a series of x and y-axis log-transformed histograms, 
producing one histogram for each distinct label class in the dataset:# Group A, 
log transformedggplot(group_a, aes(x = anomaly_score)) +      
geom_histogram(aes(y = ..count..), binwidth = 0.05,      colour = 
"darkgoldenrod1", fill = "darkgoldenrod2") +      scale_x_continuous(name = 
"Log-scale Anomaly Score", trans="log2") +      
scale_y_continuous(trans="log2", name="Log-transformed Frequency Counts") +     
 ggtitle("Transformed Anomaly Scores - Group A Only")# Group A transformed 
histogram is produced here.# Group B, log transformed  ggplot(group_b, aes(x = 
anomaly_score)) +      geom_histogram(aes(y = ..count..), binwidth = 0.05,      
colour = "green", fill = "darkgreen") +      scale_x_continuous(name = 
"Log-scale Anomaly Score", trans="log2") +      
scale_y_continuous(trans="log2", name="Log-transformed Frequency Counts") +     
 ggtitle("Transformed Anomaly Scores - Group B Only")# Group B transformed 
histogram is produced here.# Group C, log transformed  ggplot(group_c, aes(x = 
anomaly_score)) +      geom_histogram(aes(y = ..count..), binwidth = 0.05,      
colour = "red", fill = "darkred") +      scale_x_continuous(name = "Log-scale 
Anomaly Score", trans="log2") +      scale_y_continuous(trans="log2", 
name="Log-transformed Frequency Counts") +      ggtitle("Transformed Anomaly 
Scores - Group C Only")# Group C transformed histogram is produced here.# 
End.Thanks in advance, everyone!- TomThomas A. Woolman, PhD Candidate (Indiana 
State University), MBA, MS, MSOn Target Technologies, Inc.Virginia, 
USA------------------------------Message: 7Date: Tue, 03 Aug 2021 19:04:29 
-0400From: Tom Woolman <twool...@ontargettek.com>To: 
r-help@r-project.orgSubject: Re: [R] Creating a log-transformed histogram of 
multiclass   dataMessage-ID: 
<ba170db0581b2b7f5c79448355685...@ontargettek.com>Content-Type: text/plain; 
charset="us-ascii"; Format="flowed"Apologies, I left out 3 critical lines of 
code after the randomized sample dataframe is created:group_a <- d[ 
which(d$label =='A'), ]group_b <- d[ which(d$label =='B'), ]group_c <- d[ 
which(d$label =='C'), ]On 2021-08-03 18:56, Tom Woolman wrote:> # Resending 
this message since the original email was held in queue by> the listserv 
software because of a "suspicious" subject line, and/or> because of attached 
.png histogram chart attachments. I'm guessing> that the listserv software 
doesn't like multiple image file> attachments.> > > Hi everyone. I'm working on 
a research model now that is calculating> anomaly scores (RMSE values) for 
three distinct groups within a large> dataset. The anomaly scores are a 
continuous data type and are quite> small, ranging from approximately 1e-04 to 
1-e07 across a population> of approximately 1 million observations.> > I have 
all of the summary and descriptive statistics for each of the> anomaly score 
distributions across each group label in the dataset,> and I am able to create 
some useful histograms showing how each of the> three groups is uniquely 
distributed across the range of scores.> However, because of the large variance 
within the frequency of score> values and the high density peaks within much of 
the anomaly scores, I> need to use a log transformation within the histogram to 
show both the> log frequency count of each binned observation range (y-axis) 
and a> log transformation of the binned score values (x-axis) to be able to> 
appropriately illustrate the distributions within the data and make it> more 
readily understandable.> > Fortunately, ggplot2 is really useful for creating 
some really> attractive dual-axis log transformed histograms.> > However, I 
cannot figure out a way to create the log transformed> histograms to show each 
of my three groups by color within the same> histogram. I would want it to look 
like this, BUT use a log> transformation for each axis. This plot below shows 
the 3 groups in> one histogram but uses the default normal values.> > For log 
transformed axis values, the best I can do so far is produce> three separate 
histograms, one for each group.> > > > Below is sample R code to illustrate my 
problem with a> randomly-generated example dataset and the ggplot2 approaches 
that I> have taken so far:> > # Sample R code below:> > library(ggplot2)> 
library(dplyr)> library(hrbrthemes)> > # I created some simple random sample 
data to produce an example > dataset.> # This produces an example dataframe 
called d, which contains a class> label IV of either A, B or C for each 
observation. The target variable> is the anomaly_score continuous value for 
each observation.> # There are 300 rows of dummy data in this dataframe.> > 
DV_score_generator = round(runif(300,0.001,0.999), 3)> d <- data.frame( label = 
sample( LETTERS[1:3], 300, replace=TRUE,> prob=c(0.65, 0.30, 0.05) ), 
anomaly_score = DV_score_generator)> > # First, I use ggplot to create the 
normal distribution histogram that> shows all 3 groups on the same plot, by 
color.> # Please note that with this small set of randomized sample data it> 
doesn't appear to be necessary to use an x and y-axis log> transformation to 
show the distribution patterns, but it does becomes> an issue with my vastly 
larger and more complex score values in the DV> of the actual data.> > p <- d 
%>%> ggplot( aes(x=anomaly_score, fill=label)) +> geom_histogram( 
color="#e9ecef", alpha=0.6, position = 'identity') +> 
scale_fill_manual(values=c("#69b3a2", "blue", "#404080")) +> theme_ipsum() +> 
labs(fill="")> > p> > # Produces a normal multiclass histogram.> > > > # Now 
produce a series of x and y-axis log-transformed histograms,> producing one 
histogram for each distinct label class in the dataset:> > > # Group A, log 
transformed> > ggplot(group_a, aes(x = anomaly_score)) +>      
geom_histogram(aes(y = ..count..), binwidth = 0.05,>      colour = 
"darkgoldenrod1", fill = "darkgoldenrod2") +>      scale_x_continuous(name = 
"Log-scale Anomaly Score", trans="log2") > +>      
scale_y_continuous(trans="log2", name="Log-transformed Frequency > Counts") +>  
    ggtitle("Transformed Anomaly Scores - Group A Only")> > > # Group A 
transformed histogram is produced here.> > > > # Group B, log transformed> >  
ggplot(group_b, aes(x = anomaly_score)) +>      geom_histogram(aes(y = 
..count..), binwidth = 0.05,>      colour = "green", fill = "darkgreen") +>     
 scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") > +>      
scale_y_continuous(trans="log2", name="Log-transformed Frequency > Counts") +>  
    ggtitle("Transformed Anomaly Scores - Group B Only")> > # Group B 
transformed histogram is produced here.> > > > # Group C, log transformed> >  
ggplot(group_c, aes(x = anomaly_score)) +>      geom_histogram(aes(y = 
..count..), binwidth = 0.05,>      colour = "red", fill = "darkred") +>      
scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") > +>      
scale_y_continuous(trans="log2", name="Log-transformed Frequency > Counts") +>  
    ggtitle("Transformed Anomaly Scores - Group C Only")> > # Group C 
transformed histogram is produced here.> > > # End.> > > > Thanks in advance, 
everyone!> > > - Tom> > > Thomas A. Woolman, PhD Candidate (Indiana State 
University), MBA, MS, > MS> On Target Technologies, Inc.> Virginia, USA> > 
______________________________________________> R-help@r-project.org mailing 
list -- To UNSUBSCRIBE and more, see> 
https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html> and provide commented, minimal, 
self-contained, reproducible code.------------------------------Subject: Digest 
footer_______________________________________________r-h...@r-project.org 
mailing listhttps://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the 
posting guide http://www.R-project.org/posting-guide.htmland provide commented, 
minimal, self-contained, reproducible code.------------------------------End of 
R-help Digest, Vol 222, Issue 4**************************************
        [[alternative HTML version deleted]]


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] What are the pros and cons of the log.p parameter in (p|q)norm and similar?

Reply via email to