[R] p-values and temporal uncertainty in variables

Yahoo Mon, 18 May 2015 10:06:13 -0700

Dear List,

This is one of those questions that are easier to be explained with an example. 
I also apologise as strictly speaking this is not an R question, but more 
general methodological question that I would like to solve in R.


Suppose we have the following data

#Generate artificial data#
DateMeans<-c(900,1000,1250,1300,1380,1400,1400,1500,1550)
DateSDs<-c(300,400,100,100,60,60,120,100,50)
index=sample(1:9,size=400,prob=c(0.238,0.119,0.238,0.048,0.095,0.048,0.048,0.024,0.143),replace=TRUE)
A1=rnorm(100,mean=5,sd=1)
A2=rnorm(100,mean=5,sd=1)
B1=rnorm(100,mean=5,sd=2)
B2=rnorm(100,mean=5,sd=1.5)

data=data.frame(type=c(rep("a",200),rep("b",200)),size=c(A1,A2,B1,B2),dateAvg=DateMeans[index]+round(runif(400,min=0,max=100)),dateSd=DateSDs[index]+round(runif(400,min=0,max=20)))
head(data)

Let’s say we are measuring the size of two objects labelled “a” and “b” (the 
column “type”), and we are interested whether one is bigger than the other. 
They are normally distributed so a standard t-test will suffice. However, 
suppose also that these object have a time-stamp (column dateAvg), each with 
some degree of uncertainty (column dateSd) described as a Gaussian error. If I 
want to now the average size of type “a” for a given interval (let’s say 
between 1300 and 1400), I could simply resample as follow:

nsim<-1000 #number of simulations
MeanOfA<-numeric(length=nsim)

for (s in 1:nsim)
{
simdates<-rnorm(nrow(data),mean=data$dateAvg,sd=data$dateSd)
i=which(simdates>=interestBlock[1]&simdates<=interestBlock[2])
MeanOfA[s]=mean(subset(data[i,],type=="a")$size)
}

This will give me a distribution of means that will include the uncertainty of 
our summary statistics. I think there is nothing wrong with this.

But suppose I want to test whether objects of type “a” are different in size in 
comparison with objects of type “b”. I don’t think comparing the means obtained 
from the simulation is fair, as the significance will change as a function of 
the number of simulations. All I can think of is to compute a t-test for each 
iteration, something like the following:


nsim=1000
interestBlock=c(1400,1500)
statistic<-numeric(length=nsim)
pvals<-numeric(length=nsim)
diffs<-numeric(length=nsim)
for (s in 1:nsim)
{
simdates<-rnorm(nrow(data),mean=data$dateAvg,sd=data$dateSd)
i=which(simdates>=interestBlock[1]&simdates<=interestBlock[2])
tmpdata=data[i,]
diffs[s]=mean(subset(tmpdata,type=="a")$size)-mean(subset(tmpdata,type=="b")$size)
res=t.test(x=subset(tmpdata,type=="a")$size,y=subset(tmpdata,type=="b")$size)
statistic[s]=res$statistic
pvals[s]=res$p.value  
}

And then get a distribution of p-values (that I frankly find useless). Is there 
a way to get a single p-value? The only alternative I can think of is to get a 
95% confidence interval of the difference in mean, and see if this includes no 
difference in mean or not…. 

Many thanks in advance,
Enrico

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] p-values and temporal uncertainty in variables

Reply via email to