Re: [ai-geostats] F and T-test for samples drawn from the same p

2004-12-05 Thread Chaosheng Zhang



Dear all,

I'm wondering if sample size (number of samples, n) 
is playing a role here.

Since Colin is using Excel to analyse several 
thousand samples, I have checked the functions of t-tests in Excel. In the Data 
Analysis Tools help, a function is provided for "t-Test: Two-Sample Assuming 
Unequal Variances analysis". This function is the same as those from 
manytext books (There are other forms of the function). Unfortunately, I 
cannot find the function for "assuming equal variances" in Excel, but I assume 
they aresimilar, and should be the same as those from some text 
books.

From the function,you can find that when the 
sample size is largeyou always get a large t value. When sample size is 
large enough, even slight differences betweenthe mean values of two data 
sets (x bar and y bar) can be detected, and this will result in rejection of the 
null hypothesis. This is in fact quite reasonable. When the sample size is 
large, you are confident with the mean values (Central Limit Theorem), with 
avery small stand error (s/(sqrt(n)). Therefore, you are confident to 
detect the differences between the two data sets. Even though there is only a 
slight difference, you can still say, yes, they are "significantly" 
different.

If you still remember some time ago, we had a 
discussion on large sample size problem for tests for normality. When the sample 
size is large enough, the result can always be expected (for real data sets), 
that is, rejection of the null hypothesis.

Cheers,

Chaosheng
--Dr. 
Chaosheng ZhangLecturer in GISDepartment of GeographyNational 
University of Ireland, GalwayIRELANDTel: +353-91-524411 x 2375Direct 
Tel: +353-91-49 2375Fax: +353-91-525700E-mail: [EMAIL PROTECTED]Web 1: 
www.nuigalway.ie/geography/zhang.htmlWeb 2: www.nuigalway.ie/geography/gis/index.htm


- Original Message - 

From: "Isobel Clark" [EMAIL PROTECTED]
To: "Donald E. Myers" [EMAIL PROTECTED]
Cc: "Colin Badenhorst" [EMAIL PROTECTED]; 
[EMAIL PROTECTED]
Sent: Saturday, December 04, 2004 11:49 
AM
Subject: [ai-geostats] F and T-test for samples 
drawn from the same p
 Don  Thank you for the extended clarification of F 
and t hypothesis test. For those unfamiliar with the concept, it 
is worth noting that the F test for multiple means may be more familiar 
under the title "Analysis of variance".  My own brief 
answer was in the context of Colin's question, where it was quite clear 
that he was talking aboutthe simplest F variance-ratio and t comparison 
of means test.  Isobel  



 * By using the ai-geostats mailing list you 
agree to follow its rules  ( see http://www.ai-geostats.org/help_ai-geostats.htm )  * To unsubscribe to ai-geostats, send the 
following in the subject or in the body (plain text format) of an email message 
to [EMAIL PROTECTED]  Signoff 
ai-geostats 
* By using the ai-geostats mailing list you agree to follow its rules 
( see http://www.ai-geostats.org/help_ai-geostats.htm )

* To unsubscribe to ai-geostats, send the following in the subject or in the 
body (plain text format) of an email message to [EMAIL PROTECTED]

Signoff ai-geostats

[ai-geostats] RE: F and T-test for samples drawn from the same p

2004-12-05 Thread Isobel Clark
Hence my recommendation to use cross cross validation
Isobel
http://geoecosse.bizland.com/books.htm



 --- Colin Daly [EMAIL PROTECTED] wrote: 
 
 
 Hi
 
 Sorry to repeat myself - but the samples are not
 independent.  Independance is a fundamental
 assumption of these types of tests - and you cannot
 interpret the tests if this assumption is violated. 
 In the situation where spatial correlation exists,
 the true standard error is nothing like as small as
 the (s/sqrt(n)) that Chaosheng discusses - because
 the sqrt(n) depends on independence.
 
 Again, as I said before, if the data has any type of
 trend in it, then it is completely meaningless to
 try and use these tests - and with no trend but some
 'ordinary' correlation, you must find a means of
 taking the data redundancy into account or risk get
 hopelessly pessimistic results (in the sense of
 rejecting the null hypothesis of equal means far too
 often)
 
 Consider a trivial example. A one dimensional random
 function which takes constant values over intervals
 of lenght one - so, it takes the value a_0 in the
 interval [0,1[  then the value a_1 in the interval
 [1,2[ and so on (let us suppose that each a_n term
 is drawn at random from a gaussian distribution with
 the same mean and variance for example).  Next
 suppose you are given samples on the interval [0,2].
 You spot that there seems to be a jump between [0,1[
 and [1,2[  - so you test for the difference in the
 means. If you apply an f test you will easily find
 that the mean differs (and more convincingly the
 more samples you have drawn!). However by
 construction of the random function,  the mean is
 not different.  We have been lulled into the false
 conclusion of differing means by assuming that all
 our data are independent.
 
 Regards
 
 Colin Daly
 
 
 -Original Message-
 From: Chaosheng Zhang
 [mailto:[EMAIL PROTECTED]
 Sent: Sun 12/5/2004 11:42 AM
 To:   [EMAIL PROTECTED]
 Cc:   Colin Badenhorst; Isobel Clark; Donald E. Myers
 Subject:  Re: [ai-geostats] F and T-test for samples
 drawn from the same p
 Dear all,
 
 
 
 I'm wondering if sample size (number of samples, n)
 is playing a role here.
 
 
 
 Since Colin is using Excel to analyse several
 thousand samples, I have checked the functions of
 t-tests in Excel. In the Data Analysis Tools help, a
 function is provided for t-Test: Two-Sample
 Assuming Unequal Variances analysis. This function
 is the same as those from many text books (There are
 other forms of the function). Unfortunately, I
 cannot find the function for assuming equal
 variances in Excel, but I assume they are similar,
 and should be the same as those from some text
 books.
 
 
 
 From the function, you can find that when the sample
 size is large you always get a large t value. When
 sample size is large enough, even slight differences
 between the mean values of two data sets (x bar and
 y bar) can be detected, and this will result in
 rejection of the null hypothesis. This is in fact
 quite reasonable. When the sample size is large, you
 are confident with the mean values (Central Limit
 Theorem), with a very small stand error
 (s/(sqrt(n)). Therefore, you are confident to detect
 the differences between the two data sets. Even
 though there is only a slight difference, you can
 still say, yes, they are significantly different.
 
 
 
 If you still remember some time ago, we had a
 discussion on large sample size problem for tests
 for normality. When the sample size is large enough,
 the result can always be expected (for real data
 sets), that is, rejection of the null hypothesis.
 
 
 
 Cheers,
 
 
 
 Chaosheng
 

--
 
 Dr. Chaosheng Zhang
 
 Lecturer in GIS
 
 Department of Geography
 
 National University of Ireland, Galway
 
 IRELAND
 
 Tel: +353-91-524411 x 2375
 
 Direct Tel: +353-91-49 2375
 
 Fax: +353-91-525700
 
 E-mail: [EMAIL PROTECTED]
 
 Web 1: www.nuigalway.ie/geography/zhang.html
 
 Web 2: www.nuigalway.ie/geography/gis/index.htm
 


 
 
 
 
 
 - Original Message -
 
 From: Isobel Clark [EMAIL PROTECTED]
 
 To: Donald E. Myers [EMAIL PROTECTED]
 
 Cc: Colin Badenhorst [EMAIL PROTECTED];
 [EMAIL PROTECTED]
 
 Sent: Saturday, December 04, 2004 11:49 AM
 
 Subject: [ai-geostats] F and T-test for samples
 drawn from the same p
 
 
 
 
 
  Don
 
 
 
  Thank you for the extended clarification of F and
 t
 
  hypothesis test. For those unfamiliar with the
 
  concept, it is worth noting that the F test for
 
  multiple means may be more familiar under the
 title
 
  Analysis of variance.
 
 
 
  My own brief answer was in the context of Colin's
 
  question, where it was quite clear that he was
 talking
 
  aboutthe simplest F variance-ratio and t
 comparison of
 
  means test.
 
 
 
  Isobel
 
 
 
 
 
 
 
 
 


RE: [ai-geostats] F and T-test for samples drawn from the same p

2004-12-05 Thread Pierre Goovaerts
Hello,

I am currently principal investigator on a major NIH grant
that aims to develop software for test of hypothesis
using alternate hypothesis specified by the user and that
differ from the omnibus spatial independence;
we called them spatial neutral models.
For example, you can test for clusters of cancer rates
above and beyond a regional background in exposure.
The p-values are computed using randomization and I applied
geostatistical simulation to generate multiple realizations
that are then used to derive the empirical distribution of
the test statistic.

I presented an example during the last GeoEnv conference
and I put a PDF copy of the paper, which is in press for
the moment, on my website.

Cheers,

Pierre



Dr. Pierre Goovaerts
President of PGeostat, LLC
Chief Scientist with Biomedware Inc.
710 Ridgemont Lane
Ann Arbor, Michigan, 48103-1535, U.S.A.

E-mail:  [EMAIL PROTECTED]
Phone:   (734) 668-9900
Fax: (734) 668-7788
http://alumni.engin.umich.edu/~goovaert/



On Sun, 5 Dec 2004, Colin Daly wrote:



 Hi

 Sorry to repeat myself - but the samples are not independent.  Independance 
 is a fundamental assumption of these types of tests - and you cannot 
 interpret the tests if this assumption is violated.  In the situation where 
 spatial correlation exists, the true standard error is nothing like as small 
 as the (s/sqrt(n)) that Chaosheng discusses - because the sqrt(n) depends on 
 independence.

 Again, as I said before, if the data has any type of trend in it, then it is 
 completely meaningless to try and use these tests - and with no trend but 
 some 'ordinary' correlation, you must find a means of taking the data 
 redundancy into account or risk get hopelessly pessimistic results (in the 
 sense of rejecting the null hypothesis of equal means far too often)

 Consider a trivial example. A one dimensional random function which takes 
 constant values over intervals of lenght one - so, it takes the value a_0 in 
 the interval [0,1[  then the value a_1 in the interval [1,2[ and so on (let 
 us suppose that each a_n term is drawn at random from a gaussian distribution 
 with the same mean and variance for example).  Next suppose you are given 
 samples on the interval [0,2]. You spot that there seems to be a jump between 
 [0,1[ and [1,2[  - so you test for the difference in the means. If you apply 
 an f test you will easily find that the mean differs (and more convincingly 
 the more samples you have drawn!). However by construction of the random 
 function,  the mean is not different.  We have been lulled into the false 
 conclusion of differing means by assuming that all our data are independent.

 Regards

 Colin Daly


 -Original Message-
 From: Chaosheng Zhang [mailto:[EMAIL PROTECTED]
 Sent: Sun 12/5/2004 11:42 AM
 To:   [EMAIL PROTECTED]
 Cc:   Colin Badenhorst; Isobel Clark; Donald E. Myers
 Subject:  Re: [ai-geostats] F and T-test for samples drawn from the same p
 Dear all,



 I'm wondering if sample size (number of samples, n) is playing a role here.



 Since Colin is using Excel to analyse several thousand samples, I have 
 checked the functions of t-tests in Excel. In the Data Analysis Tools help, a 
 function is provided for t-Test: Two-Sample Assuming Unequal Variances 
 analysis. This function is the same as those from many text books (There are 
 other forms of the function). Unfortunately, I cannot find the function for 
 assuming equal variances in Excel, but I assume they are similar, and 
 should be the same as those from some text books.



 From the function, you can find that when the sample size is large you always 
 get a large t value. When sample size is large enough, even slight 
 differences between the mean values of two data sets (x bar and y bar) can be 
 detected, and this will result in rejection of the null hypothesis. This is 
 in fact quite reasonable. When the sample size is large, you are confident 
 with the mean values (Central Limit Theorem), with a very small stand error 
 (s/(sqrt(n)). Therefore, you are confident to detect the differences between 
 the two data sets. Even though there is only a slight difference, you can 
 still say, yes, they are significantly different.



 If you still remember some time ago, we had a discussion on large sample size 
 problem for tests for normality. When the sample size is large enough, the 
 result can always be expected (for real data sets), that is, rejection of the 
 null hypothesis.



 Cheers,



 Chaosheng

 --

 Dr. Chaosheng Zhang

 Lecturer in GIS

 Department of Geography

 National University of Ireland, Galway

 IRELAND

 Tel: +353-91-524411 x 2375

 Direct Tel: +353-91-49 2375

 Fax: +353-91-525700

 E-mail: [EMAIL PROTECTED]

 Web 1: www.nuigalway.ie/geography/zhang.html

 Web 2: www.nuigalway.ie/geography/gis/index.htm

 


RE: [ai-geostats]F and T-test for samples drawn from the same p

2004-12-05 Thread Mat (University Account)
 
Sorry if this is somewhat off subject - but I'd like to discuss (and invite
further comments) on Colin's comments regarding the effects of independence
on standard statistical tests.

He mentioned that a lack of independence typically removes a large part of
the usability of basic tests unless corrected for spatial variables.
The standard argument goes something like: 
'Spatial autocorrelation means that the sampled values are not independent, 
so you have less information than you think (i.e. your estimated degrees of
freedom are too large). 
Consequently, the variance is underestimated and confidence intervals are
too small (or the type I error is under-reported)'.

My understanding is that this argument is quite valid when you are inferring
beyond the area from which you have sampled (or inferring about the
stochastic process generating the sample data). 
However, it's probably worth mentioning that if you are simply looking to
compare the parameters of specified areas (or volumes) and you have used a
sensible design-based sampling method (e.g. SRS), then autocorrelation poses
no problem.

i.e. if you have randomly sampled some regionalized variable in volume X and
volume Y, and simply wish to determine if, say, the population means of
these volumes are different -- then the sample points will be independent
(relative to the area of inference). In this scenario, classical statistical
tests can be used to compare the realization parameters of the different
areas.

The question that often is failed to be asked is - What inference space are
we interested in? Do we wish to discuss the process that generated the data,
or simply make inference about the actual physical realization?
Geostatistics avoids many complications with autocorrelation by typically
restricting inference to the actual data, rather than the stochastic
process.

In your particular case I would expect that statistically showing that: 
(a) two horizons exhibit the same mineral content/spatial structure and 
(b) two horizons derive from the same process
are very different problems.

Certainly within biology, the difference between these situations does not
seem to be well understood
 - I am curious if geostatisticians distinguish between them as a matter of
course?

regards,
Matthew Pawley


 --- Colin Daly [EMAIL PROTECTED] wrote: 
 
 
 Hi
 
 Sorry to repeat myself - but the samples are not independent.  
 Independance is a fundamental assumption of these types of tests - and 
 you cannot interpret the tests if this assumption is violated.
 In the situation where spatial correlation exists, the true standard 
 error is nothing like as small as the (s/sqrt(n)) that Chaosheng 
 discusses - because the sqrt(n) depends on independence.
 
 Again, as I said before, if the data has any type of trend in it, then 
 it is completely meaningless to try and use these tests - and with no 
 trend but some 'ordinary' correlation, you must find a means of taking 
 the data redundancy into account or risk get hopelessly pessimistic 
 results (in the sense of rejecting the null hypothesis of equal means 
 far too
 often)
 
 Consider a trivial example. A one dimensional random function which 
 takes constant values over intervals of lenght one - so, it takes the 
 value a_0 in the interval [0,1[  then the value a_1 in the interval 
 [1,2[ and so on (let us suppose that each a_n term is drawn at random 
 from a gaussian distribution with the same mean and variance for 
 example).  Next suppose you are given samples on the interval [0,2].
 You spot that there seems to be a jump between [0,1[ and [1,2[  - so 
 you test for the difference in the means. If you apply an f test you 
 will easily find that the mean differs (and more convincingly the more 
 samples you have drawn!). However by construction of the random 
 function,  the mean is not different.  We have been lulled into the 
 false conclusion of differing means by assuming that all our data are 
 independent.
 
 Regards
 
 Colin Daly
 



* By using the ai-geostats mailing list you agree to follow its rules 
( see http://www.ai-geostats.org/help_ai-geostats.htm )

* To unsubscribe to ai-geostats, send the following in the subject or in the 
body (plain text format) of an email message to [EMAIL PROTECTED]

Signoff ai-geostats

[ai-geostats] Large samples, t tests, etc

2004-12-05 Thread myers

Most of the tests of hypotheses that have been mentioned recently on this list
serv are non-spatial, i.e., there is nothing in the underlying statistical
assumptions that specifically pertains to spatial data. The one common
assumption is random sampling or  iid (independent, identically
distributed). In many typical (non-spatial) applications, this assumption is
ensured by the design of the experiment, i.e., the way the data is generated
and collected. Spatial data problems more often involve observational data
which does not easily lend itself to being able to design the experiment in such
a way as to ensure this basic assumption. 

In the case of spatial data, random site selection does not necessarily
correspond to random sampling. In the case of the random function model
implicit in most of geostatistics, the data is a non-random sample from one
realization of the random function (in that context using random site selection
does not then make it a random sample). Note that not all spatial statistical
analysis methods are based on this random function model.

Normality is another common underlying assumption in many hypothesis tests. In
the case of random sampling from a distribution with a finite moment of order
2+delta, delta 0 then the distribution of the sample mean will converge IN
DISTRIBUTION to a normal distribution. This means that a sequence of functions
is converging to another function. It is important to note that this convergence
may be pointwise or uniform or uniform on intervals. Pointwise is you usually
get from the Central Limit Theorem, this means that the rate of convergence
depends on where you are on the curve. The difference between using a normal
statistic vs using a t-statistic usually is the difference between a known
variance and an unknown variance (and hence estimated). But in either case the
variance is assumed to exist and be finite. The sample variance can always be
computed from a data set but that does not ensure that the variance of the
distribution exists. The quotient of two standard normal random variables has a
Cauchy distribution, neither the mean nor the variance is finite. Hence the
Central Limit Theorem does not apply. 

In the case of a non-normal distribution one really needs to know how robust the
test is to deviation from normality, increasing the sample size does not really
solve this problem.

Finally note that most tests of hypotheses are not exactly neutral, there is a
tendency to accept the null hypothesis UNLESS there is evidence against the null
hypothesis, this is one of the reasons for the emphasis on the POWER of the
test.  Often the null hypothesis is the status quo and this logical stance for
the null and alternative hypotheses is okay but not in all circumstances. 

However in some tests for normality (which still depend on the assumption of
random sampling) the test is set up in such a way that the null hypothesis
corresponds to the conclusion of normality.  E.g., Chi-square tests. If you are
trying to argue that it is safe to assume normality then you want to accept the
null hypothesis and you should want a very high power for the test, you don't
want a small p-vallue, instead you want a very large p-value. Note that the
normal distribution is symmetric but not all symmetric distributions are normal.


Donald Myers



* By using the ai-geostats mailing list you agree to follow its rules 
( see http://www.ai-geostats.org/help_ai-geostats.htm )

* To unsubscribe to ai-geostats, send the following in the subject or in the 
body (plain text format) of an email message to [EMAIL PROTECTED]

Signoff ai-geostats

Re: [ai-geostats] F and T-test for samples drawn from the same p

2004-12-05 Thread Digby Millikan
Every resource model I have done, I always subdivide the populations into
those of equal mean and variance, so stationarity is obeyed, is this the 
correct
procedure, I havn't read Mining Geostatisitcs in detail yet, but understood
that this was a basic requirement for geostatisitical modelling procedures.

http://www.users.on.net/~digbym/about_consulting.htm
Digby

* By using the ai-geostats mailing list you agree to follow its rules 
( see http://www.ai-geostats.org/help_ai-geostats.htm )

* To unsubscribe to ai-geostats, send the following in the subject or in the 
body (plain text format) of an email message to [EMAIL PROTECTED]

Signoff ai-geostats

Re: [ai-geostats] F and T-test for samples drawn from the same population

2004-12-05 Thread Digby Millikan
I believe a related topic is called the proportional effect, which is 
displayed when populations
display related, but different properties, as discussed in Geostatistical 
Ore Reserve Estimation,
M. David, pp170, which also displays itself in a sudy of the normal and 
relative variograms.

Regards Digby 


* By using the ai-geostats mailing list you agree to follow its rules 
( see http://www.ai-geostats.org/help_ai-geostats.htm )

* To unsubscribe to ai-geostats, send the following in the subject or in the 
body (plain text format) of an email message to [EMAIL PROTECTED]

Signoff ai-geostats