subject:"Re\:"

Re: Applied analysis question

2002-03-03 Thread Rich Ulrich

On 28 Feb 2002 07:37:16 -0800, [EMAIL PROTECTED] (Brad Anderson)
wrote:
> Rich Ulrich <[EMAIL PROTECTED]> wrote in message 
>news:<[EMAIL PROTECTED]>...
> > On 27 Feb 2002 11:59:53 -0800, [EMAIL PROTECTED] (Brad Anderson)
> > wrote:
BA > > > 
> > > I have a continuous response variable that ranges from 0 to 750.  I
> > > only have 90 observations and 26 are at the lower limit of 0, which is
> > > the modal category.  The mean is about 60 and the median is 3; the
> > > distribution is highly skewed, extremely kurtotic, etc.  Obviously,
> > > none of the power transformations are especially useful.  The product
> > 
[ snip, my own earlier comments ]
BA >
> I should have been more precise.  It's technically a count variable
> representing the number of times respondents report using dirty
> needles/syringes after someone else had used them during the past 90
> days.  Subjects were first asked to report the number of days they had
> injected drugs, then the average number of times they injected on
> injection days, and finally, on how many of those total times they had
> used dirty needles/syringes.  All of the subjects are injection drug
> users, but not all use dirty needles.  The reliability of reports near
> 0 is likely much better than the reliability of estimates near 750. 
> Indeed, substantively, the difference between a 0 and 1 is much more
> significant than the difference between a 749 and a 750--0 represents
> no risk, 1 represents at least some risk, and high values--regardless
> of the precision, represent high risk.

Okay, here is a break for some comment by me.

There are two immediate aims of analyses:  to show that
results are extreme enough that they don't happen by 
chance - statistical testing;  and to characterize the results 
so that people can understand them - estimation.

When the mean is 60 and the median is 3, giving report 
on averages, as if they were reports on central tendencies,
 is not going to help much with either aim.  If you 
want to look at outcomes, you make groups (as you did)
that seem somewhat homogeneous.  0 (if it is). 1.  2-3
eventually, your top group of 90+, which comes out to
'daily',  seems reasonable as a top-end.  Using groups 
ought to give you a robust test, whatever you are testing,
unless those distinctions between 10 and 500 needle-sticks
become important.  Using groups also lets you inspect, 
in particular, the means for 0, 1, 2 and 3.

I started thinking that the dimension is something like 
'promiscuous use of dirty needles';  and I realized that
an analogy to risky sex was not far wrong.  Or, at any rate,
doesn't seem far wrong to me.  But your  measure 
(the one that you mention, anyway) does not distinguish
between 1 act each with 100 risky partners, and 100 acts 
with one. 

Anyway, one way to describe the groups is to have some
experts place the reports of behaviors into 'risk-groups'.
Or assign the risk scores.   Assuming that those scores do
describe your sample, without great non-normality, you 
should be able to use averages of risk-scores for a technical
level of testing and reporting, and convert them back to the
verbal anchor-descriptions in order to explain what they mean.

[ ...Q about zero; kurtosis.]
RU > >
> > Categorizing the values into a few categories labeled, 
> > "none, almost none, "  is one way to convert your scores.  
> > If those labels do make sense.
> 
> Makes sense at the low end 0 risk.  And at the high end I used 90+
> representing using a dirty needle/syringe once a day or more often. 
> The 2 middle categories were pretty arbitrary.

[ snip, other procedures ]

> One of the other posters asked about the appropriate error term--I
> guess that lies at the heart of my inquiry.  I have no idea what the
> appropriate error term would be, and to best model such data.  I often
> deal with similar response variables that have distributions in which
> observations are clustered at 1 or both ends of the continuum.  In
> most cases, these distributions are not even approximately unimodal
> and a bit skewed--variables for which normalizing power
> transformations make sense.  Additionally, these typically aren't
> outcomes that could be thought of as being generated by a gaussian
> process.

Can you describe them usefully?  What is the shape of
the behaviors that you observe or expect, corresponding to
the drop-off of density near either extreme?

> In some cases I think it makes sense to consider poisson and
> generalizations of poisson processes although there is clearly much
> greater between subject heterogeneity than assumed by a poisson
> process.  I estimated poission and negative binomial regression
> models--there was compelling evidence that the poission was
> overdispersed.  I also used a Vuong statistic to compare NB regression

[ snip, more detail ]

> I think a lot of folks just run standard analyses or arbitrarily apply
> some "normalizing" transformation because that's whats done in their
> field.

Re: Applied analysis question

2002-03-03 Thread Rich Ulrich


On 27 Feb 2002 17:16:26 -0800, [EMAIL PROTECTED] (Dennis Roberts) wrote:

> i thought of a related data situation ...but at the opposite end
> what if you were interested in the relationship between the time it takes 
> students to take a test AND their test score
> 
> so, you have maybe 35 students in your 1 hour class that starts at 9AM ...
> 
> you decide to note (by your watch) the time they turn in the test ... and 
> about 9:20 the first person turns it in ... then 9:35 the second  9:45 
> the 3rd  9:47 the 4th ... and then, as you get to 10, when the time 
> limit is up ... the rest sort of come up to the desk at the same time
> 
> for about 1/2 of the students, you can pretty accurately write down the 
> time ... but, as it gets closer to the time limit, you have more of a 
> (literal) rush  and, at the end ... you probably put down the same time 
> on the last 8 students
> 
> you could decide just to put the order of the answer sheet as it sits in 
> the pile ... or, you might collapse the set to 3 groupings ... quick turner 
> iners, middle time turner iners ... and slow turner iners BUT, this clouds 
> the data
[ snip, rest]

Looks to me like it might be reasonable to re-sort and re-score
the speed as reciprocal, "questions per hour" -- instead of 
the original, hours per question.  That emphasizes something 
you (perhaps) omitted:  some tests at the end were incomplete.

Also, Q/H  accommodates that early test that was nearly blank.

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Splus or R

2002-03-03 Thread Eric Bohlman


Anonymous God-fearer <[EMAIL PROTECTED]> wrote:
> Does anyone know how to generate a correlation matrix given a covariance
> matrix in Splus?

> Or could you give the details of how to do it in another language?

corr[i,j] = cov[i,j]/sqrt(cov[i,i]*cov[j,j])





=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: help on factor analysis/non-normality

2002-03-01 Thread Robert Ehrlich


to amplifiy a bit, the interpretability of regression tends to go down as
the assumptions of normality and homogeneous variance are markedly
different from reality.  You can still go through the calcualtions but the
interpretation of results gets tricky.  Factor analysis is a sort of
regression analysis and so suffers in the same way from break downs of
assumptions.

Rich Ulrich wrote:

> On 1 Mar 2002 04:51:42 -0800, [EMAIL PROTECTED] (Mobile Survey)
> wrote:
>
> > What do i do if I need to run a factor analysis and have non-normal
> > distribution for some of the items (indicators)? Does Principal
> > component analysis require the normality assumption.
>
> There is no problem of non-normality, except that it *implies*
> that decomposition  *might*  not give simple structures.
> Complications are more likely when covariances are high.
>
> What did you read, that you are trying to respond to?
>
> >  Can I use GLS to
> > extract the factors and get over the problem of non-normality. Please
> > do give references if you are replying.
> > Thanks.
>
> --
> Rich Ulrich, [EMAIL PROTECTED]
> http://www.pitt.edu/~wpilib/index.html



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: AIC

2002-03-01 Thread Alan Miller


SR Millis wrote in message <[EMAIL PROTECTED]>...
>What is the correct pronunciation for "Akaike" as in AIC?
>
>Thanks,
>SR Millis (rhymes with "bacillus")
>
>


In Japanese, all letters are pronounced.
Try: Aka-ee-ke
Now try pronouncing Toyota!  `y` is always a consonant in Japanese, so it
should be
something like: To-yow-ta where the first `o' is short.
instead of what we usually hear: Toy-ow-ta
--
Alan Miller (Honorary Research Fellow, CSIRO Mathematical
& Information Sciences)
http://www.ozemail.com.au/~milleraj
http://users.bigpond.net.au/amiller/





=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Robust regression

2002-03-01 Thread Vadim and Oxana Marmer


If, for example, normality assumption holds then by doing robust
regression instead of OLS you lose efficiency. So, it's not the same
result after all. But you can do both, compare and decide. If robust
regression produces results which are not really different from the OLS
then stay with OLS.

On Fri, 1 Mar 2002, Rich Ulrich wrote:

> On 1 Mar 2002 00:36:01 -0800, [EMAIL PROTECTED] (Alex Yu)
> wrote:
>
> >
> > I know that robust regression can downweight outliers. Should someone
> > apply robust regression when the data have skewed distributions but do not
> > have outliers? Regression assumptions require normality of residuals, but
> > not the normality of raw scores. So does it help at all to use robust
> > regression in this situation. Any help will be appreciated.
>
> Go ahead and do it if you want.
>
> If someone asks (or even if they don't), you can tell
> them that robust regression gives exactly the same result.
>
>
> --
> Rich Ulrich, [EMAIL PROTECTED]
> http://www.pitt.edu/~wpilib/index.html
>



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Robust regression

2002-03-01 Thread Vadim and Oxana Marmer



You don't need normality for regression. You may need it for certain
optimality properties to hold, but you can apply OLS without normality.

On 1 Mar 2002, Alex Yu wrote:

>
> I know that robust regression can downweight outliers. Should someone
> apply robust regression when the data have skewed distributions but do not
> have outliers? Regression assumptions require normality of residuals, but
> not the normality of raw scores. So does it help at all to use robust
> regression in this situation. Any help will be appreciated.
>
>
>
> =
> Instructions for joining and leaving this list, remarks about the
> problem of INAPPROPRIATE MESSAGES, and archives are available at
>   http://jse.stat.ncsu.edu/
> =
>



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Robust regression

2002-03-01 Thread Rich Ulrich


On 1 Mar 2002 00:36:01 -0800, [EMAIL PROTECTED] (Alex Yu)
wrote:

> 
> I know that robust regression can downweight outliers. Should someone
> apply robust regression when the data have skewed distributions but do not
> have outliers? Regression assumptions require normality of residuals, but
> not the normality of raw scores. So does it help at all to use robust
> regression in this situation. Any help will be appreciated. 

Go ahead and do it if you want.  

If someone asks (or even if they don't), you can tell 
them that robust regression gives exactly the same result.


-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: help on factor analysis/non-normality

2002-03-01 Thread Rich Ulrich


On 1 Mar 2002 04:51:42 -0800, [EMAIL PROTECTED] (Mobile Survey)
wrote:

> What do i do if I need to run a factor analysis and have non-normal
> distribution for some of the items (indicators)? Does Principal
> component analysis require the normality assumption. 

There is no problem of non-normality, except that it *implies*
that decomposition  *might*  not give simple structures.
Complications are more likely when covariances are high.

What did you read, that you are trying to respond to?

>  Can I use GLS to
> extract the factors and get over the problem of non-normality. Please
> do give references if you are replying.
> Thanks.

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Applied analysis question

2002-03-01 Thread Brad Anderson

[EMAIL PROTECTED] (Eric Bohlman) wrote in message 
news:...
> Rolf Dalin <[EMAIL PROTECTED]> wrote:
> 
> IIRC, your example is exactly the sort of situation for which Tobit 
> modelling was invented.

Considered that (actually estimated a couple of Tobit models and if I
use a log transformed or box-cox transformed response the results are
consistent with the ordinal logit I originally described) but Tobt
assumes a normally distributed censored response -- the observed
distribution for the non-zero responses is not approximately normal
(even with transformations) and I don't think it's reasonable to
assume the errors are generated by an underlying gaussian process.  My
understanding of the Tobit model is that it's not especially robust to
violations of the this assumption.

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: REML for Dummies?

2002-03-01 Thread John Uebersax


The Enclyclopedia of Biostatistics (Armitage P, Colton T; Wiley,
1999?) has an article on REML.

I have not seen the article, but usually their articles well explain
statistical concepts to non-statisticians.

The Encyclopedia is a resource you might find helpful in general.  For
more info, see:

http://www.wiley.co.uk/wileychi/eob/


John Uebersax, PhD (858) 597-5571 
La Jolla, California   (858) 625-0155 (fax)
email: [EMAIL PROTECTED]

Statistics:  http://ourworld.compuserve.com/homepages/jsuebersax/agree.htm
Psychology:  http://members.aol.com/spiritualpsych


Dr Jonathan Newman <[EMAIL PROTECTED]> 
> I'm trying to find a good introduction to REML (restricted maximum
> likelihood.


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Applied analysis question

2002-03-01 Thread Eric Bohlman


Rolf Dalin <[EMAIL PROTECTED]> wrote:
> Brad Anderson wrote:

>> I have a continuous response variable that ranges from 0 to 750.  I only
>> have 90 observations and 26 are at the lower limit of 0, 

> What if you treated the information collected by that variable as really
> two variables, one categorical variable indicating zero or non-zero value.
> Then the remaining numerical variable could only be analyzed conditionally
> on the category was non-zero.

> In many cases when you collect data on consumers consumption of 
> some commodity, you would end up in a big number of them not 
> using the product at all, while those who used the product would 
> consume different amounts.

IIRC, your example is exactly the sort of situation for which Tobit 
modelling was invented.



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: detecting outliers in NON normal data ?

2002-03-01 Thread Erik-André Sauleau


But Mahalanobis distance is sensible to swamping and masking so is it really
a good measure for outliers?

DELOMBA a écrit dans le message ...
>What about Hat Matrix ? Mahalanobis distance ?
>
>Yves
>
>
>"Voltolini" <[EMAIL PROTECTED]> wrote in message
>00f301c1be68$13413000$fde9e3c8@oemcomputer...">news:00f301c1be68$13413000$fde9e3c8@oemcomputer...;
>> Hi,
>>
>> I would like to know if methods for detecting outliers
>> using interquartil ranges are indicated for data with
>> NON normal distribution.






=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: REML for Dummies?

2002-03-01 Thread kjetil halvorsen


A good book is 
Pinheiro, J.C. and Bates., D.M.  "mixed models with S and S-Plus", 
Springer.

Kjetil Halvorsen

Dr Jonathan Newman wrote:
> 
> I'm trying to find a good introduction to REML (restricted maximum
> likelihood).  I'm a biologist rather than a statistician.  If you have any
> suggestions I'd great appreciate hearing them.  Thanks.
> --
> Dr Jonathan Newman
> St. Peter's College, New Inn Hall Street, Oxford  OX1 2DL  Tel. 01865
> 271278891  Fax. 01865 278855 or
> Department of Zoology, University of Oxford, South Parks Road, Oxford OX1
> 3PS  Tel. 01865 271279  Fax. 01865 271168
> [EMAIL PROTECTED]http://users.ox.ac.uk/~zool0264
> 
> =
> Instructions for joining and leaving this list, remarks about the
> problem of INAPPROPRIATE MESSAGES, and archives are available at
>   http://jse.stat.ncsu.edu/
> =


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: REML for Dummies?

2002-03-01 Thread Anon.


Dr Jonathan Newman wrote:
> 
> I'm trying to find a good introduction to REML (restricted maximum
> likelihood).  I'm a biologist rather than a statistician.  If you have any
> suggestions I'd great appreciate hearing them.  Thanks.

Lynch & Walsh (1998)?  (Genetic Analysis of Quantitative Traits, Chapter
27).  I'm not sure how useful it is - I came via a different route. 
Alternatively, you could try the Genstat manuals.

Bob

-- 
Bob O'Hara
Metapopulation Research Group
Division of Population Biology
Department of Ecology and Systematics
PO Box 65 (Viikinkaari 1)
FIN-00014 University of Helsinki
Finland

!!!  Note: my address has changed.  So has my phone number, but I've no
idea what the new one is.
tel: +358 9 191 28779  mobile: +358 50 599 0540
fax: +358 9 191 57694email: [EMAIL PROTECTED]
 is where it's not at

It is being said of a certain poet, that though he tortures the English
language, he has still never yet succeeded in forcing it to reveal his
meaning
- Beachcomber


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Applied analysis question

2002-02-28 Thread Rich Ulrich

On 27 Feb 2002 14:14:44 -0800, [EMAIL PROTECTED] (Dennis Roberts) wrote:

> At 04:11 PM 2/27/02 -0500, Rich Ulrich wrote:
> 
> >Categorizing the values into a few categories labeled,
> >"none, almost none, "  is one way to convert your scores.
> >If those labels do make sense.

> well, if 750 has the same numerical sort of meaning as 0 (unit wise) ... in 
> terms of what is being measured then i would personally not think so SINCE, 
> the categories above 0 will encompass very wide ranges of possible values
[ ... ]

Frankly, the question is about meaning of numbers, 
and I would to ask it.

I don't expect a bunch of zeros, with 3 as median, and 
values up to 750.  Numbers like that *might*  reflect,
say, the amount of gold detected in some assays.  
Then, you want to know the handful of locations with 
numbers near 750.  If any of the numbers at all are big
enough to be interesting.

Data like those are  *not*  apt to be congenial for taking means.  
And if 750 is meaningful, using ranks is apt to be nonsensical, too.

In this example, the median was 3.
Does *that*  represent a useful interval from 0?  - if so, *that* 
tells me scaling or scoring is probably not  well-chosen.

Is there a large range of 'meaning'  between 0 and non-zero?  
Is there a range of meaning concealed within zero?
"Zero children" as outcome of a marriage can reflect 
(a) a question being asked too early; 
(b) unfortunate happenstance; or 
(c) personal choice
 - categories, within 0, and none of them are necessarily
a good 'interval'  from the 1, 2, 3...  answers.  But that 
(further) depends on what questions are being asked.

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Find PDF of RV with a given mean value

2002-02-28 Thread Herman Rubin


In article <[EMAIL PROTECTED]>,
Glen <[EMAIL PROTECTED]> wrote:
>"Chia C Chong" <[EMAIL PROTECTED]> wrote in message 
>news:...
>> Hi!

>> I have a set of random numbers and if I know their expectation/mean, would
>> it be possible to deduce a PDF to describe the distribution of them? 

>Knowing the mean tells you (almost) nothing about the form of the PDF.

Even knowing much more does not tell you that much.

While the normal distribution is determined by its 
moments, and the CDF, much more stable than the PDF,
is .5 at the mean, the first 20 moments do not fix
it between 1/3 and 2/3.

-- 
This address is for information only.  I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
[EMAIL PROTECTED] Phone: (765)494-6054   FAX: (765)494-0558


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Means of semantic differential scales

2002-02-28 Thread Dennis Roberts


At 09:51 AM 2/28/02 -0800, Jay Tanzman wrote:


>I partially did this, insofar as I ran Pearson and Spearman correlations 
>between
>several of the scales and, not surprisingly, the two correlation coefficients
>and their p-values were similar. < that issue is entirely a separate 
>one since the rank order FORMULA was derived from the pearson ...




>  Dr. Kim was not impressed.
>
>-Jay

i hate to ask this question but, what the heck, spring break is near so i will

if your boss, dr. kim??? ... seems so knowledgeable about what the data are 
and what is and is not appropriate to do with the data, why is not dr. kim 
doing the analysis?

this reminds me of assigning a task to someone and, doing so much 
micro-managing that ... it would have been better off doing it oneself ...



>=
>Instructions for joining and leaving this list, remarks about the
>problem of INAPPROPRIATE MESSAGES, and archives are available at
>   http://jse.stat.ncsu.edu/
>=

Dennis Roberts, 208 Cedar Bldg., University Park PA 16802

WWW: http://roberts.ed.psu.edu/users/droberts/drober~1.htm
AC 8148632401



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Means of semantic differential scales

2002-02-28 Thread Jay Tanzman




> "Simon, Steve, PhD" wrote:
> 
> Jay Tanzman got chewed out by his boss for averaging a 7 point ordinal scale.
> Generally it is not a good idea to argue with your boss, but perhaps you might
> ask what was the grade point average that he or she received in college. When
> you hear the response, then ask if the grading scale A, B, C, D, F is ordinal
> or interval.

I'm going to ask him.

> A possible compromise is to model the data as if it were interval and then
> model it as if it were ordinal. If the two models are reasonably similar,
> good. If they differ, that is still good, as it allows you to then explore why
> the two models differ.

I partially did this, insofar as I ran Pearson and Spearman correlations between
several of the scales and, not surprisingly, the two correlation coefficients
and their p-values were similar.  Dr. Kim was not impressed.

-Jay


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Applied analysis question

2002-02-28 Thread Dennis Roberts

At 07:37 AM 2/28/02 -0800, Brad Anderson wrote:

>I think a lot of folks just run standard analyses or arbitrarily apply
>some "normalizing" transformation because that's whats done in their
>field.  Then report the results without really examining the
>underlying distributions.  I'm curious how folks procede when they
>encounter very goofy distrubions.  Thanks for your comments.

i think the lesson to be gained from this is that, we seem to be focusing 
on (or the message that students and others get) getting the analysis DONE 
and summarizied ... and with most standard packages ... that is relatively 
easy to do

for example, you talk about a simple regression analysis and then show them 
in minitab that you can do that like: mtb> regr 'height' 1 'weight' and, 
when they do it, lots of output comes out BUT, the first thing is the best 
fitting straight line equation like:

The regression equation is
Weight = - 205 + 5.09 Height

and THAT's where they start AND stop (more or less)

while software makes it rather easy to do lots of prelim inspection of 
data, it also makes it very easy to SKIP all that too

before we do any serious analysis ... we need to LOOK at the data ... 
carefully ... make some scatterplots (to check for outliers, etc.), to look 
at some frequency distributions ON the variables, to even just look at the 
means and sds ... to see if some serious restriction of range issue pops up 
...

THEN and ONLY then, after we get a feel for what we have ... THEN and ONLY 
then should we be doing the main part of our analysis ... ie, testing some 
hypothesis or notion WITH the data (actually, i might call the prelims the 
MAIN part but, others might disagree)

we put the cart before the horse ... in fact, we don't even pay any 
attention to the horse

unfortunately, far too much of this is "caused" by the dominant and 
preoccupation of doing "significance tests" so we run routines that give us 
these "p values" and are done with it ... without  paying ANY attention to 
just looking at the data

my 2 cents worth

>=
>Instructions for joining and leaving this list, remarks about the
>problem of INAPPROPRIATE MESSAGES, and archives are available at
>   http://jse.stat.ncsu.edu/
>=

Dennis Roberts, 208 Cedar Bldg., University Park PA 16802

WWW: http://roberts.ed.psu.edu/users/droberts/drober~1.htm
AC 8148632401

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Applied analysis question

2002-02-28 Thread Brad Anderson

Rich Ulrich <[EMAIL PROTECTED]> wrote in message 
news:<[EMAIL PROTECTED]>...
> On 27 Feb 2002 11:59:53 -0800, [EMAIL PROTECTED] (Brad Anderson)
> wrote:
> 
> > I have a continuous response variable that ranges from 0 to 750.  I
> > only have 90 observations and 26 are at the lower limit of 0, which is
> > the modal category.  The mean is about 60 and the median is 3; the
> > distribution is highly skewed, extremely kurtotic, etc.  Obviously,
> > none of the power transformations are especially useful.  The product
> 
> I guess it is 'continuous'  except for having 26 ties at 0.  
> I have to wonder how that set of scores arose, and also, 
> what should a person guess about the *error*  associated
> with those:   Are the numbers near 750  measured with
> as much accuracy as the numbers near 3?

I should have been more precise.  It's technically a count variable
representing the number of times respondents report using dirty
needles/syringes after someone else had used them during the past 90
days.  Subjects were first asked to report the number of days they had
injected drugs, then the average number of times they injected on
injection days, and finally, on how many of those total times they had
used dirty needles/syringes.  All of the subjects are injection drug
users, but not all use dirty needles.  The reliability of reports near
0 is likely much better than the reliability of estimates near 750. 
Indeed, substantively, the difference between a 0 and 1 is much more
significant than the difference between a 749 and a 750--0 represents
no risk, 1 represents at least some risk, and high values--regardless
of the precision, represent high risk.
> 
> How do zero scores arise?  Is this truncation;  the limit of
> practical measurement;  or just what?

Zero scores are logical and represent no risk, negative values are not
logical.
> 
> "Extremely kurtotic," you say.  That huge lump at 0 and skew
> is not consistent with what I think of as kurtosis, but I guess
> I have not paid attention to kurtosis at all, once I know that
> skewness is extraordinary.

True, the kurtosis statistic exceeded 11, and and a plot against the
normal indicates a huge lump in the low end of the tail, and also a
larger proportion of very high values than expected.
> 
> Categorizing the values into a few categories labeled, 
> "none, almost none, "  is one way to convert your scores.  
> If those labels do make sense.

Makes sense at the low end 0 risk.  And at the high end I used 90+
representing using a dirty needle/syringe once a day or more often. 
The 2 middle categories were pretty arbitrary.

If I analyze a contingency Table using the 4-category response and a
3-category measure of the primary covariate (categories defined using
"clinically meaningful" categories, the association is quite strong
and I used the exact p-value associated with the CMH difference in row
means test (using SAS) and the association is signficant.  I also used
the 3-category predictor and the procedures outlined by Stokes et al.
(2000) to estimate a rank analysis of covariance--again with
consistent results.

I've also run a few other analyses I didn't describe.  I used the
Box-Cox procedure to find a power transformation.  Although the
skewness statistic then looks great, the distribution is still not
approximately normal.  However, a regression using the transformed
variable is consistent with the ordered logit and the contingency
table analysis.

One of the other posters asked about the appropriate error term--I
guess that lies at the heart of my inquiry.  I have no idea what the
appropriate error term would be, and to best model such data.  I often
deal with similar response variables that have distributions in which
observations are clustered at 1 or both ends of the continuum.  In
most cases, these distributions are not even approximately unimodal
and a bit skewed--variables for which normalizing power
transformations make sense.  Additionally, these typically aren't
outcomes that could be thought of as being generated by a gaussian
process.

In some cases I think it makes sense to consider poisson and
generalizations of poisson processes although there is clearly much
greater between subject heterogeneity than assumed by a poisson
process.  I estimated poission and negative binomial regression
models--there was compelling evidence that the poission was
overdispersed.  I also used a Vuong statistic to compare NB regression
with zero-inflated NB regression--the results support the
zero-inflated model.  The model standard errors for a zero-inflated
model are wildly different than the Huber-White sandwich robust
standard errors.  The later give results that are fairly consistent
with the ordered logit, the model based standard errors are
huge--given that these are asymptotic statistics and I have a
relatively small sample, I don't really trust either.

I think a lot of folks just run standard analyses or arbitrarily apply
some "normalizing" trans

RE: Means of semantic differential scales

2002-02-28 Thread Simon, Steve, PhD

Title: RE: Means of semantic differential scales





Jay Tanzman got chewed out by his boss for averaging a 7 point ordinal scale. Generally it is not a good idea to argue with your boss, but perhaps you might ask what was the grade point average that he or she received in college. When you hear the response, then ask if the grading scale A, B, C, D, F is ordinal or interval.

A possible compromise is to model the data as if it were interval and then model it as if it were ordinal. If the two models are reasonably similar, good. If they differ, that is still good, as it allows you to then explore why the two models differ.

Steve Simon, [EMAIL PROTECTED], Standard Disclaimer.
The STATS web page has moved to
http://www.childrens-mercy.org/stats

Re: Means of semantic differential scales

2002-02-28 Thread J. Williams

On 27 Feb 2002 15:01:24 -0800, [EMAIL PROTECTED] (Dennis Roberts) wrote:

>At 01:39 PM 2/27/02 -0600, Jay Warner wrote:
>
>> > >
>> > >Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful
>
>just out of curiosity ... how many consider the above to be an example of a 
>bipolar scale?
>
>i don't
>
>now, if we had an item like:
>
>sad  happy
>1  . 7
>
>THEN the mid point becomes much more problematic ...
>
>since being a 4 ... is neither a downer nor upper

The bipolar adjectives in Mr. Warner's example might be a tad "fuzzy"
IMHO.  What is a clear antonym for "stressful"?  "Pacified"?
"Carefree"?  I noted same in my original response to his query. Your
item "sad...happy" appears more like what Osgood et al had in mind.
"GoodBad," " Hot...Cold, " for example, are clearcut bipolars.  

If one wants to force an opinion one way or another, then display an
even numbered scale.  If the investigator wants the "neutral" opinion
then make the scale odd numbered.  To me the semantic differential  is
only a Likert Scale without the glitter :-))  I think his supervisor
more than likely, however, was concerned about computing means with
ordinal data. Perhaps,  arguments can be made  for both ordinal and
interval usage depending on the intent of the research.  Some semantic
differential instruments I have seen in the past have no printed
numerical scale at all.  The respondent places a check mark along a
horizontally gradated continuum.  The researcher then assigns an
appropriate score. vis a vis the check mark.  Usually bipolar
adjective items are randomly assigned, i.e., "good" responses are not
all on one side of the document.  Supposedly, the respondent can't
simply "halo" the concept being evaluated.  

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Means of semantic differential scales

2002-02-28 Thread Art Kendall

DMR, I should have read your previous posting more carefully.  I have now had
coffee.

>Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful

is a question that has an extent response format.  The cognitive schema the
response format tries to invoke might be reinforced by anchoring with zero for
not at all.
To me the low end is zero rather than anti-stressful.

In some fields the above might be used as an item in a scale.   As in your
example, the 16pf uses a series of items to produce bipolar scales.

Some concepts make no sense as bipolar scales.   Ability, achievement, etc.
have no cognitive opposites. Even preferences and attitudes are necessarily
measured with opposites.  Bem & associates made much of the fact that adaption
to gender expectations should be represented with two dimensions (analogous to
longitude to and latitude) so that it took 2 variables to adequately represent
that concept. The degre of having attributes popularly considered
characteristic of masculinity were construed as orthogonal to the degree of
having attributes popularly considered characteristic of femininity.

With regard to the original question, in my opinion, there is nothing
automatically incorrect about getting means on such variables.  If the purpose
is to compare groups, it is more important to be sure to use the same ruler,
than it is to worry whether it is a rubber ruler.

Dennis Roberts wrote:

> At 01:39 PM 2/27/02 -0600, Jay Warner wrote:
>
> > > >
> > > >Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful
>
> just out of curiosity ... how many consider the above to be an example of a
> bipolar scale?
>
> i don't

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Means of semantic differential scales

2002-02-28 Thread Art Kendall


I would consider it a unipolar extent scale.  Maybe the visual anchor should be
0 to 6 to aid association with the number line concept.

Dennis Roberts wrote:

> At 01:39 PM 2/27/02 -0600, Jay Warner wrote:
>
> > > >
> > > >Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful
>
> just out of curiosity ... how many consider the above to be an example of a
> bipolar scale?
>
> i don't
>
>



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Applied analysis question

2002-02-27 Thread Rolf Dalin

Brad Anderson wrote:

> I have a continuous response variable that ranges from 0 to 750.  I only
> have 90 observations and 26 are at the lower limit of 0, 

What if you treated the information collected by that variable as really
two variables, one categorical variable indicating zero or non-zero value.
Then the remaining numerical variable could only be analyzed conditionally
on the category was non-zero.

In many cases when you collect data on consumers consumption of 
some commodity, you would end up in a big number of them not 
using the product at all, while those who used the product would 
consume different amounts.

Rolf Dalin
**
Rolf Dalin
Department of Information Tchnology and Media
Mid Sweden University
S-870 51 SUNDSVALL
Sweden
Phone: 060 148690, international: +46 60 148690
Fax: 060 148970, international: +46 60 148970
Mobile: 0705 947896, intnational: +46 70 5947896

mailto:[EMAIL PROTECTED]
http://www.itk.mh.se/~roldal/
**

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Statistics Tool For Classification/Clustering

2002-02-27 Thread Mark Harrison


Good places to start:

Optimal feature extractors, that's better than PCA because you whiten your
inter class scatter and so put all inter class comparisons on the same
level. The good thing is this will also reduce your feature vector
dimensionality to c-1 (where c is # classes). PCA will not do this.

Check the stats of each class, is it Gaussian or known pdf? Apply
parameteric classifier if so.

However you are lucky if you get good classification after this, so you will
probably need non linear, non parametric classifiers. Try K nearest
neighobour, but that might take the age of the Universe so use a condensing
algorithm first to get a smaller representative set.

Matlab is what I use for coding, there are a lot of free toolboxes around.
Mostly I write my own though.

Best wishes

Andrew


"Rishabh Gupta" <[EMAIL PROTECTED]> wrote in message
news:a4eje9$ip8$[EMAIL PROTECTED].;
> Hi All,
> I'm a research student at the Department Of Electronics, University Of
> York, UK. I'm working a project related to music analysis and
> classification. I am at the stage where I perform some analysis on music
> files (currently only in MIDI format) and extract about 500 variables that
> are related to music properties like pitch, rhythm, polyphony and volume.
I
> am performing basic analysis like mean and standard deviation but then I
> also perform more elaborate analysis like measuring complexity of melody
and
> rhythm.
>
> The aim is that the variables obtained can be used to perform a number of
> different operations.
> - The variables can be used to classify / categorise each piece of
> music, on its own, in terms of some meta classifier (e.g. rock, pop,
> classical).
> - The variables can be used to perform comparison between two files. A
> variable from one music file can be compared to the equivalent variable in
> the other music file. By comparing all the variables in one file with the
> equivalent variable in the other file, an overall similarity measurement
can
> be obtained.
>
> The next stage is to test the ability of the of the variables obtained to
> perform the classification / comparison. I need to identify variables that
> are redundant (redundant in the sense of 'they do not provide any
> information' and 'they provide the same information as the other
variable')
> so that they can be removed and I need to identify variables that are
> distinguishing (provide the most amount of information).
>
> My Basic Questions Are:
> - What are the best statistical techniques / methods that should be
> applied here. E.g. I have looked at Principal Component Analysis; this
would
> be a good method to remove the redundant variables and hence reduce some
the
> amount of data that needs to be processed. Can anyone suggest any other
> sensible statistical anaysis methods?
> - What are the ideal tools / software to perform the clustering /
> classification. I have access to SPSS software but I have never used it
> before and am not really sure how to apply it or whether it is any good
when
> dealing with 100s of variables.
>
> So far I have been analysing each variable on its own 'by eye' by plotting
> the mean and sd for all music files. However this approach is not feasible
> in the long term since I am dealing with such a large number of variables.
> In addition, by looking at each variable on its own, I do not find
clusters
> / patterns that are only visible through multivariate analysis. If anyone
> can recommend a better approach I would be greatly appreciated.
>
> Any help or suggestion that can be offered will be greatly appreciated.
>
> Many Thanks!
>
> Rishabh Gupta
>
>




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Statistics Tool For Classification/Clustering

2002-02-27 Thread Mark Harrison


Corection typo: Should read 'Whiten intra class scatter'

"Mark Harrison" <[EMAIL PROTECTED]> wrote in message
news:FIif8.16518$[EMAIL PROTECTED].;
> Good places to start:
>
> Optimal feature extractors, that's better than PCA because you whiten your
> inter class scatter and so put all inter class comparisons on the same
> level. The good thing is this will also reduce your feature vector
> dimensionality to c-1 (where c is # classes). PCA will not do this.
>
> Check the stats of each class, is it Gaussian or known pdf? Apply
> parameteric classifier if so.
>
> However you are lucky if you get good classification after this, so you
will
> probably need non linear, non parametric classifiers. Try K nearest
> neighobour, but that might take the age of the Universe so use a
condensing
> algorithm first to get a smaller representative set.
>
> Matlab is what I use for coding, there are a lot of free toolboxes around.
> Mostly I write my own though.
>
> Best wishes
>
> Andrew
>
>
> "Rishabh Gupta" <[EMAIL PROTECTED]> wrote in message
> news:a4eje9$ip8$[EMAIL PROTECTED].;
> > Hi All,
> > I'm a research student at the Department Of Electronics, University
Of
> > York, UK. I'm working a project related to music analysis and
> > classification. I am at the stage where I perform some analysis on music
> > files (currently only in MIDI format) and extract about 500 variables
that
> > are related to music properties like pitch, rhythm, polyphony and
volume.
> I
> > am performing basic analysis like mean and standard deviation but then I
> > also perform more elaborate analysis like measuring complexity of melody
> and
> > rhythm.
> >
> > The aim is that the variables obtained can be used to perform a number
of
> > different operations.
> > - The variables can be used to classify / categorise each piece of
> > music, on its own, in terms of some meta classifier (e.g. rock, pop,
> > classical).
> > - The variables can be used to perform comparison between two files.
A
> > variable from one music file can be compared to the equivalent variable
in
> > the other music file. By comparing all the variables in one file with
the
> > equivalent variable in the other file, an overall similarity measurement
> can
> > be obtained.
> >
> > The next stage is to test the ability of the of the variables obtained
to
> > perform the classification / comparison. I need to identify variables
that
> > are redundant (redundant in the sense of 'they do not provide any
> > information' and 'they provide the same information as the other
> variable')
> > so that they can be removed and I need to identify variables that are
> > distinguishing (provide the most amount of information).
> >
> > My Basic Questions Are:
> > - What are the best statistical techniques / methods that should be
> > applied here. E.g. I have looked at Principal Component Analysis; this
> would
> > be a good method to remove the redundant variables and hence reduce some
> the
> > amount of data that needs to be processed. Can anyone suggest any other
> > sensible statistical anaysis methods?
> > - What are the ideal tools / software to perform the clustering /
> > classification. I have access to SPSS software but I have never used it
> > before and am not really sure how to apply it or whether it is any good
> when
> > dealing with 100s of variables.
> >
> > So far I have been analysing each variable on its own 'by eye' by
plotting
> > the mean and sd for all music files. However this approach is not
feasible
> > in the long term since I am dealing with such a large number of
variables.
> > In addition, by looking at each variable on its own, I do not find
> clusters
> > / patterns that are only visible through multivariate analysis. If
anyone
> > can recommend a better approach I would be greatly appreciated.
> >
> > Any help or suggestion that can be offered will be greatly appreciated.
> >
> > Many Thanks!
> >
> > Rishabh Gupta
> >
> >
>
>




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Applied analysis question

2002-02-27 Thread Dennis Roberts

i thought of a related data situation ...but at the opposite end
what if you were interested in the relationship between the time it takes 
students to take a test AND their test score

so, you have maybe 35 students in your 1 hour class that starts at 9AM ...

you decide to note (by your watch) the time they turn in the test ... and 
about 9:20 the first person turns it in ... then 9:35 the second  9:45 
the 3rd  9:47 the 4th ... and then, as you get to 10, when the time 
limit is up ... the rest sort of come up to the desk at the same time

for about 1/2 of the students, you can pretty accurately write down the 
time ... but, as it gets closer to the time limit, you have more of a 
(literal) rush  and, at the end ... you probably put down the same time 
on the last 8 students

you could decide just to put the order of the answer sheet as it sits in 
the pile ... or, you might collapse the set to 3 groupings ... quick turner 
iners, middle time turner iners ... and slow turner iners BUT, this clouds 
the data

here we have a situation where the BIG times have lots of the n ... where 
there are widely scattered (but infrequent) short times ... if you have 
time on the baseline, it is radically NEG skewed

better ways to record the times do not really solve this even if you have a 
time stamper like i used to have to used when punching my time card on 
coming into and leaving work

it's a conundrum for sure

At 10:17 AM 2/28/02 +1100, Glen Barnett wrote:
>Brad Anderson wrote:
> >
> > I have a continuous response variable that ranges from 0 to 750.  I
> > only have 90 observations and 26 are at the lower limit of 0, which is
> > the modal category.
>
>If it's continuous, it can't really have categories (apart from those
>induced by recording the variable to some limited precision, but people
>don't generally call those categories).
>
>The fact that you have a whole pile of zeros makes it mixed rather than
>continuous, and the fact that you say "category" makes it sound purely
>discrete.
>
>Glen
>
>
>=
>Instructions for joining and leaving this list, remarks about the
>problem of INAPPROPRIATE MESSAGES, and archives are available at
>   http://jse.stat.ncsu.edu/
>=

Dennis Roberts, 208 Cedar Bldg., University Park PA 16802

WWW: http://roberts.ed.psu.edu/users/droberts/drober~1.htm
AC 8148632401

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Applied analysis question

2002-02-27 Thread Glen Barnett

Brad Anderson wrote:
> 
> I have a continuous response variable that ranges from 0 to 750.  I
> only have 90 observations and 26 are at the lower limit of 0, which is
> the modal category.  

If it's continuous, it can't really have categories (apart from those
induced by recording the variable to some limited precision, but people
don't generally call those categories).

The fact that you have a whole pile of zeros makes it mixed rather than
continuous, and the fact that you say "category" makes it sound purely 
discrete.

Glen

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Means of semantic differential scales

2002-02-27 Thread Dennis Roberts


At 01:39 PM 2/27/02 -0600, Jay Warner wrote:

> > >
> > >Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful

just out of curiosity ... how many consider the above to be an example of a 
bipolar scale?

i don't

now, if we had an item like:

sad  happy
1  . 7

THEN the mid point becomes much more problematic ...

since being a 4 ... is neither a downer nor upper

now, a quick search found info from ncs about the 16pf personality scale 
... it shows 16 BIpolar dimensions as:

Bipolar Dimensions of Personality
Factor A Warmth (Cool vs Warm)
Factor B Intelligence (Concrete Thinking vs Abstract Thinking)
Factor C Emotional Stability (Easily Upset vs Calm)
Factor E Dominance (Not Assertive vs Dominant)
Factor F Impulsiveness (Sober vs Enthusiastic)
Factor G Conformity (Expedient vs Conscientious)
Factor H Boldness (Shy vs Venturesome)
Factor I Sensitivity (Tough-Minded vs Sensitive)
Factor L Suspiciousness (Trusting vs Suspicious)
Factor M Imagination (Practical vs Imaginative)
Factor N Shrewdness (Forthright vs Shrewd)
Factor O Insecurity (Self-Assured vs Self-Doubting)
Factor Q1 Radicalism (Conservative vs Experimenting)
Factor Q2 Self-Sufficiency (Group-Oriented vs Self-Sufficient)
Factor Q3 Self-Discipline (Undisciplined vs Self-Disciplined)
Factor Q4 Tension (Relaxed vs Tense)

let's take the one ... shy versus venturesome ...

now, we could make a venturesome scale by itself ...

0 venturesomeness .. (up to)  very venturesome 7

does 0 = shy seems like if the answer is no ... then we might have a 
bipolar scale ... if the answer is yes ... then we don't



>  It could be the use of the particular bipolars
> > "not stressful" and "very stressful."
>=

Dennis Roberts, 208 Cedar Bldg., University Park PA 16802

WWW: http://roberts.ed.psu.edu/users/droberts/drober~1.htm
AC 8148632401



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Applied analysis question

2002-02-27 Thread Dennis Roberts

At 04:11 PM 2/27/02 -0500, Rich Ulrich wrote:

>Categorizing the values into a few categories labeled,
>"none, almost none, "  is one way to convert your scores.
>If those labels do make sense.

well, if 750 has the same numerical sort of meaning as 0 (unit wise) ... in 
terms of what is being measured then i would personally not think so SINCE, 
the categories above 0 will encompass very wide ranges of possible values

if the scale was # of emails you look at in a day ... and 1/3 said none or 
0 ... we could rename the scale 0 = not any, 1 to 50 as = some, and 51 to 
750 as = many (and recode as 1, 2, and 3) .. i don't think anyone who just 
saw the labels ... and were then asked to give some extemporaneous 'values' 
for each of the categories ... would have any clue what to put in for the 
some and many categories ... but i would predict they would seriously 
UNderestimate the values compared to the ACTUAL responses

this just highlights that for some scales, we have almost no 
differentiation at one end where they pile up ... perhaps (not saying one 
could have in this case) we could have anticipated this ahead of time and 
put scale categories that might have anticipated that

after the fact, we are more or less dead ducks

i would say this though ... treating the data only in terms of ranks ... 
does not really solve anything ... and clearly represents being able to say 
LESS about your data or interrelationships (even if the rank order r is .3 
compared to the regular pearson of about 0) ... than if you did not resort 
to only thinking about the data in rank terms

>--
>Rich Ulrich, [EMAIL PROTECTED]
>http://www.pitt.edu/~wpilib/index.html
>
>
>=
>Instructions for joining and leaving this list, remarks about the
>problem of INAPPROPRIATE MESSAGES, and archives are available at
>   http://jse.stat.ncsu.edu/
>=

Dennis Roberts, 208 Cedar Bldg., University Park PA 16802

WWW: http://roberts.ed.psu.edu/users/droberts/drober~1.htm
AC 8148632401

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Means of semantic differential scales

2002-02-27 Thread Jay Warner


I am humbled by the insight & background knowledge expressed by Mssrs.
Williams and McLean, not to mention the string of others.  My lack of
academic experince in the subject matter is painfully clear.  Now to see if
I can find Osgood et al.  When I consider how many research projects and
social/political actions depend on survey responses for their information,
the need for this level of 'prethinking' becomes all the more necessary.

Jay

"J. Williams" wrote:

> On Mon, 25 Feb 2002 15:17:55 -0800, Jay Tanzman <[EMAIL PROTECTED]>
> wrote:
>
> >I just got chewed out by my boss for modelling the means of some 7-point
> >semantic differential scales.  The scales were part of a written,
> >self-administered questionnaire, and were laid out like this:
> >
> >Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful
> >
> >So, why or why not is it kosher to model the means of scales like this?
> >
> >-Jay
>
> You can check it out by reading the pioneers of the semantic
> differential scale.  Osgood, Suci, and Tannenbaum are the authors of
> "Measurement of Meaning"  which now is published in paperback by the
> University of Illinois Press, Oct. 1990.  It may be your boss is a
> stickler on what constitutes a true interval scale.  It could be
> he/she wants no middle value score - that way respondents must tilt
> toward a yea or nay.  It could be the use of the particular bipolars
> "not stressful" and "very stressful."  Why not use stressful and not
> stressful?   What is "very" stressful?  By reading the Osgood et al
> text, you can find many nifty ideas and variations for using the
> semantic differential scale.  Like the Likert Scale, I suppose it is
> arguably an ordinal scale.  But, there are lots of statistical tools
> you could employ using rankings, medians, etc.  Like the Likert Scale
> devotees,  there are those who nevertheless use means as the measure
> of central tendency with semantic differential instruments.  Good
> luck.
>
> =
> Instructions for joining and leaving this list, remarks about the
> problem of INAPPROPRIATE MESSAGES, and archives are available at
>   http://jse.stat.ncsu.edu/
> =

--
Jay Warner
Principal Scientist
Warner Consulting, Inc.
 North Green Bay Road
Racine, WI 53404-1216
USA

Ph: (262) 634-9100
FAX: (262) 681-1133
email: [EMAIL PROTECTED]
web: http://www.a2q.com

The A2Q Method (tm) -- What do you want to improve today?






=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Applied analysis question

2002-02-27 Thread Rich Ulrich

On 27 Feb 2002 11:59:53 -0800, [EMAIL PROTECTED] (Brad Anderson)
wrote:

> I have a continuous response variable that ranges from 0 to 750.  I
> only have 90 observations and 26 are at the lower limit of 0, which is
> the modal category.  The mean is about 60 and the median is 3; the
> distribution is highly skewed, extremely kurtotic, etc.  Obviously,
> none of the power transformations are especially useful.  The product

I guess it is 'continuous'  except for having 26 ties at 0.  
I have to wonder how that set of scores arose, and also, 
what should a person guess about the *error*  associated
with those:   Are the numbers near 750  measured with
as much accuracy as the numbers near 3?

How do zero scores arise?  Is this truncation;  the limit of
practical measurement;  or just what?

"Extremely kurtotic," you say.  That huge lump at 0 and skew
is not consistent with what I think of as kurtosis, but I guess
I have not paid attention to kurtosis at all, once I know that
skewness is extraordinary.

Categorizing the values into a few categories labeled, 
"none, almost none, "  is one way to convert your scores.  
If those labels do make sense.

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: CRIMCOORD transformation in QUEST

2002-02-27 Thread Paul Thompson


That is either a sloppiness in writing or reliance on the relationship 
between eigen decomposition and SVD.

SSM - square symmetric matrix
AM - arbitrary matrix

In ED, SSM = Q E Q'
In SVD, AM = P D Q'

SSM = AM' AM
= Q D P' P D Q' = Q D D Q'
= Q E Q', if E = D D
I haven't checked that above, but it is pretty close to accurate.  You 
may need to throw in a division by n.

David Chang wrote:

>Hi, thank you for reading this message. I have the following problems in
>getting the "correct" CRIMCOORD transformation of categorical variables
>in QUEST decision tree algorithm. Your help will be greatly appreciated.
>
>Q1: In Loh & Shih's paper (Split Selection Models for Classification
>Trees, Statistica Sinica, 1997, vol 7, p815-840), they mentioned about
>the mapping from categorical variable to ordered variable via CRIMCOORD.
>But, their explanation, in particular, step 5 of algorithm 2 is not
>clear. For example, they wrote "Perform a singular value decomposition
>of the matrix GFU and let a (vector) be the eigenvector (of what?)
>associated with the largest eigenvalue" in step 5. Does this mean
>a(vector) is the eigenvector of transpose(GFU)*GFU?
>
>Q2.
>I tried to verify the data sets in Table 1. Data set I-III are OK. But,
>the result for data set IV seems to be incorrect. Could any one of you
>help me verify that?
>
>Thank you very much for your help !!
>
>David
>



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: detecting outliers in NON normal data ?

2002-02-27 Thread DELOMBA


What about Hat Matrix ? Mahalanobis distance ?

Yves


"Voltolini" <[EMAIL PROTECTED]> wrote in message
00f301c1be68$13413000$fde9e3c8@oemcomputer">news:00f301c1be68$13413000$fde9e3c8@oemcomputer...
> Hi,
>
> I would like to know if methods for detecting outliers
> using interquartil ranges are indicated for data with
> NON normal distribution.
>
> The software "Statistica" presents this method:
> data point value > UBV + o.c.*(UBV - LBV)
> data point value < LBV - o.c.*(UBV - LBV)
>
> where: UBV is the 75th percentile) and LBV is the 25th percentile).  o.c.
is
> the outlier coefficient.
>
> In the biological world many data are not normally distributed and tests
> like Rosner, Dixon and Grubbs (if I am wright ! ) are good just for
normally
> distributed data.
>
>
> Does anyone can help me ?
>
>
> Thanks..
>
>
>
> _
> Prof. J. C. Voltolini
> Grupo de Estudos em Ecologia de Mamiferos - ECOMAM
> Universidade de Taubate - Depto. Biologia
> Praca Marcellino Monteiro 63, Bom Conselho,
> Taubate, SP - BRASIL. 12030-010
>
> TEL: 0XX12-2254165 (lab.), 2254277 (depto.)
> FAX: 0XX12-2322947
> E-Mail: [EMAIL PROTECTED]
> http://www.mundobio.rg3.net/
> 
>
>
>
> =
> Instructions for joining and leaving this list, remarks about the
> problem of INAPPROPRIATE MESSAGES, and archives are available at
>   http://jse.stat.ncsu.edu/
> =




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: (È«º¸)ÃÖ°È«º¸ÇÁ·Î±×·¥!!È«º¸°ÆÁ¤³¡.

2002-02-27 Thread Jim Snow


This is a multi-part message in MIME format.

--=_NextPart_000_0017_01C1BFC6.F446E040
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

O=BA=BB=B8=DE=C0=CF=C0=BA=C1=A4=BA=B8=C5=EB=BD=C5=B8=C1=C0=CC=BF=EB=C3=CB=
=C1=F8=B9=D7=C1=A4=BA=B8=BA=B8=C8=A3=B5=EE=BF=A1=B0=FC=C7=D1=B9=FD=B7=FC=C1=
=A650=C1=B6=BF=A1=C0=C7=B0=C5=C7=D1[=B1=A4=B0=ED]=B8=DE=C0=CF=C0=D4=B4=CF=
=B4=D9BLANK
  <[EMAIL PROTECTED]> wrote in message =
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
  O =BA=BB =B8=DE=C0=CF=C0=BA =C1=A4=BA=B8=C5=EB=BD=C5=B8=C1 =
=C0=CC=BF=EB=C3=CB=C1=F8 =B9=D7 =C1=A4=BA=B8=BA=B8=C8=A3 =B5=EE=BF=A1 =
=B0=FC=C7=D1 =B9=FD=B7=FC =C1=A6 50=C1=B6=BF=A1 =C0=C7=B0=C5=C7=D1 =
[=B1=A4=B0=ED] =B8=DE=C0=CF=C0=D4=B4=CF=B4=D9
  O e-mail=C1=D6=BC=D2=B4=C2 =C0=CE=C5=CD=B3=DD=BB=F3=BF=A1=BC=AD =
=C3=EB=B5=E6=C7=CF=BF=B4=C0=B8=B8=E7, =C1=D6=BC=D2=BF=DC =
=BE=EE=B6=B0=C7=D1 =B0=B3=C0=CE =C1=A4=BA=B8=B5=B5 =B0=A1=C1=F6=B0=ED =
=C0=D6=C1=F6 =BE=CA=BD=C0=B4=CF=B4=D9
  =BC=F6=BD=C5=B0=C5=BA=CE=B8=A6 =BF=F8=C7=CF=BD=C3=B8=E9 =
=BE=C6=B7=A1=BF=A1=BC=AD =BC=F6=BD=C5=B0=C5=BA=CE =C7=D8 =
=C1=D6=BC=BC=BF=E4.=C1=A4=BA=B8=B8=A6 =BF=F8=C4=A1 =BE=CA=B4=C2 =
=BA=D0=B2=B2=B4=C2 =B4=EB=B4=DC=C8=F7 =C1=CB=BC=DB =C7=D5=B4=CF=B4=D9.
=A2=BF=A2=BF=A2=BF =C8=AB=BA=B8 =B6=A7=B9=AE=BF=A1 =B0=C6=C1=A4 =
=C7=CF=BC=CC=B3=AA=BF=E4? =C0=CC=C1=A8 =B0=C6=C1=A4 =B8=B6=BC=BC=BF=E4. =
=A2=BF=A2=BF=A2=BF
=C8=AB=BA=B8=BF=A1 =B4=EB=C7=D1 =B8=F0=B5=E7=B0=CD=B0=FA =
=B3=EB=C7=CF=BF=EC =BF=A9=B1=E2 =B4=D9 =C0=D6=BD=C0=B4=CF=B4=D9.=20
=B9=AB=BE=FA=C0=CC=B5=E7=C1=F6 =B9=B0=BE=EE =BA=B8=BC=BC=BF=E4.  =
mailto:[EMAIL PROTECTED]

=A2=BA=A2=BA=A2=BA =C0=CC=B9=F8=BF=A1 =
=C8=AB=BA=B8=B4=EB=C7=E0=BE=F7 =C0=B8=B7=CE =
=C0=FC=C8=AF=C7=D4=BF=A1=B5=FB=B6=F3=20
3=B3=E2=B5=BF=BE=C8  =B8=F0=BE=C6=B3=F5=C0=BA =
=C8=AB=BA=B8=C7=C3=B1=D7=B7=A5=C0=BB =BF=B0=B0=A1=B7=CE  =
=B4=D9=B5=E5=B8=B2=B4=CF=B4=D9.=A2=B8=A2=B8=A2=B8
  =20
=A2=BE=A2=BE=A2=BE =C8=AB=BA=B8=C3=CA=BA=B8=BF=EB =
=A2=BE=A2=BE=A2=BE=20

=A2=BD=C0=CC=B8=E1=C3=DF=C3=E2=B1=E22=B0=B3 =
=A2=BD=C0=CC=B8=E1=C6=ED=C1=FD=B1=E21=B0=B3 =
=A2=BD=C0=CC=B8=E1=B9=DF=BC=DB=B1=E22=B0=B3(=C1=A4=C7=B01,=B5=A5=B8=F01) =

=A2=BD=C0=CC=B8=E1=B8=AE=BD=BA=C6=AE50=B8=B8=B0=B3 =
=A2=BD=B0=D4=BD=C3=C6=C7=B5=EE=B7=CF=B1=E21=B0=B3 =
=A2=BD=B0=D4=BD=C3=C6=C7=B5=F0DB2000=B0=B3

=A2=D1 =C0=A7=C0=C7 =B8=F0=B5=E7=B0=CD=C0=BB =
10=B8=B8=BF=F8=BF=A1 =B4=D9 =B5=E5=B8=B3=B4=CF=B4=D9. =A2=D0
  =20
=A2=BE=A2=BE=A2=BE =C8=AB=BA=B8=C1=DF=B1=DE=BF=EB =
=A2=BE=A2=BE=A2=BE

=A2=BD=C0=CC=B8=E1=C3=DF=C3=E2=B1=E23=B0=B3 =
=A2=BD=C0=CC=B8=E1=C6=ED=C1=FD=B1=E21=B0=B3 =
=A2=BD=C0=CC=B8=E1=B9=DF=BC=DB=B1=E23=B0=B3(=C1=A4=C7=B02=B0=B3,=B5=A5=B8=
=F01=B0=B3)
=A2=BD=C0=CC=B8=E1=B8=AE=BD=BA=C6=AE100=B8=B8=B0=B3 =
=A2=BD=B0=D4=BD=C3=C6=C7=B5=EE=B7=CF=B1=E21=B0=B3 =
=A2=BD=B0=D4=BD=C3=C6=C7DB5000=B0=B3

=A2=D1 =C0=A7=C0=C7 =B8=F0=B5=E7=B0=CD=C0=BB =
20=B8=B8=BF=F8=BF=A1 =B4=D9 =B5=E5=B8=B3=B4=CF=B4=D9. =A2=D0
  =20
=A2=BE=A2=BE=A2=BE =C8=AB=BA=B8=B0=ED=B1=DE=BF=EB(1) =
=A2=BE=A2=BE=A2=BE

=A2=BC=A2=BC=A2=BC=B0=B3=C0=CE =C8=A8=C6=E4=C1=F6=BF=A1 =
=C0=CC=B8=E1=C3=DF=C3=E2,=B9=DF=BC=DB=B1=E2=B8=A6 =C1=F7=C1=A2 =
=BC=B3=C4=A1 =C7=D8 =B5=E5=B8=B3=B4=CF=B4=D9.=A2=BC=A2=BC=A2=BC

=
=A2=BD=C0=CC=B8=E1=C3=DF=C3=E2=B1=E2=B4=C9=A2=BD=C0=CC=B8=E1=C1=DF=BA=B9=BB=
=E8=C1=A6=B1=E2=B4=C9=A2=BD=BC=F6=BD=C5=B0=C5=BA=CE=C0=DA=B5=BF=B1=E2=B4=C9=
=A2=BD=C0=CC=B8=E1=B9=DF=BC=DB=B1=E2=B4=C9
=
=A2=BD=BC=F6=BD=C5=B0=C5=BA=CE=C0=DA=C0=D3=BD=C3=BA=B8=B3=BB=B1=E2=A2=BD=C0=
=D3=BD=C3=B0=C5=BA=CE=C0=DA=BC=F6=BD=C5=B0=C5=BA=CE=C0=DA=B7=CE

=
=A2=D1=BC=B3=C4=A1=B0=A1=B4=C9=C7=D1=B0=F7=3D=C8=A8=C6=E4=C1=F6=BF=A1MYSQ=
L=B0=E8=C1=A4=C0=CC =C0=D6=BE=EE=BE=DF=C7=D4
=C0=AF=B7=E1=C8=A8=C0=CC =BE=F8=B4=C2=B0=E6=BF=EC=B4=C2 =
(200=B8=DE=B0=A1,=C0=CF=B3=E2=C8=A3=BD=BA=C6=C34.4000=BF=F8=BA=B0=B5=B5=C0=
=D3)

=A2=D1 =C0=A7=C0=C7 =BC=B3=C4=A1=B8=A6 20=B8=B8=BF=F8=BF=A1 =
=C7=D8=B5=E5=B8=B3=B4=CF=B4=D9. =A2=D0
  =20
=A2=BE=A2=BE=A2=BE =C8=AB=BA=B8=B0=ED=B1=DE=BF=EB(2) =
=A2=BE=A2=BE=A2=BE

1000=B8=B8=B0=B3 =C0=CC=B8=E1=B8=AE=BD=BA=C6=AE=B8=A6 =
=BF=C3=B8=B0=BC=AD=B9=F6=B8=A6 =
=B8=EE=BB=E7=B6=F7=BF=A1=B0=D4=B8=B8=C0=D3=B4=EB=C7=D4=B4=CF=B4=D9.
=
(=B1=E2=B0=A31=B3=E2=3D=B0=A1=B0=DD100=B8=B8=BF=F8)=C8=AB=BA=B8=C7=C1=B7=CE=
=B1=D7=B7=A5=B0=FA =B8=F0=B5=E7 =B3=EB=C7=CF=BF=EC=B8=A6 =C0=FC=BA=CE =
=C0=FC=BC=F6 =C7=D4=B4=CF=B4=D9.
  =20
=A2=C2=A2=C2=A2=C2 =C0=CC=B8=E1 =B1=A4=B0=ED =B4=EB=C7=E0 =
=A2=C2=A2=C2=A2=C2

=B1=D7=B5=BF=BE=C8 =C8=AB=BA=B8=C0=C7 =B3=EB=C7=CF=BF=EC=B7=CE =
2=B3=E2=BF=A1 =B0=C9=C3=C4 =B9=DF=BC=DB=BD=C3=BC=B3=C0=BB =BF=CF=BA=F1 =
=C7=CF=B0=ED
6000=B8=B8=B0=B3=C0=C7 =C0=CC=B8=E1=B5=A5=C0=CC=C5=B8=B8=A6 =
=B1=B8=BA=F1=C7=CF=BF=A9 =C0=CC=B8=E1=C8=AB=BA=B8=B8=A6 =
=B4=EB=C7=E0=C7=D8 =B5=E5=B8=B3=B4=CF=B4=D9.

=A2=BD=B9=DF=BC=DB=B4=C9=B7

Re: Find PDF of RV with a given mean value

2002-02-26 Thread Glen


"Chia C Chong" <[EMAIL PROTECTED]> wrote in message 
news:...
> Hi!
> 
> I have a set of random numbers and if I know their expectation/mean, would
> it be possible to deduce a PDF to describe the distribution of them? 

Knowing the mean tells you (almost) nothing about the form of the PDF.

However, if you are considering a particular family of PDFs (for
whatever reason), it should usually be possible to specify the mean
(in some cases fixing a parameter, in other cases introducing an
equation relating the parameters, so that you can reduce the dimension
of the parameter vector by 1).

> How do
> I make sure that when I generating these random numbers using the PDF I
> obtained, it will give me th correct mean/expectation value?

It depends on what you mean here - you must be careful to distinguish
between the population mean (which you say is known) and the sample
mean.

If you mean make it so you are generating from a distribution which
has the correct population mean, that's taken care of above.

If you mean generate so the sample mean is equal to the population
mean, why would you want to do that?

Consider the mean from n rolls of a (hypothetical) fair six-sided die
numbered 1 to 6. If it really is fair, I *know* the population mean is
3.5. Yet the sample mean is almost never 3.5, even though I know the
population mean exactly. If I wanted to simulate rolls from this die,
I would not try to make the sample mean 3.5.

Think on this: Let's assume I want a sample of size 1. To make it have
the known mean I have to set it equal to the known mean. Does it come
from the right distribution? Not at all! It comes from a distribution
with all the probability at the known mean. Now I want to enlarge the
sample by adding a second observation. What value will that have? As I
keep adding to my sample, I have to keep generating the same value
over and over.

(There may be some reason you want to generate in such a way that the
sample mean is constant, but I doubt it - and you won't be able to
have independent observations if you do.)

Glen


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Means of semantic differential scales

2002-02-26 Thread Alan McLean

Jay Tanzman wrote:
> 
> Jay Warner wrote:
> >
> > Jay Tanzman wrote:
> >
> > > I just got chewed out by my boss for modelling the means of some 7-point
> > > semantic differential scales.  The scales were part of a written,
> > > self-administered questionnaire, and were laid out like this:
> > >
> > > Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful
> > >
> > > So, why or why not is it kosher to model the means of scales like this?
> > >
> > > -Jay
> 
> My boss's objection was that he believes "categorically" (sorry) that semantic
> differential scales are ordinal.
> 
> > 1)Why do you think the scale is interval data, and not ordinal or
> > categorical?
> 
> Why would anyone think it is ordinal and not interval?  Most of the scales were
> measuring abstract, subjective constructs, such as empathy and satisfaction, for
> which there is no underlying physical or biological measurement.  Why not, then,
> _define_ degree of empathy as the subjects' rating on a 1-to-7 scale?
> 

Why not indeed?! Of course you can do this - and in fact you are doing
this. The question is really - what properties should this variable
possess in order that it is meaningful - that is, that it reflects
'reality' meaningfully. If it does not do this, then whatever
conclusions you come to about your variable are of no use whatsoever.

It is certainly true that your variable is ordinal. Is it more than
this? It is extremely unlikely that it is fully numeric (that is,
'interval') because the difference between 1 and 2 is unlikely to have
the same meaning as the difference between 4 and 5. You cannot simply
define these differences to be equal - you need your variable to reflect
reality! However, it is probable that the scale is 'reasonably numeric',
so the assumption that the variable is interval may be reasonable. But
this will be a model, using a number of assumptions - as all these
things are. 

It is important that you recognise this modelling aspect of your data
definition.

Regards,
Alan

-- 
Alan McLean ([EMAIL PROTECTED])
Department of Econometrics and Business Statistics
Monash University, Caulfield Campus, Melbourne
Tel:  +61 03 9903 2102Fax: +61 03 9903 2007

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: What is a qualitative ordinal variable?

2002-02-26 Thread Art Kendall

part 2.
Ordinal variables come in two flavors. In purely ordinal data the number of
distinct values is pretty much the same as the number of cases.  In ordered
category variables, there are only a few values that a variable may take, but
the intervals are perceived as very different.

Interval variables  have ordered values and the intervals between values are not
severely discrepant from each other.

Ratio variables have some kind of meaningful zero point.

I would usually consider months interval data if considering more than a few.

Level of aggression, or seed size, I would consider interval level if there were
four or more categories and care had been taken to make the intervals equal
appearing (e.g., anchoring with numeric stimulus labels.)

I would tend to consider Likert-scales variables as interval especially if they
were to be used  in a summative scale.

Poorly written questions can often necessitate treating variables as lower
levels of measurement.

It is possible, using methods of psychophysics, to evaluate how people use a
particular response scale.

Dual Scaling can be used to evaluate how much difference it makes to assume
different levels of measurement.

Voltolini wrote:

> Hi,
>
> I have a doubt about ordinal variables !
>
> I understand that months (jan., feb., mar.) and level of aggression (low,
> medium, high) can be accepted as qualitative ordinal variables but
> my doubt is.
>
> What about variables like seed size when using categories like small, medium
> and large or... level of mutation as rare and frequent ? Is these variables
> qualitative ? May I use these cases as examples of qualitative and ordinal
> variables ?
>
> I am in doubt because the size of seeds or the frequency of mutations are
> measurements and counts !
>

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: What is a qualitative ordinal variable?

2002-02-26 Thread Art Kendall

part 1.
A lot depends on your discipline.  Since the issue of level of measurement rose
in the 60's and early 70's there have been different viewpoint.

For some qualitative data means textual or pictorial information.
For some it means nominal level data.
For some it means variables that the measurement system allows to have only a
few specific values.
Categorical data is sometimes used synonymously with qualitative.

I tend to think of levels of measurement in an expanded version of Steven's
schema.

Nominal level variables have names but no ordering, city, town, genus, school,
etc.
They can be changed into a vector of dichotomies (e.g., dummy variables).

Dichotomies may be the only purely interval data in the social sciences. Since
there is only one interval, all intervals are perfectly equal to each other.

Voltolini wrote:

> Hi,
>
> I have a doubt about ordinal variables !
>
> I understand that months (jan., feb., mar.) and level of aggression (low,
> medium, high) can be accepted as qualitative ordinal variables but
> my doubt is.
>
> What about variables like seed size when using categories like small, medium
> and large or... level of mutation as rare and frequent ? Is these variables
> qualitative ? May I use these cases as examples of qualitative and ordinal
> variables ?
>
> I am in doubt because the size of seeds or the frequency of mutations are
> measurements and counts !
>

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Means of semantic differential scales

2002-02-26 Thread Rich Ulrich


> 
> > 2. Perhaps more likely, your boss may have learned
> > (wrongly?) that parametric stats should not be done unless scales
> > of measurement are at least interval in quality.
> 
> I don't know if his objection was to parametric statistics per se, but he did
> object to calculating means on these data, which he believes are only ordinal.
> 
> > Search on google
> > for people like John? Gaito and S.S. Stevens and for phrases like
> > "scales of measurement" and "parametric statistics."
> 
> Thanks.  Will do.
> 

Or,  do an Advanced search with  groups.google  
among the  sci.stat.*   groups for < Stevens, measurement >.
I think that would find earlier discussions and some references.
As I recall it, no one who pretended to know much would have
sided with your boss.

The firmness of Stevens's  categories was strongly challenged 
by the early 1950s.  In particular, there was Frederick Lord's 
ridiculing parable of the football jerseys.   (Naturally, psychology
departments taught the subject otherwise, for quite a while longer.)

Conover, et al., took a lot of the glory out of 'nonparametric tests'
by showing that you can't gain much from rank-order 
transformations, compared to any decent scaling.  That was 
in an article of 1980 or thereabouts.

I may have seen a 'research manual' dated as recent as 1985
that still  favored using rank-statistics with Likert-scaled items.  
I am curious as to what more recent endorsements might exist,  
in any textbooks at all, or in papers by statisticians.

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: What is an outlier ?

2002-02-26 Thread Jon Cryer


and with bivariate data, neither component need be high or low!

Jon Cryer

At 12:14 PM 2/25/2002 -0700, you wrote:
>Of course it can be. An outlier is any value that is not usual for your data
>set.
>"Voltolini" <[EMAIL PROTECTED]> wrote in message
>002f01c1be21$65913d60$0fe9e3c8@oemcomputer">news:002f01c1be21$65913d60$0fe9e3c8@oemcomputer...
> > Hi,
> >
> >
> > My doubt isan outlier can be a LOW data value in the sample (and not
> > just the highest) ?
> >
> > Several text boks dont make this clear !!!
> >
> >
> > Thanks
> >
> >
> > V.
> >
> >
> >
> > =
> > Instructions for joining and leaving this list, remarks about the
> > problem of INAPPROPRIATE MESSAGES, and archives are available at
> >   http://jse.stat.ncsu.edu/
> > =
>
>
>
>
>=
>Instructions for joining and leaving this list, remarks about the
>problem of INAPPROPRIATE MESSAGES, and archives are available at
>   http://jse.stat.ncsu.edu/
>=




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Means of semantic differential scales

2002-02-26 Thread Dennis Roberts

At 08:18 AM 2/26/02 -0800, Jay Tanzman wrote:
> > >
> > > Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful

these contain more information than simply ordinality ... they give you 
some indication of amount of stress too

differentiate this sort of item and response from:

rank order your preferences for the following foods:

steak ___ ... 1
veal  ... 2
chicken  ... 4
fish  ... 5
pork  ... 3

and, assume it says to put 1 for the top 1 ... and 5 for the low one

so, i do as above

both CAN be thought of as ordering scales ... but, there is definitely MORE 
information in the not stressful to very stressful item and responses

the end points of the 1 to 7 scale DO have meaning ... in terms of ABSOLUTE 
quantities
that is not so for the food orderings ... can we infer that i don't like 
fish since i ranked it 5 and DO like steak since i ranked it one??? NOT 
necessarily

there is a fundamental difference in the information you can extract from 
each of the examples above

i see nothing inherently wrong with finding means on items like the stress 
item ... since means close to 1 or 7 ... do have some underlying referent 
to quantity of stress ... one cannot say that about the food preferences in 
terms of some underlying absolute liking or disliking of the foods

Dennis Roberts, 208 Cedar Bldg., University Park PA 16802

WWW: http://roberts.ed.psu.edu/users/droberts/drober~1.htm
AC 8148632401

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Means of semantic differential scales

2002-02-26 Thread Dennis Roberts

i think we are all missing the main point

if you have a number of these items where, your goal (perhaps) is to SUM 
them together in some way ... where one end represents low amounts of the 
"thing" presented and the other end represents large amounts of the thing 
presented ... then ACROSS items ... the issue is do Ss tend to respond at 
the low end or the high end?

i really don't care if the exact scale IS interval or interpreted by Ss as 
such ... the main thing is how do they respond across a set of items?

whether or not these data or scales are interval or not, the MEAN has 
meaning ... excuse the pun ... i am willing to bet that those Ss who 
produce mean values close to 1 below are not experiencing any serious 
stress ... whereas those Ss whose means are close to 6 or 7 ... are

now, does that mean i know precisely what they are thinking/feeling? of 
course not but, it is plenty good enough to get a good idea of variation 
across Ss on these items or dimensions

i really don't see what the big fuss is

At 08:10 AM 2/26/02 -0800, Jay Tanzman wrote:

>Jay Warner wrote:
> >
> > Jay Tanzman wrote:
> >
> > > I just got chewed out by my boss for modelling the means of some 7-point
> > > semantic differential scales.  The scales were part of a written,
> > > self-administered questionnaire, and were laid out like this:
> > >
> > > Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful

Dennis Roberts, 208 Cedar Bldg., University Park PA 16802

WWW: http://roberts.ed.psu.edu/users/droberts/drober~1.htm
AC 8148632401

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Means of semantic differential scales

2002-02-26 Thread Jay Tanzman

Jay Warner wrote:
> 
> Jay Tanzman wrote:
> 
> > I just got chewed out by my boss for modelling the means of some 7-point
> > semantic differential scales.  The scales were part of a written,
> > self-administered questionnaire, and were laid out like this:
> >
> > Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful
> >
> > So, why or why not is it kosher to model the means of scales like this?
> >
> > -Jay

My boss's objection was that he believes "categorically" (sorry) that semantic
differential scales are ordinal.

> 1)Why do you think the scale is interval data, and not ordinal or
> categorical?

Why would anyone think it is ordinal and not interval?  Most of the scales were
measuring abstract, subjective constructs, such as empathy and satisfaction, for
which there is no underlying physical or biological measurement.  Why not, then,
_define_ degree of empathy as the subjects' rating on a 1-to-7 scale?

> If interval, the increments between the levels are more or
> less equal.  If ordinal we know they are sequential, but have no idea how
> far apart each pair is.  Categorical means there is no relationship between
> them - 4 is not greater than 3 - it's only different.
> 
> Some people use a response of 4 to mean 'no response' as well as 'no
> opinion' and 'neutral opinion.'  sorry, these are not intervals.
> 
> 2)Is it possible for a respondent to come back with 2.5?  If so, they
> think it is interval data, regardless of your opinion.  Would you throw out
> a response of 2.5, or would you enter it in your dataset as 2.5?  If the
> latter, you think it is interval, also.

An obscure corollary to the Law of Large Numbers is that, in a self-administered
questionnaire, the probability that some individual will either write in
some-number-point-five (or, equivalently, check two adjacent numbers) approaches
1 as N increases without bound.  I would have no theoretical objection to them
doing that on this survey.

> 3)What makes you think the scale is linear (equal intervals)?

My boss's argument that it is not interval is that subjects don't necessarily
treat it that way.  That is, they don't treat the difference between 1 and 2 as
the same as, say, between 3 and 4.  My feeling is that there is no natural unit
of, say, satisfaction, so why not define a unit of satisfaction as the rating on
the scale.

> It ain't
> - since respondents can't go below 1 or above 7 .  Well, maybe 0 and 8, but
> the point is the same.  If you must, make a transformation (arc-sine for
> starters) to make it more 'linear' and more likely to contain Normal dist.
> data.

The scale can have limits and still be interval.  The amount of water in an 8
oz. glass is constrained to be between 0 and 8, but ounces on water in the glass
would still be interval data.

> 4)Why might the respondents use the same increments that you think
> exist, or the same as other respondents?  If there is some way you can
> 'anchor' at end points or mid point, you will get much more informative
> data.  I mean, what is 'very stressful' to you?  To me?  to anyone?

I don't think it matters.  What is 'very stressful' to the individual respondent
is what is important.  For one thing, we were testing hypotheses about the
effects of alternative programs on these subjective outcomes.  As long as there
was no association between how respondents interpret the scales and which
program they attended, I don't see how differences in scale interpretation could
affect the results; there would be no confounding.

> 5)In cases where I have been able to anchor firmly, and in some where I
> haven't, I find that treating the scale as incremental data work just fine,
> thank you. 

I agree.  Assuming that the data, which consist of the numbers 1 to 7, are
interval in the absence of evidence to the contrary seems like a pretty mild
assumption to me.  Furthermore, even if they are not interval, treating them as
such would seem unlikely to cause any great bias in the results.

> As soon as you compute an average of responses on this scale,
> you have done just that.  If you restrict yourself to categorical analysis
> for frequencies between categories, you have avoided that assumption.  And
> you have far less to say about the data, as well.

Treating this data as categorical would have led to very sparse data.  Ordinal
logistic regression would have been messy because I would had to collapse
categories, and this defeats the purpose of having the categories in the first
place.  Treating the data as interval allowed me to evaluate the treatments and
their interactions using multiple linear regression, though, possibly, I could
have done this on the ranks of the data as well, though I didn't see any
advantage in doing so.

-Jay

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.s

Re: Means of semantic differential scales

2002-02-26 Thread Jay Tanzman




jim clark wrote:
> 
> Hi
> 
> On Mon, 25 Feb 2002, Jay Tanzman wrote:
> 
> > I just got chewed out by my boss for modelling the means of some 7-point
> > semantic differential scales.  The scales were part of a written,
> > self-administered questionnaire, and were laid out like this:
> >
> > Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful
> >
> > So, why or why not is it kosher to model the means of scales like this?

[snip]

> 2. Perhaps more likely, your boss may have learned
> (wrongly?) that parametric stats should not be done unless scales
> of measurement are at least interval in quality.

I don't know if his objection was to parametric statistics per se, but he did
object to calculating means on these data, which he believes are only ordinal.

> Search on google
> for people like John? Gaito and S.S. Stevens and for phrases like
> "scales of measurement" and "parametric statistics."

Thanks.  Will do.

-Jay


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Means of semantic differential scales

2002-02-26 Thread Jay Tanzman




"J. Williams" wrote:
> 
> On Mon, 25 Feb 2002 15:17:55 -0800, Jay Tanzman <[EMAIL PROTECTED]>
> wrote:
> 
> >I just got chewed out by my boss for modelling the means of some 7-point
> >semantic differential scales.  The scales were part of a written,
> >self-administered questionnaire, and were laid out like this:
> >
> >Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful
> >
> >So, why or why not is it kosher to model the means of scales like this?
> >
> >-Jay
> 
> You can check it out by reading the pioneers of the semantic
> differential scale.  Osgood, Suci, and Tannenbaum are the authors of
> "Measurement of Meaning"  which now is published in paperback by the
> University of Illinois Press, Oct. 1990.

Thanks.  I'll do that.  I think one of the above authors also has a website,
though, yesterday it crashed my Browser.  Then again, my browser was Netscape...

> It may be your boss is a
> stickler on what constitutes a true interval scale. 

Yes, that is it.  See my response to Jay Warner for the details.

-Jay


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: detecting outliers in NON normal data ?

2002-02-26 Thread Herman Rubin


In article <00f301c1be68$13413000$fde9e3c8@oemcomputer>,
Voltolini <[EMAIL PROTECTED]> wrote:
>Hi,

>I would like to know if methods for detecting outliers
>using interquartil ranges are indicated for data with
>NON normal distribution.

>The software "Statistica" presents this method:
>data point value > UBV + o.c.*(UBV - LBV)
>data point value < LBV - o.c.*(UBV - LBV)

>where: UBV is the 75th percentile) and LBV is the 25th percentile).  o.c. is
>the outlier coefficient.

>In the biological world many data are not normally distributed and tests
>like Rosner, Dixon and Grubbs (if I am wright ! ) are good just for normally
>distributed data.

Nothing is normally distributed; some may come close.

But are they even good for normally distributed data?  
Why should anyone be concerned about outliers?  If there
are observations produced under the assumed model, they
should be included, no matter how far out they are.  The
only legitimate justification for excluding some data
points is that errors of some kind have occurred in 
producing them, whether they are outliers or inliers.
-- 
This address is for information only.  I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
[EMAIL PROTECTED] Phone: (765)494-6054   FAX: (765)494-0558


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Cauchy PDF + Parameter Estimate

2002-02-26 Thread Herman Rubin

In article <[EMAIL PROTECTED]>,
Glen Barnett  <[EMAIL PROTECTED]> wrote:
>Herman Rubin wrote:

>> In article ,
>> Chia C Chong <[EMAIL PROTECTED]> wrote:
>> >Hi!

>> >Does anyone come across some Matlab code to estimate the parameters for the
>> >Cauchy PDF?? Or some other sources about the method to estimate their
>> >parameters??

>> What is so difficult about maximum likelihood?  Start with a
>> reasonable estimator, and use Newton's method.

>There are difficulties with Newton's method (and many other
>hill-climbing
>techniques) because the cauchy likelihood function is generally
>multimodal.

>You can end up somewhere other than the MLE unless you use a somewhat
>more
>sophisticated starting point than "a reasonable estimator". There are
>good
>estimators that can start you off very close to the true maximum, but
>it's 
>a long time since I've seen that literature, so I can't name names right
>now.

The Cauchy likelihood function is frequently multimodal; for
large samples for the center with known spread, the
probability of unimodal is about .13.  However, for
reasonable sample sizes, the other modes will be "way out",
and will be small. 

For squared error loss, the best translation invariant
estimator (the Pitman estimator) can be computed by a
closed formula, but I would be concerned about the 
numerical error if it is not done using considerably
higher precision.  It can also be done by numerical
integration, which is not that difficult.

However, I believe that the MLE will be rather good
for moderate samples.  The local MLE starting with
quantile estimates should work quite well.  Also, if
one knows it is Cauchy, there are estimators using a
few quantiles which are close to efficient.

-- 
This address is for information only.  I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
[EMAIL PROTECTED] Phone: (765)494-6054   FAX: (765)494-0558

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: STATA Constrained Regression

2002-02-26 Thread Richard Goldstein

Have you looked in the manual under "Constraint"?  If you still have a
problem you should submit your question either to Stata tech support or
to the Stata list server (you can join at the Stata web site:
http://www.stata.com), rather than to a general newsgroup such as this
one.

Rich Goldstein

Emmanuel Salta wrote:
> 
> Does anybody know how to run this constrained regression in STATA? The
> model is Y=b1X1 + b2X2 + b3X3, where b1+b2+b3=1 and 0  Thanks.
> 
> Emmanuel Salta
> [EMAIL PROTECTED]

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Means of semantic differential scales

2002-02-26 Thread jim clark

Hi

On Mon, 25 Feb 2002, Jay Tanzman wrote:

> I just got chewed out by my boss for modelling the means of some 7-point
> semantic differential scales.  The scales were part of a written,
> self-administered questionnaire, and were laid out like this:
> 
> Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful
> 
> So, why or why not is it kosher to model the means of scales like this?

Two possibilities suggest themselves (there are probably more),
although somewhat unclear to me what you mean by "modelling the
means."

1. You are aggregating across items that the boss thinks should
be analyzed separately, either because they measure different
constructs or because some are reverse-worded?

2. Perhaps more likely, your boss may have learned
(wrongly?) that parametric stats should not be done unless scales
of measurement are at least interval in quality. Search on google
for people like John? Gaito and S.S. Stevens and for phrases like
"scales of measurement" and "parametric statistics."  This debate
surfaces now and then, so there are probably things in various
archives as well.

Best wishes
Jim

James M. Clark  (204) 786-9757
Department of Psychology(204) 774-4134 Fax
University of Winnipeg  4L05D
Winnipeg, Manitoba  R3B 2E9 [EMAIL PROTECTED]
CANADA  http://www.uwinnipeg.ca/~clark

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Numerical recipes in statistics ???

2002-02-26 Thread David Duffy


In sci.stat.edu The Truth <[EMAIL PROTECTED]> wrote:
> Glen Barnett <[EMAIL PROTECTED]> wrote in message 
>news:<[EMAIL PROTECTED]>...
>> The Truth wrote:
>> > 
>> > Are there any "Numerical Recipes" like textbook on statistics and probability ?
>> > Just wondering..
>> 
>> What do you mean, a book with algorithms for statistics and probability
>> or a handbook/cookbook list of techniques with some basic explanation?
>> 
>> Glen


> I suppose I should have been more clear with my question. What I
> essentially require is a textbook which presents algorithms like Monte
> Carlo, Principal Component Analysis, Clustering methods,
> MANOVA/MANACOVA methods etc. and provides source code (in C , C++ or
> Fortran) or pseudocode together with short explanations of the
> algorithms.

> Thanks.

> --

-- 
| David Duffy. ,-_|\
| email: [EMAIL PROTECTED]  ph: INT+61+7+3362-0217 fax: -0101/ *
| Epidemiology Unit, The Queensland Institute of Medical Research \_,-._/
| 300 Herston Rd, Brisbane, Queensland 4029, Australia v 


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Means of semantic differential scales

2002-02-25 Thread Dennis Roberts


of course, to be fair to the first jay .. could be simply that his boss did 
not like semantic diff. scales ... AND, for none of the reasons the second 
jay below said ...

it would be helpful if the first jay could give us some further info on why 
his boss was so ticked off ...

At 09:39 PM 2/25/02 -0600, Jay Warner wrote:
>Jay Tanzman wrote:
>
> > I just got chewed out by my boss for modelling the means of some 7-point
> > semantic differential scales.  The scales were part of a written,
> > self-administered questionnaire, and were laid out like this:
> >
> > Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful
> >
> > So, why or why not is it kosher to model the means of scales like this?
>
>=
>Instructions for joining and leaving this list, remarks about the
>problem of INAPPROPRIATE MESSAGES, and archives are available at
>   http://jse.stat.ncsu.edu/
>=



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Means of semantic differential scales

2002-02-25 Thread Jay Warner

Jay Tanzman wrote:

> I just got chewed out by my boss for modelling the means of some 7-point
> semantic differential scales.  The scales were part of a written,
> self-administered questionnaire, and were laid out like this:
>
> Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful
>
> So, why or why not is it kosher to model the means of scales like this?
>
> -Jay

1)Why do you think the scale is interval data, and not ordinal or
categorical?  If interval, the increments between the levels are more or
less equal.  If ordinal we know they are sequential, but have no idea how
far apart each pair is.  Categorical means there is no relationship between
them - 4 is not greater than 3 - it's only different.

Some people use a response of 4 to mean 'no response' as well as 'no
opinion' and 'neutral opinion.'  sorry, these are not intervals.

2)Is it possible for a respondent to come back with 2.5?  If so, they
think it is interval data, regardless of your opinion.  Would you throw out
a response of 2.5, or would you enter it in your dataset as 2.5?  If the
latter, you think it is interval, also.

3)What makes you think the scale is linear (equal intervals)?  It ain't
- since respondents can't go below 1 or above 7 .  Well, maybe 0 and 8, but
the point is the same.  If you must, make a transformation (arc-sine for
starters) to make it more 'linear' and more likely to contain Normal dist.
data.

4)Why might the respondents use the same increments that you think
exist, or the same as other respondents?  If there is some way you can
'anchor' at end points or mid point, you will get much more informative
data.  I mean, what is 'very stressful' to you?  To me?  to anyone?

Perhaps you are evaluating how people respond to specific scenarios with
their impression of anticipated stress.  In which case, the strength of
'very' is at issue, and perhaps you can argue that it is what you are
measuring.  (remember the old maps:  there be dragons).

When I sit down with a client to work out an experimental design for a
project, one might call this highly stressful.  I am  in full control of
the alternatives and options, so to me it is great fun, and very
invigorating.  the situation is far from 'Not stressful' - it is not
opposite of 'stressful.'  I know my muscles have been stressed, because it
is also very tiring.  so what might be 'stressful'?  Is that worked out
with your respondents beforehand?

5)In cases where I have been able to anchor firmly, and in some where I
haven't, I find that treating the scale as incremental data work just fine,
thank you.  As soon as you compute an average of responses on this scale,
you have done just that.  If you restrict yourself to categorical analysis
for frequencies between categories, you have avoided that assumption.  And
you have far less to say about the data, as well.

Cheers,
Jay
--
Jay Warner
Principal Scientist
Warner Consulting, Inc.
 North Green Bay Road
Racine, WI 53404-1216
USA

Ph: (262) 634-9100
FAX: (262) 681-1133
email: [EMAIL PROTECTED]
web: http://www.a2q.com

The A2Q Method (tm) -- What do you want to improve today?

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: What is an outlier ?

2002-02-25 Thread Glen Barnett

Voltolini wrote:
> 
> Hi,
> 
> My doubt isan outlier can be a LOW data value in the sample (and not
> just the highest) ?
> 
> Several text boks dont make this clear !!!

What makes an outlier "an outlier" is your model. If your model accounts
for all the observations, you can't really call any of them an outlier.
If your model adequately accounts for all but one or two unusual
observations, you might regard them as coming from some process other
than that which generated the data you model accounts for, and call them
outliers.

Such "not adequately accounted for" observations may be low
observations, or high
observations, or they may actually turn out be somewhere in the middle
of the range of your data - as I have seen with time series for example,
where in some applications an autoregressive models was a very good
desctiption of a long series, apart from a few outliers in the first
quarter or so of the time period (which did in the end turn out to have
come from a different process, because the protocol wasn't always being
properly followed early on). Two of those "outliers" - in the sense that
the model didn't adequately account for them - turn out to be neither
particularly high or low observations - but they were substantially
higher or lower than expected from the model. 

Another case where you might have "outliers" in the middle of your data
is in a regression context, where a generally increasing relationship
shows a tight, gaussian-looking random scatter about the relationship,
but with a couple of relatively low y-values at some of the higher
x-values. The observations themselves may actually be very close to the
mean of the y's, but the model of the relationship makes them "unusual".
A different model - for example, one where the observations come from a
distribution which has the same expectation as a function of x, but
which has a heavier tail to the left around that - might account for all
the data and not find any outliers.

Glen

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Cauchy PDF + Parameter Estimate

2002-02-25 Thread Glen Barnett

Herman Rubin wrote:
> 
> In article ,
> Chia C Chong <[EMAIL PROTECTED]> wrote:
> >Hi!
> 
> >Does anyone come across some Matlab code to estimate the parameters for the
> >Cauchy PDF?? Or some other sources about the method to estimate their
> >parameters??
> 
> What is so difficult about maximum likelihood?  Start with a
> reasonable estimator, and use Newton's method.

There are difficulties with Newton's method (and many other
hill-climbing
techniques) because the cauchy likelihood function is generally
multimodal.

You can end up somewhere other than the MLE unless you use a somewhat
more
sophisticated starting point than "a reasonable estimator". There are
good
estimators that can start you off very close to the true maximum, but
it's 
a long time since I've seen that literature, so I can't name names right
now.

Glen

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: detecting outliers in NON normal data ?

2002-02-25 Thread Glen Barnett

Voltolini wrote:
> 
> Hi,
> 
> I would like to know if methods for detecting outliers
> using interquartil ranges are indicated for data with
> NON normal distribution.
> 
> The software "Statistica" presents this method:
> data point value > UBV + o.c.*(UBV - LBV)
> data point value < LBV - o.c.*(UBV - LBV)
> 
> where: UBV is the 75th percentile) and LBV is the 25th percentile).  o.c. is
> the outlier coefficient.

The values of the outlier coefficient are traditionally chosen by
reference
to some percentile of the normal distribution. (If anyone didn't
recognise it,
this is just the outliers on a boxplot.)

If you choose that coefficient in some appropriate way, then it may be
reasonable
for non-normal data.

Glen

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Means of semantic differential scales

2002-02-25 Thread J. Williams

On Mon, 25 Feb 2002 15:17:55 -0800, Jay Tanzman <[EMAIL PROTECTED]>
wrote:

>I just got chewed out by my boss for modelling the means of some 7-point
>semantic differential scales.  The scales were part of a written,
>self-administered questionnaire, and were laid out like this:
>
>Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful
>
>So, why or why not is it kosher to model the means of scales like this?
>
>-Jay

You can check it out by reading the pioneers of the semantic
differential scale.  Osgood, Suci, and Tannenbaum are the authors of
"Measurement of Meaning"  which now is published in paperback by the
University of Illinois Press, Oct. 1990.  It may be your boss is a
stickler on what constitutes a true interval scale.  It could be
he/she wants no middle value score - that way respondents must tilt
toward a yea or nay.  It could be the use of the particular bipolars
"not stressful" and "very stressful."  Why not use stressful and not
stressful?   What is "very" stressful?  By reading the Osgood et al
text, you can find many nifty ideas and variations for using the
semantic differential scale.  Like the Likert Scale, I suppose it is
arguably an ordinal scale.  But, there are lots of statistical tools
you could employ using rankings, medians, etc.  Like the Likert Scale
devotees,  there are those who nevertheless use means as the measure
of central tendency with semantic differential instruments.  Good
luck.

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: What is an outlier ?

2002-02-25 Thread Dennis Roberts

of course, if one has control over the data, checking the coding and making 
sure it is correct is a good thing to do

if you do not have control over that, then there may be very little you can 
do with it and in fact, you may be totally UNaware of an outlier problem

i see as a potentially MUCH larger problem when ONLY certain summary 
statistics are shown without any basic tallies/graphs displayed so, IF 
there are some really strange outlier values, it usually will go undetected ...

correlations are ONE good case in point ... have a look at the following 
scatterplot ... height in inches and weight in pounds ... from the pulse 
data set in minitab

  -  *
  -
   300+
  -
  Weight  -
  - 2
  - 2  224 32
   150+   ** 3458*454322*
  -*53*3*535  2
  -  **
--+-+-+-+-+-+Height
   32.0  40.0  48.0  56.0  64.0  72.0

now, the actual r between the X and Y is -.075 ... and of course, this 
seems strange but, IF you had only seen this in a matrix of r values ... 
you might say that perhaps there was serious range restriction that more or 
less wiped out the r in this case ...  but even the desc. stats might not 
adequately tell you of this problem

IF you had the scatterplot, you probably would figure out REAL quick that 
there is a PROBLEM with one of the data points ...

in fact, without that one weird data point, the r is about .8 ... which 
makes a lot better sense when correlating heights and weights of college 
students

At 09:06 PM 2/25/02 +, Art Kendall wrote:

>--6F47CB3D3B10A10A3E9B064C
>Content-Type: text/plain; charset=us-ascii
>Content-Transfer-Encoding: 7bit
>
>An "outlier" is any value for a variable that is suspect given the
>measurement system, "common sense",  other values for the variable in
>the data set, or  the values a case has on other variables.
>=

Dennis Roberts, 208 Cedar Bldg., University Park PA 16802

WWW: http://roberts.ed.psu.edu/users/droberts/drober~1.htm
AC 8148632401

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Cauchy PDF + Parameter Estimate

2002-02-25 Thread Herman Rubin


In article ,
Chia C Chong <[EMAIL PROTECTED]> wrote:
>Hi!

>Does anyone come across some Matlab code to estimate the parameters for the
>Cauchy PDF?? Or some other sources about the method to estimate their
>parameters??

What is so difficult about maximum likelihood?  Start with a
reasonable estimator, and use Newton's method.
-- 
This address is for information only.  I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
[EMAIL PROTECTED] Phone: (765)494-6054   FAX: (765)494-0558


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: What is an outlier ? cont'd

2002-02-25 Thread Art Kendall



--A59A95727DA65C2AB2F9EBF5
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

That being said, occasions can arise where there are outliers other than
from measurement or data entry error. Different disciplines have different
approaches.
What discipline are you studying? What is the variable you are concerned
about?  How is it measured?

some examples of low values:
10 pounds would be a suspicious value for an adult's weight.
Few college students are under 16.
37degrees F would be unreasonable for a body temperature of a li


--A59A95727DA65C2AB2F9EBF5
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit



That being said, occasions can arise where there are outliers other
than from measurement or data entry error. Different disciplines have different
approaches.
What discipline are you studying? What is the variable you are concerned
about?  How is it measured?
some examples of low values:
10 pounds would be a suspicious value for an adult's weight.
Few college students are under 16.
37degrees F would be unreasonable for a body temperature of a li
 

--A59A95727DA65C2AB2F9EBF5--



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: What is an outlier ?

2002-02-25 Thread IPEK


Of course it can be. An outlier is any value that is not usual for your data
set.
"Voltolini" <[EMAIL PROTECTED]> wrote in message
002f01c1be21$65913d60$0fe9e3c8@oemcomputer">news:002f01c1be21$65913d60$0fe9e3c8@oemcomputer...
> Hi,
>
>
> My doubt isan outlier can be a LOW data value in the sample (and not
> just the highest) ?
>
> Several text boks dont make this clear !!!
>
>
> Thanks
>
>
> V.
>
>
>
> =
> Instructions for joining and leaving this list, remarks about the
> problem of INAPPROPRIATE MESSAGES, and archives are available at
>   http://jse.stat.ncsu.edu/
> =




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Cauchy PDF + Parameter Estimate

2002-02-25 Thread Duncan Murdoch


On 25 Feb 2002 07:56:56 -0800, [EMAIL PROTECTED] (kjetil
halvorsen) wrote:

>It isstraightforward tlo write down the loglikelihood, and then whatever
>optimization routine (there must be one in Matlab) will help you!

Just be careful when searching, because Cauchy likelihoods are
frequently multi-modal.  

Duncan Murdoch


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Question on Conditional PDF

2002-02-25 Thread Chia C Chong



"Glen Barnett" <[EMAIL PROTECTED]> wrote in message
a5dev7$8jn$[EMAIL PROTECTED]">news:a5dev7$8jn$[EMAIL PROTECTED]...
>
> Chia C Chong <[EMAIL PROTECTED]> wrote in message
> a5d38d$63e$[EMAIL PROTECTED]">news:a5d38d$63e$[EMAIL PROTECTED]...
> >
> >
> > "Glen" <[EMAIL PROTECTED]> wrote in message
> > [EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
> > > Do you want to make any assumptions about the form of the conditional,
> > > or the joint, or any of the marginals?
> >
> > Well, the X & Y are dependent and hence there are being descibed by a
joint
> > PDF.
>
> This much is clear.
>
> > I am not sure what other assumption I can make though..
>
> I merely though you may have domain specific knowledge of the variables
and
> their likely relationships which might inform the choice a bit (cut down
the
> space
> of possibilities).
>
> Can you at least indicate whether any of them are restricted to be
positive?


All values of X and Z are positive while Y can have both positive and
negative values.
In fact, X has the range span from 0 to 250 (time) and Y has values that
span from -60 to +60 (angle) and Z has some positive values. Note that, the
joint PDF of X & Y was defined as f(X,Y)=f(Y|X)f(X) in which f(Y|X) is a
conditional Gaussian PDF and f(X) is an exponential PDF. The plot of the 3rd
variable, Z (Power)  i.e. Z vs X and Z vs.Y, respectively shows that Z has
some kind of dependency on X and Y, hence, my original post was asking the
possible method of finding the conditional PDF of Z on both X and Y. I hope
this makes things a little bit clearer or more complicated???


Thanks..

CCC
>
> Glen
>
>




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: REQ: Appendix A. of Radford Neal thesis: "Bayesian Learning for

2002-02-25 Thread Jonathan G Campbell


Mark wrote:
> 
> Hi,
> 
> I'm CS student interested in Radford Neal thesis called "Bayesian
> Learning for Neural Networks". I know that some years ago  this thesis
> was available for download from author's site, but nowadays there
> isn't possible. I have searched it on Intenet so I have not known to
> find it.
> 
> I should be grateful if anyone could tell me where I can find it, or
> could send it to me via e-mail.
> 
> I specially interested in Appendix A. of this thesis.

As the other poster suggested, it has been published:

@Book{neal-bayesian-nn,
  author =   "R.M. Neal",
  title ="Bayesian Learning for Neural Networks",
  publisher ="Springer Verlag",
  year = "1996"
}

>From the preface: "This book, a revision of my PhD thesis [Bayesian
Learning for Neural Networks] ..."

Appendix A: Details of the Implementation. 

Best regards,

Jon C.

-- 
Jonathan G Campbell BT48 7PG [EMAIL PROTECTED] 028 7126 6125
http://homepage.ntlworld.com/jg.campbell/


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Question on Conditional PDF

2002-02-25 Thread Vadim and Oxana Marmer


> > > Do you want to make any assumptions about the form of the conditional,
> > > or the joint, or any of the marginals?
> >
> > Well, the X & Y are dependent and hence there are being descibed by a joint
> > PDF.
>
> Can you at least indicate whether any of them are restricted to be positive?


Also, can you treat them as fixed (not random?) You indicated before that
you data was from experiments, so if X,Y were independent they could be
also non-random or controled by a researcher.



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Cauchy PDF + Parameter Estimate

2002-02-25 Thread kjetil halvorsen


It isstraightforward tlo write down the loglikelihood, and then whatever
optimization routine (there must be one in Matlab) will help you!

Kjetil Halvorsen

Chia C Chong wrote:
> 
> Hi!
> 
> Does anyone come across some Matlab code to estimate the parameters for the
> Cauchy PDF?? Or some other sources about the method to estimate their
> parameters??
> 
> Thanks..
> 
> CCC
> 
> =
> Instructions for joining and leaving this list, remarks about the
> problem of INAPPROPRIATE MESSAGES, and archives are available at
>   http://jse.stat.ncsu.edu/
> =


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: regression of non-normal data ?

2002-02-25 Thread Paige Miller


John Ziker wrote:

> This research deals with the classical anthropological question of
> food sharing among hunters and gatherers. There are a number of
> hypotheses being discussed within the field. This study is relevant
> for two models, namely kinship cooperation and reciprocity. The
> kinship model predicts greater assymetry in sharing with increasing
> proximity of relatedness between the partners. The reciprocity model
> predicts that sharing is contigent on returned acts of sharing. I have
> a small sample of meals I observed and documented among Dolgan and
> Nganasan hunter-gatherers in a remote community in the Siberian
> Arctic. I documented approximately 800 meals in 1995 and 1996. Of
> these, 145 meals included members of more than one household. I am
> including the raw data in this message. These raw data are: the number
> of times household x hosted household y, the number of times household
> y hosted household x, and the average household relatedness of
> household x and y. The relatedness figure was calculated as the
> average relatedness (r) of each pair of individuals in each household.
> [The variable 'r'is used in biology to represent the likelihood that
> two individuals share a gene at a given locus.]
> 
> The main question I have is: with these data is it possible to
> determine statistically whether or not average household r predicts x
> to y sharing better than y to x reciprocity, or vice versa. The sample
> is highly skewed because of the fact that, even though the households
> represented are the ones in my sample that had the highest number of
> sharing partners, not every household hosted each other.
...
> 
> I have run Spearmans rho and the correlation is highly significant for
> all comparisons. The data are not normal though, and I am questioning
> multiple regression results (X to Y dependent variable). A college of
> mine suggests that the standardized beta result may be a valid
> indicator of some significant difference however. I'd greatly
> appreciate any suggestions.

Regression does NOT require normally distributed data. Neither the
independent nor the dependent variable needs to be normally distributed.
It is a common misconception that normality is required.

However, it is required that the errors from the prediction are normally
distribution. Generally, this is tested after you fit the regression by
seeing if your residuals are normally distributed.

Now, having said this, how does it apply to your sitaution? You need to
examine your data and see if the assumption holds. It may not, but I
won't presume to do the work for you. 

Your question about using standardized betas is confusing to me, this is
just  a different scaling of the betas, it doesn't affect significance. 

-- 
Paige Miller
Eastman Kodak Company
[EMAIL PROTECTED]

"It's nothing until I call it!" -- Bill Klem, NL Umpire
"When you get the choice to sit it out or dance,
   I hope you dance" -- Lee Ann Womack
.
.
.
.
.
.
.
.
.
.
.
.
.


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Question on Conditional PDF

2002-02-25 Thread Glen Barnett



Chia C Chong <[EMAIL PROTECTED]> wrote in message
a5d38d$63e$[EMAIL PROTECTED]">news:a5d38d$63e$[EMAIL PROTECTED]...
>
>
> "Glen" <[EMAIL PROTECTED]> wrote in message
> [EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
> > Do you want to make any assumptions about the form of the conditional,
> > or the joint, or any of the marginals?
>
> Well, the X & Y are dependent and hence there are being descibed by a joint
> PDF.

This much is clear.

> I am not sure what other assumption I can make though..

I merely though you may have domain specific knowledge of the variables and
their likely relationships which might inform the choice a bit (cut down the
space
of possibilities).

Can you at least indicate whether any of them are restricted to be positive?

Glen




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: odds vs probabilities

2002-02-25 Thread David Smith


Odds are multiplicative in the following sense, useful in some types of
betting arrangements.

If the odds of one bet are 4 to 3 and of the next bet 3 to 2 then the odds
of both  bets are the product 4*3 to 3*2 or 2 to 1.  This is useful in some
horse racing bets where the second (and even more) bets are made
sequentially provided the earlier bets are winners for the gambler (placed
with the bookmaker).

Odds have a cultural history that seems to be lost.  They were the common
form of gambling until quite recently, when even odds bets with point
spreads became common for such sports as basketball and football.  Odds
require a strong facility with arithmetic when there are multiple results,
such as in a horse race.  There, as always, the odds are constrained by the
usual requirement that the corresponding probabilities must still add to
one, at least for fair bets.  Imposing this requirement "on the fly" when a
bookmaker changes the odds on a horse seems difficult unless there were some
quick simple rules of thumb that were part of bookmakers' lore.  Those rules
of thumb would have included the profit margin for the bookie automatically,
making all  the bets slightly, but fairly evenly, unfair for the bettor.  If
anyone knows of such rules I would appreciate hearing them.

Regards,
David

David W. Smith, Ph.D., M.P.H.

(518) 439-6421

45 The Crosway
Delmar, NY 12054

[EMAIL PROTECTED]
- Original Message -
From: "Brad Branford" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, February 23, 2002 9:49 AM
Subject: Re: odds vs probabilities

> probabilities. I know that probs have a problem in that they don't
> make multiplicative sense:



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Question on Conditional PDF

2002-02-25 Thread Chia C Chong




"Glen" <[EMAIL PROTECTED]> wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
> [EMAIL PROTECTED] (Chia C Chong) wrote in message
news:<[EMAIL PROTECTED]>...
> > Helloo..
> >
> > I have 1000 samples 3 RVs (say X, Y and Z) drawn from a series of
> > experiments. My intention is to find the PDF of Z condition on X and Y
> > i.e. f(Z|X,Y). I am not sure what is the proper way of doing it
> > practically!!. Any suggestions??
> >
>
> Are any of the variables discrete?

All the variables are continuous...

>
> Do you want to make any assumptions about the form of the conditional,
> or the joint, or any of the marginals?

Well, the X & Y are dependent and hence there are being descibed by a joint
PDF. I am not sure what other assumption I can make though..


>
> Glen




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Question on Conditional PDF

2002-02-24 Thread Glen


[EMAIL PROTECTED] (Chia C Chong) wrote in message 
news:<[EMAIL PROTECTED]>...
> Helloo..
> 
> I have 1000 samples 3 RVs (say X, Y and Z) drawn from a series of
> experiments. My intention is to find the PDF of Z condition on X and Y
> i.e. f(Z|X,Y). I am not sure what is the proper way of doing it
> practically!!. Any suggestions??
> 

Are any of the variables discrete?

Do you want to make any assumptions about the form of the conditional,
or the joint, or any of the marginals?

Glen


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Question on Conditional PDF

2002-02-24 Thread Vadim and Oxana Marmer

if you don't want to make to many assumptions then you can try
nonparametric estimation (estimate f(X,Y,Z) and f(X,Y) by kernel methods).
Check out books on nonparametric methods ("Nonparametric Econometrics" by
Pagan, for example, or a book by Silverman(?)).

On 24 Feb 2002, Chia C Chong wrote:

> Helloo..
>
> I have 1000 samples 3 RVs (say X, Y and Z) drawn from a series of
> experiments. My intention is to find the PDF of Z condition on X and Y
> i.e. f(Z|X,Y). I am not sure what is the proper way of doing it
> practically!!. Any suggestions??
>
> Thanks in advacnce
>
> Regards,
> CCC
>

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Appendix A. of Radford Neal thesis: "Bayesian Learning for Neural Networks"

2002-02-24 Thread Glen Barnett



Mark <[EMAIL PROTECTED]> wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
> Hi,
>
> I'm CS student interested in Radford Neal thesis called "Bayesian
> Learning for Neural Networks". I know that some years ago  this thesis
> was available for download from author's site, but nowadays there
> isn't possible. I have searched it on Intenet so I have not known to
> find it.

He published a book by that title a few years ago. Is it in there?

Glen



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: REQ: Appendix A. of Radford Neal thesis: "Bayesian Learning for Neural Networks"

2002-02-24 Thread Rich Ulrich


On Fri, 22 Feb 2002 18:00:16 +0100, Mark
<[EMAIL PROTECTED]> wrote:

> Hi,
> 
> I'm CS student interested in Radford Neal thesis called "Bayesian
> Learning for Neural Networks". I know that some years ago  this thesis
> was available for download from author's site, but nowadays there
> isn't possible. I have searched it on Intenet so I have not known to
> find it.

Why not send personal  e-mail and ask him?
He has posted to the stat-groups within the last month, from

 Radford Neal ([EMAIL PROTECTED])

> My e-mail address is [EMAIL PROTECTED] Please remove the
> "REMOVETHIS" string from the email address to get my real one. It's an
> anty-spam measure. I apologize for any inconvenience that it causes to
> you.

 - no inconvenience;  I won't bother.

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: odds vs probabilities

2002-02-24 Thread Rich Ulrich

On 23 Feb 2002 06:49:58 -0800, [EMAIL PROTECTED] (Brad Branford)
wrote:

> hi,
> 
> thanks.  sorry if I posed the question poorly. actually, what I'm
> looking for is an intuitive understanding of when to use odds and when
> probabilities. I know that probs have a problem in that they don't
> make multiplicative sense: for instance, assume I have a probability
> of winning of 55%; if the likelihood of winning doubles, we have
> absurd outcomes if expressed in terms of probabilities.
> 
> thanks.

Uh-oh.  You have introduced a third technical term here.
Likelihood doesn't match either probability or odds.

There is a classical book by AWF Edwards by that title
(Likelihood), which argues we should be using likelihood 
for all our inference:  This makes some difference, though
I am not sure of what and when and how much.
Ps and likelihood compete in inference.  Ps and ORs arise
in descriptions of samples.

Where Probability is constructed as an area under the pdf
(often a 'tail area'), and OR is the ratio of two areas, 
the Likelihood is simply the height of the curve as evaluated 
at one point.  Thus  Probability and Odds are interchangeable, 
in just the way Ken describes.  

I think you only gain the 'intuitive understanding'  by 
exposure to a whole gamut of examples, including the 
counter-examples on when (especially) probability does 
not work because it is misleading.  For instance, 
in many contexts, 93% of a sample is  not "nearly the 
same as"  99%, since the OR is 7.00, and that will matter.

There is less reason to complain about  P  in place of ORs 
when the Ps are small -- where the arithmetic doesn't
expose the fallacy, as with your "twice 55%".   And P can
be approximately correct  in describing sample sizes when 
comparison-values are all between 25% and 75%, 
or 20-80.  But generally ORs are more appropriate for 
statistical models.  The drawback of ORs is that the public
is apt to understand them  less.

A year or so ago, someone posted journal references with 
arguments *against*  using ORs.  I looked up a couple
and did not find them impressive, but you can probably
find them by searching  sci.stat.*with  groups.google.com .

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Probabilities

2002-02-24 Thread Tobias Arens




Zachary Agatstein wrote:
> 
> Can you help me solve this problem:
> 
> There are 8 baskets and 4 apples.  Thrown at random, 3 of the 4 apples
> can go to any basket.  The 4th apple, however, can only be thrown into
> baskets 1 through 4.  What is the probability that there is no more than
> one apple in every basket?
> 
> Now, I can easily solve this problem if the 4th apple could also go to
> any of the 8 baskets.  The probability referred to above can be computed
> as follows:
> P = factorial(8)/((8 to the power of 4)*factorial(8-4)) = 0.41015625.

I agree.

> But the restriction for the 4th apple to only be limited to baskets 1
> thru 4 would obviously change that probability. How?

You can ask: If you have a valid solution to the simplified problem (ie
"there is no more than
one apple in every basket"), what is the probability that apple No 4 was
thrown into one of the baskets 1-4?
It's 4/8 = 0.5, so 
0.5*0.41015625 = 0.205078125

is the probability you've searched.


Tobias Arens


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: What is an experiment ?

2002-02-24 Thread Art Kendall

Speculatively, temperature could confound or be a rival hypothesis in a few ways.
It would influence what could be in solution, pollutants as well as things that
offset them.  It could be what varies across the parts of rivers or between rivers.
It might differentially  influence survival or breeding of different species.  etc.

Jay Tanzman wrote:

> Art Kendall wrote:
>
> [snip good points]
>
> > in your quasi-experiment you can possibly contrast different levels of specific
> > pollutants, as well as kinds of pollutants, in different rivers at different
> > times.
> > I'm not a biologist, but I would be amazed if temperature did not affect
> > population sizes.
>
> Yeah, but would temperature also be related to pollultion levels, and if not, so
> what if it is related to the outcome under study.
>
> -Jay

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: What is an experiment ?

2002-02-23 Thread Jay Tanzman




Art Kendall wrote:

[snip good points]

> in your quasi-experiment you can possibly contrast different levels of specific
> pollutants, as well as kinds of pollutants, in different rivers at different
> times.
> I'm not a biologist, but I would be amazed if temperature did not affect
> population sizes.

Yeah, but would temperature also be related to pollultion levels, and if not, so
what if it is related to the outcome under study.

-Jay


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: odds vs probabilities

2002-02-23 Thread Brad Branford


hi,

thanks.  sorry if I posed the question poorly. actually, what I'm
looking for is an intuitive understanding of when to use odds and when
probabilities. I know that probs have a problem in that they don't
make multiplicative sense: for instance, assume I have a probability
of winning of 55%; if the likelihood of winning doubles, we have
absurd outcomes if expressed in terms of probabilities.

thanks.

brad


[EMAIL PROTECTED] (Kenmlin) wrote in message 
news:<[EMAIL PROTECTED]>...
> Odd is defined to be 
> 
> P(event)
> ---
> 1- P(event)
> 
> So if P(event) is 0.50, then the odd is 1 to 1.  If P(event) is 0.75, then the
> odd is 3 to 1 since 0.75 is three times as large as 1 - 0.75 = 0.25.
> 
> Given one of odds or probabilities, you can always derive the other. 
> 
> Ken


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: What is an experiment ?

2002-02-23 Thread Art Kendall

It depends on which science.
In social, behavioral, industrial, and many health related fields, the
distinction is sharply drawn between true experiments where there is active
manipulation of one or more treatment independent variables and random
assignment of cases to treatment.  (also, it simplifies calculation and drawing
conclusions if there are equal n's in cells of the design.)

Other designs are considered quasi-experimental where plausible rival
hypotheses need to be addressed by considerations other than manipulation and
random assignment. The fewer aspects of a true experiment a study has the more
discussion there needs to be of ruling out the rival hypotheses. The study you
briefly describe would be called quasi-experimental.

Some fields in statistics talk about what other fields would consider "thought
experiments" such as ball-and-urns as experiments.

The term "observation" has a wide variety of meanings, but some would include
several kinds of quasi-experimental designs as observational

- - -
in your quasi-experiment you can possibly contrast different levels of specific
pollutants, as well as kinds of pollutants, in different rivers at different
times.
I'm not a biologist, but I would be amazed if temperature did not affect
population sizes.
.

Voltolini wrote:

> Hi,
>
> I was reading a definition of  "experiment" in science to be used in a
> lecture and the use of treatments and controls are an important feature of
> an experiment but my doubt is... is it possible to plan an experiment
> without a control and call this as an "experiment" ?
>
> =

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: What is an experiment ?

2002-02-22 Thread EugeneGall

Jay Tanzman wrote:
>I agree that you can test a hypothesis by using an observational study, but
>that
>does not make it an experiment.  The original poster was looking for a
>definition to use in a lecture, and an experiment, by definition, involes
>assignment of treatments to experimental units.
>
>A study is hypothesis testing if the investigator is using it as such.
>Whether
>the particular study design would be expected to yield a valid answer is
>another
>matter.

In ecology, Hurlbert (1984) distinguished between manipulative and mensurative
experiments, "A manipulative experiment always involves two or more treatments.
 The defining feature of a manipulative experiment is that the different
experimental units receive different treatments and that the assignment of
treatments to experimental units is or can be randomized."
Underwood (1997, p. 16) argued, "The distinction between types of experiments
is a distraction.  It does not matter whether the system is measured or
manipulated or manipulated and measured. Each is appropriate for different
circumstances and different models.  What matters is that the experiment is
clearly related to the need to test a logically defined null hypothesis. The
experiment must then be done so that it preserves the logical structure and
allows a logical conclusion."

The restriction of the term "experiment" to studies in which treatments are
assigned to experimental units, would rule out as non-experiments, many of the
more famous "experiments" in science.  For example, Mayo (1996) describes
Eddington's 1919 observations of the deflection of starlight during eclipses as
experimental tests of Newton's and Einstein's theories of gravity.  In 1918,
Eddington set out predictions which served as a crucial "experiment" to test
the predictions of Einstein and Newton.  The deflection of starlight near the
sun during an eclipse was in near agreement with Einstein's gravitational
theory (within experimental error).  Eddington's test would fall under the
category of a mensurative experiment or observational study, not a true
manipulative experiment. It would fit Underwood & Mayo's broad definition of
experiment.
   Ernst Mayr (1982), the noted evolutionary biologist, adopted the strict
definition of experiment, "As Pantin (1968: 17) has stated, 'In astronomy, in
geology, and in biology observation of natural events at chosen times and
places can sometimes provide information as wholly sufficient for a conclusion
to be drawn as that which can be obtained by experiment
  ...contrary to the claims of some physicists, the branches of science which
depend on the comparative method are not inferior. " 

So, there is some justification for the restrictive definition of experiment,
but there is also justification for a broad definition of experiment which
includes well-designed tests such as Eddington's.  
Kendall and Stuart (1961) praise Fisher's advocacy of random assignment of
treatments to experimental units in order to solve problems inherent in
experimental designs, but they do not make randomization part of their
definition of an experiment, "The distinction between the design of experiments
and the design of sample surveys is fairly clear-cut, and may be expressed by
saying that in surveys we make observations on a sample taken from a finite
population of individuals, whereas in experiments we make observations which
are in principle generated by a hypothetical infinite population, in exactly
the same way that the tosses of a coin are.  Of course, we may sometimes
experiment on the members of a sample resulting from a survey, or even make a
sample survey of the results of an (extensive) experiment, but the essential
distinction between the two fields should be clear."

Gene Gallagher
References
Hurlbert, S. J. 1984.  Pseudoreplication and the design of ecological field
experiments. Ecol. Monogr 54: 187-211.
Kendall, M. G. and A. Stuart. 1961. The Advanced Theory of Statistics. Hafner.
Mayo, D. G. 1996.  Error and the growth of experimental knowledge. U. of
Chicago Press.
Mayr, E. 1982.  The growth of biological thought, Belknap Press, Cambridge
Underwood, A. J. 1997. Experiments in ecology. Cambridge University Press. 

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: If T-Test can not be applied

2002-02-22 Thread EugeneGall

 Art Kendall wrote:

>In SPSS output  ignore the lines for equal variances, and use the lines for
>unequal variances.

Last year on this group, there was an interesting dataset posted, in which the
equal and unequal variance t tests give very different results:

Temperatures from Portion 1 of a stream:
16.9
17
15.8
17.1
18.7
18

mean = 17.25
variance = 0.995

Portion 2
18.3
18.5

mean = 18.4
variance = 0.02

The SPSS unequal variance t-test gives a 2-tailed P of 0.037, but the equal
variance t produces a two-tailed P of 0.174
 An exact test is possible with these data, as there are only 28 ways of
forming groups of 6 and 2, and only 3 of these groupings produce a difference
in means equal to or greater than the observed (one being the observed, p=0.11,
reasonably close to the equal variance t test).  I gave this example to my
class, just so they would not automatically use the unequal variance t test
output (it is certainly not appropriate for unequal group sizes if the smaller
group has the smaller variance)
Gene Gallagher

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: SPC control limits

2002-02-22 Thread Jay Warner

Inasmuch as the objective is to 'drain the swamp' - reduce product &
process variation, I find that in practice Wheeler's suggestion of focus
on causes of that variation is on the mark.  the procedure from the
old Ford Manual, and I believe the 6th & 7th Ed. of Grant & Leavenworth,
puts the focus on getting the limits seriously 'accurate.'
when a chart is first set up, the control limits are not that critical. 
(Now, _that_ is a heretical statement!)  Previously, no one recognized
the variations for what they were, so every wild point on the brand new
chart leads quickly to understanding of causes and corrections (prevention,
in ISO9000 -speak).  A Run Chart is all it takes at this point. 
After things settle down some, a control limit will give guidance on the
value of any system changes/ improvements that may reduce the overall variation
(i.e., average, if we're discussing flaws).
sorry if my response misled anyone - I recall the question regarded
the calculation, not the logic of improvement.
Jay
"Simon, Steve, PhD" wrote:

Jay Warner writes:
>The 'party line' is to take the first 30 or so points,
calculate
>limits, throw out any outside ones & add more at
the end, until you
>have 30 points, all of which are inside the control
limits.
Actually, I have heard the opposite. Out-of-control points
do not have THAT much of an impact on the control limits, so that recomputing
control limits after removing out of control points is not a good use of
your time. Wheeler makes the suggestion that the time spent fine tuning
the chart might be better spent investigating why the original points are
out of control.
I am not an SPC guru, though, so take my advice with a
grain of salt.
Steve Simon, [EMAIL PROTECTED], Standard Disclaimer.
The STATS web page has moved to
http://www.childrens-mercy.org/stats

--
Jay Warner
Principal Scientist
Warner Consulting, Inc.
 North Green Bay Road
Racine, WI 53404-1216
USA
Ph: (262) 634-9100
FAX: (262) 681-1133
email: [EMAIL PROTECTED]
web: http://www.a2q.com
The A2Q Method (tm) -- What do you want to improve today?

Re: Question on CDF

2002-02-22 Thread Glen Barnett



Henry <[EMAIL PROTECTED]> wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
> I was trying to suggest that he meant the slope of the CDF was the
> height of the PDF.

Oh, okay. Yes, that would be correct, but it shouldn't be called probability!

Glen



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Question on CDF

2002-02-22 Thread Henry


On Sat, 23 Feb 2002 00:27:00 +1100, "Glen Barnett"
<[EMAIL PROTECTED]> wrote:

>
>Henry <[EMAIL PROTECTED]> wrote in message
>[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
>> On Fri, 22 Feb 2002 08:55:42 +1100, "Glen Barnett"
>> <[EMAIL PROTECTED]> wrote:
>>
>> >Bob <[EMAIL PROTECTED]> wrote in message
>> >[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
>> >> A straight line CDF would imply the data is uniformly distributed,
>> >> that is, the probability of one event is the same as the probability
>> >> of any other event.  The slope of the line would be the probability of
>> >> an event.
>> >
>> >I doubt that - if the data were distributed uniformly on [0,1/2), say, then
>> >the slope of the line would be 2!
>>
>> I suspect he meant probability density.
>
>I guess that's actually correct - the slope of the pdf is zero. However, I'm
>fairly certain that's not what he meant.

I was trying to suggest that he meant the slope of the CDF was the
height of the PDF.


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: What is an experiment ?

2002-02-22 Thread Jay Tanzman




SSCHEINE wrote:
> 
> Let me take a (somewhat) contrarian position to those previously
> expressed. An experiment is any test of a hypothesis. An experiment can
> involve the use of observational (unmanipulated) data, as long as the
> hypothesis is clearly stated prior to the collection of the data. While
> it is true that an experiment involving manipulation can provide some of
> the best evidence for causal relationships, causal relationships can be
> deduced from observation data combined with other information about how
> the world works.

I agree that you can test a hypothesis by using an observational study, but that
does not make it an experiment.  The original poster was looking for a
definition to use in a lecture, and an experiment, by definition, involes
assignment of treatments to experimental units.

> All of that said, the situation described below is what I would call a
> hypothesis-generating activity. That is, you want to look for a
> potential correlation that you will use to then test specific mechanisms
> (i.e., doex chemical X kill fish?). It would be a hypothesis-testing
> activity, if you had a prespecified hypothesis concerning a particular
> pollutant that previous experiments have shown to kill or otherwise harm
> fish.

A study is hypothesis testing if the investigator is using it as such.  Whether
the particular study design would be expected to yield a valid answer is another
matter.

-Jay


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

RE: SPC control limits

2002-02-22 Thread Simon, Steve, PhD

Title: RE: SPC control limits

Jay Warner writes:

>The 'party line' is to take the first 30 or so points, calculate
>limits, throw out any outside ones & add more at the end, until you
>have 30 points, all of which are inside the control limits.

Actually, I have heard the opposite. Out-of-control points do not have THAT much of an impact on the control limits, so that recomputing control limits after removing out of control points is not a good use of your time. Wheeler makes the suggestion that the time spent fine tuning the chart might be better spent investigating why the original points are out of control.

I am not an SPC guru, though, so take my advice with a grain of salt.

Steve Simon, [EMAIL PROTECTED], Standard Disclaimer.
The STATS web page has moved to
http://www.childrens-mercy.org/stats

Re: Evaluation of skating

2002-02-22 Thread Rich Ulrich

On 19 Feb 2002 15:14:01 -0800, [EMAIL PROTECTED] (Trevor Bond)
wrote:
[ snip, much ]
> affected who won the gold medal.  In fact, Looney (1994, p. 156) 
> concluded:
>   "all of the judges with an Eastern block or communistic background 
> not only ranked Baiul better than expected, but ranked Kerrigan 
> worse.  The same trend was seen for Western Block judges.  They 
> ranked Baiul worse and Kerrigan better than expected.  ... "

Finding a difference is one thing.  
Drawing invidious conclusions is a gratuitous step for
a statistician, isn't it?

Hypothesize.
Group A  holds a bunch of their own, inter-community skating
competitions.  So does group B.  This happens for many years.

I find it wholly reasonable -- if not expected -- that 
'community standards'  might exist with some 
divergence.  That's especially so when there were 
never any joint standards in the first place, and when
one country has the outstanding professional dance 
(ballet) of the world, which is accorded much local 
respect.  

>   When the 
> median of the expected ranks is determined, Kerrigan would be 
> declared the winner.  Before the free skate began, all the judges 
> knew the rank order of the skaters from the technical program and the 
> importance of the free skate performance in determining the gold 
> medal winner.  This may be why some judging bias was more prevalent 
> in the free skate than in the technical program."

Or, it could be (as the name suggests) that  'free skate'  offers 
more individual choices, more choices that will please or offend
personal tastes.

>   Looney's investigation of the effect of judge's ratings on 
> the final placement of skaters objectively validates what a chorus of 
> disbelieving armchair judges had suspected.  The median rank system 

Hooey.  You can't 'objectively'  validate one set of value-judgments.
You can't show that one set of scores arises 'by merit' while another,
with exactly the same salient features, does not.

The NY Times published the rankings in the pairs free program,
the one that ended up with ratings of the French judge being dropped.
There were 9 judges, labeled by nationality, and 20 teams.  
I don't know how the teams were 'qualified'  to appear here:  
there were  3 each, from Canada, Russia, and China.  In some
sense, anyway, these are the best in the world.  I have 
reproduced the data, below.

What astounds me is the uniformity of the rankings.  The *worst*
Pearson correlation between two judges (also, Spearman, 
since the scores are ranks) is 0.973, between judges from 
Japan and Russia.  Correlations with the total were above 0.98.

The NY Times highlighted the 'discrepancies' between each
judge and the Final ranking.  Of those 180 rankings, there were
two that were off by 3 (Japan rating the U.S. #13  as 10, for 
instance), 5 that were off by 2, and only 58 others  off by 1.

The most consistent rankings were by the French judge
(the scores that were thrown out).

Anyway, one consequence of that 'reliability'  is that there is 
relatively great 'statistical power'  for looking at blocs of votes,
if such exist.  I know some other rankings have been less
consistent than this; I don't know how (a)typical this level
of agreement might be for this skating event, or others.

Personally, I now suspect that there is 'collusion'  to the 
extent that judges agree, before the skate-off, about who
will be competing for 1-3 (say), 4-7, ...,  16-20.  
That might be decided on gross technical competence
(again, not invidious).
Concerns of great or small errors, difficulty, originality:  
these play a role within these strata.  And, biases about
tastes in presentations.

*= data: entered (for convenience) by judge.
* set up for SPSS to read; transpose; list; correlate.

Title   Skating Pairs, rankings by judge.
data list list / rank1 to rank20 judge(20F3.0,1x,A8).
begin data
  1  2  3  4  6  5  7  9  8 10 12 11 13 14 15 16 18 17 19 20 Russia
  1  2  3  5  4  7  6  8  9 10 11 13 12 15 14 16 17 18 19 20 China
  2  1  3  5  4  7  6  8  9 12 10 13 11 14 15 16 17 18 19 20 U.S.
  1  2  3  4  5  6  7  9  8 10 11 12 13 14 15 16 17 18 19 20 France
  1  2  3  4  5  6  7  8 10  9 11 12 14 13 15 16 17 18 19 20 Poland
  2  1  3  7  4  5  6  8 10  9 11 12 13 14 15 16 18 17 19 20 Canada
  1  2  3  4  5  6  7  8  9 10 11 12 15 14 13 16 17 19 18 20 Ukraine
  2  1  3  5  4  6  7  8  9 11 10 12 13 14 15 16 18 17 19 20 Germany
  2  1  3  4  5  7  6  8  9 12 11 13 10 15 14 16 17 19 18 20 Japan
end data.
execute.
flipnewnames= judge.
formats russia to japan(F2.0).
listall.
subtitle'Spearman' is the Pearson corr.
compute ranked= $casenum.
nonpar corr vars= russia to japan ranked /print=both.

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html

=
Instructions for joining and leaving this list, remar

Re: What is an experiment ?

2002-02-22 Thread SSCHEINE

Let me take a (somewhat) contrarian position to those previously
expressed. An experiment is any test of a hypothesis. An experiment can
involve the use of observational (unmanipulated) data, as long as the
hypothesis is clearly stated prior to the collection of the data. While
it is true that an experiment involving manipulation can provide some of
the best evidence for causal relationships, causal relationships can be
deduced from observation data combined with other information about how
the world works.

All of that said, the situation described below is what I would call a
hypothesis-generating activity. That is, you want to look for a
potential correlation that you will use to then test specific mechanisms
(i.e., doex chemical X kill fish?). It would be a hypothesis-testing
activity, if you had a prespecified hypothesis concerning a particular
pollutant that previous experiments have shown to kill or otherwise harm
fish.

Sam Scheiner

Voltolini wrote:
> 
> Hi,
> 
> I was reading a definition of  "experiment" in science to be used in a
> lecture and the use of treatments and controls are an important feature of
> an experiment but my doubt is... is it possible to plan an experiment
> without a control and call this as an "experiment" ?
> 
> For example, in a polluted river basin there is a gradient of contamination
> and someone are interested in to compare the fish diversity in ten rivers of
> this basin. Then, the "pollution level" are the treatment (with ten levels)
> but if there is not a clean river in the basin, I cannot use a control !
> 
> Is this an experiment anyway ?
> 
> Thanks for any comments.
>Voltolini
> 
> _
> 
> Prof. J. C. VOLTOLINI
> Grupo de Estudos em Ecologia de Mamiferos - ECOMAM
> Universidade de Taubate (UNITAU)
> Departamento de Biologia
> Taubate, SP, Brasil. CEP 12030-010
> Tel: 0XX12-2254165 (lab.), 2254277 (secret. depto.)
> FAX: 12 - 2322947
> E-Mail: [EMAIL PROTECTED]
> _
> 
> =
> Instructions for joining and leaving this list, remarks about the
> problem of INAPPROPRIATE MESSAGES, and archives are available at
>   http://jse.stat.ncsu.edu/
> =

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Question on CDF

2002-02-22 Thread Glen Barnett



Henry <[EMAIL PROTECTED]> wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
> On Fri, 22 Feb 2002 08:55:42 +1100, "Glen Barnett"
> <[EMAIL PROTECTED]> wrote:
>
> >Bob <[EMAIL PROTECTED]> wrote in message
> >[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
> >> A straight line CDF would imply the data is uniformly distributed,
> >> that is, the probability of one event is the same as the probability
> >> of any other event.  The slope of the line would be the probability of
> >> an event.
> >
> >I doubt that - if the data were distributed uniformly on [0,1/2), say, then
> >the slope of the line would be 2!
>
> I suspect he meant probability density.

I guess that's actually correct - the slope of the pdf is zero. However, I'm
fairly certain that's not what he meant.

Glen



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: covariates !!

2002-02-22 Thread Thom Baguley

Rich Ulrich wrote:
> I've always done this in SPSS  (6.1 and earlier) with
> ANOVA   vara  by grps(1,4) with covar/

Likewise.

However, as I'm in a funny mood, it occurred to me that you could
use the residuals from correlating the covariates with the separate
grooup scores as input to a Mann-Whitney U test or sign test. (It is
a stretch thinking of realistic examples where this might be better
than the standard ANCOVA).

Thom

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Efficient convergence method

2002-02-21 Thread Jay Warner


OK.  You have a case where you sample from a 'population' of times from situation A a
number of times, and from Situation B a number of times.  Maybe C, D, etc. too.

To compare 2 of these babies, use a t test.  Keep sample size (n's of each) about
equal, and go for it.  Student 't' test is pretty darn robust.  Means test relies on
Central Limit Theorem, etc.  Depending on how much work it takes to run a simulation,
you might do this 20 times each, to get a pretty fair discriminatory ability.

to compare more than 2, say A through E, a one way AoV would work.  IF the respective
distributions were Normal, and the variances were reasonably close together.  Have to
check this out first.  If not, you have some choices.  One is to back off the
confidence in your conclusion, and select the 'best' condition for further study.  If
the 'best' one is 'obvious' this may get you on to the next step.  Otherwise, some
(possibly effective) modifications of AoV may correct the deviations from assumptions.

And I haven't said anything about testing for Power, but I suspect you are not up to
that yet.  Patience :)

this is a pretty quick way to do it.  Depending on how rigorous you want to be, it
could be more than enough.

Jay

Gooseman wrote:

> Hi,
>
> Thanks for your help! I went over these ideas and now understand my
> problem better.
>
> If I explain my simulation, it may help. Basically, I have a
> simulation where various "agents" have to find a target. The
> simulation is terminated once the target has been found. The current
> measurement of performance is "iterations taken" - time. The
> simulation settings are kept constant, aside from the starting
> positions which are random. The simulation is repeated until the
> confidence interval reaches a certain percentage [this statement may
> be wrong once you read then next step!]
>
> The simulation then changes a parameter (such as number of "agents")
> and is then repeated to sample the new population. This is done quite
> a few times.
>
> What I really need to do, is to prove with a certain confidence, that
> the MEAN time taken from Simulation A comes from a different
> population than from Simulation B, C, D, E etc.
>
> >From my undestating, this may imply that I need to concentrate on the
> accuracy of the mean of simulation run wrt the real population mean
> (unknown) and then compare this to other simulation runs with
> different populations. Some suggestions have included doing a ANOVA
> analysis. Comparing multi variances was also suggested, but this
> apparenly can only be done with 2 populations.
>
> On top of this, there is a big requirement on computational efficiency
> - each simulation needs to stop when the results are accurate enough
> for the next step. So is confidence in the mean the solution (and how
> do I do that), or is it comparing various simulation runs together
> (and using what method) or is it something else, or a combination.
>
> Does this explain enough? If anyone requires any more info, just ask.
> Sorry if this explanation or question sounds vague - I am just
> starting to find my way around stats!
>
> Many thanks!
>
> [EMAIL PROTECTED] (Jay Warner) wrote in message news:<[EMAIL PROTECTED]>...
> > the real question is, 'how much accuracy (precision, variance) is
> > suitable?'
> >
> > If you were to repeat the simulation run (i.e., a test) a total of n
> > times, then you could say that the true mean elapsed time was x-bar +/-
> > (certain amount), with say 95% confidence.
> >
> > That is, if you were to then repeat the whole process, n times again, 95%
> > of the time the x-bar would fall within the +/- (certain amount) you had
> > calculated.  The average of your mean elapsed time is probably Normal, so
> > this equation can be used.  If you want to predict the one next elapsed
> > time from the next simulation run, then you have to believe that your
> > individual times are Normally distributed, or do some deeper analysis.
> > If that's confusing, I'm sorry, but it comes from what you asked.
> >
> > You can do the simulation run n times, and _estimate_ a value for mean
> > elapsed time that could be confirmed only by say 100*n runs.  Does this
> > sound like what you want?
> >
> > The eq. for the 'certain amount' is given by
> >
> > certain amount = s*z/sqrt(n)
> >
> > where s = stdev of your n run times, z = 1.96 for 95% confidence, and n =
> > number of simulation runs.
> >
> > Pick a confidence interval ('certain amount') that you like, then solve
> > for n to decide how many runs you will need to make.  Statistics cannot
> > tell you what confidence interval is suitable to your problem - that is a
> > technical issue.  It can tell you now many n's you need to reach that
> > confidence interval.
> >
> > Is this what you were looking for?
> >
> > Cheers,
> > Jay
> >
> > PS:Yes, I know 'accuracy' and 'precision' refer to different things.
> > But you used the first of these words in a way which I infer meant the
> > latt

RE: Chi-square chart in Excel

2002-02-21 Thread Dennis Roberts


sure is easy in minitab ... one can draw a very nice curve (it's easy but, 
hard to post here) but, to make a distribution easy for viewing we can

MTB > rand 10 c1; <<< generated 10 values from
SUBC> chis 4.  <<< a chi square distribution with 4 degrees of freedom
MTB > dotp c1

Dotplot: C1


Each dot represents up to 778 points
 .
.::.
:
   .::.
   :
   ::.
   :.
  ::. ..
  +-+-+-+-+-+---C1
0.0   6.0  12.0  18.0  24.0  30.0

MTB > desc c1

Descriptive Statistics: C1


Variable N   Mean Median TrMean  StDevSE Mean
C1  10 4.0123 3.3729 3.7727 2.8350 0.0090

Variable   MinimumMaximum Q1 Q3
C1  0.008029.4143 1.9236 5.4110

not quite as fancy as the professional graph but, will do in a pinch



At 06:27 PM 2/21/02 -0800, David Heiser wrote:


>-Original Message-
>From: [EMAIL PROTECTED]
>[mailto:[EMAIL PROTECTED]]On Behalf Of Ronny Richardson
>Sent: Wednesday, February 20, 2002 7:29 PM
>To: [EMAIL PROTECTED]
>Subject: Chi-square chart in Excel
>
>
>Can anyone tell me how to produce a chart of the chi-square distribution in
>Excel? (I know how to find chi-square values but not how to turn those into
>a chart of the chi-square curve.)
>
>
>Ronny Richardson
>---
>Excel does not have a function that gives the Chi-Square density
>
>The following might be helpful regarding future graphs. It is a fraction of
>a larger "package" I am preparing. It is awkward to present it in .txt
>format.
>
>
>DISTRIBUTIONDENSITY CUMMULATIVE I   NVERSE
>BetaBETADIST 
>BETAINV
>BinomialBINOMST CRITBINOM
>Chi-Square  CHIDIST CHINV
>Exponential EXPONDIST   EXPONDIST
>F   FDIST   FINV
>Gamma   GAMMADIST   GAMMADIST 
>GAMMAINV
>Hyper geometric HYPGEOMDIST
>Log Normal  LOGNORMDIST LOGINV
>Negative Binomial   NEGBINOMDIST
>Normal(with parameters) NORMDISTNORMDISTNORMINV
>Normal (z 
>values)   NORMSDIST   NORMSINV
>Poisson POISSON
>t   TDIST   TINV
>Weibull WEIBULL
>
>You have to build a column (say B) of X values.
>
>Build an expression for column C calculating the Chi-Square density, given
>the x value in col B and the df value in A1.
>
>It would be "=exp(($A$1/2)*LN(2) + GAMMALN($A$1/2) + (($A$1/2)-1)*LN(B1) -
>B1/2)" without the quotes.
>You can equation-drag this cell down column C for each X value.
>
>Now build a smoothed scatter plot graph as series 1 with the X value column
>B and the Y value as column C.
>
>DAHeiser
>
>
>
>=
>Instructions for joining and leaving this list, remarks about the
>problem of INAPPROPRIATE MESSAGES, and archives are available at
>   http://jse.stat.ncsu.edu/
>=



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: double exponential smoothing

2002-02-21 Thread Neville X. Elliven


Yvette wrote:

>Prediction Model using double exponential smoothing  to estimate the
>linear trend in prices (as originally reported) and extending the
>trend to future years.  The base period is about 8-10 years and a
>smoothing constant needs to be used to make the trend fairly
>responsive to change.  I'm not sure what this constant should be or
>how it is obtained.

Your textbook and your instructor should each have more meaningful hints 
to offer than we, the readers of this small amount of information.


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

RE: Chi-square chart in Excel

2002-02-21 Thread David Heiser

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Ronny Richardson
Sent: Wednesday, February 20, 2002 7:29 PM
To: [EMAIL PROTECTED]
Subject: Chi-square chart in Excel

Can anyone tell me how to produce a chart of the chi-square distribution in
Excel? (I know how to find chi-square values but not how to turn those into
a chart of the chi-square curve.)

Ronny Richardson
---
Excel does not have a function that gives the Chi-Square density

The following might be helpful regarding future graphs. It is a fraction of
a larger "package" I am preparing. It is awkward to present it in .txt
format.

DISTRIBUTIONDENSITY CUMMULATIVE I   NVERSE
BetaBETADISTBETAINV
BinomialBINOMST CRITBINOM
Chi-Square  CHIDIST CHINV
Exponential EXPONDIST   EXPONDIST
F   FDIST   FINV
Gamma   GAMMADIST   GAMMADIST   GAMMAINV
Hyper geometric HYPGEOMDIST
Log Normal  LOGNORMDIST LOGINV
Negative Binomial   NEGBINOMDIST
Normal(with parameters) NORMDISTNORMDISTNORMINV
Normal (z values)   NORMSDIST   NORMSINV
Poisson POISSON
t   TDIST   TINV
Weibull WEIBULL

You have to build a column (say B) of X values.

Build an expression for column C calculating the Chi-Square density, given
the x value in col B and the df value in A1.

It would be "=exp(($A$1/2)*LN(2) + GAMMALN($A$1/2) + (($A$1/2)-1)*LN(B1) -
B1/2)" without the quotes.
You can equation-drag this cell down column C for each X value.

Now build a smoothed scatter plot graph as series 1 with the X value column
B and the Y value as column C.

DAHeiser

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Probabilities

2002-02-21 Thread Herman Rubin


In article <[EMAIL PROTECTED]>,
Zachary Agatstein  <[EMAIL PROTECTED]> wrote:
>Can you help me solve this problem:

>There are 8 baskets and 4 apples.  Thrown at random, 3 of the 4 apples
>can go to any basket.  The 4th apple, however, can only be thrown into
>baskets 1 through 4.  What is the probability that there is no more than
>one apple in every basket?

>Now, I can easily solve this problem if the 4th apple could also go to
>any of the 8 baskets.  The probability referred to above can be computed
>as follows:
>P = factorial(8)/((8 to the power of 4)*factorial(8-4)) = 0.41015625.

>But the restriction for the 4th apple to only be limited to baskets 1
>thru 4 would obviously change that probability. How?

A simple argument, without calculation, shows that the probability
is exactly the same.  For whatever basket the fourth apple is in,
the probability is exactly the same that none of the other three
apples go to that basket, or to the same basket as any other.

Now if two apples are not equally likely to go to each basket,
and are not independent, this argument fails.
-- 
This address is for information only.  I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
[EMAIL PROTECTED] Phone: (765)494-6054   FAX: (765)494-0558


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: How to test whether f(X,Y)=f(X)f(Y) is true??

2002-02-21 Thread Rich Ulrich

On Wed, 20 Feb 2002 22:21:38 -, "Chia C Chong"
<[EMAIL PROTECTED]> wrote:

[snip, various discussion before]
> 
> I have an example of data of 2 RVs. When I tested the correlation between
> them, by simply find the correlation coefficient, it shows that the
> correlation coefficient is so small and therefore, I could say that these
> two RVs are uncorrelated,or better still, not linearly correlated. 

Right!
>   However,
> when I plotted the scatter plot of them, it is clearly shown that one of the
> varriable does dependent on the other variable  in some kind of pattern, is
> just that there are not lineraly dependent, hence the almost zero
> correlation coeffiicent. So, I am just wonder whether any kind of tests that
> I could use to test dependency between 2 varaibles...

Construct a test that checks for features.  What features?
Well, what features characterize your *observed*  dependency,
in a generalized way?  -- you do want a description that would
presumably have a chance for describing some future set of
data.

The null hypothesis is that the joint density is merely 
the product of the separate densities.  
For a picture:  a greytone backdrop changes just gradually, 
as you move in any direction.  Distinct lines or blotches are
'dependencies' -- whenever they are more distinct
than would  'arise by chance.'

The best test to detect vague blotches would not be the
best to detect sharp spots, and that would be different
from detecting lines.

As I wrote before , 
> >
> > So there is an infinite variety of tests conceivable.
> > So the  *useful*  test is the one that avoids 'Bonferroni correction,"
> > because it is the one you perform because
> > you have some reason for it.
> >
--
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html

=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

Re: Question on CDF

2002-02-21 Thread Henry


On Fri, 22 Feb 2002 08:55:42 +1100, "Glen Barnett"
<[EMAIL PROTECTED]> wrote:

>Bob <[EMAIL PROTECTED]> wrote in message
>[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
>> A straight line CDF would imply the data is uniformly distributed,
>> that is, the probability of one event is the same as the probability
>> of any other event.  The slope of the line would be the probability of
>> an event.
>
>I doubt that - if the data were distributed uniformly on [0,1/2), say, then
>the slope of the line would be 2!

I suspect he meant probability density.


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 5520 matches

Mail list logo