Re: PPIG discuss: Statistical evaluation of technology effects

Ruven E Brooks Mon, 31 Oct 2005 07:14:57 -0800

This is not a simple question to answer. What is important for many statistical tests is the
distribution of the sample, not the distribution of the underlying variable, and the Central Limit Theorem
guarentees us that the distribution of the sample mean will be normal if enough samples are taken, regardless
of the underlying population distribution. The place for non-parametric tests is when the sample size is small,
and it cannot be assumed that the underlying sample is normal.

In many cases, what appears to be a non-normal distribution can be transformed to a normal one.
For example, it is well known that human task performance times have a distribution which is skewed
in one direction; a common treatment is to apply a log transformation which brings in the tail of the curve.

Non-parametric tests use things like the ordinal position of a sample rather than its value; effectively,
they throw away the information about the size of the differences between sample points.
As such, they usually require much larger, clearer differences before the results are considered significant.

In the study you cite, the sample sizes are very small, on the limits of whether statistical analysis is
appropriate at all. Even using an F test, the differences would have to be pretty large before they were significant.

In planning your own study, the biggest thing to worry about is probably the sample size. Most likely,
if you're doing a study on experienced programmers, you won't be able to get many subjects, but you
can push things in the direction of more measurements per subject. If you have some preliminary data that
will give you a variance estimate, you may even want to calculate the power of your experiment, e.g.,
if you have N measurements, what sample difference is needed to be significant?

Ruven Brooks

Brian de Alwis <[EMAIL PROTECTED]>
Sent by: [EMAIL PROTECTED]

10/28/2005 01:05 PM

To: discuss@ppig.org
cc:
Subject: PPIG discuss: Statistical evaluation of technology effects

Not being a statistics expert, I wondered if someone here could
comment on the suitability of using parametric tests: has there
been prior work to demonstrate that developer performance shows
normality?

I recently came across a study of the impact of a tool on software
developers that used a two parametric significance tests to show
an statistically-significant effects looking at task-completion
times and number of tasks completed. The study compared two groups
of developers (one with 4, the other with 5 developers) as they
completed 6 tasks. The authors used two tests and found significant
effects: a repeated-measures ANOVA across the completion times,
and a t test to compare number of tasks successfully completed.
The surrounding description and actual numbers convinced me of a
practical effect, but the statistical results seemed a little
sketchy to me.

I'm currently designing an experiment to assess the impact of a
tool on developer performance. Ideally I'd like to have statistical
significance, but don't think I can rely on parametric methods.

Brian.

--
Brian de Alwis | Software Practices Lab | UBC | http://www.cs.ubc.ca/~bsd/
"Amusement to an observing mind is study." - Benjamin Disraeli

----------------------------------------------------------------------
PPIG Discuss List (discuss@ppig.org)
Discuss admin: http://limitlessmail.net/mailman/listinfo/discuss
Announce admin: http://limitlessmail.net/mailman/listinfo/announce
PPIG Discuss archive: http://www.mail-archive.com/discuss%40ppig.org/

Re: PPIG discuss: Statistical evaluation of technology effects

Reply via email to