Re: multivariate techniques for large datasets

Rich Ulrich Thu, 14 Jun 2001 07:52:38 -0700
On 13 Jun 2001 20:32:51 -0700, [EMAIL PROTECTED] (Tracey
Continelli) wrote:

> Sidney Thomas <[EMAIL PROTECTED]> wrote in message 
>news:<[EMAIL PROTECTED]>...
> > srinivas wrote:
> > > 
> > > Hi,
> > > 
> > >   I have a problem in identifying the right multivariate tools to
> > > handle datset of dimension 1,00,000*500. The problem is still
> > > complicated with lot of missing data. can anyone suggest a way out to
> > > reduce the data set and  also to estimate the missing value. I need to
> > > know which clustering tool is appropriate for grouping the
> > > observations( based on 500 variables ).
> 
> One of the best ways in which to handle missing data is to impute the
> mean for other cases with the selfsame value.  If I'm doing
> psychological research and I am missing some values on my depression
> scale for certain individuals, I can look at their, say, locus of
> control reported and impute the mean value.  Let's say [common
> finding] that I find a pattern - individuals with a high locus of
> control report low levels of depression, and I have a scale ranging
> from 1-100 listing locus of control.  If I have a missing value for
> depression at level 75 for one case, I can take the mean depression
> level for all individuals at level 75 of locus of control and impute
> that for all missing cases in which 75 is the listed locus of control
> value.  I'm not sure why you'd want to reduce the size of the data
> set, since for the most part the larger the "N" the better.

Do you draw numeric limits for a variable, and for a person?
Do you make sure, first, that there is not a pattern?

That is -- Do you do something different depending on
how many are missing?  Say, estimate the value, if it is an
oversight in filling blanks on a form, BUT drop a variable if 
more than 5% of responses are unexpectedly missing, since 
(obviously) there was something wrong in the conception of it, 
or the collection of it....  Psychological research (possibly) 
expects fewer missing than market research.

As to the N -  As I suggested before - my computer takes 
more time to read  50 megabytes than one megabyte.  But
a psychologist should understand that it is easier to look at
and grasp and balance raw numbers that are only two or 
three digits, compared to 5 and 6.

A COMMENT ABOUT HUGE DATA-BASES.

And as a statistician, I keep noticing that HUGE databases
tend to consist of aggregations.  And these are "random"
samples only in the sense that they are uncontrolled, and 
their structure is apt to be ignored.

If you start to sample, to are more likely to ask yourself about 
the structure - by time, geography, what-have-you.  

An N of millions gives you tests that are wrong; estimates 
ignoring "relevant" structure have a spurious report of precision.
To put it another way: the Error  (or real variation) that *exists*
between a fixed number of units (years, or cities, for what I
mentioned above) is something that you want to generalize across.  
With a small N, that error term is (we assume?) small enough to 
ignore.  However, that error term will not decrease with N, 
so with a large N, it will eventually dominate.  The test 
based on N becomes increasing irrelevant

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html


=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
                  http://jse.stat.ncsu.edu/
=================================================================
Re: multivariate techniques for large datasets

Reply via email to