On 11 Jun 2001 22:18:11 -0700, [EMAIL PROTECTED] (srinivas) wrote:
> Hi,
>
> I have a problem in identifying the right multivariate tools to
> handle datset of dimension 1,00,000*500. The problem is still
> complicated with lot of missing data. can anyone suggest a way out to
> reduce the data set and also to estimate the missing value. I need to
> know which clustering tool is appropriate for grouping the
> observations( based on 500 variables ).
'An intelligent user' with a little experience.
Look at all the data, and figure what comprises a
'random' subset. There are not many purposes that
require more than 10,000 cases so long as your
sampling gives you a few hundred in every interesting
category. [This can cut down your subsequent
computer processing time, since 1 million times 500
could be a couple of hundred megabytes, and might
take some time just for the disk reading.]
Look at the means/ SDs/ # missing for all 500;
look at frequency tabulations for things in categories;
look at cross tabulations between a few variables of
your 'primary' interest, and the rest. Throw out what
is relatively useless.
For *your* purposes, how do you combine logical categories? -
8 ounce size with 24 ounce; chocolate with vanilla; etc.
A computer program won't tell you what makes sense,
not for another few years.
--
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html
=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
http://jse.stat.ncsu.edu/
=================================================================