On 11 Jun 2001 22:18:11 -0700, [EMAIL PROTECTED] (srinivas) wrote:

> Hi,
> 
>   I have a problem in identifying the right multivariate tools to
> handle datset of dimension 1,00,000*500. The problem is still
> complicated with lot of missing data. can anyone suggest a way out to
> reduce the data set and  also to estimate the missing value. I need to
> know which clustering tool is appropriate for grouping the
> observations( based on 500 variables ).

'An intelligent user' with a little experience.

Look at all the data, and figure what comprises a 
'random' subset.  There are not many purposes that
require more than 10,000  cases so long as your 
sampling gives you a few hundred in every interesting
category.  [This can cut down your subsequent 
computer processing time, since 1 million times 500 
could be a couple of hundred megabytes, and might
take some time just for the disk reading.]

Look at the means/ SDs/ # missing for all 500;
look at frequency tabulations for things in categories;
look at cross tabulations between a few variables of
your 'primary'  interest, and the rest.  Throw out what
is relatively useless.

For *your*  purposes, how do you combine logical categories?  -
8 ounce size with 24 ounce; chocolate with vanilla; etc.
A computer program won't tell you what makes sense, 
not for another few years.

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html


=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
                  http://jse.stat.ncsu.edu/
=================================================================

Reply via email to