multivariate techniques for large datasets

2001-06-11 Thread srinivas

Hi,

  I have a problem in identifying the right multivariate tools to
handle datset of dimension 1,00,000*500. The problem is still
complicated with lot of missing data. can anyone suggest a way out to
reduce the data set and  also to estimate the missing value. I need to
know which clustering tool is appropriate for grouping the
observations( based on 500 variables ).


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: multivariate techniques for large datasets

2001-06-12 Thread Donald Burrill

On 11 Jun 2001, srinivas wrote:

>   I have a problem in identifying the right multivariate tools to 
> handle datset of dimension 1,00,000*500.  The problem is still
> complicated with lot of missing data.

So far, you have not described the problem you want to address, nor the 
models you think may be appropriate to the situation.  Consequently, 
no-one will be able to offer you much assistance. 

> Can anyone suggest a way out to reduce the data set and also to 
> estimate the missing value. 

There are a variety of ways of estimating missing values, all of which 
depend on the model you have in mind for the data, and the reason(s) you 
think you have for substituting estimates for the missing data.

> I need to know which clustering tool is appropriate for grouping the
> observations ( based on 500 variables ).

No answer is possible without context.  No context has been supplied.

 
 Donald F. Burrill [EMAIL PROTECTED]
 184 Nashua Road, Bedford, NH 03110  603-471-7128



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: multivariate techniques for large datasets

2001-06-12 Thread Rich Ulrich

On 11 Jun 2001 22:18:11 -0700, [EMAIL PROTECTED] (srinivas) wrote:

> Hi,
> 
>   I have a problem in identifying the right multivariate tools to
> handle datset of dimension 1,00,000*500. The problem is still
> complicated with lot of missing data. can anyone suggest a way out to
> reduce the data set and  also to estimate the missing value. I need to
> know which clustering tool is appropriate for grouping the
> observations( based on 500 variables ).

'An intelligent user' with a little experience.

Look at all the data, and figure what comprises a 
'random' subset.  There are not many purposes that
require more than 10,000  cases so long as your 
sampling gives you a few hundred in every interesting
category.  [This can cut down your subsequent 
computer processing time, since 1 million times 500 
could be a couple of hundred megabytes, and might
take some time just for the disk reading.]

Look at the means/ SDs/ # missing for all 500;
look at frequency tabulations for things in categories;
look at cross tabulations between a few variables of
your 'primary'  interest, and the rest.  Throw out what
is relatively useless.

For *your*  purposes, how do you combine logical categories?  -
8 ounce size with 24 ounce; chocolate with vanilla; etc.
A computer program won't tell you what makes sense, 
not for another few years.

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: multivariate techniques for large datasets

2001-06-12 Thread Sidney Thomas

srinivas wrote:
> 
> Hi,
> 
>   I have a problem in identifying the right multivariate tools to
> handle datset of dimension 1,00,000*500. The problem is still
> complicated with lot of missing data. can anyone suggest a way out to
> reduce the data set and  also to estimate the missing value. I need to
> know which clustering tool is appropriate for grouping the
> observations( based on 500 variables ).

This may not be the answer to your question, but clearly you need a
good statistical package that would allow you to manipulate the data
in ways that make sense and that would allow you to devise
simplification strategies appropriate in context. I recently went
through a similar exercise, smaller than yours, but still complex ...
approx. 5,000 cases by 65 variables. I used the statistical package
R, and I can tell you it was a god-send. In previous incarnations
(more than 10 years ago) I had used at various times (which varied
with employer) BDMS, SAS, SPSS, and S. I had liked S best of the lot
because of the advantages I found in the Unix environment. Nowadays,
I have Linux on the desktop, and looked for the package closest to S
in spirit, which turned out to be R. That it is freeware was a bonus.
That it is a fully extensible programming language in its own right
gave me everything I needed, as I tend to "roll my own" when I do
statistical analysis, combining elements of possibilistic analysis of
the likelihood function derived from fuzzy set theory. At any rate,
if that was indeed your question, and if you're on a tight budget, I
would say get a Linux box (a fast one, with lots of RAM and hard disk
space) and download a copy of R, and start with the graphing tools
that allow you as a first step to "look at" the data. Sensible ways
of grouping and simplifying will suggest themselves to you, and
inevitably thereafter you'll want to fit some regression models
and/or do some analysis of variance. If you're *not* on a tight
budget, and/or you have access to a fancy workstation, then you might
also have access to your choice of expensive stats packages. If I
were you, I would still opt for R, essentially because of its
programmability, which in my recent work I found to be indispensable.
Hope this is of help. Good luck.

S. F. Thomas


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: multivariate techniques for large datasets

2001-06-13 Thread Tracey Continelli

Sidney Thomas <[EMAIL PROTECTED]> wrote in message 
news:<[EMAIL PROTECTED]>...
> srinivas wrote:
> > 
> > Hi,
> > 
> >   I have a problem in identifying the right multivariate tools to
> > handle datset of dimension 1,00,000*500. The problem is still
> > complicated with lot of missing data. can anyone suggest a way out to
> > reduce the data set and  also to estimate the missing value. I need to
> > know which clustering tool is appropriate for grouping the
> > observations( based on 500 variables ).

One of the best ways in which to handle missing data is to impute the
mean for other cases with the selfsame value.  If I'm doing
psychological research and I am missing some values on my depression
scale for certain individuals, I can look at their, say, locus of
control reported and impute the mean value.  Let's say [common
finding] that I find a pattern - individuals with a high locus of
control report low levels of depression, and I have a scale ranging
from 1-100 listing locus of control.  If I have a missing value for
depression at level 75 for one case, I can take the mean depression
level for all individuals at level 75 of locus of control and impute
that for all missing cases in which 75 is the listed locus of control
value.  I'm not sure why you'd want to reduce the size of the data
set, since for the most part the larger the "N" the better.


Tracey Continelli


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: multivariate techniques for large datasets

2001-06-13 Thread Eric Bohlman

In sci.stat.consult Tracey Continelli <[EMAIL PROTECTED]> wrote:
> value.  I'm not sure why you'd want to reduce the size of the data
> set, since for the most part the larger the "N" the better.

Actually, for datasets of the OP's size, the increase in power from the 
large size is a mixed blessing, for the same reason that many 
hard-of-hearing people don't terribly like wearing hearing aids: they 
bring up the background noise just as much as the signal.  With an N of 
one million, practically *any* effect you can test for is going to be 
significant, regardless of how small it is.



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: multivariate techniques for large datasets

2001-06-14 Thread Herman Rubin

In article <9g9k9f$h4c$[EMAIL PROTECTED]>,
Eric Bohlman <[EMAIL PROTECTED]> wrote:
>In sci.stat.consult Tracey Continelli <[EMAIL PROTECTED]> wrote:
>> value.  I'm not sure why you'd want to reduce the size of the data
>> set, since for the most part the larger the "N" the better.

>Actually, for datasets of the OP's size, the increase in power from the 
>large size is a mixed blessing, for the same reason that many 
>hard-of-hearing people don't terribly like wearing hearing aids: they 
>bring up the background noise just as much as the signal.  With an N of 
>one million, practically *any* effect you can test for is going to be 
>significant, regardless of how small it is.


This just points out another stupidity of the use of 
"significance testing".  Since the null hypothesis is
false anyhow, why should we care what happens to be
the probability of rejecting when it is true?

State the REAL problem, and attack this.  

-- 
This address is for information only.  I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
[EMAIL PROTECTED] Phone: (765)494-6054   FAX: (765)494-0558


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: multivariate techniques for large datasets

2001-06-14 Thread Rich Ulrich

On 13 Jun 2001 20:32:51 -0700, [EMAIL PROTECTED] (Tracey
Continelli) wrote:

> Sidney Thomas <[EMAIL PROTECTED]> wrote in message 
>news:<[EMAIL PROTECTED]>...
> > srinivas wrote:
> > > 
> > > Hi,
> > > 
> > >   I have a problem in identifying the right multivariate tools to
> > > handle datset of dimension 1,00,000*500. The problem is still
> > > complicated with lot of missing data. can anyone suggest a way out to
> > > reduce the data set and  also to estimate the missing value. I need to
> > > know which clustering tool is appropriate for grouping the
> > > observations( based on 500 variables ).
> 
> One of the best ways in which to handle missing data is to impute the
> mean for other cases with the selfsame value.  If I'm doing
> psychological research and I am missing some values on my depression
> scale for certain individuals, I can look at their, say, locus of
> control reported and impute the mean value.  Let's say [common
> finding] that I find a pattern - individuals with a high locus of
> control report low levels of depression, and I have a scale ranging
> from 1-100 listing locus of control.  If I have a missing value for
> depression at level 75 for one case, I can take the mean depression
> level for all individuals at level 75 of locus of control and impute
> that for all missing cases in which 75 is the listed locus of control
> value.  I'm not sure why you'd want to reduce the size of the data
> set, since for the most part the larger the "N" the better.

Do you draw numeric limits for a variable, and for a person?
Do you make sure, first, that there is not a pattern?

That is -- Do you do something different depending on
how many are missing?  Say, estimate the value, if it is an
oversight in filling blanks on a form, BUT drop a variable if 
more than 5% of responses are unexpectedly missing, since 
(obviously) there was something wrong in the conception of it, 
or the collection of it  Psychological research (possibly) 
expects fewer missing than market research.

As to the N -  As I suggested before - my computer takes 
more time to read  50 megabytes than one megabyte.  But
a psychologist should understand that it is easier to look at
and grasp and balance raw numbers that are only two or 
three digits, compared to 5 and 6.

A COMMENT ABOUT HUGE DATA-BASES.

And as a statistician, I keep noticing that HUGE databases
tend to consist of aggregations.  And these are "random"
samples only in the sense that they are uncontrolled, and 
their structure is apt to be ignored.

If you start to sample, to are more likely to ask yourself about 
the structure - by time, geography, what-have-you.  

An N of millions gives you tests that are wrong; estimates 
ignoring "relevant" structure have a spurious report of precision.
To put it another way: the Error  (or real variation) that *exists*
between a fixed number of units (years, or cities, for what I
mentioned above) is something that you want to generalize across.  
With a small N, that error term is (we assume?) small enough to 
ignore.  However, that error term will not decrease with N, 
so with a large N, it will eventually dominate.  The test 
based on N becomes increasing irrelevant

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: multivariate techniques for large datasets

2001-06-14 Thread S. F. Thomas

Herman Rubin wrote:
> 
> In article <9g9k9f$h4c$[EMAIL PROTECTED]>,
> Eric Bohlman <[EMAIL PROTECTED]> wrote:
> >In sci.stat.consult Tracey Continelli <[EMAIL PROTECTED]> wrote:
> >> value.  I'm not sure why you'd want to reduce the size of the data
> >> set, since for the most part the larger the "N" the better.
> 
> >Actually, for datasets of the OP's size, the increase in power from the
> >large size is a mixed blessing, for the same reason that many
> >hard-of-hearing people don't terribly like wearing hearing aids: they
> >bring up the background noise just as much as the signal.  With an N of
> >one million, practically *any* effect you can test for is going to be
> >significant, regardless of how small it is.
> 
> This just points out another stupidity of the use of
> "significance testing".  Since the null hypothesis is
> false anyhow, why should we care what happens to be
> the probability of rejecting when it is true?
> 
> State the REAL problem, and attack this.

How true! The only drawback there can be to more rather than less
data for inferential purposes would have to center around the extra
cost of computation, rather than the inconvenience posed to
significance testing methodology. 

There is a significant philosophical question lurking here. It is a
reminder of how we get so attached to the tools we use that we
sometimes turn their bugs into features. Significance testing is a
make-do construction of classical statistical inference, in some
sense an indirect way of characterizing the uncertainty surrounding a
parameter estimate. The Bayesian approach of attempting to
characterize such uncertainty directly, rather than indirectly, and
further of characterizing directly, through some function
transformation of the parameter in question, the uncertainty
surrounding some consequential loss or profit function critical to
some real-world decision, is clearly laudable... if it can be
justified. 

Clearly, from a classicist's perspective, the Bayesians have failed
at this attempt at justification, otherwise one would have to be a
masochist to stick with the sheer torture of classical inferential
methods. Besides, the Bayesians indulge not a little in turning bugs
into features themselves. 

At any rate, I say all that to say this: once it is recognized that
there is a valid (extended) likelihood calculus, as easy of
manipulation as the probability calculus in attempting a direct
characterization of the uncertainty surrounding statistical model
parameters, the gap between these two ought to be closed. 

I'm not holding my breath, as this may take several generations. We
all reach for the tool we know how to use, not necessarily for the
best tool for the job. 

> Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
> [EMAIL PROTECTED] Phone: (765)494-6054   FAX: (765)494-0558

Regards,
S. F. Thomas


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: multivariate techniques for large datasets

2001-06-18 Thread Art Kendall

you might want to go to http://www.pitt.edu/~csna/
and then cross-post your question to CLASS-L

The Classification Society meeting this weekend had a lot of discussion of
these topics.

My first question is whether you intend to interpret the clusters?

If so, what is the nature of the 500 variables?
What is the nature of your cases?
What does the set of cases represent?
How much data is missing. What kinds of missing data do you have?
What do you want to do with the cluster reults?
Are you interested in a tree or a simple clustering?


Many users of clustering use data reduction techniques such as factor
analysis to summarize the variability of the 500 with a smaller number of
dimensions.



srinivas wrote:

> Hi,
>
>   I have a problem in identifying the right multivariate tools to
> handle datset of dimension 1,00,000*500. The problem is still
> complicated with lot of missing data. can anyone suggest a way out to
> reduce the data set and  also to estimate the missing value. I need to
> know which clustering tool is appropriate for grouping the
> observations( based on 500 variables ).



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=