Re: Statistical sampling and representative stats collection

Jack Silvey Wed, 22 May 2002 08:20:27 -0700

That is a very good question. Anyone else know the
answer?

My approach to skewness right now is to compare the
mean and the median of the entire dataset, since if
you have two measures of centrality that are not the
same, you have skewness.


/jack

--- [EMAIL PROTECTED] wrote:
> 
> Jack,
> 
> Isn't the problem with this concept that it doesn't
> take into consideration
> how skewed the data is?   Statistically significance
> would be relevant to
> perfectly distributed data but wouldn't you need a
> higher percentage of
> data for significance in more highly skewed data?
> 
> Just something to consider.
> 
> Cherie Machler
> Oracle DBA
> Gelco Information Network
> 
> 
>                                                     
>                                                     
>        
>                     Jack Silvey                     
>                                                     
>        
>                     <jack_silvey@y       To:    
> Multiple recipients of list ORACLE-L
> <[EMAIL PROTECTED]>     
>                     ahoo.com>            cc:        
>                                                     
>        
>                     Sent by:             Subject:   
>  Re: Statistical sampling and representative stats  
>        
>                     [EMAIL PROTECTED]        collection
>                                                     
>        
>                     om                              
>                                                     
>        
>                                                     
>                                                     
>        
>                                                     
>                                                     
>        
>                     05/21/02 06:19                  
>                                                     
>        
>                     PM                              
>                                                     
>        
>                     Please respond                  
>                                                     
>        
>                     to ORACLE-L                     
>                                                     
>        
>                                                     
>                                                     
>        
>                                                     
>                                                     
>        
> 
> 
> 
> 
> Hi Rafiq,
> 
> We have been using 35 percent on our warehouse, even
> on our fact partitions. Now that I have thought
> about
> it for a while, that seems like a lot given the
> volume
> of data. If a representative sample can be gathered
> with 10,000 or 50,000 or 100,000 rows, and our fact
> partitions have millions of rows each, seems like we
> could go 1% on our analyze and it would be within
> acceptable tolerances.
> 
> /jack
> 
> 
> --- Mohammad Rafiq <[EMAIL PROTECTED]> wrote:
> > The most of the list memeber agrees on estimate
> with
> > 30%....
> >
> > Regards
> > Rafiq
> >
> >
> >
> >
> > Reply-To: [EMAIL PROTECTED]
> > To: Multiple recipients of list ORACLE-L
> > <[EMAIL PROTECTED]>
> > Date: Tue, 21 May 2002 13:43:33 -0800
> >
> > Hi all,
> >
> > Did some investigation about statistical sampling
> > this
> > weekend since we are going to optimize our analyze
> > process soon, and would like some input from all
> you
> > orabrains on this one.
> >
> > I opened a TAR with Oracle asking about the
> sampling
> > algorithm of stats collection, and they assured me
> > it
> > was random.
> >
> > The goal of analyze...estimate is to collect stats
> > that are representative of the data population as
> a
> > whole using a given sample set. Since analyzing
> > tables
> > takes up resources (does sorts to order the data
> for
> > investigation) the fewer rows you use in estimate,
> > the
> > less system resources you use and the faster the
> > analyze will go.
> >
> > Since our goal is to get as small a sample as
> > possible
> > and still have stats that are representative, my
> > contention is that we could start by finding what
> > the
> > margin of error will be for each sample size and
> > gauge
> > our tolerance for it.
> >
> > One standard way to calculate margin of error for
> a
> > given sample is by using this formula:
> >
> > M = 1/SQRT(N)
> >
> > where:
> > M = margin of error
> > N=sample size
> >
> > So, if we can tolerate stats that have a 1% a
> margin
> > of error (will deviate from representative of the
> > whole population by 1%), our sample size should be
> > 10,000 rows.
> >
> > Also, a corollary (not a toyota corollary, though)
> > to
> > this would be that the more rows you add to your
> > sample, the closer to representative your sample
> > will
> > be. So, in order to test whether your sample is
> > representative enough, you could analyze using
> > either
> > estimate 49% or compute, take a snapshot of the
> > stats,
> > and then compare the stats from a 10,000 row
> > estimate
> > to those. Then, add rows to your estimate until
> you
> > are satisfied with the stats.
> >
> > This of course is a pie in the sky mathematical
> > model,
> > but seems like a reasonable place to start with
> > testing.
> >
> > Input? Input? Buhler? Buhler?
> >
> >
> > /jack silvey
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > LAUNCH - Your Yahoo! Music Experience
> > http://launch.yahoo.com
> > --
> > Please see the official ORACLE-L FAQ:
> > http://www.orafaq.com
> > --
> > Author: Jack Silvey
> >    INET: [EMAIL PROTECTED]
> >
> > Fat City Network Services    -- (858) 538-5051 
> FAX:
> > (858) 538-5051
> > San Diego, California        -- Public Internet
> > access / Mailing Lists
> >
>
--------------------------------------------------------------------
> > To REMOVE yourself from this mailing list, send an
> > E-Mail message
> > to: [EMAIL PROTECTED] (note EXACT spelling of
> > 'ListGuru') and in
> > the message BODY, include a line containing: UNSUB
> 
=== message truncated ===


__________________________________________________
Do You Yahoo!?
LAUNCH - Your Yahoo! Music Experience
http://launch.yahoo.com
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Jack Silvey
  INET: [EMAIL PROTECTED]

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

Re: Statistical sampling and representative stats collection

Reply via email to