Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

Mischa Sandberg Tue, 03 May 2005 20:00:30 -0700

Quoting Josh Berkus <[email protected]>: 
 
> Mischa, 
>  
> > Okay, although given the track record of page-based sampling for 
> > n-distinct, it's a bit like looking for your keys under the 
> streetlight, 
> > rather than in the alley where you dropped them :-) 
>  
> Bad analogy, but funny. 
 
Bad analogy? Page-sampling effort versus row-sampling effort, c'est 
moot. It's not good enough for stats to produce good behaviour on the 
average. Straight random sampling, page or row, is going to cause 
enough untrustworthy engine behaviour,for any %ages small enough to 
allow sampling from scratch at any time. 
 
I'm curious what the problem is with relying on a start-up plus 
incremental method, when the method in the distinct-sampling paper 
doesn't degenerate: you can start when the table is still empty. 
Constructing an index requires an initial full scan plus incremental 
update; what's the diff? 
 
> Unless, of course, we use indexes for sampling, which seems like a 
> *really  
> good* idea to me .... 
 
"distinct-sampling" applies for indexes, too. I started tracking the 
discussion of this a bit late.  Smart method for this is in VLDB'92: 
Gennady Antoshenkov, "Random Sampling from Pseudo-ranked B+-trees". I 
don't think this is online anywhere, except if you have a DBLP 
membership. Does nybod else know better? 
Antoshenkov was the brains behind some of the really cool stuff in DEC 
Rdb (what eventually became Oracle). Compressed bitmap indices, 
parallel competing query plans, and smart handling of keys with 
hyperbolic distributions.  
--  
Engineers think equations approximate reality. 
Physicists think reality approximates the equations. 
Mathematicians never make the connection.



---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to [EMAIL PROTECTED] so that your
      message can get through to the mailing list cleanly

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

Reply via email to