On Thu, Oct 16, 2008 at 09:17:03PM -0600, Joshua Tolley wrote:
> Because I'm trying to picture geometrically how this might work for
> the two-column case, and hoping to extend that to more dimensions, and
> am finding that picturing a quantile-based system like the one we have
> now in multiple dimensions is difficult. 

Just a note: using a multidimensional histograms will work well for the
cases like (startdate,enddate) where the histogram will show a
clustering of values along the diagonal. But it will fail for the case
(zipcode,state) where one implies the other. Histogram-wise you're not
going to see any correlation at all but what you want to know is:

count(distinct zipcode,state) = count(distinct zipcode)

So you might need to think about storing/searching for different kinds
of correlation.

Secondly, my feeling about multidimensional histograms is that you're
not going to need the matrix to have 100 bins along each axis, but that
it'll be enough to have 1000 bins total. The cases where we get it
wrong enough for people to notice will probably be the same cases where
the histogram will have noticable variation even for a small number of
bins.

Have a nice day,
-- 
Martijn van Oosterhout   <[EMAIL PROTECTED]>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while 
> boarding. Thank you for flying nlogn airlines.

Attachment: signature.asc
Description: Digital signature

Reply via email to