On Apr 17, 2008, at 12:22 PM, Gregory Stark wrote:
"Decibel!" <[EMAIL PROTECTED]> writes:

For each field that isn't already in a set of field groupings
 * Sort sample rows on that field
 * Calculate correlation for all other fields
* If there are other fields that have a correlation to this sort order over some threshold, save them along with the field we originally sorted on as a
new 'field grouping'
* Else, there are no other fields that group with this field; it's a "loner"

I think this is going somewhere. But "correlation" isn't quite right. It has the same problem our use of correlation for clusteredness has. Consider the case of Zip code and City. They're nearly very non-independent variables but
there's basically no correlation.

If we have a limited number of values in one of the fields, it would be possible to build a histogram of values for other fields based on the values in the first field. But I can't see how we could possibly represent something like city, zip in a compact form. You would have to keep a range of zips that cover a city.

Hmm... but we only care about cities that have a substantial number of zip codes. This is what the equivalent of the most-common-values list would be for cross-platform stats: for the most_common_vals in column a, you store a range or histogram of the corresponding values in b, assuming that there is a good correspondence.

For each field grouping, at a minimum we'd need to store a histogram for that
grouping.

This is a problem. What does a histogram on a grouping mean? It's not clear
how to come up with a histogram which can help answer questions like
  A between ? and ? and B between ? and ?

You can do a histogram on <a,b> or <b,a> but neither are going to be
especially useful. Heikki and I came up with a weird hybrid thing which might
be useful for avoiding overestimating selectivity like
  WHERE city='BOS' AND areacode = '617'

But it didn't help at all with the converse, ie:
 WHERE city='BOS' AND areacode = '212'

It's hard to see how we could possibly catch cases like that though.

If the two fields share the same correlation, then the histogram is just what we use right now. We could actually do this today, but only for fields with a high physical correlation. What I was describing allowed extending this to fields that have a high correlation to each other, even if they didn't have a high physical correlation. I know that this doesn't help us for things like city/area code or city/zip, but other than my idea above I'm rather at a loss on how to represent that in a compact fashion.
--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828


Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to