> For things like PostGIS, which will want to index in 4 dimensions > (x, y, z, t), we might want to have multi-dimensional selectivity > histograms and some way to use same. >
Another use case is cross column statistics. > Anybody here qualified to check out this paper on the subject, please > speak up :) > Well, I don't know if I'm qualified per se, but here goes... It seems seems like a very reasonable approach to multidimensional histograms. For those who haven't had time to read the paper, it describes SGRID, a greedy algorithm for partitioning an n-dimensional space into non-overlapping n-dimensional boxes and 'outliers' ( points that dont fit into any boxes, but dont warrant their own boxes ). Then, they compare their method to a regression technique, where the space is fit into a fixed grid histogram and then the grid densities are estimated via regression. They also compare a plane splitting algorithm, called MHIST, where the planes are constrained to be orthogonal to the naive dimension vectors. They dismiss singular value decomposition and the discrete wavelet transform as being too parametric ( which is silly, IMHO ) and briefly mention a couple standard clustering techniques ( which are probably not appropriate for us, given their complexity ). Unsurprisingly, it does well on the two test sets that they consider. It think the general idea is fine, but it would certainly need some modification to work for postgres. In particular, Using the naive dimension vectors ( ie, north and east for geo data, or column a and column b for cross-column stats ) in combination with the box constraint will probably lead to problems. Consider, for example, a river that goes in a straight line north-east, and a table that stores camp sites which are mostly along the river. Because SGRID can only create boxes with north east edges, you would end up with a bunch of tiny box on the river where one, long north-east pointing box would do. We would probably want to replace the error term that SGRID uses to determine when to create a new box by a statistical test ( maybe homogeneity of means? ) ( also, the same holds for the outliers ). Finally, this creates the partition but ( AFAICT ) it doesn't describe a method for locating the histogram estimate given a point ( although that doesn't seem too difficult ). -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers