At 04:24 AM 2/17/2006, Ragnar wrote:
On fös, 2006-02-17 at 01:20 -0500, Ron wrote:
>
> OK, so here's _a_ way (there are others) to obtain a mapping such that
>   if a < b then f(a) < f (b) and
>   if a == b then f(a) == f(b)

> By scanning the table once, we can map say 0000001h (Hex used to ease
> typing) to the row with the minimum value and 1111111h to the row
> with the maximum value as well as mapping everything in between to
> their appropriate keys.  That same scan can be used to assign a
> pointer to each record's location.

This step is just as expensive as the original sort you want to replace/improve.

Why do you think that? External sorts involve the equivalent of multiple scans of the table to be sorted, sometimes more than lgN (where N is the number of items in the table to be sorted). Since this is physical IO we are talking about, each scan is very expensive, and therefore 1 scan is going to take considerably less time than >= lgN scans will be.


If you want to keep this mapping saved as a sort of an index, or as part ot each row data, this will make the cost of inserts and updates enormous.

Not sure you've got this right either. Looks to me like we are adding a <= 32b quantity to each row. Once we know the mapping, incrementally updating it upon insert or update would seem to be simple matter of a fast search for the correct ranking [Interpolation search, which we have all the needed data for, is O(lglgN). Hash based search is O(1)]; plus an increment/decrement of the key values greater/less than the key value of the row being inserted / updated. Given than we are updating all the keys in a specific range within a tree structure, that update can be done in O(lgm) (where m is the number of records affected).

>  We can now sort the key+pointer pairs instead of the actual data and
> use an optional final pass to rearrange the actual rows if we wish.

How are you suggesting this mapping be accessed? If the mapping is kept separate from the tuple data, as in an index, then how will you look up the key?
??? We've effectively created a data set where each record is a pointer to a DB row plus its key. We can now sort the data set by key and then do an optional final pass to rearrange the actual DB rows if we so wish. Since that final pass is very expensive, it is good that not all use scenarios will need that final pass.

The amount of storage required to sort this representation of the table rather than the actual table is so much less that it turns an external sorting problem into a internal sorting problem with an optional final pass that is =1= scan (albeit one scan with a lot of seeks and data movement). This is a big win. It is a variation of a well known technique. See Sedgewick, Knuth, etc.


> That initial scan to set up the keys is expensive, but if we wish
> that cost can be amortized over the life of the table so we don't
> have to pay it all at once.  In addition, once we have created those
> keys, then can be saved for later searches and sorts.

What is the use case where this would work better than a
regular btree index ?
Again, ??? btree indexes address different issues. They do not in any way help create a compact data representation of the original data that saves enough space so as to turn an external ranking or sorting problem into an internal one.


Ron


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq

Reply via email to