On Tue, Nov 01, 2005 at 09:22:17PM +0000, Simon Riggs wrote: > varlen is int32 to match the standard varlena header. However, the max > number of digits of the datatype is less than the threshold at which > values get toasted. So no NUMERIC values ever get toasted - in which > case, why worry about matching the size of varlena - lets reduce it to 2 > bytes which still gives us up to 1000 digits as we have now.
The other ideas may have merit, I don't know. But this one is a no-goer. The backend currently has recognises three forms of Datum: - Fixed length, By value: integers, chars, anything short anough to fit in a word - Fixed length, By reference: datatime, etc, anything that's fixed length but too long for a word - Variable length: Anything variable: text, varchar(), etc The last all, without exception, have a varlena header. This makes the code easy, because all variable length values look the same for copying, loading, storing, etc. You are proposing a fourth type, say VARLENA2 which looks a lot like a verlena but it's not. I think the shear volume of code that would need to be checked is huge. Also, things like pg_attribute would need changing because you have to represent this new state somehow. I seriously doubt this isn't going to happen. Your other possible optimisations have other issues. > n_weight seems to exist because we do not store trailing zeroes. So > 1000000 is stored as 1 with a weight of 6. My experience is that large > numbers of trailing zeroes do not occur with any frequency in real > measurement or financial data and that this is an over-optimization. > This is probably a hang over from the original algorithm, rather than a > conscious design goal for PostgreSQL? But if you are storing large numbers then it's helpful. Whether it's worth the cost... > n_sign_dscale shows us where the decimal point is. We could actually > store a marker representing the decimal point, which would cost us 0.5 > byte rather than 2 bytes. Since we have 4 bits to represent a decimal > number, that leaves a few bits spare to represent either a decimal- > point-and-positive-sign and decimal-point-and-negative-sign. (We would > still need to store trailing zeroes even after the decimal point). Consider the algorithm: A number is stored as base + exponent. To multiply two numbers you can multiply the bases and add the exponents. OTOH, if you store the decimal inside the data, now you have to extract it again before you can do any calculating. So you've traded CPU time for disk space. Is diskspace cheaper or more expensive than CPU? Debatable I guess. > So, assuming I have this all correct, means we could reduce the on disk > storage for NUMERIC datatypes to the following struct. This gives an > overhead of just 2.5 bytes, plus the loss of the optimization of > trailing zeroes, which I assess as having almost no value anyway in > 99.9999% of data values (literally...). Actually, I have a table with a column declared as numeric(12,4) because there has to be 4 decimal places. As it turns out, the decimal places are mostly zero so the optimisation works for me. Interesting ideas, but there's a lot of hurdles to jump I think... -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
pgpGrhuhKG0qh.pgp
Description: PGP signature