> Thanks, Ken. Could you devote a few words to how PuppetDB chooses which of > those alternative columns to use for any particular value, and how it > afterward tracks which one has been used?
So PuppetDB, in particular fact-contents, and the way it stores leaf values makes a decision using a very basic forensic function in clojure: https://github.com/puppetlabs/puppetdb/blob/master/src/com/puppetlabs/puppetdb/facts.clj#L114-L124 We store that ID, which really maps to another lookup table (more for referential integrity purposes than anything). We also use that ID to make the decision as to which column we use: value_string/integer/float/boolean/null etc. > I'm also curious about whether index efficiency in PostgreSQL (as an > example) takes a significant hit just from an index being defined on a > Numeric/Decimal column or whether the impact depends strongly on the number > of non-NULL values in that column. It takes a hit because it requires more storage, and I believe in theory the index optimisation around decimals is different also (but I don't have the real numbers on that) ... integers are more commonly optimised in their code base (pg's that is) because of their wide-spread use in id columns. Again the same is true for smallints versus bigints. This is of course conjecture without perf numbers to back it to a degree, but I believe I'm probably correct. Of course when it comes time to analyse this closer we'll prove it with real perf tests as we usually do :-). > Additionally, I'm curious about how (or whether) the alternative column > approach interacts with queries. Do selection predicates against Any type > values typically need to consider multiple (or all) of the value columns? Its kind of interesting, and largely pivots on operator. For structured facts and the <, >, <=, >= operators ... we are forced to interrogate both the integer and float columns (an OR clause in SQL basically), because a user would presume thats how it worked. In a way this is a coercement. If we introduced a decimal, we would have to do the same again, especially if it was an overflow where its other related numbers are still integers. In theory (and needs to be backed with perf numbers) even while we do this, the decimal column should in theory be sparser than the integer column (and therefore quicker overall to traverse), we could have our cake and eat it too with just a mild perf hit. If they were all decimals, then we are locking ourselves into the performance of decimal for all numbers. Another example ... the ~ operator only works on strings of course, and in fact benefits from the new trgm indexes we've started to introduce in 9.3. This wasn't as simple before, the normal index for a text column actually has a maximum size limit ... this is a fact not many devs realise. It also isn't used for regexp queries :-). But I digress ... So in short, we hide this internal float/integer comparison problem from the user. In my mind, they are both "numbers" and we try to expose it that way, but internally they are treated with the most optimal storage we can provide. ken. -- You received this message because you are subscribed to the Google Groups "Puppet Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-dev/CAE4bNTk0zdJ0uO6hpOAwH3DOcoX0eCZHJr9qpBb9zKd6GUJjxA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.