> Thanks, Ken.  Could you devote a few words to how PuppetDB chooses which of
> those alternative columns to use for any particular value, and how it
> afterward tracks which one has been used?

So PuppetDB, in particular fact-contents, and the way it stores leaf
values makes a decision using a very basic forensic function in
clojure:

https://github.com/puppetlabs/puppetdb/blob/master/src/com/puppetlabs/puppetdb/facts.clj#L114-L124

We store that ID, which really maps to another lookup table (more for
referential integrity purposes than anything). We also use that ID to
make the decision as to which column we use:
value_string/integer/float/boolean/null etc.

> I'm also curious about whether index efficiency in PostgreSQL (as an
> example) takes a significant hit just from an index being defined on a
> Numeric/Decimal column or whether the impact depends strongly on the number
> of non-NULL values in that column.

It takes a hit because it requires more storage, and I believe in
theory the index optimisation around decimals is different also (but I
don't have the real numbers on that) ... integers are more commonly
optimised in their code base (pg's that is) because of their
wide-spread use in id columns. Again the same is true for smallints
versus bigints. This is of course conjecture without perf numbers to
back it to a degree, but I believe I'm probably correct. Of course
when it comes time to analyse this closer we'll prove it with real
perf tests as we usually do :-).

> Additionally, I'm curious about how (or whether) the alternative column
> approach interacts with queries.  Do selection predicates against Any type
> values typically need to consider multiple (or all) of the value columns?

Its kind of interesting, and largely pivots on operator. For
structured facts and the <, >, <=, >= operators ... we are forced to
interrogate both the integer and float columns (an OR clause in SQL
basically), because a user would presume thats how it worked. In a way
this is a coercement. If we introduced a decimal, we would have to do
the same again, especially if it was an overflow where its other
related numbers are still integers. In theory (and needs to be backed
with perf numbers) even while we do this, the decimal column should in
theory be sparser than the integer column (and therefore quicker
overall to traverse), we could have our cake and eat it too with just
a mild perf hit. If they were all decimals, then we are locking
ourselves into the performance of decimal for all numbers.

Another example ... the ~ operator only works on strings of course,
and in fact benefits from the new trgm indexes we've started to
introduce in 9.3. This wasn't as simple before, the normal index for a
text column actually has a maximum size limit ... this is a fact not
many devs realise. It also isn't used for regexp queries :-). But I
digress ...

So in short, we hide this internal float/integer comparison problem
from the user. In my mind, they are both "numbers" and we try to
expose it that way, but internally they are treated with the most
optimal storage we can provide.

ken.

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to puppet-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/puppet-dev/CAE4bNTk0zdJ0uO6hpOAwH3DOcoX0eCZHJr9qpBb9zKd6GUJjxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to