Re: [HACKERS] record identical operator

Robert Haas Tue, 24 Sep 2013 11:06:24 -0700

On Tue, Sep 24, 2013 at 1:04 PM, Stephen Frost <[email protected]> wrote:
> * Robert Haas ([email protected]) wrote:
>> On Tue, Sep 24, 2013 at 12:00 PM, Stephen Frost <[email protected]> wrote:
>> > It wouldn't address my concerns anyway, which are around these binary
>> > operators and the update-in-place approach being risky and setting us up
>> > for problems down the road.
>>
>> I think that if you want to hold up a bug fix to a feature that's
>> already committed, you need to be more specific than to say that there
>> might be problems down the road.
>
> Committed isn't released and I simply can't agree that introducing new
> operators is a 'bug fix'.  Had it gone out as-is, I can't see us
> back-patching a bunch of new operators to fix it.


You're perfectly welcome to argue that we should rip REFRESH
MATERIALIZED VIEW CONCURRENTLY back out, or change the implementation
of it.  However, that's not the topic of this thread.  Had it gone out
as is, we would have either had to come up with some more complex fix
that could make things right without catalog changes, or we would have
had to document that it doesn't work correctly.  Fortunately we are
not faced with that conundrum.

>> Note that, after we change the data in the underlying table, the
>> output of the query changes.  But REFRESH MATERIALIZED VIEW
>> CONCURRENTLY doesn't fix it.  However, REFRESH MATERIALIZED VIEW
>> without CONCURRENTLY does fix it.  That's a bug, because if there are
>> two ways of refreshing a materialized view they should both produce
>> the same answer.  Shouldn't they?
>
> The same queries run over time without changes to the underlying data
> really should return the same data, shoudln't they?

Not necessarily.  There are and always have been plenty of cases where
that isn't true.

> Is it a bug that they don't?

No.

> In general, I agree that they should produce the same results, as should
> incrementally maintained views when they happen.  I'm not convinced that
> choosing whatever the 'new' value is to represent the value in the
> matview for the equal-but-not-binary-identical will always be the
> correct answer though.

Well, you have yet to provide an example of where it isn't the right
behavior.  I initially thought that perhaps we needed a type-specific
concept of exact equality, so that two things would be exactly equal
if they would both be dumped the same way by pg_dump, even if the
internal representations were different.

But on further thought I'm no longer convinced that's a good idea.
For example, consider the compact-numeric format that I introduced a
few releases back.  The changes are backward compatible, so you can
run pg_upgrade on a cluster containing the old format and everything
works just fine.  But your table will be larger than you really want
it to be until you rewrite it.  Any materialized view that was built
by selecting those numeric values will *also* be larger than you want
it to be until you rewrite it.  Fortunately, you can shrink the table
by using a rewriting variant of ALTER TABLE.  After you do that, you
will perhaps also want to shrink the materialized view.  And in fact,
REFRESH will do that for you, but currently, REFRESH CONCURRENTLY
won't.  It seems to me that it would be nicer if it did.

Now I admit that's an arguable point.  We could certainly define an
intermediate notion of equality that is more equal than whatever =
does, but not as equal as exact binary equality.  However, such a new
notion would have no precedence in our existing code, whereas the use
of binary equality does.  Going to a lot of work to build up this
intermediate notion of equality does not seem like a good idea to me;
I think a general rule of good programming is not to invent more ways
to do things than are clearly necessary, and requiring every core and
extension type to define an exact-equality operator just to support
this feature seems like excessive in the extreme.

Moreover, I think what to include in that intermediate notion of
equality would be kind of hard to decide, because there are a lot of
subtly different questions that one can ask, and we'd inevitably have
to answer the question "how equal does it have to be to be equal
enough?".   Numerics and arrays have multiple ways of referencing what
is intended to be the exact same value, but those can be disambiguated
by passing them to a function that cares, like pg_column_size().
Then, on a user-visible level, numerics allow variation in the number
of trailing zeroes after the decimal point.  Floats have extra digits
that aren't shown except with extra_float_digits, so that the value
can be less than totally equal even if the text representation is the
same.  citext intentionally ignores case.  In every one of those
cases, it's possible that the user of materialized views might say
"you know, if it's equal enough that the = operator says it's the
same, then I really don't mind if the materialized view maintenance
gets skipped".  But it's also possible that they might say, "you know,
those values are not really the same, and the materialized view really
ought to reflect the latest data".

I think the conservative (and therefore correct) approach is to decide
that we're always going to update if we detect any difference at all.
The only real argument for ignoring any differences is that it will
mean more changes to the materialized view.  And that is true enough,
but in practical scenarios such updates will be rare.  I would argue
that if someone goes through and add spends a lot of time adding or
removing trailing zeroes to the base tables on which a materialized
view is based, the likelihood is that those trailing zeros do matter
and the change ought to reflect through to the materialized views as
well.  It is not impossible that someone could upgrade to a version of
PG that supports a more compact or otherwise-enhanced version of some
data type, rewrites the table for some reason, and is then surprised
to see a change storm on the associated materialized view as well.
But I don't think it's very likely - we have no such changes planned.
And even if does happen at some point down the road, I can't see
justifying the effort of building a third notion of equality as effort
well-spent to avoid it, especially because it's equally possible that
the user *would* want the changes to propagate through to the
materialized view.

>> The fact that users can write
>> queries that don't always give the same answer is not a reason why
>> it's OK for REFRESH CONCURRENTLY to misbehave on queries that do.
>
> This really is, imv, agreeing to hold a higher standard for matviews
> than we do for what matviews are *based* on- which is queries.

I don't think it is, and your explanation of why you think it is does
not make any sense to me.  If a view produced an answer that the
underlying query *could not have produced* then we would clearly
consider that a bug.  But, you're arguing that a materialized view on
which REFRESH CONCURRENTLY should be allowed to produce just such
answers.

It is obviously true, and unavoidable, that if the query can produce
more than one result set depending on the query plan or other factors,
then the materialized view contents can match only one of those
possible result sets.  But you are arguing that it should be allowed
to produce some OTHER result set, that a direct execution of the query
could *never* have produced, and I can't see how that can be right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] record identical operator

Reply via email to