Hitoshi Harada <umi.tan...@gmail.com> wrote: > Kevin Grittner <kgri...@ymail.com> wrote: >> Hitoshi Harada <umi.tan...@gmail.com> wrote:
>>> As far as I can tell, the overall approach is as follows. >>> >>> - create a new temp heap as non-concurrent does, but with >>> ExclusiveLock on the matview, so that reader wouldn't be >>> blocked >> >> Non-concurrent creates the heap in the matview's tablespace and >> namespace, so the "temp" part is different in concurrent >> generation. This difference is why concurrent can be faster >> when few rows change. > > It's still not clear to me why you need temp in concurrent and > not in non-concurrent. Well, temp tables can be in an entirely different tablespace, so you can't just move the heap of a temp table into place as the new heap for the matview (as we do for a non-concurrent REFRESH) -- at least not without potentially copying everything an extra time. For concurrent we are modifying the existing matview heap, and the freshly generated set of values, as well as the "diff" table, are just needed, well, temporarily. That makes the behavior of temp tables ideal. Not only are they not logged, they are potentially placed on faster tablespaces, and the data written to them might never be written to disk: http://www.postgresql.org/docs/9.2/interactive/runtime-config-resource.html#GUC-TEMP-BUFFERS > If this type of operations is always creating "temp" table and > just swap it with existing one, why can't we just make it temp > always? Because of the potentially differrent tablespaces. > And if the performance is the only concern, is temp better than > just turning off WAL for the table or UNLOGGED table? Yes, it is. >>> - this operation requires unique index for performance reason >>> (or correctness reason too) >> >> It is primarily for correctness in the face of duplicate rows >> which have no nulls. Do you think the reasons need to be better >> documented with comments? > > Ah, yes, even after looking at patch I was confused if it was for > performance or correctness. This is the part of the function's comment which attempts to explain the problem. * This join cannot work if there are any * duplicated rows in either the old or new versions, in the sense that every * column would compare as equal between the two rows. It does work correctly * in the face of rows which have at least one NULL value, with all non-NULL * columns equal. The behavior of NULLs on equality tests and on UNIQUE * indexes turns out to be quite convenient here; the tests we need to make * are consistent with default behavior. If there is at least one UNIQUE * index on the materialized view, we have exactly the guarantee we need. By * joining based on equality on all columns which are part of any unique * index, we identify the rows on which we can use UPDATE without any problem. * If any column is NULL in either the old or new version of a row (or both), * we must use DELETE and INSERT, since there could be multiple rows which are * NOT DISTINCT FROM each other, and we could otherwise end up with the wrong * number of occurrences in the updated relation. I'm open to suggestions on better wording. As an example of the way the full join gets into trouble with duplicate rows when used for a diff, see this example: test=# create table old (c1 int, c2 int); CREATE TABLE test=# create table new (c1 int, c2 int); CREATE TABLE test=# insert into old values test-# (1,1),(1,2),(1,2),(1,2),(1,3),(1,null),(1,null); INSERT 0 7 test=# insert into new values test-# (1,1),(1,2),(1,2),(1,2),(1,2),(1,2),(1,4),(1,null),(1,null),(1,null); INSERT 0 10 At this point it is clear that we need to add two rows with values (1,2) and we need to wind up with one more row with values (1,null) than we already have. We also need to delete (1,3) and add (1,4). But full join logic fails to get things right for the case of duplicate rows with no nulls: test=# select old, new test-# from old test-# full join new on old.c1 = new.c1 and old.c2 = new.c2 test-# where (old.*) is distinct from (new.*); old | new -------+------- (1,3) | (1,) | (1,) | | (1,4) | (1,) | (1,) | (1,) (7 rows) > It's a shame we cannot refresh it concurrently if we have > duplicate rows in the matview. I thought it would make sense to > allow it without unique key if it as only performance tradeoffs. Well, we *could* allow it without a unique index, but the code would be more complex and significantly slower. I think we would still want to handle it the way the current patch does when a unique index is present, even if we have a way to handle cases where such an index is not present. Even if I were convinced it was worthwhile to support the more general case, I would want to commit this patch first, and add the more complicated code as a follow-on patch. At this point I'm not convinced of the value of supporting concurrent refresh of materialized views which contain duplicate rows, nor of the benefit of allowing it to work ten times slower on matviews without duplicate rows but on which no unique index has been added, rather than requiring the user to add a unique index if they want the concurrent refresh. I'm open to arguments as to why either of those is a good idea and worth doing ahead of incremental maintenance of matviews. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers