Hi Alexey,

Thanks for the thorough and extremely valuable review!

On 12/17/18 5:23 PM, Alexey Kondratov wrote:
> Hi Tomas,
> 
>> This new version is mostly just a rebase to current master (or almost,
>> because 2pc decoding only applies to 29180e5d78 due to minor bitrot),
>> but it also addresses the new stuff committed since last version (most
>> importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of
>> subxact assignments, where the assignment was included in records with
>> XID=0, essentially failing to track the subxact properly.
> 
> I started reviewing your patch about a month ago and tried to do an
> in-depth review, since I am very interested in this patch too. The new
> version is not applicable to master 29180e5d78, but everything is OK
> after applying 2pc patch before. Anyway, I guess it may complicate
> further testing and review, since any potential reviewer has to take
> into account both patches at once. Previous version was applicable to
> master and was working fine for me separately (excepting a few
> patch-specific issues, which I try to explain below).
> 

I agree it's somewhat annoying, but I don't think there's a better way,
unfortunately. Decoding in-progress transactions does require safe
handling of concurrent aborts, so it has to be committed after the 2pc
decoding patch (which makes that possible). But the 2pc patch also
touches the same places as this patch series (it reworks the reorder
buffer for example).

> 
> Patch review
> ========
> 
> First of all, I want to say thank you for such a huge work done. Here
> are some problems, which I have found and hopefully fixed with my
> additional patch (please, find attached, it should be applicable to the
> last commit of your newest patch version):
> 
> 1) The most important issue is that your tap tests were broken—there was
> missing option "WITH (streaming=true)" in the subscription creating
> statement. Therefore, spilling mechanism has been tested rather than
> streaming.
> 

D'oh!

> 2) After fixing tests the first one with simple streaming is immediately
> failed, because of logical replication worker segmentation fault. It
> happens, since worker tries to call stream_cleanup_files inside
> stream_open_file at the stream start, while nxids is zero, then it goes
> to the negative value and everything crashes. Something similar may
> happen with xids array, so I added two checks there.
> 
> 3) The next problem is much more critical and is dedicated to historic
> MVCC visibility rules. Previously, walsender was starting to decode
> transaction on commit and we were able to resolve all xmin, xmax,
> combocids to cmin/cmax, build tuplecids hash and so on, but now we start
> doing all these things on the fly.
> 
> Thus, rather difficult situation arises: HeapTupleSatisfiesHistoricMVCC
> is trying to validate catalog tuples, which are currently in the future
> relatively to the current decoder position inside transaction, e.g. we
> may want to resolve cmin/cmax of a tuple, which was created with cid 3
> and deleted with cid 5, while we are currently at cid 4, so our
> tuplecids hash is not complete to handle such a case.
> 

Damn it! I ran into those two issues some time ago and I fixed it, but
I've forgotten to merge that fix into the patch. I'll merge those fixed
and compare them to your proposed fix, and send a new version tomorrow.

> 
> 4) There was a problem with marking top-level transaction as having
> catalog changes if one of its subtransactions has. It was causing a
> problem with DDL statements just after subtransaction start (savepoint),
> so data from new columns is not replicated.
> 
> 5) Similar issue with schema send. You send schema only once per each
> sub/transaction (IIRC), while we have to update schema on each catalog
> change: invalidation execution, snapshot rebuild, adding new tuple cids.
> So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
> it is easy to set it inside RB and read in the output plugin. Probably,
> we have to choose a better place for this flag.
> 

Hmm. Can you share an example how to trigger these issues?

> 6) To better handle all these tricky cases I added new tap
> test—014_stream_tough_ddl.pl—which consist of really tough combination
> of DDL, DML, savepoints and ROLLBACK/RELEASE in a single transaction.
> 

Thanks!

> I marked all my fixes and every questionable place with comment and
> "TOCHECK:" label for easy search. Removing of pretty any of these fixes
> leads to the tests fail due to the segmentation fault or replication
> mismatch. Though I mostly read and tested old version of patch, but
> after a quick look it seems that all these fixes are applicable to the
> new version of patch as well.
> 

Thanks. I'll go through your patch tomorrow.

> 
> Performance
> ========
> 
> I have also performed a series of performance tests, and found that
> patch adds a huge overhead in the case of a large transaction consisting
> of many small rows, e.g.:
> 
> CREATE TABLE large_test (num1 bigint, num2 double precision, num3 double
> precision);
> 
> EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
> SELECT round(random()*10), random(), random()*142
> FROM generate_series(1, 1000000) s(i);
> 
> Execution Time: 2407.709 ms
> Total Time: 11494,238 ms (00:11,494)
> 
> With synchronous_standby_names and 64 MB logical_work_mem it takes up to
> x5 longer, while without patch it is about x2. Thus, logical replication
> streaming is approximately x4 as slower for similar transactions.
> 
> However, dealing with large transactions consisting of a small number of
> large rows is much better:
> 
> CREATE TABLE large_text (t TEXT);
> 
> EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_text
> SELECT (SELECT string_agg('x', ',')
> FROM generate_series(1, 1000000)) FROM generate_series(1, 125);
> 
> Execution Time: 3545.642 ms
> Total Time: 7678,617 ms (00:07,679)
> 
> It is around the same x2 as without patch. If someone is interested I
> also added flamegraphs of walsender and logical replication worker
> during first numerical transaction processing.
> 

Interesting. Any idea where does the extra overhead in this particular
case come from? It's hard to deduce that from the single flame graph,
when I don't have anything to compare it with (i.e. the flame graph for
the "normal" case).

I'll investigate this (probably not this week), but in general it's good
to keep in mind a couple of things:

1) Some overhead is expected, due to doing things incrementally.

2) The memory limit should be set to sufficiently high value to be hit
only infrequently.

3) And when the limit is actually hit, it's an alternative to spilling
large amounts of data locally (to disk) or incurring significant
replication lag later.

So I'm not particularly worried, but I'll look into that. I'd be much
more worried if there was measurable overhead in cases when there's no
streaming happening (either because it's disabled or the memory limit
was not hit).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply via email to