Re: Batch insert in CTAS/MatView code

Paul Guo Tue, 18 Jun 2019 03:08:29 -0700

On Mon, Jun 17, 2019 at 8:53 PM Paul Guo <p...@pivotal.io> wrote:

> Hi all,
>
> I've been working other things until recently I restarted the work,
> profiling & refactoring the code.
> It's been a long time since the last patch was proposed. The new patch has
> now been firstly refactored due to
> 4da597edf1bae0cf0453b5ed6fc4347b6334dfe1 (Make TupleTableSlots extensible,
> finish split of existing slot type).
>
> Now that TupleTableSlot, instead of HeapTuple is one argument of
> intorel_receive() so we can not get the
> tuple length directly. This patch now gets the tuple length if we know all
> columns are with fixed widths, else
> we calculate an avg. tuple length using the first MAX_MULTI_INSERT_SAMPLES
> (defined as 1000) tuples
> and use for the total length of tuples in a batch.
>
> I noticed that to do batch insert, we might need additional memory copy
> sometimes comparing with "single insert"
> (that should be the reason that we previously saw a bit regressions) so a
> good solution seems to fall back
> to "single insert" if the tuple length is larger than a threshold. I set
> this as 2000 after quick testing.
>
> To make test stable and strict, I run checkpoint before each ctas, the
> test script looks like this:
>
> checkpoint;
> \timing
> create table tt as select a,b,c from t11;
> \timing
> drop table tt;
>
> Also previously I just tested the BufferHeapTupleTableSlot (i.e. create
> table tt as select * from t11),
> this time I test VirtualTupleTableSlot (i.e. create table tt as select
> a,b,c from t11) additionally.
> It seems that VirtualTupleTableSlot is very common in real cases.
>
> I tested four kinds of tables, see below SQLs.
>
> -- tuples with small size.
> create table t11 (a int, b int, c int, d int);
> insert into t11 select s,s,s,s from generate_series(1, 10000000) s;
> analyze t11;
>
> -- tuples that are untoasted and tuple size is 1984 bytes.
> create table t12 (a name, b name, c name, d name, e name, f name, g name,
> h name, i name, j name, k name, l name, m name, n name, o name, p name, q
> name, r name, s name, t name, u name, v name, w name, x name, y name, z
> name, a1 name, a2 name, a3 name, a4 name, a5 name);
> insert into t12 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
> 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y',
> 'z', 'a', 'b', 'c', 'd', 'e' from generate_series(1, 500000);
> analyze t12;
>
> -- tuples that are untoasted and tuple size is 2112 bytes.
> create table t13 (a name, b name, c name, d name, e name, f name, g name,
> h name, i name, j name, k name, l name, m name, n name, o name, p name, q
> name, r name, s name, t name, u name, v name, w name, x name, y name, z
> name, a1 name, a2 name, a3 name, a4 name, a5 name, a6 name, a7 name);
> insert into t13 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
> 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y',
> 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g' from generate_series(1, 500000);
> analyze t13;
>
> -- tuples that are toastable and tuple compressed size is 1084.
> create table t14 (a text, b text, c text, d text, e text, f text, g text,
> h text, i text, j text, k text, l text, m text, n text, o text, p text, q
> text, r text, s text, t text, u text, v text, w text, x text, y text, z
> text);
> insert into t14 select i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i,
> i, i, i, i, i, i, i, i, i from (select repeat('123456789', 10000) from
> generate_series(1,5000)) i;
> analyze t14;
>
>
> I also tested two scenarios for each testing.
>
> One is to clean up all kernel caches (page & inode & dentry on Linux)
> using the command below and then run the test,
>     sync; echo 3 > /proc/sys/vm/drop_caches
> After running all tests all relation files will be in kernel cache (my
> test system memory is large enough to accommodate all relation files),
> then I run the tests again. I run like this because in real scenario the
> result of the test should be among the two results. Also I rerun
> each test and finally I calculate the average results as the experiment
> results. Below are some results:
>
>
> scenario1: All related kernel caches are cleaned up (note the first two
> columns are time with second).
>
> baseline    patch   diff%   SQL
>
>
>
>
> 10.1        5.57    44.85%  create table tt as select * from t11;
>
> 10.7        5.52    48.41%  create table tt as select a,b,c from t11;
>
> 9.57        10.2    -6.58%  create table tt as select * from t12;
>
> 9.64        8.63    10.48%  create table tt as select
> a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12;
>
> 14.2        14.46   -1.83%  create table tt as select * from t13;
>
> 11.88       12.05   -1.43%  create table tt as select
> a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from
> t13;
>
> 3.17        3.25    -2.52%  create table tt as select * from t14;
>
>
> 2.93        3.12    -6.48%  create table tt as select
> a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y from t14;
>
>
>
> scenario2: all related kernel caches are populated after previous testing.
>
>
>
>
>
> baseline    patch   diff%   SQL
>
>
>
>
> 9.6         4.97    48.23%  create table tt as select * from t11;
>
> 10.41       5.32    48.90%  create table tt as select a,b,c from t11;
>
> 9.12        9.52    -4.38%  create table tt as select * from t12;
>
> 9.66        8.6     10.97%  create table tt as select
> a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12;
>
>
> 13.56       13.6    -0.30%  create table tt as select * from t13;
>
>
> 11.36       11.7    -2.99%  create table tt as select
> a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from
> t13;
>
> 3.08        3.13    -1.62%  create table tt as select * from t14;
>
>
> 2.95        3.03    -2.71%  create table tt as select
> a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y from t14;
>
> From above we can get some tentative conclusions:
>
> 1. t11: For short-size tables, batch insert improves much (40%+).
>
> 2. t12: For  BufferHeapTupleTableSlot, the patch slows down 4.x%-6.x%, but
> for VirtualTupleTableSlot it improves 10.x%.
>  If we look at execTuples.c, it looks like this is quite relevant to
> additional memory copy. It seems that VirtualTupleTableSlot is
> more common than the BufferHeapTupleTableSlot so probably the current code
> should be fine for most real cases. Or it's possible
> to determine multi-insert also according to the input slot tuple but this
> seems to be ugly in code. Or continue to lower the threshold a bit
> so that "create table tt as select * from t12;" also improves although
> this hurts the VirtualTupleTableSlot case.
>
>
To alleviate this. I tuned MAX_TUP_LEN_FOR_MULTI_INSERT a bit and set it
from 2000 to 1600. With a table with 24 name-typed columns (total size
1536), I tried both
case1: create table tt as select * from t12;
case2: create table tt as
select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w from t12;


This patch increases the performance for both. Note, of course, this change
(MAX_TUP_LEN_FOR_MULTI_INSERT) does not affect the test results of previous
t11, t13, t14 in theory since the code path is not affected.

kernel caches cleaned up:

           baseline(s)   patch(s)   diff%
case1: 7.65             7.30          4.6%
case2: 7.75             6.80         12.2%


relation files are in cache:

case1: 7.09             6.66          6.1%
case2: 7.49             6.83          8.8%

We do not need to find a larger threshold that just makes the case1
improvement near to zero since on other test environments the threshold
might be a bit different so it should be set as a rough value, and it seems
that 1600 should benefit most cases.

I attached the v3 patch which just has the MAX_TUP_LEN_FOR_MULTI_INSERT
change.

Thanks.



> 3. for t13, new code still uses single insert so the difference should be
> small. I just want to see the regression when even we use "single insert".
>
> 4. For toast case t14, the degradation is small, not a big deal.
>
> By the way, did we try or think about allow better prefetch (on Linux) for
> seqscan. i.e. POSIX_FADV_SEQUENTIAL in posix_fadvise() to enlarge the
> kernel readahead window. Suppose this should help if seq tuple handling is
> faster than default kernel readahead setting.
>
>
> v2 patch is attached.
>
>
> On Thu, Mar 7, 2019 at 4:54 PM Heikki Linnakangas <hlinn...@iki.fi> wrote:
>
>> On 06/03/2019 22:06, Paul Guo wrote:
>> > The patch also modifies heap_multi_insert() a bit to do a bit further
>> > code-level optimization by using static memory, instead of using memory
>> > context and dynamic allocation.
>>
>> If toasting is required, heap_prepare_insert() creates a palloc'd tuple.
>> That is still leaked to the current memory context.
>>
>> Leaking into the current memory context is not a bad thing, because
>> resetting a memory context is faster than doing a lot of pfree() calls.
>> The callers just need to be prepared for that, and use a short-lived
>> memory context.
>>
>> > By the way, while looking at the code, I noticed that there are 9 local
>> > arrays with large length in toast_insert_or_update() which seems to be
>> a
>> > risk of stack overflow. Maybe we should put it as static or global.
>>
>> Hmm. We currently reserve 512 kB between the kernel's limit, and the
>> limit we check in check_stack_depth(). See STACK_DEPTH_SLOP. Those
>> arrays add up to 52800 bytes on a 64-bit maching, if I did my math
>> right. So there's still a lot of headroom. I agree that it nevertheless
>> seems a bit excessive, though.
>>
>> > With the patch,
>> >
>> > Time: 4728.142 ms (00:04.728)
>> > Time: 14203.983 ms (00:14.204)
>> > Time: 1008.669 ms (00:01.009)
>> >
>> > Baseline,
>> > Time: 11096.146 ms (00:11.096)
>> > Time: 13106.741 ms (00:13.107)
>> > Time: 1100.174 ms (00:01.100)
>>
>> Nice speedup!
>>
>> > While for toast and large column size there is < 10% decrease but for
>> > small column size the improvement is super good. Actually if I hardcode
>> > the batch count as 4 all test cases are better but the improvement for
>> > small column size is smaller than that with current patch. Pretty much
>> > the number 4 is quite case specific so I can not hardcode that in the
>> > patch. Of course we could further tune that but the current value seems
>> > to be a good trade-off?
>>
>> Have you done any profiling, on why the multi-insert is slower with
>> large tuples? In principle, I don't see why it should be slower.
>>
>> - Heikki
>>
>

v3-0001-Heap-batch-insert-for-CTAS-MatView.patch
Description: Binary data

Re: Batch insert in CTAS/MatView code

Reply via email to