On Mon, Jun 17, 2019 at 8:53 PM Paul Guo <p...@pivotal.io> wrote: > Hi all, > > I've been working other things until recently I restarted the work, > profiling & refactoring the code. > It's been a long time since the last patch was proposed. The new patch has > now been firstly refactored due to > 4da597edf1bae0cf0453b5ed6fc4347b6334dfe1 (Make TupleTableSlots extensible, > finish split of existing slot type). > > Now that TupleTableSlot, instead of HeapTuple is one argument of > intorel_receive() so we can not get the > tuple length directly. This patch now gets the tuple length if we know all > columns are with fixed widths, else > we calculate an avg. tuple length using the first MAX_MULTI_INSERT_SAMPLES > (defined as 1000) tuples > and use for the total length of tuples in a batch. > > I noticed that to do batch insert, we might need additional memory copy > sometimes comparing with "single insert" > (that should be the reason that we previously saw a bit regressions) so a > good solution seems to fall back > to "single insert" if the tuple length is larger than a threshold. I set > this as 2000 after quick testing. > > To make test stable and strict, I run checkpoint before each ctas, the > test script looks like this: > > checkpoint; > \timing > create table tt as select a,b,c from t11; > \timing > drop table tt; > > Also previously I just tested the BufferHeapTupleTableSlot (i.e. create > table tt as select * from t11), > this time I test VirtualTupleTableSlot (i.e. create table tt as select > a,b,c from t11) additionally. > It seems that VirtualTupleTableSlot is very common in real cases. > > I tested four kinds of tables, see below SQLs. > > -- tuples with small size. > create table t11 (a int, b int, c int, d int); > insert into t11 select s,s,s,s from generate_series(1, 10000000) s; > analyze t11; > > -- tuples that are untoasted and tuple size is 1984 bytes. > create table t12 (a name, b name, c name, d name, e name, f name, g name, > h name, i name, j name, k name, l name, m name, n name, o name, p name, q > name, r name, s name, t name, u name, v name, w name, x name, y name, z > name, a1 name, a2 name, a3 name, a4 name, a5 name); > insert into t12 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', > 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', > 'z', 'a', 'b', 'c', 'd', 'e' from generate_series(1, 500000); > analyze t12; > > -- tuples that are untoasted and tuple size is 2112 bytes. > create table t13 (a name, b name, c name, d name, e name, f name, g name, > h name, i name, j name, k name, l name, m name, n name, o name, p name, q > name, r name, s name, t name, u name, v name, w name, x name, y name, z > name, a1 name, a2 name, a3 name, a4 name, a5 name, a6 name, a7 name); > insert into t13 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', > 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', > 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g' from generate_series(1, 500000); > analyze t13; > > -- tuples that are toastable and tuple compressed size is 1084. > create table t14 (a text, b text, c text, d text, e text, f text, g text, > h text, i text, j text, k text, l text, m text, n text, o text, p text, q > text, r text, s text, t text, u text, v text, w text, x text, y text, z > text); > insert into t14 select i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, > i, i, i, i, i, i, i, i, i from (select repeat('123456789', 10000) from > generate_series(1,5000)) i; > analyze t14; > > > I also tested two scenarios for each testing. > > One is to clean up all kernel caches (page & inode & dentry on Linux) > using the command below and then run the test, > sync; echo 3 > /proc/sys/vm/drop_caches > After running all tests all relation files will be in kernel cache (my > test system memory is large enough to accommodate all relation files), > then I run the tests again. I run like this because in real scenario the > result of the test should be among the two results. Also I rerun > each test and finally I calculate the average results as the experiment > results. Below are some results: > > > scenario1: All related kernel caches are cleaned up (note the first two > columns are time with second). > > baseline patch diff% SQL > > > > > 10.1 5.57 44.85% create table tt as select * from t11; > > 10.7 5.52 48.41% create table tt as select a,b,c from t11; > > 9.57 10.2 -6.58% create table tt as select * from t12; > > 9.64 8.63 10.48% create table tt as select > a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12; > > 14.2 14.46 -1.83% create table tt as select * from t13; > > 11.88 12.05 -1.43% create table tt as select > a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from > t13; > > 3.17 3.25 -2.52% create table tt as select * from t14; > > > 2.93 3.12 -6.48% create table tt as select > a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y from t14; > > > > scenario2: all related kernel caches are populated after previous testing. > > > > > > baseline patch diff% SQL > > > > > 9.6 4.97 48.23% create table tt as select * from t11; > > 10.41 5.32 48.90% create table tt as select a,b,c from t11; > > 9.12 9.52 -4.38% create table tt as select * from t12; > > 9.66 8.6 10.97% create table tt as select > a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12; > > > 13.56 13.6 -0.30% create table tt as select * from t13; > > > 11.36 11.7 -2.99% create table tt as select > a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from > t13; > > 3.08 3.13 -1.62% create table tt as select * from t14; > > > 2.95 3.03 -2.71% create table tt as select > a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y from t14; > > From above we can get some tentative conclusions: > > 1. t11: For short-size tables, batch insert improves much (40%+). > > 2. t12: For BufferHeapTupleTableSlot, the patch slows down 4.x%-6.x%, but > for VirtualTupleTableSlot it improves 10.x%. > If we look at execTuples.c, it looks like this is quite relevant to > additional memory copy. It seems that VirtualTupleTableSlot is > more common than the BufferHeapTupleTableSlot so probably the current code > should be fine for most real cases. Or it's possible > to determine multi-insert also according to the input slot tuple but this > seems to be ugly in code. Or continue to lower the threshold a bit > so that "create table tt as select * from t12;" also improves although > this hurts the VirtualTupleTableSlot case. > > To alleviate this. I tuned MAX_TUP_LEN_FOR_MULTI_INSERT a bit and set it from 2000 to 1600. With a table with 24 name-typed columns (total size 1536), I tried both case1: create table tt as select * from t12; case2: create table tt as select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w from t12;
This patch increases the performance for both. Note, of course, this change (MAX_TUP_LEN_FOR_MULTI_INSERT) does not affect the test results of previous t11, t13, t14 in theory since the code path is not affected. kernel caches cleaned up: baseline(s) patch(s) diff% case1: 7.65 7.30 4.6% case2: 7.75 6.80 12.2% relation files are in cache: case1: 7.09 6.66 6.1% case2: 7.49 6.83 8.8% We do not need to find a larger threshold that just makes the case1 improvement near to zero since on other test environments the threshold might be a bit different so it should be set as a rough value, and it seems that 1600 should benefit most cases. I attached the v3 patch which just has the MAX_TUP_LEN_FOR_MULTI_INSERT change. Thanks. > 3. for t13, new code still uses single insert so the difference should be > small. I just want to see the regression when even we use "single insert". > > 4. For toast case t14, the degradation is small, not a big deal. > > By the way, did we try or think about allow better prefetch (on Linux) for > seqscan. i.e. POSIX_FADV_SEQUENTIAL in posix_fadvise() to enlarge the > kernel readahead window. Suppose this should help if seq tuple handling is > faster than default kernel readahead setting. > > > v2 patch is attached. > > > On Thu, Mar 7, 2019 at 4:54 PM Heikki Linnakangas <hlinn...@iki.fi> wrote: > >> On 06/03/2019 22:06, Paul Guo wrote: >> > The patch also modifies heap_multi_insert() a bit to do a bit further >> > code-level optimization by using static memory, instead of using memory >> > context and dynamic allocation. >> >> If toasting is required, heap_prepare_insert() creates a palloc'd tuple. >> That is still leaked to the current memory context. >> >> Leaking into the current memory context is not a bad thing, because >> resetting a memory context is faster than doing a lot of pfree() calls. >> The callers just need to be prepared for that, and use a short-lived >> memory context. >> >> > By the way, while looking at the code, I noticed that there are 9 local >> > arrays with large length in toast_insert_or_update() which seems to be >> a >> > risk of stack overflow. Maybe we should put it as static or global. >> >> Hmm. We currently reserve 512 kB between the kernel's limit, and the >> limit we check in check_stack_depth(). See STACK_DEPTH_SLOP. Those >> arrays add up to 52800 bytes on a 64-bit maching, if I did my math >> right. So there's still a lot of headroom. I agree that it nevertheless >> seems a bit excessive, though. >> >> > With the patch, >> > >> > Time: 4728.142 ms (00:04.728) >> > Time: 14203.983 ms (00:14.204) >> > Time: 1008.669 ms (00:01.009) >> > >> > Baseline, >> > Time: 11096.146 ms (00:11.096) >> > Time: 13106.741 ms (00:13.107) >> > Time: 1100.174 ms (00:01.100) >> >> Nice speedup! >> >> > While for toast and large column size there is < 10% decrease but for >> > small column size the improvement is super good. Actually if I hardcode >> > the batch count as 4 all test cases are better but the improvement for >> > small column size is smaller than that with current patch. Pretty much >> > the number 4 is quite case specific so I can not hardcode that in the >> > patch. Of course we could further tune that but the current value seems >> > to be a good trade-off? >> >> Have you done any profiling, on why the multi-insert is slower with >> large tuples? In principle, I don't see why it should be slower. >> >> - Heikki >> >
v3-0001-Heap-batch-insert-for-CTAS-MatView.patch
Description: Binary data