subject:"Re\: \[HACKERS\] Bulk Inserts"

Re: [HACKERS] Bulk Inserts

2009-09-17 Thread Jeff Janes

2009/9/16 Pierre Frédéric Caillaud li...@peufeu.com


  OK, that makes sense.  I thought you had hacked either XLogInsert or the
 heap WAL replay code so that you could just accumulate tuples in the rdata
 chain and then submit them all under the cover of a single WALInsertLock.


 Ah, no, I did not do that.

 This would be difficult to do : rdata chain contains buffer pointers, and
 when we are in XLogInsert, we have an exclusive lock on the buffer. If
 XLogInsert accumulated xlog records as you say, then later, when it's time
 to write them to the xlog, it would no longer hold exclusive lock on the
 buffer, so its contents could have changed, and if XLogInsert decides to do
 a full page write, the contents will be wrong.


Yes, I didn't mean to make XLogInsert unilaterally accumulate rdata
(Actually I have done that, purely as a proof of concept.  The resulting
database is completely unrecoverable, but as long you don't bring it down,
it runs fine and lets me test the speed of different concepts without going
to the trouble of implementing them correctly).

heap_bulk_insert would do the accumulation.  The hack to XLogInsert would
involve making it insert an extra dummy xlog record header for every tuple
in the rdata chain.  Alternatively, the hack to heap replay would involve
making it accept multiple tuples reported under a single WAL record.  I
don't know which one would be easier.



 Besides, the LSN needs to be set in the page at every call to heap_insert
 (or else WAL will not be Ahead), so if XLogInsert accumulated stuff before
 writing it, it would need a mechanism to assign a LSN without having written
 the xlog data yet.


Right, XLogInsert would only be able to accumulate as long as it knew it was
going to get called again before the buffer exclusive lock was released.
That is why the accumulation is better done in the heap_bulk_insert,
otherwise it would require an unfortunate amount of communication between
the two.



  If you haven't done that, then of course doing the bulk insert doesn't
 help much if you still do tuple-by-tuple XLogInsert.


 Exactly.

  So in the case that it is
 under the limit, you first run through the tuples putting them into the
 block, then run through the tuples again doing the XLogInserts?


 Yes, exactly. This isn't really optimal...
 I wonder if I could build one rdata chain containing all updates to the
 tuples and submit it in one go. Would it work ?...


I'm sure it could be made to work.  I haven't checked the replay code, but I
doubt it would work on this massed record right out of the box.  Or we could
insert dummy headers between the each tuple's WAL data.

...


 So in order to benchmark the right thing, I have :
 - all the tables in a big RAMDISK
 - xlog on RAID5
 - fsync=fdatasync

 I've also found that setting wal_buffers to a large value like 128MB gives
 a significant speed boost when doing COPY or INSERT INTO SELECT, probably
 because it allows the backends to always find space in the buffers even if
 the walwriter is a bit busy.


That seems very large, even for the high throughput set up you describe.  Is
the WAL background writer set at the default interval?  (On the other hand,
if 128MB is just a rounding error in your machine's total RAM size, why not
be generous?  On the other other hand, I've seen perfomance start to drop as
wal_buffers gets too large, though I never bothered to chased down the
cause.)

At one point, I think that XLogInsert would synchronously write out the
buffer when it noticed it was over half full, but that got ripped out (if
I'm going to block, I might as well wait until it actually is full to do).
Now that there is a background writer, maybe it would be a good idea to have
XLogInserters wake it up if the buffer is half full.
...


  Another angle of attack would be to make wal-writing more efficient...

 If you mean to do this without changing the xlog interfaces, I'm not
 optimistic.


 Agree : that's why I didn't even try ;)

  If you have to call XLogInsert once per row that is copied (or
 insert...select), then my experiments show that simply taking the
 WALInsertLock and immediately releasing it, doing absolutely no real work
 while it is held, is already a substanial multi-core scalibility
 bottleneck.


 Confirmed by context switch issue above...

 Having all cores block on the same lock for every row can be OK if it's a
 spinlock protecting just a few lines of code... which is not the present
 case...



Maybe there could be some hybrid approach.  You take the spinlock, check
that you could get the lwlock if you wanted to.  If you could get the lwlock
and know you have almost no work to do, then just do it while holding the
spinlock.  If you discover you have more work todo (xlog file switch,
nontrivial AdvanceXLInsert, etc.) then actually take the lwlock and drop the
spinlock.  This *really* breaks the encapsulation of lwlock, so it is
probably not much more than idle speculation.

Jeff

Re: [HACKERS] Bulk Inserts

2009-09-16 Thread Pierre Frédéric Caillau d




OK, that makes sense.  I thought you had hacked either XLogInsert or the
heap WAL replay code so that you could just accumulate tuples in the  
rdata

chain and then submit them all under the cover of a single WALInsertLock.


Ah, no, I did not do that.

This would be difficult to do : rdata chain contains buffer pointers, and  
when we are in XLogInsert, we have an exclusive lock on the buffer. If  
XLogInsert accumulated xlog records as you say, then later, when it's time  
to write them to the xlog, it would no longer hold exclusive lock on the  
buffer, so its contents could have changed, and if XLogInsert decides to  
do a full page write, the contents will be wrong.


Besides, the LSN needs to be set in the page at every call to heap_insert  
(or else WAL will not be Ahead), so if XLogInsert accumulated stuff  
before writing it, it would need a mechanism to assign a LSN without  
having written the xlog data yet.


If you haven't done that, then of course doing the bulk insert doesn't  
help much if you still do tuple-by-tuple XLogInsert.


Exactly.


So in the case that it is
under the limit, you first run through the tuples putting them into the
block, then run through the tuples again doing the XLogInserts?


Yes, exactly. This isn't really optimal...
I wonder if I could build one rdata chain containing all updates to the  
tuples and submit it in one go. Would it work ?...



Do you have an IO bottleneck even in the absence of fsyncs?


Sometimes, yes :

- When xlog is on a (rather slow) software RAID1, I've found that the  
fsyncs which happen when switching to a new xlog segment are significant,  
and the amount of log data written is dangerously close to the max  
throughput, too.
- When xlog is on a much faster RAID5, there is no such problem. fsync on  
raid5 is horrendously slow if you have many pending small writes, but if  
all you need to sync is a 16MB file which nicely aligns on raid stripes,  
it's not a problem.


So in order to benchmark the right thing, I have :
- all the tables in a big RAMDISK
- xlog on RAID5
- fsync=fdatasync

I've also found that setting wal_buffers to a large value like 128MB gives  
a significant speed boost when doing COPY or INSERT INTO SELECT, probably  
because it allows the backends to always find space in the buffers even if  
the walwriter is a bit busy.



My experience
on multi-core machines with decent IO systems has been that the amount of
WAL traffic (by volume) matters rather little, as opposed to the number
WALInsertLocks taken, which matter quite a bit.  Of course this depends
quite a bit on your OS and hardware.


Exactly : with the configuration above, iowait is extremely small (1%) ,  
yet all processes still spend 50% of their time waiting on WALInsertLock.


The lock is held for about 15% of the total time ; with 4 cores and 4  
threads, if a core spends X time holding the lock, it's logical that the  
others spend 3*X time waiting on it.


But consider this :

- 4 processes spend 50% of their time waiting on the lock
- therefore at any time there are average 2 processes waiting
- therefore every time the lock is Released, a process is woken up

= more than 200.000 context switches per second seen in vmstat

Actually it does 1 context switch every two inserted rows, which is  
enormous...


I've done a bit of profiling and found that 70% of the time spent holding  
this lock is... computing the header's CRC.
Just for a quick check, I removed all CRC calculations. There was  
absolutely no performance gain.



Another angle of attack would be to make wal-writing more efficient...

If you mean to do this without changing the xlog interfaces, I'm not
optimistic.


Agree : that's why I didn't even try ;)


If you have to call XLogInsert once per row that is copied (or
insert...select), then my experiments show that simply taking the
WALInsertLock and immediately releasing it, doing absolutely no real work
while it is held, is already a substanial multi-core scalibility
bottleneck.


Confirmed by context switch issue above...

Having all cores block on the same lock for every row can be OK if it's a  
spinlock protecting just a few lines of code... which is not the present  
case...



Once we accept that this must be done, the next existing
bottleneck is the memcpy of the first byte from the rdata chain into the
shared wal_buffers, presumably because this copy involves fighting the  
cache
line away from other cores.  Once you've copied the first byte, the rest  
of

them seem to be almost free.  (Again, this is probably hardware and
situation dependent).


Since I have one 4 core CPU I probably don't see this effect (a multi-cpu  
case would).


I've seen some suggestions that the wal_buffer block initation work be  
moved

from being done by AdvanceXLInsert to instead be done by XLogWrite.
However, I've not seen any indication that AdvanceXLInsert is a  
meaningful

bottlneck in the first place.  Except when wal_buffers is too small:

Re: [HACKERS] Bulk Inserts

2009-09-15 Thread Pierre Frédéric Caillau d




Yes, I did not consider that to be a problem because I did not think it
would be used on indexed tables.  I figured that the gain from doing bulk
inserts into the table would be so diluted by the still-bottle-necked  
index maintenance that it was OK not to use this optimization for  
indexed tables.


I've tested with indexes, and the index update time is much larger than  
the inserts time. Bulk inserts still provide a little bonus though, and  
having a solution that works in all cases is better IMHO.



My original thought was based on the idea of still using heap_insert, but
with a modified form of bistate which would hold the exclusive lock and  
not

just a pin.  If heap_insert is being driven by the unmodified COPY code,
then it can't guarantee that COPY won't stall on a pipe read or  
something,

and so probably shouldn't hold an exclusive lock while filling the block.


Exactly, that's what I was thinking too, and reached the same conclusion.

That is why I decided a local buffer would be better, as the exclusive  
lock
is really a no-op and wouldn't block anyone.  But if you are creating a  
new

heap_bulk_insert and modifying the COPY to go with it, then you can
guarantee it won't stall from the driving end, instead.


I think it's better, but you have to buffer tuples : at least a full  
page's worth, or better, several pages' worth of tuples, in case inline  
compression kicks in and shrinks them, since the purpose is to be able to  
fill a complete page in one go.



 Whether any of these approaches will be maintainable enough to be
integrated into the code base is another matter.  It seems like there is
already a lot of discussion going on around various permutations of copy
options.


It's not really a COPY mod, since it would also be good for big INSERT  
INTO SELECT FROM which is wal-bound too (even more so than COPY, since  
there is no parsing to do).


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bulk Inserts

2009-09-15 Thread Pierre Frédéric Caillau d

Does that heuristic change the timings much?  If not, it seems like it  
would
better to keep it simple and always do the same thing, like log the  
tuples

(if it is done under one WALInsertLock, which I am assuming it is..)


It is the logging of whole pages that makes it faster.
If you fill a page with tuples in one operation (while holding exclusive  
lock) and then insert WAL records for each tuple, there is no speed gain.


Inserting a full page WAL record (since you just filled the page  
completely) :


- only takes WalInsertLock once instead of once per tuple
- reduces wal traffic
- is about 2x faster in my benchmark

And inserting a clear new page record (if the page was previously  
new/empty and relation is fsync'd at the end) :


- only takes WalInsertLock once instead of once per tuple
- reduces wal traffic a lot
- is about 4x faster in my benchmark


Do you even need the new empty page record?  I think a zero page will be
handled correctly next time it is read into shared buffers, won't it?


I have no idea ;)


But I
guess it is need to avoid  problems with partial page writes that would
leave in a state that is neither all zeros nor consistent.


Plus, empty page records make for very small WAL traffic and I didn't see  
any performance difference with or without them.



If the entire page is logged, would it have to marked as not removable by
the log compression tool?  Or can the tool recreate the needed delta?


No, the tool cannot recreate the data, since the idea is precisely to  
replace a lot of tuple insert messages with one entire page message,  
which takes both less space and less time. The warm-standby replicators  
that get this WAL need to know the page contents to replicate it... (also,  
it will probably be faster for them to redo a page write than redo all the  
tuple inserts).


Here is what I'm thinking about now :

* have some kind of BulkInsertState which contains
- info about the relation, indexes, triggers, etc
- a tuple queue.

The tuple queue may be a tuple store, or simply tuple copies in a local  
memory context.


You'd have functions to :

- Setup the BulkInsertState
- Add a tuple to the BulkInsertState
- Finish the operation and clear the BulkInsertState

When adding a tuple, it is stored in the queue.
When the queue is full, a bulk insert operation takes place, hopefully we  
can fill an entire page, and return.

Post insert triggers and index updates are also handled at this point.

When finished, the function that clears the state also inserts all  
remaining tuples in the queue.


With this you could also do something *really* interesting : bulk index  
updates...


Bulk index updates are probably mutually exclusive with after-row triggers  
though.


Another angle of attack would be to make wal-writing more efficient...






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bulk Inserts

2009-09-15 Thread Jeff Janes

2009/9/15 Pierre Frédéric Caillaud li...@peufeu.com

 Does that heuristic change the timings much?  If not, it seems like it
 would
 better to keep it simple and always do the same thing, like log the tuples
 (if it is done under one WALInsertLock, which I am assuming it is..)


 It is the logging of whole pages that makes it faster.
 If you fill a page with tuples in one operation (while holding exclusive
 lock) and then insert WAL records for each tuple, there is no speed gain.


 Inserting a full page WAL record (since you just filled the page
 completely) :

 - only takes WalInsertLock once instead of once per tuple



OK, that makes sense.  I thought you had hacked either XLogInsert or the
heap WAL replay code so that you could just accumulate tuples in the rdata
chain and then submit them all under the cover of a single WALInsertLock.
If you haven't done that, then of course doing the bulk insert doesn't help
much if you still to tuple-by-tuple XLogInsert.  So in the case that it is
under the limit, you first run through the tuples putting them into the
block, then run through the tuples again doing the XLogInserts?


 - reduces wal traffic
 - is about 2x faster in my benchmark

 And inserting a clear new page record (if the page was previously
 new/empty and relation is fsync'd at the end) :

 - only takes WalInsertLock once instead of once per tuple
 - reduces wal traffic a lot
 - is about 4x faster in my benchmark


Do you have an IO bottleneck even in the absence of fsyncs?  My experience
on multi-core machines with decent IO systems has been that the amount of
WAL traffic (by volume) matters rather little, as opposed to the number
WALInsertLocks taken, which matter quite a bit.  Of course this depends
quite a bit on your OS and hardware.


...

Another angle of attack would be to make wal-writing more efficient...


If you mean to do this without changing the xlog interfaces, I'm not
optimistic.

If you have to call XLogInsert once per row that is copied (or
insert...select), then my experiments show that simply taking the
WALInsertLock and immediately releasing it, doing absolutely no real work
while it is held, is already a substanial multi-core scalibility
bottleneck.  Once we accept that this must be done, the next existing
bottleneck is the memcpy of the first byte from the rdata chain into the
shared wal_buffers, presumably because this copy involves fighting the cache
line away from other cores.  Once you've copied the first byte, the rest of
them seem to be almost free.  (Again, this is probably hardware and
situation dependent).

I've seen some suggestions that the wal_buffer block initation work be moved
from being done by AdvanceXLInsert to instead be done by XLogWrite.
However, I've not seen any indication that AdvanceXLInsert is a meaningful
bottlneck in the first place.  Except when wal_buffers is too small: then
AdvanceXLInsert is a bottleneck, but only because XLogWrite is getting
called from within it, in which case moving work from one to the other is
probably not going to make things better.

Jeff

Re: [HACKERS] Bulk Inserts

2009-09-14 Thread Pierre Frédéric Caillau d



Replying to myself...

Jeff suggested to build pages in local memory and insert them later in the  
table. This is what's used in CLUSTER for instance, I believe.


It has some drawbacks though :

- To insert the tuples in indexes, the tuples need tids, but if you build  
the page in local memory, you don't know on which page they will be until  
after allocating the page, which will probably be done after the page is  
built, so it's a bit of a chicken and egg problem.


- It only works on new pages. Pages which are not empty, but have free  
space, cannot be written in this way.


The little experiment I made yesterday does not have these drawbacks,  
since it allocates pages in the standard way, simply it inserts many  
tuples in one operation instead of just inserting one. If the page  
happened to be empty, it's even better, but it's not necessary. If your  
table has lots of free space, it will be used.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bulk Inserts

2009-09-14 Thread Jeff Janes

2009/9/14 Pierre Frédéric Caillaud li...@peufeu.com


 I've done a little experiment with bulk inserts.

 = heap_bulk_insert()

 Behaves like heap_insert except it takes an array of tuples (HeapTuple
 *tups, int ntups).

 - Grabs a page (same as heap_insert)

 - While holding exclusive lock, inserts as many tuples as it can on the
 page.
- Either the page gets full
- Or we run out of tuples.

 - Generate xlog : choice between
- Full Xlog mode :
- if we inserted more than 10 tuples (totaly bogus
 heuristic), log the entire page
- Else, log individual tuples as heap_insert does


Does that heuristic change the timings much?  If not, it seems like it would
better to keep it simple and always do the same thing, like log the tuples
(if it is done under one WALInsertLock, which I am assuming it is..)


- Light log mode :
- if page was empty, only xlog a new empty page record,
 not page contents
- else, log fully
- heap_sync() at the end

 - Release the page
 - If we still have tuples to insert, repeat.

 Am I right in assuming that :

 1)
 - If the page was empty,
 - and log archiving isn't used,
 - and the table is heap_sync()'d at the end,
 = only a new empty page record needs to be created, then the page can be
 completely filled ?


Do you even need the new empty page record?  I think a zero page will be
handled correctly next time it is read into shared buffers, won't it?  But I
guess it is need to avoid  problems with partial page writes that would
leave in a state that is neither all zeros nor consistent.



 2)
 - If the page isn't empty
 - or log archiving is used,
 = logging either the inserted tuples or the entire page is OK to guarantee
 persistence ?


If the entire page is logged, would it have to marked as not removable by
the log compression tool?  Or can the tool recreate the needed delta?

Jeff

Re: [HACKERS] Bulk Inserts

2009-09-14 Thread Jeff Janes

2009/9/14 Pierre Frédéric Caillaud li...@peufeu.com


 Replying to myself...

 Jeff suggested to build pages in local memory and insert them later in the
 table. This is what's used in CLUSTER for instance, I believe.

 It has some drawbacks though :

 - To insert the tuples in indexes, the tuples need tids, but if you build
 the page in local memory, you don't know on which page they will be until
 after allocating the page, which will probably be done after the page is
 built, so it's a bit of a chicken and egg problem.


Yes, I did not consider that to be a problem because I did not think it
would be used on indexed tables.  I figured that the gain from doing bulk
inserts into the table would be so diluted by the still-bottle-necked index
maintenance that it was OK not to use this optimization for indexed tables.



 - It only works on new pages. Pages which are not empty, but have free
 space, cannot be written in this way.


My original thought was based on the idea of still using heap_insert, but
with a modified form of bistate which would hold the exclusive lock and not
just a pin.  If heap_insert is being driven by the unmodified COPY code,
then it can't guarantee that COPY won't stall on a pipe read or something,
and so probably shouldn't hold an exclusive lock while filling the block.
That is why I decided a local buffer would be better, as the exclusive lock
is really a no-op and wouldn't block anyone.  But if you are creating a new
heap_bulk_insert and modifying the COPY to go with it, then you can
guarantee it won't stall from the driving end, instead.

 Whether any of these approaches will be maintainable enough to be
integrated into the code base is another matter.  It seems like there is
already a lot of discussion going on around various permutations of copy
options.

Cheers,

Jeff

Re: [HACKERS] Bulk inserts and usage_count

2007-05-15 Thread Jim C. Nasby

On Tue, May 15, 2007 at 04:37:28PM +0100, Heikki Linnakangas wrote:
 While testing the buffer ring patch, I noticed that bulk inserts with 
 both INSERT and COPY pin and unpin the buffer they insert to for every 
 tuple. That means that the usage_count of all those buffers are bumped 
snip 
 A fix for COPY will fall naturally out of the buffer ring patch, but not 
 for INSERT.
 
 A more general fix would be to somehow keep the last insertion page 
 pinned across calls to heap_insert.

ISTR discussion in the past about having things like COPY and INSERT
INTO ... SELECT building entire pages in one shot once they exhaust the
FSM. Not only would it address this issue, but it would probably improve
performance in many ways (less locking and unlocking, ability to
pre-sort before inserting into indexes, fewer calls to FSM, probably a
bunch of other things).
-- 
Jim Nasby  [EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Bulk Inserts

Re: [HACKERS] Bulk Inserts

Re: [HACKERS] Bulk Inserts

Re: [HACKERS] Bulk Inserts

Re: [HACKERS] Bulk Inserts

Re: [HACKERS] Bulk Inserts

Re: [HACKERS] Bulk Inserts

Re: [HACKERS] Bulk Inserts

Re: [HACKERS] Bulk inserts and usage_count

9 matches

Site Navigation

Mail list logo

Footer information