Re: [GENERAL] Improving performance of merging data between tables

Andy Colson Tue, 30 Dec 2014 06:13:06 -0800

On 12/29/2014 11:29 PM, Pawel Veselov wrote:


Andy,

thanks for looking into this.

On Mon, Dec 29, 2014 at 9:00 AM, Andy Colson <a...@squeakycode.net 
<mailto:a...@squeakycode.net>> wrote:

    On 12/28/2014 3:49 PM, Pawel Veselov wrote:

        Hi.

        I was wondering if anybody would have any ideas on how to improve
        certain operations that we are having.

    <SNIP>


        Besides "can somebody please look at this and let me know if I'm doing
        something utterly stupid", here are my questions.

        1) How do I find out what exactly is consuming the CPU in a PL/pgSQL
        function? All I see is that the calls to merge_all() function take long
        time, and the CPU is high while this is going on.



    First, I'll admit I didn't read your entire post.

    I can think of a couple methods:

    1) try each of the statements in merge_all by hand with an "explain 
analyze" in front to see which is slow.  Look for things that hit big tables without 
an index.  Check that fk lookups are indexes.


If I didn't miss anything, that seems to be OK, even on function-based queries.

    2) try pg_stat_statements, setting "pg_stat_statements.track = all".  see:
    http://www.postgresql.org/__docs/9.4/static/__pgstatstatements.html 
<http://www.postgresql.org/docs/9.4/static/pgstatstatements.html>

    I have used this to profile some functions, and it worked pretty well. 
Mostly I use it on a test box, but once ran it on the live, which was scary, 
but worked great.


That looks promising. Turned it on, waiting for when I can turn the server at the next 
"quiet time".

    3) try auto-explain:
    http://www.postgresql.org/__docs/9.4/static/auto-explain.__html 
<http://www.postgresql.org/docs/9.4/static/auto-explain.html>

    I've never used it, so don't know if it'll show each statement inside a 
function.  Dumps stuff to the log AFAIK, so you'll have to dig out the info by 
hand.

    > 2) Is there a better way to merge individual rows, except doing
    > UPDATE/INSERT in a loop, and would that be CPU expensive?
    >

    Not that I know of.  I use pretty much the same thing.  Soon!  we will have 
merge/upsert support.  Hopefully it'll be fast.


Well, anytime I cancelled the PID that was executing this whole mess, it would 
always stop at UPDATE ... SET ... WHERE on the main table. Which does make me 
believe that bulk update would really help.

    > 3) Is there a better way to merge whole tables? However, note that I
    > need to translate primary keys from node main table into the common main
    > table, as they are used as foreign keys, hence the loops. I suspect the
    > looping is CPU intensive.

    Avoiding loops and doing things as sets is the best way.  If possible. The 
only loop I saw was looping over the merge_xxx tables, which is probably the 
only way.


There is an endless loop that is just a device for merging, but then there are 
loops going over each record in all the tables that are being merge, feeding 
them into the function that actually does the merge. That table iteration is 
what I want to eliminate (especially if I knew it would help :) )

    If possible (if you haven't already) you could add and extra column to your 
secondary table that you can set as the main table's key.

    bulk insert into second;
    update second set magic = (select key from main where ... );

    Then, maybe, you can do two ops in batch:

    update main (where key exists in main)
    insert into main (where key not exists in main)


I was thinking along the same lines. I can't really do bulk insert, at any 
point, because any key can be inserted by another process at any time, and with 
a good probability. However, there will be a lot less inserts than updates. So, 
in general, I'm making it do this:


What about transactions?  I assume you do something like:

begin;
merge_all;
commit;


Depending on your transaction isolation level, then you could ensure that 
nobody could insert while you are inserting.  I've never used the different 
isolation levels, so not 100% sure.  Even then, maybe a lock on the table 
itself, like:

begin;
lock;
merge_all;
unlock;
commit;

This way only one at a time can work do insert/update, but you can do them in 
batch and not in a loop.  It might be faster that way.  Other processes might 
wait for the lock a little bit, but if merge_all was faster in general, the 
time to wait for lock would be less that the entire merge process itself.

I'm totally guessing here.

-Andy


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Improving performance of merging data between tables

Reply via email to