subject:"Re\: \[GENERAL\] Improving performance of merging data between tables"

Re: [GENERAL] Improving performance of merging data between tables

2015-02-10 Thread Pawel Veselov

Sorry, it took me a while to respond, but I re-factored all of this process
to suggestions.

On Wed, Jan 7, 2015 at 7:49 PM, Maxim Boguk  wrote:
>
> On Wed, Jan 7, 2015 at 8:49 PM, Pawel Veselov 
> wrote:
>>
>> PS: your setup look pretty complicated and hard to analyze without seeing
>>> all involved table structures, transaction/query flow, and (especially)
>>> involved procedures source code.
>>>
>>
>> Sure :) At this point, I've put together the "bulk merge" code as well. I
>> can't quite see much of a difference, actually, but it's hard to trust the
>> execution times, as on the same amount of data they vary from, say, 0.5s to
>> 2s, and the sample data is not stepping on any other locks. In general, I'm
>> afraid all those left joins and multiple scans, even over small amount of
>> data, is nullifying any positive effect.
>>
> Now some ideas to check.
> The high CPU usage usually isn't related to locking, but related to seq
> scan or wrong plans or simple inefficient pl/pgsql code, locked processes
> usually doesn't use too much cpu.
>
> 1)on the test database perform  select pg_stat_reset(); then perform full
> round of merges, then check
> select * from pg_stat_user_tables where seq_scan>0 order by seq_tup_read;
> and if you find a lot of seq_scan and seq_tuple_reads on the particular
> table try find where they coming from (it could be reason for high CPU
> usage).
>
> 2)enable track_functions in postgresql.conf and perform the same sequence
> (select pg_stat_reset() + full round of merges
> ) then check
> select * FROM pg_stat_user_functions order by self_time desc;
> and check which function using the most time.
>

These are good pointers, if the new process is having the same sort of
problems, this will come in handy on figuring out where they are coming
from, thank you.

> 3)old/lost prepared transactions can have deadly effect on the database
> performance at whole. So check select * from pg_prepared_xact(); and verify
> that you don't have a hours (or weeks) old prepared xact lying around.
>

If there are "lost" prepared transactions, they will lock up a particular
instance from being able to write into its table data, so it will just
stall the node. But does happen, and we have an application mechanism to
find and delete those.

> PS: btw I still don't fully understood relation between the:
> "
> - merges data into its own node tables (using merge_xxx PL/pgSQL functions)
> "
> and provided code for the public."merge_all02-9A-46-8B-C1-DD" and
> PUBLIC.merge_agrio.
> As I see
> public."merge_all02-9A-46-8B-C1-DD" calling PUBLIC.merge_agrio, and the
> PUBLIC.merge_agrio updates a global table R_AGRIO (but not the "own node
> table").
>

It's a bit irrelevant at this point, but. merge_all02-9A-46-8B-C1-DD()
function will take all data for 02-9A-46-8B-C1-DD node and move it into the
master table. There is an analogous merge_02-9A-46-8B-C1-DD() function that
takes data from application, and writes it into the tables for
02-9A-46-8B-C1-DD node. The process of moving data node tables->main tables
and application->node tables is nearly identical, hence I only provided the
body once. The big difference, is when merging into master, there is a lot
more data to look through, as node tables only contain data that has not
yet been merged into the master yet.

> I think the best implementation of such task is asynchronous processing of
> this changes via background process. An application only inserts events
> into queue table (it lockless process), and some background process read
> these data from queue table and merge it into main table (again lockless
> because it single thread so no concurrent writes), and then delete the
> merged data from queue table.
>

Well, that was a really good suggestion, thank you. Some weeks later I've
put it together. This hasn't hit production yet, so I'm yet to see the
overall improvement effect. Along with turning it into a queue, I've added
provisions to try to combine as much data as possible before writing it out
into the databse tables, and merged all of the satellite tables with the
main data.

Before, I had:
r_agrio
  r_brk_xxx (multiple entries reference rows in r_agrio)

Now, I have:
r_agrio_daily
r_agrio_total
r_agrio_hourly

All the data that was in the r_brk_xxx tables is now in columns of the
r_agrio* tables. To get around the fact that there are potentially multiple
BRK records for each AGR record, the data is now stored as JSON object. The
primary key used for the BRK tables is turned into a string that serves as
a key in a top level JSON object. This should help me tremendously on the
side that needs to read that data, as I had to join or left join the BRK
tables.

Splitting this into 3 tables may come back and bite me in back, since it's
two more inserts and corresponding look ups, but it seriously helps me on
the reading side of things.

The code that aggregates the JSON data is still done in PL/PGSQL, which is
probably a bad idea, considering that PL

Re: [GENERAL] Improving performance of merging data between tables

2015-01-07 Thread Maxim Boguk

On Wed, Jan 7, 2015 at 8:49 PM, Pawel Veselov 
wrote:

>
> PPPS: and the last suggestion, after you finished with the "write all the
>> data into its own tables", then application should perform analyze of these
>> own tables (or you could have weird/inefficient plans during last stage).
>
>
> Any references to back this up? I don't particularly mind doing it, but I
> wonder if analysis can be more expensive the processing. These tables get a
> few hundreds of records inserted/updated, then are entirely processed (with
> expected full scans), and then deleted...
>

If these "own tables" used only in full table selects but never used in
joins - than there should be no issues.
However, once you start join these tables with anything else, you could
have very inefficient/weird plans because the database doesn't know
(without analyze) how many rows you have in these tables.

>
>
>> PS: your setup look pretty complicated and hard to analyze without seeing
>> all involved table structures, transaction/query flow, and (especially)
>> involved procedures source code.
>>
>
> Sure :) At this point, I've put together the "bulk merge" code as well. I
> can't quite see much of a difference, actually, but it's hard to trust the
> execution times, as on the same amount of data they vary from, say, 0.5s to
> 2s, and the sample data is not stepping on any other locks. In general, I'm
> afraid all those left joins and multiple scans, even over small amount of
> data, is nullifying any positive effect.
>

Now some ideas to check.
The high CPU usage usually isn't related to locking, but related to seq
scan or wrong plans or simple inefficient pl/pgsql code, locked processes
usually doesn't use too much cpu.

1)on the test database perform  select pg_stat_reset(); then perform full
round of merges, then check
select * from pg_stat_user_tables where seq_scan>0 order by seq_tup_read;
and if you find a lot of seq_scan and seq_tuple_reads on the particular
table try find where they coming from (it could be reason for high CPU
usage).

2)enable track_functions in postgresql.conf and perform the same sequence
(select pg_stat_reset() + full round of merges
) then check
select * FROM pg_stat_user_functions order by self_time desc;
and check which function using the most time.

3)old/lost prepared transactions can have deadly effect on the database
performance at whole. So check select * from pg_prepared_xact(); and verify
that you don't have a hours (or weeks) old prepared xact lying around.

PS: btw I still don't fully understood relation between the:
"
- merges data into its own node tables (using merge_xxx PL/pgSQL functions)
"
and provided code for the public."merge_all02-9A-46-8B-C1-DD" and
PUBLIC.merge_agrio.
As I see
public."merge_all02-9A-46-8B-C1-DD" calling PUBLIC.merge_agrio, and the
PUBLIC.merge_agrio updates a global table R_AGRIO (but not the "own node
table").

I think the best implementation of such task is asynchronous processing of
this changes via background process. An application only inserts events
into queue table (it lockless process), and some background process read
these data from queue table and merge it into main table (again lockless
because it single thread so no concurrent writes), and then delete the
merged data from queue table.

-- 
Maxim Boguk
Senior Postgresql DBA
http://www.postgresql-consulting.ru/ 

Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
Jabber: maxim.bo...@gmail.com
МойКруг: http://mboguk.moikrug.ru/

"People problems are solved with people.
If people cannot solve the problem, try technology.
People will then wish they'd listened at the first stage."

Re: [GENERAL] Improving performance of merging data between tables

2015-01-07 Thread Pawel Veselov

On Tue, Dec 30, 2014 at 7:25 PM, Maxim Boguk  wrote:
>
> On Wed, Dec 31, 2014 at 11:10 AM, Pawel Veselov 
> wrote
>>
>>
>> [skipped]
>>
>> 2) try pg_stat_statements, setting "pg_stat_statements.track = all".  see:
 http://www.postgresql.org/docs/9.4/static/pgstatstatements.html

 I have used this to profile some functions, and it worked pretty well.
 Mostly I use it on a test box, but once ran it on the live, which was
 scary, but worked great.

>>>
>>> That looks promising. Turned it on, waiting for when I can turn the
>>> server at the next "quiet time".
>>>
>>
>> I have to say this turned out into a bit of a disappointment for this use
>> case. It only measures total time spent in a call. So, it sends up
>> operations that waited a lot on some lock. It's good, but it would be great
>> if total_time was provided along with wait_time (and io_time may be as
>> well, since I also see operations that just naturally have to fetch a lot
>> of data)
>>
>
> 1) pg_stat_statements provide an information about io_time of each
> statement but you should have track_io_timing 
>
> enabled for that.
>

Enabled that now. Still the top winners are the functions that probably
lock for a long (relatively) time. This did help my find some crap that
either was missing an index, or used an unreasonable join, and just needed
re-writing. One entry that doesn't make sense to me is:

total_time - io_time = 1,366,773
calls = 666,542
query = SELECT * FROM q_SCHEDULER_STATE WHERE SCHED_NAME = ?
The table only has 18 rows, there is an index, but the analyzer chooses to
ignore it, which is right since sched_name column has the same value for
all rows. So all rows are returned in SELECT. The time to run that query
under database load varies from 0.09 to 70ms.
This is a distraction from the main topic, though, but does stand out odd.

> 2) About locking I suggest enable log_lock_waits and set deadlock_timeout
> to say 100ms (just for testing purposes), and than any lock waiting more
> than 100ms will be logged with some useful additional info.
>

> PPS: btw, please check the database logs for deadlocks messages, your
> setup around "and then call a pgsql function to merge the data from its
> tables into the common tables" part could be easily deadlock prone.
>

I don't have I have abnormal problem with locking. I wanted to eliminate
locking time out of the pg_stat_statement, to address queries that aren't
waiting on disk and/or locks first, as my problem is high CPU, not specific
query performance. I don't have deadlocks for sure -- I had them before,
and I would normally get an error if there was a deadlock. We process all
the records in exactly the same order of keys to avoid deadlocks.

PPPS: and the last suggestion, after you finished with the "write all the
> data into its own tables", then application should perform analyze of these
> own tables (or you could have weird/inefficient plans during last stage).

Any references to back this up? I don't particularly mind doing it, but I
wonder if analysis can be more expensive the processing. These tables get a
few hundreds of records inserted/updated, then are entirely processed (with
expected full scans), and then deleted...

> PS: your setup look pretty complicated and hard to analyze without seeing
> all involved table structures, transaction/query flow, and (especially)
> involved procedures source code.
>

Sure :) At this point, I've put together the "bulk merge" code as well. I
can't quite see much of a difference, actually, but it's hard to trust the
execution times, as on the same amount of data they vary from, say, 0.5s to
2s, and the sample data is not stepping on any other locks. In general, I'm
afraid all those left joins and multiple scans, even over small amount of
data, is nullifying any positive effect.

primary table: http://pastebin.com/gE2TjZd3
secondary table(s): http://pastebin.com/aDVakUkp
There are actually 10 secondary tables, but they are more or less of the
same structure.
The node tables have identical structure to the main tables.

First stage, which I don't particularly question, but may be wrongfully so,
the application does:
- create data suitable for the rows in the primary/secondary tables
- starts transaction
- merges data into its own node tables (using merge_xxx PL/pgSQL functions)
(<100 rows in primary table)
- prepares transactions
- deals with other data sources
- commits/rolls back prepared transaction depending on success of the
previous step.

An example of a merge_xxx function: http://pastebin.com/6YYm8BVM

Second stage is really:
- start transaction
- call PL/pgSQL merge_all()
- commit

2 reasons for the 2 stages:
- if stage#2 fails, the data will be merged during the next iteration
- the lock time on the shared tables is minimized

It's possible that an external process may take over writing data for
certain key subset (combination of (tagid,blockid)), to make sure there is
no race condition with such process,

Re: [GENERAL] Improving performance of merging data between tables

2014-12-30 Thread Maxim Boguk

On Wed, Dec 31, 2014 at 11:10 AM, Pawel Veselov 
wrote
>
>
> [skipped]
>
> 2) try pg_stat_statements, setting "pg_stat_statements.track = all".  see:
>>> http://www.postgresql.org/docs/9.4/static/pgstatstatements.html
>>>
>>> I have used this to profile some functions, and it worked pretty well.
>>> Mostly I use it on a test box, but once ran it on the live, which was
>>> scary, but worked great.
>>>
>>
>> That looks promising. Turned it on, waiting for when I can turn the
>> server at the next "quiet time".
>>
>
> I have to say this turned out into a bit of a disappointment for this use
> case. It only measures total time spent in a call. So, it sends up
> operations that waited a lot on some lock. It's good, but it would be great
> if total_time was provided along with wait_time (and io_time may be as
> well, since I also see operations that just naturally have to fetch a lot
> of data)
>

1) pg_stat_statements provide an information about io_time of each
statement but you should have track_io_timing 

enabled for that.

2) About locking I suggest enable log_lock_waits and set deadlock_timeout
to say 100ms (just for testing purposes), and than any lock waiting more
than 100ms will be logged with some useful additional info.

PS: your setup look pretty complicated and hard to analyze without seeing
all involved table structures, transaction/query flow, and (especially)
involved procedures source code.

PPS: btw, please check the database logs for deadlocks messages, your setup
around "and then call a pgsql function to merge the data from its tables
into the common tables" part could be easily deadlock prone.

PPPS: and the last suggestion, after you finished with the "write all the
data into its own tables", then application should perform analyze of these
own tables (or you could have weird/inefficient plans during last stage).



-- 
Maxim Boguk
Senior Postgresql DBA
http://www.postgresql-consulting.ru/ 

Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
Jabber: maxim.bo...@gmail.com
МойКруг: http://mboguk.moikrug.ru/

"People problems are solved with people.
If people cannot solve the problem, try technology.
People will then wish they'd listened at the first stage."

Re: [GENERAL] Improving performance of merging data between tables

2014-12-30 Thread Pawel Veselov

On Mon, Dec 29, 2014 at 9:29 PM, Pawel Veselov 
wrote:

[skipped]

>>> 1) How do I find out what exactly is consuming the CPU in a PL/pgSQL
>>> function? All I see is that the calls to merge_all() function take long
>>> time, and the CPU is high while this is going on.
>>>
>>>
[skipped]

2) try pg_stat_statements, setting "pg_stat_statements.track = all".  see:
>> http://www.postgresql.org/docs/9.4/static/pgstatstatements.html
>>
>> I have used this to profile some functions, and it worked pretty well.
>> Mostly I use it on a test box, but once ran it on the live, which was
>> scary, but worked great.
>>
>
> That looks promising. Turned it on, waiting for when I can turn the server
> at the next "quiet time".
>

I have to say this turned out into a bit of a disappointment for this use
case. It only measures total time spent in a call. So, it sends up
operations that waited a lot on some lock. It's good, but it would be great
if total_time was provided along with wait_time (and io_time may be as
well, since I also see operations that just naturally have to fetch a lot
of data)

[skipped]

Re: [GENERAL] Improving performance of merging data between tables

2014-12-30 Thread Andy Colson

On 12/29/2014 11:29 PM, Pawel Veselov wrote:

Andy,

thanks for looking into this.

On Mon, Dec 29, 2014 at 9:00 AM, Andy Colson mailto:a...@squeakycode.net>> wrote:

On 12/28/2014 3:49 PM, Pawel Veselov wrote:

Hi.

I was wondering if anybody would have any ideas on how to improve
certain operations that we are having.

Besides "can somebody please look at this and let me know if I'm doing
something utterly stupid", here are my questions.

1) How do I find out what exactly is consuming the CPU in a PL/pgSQL
function? All I see is that the calls to merge_all() function take long
time, and the CPU is high while this is going on.

First, I'll admit I didn't read your entire post.

I can think of a couple methods:

1) try each of the statements in merge_all by hand with an "explain 
analyze" in front to see which is slow.  Look for things that hit big tables without 
an index.  Check that fk lookups are indexes.

If I didn't miss anything, that seems to be OK, even on function-based queries.

2) try pg_stat_statements, setting "pg_stat_statements.track = all".  see:
http://www.postgresql.org/__docs/9.4/static/__pgstatstatements.html 

I have used this to profile some functions, and it worked pretty well. 
Mostly I use it on a test box, but once ran it on the live, which was scary, 
but worked great.

That looks promising. Turned it on, waiting for when I can turn the server at the next 
"quiet time".

3) try auto-explain:
http://www.postgresql.org/__docs/9.4/static/auto-explain.__html 

I've never used it, so don't know if it'll show each statement inside a 
function.  Dumps stuff to the log AFAIK, so you'll have to dig out the info by 
hand.

> 2) Is there a better way to merge individual rows, except doing
> UPDATE/INSERT in a loop, and would that be CPU expensive?
>

Not that I know of.  I use pretty much the same thing.  Soon!  we will have 
merge/upsert support.  Hopefully it'll be fast.

Well, anytime I cancelled the PID that was executing this whole mess, it would 
always stop at UPDATE ... SET ... WHERE on the main table. Which does make me 
believe that bulk update would really help.

> 3) Is there a better way to merge whole tables? However, note that I
> need to translate primary keys from node main table into the common main
> table, as they are used as foreign keys, hence the loops. I suspect the
> looping is CPU intensive.

Avoiding loops and doing things as sets is the best way.  If possible. The 
only loop I saw was looping over the merge_xxx tables, which is probably the 
only way.

There is an endless loop that is just a device for merging, but then there are 
loops going over each record in all the tables that are being merge, feeding 
them into the function that actually does the merge. That table iteration is 
what I want to eliminate (especially if I knew it would help :) )

If possible (if you haven't already) you could add and extra column to your 
secondary table that you can set as the main table's key.

bulk insert into second;
update second set magic = (select key from main where ... );

Then, maybe, you can do two ops in batch:

update main (where key exists in main)
insert into main (where key not exists in main)

I was thinking along the same lines. I can't really do bulk insert, at any 
point, because any key can be inserted by another process at any time, and with 
a good probability. However, there will be a lot less inserts than updates. So, 
in general, I'm making it do this:

What about transactions?  I assume you do something like:

begin;
merge_all;
commit;

Depending on your transaction isolation level, then you could ensure that 
nobody could insert while you are inserting.  I've never used the different 
isolation levels, so not 100% sure.  Even then, maybe a lock on the table 
itself, like:

begin;
lock;
merge_all;
unlock;
commit;

This way only one at a time can work do insert/update, but you can do them in 
batch and not in a loop.  It might be faster that way.  Other processes might 
wait for the lock a little bit, but if merge_all was faster in general, the 
time to wait for lock would be less that the entire merge process itself.

I'm totally guessing here.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Improving performance of merging data between tables

2014-12-29 Thread Pawel Veselov

Andy,

thanks for looking into this.

On Mon, Dec 29, 2014 at 9:00 AM, Andy Colson  wrote:

> On 12/28/2014 3:49 PM, Pawel Veselov wrote:
>
>> Hi.
>>
>> I was wondering if anybody would have any ideas on how to improve
>> certain operations that we are having.
>>
>>  
>
>>
>> Besides "can somebody please look at this and let me know if I'm doing
>> something utterly stupid", here are my questions.
>>
>> 1) How do I find out what exactly is consuming the CPU in a PL/pgSQL
>> function? All I see is that the calls to merge_all() function take long
>> time, and the CPU is high while this is going on.
>>
>>
>>
> First, I'll admit I didn't read your entire post.
>
> I can think of a couple methods:
>
> 1) try each of the statements in merge_all by hand with an "explain
> analyze" in front to see which is slow.  Look for things that hit big
> tables without an index.  Check that fk lookups are indexes.
>

If I didn't miss anything, that seems to be OK, even on function-based
queries.


> 2) try pg_stat_statements, setting "pg_stat_statements.track = all".  see:
> http://www.postgresql.org/docs/9.4/static/pgstatstatements.html
>
> I have used this to profile some functions, and it worked pretty well.
> Mostly I use it on a test box, but once ran it on the live, which was
> scary, but worked great.
>

That looks promising. Turned it on, waiting for when I can turn the server
at the next "quiet time".


> 3) try auto-explain:
> http://www.postgresql.org/docs/9.4/static/auto-explain.html
>
> I've never used it, so don't know if it'll show each statement inside a
> function.  Dumps stuff to the log AFAIK, so you'll have to dig out the info
> by hand.
>
> > 2) Is there a better way to merge individual rows, except doing
> > UPDATE/INSERT in a loop, and would that be CPU expensive?
> >
>
> Not that I know of.  I use pretty much the same thing.  Soon!  we will
> have merge/upsert support.  Hopefully it'll be fast.


Well, anytime I cancelled the PID that was executing this whole mess, it
would always stop at UPDATE ... SET ... WHERE on the main table. Which does
make me believe that bulk update would really help.


> > 3) Is there a better way to merge whole tables? However, note that I
> > need to translate primary keys from node main table into the common main
> > table, as they are used as foreign keys, hence the loops. I suspect the
> > looping is CPU intensive.
>
> Avoiding loops and doing things as sets is the best way.  If possible. The
> only loop I saw was looping over the merge_xxx tables, which is probably
> the only way.
>

There is an endless loop that is just a device for merging, but then there
are loops going over each record in all the tables that are being merge,
feeding them into the function that actually does the merge. That table
iteration is what I want to eliminate (especially if I knew it would help
:) )


> If possible (if you haven't already) you could add and extra column to
> your secondary table that you can set as the main table's key.
>
> bulk insert into second;
> update second set magic = (select key from main where ... );
>
> Then, maybe, you can do two ops in batch:
>
> update main (where key exists in main)
> insert into main (where key not exists in main)
>
>
I was thinking along the same lines. I can't really do bulk insert, at any
point, because any key can be inserted by another process at any time, and
with a good probability. However, there will be a lot less inserts than
updates. So, in general, I'm making it do this:

with pivot as ( select main_table.id, node_table.id as node_id as main_id
from node_table left join main_table using (key fields) )
update node_table set translate_id = pivot.main_id where node_table.id =
pivot.node_id;

(missing is cursor as select from node_table where main_id is null)

for row in missing loop
  -- merge_function will return PK of either the updated, or inserted
record.
  -- use (0) data values, so there it's an identity update, if the merge
results
  -- into an update, or "empty" data if not.
  select merge_function(missing.key_fields, 0) into use_id;
  update node_table set translate_id = use_id where current of missing;
end loop

At this point, I have a guarantee that I can update all records, and there
is nothing to insert.
So,

with new as ( select * from node_table )
update main_table old
set new.val = f(old.val, new.val)
where new.translate_id = old.id

So, I don't need full key matching anymore, I can use PKs instead.

Re: [GENERAL] Improving performance of merging data between tables

2014-12-29 Thread Andy Colson

On 12/28/2014 3:49 PM, Pawel Veselov wrote:

Hi.

I was wondering if anybody would have any ideas on how to improve
certain operations that we are having.

Besides "can somebody please look at this and let me know if I'm doing
something utterly stupid", here are my questions.

1) How do I find out what exactly is consuming the CPU in a PL/pgSQL
function? All I see is that the calls to merge_all() function take long
time, and the CPU is high while this is going on.

First, I'll admit I didn't read your entire post.

I can think of a couple methods:

1) try each of the statements in merge_all by hand with an "explain 
analyze" in front to see which is slow.  Look for things that hit big 
tables without an index.  Check that fk lookups are indexes.

2) try pg_stat_statements, setting "pg_stat_statements.track = all".  see:
http://www.postgresql.org/docs/9.4/static/pgstatstatements.html

I have used this to profile some functions, and it worked pretty well. 
Mostly I use it on a test box, but once ran it on the live, which was 
scary, but worked great.

3) try auto-explain:
http://www.postgresql.org/docs/9.4/static/auto-explain.html

I've never used it, so don't know if it'll show each statement inside a 
function.  Dumps stuff to the log AFAIK, so you'll have to dig out the 
info by hand.

> 2) Is there a better way to merge individual rows, except doing
> UPDATE/INSERT in a loop, and would that be CPU expensive?
>

Not that I know of.  I use pretty much the same thing.  Soon!  we will 
have merge/upsert support.  Hopefully it'll be fast.

> 3) Is there a better way to merge whole tables? However, note that I
> need to translate primary keys from node main table into the common main
> table, as they are used as foreign keys, hence the loops. I suspect the
> looping is CPU intensive.

Avoiding loops and doing things as sets is the best way.  If possible. 
The only loop I saw was looping over the merge_xxx tables, which is 
probably the only way.

If possible (if you haven't already) you could add and extra column to 
your secondary table that you can set as the main table's key.

bulk insert into second;
update second set magic = (select key from main where ... );

Then, maybe, you can do two ops in batch:

update main (where key exists in main)
insert into main (where key not exists in main)

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Improving performance of merging data between tables

Re: [GENERAL] Improving performance of merging data between tables

Re: [GENERAL] Improving performance of merging data between tables

Re: [GENERAL] Improving performance of merging data between tables

Re: [GENERAL] Improving performance of merging data between tables

Re: [GENERAL] Improving performance of merging data between tables

Re: [GENERAL] Improving performance of merging data between tables

Re: [GENERAL] Improving performance of merging data between tables

8 matches

Site Navigation

Mail list logo

Footer information