Re: [GENERAL] Help with slow table update
[skipped] But remember that if you update or delete a row, removing it from an index, the data will stay in that index until vacuum comes along. Also, there's no point in doing a REINDEX after a VACUUM FULL; vacuum full rebuilds all the indexes for you. I was being desperate :) I still think there is something very wrong with this particular table. First, I have production systems that employ this function on way larger data set, and there is no problem (so far, but still). This machine is part of a test deployment, there is no constant load, the only data that is being written now is when I do these tests. Vacuuming should prune all that dead stuff, and if it's absent, it's unclear where is the time spent navigating/updating the table with 24 rows :) I think you definitely have a problem with dead rows, as evidenced by the huge improvement VACUUM FULL made. But it's not clear why (and not reasonable, IMHO, that) it wouldn't improve past current point. What I should've done is 'VACUUM FULL VERBOSE'. Once I did, it told me there were 800k dead rows that can't be removed. After digging around I found some dangling prepared transactions, going back months. Once I threw those away, and re-vacuumed, things got back to normal. Thanks for all your help and advice.
Re: [GENERAL] Help with slow table update
[skipped] This is where using sets becomes really tedious, as Postgres severely lacks an upsert-like statement. I don't think there are joins allowed in UPDATE statement, so I will need to use WITH query, right? Also, I'm not sure how LEFT JOIN will help me isolate and insert missed entries... [skipped] Yes, you can do UPDATE with joins ( http://www.postgresql.org/docs/9.4/static/sql-update.html) like this: UPDATE table1 A SET col1 = B.col2 FROM table2 B WHERE A.col3 = B.col4; I meant using JOIN operator in the update. But it's still possible, though through WITH query.
Re: [GENERAL] Help with slow table update
On Mon, Apr 13, 2015 at 6:03 PM, David G. Johnston david.g.johns...@gmail.com wrote: On Mon, Apr 13, 2015 at 5:01 PM, Pawel Veselov pawel.vese...@gmail.com wrote: r_agrio_hourly - good, r_agrio_total - bad. Update on r_agrio_hourly (cost=0.42..970.32 rows=250 width=329) (actual time=2.248..2.248 rows=0 loops=1) - Index Scan using u_r_agrio_hourly on r_agrio_hourly (cost=0.42..970.32 rows=250 width=329) (actual time=0.968..1.207 rows=1 loops=1) Index Cond: ((tagid = 1002::numeric) AND (unitid = 1002::numeric) AND ((rowdate)::text = '2015-04-09T23'::text) AND (device_type = 3::numeric) AND (placement = 2::numeric)) Total runtime: 2.281 ms Update on r_agrio_total (cost=0.42..45052.56 rows=12068 width=321) (actual time=106.766..106.766 rows=0 loops=1) - Index Scan using u_r_agrio_total on r_agrio_total (cost=0.42..45052.56 rows=12068 width=321) (actual time=0.936..32.626 rows=1 loops=1) Index Cond: ((tagid = 1002::numeric) AND (unitid = 1002::numeric) AND (device_type = 3::numeric) AND (placement = 2::numeric)) Total runtime: 106.793 ms What it is you expect to see here? What are the results (count and times) for: SELECT count(*) FROM r_agrio_total WHERE tagid = 1002 and unitid = 1002; Result: 8 (the whole table is 24 rows). It returns somewhat with a stumble, but relatively quickly. db= explain analyze SELECT count(*) FROM r_agrio_total WHERE tagid = 1002 and unitid = 1002; QUERY PLAN - Aggregate (cost=4.45..4.46 rows=1 width=0) (actual time=327.194..327.195 rows=1 loops=1) - Index Scan using tag_r_agrio_total on r_agrio_total (cost=0.42..4.45 rows=1 width=0) (actual time=0.039..327.189 rows=8 loops=1) Index Cond: (tagid = 1002::numeric) Filter: (unitid = 1002::numeric) Total runtime: 327.228 ms SELECT count(*) FROM r_agrio_hourly WHERE tagid = 1002 and unitid = 1002; Result is 2869. Returns somewhat quckly. Explain analyze is crazy though: db= explain analyze SELECT count(*) FROM r_agrio_hourly WHERE tagid = 1002 and unitid = 1002; QUERY PLAN Aggregate (cost=68134.68..68134.69 rows=1 width=0) (actual time=15177.211..15177.211 rows=1 loops=1) - Index Scan using adunit_r_agrio_hourly on r_agrio_hourly (cost=0.42..67027.10 rows=443035 width=0) (actual time=0.096..15175.730 rows=2869 loops=1) Index Cond: (unitid = 1002::numeric) Filter: (tagid = 1002::numeric) Total runtime: 15177.240 ms More queries along this line might be needed. The underlying question is how many index rows need to be skipped over on total to get the final result - or rather are the columns in the index in descending order of cardinality? Idea is - both tables have unique multi-field indices, and each update hits exactly one row from that index, no more, and all fields from the index are locked with equality condition on the update. All of the updates (within a transaction) would always work on a small subset of rows (max a few hundred, ever; in this case, it's may be around 10). I expect it to be possible for the server to keep the active working set in the cache at all times. Since the index is unique, there shouldn't be a reason to re-scan the table, if a cached row is found, no? Any chance you can perform a REINDEX - maybe there is some bloat present? There are queries to help discern if that may be the case, I do not know then off the top of my head, but just doing it might be acceptable and is definitely quicker if so. That's the thing - I've done both vacuum full, and re-index. The very first time I did vacuum full things improved (60 seconds to 7 seconds). Re-index didn't improve anything (but it was done after vacuum full). I'm still not really following your presentation but maybe my thoughts will spark something. Thank you! I hope I clarified this some :)
Re: [GENERAL] Help with slow table update
On Mon, Apr 13, 2015 at 7:37 PM, Jim Nasby jim.na...@bluetreble.com wrote: On 4/13/15 7:01 PM, Pawel Veselov wrote: Cursors tend to make things slow. Avoid them if you can. Is there an alternative to iterating over a number of rows, where a direct update query is not an option? I really doubt that either the actual processing logic, including use of types has anything to do with my problem. This is based on the fact that out of the tables that are being changed, only one is exhibiting the problem. All of the involved tables have nearly the same structure, and have the same logical operations performed on them. I thought may be the bad table is slow because it was first in the list, and Postgres was caching the functions results, but I moved things around, and pattern is the same. I'm guessing that you're essentially processing a queue. Take a look at http://www.postgresql.org/message-id/552c750f.2010...@bluetreble.com for some ideas. Basically, not only do cursors have non-trivial overhead, doing a ton of single-row queries is going to have a non-trivial overhead itself. Thank you for the pointers. PgQ sounds interesting, it has to be remote for RDS (I use RDS), but I'll try implementing a solution based on it. However, for all the times that is being spent during this update, the breakdown is: update total table: 10.773033 update hourly table: 00.179711 update daily table: 01.082467 update some other table (actually, it has cardinality similar to total table): 00.168287 clean the queue table: 00.021733 overhead: 00.014922 The overhead is time taken to run the whole procedure, minus all these other times that have been counted. (some notes about the daily table below) As for your specific question, I suggest you modify the plpgsql function so that it's doing an EXPLAIN ANALYZE on the slow table. EXPLAIN ANALYZE actually returns a recordset the same way a SELECT would, with a single column of type text. So you just need to do something with that output. The easiest thing would be to replace this in your function: UPDATE slow_table SET ... to this (untested) RETURN QUERY EXPLAIN ANALYZE UPDATE slow_table SET ... and change the function so it returns SETOF text instead of whatever it returns now. Thank you, that made it a lot easier to see into what's really going on. But the outcome is somewhat the same. The bad table analysis shows a very high cost, and thousands of rows, where the table contains only 24 rows. This time, however, the actual run time is shown, and one can see where the time is spent (I was using just a sum of clock_time()s around the update statements to see where the problem is). r_agrio_hourly - good, r_agrio_total - bad. Update on r_agrio_hourly (cost=0.42..970.32 rows=250 width=329) (actual time=2.248..2.248 rows=0 loops=1) - Index Scan using u_r_agrio_hourly on r_agrio_hourly (cost=0.42..970.32 rows=250 width=329) (actual time=0.968..1.207 rows=1 loops=1) Index Cond: ((tagid = 1002::numeric) AND (unitid = 1002::numeric) AND ((rowdate)::text = '2015-04-09T23'::text) AND (device_type = 3::numeric) AND (placement = 2::numeric)) Total runtime: 2.281 ms Update on r_agrio_total (cost=0.42..45052.56 rows=12068 width=321) (actual time=106.766..106.766 rows=0 loops=1) - Index Scan using u_r_agrio_total on r_agrio_total (cost=0.42..45052.56 rows=12068 width=321) (actual time=0.936..32.626 rows=1 loops=1) Index Cond: ((tagid = 1002::numeric) AND (unitid = 1002::numeric) AND (device_type = 3::numeric) AND (placement = 2::numeric)) Total runtime: 106.793 ms Keep in mind that the estimated cost is not terribly useful; it's the actual times that matter. I suspect what's happening here is a combination of things. First, the hourly table is basically living in cache, but the total table is not. That means that when you go to find a row in the total table you're actually hitting the disk instead of pulling the data from memory. Second, you may have a lot of dead rows in the total table. I suspect this because of the very large amount of time the index scan is taking. Unless you're running on an old 10MB MFM drive you'd be pretty hard pressed for even 2 IO operations (one for the index leaf page and one for the heap page) to take 32ms. I suspect the index scan is having to read many dead rows in before it finds a live one, and incurring multiple IOs. Swiching to EXPLAIN (analyze, buffers) would help confirm that. That looks most likely to me as well. Most of the updates in a single batch, for the total table would be on the same record, while for hourly table it's a lot less. Logically, the tables contain identical data, except that hourly table breaks it down per hour, and total table contains the data for all times. The daily table contains the same data per day. So, if I compared the tables, the total table has
Re: [GENERAL] Help with slow table update
On Tue, Apr 14, 2015 at 3:29 PM, Jim Nasby jim.na...@bluetreble.com wrote: On 4/14/15 4:44 PM, Pawel Veselov wrote: On Tue, Apr 14, 2015 at 1:15 PM, Jim Nasby jim.na...@bluetreble.com mailto:jim.na...@bluetreble.com wrote: On 4/14/15 1:28 PM, Pawel Veselov wrote: I wonder if what I need to do, considering that I update a lot of the same rows as I process this queue, is to create a temp table, update the rows there, and then update the actual tables once at the end... That's what I'd do. Well, in short, I changed (repeat the body of loop for how many tables are there) LOOP (item) UPDATE table with item IF not found INSERT item INTO table; END IF; END LOOP; to: CREATE TEMP TABLE xq_table (like table) on commit drop; LOOP (item) LOOP UPDATE xq_table with item; exit when found; INSERT INTO xq_table select * from table for update; continue when found; INSERT item INTO xq_table; exit; END LOOP; END LOOP; UPDATE table a set (rows) = (xq.rows) FROM xq_table xq WHERE (a.keys) = (xq.keys) That works significantly faster. The final update statement is very fast. The process is somewhat slow in the beginning as it sucks in records from total into xq_total, but once all of that is moved into the temp table, it rushes through the rest. Databases like to think in sets. It will generally be more efficient to do set operations instead of a bunch of row-by-row stuff. Since you're pulling all of this from some other table your best bet is probably something like: CREATE TEMP TABLE raw AS DELETE FROM queue WHERE ... RETURNING *; CREATE TEMP VIEW hourly_v AS SELECT ... FROM raw GROUP BY; UPDATE ar_hourly SET ... FROM hourly_v JOIN ...; INSERT INTO ar_hourly SELECT FROM hourly_v LEFT JOIN ar_hourly ON ...; -- Same thing for daily -- Same thing for total In my previous post, there was a problem with that pseudo-code, as it's missing inserts into the final table at the end of loop, for those records that need to be inserted and not updated. This is where using sets becomes really tedious, as Postgres severely lacks an upsert-like statement. I don't think there are joins allowed in UPDATE statement, so I will need to use WITH query, right? Also, I'm not sure how LEFT JOIN will help me isolate and insert missed entries... Would it be OK to replace upsert part with merging into a temp table, then deleting and inserting from temp table? Is there any penalty for insert/delete comparing to update? [skipped] But remember that if you update or delete a row, removing it from an index, the data will stay in that index until vacuum comes along. Also, there's no point in doing a REINDEX after a VACUUM FULL; vacuum full rebuilds all the indexes for you. I was being desperate :) I still think there is something very wrong with this particular table. First, I have production systems that employ this function on way larger data set, and there is no problem (so far, but still). This machine is part of a test deployment, there is no constant load, the only data that is being written now is when I do these tests. Vacuuming should prune all that dead stuff, and if it's absent, it's unclear where is the time spent navigating/updating the table with 24 rows :) I think you definitely have a problem with dead rows, as evidenced by the huge improvement VACUUM FULL made. But it's not clear why (and not reasonable, IMHO, that) it wouldn't improve past current point.
Re: [GENERAL] Help with slow table update
On Tue, Apr 14, 2015 at 1:15 PM, Jim Nasby jim.na...@bluetreble.com wrote: On 4/14/15 1:28 PM, Pawel Veselov wrote: I wonder if what I need to do, considering that I update a lot of the same rows as I process this queue, is to create a temp table, update the rows there, and then update the actual tables once at the end... That's what I'd do. Well, in short, I changed (repeat the body of loop for how many tables are there) LOOP (item) UPDATE table with item IF not found INSERT item INTO table; END IF; END LOOP; to: CREATE TEMP TABLE xq_table (like table) on commit drop; LOOP (item) LOOP UPDATE xq_table with item; exit when found; INSERT INTO xq_table select * from table for update; continue when found; INSERT item INTO xq_table; exit; END LOOP; END LOOP; UPDATE table a set (rows) = (xq.rows) FROM xq_table xq WHERE (a.keys) = (xq.keys) That works significantly faster. The final update statement is very fast. The process is somewhat slow in the beginning as it sucks in records from total into xq_total, but once all of that is moved into the temp table, it rushes through the rest. The other option would be to use a constraint trigger paired with a per-row trigger on the hourly table to drive the daily table, and on the daily table to drive the total table. The way that would work is the per-row table would simply keep track of all the unique records that were changed in a statement (presumably by putting them in a temp table). Once the statement is done, the constraint trigger would fire; it would summarize all the changed data and do a much smaller number of updates to the table being summarized into. I'm not sure how I would be able to avoid the same number of changes on the total table, trigger would fire on each update, won't it? So, same problem with a lot of changes on a table... BTW, you also made a comment about not having to hit the table if you look at something in an index. You can only do that if all the data you need is in the index, AND the page with the record is marked as being all-visible (google for Postgres Visibility Map). If that's not the case then you still have to pull the row in the table in, in order to determine visibility. The only case where you can still avoid hitting the table is something like a NOT EXISTS; if you can't find any entries in the index for something then they definitely won't be in the table. What I was saying is that if a table has a unique index, and there is cached fact that a particular index value points to a particular row, there shouldn't be a need to re-scan the index again to search for any more matching values (which would be necessary if the index was not unique). Again, all considering the size of the index, the amount of different index values that are being queried, etc. But remember that if you update or delete a row, removing it from an index, the data will stay in that index until vacuum comes along. Also, there's no point in doing a REINDEX after a VACUUM FULL; vacuum full rebuilds all the indexes for you. I was being desperate :) I still think there is something very wrong with this particular table. First, I have production systems that employ this function on way larger data set, and there is no problem (so far, but still). This machine is part of a test deployment, there is no constant load, the only data that is being written now is when I do these tests. Vacuuming should prune all that dead stuff, and if it's absent, it's unclear where is the time spent navigating/updating the table with 24 rows :)
[GENERAL] bigserial continuity safety
Hi. If I have a table created as: CREATE TABLE xq_agr ( idBIGSERIAL PRIMARY KEY, node text not null ); and that multiple applications insert into. The applications never explicitly specify the value for 'id'. Is it safe to, on a single connection, do: - open transaction (default transaction isolation) - Open cursor for select * from xq_agr order by id asc - do something with current record - advance the cursor (and repeat something), but stop at some point (id = LAST_ID), and - delete from xq_agr where id = LAST_ID; - commit safe to means - whether the cursor will not miss any records that were deleted at the end. I'm suspecting that depending on the commit order, I may have situations when: - TX1 insert ID 1 - TX2 insert ID 2 - TX2 commits - TX3 scans 2 - TX1 commits - TX3 deletes = 2 - record ID1 is deleted, but never processed. Is that sequence of events as listed above possible? If yes, is there a transaction isolation I can use to avoid that? Table and sequence definition, as present in the DB: db= \d+ xq_agr_id_seq Sequence public.xq_agr_id_seq Column | Type |Value| Storage ---+-+-+- sequence_name | name| xq_agr_id_seq | plain last_value| bigint | 139898829 | plain start_value | bigint | 1 | plain increment_by | bigint | 1 | plain max_value | bigint | 9223372036854775807 | plain min_value | bigint | 1 | plain cache_value | bigint | 1 | plain log_cnt | bigint | 27 | plain is_cycled | boolean | f | plain is_called | boolean | t | plain db= \d xq_agr Table public.xq_agr Column | Type | Modifiers ---+-+- id| bigint | not null default nextval('xq_agr_id_seq'::regclass) node | text| not null Indexes: xq_agr_pkey PRIMARY KEY, btree (id)
Re: [GENERAL] Help with slow table update
On Sun, Apr 12, 2015 at 5:40 PM, Jim Nasby jim.na...@bluetreble.com wrote: On 4/9/15 6:18 PM, Pawel Veselov wrote: Hi. I have a plpgsql procedure that updates a few similar tables. for some reason, updates on one of the tables take a lot longer the updates on the other ones. The difference is, say, 7 seconds vs. 80 milliseconds. the procedure uses cursors and record variables to do the updates. For example: update r_agrio_total set unserved = unserved + (agrow-'unserved')::numeric(38), r_brkconn = mush_brk_conn(r_brkconn, q_item.r_brkconn), where tagid = _tagid and unitid = (akey-'unitid')::numeric and placement = (akey-'placement')::numeric and device_type = (akey-'device_type')::numeric; There is another table (xq_agr) that is read record by record, and for each of those records, such update is executed. I was trying to select analyze the updates to see where the time could be spent. There are only 24 row in the bad table, and 3,400 rows in good table. So, for the bad table, most of the updates will be on the same rows. The times were measured on processing 100 original records. When I'm analyzing pure update statements, I don't see anything strange. bad table: explain analyze update r_agrio_total set unconfirmed = unconfirmed +0 where tagid = 1000 and unitid = 1000 and placement = 0 and device_type = 100; RESULT: Update on r_agrio_total (cost=0.42..4.46 rows=1 width=321) (actual time=0.253..0.253 rows=0 loops=1) - Index Scan using tag_r_agrio_total on r_agrio_total (cost=0.42..4.46 rows=1 width=321) (actual time=0.037..0.041 rows=1 loops=1) Index Cond: (tagid = 1000::numeric) Filter: ((unitid = 1000::numeric) AND (placement = 0::numeric) AND (device_type = 100::numeric)) Rows Removed by Filter: 7 Total runtime: 0.282 ms good table: explain analyze update r_agrio_hourly set unconfirmed = unconfirmed +0 where tagid = 1000 and unitid = 1000 and placement = 0 and device_type = 100 and rowdate = '2015-02-23T13'; RESULT: Update on r_agrio_hourly (cost=0.42..17.36 rows=6 width=329) (actual time=0.102..0.102 rows=0 loops=1) - Index Scan using u_r_agrio_hourly on r_agrio_hourly (cost=0.42..17.36 rows=6 width=329) (actual time=0.047..0.048 rows=1 loops=1) Index Cond: ((tagid = 1000::numeric) AND (unitid = 1000::numeric) AND ((rowdate)::text = '2015-02-23T13'::text) AND (device_type = 100::numeric) AND (placement = 0::numeric)) Total runtime: 0.135 ms When I try doing it with WITH statement (really, to apply the actual data that the plpgsql function uses), there is something strange in the bad table. explain analyze with SRC as (select * from xq_agr where id = 914830) update r_agrio_total set unconfirmed = unconfirmed + (SRC.r_agrio-'unconfirmed')::numeric(38) from SRC where tagid = (SRC.r_agrio-'key'-'tagid')::numeric and unitid = (SRC.r_agrio-'key'-'unit')::numeric and placement = (SRC.r_agrio-'key'-'placement')::numeric and device_type = (SRC.r_agrio-'key'-'device_type')::numeric; RESULT: Update on r_agrio_total (cost=8.91..32777.51 rows=19331 width=409) (actual time=0.107..0.107 rows=0 loops=1) CTE src - Index Scan using xq_agr_pkey on xq_agr (cost=0.42..8.44 rows=1 width=379) (actual time=0.026..0.027 rows=1 loops=1) Index Cond: (id = 914830) - Nested Loop (cost=0.46..32769.07 rows=19331 width=409) (actual time=0.107..0.107 rows=0 loops=1) - CTE Scan on src (cost=0.00..0.02 rows=1 width=88) (actual time=0.032..0.033 rows=1 loops=1) - Index Scan using u_r_agrio_total on r_agrio_total (*cost=0.46..32285.78 rows=19331* width=321) (actual time=0.001..0.001 rows=0 loops=1) Index Cond: ((tagid = (((src.r_agrio - 'key'::text) - 'tagid'::text))::numeric) AND (unitid = (((src.r_agrio - 'key'::text) - 'unit'::text))::numeric) AND (device_type = (((src.r_agrio - 'key'::text) - 'device_type'::text))::numeric) AND (placement = (((src.r_agrio - 'key'::text) - 'placement'::text))::numeric)) Total runtime: 0.155 ms explain analyze with SRC as (select * from xq_agr where id = 914830) update r_agrio_hourly set unconfirmed = unconfirmed + (SRC.r_agrio-'unconfirmed')::numeric(38) from SRC where tagid = (SRC.r_agrio-'key'-'tagid')::numeric and unitid = (SRC.r_agrio-'key'-'unit')::numeric and placement = (SRC.r_agrio-'key'-'placement')::numeric and device_type = (SRC.r_agrio-'key'-'device_type')::numeric and rowdate = (SRC.r_agrio-'key'-'rowdate'); RESULT: Update on r_agrio_hourly (cost=8.91..52.91 rows=20 width=417) (actual time=0.123..0.123 rows=0 loops=1) CTE src - Index Scan using xq_agr_pkey on xq_agr
Re: [GENERAL] Improving performance of merging data between tables
Sorry, it took me a while to respond, but I re-factored all of this process to suggestions. On Wed, Jan 7, 2015 at 7:49 PM, Maxim Boguk maxim.bo...@gmail.com wrote: On Wed, Jan 7, 2015 at 8:49 PM, Pawel Veselov pawel.vese...@gmail.com wrote: PS: your setup look pretty complicated and hard to analyze without seeing all involved table structures, transaction/query flow, and (especially) involved procedures source code. Sure :) At this point, I've put together the bulk merge code as well. I can't quite see much of a difference, actually, but it's hard to trust the execution times, as on the same amount of data they vary from, say, 0.5s to 2s, and the sample data is not stepping on any other locks. In general, I'm afraid all those left joins and multiple scans, even over small amount of data, is nullifying any positive effect. Now some ideas to check. The high CPU usage usually isn't related to locking, but related to seq scan or wrong plans or simple inefficient pl/pgsql code, locked processes usually doesn't use too much cpu. 1)on the test database perform select pg_stat_reset(); then perform full round of merges, then check select * from pg_stat_user_tables where seq_scan0 order by seq_tup_read; and if you find a lot of seq_scan and seq_tuple_reads on the particular table try find where they coming from (it could be reason for high CPU usage). 2)enable track_functions in postgresql.conf and perform the same sequence (select pg_stat_reset() + full round of merges ) then check select * FROM pg_stat_user_functions order by self_time desc; and check which function using the most time. These are good pointers, if the new process is having the same sort of problems, this will come in handy on figuring out where they are coming from, thank you. 3)old/lost prepared transactions can have deadly effect on the database performance at whole. So check select * from pg_prepared_xact(); and verify that you don't have a hours (or weeks) old prepared xact lying around. If there are lost prepared transactions, they will lock up a particular instance from being able to write into its table data, so it will just stall the node. But does happen, and we have an application mechanism to find and delete those. PS: btw I still don't fully understood relation between the: - merges data into its own node tables (using merge_xxx PL/pgSQL functions) and provided code for the public.merge_all02-9A-46-8B-C1-DD and PUBLIC.merge_agrio. As I see public.merge_all02-9A-46-8B-C1-DD calling PUBLIC.merge_agrio, and the PUBLIC.merge_agrio updates a global table R_AGRIO (but not the own node table). It's a bit irrelevant at this point, but. merge_all02-9A-46-8B-C1-DD() function will take all data for 02-9A-46-8B-C1-DD node and move it into the master table. There is an analogous merge_02-9A-46-8B-C1-DD() function that takes data from application, and writes it into the tables for 02-9A-46-8B-C1-DD node. The process of moving data node tables-main tables and application-node tables is nearly identical, hence I only provided the body once. The big difference, is when merging into master, there is a lot more data to look through, as node tables only contain data that has not yet been merged into the master yet. I think the best implementation of such task is asynchronous processing of this changes via background process. An application only inserts events into queue table (it lockless process), and some background process read these data from queue table and merge it into main table (again lockless because it single thread so no concurrent writes), and then delete the merged data from queue table. Well, that was a really good suggestion, thank you. Some weeks later I've put it together. This hasn't hit production yet, so I'm yet to see the overall improvement effect. Along with turning it into a queue, I've added provisions to try to combine as much data as possible before writing it out into the databse tables, and merged all of the satellite tables with the main data. Before, I had: r_agrio r_brk_xxx (multiple entries reference rows in r_agrio) Now, I have: r_agrio_daily r_agrio_total r_agrio_hourly All the data that was in the r_brk_xxx tables is now in columns of the r_agrio* tables. To get around the fact that there are potentially multiple BRK records for each AGR record, the data is now stored as JSON object. The primary key used for the BRK tables is turned into a string that serves as a key in a top level JSON object. This should help me tremendously on the side that needs to read that data, as I had to join or left join the BRK tables. Splitting this into 3 tables may come back and bite me in back, since it's two more inserts and corresponding look ups, but it seriously helps me on the reading side of things. The code that aggregates the JSON data is still done in PL/PGSQL, which is probably a bad idea, considering that PL doesn't have good ways of manipulating
Re: [GENERAL] Casting hstore to json
Adrian, On Thu, Jan 15, 2015 at 9:09 PM, Adrian Klaver adrian.kla...@aklaver.com wrote: On 01/15/2015 07:59 PM, Pawel Veselov wrote: Hi. I'm trying to cast hstore to json, but I don't seem to be getting a json object from hstore key/value pairs. [skipped] db= SELECT to_json('a=1, b=2'::hstore); to_json -- \a\=\1\, \b\=\2\ Works here: business_app= select version(); version - PostgreSQL 9.3.5 on i686-pc-linux-gnu, compiled by gcc (SUSE Linux) 4.8.1 20130909 [gcc-4_8-branch revision 202388], 32-bit (1 row) business_app= SELECT to_json('a=1, b=2'::hstore); to_json -- {a: 1, b: 2} (1 row) Did you upgrade from a previous version of Postgres? Yes! From 9.1.14. 'alter extension hstore update' did the trick, thank you :) [skipped]
[GENERAL] Casting hstore to json
Hi. I'm trying to cast hstore to json, but I don't seem to be getting a json object from hstore key/value pairs. 9.3 documentation says: *Note:* The hstore http://www.postgresql.org/docs/9.3/static/hstore.html extension has a cast from hstore to json, so that converted hstore values are represented as JSON objects, not as string values. to_json(anyelement) - If the data type is not built in, and there is a cast from the type tojson, the cast function will be used to perform the conversion. On 9.3.5 server: db= SELECT 'a=1, b=2'::hstore; hstore a=1, b=2 (1 row) db= SELECT to_json('a=1, b=2'::hstore); to_json -- \a\=\1\, \b\=\2\ (1 row) db= SELECT cast(('a=1, b=2'::hstore) as json); ERROR: cannot cast type hstore to json LINE 1: SELECT cast(('a=1, b=2'::hstore) as json); Thanks! Pawel.
Re: [GENERAL] Improving performance of merging data between tables
On Tue, Dec 30, 2014 at 7:25 PM, Maxim Boguk maxim.bo...@gmail.com wrote: On Wed, Dec 31, 2014 at 11:10 AM, Pawel Veselov pawel.vese...@gmail.com wrote [skipped] 2) try pg_stat_statements, setting pg_stat_statements.track = all. see: http://www.postgresql.org/docs/9.4/static/pgstatstatements.html I have used this to profile some functions, and it worked pretty well. Mostly I use it on a test box, but once ran it on the live, which was scary, but worked great. That looks promising. Turned it on, waiting for when I can turn the server at the next quiet time. I have to say this turned out into a bit of a disappointment for this use case. It only measures total time spent in a call. So, it sends up operations that waited a lot on some lock. It's good, but it would be great if total_time was provided along with wait_time (and io_time may be as well, since I also see operations that just naturally have to fetch a lot of data) 1) pg_stat_statements provide an information about io_time of each statement but you should have track_io_timing enabled for that. Enabled that now. Still the top winners are the functions that probably lock for a long (relatively) time. This did help my find some crap that either was missing an index, or used an unreasonable join, and just needed re-writing. One entry that doesn't make sense to me is: total_time - io_time = 1,366,773 calls = 666,542 query = SELECT * FROM q_SCHEDULER_STATE WHERE SCHED_NAME = ? The table only has 18 rows, there is an index, but the analyzer chooses to ignore it, which is right since sched_name column has the same value for all rows. So all rows are returned in SELECT. The time to run that query under database load varies from 0.09 to 70ms. This is a distraction from the main topic, though, but does stand out odd. 2) About locking I suggest enable log_lock_waits and set deadlock_timeout to say 100ms (just for testing purposes), and than any lock waiting more than 100ms will be logged with some useful additional info. PPS: btw, please check the database logs for deadlocks messages, your setup around and then call a pgsql function to merge the data from its tables into the common tables part could be easily deadlock prone. I don't have I have abnormal problem with locking. I wanted to eliminate locking time out of the pg_stat_statement, to address queries that aren't waiting on disk and/or locks first, as my problem is high CPU, not specific query performance. I don't have deadlocks for sure -- I had them before, and I would normally get an error if there was a deadlock. We process all the records in exactly the same order of keys to avoid deadlocks. PPPS: and the last suggestion, after you finished with the write all the data into its own tables, then application should perform analyze of these own tables (or you could have weird/inefficient plans during last stage). Any references to back this up? I don't particularly mind doing it, but I wonder if analysis can be more expensive the processing. These tables get a few hundreds of records inserted/updated, then are entirely processed (with expected full scans), and then deleted... PS: your setup look pretty complicated and hard to analyze without seeing all involved table structures, transaction/query flow, and (especially) involved procedures source code. Sure :) At this point, I've put together the bulk merge code as well. I can't quite see much of a difference, actually, but it's hard to trust the execution times, as on the same amount of data they vary from, say, 0.5s to 2s, and the sample data is not stepping on any other locks. In general, I'm afraid all those left joins and multiple scans, even over small amount of data, is nullifying any positive effect. primary table: http://pastebin.com/gE2TjZd3 secondary table(s): http://pastebin.com/aDVakUkp There are actually 10 secondary tables, but they are more or less of the same structure. The node tables have identical structure to the main tables. First stage, which I don't particularly question, but may be wrongfully so, the application does: - create data suitable for the rows in the primary/secondary tables - starts transaction - merges data into its own node tables (using merge_xxx PL/pgSQL functions) (100 rows in primary table) - prepares transactions - deals with other data sources - commits/rolls back prepared transaction depending on success of the previous step. An example of a merge_xxx function: http://pastebin.com/6YYm8BVM Second stage is really: - start transaction - call PL/pgSQL merge_all() - commit 2 reasons for the 2 stages: - if stage#2 fails, the data will be merged during the next iteration - the lock time on the shared tables is minimized It's possible that an external process may take over writing data for certain key subset (combination of (tagid,blockid)), to make sure there is no race condition with such process, such key pairs are locked, that's what the whole
[GENERAL] min/max performance inequality.
Hi. I was wondering how come there is such a drastic difference between finding max and min. Seems like index scan backwards is really bad... The table is freshly re-indexed just in case. I added a count(*) in there, forcing the seq scan, and it's even better than the backwards index scan... db= EXPLAIN ANALYZE select min(rowdate) from r_agrio where blockid = 4814; QUERY PLAN - Result (cost=495.89..495.90 rows=1 width=0) (actual time=24.149..24.150 rows=1 loops=1) InitPlan 1 (returns $0) - Limit (cost=0.00..495.89 rows=1 width=13) (actual time=24.139..24.140 rows=1 loops=1) - Index Scan using rowdate_r_agrio on r_agrio (cost=0.00..222160.24 rows=448 width=13) (actual time=24.137..24.137 rows=1 loops=1) Index Cond: ((rowdate)::text IS NOT NULL) Filter: (blockid = 4814::numeric) Total runtime: 24.186 ms (7 rows) db= EXPLAIN ANALYZE select max(rowdate) from r_agrio where blockid = 4814; QUERY PLAN Result (cost=495.89..495.90 rows=1 width=0) (actual time=926.032..926.033 rows=1 loops=1) InitPlan 1 (returns $0) - Limit (cost=0.00..495.89 rows=1 width=13) (actual time=926.019..926.021 rows=1 loops=1) - Index Scan Backward using rowdate_r_agrio on r_agrio (cost=0.00..222160.24 rows=448 width=13) (actual time=926.017..926.017 rows=1 loops=1) Index Cond: ((rowdate)::text IS NOT NULL) Filter: (blockid = 4814::numeric) Total runtime: 926.070 ms (7 rows) db= EXPLAIN ANALYZE select count(*), max(rowdate) from r_agrio where blockid = 4814; QUERY PLAN - Aggregate (cost=31585.18..31585.19 rows=1 width=13) (actual time=461.079..461.080 rows=1 loops=1) - Seq Scan on r_agrio (cost=0.00..31582.94 rows=448 width=13) (actual time=8.912..460.999 rows=15 loops=1) Filter: (blockid = 4814::numeric) Total runtime: 461.134 ms (4 rows) db= \d r_agrio Table public.r_agrio Column| Type | Modifiers -+---+ id | numeric(38,0) | not null tagid | numeric(38,0) | not null blockid | numeric(38,0) | not null rowdate | character varying(15) | not null count | numeric(38,0) | not null default 0 events | numeric(38,0) | not null default 0 devents | numeric(38,0) | not null default 0 duration| numeric(38,0) | not null default 0 device_type | numeric(38,0) | not null placement | numeric(38,0) | not null default 0 unserved| numeric(38,0) | not null default 0 unconfirmed | numeric(38,0) | not null default 0 version | numeric(38,0) | not null default 1 Indexes: pk_r_agrio PRIMARY KEY, btree (id) u_r_agrio UNIQUE, btree (tagid, blockid, rowdate, device_type, placement) rowdate_r_agrio btree (rowdate)
Re: [GENERAL] min/max performance inequality.
Thanks Jeff (and Tom) On Wed, Jan 7, 2015 at 3:34 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Wed, Jan 7, 2015 at 3:00 PM, Pawel Veselov pawel.vese...@gmail.com wrote: Hi. I was wondering how come there is such a drastic difference between finding max and min. Seems like index scan backwards is really bad... The table is freshly re-indexed just in case. I added a count(*) in there, forcing the seq scan, and it's even better than the backwards index scan... db= EXPLAIN ANALYZE select min(rowdate) from r_agrio where blockid = 4814; It crawls the data in rowdate order (either forward or reverse) until it finds the first 4814. Crawling forward it finds 4814 very early. Crawling backwards it has to pass through a bunch of non-4814 before it finds the first 4814. This fact doesn't show up in your EXPLAIN ANALYZE, but if you used a more modern version of postgresql (9.2 or above) there would be another line for Rows Removed by Filter: which would tell the story of what is going on. Yeah, there is 10x more rows on when going backwards If you have a composite index on (blockid, rowdate), it would help make this much faster, as it can go directly to the desired row. That does help a lot. So, when does postgres use a more-dimensional index, even if not all dimensions are engaged (as there is an index that involves those 2 fields, and more)? I definitely see it do that in some cases... Even with that index, however, there is still a good difference in time (the interest is theoretical at this point, as I found a better way to extract that data anyway). On a newer db. db= EXPLAIN analyze select min(rowdate) from r_agrio where blockid = 4814; QUERY PLAN - Result (cost=521.54..521.55 rows=1 width=0) (actual time=39.770..39.770 rows=1 loops=1) InitPlan 1 (returns $0) - Limit (cost=0.55..521.54 rows=1 width=13) (actual time=39.765..39.766 rows=1 loops=1) - Index Scan using rowdate_r_agrio on r_agrio (cost=0.55..303738.47 rows=583 width=13) (actual time=39.763..39.763 rows=1 loops=1) Index Cond: ((rowdate)::text IS NOT NULL) Filter: (blockid = 4814::numeric) Rows Removed by Filter: 37246 Total runtime: 39.798 ms (8 rows) db= EXPLAIN analyze select max(rowdate) from r_agrio where blockid = 4814; QUERY PLAN -- Result (cost=521.54..521.55 rows=1 width=0) (actual time=1497.377..1497.378 rows=1 loops=1) InitPlan 1 (returns $0) - Limit (cost=0.55..521.54 rows=1 width=13) (actual time=1497.371..1497.372 rows=1 loops=1) - Index Scan Backward using rowdate_r_agrio on r_agrio (cost=0.55..303738.47 rows=583 width=13) (actual time=1497.370..1497.370 rows=1 loops=1) Index Cond: ((rowdate)::text IS NOT NULL) Filter: (blockid = 4814::numeric) Rows Removed by Filter: 317739 Total runtime: 1497.407 ms (8 rows) db= CREATE INDEX concurrently xxx on r_agrio(rowdate,blockid); CREATE INDEX db= EXPLAIN analyze select min(rowdate) from r_agrio where blockid = 4814; QUERY PLAN - Result (cost=85.05..85.06 rows=1 width=0) (actual time=17.585..17.585 rows=1 loops=1) InitPlan 1 (returns $0) - Limit (cost=0.43..85.05 rows=1 width=13) (actual time=17.580..17.581 rows=1 loops=1) - Index Only Scan using xxx on r_agrio (cost=0.43..37827.09 rows=447 width=13) (actual time=17.578..17.578 rows=1 loops=1) Index Cond: ((rowdate IS NOT NULL) AND (blockid = 4814::numeric)) Heap Fetches: 0 Total runtime: 17.616 ms (7 rows) db= EXPLAIN analyze select max(rowdate) from r_agrio where blockid = 4814; QUERY PLAN -- Result (cost=85.04..85.05 rows=1 width=0) (actual time=89.141..89.142 rows=1 loops=1) InitPlan 1 (returns $0) - Limit (cost=0.43..85.04 rows=1 width=13) (actual time=89.135..89.136 rows=1 loops=1) - Index Only Scan Backward using xxx on r_agrio (cost=0.43..37823.09 rows=447 width=13) (actual time=89.134..89.134 rows=1 loops=1) Index Cond: ((rowdate IS NOT NULL) AND (blockid = 4814::numeric)) Heap Fetches: 1 Total runtime: 89.173 ms (7 rows) Cheers,
Re: [GENERAL] Improving performance of merging data between tables
On Mon, Dec 29, 2014 at 9:29 PM, Pawel Veselov pawel.vese...@gmail.com wrote: [skipped] 1) How do I find out what exactly is consuming the CPU in a PL/pgSQL function? All I see is that the calls to merge_all() function take long time, and the CPU is high while this is going on. [skipped] 2) try pg_stat_statements, setting pg_stat_statements.track = all. see: http://www.postgresql.org/docs/9.4/static/pgstatstatements.html I have used this to profile some functions, and it worked pretty well. Mostly I use it on a test box, but once ran it on the live, which was scary, but worked great. That looks promising. Turned it on, waiting for when I can turn the server at the next quiet time. I have to say this turned out into a bit of a disappointment for this use case. It only measures total time spent in a call. So, it sends up operations that waited a lot on some lock. It's good, but it would be great if total_time was provided along with wait_time (and io_time may be as well, since I also see operations that just naturally have to fetch a lot of data) [skipped]
Re: [GENERAL] Improving performance of merging data between tables
Andy, thanks for looking into this. On Mon, Dec 29, 2014 at 9:00 AM, Andy Colson a...@squeakycode.net wrote: On 12/28/2014 3:49 PM, Pawel Veselov wrote: Hi. I was wondering if anybody would have any ideas on how to improve certain operations that we are having. SNIP Besides can somebody please look at this and let me know if I'm doing something utterly stupid, here are my questions. 1) How do I find out what exactly is consuming the CPU in a PL/pgSQL function? All I see is that the calls to merge_all() function take long time, and the CPU is high while this is going on. First, I'll admit I didn't read your entire post. I can think of a couple methods: 1) try each of the statements in merge_all by hand with an explain analyze in front to see which is slow. Look for things that hit big tables without an index. Check that fk lookups are indexes. If I didn't miss anything, that seems to be OK, even on function-based queries. 2) try pg_stat_statements, setting pg_stat_statements.track = all. see: http://www.postgresql.org/docs/9.4/static/pgstatstatements.html I have used this to profile some functions, and it worked pretty well. Mostly I use it on a test box, but once ran it on the live, which was scary, but worked great. That looks promising. Turned it on, waiting for when I can turn the server at the next quiet time. 3) try auto-explain: http://www.postgresql.org/docs/9.4/static/auto-explain.html I've never used it, so don't know if it'll show each statement inside a function. Dumps stuff to the log AFAIK, so you'll have to dig out the info by hand. 2) Is there a better way to merge individual rows, except doing UPDATE/INSERT in a loop, and would that be CPU expensive? Not that I know of. I use pretty much the same thing. Soon! we will have merge/upsert support. Hopefully it'll be fast. Well, anytime I cancelled the PID that was executing this whole mess, it would always stop at UPDATE ... SET ... WHERE on the main table. Which does make me believe that bulk update would really help. 3) Is there a better way to merge whole tables? However, note that I need to translate primary keys from node main table into the common main table, as they are used as foreign keys, hence the loops. I suspect the looping is CPU intensive. Avoiding loops and doing things as sets is the best way. If possible. The only loop I saw was looping over the merge_xxx tables, which is probably the only way. There is an endless loop that is just a device for merging, but then there are loops going over each record in all the tables that are being merge, feeding them into the function that actually does the merge. That table iteration is what I want to eliminate (especially if I knew it would help :) ) If possible (if you haven't already) you could add and extra column to your secondary table that you can set as the main table's key. bulk insert into second; update second set magic = (select key from main where ... ); Then, maybe, you can do two ops in batch: update main (where key exists in main) insert into main (where key not exists in main) I was thinking along the same lines. I can't really do bulk insert, at any point, because any key can be inserted by another process at any time, and with a good probability. However, there will be a lot less inserts than updates. So, in general, I'm making it do this: with pivot as ( select main_table.id, node_table.id as node_id as main_id from node_table left join main_table using (key fields) ) update node_table set translate_id = pivot.main_id where node_table.id = pivot.node_id; (missing is cursor as select from node_table where main_id is null) for row in missing loop -- merge_function will return PK of either the updated, or inserted record. -- use (0) data values, so there it's an identity update, if the merge results -- into an update, or empty data if not. select merge_function(missing.key_fields, 0) into use_id; update node_table set translate_id = use_id where current of missing; end loop At this point, I have a guarantee that I can update all records, and there is nothing to insert. So, with new as ( select * from node_table ) update main_table old set new.val = f(old.val, new.val) where new.translate_id = old.id So, I don't need full key matching anymore, I can use PKs instead.
[GENERAL] Improving performance of merging data between tables
Hi. I was wondering if anybody would have any ideas on how to improve certain operations that we are having. PostgreSQL 9.3.3. Table layout: main_table: PK, N key columns, M data columns. The key columns are bound in a unique key together. PK is pure sequence number. There are few separate non-unique indices on some of the key columns, to aid in querying the table data. second_table: PK, main_table_PK_REF (declared as foreign key), Na key columns, Ma data columns. There are 0-inf (typically 0-10) entries in second_table that reference a single entry in main_table. PK is pure sequence number, and unique key for that table is (main_table_PK_REF + key columns). The only non-obvious thing here is that some of the unique indexes involve coalesce() function: uq_brkioevent UNIQUE, btree (mainid, blockid, name, subid, (COALESCE(cname, ''::character varying))) Those are there because we need to write the NULL into the tables, but unique indexes don't like NULL values (AFAIR). There is one main_table, and 10 second_tables. The amount of rows in main table right now is ~1.1M, second tables have about 1M-2M of rows. The growth of the main table is fixed amount of rows (~10) per hour. Multiple (web) application nodes need to write data into this table. Most of the operations are modifying the data columns, rather than inserting new data. We had serious contention problems if we let all the application nodes write directly into the table. The writes involved using prepared transactions, the prepared transaction can only be reaped after other data stores are written to, and there is very high chance nodes will need to modify the same rows, so the locking was taking too long. To relieve the contention, we have allocated each application node it's own set of tables that structurally are exactly like main/second tables. The application node will open a transaction, write all the data into its own tables, free of locks, and then call a pgsql function to merge the data from its tables into the common tables. There is typically relatively little data in the node tables (say within 100 rows in any table) before its merged into the common tables. Nodes would dump their data when there is something to dump, can be few times a second. Recently, the operation that moves the data from the node tables into the common tables started being a real drain on the PostgreSQL server CPU. I assume this is probably due to the data set size reaching some critical mass. Things really got outta hand when we had to double the amount of application nodes to accommodate surge in application use. The merging is done in the following manner. Each main/second table has an associated PL/pgSQL function (merge_xxx) that takes in key and data values as arguments. It then, in an endless loop, tries to execute UPDATE statement (using math operations to update the data based on existing and input data values, using key data in the query part). If UPDATE statement set found, then the function exists. Otherwise, the function tries to INSERT with key/data values. If that succeeds, function exists, else if unique_violation is thrown, loop continues. On top of these individual functions, there is another PL/pgSQL function (merge_all). It uses for ROW in select * from MAIN_NODE outer loop, and within that loop it calls the merge_xxx for the main table, and then for each secondary table, does the same for ROWx in select * from SECOND_NODE, adding WHERE clause to only pick up entries that correspond to the current main_node table entry that's being processed, calling merge_xxx for the corresponding secondary table. At the end of the outer loop, all data from node tables is removed (using DELETE). I will gladly provide pseudo-code, or even the function body is my explanation is unclear. Besides can somebody please look at this and let me know if I'm doing something utterly stupid, here are my questions. 1) How do I find out what exactly is consuming the CPU in a PL/pgSQL function? All I see is that the calls to merge_all() function take long time, and the CPU is high while this is going on. 2) Is there a better way to merge individual rows, except doing UPDATE/INSERT in a loop, and would that be CPU expensive? 3) Is there a better way to merge whole tables? However, note that I need to translate primary keys from node main table into the common main table, as they are used as foreign keys, hence the loops. I suspect the looping is CPU intensive. Thank you, Pawel.
[GENERAL] incremental digest (and other) functions?
Hi. I was looking into a need of doing SHA on contents of a large object, and it seems that the only digest (and other) functions provided by pgcrypto produce output based on a single string on an input, without being able to accumulate for multiple data blocks. It would be nice to see those, IMHO. Or may be I missed them? Thank you, Pawel.
[GENERAL] documentation for lo_* functions
Hi. Where is the proper documentation for lo_* functions (e.g. lo_open) that are available as SQL functions? I see libpq functions documented in /static/lo-interfaces.html, but not the SQL ones (from pg_catalog). Thank you, Pawel.
[GENERAL] ORDER BY expression required in SELECT if DISTINCT
Hi. Anything I can do about this? Seems that the problem is that the query is a prepared statement, even if the parameters ($1 and $6) are the same. ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list at character 493 STATEMENT: SELECT DISTINCT t0.package_name, ts_rank(t5.search_vector, to_tsquery($1)) FROM application t0 CROSS JOIN application t1 CROSS JOIN application_search t5 INNER JOIN application_state t2 ON t1.PUBLICAPPSTATE_ID = t2.id INNER JOIN application_state_price t3 ON t2.id= t3.E_APPSTATE_ID INNER JOIN price t4 ON t3.PRICES_ID = t4.id WHERE (t4.currency = $2 AND t2.frontState = $3 AND t0.package_name = t5.APP_PACKAGE_NAME AND ts_match_vq(t5.search_vector, to_tsquery($4)) = $5 AND 1 = 1) ORDER BY ts_rank(t5.search_vector, to_tsquery($6)) ASC LIMIT $7 This works: SELECT DISTINCT t0.package_name, ts_rank(t5.search_vector, to_tsquery('aaa')) FROM application t0 CROSS JOIN application t1 CROSS JOIN application_search t5 INNER JOIN application_state t2 ON t1.PUBLICAPPSTATE_ID = t2.id INNER JOIN application_state_price t3 ON t2.id = t3.E_APPSTATE_ID INNER JOIN price t4 ON t3.PRICES_ID = t4.id WHERE (t4.currency = 'USD' AND t2.frontState = 'PUBLISHED' AND t0.package_name = t5.APP_PACKAGE_NAME AND ts_match_vq(t5.search_vector, to_tsquery('aaa')) AND 1 = 1) ORDER BY ts_rank(t5.search_vector, to_tsquery('aaa')) ASC Thank you, Pawel.
Re: [GENERAL] Understanding streaming replication
On Mon, Nov 12, 2012 at 1:36 AM, Albe Laurenz laurenz.a...@wien.gv.atwrote: I'll try to answer the questions I can. Thank you! Pawel Veselov wrote: I've been struggling with understanding all the necessary pieces for streaming replication. So I put down the pieces as I did understand them, and would appreciate if you guys could point out any of the stuff I understood or have done wrong. The set up is pgpool + streaming replication + hot stand by. No load balancing, stand-by nodes will not receive any application queries (I don't have that big of a query load, and I don't want to risk inconsistent reads). There are no shared file systems, but there is a way to rsync/scp files between nodes. Fail-over is automatic, and should kick in within reasonably small period after master failure. 1. Archiving. Should be turned on on all the nodes. The archive command should copy the archive file to the local archive directory, and rsync archive directory between all the nodes. My understanding is that archiving is necessary if a stand-by node ever missed enough WAL updates to need an old enough WAL that might have been removed from pg_xlog. You don't give details about how the rsync is triggered, but I'd advise against having rsync as part of archive_command. First, it is slow and if there is a lot of activity, the archiver will not be able to keep up. Second, if rsync fails, the WAL file will not be considered archived. Both these things will keep the WAL files from being deleted from pg_xlog. I'd schedule rsync as a cron job or similar. From your later comments, it's also apparent that these archived WALs will be useless after failover (for the purpose of recovery), so there is no reason to send them to all the nodes after all. QUESTION: After the failover, the new master will start archiving its WAL files. These archived WALs will not collide in any way with the archived WALs generated by previous master(s)? They will not, because the standby starts a new time line when it is promoted to primary, which will result in new WAL file names. QUESTION: What is a good policy for archive clean up? From the perspective to only remove archive files that are guaranteed to never be required by any nodes. You cannot tell from the primary's side. Since you also need the archives to restore an online backup, I'd keep them a long as your backup policy dictates. I hope you don't rely on standby databases for backup (just imagine an accidental DROP TABLE that gets propagated to all standbys withing seconds). I don't relay on stand-by's for back up. But that timeline establishment business is a key piece that I didn't realize. 2. Failover. On master failure, pgpool will automatically select a new master, and degenerate all other nodes. The cluster is now in the emergency state and requires manual intervention for reconfiguration and recovery. pgpool executes a script to promote a node, that script will create a trigger file on a newly selected master node, and postgres will exist stand-by mode. QUESTION: If multiple pgpools are running, and if there are no network problems, and configuration files are identical, is there any guarantee that the same stand-by node will be selected for promotion? Concern here is that with configuration of (M-SB0-SB1) one pgpool decides to promote SB0 and another - SB1, causing both of them to enter master mode, and splitting the cluster. It does look that pgpool will always select next alive node for promotion, but I couldn't find a definitive statement on that. I don't know about pgpool and its abilities to handle cluster failover, but I wouldn't go this way at all. Even if the answer were that in the circumstances you describe things would work, you can depend on it that things will go wrong in ways different from what you expect, e.g. a broken network card. The consequences would be worse than I'd like to imagine. I would imagine this situation will happen in any case, I don't logically see how it's avoidable. If you only have one agent that has power to promote a node to be a new master, you have SPF. If you have multiple agents that can do the promotion, there is always a risk that they fall out of sync. If you want reliable automatic failover, consider cluster software. Anything you could please recommend? [skipped] Yours, Laurenz Albe
Re: [GENERAL] Understanding streaming replication
On Mon, Nov 12, 2012 at 10:11 AM, Pawel Veselov pawel.vese...@gmail.comwrote: On Mon, Nov 12, 2012 at 1:36 AM, Albe Laurenz laurenz.a...@wien.gv.atwrote: I'll try to answer the questions I can. Thank you! Pawel Veselov wrote: I've been struggling with understanding all the necessary pieces for streaming replication. So I put down the pieces as I did understand them, and would appreciate if you guys could point out any of the stuff I understood or have done wrong. The set up is pgpool + streaming replication + hot stand by. No load balancing, stand-by nodes will not receive any application queries (I don't have that big of a query load, and I don't want to risk inconsistent reads). There are no shared file systems, but there is a way to rsync/scp files between nodes. Fail-over is automatic, and should kick in within reasonably small period after master failure. 1. Archiving. Should be turned on on all the nodes. The archive command should copy the archive file to the local archive directory, and rsync archive directory between all the nodes. My understanding is that archiving is necessary if a stand-by node ever missed enough WAL updates to need an old enough WAL that might have been removed from pg_xlog. You don't give details about how the rsync is triggered, but I'd advise against having rsync as part of archive_command. First, it is slow and if there is a lot of activity, the archiver will not be able to keep up. Second, if rsync fails, the WAL file will not be considered archived. Both these things will keep the WAL files from being deleted from pg_xlog. I'd schedule rsync as a cron job or similar. From your later comments, it's also apparent that these archived WALs will be useless after failover (for the purpose of recovery), so there is no reason to send them to all the nodes after all. I obviously lost it here. The archives do need to be synchronized, for the purpose of recovering slaves. If a slave dies, and I want to recover it, it may need the archived WALs, and for this, the archives should be available on the node. So, rsync (or something like that) is necessary. But it's a bad idea to run the rsync from the archive command itself.
[GENERAL] Understanding streaming replication
Hi. I've been struggling with understanding all the necessary pieces for streaming replication. So I put down the pieces as I did understand them, and would appreciate if you guys could point out any of the stuff I understood or have done wrong. The set up is pgpool + streaming replication + hot stand by. No load balancing, stand-by nodes will not receive any application queries (I don't have that big of a query load, and I don't want to risk inconsistent reads). There are no shared file systems, but there is a way to rsync/scp files between nodes. Fail-over is automatic, and should kick in within reasonably small period after master failure. 1. Archiving. Should be turned on on all the nodes. The archive command should copy the archive file to the local archive directory, and rsync archive directory between all the nodes. My understanding is that archiving is necessary if a stand-by node ever missed enough WAL updates to need an old enough WAL that might have been removed from pg_xlog. QUESTION: After the failover, the new master will start archiving its WAL files. These archived WALs will not collide in any way with the archived WALs generated by previous master(s)? QUESTION: What is a good policy for archive clean up? From the perspective to only remove archive files that are guaranteed to never be required by any nodes. 2. Failover. On master failure, pgpool will automatically select a new master, and degenerate all other nodes. The cluster is now in the emergency state and requires manual intervention for reconfiguration and recovery. pgpool executes a script to promote a node, that script will create a trigger file on a newly selected master node, and postgres will exist stand-by mode. QUESTION: If multiple pgpools are running, and if there are no network problems, and configuration files are identical, is there any guarantee that the same stand-by node will be selected for promotion? Concern here is that with configuration of (M-SB0-SB1) one pgpool decides to promote SB0 and another - SB1, causing both of them to enter master mode, and splitting the cluster. It does look that pgpool will always select next alive node for promotion, but I couldn't find a definitive statement on that. 3. Recovery. That part is a bit confusing. The majority of the documentation says that in this case, the node should be re-loaded from the base backup, obtained from the master. I'm not sure why this is necessary, if there are enough archived WALs. QUESTION: Is there any metric to understand whether hauling base will be slower/faster than replaying missed WALs? Anyway, pgpool only has one recovery mechanism, and it does invoke a base restore from whatever current master is. PROBLEM: This I see as a problem. The only way that I see to re-attach a node to the pgpool, short of restarting it, is to call pcp_recovery_node. This will make the master take a base back up, push it to the stand-by that needs recovery, and re-start the stand-by node. I am not sure if there is a good way to check if that node has already been recovered. That because if there are more than 2 pgpools, they both will attempt to recover the same stand-by, and this will probably get ugly. Thank you, Pawel.
[GENERAL] index update
Hi. If I have a lot (10k) tables, and each table has a btree index, and all the tables are being constantly inserted into, would all the indexes have to be in memory, and would effectively start fighting for space? Thank you, Pawel.
Re: [GENERAL] index update
On Tue, Jul 17, 2012 at 9:18 PM, Craig Ringer ring...@ringerc.id.au wrote: On 07/18/2012 12:02 PM, Pawel Veselov wrote: Hi. If I have a lot (10k) tables, and each table has a btree index, and all the tables are being constantly inserted into, would all the indexes have to be in memory, and would effectively start fighting for space? Quite likely, yes. You could make it a bit more efficient by grouping inserts up and doing batches for each table - if that's possible in your application. Not with the current design :) 10k tables is a heck of a lot. That sort of thing often implies app design issues. That may be. However, attempting to put all the data into the same table created problems with looking it up, and the fact that both write and read traffic hits the same table. Why 10k tables? What do they do? Realistically, that's a way to partition data. The tables have the same structure, but apply to different partitions. If I am to use Postgres partitioning, would there be any difference? Considering that I would have to do almost exactly the same, with the inheritance and all? Thank you! -- Craig Ringer
[GENERAL] errors with high connections rate
Hi. -- problem 1 -- I have an application, using libpq, connecting to postgres 9.1.3 (Amazon AMI distro). The application writes data at a high rate (at this point it's 500 transaction per second), using multiple threads (at this point it's 800). These are worker threads, that receive messages that are then written out to the DB. There is no connection pool, instead, each worker thread maintains it's own connection that it uses to write data to the database. The connections are kept pthread's specific data blocks. Each thread would connect to the DB when the first work message is received, or when there was an error flag with a connection. The error flag is set any time there is any error running a database statement. When the work is slow, I don't see any problem (slow was ~250 messages per second). As I increased the load, when I restart the process, threads start grabbing work at high enough rate, and each will first open a connection to the database, and these errors start popping up: Can't connect to DB: could not send data to server: Transport endpoint is not connected could not send startup packet: Transport endpoint is not connected This is a result of executing the following code: wi-pg_conn = PQconnectdb(conn_str); ConnectionStatusType cst = PQstatus(wi-pg_conn); if (cst != CONNECTION_OK) { ERR(Can't connect to DB: %s\n, PQerrorMessage(wi-pg_conn)); } Eventually, the errors go away (when the worker thread fail to connect, they just pass the message to another thread, and wait for their turn, and will try reconnecting again), so it does seem that the remedy is just spreading the connections in time. The connection string is '' (empty), the connection is made through /tmp/.s.PGSQL.5432 I don't see these errors when: 1) the amount of worker threads is reduced (could never reproduce it under 200 or less, but seen them with 300 and more) 2) the amount of load is reduced -- problem 2 -- As I'm trying to debug this (with strace), I could never reproduce it, at least to see what's going on, but sometimes I get another error : too many users connected. Even restarting postmaster doesn't help. The postmaster is running with -N810, and the role has connection limit of 1000. Yet, the too many error starts creeping up only after 275 connections are opened (counted by successful connect() from strace). Any idea where should I dig? P.S. I looked at fe-connect.c, I'm wondering if there a potential race condition between poll() and socket actually finishing the connection? If running under strace, I never see EINPROGRESS returned from connect(), and the only reason sendto() would result into ENOTCONN is when the connect didn't finish, and the socket was deemed connected using poll/getsockopt... Thanks, Pawel.
[GENERAL] 31.11. Notice Processing - associating notices with calls
Hi. What's the best way to associate an incoming notice with the statement that resulted in generating it? Notice operate on PGResult objects, but these objects only become available after the statement call is made. Should I have a global (or thread global, since my app is MT) flag that would tell the receiver what statement was executed last? I assume that the execution path is that notice call back is invoked while the statement function executes. Thank you, Pawel.
Re: [GENERAL] 31.11. Notice Processing - associating notices with calls
On Mon, Jun 25, 2012 at 8:07 AM, Tom Lane t...@sss.pgh.pa.us wrote: Pawel Veselov pawel.vese...@gmail.com writes: What's the best way to associate an incoming notice with the statement that resulted in generating it? Um ... the first issue with this problem statement is the assumption that there *is* a statement that caused the notice. The server is capable of generating notices autonomously, for example during a forced database shutdown. But having said that, you could certainly keep track of which command you last issued. Is there then any way to know if a notice came from a statement? My issue is that there are some statements that generate notices that I can safely dismiss (CREATE something IF NOT EXISTS), but I don't want to dismiss any other. I believe notices are not asynchronous (I don't use any asynchronous API), so if there is a pending notice on the connection, and if I set the current statement, and execute it, I will first get the pending notice, and only then the statement-related notice. Notice operate on PGResult objects, but these objects only become available after the statement call is made. I think you are misunderstanding the notice receiver API. The PGresult that's passed to the receiver is just a transient one created to hold the notice message's fields. It has nothing to do with the PGresult generated to hold the eventual result of the current query (if any). Yes, I did misunderstand it. Is there a standard of what would be in this result object?