Re: [GENERAL] Help with slow table update

2015-04-17 Thread Pawel Veselov

 [skipped]


  But remember that if you update or delete a row, removing it from an
 index, the data will stay in that index until vacuum comes along.

 Also, there's no point in doing a REINDEX after a VACUUM FULL;
 vacuum full rebuilds all the indexes for you.


 I was being desperate :)

 I still think there is something very wrong with this particular table.
 First, I have production systems that employ this function on way larger
 data set, and there is no problem (so far, but still). This machine is
 part of a test deployment, there is no constant load, the only data that
 is being written now is when I do these tests. Vacuuming should prune
 all that dead stuff, and if it's absent, it's unclear where is the time
 spent navigating/updating the table with 24 rows :)


 I think you definitely have a problem with dead rows, as evidenced by the
 huge improvement VACUUM FULL made.


 But it's not clear why (and not reasonable, IMHO, that) it wouldn't
 improve past current point.


What I should've done is 'VACUUM FULL VERBOSE'. Once I did, it told me
there were 800k dead rows that can't be removed. After digging around I
found some dangling prepared transactions, going back months. Once I threw
those away, and re-vacuumed, things got back to normal.

Thanks for all your help and advice.


Re: [GENERAL] Help with slow table update

2015-04-15 Thread Pawel Veselov

 [skipped]



 This is where using sets becomes really tedious, as Postgres severely
 lacks an upsert-like statement.

 I don't think there are joins allowed in UPDATE statement, so I will need
 to use WITH query, right?

 Also, I'm not sure how LEFT JOIN will help me isolate and insert missed
 entries...



   [skipped]



 Yes, you can do UPDATE with joins (
 http://www.postgresql.org/docs/9.4/static/sql-update.html) like this:



 UPDATE table1 A SET col1 = B.col2

   FROM table2 B

   WHERE A.col3 = B.col4;



I meant using JOIN operator in the update. But it's still possible, though
through WITH query.


Re: [GENERAL] Help with slow table update

2015-04-14 Thread Pawel Veselov
On Mon, Apr 13, 2015 at 6:03 PM, David G. Johnston 
david.g.johns...@gmail.com wrote:

 On Mon, Apr 13, 2015 at 5:01 PM, Pawel Veselov pawel.vese...@gmail.com
 wrote:


 r_agrio_hourly - good, r_agrio_total - bad.

  Update on r_agrio_hourly  (cost=0.42..970.32 rows=250 width=329) (actual
 time=2.248..2.248 rows=0 loops=1)
-  Index Scan using u_r_agrio_hourly on r_agrio_hourly
  (cost=0.42..970.32 rows=250 width=329) (actual time=0.968..1.207 rows=1
 loops=1)
  Index Cond: ((tagid = 1002::numeric) AND (unitid =
 1002::numeric) AND ((rowdate)::text = '2015-04-09T23'::text) AND
 (device_type = 3::numeric) AND (placement = 2::numeric))
  Total runtime: 2.281 ms
  Update on r_agrio_total  (cost=0.42..45052.56 rows=12068 width=321)
 (actual time=106.766..106.766 rows=0 loops=1)
-  Index Scan using u_r_agrio_total on r_agrio_total
  (cost=0.42..45052.56 rows=12068 width=321) (actual time=0.936..32.626
 rows=1 loops=1)
  Index Cond: ((tagid = 1002::numeric) AND (unitid =
 1002::numeric) AND (device_type = 3::numeric) AND (placement = 2::numeric))
  Total runtime: 106.793 ms


 What it is you expect to see here?

 ​What are the results (count and times) for:

 SELECT count(*) FROM r_agrio_total WHERE tagid = 1002 and unitid = 1002;


Result: 8 (the whole table is 24 rows). It returns somewhat with a stumble,
but relatively quickly.
db= explain analyze SELECT count(*) FROM r_agrio_total WHERE tagid = 1002
and unitid = 1002;
   QUERY PLAN

-
 Aggregate  (cost=4.45..4.46 rows=1 width=0) (actual time=327.194..327.195
rows=1 loops=1)
   -  Index Scan using tag_r_agrio_total on r_agrio_total
 (cost=0.42..4.45 rows=1 width=0) (actual time=0.039..327.189 rows=8
loops=1)
 Index Cond: (tagid = 1002::numeric)
 Filter: (unitid = 1002::numeric)
 Total runtime: 327.228 ms


 SELECT count(*) FROM r_agrio_hourly WHERE tagid = 1002 and unitid = 1002;


Result is 2869. Returns somewhat quckly. Explain analyze is crazy though:
db= explain analyze SELECT count(*) FROM r_agrio_hourly WHERE tagid = 1002
and unitid = 1002;

 QUERY PLAN


 Aggregate  (cost=68134.68..68134.69 rows=1 width=0) (actual
time=15177.211..15177.211 rows=1 loops=1)
   -  Index Scan using adunit_r_agrio_hourly on r_agrio_hourly
 (cost=0.42..67027.10 rows=443035 width=0) (actual time=0.096..15175.730
rows=2869 loops=1)
 Index Cond: (unitid = 1002::numeric)
 Filter: (tagid = 1002::numeric)
 Total runtime: 15177.240 ms

​More queries along this line might be needed.  The underlying question is
 how many index rows need to be skipped over on total to get the final
 result - or rather are the columns in the index in descending order of
 cardinality?


Idea is - both tables have unique multi-field indices, and each update hits
exactly one row from that index, no more, and all fields from the index are
locked with equality condition on the update. All of the updates (within a
transaction) would always work on a small subset of rows (max a few
hundred, ever; in this case, it's may be around 10). I expect it to be
possible for the server to keep the active working set in the cache at all
times. Since the index is unique, there shouldn't be a reason to re-scan
the table, if a cached row is found, no?


 Any chance you can perform a REINDEX - maybe there is some bloat
 present?  There are queries to help discern if that may be the case, I do
 not know then off the top of my head, but just doing it might be acceptable
 and is definitely quicker if so.


That's the thing - I've done both vacuum full, and re-index. The very first
time I did vacuum full things improved (60 seconds to 7 seconds). Re-index
didn't improve anything (but it was done after vacuum full).


 ​I'm still not really following your presentation but maybe my thoughts
 will spark something.​


Thank you! I hope I clarified this some :)


Re: [GENERAL] Help with slow table update

2015-04-14 Thread Pawel Veselov
On Mon, Apr 13, 2015 at 7:37 PM, Jim Nasby jim.na...@bluetreble.com wrote:

 On 4/13/15 7:01 PM, Pawel Veselov wrote:

 Cursors tend to make things slow. Avoid them if you can.


 Is there an alternative to iterating over a number of rows, where a
 direct update query is not an option?

 I really doubt that either the actual processing logic, including use of
 types has anything to do with my problem. This is based on the fact that
 out of the tables that are being changed, only one is exhibiting the
 problem. All of the involved tables have nearly the same structure, and
 have the same logical operations performed on them. I thought may be the
 bad table is slow because it was first in the list, and Postgres was
 caching the functions results, but I moved things around, and pattern is
 the same.


 I'm guessing that you're essentially processing a queue. Take a look at
 http://www.postgresql.org/message-id/552c750f.2010...@bluetreble.com for
 some ideas. Basically, not only do cursors have non-trivial overhead, doing
 a ton of single-row queries is going to have a non-trivial overhead itself.


Thank you for the pointers. PgQ sounds interesting, it has to be remote for
RDS (I use RDS), but I'll try implementing a solution based on it.
However, for all the times that is being spent during this update, the
breakdown is:

update total table: 10.773033
update hourly table: 00.179711
update daily table: 01.082467
update some other table (actually, it has cardinality similar to total
table): 00.168287
clean the queue table: 00.021733
overhead: 00.014922

The overhead is time taken to run the whole procedure, minus all these
other times that have been counted.

(some notes about the daily table below)


  As for your specific question, I suggest you modify the plpgsql
 function so that it's doing an EXPLAIN ANALYZE on the slow table.
 EXPLAIN ANALYZE actually returns a recordset the same way a SELECT
 would, with a single column of type text. So you just need to do
 something with that output. The easiest thing would be to replace
 this in your function:

 UPDATE slow_table SET ...

 to this (untested)

 RETURN QUERY EXPLAIN ANALYZE UPDATE slow_table SET ...

 and change the function so it returns SETOF text instead of whatever
 it returns now.


 Thank you, that made it a lot easier to see into what's really going on.
 But the outcome is somewhat the same. The bad table analysis shows a
 very high cost, and thousands of rows, where the table contains only 24
 rows. This time, however, the actual run time is shown, and one can see
 where the time is spent (I was using just a sum of clock_time()s around
 the update statements to see where the problem is).

 r_agrio_hourly - good, r_agrio_total - bad.

   Update on r_agrio_hourly  (cost=0.42..970.32 rows=250 width=329)
 (actual time=2.248..2.248 rows=0 loops=1)
   -  Index Scan using u_r_agrio_hourly on r_agrio_hourly
   (cost=0.42..970.32 rows=250 width=329) (actual time=0.968..1.207
 rows=1 loops=1)
   Index Cond: ((tagid = 1002::numeric) AND (unitid =
 1002::numeric) AND ((rowdate)::text = '2015-04-09T23'::text) AND
 (device_type = 3::numeric) AND (placement = 2::numeric))
   Total runtime: 2.281 ms
   Update on r_agrio_total  (cost=0.42..45052.56 rows=12068 width=321)
 (actual time=106.766..106.766 rows=0 loops=1)
   -  Index Scan using u_r_agrio_total on r_agrio_total
   (cost=0.42..45052.56 rows=12068 width=321) (actual time=0.936..32.626
 rows=1 loops=1)
   Index Cond: ((tagid = 1002::numeric) AND (unitid =
 1002::numeric) AND (device_type = 3::numeric) AND (placement =
 2::numeric))
   Total runtime: 106.793 ms


 Keep in mind that the estimated cost is not terribly useful; it's the
 actual times that matter.

 I suspect what's happening here is a combination of things. First, the
 hourly table is basically living in cache, but the total table is not. That
 means that when you go to find a row in the total table you're actually
 hitting the disk instead of pulling the data from memory.



 Second, you may have a lot of dead rows in the total table. I suspect this
 because of the very large amount of time the index scan is taking. Unless
 you're running on an old 10MB MFM drive you'd be pretty hard pressed for
 even 2 IO operations (one for the index leaf page and one for the heap
 page) to take 32ms. I suspect the index scan is having to read many dead
 rows in before it finds a live one, and incurring multiple IOs. Swiching to
 EXPLAIN (analyze, buffers) would help confirm that.


That looks most likely to me as well. Most of the updates in a single
batch, for the total table would be on the same record, while for hourly
table it's a lot less. Logically, the tables contain identical data, except
that hourly table breaks it down per hour, and total table contains the
data for all times. The daily table contains the same data per day.

So, if I compared the tables, the total table has

Re: [GENERAL] Help with slow table update

2015-04-14 Thread Pawel Veselov
On Tue, Apr 14, 2015 at 3:29 PM, Jim Nasby jim.na...@bluetreble.com wrote:

 On 4/14/15 4:44 PM, Pawel Veselov wrote:

 On Tue, Apr 14, 2015 at 1:15 PM, Jim Nasby jim.na...@bluetreble.com
 mailto:jim.na...@bluetreble.com wrote:

 On 4/14/15 1:28 PM, Pawel Veselov wrote:


 I wonder if what I need to do, considering that I update a lot
 of the
 same rows as I process this queue, is to create a temp table,
 update
 the rows there, and then update the actual tables once at the
 end...


 That's what I'd do.


 Well, in short, I changed (repeat the body of loop for how many tables
 are there)

 LOOP (item)
UPDATE table with item
IF not found INSERT item INTO table; END IF;
 END LOOP;

 to:

 CREATE TEMP TABLE xq_table (like table) on commit drop;
 LOOP (item)
LOOP
  UPDATE xq_table with item;
  exit when found;
  INSERT INTO xq_table select * from table for update;
  continue when found;
  INSERT item INTO xq_table;
  exit;
END LOOP;
 END LOOP;
 UPDATE table a set (rows) = (xq.rows)
FROM xq_table xq
WHERE (a.keys) = (xq.keys)

 That works significantly faster. The final update statement is very
 fast. The process is somewhat slow in the beginning as it sucks in
 records from total into xq_total, but once all of that is moved into
 the temp table, it rushes through the rest.


 Databases like to think in sets. It will generally be more efficient to do
 set operations instead of a bunch of row-by-row stuff.

 Since you're pulling all of this from some other table your best bet is
 probably something like:

 CREATE TEMP TABLE raw AS DELETE FROM queue WHERE ... RETURNING *;

 CREATE TEMP VIEW hourly_v AS SELECT ... FROM raw GROUP BY;
 UPDATE ar_hourly SET ... FROM hourly_v JOIN ...;
 INSERT INTO ar_hourly SELECT FROM hourly_v LEFT JOIN ar_hourly ON ...;

 -- Same thing for daily
 -- Same thing for total


In my previous post, there was a problem with that pseudo-code, as it's
missing inserts into the final table at the end of loop, for those records
that need to be inserted and not updated.

This is where using sets becomes really tedious, as Postgres severely lacks
an upsert-like statement.
I don't think there are joins allowed in UPDATE statement, so I will need
to use WITH query, right?
Also, I'm not sure how LEFT JOIN will help me isolate and insert missed
entries...

Would it be OK to replace upsert part with merging into a temp table, then
deleting and inserting from temp table? Is there any penalty for
insert/delete comparing to update?

[skipped]


  But remember that if you update or delete a row, removing it from an
 index, the data will stay in that index until vacuum comes along.

 Also, there's no point in doing a REINDEX after a VACUUM FULL;
 vacuum full rebuilds all the indexes for you.


 I was being desperate :)

 I still think there is something very wrong with this particular table.
 First, I have production systems that employ this function on way larger
 data set, and there is no problem (so far, but still). This machine is
 part of a test deployment, there is no constant load, the only data that
 is being written now is when I do these tests. Vacuuming should prune
 all that dead stuff, and if it's absent, it's unclear where is the time
 spent navigating/updating the table with 24 rows :)


 I think you definitely have a problem with dead rows, as evidenced by the
 huge improvement VACUUM FULL made.


But it's not clear why (and not reasonable, IMHO, that) it wouldn't improve
past current point.


Re: [GENERAL] Help with slow table update

2015-04-14 Thread Pawel Veselov
On Tue, Apr 14, 2015 at 1:15 PM, Jim Nasby jim.na...@bluetreble.com wrote:

 On 4/14/15 1:28 PM, Pawel Veselov wrote:


 I wonder if what I need to do, considering that I update a lot of the
 same rows as I process this queue, is to create a temp table, update
 the rows there, and then update the actual tables once at the end...


 That's what I'd do.


Well, in short, I changed (repeat the body of loop for how many tables are
there)

LOOP (item)
  UPDATE table with item
  IF not found INSERT item INTO table; END IF;
END LOOP;

to:

CREATE TEMP TABLE xq_table (like table) on commit drop;
LOOP (item)
  LOOP
UPDATE xq_table with item;
exit when found;
INSERT INTO xq_table select * from table for update;
continue when found;
INSERT item INTO xq_table;
exit;
  END LOOP;
END LOOP;
UPDATE table a set (rows) = (xq.rows)
  FROM xq_table xq
  WHERE (a.keys) = (xq.keys)

That works significantly faster. The final update statement is very fast.
The process is somewhat slow in the beginning as it sucks in records from
total into xq_total, but once all of that is moved into the temp table,
it rushes through the rest.


 The other option would be to use a constraint trigger paired with a
 per-row trigger on the hourly table to drive the daily table, and on the
 daily table to drive the total table. The way that would work is the
 per-row table would simply keep track of all the unique records that were
 changed in a statement (presumably by putting them in a temp table). Once
 the statement is done, the constraint trigger would fire; it would
 summarize all the changed data and do a much smaller number of updates to
 the table being summarized into.


I'm not sure how I would be able to avoid the same number of changes on the
total table, trigger would fire on each update, won't it? So, same problem
with a lot of changes on a table...


 BTW, you also made a comment about not having to hit the table if you look
 at something in an index. You can only do that if all the data you need is
 in the index, AND the page with the record is marked as being all-visible
 (google for Postgres Visibility Map). If that's not the case then you still
 have to pull the row in the table in, in order to determine visibility. The
 only case where you can still avoid hitting the table is something like a
 NOT EXISTS; if you can't find any entries in the index for something then
 they definitely won't be in the table.


What I was saying is that if a table has a unique index, and there is
cached fact that a particular index value points to a particular row, there
shouldn't be a need to re-scan the index again to search for any more
matching values (which would be necessary if the index was not unique).
Again, all considering the size of the index, the amount of different index
values that are being queried, etc.


 But remember that if you update or delete a row, removing it from an
 index, the data will stay in that index until vacuum comes along.

 Also, there's no point in doing a REINDEX after a VACUUM FULL; vacuum full
 rebuilds all the indexes for you.


I was being desperate :)

I still think there is something very wrong with this particular table.
First, I have production systems that employ this function on way larger
data set, and there is no problem (so far, but still). This machine is part
of a test deployment, there is no constant load, the only data that is
being written now is when I do these tests. Vacuuming should prune all that
dead stuff, and if it's absent, it's unclear where is the time spent
navigating/updating the table with 24 rows :)


[GENERAL] bigserial continuity safety

2015-04-13 Thread Pawel Veselov
Hi.

If I have a table created as:

CREATE TABLE xq_agr (
  idBIGSERIAL PRIMARY KEY,
  node  text not null
);

and that multiple applications insert into. The applications never
explicitly specify the value for 'id'.
Is it safe to, on a single connection, do:

- open transaction (default transaction isolation)
- Open cursor for select * from xq_agr order by id asc
- do something with current record
- advance the cursor (and repeat something), but stop at some point (id =
LAST_ID), and
- delete from xq_agr where id = LAST_ID;
- commit

safe to means - whether the cursor will not miss any records that were
deleted at the end.

I'm suspecting that depending on the commit order, I may have situations
when:
- TX1 insert ID 1
- TX2 insert ID 2
- TX2 commits
- TX3 scans 2
- TX1 commits
- TX3 deletes = 2
- record ID1 is deleted, but never processed.

Is that sequence of events as listed above possible? If yes, is there a
transaction isolation  I can use to avoid that?

Table and sequence definition, as present in the DB:

db= \d+ xq_agr_id_seq
 Sequence public.xq_agr_id_seq
Column |  Type   |Value| Storage
---+-+-+-
 sequence_name | name| xq_agr_id_seq   | plain
 last_value| bigint  | 139898829   | plain
 start_value   | bigint  | 1   | plain
 increment_by  | bigint  | 1   | plain
 max_value | bigint  | 9223372036854775807 | plain
 min_value | bigint  | 1   | plain
 cache_value   | bigint  | 1   | plain
 log_cnt   | bigint  | 27  | plain
 is_cycled | boolean | f   | plain
 is_called | boolean | t   | plain

db= \d xq_agr
   Table public.xq_agr
  Column   |  Type   |  Modifiers

---+-+-
 id| bigint  | not null default
nextval('xq_agr_id_seq'::regclass)
 node  | text| not null
Indexes:
xq_agr_pkey PRIMARY KEY, btree (id)


Re: [GENERAL] Help with slow table update

2015-04-13 Thread Pawel Veselov
On Sun, Apr 12, 2015 at 5:40 PM, Jim Nasby jim.na...@bluetreble.com wrote:

 On 4/9/15 6:18 PM, Pawel Veselov wrote:

 Hi.

 I have a plpgsql procedure that updates a few similar tables.
 for some reason, updates on one of the tables take a lot longer the
 updates on the other ones. The difference is, say, 7 seconds vs. 80
 milliseconds.

 the procedure uses cursors and record variables to do the updates. For
 example:

  update r_agrio_total set
unserved = unserved + (agrow-'unserved')::numeric(38),
r_brkconn = mush_brk_conn(r_brkconn, q_item.r_brkconn),
  where
tagid = _tagid and
unitid = (akey-'unitid')::numeric and
placement = (akey-'placement')::numeric and
device_type = (akey-'device_type')::numeric;

 There is another table (xq_agr) that is read record by record, and for
 each of those records, such update is executed.

 I was trying to select analyze the updates to see where the time could
 be spent.
 There are only 24 row in the bad table, and 3,400 rows in good
 table. So, for the bad table, most of the updates will be on the same
 rows. The times were measured on processing 100 original records.

 When I'm analyzing pure update statements, I don't see anything strange.

 bad table: explain analyze update r_agrio_total set unconfirmed =
 unconfirmed +0 where tagid = 1000 and unitid = 1000 and placement = 0
 and device_type = 100;

 RESULT:
   Update on r_agrio_total  (cost=0.42..4.46 rows=1 width=321) (actual
 time=0.253..0.253 rows=0 loops=1)
 -  Index Scan using tag_r_agrio_total on r_agrio_total
   (cost=0.42..4.46 rows=1 width=321) (actual time=0.037..0.041 rows=1
 loops=1)
   Index Cond: (tagid = 1000::numeric)
   Filter: ((unitid = 1000::numeric) AND (placement = 0::numeric)
 AND (device_type = 100::numeric))
   Rows Removed by Filter: 7
   Total runtime: 0.282 ms

 good table: explain analyze update r_agrio_hourly set unconfirmed =
 unconfirmed +0 where tagid = 1000 and unitid = 1000 and placement = 0
 and device_type = 100 and rowdate = '2015-02-23T13';

 RESULT:
   Update on r_agrio_hourly  (cost=0.42..17.36 rows=6 width=329) (actual
 time=0.102..0.102 rows=0 loops=1)
 -  Index Scan using u_r_agrio_hourly on r_agrio_hourly
   (cost=0.42..17.36 rows=6 width=329) (actual time=0.047..0.048 rows=1
 loops=1)
   Index Cond: ((tagid = 1000::numeric) AND (unitid =
 1000::numeric) AND ((rowdate)::text = '2015-02-23T13'::text) AND
 (device_type = 100::numeric) AND (placement = 0::numeric))
   Total runtime: 0.135 ms

 When I try doing it with WITH statement (really, to apply the actual
 data that the plpgsql function uses), there is something strange in the
 bad table.

 explain analyze
 with SRC as (select * from xq_agr where id = 914830)
  update r_agrio_total set
unconfirmed = unconfirmed +
 (SRC.r_agrio-'unconfirmed')::numeric(38)
  from SRC
  where
tagid = (SRC.r_agrio-'key'-'tagid')::numeric and
unitid = (SRC.r_agrio-'key'-'unit')::numeric and
placement = (SRC.r_agrio-'key'-'placement')::numeric and
device_type = (SRC.r_agrio-'key'-'device_type')::numeric;

 RESULT:
   Update on r_agrio_total  (cost=8.91..32777.51 rows=19331 width=409)
 (actual time=0.107..0.107 rows=0 loops=1)
 CTE src
   -  Index Scan using xq_agr_pkey on xq_agr  (cost=0.42..8.44
 rows=1 width=379) (actual time=0.026..0.027 rows=1 loops=1)
 Index Cond: (id = 914830)
 -  Nested Loop  (cost=0.46..32769.07 rows=19331 width=409) (actual
 time=0.107..0.107 rows=0 loops=1)
   -  CTE Scan on src  (cost=0.00..0.02 rows=1 width=88) (actual
 time=0.032..0.033 rows=1 loops=1)
   -  Index Scan using u_r_agrio_total on r_agrio_total
   (*cost=0.46..32285.78 rows=19331* width=321) (actual time=0.001..0.001

 rows=0 loops=1)
 Index Cond: ((tagid = (((src.r_agrio - 'key'::text) -
 'tagid'::text))::numeric) AND (unitid = (((src.r_agrio - 'key'::text)
 - 'unit'::text))::numeric) AND (device_type = (((src.r_agrio -
 'key'::text) - 'device_type'::text))::numeric) AND (placement =
 (((src.r_agrio - 'key'::text) - 'placement'::text))::numeric))
   Total runtime: 0.155 ms

 explain analyze
 with SRC as (select * from xq_agr where id = 914830)
  update r_agrio_hourly set
unconfirmed = unconfirmed +
 (SRC.r_agrio-'unconfirmed')::numeric(38)
  from SRC
  where
tagid = (SRC.r_agrio-'key'-'tagid')::numeric and
unitid = (SRC.r_agrio-'key'-'unit')::numeric and
placement = (SRC.r_agrio-'key'-'placement')::numeric and
device_type = (SRC.r_agrio-'key'-'device_type')::numeric
 and
rowdate = (SRC.r_agrio-'key'-'rowdate');

 RESULT:
   Update on r_agrio_hourly  (cost=8.91..52.91 rows=20 width=417) (actual
 time=0.123..0.123 rows=0 loops=1)
 CTE src
   -  Index Scan using xq_agr_pkey on xq_agr

Re: [GENERAL] Improving performance of merging data between tables

2015-02-10 Thread Pawel Veselov
Sorry, it took me a while to respond, but I re-factored all of this process
to suggestions.

On Wed, Jan 7, 2015 at 7:49 PM, Maxim Boguk maxim.bo...@gmail.com wrote:

 On Wed, Jan 7, 2015 at 8:49 PM, Pawel Veselov pawel.vese...@gmail.com
 wrote:

 PS: your setup look pretty complicated and hard to analyze without seeing
 all involved table structures, transaction/query flow, and (especially)
 involved procedures source code.


 Sure :) At this point, I've put together the bulk merge code as well. I
 can't quite see much of a difference, actually, but it's hard to trust the
 execution times, as on the same amount of data they vary from, say, 0.5s to
 2s, and the sample data is not stepping on any other locks. In general, I'm
 afraid all those left joins and multiple scans, even over small amount of
 data, is nullifying any positive effect.

 Now some ideas to check.
 The high CPU usage usually isn't related to locking, but related to seq
 scan or wrong plans or simple inefficient pl/pgsql code, locked processes
 usually doesn't use too much cpu.

 1)on the test database perform  select pg_stat_reset(); then perform full
 round of merges, then check
 select * from pg_stat_user_tables where seq_scan0 order by seq_tup_read;
 and if you find a lot of seq_scan and seq_tuple_reads on the particular
 table try find where they coming from (it could be reason for high CPU
 usage).

 2)enable track_functions in postgresql.conf and perform the same sequence
 (select pg_stat_reset() + full round of merges
 ) then check
 select * FROM pg_stat_user_functions order by self_time desc;
 and check which function using the most time.


These are good pointers, if the new process is having the same sort of
problems, this will come in handy on figuring out where they are coming
from, thank you.


 3)old/lost prepared transactions can have deadly effect on the database
 performance at whole. So check select * from pg_prepared_xact(); and verify
 that you don't have a hours (or weeks) old prepared xact lying around.


If there are lost prepared transactions, they will lock up a particular
instance from being able to write into its table data, so it will just
stall the node. But does happen, and we have an application mechanism to
find and delete those.


 PS: btw I still don't fully understood relation between the:
 
 - merges data into its own node tables (using merge_xxx PL/pgSQL functions)
 
 and provided code for the public.merge_all02-9A-46-8B-C1-DD and
 PUBLIC.merge_agrio.
 As I see
 public.merge_all02-9A-46-8B-C1-DD calling PUBLIC.merge_agrio, and the
 PUBLIC.merge_agrio updates a global table R_AGRIO (but not the own node
 table).


It's a bit irrelevant at this point, but. merge_all02-9A-46-8B-C1-DD()
function will take all data for 02-9A-46-8B-C1-DD node and move it into the
master table. There is an analogous merge_02-9A-46-8B-C1-DD() function that
takes data from application, and writes it into the tables for
02-9A-46-8B-C1-DD node. The process of moving data node tables-main tables
and application-node tables is nearly identical, hence I only provided the
body once. The big difference, is when merging into master, there is a lot
more data to look through, as node tables only contain data that has not
yet been merged into the master yet.


 I think the best implementation of such task is asynchronous processing of
 this changes via background process. An application only inserts events
 into queue table (it lockless process), and some background process read
 these data from queue table and merge it into main table (again lockless
 because it single thread so no concurrent writes), and then delete the
 merged data from queue table.


Well, that was a really good suggestion, thank you. Some weeks later I've
put it together. This hasn't hit production yet, so I'm yet to see the
overall improvement effect. Along with turning it into a queue, I've added
provisions to try to combine as much data as possible before writing it out
into the databse tables, and merged all of the satellite tables with the
main data.

Before, I had:
r_agrio
  r_brk_xxx (multiple entries reference rows in r_agrio)

Now, I have:
r_agrio_daily
r_agrio_total
r_agrio_hourly

All the data that was in the r_brk_xxx tables is now in columns of the
r_agrio* tables. To get around the fact that there are potentially multiple
BRK records for each AGR record, the data is now stored as JSON object. The
primary key used for the BRK tables is turned into a string that serves as
a key in a top level JSON object. This should help me tremendously on the
side that needs to read that data, as I had to join or left join the BRK
tables.

Splitting this into 3 tables may come back and bite me in back, since it's
two more inserts and corresponding look ups, but it seriously helps me on
the reading side of things.

The code that aggregates the JSON data is still done in PL/PGSQL, which is
probably a bad idea, considering that PL doesn't have good ways of
manipulating

Re: [GENERAL] Casting hstore to json

2015-01-16 Thread Pawel Veselov
Adrian,

On Thu, Jan 15, 2015 at 9:09 PM, Adrian Klaver adrian.kla...@aklaver.com
wrote:

 On 01/15/2015 07:59 PM, Pawel Veselov wrote:

 Hi.

 I'm trying to cast hstore to json, but I don't seem to be getting a json
 object from hstore key/value pairs.


[skipped]


 db= SELECT to_json('a=1, b=2'::hstore);
 to_json
 --
   \a\=\1\, \b\=\2\


 Works here:

 business_app= select version();
version
 
 -
  PostgreSQL 9.3.5 on i686-pc-linux-gnu, compiled by gcc (SUSE Linux) 4.8.1
 20130909 [gcc-4_8-branch revision 202388], 32-bit
 (1 row)


 business_app= SELECT to_json('a=1, b=2'::hstore);
to_json
 --
  {a: 1, b: 2}
 (1 row)


 Did you upgrade from a previous version of Postgres?


Yes! From 9.1.14. 'alter extension hstore update' did the trick, thank you
:)

[skipped]


[GENERAL] Casting hstore to json

2015-01-15 Thread Pawel Veselov
Hi.

I'm trying to cast hstore to json, but I don't seem to be getting a json
object from hstore key/value pairs.

9.3 documentation says:
*Note:* The hstore
http://www.postgresql.org/docs/9.3/static/hstore.html extension
has a cast from hstore to json, so that converted hstore values are
represented as JSON objects, not as string values.
to_json(anyelement) - If the data type is not built in, and there is a cast
from the type tojson, the cast function will be used to perform the
conversion.

On 9.3.5 server:
db= SELECT 'a=1, b=2'::hstore;
   hstore

 a=1, b=2
(1 row)

db= SELECT to_json('a=1, b=2'::hstore);
   to_json
--
 \a\=\1\, \b\=\2\
(1 row)
db= SELECT cast(('a=1, b=2'::hstore) as json);
ERROR:  cannot cast type hstore to json
LINE 1: SELECT cast(('a=1, b=2'::hstore) as json);

Thanks!
  Pawel.


Re: [GENERAL] Improving performance of merging data between tables

2015-01-07 Thread Pawel Veselov
On Tue, Dec 30, 2014 at 7:25 PM, Maxim Boguk maxim.bo...@gmail.com wrote:

 On Wed, Dec 31, 2014 at 11:10 AM, Pawel Veselov pawel.vese...@gmail.com
 wrote


 [skipped]

 2) try pg_stat_statements, setting pg_stat_statements.track = all.  see:
 http://www.postgresql.org/docs/9.4/static/pgstatstatements.html

 I have used this to profile some functions, and it worked pretty well.
 Mostly I use it on a test box, but once ran it on the live, which was
 scary, but worked great.


 That looks promising. Turned it on, waiting for when I can turn the
 server at the next quiet time.


 I have to say this turned out into a bit of a disappointment for this use
 case. It only measures total time spent in a call. So, it sends up
 operations that waited a lot on some lock. It's good, but it would be great
 if total_time was provided along with wait_time (and io_time may be as
 well, since I also see operations that just naturally have to fetch a lot
 of data)


 ​1) pg_stat_statements provide an information about io_time of each
 statement but you should have track_io_timing ​

 ​enabled for that.


Enabled that now. Still the top winners are the functions that probably
lock for a long (relatively) time. This did help my find some crap that
either was missing an index, or used an unreasonable join, and just needed
re-writing. One entry that doesn't make sense to me is:

total_time - io_time = 1,366,773
calls = 666,542
query = SELECT * FROM q_SCHEDULER_STATE WHERE SCHED_NAME = ?
The table only has 18 rows, there is an index, but the analyzer chooses to
ignore it, which is right since sched_name column has the same value for
all rows. So all rows are returned in SELECT. The time to run that query
under database load varies from 0.09 to 70ms.
This is a distraction from the main topic, though, but does stand out odd.


 2) About locking I suggest enable log_lock_waits and set deadlock_timeout
 to say 100ms (just for testing purposes), and than any lock waiting more
 than 100ms will be logged with some useful additional info.



 PPS: btw, please check the database logs for deadlocks messages, your
 setup around and then call a pgsql function to merge the data from its
 tables into the common tables part could be easily deadlock prone.


I don't have I have abnormal problem with locking. I wanted to eliminate
locking time out of the pg_stat_statement, to address queries that aren't
waiting on disk and/or locks first, as my problem is high CPU, not specific
query performance. I don't have deadlocks for sure -- I had them before,
and I would normally get an error if there was a deadlock. We process all
the records in exactly the same order of keys to avoid deadlocks.

PPPS: and the last suggestion, after you finished with the write all the
 data into its own tables, then application should perform analyze of these
 own tables (or you could have weird/inefficient plans during last stage).


Any references to back this up? I don't particularly mind doing it, but I
wonder if analysis can be more expensive the processing. These tables get a
few hundreds of records inserted/updated, then are entirely processed (with
expected full scans), and then deleted...


 PS: your setup look pretty complicated and hard to analyze without seeing
 all involved table structures, transaction/query flow, and (especially)
 involved procedures source code.


Sure :) At this point, I've put together the bulk merge code as well. I
can't quite see much of a difference, actually, but it's hard to trust the
execution times, as on the same amount of data they vary from, say, 0.5s to
2s, and the sample data is not stepping on any other locks. In general, I'm
afraid all those left joins and multiple scans, even over small amount of
data, is nullifying any positive effect.

primary table: http://pastebin.com/gE2TjZd3
secondary table(s): http://pastebin.com/aDVakUkp
There are actually 10 secondary tables, but they are more or less of the
same structure.
The node tables have identical structure to the main tables.

First stage, which I don't particularly question, but may be wrongfully so,
the application does:
- create data suitable for the rows in the primary/secondary tables
- starts transaction
- merges data into its own node tables (using merge_xxx PL/pgSQL functions)
(100 rows in primary table)
- prepares transactions
- deals with other data sources
- commits/rolls back prepared transaction depending on success of the
previous step.

An example of a merge_xxx function: http://pastebin.com/6YYm8BVM

Second stage is really:
- start transaction
- call PL/pgSQL merge_all()
- commit

2 reasons for the 2 stages:
- if stage#2 fails, the data will be merged during the next iteration
- the lock time on the shared tables is minimized

It's possible that an external process may take over writing data for
certain key subset (combination of (tagid,blockid)), to make sure there is
no race condition with such process, such key pairs are locked, that's
what the whole

[GENERAL] min/max performance inequality.

2015-01-07 Thread Pawel Veselov
Hi.

I was wondering how come there is such a drastic difference between finding
max and min. Seems like index scan backwards is really bad... The table
is freshly re-indexed just in case. I added a count(*) in there, forcing
the seq scan, and it's even better than the backwards index scan...

db= EXPLAIN ANALYZE select min(rowdate) from r_agrio where blockid = 4814;
   QUERY
PLAN
-
 Result  (cost=495.89..495.90 rows=1 width=0) (actual time=24.149..24.150
rows=1 loops=1)
   InitPlan 1 (returns $0)
 -  Limit  (cost=0.00..495.89 rows=1 width=13) (actual
time=24.139..24.140 rows=1 loops=1)
   -  Index Scan using rowdate_r_agrio on r_agrio
 (cost=0.00..222160.24 rows=448 width=13) (actual time=24.137..24.137
rows=1 loops=1)
 Index Cond: ((rowdate)::text IS NOT NULL)
 Filter: (blockid = 4814::numeric)
 Total runtime: 24.186 ms
(7 rows)

db= EXPLAIN ANALYZE select max(rowdate) from r_agrio where blockid = 4814;

 QUERY PLAN


 Result  (cost=495.89..495.90 rows=1 width=0) (actual time=926.032..926.033
rows=1 loops=1)
   InitPlan 1 (returns $0)
 -  Limit  (cost=0.00..495.89 rows=1 width=13) (actual
time=926.019..926.021 rows=1 loops=1)
   -  Index Scan Backward using rowdate_r_agrio on r_agrio
 (cost=0.00..222160.24 rows=448 width=13) (actual time=926.017..926.017
rows=1 loops=1)
 Index Cond: ((rowdate)::text IS NOT NULL)
 Filter: (blockid = 4814::numeric)
 Total runtime: 926.070 ms
(7 rows)

db= EXPLAIN ANALYZE select count(*), max(rowdate) from r_agrio where
blockid = 4814;
   QUERY PLAN

-
 Aggregate  (cost=31585.18..31585.19 rows=1 width=13) (actual
time=461.079..461.080 rows=1 loops=1)
   -  Seq Scan on r_agrio  (cost=0.00..31582.94 rows=448 width=13) (actual
time=8.912..460.999 rows=15 loops=1)
 Filter: (blockid = 4814::numeric)
 Total runtime: 461.134 ms
(4 rows)

db= \d r_agrio
  Table public.r_agrio
   Column| Type  | Modifiers
-+---+
 id  | numeric(38,0) | not null
 tagid   | numeric(38,0) | not null
 blockid | numeric(38,0) | not null
 rowdate | character varying(15) | not null
 count   | numeric(38,0) | not null default 0
 events  | numeric(38,0) | not null default 0
 devents | numeric(38,0) | not null default 0
 duration| numeric(38,0) | not null default 0
 device_type | numeric(38,0) | not null
 placement   | numeric(38,0) | not null default 0
 unserved| numeric(38,0) | not null default 0
 unconfirmed | numeric(38,0) | not null default 0
 version | numeric(38,0) | not null default 1
Indexes:
pk_r_agrio PRIMARY KEY, btree (id)
u_r_agrio UNIQUE, btree (tagid, blockid, rowdate, device_type,
placement)
rowdate_r_agrio btree (rowdate)


Re: [GENERAL] min/max performance inequality.

2015-01-07 Thread Pawel Veselov
Thanks Jeff (and Tom)

On Wed, Jan 7, 2015 at 3:34 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Wed, Jan 7, 2015 at 3:00 PM, Pawel Veselov pawel.vese...@gmail.com
 wrote:

 Hi.

 I was wondering how come there is such a drastic difference between
 finding max and min. Seems like index scan backwards is really bad... The
 table is freshly re-indexed just in case. I added a count(*) in there,
 forcing the seq scan, and it's even better than the backwards index scan...

 db= EXPLAIN ANALYZE select min(rowdate) from r_agrio where blockid =
 4814;


 It crawls the data in rowdate order (either forward or reverse) until it
 finds the first 4814.  Crawling forward it finds 4814 very early. Crawling
 backwards it has to pass through a bunch of non-4814 before it finds the
 first 4814.

 This fact doesn't show up in your EXPLAIN ANALYZE, but if you used a more
 modern version of postgresql (9.2 or above) there would be another line for
 Rows Removed by Filter: which would tell the story of what is going on.


Yeah, there is 10x more rows on when going backwards



 If you have a composite index on (blockid, rowdate), it would help make
 this much faster, as it can go directly to the desired row.


That does help a lot. So, when does postgres use a more-dimensional index,
even if not all dimensions are engaged (as there is an index that involves
those 2 fields, and more)? I definitely see it do that in some cases...

Even with that index, however, there is still a good difference in time
(the interest is theoretical at this point, as I found a better way to
extract that data anyway).

On a newer db.

db= EXPLAIN analyze select min(rowdate) from r_agrio where blockid = 4814;
   QUERY
PLAN
-
 Result  (cost=521.54..521.55 rows=1 width=0) (actual time=39.770..39.770
rows=1 loops=1)
   InitPlan 1 (returns $0)
 -  Limit  (cost=0.55..521.54 rows=1 width=13) (actual
time=39.765..39.766 rows=1 loops=1)
   -  Index Scan using rowdate_r_agrio on r_agrio
 (cost=0.55..303738.47 rows=583 width=13) (actual time=39.763..39.763
rows=1 loops=1)
 Index Cond: ((rowdate)::text IS NOT NULL)
 Filter: (blockid = 4814::numeric)
 Rows Removed by Filter: 37246
 Total runtime: 39.798 ms
(8 rows)

db= EXPLAIN analyze select max(rowdate) from r_agrio where blockid = 4814;

QUERY PLAN

--
 Result  (cost=521.54..521.55 rows=1 width=0) (actual
time=1497.377..1497.378 rows=1 loops=1)
   InitPlan 1 (returns $0)
 -  Limit  (cost=0.55..521.54 rows=1 width=13) (actual
time=1497.371..1497.372 rows=1 loops=1)
   -  Index Scan Backward using rowdate_r_agrio on r_agrio
 (cost=0.55..303738.47 rows=583 width=13) (actual time=1497.370..1497.370
rows=1 loops=1)
 Index Cond: ((rowdate)::text IS NOT NULL)
 Filter: (blockid = 4814::numeric)
 Rows Removed by Filter: 317739
 Total runtime: 1497.407 ms
(8 rows)
db= CREATE INDEX concurrently xxx on r_agrio(rowdate,blockid);
CREATE INDEX

db= EXPLAIN analyze select min(rowdate) from r_agrio where blockid = 4814;
   QUERY PLAN

-
 Result  (cost=85.05..85.06 rows=1 width=0) (actual time=17.585..17.585
rows=1 loops=1)
   InitPlan 1 (returns $0)
 -  Limit  (cost=0.43..85.05 rows=1 width=13) (actual
time=17.580..17.581 rows=1 loops=1)
   -  Index Only Scan using xxx on r_agrio  (cost=0.43..37827.09
rows=447 width=13) (actual time=17.578..17.578 rows=1 loops=1)
 Index Cond: ((rowdate IS NOT NULL) AND (blockid =
4814::numeric))
 Heap Fetches: 0
 Total runtime: 17.616 ms
(7 rows)

db= EXPLAIN analyze select max(rowdate) from r_agrio where blockid = 4814;
QUERY
PLAN
--
 Result  (cost=85.04..85.05 rows=1 width=0) (actual time=89.141..89.142
rows=1 loops=1)
   InitPlan 1 (returns $0)
 -  Limit  (cost=0.43..85.04 rows=1 width=13) (actual
time=89.135..89.136 rows=1 loops=1)
   -  Index Only Scan Backward using xxx on r_agrio
 (cost=0.43..37823.09 rows=447 width=13) (actual time=89.134..89.134 rows=1
loops=1)
 Index Cond: ((rowdate IS NOT NULL) AND (blockid =
4814::numeric))
 Heap Fetches: 1
 Total runtime: 89.173 ms
(7 rows)





 Cheers,




Re: [GENERAL] Improving performance of merging data between tables

2014-12-30 Thread Pawel Veselov
On Mon, Dec 29, 2014 at 9:29 PM, Pawel Veselov pawel.vese...@gmail.com
wrote:

[skipped]


 1) How do I find out what exactly is consuming the CPU in a PL/pgSQL
 function? All I see is that the calls to merge_all() function take long
 time, and the CPU is high while this is going on.


[skipped]

2) try pg_stat_statements, setting pg_stat_statements.track = all.  see:
 http://www.postgresql.org/docs/9.4/static/pgstatstatements.html

 I have used this to profile some functions, and it worked pretty well.
 Mostly I use it on a test box, but once ran it on the live, which was
 scary, but worked great.


 That looks promising. Turned it on, waiting for when I can turn the server
 at the next quiet time.


I have to say this turned out into a bit of a disappointment for this use
case. It only measures total time spent in a call. So, it sends up
operations that waited a lot on some lock. It's good, but it would be great
if total_time was provided along with wait_time (and io_time may be as
well, since I also see operations that just naturally have to fetch a lot
of data)

[skipped]


Re: [GENERAL] Improving performance of merging data between tables

2014-12-29 Thread Pawel Veselov
Andy,

thanks for looking into this.

On Mon, Dec 29, 2014 at 9:00 AM, Andy Colson a...@squeakycode.net wrote:

 On 12/28/2014 3:49 PM, Pawel Veselov wrote:

 Hi.

 I was wondering if anybody would have any ideas on how to improve
 certain operations that we are having.

  SNIP


 Besides can somebody please look at this and let me know if I'm doing
 something utterly stupid, here are my questions.

 1) How do I find out what exactly is consuming the CPU in a PL/pgSQL
 function? All I see is that the calls to merge_all() function take long
 time, and the CPU is high while this is going on.



 First, I'll admit I didn't read your entire post.

 I can think of a couple methods:

 1) try each of the statements in merge_all by hand with an explain
 analyze in front to see which is slow.  Look for things that hit big
 tables without an index.  Check that fk lookups are indexes.


If I didn't miss anything, that seems to be OK, even on function-based
queries.


 2) try pg_stat_statements, setting pg_stat_statements.track = all.  see:
 http://www.postgresql.org/docs/9.4/static/pgstatstatements.html

 I have used this to profile some functions, and it worked pretty well.
 Mostly I use it on a test box, but once ran it on the live, which was
 scary, but worked great.


That looks promising. Turned it on, waiting for when I can turn the server
at the next quiet time.


 3) try auto-explain:
 http://www.postgresql.org/docs/9.4/static/auto-explain.html

 I've never used it, so don't know if it'll show each statement inside a
 function.  Dumps stuff to the log AFAIK, so you'll have to dig out the info
 by hand.

  2) Is there a better way to merge individual rows, except doing
  UPDATE/INSERT in a loop, and would that be CPU expensive?
 

 Not that I know of.  I use pretty much the same thing.  Soon!  we will
 have merge/upsert support.  Hopefully it'll be fast.


Well, anytime I cancelled the PID that was executing this whole mess, it
would always stop at UPDATE ... SET ... WHERE on the main table. Which does
make me believe that bulk update would really help.


  3) Is there a better way to merge whole tables? However, note that I
  need to translate primary keys from node main table into the common main
  table, as they are used as foreign keys, hence the loops. I suspect the
  looping is CPU intensive.

 Avoiding loops and doing things as sets is the best way.  If possible. The
 only loop I saw was looping over the merge_xxx tables, which is probably
 the only way.


There is an endless loop that is just a device for merging, but then there
are loops going over each record in all the tables that are being merge,
feeding them into the function that actually does the merge. That table
iteration is what I want to eliminate (especially if I knew it would help
:) )


 If possible (if you haven't already) you could add and extra column to
 your secondary table that you can set as the main table's key.

 bulk insert into second;
 update second set magic = (select key from main where ... );

 Then, maybe, you can do two ops in batch:

 update main (where key exists in main)
 insert into main (where key not exists in main)


I was thinking along the same lines. I can't really do bulk insert, at any
point, because any key can be inserted by another process at any time, and
with a good probability. However, there will be a lot less inserts than
updates. So, in general, I'm making it do this:

with pivot as ( select main_table.id, node_table.id as node_id as main_id
from node_table left join main_table using (key fields) )
update node_table set translate_id = pivot.main_id where node_table.id =
pivot.node_id;

(missing is cursor as select from node_table where main_id is null)

for row in missing loop
  -- merge_function will return PK of either the updated, or inserted
record.
  -- use (0) data values, so there it's an identity update, if the merge
results
  -- into an update, or empty data if not.
  select merge_function(missing.key_fields, 0) into use_id;
  update node_table set translate_id = use_id where current of missing;
end loop

At this point, I have a guarantee that I can update all records, and there
is nothing to insert.
So,

with new as ( select * from node_table )
update main_table old
set new.val = f(old.val, new.val)
where new.translate_id = old.id

So, I don't need full key matching anymore, I can use PKs instead.


[GENERAL] Improving performance of merging data between tables

2014-12-28 Thread Pawel Veselov
Hi.

I was wondering if anybody would have any ideas on how to improve certain
operations that we are having.

PostgreSQL 9.3.3. Table layout:

main_table: PK, N key columns, M data columns. The key columns are bound in
a unique key together. PK is pure sequence number. There are few separate
non-unique indices on some of the key columns, to aid in querying the table
data.

second_table: PK, main_table_PK_REF (declared as foreign key), Na key
columns, Ma data columns. There are 0-inf (typically 0-10) entries in
second_table that reference a single entry in main_table. PK is pure
sequence number, and unique key for that table is (main_table_PK_REF + key
columns). The only non-obvious thing here is that some of the unique
indexes involve coalesce() function:

uq_brkioevent UNIQUE, btree (mainid, blockid, name, subid,
(COALESCE(cname, ''::character varying)))

Those are there because we need to write the NULL into the tables, but
unique indexes don't like NULL values (AFAIR).

There is one main_table, and 10 second_tables.

The amount of rows in main table right now is ~1.1M, second tables have
about 1M-2M of rows. The growth of the main table is fixed amount of rows
(~10) per hour.

Multiple (web) application nodes need to write data into this table. Most
of the operations are modifying the data columns, rather than inserting new
data. We had serious contention problems if we let all the application
nodes write directly into the table. The writes involved using prepared
transactions, the prepared transaction can only be reaped after other data
stores are written to, and there is very high chance nodes will need to
modify the same rows, so the locking was taking too long.

To relieve the contention, we have allocated each application node it's own
set of tables that structurally are exactly like main/second tables. The
application node will open a transaction, write all the data into its own
tables, free of locks, and then call a pgsql function to merge the data
from its tables into the common tables. There is typically relatively
little data in the node tables (say within 100 rows in any table) before
its merged into the common tables. Nodes would dump their data when there
is something to dump, can be few times a second.

Recently, the operation that moves the data from the node tables into the
common tables started being a real drain on the PostgreSQL server CPU. I
assume this is probably due to the data set size reaching some critical
mass. Things really got outta hand when we had to double the amount of
application nodes to accommodate surge in application use.

The merging is done in the following manner.
Each main/second table has an associated PL/pgSQL function (merge_xxx) that
takes in key and data values as arguments. It then, in an endless loop,
tries to execute UPDATE statement (using math operations to update the data
based on existing and input data values, using key data in the query part).
If UPDATE statement set found, then the function exists. Otherwise, the
function tries to INSERT with key/data values. If that succeeds, function
exists, else if unique_violation is thrown, loop continues.

On top of these individual functions, there is another PL/pgSQL function
(merge_all). It uses for ROW in select * from MAIN_NODE outer loop, and
within that loop it calls the merge_xxx for the main table, and then for
each secondary table, does the same for ROWx in select * from
SECOND_NODE, adding WHERE clause to only pick up entries that correspond
to the current main_node table entry that's being processed, calling
merge_xxx for the corresponding secondary table. At the end of the outer
loop, all data from node tables is removed (using DELETE). I will gladly
provide pseudo-code, or even the function body is my explanation is unclear.

Besides can somebody please look at this and let me know if I'm doing
something utterly stupid, here are my questions.

1) How do I find out what exactly is consuming the CPU in a PL/pgSQL
function? All I see is that the calls to merge_all() function take long
time, and the CPU is high while this is going on.

2) Is there a better way to merge individual rows, except doing
UPDATE/INSERT in a loop, and would that be CPU expensive?

3) Is there a better way to merge whole tables? However, note that I need
to translate primary keys from node main table into the common main table,
as they are used as foreign keys, hence the loops. I suspect the looping is
CPU intensive.

Thank you,
  Pawel.


[GENERAL] incremental digest (and other) functions?

2014-11-17 Thread Pawel Veselov
Hi.

I was looking into a need of doing SHA on contents of a large object, and
it seems that the only digest (and other) functions provided by pgcrypto
produce output based on a single string on an input, without being able to
accumulate for multiple data blocks. It would be nice to see those, IMHO.
Or may be I missed them?

Thank you,
  Pawel.


[GENERAL] documentation for lo_* functions

2014-11-17 Thread Pawel Veselov
Hi.

Where is the proper documentation for lo_* functions (e.g. lo_open) that
are available as SQL functions? I see libpq functions documented in
/static/lo-interfaces.html, but not the SQL ones (from pg_catalog).

Thank you,
  Pawel.


[GENERAL] ORDER BY expression required in SELECT if DISTINCT

2013-03-17 Thread Pawel Veselov
Hi.

Anything I can do about this? Seems that the problem is that the query is a
prepared statement, even if the parameters ($1 and $6) are the same.

ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select
list at character 493
STATEMENT:  SELECT DISTINCT t0.package_name, ts_rank(t5.search_vector,
to_tsquery($1)) FROM application t0 CROSS JOIN application t1 CROSS JOIN
application_search t5 INNER JOIN application_state t2 ON
t1.PUBLICAPPSTATE_ID = t2.id INNER JOIN application_state_price t3 ON
t2.id= t3.E_APPSTATE_ID INNER JOIN price t4 ON t3.PRICES_ID =
t4.id WHERE (t4.currency = $2 AND t2.frontState = $3 AND t0.package_name =
t5.APP_PACKAGE_NAME AND ts_match_vq(t5.search_vector, to_tsquery($4)) = $5
AND 1 = 1) ORDER BY ts_rank(t5.search_vector, to_tsquery($6)) ASC LIMIT $7

This works:

SELECT DISTINCT t0.package_name, ts_rank(t5.search_vector,
to_tsquery('aaa'))
FROM application t0
CROSS JOIN application t1
CROSS JOIN application_search t5
INNER JOIN application_state t2
ON t1.PUBLICAPPSTATE_ID = t2.id
INNER JOIN application_state_price t3
ON t2.id = t3.E_APPSTATE_ID
INNER JOIN price t4
ON t3.PRICES_ID = t4.id
WHERE
(t4.currency = 'USD' AND
t2.frontState = 'PUBLISHED' AND
 t0.package_name = t5.APP_PACKAGE_NAME AND
ts_match_vq(t5.search_vector, to_tsquery('aaa')) AND 1 = 1)
ORDER BY ts_rank(t5.search_vector, to_tsquery('aaa')) ASC

Thank you,
  Pawel.


Re: [GENERAL] Understanding streaming replication

2012-11-12 Thread Pawel Veselov
On Mon, Nov 12, 2012 at 1:36 AM, Albe Laurenz laurenz.a...@wien.gv.atwrote:

 I'll try to answer the questions I can.


Thank you!


 Pawel Veselov wrote:
  I've been struggling with understanding all the necessary pieces for
 streaming replication. So I put
  down the pieces as I did understand them, and would appreciate if you
 guys could point out any of the
  stuff I understood or have done wrong.
 
  The set up is pgpool + streaming replication + hot stand by. No load
 balancing, stand-by nodes will
  not receive any application queries (I don't have that big of a query
 load, and I don't want to risk
  inconsistent reads). There are no shared file systems, but there is a
 way to rsync/scp files between
  nodes. Fail-over is automatic, and should kick in within reasonably
 small period after master failure.
 
  1. Archiving. Should be turned on on all the nodes. The archive command
 should copy the archive file
  to the local archive directory, and rsync archive directory between all
 the nodes. My understanding is
  that archiving is necessary if a stand-by node ever missed enough WAL
 updates to need an old enough
  WAL that might have been removed from pg_xlog.
 You don't give details about how the rsync is triggered,

but I'd advise against having rsync as part of archive_command.
 First, it is slow and if there is a lot of activity, the
 archiver will not be able to keep up.
 Second, if rsync fails, the WAL file will not be considered
 archived.

 Both these things will keep the WAL files from being deleted
 from pg_xlog.

 I'd schedule rsync as a cron job or similar.


From your later comments, it's also apparent that these archived WALs will
be useless after failover (for the purpose of recovery), so there is no
reason to send them to all the nodes after all.



  QUESTION: After the failover, the new master will start archiving its
 WAL files. These archived WALs
  will not collide in any way with the archived WALs generated by previous
 master(s)?

 They will not, because the standby starts a new time line
 when it is promoted to primary, which will result in new
 WAL file names.

  QUESTION: What is a good policy for archive clean up? From the
 perspective to only remove archive
  files that are guaranteed to never be required by any nodes.

 You cannot tell from the primary's side.
 Since you also need the archives to restore an online backup,
 I'd keep them a long as your backup policy dictates.
 I hope you don't rely on standby databases for backup (just
 imagine an accidental DROP TABLE that gets propagated to all
 standbys withing seconds).


I don't relay on stand-by's for back up. But that timeline establishment
business is a key piece that I didn't realize.



  2. Failover. On master failure, pgpool will automatically select a new
 master, and degenerate all
  other nodes. The cluster is now in the emergency state and requires
 manual intervention for
  reconfiguration and recovery. pgpool executes a script to promote a
 node, that script will create a
  trigger file on a newly selected master node, and postgres will exist
 stand-by mode.
 
  QUESTION: If multiple pgpools are running, and if there are no network
 problems, and configuration
  files are identical, is there any guarantee that the same stand-by node
 will be selected for
  promotion? Concern here is that with configuration of (M-SB0-SB1) one
 pgpool decides to promote SB0
  and another - SB1, causing both of them to enter master mode, and
 splitting the cluster. It does look
  that pgpool will always select next alive node for promotion, but I
 couldn't find a definitive
  statement on that.

 I don't know about pgpool and its abilities to handle
 cluster failover, but I wouldn't go this way at all.
 Even if the answer were that in the circumstances you
 describe things would work, you can depend on it that
 things will go wrong in ways different from what you
 expect, e.g. a broken network card.
 The consequences would be worse than I'd like to imagine.


I would imagine this situation will happen in any case, I don't logically
see how it's avoidable. If you only have one agent that has power to
promote a node to be a new master, you have SPF. If you have multiple
agents that can do the promotion, there is always a risk that they fall out
of sync.


 If you want reliable automatic failover, consider cluster
 software.


Anything you could please recommend?

[skipped]


  Yours,
 Laurenz Albe



Re: [GENERAL] Understanding streaming replication

2012-11-12 Thread Pawel Veselov
On Mon, Nov 12, 2012 at 10:11 AM, Pawel Veselov pawel.vese...@gmail.comwrote:


 On Mon, Nov 12, 2012 at 1:36 AM, Albe Laurenz laurenz.a...@wien.gv.atwrote:

 I'll try to answer the questions I can.


 Thank you!


 Pawel Veselov wrote:
  I've been struggling with understanding all the necessary pieces for
 streaming replication. So I put
  down the pieces as I did understand them, and would appreciate if you
 guys could point out any of the
  stuff I understood or have done wrong.
 
  The set up is pgpool + streaming replication + hot stand by. No load
 balancing, stand-by nodes will
  not receive any application queries (I don't have that big of a query
 load, and I don't want to risk
  inconsistent reads). There are no shared file systems, but there is a
 way to rsync/scp files between
  nodes. Fail-over is automatic, and should kick in within reasonably
 small period after master failure.
 
  1. Archiving. Should be turned on on all the nodes. The archive command
 should copy the archive file
  to the local archive directory, and rsync archive directory between all
 the nodes. My understanding is
  that archiving is necessary if a stand-by node ever missed enough WAL
 updates to need an old enough
  WAL that might have been removed from pg_xlog.
 You don't give details about how the rsync is triggered,

 but I'd advise against having rsync as part of archive_command.
 First, it is slow and if there is a lot of activity, the
 archiver will not be able to keep up.
 Second, if rsync fails, the WAL file will not be considered
 archived.

 Both these things will keep the WAL files from being deleted
 from pg_xlog.

 I'd schedule rsync as a cron job or similar.


 From your later comments, it's also apparent that these archived WALs will
 be useless after failover (for the purpose of recovery), so there is no
 reason to send them to all the nodes after all.


I obviously lost it here. The archives do need to be synchronized, for the
purpose of recovering slaves. If a slave dies, and I want to recover it, it
may need the archived WALs, and for this, the archives should be available
on the node. So, rsync (or something like that) is necessary. But it's a
bad idea to run the rsync from the archive command itself.


[GENERAL] Understanding streaming replication

2012-11-09 Thread Pawel Veselov
Hi.

I've been struggling with understanding all the necessary pieces for
streaming replication. So I put down the pieces as I did understand them,
and would appreciate if you guys could point out any of the stuff I
understood or have done wrong.

The set up is pgpool + streaming replication + hot stand by. No load
balancing, stand-by nodes will not receive any application queries (I don't
have that big of a query load, and I don't want to risk inconsistent
reads). There are no shared file systems, but there is a way to rsync/scp
files between nodes. Fail-over is automatic, and should kick in within
reasonably small period after master failure.

1. Archiving. Should be turned on on all the nodes. The archive command
should copy the archive file to the local archive directory, and rsync
archive directory between all the nodes. My understanding is that archiving
is necessary if a stand-by node ever missed enough WAL updates to need an
old enough WAL that might have been removed from pg_xlog.

QUESTION: After the failover, the new master will start archiving its WAL
files. These archived WALs will not collide in any way with the archived
WALs generated by previous master(s)?

QUESTION: What is a good policy for archive clean up? From the perspective
to only remove archive files that are guaranteed to never be required by
any nodes.

2. Failover. On master failure, pgpool will automatically select a new
master, and degenerate all other nodes. The cluster is now in the emergency
state and requires manual intervention for reconfiguration and recovery.
pgpool executes a script to promote a node, that script will create a
trigger file on a newly selected master node, and postgres will exist
stand-by mode.

QUESTION: If multiple pgpools are running, and if there are no network
problems, and configuration files are identical, is there any guarantee
that the same stand-by node will be selected for promotion? Concern here is
that with configuration of (M-SB0-SB1) one pgpool decides to promote SB0
and another - SB1, causing both of them to enter master mode, and splitting
the cluster. It does look that pgpool will always select next alive node
for promotion, but I couldn't find a definitive statement on that.

3. Recovery. That part is a bit confusing. The majority of the
documentation says that in this case, the node should be re-loaded from the
base backup, obtained from the master. I'm not sure why this is necessary,
if there are enough archived WALs.

QUESTION: Is there any metric to understand whether hauling base will be
slower/faster than replaying missed WALs? Anyway, pgpool only has one
recovery mechanism, and it does invoke a base restore from whatever current
master is.

PROBLEM: This I see as a problem. The only way that I see to re-attach a
node to the pgpool, short of restarting it, is to call pcp_recovery_node.
This will make the master take a base back up, push it to the stand-by that
needs recovery, and re-start the stand-by node. I am not sure if there is a
good way to check if that node has already been recovered. That because if
there are more than 2 pgpools, they both will attempt to recover the same
stand-by, and this will probably get ugly.

Thank you,
  Pawel.


[GENERAL] index update

2012-07-17 Thread Pawel Veselov
Hi.

If I have a lot (10k) tables, and each table has a btree index, and all the
tables are being constantly inserted into, would all the indexes have to be
in memory, and would effectively start fighting for space?

Thank you,
  Pawel.


Re: [GENERAL] index update

2012-07-17 Thread Pawel Veselov
On Tue, Jul 17, 2012 at 9:18 PM, Craig Ringer ring...@ringerc.id.au wrote:

  On 07/18/2012 12:02 PM, Pawel Veselov wrote:

 Hi.

  If I have a lot (10k) tables, and each table has a btree index, and all
 the tables are being constantly inserted into, would all the indexes have
 to be in memory, and would effectively start fighting for space?


 Quite likely, yes.

 You could make it a bit more efficient by grouping inserts up and doing
 batches for each table - if that's possible in your application.


Not with the current design :)



 10k tables is a heck of a lot. That sort of thing often implies app design
 issues.


That may be. However, attempting to put all the data into the same table
created problems with looking it up, and the fact that both write and read
traffic hits the same table.


 Why 10k tables? What do they do?


Realistically, that's a way to partition data. The tables have the same
structure, but apply to different partitions. If I am to use Postgres
partitioning, would there be any difference? Considering that I would have
to do almost exactly the same, with the inheritance and all?

Thank you!



 --
 Craig Ringer



[GENERAL] errors with high connections rate

2012-07-03 Thread Pawel Veselov
Hi.

-- problem 1 --

I have an application, using libpq, connecting to postgres 9.1.3 (Amazon
AMI distro).
The application writes data at a high rate (at this point it's 500
transaction per second), using multiple threads (at this point it's 800).

These are worker threads, that receive messages that are then written
out to the DB. There is no connection pool, instead, each worker thread
maintains it's own connection that it uses to write data to the database.
The connections are kept pthread's specific data blocks.

Each thread would connect to the DB when the first work message is
received, or when there was an error flag with a connection. The error
flag is set any time there is any error running a database statement.

When the work is slow, I don't see any problem (slow was ~250 messages
per second). As I increased the load, when I restart the process, threads
start grabbing work at high enough rate, and each will first open a
connection to the database, and these errors start popping up:

Can't connect to DB: could not send data to server: Transport endpoint is
not connected
could not send startup packet: Transport endpoint is not connected

This is a result of executing the following code:

wi-pg_conn = PQconnectdb(conn_str);
ConnectionStatusType cst = PQstatus(wi-pg_conn);

if (cst != CONNECTION_OK) {
ERR(Can't connect to DB: %s\n, PQerrorMessage(wi-pg_conn));
}

Eventually, the errors go away (when the worker thread fail to connect,
they just pass the message to another thread, and wait for their turn, and
will try reconnecting again), so it does seem that the remedy is just
spreading the connections in time.

The connection string is '' (empty), the connection is made through
/tmp/.s.PGSQL.5432

I don't see these errors when:
1) the amount of worker threads is reduced (could never reproduce it under
200 or less, but seen them with 300 and more)
2) the amount of load is reduced

-- problem 2 --

As I'm trying to debug this (with strace), I could never reproduce it, at
least to see what's going on, but sometimes I get another error : too many
users connected. Even restarting postmaster doesn't help. The postmaster
is running with -N810, and the role has connection limit of 1000. Yet, the
too many error starts creeping up only after 275 connections are opened
(counted by successful connect() from strace).

Any idea where should I dig?

P.S. I looked at fe-connect.c, I'm wondering if there a potential race
condition between poll() and socket actually finishing the connection? If
running under strace, I never see EINPROGRESS returned from connect(), and
the only reason sendto() would result into ENOTCONN is when the connect
didn't finish, and the socket was deemed connected using
poll/getsockopt...

Thanks,
  Pawel.


[GENERAL] 31.11. Notice Processing - associating notices with calls

2012-06-25 Thread Pawel Veselov
Hi.

What's the best way to associate an incoming notice with the statement
that resulted in generating it? Notice operate on PGResult objects, but
these objects only become available after the statement call is made.
Should I have a global (or thread global, since my app is MT) flag that
would tell the receiver what statement was executed last? I assume that the
execution path is that notice call back is invoked while the statement
function executes.

Thank you,
  Pawel.


Re: [GENERAL] 31.11. Notice Processing - associating notices with calls

2012-06-25 Thread Pawel Veselov
On Mon, Jun 25, 2012 at 8:07 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Pawel Veselov pawel.vese...@gmail.com writes:
  What's the best way to associate an incoming notice with the statement
  that resulted in generating it?

 Um ... the first issue with this problem statement is the assumption
 that there *is* a statement that caused the notice.  The server is
 capable of generating notices autonomously, for example during a forced
 database shutdown.  But having said that, you could certainly keep track
 of which command you last issued.


Is there then any way to know if a notice came from a statement? My issue
is that there are some statements that generate notices that I can safely
dismiss (CREATE something IF NOT EXISTS), but I don't want to dismiss any
other. I believe notices are not asynchronous (I don't use any asynchronous
API), so if there is a pending notice on the connection, and if I set the
current statement, and execute it, I will first get the pending notice,
and only then the statement-related notice.



  Notice operate on PGResult objects, but
  these objects only become available after the statement call is made.

 I think you are misunderstanding the notice receiver API.  The PGresult
 that's passed to the receiver is just a transient one created to hold
 the notice message's fields.  It has nothing to do with the PGresult
 generated to hold the eventual result of the current query (if any).


Yes, I did misunderstand it. Is there a standard of what would be in this
result object?