Re: [PERFORM] Blocking every 20 sec while mass copying.

2014-07-21 Thread Albe Laurenz
Please keep the list on CC: in your responses.

Benjamin Dugast wrote:
 2014-07-18 13:11 GMT+02:00 Albe Laurenz laurenz.a...@wien.gv.at:
 This sounds a lot like checkpoint I/O spikes.

 Check with the database server log if the freezes coincide with checkpoints.

 You can increase checkpoint_segments when you load data to have them occur 
 less often.

 If you are on Linux and you have a lot of memory, you might hit spikes 
 because too
 much dirty data are cached; check /proc/sys/vm/dirty_ratio and 
 /proc/sys/dirty_background_ratio.

 The checkpoint_segments is set to 64 already
 
 the dirty_ration was set by default to 10 i put it down to 5
 the dirty_background_ratio was set to 5 and I changed it to 2
 
 There is less freezes but the insert is so slower than before.

That seems to indicate that my suspicion was right.

I would say that your I/O system is saturated.
Have you checked with iostat -mNx 1?

If you really cannot drop the indexes during loading, there's probably not much 
more
you can do to speed up the load.
You can try to increase checkpoint_segments beyond 64 and see if that buys you 
anything.

Tuning the file system write cache will not reduce the amount of I/O necessary, 
but it
should reduce the spikes (which is what I thought was your problem).

Yours,
Laurenz Albe

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-07-21 Thread Kevin Grittner
Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote:
 On 12/07/14 01:19, Kevin Grittner wrote:

 It might be worth a test using a cpuset to interleave OS cache and
 the NUMA patch I submitted to the current CF to see whether this is
 getting into territory where the patch makes a bigger difference.
 I would expect it to do much better than using numactl --interleave
 because work_mem and other process-local memory would be allocated
 in near memory for each process.

 http://www.postgresql.org/message-id/1402267501.4.yahoomail...@web122304.mail.ne1.yahoo.com

 Thanks Kevin - I did try this out - seemed slightly better than using
 --interleave, but almost identical to the results posted previously.

 However looking at my postgres binary with ldd, I'm not seeing any link
 to libnuma (despite it demanding the library whilst building), so I
 wonder if my package build has somehow vanilla-ified the result :-(

That is odd; not sure what to make of that!

 Also I am guessing that with 60 cores I do:

 $ sudo /bin/bash -c echo 0-59 /dev/cpuset/postgres/cpus

 i.e cpus are cores not packages...?

Right; basically, as a guide, you can use the output from:

$ numactl --hardware

Use the union of all the cpu numbers from the node n cpus lines.  The
above statement is also a good way to see how unbalance memory usage has
become while running a test.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

2014-07-21 Thread johno
Hi there,

I am trying to optimize a simple query that returns first 100 rows that
have been updated since a given timestamp (ordered by timestamp and id
desc).  If there are several rows with the same timestamp I need to a
second condition, that states that I want to return rows having the given
timestamp and id  given id.

The obvious query is

SELECT * FROM register_uz_accounting_entities
 WHERE effective_on  '2014-07-11' OR (effective_on = '2014-07-11' AND
 id  1459)
 ORDER BY effective_on, id
 LIMIT 100


With a composite index on (effective_on, id)

Query plan

Limit  (cost=4613.70..4613.95 rows=100 width=1250) (actual
 time=0.122..0.130 rows=22 loops=1)
   Buffers: shared hit=28
   -  Sort  (cost=4613.70..4617.33 rows=1453 width=1250) (actual
 time=0.120..0.122 rows=22 loops=1)
 Sort Key: effective_on, id
 Sort Method: quicksort  Memory: 30kB
 Buffers: shared hit=28
 -  Bitmap Heap Scan on register_uz_accounting_entities
  (cost=35.42..4558.17 rows=1453 width=1250) (actual time=0.036..0.083
 rows=22 loops=1)
   Recheck Cond: ((effective_on  '2014-07-11'::date) OR
 ((effective_on = '2014-07-11'::date) AND (id  1459)))
   Buffers: shared hit=28
   -  BitmapOr  (cost=35.42..35.42 rows=1453 width=0) (actual
 time=0.026..0.026 rows=0 loops=1)
 Buffers: shared hit=6
 -  Bitmap Index Scan on idx2  (cost=0.00..6.49
 rows=275 width=0) (actual time=0.017..0.017 rows=15 loops=1)
   Index Cond: (effective_on  '2014-07-11'::date)
   Buffers: shared hit=3
 -  Bitmap Index Scan on idx2  (cost=0.00..28.21
 rows=1178 width=0) (actual time=0.007..0.007 rows=7 loops=1)
   Index Cond: ((effective_on =
 '2014-07-11'::date) AND (id  1459))
   Buffers: shared hit=3
 Total runtime: 0.204 ms



Everything works as expected. However if I change the constraint to a
timestamp/date more in the past (resulting in far more matching rows) the
query slows down drastically.


 SELECT * FROM register_uz_accounting_entities
 WHERE effective_on  '2014-06-11' OR (effective_on = '2014-06-11' AND id 
 1459)
 ORDER BY effective_on, id
 LIMIT 100

 Limit  (cost=0.42..649.46 rows=100 width=1250) (actual
 time=516.125..516.242 rows=100 loops=1)
   Buffers: shared hit=576201
   -  Index Scan using idx2 on register_uz_accounting_entities
  (cost=0.42..106006.95 rows=16333 width=1250) (actual time=516.122..516.229
 rows=100 loops=1)
 Filter: ((effective_on  '2014-06-11'::date) OR ((effective_on =
 '2014-06-11'::date) AND (id  1459)))
 Rows Removed by Filter: 797708
 Buffers: shared hit=576201
 Total runtime: 516.304 ms



I've tried to optimize this query by pushing down the limit and order by's
into explicit subselects.

SELECT * FROM (
SELECT * FROM register_uz_accounting_entities
WHERE effective_on  '2014-06-11'
ORDER BY effective_on, id LIMIT 100
) t1
 UNION
   SELECT * FROM (
 SELECT * FROM register_uz_accounting_entities
 WHERE effective_on = '2014-06-11' AND id  1459
 ORDER BY effective_on, id LIMIT 100
 ) t2
 ORDER BY effective_on, id
 LIMIT 100

 -- query plan
 Limit  (cost=684.29..684.54 rows=100 width=1250) (actual
 time=2.648..2.708 rows=100 loops=1)
   Buffers: shared hit=203
   -  Sort  (cost=684.29..684.79 rows=200 width=1250) (actual
 time=2.646..2.672 rows=100 loops=1)
 Sort Key: register_uz_accounting_entities.effective_on,
 register_uz_accounting_entities.id
 Sort Method: quicksort  Memory: 79kB
 Buffers: shared hit=203
 -  HashAggregate  (cost=674.65..676.65 rows=200 width=1250)
 (actual time=1.738..1.971 rows=200 loops=1)
   Buffers: shared hit=203
   -  Append  (cost=0.42..661.15 rows=200 width=1250) (actual
 time=0.054..0.601 rows=200 loops=1)
 Buffers: shared hit=203
 -  Limit  (cost=0.42..338.62 rows=100 width=1250)
 (actual time=0.053..0.293 rows=100 loops=1)
   Buffers: shared hit=101
   -  Index Scan using idx2 on
 register_uz_accounting_entities  (cost=0.42..22669.36 rows=6703 width=1250)
 (actual time=0.052..0.260 rows=100 loops=1)
 Index Cond: (effective_on 
 '2014-06-11'::date)
 Buffers: shared hit=101
 -  Limit  (cost=0.42..318.53 rows=100 width=1250)
 (actual time=0.037..0.228 rows=100 loops=1)
   Buffers: shared hit=102
   -  Index Scan using idx2 on
 register_uz_accounting_entities register_uz_accounting_entities_1
  (cost=0.42..30888.88 rows=9710 width=1250) (actual time=0.036..0.187
 rows=100 loops=1)
 Index Cond: ((effective_on =
 '2014-06-11'::date) AND (id  1459))
 Buffers: shared hit=102

[PERFORM] Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

2014-07-21 Thread David G Johnston
johno wrote
 The question is... why is the query planner unable to make this
 optimization for the slow query? What am I missing?

Short answer - your first and last queries are not relationally equivalent
and the optimizer cannot change the behavior of the query which it is
optimizing.  i.e. you did not make an optimization but rather choose to
reformulate the question so that it could be answered more easily while
still providing an acceptable answer.

The question main question is better phrased as:

Give me 100 updated at t(0) but only that are subsequent to a given ID.  If
there are less than 100 such records give me enough additional rows having t
 t(0) so that the total number of rows returned is equal to 100.

Both queries give the same answer but only due to the final LIMIT 100. They
arrive there in different ways which necessitates generating different
plans.  At a basic level it is unable to push down LIMIT into a WHERE clause
and it cannot add additional sub-queries that do not exist in the original
plan - which includes adding a UNION node.

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Slow-query-with-indexed-ORDER-BY-and-LIMIT-when-using-OR-d-conditions-tp5812282p5812285.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

2014-07-21 Thread johno
Thanks for the quick reply David!

However I am still unsure how these two queries are not relationally
equivalent. I am struggling to find a counterexample where the first and
third query (in email, not in gist) would yield different results. Any
ideas?

Jano


On Mon, Jul 21, 2014 at 11:31 PM, David G Johnston 
david.g.johns...@gmail.com wrote:

 johno wrote
  The question is... why is the query planner unable to make this
  optimization for the slow query? What am I missing?

 Short answer - your first and last queries are not relationally equivalent
 and the optimizer cannot change the behavior of the query which it is
 optimizing.  i.e. you did not make an optimization but rather choose to
 reformulate the question so that it could be answered more easily while
 still providing an acceptable answer.

 The question main question is better phrased as:

 Give me 100 updated at t(0) but only that are subsequent to a given ID.  If
 there are less than 100 such records give me enough additional rows having
 t
  t(0) so that the total number of rows returned is equal to 100.

 Both queries give the same answer but only due to the final LIMIT 100. They
 arrive there in different ways which necessitates generating different
 plans.  At a basic level it is unable to push down LIMIT into a WHERE
 clause
 and it cannot add additional sub-queries that do not exist in the original
 plan - which includes adding a UNION node.

 David J.




 --
 View this message in context:
 http://postgresql.1045698.n5.nabble.com/Slow-query-with-indexed-ORDER-BY-and-LIMIT-when-using-OR-d-conditions-tp5812282p5812285.html
 Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance



[PERFORM] Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

2014-07-21 Thread David G Johnston
johno wrote
 Thanks for the quick reply David!
 
 However I am still unsure how these two queries are not relationally
 equivalent. I am struggling to find a counterexample where the first and
 third query (in email, not in gist) would yield different results. Any
 ideas?

Remove the outer LIMIT 100 from both queries...

The first query would return a maximal number of rows that meet the OR
criteria while the second query would return at most 200 rows since both
sub-queries would still have their own independent LIMIT 100 clauses.

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Slow-query-with-indexed-ORDER-BY-and-LIMIT-when-using-OR-d-conditions-tp5812282p5812289.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

2014-07-21 Thread johno
Oh, yes I do understand that if I remove the outer limit, the semantics of
the query would change. However I am looking for the counterexample *with*
the limit clauses. Maybe I just don't understand what relationally
equivalent means, sorry about that.

BTW this is to my understanding a very similar scenario to how partitioned
tables work and push down limit and where conditions. Why is this not
possible in this case?

Jano


On Mon, Jul 21, 2014 at 11:54 PM, David G Johnston 
david.g.johns...@gmail.com wrote:

 johno wrote
  Thanks for the quick reply David!
 
  However I am still unsure how these two queries are not relationally
  equivalent. I am struggling to find a counterexample where the first and
  third query (in email, not in gist) would yield different results. Any
  ideas?

 Remove the outer LIMIT 100 from both queries...

 The first query would return a maximal number of rows that meet the OR
 criteria while the second query would return at most 200 rows since both
 sub-queries would still have their own independent LIMIT 100 clauses.

 David J.




 --
 View this message in context:
 http://postgresql.1045698.n5.nabble.com/Slow-query-with-indexed-ORDER-BY-and-LIMIT-when-using-OR-d-conditions-tp5812282p5812289.html
 Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance



[PERFORM] Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

2014-07-21 Thread David G Johnston
johno wrote
 Oh, yes I do understand that if I remove the outer limit, the semantics of
 the query would change. However I am looking for the counterexample *with*
 the limit clauses. Maybe I just don't understand what relationally
 equivalent means, sorry about that.
 
 BTW this is to my understanding a very similar scenario to how partitioned
 tables work and push down limit and where conditions. Why is this not
 possible in this case?
 
 Jano
 
 
 On Mon, Jul 21, 2014 at 11:54 PM, David G Johnston 

 david.g.johnston@

 wrote:
 
 johno wrote
  Thanks for the quick reply David!
 
  However I am still unsure how these two queries are not relationally
  equivalent. I am struggling to find a counterexample where the first
 and
  third query (in email, not in gist) would yield different results. Any
  ideas?

 Remove the outer LIMIT 100 from both queries...

 The first query would return a maximal number of rows that meet the OR
 criteria while the second query would return at most 200 rows since both
 sub-queries would still have their own independent LIMIT 100 clauses.

 David J.

Try following my lead and bottom-post, please.

Anyway, the query has no clue that because of the final LIMIT 100 that the
two different feeding queries are just going to happen to end up providing
the same result.  Maybe, in this particular instance, it is theoretically
possible to make such a proof but generally that is not the case and so such
an optimization has not made into the codebase even if it theoretically
could be done (I'm not convinced it could but do not know enough to explain
to someone else why I have that impression).

I do not know enough to answer why this situation is any different from a
similar partitioning scenario.  An example showing exactly what a similar
partitioning query looks like would help in this regard.

If you are looking for considerably more insight into the planner workings
and why it does or doesn't do something you will need to wait for others.  I
can, to a reasonable degree, deconstruct a pair of queries and either
explain or guess as to why things are happening but that is mostly applied
deductive reasoning and not because I have any particular insight into the
codebase.

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Slow-query-with-indexed-ORDER-BY-and-LIMIT-when-using-OR-d-conditions-tp5812282p5812291.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

2014-07-21 Thread johno


 Try following my lead and bottom-post, please.


Sorry for that.



 Anyway, the query has no clue that because of the final LIMIT 100 that the
 two different feeding queries are just going to happen to end up providing
 the same result.  Maybe, in this particular instance, it is theoretically
 possible to make such a proof but generally that is not the case and so
 such
 an optimization has not made into the codebase even if it theoretically
 could be done (I'm not convinced it could but do not know enough to explain
 to someone else why I have that impression).

 I do not know enough to answer why this situation is any different from a
 similar partitioning scenario.  An example showing exactly what a similar
 partitioning query looks like would help in this regard.

 If you are looking for considerably more insight into the planner workings
 and why it does or doesn't do something you will need to wait for others.
  I
 can, to a reasonable degree, deconstruct a pair of queries and either
 explain or guess as to why things are happening but that is mostly applied
 deductive reasoning and not because I have any particular insight into the
 codebase.



Thanks again for your time. Let's just wait for someone else and see where
this will end up going.

Jano


Re: [PERFORM] Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

2014-07-21 Thread Tom Lane
johno jan.suc...@gmail.com writes:
 I am trying to optimize a simple query that returns first 100 rows that
 have been updated since a given timestamp (ordered by timestamp and id
 desc).  If there are several rows with the same timestamp I need to a
 second condition, that states that I want to return rows having the given
 timestamp and id  given id.

 The obvious query is

 SELECT * FROM register_uz_accounting_entities
 WHERE effective_on  '2014-07-11' OR (effective_on = '2014-07-11' AND
 id  1459)
 ORDER BY effective_on, id
 LIMIT 100

A more readily optimizable query is

SELECT * FROM register_uz_accounting_entities
WHERE (effective_on, id)  ('2014-07-11'::date, 1459)
ORDER BY effective_on, id
LIMIT 100

This formulation allows the planner to match both the WHERE and ORDER BY
clauses directly to the two-column index.

 I've tried to optimize this query by pushing down the limit and order by's
 into explicit subselects.

As noted earlier, that's unlikely to be an improvement, because on its
face it specifies more computation.  Postgres is not terribly bright
about UNIONs, either.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

2014-07-21 Thread johno
On Tue, Jul 22, 2014 at 4:53 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 johno jan.suc...@gmail.com writes:
  I am trying to optimize a simple query that returns first 100 rows that
  have been updated since a given timestamp (ordered by timestamp and id
  desc).  If there are several rows with the same timestamp I need to a
  second condition, that states that I want to return rows having the given
  timestamp and id  given id.

  The obvious query is

  SELECT * FROM register_uz_accounting_entities
  WHERE effective_on  '2014-07-11' OR (effective_on = '2014-07-11' AND
  id  1459)
  ORDER BY effective_on, id
  LIMIT 100

 A more readily optimizable query is

 SELECT * FROM register_uz_accounting_entities
 WHERE (effective_on, id)  ('2014-07-11'::date, 1459)
 ORDER BY effective_on, id
 LIMIT 100


Yes, but that query has completely different semantics - I can't change
that.



 This formulation allows the planner to match both the WHERE and ORDER BY
 clauses directly to the two-column index.


Are both fields really used? I was under the impression that only the first
column from index can be used when there is a range query.



  I've tried to optimize this query by pushing down the limit and order
 by's
  into explicit subselects.

 As noted earlier, that's unlikely to be an improvement, because on its
 face it specifies more computation.  Postgres is not terribly bright
 about UNIONs, either.



Despite the cost calculation in explain the actual query times are very
different. I get consistent sub 50ms responses from the optimized one
(union with pushing down the limits) and 500+ms for the plain one (when not
using bitmap index scan).

Is this possible optimization considered by query planner or do I have
force it?

Thanks again for your time and effort, I appreciate it.




 regards, tom lane