[PERFORM] Re: OLAP/reporting queries fall into nested loops over seq scans or other horrible planner choices

2017-11-03 Thread Thomas Kellerer
Laurenz Albe schrieb am 02.11.2017 um 09:30:
> Finally, even though the official line of PostgreSQL is to *not* have
> query hints, and for a number of good reasons, this is far from being
> an unanimous decision.  The scales may tip at some point, though I
> personally hope that this point is not too close.

I also think that hints are not the right way to solve problems like that.

I do like Oracle's approach with SQL profiles, where you can force the
optimizer to try harder to find a good execution plan. I _think_ it even
runs the statement with multiple plans and compares the expected outcome
with the actual values. Once a better plan is found that plan can be
attached to that query and the planner will use that plan with subsequent
executions.

This however requires a much bigger infrastructure then simple hints.

(Unrelated, but: maybe a compromise of the never-ending "hints vs. no hints"
discussion would be, to think about integrating the existing "pg_hint_plan"
as a standard contrib module)

Thomas




--
Sent from: 
http://www.postgresql-archive.org/PostgreSQL-performance-f2050081.html


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Architectural question

2016-03-23 Thread Thomas Kellerer
Jim Nasby schrieb am 11.03.2016 um 17:37:
> If the blob is in the database then you have nothing extra to do. It's 
> handled just like all your other data.
> 
> If it's a file in a file system then you need to:
> 
> - Have application code that knows how and where to get at the file
> - Have a way to make those files available on all your webservers
> - Have completely separate backup and recovery plans for those files
> 
> That's a lot of extra work. Sometimes it's necessary, but many times it's not.

Don't forget the code you need to write to properly handle transactional access 
(writing, deleting) to the files

You usually also need to distribute the files over many directories. 
Having millions of files in a single directory is usually not such a good idea. 

In my experience you also need some cleanup job that removes orphaned files 
from the file system. 
Because no matter how hard you try, to get updates/writes to the file system 
right, at some point this fails.

Also from a security point of view having this in the database is more robust 
then in the file system.

The downside of bytea is that you can't stream them to the client. The 
application always needs to read the whole blob into memory before it can be 
used. This might put some memory pressure on the application server. 

Thomas




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Why is now()::date so much faster than current_date

2015-11-17 Thread Thomas Kellerer
Hello,

I stumbled over this answer: http://stackoverflow.com/a/9717125/330315 and this 
sounded quite strange to me. 

So I ran this on my Windows laptop with Postgres 9.4.5, 64bit and indeed 
now()::date is much faster than current_date:

  explain analyze
  select current_date
  from   generate_series (1, 100);

  Function Scan on generate_series  (cost=0.00..6.00 rows=1000 width=0) (actual 
time=243.878..1451.839 rows=100 loops=1)
  Planning time: 0.047 ms
  Execution time: 1517.881 ms

And:

  explain analyze
  select now()::date
  from   generate_series (1, 100);

  Function Scan on generate_series  (cost=0.00..6.00 rows=1000 width=0) (actual 
time=244.491..785.819 rows=100 loops=1)
  Planning time: 0.037 ms
  Execution time: 826.612 ms

Running this on a CentOS 6.6. test server (Postgres 9.4.1, 64bit), there is 
still a difference, but not as big as on Windows:

  explain analyze
  select current_date
  from generate_series (1, 100);

  Function Scan on generate_series  (cost=0.00..15.00 rows=1000 width=0) 
(actual time=233.599..793.032 rows=100 loops=1)
  Planning time: 0.087 ms
  Execution time: 850.198 ms

And

  explain analyze
  select now()::date
  from   generate_series (1, 100);

  Function Scan on generate_series  (cost=0.00..15.00 rows=1000 width=0) 
(actual time=198.385..570.171 rows=100 loops=1)
  Planning time: 0.074 ms
  Execution time: 623.211 ms

Any ideas? 




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SELECT slows down on sixth execution

2015-10-19 Thread Thomas Kellerer
Jonathan Rogers schrieb am 17.10.2015 um 04:14:
>>> Yes, I have been looking at both plans and can see where they diverge.
>>> How could I go about figuring out why Postgres fails to see the large
>>> difference in plan execution time? I use exactly the same parameters
>>> every time I execute the prepared statement, so how would Postgres come
>>> to think that those are not the norm?
>>
>> PostgreSQL does not consider the actual query execution time, it only
>> compares its estimates for there general and the custom plan.
>> Also, it does not keep track of the parameter values you supply,
>> only of the average custom plan query cost estimate.
> 
> OK, that makes more sense then. It's somewhat tedious for the purpose of
> testing to execute a prepared statement six times to see the plan which
> needs to be optimized. Unfortunately, there doesn't seem to be any way
> to force use of a generic plan in SQL based on Pavel Stehule's reply.


If you are using JDBC the threshold can be changed:

   https://jdbc.postgresql.org/documentation/94/server-prepare.html
   
https://jdbc.postgresql.org/documentation/publicapi/org/postgresql/PGStatement.html#setPrepareThreshold%28int%29

As I don't think JDBC is using anything "exotic" I would be surprised if this 
can't be changed with other programming environments also.

Thomas



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Re: Hmmm... why does CPU-intensive pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?

2015-07-09 Thread Thomas Kellerer
Graeme B. Bell schrieb am 09.07.2015 um 11:44:
> I don't recall seeing a clear statement telling me I should mark pl/pgsql
> functions nonvolatile wherever possible or throw all performance and
> scalability out the window. 

From: http://www.postgresql.org/docs/current/static/xfunc-volatility.html

   "For best optimization results, you should label your functions 
with the strictest volatility category that is valid for them."





-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issues

2015-03-17 Thread Thomas Kellerer
Tomas Vondra schrieb am 17.03.2015 um 15:43:
> On 17.3.2015 15:19, Thomas Kellerer wrote:
>> Tomas Vondra schrieb am 17.03.2015 um 14:55:
>>>  (2) using window functions, e.g. like this:
>>>
>>>  SELECT * FROM (
>>>SELECT *,
>>> ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id
>>>ORDER BY FROM max_creation_dt) AS rn
>>>FROM s_f_touchpoint_execution_status_history
>>>  ) foo WHERE rn = 1
>>>
>>>  But estimating this is also rather difficult ...
>>
>>
>> From my experience rewriting something like the above using DISTINCT 
>> ON is usually faster.
> 
> How do you get the last record (with respect to a timestamp column)
> using a DISTINCT ON?

You need to use "order by ... desc". See here: http://sqlfiddle.com/#!15/d4846/2

Btw: your row_number() usage wouldn't return the "latest" row either. 
It would return the "oldest" row.










-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance issues

2015-03-17 Thread Thomas Kellerer
Tomas Vondra schrieb am 17.03.2015 um 14:55:
>  (2) using window functions, e.g. like this:
> 
>  SELECT * FROM (
>SELECT *,
> ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id
>ORDER BY FROM max_creation_dt) AS rn
>FROM s_f_touchpoint_execution_status_history
>  ) foo WHERE rn = 1
> 
>  But estimating this is also rather difficult ...


>From my experience rewriting something like the above using DISTINCT ON is 
>usually faster. 

e.g.:

select distinct on (touchpoint_execution_id) *
from s_f_touchpoint_execution_status_history
order by touchpoint_execution_id, max_creation_dt;






-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] MusicBrainz postgres performance issues

2015-03-16 Thread Thomas Kellerer
Robert Kaye schrieb am 16.03.2015 um 13:59:
> However, I am glad to report that our problems are fixed and that our
> server is back to humming along nicely.
> 
> And as I said to Josh earlier: "Postgres rocks our world. I’m
> immensely pleased that once again the problems were our own stupidity
> and not PG’s fault. In over 10 years of us using PG, it has never
> been PG’s fault. Not once.”
> 
> And thus we’re one tiny bit smarter today. Thank you everyone!
> 

I think it would be nice if you can amend your blog posting to include the 
solution that you found. 

Otherwise this will simply stick around as yet another unsolved performance 
problem

Thomas



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Reverse Key Index

2015-02-26 Thread Thomas Kellerer
Sven R. Kunze schrieb am 26.02.2015 um 13:23:
> If you think Reverse Key Indexes have no usage here in PostgreSQL, you should 
> not support convenience features 
> for easily improving performance without breaking the querying API 

It's also unclear to me which "performance" you are referring to.
Insert performance? Retrieval performance? Concurrency? 

The use-case for reverse indexes in Oracle is pretty small: it's _only_ about 
the contention when doing a lot of inserts with increasing numbers (because the 
different transactions will be blocked when accessing the blocks in question). 

As Postgres manages inserts differently than Oracle I'm not so sure that this 
problem exists in Postgres the same way it does in Oracle.
That's why I asked if you have a _specific_ problem. 

Richard Footes blog post is mostly about the myth that _if_ you have a reverse 
index this is only used for equality operations. 
It does not claim that a reverse index is faster than a regular index _if_ it 
is used for a range scan. 

The question is: do you think you need a reverse index because you have a 
performance problem with when doing many, many inserts at the same time using 
"close-by" values into a table that uses a btree index on the column? 

Or do you think you need a reverse index to improve the performance of a range 
scan? If that is the then you can easily us a gin/gist index or even a simple 
btree index using a trigram index to speed up a "LIKE '%abc%'" (something 
Oracle can't do at all) without having to worry about obfuscation layers (aka 
ORM).

Thomas






-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Reverse Key Index

2015-02-26 Thread Thomas Kellerer
Sven R. Kunze schrieb am 26.02.2015 um 12:04:
> I just thought about btree indexes here mainly because they well-known and 
> well-used in ORM frameworks. 

If your ORM framework needs to know about the internals of an index definition 
or even requires a certain index type, then you should ditch that ORM framework.

Apart from indexes supporting business constraints (e.g. a unique index) 
neither the application nor the the ORM framework should care about indexes at 
all.

> does PostgreSQL support the concept of reverse key indexing as described 
> here? 

The real question is: why do you think you need such an index? 
Do you have any performance problems with the existing BTree index? If yes, 
which problem exactly? 

Thomas





-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query Performance question

2014-07-14 Thread Thomas Kellerer
Magers, James, 14.07.2014 15:18:
> Thank you.  I executed the query this morning after disabling the scan types. 
>  
> I am including links to explain.depesz output for each of the three 
> variations that I executed.  
> 
> indexscan and bitmapscan off: http://explain.depesz.com/s/sIx
> seqscan and bitmapscan off: http://explain.depesz.com/s/GfM
> bitmapscan off: http://explain.depesz.com/s/3wna
> 

So the original query (using an "Index Scan" + "Bitmap Index Scan") is indeed 
the most efficient one: 4ms vs. 39ms vs. 64ms 








-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query Performance question

2014-07-14 Thread Thomas Kellerer
Magers, James, 14.07.2014 04:20:

> Thank you for your feedback.  I am attaching the requested information.  
> While I do not think the query is necessarily inefficient, I believe  a 
> sequence scan would be more efficient.  

You can try

set enable_indexscan = off;
set enable_bitmapscan = off;

and then run your query. 

But I would be very surprised if a seq scan (which reads through the whole 
table) was faster than those 4ms you have now




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How clustering for scale out works in PostgreSQL

2013-08-29 Thread Thomas Kellerer

bsreejithin wrote on 29.08.2013 18:13:

PostgreSQL version was* 8.2*.


8.2 has long been deprecated.

For a new system you should use 9.2 (or at least 9.1)

Thomas





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Thinking About Correlated Columns (again)

2013-05-15 Thread Thomas Kellerer

Shaun Thomas wrote on 15.05.2013 17:31:

Hi!

This has been a pain point for quite a while. While we've had several
discussions in the area, it always seems to just kinda trail off and
eventually vanish every time it comes up.

A really basic example of how bad the planner is here:

CREATE TABLE foo AS
SELECT a.id, a.id % 1000 AS col_a, a.id % 1000 AS col_b
   FROM generate_series(1, 100) a(id);

CREATE INDEX idx_foo_ab ON foo (col_a, col_b);

Index Scan using idx_foo_ab on foo  (cost=0.00..6.35 rows=1 width=12)
(actual time=0.030..3.643 rows=1000 loops=1)
   Index Cond: ((col_a = 50) AND (col_b = 50))

Hey, look! The row estimate is off by a factor of 1000. This
particular case doesn't suffer terribly from the mis-estimation, but
others do. Boy, do they ever.


What happens if you create one index for each column? (instead of one combined 
index)

For your example it does not seem to improve the situation, but maybe things get better 
with the "bad" queries?






--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Deterioration in performance when query executed in multi threads

2013-05-06 Thread Thomas Kellerer

Anne Rosset, 06.05.2013 19:00:

Postgres version: 9.0.13


Work_mem is set to 64MB
Shared_buffer to 240MB
Segment_size is 1GB
Wal_buffer is 10MB


Artifact table: 251831 rows
Field_value table: 77378 rows
Mntr_subscription: 929071 rows
Relationship: 270478 row
Folder: 280356 rows
Item: 716465 rows
Sfuser: 5733 rows
Project: 1817 rows

8CPUs
RAM: 8GB



With 8GB RAM you should be able to increase shared_buffer to 1GB or maybe even 
higher especially if this is a dedicated server.
240MB is pretty conservative for a server with that amount of RAM (unless you 
have many other applications running on that box)

Also what are the values for

cpu_tuple_cost
seq_page_cost
random_page_cost
effective_cache_size

What kind of harddisk is in the server? SSD? Regular ones (spinning disks)?





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] In progress INSERT wrecks plans on table

2013-05-02 Thread Thomas Kellerer

mark.kirkw...@catalyst.net.nz wrote on 03.05.2013 00:19:

I think the idea of telling postgres that we are doing a load is probably
the wrong way to go about this. We have a framework that tries to
automatically figure out the best plans...I think some more thought about
how to make that understand some of the more subtle triggers for a
time-to-do-new-plans moment is the way to go. I understand this is
probably hard - and may imply some radical surgery to how the stats
collector and planner interact.


I wonder if "freezing" (analyze, then disable autovacuum) the statistics for 
the large number of rows would work.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Recursive query gets slower when adding an index

2012-10-19 Thread Thomas Kellerer

Tom Lane wrote on 19.10.2012 16:20:

Thomas Kellerer  writes:

This is the execution plan without index: http://explain.depesz.com/s/ecCT
When I create an index on parent_id execution time increases to something 
between 110ms and 130ms
This is the execution plan with index: http://explain.depesz.com/s/xiL


The reason you get a bad plan choice here is the severe underestimate of
the average number of rows coming out of the worktable scan (ie, the
size of the "recursive" result carried forward in each iteration).

Unfortunately, it's really hard to see how we might make that number
better.  The current rule of thumb is "10 times the size of the
nonrecursive term", which is why you get 10 here.  We could choose
another multiplier but it'd be just as bogus as the current one
(unless somebody has some evidence about typical expansion factors?)

I suppose though that there's some argument for discouraging the planner
from assuming that the carried-forward result is small; so maybe we
should use something larger than 10.



Thanks for the feedback.

I just noticed this behaviour because we ran the same query on SQL Server 2008 
and that took well over 30seconds without the index
SQL Server *really* improved with the index and returned the result in 0.5 
seconds whith the index in place.

So I was curious how much faster Postgres would be *with* the index ;)

Regards
Thomas





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Recursive query gets slower when adding an index

2012-10-19 Thread Thomas Kellerer

Hi,

I have a self-referencing table that defines a hierarchy of projects and 
sub-projects.

This is the table definition:

CREATE TABLE project
(
   project_idinteger primary key,
   project_name  text,
   pl_name   text,
   parent_id integer
);

ALTER TABLE project
  ADD CONSTRAINT project_parent_id_fkey FOREIGN KEY (parent_id)
  REFERENCES project (project_id)
  ON UPDATE NO ACTION
  ON DELETE NO ACTION;


The table contains ~11000 rows

The following statement:

with recursive project_tree as (
   select project_id,
  parent_id,
  pl_name  as root_pl,
  pl_name  as sub_pl,
  1 as lvl
 from project
where parent_id is null
   union all
   select c.project_id,
  c.parent_id,
  coalesce(p.root_pl, c.pl_name) as root_pl,
  coalesce(c.pl_name, p.sub_pl)  as sub_pl,
  p.lvl + 1
 from project c
 join project_tree p on p.project_id = c.parent_id
)
select count(*), max(lvl)
  from project_tree
 where root_pl <> sub_pl;

usually runs in something like 60-80ms when the parent_id column is *not* 
indexed.

This is the execution plan without index: http://explain.depesz.com/s/ecCT

When I create an index on parent_id execution time increases to something 
between 110ms and 130ms

This is the execution plan with index: http://explain.depesz.com/s/xiL

As far as I can tell, the choice for the nested loop is the reason for the 
(slightly) slower execution.
I increased the statistics for the parent_id column to 1 (and did an 
analyze of course) but that didn't change anything.

I have no problem with that performance, so this is more a "I'm curious on why this 
happens" type of question.
(And I thought you might be interested in this behaviour as well)

My environment:

  *Windows 7 Professional 64bit
  * PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit


Regards
Thomas





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres becoming slow, only full vacuum fixes it

2012-09-28 Thread Thomas Kellerer

Kiriakos Tsourapas, 25.09.2012 13:01:

Thank you,

I will take this into consideration, since upgrading to 9 will be much harder I 
assume...



I think an upgrade from 8.3 to 8.4 was "harder" due to the removal of a lot of 
implicit type casts.
8.4 to 9.x shouldn't be that problematic after all (but will take longer due to 
the required dump/reload)






--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Using ctid column changes plan drastically

2012-07-25 Thread Thomas Kellerer

Tom Lane, 24.07.2012 19:12:

Well, it would only help if you're running a PG version that's new
enough to recognize the NOT EXISTS as an anti-join; and even then,
it's possible that joining on a tid column forecloses enough plan
types that you don't get any real benefit.  But I'm just guessing.
Can you show exactly what you tried and what EXPLAIN ANALYZE results
you got?



I am using 9.1.4 (as I said in my initial post).

I finally found a solution that runs fine:

DELETE FROM dupes a
WHERE EXISTS (SELECT 1
  FROM  dupes b
  WHERE b.first_name = a.first_name
AND b.last_name = a.last_name
AND b.ctid > a.ctid);

The execution plan for this is:

Delete on public.dupes a  (cost=14575.95..16978.87 rows=25000 width=12) (actual 
time=2419.334..2419.334 rows=0 loops=1)
  Buffers: shared hit=18029
  ->  Merge Semi Join  (cost=14575.95..16978.87 rows=25000 width=12) (actual 
time=2043.674..2392.707 rows=17097 loops=1)
Output: a.ctid, b.ctid
Merge Cond: ((a.first_name = b.first_name) AND (a.last_name = 
b.last_name))
Join Filter: (b.ctid > a.ctid)
Buffers: shared hit=930
->  Sort  (cost=7287.98..7475.48 rows=75000 width=20) (actual 
time=1024.195..1030.051 rows=75000 loops=1)
  Output: a.ctid, a.first_name, a.last_name
  Sort Key: a.first_name, a.last_name
  Sort Method: quicksort  Memory: 8870kB
  Buffers: shared hit=465
  ->  Seq Scan on public.dupes a  (cost=0.00..1215.00 rows=75000 
width=20) (actual time=0.025..23.234 rows=75000 loops=1)
Output: a.ctid, a.first_name, a.last_name
Buffers: shared hit=465
->  Sort  (cost=7287.98..7475.48 rows=75000 width=20) (actual 
time=1019.148..1028.483 rows=105841 loops=1)
  Output: b.ctid, b.first_name, b.last_name
  Sort Key: b.first_name, b.last_name
  Sort Method: quicksort  Memory: 8870kB
  Buffers: shared hit=465
  ->  Seq Scan on public.dupes b  (cost=0.00..1215.00 rows=75000 
width=20) (actual time=0.017..19.133 rows=75000 loops=1)
Output: b.ctid, b.first_name, b.last_name
Buffers: shared hit=465
Total runtime: 2420.953 ms

Which is a lot better than the plan using "WHERE ctid NOT IN (.)":

Delete on public.dupes  (cost=1777.50..4925055.00 rows=37500 width=6) (actual 
time=582515.094..582515.094 rows=0 loops=1)
  Buffers: shared hit=18027
  ->  Seq Scan on public.dupes  (cost=1777.50..4925055.00 rows=37500 width=6) 
(actual time=1038.164..582332.927 rows=17097 loops=1)
Output: dupes.ctid
Filter: (NOT (SubPlan 1))
Buffers: shared hit=930
SubPlan 1
  ->  Materialize  (cost=1777.50..1890.00 rows=7500 width=20) (actual 
time=0.001..2.283 rows=35552 loops=75000)
Output: (min(b.ctid)), b.first_name, b.last_name
Buffers: shared hit=465
->  HashAggregate  (cost=1777.50..1852.50 rows=7500 width=20) 
(actual time=90.964..120.228 rows=57903 loops=1)
  Output: min(b.ctid), b.first_name, b.last_name
  Buffers: shared hit=465
  ->  Seq Scan on public.dupes b  (cost=0.00..1215.00 
rows=75000 width=20) (actual time=0.008..25.515 rows=75000 loops=1)
Output: b.ctid, b.first_name, b.last_name
Buffers: shared hit=465
Total runtime: 582517.711 ms

Using "WHERE id NOT IN (...)" is the fastest way:

Delete on public.dupes  (cost=1871.25..3273.75 rows=37500 width=6) (actual 
time=187.949..187.949 rows=0 loops=1)
  Buffers: shared hit=18490
  ->  Seq Scan on public.dupes  (cost=1871.25..3273.75 rows=37500 width=6) 
(actual time=125.351..171.108 rows=17097 loops=1)
Output: dupes.ctid
Filter: (NOT (hashed SubPlan 1))
Buffers: shared hit=930
SubPlan 1
  ->  HashAggregate  (cost=1777.50..1852.50 rows=7500 width=18) (actual 
time=73.131..93.421 rows=57903 loops=1)
Output: min(b.id), b.first_name, b.last_name
Buffers: shared hit=465
->  Seq Scan on public.dupes b  (cost=0.00..1215.00 rows=75000 
width=18) (actual time=0.004..8.515 rows=75000 loops=1)
  Output: b.id, b.first_name, b.last_name
  Buffers: shared hit=465
Total runtime: 189.222 ms

Regards
Thomas



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Using ctid column changes plan drastically

2012-07-24 Thread Thomas Kellerer

Tom Lane wrote on 24.07.2012 17:55:

Joins on tid columns just aren't supported very well at the moment.
Partly that's from lack of round tuits, and partly it's because it
doesn't seem all that wise to encourage people to use them.  There
are gotchas if any of the rows receive concurrent updates.


Thanks for the clarification. I will keep that in mind.


FWIW, it might be helpful to cast this as a NOT EXISTS rather than
NOT IN subquery.


Hmm. How would you change that into an NOT EXISTS clause (so that one of the 
duplicates remains)
Everything I come up with is in fact slower than the NOT IN solution.

Regards
Thomas




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Using ctid column changes plan drastically

2012-07-24 Thread Thomas Kellerer

Tom Lane, 24.07.2012 16:23:

Thomas Kellerer  writes:

DELETE FROM dupes
WHERE id NOT IN (SELECT min(b.id)
   FROM   dupes b
   GROUP BY first_name, last_Name
   HAVING count(*) > 1);


Doesn't that kill the non-duplicates too?


Ah right - another good point on how important the correct test data is ;)


Why does the usage of the CTID column change the plan so drastically?


IIRC, type tid doesn't have any hash support.



So the "bad" plan is expected?

Regards
Thomas




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Using ctid column changes plan drastically

2012-07-24 Thread Thomas Kellerer

Hi,

I was testing a query to delete duplicates to see how well using ctid works if 
the table doesn't have a unique identifier available.

The table definition is:

create table dupes
(
   id integer primary key,
   first_name text,
   last_name text
);

My test table has 100.000 rows with ~13000 being actually unique.

The following statement:

DELETE FROM dupes
WHERE id NOT IN (SELECT min(b.id)
 FROM   dupes b
 GROUP BY first_name, last_Name
 HAVING count(*) > 1);

produces a quite nice execution plan:

Delete on public.dupes  (cost=2770.00..4640.00 rows=5 width=6) (actual 
time=299.809..299.809 rows=0 loops=1)
  Buffers: shared hit=88100
  ->  Seq Scan on public.dupes  (cost=2770.00..4640.00 rows=5 width=6) 
(actual time=150.113..211.340 rows=86860 loops=1)
Output: dupes.ctid
Filter: (NOT (hashed SubPlan 1))
Buffers: shared hit=1240
SubPlan 1
  ->  HashAggregate  (cost=2620.00..2745.00 rows=1 width=18) 
(actual time=115.739..143.004 rows=13140 loops=1)
Output: min(b.id), b.first_name, b.last_name
Filter: (count(*) > 1)
Buffers: shared hit=620
->  Seq Scan on public.dupes b  (cost=0.00..1620.00 rows=10 
width=18) (actual time=0.006..15.563 rows=10 loops=1)
  Output: b.id, b.first_name, b.last_name
  Buffers: shared hit=620
Total runtime: 301.241 ms

Now assuming I do not have a unique value in the table. In that case I would 
revert to using the ctid to identify individual rows:

DELETE FROM dupes
WHERE ctid NOT IN (SELECT min(b.ctid)
   FROM   dupes b
   GROUP BY first_name, last_Name
   HAVING count(*) > 1);

Which has a completely different execution plan:

Delete on public.dupes  (cost=2620.00..10004490.00 rows=5 width=6) (actual 
time=269966.623..269966.623 rows=0 loops=1)
  Buffers: shared hit=88720
  ->  Seq Scan on public.dupes  (cost=2620.00..10004490.00 rows=5 width=6) 
(actual time=176.107..269582.651 rows=86860 loops=1)
Output: dupes.ctid
Filter: (NOT (SubPlan 1))
Buffers: shared hit=1240
SubPlan 1
  ->  Materialize  (cost=2620.00..2795.00 rows=1 width=20) (actual 
time=0.002..0.799 rows=12277 loops=10)
Output: (min(b.ctid)), b.first_name, b.last_name
Buffers: shared hit=620
->  HashAggregate  (cost=2620.00..2745.00 rows=1 width=20) 
(actual time=131.162..164.941 rows=13140 loops=1)
  Output: min(b.ctid), b.first_name, b.last_name
  Filter: (count(*) > 1)
  Buffers: shared hit=620
  ->  Seq Scan on public.dupes b  (cost=0.00..1620.00 
rows=10 width=20) (actual time=0.005..29.531 rows=10 loops=1)
Output: b.ctid, b.first_name, b.last_name
Buffers: shared hit=620
Total runtime: 269968.515 ms

This is Postgres 9.1.4 64bit on Windows 7

Why does the usage of the CTID column change the plan so drastically?

Regards
Thomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] The need for clustered indexes to boost TPC-V performance

2012-07-04 Thread Thomas Kellerer

Craig Ringer, 04.07.2012 07:43:


I'm not sure what the best option for getting a 9.2 beta build for Windows is.


Download the ZIP from here:

http://www.enterprisedb.com/products-services-training/pgbindownload

Unzip, initdb, pg_ctl start

Regards
Thomas




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Re: Could synchronous streaming replication really degrade the performance of the primary?

2012-05-10 Thread Thomas Kellerer

MauMau, 10.05.2012 13:34:

Today, they told me that they ran the test on two virtual machines on
a single physical machine.


Which means that both databases shared the same I/O system (harddisks).
Thererfor it's not really surprising that the overall performance goes down if 
you increase the I/O load.

A more realistic test (at least in my opinion) would have been to have two 
separate computers with two separate I/O systems




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Result Set over Network Question

2012-05-07 Thread Thomas Kellerer

Robert Klemme, 07.05.2012 15:44:

http://www.oracle.com/technetwork/developer-tools/sql-developer/overview/index.html


SQL Developer does not support PostgreSQL


Last time I checked (quite a while ago) you could use arbitrary JDBC
drivers.  There's also
http://docs.oracle.com/cd/E25259_01/appdev.31/e24285/intro.htm#sthref306

And this seems to indicate that it's still the case: "[...] or another
third-party driver. [...]
JDBC URL (Other Third Party Driver): URL for connecting directly from
Java to the database; overrides any other connection type
specification."
http://docs.oracle.com/cd/E25259_01/appdev.31/e24285/dialogs.htm#BACDGCIA

I assume Oracle is not interested in aggressively advertizing this
feature though.


That seems to be a documentation bug.
I tried it, and it definitely does not work (or I am missing something).

Their release notes at: 
http://www.oracle.com/technetwork/developer-tools/sql-developer/sqldev31-ea-relnotes-487612.html


state:

Third Party Databases

SQL Developer supports IBM DB2 UDB LUW , Microsoft SQL Server and Microsoft 
Access, MySQL, Sybase Adaptive Server and Teradata.
See Supported Platforms for details on all third party database releases 
supported.


Regards
Thomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Result Set over Network Question

2012-05-07 Thread Thomas Kellerer

Robert Klemme, 07.05.2012 14:03:

Alternative tools for JDBC tests:

http://www.oracle.com/technetwork/developer-tools/sql-developer/overview/index.html


SQL Developer does not support PostgreSQL

This page:

http://wiki.postgresql.org/wiki/Community_Guide_to_PostgreSQL_GUI_Tools

also lists several JDBC based tools.

Thomas





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning Postgres 9.1 on Windows

2012-05-01 Thread Thomas Kellerer

Walker, James Les wrote on 01.05.2012 16:44:

I installed the enterprisedb distribution and immediately saw a 400% 
performance increase.


What exactly is "the enterprisedb distribution"?
Are you talking about the the Advanced Server?

I would be very surprised if the code base would differ so much to allow such a 
performance gain.
Could it be that the default settings for the Advanced Server are different than those of 
the "community edition"?

And what did you have installed before that? (as the Windows binary are always 
distributed by EnterpriseDB)

Thomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning Postgres 9.1 on Windows

2012-04-30 Thread Thomas Kellerer

Merlin Moncure wrote on 30.04.2012 23:43:

Trying turning off fsync in postgrsql.conf to be sure.


This is a dangerous advise.
Turning off fsync can potentially corrupt the database in case of a system 
failure (e.g. power outage).






--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query optimization

2012-04-26 Thread Thomas Kellerer

Tom Lane wrote on 26.04.2012 21:17:

Richard Kojedzinszky  writes:

Dear list,
We have a database schema, which looks the same as the attached script.



When filling the tables with data, and skipping analyze on the table (so
pg_stats contains no records for table 'a'), the first select in the
script runs fast, but after an analyze the planner decides to sequence
scan tables b and c, thus making the query much slower. Can somebody help
me solving this issue, or tuning our installation to not to use sequence
scans in this case?


Um ... did you analyze all the tables, or just some of them?  I get
sub-millisecond runtimes if all four tables have been analyzed, but it
does seem to pick lousy plans if, say, only a and b have been analyzed.



Here it's similar to Richard's experience:

Before analyzing the four tables, the first statement yields this plan:

Merge Left Join  (cost=504.89..2634509.91 rows=12500 width=16) (actual 
time=0.103..0.108 rows=1 loops=1)
  Merge Cond: (a.b = b.id)
  ->  Sort  (cost=504.89..506.14 rows=500 width=8) (actual time=0.043..0.043 
rows=1 loops=1)
Sort Key: a.b
Sort Method: quicksort  Memory: 17kB
->  Bitmap Heap Scan on a  (cost=12.14..482.47 rows=500 width=8) 
(actual time=0.028..0.029 rows=1 loops=1)
  Recheck Cond: (id = 4)
  ->  Bitmap Index Scan on a_idx1  (cost=0.00..12.01 rows=500 
width=0) (actual time=0.021..0.021 rows=1 loops=1)
Index Cond: (id = 4)
  ->  Materialize  (cost=0.00..884002.52 rows=5000 width=8) (actual 
time=0.041..0.057 rows=5 loops=1)
->  Merge Join  (cost=0.00..759002.52 rows=5000 width=8) (actual 
time=0.037..0.051 rows=5 loops=1)
  Merge Cond: (b.id = c.id)
  ->  Index Scan using b_idx1 on b  (cost=0.00..4376.26 rows=10 
width=4) (actual time=0.016..0.018 rows=5 loops=1)
  ->  Materialize  (cost=0.00..4626.26 rows=10 width=4) (actual 
time=0.017..0.022 rows=5 loops=1)
->  Index Scan using c_idx1 on c  (cost=0.00..4376.26 
rows=10 width=4) (actual time=0.014..0.017 rows=5 loops=1)
Total runtime: 0.209 ms

This continues to stay the plan for about 10-15 repetitions, then it turns to 
this plan

Hash Right Join  (cost=2701.29..6519.30 rows=1 width=16) (actual 
time=79.604..299.227 rows=1 loops=1)
  Hash Cond: (b.id = a.b)
  ->  Hash Join  (cost=2693.00..6136.00 rows=10 width=8) (actual 
time=79.550..265.251 rows=10 loops=1)
Hash Cond: (b.id = c.id)
->  Seq Scan on b  (cost=0.00..1443.00 rows=10 width=4) (actual 
time=0.011..36.158 rows=10 loops=1)
->  Hash  (cost=1443.00..1443.00 rows=10 width=4) (actual 
time=79.461..79.461 rows=10 loops=1)
  Buckets: 16384  Batches: 1  Memory Usage: 2735kB
  ->  Seq Scan on c  (cost=0.00..1443.00 rows=10 width=4) 
(actual time=0.010..32.930 rows=10 loops=1)
  ->  Hash  (cost=8.28..8.28 rows=1 width=8) (actual time=0.015..0.015 rows=1 
loops=1)
Buckets: 1024  Batches: 1  Memory Usage: 1kB
->  Index Scan using a_idx1 on a  (cost=0.00..8.28 rows=1 width=8) 
(actual time=0.010..0.012 rows=1 loops=1)
  Index Cond: (id = 4)
Total runtime: 299.564 ms

(I guess autovacuum kicked in, because that the same plan I get when running 
analyze on all four tables right after populating them)

And the second one yields this one here (Regardless of analyze or not):

QUERY PLAN
Nested Loop Left Join  (cost=0.00..16.89 rows=1 width=16) (actual 
time=0.027..0.031 rows=1 loops=1)
  ->  Nested Loop Left Join  (cost=0.00..16.57 rows=1 width=12) (actual 
time=0.020..0.022 rows=1 loops=1)
->  Index Scan using a_idx1 on a  (cost=0.00..8.28 rows=1 width=8) 
(actual time=0.011..0.012 rows=1 loops=1)
  Index Cond: (id = 4)
->  Index Scan using b_idx1 on b  (cost=0.00..8.28 rows=1 width=4) 
(actual time=0.004..0.005 rows=1 loops=1)
  Index Cond: (a.b = id)
  ->  Index Scan using c_idx1 on c  (cost=0.00..0.31 rows=1 width=4) (actual 
time=0.004..0.005 rows=1 loops=1)
Index Cond: (b.id = id)
Total runtime: 0.104 ms


My version:
PostgreSQL 9.1.3, compiled by Visual C++ build 1500, 32-bit
Running on Windows XP SP3

shared_buffers = 768MB
work_mem = 24MB 
effective_cache_size = 1024MB

All other (relevant) settings are on defaults

Regards
Thomas





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Update join performance issues

2012-04-03 Thread Thomas Kellerer

Kevin Kempter wrote on 03.04.2012 19:29:

Hi All;

I have a query that wants to update a table based on a join like this:

update test_one
set f_key = t.f_key
from
upd_temp1 t,
test_one t2
where
t.id_number = t2.id_number

upd_temp1 has 248,762 rows
test_one has 248,762 rows



To extend on what Kevin has already answere:

Quote from the manual:
  "Note that the target table must not appear in the from_list, unless you intend a 
self-join"



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Re: PG as in-memory db? How to warm up and re-populate buffers? How to read in all tuples into memory?

2012-02-26 Thread Thomas Kellerer

Stefan Keller wrote on 26.02.2012 01:16:

2. Are there any hints on how to tell Postgres to read in all table
contents into memory?


What about creating tablespace on a RAM Fileystem (tmpfs), then create a second schema in 
your database where all tables are located in the that "temp" tablespace.

Then upon startup (or using triggers) you can copy all data from the persistent 
tables to the memory tables.

It would probably make sense to change the value of random_page_cost for that 
tablespace to 1

I'm not sure though how PostgreSQL handles a system-restart with tables on a 
tablespace that might not be there.

Thomas







--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PostgreSQL Parallel Processing !

2012-01-27 Thread Thomas Kellerer

sridhar bamandlapally, 27.01.2012 05:31:

SQL> explain plan for select * from hr.emp ;
Explained.
PLAN
--
| Id  | Operation | Name | Rows  | Bytes | Cost (%CPU)| Time |
--
|   0 | SELECT STATEMENT  |  |  7444K|   944M| 16077   (4)| 00:03:13 |
|   1 |  TABLE ACCESS FULL| EMP  |  7444K|   944M| 16077   (4)| 00:03:13 |
--
*WITH PARALLEL Option*
SQL> explain plan for select /*+parallel(emp,4)*/ * from hr.emp ;
Explained.
PLAN
-
| Id  | Operation| Name | Rows  | Bytes | Cost (%CPU)| Time 
|
-
|   0 | SELECT STATEMENT |  |  7444K|   944M|  4442   (3)| 00:00:54 
|
|   1 |  PX COORDINATOR  |  |   |   ||  
|
|   2 |   PX SEND QC (RANDOM)| :TQ1 |  7444K|   944M|  4442   (3)| 00:00:54 
|
|   3 |PX BLOCK ITERATOR |  |  7444K|   944M|  4442   (3)| 00:00:54 
|
|   4 | TABLE ACCESS FULL| EMP  |  7444K|   944M|  4442   (3)| 00:00:54 
|
-

In the above plan ( WITH PARALLEL Option )
1. "Cost" has been nearly reduced to 1/4th
2. "CPU" has been reduced
3. "Time" has been nearly reduced to 1/3rd


I have *never* seen the "time" column in the explain plan output come anywhere 
near the actual execution time in Oracle.




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Re: Hash index use presently(?) discouraged since 2005: revive or bury it?

2011-09-19 Thread Thomas Kellerer

Robert Klemme, 19.09.2011 13:13:

On Sun, Sep 18, 2011 at 9:31 PM, Stefan Keller  wrote:

I'm simply referring to literature (like the intro Ramakrishnan&  Gehrke).
I just know that Oracle an Mysql actually do have them too and use it
without those current implementation specific restrictions in
Postgres.


Where exactly do you take that from that Oracle has hash indexes?  I
can't seem to find them:
http://download.oracle.com/docs/cd/E11882_01/server.112/e16508/indexiot.htm#sthref293

Are you mixing this up with hash partitioning?
http://download.oracle.com/docs/cd/E11882_01/server.112/e16508/schemaob.htm#sthref443

Or am I missing something?


Maybe he was referring to a hash cluster:
http://download.oracle.com/docs/cd/E11882_01/server.112/e17118/statements_5001.htm

This is a storage option where you can store related rows (e.g. in a 
parent/child relationship) in the same phyiscal database block based on a hash 
value. That enables the databse to read parent and child rows with just a 
single IO.

In the background Oracle probably has something like a hash index to support 
that.

Thomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Re: PostgreSQL-related topics of theses and seminary works sought (Was: Hash index use presently(?) discouraged...)

2011-09-19 Thread Thomas Kellerer

Stefan Keller, 17.09.2011 22:01:

I'm also interested in such proposals or ideas!

Here's some list of topics:
* Time in PostgreSQL
* Fast Bulk Data Inserting in PostgreSQL with Unlogged tables


I don't understand these two items. Postgres does have a time data type and it 
has unlogged tables since 9.1

Regards
Thomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why Index is not used

2011-03-25 Thread Thomas Kellerer

Adarsh Sharma, 25.03.2011 07:51:


Thanks Andreas, I was about print the output but it takes too much time.

Below is the output of explain analyze command :
pdc_uima=# explain analyze select c.clause, s.* from clause2 c, svo2 s where 
c.clause_id=s.clause_id and s.doc_id=c.source_id and c.
pdc_uima-# sentence_id=s.sentence_id ;
QUERY PLAN

Merge Join (cost=5673831.05..34033959.87 rows=167324179 width=2053) (actual 
time=216281.162..630721.636 rows=30473117 loops=1)
Merge Cond: ((s.clause_id = c.clause_id) AND (s.doc_id = c.source_id) AND 
(s.sentence_id = c.sentence_id))
-> Index Scan using idx_svo2 on svo2 s (cost=0.00..24489343.65 rows=27471560 
width=1993) (actual time=0.130..177599.310 rows=27471560 loops=1)
-> Materialize (cost=5673828.74..6071992.29 rows=31853084 width=72) (actual 
time=216280.596..370507.452 rows=52037763 loops=1)
-> Sort (cost=5673828.74..5753461.45 rows=31853084 width=72) (actual 
time=216280.591..324707.956 rows=31853083 loops=1)
Sort Key: c.clause_id, c.source_id, c.sentence_id
Sort Method: external merge Disk: 2616520kB
-> Seq Scan on clause2 c (cost=0.00..770951.84 rows=31853084 width=72) (actual 
time=0.025..25018.665 rows=31853083 loops=1)
Total runtime: 647804.037 ms
(9 rows)



How many rows are there in clause2 in total?

31853084 rows are returned from that table which sounds like the whole table 
qualifies for the join condition.

Regards
Thomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Disabling nested loops - worst case performance

2011-03-18 Thread Thomas Kellerer

Anssi Kääriäinen, 18.03.2011 08:15:

Hello list,

I am working on a Entity-Attribute-Value (EAV) database using
PostgreSQL 8.4.7. The basic problem is that when joining multiple
times different entities the planner thinks that there is vastly less
rows to join than there is in reality and decides to use multiple
nested loops for the join chain.


Did you consider using hstore instead?

I think in the PostgreSQL world, this is a better alternative than EAV and most 
probably faster as well.

Regards
Thomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Which gives good performance? separate database vs separate schema

2010-11-25 Thread Thomas Kellerer

Divakar Singh, 25.11.2010 12:37:

Hello Friends,
I have many instances of my software running on a server (Solaris SPARC). Each 
software instance requires some DB tables (same DDL for all instances' tables) 
to store data.
It essentially means that some processes from each instance of the software 
connect to these tables.
Now, should I put these tables in 1 Database's different schemas or in separate 
databases itself for good performance?
I am using libpq for connection.



I don't think it will make a big difference in performance.

The real question is: do you need queries that "cross boundaries"? If that is 
the case you have to use schema, because Postgres does not support cross-database queries.

Regards
Thomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Select * is very slow

2010-11-08 Thread Thomas Kellerer

Kevin Grittner, 08.11.2010 18:01:

"shaiju.ck"  wrote:


The table have 200 records now.
Select * from employee takes 15 seconds to fetch the data!!!
Which seems to be very slow.
But when I say select id,name from empoyee it executes in 30ms.
Same pefromance if I say select count(*) from emloyee.


You haven't given nearly enough information for anyone to diagnose
the issues with any certainty.  Earlier responses have asked for
some particularly important information, and I would add a request
to see the output from `VACUUM VERBOSE employee;`.  Beyond that, you
might want to review this page for checks you can make yourself, and
information which you could provide to allow people to give more
informed advice:



Do you really think that VACCUM is the problem? If the OP only selects two 
columns it is apparently fast.
If he selects all columns it's slow, so I wouldn't suspect dead tuples here.

My bet is that there are some really large text columns in there...

He has asked the same question here:
http://forums.devshed.com/postgresql-help-21/select-is-very-slow-761130.html

but has also failed to answer the question about the table details...

Regards
Thomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Select count(*), the sequel

2010-10-27 Thread Thomas Kellerer

Kenneth Marshall, 27.10.2010 22:41:

Different algorithms have been discussed before. A quick search turned
up:

quicklz - GPL or commercial
fastlz - MIT works with BSD okay
zippy - Google - no idea about the licensing
lzf - BSD-type
lzo - GPL or commercial
zlib - current algorithm

Of these lzf can compress at almost 3.7X of zlib and decompress at 1.7X
and fastlz can compress at 3.1X of zlib and decompress at 1.9X. The same
comparison put lzo at 3.0X for compression and 1.8X decompress. The block
design of lzl/fastlz may be useful to support substring access to toasted
data among other ideas that have been floated here in the past.

Just keeping the hope alive for faster compression.


What about a dictionary based compression (like DB2 does)?

In a nutshell: it creates a list of "words" in a page. For each word, the 
occurance in the db-block are stored and the actual word is removed from the page/block 
itself. This covers all rows on a page and can give a very impressive overall compression.
This compression is not done only on disk but in-memory as well (the page is 
loaded with the dictionary into memory).

I believe Oracle 11 does something similar.

Regards
Thomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Sorted group by

2010-08-10 Thread Thomas Kellerer

Matthew Wakeling wrote on 10.08.2010 18:03:

On Tue, 10 Aug 2010, Thomas Kellerer wrote:

No. It's built in (8.4) and it's called Windowing functions:
http://www.postgresql.org/docs/8.4/static/tutorial-window.html
http://www.postgresql.org/docs/8.4/static/functions-window.html

SELECT group, last_value(value) over(ORDER BY number)
FROM table


I may be mistaken, but as I understand it, a windowing function doesn't
reduce the number of rows in the results?


Yes you are right, a bit too quick on my side ;)

But this might be what you are after then:

select group_, value_
from (
  select group_, value_, number_, row_number() over (partition by group_ order 
by value_ desc) as row_num
  from numbers
) t
where row_num = 1
order by group_ desc


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Sorted group by

2010-08-10 Thread Thomas Kellerer

Matthew Wakeling wrote on 10.08.2010 17:40:

Currently, I do this in my application by ordering by the number and
only using the last value. I imagine that this is something that can be
done in the new Postgres 9, with a sorted group by - something like this:

SELECT group, LAST(value, ORDER BY number) FROM table GROUP BY group

Is this something that is already built in, or would I have to write my
own LAST aggregate function?


No. It's built in (8.4) and it's called Windowing functions:
http://www.postgresql.org/docs/8.4/static/tutorial-window.html
http://www.postgresql.org/docs/8.4/static/functions-window.html

SELECT group, last_value(value) over(ORDER BY number)
FROM table

You don't need the group by then (but you can apply e.g. an ORDER BY GROUP)

Thomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Does FILTER in SEQSCAN short-circuit AND?

2010-05-27 Thread Thomas Kellerer

Craig James wrote on 27.05.2010 23:13:

It would be nice if Postgres had a way to assign a cost to every
function.


Isn't that what the COST parameter is intended to be:

http://www.postgresql.org/docs/current/static/sql-createfunction.html

Thomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] autovacuum strategy / parameters

2010-04-28 Thread Thomas Kellerer

akp geek, 28.04.2010 16:37:

We have 8.4, which of AUTOVACUUM PARAMETERS can be set to handle individual 
table?


All documented here:
http://www.postgresql.org/docs/current/static/sql-createtable.html



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] autovacuum strategy / parameters

2010-04-28 Thread Thomas Kellerer

Rick, 22.04.2010 22:42:


So, in a large table, the scale_factor is the dominant term. In a
small table, the threshold is the dominant term. But both are taken into
account.

The default values are set for small tables; it is not being run for
large tables.


With 8.4 you can adjust the autovacuum settings per table...




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PG optimization question

2010-01-09 Thread Thomas Kellerer

Nickolay wrote on 09.01.2010 11:24:

it would be pretty hard with one table because of blocking


What do you man with "because of blocking"?

Thomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 8.4.1 ubuntu karmic slow createdb

2009-12-28 Thread Thomas Kellerer

Michael Clemmons wrote on 11.12.2009 23:52:

Thanks all this has been a good help.
I don't have control(or easy control) over unit tests creating/deleting
databases since Im using the django framework for this job.  Createdb
takes 12secs on my system(9.10 pg8.4 and ext4)  which is impossibly slow
for running 200unittests.


I wonder if you could simply create one database, and then a new schema for 
each of the tests.

After creating the schema you could alter the search_path for the "unit test 
user" and it would look like a completely new database.

Thomas




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Modeling a table with arbitrary columns

2009-10-29 Thread Thomas Kellerer

Andreas Hartmann wrote on 29.10.2009 21:52:

Hi everyone,

I want to model the following scenario for an online marketing application:

Users can create mailings. The list of recipients can be uploaded as 
spreadsheets with arbitrary columns (each row is a recipient). I expect 
the following maximum quantities the DB will contain:


* up to 5000 mailings
* 0-10'000 recipients per mailing, average maybe 2000
* approx. 20 columns per spreadsheet


[...]


I have the feeling that the second approach is cleaner. But since the 
recipient_value table will contain approx. 20 times more rows than the 
recipient table in approach A, I'd expect a performance degradation.


Is there a limit to the number of rows that should be stored in a table? 
With approach B the maximum number of rows could be about 200'000'000, 
which sounds quite a lot …




I don't think the number of rows is that critical (it sure doesn't hit any "limits". The question is how you want to access that information and how quick that has to be. If you need sub-second response time for aggregates over that, you'll probably have to throw quite some hardware at it. 


You could also check out the hstore contrib module which lets you store key 
value pairs in a single column, which might actually be what you are looking 
for (note that I have never used it, so I cannot tell how fast it acutally is)

http://www.postgresql.org/docs/current/static/hstore.html

So something like

CREATE TABLE recipient (
  mailing integer NOT NULL,
  row integer NOT NULL,
  recipient_values hstore,
  PRIMARY KEY (mailing, row), 
  FOREIGN KEY (mailing) REFERENCES mailing (id)

)

Btw: I would get rid of a column named "row", this more a "recipient_id", but 
that is just personal taste.

Regards
Thomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Getting a random row

2009-10-13 Thread Thomas Kellerer

Shaul Dar, 13.10.2009 17:17:

Also PG does not have a concept of an auto-increment pseudo-column
like Oracle's "rownum". Any suggestions?


Yes it does (at least 8.4)

SELECT row_number() over(), the_other_columns...
FROM your_table

So you could do something like:

SELECT * 
FROM (
 SELECT row_number() over() as rownum, 
the_other_columns...

 FROM your_table
) t 
WHERE t.rownum = a_random_integer_value_lower_than_rowcount;


Thomas




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres Clustering

2009-05-27 Thread Thomas Kellerer

Alan McKay wrote on 27.05.2009 19:57:
> What options are available?

I guess a good starting point is the Postgres Wiki:

http://wiki.postgresql.org/wiki/Replication%2C_Clustering%2C_and_Connection_Pooling


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] L

2009-05-27 Thread Thomas Kellerer

ramasubramanian, 27.05.2009 08:42:

How to insert or update a file in a table using the query in postgres
CREATE TABLE excel_file_upload
(
  user_id integer,
  excel_file bytea
}
 
example

insert into excel_file_upload values(1,file1)
 
file1 can be any file *.doc,*.xls
How i can do this(with out using java or any other language) 
using query?


If you are restricted to psql then I gues the only way is the solution show by 
Albe - but I guess that only works if the file is on the server.

Some of the GUI SQL tools can handle blob "uploads" from the client. So if you are willing to switch to a different SQL client, this could be done without programming. 


My own tool available at http://www.sql-workbench.net can either do that 
through a GUI dialog or as part of an extended SQL syntax that is of course 
specific to that application.

Thomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance for high-volume log insertion

2009-04-23 Thread Thomas Kellerer

Stephen Frost wrote on 22.04.2009 23:51:

What about 4 individual prepared inserts?  Just curious about it.



4 inserts, one prepared statement each (constructing the prepared
statement only once), in a single transaction: 1.68s

I'm surprised that there's any win here at all.


For a single column table, I wouldn't expect much either.  With more
columns I think it would be a larger improvement.


Out of curiosity I did some tests through JDBC.

Using a single-column (integer) table, re-using a prepared statement took about 
7 seconds to insert 10 rows with JDBC's batch interface and a batch size of 1000


Using a prepared statement that had a 1000 (?) after the insert (insert into foo 
 values (?), (?), ...) the insert took about 0.8 seconds. Quite an improvement 
I'd say.


Then I switched to a three column table (int, varchar(500), varchar(500)).

Insert using a preparedstatement with batch (first scenario) now was ~8.5 
seconds, whereas the multi-value insert now took ~3 seconds. So the difference 
got smaller, but still was quite substantial. This was inserting relatively 
small strings (~20 characters) into the table


When increasing the size of the inserted strings, things began to change. When I 
bumped the length of the strings to 70 and 200 characters, the multi-value 
insert slowed down considerably. Both solutions now took around 9 seconds.


The multi-value solution ranged between 7 and 9 seconds, whereas the "regular" 
insert syntax was pretty constant at roughly 9 seconds (I ran it about 15 times).


So it seems, that as the size of the row increases the multi-value insert loses 
its head-start compared to the "regular" insert.


I also played around with batch size. Going beyond 200 didn't make a big 
difference.


For the JDBC batch, the batch size was the number of rows after which I called 
executeBatch() for the multi-value insert, this was the number of tuples I sent 
in a single INSERT statement.


The multi-value statement seems to perform better with lower "batch" sizes 
(~10-50) whereas the JDBC batching seems to be fastest with about 200 statements 
per batch.



Thomas


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance