Re: [PERFORM] Thousands databases or schemas

2012-11-09 Thread Merlin Moncure
On Thu, Nov 8, 2012 at 3:36 AM, Denis soc...@gmail.com wrote:
 We have a web application where we create a schema or a database with a
 number of tables in it for each customer. Now we have about 2600 clients.

 The problem we met using a separate DB for each client is that the creation
 of new DB can take up to 2 minutes, that is absolutely unacceptable. Using
 schemes instead (one DB with a number of schemes containing similar tables
 in it) solved this problem (schemes are created in a couple of seconds), but
 created two other blocking points:

Sure: db creation can be a bear particularly on servers already under
load; it's i/o intensive.  I think you made the right choice: it's not
a good idea to create databases via user input (but if I *had* to do
that, I would probably be pre-creating them).

 1. sometimes creation of a new table in a schema takes up to 5 seconds. In
 case when we have create up to 40 tables in a schema this takes way too much
 time.

How are you creating the tables.  What's the iowait on the sever in
situations like this?  If the file system is binding you here, there's
a not a lot you can do other than to try and pre-create or improve i/o
performance.


 2. pg_dump -n schema_name db_name takes from 30 to 60 seconds, no matter
 how big is the amount of data in the schema. Also, the dump of the tables
 structure only takes at least 30 seconds. Basing on this topic
 http://postgresql.1045698.n5.nabble.com/pg-dump-and-thousands-of-schemas-td5709766i60.html,
 pg_dump always analyses ALL the tables in the DB, i.e. in my case more than
 100 000 tables.

That may be correct.  To prove it, try:

pg_dump -s -n schema_name db_name

where '-s' is the switch to dump only schema.  if things are still
slow, try logging queries from pg_dump (maybe enable
log_min_duration_statement if you can) and maybe something turns up
that can be optimized.  One you have the query, explain analyze it and
post it to the list. postgresql has been continuously improving in
terms of bulk table handling over the years and there may be some low
hanging fruit there.  There may even be some simple database tweaks
you can make to improve thing without code changes.

Magnus already answered this pretty well, but I'd like to add: the
database engine scales pretty well to large amounts of table but in
particular cases the tools are not.  The reason for this is that the
engine deals with internal optimized structures while the tools have
to do everything over SQL.  That said, there may be some low hanging
optimization fruit; it's a lot easier to hack on client side tools vs
the backend.

 I know you guys will ask me about selecting this particular application
 architecture.

Sharding by schema is pretty common actually. 6000 schema holding
tables is a lot -- so the question you should be getting is 'given the
current state of affairs, have you considered distributing your
clients across more than one server'.  What if you suddenly sign
10,000 more clients?

 Different clients have different activity rate and we can select different
 backup strategies according to it. This would be impossible in case we keep
 all the clients data in one table.
 Besides all the above mentioned, the probability of massive data corruption
 (if an error in our web application occurs) is much higher.

sure -- all large databases struggle with backups once the brute force
dump starts to become impractical.  the way forward is to explore
various high availability options -- PITR, HS/SR etc.  distributing
the backup load across shards is also good as long as your rigorous
about not involving any shared structures.

 Not to start a holywar, but FYI: in a similar project where we used MySQL
 now we have about 6000 DBs and everything works like a charm.

no worries.  postgres schemas are fancier than mysql databases and
this is one of those things were extra features really do impact
performance :-).

merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Increasing work_mem and shared_buffers on Postgres 9.2 significantly slows down queries

2012-11-09 Thread Andres Freund
On 2012-10-30 14:08:56 -0500, Petr Praus wrote:
 select count(*) from contest c
 left outer join contestparticipant cp on c.id=cp.contestId
 left outer join teammember tm on tm.contestparticipantid=cp.id
 left outer join staffmember sm on cp.id=sm.contestparticipantid
 left outer join person p on p.id=cp.personid
 left outer join personinfo pi on pi.id=cp.personinfoid
 where pi.lastname like '%b%' or pi.firstname like '%a%';

Btw, not really related to the question, but the way you use left joins
here doesn't really make sense and does lead to inferior plans.
As you restrict on 'pi', the rightmost table in a chain of left joins,
there is no point in all those left joins. I would guess the overall
plan is better if use straight joins.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [HACKERS] [PERFORM] pg_dump and thousands of schemas

2012-11-09 Thread Jeff Janes
On Thu, Nov 8, 2012 at 1:04 AM, Denis soc...@gmail.com wrote:

 Still I can't undesrtand why pg_dump has to know about all the tables?

Strictly speaking it probably doesn't need to.  But it is primarily
designed for dumping entire databases, and the efficient way to do
that is to read it all into memory in a few queries and then sort out
the dependencies, rather than tracking down every dependency
individually with one or more trips back to the database.  (Although
it still does make plenty of trips back to the database per
table/sequence, for acls, defaults, attributes.

If you were to rewrite pg_dump from the ground up to achieve your
specific needs (dumping one schema, with no dependencies between to
other schemata) you could probably make it much more efficient.  But
then it wouldn't be pg_dump, it would be something else.

Cheers,

Jeff


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance