Re: [GENERAL] Many thousands of partitions

2013-10-09 Thread Grzegorz Tańczyk

Hello,

Thanks for sharing your experiences with the problem.

W dniu 2013-10-09 00:47, Gabriel E. Sánchez Martínez pisze:
Partioning seems to be a good idea if a single table would be too big 
to fit in your server's file cache, and also for management, since you 
can drop partitions of old data and create new ones without having to 
reindex and lock.  Does your data partition nicely by date, for 
example?  If most of the inserts are new data and old data is 
read-mostly, then partitioning may make sense because you would not 
have to reindex old partitions.  In fact, you could very well not have 
an index on the hot, write-mostly partition of, say, the current 
month, until the write activity on that table diminishes, which would 
make inserts faster.  If, on the other hand, your writes are scattered 
across many partitions, a single large table with an index may be a 
better solution. 


Changes are scattered, so single large table already is not a good 
solution. I like the idea of hot, write-mostly partition, because I 
might as well use only two partitions and merge changes from small table 
to the large one once the processing is done.
Rows are grouped by some key and when I start processing some group I 
could move all rows from large table in a batch (INSERT INTO .. SELECT 
.. WHERE group=x; DELETE FROM WHERE group=x). This way the read only 
part of the system will continue work without problems and processing 
should be much faster.


Although this will not solve the problem of neverending vacuums on large 
table, postgres could easily become the bottleneck. I am using 8.3 for 
this, but I will make an upgrade at some point, however I don't think it 
will change the design.


Thanks

--
Regards,
  Grzegorz



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Many thousands of partitions

2013-10-08 Thread Grzegorz Tańczyk

Hello,

I have question regarding one of caveats from docs:
http://www.postgresql.org/docs/8.3/static/ddl-partitioning.html

Partitioning using these techniques will work well with up to perhaps a 
hundred partitions; don't try to use many thousands of partitions.


What's the alternative? Nested partitioning could do the trick? I have 
milions of rows(numbers, timestamps and text(4kb), which are frequently 
updated and there are also frequent inserts. Partitioning was my first 
thought about solution of this problem. I want to avoid long lasting 
locks, index rebuild problems and neverending vacuum.
Write performance may be low if at the same time I will have no problem 
selecting single rows using primary key(bigint).Partitioning seems to be 
the solution, but I'm sure I will end up with several thousands of 
automatically generated partitions.


Thanks

--
Regards,
  Grzegorz


Re: [GENERAL] Many thousands of partitions

2013-10-08 Thread bricklen
On Tue, Oct 8, 2013 at 8:23 AM, Grzegorz Tańczyk golia...@polzone.plwrote:

  Hello,

 I have question regarding one of caveats from docs:
 http://www.postgresql.org/docs/8.3/static/ddl-partitioning.html


If you are using Postgresql 8.3 then you should consider upgrading to 9.3
instead.


 Partitioning using these techniques will work well with up to perhaps a
 hundred partitions; don't try to use many thousands of partitions.

 What's the alternative? Nested partitioning could do the trick?


Nested partitioning will have the same problems, if not more. The query
planner might come up with suboptimal plans depending on how many nested
partitions there are.


 I have milions of rows(numbers, timestamps and text(4kb), which are
 frequently updated and there are also frequent inserts. Partitioning was my
 first thought about solution of this problem. I want to avoid long lasting
 locks, index rebuild problems and neverending vacuum.
 Write performance may be low if at the same time I will have no problem
 selecting single rows using primary key(bigint).Partitioning seems to be
 the solution, but I'm sure I will end up with several thousands of
 automatically generated partitions.



I can speak from painful experience: just recently we had a project where a
development team went ahead and partitioned about 900 tables, resulting in
almost 80 thousand tables . It was almost comical that every single query
went from sub-second to tens of seconds, and a pg_dump of an *empty*
database would take longer than an hour. This was on sandbox servers so the
hardware was not production grade, but it was an excellent way to get the
point across that too many partitions can crush performance.


Re: [GENERAL] Many thousands of partitions

2013-10-08 Thread Gabriel E. Sánchez Martínez


On 10/08/2013 11:23 AM, Grzegorz Tańczyk wrote:

Hello,

I have question regarding one of caveats from docs:
http://www.postgresql.org/docs/8.3/static/ddl-partitioning.html

Partitioning using these techniques will work well with up to perhaps 
a hundred partitions; don't try to use many thousands of partitions.


What's the alternative? Nested partitioning could do the trick? I have 
milions of rows(numbers, timestamps and text(4kb), which are 
frequently updated and there are also frequent inserts. Partitioning 
was my first thought about solution of this problem. I want to avoid 
long lasting locks, index rebuild problems and neverending vacuum.
Write performance may be low if at the same time I will have no 
problem selecting single rows using primary key(bigint).Partitioning 
seems to be the solution, but I'm sure I will end up with several 
thousands of automatically generated partitions.
I had the same question a short while back and actually tested some of 
it.  My test data set was on an unpartitioned table having about 432 
million rows and taking about 46 GB.  In my case the data partitions 
nicely by date.  The table had about a year of data, spread over two 
years, and I was curious to see how much partioning could improve query 
speeds vs the unpartitioned table with an index on a timestamp field.  I 
tested partitioning by month (24 partitions) and also by day (over 700 
partitions), in both cases using non-overlapping CHECK constraints to 
enable constraint exclusion.  Both partitioning schemes worked 
considerably faster than the single table (on the order of 200 ms vs 
2500 ms), and were not significantly far from each other in 
performance.  Since more partitions translates to more management hasle, 
I concluded that partitioning by month was the best option.


I also tested the partitioning idea with a script that did a partition 
per day for 20 years.  Most of the tables were empty because I used the 
same data for this test.  I found that CREATE and DROP took very long, 
while SELECT queries (on over 7000 thousand of partitions) were still 
faster than on the single table but a bit slower than the schemes with 
only a few partitions.  With more complex SELECT queries the performance 
might be much worse.


I also tested the nested partitioning idea on partitions by day over two 
years, the idea being that since CHECK constraints are inherited the 
query planner should be able to exclude sets of tables at once, making 
thousands of partitions feasible.  I was not able to detect a 
significant change in the time it took to SELECT on the tables for two 
years, so I tested it with tables for 20 and 30 years.  If nested 
partitioning worked, having thousands of partitions would not 
appreciably slow down the query, but it did a bit, so it is clear that 
the query planner still scans all partitions down the inheritance tree.  
Perhaps this is because one can override the default behavior of 
inheriting CHECK constraints, so a child table does not necesarilly 
observe a parent's CHECK.  I suppose that with a bit of bookkeeping the 
query planner could exploit nested partitioning, but that is not how it 
currently behaves.  Maybe this could be considered for a future 
performance feature of PostgreSQL, but it would only be useful in big 
data applications with several TB of data archival.


I was not using INSERT triggers to distribute data across partitions.  
Instead I left that logic to a Java application that relied on the 
naming convention for tables.  I imagine that things can slow down 
considerably as the number of partitions increases if you are inserting 
data with triggers.


Partioning seems to be a good idea if a single table would be too big to 
fit in your server's file cache, and also for management, since you can 
drop partitions of old data and create new ones without having to 
reindex and lock.  Does your data partition nicely by date, for 
example?  If most of the inserts are new data and old data is 
read-mostly, then partitioning may make sense because you would not have 
to reindex old partitions.  In fact, you could very well not have an 
index on the hot, write-mostly partition of, say, the current month, 
until the write activity on that table diminishes, which would make 
inserts faster.  If, on the other hand, your writes are scattered across 
many partitions, a single large table with an index may be a better 
solution.




Thanks

--
Regards,
  Grzegorz