Re: [GENERAL] Questions about Partitioning

2011-04-21 Thread Craig Ringer
On 19/04/11 23:56, Phoenix Kiula wrote:
 While I fix some bigger DB woes, I have learned a lesson. Huge indexes
 and tables are a pain.
 
 Which makes me doubly keen on looking at partitioning.
 
 Most examples I see online are partitioned by date. As in months, or
 quarter, and so on. This doesn't work for me as I don't have too much
 logic required based on time.
 
 The biggest, highest volume SELECT in my database happens through an
 alias column. This is an alphanumeric column. The second-biggest
 SELECT happens through the userid column -- because many users check
 their account every day.

If  user id - alias and/or alias - user id lookups are really hot,
consider moving them to a subtable, so you don't have to worry about
whether to partition by user id or alias, and so that the table is
really small, easily cached, and fast to scan. For example:

CREATE TABLE user_alias (
   alias VARCHAR(42) PRIMARY KEY,
   user_id integer REFERENCES maintable(id)
);

If you like you can retain the alias column in maintable, making
that a REFERENCE to user_alias(alias) so you force a 1:1 relationship
and don't have to JOIN on user_alias to get alias data for a user. The
downside of that is that the circular/bidirectional reference requires
you to use 'DEFERRABLE INITIALLY DEFERRED' on one or both references to
be able to insert, and that can cause memory use issues if you do really
big batch inserts and deletes on those tables.

 1. Which column should I partition by -- the alias because it's the
 largest contributor of queries? This should be OK, but my concern is
 that when user_id queries are happening, then the data for the same
 user will come through many subtables that are partitioned by alias

See above: consider splitting the user-id-to-alias mapping out into
another table.

 3. If I partition using a%, b% etc up to z% as the partition
 condition, is this an issue

It might be worth examining the distribution of your data and
partitioning on constraints that distribute the data better. There'll be
a lot more cs than zs.

That said, it might not be worth the complexity and you'd have to check
if the constraint exclusion code was smart enough to figure out the
conditions. I don't have much experience with partitioning and have
never tried or tested partitioning on a LIKE pattern.

 6. Triggers - how do they affect speed?

A constraint is not a trigger, they're different. SELECTs on partitioned
tables are not affected by triggers.

For INSERT, UPDATE and DELETE, where you're redirecting INSERTs into the
parent table into the appropriate partition, then speed might be a
concern. It probably doesn't matter. If you find it to be an issue, then
rather then re-writing the trigger in C, you're probably better off just
INSERTing directly into the appropriate subtable and thus bypassing the
trigger.

--
Craig Ringer

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Questions about Partitioning

2011-04-19 Thread Phoenix Kiula
While I fix some bigger DB woes, I have learned a lesson. Huge indexes
and tables are a pain.

Which makes me doubly keen on looking at partitioning.

Most examples I see online are partitioned by date. As in months, or
quarter, and so on. This doesn't work for me as I don't have too much
logic required based on time.

The biggest, highest volume SELECT in my database happens through an
alias column. This is an alphanumeric column. The second-biggest
SELECT happens through the userid column -- because many users check
their account every day.

A rough table definition can be considered as follows:

   CREATE TABLE maintable
  idSERIAL  primary key
  alias  VARCHAR(42)
  ...
  user_id  VARCHAR(30)
  user_registered   BOOLEAN
  statusVARCHAR(1)


 My questions:

1. Which column should I partition by -- the alias because it's the
largest contributor of queries? This should be OK, but my concern is
that when user_id queries are happening, then the data for the same
user will come through many subtables that are partitioned by alias
-- will this happen automatically (presuming constraint exclusion is
on)? How does partitioning by one column affect queries on others.
Will there be subtable-by-subtable indexes on both alias and

2. How does SERIAL type work with partitions? Will INSERT data go into
the respective partitions and yet maintain an overall sequence -- I
mean, the *same* overall sequence for the parent table distributed
automagically across subtables?

3. If I partition using a%, b% etc up to z% as the partition
condition, is this an issue -- are about 26 subtables too many
partitions? Mine are static partitions as in they will be the same
forever, unlike data-based partitions. And each partition will
continue to grow. If I include that aliases can begin with numbers
and allowed symbols too, then this may be 45 partitions? What's the
limit of partitions -- not only official limit, but practical limit in
terms of performance?

4. Given that it's a wildcard LIKE condition (with a %) will this
affect the index and subsequent SELECT speed? Are partition conditions
recommended to be = or  type operators only or is LIKE ok??

5. Does partitioning need to happen only through one column? Can I
have a condition containing two columns instead?

   CREATE TABLE subtable_a (
PRIMARY KEY (id)
CHECK ( user_id LIKE 'a%' and user_registered IS TRUE)
) INHERITS (maintable);

   CREATE TABLE subtable_b (
PRIMARY KEY (id),
CHECK ( user_id LIKE 'b%' and user_registered IS TRUE)
) INHERITS (maintable);

   ..etc


6. Triggers - how do they affect speed? Everything, insert, update,
select will happen through this conditional trigger. I will likely be
writing this in PLSQL, but I read in several websites that C triggers
are much faster than PLSQL triggers. Is this a concern?

7. Constraint exclusion - is it recommended to have this in the
pg.conf, or will I need to do this before every SQL? I prefer the
pg.conf way, but want to confirm that there are no downsides for other
regular SQL operations with this setting?

8. How will JOIN work? I have different tables JOINing with the parent
table now. With partitioned subtables, will constraint exclusion
automatically do what's needed and my SQL does not need to change? Or
will there be triggers required for each and every query I currently
have?


Eight questions is enough for my first post in this partitioning thread   :)

Thanks much!

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Questions about Partitioning

2011-04-19 Thread Steve Crawford

On 04/19/2011 08:56 AM, Phoenix Kiula wrote:

While I fix some bigger DB woes, I have learned a lesson. Huge indexes
and tables are a pain.

Which makes me doubly keen on looking at partitioning.
Before jumping into partitioning it would be useful to know specifically 
what pain you are having with your current tables and indexes. 
Maintenance? Performance? Other? Question zero is What issues are 
currently causing you pain with large tables? and after that 
determining if the partitioning is an appropriate solution. There is 
pain associated with partitioning, as well, so you need to be sure that 
you will achieve a net pain reduction.


Carefully read 
http://www.postgresql.org/docs/current/static/ddl-partitioning.html, it 
has examples that answer several of your questions. Pay extra attention 
to 5.9.6 Caveats.


Some places where partitioning work well:

1. The partition can substitute for an index and the resulting child 
tables will have somewhat comparable sizes. If you had contact 
information where state was typically required in queries you might 
partition the data into tables for each state so a typical query would 
only touch a smaller data set and the partitioning/child-table 
constraints substitute for an index on state.


2. You frequently drop data in bulk and can group that data in such a 
way that you can drop or truncate a child-table. Among the places I've 
used partitioning is for validation codes. I partition them by like 
expiration and when the date arrives, I just drop the partition with the 
expired codes - way faster than delete-from and the necessary follow-up 
maintenance when deleting millions of codes.


3. The nature of your data is such that it can be partitioned into a 
small part that is accessed frequently and parts that are relatively 
rarely accessed.



Most examples I see online are partitioned by date. As in months, or
quarter, and so on. This doesn't work for me as I don't have too much
logic required based on time.


Time-based data often satisfies all of the above (log data you can 
partition by month, typically only look at the current month and drop 
data that is a year old, for example) so that's what ends up being in 
most examples.



The biggest, highest volume SELECT in my database happens through an
alias column. This is an alphanumeric column. The second-biggest
SELECT happens through the userid column -- because many users check
their account every day.

A rough table definition can be considered as follows:

CREATE TABLE maintable
   idSERIAL  primary key
   alias  VARCHAR(42)
   ...
   user_id  VARCHAR(30)
   user_registered   BOOLEAN
   statusVARCHAR(1)


  My questions:

1. Which column should I partition by -- the alias because it's the
largest contributor of queries? This should be OK, but my concern is
that when user_id queries are happening, then the data for the same
user will come through many subtables that are partitioned by alias
-- will this happen automatically (presuming constraint exclusion is
on)? How does partitioning by one column affect queries on others.
Will there be subtable-by-subtable indexes on both alias and


Answer question zero, above, first. But beware - the primary key is not 
inherited. You run the risk of duplicating the primary key (or other 
unique identifier) across child tables unless you implement the 
appropriate constraints on the child tables to prevent this. It's also 
pointless to have a primary key on the parent table in most situations.



2. How does SERIAL type work with partitions? Will INSERT data go into
the respective partitions and yet maintain an overall sequence -- I
mean, the *same* overall sequence for the parent table distributed
automagically across subtables?
This depends on how you set up your triggers, constraints, child tables 
etc. but by default a basic create table thechild () inherits 
(theparent); will result in a child table that shares the same sequence 
as the parent.

3. If I partition using a%, b% etc up to z% as the partition
condition, is this an issue -- are about 26 subtables too many
partitions? Mine are static partitions as in they will be the same
forever, unlike data-based partitions. And each partition will
continue to grow. If I include that aliases can begin with numbers
and allowed symbols too, then this may be 45 partitions? What's the
limit of partitions -- not only official limit, but practical limit in
terms of performance?


As always, the answer is depends but I wouldn't typically see 45 as 
too many. See primary-key warning above. It's less an absolute number of 
tables and more whether the design of your tables and queries results in 
execution efficiency gains that outweigh the additional planner costs.



4. Given that it's a wildcard LIKE condition (with a %) will this
affect the index and subsequent SELECT speed? Are partition conditions
recommended to be = or  type operators only or is LIKE