subject:"\[HACKERS\] Dynamic Partitioning using Segment Visibility Maps"

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-11 Thread Simon Riggs

On Fri, 2008-01-11 at 02:28 +0100, Gavin Sherry wrote:
 On Thu, Jan 10, 2008 at 09:30:10PM +, Simon Riggs wrote:
We cannot perform partition exclusion using this type of WHERE clause at
planning time because the CURRENT DATE function is STABLE. 
   
   We can do the exact same thing -- if it's a direction people want to
   take. In fact, we can do it better/faster because once we've evaluated 
   one 
   partition we know that there are no others to evaluate.
  
  Lost you completely here. I'm explaining to you that *nobody* can solve
  those problems solely at planning time, by definition, so it has to be
  done at execution time. I'm not saying anything about your way, my way.
 
 Sorry, I wasn't clear enough. I was trying to say, if we're going to do
 something in the executor (for right or wrong) the declarative approach
 can do it too. Since there will be partition bounding information
 available, we can do partition selection in the executor (maybe the
 planner should tell us to do it).

Of course. It's an identical situation for both. Regrettably, none of
your comments about dynamic partitioning and planning were accurate as a
result. 

 Okay. As I said above, nothing in declarative partitioning rules out
 partition selection with stable functions. So, we lets do it, assuming
 everyone else thinks it is a good idea.

If you check the archives this was long ago been identified as a
requirement. And I said exactly the things you said, BTW, when trying to
say it didn't matter.

I've kept a list of requests for improvement that I can share with you;
I've always been loathe to publish a list of bad points.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-11 Thread Gavin Sherry

On Fri, Jan 11, 2008 at 08:07:18AM +, Simon Riggs wrote:
 On Fri, 2008-01-11 at 02:28 +0100, Gavin Sherry wrote:
  On Thu, Jan 10, 2008 at 09:30:10PM +, Simon Riggs wrote:
 We cannot perform partition exclusion using this type of WHERE clause 
 at
 planning time because the CURRENT DATE function is STABLE. 

We can do the exact same thing -- if it's a direction people want to
take. In fact, we can do it better/faster because once we've evaluated 
one 
partition we know that there are no others to evaluate.
   
   Lost you completely here. I'm explaining to you that *nobody* can solve
   those problems solely at planning time, by definition, so it has to be
   done at execution time. I'm not saying anything about your way, my way.
  
  Sorry, I wasn't clear enough. I was trying to say, if we're going to do
  something in the executor (for right or wrong) the declarative approach
  can do it too. Since there will be partition bounding information
  available, we can do partition selection in the executor (maybe the
  planner should tell us to do it).
 
 Of course. It's an identical situation for both. Regrettably, none of
 your comments about dynamic partitioning and planning were accurate as a
 result. 

That's not true. We will still have planning drive the partition
selection when the predicate is immutable, thus having more accurate
plans. Some but not all plans use current_date and similar.

 
  Okay. As I said above, nothing in declarative partitioning rules out
  partition selection with stable functions. So, we lets do it, assuming
  everyone else thinks it is a good idea.
 
 If you check the archives this was long ago been identified as a
 requirement. And I said exactly the things you said, BTW, when trying to
 say it didn't matter.
 
 I've kept a list of requests for improvement that I can share with you;
 I've always been loathe to publish a list of bad points.

Why, please send them.

Thanks,

Gavin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-11 Thread Simon Riggs

Gavin,

I've responded to the technical questions you raise here:

On Thu, 2008-01-10 at 02:22 +0100, Gavin Sherry wrote:

  After we note that a segment is read-only we can scan the segment and
  record min/max values for all columns. These are then implicit
  constraints, which can then be used for segment exclusion in a similar
  manner as we do with the current constraint exclusion mechanism.
 
 What about columns which are varlen? The maximum value could be very long -- 1
 GB in fact. Say we decide to not support those. Well, what about
 numeric? What about user defined types?

No problem with any datatypes that I can see. Why would there be?

Very long values as boundary values would be a problem with any type of
partitioning, so some sensible heuristics would need to apply in any
case.

  This would also allow a Merge Join to utilise exclusion for the inner
  plan at execution time. Instead of scanning the whole inner table plan
  from the start, we would be able to start the scan from the appropriate
  segment. This would require us to pass the current value of the outer
  plan down into the inner plan. The typical executor nodes on the inner
  plan would be a SortNode and below that a SeqScan. In that case the
  executor would need to pass the outer value from the MergeJoinNode down
  thru the SortNode to the SeqScan node. The SeqScan node could then
  perform partition exclusion, reducing the time for that scan and also
  reducing the time for the resulting sort. This sounds like it might be
  worth doing in normal cases also. It might turn out that the potentially
  applicable cases are already excluded during planning, I haven't thought
  about that aspect in enough detail yet.
 
 I don't think that would work at all. Say you have have skew of data
 across the partitions. You may end up doing a seq scan for every outer
 tuple. It would look like a really expensive nested loop.

No, you've misunderstood. I said start the plan, i.e. do this once, not
for every row. That would be ridiculous.

  If we collect data for all columns then many of our implicit constraints
  would be useless. e.g. if a column only has a few values and these are
  present in all segments. Matching our predicate against all constraints
  would be expensive, so we must weed out poor constraints. We would do
  this by removing any constraint that overlapped more than 10% of other
  segments. Various heuristics would likely need to be discovered during
  development to make this work without resorting to manual commands.
  
  Note that all of this exclusion is happening in the executor, not the
  planner. That allows this form of exclusion to work with stable
  functions and parameters without problem.
 
 So, in that case, you've totally undercut the planner. All the steps up
 the tree would be based on the stats for scanning the entire table but
 you've only returned part of it.

Answered already. Not an issue solely for this particular approach.

 To solve that, you could put the min/max values in a catalog somewhere.
 For a 16 TB table, that's 16000 min/max values. That's pretty expensive.

We agreed the partitioning could and would be adjustable, so the number
need not be 16000.

The number of partitions we can support needs to be fairly high to
support a wide range of applications. Requests for 1-2000 partitions
seem fairly typical to me and we need to be able to handle that, however
we do partitioning.

 And how do we handle making things non-read-only. You'd have to wait for
 all current queries to finish... that could take a long time.

No, I explained that it would not require locking or waiting. The
existing queries would just not take advantage of the newly read-only
state until they complete.

 How do we handle crashes during setting of min/max? I presume it needs
 WAL support.

It's in a catalog table, as we've said. Handled.

 What happens if the free space map gives us a free block in a read only
 segment. Then we need to look at concurrency and transactionality of
 min/max values don't we? 

I discussed that and explained how it would be handled. 

  Visibility Map
  --
 [snip]
  No dynamic shared memory cache is required because any concurrent
  changes to the table would be ignored by a Scan anyway, so it doesn't
  matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any
  new scans that start will attempt to lock the table and then perform a
  rel cache check before continuing. So the visibility will be set
  correctly for *that* scan at least.
 
 Is that true in the presence of READ COMMITTED?

Yes, because READ COMMITTED has nothing to do with it.

The SVM would be refreshed via the relcache invalidation mechanism,
which gets checked every time we obtain a lock we did not already hold.
In the case of a table with bits set in the SVM we would need to recheck
that each time we try to obtain a lock, even if we already hold it.

  In most cases the visibility map can be summarised

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-11 Thread Simon Riggs

On Fri, 2008-01-11 at 10:25 +0100, Gavin Sherry wrote:
  
  Of course. It's an identical situation for both. Regrettably, none of
  your comments about dynamic partitioning and planning were accurate as a
  result. 
 
 That's not true. We will still have planning drive the partition
 selection when the predicate is immutable, thus having more accurate
 plans. 

Not really.

The planner already evaluates stable functions at plan time to estimate
selectivity against statistics. It can do the same here.

The boundary values can't be completely trusted at plan time because
they are dynamic, but they're at least as accurate as ANALYZE statistics
(and probably derived at identical times), so can be used as estimates.
So I don't see any reason to imagine the plans will be badly adrift.

We're back to saying that if the visibility map is volatile, then SE
won't help you much. I agree with that and haven't argued otherwise.
Does saying it make us throw away SE? No, at least, not yet and not for
that reason.

SE does what I was looking for it to do, but doesn't do all of what
you'd like to achieve with partitioning, because we're looking at
different use cases. I'm sure you'd agree that all large databases are
not the same and that they can have very different requirements. I'd
characterise our recent positions on this that I've been focused on
archival requirements, whereas you've been focused on data warehousing.
The difference really lies in how much activity and of what kind occurs
on a table. You don't unload and reload archives regularly, nor do you
perform random updates against the whole table. I'm sure we'd quickly
agree that many of the challenges you've seen recently at Greenplum
would not be possible with core Postgres, so I'm not really trying too
hard to compete.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-11 Thread Simon Riggs

On Mon, 2008-01-07 at 12:53 -0800, Ron Mayer wrote:

 Is my understanding right that these Segment Visibility Maps could
 help this case as well?

Yes, I think so.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-11 Thread Zeugswetter Andreas ADI SD


 I've kept a list of requests for improvement that I can share with
you;
 I've always been loathe to publish a list of bad points.

I think it would help understand the proposal if you also present the
shortcomings.
When you presented the positive and negative points, the negative list
did look intentionally short :-)
This imho provokes negative replies, the average hackers that reply are
no dummies eighter.

Some of the issues I see, with the proposal made so far:
- segment boundaries will imho sometimes be too fuzzy for a reasonably
short segment describing where clause
- fault isolation
- huge global indexes (index create/reorg needs to repeatedly sort data
that is long static)
- unpredictability
- when you delete large parts of the table you want that to be
instantaneous and cause little work on the db side
- partitioning that also partitions the indexes, this is imho a must
have for huge non static tables
- when the user tweaks segment size this is already declarative
- some indexes only needed for recent data (a partial index would need
to slide == recreate)

The whole subject is imho very interesting, and I expect more feedback
after 8.3 is out.  

I am also in the declarative camp. In my projects the partitioning is
the developer/designer's
responsibility, and thus all add/drop partition tasks are automated, no
dba.
Needless to say, all existing declarative implementations lack some of
the essential features on the implementation side, e.g.:
- use the btree order of partitions in plans that need more than one
partition
- a useful syntax that allows automatic creation/exclusion of partitions
(e.g. each month of a timestamp in one partition)
e.g. partition 'hugetable_'||to_char(ts, 'MM') with
extend(ts, year to month) 
- unique indexes, equally partitioned like table, that don't contain the
partitioning column[s]
- some lack expression based
- some lack instantaneous attach using a prefilled preindexed table
- some lack detach
- some need separate tablespace per partition

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-11 Thread Gavin Sherry

On Fri, Jan 11, 2008 at 11:49:50AM +, Simon Riggs wrote:
 On Fri, 2008-01-11 at 10:25 +0100, Gavin Sherry wrote:
   
   Of course. It's an identical situation for both. Regrettably, none of
   your comments about dynamic partitioning and planning were accurate as a
   result. 
  
  That's not true. We will still have planning drive the partition
  selection when the predicate is immutable, thus having more accurate
  plans. 
 
 Not really.
 
 The planner already evaluates stable functions at plan time to estimate
 selectivity against statistics. It can do the same here.
 
 The boundary values can't be completely trusted at plan time because
 they are dynamic, but they're at least as accurate as ANALYZE statistics
 (and probably derived at identical times), so can be used as estimates.
 So I don't see any reason to imagine the plans will be badly adrift.

Okay, it's good that you want the planner to look at those. Did you
consider the point I made about the sheer amount of data the planner
would have to consider for large cases?

 We're back to saying that if the visibility map is volatile, then SE
 won't help you much. I agree with that and haven't argued otherwise.
 Does saying it make us throw away SE? No, at least, not yet and not for
 that reason.

Yes, I'm not against SE I just think that only having it would see a
serious regression for larger user. Personally, I think SE would be a
great idea for append only tables since it removes the thing I'm most
worried about with it: the need to vacuum to 'turn it on'.

 
 SE does what I was looking for it to do, but doesn't do all of what
 you'd like to achieve with partitioning, because we're looking at
 different use cases. I'm sure you'd agree that all large databases are
 not the same and that they can have very different requirements. I'd
 characterise our recent positions on this that I've been focused on
 archival requirements, whereas you've been focused on data warehousing.

I think that sums it up, although I'd also say that declarative
partitioning is suitable for all those with largish amounts of data
which they know how they want stored. This points to another case that
SE suits: those who don't know how or (maybe more importantly) don't
care to manage their data.

I'll go back to what I said above. SE looks like a good performance
boost for archival read only data. If we tighten up the definitions of
how some tables can be used -- append only -- then we can remove the
vacuum requirement and also change other characteristics of the storage.
For example, reduced visibilty information, compression, etc. These are
hot topics for people with that kind of data.

 The difference really lies in how much activity and of what kind occurs
 on a table. You don't unload and reload archives regularly, nor do you

I couldn't agree more.

 perform random updates against the whole table. I'm sure we'd quickly
 agree that many of the challenges you've seen recently at Greenplum
 would not be possible with core Postgres, so I'm not really trying too
 hard to compete.

There we diverge. Yes, Greenplum produces systems for very large amounts
of data, peta-byte range in fact. However, the architecture --
individual post-masters on a single CPU with their own storage -- means
that we see the problems of non-distributed database users when we look
at data at the node level. This is why I say that VACUUMing such
systems, under the SE model, after you do a load of data is just
impossible (I know that after time the cost of vacuum will stabilise but
there's always the initial data load).

Thanks,

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-11 Thread Simon Riggs

On Fri, 2008-01-11 at 17:31 +0100, Zeugswetter Andreas ADI SD wrote:
  I've kept a list of requests for improvement that I can share with
 you;
  I've always been loathe to publish a list of bad points.

The list of bad points doesn't refer to shortcomings in my proposal,
which I would never hide. It refers to a list of general requirements
for improvements to partitioning that I have accumulated over a number
of years and will publish as soon as I've retyped it. It's not a secret,
it all comes from things discussed on list at various points.

 I think it would help understand the proposal if you also present the
 shortcomings.
 When you presented the positive and negative points, the negative list
 did look intentionally short :-)

If I am aware of a downside in any proposal, I always identify it.
That's why the damn things are so long.

I design many things, but only post some of them. The proposals I do
post always have more positive than negative posts, otherwise I wouldn't
waste anybody's time.

I regularly get comments my proposal style. First, my proposals are too
long and they need a summary. Next that I don't explain myself enough
and need to stop leaving gaps that have people assume I've not thought
it through thoroughly. So I try to put the summary in for some people
and the detail for others.

For this proposal I identified the target use case at the very top of
the proposal, and IMHO it does meet the needs of that very well, as many
people have agreed. It doesn't do everything that everybody wants
because the list is very long indeed, probably man years of effort, all
things considered. We won't cross that gap in a single step.

I'm working on a revised version which will include all of the comments
and points that people have made, about 20 different topics in all from
my notes. In progress here:
http://developer.postgresql.org/index.php/Segment_Exclusion

 This imho provokes negative replies, the average hackers that reply are
 no dummies eighter.

I post to -hackers because I know there's no dummies here.

If I've provoked negative replies, I'm happy to apologise to all.

Thanks for your candid thoughts,

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-11 Thread Simon Riggs

On Fri, 2008-01-11 at 20:03 +0100, Gavin Sherry wrote:

 Okay, it's good that you want the planner to look at those. Did you
 consider the point I made about the sheer amount of data the planner
 would have to consider for large cases?

Sorry, thought I had somewhere in all those emails...

If you really do have 16TB of data and its in the use case of mostly
read only, volatile in last portion of table, then it would be
straightforward to increase segment size.

Whatever form of partitioning we go for we do need to allow 1-2,000
partitions fairly easily. We can't sustain a sequential method of
applying the rules, which gives O(n) behaviour. We need some form of
indexing/tree search mechanism that gives roughly O(logN) behaviour.

  We're back to saying that if the visibility map is volatile, then SE
  won't help you much. I agree with that and haven't argued otherwise.
  Does saying it make us throw away SE? No, at least, not yet and not
 for
  that reason.
 
 Yes, I'm not against SE I just think that only having it would see a
 serious regression for larger user. 

If we do find regressions, then I'd suggest we look at ways of turning
it on/off automatically. We can keep track of the volatility in the map.
We can also have an off switch if you're really against it. But I'm
still skeptical.

I think we can sustain the last few segments in a table being volatile,
as long as the greater proportion of segments are not.

The volatile part of the table is the subject of most of the queries
anyway, so excluding it from seq scans isn't that important. Index
access to historical data doesn't slow down whether or not you have a
volatile map.

 Personally, I think SE would be a
 great idea for append only tables since it removes the thing I'm most
 worried about with it: the need to vacuum to 'turn it on'.

You do need to VACUUM anyway, so thats no problem. Sure you have to scan
it to derive the boundary values, but thats one scan that saves many in
the future.

 I'll go back to what I said above. SE looks like a good performance
 boost for archival read only data. If we tighten up the definitions of
 how some tables can be used -- append only -- then we can remove the
 vacuum requirement and also change other characteristics of the
 storage.
 For example, reduced visibilty information, compression, etc. These
 are
 hot topics for people with that kind of data.

SE isn't aimed at solely INSERT-only data. It's much wider than that. 

An ERP/SCM type application would easily benefit, say where orders are
received and processed within a relatively short period, but order data
needs to be maintained online for 2+ years. Anything where the table
grows either by date, or by increasing key, that has a read-only tail.
That's a lot of applications and a ton of data. It might not apply to
the Customer table in that app, but it will apply to Orders, OrderItem
etc.

We can compress older read-only data as a way of shrinking file size.
That's way easier, plus it applies to more real-world cases than trying
to remove all the visibility stuff. The Insert-only people will still be
able to take advantage of it compression, but the more general purpose
people will never be able to make use of the visibility removal code.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-11 Thread August Zajonc


Simon Riggs wrote:

Happy New Year, everybody.

This proposal follows on from previous thinking about partitioning,
where I've taken up Andrew Sullivan's suggestion to re-examine the
current partitioning concept of using tables as partitions. So I've come
up with an alternative concept to allow us to discuss the particular
merits of each. ISTM that this new proposal has considerable potential.


I've been lurking and reading a huge number of threads on partitioning. 
I see that postgresql is likely to give the user lots of knobs to define 
partitions very flexibly, which is a good thing for things like sales 
region reports etc.


All that to say, I hope some form of this very automatic tunning makes 
it in. This automatic option would provide a benefit (not perfect but 
improved) for a significant set of use cases. Even better, it is trivial 
to setup, though I would want a knob for the underlying partition sizes, 
1GB feels a bit too big for some situations.


Even expensive databases have found I think that there is a cost to 
administrative complexity. In many cases someone may not be ready to go 
down the declarative path, but be open to allowing the system take some 
optimizing approaches. What I especially like is the ability to later 
mark things read only. Perhaps a consultant who checks in on things 
monthly might mark partitions in that form.


Good luck though with it all, great to see this.

- August


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-10 Thread Simon Riggs

On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote:

  If people with large tables like partitioning why is Oracle moving
  towards automated partitioning in 11g? Automated partitioning was one of
 
 Have you used Oracle's partitioning? 

Since you ask, yep, certified on it, plus DB2, Teradata and Greenwich.

 What about consistency over time? High levels of control for users if
 they want it? Even simple things like tablespace support? High
 performance unload and load performance much better than now?
 Reliable and consistent query plans? Support for datatypes other than
 those relating to time.

Most of these comments seem like emotional appeals, so refuting them one
by one is just going to look and smell like an argument, and a fruitless
one as well since we agree on so many things.

So, I get the message that you really want the DDL approach and agree
that you've demonstrated there are use cases that need it that you are
interested in. That's fine by me as long as we can each work on parts of
it to get it done. Will it be possible for you to do that?

I feel I can say that because AFAICS we can in principle have both
dynamic and declarative techniques, even on the same table. Around half
of the mechanics would be identical anyway.

So from here I say lets work together to get the basic mechanics right.
My experience with PostgreSQL is that the team/list always comes up with
a better idea in the end and I'm sure we will this time too.

I have one main technical concern which is the storage model, which is
the source of the difference between the two types of partitioning we've
been discussing. I see difficulties originating there, so I will start
another thread to examine just those items in detail.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-10 Thread Ron Mayer

hris Browne wrote:
 _On The Other Hand_, there will be attributes that are *NOT* set in a
 more-or-less chronological order, and Segment Exclusion will be pretty
 useless for these attributes.

Short summary:

  With the appropriate clustering, ISTM Segment Exclusion
  can be useful on all columns in a table.

  Cluster the table by one-bit-of-column1 || one-bit-of-column-2...
  That way any col2=value query could exclude at least about
  half the table regardless of if it's monotonically increasing
  or even totally random.

It seems one could make Segment Exclusion even useful for
multiple unrelated columns in a table.Consider a large
table of people where one might want segment exclusion to
help with both first name, and last name.

One could cluster the table by first-letter-of-last-name ||
first-letter-of-first-name.   Then a query for last name Smith
could exclude all but the consecutive segments of S's; while
the query for first name John would only have to look in the
26 runs of segments with AJ, BJ, CJ, ...

Perhaps better - hash each column and interleave the bits
col1bit1, col2bit1, col3bit1, col1bit2, col2bit2, col3bit3
If I understand segment exclusion right, that way on any
table bigger than 8 segments any query of col1=val,
or col2=val or col3=val would scan at most half the table;
on a table bigger than 64 segments any such query would
scan at most 1/4 of the table.

Obviously this only benefits the rarely changing parts of
tables; and new and updated values would be in a very hot
segment at the end of the table that wouldn't be segment
excluded much.

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-10 Thread Gavin Sherry

On Thu, Jan 10, 2008 at 07:25:00AM +, Simon Riggs wrote:
 On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote:
  If the exclusion is executor driven, the planner cannot help but
  create a seq scan plan. The planner will think you're returning 100X
  rows when really you end up returning X rows. After that, all
  decisions made by the planner are totally bogus.
 
 One of the most important queries on large tables is handling
   WHERE Eventdate = CURRENT DATE;

Really? Well, this isn't handled by the current partitioning approach
(i.e., constraint exclusion) so users have worked around it.

 
 We cannot perform partition exclusion using this type of WHERE clause at
 planning time because the CURRENT DATE function is STABLE. 

We can do the exact same thing -- if it's a direction people want to
take. In fact, we can do it better/faster because once we've evaluated one 
partition we know that there are no others to evaluate.

 So it seems clear that we need to make partition exclusion work at
 executor time, whatever else we do.

One example doesn't make the rule. Most people doing range partitioning
are going to be doing either specific dates or date ranges -- i.e.,
constants or things that can be folded to constants by the planner. At
least, that's what I've seen.

Thanks,

Gavin

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-10 Thread Gavin Sherry

On Thu, Jan 10, 2008 at 04:51:04PM +, Simon Riggs wrote:
 On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote:
 
   If people with large tables like partitioning why is Oracle moving
   towards automated partitioning in 11g? Automated partitioning was one of
  
  Have you used Oracle's partitioning? 
 
 Since you ask, yep, certified on it, plus DB2, Teradata and Greenwich.

It was just a joke.

 
  What about consistency over time? High levels of control for users if
  they want it? Even simple things like tablespace support? High
  performance unload and load performance much better than now?
  Reliable and consistent query plans? Support for datatypes other than
  those relating to time.
 
 Most of these comments seem like emotional appeals, so refuting them one
 by one is just going to look and smell like an argument, and a fruitless
 one as well since we agree on so many things.

Sigh. You can call those questions what you like but your approach
doesn't deal with them and the current partitioning approach, for it's
problems, does deal with them.

 So, I get the message that you really want the DDL approach and agree
 that you've demonstrated there are use cases that need it that you are
 interested in. That's fine by me as long as we can each work on parts of
 it to get it done. Will it be possible for you to do that?

I assured you offlist that it was. This is something Greenplum needs
right now and I'm being paid to deliver it.

 I feel I can say that because AFAICS we can in principle have both
 dynamic and declarative techniques, even on the same table. Around half
 of the mechanics would be identical anyway.
 
 So from here I say lets work together to get the basic mechanics right.
 My experience with PostgreSQL is that the team/list always comes up with
 a better idea in the end and I'm sure we will this time too.

I agree. Scrutiny of ideas almost always leads to better ideas.

 I have one main technical concern which is the storage model, which is
 the source of the difference between the two types of partitioning we've
 been discussing. I see difficulties originating there, so I will start
 another thread to examine just those items in detail.

I look forward to it. As promised, I'll soon post some syntax of what we
have been working on.

Thanks,

Gavin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-10 Thread Simon Riggs

On Thu, 2008-01-10 at 21:43 +0100, Gavin Sherry wrote:
 On Thu, Jan 10, 2008 at 07:25:00AM +, Simon Riggs wrote:
  On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote:
   If the exclusion is executor driven, the planner cannot help but
   create a seq scan plan. The planner will think you're returning 100X
   rows when really you end up returning X rows. After that, all
   decisions made by the planner are totally bogus.
  
  One of the most important queries on large tables is handling
  WHERE Eventdate = CURRENT DATE;
 
 Really? Well, this isn't handled by the current partitioning approach

Yes, exactly why we need to fix it and why I'm mentioning it here.

  We cannot perform partition exclusion using this type of WHERE clause at
  planning time because the CURRENT DATE function is STABLE. 
 
 We can do the exact same thing -- if it's a direction people want to
 take. In fact, we can do it better/faster because once we've evaluated one 
 partition we know that there are no others to evaluate.

Lost you completely here. I'm explaining to you that *nobody* can solve
those problems solely at planning time, by definition, so it has to be
done at execution time. I'm not saying anything about your way, my way.

  So it seems clear that we need to make partition exclusion work at
  executor time, whatever else we do.
 
 One example doesn't make the rule. Most people doing range partitioning
 are going to be doing either specific dates or date ranges -- i.e.,
 constants or things that can be folded to constants by the planner. At
 least, that's what I've seen.

It's not always true that planning time = execution time.

Using CURRENT DATE to access current day, week, month is common in many
applications, as is parameterised SQL for higher transaction rate apps.

We we can't re-write all the SQL, so we need to make it work.

I think if you demand a full function implementation of partitioning,
you'd better take into account *all* of the requirements.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-10 Thread Simon Riggs

On Thu, 2008-01-10 at 21:49 +0100, Gavin Sherry wrote:

  So, I get the message that you really want the DDL approach and agree
  that you've demonstrated there are use cases that need it that you are
  interested in. That's fine by me as long as we can each work on parts of
  it to get it done. Will it be possible for you to do that?
 
 I assured you offlist that it was. This is something Greenplum needs
 right now and I'm being paid to deliver it.

Cool, I'll assist you where possible.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-10 Thread Gavin Sherry

On Thu, Jan 10, 2008 at 09:30:10PM +, Simon Riggs wrote:
   We cannot perform partition exclusion using this type of WHERE clause at
   planning time because the CURRENT DATE function is STABLE. 
  
  We can do the exact same thing -- if it's a direction people want to
  take. In fact, we can do it better/faster because once we've evaluated one 
  partition we know that there are no others to evaluate.
 
 Lost you completely here. I'm explaining to you that *nobody* can solve
 those problems solely at planning time, by definition, so it has to be
 done at execution time. I'm not saying anything about your way, my way.

Sorry, I wasn't clear enough. I was trying to say, if we're going to do
something in the executor (for right or wrong) the declarative approach
can do it too. Since there will be partition bounding information
available, we can do partition selection in the executor (maybe the
planner should tell us to do it).

 
   So it seems clear that we need to make partition exclusion work at
   executor time, whatever else we do.
  
  One example doesn't make the rule. Most people doing range partitioning
  are going to be doing either specific dates or date ranges -- i.e.,
  constants or things that can be folded to constants by the planner. At
  least, that's what I've seen.
 
 It's not always true that planning time = execution time.
 
 Using CURRENT DATE to access current day, week, month is common in many
 applications, as is parameterised SQL for higher transaction rate apps.
 
 We we can't re-write all the SQL, so we need to make it work.
 
 I think if you demand a full function implementation of partitioning,
 you'd better take into account *all* of the requirements.

Okay. As I said above, nothing in declarative partitioning rules out
partition selection with stable functions. So, we lets do it, assuming
everyone else thinks it is a good idea.

Thanks,

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Simon Riggs

On Sat, 2008-01-05 at 16:30 -0500, Robert Treat wrote:

 I'm not following this.  If we can work out a scheme, I see no reason not to 
 allow a single table to span multiple tablespaces. 

That seems to be something we might want anyway, so yes.

 The difference is that, if I currently have a table split by month, I 
 can re-partition it into weekly segments, and only shuffle one months data 
 at a time minimize impact on the system while I shuffle it. This can even be 
 used to do dynamic management, where data from the current month is archived 
 by day, data from the past year by week, and data beyond that done monthly.   
  

Understood

 On many other databases, if you change the partition scheme, it requires 
 exclusive locks and a shuffleing of all of the data, even data whose 
 partitions arent being redefined.  Even worse are systems like mysql, where 
 you need to rewrite the indexes as well.  To me, these requirements always 
 seem like show stoppers; I generally can't afford to lock a table while the 
 database rewrites a billion rows of data. 

Agreed

 In any case, my thinking is if we had the segment exclusion technique, I 
 could 
 convert that partitioned table into a regular table again, use segment 
 exclusion to handle what is currently handled by partitions, and create 
 a global index across all the other data for that other, currently killer, 
 query. 

Yes, that's what I have in mind.

Can I ask that you produce a gap analysis between what you have now
and what you would have in the future, so we can see what omissions or
errors there are in the segex proposal?

If we had indexes that spanned partitions, would we find that some of
the queries that were producing seq scans will now produce better join
and index plans, do you think?

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Simon Riggs

On Sun, 2008-01-06 at 11:39 +0100, Markus Schiltknecht wrote:

 I think this has to do with SE not being of much use for index scans. 

Hmmm. I think it fits rather neatly with BitmapIndexScans. It would be
easy to apply the index condition and/or filters to see which segments
are excluded and then turn off bits in the bitmap appropriately.

Not fully sure about IndexScans yet. I don't think it would be worth
trying to apply SE until we estimated we would return say 100 rows. It
needs to be able to work without slowing down the common path.

 Or 
 put it another way: SE is an optimization for sequential scans. For 
 tables where it works well, it could possibly replace the index entirely.

True

 Without the index, you would rely on SE to always be able to exclude 
 enough segments, so that the seq scan is less expensive than an index 
 scan with the following table lookups.

It would have to be a very fat index scan for so large a table...

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Simon Riggs

On Mon, 2008-01-07 at 14:20 +0100, Markus Schiltknecht wrote:

 AFAIUI, Segment Exclusion combines perfectly well with 
 clustering. 

Yes, seems like it would be possible to have a segment-aware CLUSTER, so
it was actually usable on large tables. Not planning that initially
though.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Markus Schiltknecht


Simon Riggs wrote:

Hmmm. I think it fits rather neatly with BitmapIndexScans. It would be
easy to apply the index condition and/or filters to see which segments
are excluded and then turn off bits in the bitmap appropriately.


Yeah, good point.


Not fully sure about IndexScans yet. I don't think it would be worth
trying to apply SE until we estimated we would return say 100 rows. It
needs to be able to work without slowing down the common path.


Yup.

Or 
put it another way: SE is an optimization for sequential scans. For 
tables where it works well, it could possibly replace the index entirely.


True

Without the index, you would rely on SE to always be able to exclude 
enough segments, so that the seq scan is less expensive than an index 
scan with the following table lookups.


It would have to be a very fat index scan for so large a table...


..for SE to be faster than an index scan, you mean? Yes.

Regards

Markus




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Simon Riggs

On Mon, 2008-01-07 at 12:14 +0100, Csaba Nagy wrote:
 On Wed, 2008-01-02 at 17:56 +, Simon Riggs wrote:
  Like it?
 
 Very cool :-)

Thanks. As ever, a distillation of various thoughts, not all mine.

 One additional thought: what about a kind of segment fill factor ?
 Meaning: each segment has some free space reserved for future
 updates/inserts of records in the same range of it's partitioning
 constraint. And when inserting/updating you put the new record into the
 corresponding segment... just like a very coarse clustering. Then you
 could vacuum the segments separately to keep the free space not running
 out. For active segments you would then fix the partitioning constraint
 range once the fill factor is reached, to allow for keeping it's
 constraint even when heavily updating (heavily vacuuming it too as
 response to that), and create a new segment for the unbounded range for
 new inserts... this would work fine for tables where the constraint is
 based on ever increasing keys and accidental inserts in old ranges
 (which do happen occasionally in real life).

Lots of thoughts there, so I'll try to separate them out and analyse.

The way I originally described it is a very simple mechanism and we
could tweak that some more. All ideas welcome.

If we had dynamic segment constraints when the segment was not yet read
only that would lead to changes in the segment constraint for each
INSERT when we have increasing keys. It seems better to set the
constraints once only, when the segment was full, then prevent further
INSERTs. The accidental inserts in old ranges seem like something that
can be avoided if it is a real-world problem.

For UPDATEs on a segment with constraints we might choose to apply the
constraints to see what to do. You might want to allow the UPDATE and
have it stretch the constraints outwards or you might want to prevent it
and throw an error, or you might want to allow the UPDATE, yet migrate
the new tuple to the appropriate segment. 

Dynamic partitioning works in multiple dimensions, so there isn't just
one single valid location for any row. i.e. if we update a have a row
with OrderId, OrderDate, RequiredByDate, ShipDate and LastModifiedDate
on it, we'll probably expand the constraints on at least one of those.
If we were lucky enough to have only changed one of those it might turn
out there was *in that case* a single more sensible location for the new
tuple, but that probably isn't the common case.

So the likely best behaviour for UPDATEs is to try to keep the new row
version in the same segment, then change the constraints. 

The segment constraints concept and the read only concept were linked.
You're right we could separate them, thought that turns out not to be
that desirable. When we do an DELETE or an UPDATE we don't know whether
the deleted row version was the last tuple with that particular boundary
value. So we don't know whether the DELETE or UPDATE changes the
constraints or not, and however we try to avoid it, we'll probably need
to recalc the constraints in some circumstance. So updating the
constraints dynamically isn't anywhere near as easy as it sounds and
ultimately probably isn't worth the effort. So thats why constraints and
read-only go together.

HOT allows us to keep the new row version within the segment, in many
cases. What might be worth doing is marking the FSM space for that
segment as update-only to exclude inserts from using it, then forcing
UPDATEs to stay within the segment if possible by providing the current
block number in each call to the FSM. That would also mean that every
UPDATE that wants to do a multi-block update on a currently read-only
segment would need to call the FSM. Sounds good, but that would only
work for the first UPDATE on a segment after it is marked read only,
which isn't much use, or we would do it for *every* block-spanning
UPDATE, which would cause contention in other use cases.

So although I'm willing to listen and tweak, that hasn't yet resulted in
any additional design points, unless I missed something above?

 When the change rate of old segments get down, the segments could be
 reorganized to have a smaller fill factor, so that you still allow for
 accidental updates but keep space usage efficient. This would be some
 similar action as a clustering, but hopefully not blocking (which might
 be a hard thing to do)... and later again you could mark some of the
 really old things as read only and put them in special segments with no
 wasted space.

Yes, hopefully marking read only and compressed segments is a possible
future.

 One problem would be when the segment's free space runs out, so you must
 put records from the same constraint range in multiple segments - but
 that could still work, you just would have multiple segments covering
 the same range, but if the segment fill factor is chosen properly it
 should not be the case... you could normally maintain a set of
 non-overlapping segments in terms of the

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Simon Riggs

On Sat, 2008-01-05 at 16:42 +0100, Markus Schiltknecht wrote:

 Simon Riggs wrote:
   On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote:
  
   I'm still puzzled about how a DBA is expected to figure out which
   segments to mark. Simon, are you assuming we are going to pass on
   segment numbers to the DBA one day?
  
   No Way!
 
 Ah, I'm glad ;-)

But now you want names on partitions?

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Simon Riggs

Gavin and all,

This is quite a long reply, so apologies for that.

On Wed, 2008-01-09 at 07:28 +0100, Gavin Sherry wrote:
 On Wed, Jan 02, 2008 at 05:56:14PM +, Simon Riggs wrote:
  This technique would be useful for any table with historical data keyed
  by date or timestamp. It would also be useful for data where a
  time-of-insert component is implicit, such as many major entity tables
  where the object ids are assigned by a sequence. e.g. an Orders table
  with an OrderId as PK. Once all orders placed in a period have been
  shipped/resolved/closed then the segments will be marked read-only.
  

 A novel approach to the problem. For me, this proposal addresses some
 of the other problems in postgres (visibility in the heap vs. index)
 rather than the problems with partitioning itself.

I think you're right that it isn't attempting to address the problems
with partitioning directly, it is attempting to address the core use
case itself. I think we have an opportunity to bypass the
legacy-of-thought that Oracle has left us and implement something more
usable.

There are some low-level technical issues associated with declarative
partitioning that I'll address last, which are in many ways the big
reason to want to go to non-declarative partitioning.

 It might seem otherwise, but for me partitioning is tool for managing
 large volumes of data. It allows the user to encode what they know about
 the nature of their data, and that's often about management. In this
 way, your proposal actually makes partitioning too smart: the user wants
 to tell the system how to organise the data, as opposed to the other way
 around.

 At Greenplum, we've been discussing this in depth. Interestingly, we
 also felt that the storage layer could be much smarter with large tables
 with predictable layouts and predictable data patterns. But the thing
 is, people with large tables like partitioning, putting different
 partitions on different storage; they like the ability to merge and
 split partitions; they like truncating old partitions. In a word, they
 seem to like the managability partitions give them -- as well as the
 performance that comes with this.

Do people really like running all that DDL? There is significant
manpower cost in implementing and maintaining a partitioning scheme,
plus significant costs in getting it wrong. 

If people with large tables like partitioning why is Oracle moving
towards automated partitioning in 11g? Automated partitioning was one of
the major goals for this next set of Postgres partitioning functionality
also, whether or not we have declarative partitioning. My thinking is
lets go for the best ideas and skip over the stuff that will be (is)
obsolete before its left the design stage.

I see many more systems in the Terabyte range now than I did 10 years
ago, but I see about the same number of DBAs. We'll always need experts,
but I feel we should be using our expertise to simplify the standard
cases, not just maintain the same level of difficulty in managing them.
One of the big benefits of the dynamic partitioning approach is that it
needs no DDL. So it will work out-of-the-box, for anybody.

Deleting older data would be optimised under the proposed scheme, so
that's not really a problem. Loading data is actually slightly harder
and slower with declarative partitioning (see below).

Merging and splitting partitions are tools for fine tuning a very
complex partitioning scheme. They do also allow a non-linear
segmentation scheme, which might aid performance in some cases.

(On a different note, I'm strongly in favour of a declarative approach
to other optimizer information to allow the DBA to say what they know
about how a table will be used. But that's off-topic for now.)

One major advantage of the dynamic approach is that it can work on
multiple dimensions simultaneously, which isn't possible with
declarative partitioning. For example if you have a table of Orders then
you will be able to benefit from Segment Exclusion on all of these
columns, rather than just one of them: OrderId, OrderDate,
RequiredByDate, LastModifiedDate. This will result in some sloppiness
in the partitioning, e.g. if we fill 1 partition a day of Orders, then
the OrderId and OrderData columns will start out perfectly arranged. Any
particular RequiredByDate will probably be spread out over 7 partitions,
but thats way better than being spread out over 365+ partitions.

When we look at the data in the partition we can look at any number of
columns. When we declaratively partition, you get only one connected set
of columns, which is one of the the reasons you want multi-dimensional
partitioning in the first place.

 To this end, we (well, Jeff Cohen) looked at the syntax and semantics of
 partitining in leading databases (Oracle, Informix, DB2) and came up
 with a highly expressive grammar which takes the best of each I think
 (I'll post details on the grammar in a seperate thread). The idea is
 that range (for

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Chris Browne

[EMAIL PROTECTED] (Simon Riggs) writes:
 I think we have an opportunity to bypass the legacy-of-thought that
 Oracle has left us and implement something more usable.

This seems like a *very* good thing to me, from a couple of
perspectives.

1.  I think you're right on in terms of the issue of the cost of
running all that DDL in managing partitioning schemes.

When I was working as DBA, I was decidedly *NOT* interested in
doing a lot of low level partition management work, and those that
are in that role now would, I'm quite sure, agree that they are
not keen on spending a lot of their time trying to figure out what
tablespace to shift a particular table into, or what tablespace
filesystem to get sysadmins to set up.

2.  Blindly following what Oracle does has always been a dangerous
sort of thing to do.

There are two typical risks:

  a) There's always the worry that they may have patented some
 part of how they implement things, and if you follow too
 closely, There Be Dragons...

  b) They have enough billion$ of development dollar$ and
 development re$ource$ that they can follow strategies that
 are too expensive for us to even try to follow.

3.  If, rather than blindly following, we create something at least
quasi-new, there is the chance of doing fundamentally better.

This very thing happened when it was discovered that IBM had a
patent on the ARC cacheing scheme; the clock system that emerged
was a lot better than ARC ever was.

 One major advantage of the dynamic approach is that it can work on
 multiple dimensions simultaneously, which isn't possible with
 declarative partitioning. For example if you have a table of Orders then
 you will be able to benefit from Segment Exclusion on all of these
 columns, rather than just one of them: OrderId, OrderDate,
 RequiredByDate, LastModifiedDate. This will result in some sloppiness
 in the partitioning, e.g. if we fill 1 partition a day of Orders, then
 the OrderId and OrderData columns will start out perfectly arranged. Any
 particular RequiredByDate will probably be spread out over 7 partitions,
 but thats way better than being spread out over 365+ partitions.

I think it's worth observing both the advantages and demerits of this
together.

In effect, with the dynamic approach, Segment Exclusion provides its
benefits as an emergent property of the patterns of how INSERTs get
drawn into segments.

The tendancy will correspondly be that Segment Exclusion will be able
to provide useful constraints for those patterns that can naturally
emerge from the INSERTs.

We can therefore expect useful constraints for attributes that are
assigned in some kind of more or less chronological order.  Such
attributes will include:

 - Object ID, if set by a sequence
 - Processing dates

There may be a bit of sloppiness, but the constraints may still be
useful enough to exclude enough segments to improve efficiency.

_On The Other Hand_, there will be attributes that are *NOT* set in a
more-or-less chronological order, and Segment Exclusion will be pretty
useless for these attributes.  

In order to do any sort of Exclusion for non-chronological
attributes, it will be necessary to use some mechanism other than the
patterns that fall out of natural chronological insertions.  If you
want exclusion on such attributes, then there needs to be some sort of
rule system to spread such items across additional partitions.  Mind
you, if you do such, that will weaken the usefulness of Segment
Exclusion.  For instance, suppose you have 4 regions, and scatter
insertions by region.  In that case, there will be more segments that
overlap any given chronological range.

 When we look at the data in the partition we can look at any number of
 columns. When we declaratively partition, you get only one connected set
 of columns, which is one of the the reasons you want multi-dimensional
 partitioning in the first place.

Upside: Yes, you get to exclude based on examining any number of
columns.

Downside: You only get the exclusions that are emergent properties
of the data...

The more I'm looking at the dynamic approach, the more I'm liking
it...
-- 
cbbrowne,@,cbbrowne.com
http://linuxfinances.info/info/linuxxian.html
Feel free  to contribute build  files.  Or work on  your motivational
skills, and maybe someone somewhere will write them for you...
-- Fredrik Lundh [EMAIL PROTECTED]

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Ron Mayer

Chris Browne wrote:
 _On The Other Hand_, there will be attributes that are *NOT* set in a
 more-or-less chronological order, and Segment Exclusion will be pretty
 useless for these attributes.  

Really?I was hoping that it'd be useful for any data
with long runs of the same value repeated - regardless of ordering.

My biggest tables are clustered by zip/postal-code -- which means that
while the City, State, Country attributes aren't monotonically increasing
or decreasing; they are grouped tightly together.   I'd expect all queries
for San Francisco to be able to come from at most 2 segments; and all queries
for Texas to be able to come from only a fraction of the whole.


If the segment sizes are configurable - I imagine this would even
be useful for other data - like a people table organized
by last_name,first_name.   John's may be scattered through out
the table -- but at least the John Smith's would all be on one
segment, while the Aaron-through-Jim Smith segments might get excluded.

Or am I missing something?

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Gavin Sherry

On Wed, Jan 09, 2008 at 11:47:31AM -0500, Chris Browne wrote:
 [EMAIL PROTECTED] (Simon Riggs) writes:
  I think we have an opportunity to bypass the legacy-of-thought that
  Oracle has left us and implement something more usable.
 
 This seems like a *very* good thing to me, from a couple of
 perspectives.
 
[snip]
 2.  Blindly following what Oracle does has always been a dangerous
 sort of thing to do.
 
 There are two typical risks:
 
   a) There's always the worry that they may have patented some
  part of how they implement things, and if you follow too
  closely, There Be Dragons...

I think that could be equally said of the dynamic partitioning approach.
In fact, it might be more likely since declarative partitioning has been
around for eons.

   b) They have enough billion$ of development dollar$ and
  development re$ource$ that they can follow strategies that
  are too expensive for us to even try to follow.

I don't see that as an argument against the declarative approach.
Reading the details on this thread so far, I think Simon's approach is
probably more complex from an implementation POV.

 
 3.  If, rather than blindly following, we create something at least
 quasi-new, there is the chance of doing fundamentally better.
 
 This very thing happened when it was discovered that IBM had a
 patent on the ARC cacheing scheme; the clock system that emerged
 was a lot better than ARC ever was.

Well, I don't think I'm proposing we /blindly follow/ others. I propose
we choose a grammar which takes the best of what others have tried to
do. Oracle's grammar is hideous, IBM's is too restrictive, for example.

  One major advantage of the dynamic approach is that it can work on
  multiple dimensions simultaneously, which isn't possible with
  declarative partitioning. For example if you have a table of Orders then
  you will be able to benefit from Segment Exclusion on all of these
  columns, rather than just one of them: OrderId, OrderDate,
  RequiredByDate, LastModifiedDate. This will result in some sloppiness
  in the partitioning, e.g. if we fill 1 partition a day of Orders, then
  the OrderId and OrderData columns will start out perfectly arranged. Any
  particular RequiredByDate will probably be spread out over 7 partitions,
  but thats way better than being spread out over 365+ partitions.
 
 I think it's worth observing both the advantages and demerits of this
 together.
 
 In effect, with the dynamic approach, Segment Exclusion provides its
 benefits as an emergent property of the patterns of how INSERTs get
 drawn into segments.
 
 The tendancy will correspondly be that Segment Exclusion will be able
 to provide useful constraints for those patterns that can naturally
 emerge from the INSERTs.

Many people, in my experience, doing the kind of data processing which
benefits from partitioning are regularly loading data, rather than
collecting it in an OLTP fashion. Lets take the easily understandable
concept of processing web site traffic. If the amount of data is large
enough to benefit from partitioning, they probably have multiple web
servers and therefore almost certainly multiple log files. If these
files are not sorted into a single file, the records will not have a
naturally progressing chronology: every file we go back to the beginning
of the period of time the load covers. If you add parallelism to your load,
things look even more different. This means you could end up with a
bunch of partitions, under the dynamic model, which all cover the same
time range.

Then there's the way that really big databases are used (say, up around
Simon's upper bound of 16 TB). It is expensive to keep data online so
people aren't. They're loading and unloading data all the time, to
perform different forms of analysis. A common scenario in the example
above might be to unload all but the current month's data and then load
the same month from the previous year. The unload step needs to be
costly (i.e., TRUNCATE). Then, there's no guarantee that what they're
interested in is the date range at all. They may want to compare user
agent (look at bot activity). In this case, the partitioning is across a
small list of strings (well, numbers most likely). Here, the partitions
would all have the same range. Partitioning would be useless.

 We can therefore expect useful constraints for attributes that are
 assigned in some kind of more or less chronological order.  Such
 attributes will include:
 
  - Object ID, if set by a sequence
  - Processing dates
 
 There may be a bit of sloppiness, but the constraints may still be
 useful enough to exclude enough segments to improve efficiency.

I really like the idea of the system being able to identify trends and
patterns in the data (actually useful at the user level) but it's the
uncertainty of how the partitioning will work for different kinds of
data that I don't like.

 
 _On The Other Hand_, there will be

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Simon Riggs

On Wed, 2008-01-09 at 20:03 +0100, Gavin Sherry wrote:

 I think Simon's approach is
 probably more complex from an implementation POV.

Much of the implementation is exactly the same, and I'm sure we agree on
more than 50% of how this should work already. We just need to close in
on the remainder.

My current opinion on the SE approach is the opposite of the one above,
though it gets us nowhere just to state it. I'm trying to avoid opinion
and look at the details, which is the reason my viewpoint recently
changed in favour of the dynamic approach as the main thrust for
implementation.

I've written a detailed email this morning to explain where and how the
problems lie, which are nowhere near the syntax level. I haven't ruled
out a declarative approach yet, but I need some detailed technical
review of the issues. I hope you'll be replying to that?

  3.  If, rather than blindly following, we create something at least
  quasi-new, there is the chance of doing fundamentally better.
  
  This very thing happened when it was discovered that IBM had a
  patent on the ARC cacheing scheme; the clock system that emerged
  was a lot better than ARC ever was.
 
 Well, I don't think I'm proposing we /blindly follow/ others. I propose
 we choose a grammar which takes the best of what others have tried to
 do. Oracle's grammar is hideous, IBM's is too restrictive, for example.

I assume the new grammar is good and if we do go that way, it sounds
like the right starting place. 

   One major advantage of the dynamic approach is that it can work on
   multiple dimensions simultaneously, which isn't possible with
   declarative partitioning. For example if you have a table of Orders then
   you will be able to benefit from Segment Exclusion on all of these
   columns, rather than just one of them: OrderId, OrderDate,
   RequiredByDate, LastModifiedDate. This will result in some sloppiness
   in the partitioning, e.g. if we fill 1 partition a day of Orders, then
   the OrderId and OrderData columns will start out perfectly arranged. Any
   particular RequiredByDate will probably be spread out over 7 partitions,
   but thats way better than being spread out over 365+ partitions.
  
  I think it's worth observing both the advantages and demerits of this
  together.
  
  In effect, with the dynamic approach, Segment Exclusion provides its
  benefits as an emergent property of the patterns of how INSERTs get
  drawn into segments.
  
  The tendancy will correspondly be that Segment Exclusion will be able
  to provide useful constraints for those patterns that can naturally
  emerge from the INSERTs.
 
 Many people, in my experience, doing the kind of data processing which
 benefits from partitioning are regularly loading data, rather than
 collecting it in an OLTP fashion. Lets take the easily understandable
 concept of processing web site traffic. If the amount of data is large
 enough to benefit from partitioning, they probably have multiple web
 servers and therefore almost certainly multiple log files. If these
 files are not sorted into a single file, the records will not have a
 naturally progressing chronology: every file we go back to the beginning
 of the period of time the load covers. If you add parallelism to your load,
 things look even more different. This means you could end up with a
 bunch of partitions, under the dynamic model, which all cover the same
 time range.

Depends how big we make the partitions and how sloppy this is as to
whether that is a problem or not. We might still expect a x100 gain from
using the SE approach depending upon the data volume.

 Then there's the way that really big databases are used (say, up around
 Simon's upper bound of 16 TB). It is expensive to keep data online so
 people aren't. They're loading and unloading data all the time, to
 perform different forms of analysis. 

That isn't my experience. That sounds very time consuming.

The storage cost issue was the reason Andrew wanted offline segments,
and why I have been talking about hierarchical storage.

 A common scenario in the example
 above might be to unload all but the current month's data and then load
 the same month from the previous year. The unload step needs to be
 costly (i.e., TRUNCATE). Then, there's no guarantee that what they're
 interested in is the date range at all. They may want to compare user
 agent (look at bot activity). In this case, the partitioning is across a
 small list of strings (well, numbers most likely). Here, the partitions
 would all have the same range. Partitioning would be useless.

I take it you mean SE-based partitioning would be useless, but
declarative partitioning would be useful? I would agree, assuming they
run queries with a few of the small list of strings. Seems like a
contrived case.

 Some of the biggest I've seen are GIS information or network
 information.

Those are good examples of where a declarative approach would be the
only way of getting

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Chris Browne

Ron Mayer [EMAIL PROTECTED] writes:
 Chris Browne wrote:
 _On The Other Hand_, there will be attributes that are *NOT* set in a
 more-or-less chronological order, and Segment Exclusion will be pretty
 useless for these attributes.  

 Really?I was hoping that it'd be useful for any data
 with long runs of the same value repeated - regardless of ordering.

 My biggest tables are clustered by zip/postal-code -- which means that
 while the City, State, Country attributes aren't monotonically increasing
 or decreasing; they are grouped tightly together.   I'd expect all queries
 for San Francisco to be able to come from at most 2 segments; and all queries
 for Texas to be able to come from only a fraction of the whole.


 If the segment sizes are configurable - I imagine this would even
 be useful for other data - like a people table organized
 by last_name,first_name.   John's may be scattered through out
 the table -- but at least the John Smith's would all be on one
 segment, while the Aaron-through-Jim Smith segments might get excluded.

 Or am I missing something?

Well, this can head in two directions...

1.  Suppose we're not using an organize in CLUSTER order approach.

If the data is getting added in roughly by order of insertion order,
then there's no reason to expect San Francisco data to be clustered
together.

Ditto for John Smiths; we can expect them to be as scattered as
their dates of creation.

1.  Suppose we *are* using an organize in CLUSTER order approach.

In that case, yes, indeed, you can expect segments to contain specific
ranges of the data.

However, in that case, the ONLY dimension under which the Segment
Exclusion may be expected to be useful is that of the first column of
the index being used.  Smith should be useful to SE, but not John,
because, as a secondary criteria, the first name values will be
scattered across all segments.
-- 
(reverse (concatenate 'string ofni.sesabatadxunil @ enworbbc))
http://www3.sympatico.ca/cbbrowne/linuxdistributions.html
PALINDROME spelled backwards is EMORDNILAP. 

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Simon Riggs

On Tue, 2008-01-08 at 02:12 +, Gregory Stark wrote:

 I also don't understand how this proposal deals with the more common use case
 of unloading and loading data. Normally in partitioned tables we build the
 data in a side table until the data is all correct then load it as a
 partition. If you treat it as a lower-level object then I don't see that
 working. The layout of the new table won't often match the layout of the
 target partitioned table.

We optimised for that in 8.2, but I would say that not many people
noticed and that it isn't normal.

The problem with that approach, and the reason many people don't use it
is that it requires all data for a partition to be available at the time
you add the partition. That necessarily implies a time delay into the
process of loading data, which is no long acceptable in the world of
straight-through-processing or whatever you call the need for zero
processing delay in an/your industry. So people choose to load data
directly into the main table, allowing it to be immediately available,
though at the cost of some performance.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Gavin Sherry

On Wed, Jan 09, 2008 at 02:38:21PM -0500, Chris Browne wrote:
 Ron Mayer [EMAIL PROTECTED] writes:
  Or am I missing something?
 
 Well, this can head in two directions...
 
 1.  Suppose we're not using an organize in CLUSTER order approach.
 
 If the data is getting added in roughly by order of insertion order,
 then there's no reason to expect San Francisco data to be clustered
 together.
 
 Ditto for John Smiths; we can expect them to be as scattered as
 their dates of creation.
 
 1.  Suppose we *are* using an organize in CLUSTER order approach.
 
 In that case, yes, indeed, you can expect segments to contain specific
 ranges of the data.
 
 However, in that case, the ONLY dimension under which the Segment
 Exclusion may be expected to be useful is that of the first column of
 the index being used.  Smith should be useful to SE, but not John,
 because, as a secondary criteria, the first name values will be
 scattered across all segments.

One of the reasons for using partitions is that the usefulness of
indexes breaks down because of the low cardinality of the indexed
column. Also, the random IO can kill your performance.

In my opinion, CLUSTER (and vacuum, mentioned else where) are not data
management tools.

Thanks,

Gavin

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Gavin Sherry

On Wed, Jan 09, 2008 at 08:17:41PM +, Simon Riggs wrote:
 On Wed, 2008-01-09 at 20:03 +0100, Gavin Sherry wrote:
 
  I think Simon's approach is
  probably more complex from an implementation POV.
 
 Much of the implementation is exactly the same, and I'm sure we agree on
 more than 50% of how this should work already. We just need to close in
 on the remainder.
 
 My current opinion on the SE approach is the opposite of the one above,
 though it gets us nowhere just to state it. I'm trying to avoid opinion
 and look at the details, which is the reason my viewpoint recently
 changed in favour of the dynamic approach as the main thrust for
 implementation.
 
 I've written a detailed email this morning to explain where and how the
 problems lie, which are nowhere near the syntax level. I haven't ruled
 out a declarative approach yet, but I need some detailed technical
 review of the issues. I hope you'll be replying to that?

I'm responding to that email seperately. Just a matter or addressing all
the points in the detail it deserves.

 
   3.  If, rather than blindly following, we create something at least
   quasi-new, there is the chance of doing fundamentally better.
   
   This very thing happened when it was discovered that IBM had a
   patent on the ARC cacheing scheme; the clock system that emerged
   was a lot better than ARC ever was.
  
  Well, I don't think I'm proposing we /blindly follow/ others. I propose
  we choose a grammar which takes the best of what others have tried to
  do. Oracle's grammar is hideous, IBM's is too restrictive, for example.
 
 I assume the new grammar is good and if we do go that way, it sounds
 like the right starting place. 
 
One major advantage of the dynamic approach is that it can work on
multiple dimensions simultaneously, which isn't possible with
declarative partitioning. For example if you have a table of Orders then
you will be able to benefit from Segment Exclusion on all of these
columns, rather than just one of them: OrderId, OrderDate,
RequiredByDate, LastModifiedDate. This will result in some sloppiness
in the partitioning, e.g. if we fill 1 partition a day of Orders, then
the OrderId and OrderData columns will start out perfectly arranged. Any
particular RequiredByDate will probably be spread out over 7 partitions,
but thats way better than being spread out over 365+ partitions.
   
   I think it's worth observing both the advantages and demerits of this
   together.
   
   In effect, with the dynamic approach, Segment Exclusion provides its
   benefits as an emergent property of the patterns of how INSERTs get
   drawn into segments.
   
   The tendancy will correspondly be that Segment Exclusion will be able
   to provide useful constraints for those patterns that can naturally
   emerge from the INSERTs.
  
  Many people, in my experience, doing the kind of data processing which
  benefits from partitioning are regularly loading data, rather than
  collecting it in an OLTP fashion. Lets take the easily understandable
  concept of processing web site traffic. If the amount of data is large
  enough to benefit from partitioning, they probably have multiple web
  servers and therefore almost certainly multiple log files. If these
  files are not sorted into a single file, the records will not have a
  naturally progressing chronology: every file we go back to the beginning
  of the period of time the load covers. If you add parallelism to your load,
  things look even more different. This means you could end up with a
  bunch of partitions, under the dynamic model, which all cover the same
  time range.
 
 Depends how big we make the partitions and how sloppy this is as to
 whether that is a problem or not. We might still expect a x100 gain from
 using the SE approach depending upon the data volume.

This comes back to my problem with the dynamic approach: the user might
get a certain experience but with the declarative approach the
expectation matches the result.

 
  Then there's the way that really big databases are used (say, up around
  Simon's upper bound of 16 TB). It is expensive to keep data online so
  people aren't. They're loading and unloading data all the time, to
  perform different forms of analysis. 
 
 That isn't my experience. That sounds very time consuming.

I cannot offer any evidence but it's what we see companies doing.

 The storage cost issue was the reason Andrew wanted offline segments,
 and why I have been talking about hierarchical storage.

The thing is, people with lots of data usually have good hierarchical
storage systems. Can we really do better?

 
  A common scenario in the example
  above might be to unload all but the current month's data and then load
  the same month from the previous year. The unload step needs to be
  costly (i.e., TRUNCATE). Then, there's no guarantee that what they're
  interested in is the date range at all. They may want to

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Gavin Sherry

Hi Simon,

On Wed, Jan 02, 2008 at 05:56:14PM +, Simon Riggs wrote:
 Segment Exclusion
 -
 
 After we note that a segment is read-only we can scan the segment and
 record min/max values for all columns. These are then implicit
 constraints, which can then be used for segment exclusion in a similar
 manner as we do with the current constraint exclusion mechanism.

What about columns which are varlen? The maximum value could be very long -- 1
GB in fact. Say we decide to not support those. Well, what about
numeric? What about user defined types?

 This would also allow a Merge Join to utilise exclusion for the inner
 plan at execution time. Instead of scanning the whole inner table plan
 from the start, we would be able to start the scan from the appropriate
 segment. This would require us to pass the current value of the outer
 plan down into the inner plan. The typical executor nodes on the inner
 plan would be a SortNode and below that a SeqScan. In that case the
 executor would need to pass the outer value from the MergeJoinNode down
 thru the SortNode to the SeqScan node. The SeqScan node could then
 perform partition exclusion, reducing the time for that scan and also
 reducing the time for the resulting sort. This sounds like it might be
 worth doing in normal cases also. It might turn out that the potentially
 applicable cases are already excluded during planning, I haven't thought
 about that aspect in enough detail yet.

I don't think that would work at all. Say you have have skew of data
across the partitions. You may end up doing a seq scan for every outer
tuple. It would look like a really expensive nested loop.

 If we collect data for all columns then many of our implicit constraints
 would be useless. e.g. if a column only has a few values and these are
 present in all segments. Matching our predicate against all constraints
 would be expensive, so we must weed out poor constraints. We would do
 this by removing any constraint that overlapped more than 10% of other
 segments. Various heuristics would likely need to be discovered during
 development to make this work without resorting to manual commands.
 
 Note that all of this exclusion is happening in the executor, not the
 planner. That allows this form of exclusion to work with stable
 functions and parameters without problem.

So, in that case, you've totally undercut the planner. All the steps up
the tree would be based on the stats for scanning the entire table but
you've only returned part of it.

To solve that, you could put the min/max values in a catalog somewhere.
For a 16 TB table, that's 16000 min/max values. That's pretty expensive.
And how do we handle making things non-read-only. You'd have to wait for
all current queries to finish... that could take a long time.

How do we handle crashes during setting of min/max? I presume it needs
WAL support.

What happens if the free space map gives us a free block in a read only
segment. Then we need to look at concurrency and transactionality of
min/max values don't we? If changing it makes it not read-only (which
seems to be the case) it would be trivial to totally degrade your
partitioning scheme to full seq scan. One a 16 TB table. And getting the
performance back might involve a day or more or VACUUMing. Impossible.

 Visibility Map
 --
[snip]
 No dynamic shared memory cache is required because any concurrent
 changes to the table would be ignored by a Scan anyway, so it doesn't
 matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any
 new scans that start will attempt to lock the table and then perform a
 rel cache check before continuing. So the visibility will be set
 correctly for *that* scan at least.

Is that true in the presence of READ COMMITTED?

 
 In most cases the visibility map can be summarised as a single boolean
 to show whether any 100% visible segments exist. That makes accessing
 the map very cheap in the common, unset case.

What do we do in failure scenarios. Does this go into WAL? Is it
replicated via log shipping techniques. You mention Slony support below.
I don't know that Slony's target is Very Big Tables (up to 16 TB) but
even if it was being used for a 'small' system of a few hundred GB, when
you fail over the system has degraded if you aren't vacuuming it. More
over, the layout of data may be different and therefore the performance
of the segment exclusion. That's a bit of a surprise.

 
 Setting the Visibility Map
 --
 
 VACUUM would scan all READ_WRITE_ALLOWED segments and mark some of
 them as EFFECTIVE_READ_ONLY if 100% of the remaining rows are visible to
 all current backends. Marking the map will cause a rel cache
 invalidation.

Argh. Remember, we're talking about 16 TB tables here. With the best of
tuning, that takes all day. So, day 1, load data; day 2 VACUUM it? With
the existing partitioning approach, there's no such thing.

 
 We would never mark the highest numbered

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Gavin Sherry

Hi Simon,

On Wed, Jan 09, 2008 at 03:08:08PM +, Simon Riggs wrote:
 Do people really like running all that DDL? There is significant
 manpower cost in implementing and maintaining a partitioning scheme,
 plus significant costs in getting it wrong. 

Well... that's impossible for me to say. Most of the people who use lots
of data like using MDX (a large and quite complex reporting and
modelling language from MS, which is more complex than SQL IMHO) so...

 
 If people with large tables like partitioning why is Oracle moving
 towards automated partitioning in 11g? Automated partitioning was one of

Have you used Oracle's partitioning? I'm not surprised they're moving
away ;-).

More seriously, I didn't know that. Do you have a URL?

 the major goals for this next set of Postgres partitioning functionality
 also, whether or not we have declarative partitioning. My thinking is
 lets go for the best ideas and skip over the stuff that will be (is)
 obsolete before its left the design stage.

Well, you're assuming that because declarative partitioning has been around a
while, it's obselete. I don't think that's the case. Postgres has been
around about as long...

 I see many more systems in the Terabyte range now than I did 10 years
 ago, but I see about the same number of DBAs. We'll always need experts,
 but I feel we should be using our expertise to simplify the standard
 cases, not just maintain the same level of difficulty in managing them.
 One of the big benefits of the dynamic partitioning approach is that it
 needs no DDL. So it will work out-of-the-box, for anybody.

It's fine to say that, but people have already started talking about
grammar extensions for dynamic partitions. People want the tools to
manage it. There's difficulty in learning how and when to use the
different VACUUM options that have been discussed.

 Deleting older data would be optimised under the proposed scheme, so
 that's not really a problem. Loading data is actually slightly harder
 and slower with declarative partitioning (see below).
 
 Merging and splitting partitions are tools for fine tuning a very
 complex partitioning scheme. They do also allow a non-linear
 segmentation scheme, which might aid performance in some cases.

What about adding partitions into a set of partitions. Greg's post on
that was dismissed and it might not be representative but we have users
doing that all the time.

 One major advantage of the dynamic approach is that it can work on
 multiple dimensions simultaneously, which isn't possible with
 declarative partitioning. For example if you have a table of Orders then
 you will be able to benefit from Segment Exclusion on all of these
 columns, rather than just one of them: OrderId, OrderDate,
 RequiredByDate, LastModifiedDate. This will result in some sloppiness
 in the partitioning, e.g. if we fill 1 partition a day of Orders, then
 the OrderId and OrderData columns will start out perfectly arranged. Any
 particular RequiredByDate will probably be spread out over 7 partitions,
 but thats way better than being spread out over 365+ partitions.
 
 When we look at the data in the partition we can look at any number of
 columns. When we declaratively partition, you get only one connected set
 of columns, which is one of the the reasons you want multi-dimensional
 partitioning in the first place.
 
  To this end, we (well, Jeff Cohen) looked at the syntax and semantics of
  partitining in leading databases (Oracle, Informix, DB2) and came up
  with a highly expressive grammar which takes the best of each I think
  (I'll post details on the grammar in a seperate thread). The idea is
  that range (for example, a date range), list (a list of distinct values)
  and hash partitioning be supported on multiple columns. Partitions can
  be added, merged, truncated. Partitions can reside on different
  tablespaces. The existing issues with the rewriter, COPY support and the
  like go away by smartening up the backend.
 
  To explore the grammar and semantics Jeff and I (to a small extent) have
  implemented the parsing of such a grammar in the backend. At the moment,
  this just results in the generation of a bunch of rules (i.e., it
  produces the current behaviour, as if someone had written rules
  themselves) and debugging output.
  
  The fully fledged approach will see partitioning rules stored in
  a new system catalog which can be interrogated by the planner or COPY to
  determine which partition a given tuple belongs in or which partition(s)
  a WHERE clause matches.
 
 When loading data we would need to examine each incoming tuple and
 distribute it to the correct place in the partitioned heap. That's a
 non-trivial overhead on data loading that I would like to avoid, since
 if we rely on the natural locality of loaded data then that overhead is
 zero.

Yes, but it's a lot faster than what we have at the moment and the
dynamic approach is not without its costs. I've pointed it out before,
but I will

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-09 Thread Simon Riggs

On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote:
 If the exclusion is executor driven, the planner cannot help but
 create a seq scan plan. The planner will think you're returning 100X
 rows when really you end up returning X rows. After that, all
 decisions made by the planner are totally bogus.

One of the most important queries on large tables is handling
WHERE Eventdate = CURRENT DATE;

We cannot perform partition exclusion using this type of WHERE clause at
planning time because the CURRENT DATE function is STABLE. 

We can decide that some form of partition exclusion is possible, but the
actual partition we access can *only* be decided within the executor.

That necessarily effects the selectivity estimates. The planner can see
we want some data, but it can't tell which data, so it doesn't know
whether we will hit the day with the most data or the date with the
least data.

You've mentioned this a few times, as if its a problem with dynamic
partitioning, yet its clearly an issue for any form of partitioning.

So it seems clear that we need to make partition exclusion work at
executor time, whatever else we do.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-08 Thread Andrew Sullivan

On Tue, Jan 08, 2008 at 01:08:52AM +0100, Markus Schiltknecht wrote:
 
 Uh, which key are you talking about? AFAIU Simon's proposal, he suggests 
 maintaining min/max values for all columns of the table.

Right, but I think that's just because that approach is automatable.  Only
some use cases are going to be approproate to this.

 Yeah, and if only *one* tuple in the 1G segment has:
 
   some_date = '1998-12-31' OR some_date = '2001-01-01'
 
 Segment Exclusion can't exclude that segment. That's all I'm saying.

Correct.

 Huh? I'm certainly not the one asking for it. Quite the opposite, I'm 
 warning from over-estimating the use of SE.

Right; I think one should be clear that there are many -- maybe most --
uses of PostgreSQL where the proposal will be of no use.  I just think we
need to be clear that for the areas where it _can_ be useful, it could be
very useful indeed.

A


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-08 Thread Andrew Sullivan

On Tue, Jan 08, 2008 at 02:12:28AM +, Gregory Stark wrote:

  Yes: it doesn't solve the problem I have, which is that I don't want to
  have to manage a whole bunch of tables.  I want one table, and I want to
  be able to say, That section is closed.
 
 That's not your problem, that's the solution you're looking for. You're
 assuming a particular solution in your problem statement.

Probably in that one, yes.  I'm still waiting for permission to post my
original problem statement; I suspect it's not going to be forthcoming by
next Monday, so it's not going to happen.

But I did outline something like what I'm talking about elsewhere in this
thread.  For my case, I'm thinking of the sort of data that builds up over
time, and most of which happens probably not to be useful at any moment, but
all of which _might_ be useful over the long haul.  So what I wanted,
originally, was to be able to set arbitrary ranges of tuples to be
read-only, and to be able to set them offline if I wanted.  Pseudo-DDL:

ALTER TABLE foo
SET read_only='t'
WHERE created_on  '2007-01-01';

ALTER TABLE foo
SET tuple_offline='t'
WHERE created_on  '2006-01-01';

Now, the second segment is marked offline.  If I query the table for
things in that range, I get an ERROR telling me there might be data in the
range, but it's not mounted at the moment.  If I try to update records in
the read-only (first) range, I get an error telling me the tuple is marked
read only.  The idea then is that these older tuples can be put off into
long-term storage (wave hands here about the management of that stuff), and
this keeps my online system compact but yet allows me, for just the cost
of mounting a backup tape and reading the segments back, to go back and
query those old ranges.

The case I was particularly aiming at originally was for a case of data that
cannot cost more than fractions of pennies to store, but that might
represent a hugely expensive liability if the answer is not always right. 
Driving down that storage cost was mostly what I was aiming at, but people
gradually convinced me that slightly more generic implementations might be
useful.  Simon's proposal came along, and it seems to me to be something
like the generic implementation that others already convinced me was needed.

 I think Simon's proposal loses the very feature that makes partitioning
 useful. The DBA doesn't have a thing to describe, he has to define what
 parts of the table he's describing for every operation. And if you define a
 whole new object to name these things I think you'll end up with something
 that looks a lot like tables.

I don't see how that's the case at all.  In fact, I have the feeling it's
the opposite, so perhaps I've misunderstood something.

A


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-08 Thread Gavin Sherry

On Wed, Jan 02, 2008 at 05:56:14PM +, Simon Riggs wrote:
 This technique would be useful for any table with historical data keyed
 by date or timestamp. It would also be useful for data where a
 time-of-insert component is implicit, such as many major entity tables
 where the object ids are assigned by a sequence. e.g. an Orders table
 with an OrderId as PK. Once all orders placed in a period have been
 shipped/resolved/closed then the segments will be marked read-only.
 
 Its not really going to change the code path much for small tables, yet
 seems likely to work reasonably well for large tables of all shapes and
 sizes. If a segment is being updated, we leave it alone, and maybe never
 actually set the visibility map at all. So overall, this idea seems to
 cover the main use case well, yet with only minimal impact on the
 existing use cases we support.
 
 
 As before, I will maintain this proposal on the PG developer Wiki, once
 we get down to detailed design.
 
 
 
 Like it?

Simon,

A novel approach to the problem. For me, this proposal addresses some
of the other problems in postgres (visibility in the heap vs. index)
rather than the problems with partitioning itself.

It might seem otherwise, but for me partitioning is tool for managing
large volumes of data. It allows the user to encode what they know about
the nature of their data, and that's often about management. In this
way, your proposal actually makes partitioning too smart: the user wants
to tell the system how to organise the data, as opposed to the other way
around.

At Greenplum, we've been discussing this in depth. Interestingly, we
also felt that the storage layer could be much smarter with large tables
with predictable layouts and predictable data patterns. But the thing
is, people with large tables like partitioning, putting different
partitions on different storage; they like the ability to merge and
split partitions; they like truncating old partitions. In a word, they
seem to like the managability partitions give them -- as well as the
performance that comes with this.

To this end, we (well, Jeff Cohen) looked at the syntax and semantics of
partitining in leading databases (Oracle, Informix, DB2) and came up
with a highly expressive grammar which takes the best of each I think
(I'll post details on the grammar in a seperate thread). The idea is
that range (for example, a date range), list (a list of distinct values)
and hash partitioning be supported on multiple columns. Partitions can
be added, merged, truncated. Partitions can reside on different
tablespaces. The existing issues with the rewriter, COPY support and the
like go away by smartening up the backend.

To explore the grammar and semantics Jeff and I (to a small extent) have
implemented the parsing of such a grammar in the backend. At the moment,
this just results in the generation of a bunch of rules (i.e., it
produces the current behaviour, as if someone had written rules
themselves) and debugging output.

The fully fledged approach will see partitioning rules stored in
a new system catalog which can be interrogated by the planner or COPY to
determine which partition a given tuple belongs in or which partition(s)
a WHERE clause matches.

Yes, this is the traditional approach but I think it still has merit. We
shall continue working on this approach because it is what our customers
have asked for. We would also like to see it get into postgres too.

Thanks,

Gavin

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-07 Thread Csaba Nagy

On Wed, 2008-01-02 at 17:56 +, Simon Riggs wrote:
 Like it?

Very cool :-)

One additional thought: what about a kind of segment fill factor ?
Meaning: each segment has some free space reserved for future
updates/inserts of records in the same range of it's partitioning
constraint. And when inserting/updating you put the new record into the
corresponding segment... just like a very coarse clustering. Then you
could vacuum the segments separately to keep the free space not running
out. For active segments you would then fix the partitioning constraint
range once the fill factor is reached, to allow for keeping it's
constraint even when heavily updating (heavily vacuuming it too as
response to that), and create a new segment for the unbounded range for
new inserts... this would work fine for tables where the constraint is
based on ever increasing keys and accidental inserts in old ranges
(which do happen occasionally in real life).

When the change rate of old segments get down, the segments could be
reorganized to have a smaller fill factor, so that you still allow for
accidental updates but keep space usage efficient. This would be some
similar action as a clustering, but hopefully not blocking (which might
be a hard thing to do)... and later again you could mark some of the
really old things as read only and put them in special segments with no
wasted space.

One problem would be when the segment's free space runs out, so you must
put records from the same constraint range in multiple segments - but
that could still work, you just would have multiple segments covering
the same range, but if the segment fill factor is chosen properly it
should not be the case... you could normally maintain a set of
non-overlapping segments in terms of the partitioning constraint.

Cheers,
Csaba.




---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-07 Thread Markus Schiltknecht


Hi Csaba,

Csaba Nagy wrote:

One additional thought: what about a kind of segment fill factor ?
Meaning: each segment has some free space reserved for future
updates/inserts of records in the same range of it's partitioning
constraint. And when inserting/updating you put the new record into the
corresponding segment... just like a very coarse clustering.


Hm.. yeah. That way, a few writes to a read optimized segment could be 
accepted, without having to drop the optimization immediately. And the 
other way around: generally prevent having to drop the optimization by 
forcing tuples to be written to a segment with matching min/max tuples. 
Although, that's not exactly trivial, I think.


However, for tables which don't fit the use case of SE, people certainly 
don't want such a fill factor to bloat their tables.


Regards

Markus


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-07 Thread Csaba Nagy

On Mon, 2008-01-07 at 13:59 +0100, Markus Schiltknecht wrote:
 However, for tables which don't fit the use case of SE, people certainly 
 don't want such a fill factor to bloat their tables.

Sure, but it could be configurable and should only be enabled if the
table is marked as partitioned on some condition... I think it would be
a bad idea anyway if the DB would start partitioning on some arbitrary
criteria based on analyzing he table, so the DBA should be the one to
decide on what criteria to partition. In particular it could be a bad
idea on occasions to partition on the clustering criteria for tables
which were clustered once.

Cheers,
Csaba.



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-07 Thread Csaba Nagy

On Mon, 2008-01-07 at 14:20 +0100, Markus Schiltknecht wrote:
 Why is that? AFAIUI, Segment Exclusion combines perfectly well with 
 clustering. Or even better, with an upcoming feature to maintain 
 clustered ordering. Where do you see disadvantages such an optimization 
 for sequential scans?

Well, as I understood it, this would be some kind of special case of
clustering, where the cluster key is expected to be ever increasing in
time and new rows would not be randomly distributed over the complete
possible range. In theory you could also have each segment in turn be
clustered on some other criteria than the partitioning criteria so
indexed access could also be better on the main selection criteria which
could be different than the partitioning criteria. All this is of course
just speculations - but I guess that's what you expected too in this
discussion :-)

Cheers,
Csaba.



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-07 Thread Markus Schiltknecht


Hi,

Csaba Nagy wrote:

Sure, but it could be configurable and should only be enabled if the
table is marked as partitioned on some condition... 


As I'm regarding SE as an optimization, I disagree here.. As all 
optimizations, SE should conceptually be reasonably close to cost-less 
when unneeded.



I think it would be
a bad idea anyway if the DB would start partitioning on some arbitrary
criteria based on analyzing he table, so the DBA should be the one to
decide on what criteria to partition.


I absolutely agree for real partitioning, which targets manageability of 
table partitions.



In particular it could be a bad
idea on occasions to partition on the clustering criteria for tables
which were clustered once.


Why is that? AFAIUI, Segment Exclusion combines perfectly well with 
clustering. Or even better, with an upcoming feature to maintain 
clustered ordering. Where do you see disadvantages such an optimization 
for sequential scans?


Regards

Markus


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-07 Thread Andrew Sullivan

On Sat, Jan 05, 2008 at 08:02:41PM +0100, Markus Schiltknecht wrote:
 Well, management of relations is easy enough, known to the DBA and most 
 importantly: it already exists. Having to set up something which is 
 *not* tied to a relation complicates things just because it's an 
 additional concept.

But we're already dealing with some complicated concepts.

There isn't anything that will prevent current-style partitioning strategies
from continuing to work in the face of Simon's proposal.  But let me see if
I can outline the sort of cases where I see real value in what he's
outlined.

There is a tendency in data systems to gather all manner of data that, in
retrospect, _might_ turn out to be be valuable; but which, at the time, is
not really valuable at all.  Moreover, the value later on might be
relatively low: if you can learn something much later from that data, and do
so easily, then it will be worth doing.  But if the work involved passes
some threshold (say 1/2 a day), it's suddenly not worth it any more.  It's
simple economics: below a certain cost, the data is valuable.  Above a
certain cost, you simply shouldn't keep the data in the first place, because
the cost of using it is higher than any value you'll likely be able to
extract.

Simon's proposal changes the calculations you have to do.  If keeping some
data online longer does not impose administrative or operational overhead
(you have it marked read only, so there's no I/O for vacuum; you don't need
to do anything to get the data marked read only; c.), then all it costs is
a little more disk, which is relatively cheap these days.  More importantly,
if the longer-term effect of this strategy is to make it possible to move
such data offline _without imposing a big cost_ when moving it back online,
then the value is potentially very high.

Without even trying, I can think of a dozen examples in the past 5 years
where I could have used that sort of functionality.  Because the cost of
data retrieval was high enough, we had to decide that the question wasn't
worth answering.  Some of those answers might have been quite valuable
indeed to the Internet community, to be frank; but because I had to pay the
cost without getting much direct benefit, it just wasn't worth the effort. 

The thing about Simon's proposal that is beguiling is that it is aimed at
a very common use pattern.  The potential for automatic management under
such a use pattern makes it seem to me to be worth exploring in some detail.

 Agreed. I'd say that's why the DBA needs to be able to define the split 
 point between partitions: only he knows the meaning of the data.

I think this is only partly true.  A casual glance at the -general list will
reveal all manner of false assumptions on the parts of administrators about
how their data is structured.  My experience is that, given that the
computer has way more information about the data than I do, it is more
likely to make the right choice.  To the extent it doesn't do so, that's a
problem in the planning (or whatever) algorithms, and it ought to be fixed
there.

A


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-07 Thread Markus Schiltknecht


Hi,

Andrew Sullivan wrote:

On Sat, Jan 05, 2008 at 08:02:41PM +0100, Markus Schiltknecht wrote:
Well, management of relations is easy enough, known to the DBA and most 
importantly: it already exists. Having to set up something which is 
*not* tied to a relation complicates things just because it's an 
additional concept.


But we're already dealing with some complicated concepts.


Possibly, yes, but that's by far no reason to introduce even more 
complicated concepts...


Does anything speak against letting the DBA handle partitions as relations?


There isn't anything that will prevent current-style partitioning strategies
from continuing to work in the face of Simon's proposal.


Agreed. Nor will Simon's proposal completely replace them.


Without even trying, I can think of a dozen examples in the past 5 years
where I could have used that sort of functionality.  Because the cost of
data retrieval was high enough, we had to decide that the question wasn't
worth answering.  Some of those answers might have been quite valuable
indeed to the Internet community, to be frank; but because I had to pay the
cost without getting much direct benefit, it just wasn't worth the effort.


Sure, there's value in Simon's proposal. But it has pretty strict 
requirements. IMO, it's pretty hard to say, if it would have helped at 
all for your cases. Any of them still available to check?


Remember the requirements: no single tuple in the segment may be 
significantly out of the average bounds. Otherwise, the min/max gets 
pretty useless and the segment can never be excluded.


As said before, combination with CLUSTERing might help, yes. But if you 
need to maintain CLUSTERed ordering, aren't there better ways? For 
example, you could use binary searching on the relation directly, much 
like with indices, instead of sequentially scanning on the CLUSTERed 
relation. That would even give us some sort of indices with visibility.


Agreed. I'd say that's why the DBA needs to be able to define the split 
point between partitions: only he knows the meaning of the data.


I think this is only partly true.  A casual glance at the -general list will
reveal all manner of false assumptions on the parts of administrators about
how their data is structured.  My experience is that, given that the
computer has way more information about the data than I do, it is more
likely to make the right choice.  To the extent it doesn't do so, that's a
problem in the planning (or whatever) algorithms, and it ought to be fixed
there.


Well, Postgres doesn't automatically create indices, for a counter example.

With regard to partitioning over multiple table spaces, I think the DBA 
definitely has more information available, than the computer. A DBA 
(hopefully) knows future plans and emergency strategies for the storage 
system, for example. Lacking such information, the database system will 
have a pretty hard time taking a good decision on how to partition 
between table spaces, IMO.


Regards

Markus

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-07 Thread Andrew Sullivan

On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote:
 
 Does anything speak against letting the DBA handle partitions as relations?

Yes: it doesn't solve the problem I have, which is that I don't want to have
to manage a whole bunch of tables.  I want one table, and I want to be able
to say, That section is closed. 

 Sure, there's value in Simon's proposal. But it has pretty strict 
 requirements. IMO, it's pretty hard to say, if it would have helped at 
 all for your cases. Any of them still available to check?

No, but one of your worries doesn't bother me:
 
 Remember the requirements: no single tuple in the segment may be 
 significantly out of the average bounds. Otherwise, the min/max gets 
 pretty useless and the segment can never be excluded.

The segment can never be excluded in a search on that key, yes.  But
consider the likely cases we're looking at: 

WHERE some_date = '1999-01-01' AND some_date  '2001-01-01';
WHERE sequence_field BETWEEN 3000 AND 30;

c.  These are the two obvious cases: you're searching for data in a given
date range or for primary (sometimes artificial) identifiers in a range,
and the source data increases (almost) monotonically.  You have to do this
now anyway, because there's _some_ basis on which you're partitioning your
data; but today, you do this with a lot of fooling around with views and
nasty triggers that push data into the right table, assuming someone
doesn't screw it up.  

 need to maintain CLUSTERed ordering, aren't there better ways? For 
 example, you could use binary searching on the relation directly, much 
 like with indices, instead of sequentially scanning on the CLUSTERed 
 relation. That would even give us some sort of indices with visibility.

I think this is a nice idea too :)

 Well, Postgres doesn't automatically create indices, for a counter example.

Yes, and it has no data-use analyser tools that automatically suggest
indexes, either.  That's the sort of thing people coming from other (err,
Other ;-) products complain about, in fact.

 definitely has more information available, than the computer. A DBA 
 (hopefully) knows future plans and emergency strategies for the storage 
 system, for example. 

Perhaps my jaundice comes from too much time spent in operational trenches,
but while good DBAs have some ideas about that, large numbers of them are
harried and overwhelmed just by the piles of work they already have. 
Nevertheless, while what you say is true, I'm not sure what it has to do
with the present case.  I don't think the current proposal is to address
partitioning across table spaces.  It's to do with the way certain segments
of a table are interpreted by the system.  It's undoubtedly true that this
strategy is of questionable utility for many kinds of use of PostgreSQL. 
But it seems to offer very significant advantages for one use-pattern that
is very common.

That said, I am not trying to argue it should be adopted without poking at
its weaknesses.  I just think it unfair to ask the proposal to address
problems it's not really aimed at.

A


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-07 Thread Ron Mayer

Andrew Sullivan wrote:
 On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote:
 ...the requirements: no single tuple in the segment may be 
 significantly out of the average bounds. Otherwise, the min/max gets 
 pretty useless and the segment can never be excluded.
 
 The segment can never be excluded in a search on that key, yes.  But
 consider the likely cases we're looking at: 
 ...you're searching for data in a given
 date range or for primary (sometimes artificial) identifiers in a range,
 and the source data increases (almost) monotonically.  

It seems to me the idea discussed elsewhere in the thread[1]
about configurable segment sizes would make this limitation
much less problematic for some types of data.

Some of my biggest tables are clustered by zip-code; and are insert
mostly.   Common queries are where state_provence='TX' or
where city='Dallas'.  While I doubt I have enough data to fill
a 1GB segment for any but the largest cities; I certainly have
runs of many consecutive blocks - since clustering by zip tends
to cluster cities as well.  Even though the table's not monotonically
increasing or decreasing, like values for cities and states are
clustered together.

Is my understanding right that these Segment Visibility Maps could
help this case as well?

[1] http://archives.postgresql.org/pgsql-hackers/2008-01/msg00065.php


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-07 Thread Markus Schiltknecht


Hi,

Andrew Sullivan wrote:

On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote:

Does anything speak against letting the DBA handle partitions as relations?


Yes: it doesn't solve the problem I have, which is that I don't want to have
to manage a whole bunch of tables.  I want one table, and I want to be able
to say, That section is closed. 


That's a fair requirement. Wanting to manage partitions manually is 
another fair requirement, IMO. They can well coexist.


Remember the requirements: no single tuple in the segment may be 
significantly out of the average bounds. Otherwise, the min/max gets 
pretty useless and the segment can never be excluded.


The segment can never be excluded in a search on that key, yes.  But
consider the likely cases we're looking at: 


Uh, which key are you talking about? AFAIU Simon's proposal, he suggests 
maintaining min/max values for all columns of the table.



WHERE some_date = '1999-01-01' AND some_date  '2001-01-01';


Yeah, and if only *one* tuple in the 1G segment has:

  some_date = '1998-12-31' OR some_date = '2001-01-01'

Segment Exclusion can't exclude that segment. That's all I'm saying.


That said, I am not trying to argue it should be adopted without poking at
its weaknesses.  I just think it unfair to ask the proposal to address
problems it's not really aimed at.


Huh? I'm certainly not the one asking for it. Quite the opposite, I'm 
warning from over-estimating the use of SE.


In his proposal, Simon was explicitly comparing to declarative 
partitioning, pointing out lots of advantages and implicitly stating 
that SE could cover most, if not all needs of what's commonly understand 
by partitioning. That's where I disagree.


But certainly, SE and SVM has some merit for it's use case. And I'm 
looking forward to test patches ;-)


Regards

Markus


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-07 Thread Gregory Stark

Andrew Sullivan [EMAIL PROTECTED] writes:

 On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote:
 
 Does anything speak against letting the DBA handle partitions as relations?

 Yes: it doesn't solve the problem I have, which is that I don't want to have
 to manage a whole bunch of tables.  I want one table, and I want to be able
 to say, That section is closed. 

That's not your problem, that's the solution you're looking for. You're
assuming a particular solution in your problem statement.

I posit that the whole point of partitioning is to create an object which
serves to represent the semantically meaningful chunks of data. The reason
this is useful is precisely because it serves as a short-hand for the DBA
describe the data and how it will be used.

I think Simon's proposal loses the very feature that makes partitioning
useful. The DBA doesn't have a thing to describe, he has to define what
parts of the table he's describing for every operation. And if you define a
whole new object to name these things I think you'll end up with something
that looks a lot like tables.

I also don't understand how this proposal deals with the more common use case
of unloading and loading data. Normally in partitioned tables we build the
data in a side table until the data is all correct then load it as a
partition. If you treat it as a lower-level object then I don't see that
working. The layout of the new table won't often match the layout of the
target partitioned table.


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-07 Thread Mark Kirkwood


Gregory Stark wrote:

I also don't understand how this proposal deals with the more common use case
of unloading and loading data. Normally in partitioned tables we build the
data in a side table until the data is all correct then load it as a
partition. If you treat it as a lower-level object then I don't see that
working. The layout of the new table won't often match the layout of the
target partitioned table.


  

+1

In addition, a similar use case is archival of old partitions as they
are not longer (commonly) accessed - perhaps to a different tablespace
(or even to backup media). I don't see how dynamic in-table partitioning
handles this, and I think it would highly desirable to be able to do
these things!

best wishes

Mark




---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-06 Thread Gokulakannan Somasundaram

On Jan 6, 2008 3:00 AM, Robert Treat [EMAIL PROTECTED] wrote:

 On Saturday 05 January 2008 14:02, Markus Schiltknecht wrote:
   To satisfy all the different requirements of partitioning with
 segments
   based partitioning, we'd have to allow a table to span multiple table
   spaces. I'm not very keen on going that way.
  
   Why?
 
  Uh.. if a table (RELKIND_RELATION) can only span one table space, as it
  is now, all of its segments are in the same table space. I don't quite
  call that partitioning. Well, sure, you could call it so, but then, each
  and every Postgres table is already partitioned in 1G segments.
 
  It all depends on the definitions, but in my world, horizontal
  partitioning for databases involves multiple table spaces (and is quite
  useless without that). Calling anything else partitioning is confusing,
  IMO.

 I'm not following this.  If we can work out a scheme, I see no reason not
 to
 allow a single table to span multiple tablespaces.  Do you see a problem
 with
 that?


If we can span two different partitions under two tablespaces,  it would
help concepts like parallel queries, parallel updates etc in data
warehousing. If we allow a single table to span across multiple tablespaces,
the question arises how we would be able to move around different parts of
the table,as we like.  Segments, i believe doesn't let the user/DBA draw the
split-points.


 In a more general sense, a global index is a an index that spans multiple
 partitions, as opposed to a local index, which is an index on specific
 partitions; postgresql current supports the latter, not the former.

 In any case, my thinking is if we had the segment exclusion technique, I
 could
 convert that partitioned table into a regular table again, use segment
 exclusion to handle what is currently handled by partitions, and create
 a global index across all the other data for that other, currently
 killer,
 query.


 That's a good idea. Local index , then would be equivalent to partial
indexes. But single Table partition maintenance might not be as flexible as
the multi-table partition. In real partitioning, dropping a single partition
should not affect the local indexes of other partitions. But i think that
would be very difficult to implement under this scheme.

Thanks,
Gokul.

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-06 Thread Gokulakannan Somasundaram

On Jan 6, 2008 11:27 AM, [EMAIL PROTECTED] wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On Sun, Jan 06, 2008 at 01:12:32AM +0530, Gokulakannan Somasundaram wrote:
  On Jan 5, 2008 6:15 PM, [EMAIL PROTECTED] wrote:
 
  
   One thought I had back then, with partitioned tables was gee --
 B-tree
   index is already doing a partition; why do a manual partition on top
 of
   that?.

  Can you please explain more on what you are trying to say here?

 Sure. A B-tree is just a device to partition something along some order.
 If you have , say, a table of orders (to use the example upthread) and a
 B-tree index on order date, this index partitions your set (at
 recursively finer levels).


But the current index scans - Index Scan and Bitmap  Index Scan, doesn't
provide the exact benefit of partitioning, even if all the columns are
covered by the index. It does a lot more disk reads than the partitioning
scheme. I think you are looking for something like Block indexes in
Multi-dimensional Clusters in DB2.  Heikki did something like that in a more
subtle way.

Postgresql Clusters, as you may know doesn't maintain the order with
inserts. We might go for Index Organized Tables/Clustered indexes. But then,
B-tree would give lot of performance problems, if the Index Tuple size
increases beyond a certain point.

Thanks,
Gokul.

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-06 Thread Markus Schiltknecht


Hi,

Gokulakannan Somasundaram wrote:



On Jan 5, 2008 6:15 PM, [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote:


One thought I had back then, with partitioned tables was gee -- B-tree
index is already doing a partition; why do a manual partition on top of
that?.

Can you please explain more on what you are trying to say here?


I think this has to do with SE not being of much use for index scans. Or 
put it another way: SE is an optimization for sequential scans. For 
tables where it works well, it could possibly replace the index entirely.


Without the index, you would rely on SE to always be able to exclude 
enough segments, so that the seq scan is less expensive than an index 
scan with the following table lookups.


With an index, the planner gets a hard time deciding between the index 
scan and the (possibly SE optimized) seq scan.


Regards

Markus


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-06 Thread Markus Schiltknecht


Hi,

Robert Treat wrote:

On Saturday 05 January 2008 14:02, Markus Schiltknecht wrote:

To satisfy all the different requirements of partitioning with segments
based partitioning, we'd have to allow a table to span multiple table
spaces. I'm not very keen on going that way.

Why?

Uh.. if a table (RELKIND_RELATION) can only span one table space, as it
is now, all of its segments are in the same table space. I don't quite
call that partitioning. Well, sure, you could call it so, but then, each
and every Postgres table is already partitioned in 1G segments.

It all depends on the definitions, but in my world, horizontal
partitioning for databases involves multiple table spaces (and is quite
useless without that). Calling anything else partitioning is confusing,
IMO.


I'm not following this.  If we can work out a scheme, I see no reason not to 
allow a single table to span multiple tablespaces.  Do you see a problem with 
that?


Uhm... well, no. I was just pointing out that it's a requirement. It 
depends on how you define things, but I'm seeing it that way:


table -- 1:n -- partition -- 1:1 -- table space -- 1:n -- segments

What I'm advocating is making partitions available to the DBA as some 
kind of a relation, she can query separately and move around between 
table spaces.



Why should that not be possible with other schemes? Moving the split
point between two partitions involves moving tuples around, no matter if
you are going to move them between segments or between relations
building the partitions.


The difference is that, if I currently have a table split by month, I 
can re-partition it into weekly segments, and only shuffle one months data 
at a time minimize impact on the system while I shuffle it. This can even be 
used to do dynamic management, where data from the current month is archived 
by day, data from the past year by week, and data beyond that done monthly.


This should be possible for both schemes, I see no connection to what 
we've discussed. SE doesn't magically give you this level of control you 
are requesting here. Quite the opposite: referring to CLUSTERing to 
makes me wonder, if that's not going to shuffle way too many tuples around.


What I'm saying is, that SE doesn't partition the segments into 
different table spaces. Thus I don't consider it database partitioning 
in the first place. As I currently understand it, it's:


table -- 1:1 -- table space -- 1:n -- partitions -- 1:n -- segments

On many other databases, if you change the partition scheme, it requires 
exclusive locks and a shuffleing of all of the data, even data whose 
partitions arent being redefined.  Even worse are systems like mysql, where 
you need to rewrite the indexes as well.  To me, these requirements always 
seem like show stoppers; I generally can't afford to lock a table while the 
database rewrites a billion rows of data. 


I fully agree here. How do you plan to solve that problem on top of SE?

In a more general sense, a global index is a an index that spans multiple 
partitions, as opposed to a local index, which is an index on specific 
partitions; postgresql current supports the latter, not the former.


In any case, my thinking is if we had the segment exclusion technique, I could 
convert that partitioned table into a regular table again,


... on a single table space ...

use segment 
exclusion to handle what is currently handled by partitions,


... except, that there is no partitioning (!?!) (between table spaces)

and create 
a global index across all the other data for that other, currently killer, 
query. 


I thought the table you are referring to is bigger than your fastest 
table space? That would even make it impossible.


See where I'm coming from? And why I'm stating that SE is an 
optimization (for seq scans), but not partitioning?


Regards

Markus


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-06 Thread Gokulakannan Somasundaram

On Jan 6, 2008 4:09 PM, Markus Schiltknecht [EMAIL PROTECTED] wrote:

 Hi,

 Gokulakannan Somasundaram wrote:
 
 
  On Jan 5, 2008 6:15 PM, [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]
 wrote:
 
 
  One thought I had back then, with partitioned tables was gee --
 B-tree
  index is already doing a partition; why do a manual partition on top
 of
  that?.
 
  Can you please explain more on what you are trying to say here?

 I think this has to do with SE not being of much use for index scans. Or
 put it another way: SE is an optimization for sequential scans. For
 tables where it works well, it could possibly replace the index entirely.

 Without the index, you would rely on SE to always be able to exclude
 enough segments, so that the seq scan is less expensive than an index
 scan with the following table lookups.

 With an index, the planner gets a hard time deciding  between the index
 scan and the (possibly SE optimized) seq scan.


That's a good point. But i think Simon is planning not to give the job to
the planner, but to the executor. So SE optimization will come into play,
only when planner has decided on Sequential scan.

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-06 Thread Robert Treat

On Sunday 06 January 2008 05:48, Markus Schiltknecht wrote:
 What I'm saying is, that SE doesn't partition the segments into
 different table spaces. Thus I don't consider it database partitioning
 in the first place. As I currently understand it, it's:

 table -- 1:1 -- table space -- 1:n -- partitions -- 1:n -- segments


Ah, was a slight misunderstanding of terminology. I see what your getting at. 

  On many other databases, if you change the partition scheme, it requires
  exclusive locks and a shuffleing of all of the data, even data whose
  partitions arent being redefined.  Even worse are systems like mysql,
  where you need to rewrite the indexes as well.  To me, these requirements
  always seem like show stoppers; I generally can't afford to lock a table
  while the database rewrites a billion rows of data.

 I fully agree here. How do you plan to solve that problem on top of SE?

  In a more general sense, a global index is a an index that spans multiple
  partitions, as opposed to a local index, which is an index on specific
  partitions; postgresql current supports the latter, not the former.
 
  In any case, my thinking is if we had the segment exclusion technique, I
  could convert that partitioned table into a regular table again,

  on a single table space ...

  use segment
  exclusion to handle what is currently handled by partitions,

  except, that there is no partitioning (!?!) (between table spaces)

  and create
  a global index across all the other data for that other, currently
  killer, query.

 I thought the table you are referring to is bigger than your fastest
 table space? That would even make it impossible.


Nope, different table :-)   Although the above global/local one would probably 
live entirely on the slow disks, using SE and global index to satisfy the 
queries. And really our slow disks aren't exactly slow, but do have poor 
concurrency, so we put primarily read data on them to keep writes to a 
minimum. 

 See where I'm coming from? And why I'm stating that SE is an
 optimization (for seq scans), but not partitioning?


Yes, but aiui, SE should allow seq scans to achieve performance similar to 
partitioning, especially if the planner can exclude segments based on values 
in the data. 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-05 Thread Simon Riggs

On Fri, 2008-01-04 at 22:31 -0500, Robert Treat wrote:

 Not to be negative, but istm how this feature would be managed is as 
 important 
 as the bits under the hood. 

Agreed. On this part of the thread, we've been discussing an extension
to the basic proposal, which is why I have not been concentrating there.

Core management wise, the basic proposal showed how we would be able to
have VACUUM run much faster than before and how DELETE will also be
optimised naturally by this approach. Loading isn't any slower than it
is now; loading does need some work, but that's another story.

 Or at least we have to believe there will be 
 some practical way to manage this, which as of yet I am skeptical. 

Skepticism is OK, but I'd like to get your detailed thoughts on this.
I've been an advocate of the multi-tables approach now for many years,
so I don't expect everybody to switch their beliefs on my say-so
overnight. Let me make a few more comments in this area:

The main proposal deliberately has few, if any, knobs and dials. That's
a point of philosophy that I've had views on previously: my normal
stance is that we need some knobs to allow the database to be tuned to
individual circumstances. 

In this case, partitioning is way too complex to administer effectively
and requires application changes that make it impossible to use for
packaged applications. The latest Oracle TPC-H benchmark uses 10 pages
of DDL to set it up and if I can find a way to avoid that, I'd recommend
it to all. I do still want some knobs and dials, just not 10 pages
worth, though I'd like yours and others' guidance on what those should
be. Oracle have been responding to feedback with their new interval
partitioning, but its still a multi-table approach in essence.

My observation of partitioned databases is that they all work
beautifully at the design stage, but problems emerge over time. A
time-based range partitioned table can often have different numbers of
rows per partition, giving inconsistent response times. A
height-balanced approach where we make the partitions all the same size,
yet vary the data value boundaries will give much more consistent query
times and can be completely automated much more easily.

The SVM concept doesn't cover everything that you can do with
partitioning, but my feeling is it covers the main use cases well. If
that's not true, in broad strokes or in the detail, then we need to
uncover that. Everybody's help in doing that is appreciated, whatever
the viewpoint and whatever the outcome.

It's probably worth examining existing applications to see how well they
would migrate to segmented tables approach. The following query will
analyse one column of a table to produce a list of boundary values,
given a segment size of 131072 blocks (1 GB).

select
substr(ctid::text,2,strpos(ctid::text,',')-2)::integer/131072 as seg,
min(PrimaryKey), max(PrimaryKey)
from bigtable
group by seg;

We should be able to see whether this works for existing use cases, or
not fairly easily.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-05 Thread Markus Schiltknecht


Andrew Sullivan wrote:

On Fri, Jan 04, 2008 at 10:26:54PM +0100, Markus Schiltknecht wrote:
I'm still puzzled about how a DBA is expected to figure out which 
segments to mark. 


I think that part might be hand-wavy still.  But once this facility is
there, what's to prevent the current active segment (and the rest) from also
getting this mark, which would mean the table is read only?  


Well, sure, marking *all* segments read-only is pretty easy. But that 
was not quite my point.


Regards

Markus


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-05 Thread tomas

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Sat, Jan 05, 2008 at 09:33:45AM +, Simon Riggs wrote:

[...]

 The main proposal deliberately has few, if any, knobs and dials. That's
 a point of philosophy that I've had views on previously: my normal
 stance is that we need some knobs to allow the database to be tuned to
 individual circumstances. 

One thought I had back then, with partitioned tables was gee -- B-tree
index is already doing a partition; why do a manual partition on top of
that?.

It's an exhilarating experience to watch you making a design before I
could even spell this nebulous and unclear thougt :-)

Regards
- -- tomás
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFHf3v7Bcgs9XrR2kYRApP7AJwIW1cQwdK99fl+uzDelGEaqZFnEgCfUNjl
y0zGzkhf6qel/FC+rArxQu8=
=FZuv
-END PGP SIGNATURE-


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-05 Thread Markus Schiltknecht


Hi,

[EMAIL PROTECTED] wrote:

The main proposal deliberately has few, if any, knobs and dials. That's
a point of philosophy that I've had views on previously: my normal
stance is that we need some knobs to allow the database to be tuned to
individual circumstances. 


One thought I had back then, with partitioned tables was gee -- B-tree
index is already doing a partition; why do a manual partition on top of
that?.


Well, that's why I'm so focused on manageability: I think the primary 
purpose of partitioning is manageability (of the underlying storage for 
a table).


Regards

Markus


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-05 Thread Markus Schiltknecht


Hi,

Simon Riggs wrote:
 On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote:

 I'm still puzzled about how a DBA is expected to figure out which
 segments to mark. Simon, are you assuming we are going to pass on
 segment numbers to the DBA one day?

 No Way!

Ah, I'm glad ;-)

Simon Riggs wrote:

Skepticism is OK, but I'd like to get your detailed thoughts on this.
I've been an advocate of the multi-tables approach now for many years,
so I don't expect everybody to switch their beliefs on my say-so
overnight. Let me make a few more comments in this area:


I've so far always thought about some sort of multi-relations approach 
for partitioning, yes. Let's see if I can get my mind around 
single-table partitioning.



The main proposal deliberately has few, if any, knobs and dials. That's
a point of philosophy that I've had views on previously: my normal
stance is that we need some knobs to allow the database to be tuned to
individual circumstances.

In this case, partitioning is way too complex to administer effectively
and requires application changes that make it impossible to use for
packaged applications. The latest Oracle TPC-H benchmark uses 10 pages
of DDL to set it up and if I can find a way to avoid that, I'd recommend
it to all. I do still want some knobs and dials, just not 10 pages
worth, though I'd like yours and others' guidance on what those should
be. Oracle have been responding to feedback with their new interval
partitioning, but its still a multi-table approach in essence.


I can absolutely support your efforts to minimize knobs and 
configuration DDL. However, my current feeling is, that segments based 
partitioning complicates things, because the DBA doesn't have tools and 
commands to handle segments.


To satisfy all the different requirements of partitioning with segments 
based partitioning, we'd have to allow a table to span multiple table 
spaces. I'm not very keen on going that way.


However, what I certainly like is the automated split point definition. 
Instead of having to create tables by hand and linking them via 
inheritance and constraint exclusion, I have something very similar in 
mind, like what you proposed for marking read-only segments. Something like:


  SPLIT TABLE customers AT cust_name  'n';

or:

  SPLIT TABLE inventory AT inv_id % 4 = 2;

In my imagination, this should automatically create the underlying 
relations, i.e.:


  NOTICE: relations inventory__l and inventory__r have been created.

That way, the DBA could then handle those like normal relations, 
querying them or moving them to different table spaces like all other 
normal relations.


In a way, that's not so different from possible extensions on top of 
Segment Exclusion, except that the DBA additionally get a relation name 
to be able to address the set of segments which form a partition. Or put 
it the other way around: go for Segment Exclusion, but add some sort of 
a sentinel relation for each set of segments, to make them reachable for 
the DBA.



My observation of partitioned databases is that they all work
beautifully at the design stage, but problems emerge over time. A
time-based range partitioned table can often have different numbers of
rows per partition, giving inconsistent response times. A
height-balanced approach where we make the partitions all the same size,
yet vary the data value boundaries will give much more consistent query
times and can be completely automated much more easily.


Uh.. well, consistent query time isn't the first thing I'm expecting 
from partitioning by time ranges. If I wanted consistent query times I'd 
rather use hash partition or something, no?


I'd even state, that one *wants* inconsistent response times when using 
time based range partitioning, by moving old, seldom used data to slower 
storage and keeping only a small amount of often used tuples on the 
faster disks, for example.



The SVM concept doesn't cover everything that you can do with
partitioning, but my feeling is it covers the main use cases well.


As I regard manageability to be the main advantage of partitioning, 
which you've intentionally left out for now, I disagree here.


How could SVM or Segment Exclusion potentially be covering what hash 
partitioning does? Maybe together with the ability to store different 
segments of a table on different table spaces. That could be considered 
an approach to range partitioning. But then, that would be the 
partitioning, and not SVM or Segment Exclusion. To me, both of SVM and 
SE look much more like an optimization for certain special cases and 
don't have much to do with partitioning.


Regards

Markus


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-05 Thread Robert Treat

On Saturday 05 January 2008 10:42, Markus Schiltknecht wrote:
  The main proposal deliberately has few, if any, knobs and dials. That's
  a point of philosophy that I've had views on previously: my normal
  stance is that we need some knobs to allow the database to be tuned to
  individual circumstances.
 
  In this case, partitioning is way too complex to administer effectively
  and requires application changes that make it impossible to use for
  packaged applications. The latest Oracle TPC-H benchmark uses 10 pages
  of DDL to set it up and if I can find a way to avoid that, I'd recommend
  it to all. I do still want some knobs and dials, just not 10 pages
  worth, though I'd like yours and others' guidance on what those should
  be. Oracle have been responding to feedback with their new interval
  partitioning, but its still a multi-table approach in essence.

 I can absolutely support your efforts to minimize knobs and
 configuration DDL. However, my current feeling is, that segments based
 partitioning complicates things, because the DBA doesn't have tools and
 commands to handle segments.


Personally I cant say it complicates things, because it isn't clear how it 
will be managed. :-)

 To satisfy all the different requirements of partitioning with segments
 based partitioning, we'd have to allow a table to span multiple table
 spaces. I'm not very keen on going that way.


Why? 

 However, what I certainly like is the automated split point definition.
 Instead of having to create tables by hand and linking them via
 inheritance and constraint exclusion, I have something very similar in
 mind, like what you proposed for marking read-only segments. Something
 like:

SPLIT TABLE customers AT cust_name  'n';

 or:

SPLIT TABLE inventory AT inv_id % 4 = 2;

 In my imagination, this should automatically create the underlying
 relations, i.e.:

NOTICE: relations inventory__l and inventory__r have been created.

 That way, the DBA could then handle those like normal relations,
 querying them or moving them to different table spaces like all other
 normal relations.

 In a way, that's not so different from possible extensions on top of
 Segment Exclusion, except that the DBA additionally get a relation name
 to be able to address the set of segments which form a partition. Or put
 it the other way around: go for Segment Exclusion, but add some sort of
 a sentinel relation for each set of segments, to make them reachable for
 the DBA.


So the one thing that always scares me about these define it all and let the 
database sort it out methods is they seem to lead to cases where the system 
ends up rewriting the data to fit into some new partition layout. One thing 
that is nice about the current partitioning scheme is you can control the 
impact of this behavior in these scenarios, but moving around small portions 
of the table at a time.  

  My observation of partitioned databases is that they all work
  beautifully at the design stage, but problems emerge over time. A
  time-based range partitioned table can often have different numbers of
  rows per partition, giving inconsistent response times. A
  height-balanced approach where we make the partitions all the same size,
  yet vary the data value boundaries will give much more consistent query
  times and can be completely automated much more easily.

 Uh.. well, consistent query time isn't the first thing I'm expecting
 from partitioning by time ranges. If I wanted consistent query times I'd
 rather use hash partition or something, no?


More to the point (I think) is that people define access to the data based on 
the meaning of the data, not how it is stored on disk.  For example, in some 
tables we only need to be active on 1 months worth of data... how that is 
laid out on disk (# partitions, which tablespaces) is a means to the end of 
working actively on 1 months worth of data. I can't think of many cases where 
people would actually say the want to work actively on the most recent GB of 
data.  

 I'd even state, that one *wants* inconsistent response times when using
 time based range partitioning, by moving old, seldom used data to slower
 storage and keeping only a small amount of often used tuples on the
 faster disks, for example.


Yes, we do this quite a bit; and at least one of our partitioned tables (in 
total) is larger in size than our tablespace with the fastest disks. 

  The SVM concept doesn't cover everything that you can do with
  partitioning, but my feeling is it covers the main use cases well.

 As I regard manageability to be the main advantage of partitioning,
 which you've intentionally left out for now, I disagree here.

 How could SVM or Segment Exclusion potentially be covering what hash
 partitioning does? Maybe together with the ability to store different
 segments of a table on different table spaces. That could be considered
 an approach to range partitioning. But then, that would be the
 partitioning,

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-05 Thread Markus Schiltknecht


Hi,

Robert Treat wrote:
Personally I cant say it complicates things, because it isn't clear how it 
will be managed. :-)


Well, management of relations is easy enough, known to the DBA and most 
importantly: it already exists. Having to set up something which is 
*not* tied to a relation complicates things just because it's an 
additional concept.


But as I've pointed out, maybe what we have in mind isn't that different 
at all. Just have a sentinel relation mean a set of segments, i.e. all 
read-only segments of a table. Then again, a table - in a way - is not 
much else than a set of segments. So where's the real difference?



To satisfy all the different requirements of partitioning with segments
based partitioning, we'd have to allow a table to span multiple table
spaces. I'm not very keen on going that way.



Why?


Uh.. if a table (RELKIND_RELATION) can only span one table space, as it 
is now, all of its segments are in the same table space. I don't quite 
call that partitioning. Well, sure, you could call it so, but then, each 
and every Postgres table is already partitioned in 1G segments.


It all depends on the definitions, but in my world, horizontal 
partitioning for databases involves multiple table spaces (and is quite 
useless without that). Calling anything else partitioning is confusing, IMO.


So the one thing that always scares me about these define it all and let the 
database sort it out methods is they seem to lead to cases where the system 
ends up rewriting the data to fit into some new partition layout.


That holds true no matter if you shuffle between segments or relations. 
To be able to let the DBA define an exact split point, the database 
*will* have to shuffle tuples around. Why does that scare you? It's a 
regular database system's maintenance procedure.


One thing 
that is nice about the current partitioning scheme is you can control the 
impact of this behavior in these scenarios, but moving around small portions 
of the table at a time.  


Uh.. I'm not quite following. What current partitioning scheme are you 
referring to?


Why should that not be possible with other schemes? Moving the split 
point between two partitions involves moving tuples around, no matter if 
you are going to move them between segments or between relations 
building the partitions.


More to the point (I think) is that people define access to the data based on 
the meaning of the data, not how it is stored on disk.  For example, in some 
tables we only need to be active on 1 months worth of data... how that is 
laid out on disk (# partitions, which tablespaces) is a means to the end of 
working actively on 1 months worth of data. I can't think of many cases where 
people would actually say the want to work actively on the most recent GB of 
data.  


Agreed. I'd say that's why the DBA needs to be able to define the split 
point between partitions: only he knows the meaning of the data.



To me, both of SVM and
SE look much more like an optimization for certain special cases and
don't have much to do with partitioning.


Even if this were true, it might still be a useful optimization.


Possibly, yes. To me, the use case seems pretty narrow, though. For 
example it doesn't affect index scans much.


One table I 
am thinking of in particular in my system has one query we need to run across 
partitions, which ends up doing a slew of bitmap index scans for all the 
partitions. If using segment exclusion on it meant that I could get a global 
index to help that query, I'd be happy. 


As proposed, Segment Exclusion works only on exactly one table. Thus, if 
you already have your data partitioned into multiple relations, it most 
probably won't affect your setup much. It certainly has nothing to do 
with what I understand by 'global index' (that's an index spanning 
multiple tables, right?).


Regards

Markus


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-05 Thread Gokulakannan Somasundaram

On Jan 5, 2008 6:15 PM, [EMAIL PROTECTED] wrote:


 One thought I had back then, with partitioned tables was gee -- B-tree
 index is already doing a partition; why do a manual partition on top of
 that?.

 Can you please explain more on what you are trying to say here?


Thanks,
Gokul.

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-05 Thread Robert Treat

On Saturday 05 January 2008 14:02, Markus Schiltknecht wrote:
  To satisfy all the different requirements of partitioning with segments
  based partitioning, we'd have to allow a table to span multiple table
  spaces. I'm not very keen on going that way.
 
  Why?

 Uh.. if a table (RELKIND_RELATION) can only span one table space, as it
 is now, all of its segments are in the same table space. I don't quite
 call that partitioning. Well, sure, you could call it so, but then, each
 and every Postgres table is already partitioned in 1G segments.

 It all depends on the definitions, but in my world, horizontal
 partitioning for databases involves multiple table spaces (and is quite
 useless without that). Calling anything else partitioning is confusing,
 IMO.

I'm not following this.  If we can work out a scheme, I see no reason not to 
allow a single table to span multiple tablespaces.  Do you see a problem with 
that?


  So the one thing that always scares me about these define it all and let
  the database sort it out methods is they seem to lead to cases where the
  system ends up rewriting the data to fit into some new partition layout.

 That holds true no matter if you shuffle between segments or relations.
 To be able to let the DBA define an exact split point, the database
 *will* have to shuffle tuples around. Why does that scare you? It's a
 regular database system's maintenance procedure.


It's a question of control see below. 

  One thing
  that is nice about the current partitioning scheme is you can control the
  impact of this behavior in these scenarios, but moving around small
  portions of the table at a time.

 Uh.. I'm not quite following. What current partitioning scheme are you
 referring to?


The current, constraint exclusion/ inheritence based, partitioning we have now 
in postgresql. 

 Why should that not be possible with other schemes? Moving the split
 point between two partitions involves moving tuples around, no matter if
 you are going to move them between segments or between relations
 building the partitions.


The difference is that, if I currently have a table split by month, I 
can re-partition it into weekly segments, and only shuffle one months data 
at a time minimize impact on the system while I shuffle it. This can even be 
used to do dynamic management, where data from the current month is archived 
by day, data from the past year by week, and data beyond that done monthly.

On many other databases, if you change the partition scheme, it requires 
exclusive locks and a shuffleing of all of the data, even data whose 
partitions arent being redefined.  Even worse are systems like mysql, where 
you need to rewrite the indexes as well.  To me, these requirements always 
seem like show stoppers; I generally can't afford to lock a table while the 
database rewrites a billion rows of data. 

  More to the point (I think) is that people define access to the data
  based on the meaning of the data, not how it is stored on disk.  For
  example, in some tables we only need to be active on 1 months worth of
  data... how that is laid out on disk (# partitions, which tablespaces) is
  a means to the end of working actively on 1 months worth of data. I can't
  think of many cases where people would actually say the want to work
  actively on the most recent GB of data.

 Agreed. I'd say that's why the DBA needs to be able to define the split
 point between partitions: only he knows the meaning of the data.

  To me, both of SVM and
  SE look much more like an optimization for certain special cases and
  don't have much to do with partitioning.
 
  Even if this were true, it might still be a useful optimization.

 Possibly, yes. To me, the use case seems pretty narrow, though. For
 example it doesn't affect index scans much.

  One table I
  am thinking of in particular in my system has one query we need to run
  across partitions, which ends up doing a slew of bitmap index scans for
  all the partitions. If using segment exclusion on it meant that I could
  get a global index to help that query, I'd be happy.

 As proposed, Segment Exclusion works only on exactly one table. Thus, if
 you already have your data partitioned into multiple relations, it most
 probably won't affect your setup much. It certainly has nothing to do
 with what I understand by 'global index' (that's an index spanning
 multiple tables, right?).


In a more general sense, a global index is a an index that spans multiple 
partitions, as opposed to a local index, which is an index on specific 
partitions; postgresql current supports the latter, not the former.

In any case, my thinking is if we had the segment exclusion technique, I could 
convert that partitioned table into a regular table again, use segment 
exclusion to handle what is currently handled by partitions, and create 
a global index across all the other data for that other, currently killer, 
query. 

-- 
Robert Treat
Build A Brighter

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-05 Thread tomas

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Sun, Jan 06, 2008 at 01:12:32AM +0530, Gokulakannan Somasundaram wrote:
 On Jan 5, 2008 6:15 PM, [EMAIL PROTECTED] wrote:
 
 
  One thought I had back then, with partitioned tables was gee -- B-tree
  index is already doing a partition; why do a manual partition on top of
  that?.

 Can you please explain more on what you are trying to say here?

Sure. A B-tree is just a device to partition something along some order.
If you have , say, a table of orders (to use the example upthread) and a
B-tree index on order date, this index partitions your set (at
recursively finer levels). Of course, you'd have to sort your data
alogn this index, but PostgreSQL knows how to do this trick: CLUSTER.

This was just a vague idea, many things were missing (for example to
separate out the more quiescent parts of the table into their own files)
which are spelled out in Simon Riggs' proposal.

This struck me when seeing people partition tables by hand -- and I was
delighted to actually watch Simon forging a real design.

Regards
- -- tomás
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFHgG3QBcgs9XrR2kYRAgKhAJ93KUybgMfG07ta67DiR8bgAbHPrgCeOI2V
by/xeXKrDJ5O0JZHyFurego=
=R/vC
-END PGP SIGNATURE-


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-04 Thread Richard Huxton


Simon Riggs wrote:

We would keep a dynamic visibility map at *segment* level, showing which
segments have all rows as 100% visible. No freespace map data would be
held at this level. 


Small dumb-user question.

I take it you've considered some more flexible consecutive-run-of-blocks 
unit of flagging rather than file-segments. That obviously complicates 
the tracking but means you can cope with infrequent updates as well as 
mark most of the most recent segment for log-style tables.


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-04 Thread Simon Riggs

On Fri, 2008-01-04 at 10:22 +, Richard Huxton wrote:
 Simon Riggs wrote:
  We would keep a dynamic visibility map at *segment* level, showing which
  segments have all rows as 100% visible. No freespace map data would be
  held at this level. 
 
 Small dumb-user question.
 
 I take it you've considered some more flexible consecutive-run-of-blocks 
 unit of flagging rather than file-segments. That obviously complicates 
 the tracking but means you can cope with infrequent updates as well as 
 mark most of the most recent segment for log-style tables.

I'm writing the code to abstract that away, so yes.

Now you mention it, it does seem straightforward to have a table storage
parameter for partition size, which defaults to 1GB. The partition size
is simply a number of consecutive blocks, as you say.

The smaller the partition size the greater the overhead of managing it.
Also I've been looking at read-only tables and compression, as you may
know. My idea was that in the future we could mark segments as either
- read-only 
- compressed
- able to be shipped off to hierarchical storage

Those ideas work best if the partitioning is based around the physical
file sizes we use for segments.

Changing the partition size would simply reset the visibility map for
that table, in its easiest implementation.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-04 Thread Simon Riggs

On Fri, 2008-01-04 at 13:06 +0530, Gokulakannan Somasundaram wrote:

 a) This proposal would work for the kind of table organizations which
 are currently partitioned and maintained based on some kind of
 timestamp. Consider one of the use-case. A large Retail firm has a lot
 of stores. DBA maintains and updates the inventories of those stores
 in hash-partitions based on store-no. As the inventory gets updated,
 the corresponding partition receives the update and it goes like
 that.. 
 Here all the partitions are going to get constantly updated.
 So no partition can possibly become a read only partition. You have
 clearly called it out in your para. i am just re-confirming on that.
 Or do you have something for this in your soln.? 
 
 To my limited experience, most partition strategies are based on some
 form of time-stamp. If the proposed soln. can cater to those, it has
 lot of use-cases.

I don't think it would apply to an Inventory table. That is a current
state table.

It is designed for any large tables that would grow naturally over time
if we left them to do so. Solutions that it would work for:

- any Fact table where measurements/observations/events are accumulated
e.g.
Web Hits (any Internet events)
Call Detail Records
Sales
Security Events
Scientific Measurements
Process Control

- any Major Entity where new entities are created from a sequence
e.g.
Orders, OrderItems
Invoices
Shipments, Returns
most SCM/DCM events

It's not aimed at any particular benchmark, just real usage scenarios.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-04 Thread Richard Huxton


Simon Riggs wrote:

On Fri, 2008-01-04 at 10:22 +, Richard Huxton wrote:

Simon Riggs wrote:

We would keep a dynamic visibility map at *segment* level, showing which
segments have all rows as 100% visible. No freespace map data would be
held at this level. 

Small dumb-user question.

I take it you've considered some more flexible consecutive-run-of-blocks 
unit of flagging rather than file-segments. That obviously complicates 
the tracking but means you can cope with infrequent updates as well as 
mark most of the most recent segment for log-style tables.


I'm writing the code to abstract that away, so yes.

Now you mention it, it does seem straightforward to have a table storage
parameter for partition size, which defaults to 1GB. The partition size
is simply a number of consecutive blocks, as you say.

The smaller the partition size the greater the overhead of managing it.


Oh, obviously, but with smaller partition sizes this also becomes useful 
for low-end systems as well as high-end ones. Skipping 80% of a seq-scan 
on a date-range query is a win for even small (by your standards) 
tables. I shouldn't be surprised if the sensible-number-of-partitions 
remained more-or-less constant as you scaled the hardware, but the 
partition size grew.



Also I've been looking at read-only tables and compression, as you may
know. My idea was that in the future we could mark segments as either
- read-only 
- compressed

- able to be shipped off to hierarchical storage

Those ideas work best if the partitioning is based around the physical
file sizes we use for segments.


I can see why you've chosen file segments. It certainly makes things easier.

Hmm - thinking about the date-range scenario above, it occurs to me that 
  for seq-scan purposes the correct partition size depends upon the 
data value you are interested in. What I want to know is what blocks Jan 
07 covers (or rather what blocks it doesn't) rather than knowing blocks 
1-999 cover 2005-04-12 to 2007-10-13. Of course that means that 
you'd eventually want different partition sizes tracking visibility for 
different columns (e.g. id, timestamp).


I suspect the same would be true for read-only/compressed/archived 
flags, but I can see how they are tightly linked to physical files 
(particularly the last two).


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-04 Thread Simon Riggs

On Fri, 2008-01-04 at 13:29 +0100, Markus Schiltknecht wrote:

 Given that we are operating on segments here, to which the DBA has very 
 limited information and access, I prefer the term Segment Exclusion. I 
 think of that as an optimization of sequential scans on tables with the 
 above mentioned characteristics.
 
  If we do need to differentiate between the two proposals, we can refer
  to this one as the Segment Visibility Map (SVM).
 
 I'm clearly in favor of separating between the two proposals. SVM is a 
 good name, IMHO.

OK, I'll refer to this as proposal as SVM.

  There would be additional complexity in selectivity estimation and plan
  costing. The above proposal allows dynamic segment exclusion, which
  cannot be assessed at planning time anyway, so suggestions welcome...
 
 Hm.. that looks like a rather bad downside of an executor-only optimization.

I think that's generally true. We already have that problem with planned
statements and work_mem, for example, and parameterised query planning
is a difficult problem. Stable functions are already estimated at plan
time, so we hopefully should be getting that right. I don't see any show
stoppers here, just more of the usual problems of query optimization.

  Comparison with other Partitioning approaches
  -
  
  Once I realised this was possible in fairly automatic way, I've tried
  hard to keep away from manual overrides, commands and new DDL.
  
  Declarative partitioning is a big overhead, though worth it for large
  databases. No overhead is *much* better though.
  
  This approach to partitioning solves the following challenges
  - allows automated table extension, so works automatically with Slony
  - responds dynamically to changing data
  - allows stable functions, nested loop joins and parametrised queries
  - allows RI via SHARE locks
  - avoids the need for operator push-down through Append nodes
  - allows unique indexes
  - allows both global indexes (because we only have one table)
  - allows advanced planning/execution using read-only/visible data
  - works naturally with synchronous scans and buffer recycling
  
  All of the above are going to take considerably longer to do in any of
  the other ways I've come up with so far... 
 
 I fully agree. But as I tried to point out above, the gains in 
 manageability from Segment Exclusion are also pretty close to zero. So 
 I'd argue they only fulfill parts of the needs for general horizontal 
 partitioning.

Agreed.

My focus for this proposal wasn't manageability, as it had been in other
recent proposals. I think there are some manageability wins to be had as
well, but we need to decide what sort of partitioning we want/need
first. 

So in the case of SVM, enhanced manageability is really a phase 2 thing.

Plus, you can always combine a design with constraint and segment
exclusion.

 Maybe a combination with CLUSTERing would be worthwhile? Or even 
 enforced CLUSTERing for the older segments?

I think there's merit in Heikki's maintain cluster order patch and that
should do an even better job of maintaining locality.

Thanks for detailed comments. I'll do my best to include all of the
viewpoints you've expressed as the design progresses.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-04 Thread Markus Schiltknecht


Hello Simon,

Simon Riggs wrote:

I've come
up with an alternative concept to allow us to discuss the particular
merits of each. ISTM that this new proposal has considerable potential.


Hm.. interesting idea.


If we were able to keep track of which sections of a table are now
read-only then we would be able to record information about the data in
that section and use it to help solve queries. This is turning the
current thinking on its head: we could derive the constraints from the
data, rather than placing the data according to the constraints. That
would be much more natural: load data into the table and have the system
work out the rest.


Yeah, but that's also the most limiting factor of your approach: it 
covers only horizontal partitioning by time (or to be more precise, by 
columns which are very likely to increase or decrease with time). All 
other columns will very likely contain values from the full range of 
possible values.


As you have pointed out, that might be a very frequent use case. I can't 
argue about that, however, I think it's important to be well aware of 
that limitation.



Other scans types might also use segment exclusion, though this would
only be useful for scans retrieving many rows, otherwise the overhead of
segment exclusion might not be worthwhile.


Uh.. the overhead of checking against min/max values doesn't seem that 
big to me.


I rather think the gain for index scans would be prohibitively small, 
because (given frequent enough vacuuming) an index scan shouldn't return 
many pointers to tuples in segments which could be optimized away by 
segment exclusion.



If we collect data for all columns then many of our implicit constraints
would be useless. e.g. if a column only has a few values and these are
present in all segments. Matching our predicate against all constraints
would be expensive, so we must weed out poor constraints. We would do
this by removing any constraint that overlapped more than 10% of other
segments. Various heuristics would likely need to be discovered during
development to make this work without resorting to manual commands.


Uh, well, that's about the limitation I've pointed out above. But is it 
really worth maintaining statistics about overlapping values and 
removing min/max checks for certain columns?


It would save you the min/max check per segment and scan, but cost 
maintaining the statistics and checking against the statistics once per 
scan. AFAICS the block with the min/max tuple per segment will often 
remain cached anyway... dunno.



Noting which segments are read-only
---

Everything so far has relied upon our ability to note which segments of
a table are read-only. We could do this in two ways

1) have the system automatically keep track of non-changing data
2) have the DBA issue a command to mark a segment as read-only now

Probably a combination of the two is better, so we have three states for
segments
- READ_WRITE_ALLOWED
- EFFECTIVE_READ_ONLY (set by 1 only)
- MARKED_READ_ONLY (set by 2 only)

Having said that I want to concentrate on (1), though consider the other
one also in case requested by reviewers.


Hm.. AFAICT, horizontal partitioning often serves manageability, which 
is quite limited having all data in one table (you can't move a single 
segment to a different tablespace). Thus I think option 2 is pretty 
constrained is usability. What would the DBA gain by setting a segment 
to read only? How does the DBA figure out, in which segment a tuple is 
stored in (so she can decide to mark it read only)?



The user may also wish to clear down very old data, so allowing DELETEs
can ensure the user can still remove old data from the table. By
carefully choosing the values to be deleted, a whole segment can be
deleted and then returned to the FSM. 


Oh, yeah, that sounds like a good optimization. Bulk deletes, yummie!


This proposal offers many of the advantages of the earlier Visibility
Map proposal, yet without major changes to heap structure. Since the
segment-level visibility map is more granular it would only be 80% as
effective as the more detailed block-level map. Having said that, the
bookkeeping overheads will also be considerably reduced, so it does seem
a good joint approach. It also allows freezing to be handled fully,
which was a problem with the original visibility map proposal. WAL
logging visibility map changes is also now much easier.


I generally agree, although I'm somewhat dubious about the 80% factor.


My thoughts have been targeted directly at partitioning, yet I have to
admit that this idea overlaps, and in my world view, replaces the
Visibility Map proposal. I very much like the name Visibility Map
though.


I would even say, that partitioning is somewhat of a misleading term 
here, because it normally allows the DBA to decide on where to split.


Given that we are operating on segments here, to which the DBA has very 
limited information and access, I

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-04 Thread Markus Schiltknecht


Hi,

Simon Riggs wrote:

- any Fact table where measurements/observations/events are accumulated
e.g.
Web Hits (any Internet events)
Call Detail Records
Sales
Security Events
Scientific Measurements
Process Control

- any Major Entity where new entities are created from a sequence
e.g.
Orders, OrderItems
Invoices
Shipments, Returns
most SCM/DCM events


...and only changed very seldom after a while, I would add. Because 
changing an old tuple would invalidate the optimization for the affected 
segment.


That's why this optimization can't help for inventory tables, where an 
id might correlate with time and storage location, but writing access 
doesn't correlate with storage location (segment number) and time.


Regards

Markus

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-04 Thread Markus Schiltknecht


Hi,

Simon Riggs wrote:

The smaller the partition size the greater the overhead of managing it.
Also I've been looking at read-only tables and compression, as you may
know. My idea was that in the future we could mark segments as either
- read-only 
- compressed

- able to be shipped off to hierarchical storage

Those ideas work best if the partitioning is based around the physical
file sizes we use for segments.


As much as I'd like this simplification.. But I'm still thinking of 
these segments as an implementation detail of Postgres, and not 
something the user should have to deal with.


Allowing the DBA to move segments to a different table space and giving 
him the possibility to check which tuples are in which segment seems 
awkward from a users perspective, IMO.


Regards

Markus

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-04 Thread Andrew Sullivan

On Fri, Jan 04, 2008 at 01:29:55PM +0100, Markus Schiltknecht wrote:
 
 Agreed. Just a minor note: I find marked read-only too strong, as it 
 implies an impossibility to write. I propose speaking about mostly-read 
 segments, or optimized for reading or similar.

I do want some segments to be _marked_ read-only: I want attempted writes to
them to _fail_.

A


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-04 Thread Simon Riggs

On Fri, 2008-01-04 at 13:06 -0500, Andrew Sullivan wrote:
 On Fri, Jan 04, 2008 at 01:29:55PM +0100, Markus Schiltknecht wrote:
  
  Agreed. Just a minor note: I find marked read-only too strong, as it 
  implies an impossibility to write. I propose speaking about mostly-read 
  segments, or optimized for reading or similar.
 
 I do want some segments to be _marked_ read-only: I want attempted writes to
 them to _fail_.

I think Markus thought that we would mark them read only automatically,
which was not my intention. I believe its possible to have this in a way
that will make you both happy. Some more explanation:

There would be three different states for a segment:
1. read write
2. optimized for reading, as Markus says it
3. marked read only by explicit command

Transition 1 - 2 is by autovacuum under the SVM proposal, transition 2
- 3 is by user command only. So throwing an ERROR is acceptable for
segments in state 3.

I came up with a complex scheme for going from 1 - 3 previously, but I
don't think its needed any longer (for this, at least). It's trivial to
go from 2 - 3 using an ALTER TABLE statement, along the lines of ALTER
TABLE  WHERE  

Files that are completely in state 3 can then be archived by a
hierarchical storage manager without problem.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-04 Thread Andrew Sullivan

On Fri, Jan 04, 2008 at 10:26:54PM +0100, Markus Schiltknecht wrote:
 
 I'm still puzzled about how a DBA is expected to figure out which 
 segments to mark. 

I think that part might be hand-wavy still.  But once this facility is
there, what's to prevent the current active segment (and the rest) from also
getting this mark, which would mean the table is read only?  

A


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-04 Thread Markus Schiltknecht


Hi,

Simon Riggs wrote:

On Fri, 2008-01-04 at 13:06 -0500, Andrew Sullivan wrote:

On Fri, Jan 04, 2008 at 01:29:55PM +0100, Markus Schiltknecht wrote:
Agreed. Just a minor note: I find marked read-only too strong, as it 
implies an impossibility to write. I propose speaking about mostly-read 
segments, or optimized for reading or similar.


Hm.. yeah, after rereading, I realize that I've mixed up states no. 2 
and 3 here, sorry.



I do want some segments to be _marked_ read-only: I want attempted writes to
them to _fail_.


Well, I can see use cases for marking tuples or complete relations as 
read only. But segments?


I'm still puzzled about how a DBA is expected to figure out which 
segments to mark. Simon, are you assuming we are going to pass on 
segment numbers to the DBA one day?


If not, a more user friendly command like MARK READ ONLY WHERE date = 
2006 would have to move tuples around between segments, so as to be 
able to satisfy the split point exactly, right?



I think Markus thought that we would mark them read only automatically,
which was not my intention. I believe its possible to have this in a way
that will make you both happy. Some more explanation:

There would be three different states for a segment:
1. read write
2. optimized for reading, as Markus says it
3. marked read only by explicit command


Thanks for clarification.

Regards

Markus

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-04 Thread Simon Riggs

On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote:

 I'm still puzzled about how a DBA is expected to figure out which 
 segments to mark. Simon, are you assuming we are going to pass on 
 segment numbers to the DBA one day?

No Way!

That would stop Richard's idea to make the segment stride configurable,
apart from being a generally ugly thing.

 If not, a more user friendly command like MARK READ ONLY WHERE date = 
 2006 would have to move tuples around between segments, so as to be 
 able to satisfy the split point exactly, right?

Yes, just a simple WHERE clause that we can translate into segments
under the covers. It would be an alter table, so we get an exclusive
lock.

ALTER TABLE foo SET READ ONLY WHERE 

possibly with a few restrictions on the WHERE clause. Anyway this is
just futures and dreams, so far, so lets just say something like that is
possible in the future and work out more when we pass the first few
hurdles.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-04 Thread Robert Treat

On Friday 04 January 2008 17:01, Simon Riggs wrote:
 On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote:
  I'm still puzzled about how a DBA is expected to figure out which
  segments to mark. Simon, are you assuming we are going to pass on
  segment numbers to the DBA one day?

 No Way!

 That would stop Richard's idea to make the segment stride configurable,
 apart from being a generally ugly thing.

  If not, a more user friendly command like MARK READ ONLY WHERE date =
  2006 would have to move tuples around between segments, so as to be
  able to satisfy the split point exactly, right?

 Yes, just a simple WHERE clause that we can translate into segments
 under the covers. It would be an alter table, so we get an exclusive
 lock.

 ALTER TABLE foo SET READ ONLY WHERE 

 possibly with a few restrictions on the WHERE clause. Anyway this is
 just futures and dreams, so far, so lets just say something like that is
 possible in the future and work out more when we pass the first few
 hurdles.

Not to be negative, but istm how this feature would be managed is as important 
as the bits under the hood.  Or at least we have to believe there will be 
some practical way to manage this, which as of yet I am skeptical. 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-03 Thread Simon Riggs

On Thu, 2008-01-03 at 00:41 +, Sam Mason wrote:
 On Wed, Jan 02, 2008 at 05:56:14PM +, Simon Riggs wrote:
  Like it?
 
 Sounds good.  I've only given it a quick scan though.  Would read-only
 segments retain the same disk-level format as is currently?  

Yes, no changes at all to the table. So your app would just work,
without any DDL changes. Existing partitioning apps would not change.

 It seems
 possible to remove the MVCC fields and hence get more tuples per page---
 whether this would actually be a net performance gain/loss seems like
 a difficult question question to answer, it would definitly be a
 complexity increase though.

I've been looking at general compression at table and segment level, but
thats further down the track. Removing the MVCC fields is too much work,
I think.

 Reading this reminds me of the design of the store for a persistent
 operating system called EROS.  It has a very good paper[1] describing
 the design (implementation and careful benchmarking thereof) that I
 think could be a useful read.

Thanks, will do.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-03 Thread Gokulakannan Somasundaram

Hi Simon,
 Looks like a novel idea. I just want to confirm my
understanding of the proposal.
a) This proposal would work for the kind of table organizations which are
currently partitioned and maintained based on some kind of timestamp.
Consider one of the use-case. A large Retail firm has a lot of stores. DBA
maintains and updates the inventories of those stores in hash-partitions
based on store-no. As the inventory gets updated, the corresponding
partition receives the update and it goes like that..
Here all the partitions are going to get constantly updated. So no
partition can possibly become a read only partition. You have clearly called
it out in your para. i am just re-confirming on that. Or do you have
something for this in your soln.?

To my limited experience, most partition strategies are based on some form
of time-stamp. If the proposed soln. can cater to those, it has lot of
use-cases.

Thanks,
Gokul.

[HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-02 Thread Simon Riggs

Happy New Year, everybody.

This proposal follows on from previous thinking about partitioning,
where I've taken up Andrew Sullivan's suggestion to re-examine the
current partitioning concept of using tables as partitions. So I've come
up with an alternative concept to allow us to discuss the particular
merits of each. ISTM that this new proposal has considerable potential.

Recap: Very Large Table Use Case


Many large tables have a regular access pattern:
- new rows inserted frequently
- some updates and deletes on the recent data
- older data becomes effectively read-only
- at some point data may be removed in bulk from the table

Most tables vary on what recent means; some are insert only, many not.
A typical example might be a large history table where the last 6 months
data is updated, but after that the data has almost zero change.

Partitioning relies on knowledge of the physical location of data to
exclude sections of data from a scan. 

Currently the partitioning knowledge is provided by user declarations in
the form of constraints on tables linked together by views or
inheritance. We also currently do this in constraint-first fashion,
i.e. we put the data where the constraints say we should, rather than
deriving the constraints from the data.

In-table Partitioning
-

In the common use-case there is a clear affinity between certain data
columns and the physical placement of data in a table. This is true for
columns such as LOG_DATE or TRANSACTION_DATE, but is also true for
any columns where the id column is assigned by a Sequence. HOT helps
this to remain true over time.

Given the above use case, when there is an affinity between data values
and data placement *and* we know that older data tends to become read
only, we can then realise that certain physical sections of a table will
become effectively read only. (My experience is that the larger the
table, the less random the write pattern - whatever the guys that wrote
TPC-C say).

If we were able to keep track of which sections of a table are now
read-only then we would be able to record information about the data in
that section and use it to help solve queries. This is turning the
current thinking on its head: we could derive the constraints from the
data, rather than placing the data according to the constraints. That
would be much more natural: load data into the table and have the system
work out the rest.

We can imagine that we do this within a single table. Imagine: split a
table up into 100 sections, then record the min/max values of all tuples
in those sections. When we perform a SeqScan we could skip over sections
that could be proved would never satisfy the query predicate. Same thing
as partitioning, but within the table.

Currently tables are already broken up into 1GB file segments, so it
seems fairly natural to pick that as our section size. That fits
reasonably well with recommendations for ideal partition size.

Segment Exclusion
-

After we note that a segment is read-only we can scan the segment and
record min/max values for all columns. These are then implicit
constraints, which can then be used for segment exclusion in a similar
manner as we do with the current constraint exclusion mechanism.

The implicit constraints can then allow SeqScans to assess segment
exclusion for each segment at execution time, i.e. a scan can skip a
whole segment if constraints don't match. If a segment does not (yet)
have implicit constraints it will always be scanned in full. This allows
partitioning, synch scans and buffer recycling to work much better
together than they currently do.

Other scans types might also use segment exclusion, though this would
only be useful for scans retrieving many rows, otherwise the overhead of
segment exclusion might not be worthwhile.

This would also allow a Merge Join to utilise exclusion for the inner
plan at execution time. Instead of scanning the whole inner table plan
from the start, we would be able to start the scan from the appropriate
segment. This would require us to pass the current value of the outer
plan down into the inner plan. The typical executor nodes on the inner
plan would be a SortNode and below that a SeqScan. In that case the
executor would need to pass the outer value from the MergeJoinNode down
thru the SortNode to the SeqScan node. The SeqScan node could then
perform partition exclusion, reducing the time for that scan and also
reducing the time for the resulting sort. This sounds like it might be
worth doing in normal cases also. It might turn out that the potentially
applicable cases are already excluded during planning, I haven't thought
about that aspect in enough detail yet.

If we collect data for all columns then many of our implicit constraints
would be useless. e.g. if a column only has a few values and these are
present in all segments. Matching our predicate against all constraints
would be expensive,

Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps

2008-01-02 Thread Sam Mason

On Wed, Jan 02, 2008 at 05:56:14PM +, Simon Riggs wrote:
 Like it?

Sounds good.  I've only given it a quick scan though.  Would read-only
segments retain the same disk-level format as is currently?  It seems
possible to remove the MVCC fields and hence get more tuples per page---
whether this would actually be a net performance gain/loss seems like
a difficult question question to answer, it would definitly be a
complexity increase though.


Reading this reminds me of the design of the store for a persistent
operating system called EROS.  It has a very good paper[1] describing
the design (implementation and careful benchmarking thereof) that I
think could be a useful read.

A lot of your design sounds like the EROS store, with the the
Checkpoint Area being, in current and all previous versions of
Postgres, the only place data is stored.  Data in EROS also has a Home
Location which is where the data ends up after a checkpoint, and sounds
somewhat like the proposed read-only.

Checkpoints here serve a similar purpose than checkpoints to PG, so the
following analogy may get a bit confusing.  When you're reading the
paper I'd try and imagine the checkpoints not occurring as one event,
but spread across time as the database recognizes that data is now (or
has been marked as) read-only.  The home locations would then store
only the read-only copies of the data and all the churn would (if the
recognition of read-only data works) be done in the checkpoint area.


  Sam

 [1] http://www.eros-os.org/papers/storedesign2002.pdf

---(end of broadcast)---
TIP 6: explain analyze is your friend

84 matches

Mail list logo