Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Fri, 2008-01-11 at 02:28 +0100, Gavin Sherry wrote: On Thu, Jan 10, 2008 at 09:30:10PM +, Simon Riggs wrote: We cannot perform partition exclusion using this type of WHERE clause at planning time because the CURRENT DATE function is STABLE. We can do the exact same thing -- if it's a direction people want to take. In fact, we can do it better/faster because once we've evaluated one partition we know that there are no others to evaluate. Lost you completely here. I'm explaining to you that *nobody* can solve those problems solely at planning time, by definition, so it has to be done at execution time. I'm not saying anything about your way, my way. Sorry, I wasn't clear enough. I was trying to say, if we're going to do something in the executor (for right or wrong) the declarative approach can do it too. Since there will be partition bounding information available, we can do partition selection in the executor (maybe the planner should tell us to do it). Of course. It's an identical situation for both. Regrettably, none of your comments about dynamic partitioning and planning were accurate as a result. Okay. As I said above, nothing in declarative partitioning rules out partition selection with stable functions. So, we lets do it, assuming everyone else thinks it is a good idea. If you check the archives this was long ago been identified as a requirement. And I said exactly the things you said, BTW, when trying to say it didn't matter. I've kept a list of requests for improvement that I can share with you; I've always been loathe to publish a list of bad points. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Fri, Jan 11, 2008 at 08:07:18AM +, Simon Riggs wrote: On Fri, 2008-01-11 at 02:28 +0100, Gavin Sherry wrote: On Thu, Jan 10, 2008 at 09:30:10PM +, Simon Riggs wrote: We cannot perform partition exclusion using this type of WHERE clause at planning time because the CURRENT DATE function is STABLE. We can do the exact same thing -- if it's a direction people want to take. In fact, we can do it better/faster because once we've evaluated one partition we know that there are no others to evaluate. Lost you completely here. I'm explaining to you that *nobody* can solve those problems solely at planning time, by definition, so it has to be done at execution time. I'm not saying anything about your way, my way. Sorry, I wasn't clear enough. I was trying to say, if we're going to do something in the executor (for right or wrong) the declarative approach can do it too. Since there will be partition bounding information available, we can do partition selection in the executor (maybe the planner should tell us to do it). Of course. It's an identical situation for both. Regrettably, none of your comments about dynamic partitioning and planning were accurate as a result. That's not true. We will still have planning drive the partition selection when the predicate is immutable, thus having more accurate plans. Some but not all plans use current_date and similar. Okay. As I said above, nothing in declarative partitioning rules out partition selection with stable functions. So, we lets do it, assuming everyone else thinks it is a good idea. If you check the archives this was long ago been identified as a requirement. And I said exactly the things you said, BTW, when trying to say it didn't matter. I've kept a list of requests for improvement that I can share with you; I've always been loathe to publish a list of bad points. Why, please send them. Thanks, Gavin ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Gavin, I've responded to the technical questions you raise here: On Thu, 2008-01-10 at 02:22 +0100, Gavin Sherry wrote: After we note that a segment is read-only we can scan the segment and record min/max values for all columns. These are then implicit constraints, which can then be used for segment exclusion in a similar manner as we do with the current constraint exclusion mechanism. What about columns which are varlen? The maximum value could be very long -- 1 GB in fact. Say we decide to not support those. Well, what about numeric? What about user defined types? No problem with any datatypes that I can see. Why would there be? Very long values as boundary values would be a problem with any type of partitioning, so some sensible heuristics would need to apply in any case. This would also allow a Merge Join to utilise exclusion for the inner plan at execution time. Instead of scanning the whole inner table plan from the start, we would be able to start the scan from the appropriate segment. This would require us to pass the current value of the outer plan down into the inner plan. The typical executor nodes on the inner plan would be a SortNode and below that a SeqScan. In that case the executor would need to pass the outer value from the MergeJoinNode down thru the SortNode to the SeqScan node. The SeqScan node could then perform partition exclusion, reducing the time for that scan and also reducing the time for the resulting sort. This sounds like it might be worth doing in normal cases also. It might turn out that the potentially applicable cases are already excluded during planning, I haven't thought about that aspect in enough detail yet. I don't think that would work at all. Say you have have skew of data across the partitions. You may end up doing a seq scan for every outer tuple. It would look like a really expensive nested loop. No, you've misunderstood. I said start the plan, i.e. do this once, not for every row. That would be ridiculous. If we collect data for all columns then many of our implicit constraints would be useless. e.g. if a column only has a few values and these are present in all segments. Matching our predicate against all constraints would be expensive, so we must weed out poor constraints. We would do this by removing any constraint that overlapped more than 10% of other segments. Various heuristics would likely need to be discovered during development to make this work without resorting to manual commands. Note that all of this exclusion is happening in the executor, not the planner. That allows this form of exclusion to work with stable functions and parameters without problem. So, in that case, you've totally undercut the planner. All the steps up the tree would be based on the stats for scanning the entire table but you've only returned part of it. Answered already. Not an issue solely for this particular approach. To solve that, you could put the min/max values in a catalog somewhere. For a 16 TB table, that's 16000 min/max values. That's pretty expensive. We agreed the partitioning could and would be adjustable, so the number need not be 16000. The number of partitions we can support needs to be fairly high to support a wide range of applications. Requests for 1-2000 partitions seem fairly typical to me and we need to be able to handle that, however we do partitioning. And how do we handle making things non-read-only. You'd have to wait for all current queries to finish... that could take a long time. No, I explained that it would not require locking or waiting. The existing queries would just not take advantage of the newly read-only state until they complete. How do we handle crashes during setting of min/max? I presume it needs WAL support. It's in a catalog table, as we've said. Handled. What happens if the free space map gives us a free block in a read only segment. Then we need to look at concurrency and transactionality of min/max values don't we? I discussed that and explained how it would be handled. Visibility Map -- [snip] No dynamic shared memory cache is required because any concurrent changes to the table would be ignored by a Scan anyway, so it doesn't matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any new scans that start will attempt to lock the table and then perform a rel cache check before continuing. So the visibility will be set correctly for *that* scan at least. Is that true in the presence of READ COMMITTED? Yes, because READ COMMITTED has nothing to do with it. The SVM would be refreshed via the relcache invalidation mechanism, which gets checked every time we obtain a lock we did not already hold. In the case of a table with bits set in the SVM we would need to recheck that each time we try to obtain a lock, even if we already hold it. In most cases the visibility map can be summarised
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Fri, 2008-01-11 at 10:25 +0100, Gavin Sherry wrote: Of course. It's an identical situation for both. Regrettably, none of your comments about dynamic partitioning and planning were accurate as a result. That's not true. We will still have planning drive the partition selection when the predicate is immutable, thus having more accurate plans. Not really. The planner already evaluates stable functions at plan time to estimate selectivity against statistics. It can do the same here. The boundary values can't be completely trusted at plan time because they are dynamic, but they're at least as accurate as ANALYZE statistics (and probably derived at identical times), so can be used as estimates. So I don't see any reason to imagine the plans will be badly adrift. We're back to saying that if the visibility map is volatile, then SE won't help you much. I agree with that and haven't argued otherwise. Does saying it make us throw away SE? No, at least, not yet and not for that reason. SE does what I was looking for it to do, but doesn't do all of what you'd like to achieve with partitioning, because we're looking at different use cases. I'm sure you'd agree that all large databases are not the same and that they can have very different requirements. I'd characterise our recent positions on this that I've been focused on archival requirements, whereas you've been focused on data warehousing. The difference really lies in how much activity and of what kind occurs on a table. You don't unload and reload archives regularly, nor do you perform random updates against the whole table. I'm sure we'd quickly agree that many of the challenges you've seen recently at Greenplum would not be possible with core Postgres, so I'm not really trying too hard to compete. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Mon, 2008-01-07 at 12:53 -0800, Ron Mayer wrote: Is my understanding right that these Segment Visibility Maps could help this case as well? Yes, I think so. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
I've kept a list of requests for improvement that I can share with you; I've always been loathe to publish a list of bad points. I think it would help understand the proposal if you also present the shortcomings. When you presented the positive and negative points, the negative list did look intentionally short :-) This imho provokes negative replies, the average hackers that reply are no dummies eighter. Some of the issues I see, with the proposal made so far: - segment boundaries will imho sometimes be too fuzzy for a reasonably short segment describing where clause - fault isolation - huge global indexes (index create/reorg needs to repeatedly sort data that is long static) - unpredictability - when you delete large parts of the table you want that to be instantaneous and cause little work on the db side - partitioning that also partitions the indexes, this is imho a must have for huge non static tables - when the user tweaks segment size this is already declarative - some indexes only needed for recent data (a partial index would need to slide == recreate) The whole subject is imho very interesting, and I expect more feedback after 8.3 is out. I am also in the declarative camp. In my projects the partitioning is the developer/designer's responsibility, and thus all add/drop partition tasks are automated, no dba. Needless to say, all existing declarative implementations lack some of the essential features on the implementation side, e.g.: - use the btree order of partitions in plans that need more than one partition - a useful syntax that allows automatic creation/exclusion of partitions (e.g. each month of a timestamp in one partition) e.g. partition 'hugetable_'||to_char(ts, 'MM') with extend(ts, year to month) - unique indexes, equally partitioned like table, that don't contain the partitioning column[s] - some lack expression based - some lack instantaneous attach using a prefilled preindexed table - some lack detach - some need separate tablespace per partition Andreas ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Fri, Jan 11, 2008 at 11:49:50AM +, Simon Riggs wrote: On Fri, 2008-01-11 at 10:25 +0100, Gavin Sherry wrote: Of course. It's an identical situation for both. Regrettably, none of your comments about dynamic partitioning and planning were accurate as a result. That's not true. We will still have planning drive the partition selection when the predicate is immutable, thus having more accurate plans. Not really. The planner already evaluates stable functions at plan time to estimate selectivity against statistics. It can do the same here. The boundary values can't be completely trusted at plan time because they are dynamic, but they're at least as accurate as ANALYZE statistics (and probably derived at identical times), so can be used as estimates. So I don't see any reason to imagine the plans will be badly adrift. Okay, it's good that you want the planner to look at those. Did you consider the point I made about the sheer amount of data the planner would have to consider for large cases? We're back to saying that if the visibility map is volatile, then SE won't help you much. I agree with that and haven't argued otherwise. Does saying it make us throw away SE? No, at least, not yet and not for that reason. Yes, I'm not against SE I just think that only having it would see a serious regression for larger user. Personally, I think SE would be a great idea for append only tables since it removes the thing I'm most worried about with it: the need to vacuum to 'turn it on'. SE does what I was looking for it to do, but doesn't do all of what you'd like to achieve with partitioning, because we're looking at different use cases. I'm sure you'd agree that all large databases are not the same and that they can have very different requirements. I'd characterise our recent positions on this that I've been focused on archival requirements, whereas you've been focused on data warehousing. I think that sums it up, although I'd also say that declarative partitioning is suitable for all those with largish amounts of data which they know how they want stored. This points to another case that SE suits: those who don't know how or (maybe more importantly) don't care to manage their data. I'll go back to what I said above. SE looks like a good performance boost for archival read only data. If we tighten up the definitions of how some tables can be used -- append only -- then we can remove the vacuum requirement and also change other characteristics of the storage. For example, reduced visibilty information, compression, etc. These are hot topics for people with that kind of data. The difference really lies in how much activity and of what kind occurs on a table. You don't unload and reload archives regularly, nor do you I couldn't agree more. perform random updates against the whole table. I'm sure we'd quickly agree that many of the challenges you've seen recently at Greenplum would not be possible with core Postgres, so I'm not really trying too hard to compete. There we diverge. Yes, Greenplum produces systems for very large amounts of data, peta-byte range in fact. However, the architecture -- individual post-masters on a single CPU with their own storage -- means that we see the problems of non-distributed database users when we look at data at the node level. This is why I say that VACUUMing such systems, under the SE model, after you do a load of data is just impossible (I know that after time the cost of vacuum will stabilise but there's always the initial data load). Thanks, Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Fri, 2008-01-11 at 17:31 +0100, Zeugswetter Andreas ADI SD wrote: I've kept a list of requests for improvement that I can share with you; I've always been loathe to publish a list of bad points. The list of bad points doesn't refer to shortcomings in my proposal, which I would never hide. It refers to a list of general requirements for improvements to partitioning that I have accumulated over a number of years and will publish as soon as I've retyped it. It's not a secret, it all comes from things discussed on list at various points. I think it would help understand the proposal if you also present the shortcomings. When you presented the positive and negative points, the negative list did look intentionally short :-) If I am aware of a downside in any proposal, I always identify it. That's why the damn things are so long. I design many things, but only post some of them. The proposals I do post always have more positive than negative posts, otherwise I wouldn't waste anybody's time. I regularly get comments my proposal style. First, my proposals are too long and they need a summary. Next that I don't explain myself enough and need to stop leaving gaps that have people assume I've not thought it through thoroughly. So I try to put the summary in for some people and the detail for others. For this proposal I identified the target use case at the very top of the proposal, and IMHO it does meet the needs of that very well, as many people have agreed. It doesn't do everything that everybody wants because the list is very long indeed, probably man years of effort, all things considered. We won't cross that gap in a single step. I'm working on a revised version which will include all of the comments and points that people have made, about 20 different topics in all from my notes. In progress here: http://developer.postgresql.org/index.php/Segment_Exclusion This imho provokes negative replies, the average hackers that reply are no dummies eighter. I post to -hackers because I know there's no dummies here. If I've provoked negative replies, I'm happy to apologise to all. Thanks for your candid thoughts, -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Fri, 2008-01-11 at 20:03 +0100, Gavin Sherry wrote: Okay, it's good that you want the planner to look at those. Did you consider the point I made about the sheer amount of data the planner would have to consider for large cases? Sorry, thought I had somewhere in all those emails... If you really do have 16TB of data and its in the use case of mostly read only, volatile in last portion of table, then it would be straightforward to increase segment size. Whatever form of partitioning we go for we do need to allow 1-2,000 partitions fairly easily. We can't sustain a sequential method of applying the rules, which gives O(n) behaviour. We need some form of indexing/tree search mechanism that gives roughly O(logN) behaviour. We're back to saying that if the visibility map is volatile, then SE won't help you much. I agree with that and haven't argued otherwise. Does saying it make us throw away SE? No, at least, not yet and not for that reason. Yes, I'm not against SE I just think that only having it would see a serious regression for larger user. If we do find regressions, then I'd suggest we look at ways of turning it on/off automatically. We can keep track of the volatility in the map. We can also have an off switch if you're really against it. But I'm still skeptical. I think we can sustain the last few segments in a table being volatile, as long as the greater proportion of segments are not. The volatile part of the table is the subject of most of the queries anyway, so excluding it from seq scans isn't that important. Index access to historical data doesn't slow down whether or not you have a volatile map. Personally, I think SE would be a great idea for append only tables since it removes the thing I'm most worried about with it: the need to vacuum to 'turn it on'. You do need to VACUUM anyway, so thats no problem. Sure you have to scan it to derive the boundary values, but thats one scan that saves many in the future. I'll go back to what I said above. SE looks like a good performance boost for archival read only data. If we tighten up the definitions of how some tables can be used -- append only -- then we can remove the vacuum requirement and also change other characteristics of the storage. For example, reduced visibilty information, compression, etc. These are hot topics for people with that kind of data. SE isn't aimed at solely INSERT-only data. It's much wider than that. An ERP/SCM type application would easily benefit, say where orders are received and processed within a relatively short period, but order data needs to be maintained online for 2+ years. Anything where the table grows either by date, or by increasing key, that has a read-only tail. That's a lot of applications and a ton of data. It might not apply to the Customer table in that app, but it will apply to Orders, OrderItem etc. We can compress older read-only data as a way of shrinking file size. That's way easier, plus it applies to more real-world cases than trying to remove all the visibility stuff. The Insert-only people will still be able to take advantage of it compression, but the more general purpose people will never be able to make use of the visibility removal code. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Simon Riggs wrote: Happy New Year, everybody. This proposal follows on from previous thinking about partitioning, where I've taken up Andrew Sullivan's suggestion to re-examine the current partitioning concept of using tables as partitions. So I've come up with an alternative concept to allow us to discuss the particular merits of each. ISTM that this new proposal has considerable potential. I've been lurking and reading a huge number of threads on partitioning. I see that postgresql is likely to give the user lots of knobs to define partitions very flexibly, which is a good thing for things like sales region reports etc. All that to say, I hope some form of this very automatic tunning makes it in. This automatic option would provide a benefit (not perfect but improved) for a significant set of use cases. Even better, it is trivial to setup, though I would want a knob for the underlying partition sizes, 1GB feels a bit too big for some situations. Even expensive databases have found I think that there is a cost to administrative complexity. In many cases someone may not be ready to go down the declarative path, but be open to allowing the system take some optimizing approaches. What I especially like is the ability to later mark things read only. Perhaps a consultant who checks in on things monthly might mark partitions in that form. Good luck though with it all, great to see this. - August ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote: If people with large tables like partitioning why is Oracle moving towards automated partitioning in 11g? Automated partitioning was one of Have you used Oracle's partitioning? Since you ask, yep, certified on it, plus DB2, Teradata and Greenwich. What about consistency over time? High levels of control for users if they want it? Even simple things like tablespace support? High performance unload and load performance much better than now? Reliable and consistent query plans? Support for datatypes other than those relating to time. Most of these comments seem like emotional appeals, so refuting them one by one is just going to look and smell like an argument, and a fruitless one as well since we agree on so many things. So, I get the message that you really want the DDL approach and agree that you've demonstrated there are use cases that need it that you are interested in. That's fine by me as long as we can each work on parts of it to get it done. Will it be possible for you to do that? I feel I can say that because AFAICS we can in principle have both dynamic and declarative techniques, even on the same table. Around half of the mechanics would be identical anyway. So from here I say lets work together to get the basic mechanics right. My experience with PostgreSQL is that the team/list always comes up with a better idea in the end and I'm sure we will this time too. I have one main technical concern which is the storage model, which is the source of the difference between the two types of partitioning we've been discussing. I see difficulties originating there, so I will start another thread to examine just those items in detail. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
hris Browne wrote: _On The Other Hand_, there will be attributes that are *NOT* set in a more-or-less chronological order, and Segment Exclusion will be pretty useless for these attributes. Short summary: With the appropriate clustering, ISTM Segment Exclusion can be useful on all columns in a table. Cluster the table by one-bit-of-column1 || one-bit-of-column-2... That way any col2=value query could exclude at least about half the table regardless of if it's monotonically increasing or even totally random. It seems one could make Segment Exclusion even useful for multiple unrelated columns in a table.Consider a large table of people where one might want segment exclusion to help with both first name, and last name. One could cluster the table by first-letter-of-last-name || first-letter-of-first-name. Then a query for last name Smith could exclude all but the consecutive segments of S's; while the query for first name John would only have to look in the 26 runs of segments with AJ, BJ, CJ, ... Perhaps better - hash each column and interleave the bits col1bit1, col2bit1, col3bit1, col1bit2, col2bit2, col3bit3 If I understand segment exclusion right, that way on any table bigger than 8 segments any query of col1=val, or col2=val or col3=val would scan at most half the table; on a table bigger than 64 segments any such query would scan at most 1/4 of the table. Obviously this only benefits the rarely changing parts of tables; and new and updated values would be in a very hot segment at the end of the table that wouldn't be segment excluded much. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Thu, Jan 10, 2008 at 07:25:00AM +, Simon Riggs wrote: On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote: If the exclusion is executor driven, the planner cannot help but create a seq scan plan. The planner will think you're returning 100X rows when really you end up returning X rows. After that, all decisions made by the planner are totally bogus. One of the most important queries on large tables is handling WHERE Eventdate = CURRENT DATE; Really? Well, this isn't handled by the current partitioning approach (i.e., constraint exclusion) so users have worked around it. We cannot perform partition exclusion using this type of WHERE clause at planning time because the CURRENT DATE function is STABLE. We can do the exact same thing -- if it's a direction people want to take. In fact, we can do it better/faster because once we've evaluated one partition we know that there are no others to evaluate. So it seems clear that we need to make partition exclusion work at executor time, whatever else we do. One example doesn't make the rule. Most people doing range partitioning are going to be doing either specific dates or date ranges -- i.e., constants or things that can be folded to constants by the planner. At least, that's what I've seen. Thanks, Gavin ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Thu, Jan 10, 2008 at 04:51:04PM +, Simon Riggs wrote: On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote: If people with large tables like partitioning why is Oracle moving towards automated partitioning in 11g? Automated partitioning was one of Have you used Oracle's partitioning? Since you ask, yep, certified on it, plus DB2, Teradata and Greenwich. It was just a joke. What about consistency over time? High levels of control for users if they want it? Even simple things like tablespace support? High performance unload and load performance much better than now? Reliable and consistent query plans? Support for datatypes other than those relating to time. Most of these comments seem like emotional appeals, so refuting them one by one is just going to look and smell like an argument, and a fruitless one as well since we agree on so many things. Sigh. You can call those questions what you like but your approach doesn't deal with them and the current partitioning approach, for it's problems, does deal with them. So, I get the message that you really want the DDL approach and agree that you've demonstrated there are use cases that need it that you are interested in. That's fine by me as long as we can each work on parts of it to get it done. Will it be possible for you to do that? I assured you offlist that it was. This is something Greenplum needs right now and I'm being paid to deliver it. I feel I can say that because AFAICS we can in principle have both dynamic and declarative techniques, even on the same table. Around half of the mechanics would be identical anyway. So from here I say lets work together to get the basic mechanics right. My experience with PostgreSQL is that the team/list always comes up with a better idea in the end and I'm sure we will this time too. I agree. Scrutiny of ideas almost always leads to better ideas. I have one main technical concern which is the storage model, which is the source of the difference between the two types of partitioning we've been discussing. I see difficulties originating there, so I will start another thread to examine just those items in detail. I look forward to it. As promised, I'll soon post some syntax of what we have been working on. Thanks, Gavin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Thu, 2008-01-10 at 21:43 +0100, Gavin Sherry wrote: On Thu, Jan 10, 2008 at 07:25:00AM +, Simon Riggs wrote: On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote: If the exclusion is executor driven, the planner cannot help but create a seq scan plan. The planner will think you're returning 100X rows when really you end up returning X rows. After that, all decisions made by the planner are totally bogus. One of the most important queries on large tables is handling WHERE Eventdate = CURRENT DATE; Really? Well, this isn't handled by the current partitioning approach Yes, exactly why we need to fix it and why I'm mentioning it here. We cannot perform partition exclusion using this type of WHERE clause at planning time because the CURRENT DATE function is STABLE. We can do the exact same thing -- if it's a direction people want to take. In fact, we can do it better/faster because once we've evaluated one partition we know that there are no others to evaluate. Lost you completely here. I'm explaining to you that *nobody* can solve those problems solely at planning time, by definition, so it has to be done at execution time. I'm not saying anything about your way, my way. So it seems clear that we need to make partition exclusion work at executor time, whatever else we do. One example doesn't make the rule. Most people doing range partitioning are going to be doing either specific dates or date ranges -- i.e., constants or things that can be folded to constants by the planner. At least, that's what I've seen. It's not always true that planning time = execution time. Using CURRENT DATE to access current day, week, month is common in many applications, as is parameterised SQL for higher transaction rate apps. We we can't re-write all the SQL, so we need to make it work. I think if you demand a full function implementation of partitioning, you'd better take into account *all* of the requirements. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Thu, 2008-01-10 at 21:49 +0100, Gavin Sherry wrote: So, I get the message that you really want the DDL approach and agree that you've demonstrated there are use cases that need it that you are interested in. That's fine by me as long as we can each work on parts of it to get it done. Will it be possible for you to do that? I assured you offlist that it was. This is something Greenplum needs right now and I'm being paid to deliver it. Cool, I'll assist you where possible. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Thu, Jan 10, 2008 at 09:30:10PM +, Simon Riggs wrote: We cannot perform partition exclusion using this type of WHERE clause at planning time because the CURRENT DATE function is STABLE. We can do the exact same thing -- if it's a direction people want to take. In fact, we can do it better/faster because once we've evaluated one partition we know that there are no others to evaluate. Lost you completely here. I'm explaining to you that *nobody* can solve those problems solely at planning time, by definition, so it has to be done at execution time. I'm not saying anything about your way, my way. Sorry, I wasn't clear enough. I was trying to say, if we're going to do something in the executor (for right or wrong) the declarative approach can do it too. Since there will be partition bounding information available, we can do partition selection in the executor (maybe the planner should tell us to do it). So it seems clear that we need to make partition exclusion work at executor time, whatever else we do. One example doesn't make the rule. Most people doing range partitioning are going to be doing either specific dates or date ranges -- i.e., constants or things that can be folded to constants by the planner. At least, that's what I've seen. It's not always true that planning time = execution time. Using CURRENT DATE to access current day, week, month is common in many applications, as is parameterised SQL for higher transaction rate apps. We we can't re-write all the SQL, so we need to make it work. I think if you demand a full function implementation of partitioning, you'd better take into account *all* of the requirements. Okay. As I said above, nothing in declarative partitioning rules out partition selection with stable functions. So, we lets do it, assuming everyone else thinks it is a good idea. Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Sat, 2008-01-05 at 16:30 -0500, Robert Treat wrote: I'm not following this. If we can work out a scheme, I see no reason not to allow a single table to span multiple tablespaces. That seems to be something we might want anyway, so yes. The difference is that, if I currently have a table split by month, I can re-partition it into weekly segments, and only shuffle one months data at a time minimize impact on the system while I shuffle it. This can even be used to do dynamic management, where data from the current month is archived by day, data from the past year by week, and data beyond that done monthly. Understood On many other databases, if you change the partition scheme, it requires exclusive locks and a shuffleing of all of the data, even data whose partitions arent being redefined. Even worse are systems like mysql, where you need to rewrite the indexes as well. To me, these requirements always seem like show stoppers; I generally can't afford to lock a table while the database rewrites a billion rows of data. Agreed In any case, my thinking is if we had the segment exclusion technique, I could convert that partitioned table into a regular table again, use segment exclusion to handle what is currently handled by partitions, and create a global index across all the other data for that other, currently killer, query. Yes, that's what I have in mind. Can I ask that you produce a gap analysis between what you have now and what you would have in the future, so we can see what omissions or errors there are in the segex proposal? If we had indexes that spanned partitions, would we find that some of the queries that were producing seq scans will now produce better join and index plans, do you think? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Sun, 2008-01-06 at 11:39 +0100, Markus Schiltknecht wrote: I think this has to do with SE not being of much use for index scans. Hmmm. I think it fits rather neatly with BitmapIndexScans. It would be easy to apply the index condition and/or filters to see which segments are excluded and then turn off bits in the bitmap appropriately. Not fully sure about IndexScans yet. I don't think it would be worth trying to apply SE until we estimated we would return say 100 rows. It needs to be able to work without slowing down the common path. Or put it another way: SE is an optimization for sequential scans. For tables where it works well, it could possibly replace the index entirely. True Without the index, you would rely on SE to always be able to exclude enough segments, so that the seq scan is less expensive than an index scan with the following table lookups. It would have to be a very fat index scan for so large a table... -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Mon, 2008-01-07 at 14:20 +0100, Markus Schiltknecht wrote: AFAIUI, Segment Exclusion combines perfectly well with clustering. Yes, seems like it would be possible to have a segment-aware CLUSTER, so it was actually usable on large tables. Not planning that initially though. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Simon Riggs wrote: Hmmm. I think it fits rather neatly with BitmapIndexScans. It would be easy to apply the index condition and/or filters to see which segments are excluded and then turn off bits in the bitmap appropriately. Yeah, good point. Not fully sure about IndexScans yet. I don't think it would be worth trying to apply SE until we estimated we would return say 100 rows. It needs to be able to work without slowing down the common path. Yup. Or put it another way: SE is an optimization for sequential scans. For tables where it works well, it could possibly replace the index entirely. True Without the index, you would rely on SE to always be able to exclude enough segments, so that the seq scan is less expensive than an index scan with the following table lookups. It would have to be a very fat index scan for so large a table... ..for SE to be faster than an index scan, you mean? Yes. Regards Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Mon, 2008-01-07 at 12:14 +0100, Csaba Nagy wrote: On Wed, 2008-01-02 at 17:56 +, Simon Riggs wrote: Like it? Very cool :-) Thanks. As ever, a distillation of various thoughts, not all mine. One additional thought: what about a kind of segment fill factor ? Meaning: each segment has some free space reserved for future updates/inserts of records in the same range of it's partitioning constraint. And when inserting/updating you put the new record into the corresponding segment... just like a very coarse clustering. Then you could vacuum the segments separately to keep the free space not running out. For active segments you would then fix the partitioning constraint range once the fill factor is reached, to allow for keeping it's constraint even when heavily updating (heavily vacuuming it too as response to that), and create a new segment for the unbounded range for new inserts... this would work fine for tables where the constraint is based on ever increasing keys and accidental inserts in old ranges (which do happen occasionally in real life). Lots of thoughts there, so I'll try to separate them out and analyse. The way I originally described it is a very simple mechanism and we could tweak that some more. All ideas welcome. If we had dynamic segment constraints when the segment was not yet read only that would lead to changes in the segment constraint for each INSERT when we have increasing keys. It seems better to set the constraints once only, when the segment was full, then prevent further INSERTs. The accidental inserts in old ranges seem like something that can be avoided if it is a real-world problem. For UPDATEs on a segment with constraints we might choose to apply the constraints to see what to do. You might want to allow the UPDATE and have it stretch the constraints outwards or you might want to prevent it and throw an error, or you might want to allow the UPDATE, yet migrate the new tuple to the appropriate segment. Dynamic partitioning works in multiple dimensions, so there isn't just one single valid location for any row. i.e. if we update a have a row with OrderId, OrderDate, RequiredByDate, ShipDate and LastModifiedDate on it, we'll probably expand the constraints on at least one of those. If we were lucky enough to have only changed one of those it might turn out there was *in that case* a single more sensible location for the new tuple, but that probably isn't the common case. So the likely best behaviour for UPDATEs is to try to keep the new row version in the same segment, then change the constraints. The segment constraints concept and the read only concept were linked. You're right we could separate them, thought that turns out not to be that desirable. When we do an DELETE or an UPDATE we don't know whether the deleted row version was the last tuple with that particular boundary value. So we don't know whether the DELETE or UPDATE changes the constraints or not, and however we try to avoid it, we'll probably need to recalc the constraints in some circumstance. So updating the constraints dynamically isn't anywhere near as easy as it sounds and ultimately probably isn't worth the effort. So thats why constraints and read-only go together. HOT allows us to keep the new row version within the segment, in many cases. What might be worth doing is marking the FSM space for that segment as update-only to exclude inserts from using it, then forcing UPDATEs to stay within the segment if possible by providing the current block number in each call to the FSM. That would also mean that every UPDATE that wants to do a multi-block update on a currently read-only segment would need to call the FSM. Sounds good, but that would only work for the first UPDATE on a segment after it is marked read only, which isn't much use, or we would do it for *every* block-spanning UPDATE, which would cause contention in other use cases. So although I'm willing to listen and tweak, that hasn't yet resulted in any additional design points, unless I missed something above? When the change rate of old segments get down, the segments could be reorganized to have a smaller fill factor, so that you still allow for accidental updates but keep space usage efficient. This would be some similar action as a clustering, but hopefully not blocking (which might be a hard thing to do)... and later again you could mark some of the really old things as read only and put them in special segments with no wasted space. Yes, hopefully marking read only and compressed segments is a possible future. One problem would be when the segment's free space runs out, so you must put records from the same constraint range in multiple segments - but that could still work, you just would have multiple segments covering the same range, but if the segment fill factor is chosen properly it should not be the case... you could normally maintain a set of non-overlapping segments in terms of the
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Sat, 2008-01-05 at 16:42 +0100, Markus Schiltknecht wrote: Simon Riggs wrote: On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote: I'm still puzzled about how a DBA is expected to figure out which segments to mark. Simon, are you assuming we are going to pass on segment numbers to the DBA one day? No Way! Ah, I'm glad ;-) But now you want names on partitions? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Gavin and all, This is quite a long reply, so apologies for that. On Wed, 2008-01-09 at 07:28 +0100, Gavin Sherry wrote: On Wed, Jan 02, 2008 at 05:56:14PM +, Simon Riggs wrote: This technique would be useful for any table with historical data keyed by date or timestamp. It would also be useful for data where a time-of-insert component is implicit, such as many major entity tables where the object ids are assigned by a sequence. e.g. an Orders table with an OrderId as PK. Once all orders placed in a period have been shipped/resolved/closed then the segments will be marked read-only. A novel approach to the problem. For me, this proposal addresses some of the other problems in postgres (visibility in the heap vs. index) rather than the problems with partitioning itself. I think you're right that it isn't attempting to address the problems with partitioning directly, it is attempting to address the core use case itself. I think we have an opportunity to bypass the legacy-of-thought that Oracle has left us and implement something more usable. There are some low-level technical issues associated with declarative partitioning that I'll address last, which are in many ways the big reason to want to go to non-declarative partitioning. It might seem otherwise, but for me partitioning is tool for managing large volumes of data. It allows the user to encode what they know about the nature of their data, and that's often about management. In this way, your proposal actually makes partitioning too smart: the user wants to tell the system how to organise the data, as opposed to the other way around. At Greenplum, we've been discussing this in depth. Interestingly, we also felt that the storage layer could be much smarter with large tables with predictable layouts and predictable data patterns. But the thing is, people with large tables like partitioning, putting different partitions on different storage; they like the ability to merge and split partitions; they like truncating old partitions. In a word, they seem to like the managability partitions give them -- as well as the performance that comes with this. Do people really like running all that DDL? There is significant manpower cost in implementing and maintaining a partitioning scheme, plus significant costs in getting it wrong. If people with large tables like partitioning why is Oracle moving towards automated partitioning in 11g? Automated partitioning was one of the major goals for this next set of Postgres partitioning functionality also, whether or not we have declarative partitioning. My thinking is lets go for the best ideas and skip over the stuff that will be (is) obsolete before its left the design stage. I see many more systems in the Terabyte range now than I did 10 years ago, but I see about the same number of DBAs. We'll always need experts, but I feel we should be using our expertise to simplify the standard cases, not just maintain the same level of difficulty in managing them. One of the big benefits of the dynamic partitioning approach is that it needs no DDL. So it will work out-of-the-box, for anybody. Deleting older data would be optimised under the proposed scheme, so that's not really a problem. Loading data is actually slightly harder and slower with declarative partitioning (see below). Merging and splitting partitions are tools for fine tuning a very complex partitioning scheme. They do also allow a non-linear segmentation scheme, which might aid performance in some cases. (On a different note, I'm strongly in favour of a declarative approach to other optimizer information to allow the DBA to say what they know about how a table will be used. But that's off-topic for now.) One major advantage of the dynamic approach is that it can work on multiple dimensions simultaneously, which isn't possible with declarative partitioning. For example if you have a table of Orders then you will be able to benefit from Segment Exclusion on all of these columns, rather than just one of them: OrderId, OrderDate, RequiredByDate, LastModifiedDate. This will result in some sloppiness in the partitioning, e.g. if we fill 1 partition a day of Orders, then the OrderId and OrderData columns will start out perfectly arranged. Any particular RequiredByDate will probably be spread out over 7 partitions, but thats way better than being spread out over 365+ partitions. When we look at the data in the partition we can look at any number of columns. When we declaratively partition, you get only one connected set of columns, which is one of the the reasons you want multi-dimensional partitioning in the first place. To this end, we (well, Jeff Cohen) looked at the syntax and semantics of partitining in leading databases (Oracle, Informix, DB2) and came up with a highly expressive grammar which takes the best of each I think (I'll post details on the grammar in a seperate thread). The idea is that range (for
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
[EMAIL PROTECTED] (Simon Riggs) writes: I think we have an opportunity to bypass the legacy-of-thought that Oracle has left us and implement something more usable. This seems like a *very* good thing to me, from a couple of perspectives. 1. I think you're right on in terms of the issue of the cost of running all that DDL in managing partitioning schemes. When I was working as DBA, I was decidedly *NOT* interested in doing a lot of low level partition management work, and those that are in that role now would, I'm quite sure, agree that they are not keen on spending a lot of their time trying to figure out what tablespace to shift a particular table into, or what tablespace filesystem to get sysadmins to set up. 2. Blindly following what Oracle does has always been a dangerous sort of thing to do. There are two typical risks: a) There's always the worry that they may have patented some part of how they implement things, and if you follow too closely, There Be Dragons... b) They have enough billion$ of development dollar$ and development re$ource$ that they can follow strategies that are too expensive for us to even try to follow. 3. If, rather than blindly following, we create something at least quasi-new, there is the chance of doing fundamentally better. This very thing happened when it was discovered that IBM had a patent on the ARC cacheing scheme; the clock system that emerged was a lot better than ARC ever was. One major advantage of the dynamic approach is that it can work on multiple dimensions simultaneously, which isn't possible with declarative partitioning. For example if you have a table of Orders then you will be able to benefit from Segment Exclusion on all of these columns, rather than just one of them: OrderId, OrderDate, RequiredByDate, LastModifiedDate. This will result in some sloppiness in the partitioning, e.g. if we fill 1 partition a day of Orders, then the OrderId and OrderData columns will start out perfectly arranged. Any particular RequiredByDate will probably be spread out over 7 partitions, but thats way better than being spread out over 365+ partitions. I think it's worth observing both the advantages and demerits of this together. In effect, with the dynamic approach, Segment Exclusion provides its benefits as an emergent property of the patterns of how INSERTs get drawn into segments. The tendancy will correspondly be that Segment Exclusion will be able to provide useful constraints for those patterns that can naturally emerge from the INSERTs. We can therefore expect useful constraints for attributes that are assigned in some kind of more or less chronological order. Such attributes will include: - Object ID, if set by a sequence - Processing dates There may be a bit of sloppiness, but the constraints may still be useful enough to exclude enough segments to improve efficiency. _On The Other Hand_, there will be attributes that are *NOT* set in a more-or-less chronological order, and Segment Exclusion will be pretty useless for these attributes. In order to do any sort of Exclusion for non-chronological attributes, it will be necessary to use some mechanism other than the patterns that fall out of natural chronological insertions. If you want exclusion on such attributes, then there needs to be some sort of rule system to spread such items across additional partitions. Mind you, if you do such, that will weaken the usefulness of Segment Exclusion. For instance, suppose you have 4 regions, and scatter insertions by region. In that case, there will be more segments that overlap any given chronological range. When we look at the data in the partition we can look at any number of columns. When we declaratively partition, you get only one connected set of columns, which is one of the the reasons you want multi-dimensional partitioning in the first place. Upside: Yes, you get to exclude based on examining any number of columns. Downside: You only get the exclusions that are emergent properties of the data... The more I'm looking at the dynamic approach, the more I'm liking it... -- cbbrowne,@,cbbrowne.com http://linuxfinances.info/info/linuxxian.html Feel free to contribute build files. Or work on your motivational skills, and maybe someone somewhere will write them for you... -- Fredrik Lundh [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Chris Browne wrote: _On The Other Hand_, there will be attributes that are *NOT* set in a more-or-less chronological order, and Segment Exclusion will be pretty useless for these attributes. Really?I was hoping that it'd be useful for any data with long runs of the same value repeated - regardless of ordering. My biggest tables are clustered by zip/postal-code -- which means that while the City, State, Country attributes aren't monotonically increasing or decreasing; they are grouped tightly together. I'd expect all queries for San Francisco to be able to come from at most 2 segments; and all queries for Texas to be able to come from only a fraction of the whole. If the segment sizes are configurable - I imagine this would even be useful for other data - like a people table organized by last_name,first_name. John's may be scattered through out the table -- but at least the John Smith's would all be on one segment, while the Aaron-through-Jim Smith segments might get excluded. Or am I missing something? ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Wed, Jan 09, 2008 at 11:47:31AM -0500, Chris Browne wrote: [EMAIL PROTECTED] (Simon Riggs) writes: I think we have an opportunity to bypass the legacy-of-thought that Oracle has left us and implement something more usable. This seems like a *very* good thing to me, from a couple of perspectives. [snip] 2. Blindly following what Oracle does has always been a dangerous sort of thing to do. There are two typical risks: a) There's always the worry that they may have patented some part of how they implement things, and if you follow too closely, There Be Dragons... I think that could be equally said of the dynamic partitioning approach. In fact, it might be more likely since declarative partitioning has been around for eons. b) They have enough billion$ of development dollar$ and development re$ource$ that they can follow strategies that are too expensive for us to even try to follow. I don't see that as an argument against the declarative approach. Reading the details on this thread so far, I think Simon's approach is probably more complex from an implementation POV. 3. If, rather than blindly following, we create something at least quasi-new, there is the chance of doing fundamentally better. This very thing happened when it was discovered that IBM had a patent on the ARC cacheing scheme; the clock system that emerged was a lot better than ARC ever was. Well, I don't think I'm proposing we /blindly follow/ others. I propose we choose a grammar which takes the best of what others have tried to do. Oracle's grammar is hideous, IBM's is too restrictive, for example. One major advantage of the dynamic approach is that it can work on multiple dimensions simultaneously, which isn't possible with declarative partitioning. For example if you have a table of Orders then you will be able to benefit from Segment Exclusion on all of these columns, rather than just one of them: OrderId, OrderDate, RequiredByDate, LastModifiedDate. This will result in some sloppiness in the partitioning, e.g. if we fill 1 partition a day of Orders, then the OrderId and OrderData columns will start out perfectly arranged. Any particular RequiredByDate will probably be spread out over 7 partitions, but thats way better than being spread out over 365+ partitions. I think it's worth observing both the advantages and demerits of this together. In effect, with the dynamic approach, Segment Exclusion provides its benefits as an emergent property of the patterns of how INSERTs get drawn into segments. The tendancy will correspondly be that Segment Exclusion will be able to provide useful constraints for those patterns that can naturally emerge from the INSERTs. Many people, in my experience, doing the kind of data processing which benefits from partitioning are regularly loading data, rather than collecting it in an OLTP fashion. Lets take the easily understandable concept of processing web site traffic. If the amount of data is large enough to benefit from partitioning, they probably have multiple web servers and therefore almost certainly multiple log files. If these files are not sorted into a single file, the records will not have a naturally progressing chronology: every file we go back to the beginning of the period of time the load covers. If you add parallelism to your load, things look even more different. This means you could end up with a bunch of partitions, under the dynamic model, which all cover the same time range. Then there's the way that really big databases are used (say, up around Simon's upper bound of 16 TB). It is expensive to keep data online so people aren't. They're loading and unloading data all the time, to perform different forms of analysis. A common scenario in the example above might be to unload all but the current month's data and then load the same month from the previous year. The unload step needs to be costly (i.e., TRUNCATE). Then, there's no guarantee that what they're interested in is the date range at all. They may want to compare user agent (look at bot activity). In this case, the partitioning is across a small list of strings (well, numbers most likely). Here, the partitions would all have the same range. Partitioning would be useless. We can therefore expect useful constraints for attributes that are assigned in some kind of more or less chronological order. Such attributes will include: - Object ID, if set by a sequence - Processing dates There may be a bit of sloppiness, but the constraints may still be useful enough to exclude enough segments to improve efficiency. I really like the idea of the system being able to identify trends and patterns in the data (actually useful at the user level) but it's the uncertainty of how the partitioning will work for different kinds of data that I don't like. _On The Other Hand_, there will be
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Wed, 2008-01-09 at 20:03 +0100, Gavin Sherry wrote: I think Simon's approach is probably more complex from an implementation POV. Much of the implementation is exactly the same, and I'm sure we agree on more than 50% of how this should work already. We just need to close in on the remainder. My current opinion on the SE approach is the opposite of the one above, though it gets us nowhere just to state it. I'm trying to avoid opinion and look at the details, which is the reason my viewpoint recently changed in favour of the dynamic approach as the main thrust for implementation. I've written a detailed email this morning to explain where and how the problems lie, which are nowhere near the syntax level. I haven't ruled out a declarative approach yet, but I need some detailed technical review of the issues. I hope you'll be replying to that? 3. If, rather than blindly following, we create something at least quasi-new, there is the chance of doing fundamentally better. This very thing happened when it was discovered that IBM had a patent on the ARC cacheing scheme; the clock system that emerged was a lot better than ARC ever was. Well, I don't think I'm proposing we /blindly follow/ others. I propose we choose a grammar which takes the best of what others have tried to do. Oracle's grammar is hideous, IBM's is too restrictive, for example. I assume the new grammar is good and if we do go that way, it sounds like the right starting place. One major advantage of the dynamic approach is that it can work on multiple dimensions simultaneously, which isn't possible with declarative partitioning. For example if you have a table of Orders then you will be able to benefit from Segment Exclusion on all of these columns, rather than just one of them: OrderId, OrderDate, RequiredByDate, LastModifiedDate. This will result in some sloppiness in the partitioning, e.g. if we fill 1 partition a day of Orders, then the OrderId and OrderData columns will start out perfectly arranged. Any particular RequiredByDate will probably be spread out over 7 partitions, but thats way better than being spread out over 365+ partitions. I think it's worth observing both the advantages and demerits of this together. In effect, with the dynamic approach, Segment Exclusion provides its benefits as an emergent property of the patterns of how INSERTs get drawn into segments. The tendancy will correspondly be that Segment Exclusion will be able to provide useful constraints for those patterns that can naturally emerge from the INSERTs. Many people, in my experience, doing the kind of data processing which benefits from partitioning are regularly loading data, rather than collecting it in an OLTP fashion. Lets take the easily understandable concept of processing web site traffic. If the amount of data is large enough to benefit from partitioning, they probably have multiple web servers and therefore almost certainly multiple log files. If these files are not sorted into a single file, the records will not have a naturally progressing chronology: every file we go back to the beginning of the period of time the load covers. If you add parallelism to your load, things look even more different. This means you could end up with a bunch of partitions, under the dynamic model, which all cover the same time range. Depends how big we make the partitions and how sloppy this is as to whether that is a problem or not. We might still expect a x100 gain from using the SE approach depending upon the data volume. Then there's the way that really big databases are used (say, up around Simon's upper bound of 16 TB). It is expensive to keep data online so people aren't. They're loading and unloading data all the time, to perform different forms of analysis. That isn't my experience. That sounds very time consuming. The storage cost issue was the reason Andrew wanted offline segments, and why I have been talking about hierarchical storage. A common scenario in the example above might be to unload all but the current month's data and then load the same month from the previous year. The unload step needs to be costly (i.e., TRUNCATE). Then, there's no guarantee that what they're interested in is the date range at all. They may want to compare user agent (look at bot activity). In this case, the partitioning is across a small list of strings (well, numbers most likely). Here, the partitions would all have the same range. Partitioning would be useless. I take it you mean SE-based partitioning would be useless, but declarative partitioning would be useful? I would agree, assuming they run queries with a few of the small list of strings. Seems like a contrived case. Some of the biggest I've seen are GIS information or network information. Those are good examples of where a declarative approach would be the only way of getting
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Ron Mayer [EMAIL PROTECTED] writes: Chris Browne wrote: _On The Other Hand_, there will be attributes that are *NOT* set in a more-or-less chronological order, and Segment Exclusion will be pretty useless for these attributes. Really?I was hoping that it'd be useful for any data with long runs of the same value repeated - regardless of ordering. My biggest tables are clustered by zip/postal-code -- which means that while the City, State, Country attributes aren't monotonically increasing or decreasing; they are grouped tightly together. I'd expect all queries for San Francisco to be able to come from at most 2 segments; and all queries for Texas to be able to come from only a fraction of the whole. If the segment sizes are configurable - I imagine this would even be useful for other data - like a people table organized by last_name,first_name. John's may be scattered through out the table -- but at least the John Smith's would all be on one segment, while the Aaron-through-Jim Smith segments might get excluded. Or am I missing something? Well, this can head in two directions... 1. Suppose we're not using an organize in CLUSTER order approach. If the data is getting added in roughly by order of insertion order, then there's no reason to expect San Francisco data to be clustered together. Ditto for John Smiths; we can expect them to be as scattered as their dates of creation. 1. Suppose we *are* using an organize in CLUSTER order approach. In that case, yes, indeed, you can expect segments to contain specific ranges of the data. However, in that case, the ONLY dimension under which the Segment Exclusion may be expected to be useful is that of the first column of the index being used. Smith should be useful to SE, but not John, because, as a secondary criteria, the first name values will be scattered across all segments. -- (reverse (concatenate 'string ofni.sesabatadxunil @ enworbbc)) http://www3.sympatico.ca/cbbrowne/linuxdistributions.html PALINDROME spelled backwards is EMORDNILAP. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Tue, 2008-01-08 at 02:12 +, Gregory Stark wrote: I also don't understand how this proposal deals with the more common use case of unloading and loading data. Normally in partitioned tables we build the data in a side table until the data is all correct then load it as a partition. If you treat it as a lower-level object then I don't see that working. The layout of the new table won't often match the layout of the target partitioned table. We optimised for that in 8.2, but I would say that not many people noticed and that it isn't normal. The problem with that approach, and the reason many people don't use it is that it requires all data for a partition to be available at the time you add the partition. That necessarily implies a time delay into the process of loading data, which is no long acceptable in the world of straight-through-processing or whatever you call the need for zero processing delay in an/your industry. So people choose to load data directly into the main table, allowing it to be immediately available, though at the cost of some performance. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Wed, Jan 09, 2008 at 02:38:21PM -0500, Chris Browne wrote: Ron Mayer [EMAIL PROTECTED] writes: Or am I missing something? Well, this can head in two directions... 1. Suppose we're not using an organize in CLUSTER order approach. If the data is getting added in roughly by order of insertion order, then there's no reason to expect San Francisco data to be clustered together. Ditto for John Smiths; we can expect them to be as scattered as their dates of creation. 1. Suppose we *are* using an organize in CLUSTER order approach. In that case, yes, indeed, you can expect segments to contain specific ranges of the data. However, in that case, the ONLY dimension under which the Segment Exclusion may be expected to be useful is that of the first column of the index being used. Smith should be useful to SE, but not John, because, as a secondary criteria, the first name values will be scattered across all segments. One of the reasons for using partitions is that the usefulness of indexes breaks down because of the low cardinality of the indexed column. Also, the random IO can kill your performance. In my opinion, CLUSTER (and vacuum, mentioned else where) are not data management tools. Thanks, Gavin ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Wed, Jan 09, 2008 at 08:17:41PM +, Simon Riggs wrote: On Wed, 2008-01-09 at 20:03 +0100, Gavin Sherry wrote: I think Simon's approach is probably more complex from an implementation POV. Much of the implementation is exactly the same, and I'm sure we agree on more than 50% of how this should work already. We just need to close in on the remainder. My current opinion on the SE approach is the opposite of the one above, though it gets us nowhere just to state it. I'm trying to avoid opinion and look at the details, which is the reason my viewpoint recently changed in favour of the dynamic approach as the main thrust for implementation. I've written a detailed email this morning to explain where and how the problems lie, which are nowhere near the syntax level. I haven't ruled out a declarative approach yet, but I need some detailed technical review of the issues. I hope you'll be replying to that? I'm responding to that email seperately. Just a matter or addressing all the points in the detail it deserves. 3. If, rather than blindly following, we create something at least quasi-new, there is the chance of doing fundamentally better. This very thing happened when it was discovered that IBM had a patent on the ARC cacheing scheme; the clock system that emerged was a lot better than ARC ever was. Well, I don't think I'm proposing we /blindly follow/ others. I propose we choose a grammar which takes the best of what others have tried to do. Oracle's grammar is hideous, IBM's is too restrictive, for example. I assume the new grammar is good and if we do go that way, it sounds like the right starting place. One major advantage of the dynamic approach is that it can work on multiple dimensions simultaneously, which isn't possible with declarative partitioning. For example if you have a table of Orders then you will be able to benefit from Segment Exclusion on all of these columns, rather than just one of them: OrderId, OrderDate, RequiredByDate, LastModifiedDate. This will result in some sloppiness in the partitioning, e.g. if we fill 1 partition a day of Orders, then the OrderId and OrderData columns will start out perfectly arranged. Any particular RequiredByDate will probably be spread out over 7 partitions, but thats way better than being spread out over 365+ partitions. I think it's worth observing both the advantages and demerits of this together. In effect, with the dynamic approach, Segment Exclusion provides its benefits as an emergent property of the patterns of how INSERTs get drawn into segments. The tendancy will correspondly be that Segment Exclusion will be able to provide useful constraints for those patterns that can naturally emerge from the INSERTs. Many people, in my experience, doing the kind of data processing which benefits from partitioning are regularly loading data, rather than collecting it in an OLTP fashion. Lets take the easily understandable concept of processing web site traffic. If the amount of data is large enough to benefit from partitioning, they probably have multiple web servers and therefore almost certainly multiple log files. If these files are not sorted into a single file, the records will not have a naturally progressing chronology: every file we go back to the beginning of the period of time the load covers. If you add parallelism to your load, things look even more different. This means you could end up with a bunch of partitions, under the dynamic model, which all cover the same time range. Depends how big we make the partitions and how sloppy this is as to whether that is a problem or not. We might still expect a x100 gain from using the SE approach depending upon the data volume. This comes back to my problem with the dynamic approach: the user might get a certain experience but with the declarative approach the expectation matches the result. Then there's the way that really big databases are used (say, up around Simon's upper bound of 16 TB). It is expensive to keep data online so people aren't. They're loading and unloading data all the time, to perform different forms of analysis. That isn't my experience. That sounds very time consuming. I cannot offer any evidence but it's what we see companies doing. The storage cost issue was the reason Andrew wanted offline segments, and why I have been talking about hierarchical storage. The thing is, people with lots of data usually have good hierarchical storage systems. Can we really do better? A common scenario in the example above might be to unload all but the current month's data and then load the same month from the previous year. The unload step needs to be costly (i.e., TRUNCATE). Then, there's no guarantee that what they're interested in is the date range at all. They may want to
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi Simon, On Wed, Jan 02, 2008 at 05:56:14PM +, Simon Riggs wrote: Segment Exclusion - After we note that a segment is read-only we can scan the segment and record min/max values for all columns. These are then implicit constraints, which can then be used for segment exclusion in a similar manner as we do with the current constraint exclusion mechanism. What about columns which are varlen? The maximum value could be very long -- 1 GB in fact. Say we decide to not support those. Well, what about numeric? What about user defined types? This would also allow a Merge Join to utilise exclusion for the inner plan at execution time. Instead of scanning the whole inner table plan from the start, we would be able to start the scan from the appropriate segment. This would require us to pass the current value of the outer plan down into the inner plan. The typical executor nodes on the inner plan would be a SortNode and below that a SeqScan. In that case the executor would need to pass the outer value from the MergeJoinNode down thru the SortNode to the SeqScan node. The SeqScan node could then perform partition exclusion, reducing the time for that scan and also reducing the time for the resulting sort. This sounds like it might be worth doing in normal cases also. It might turn out that the potentially applicable cases are already excluded during planning, I haven't thought about that aspect in enough detail yet. I don't think that would work at all. Say you have have skew of data across the partitions. You may end up doing a seq scan for every outer tuple. It would look like a really expensive nested loop. If we collect data for all columns then many of our implicit constraints would be useless. e.g. if a column only has a few values and these are present in all segments. Matching our predicate against all constraints would be expensive, so we must weed out poor constraints. We would do this by removing any constraint that overlapped more than 10% of other segments. Various heuristics would likely need to be discovered during development to make this work without resorting to manual commands. Note that all of this exclusion is happening in the executor, not the planner. That allows this form of exclusion to work with stable functions and parameters without problem. So, in that case, you've totally undercut the planner. All the steps up the tree would be based on the stats for scanning the entire table but you've only returned part of it. To solve that, you could put the min/max values in a catalog somewhere. For a 16 TB table, that's 16000 min/max values. That's pretty expensive. And how do we handle making things non-read-only. You'd have to wait for all current queries to finish... that could take a long time. How do we handle crashes during setting of min/max? I presume it needs WAL support. What happens if the free space map gives us a free block in a read only segment. Then we need to look at concurrency and transactionality of min/max values don't we? If changing it makes it not read-only (which seems to be the case) it would be trivial to totally degrade your partitioning scheme to full seq scan. One a 16 TB table. And getting the performance back might involve a day or more or VACUUMing. Impossible. Visibility Map -- [snip] No dynamic shared memory cache is required because any concurrent changes to the table would be ignored by a Scan anyway, so it doesn't matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any new scans that start will attempt to lock the table and then perform a rel cache check before continuing. So the visibility will be set correctly for *that* scan at least. Is that true in the presence of READ COMMITTED? In most cases the visibility map can be summarised as a single boolean to show whether any 100% visible segments exist. That makes accessing the map very cheap in the common, unset case. What do we do in failure scenarios. Does this go into WAL? Is it replicated via log shipping techniques. You mention Slony support below. I don't know that Slony's target is Very Big Tables (up to 16 TB) but even if it was being used for a 'small' system of a few hundred GB, when you fail over the system has degraded if you aren't vacuuming it. More over, the layout of data may be different and therefore the performance of the segment exclusion. That's a bit of a surprise. Setting the Visibility Map -- VACUUM would scan all READ_WRITE_ALLOWED segments and mark some of them as EFFECTIVE_READ_ONLY if 100% of the remaining rows are visible to all current backends. Marking the map will cause a rel cache invalidation. Argh. Remember, we're talking about 16 TB tables here. With the best of tuning, that takes all day. So, day 1, load data; day 2 VACUUM it? With the existing partitioning approach, there's no such thing. We would never mark the highest numbered
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi Simon, On Wed, Jan 09, 2008 at 03:08:08PM +, Simon Riggs wrote: Do people really like running all that DDL? There is significant manpower cost in implementing and maintaining a partitioning scheme, plus significant costs in getting it wrong. Well... that's impossible for me to say. Most of the people who use lots of data like using MDX (a large and quite complex reporting and modelling language from MS, which is more complex than SQL IMHO) so... If people with large tables like partitioning why is Oracle moving towards automated partitioning in 11g? Automated partitioning was one of Have you used Oracle's partitioning? I'm not surprised they're moving away ;-). More seriously, I didn't know that. Do you have a URL? the major goals for this next set of Postgres partitioning functionality also, whether or not we have declarative partitioning. My thinking is lets go for the best ideas and skip over the stuff that will be (is) obsolete before its left the design stage. Well, you're assuming that because declarative partitioning has been around a while, it's obselete. I don't think that's the case. Postgres has been around about as long... I see many more systems in the Terabyte range now than I did 10 years ago, but I see about the same number of DBAs. We'll always need experts, but I feel we should be using our expertise to simplify the standard cases, not just maintain the same level of difficulty in managing them. One of the big benefits of the dynamic partitioning approach is that it needs no DDL. So it will work out-of-the-box, for anybody. It's fine to say that, but people have already started talking about grammar extensions for dynamic partitions. People want the tools to manage it. There's difficulty in learning how and when to use the different VACUUM options that have been discussed. Deleting older data would be optimised under the proposed scheme, so that's not really a problem. Loading data is actually slightly harder and slower with declarative partitioning (see below). Merging and splitting partitions are tools for fine tuning a very complex partitioning scheme. They do also allow a non-linear segmentation scheme, which might aid performance in some cases. What about adding partitions into a set of partitions. Greg's post on that was dismissed and it might not be representative but we have users doing that all the time. One major advantage of the dynamic approach is that it can work on multiple dimensions simultaneously, which isn't possible with declarative partitioning. For example if you have a table of Orders then you will be able to benefit from Segment Exclusion on all of these columns, rather than just one of them: OrderId, OrderDate, RequiredByDate, LastModifiedDate. This will result in some sloppiness in the partitioning, e.g. if we fill 1 partition a day of Orders, then the OrderId and OrderData columns will start out perfectly arranged. Any particular RequiredByDate will probably be spread out over 7 partitions, but thats way better than being spread out over 365+ partitions. When we look at the data in the partition we can look at any number of columns. When we declaratively partition, you get only one connected set of columns, which is one of the the reasons you want multi-dimensional partitioning in the first place. To this end, we (well, Jeff Cohen) looked at the syntax and semantics of partitining in leading databases (Oracle, Informix, DB2) and came up with a highly expressive grammar which takes the best of each I think (I'll post details on the grammar in a seperate thread). The idea is that range (for example, a date range), list (a list of distinct values) and hash partitioning be supported on multiple columns. Partitions can be added, merged, truncated. Partitions can reside on different tablespaces. The existing issues with the rewriter, COPY support and the like go away by smartening up the backend. To explore the grammar and semantics Jeff and I (to a small extent) have implemented the parsing of such a grammar in the backend. At the moment, this just results in the generation of a bunch of rules (i.e., it produces the current behaviour, as if someone had written rules themselves) and debugging output. The fully fledged approach will see partitioning rules stored in a new system catalog which can be interrogated by the planner or COPY to determine which partition a given tuple belongs in or which partition(s) a WHERE clause matches. When loading data we would need to examine each incoming tuple and distribute it to the correct place in the partitioned heap. That's a non-trivial overhead on data loading that I would like to avoid, since if we rely on the natural locality of loaded data then that overhead is zero. Yes, but it's a lot faster than what we have at the moment and the dynamic approach is not without its costs. I've pointed it out before, but I will
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote: If the exclusion is executor driven, the planner cannot help but create a seq scan plan. The planner will think you're returning 100X rows when really you end up returning X rows. After that, all decisions made by the planner are totally bogus. One of the most important queries on large tables is handling WHERE Eventdate = CURRENT DATE; We cannot perform partition exclusion using this type of WHERE clause at planning time because the CURRENT DATE function is STABLE. We can decide that some form of partition exclusion is possible, but the actual partition we access can *only* be decided within the executor. That necessarily effects the selectivity estimates. The planner can see we want some data, but it can't tell which data, so it doesn't know whether we will hit the day with the most data or the date with the least data. You've mentioned this a few times, as if its a problem with dynamic partitioning, yet its clearly an issue for any form of partitioning. So it seems clear that we need to make partition exclusion work at executor time, whatever else we do. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Tue, Jan 08, 2008 at 01:08:52AM +0100, Markus Schiltknecht wrote: Uh, which key are you talking about? AFAIU Simon's proposal, he suggests maintaining min/max values for all columns of the table. Right, but I think that's just because that approach is automatable. Only some use cases are going to be approproate to this. Yeah, and if only *one* tuple in the 1G segment has: some_date = '1998-12-31' OR some_date = '2001-01-01' Segment Exclusion can't exclude that segment. That's all I'm saying. Correct. Huh? I'm certainly not the one asking for it. Quite the opposite, I'm warning from over-estimating the use of SE. Right; I think one should be clear that there are many -- maybe most -- uses of PostgreSQL where the proposal will be of no use. I just think we need to be clear that for the areas where it _can_ be useful, it could be very useful indeed. A ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Tue, Jan 08, 2008 at 02:12:28AM +, Gregory Stark wrote: Yes: it doesn't solve the problem I have, which is that I don't want to have to manage a whole bunch of tables. I want one table, and I want to be able to say, That section is closed. That's not your problem, that's the solution you're looking for. You're assuming a particular solution in your problem statement. Probably in that one, yes. I'm still waiting for permission to post my original problem statement; I suspect it's not going to be forthcoming by next Monday, so it's not going to happen. But I did outline something like what I'm talking about elsewhere in this thread. For my case, I'm thinking of the sort of data that builds up over time, and most of which happens probably not to be useful at any moment, but all of which _might_ be useful over the long haul. So what I wanted, originally, was to be able to set arbitrary ranges of tuples to be read-only, and to be able to set them offline if I wanted. Pseudo-DDL: ALTER TABLE foo SET read_only='t' WHERE created_on '2007-01-01'; ALTER TABLE foo SET tuple_offline='t' WHERE created_on '2006-01-01'; Now, the second segment is marked offline. If I query the table for things in that range, I get an ERROR telling me there might be data in the range, but it's not mounted at the moment. If I try to update records in the read-only (first) range, I get an error telling me the tuple is marked read only. The idea then is that these older tuples can be put off into long-term storage (wave hands here about the management of that stuff), and this keeps my online system compact but yet allows me, for just the cost of mounting a backup tape and reading the segments back, to go back and query those old ranges. The case I was particularly aiming at originally was for a case of data that cannot cost more than fractions of pennies to store, but that might represent a hugely expensive liability if the answer is not always right. Driving down that storage cost was mostly what I was aiming at, but people gradually convinced me that slightly more generic implementations might be useful. Simon's proposal came along, and it seems to me to be something like the generic implementation that others already convinced me was needed. I think Simon's proposal loses the very feature that makes partitioning useful. The DBA doesn't have a thing to describe, he has to define what parts of the table he's describing for every operation. And if you define a whole new object to name these things I think you'll end up with something that looks a lot like tables. I don't see how that's the case at all. In fact, I have the feeling it's the opposite, so perhaps I've misunderstood something. A ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Wed, Jan 02, 2008 at 05:56:14PM +, Simon Riggs wrote: This technique would be useful for any table with historical data keyed by date or timestamp. It would also be useful for data where a time-of-insert component is implicit, such as many major entity tables where the object ids are assigned by a sequence. e.g. an Orders table with an OrderId as PK. Once all orders placed in a period have been shipped/resolved/closed then the segments will be marked read-only. Its not really going to change the code path much for small tables, yet seems likely to work reasonably well for large tables of all shapes and sizes. If a segment is being updated, we leave it alone, and maybe never actually set the visibility map at all. So overall, this idea seems to cover the main use case well, yet with only minimal impact on the existing use cases we support. As before, I will maintain this proposal on the PG developer Wiki, once we get down to detailed design. Like it? Simon, A novel approach to the problem. For me, this proposal addresses some of the other problems in postgres (visibility in the heap vs. index) rather than the problems with partitioning itself. It might seem otherwise, but for me partitioning is tool for managing large volumes of data. It allows the user to encode what they know about the nature of their data, and that's often about management. In this way, your proposal actually makes partitioning too smart: the user wants to tell the system how to organise the data, as opposed to the other way around. At Greenplum, we've been discussing this in depth. Interestingly, we also felt that the storage layer could be much smarter with large tables with predictable layouts and predictable data patterns. But the thing is, people with large tables like partitioning, putting different partitions on different storage; they like the ability to merge and split partitions; they like truncating old partitions. In a word, they seem to like the managability partitions give them -- as well as the performance that comes with this. To this end, we (well, Jeff Cohen) looked at the syntax and semantics of partitining in leading databases (Oracle, Informix, DB2) and came up with a highly expressive grammar which takes the best of each I think (I'll post details on the grammar in a seperate thread). The idea is that range (for example, a date range), list (a list of distinct values) and hash partitioning be supported on multiple columns. Partitions can be added, merged, truncated. Partitions can reside on different tablespaces. The existing issues with the rewriter, COPY support and the like go away by smartening up the backend. To explore the grammar and semantics Jeff and I (to a small extent) have implemented the parsing of such a grammar in the backend. At the moment, this just results in the generation of a bunch of rules (i.e., it produces the current behaviour, as if someone had written rules themselves) and debugging output. The fully fledged approach will see partitioning rules stored in a new system catalog which can be interrogated by the planner or COPY to determine which partition a given tuple belongs in or which partition(s) a WHERE clause matches. Yes, this is the traditional approach but I think it still has merit. We shall continue working on this approach because it is what our customers have asked for. We would also like to see it get into postgres too. Thanks, Gavin ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Wed, 2008-01-02 at 17:56 +, Simon Riggs wrote: Like it? Very cool :-) One additional thought: what about a kind of segment fill factor ? Meaning: each segment has some free space reserved for future updates/inserts of records in the same range of it's partitioning constraint. And when inserting/updating you put the new record into the corresponding segment... just like a very coarse clustering. Then you could vacuum the segments separately to keep the free space not running out. For active segments you would then fix the partitioning constraint range once the fill factor is reached, to allow for keeping it's constraint even when heavily updating (heavily vacuuming it too as response to that), and create a new segment for the unbounded range for new inserts... this would work fine for tables where the constraint is based on ever increasing keys and accidental inserts in old ranges (which do happen occasionally in real life). When the change rate of old segments get down, the segments could be reorganized to have a smaller fill factor, so that you still allow for accidental updates but keep space usage efficient. This would be some similar action as a clustering, but hopefully not blocking (which might be a hard thing to do)... and later again you could mark some of the really old things as read only and put them in special segments with no wasted space. One problem would be when the segment's free space runs out, so you must put records from the same constraint range in multiple segments - but that could still work, you just would have multiple segments covering the same range, but if the segment fill factor is chosen properly it should not be the case... you could normally maintain a set of non-overlapping segments in terms of the partitioning constraint. Cheers, Csaba. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi Csaba, Csaba Nagy wrote: One additional thought: what about a kind of segment fill factor ? Meaning: each segment has some free space reserved for future updates/inserts of records in the same range of it's partitioning constraint. And when inserting/updating you put the new record into the corresponding segment... just like a very coarse clustering. Hm.. yeah. That way, a few writes to a read optimized segment could be accepted, without having to drop the optimization immediately. And the other way around: generally prevent having to drop the optimization by forcing tuples to be written to a segment with matching min/max tuples. Although, that's not exactly trivial, I think. However, for tables which don't fit the use case of SE, people certainly don't want such a fill factor to bloat their tables. Regards Markus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Mon, 2008-01-07 at 13:59 +0100, Markus Schiltknecht wrote: However, for tables which don't fit the use case of SE, people certainly don't want such a fill factor to bloat their tables. Sure, but it could be configurable and should only be enabled if the table is marked as partitioned on some condition... I think it would be a bad idea anyway if the DB would start partitioning on some arbitrary criteria based on analyzing he table, so the DBA should be the one to decide on what criteria to partition. In particular it could be a bad idea on occasions to partition on the clustering criteria for tables which were clustered once. Cheers, Csaba. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Mon, 2008-01-07 at 14:20 +0100, Markus Schiltknecht wrote: Why is that? AFAIUI, Segment Exclusion combines perfectly well with clustering. Or even better, with an upcoming feature to maintain clustered ordering. Where do you see disadvantages such an optimization for sequential scans? Well, as I understood it, this would be some kind of special case of clustering, where the cluster key is expected to be ever increasing in time and new rows would not be randomly distributed over the complete possible range. In theory you could also have each segment in turn be clustered on some other criteria than the partitioning criteria so indexed access could also be better on the main selection criteria which could be different than the partitioning criteria. All this is of course just speculations - but I guess that's what you expected too in this discussion :-) Cheers, Csaba. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Csaba Nagy wrote: Sure, but it could be configurable and should only be enabled if the table is marked as partitioned on some condition... As I'm regarding SE as an optimization, I disagree here.. As all optimizations, SE should conceptually be reasonably close to cost-less when unneeded. I think it would be a bad idea anyway if the DB would start partitioning on some arbitrary criteria based on analyzing he table, so the DBA should be the one to decide on what criteria to partition. I absolutely agree for real partitioning, which targets manageability of table partitions. In particular it could be a bad idea on occasions to partition on the clustering criteria for tables which were clustered once. Why is that? AFAIUI, Segment Exclusion combines perfectly well with clustering. Or even better, with an upcoming feature to maintain clustered ordering. Where do you see disadvantages such an optimization for sequential scans? Regards Markus ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Sat, Jan 05, 2008 at 08:02:41PM +0100, Markus Schiltknecht wrote: Well, management of relations is easy enough, known to the DBA and most importantly: it already exists. Having to set up something which is *not* tied to a relation complicates things just because it's an additional concept. But we're already dealing with some complicated concepts. There isn't anything that will prevent current-style partitioning strategies from continuing to work in the face of Simon's proposal. But let me see if I can outline the sort of cases where I see real value in what he's outlined. There is a tendency in data systems to gather all manner of data that, in retrospect, _might_ turn out to be be valuable; but which, at the time, is not really valuable at all. Moreover, the value later on might be relatively low: if you can learn something much later from that data, and do so easily, then it will be worth doing. But if the work involved passes some threshold (say 1/2 a day), it's suddenly not worth it any more. It's simple economics: below a certain cost, the data is valuable. Above a certain cost, you simply shouldn't keep the data in the first place, because the cost of using it is higher than any value you'll likely be able to extract. Simon's proposal changes the calculations you have to do. If keeping some data online longer does not impose administrative or operational overhead (you have it marked read only, so there's no I/O for vacuum; you don't need to do anything to get the data marked read only; c.), then all it costs is a little more disk, which is relatively cheap these days. More importantly, if the longer-term effect of this strategy is to make it possible to move such data offline _without imposing a big cost_ when moving it back online, then the value is potentially very high. Without even trying, I can think of a dozen examples in the past 5 years where I could have used that sort of functionality. Because the cost of data retrieval was high enough, we had to decide that the question wasn't worth answering. Some of those answers might have been quite valuable indeed to the Internet community, to be frank; but because I had to pay the cost without getting much direct benefit, it just wasn't worth the effort. The thing about Simon's proposal that is beguiling is that it is aimed at a very common use pattern. The potential for automatic management under such a use pattern makes it seem to me to be worth exploring in some detail. Agreed. I'd say that's why the DBA needs to be able to define the split point between partitions: only he knows the meaning of the data. I think this is only partly true. A casual glance at the -general list will reveal all manner of false assumptions on the parts of administrators about how their data is structured. My experience is that, given that the computer has way more information about the data than I do, it is more likely to make the right choice. To the extent it doesn't do so, that's a problem in the planning (or whatever) algorithms, and it ought to be fixed there. A ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Andrew Sullivan wrote: On Sat, Jan 05, 2008 at 08:02:41PM +0100, Markus Schiltknecht wrote: Well, management of relations is easy enough, known to the DBA and most importantly: it already exists. Having to set up something which is *not* tied to a relation complicates things just because it's an additional concept. But we're already dealing with some complicated concepts. Possibly, yes, but that's by far no reason to introduce even more complicated concepts... Does anything speak against letting the DBA handle partitions as relations? There isn't anything that will prevent current-style partitioning strategies from continuing to work in the face of Simon's proposal. Agreed. Nor will Simon's proposal completely replace them. Without even trying, I can think of a dozen examples in the past 5 years where I could have used that sort of functionality. Because the cost of data retrieval was high enough, we had to decide that the question wasn't worth answering. Some of those answers might have been quite valuable indeed to the Internet community, to be frank; but because I had to pay the cost without getting much direct benefit, it just wasn't worth the effort. Sure, there's value in Simon's proposal. But it has pretty strict requirements. IMO, it's pretty hard to say, if it would have helped at all for your cases. Any of them still available to check? Remember the requirements: no single tuple in the segment may be significantly out of the average bounds. Otherwise, the min/max gets pretty useless and the segment can never be excluded. As said before, combination with CLUSTERing might help, yes. But if you need to maintain CLUSTERed ordering, aren't there better ways? For example, you could use binary searching on the relation directly, much like with indices, instead of sequentially scanning on the CLUSTERed relation. That would even give us some sort of indices with visibility. Agreed. I'd say that's why the DBA needs to be able to define the split point between partitions: only he knows the meaning of the data. I think this is only partly true. A casual glance at the -general list will reveal all manner of false assumptions on the parts of administrators about how their data is structured. My experience is that, given that the computer has way more information about the data than I do, it is more likely to make the right choice. To the extent it doesn't do so, that's a problem in the planning (or whatever) algorithms, and it ought to be fixed there. Well, Postgres doesn't automatically create indices, for a counter example. With regard to partitioning over multiple table spaces, I think the DBA definitely has more information available, than the computer. A DBA (hopefully) knows future plans and emergency strategies for the storage system, for example. Lacking such information, the database system will have a pretty hard time taking a good decision on how to partition between table spaces, IMO. Regards Markus ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote: Does anything speak against letting the DBA handle partitions as relations? Yes: it doesn't solve the problem I have, which is that I don't want to have to manage a whole bunch of tables. I want one table, and I want to be able to say, That section is closed. Sure, there's value in Simon's proposal. But it has pretty strict requirements. IMO, it's pretty hard to say, if it would have helped at all for your cases. Any of them still available to check? No, but one of your worries doesn't bother me: Remember the requirements: no single tuple in the segment may be significantly out of the average bounds. Otherwise, the min/max gets pretty useless and the segment can never be excluded. The segment can never be excluded in a search on that key, yes. But consider the likely cases we're looking at: WHERE some_date = '1999-01-01' AND some_date '2001-01-01'; WHERE sequence_field BETWEEN 3000 AND 30; c. These are the two obvious cases: you're searching for data in a given date range or for primary (sometimes artificial) identifiers in a range, and the source data increases (almost) monotonically. You have to do this now anyway, because there's _some_ basis on which you're partitioning your data; but today, you do this with a lot of fooling around with views and nasty triggers that push data into the right table, assuming someone doesn't screw it up. need to maintain CLUSTERed ordering, aren't there better ways? For example, you could use binary searching on the relation directly, much like with indices, instead of sequentially scanning on the CLUSTERed relation. That would even give us some sort of indices with visibility. I think this is a nice idea too :) Well, Postgres doesn't automatically create indices, for a counter example. Yes, and it has no data-use analyser tools that automatically suggest indexes, either. That's the sort of thing people coming from other (err, Other ;-) products complain about, in fact. definitely has more information available, than the computer. A DBA (hopefully) knows future plans and emergency strategies for the storage system, for example. Perhaps my jaundice comes from too much time spent in operational trenches, but while good DBAs have some ideas about that, large numbers of them are harried and overwhelmed just by the piles of work they already have. Nevertheless, while what you say is true, I'm not sure what it has to do with the present case. I don't think the current proposal is to address partitioning across table spaces. It's to do with the way certain segments of a table are interpreted by the system. It's undoubtedly true that this strategy is of questionable utility for many kinds of use of PostgreSQL. But it seems to offer very significant advantages for one use-pattern that is very common. That said, I am not trying to argue it should be adopted without poking at its weaknesses. I just think it unfair to ask the proposal to address problems it's not really aimed at. A ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Andrew Sullivan wrote: On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote: ...the requirements: no single tuple in the segment may be significantly out of the average bounds. Otherwise, the min/max gets pretty useless and the segment can never be excluded. The segment can never be excluded in a search on that key, yes. But consider the likely cases we're looking at: ...you're searching for data in a given date range or for primary (sometimes artificial) identifiers in a range, and the source data increases (almost) monotonically. It seems to me the idea discussed elsewhere in the thread[1] about configurable segment sizes would make this limitation much less problematic for some types of data. Some of my biggest tables are clustered by zip-code; and are insert mostly. Common queries are where state_provence='TX' or where city='Dallas'. While I doubt I have enough data to fill a 1GB segment for any but the largest cities; I certainly have runs of many consecutive blocks - since clustering by zip tends to cluster cities as well. Even though the table's not monotonically increasing or decreasing, like values for cities and states are clustered together. Is my understanding right that these Segment Visibility Maps could help this case as well? [1] http://archives.postgresql.org/pgsql-hackers/2008-01/msg00065.php ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Andrew Sullivan wrote: On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote: Does anything speak against letting the DBA handle partitions as relations? Yes: it doesn't solve the problem I have, which is that I don't want to have to manage a whole bunch of tables. I want one table, and I want to be able to say, That section is closed. That's a fair requirement. Wanting to manage partitions manually is another fair requirement, IMO. They can well coexist. Remember the requirements: no single tuple in the segment may be significantly out of the average bounds. Otherwise, the min/max gets pretty useless and the segment can never be excluded. The segment can never be excluded in a search on that key, yes. But consider the likely cases we're looking at: Uh, which key are you talking about? AFAIU Simon's proposal, he suggests maintaining min/max values for all columns of the table. WHERE some_date = '1999-01-01' AND some_date '2001-01-01'; Yeah, and if only *one* tuple in the 1G segment has: some_date = '1998-12-31' OR some_date = '2001-01-01' Segment Exclusion can't exclude that segment. That's all I'm saying. That said, I am not trying to argue it should be adopted without poking at its weaknesses. I just think it unfair to ask the proposal to address problems it's not really aimed at. Huh? I'm certainly not the one asking for it. Quite the opposite, I'm warning from over-estimating the use of SE. In his proposal, Simon was explicitly comparing to declarative partitioning, pointing out lots of advantages and implicitly stating that SE could cover most, if not all needs of what's commonly understand by partitioning. That's where I disagree. But certainly, SE and SVM has some merit for it's use case. And I'm looking forward to test patches ;-) Regards Markus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Andrew Sullivan [EMAIL PROTECTED] writes: On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote: Does anything speak against letting the DBA handle partitions as relations? Yes: it doesn't solve the problem I have, which is that I don't want to have to manage a whole bunch of tables. I want one table, and I want to be able to say, That section is closed. That's not your problem, that's the solution you're looking for. You're assuming a particular solution in your problem statement. I posit that the whole point of partitioning is to create an object which serves to represent the semantically meaningful chunks of data. The reason this is useful is precisely because it serves as a short-hand for the DBA describe the data and how it will be used. I think Simon's proposal loses the very feature that makes partitioning useful. The DBA doesn't have a thing to describe, he has to define what parts of the table he's describing for every operation. And if you define a whole new object to name these things I think you'll end up with something that looks a lot like tables. I also don't understand how this proposal deals with the more common use case of unloading and loading data. Normally in partitioned tables we build the data in a side table until the data is all correct then load it as a partition. If you treat it as a lower-level object then I don't see that working. The layout of the new table won't often match the layout of the target partitioned table. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support! ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Gregory Stark wrote: I also don't understand how this proposal deals with the more common use case of unloading and loading data. Normally in partitioned tables we build the data in a side table until the data is all correct then load it as a partition. If you treat it as a lower-level object then I don't see that working. The layout of the new table won't often match the layout of the target partitioned table. +1 In addition, a similar use case is archival of old partitions as they are not longer (commonly) accessed - perhaps to a different tablespace (or even to backup media). I don't see how dynamic in-table partitioning handles this, and I think it would highly desirable to be able to do these things! best wishes Mark ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Jan 6, 2008 3:00 AM, Robert Treat [EMAIL PROTECTED] wrote: On Saturday 05 January 2008 14:02, Markus Schiltknecht wrote: To satisfy all the different requirements of partitioning with segments based partitioning, we'd have to allow a table to span multiple table spaces. I'm not very keen on going that way. Why? Uh.. if a table (RELKIND_RELATION) can only span one table space, as it is now, all of its segments are in the same table space. I don't quite call that partitioning. Well, sure, you could call it so, but then, each and every Postgres table is already partitioned in 1G segments. It all depends on the definitions, but in my world, horizontal partitioning for databases involves multiple table spaces (and is quite useless without that). Calling anything else partitioning is confusing, IMO. I'm not following this. If we can work out a scheme, I see no reason not to allow a single table to span multiple tablespaces. Do you see a problem with that? If we can span two different partitions under two tablespaces, it would help concepts like parallel queries, parallel updates etc in data warehousing. If we allow a single table to span across multiple tablespaces, the question arises how we would be able to move around different parts of the table,as we like. Segments, i believe doesn't let the user/DBA draw the split-points. In a more general sense, a global index is a an index that spans multiple partitions, as opposed to a local index, which is an index on specific partitions; postgresql current supports the latter, not the former. In any case, my thinking is if we had the segment exclusion technique, I could convert that partitioned table into a regular table again, use segment exclusion to handle what is currently handled by partitions, and create a global index across all the other data for that other, currently killer, query. That's a good idea. Local index , then would be equivalent to partial indexes. But single Table partition maintenance might not be as flexible as the multi-table partition. In real partitioning, dropping a single partition should not affect the local indexes of other partitions. But i think that would be very difficult to implement under this scheme. Thanks, Gokul.
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Jan 6, 2008 11:27 AM, [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Sun, Jan 06, 2008 at 01:12:32AM +0530, Gokulakannan Somasundaram wrote: On Jan 5, 2008 6:15 PM, [EMAIL PROTECTED] wrote: One thought I had back then, with partitioned tables was gee -- B-tree index is already doing a partition; why do a manual partition on top of that?. Can you please explain more on what you are trying to say here? Sure. A B-tree is just a device to partition something along some order. If you have , say, a table of orders (to use the example upthread) and a B-tree index on order date, this index partitions your set (at recursively finer levels). But the current index scans - Index Scan and Bitmap Index Scan, doesn't provide the exact benefit of partitioning, even if all the columns are covered by the index. It does a lot more disk reads than the partitioning scheme. I think you are looking for something like Block indexes in Multi-dimensional Clusters in DB2. Heikki did something like that in a more subtle way. Postgresql Clusters, as you may know doesn't maintain the order with inserts. We might go for Index Organized Tables/Clustered indexes. But then, B-tree would give lot of performance problems, if the Index Tuple size increases beyond a certain point. Thanks, Gokul.
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Gokulakannan Somasundaram wrote: On Jan 5, 2008 6:15 PM, [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: One thought I had back then, with partitioned tables was gee -- B-tree index is already doing a partition; why do a manual partition on top of that?. Can you please explain more on what you are trying to say here? I think this has to do with SE not being of much use for index scans. Or put it another way: SE is an optimization for sequential scans. For tables where it works well, it could possibly replace the index entirely. Without the index, you would rely on SE to always be able to exclude enough segments, so that the seq scan is less expensive than an index scan with the following table lookups. With an index, the planner gets a hard time deciding between the index scan and the (possibly SE optimized) seq scan. Regards Markus ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Robert Treat wrote: On Saturday 05 January 2008 14:02, Markus Schiltknecht wrote: To satisfy all the different requirements of partitioning with segments based partitioning, we'd have to allow a table to span multiple table spaces. I'm not very keen on going that way. Why? Uh.. if a table (RELKIND_RELATION) can only span one table space, as it is now, all of its segments are in the same table space. I don't quite call that partitioning. Well, sure, you could call it so, but then, each and every Postgres table is already partitioned in 1G segments. It all depends on the definitions, but in my world, horizontal partitioning for databases involves multiple table spaces (and is quite useless without that). Calling anything else partitioning is confusing, IMO. I'm not following this. If we can work out a scheme, I see no reason not to allow a single table to span multiple tablespaces. Do you see a problem with that? Uhm... well, no. I was just pointing out that it's a requirement. It depends on how you define things, but I'm seeing it that way: table -- 1:n -- partition -- 1:1 -- table space -- 1:n -- segments What I'm advocating is making partitions available to the DBA as some kind of a relation, she can query separately and move around between table spaces. Why should that not be possible with other schemes? Moving the split point between two partitions involves moving tuples around, no matter if you are going to move them between segments or between relations building the partitions. The difference is that, if I currently have a table split by month, I can re-partition it into weekly segments, and only shuffle one months data at a time minimize impact on the system while I shuffle it. This can even be used to do dynamic management, where data from the current month is archived by day, data from the past year by week, and data beyond that done monthly. This should be possible for both schemes, I see no connection to what we've discussed. SE doesn't magically give you this level of control you are requesting here. Quite the opposite: referring to CLUSTERing to makes me wonder, if that's not going to shuffle way too many tuples around. What I'm saying is, that SE doesn't partition the segments into different table spaces. Thus I don't consider it database partitioning in the first place. As I currently understand it, it's: table -- 1:1 -- table space -- 1:n -- partitions -- 1:n -- segments On many other databases, if you change the partition scheme, it requires exclusive locks and a shuffleing of all of the data, even data whose partitions arent being redefined. Even worse are systems like mysql, where you need to rewrite the indexes as well. To me, these requirements always seem like show stoppers; I generally can't afford to lock a table while the database rewrites a billion rows of data. I fully agree here. How do you plan to solve that problem on top of SE? In a more general sense, a global index is a an index that spans multiple partitions, as opposed to a local index, which is an index on specific partitions; postgresql current supports the latter, not the former. In any case, my thinking is if we had the segment exclusion technique, I could convert that partitioned table into a regular table again, ... on a single table space ... use segment exclusion to handle what is currently handled by partitions, ... except, that there is no partitioning (!?!) (between table spaces) and create a global index across all the other data for that other, currently killer, query. I thought the table you are referring to is bigger than your fastest table space? That would even make it impossible. See where I'm coming from? And why I'm stating that SE is an optimization (for seq scans), but not partitioning? Regards Markus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Jan 6, 2008 4:09 PM, Markus Schiltknecht [EMAIL PROTECTED] wrote: Hi, Gokulakannan Somasundaram wrote: On Jan 5, 2008 6:15 PM, [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: One thought I had back then, with partitioned tables was gee -- B-tree index is already doing a partition; why do a manual partition on top of that?. Can you please explain more on what you are trying to say here? I think this has to do with SE not being of much use for index scans. Or put it another way: SE is an optimization for sequential scans. For tables where it works well, it could possibly replace the index entirely. Without the index, you would rely on SE to always be able to exclude enough segments, so that the seq scan is less expensive than an index scan with the following table lookups. With an index, the planner gets a hard time deciding between the index scan and the (possibly SE optimized) seq scan. That's a good point. But i think Simon is planning not to give the job to the planner, but to the executor. So SE optimization will come into play, only when planner has decided on Sequential scan.
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Sunday 06 January 2008 05:48, Markus Schiltknecht wrote: What I'm saying is, that SE doesn't partition the segments into different table spaces. Thus I don't consider it database partitioning in the first place. As I currently understand it, it's: table -- 1:1 -- table space -- 1:n -- partitions -- 1:n -- segments Ah, was a slight misunderstanding of terminology. I see what your getting at. On many other databases, if you change the partition scheme, it requires exclusive locks and a shuffleing of all of the data, even data whose partitions arent being redefined. Even worse are systems like mysql, where you need to rewrite the indexes as well. To me, these requirements always seem like show stoppers; I generally can't afford to lock a table while the database rewrites a billion rows of data. I fully agree here. How do you plan to solve that problem on top of SE? In a more general sense, a global index is a an index that spans multiple partitions, as opposed to a local index, which is an index on specific partitions; postgresql current supports the latter, not the former. In any case, my thinking is if we had the segment exclusion technique, I could convert that partitioned table into a regular table again, on a single table space ... use segment exclusion to handle what is currently handled by partitions, except, that there is no partitioning (!?!) (between table spaces) and create a global index across all the other data for that other, currently killer, query. I thought the table you are referring to is bigger than your fastest table space? That would even make it impossible. Nope, different table :-) Although the above global/local one would probably live entirely on the slow disks, using SE and global index to satisfy the queries. And really our slow disks aren't exactly slow, but do have poor concurrency, so we put primarily read data on them to keep writes to a minimum. See where I'm coming from? And why I'm stating that SE is an optimization (for seq scans), but not partitioning? Yes, but aiui, SE should allow seq scans to achieve performance similar to partitioning, especially if the planner can exclude segments based on values in the data. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Fri, 2008-01-04 at 22:31 -0500, Robert Treat wrote: Not to be negative, but istm how this feature would be managed is as important as the bits under the hood. Agreed. On this part of the thread, we've been discussing an extension to the basic proposal, which is why I have not been concentrating there. Core management wise, the basic proposal showed how we would be able to have VACUUM run much faster than before and how DELETE will also be optimised naturally by this approach. Loading isn't any slower than it is now; loading does need some work, but that's another story. Or at least we have to believe there will be some practical way to manage this, which as of yet I am skeptical. Skepticism is OK, but I'd like to get your detailed thoughts on this. I've been an advocate of the multi-tables approach now for many years, so I don't expect everybody to switch their beliefs on my say-so overnight. Let me make a few more comments in this area: The main proposal deliberately has few, if any, knobs and dials. That's a point of philosophy that I've had views on previously: my normal stance is that we need some knobs to allow the database to be tuned to individual circumstances. In this case, partitioning is way too complex to administer effectively and requires application changes that make it impossible to use for packaged applications. The latest Oracle TPC-H benchmark uses 10 pages of DDL to set it up and if I can find a way to avoid that, I'd recommend it to all. I do still want some knobs and dials, just not 10 pages worth, though I'd like yours and others' guidance on what those should be. Oracle have been responding to feedback with their new interval partitioning, but its still a multi-table approach in essence. My observation of partitioned databases is that they all work beautifully at the design stage, but problems emerge over time. A time-based range partitioned table can often have different numbers of rows per partition, giving inconsistent response times. A height-balanced approach where we make the partitions all the same size, yet vary the data value boundaries will give much more consistent query times and can be completely automated much more easily. The SVM concept doesn't cover everything that you can do with partitioning, but my feeling is it covers the main use cases well. If that's not true, in broad strokes or in the detail, then we need to uncover that. Everybody's help in doing that is appreciated, whatever the viewpoint and whatever the outcome. It's probably worth examining existing applications to see how well they would migrate to segmented tables approach. The following query will analyse one column of a table to produce a list of boundary values, given a segment size of 131072 blocks (1 GB). select substr(ctid::text,2,strpos(ctid::text,',')-2)::integer/131072 as seg, min(PrimaryKey), max(PrimaryKey) from bigtable group by seg; We should be able to see whether this works for existing use cases, or not fairly easily. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Andrew Sullivan wrote: On Fri, Jan 04, 2008 at 10:26:54PM +0100, Markus Schiltknecht wrote: I'm still puzzled about how a DBA is expected to figure out which segments to mark. I think that part might be hand-wavy still. But once this facility is there, what's to prevent the current active segment (and the rest) from also getting this mark, which would mean the table is read only? Well, sure, marking *all* segments read-only is pretty easy. But that was not quite my point. Regards Markus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Sat, Jan 05, 2008 at 09:33:45AM +, Simon Riggs wrote: [...] The main proposal deliberately has few, if any, knobs and dials. That's a point of philosophy that I've had views on previously: my normal stance is that we need some knobs to allow the database to be tuned to individual circumstances. One thought I had back then, with partitioned tables was gee -- B-tree index is already doing a partition; why do a manual partition on top of that?. It's an exhilarating experience to watch you making a design before I could even spell this nebulous and unclear thougt :-) Regards - -- tomás -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFHf3v7Bcgs9XrR2kYRApP7AJwIW1cQwdK99fl+uzDelGEaqZFnEgCfUNjl y0zGzkhf6qel/FC+rArxQu8= =FZuv -END PGP SIGNATURE- ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, [EMAIL PROTECTED] wrote: The main proposal deliberately has few, if any, knobs and dials. That's a point of philosophy that I've had views on previously: my normal stance is that we need some knobs to allow the database to be tuned to individual circumstances. One thought I had back then, with partitioned tables was gee -- B-tree index is already doing a partition; why do a manual partition on top of that?. Well, that's why I'm so focused on manageability: I think the primary purpose of partitioning is manageability (of the underlying storage for a table). Regards Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Simon Riggs wrote: On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote: I'm still puzzled about how a DBA is expected to figure out which segments to mark. Simon, are you assuming we are going to pass on segment numbers to the DBA one day? No Way! Ah, I'm glad ;-) Simon Riggs wrote: Skepticism is OK, but I'd like to get your detailed thoughts on this. I've been an advocate of the multi-tables approach now for many years, so I don't expect everybody to switch their beliefs on my say-so overnight. Let me make a few more comments in this area: I've so far always thought about some sort of multi-relations approach for partitioning, yes. Let's see if I can get my mind around single-table partitioning. The main proposal deliberately has few, if any, knobs and dials. That's a point of philosophy that I've had views on previously: my normal stance is that we need some knobs to allow the database to be tuned to individual circumstances. In this case, partitioning is way too complex to administer effectively and requires application changes that make it impossible to use for packaged applications. The latest Oracle TPC-H benchmark uses 10 pages of DDL to set it up and if I can find a way to avoid that, I'd recommend it to all. I do still want some knobs and dials, just not 10 pages worth, though I'd like yours and others' guidance on what those should be. Oracle have been responding to feedback with their new interval partitioning, but its still a multi-table approach in essence. I can absolutely support your efforts to minimize knobs and configuration DDL. However, my current feeling is, that segments based partitioning complicates things, because the DBA doesn't have tools and commands to handle segments. To satisfy all the different requirements of partitioning with segments based partitioning, we'd have to allow a table to span multiple table spaces. I'm not very keen on going that way. However, what I certainly like is the automated split point definition. Instead of having to create tables by hand and linking them via inheritance and constraint exclusion, I have something very similar in mind, like what you proposed for marking read-only segments. Something like: SPLIT TABLE customers AT cust_name 'n'; or: SPLIT TABLE inventory AT inv_id % 4 = 2; In my imagination, this should automatically create the underlying relations, i.e.: NOTICE: relations inventory__l and inventory__r have been created. That way, the DBA could then handle those like normal relations, querying them or moving them to different table spaces like all other normal relations. In a way, that's not so different from possible extensions on top of Segment Exclusion, except that the DBA additionally get a relation name to be able to address the set of segments which form a partition. Or put it the other way around: go for Segment Exclusion, but add some sort of a sentinel relation for each set of segments, to make them reachable for the DBA. My observation of partitioned databases is that they all work beautifully at the design stage, but problems emerge over time. A time-based range partitioned table can often have different numbers of rows per partition, giving inconsistent response times. A height-balanced approach where we make the partitions all the same size, yet vary the data value boundaries will give much more consistent query times and can be completely automated much more easily. Uh.. well, consistent query time isn't the first thing I'm expecting from partitioning by time ranges. If I wanted consistent query times I'd rather use hash partition or something, no? I'd even state, that one *wants* inconsistent response times when using time based range partitioning, by moving old, seldom used data to slower storage and keeping only a small amount of often used tuples on the faster disks, for example. The SVM concept doesn't cover everything that you can do with partitioning, but my feeling is it covers the main use cases well. As I regard manageability to be the main advantage of partitioning, which you've intentionally left out for now, I disagree here. How could SVM or Segment Exclusion potentially be covering what hash partitioning does? Maybe together with the ability to store different segments of a table on different table spaces. That could be considered an approach to range partitioning. But then, that would be the partitioning, and not SVM or Segment Exclusion. To me, both of SVM and SE look much more like an optimization for certain special cases and don't have much to do with partitioning. Regards Markus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Saturday 05 January 2008 10:42, Markus Schiltknecht wrote: The main proposal deliberately has few, if any, knobs and dials. That's a point of philosophy that I've had views on previously: my normal stance is that we need some knobs to allow the database to be tuned to individual circumstances. In this case, partitioning is way too complex to administer effectively and requires application changes that make it impossible to use for packaged applications. The latest Oracle TPC-H benchmark uses 10 pages of DDL to set it up and if I can find a way to avoid that, I'd recommend it to all. I do still want some knobs and dials, just not 10 pages worth, though I'd like yours and others' guidance on what those should be. Oracle have been responding to feedback with their new interval partitioning, but its still a multi-table approach in essence. I can absolutely support your efforts to minimize knobs and configuration DDL. However, my current feeling is, that segments based partitioning complicates things, because the DBA doesn't have tools and commands to handle segments. Personally I cant say it complicates things, because it isn't clear how it will be managed. :-) To satisfy all the different requirements of partitioning with segments based partitioning, we'd have to allow a table to span multiple table spaces. I'm not very keen on going that way. Why? However, what I certainly like is the automated split point definition. Instead of having to create tables by hand and linking them via inheritance and constraint exclusion, I have something very similar in mind, like what you proposed for marking read-only segments. Something like: SPLIT TABLE customers AT cust_name 'n'; or: SPLIT TABLE inventory AT inv_id % 4 = 2; In my imagination, this should automatically create the underlying relations, i.e.: NOTICE: relations inventory__l and inventory__r have been created. That way, the DBA could then handle those like normal relations, querying them or moving them to different table spaces like all other normal relations. In a way, that's not so different from possible extensions on top of Segment Exclusion, except that the DBA additionally get a relation name to be able to address the set of segments which form a partition. Or put it the other way around: go for Segment Exclusion, but add some sort of a sentinel relation for each set of segments, to make them reachable for the DBA. So the one thing that always scares me about these define it all and let the database sort it out methods is they seem to lead to cases where the system ends up rewriting the data to fit into some new partition layout. One thing that is nice about the current partitioning scheme is you can control the impact of this behavior in these scenarios, but moving around small portions of the table at a time. My observation of partitioned databases is that they all work beautifully at the design stage, but problems emerge over time. A time-based range partitioned table can often have different numbers of rows per partition, giving inconsistent response times. A height-balanced approach where we make the partitions all the same size, yet vary the data value boundaries will give much more consistent query times and can be completely automated much more easily. Uh.. well, consistent query time isn't the first thing I'm expecting from partitioning by time ranges. If I wanted consistent query times I'd rather use hash partition or something, no? More to the point (I think) is that people define access to the data based on the meaning of the data, not how it is stored on disk. For example, in some tables we only need to be active on 1 months worth of data... how that is laid out on disk (# partitions, which tablespaces) is a means to the end of working actively on 1 months worth of data. I can't think of many cases where people would actually say the want to work actively on the most recent GB of data. I'd even state, that one *wants* inconsistent response times when using time based range partitioning, by moving old, seldom used data to slower storage and keeping only a small amount of often used tuples on the faster disks, for example. Yes, we do this quite a bit; and at least one of our partitioned tables (in total) is larger in size than our tablespace with the fastest disks. The SVM concept doesn't cover everything that you can do with partitioning, but my feeling is it covers the main use cases well. As I regard manageability to be the main advantage of partitioning, which you've intentionally left out for now, I disagree here. How could SVM or Segment Exclusion potentially be covering what hash partitioning does? Maybe together with the ability to store different segments of a table on different table spaces. That could be considered an approach to range partitioning. But then, that would be the partitioning,
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Robert Treat wrote: Personally I cant say it complicates things, because it isn't clear how it will be managed. :-) Well, management of relations is easy enough, known to the DBA and most importantly: it already exists. Having to set up something which is *not* tied to a relation complicates things just because it's an additional concept. But as I've pointed out, maybe what we have in mind isn't that different at all. Just have a sentinel relation mean a set of segments, i.e. all read-only segments of a table. Then again, a table - in a way - is not much else than a set of segments. So where's the real difference? To satisfy all the different requirements of partitioning with segments based partitioning, we'd have to allow a table to span multiple table spaces. I'm not very keen on going that way. Why? Uh.. if a table (RELKIND_RELATION) can only span one table space, as it is now, all of its segments are in the same table space. I don't quite call that partitioning. Well, sure, you could call it so, but then, each and every Postgres table is already partitioned in 1G segments. It all depends on the definitions, but in my world, horizontal partitioning for databases involves multiple table spaces (and is quite useless without that). Calling anything else partitioning is confusing, IMO. So the one thing that always scares me about these define it all and let the database sort it out methods is they seem to lead to cases where the system ends up rewriting the data to fit into some new partition layout. That holds true no matter if you shuffle between segments or relations. To be able to let the DBA define an exact split point, the database *will* have to shuffle tuples around. Why does that scare you? It's a regular database system's maintenance procedure. One thing that is nice about the current partitioning scheme is you can control the impact of this behavior in these scenarios, but moving around small portions of the table at a time. Uh.. I'm not quite following. What current partitioning scheme are you referring to? Why should that not be possible with other schemes? Moving the split point between two partitions involves moving tuples around, no matter if you are going to move them between segments or between relations building the partitions. More to the point (I think) is that people define access to the data based on the meaning of the data, not how it is stored on disk. For example, in some tables we only need to be active on 1 months worth of data... how that is laid out on disk (# partitions, which tablespaces) is a means to the end of working actively on 1 months worth of data. I can't think of many cases where people would actually say the want to work actively on the most recent GB of data. Agreed. I'd say that's why the DBA needs to be able to define the split point between partitions: only he knows the meaning of the data. To me, both of SVM and SE look much more like an optimization for certain special cases and don't have much to do with partitioning. Even if this were true, it might still be a useful optimization. Possibly, yes. To me, the use case seems pretty narrow, though. For example it doesn't affect index scans much. One table I am thinking of in particular in my system has one query we need to run across partitions, which ends up doing a slew of bitmap index scans for all the partitions. If using segment exclusion on it meant that I could get a global index to help that query, I'd be happy. As proposed, Segment Exclusion works only on exactly one table. Thus, if you already have your data partitioned into multiple relations, it most probably won't affect your setup much. It certainly has nothing to do with what I understand by 'global index' (that's an index spanning multiple tables, right?). Regards Markus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Jan 5, 2008 6:15 PM, [EMAIL PROTECTED] wrote: One thought I had back then, with partitioned tables was gee -- B-tree index is already doing a partition; why do a manual partition on top of that?. Can you please explain more on what you are trying to say here? Thanks, Gokul.
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Saturday 05 January 2008 14:02, Markus Schiltknecht wrote: To satisfy all the different requirements of partitioning with segments based partitioning, we'd have to allow a table to span multiple table spaces. I'm not very keen on going that way. Why? Uh.. if a table (RELKIND_RELATION) can only span one table space, as it is now, all of its segments are in the same table space. I don't quite call that partitioning. Well, sure, you could call it so, but then, each and every Postgres table is already partitioned in 1G segments. It all depends on the definitions, but in my world, horizontal partitioning for databases involves multiple table spaces (and is quite useless without that). Calling anything else partitioning is confusing, IMO. I'm not following this. If we can work out a scheme, I see no reason not to allow a single table to span multiple tablespaces. Do you see a problem with that? So the one thing that always scares me about these define it all and let the database sort it out methods is they seem to lead to cases where the system ends up rewriting the data to fit into some new partition layout. That holds true no matter if you shuffle between segments or relations. To be able to let the DBA define an exact split point, the database *will* have to shuffle tuples around. Why does that scare you? It's a regular database system's maintenance procedure. It's a question of control see below. One thing that is nice about the current partitioning scheme is you can control the impact of this behavior in these scenarios, but moving around small portions of the table at a time. Uh.. I'm not quite following. What current partitioning scheme are you referring to? The current, constraint exclusion/ inheritence based, partitioning we have now in postgresql. Why should that not be possible with other schemes? Moving the split point between two partitions involves moving tuples around, no matter if you are going to move them between segments or between relations building the partitions. The difference is that, if I currently have a table split by month, I can re-partition it into weekly segments, and only shuffle one months data at a time minimize impact on the system while I shuffle it. This can even be used to do dynamic management, where data from the current month is archived by day, data from the past year by week, and data beyond that done monthly. On many other databases, if you change the partition scheme, it requires exclusive locks and a shuffleing of all of the data, even data whose partitions arent being redefined. Even worse are systems like mysql, where you need to rewrite the indexes as well. To me, these requirements always seem like show stoppers; I generally can't afford to lock a table while the database rewrites a billion rows of data. More to the point (I think) is that people define access to the data based on the meaning of the data, not how it is stored on disk. For example, in some tables we only need to be active on 1 months worth of data... how that is laid out on disk (# partitions, which tablespaces) is a means to the end of working actively on 1 months worth of data. I can't think of many cases where people would actually say the want to work actively on the most recent GB of data. Agreed. I'd say that's why the DBA needs to be able to define the split point between partitions: only he knows the meaning of the data. To me, both of SVM and SE look much more like an optimization for certain special cases and don't have much to do with partitioning. Even if this were true, it might still be a useful optimization. Possibly, yes. To me, the use case seems pretty narrow, though. For example it doesn't affect index scans much. One table I am thinking of in particular in my system has one query we need to run across partitions, which ends up doing a slew of bitmap index scans for all the partitions. If using segment exclusion on it meant that I could get a global index to help that query, I'd be happy. As proposed, Segment Exclusion works only on exactly one table. Thus, if you already have your data partitioned into multiple relations, it most probably won't affect your setup much. It certainly has nothing to do with what I understand by 'global index' (that's an index spanning multiple tables, right?). In a more general sense, a global index is a an index that spans multiple partitions, as opposed to a local index, which is an index on specific partitions; postgresql current supports the latter, not the former. In any case, my thinking is if we had the segment exclusion technique, I could convert that partitioned table into a regular table again, use segment exclusion to handle what is currently handled by partitions, and create a global index across all the other data for that other, currently killer, query. -- Robert Treat Build A Brighter
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Sun, Jan 06, 2008 at 01:12:32AM +0530, Gokulakannan Somasundaram wrote: On Jan 5, 2008 6:15 PM, [EMAIL PROTECTED] wrote: One thought I had back then, with partitioned tables was gee -- B-tree index is already doing a partition; why do a manual partition on top of that?. Can you please explain more on what you are trying to say here? Sure. A B-tree is just a device to partition something along some order. If you have , say, a table of orders (to use the example upthread) and a B-tree index on order date, this index partitions your set (at recursively finer levels). Of course, you'd have to sort your data alogn this index, but PostgreSQL knows how to do this trick: CLUSTER. This was just a vague idea, many things were missing (for example to separate out the more quiescent parts of the table into their own files) which are spelled out in Simon Riggs' proposal. This struck me when seeing people partition tables by hand -- and I was delighted to actually watch Simon forging a real design. Regards - -- tomás -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFHgG3QBcgs9XrR2kYRAgKhAJ93KUybgMfG07ta67DiR8bgAbHPrgCeOI2V by/xeXKrDJ5O0JZHyFurego= =R/vC -END PGP SIGNATURE- ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Simon Riggs wrote: We would keep a dynamic visibility map at *segment* level, showing which segments have all rows as 100% visible. No freespace map data would be held at this level. Small dumb-user question. I take it you've considered some more flexible consecutive-run-of-blocks unit of flagging rather than file-segments. That obviously complicates the tracking but means you can cope with infrequent updates as well as mark most of the most recent segment for log-style tables. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Fri, 2008-01-04 at 10:22 +, Richard Huxton wrote: Simon Riggs wrote: We would keep a dynamic visibility map at *segment* level, showing which segments have all rows as 100% visible. No freespace map data would be held at this level. Small dumb-user question. I take it you've considered some more flexible consecutive-run-of-blocks unit of flagging rather than file-segments. That obviously complicates the tracking but means you can cope with infrequent updates as well as mark most of the most recent segment for log-style tables. I'm writing the code to abstract that away, so yes. Now you mention it, it does seem straightforward to have a table storage parameter for partition size, which defaults to 1GB. The partition size is simply a number of consecutive blocks, as you say. The smaller the partition size the greater the overhead of managing it. Also I've been looking at read-only tables and compression, as you may know. My idea was that in the future we could mark segments as either - read-only - compressed - able to be shipped off to hierarchical storage Those ideas work best if the partitioning is based around the physical file sizes we use for segments. Changing the partition size would simply reset the visibility map for that table, in its easiest implementation. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Fri, 2008-01-04 at 13:06 +0530, Gokulakannan Somasundaram wrote: a) This proposal would work for the kind of table organizations which are currently partitioned and maintained based on some kind of timestamp. Consider one of the use-case. A large Retail firm has a lot of stores. DBA maintains and updates the inventories of those stores in hash-partitions based on store-no. As the inventory gets updated, the corresponding partition receives the update and it goes like that.. Here all the partitions are going to get constantly updated. So no partition can possibly become a read only partition. You have clearly called it out in your para. i am just re-confirming on that. Or do you have something for this in your soln.? To my limited experience, most partition strategies are based on some form of time-stamp. If the proposed soln. can cater to those, it has lot of use-cases. I don't think it would apply to an Inventory table. That is a current state table. It is designed for any large tables that would grow naturally over time if we left them to do so. Solutions that it would work for: - any Fact table where measurements/observations/events are accumulated e.g. Web Hits (any Internet events) Call Detail Records Sales Security Events Scientific Measurements Process Control - any Major Entity where new entities are created from a sequence e.g. Orders, OrderItems Invoices Shipments, Returns most SCM/DCM events It's not aimed at any particular benchmark, just real usage scenarios. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Simon Riggs wrote: On Fri, 2008-01-04 at 10:22 +, Richard Huxton wrote: Simon Riggs wrote: We would keep a dynamic visibility map at *segment* level, showing which segments have all rows as 100% visible. No freespace map data would be held at this level. Small dumb-user question. I take it you've considered some more flexible consecutive-run-of-blocks unit of flagging rather than file-segments. That obviously complicates the tracking but means you can cope with infrequent updates as well as mark most of the most recent segment for log-style tables. I'm writing the code to abstract that away, so yes. Now you mention it, it does seem straightforward to have a table storage parameter for partition size, which defaults to 1GB. The partition size is simply a number of consecutive blocks, as you say. The smaller the partition size the greater the overhead of managing it. Oh, obviously, but with smaller partition sizes this also becomes useful for low-end systems as well as high-end ones. Skipping 80% of a seq-scan on a date-range query is a win for even small (by your standards) tables. I shouldn't be surprised if the sensible-number-of-partitions remained more-or-less constant as you scaled the hardware, but the partition size grew. Also I've been looking at read-only tables and compression, as you may know. My idea was that in the future we could mark segments as either - read-only - compressed - able to be shipped off to hierarchical storage Those ideas work best if the partitioning is based around the physical file sizes we use for segments. I can see why you've chosen file segments. It certainly makes things easier. Hmm - thinking about the date-range scenario above, it occurs to me that for seq-scan purposes the correct partition size depends upon the data value you are interested in. What I want to know is what blocks Jan 07 covers (or rather what blocks it doesn't) rather than knowing blocks 1-999 cover 2005-04-12 to 2007-10-13. Of course that means that you'd eventually want different partition sizes tracking visibility for different columns (e.g. id, timestamp). I suspect the same would be true for read-only/compressed/archived flags, but I can see how they are tightly linked to physical files (particularly the last two). -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Fri, 2008-01-04 at 13:29 +0100, Markus Schiltknecht wrote: Given that we are operating on segments here, to which the DBA has very limited information and access, I prefer the term Segment Exclusion. I think of that as an optimization of sequential scans on tables with the above mentioned characteristics. If we do need to differentiate between the two proposals, we can refer to this one as the Segment Visibility Map (SVM). I'm clearly in favor of separating between the two proposals. SVM is a good name, IMHO. OK, I'll refer to this as proposal as SVM. There would be additional complexity in selectivity estimation and plan costing. The above proposal allows dynamic segment exclusion, which cannot be assessed at planning time anyway, so suggestions welcome... Hm.. that looks like a rather bad downside of an executor-only optimization. I think that's generally true. We already have that problem with planned statements and work_mem, for example, and parameterised query planning is a difficult problem. Stable functions are already estimated at plan time, so we hopefully should be getting that right. I don't see any show stoppers here, just more of the usual problems of query optimization. Comparison with other Partitioning approaches - Once I realised this was possible in fairly automatic way, I've tried hard to keep away from manual overrides, commands and new DDL. Declarative partitioning is a big overhead, though worth it for large databases. No overhead is *much* better though. This approach to partitioning solves the following challenges - allows automated table extension, so works automatically with Slony - responds dynamically to changing data - allows stable functions, nested loop joins and parametrised queries - allows RI via SHARE locks - avoids the need for operator push-down through Append nodes - allows unique indexes - allows both global indexes (because we only have one table) - allows advanced planning/execution using read-only/visible data - works naturally with synchronous scans and buffer recycling All of the above are going to take considerably longer to do in any of the other ways I've come up with so far... I fully agree. But as I tried to point out above, the gains in manageability from Segment Exclusion are also pretty close to zero. So I'd argue they only fulfill parts of the needs for general horizontal partitioning. Agreed. My focus for this proposal wasn't manageability, as it had been in other recent proposals. I think there are some manageability wins to be had as well, but we need to decide what sort of partitioning we want/need first. So in the case of SVM, enhanced manageability is really a phase 2 thing. Plus, you can always combine a design with constraint and segment exclusion. Maybe a combination with CLUSTERing would be worthwhile? Or even enforced CLUSTERing for the older segments? I think there's merit in Heikki's maintain cluster order patch and that should do an even better job of maintaining locality. Thanks for detailed comments. I'll do my best to include all of the viewpoints you've expressed as the design progresses. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hello Simon, Simon Riggs wrote: I've come up with an alternative concept to allow us to discuss the particular merits of each. ISTM that this new proposal has considerable potential. Hm.. interesting idea. If we were able to keep track of which sections of a table are now read-only then we would be able to record information about the data in that section and use it to help solve queries. This is turning the current thinking on its head: we could derive the constraints from the data, rather than placing the data according to the constraints. That would be much more natural: load data into the table and have the system work out the rest. Yeah, but that's also the most limiting factor of your approach: it covers only horizontal partitioning by time (or to be more precise, by columns which are very likely to increase or decrease with time). All other columns will very likely contain values from the full range of possible values. As you have pointed out, that might be a very frequent use case. I can't argue about that, however, I think it's important to be well aware of that limitation. Other scans types might also use segment exclusion, though this would only be useful for scans retrieving many rows, otherwise the overhead of segment exclusion might not be worthwhile. Uh.. the overhead of checking against min/max values doesn't seem that big to me. I rather think the gain for index scans would be prohibitively small, because (given frequent enough vacuuming) an index scan shouldn't return many pointers to tuples in segments which could be optimized away by segment exclusion. If we collect data for all columns then many of our implicit constraints would be useless. e.g. if a column only has a few values and these are present in all segments. Matching our predicate against all constraints would be expensive, so we must weed out poor constraints. We would do this by removing any constraint that overlapped more than 10% of other segments. Various heuristics would likely need to be discovered during development to make this work without resorting to manual commands. Uh, well, that's about the limitation I've pointed out above. But is it really worth maintaining statistics about overlapping values and removing min/max checks for certain columns? It would save you the min/max check per segment and scan, but cost maintaining the statistics and checking against the statistics once per scan. AFAICS the block with the min/max tuple per segment will often remain cached anyway... dunno. Noting which segments are read-only --- Everything so far has relied upon our ability to note which segments of a table are read-only. We could do this in two ways 1) have the system automatically keep track of non-changing data 2) have the DBA issue a command to mark a segment as read-only now Probably a combination of the two is better, so we have three states for segments - READ_WRITE_ALLOWED - EFFECTIVE_READ_ONLY (set by 1 only) - MARKED_READ_ONLY (set by 2 only) Having said that I want to concentrate on (1), though consider the other one also in case requested by reviewers. Hm.. AFAICT, horizontal partitioning often serves manageability, which is quite limited having all data in one table (you can't move a single segment to a different tablespace). Thus I think option 2 is pretty constrained is usability. What would the DBA gain by setting a segment to read only? How does the DBA figure out, in which segment a tuple is stored in (so she can decide to mark it read only)? The user may also wish to clear down very old data, so allowing DELETEs can ensure the user can still remove old data from the table. By carefully choosing the values to be deleted, a whole segment can be deleted and then returned to the FSM. Oh, yeah, that sounds like a good optimization. Bulk deletes, yummie! This proposal offers many of the advantages of the earlier Visibility Map proposal, yet without major changes to heap structure. Since the segment-level visibility map is more granular it would only be 80% as effective as the more detailed block-level map. Having said that, the bookkeeping overheads will also be considerably reduced, so it does seem a good joint approach. It also allows freezing to be handled fully, which was a problem with the original visibility map proposal. WAL logging visibility map changes is also now much easier. I generally agree, although I'm somewhat dubious about the 80% factor. My thoughts have been targeted directly at partitioning, yet I have to admit that this idea overlaps, and in my world view, replaces the Visibility Map proposal. I very much like the name Visibility Map though. I would even say, that partitioning is somewhat of a misleading term here, because it normally allows the DBA to decide on where to split. Given that we are operating on segments here, to which the DBA has very limited information and access, I
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Simon Riggs wrote: - any Fact table where measurements/observations/events are accumulated e.g. Web Hits (any Internet events) Call Detail Records Sales Security Events Scientific Measurements Process Control - any Major Entity where new entities are created from a sequence e.g. Orders, OrderItems Invoices Shipments, Returns most SCM/DCM events ...and only changed very seldom after a while, I would add. Because changing an old tuple would invalidate the optimization for the affected segment. That's why this optimization can't help for inventory tables, where an id might correlate with time and storage location, but writing access doesn't correlate with storage location (segment number) and time. Regards Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Simon Riggs wrote: The smaller the partition size the greater the overhead of managing it. Also I've been looking at read-only tables and compression, as you may know. My idea was that in the future we could mark segments as either - read-only - compressed - able to be shipped off to hierarchical storage Those ideas work best if the partitioning is based around the physical file sizes we use for segments. As much as I'd like this simplification.. But I'm still thinking of these segments as an implementation detail of Postgres, and not something the user should have to deal with. Allowing the DBA to move segments to a different table space and giving him the possibility to check which tuples are in which segment seems awkward from a users perspective, IMO. Regards Markus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Fri, Jan 04, 2008 at 01:29:55PM +0100, Markus Schiltknecht wrote: Agreed. Just a minor note: I find marked read-only too strong, as it implies an impossibility to write. I propose speaking about mostly-read segments, or optimized for reading or similar. I do want some segments to be _marked_ read-only: I want attempted writes to them to _fail_. A ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Fri, 2008-01-04 at 13:06 -0500, Andrew Sullivan wrote: On Fri, Jan 04, 2008 at 01:29:55PM +0100, Markus Schiltknecht wrote: Agreed. Just a minor note: I find marked read-only too strong, as it implies an impossibility to write. I propose speaking about mostly-read segments, or optimized for reading or similar. I do want some segments to be _marked_ read-only: I want attempted writes to them to _fail_. I think Markus thought that we would mark them read only automatically, which was not my intention. I believe its possible to have this in a way that will make you both happy. Some more explanation: There would be three different states for a segment: 1. read write 2. optimized for reading, as Markus says it 3. marked read only by explicit command Transition 1 - 2 is by autovacuum under the SVM proposal, transition 2 - 3 is by user command only. So throwing an ERROR is acceptable for segments in state 3. I came up with a complex scheme for going from 1 - 3 previously, but I don't think its needed any longer (for this, at least). It's trivial to go from 2 - 3 using an ALTER TABLE statement, along the lines of ALTER TABLE WHERE Files that are completely in state 3 can then be archived by a hierarchical storage manager without problem. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Fri, Jan 04, 2008 at 10:26:54PM +0100, Markus Schiltknecht wrote: I'm still puzzled about how a DBA is expected to figure out which segments to mark. I think that part might be hand-wavy still. But once this facility is there, what's to prevent the current active segment (and the rest) from also getting this mark, which would mean the table is read only? A ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi, Simon Riggs wrote: On Fri, 2008-01-04 at 13:06 -0500, Andrew Sullivan wrote: On Fri, Jan 04, 2008 at 01:29:55PM +0100, Markus Schiltknecht wrote: Agreed. Just a minor note: I find marked read-only too strong, as it implies an impossibility to write. I propose speaking about mostly-read segments, or optimized for reading or similar. Hm.. yeah, after rereading, I realize that I've mixed up states no. 2 and 3 here, sorry. I do want some segments to be _marked_ read-only: I want attempted writes to them to _fail_. Well, I can see use cases for marking tuples or complete relations as read only. But segments? I'm still puzzled about how a DBA is expected to figure out which segments to mark. Simon, are you assuming we are going to pass on segment numbers to the DBA one day? If not, a more user friendly command like MARK READ ONLY WHERE date = 2006 would have to move tuples around between segments, so as to be able to satisfy the split point exactly, right? I think Markus thought that we would mark them read only automatically, which was not my intention. I believe its possible to have this in a way that will make you both happy. Some more explanation: There would be three different states for a segment: 1. read write 2. optimized for reading, as Markus says it 3. marked read only by explicit command Thanks for clarification. Regards Markus ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote: I'm still puzzled about how a DBA is expected to figure out which segments to mark. Simon, are you assuming we are going to pass on segment numbers to the DBA one day? No Way! That would stop Richard's idea to make the segment stride configurable, apart from being a generally ugly thing. If not, a more user friendly command like MARK READ ONLY WHERE date = 2006 would have to move tuples around between segments, so as to be able to satisfy the split point exactly, right? Yes, just a simple WHERE clause that we can translate into segments under the covers. It would be an alter table, so we get an exclusive lock. ALTER TABLE foo SET READ ONLY WHERE possibly with a few restrictions on the WHERE clause. Anyway this is just futures and dreams, so far, so lets just say something like that is possible in the future and work out more when we pass the first few hurdles. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Friday 04 January 2008 17:01, Simon Riggs wrote: On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote: I'm still puzzled about how a DBA is expected to figure out which segments to mark. Simon, are you assuming we are going to pass on segment numbers to the DBA one day? No Way! That would stop Richard's idea to make the segment stride configurable, apart from being a generally ugly thing. If not, a more user friendly command like MARK READ ONLY WHERE date = 2006 would have to move tuples around between segments, so as to be able to satisfy the split point exactly, right? Yes, just a simple WHERE clause that we can translate into segments under the covers. It would be an alter table, so we get an exclusive lock. ALTER TABLE foo SET READ ONLY WHERE possibly with a few restrictions on the WHERE clause. Anyway this is just futures and dreams, so far, so lets just say something like that is possible in the future and work out more when we pass the first few hurdles. Not to be negative, but istm how this feature would be managed is as important as the bits under the hood. Or at least we have to believe there will be some practical way to manage this, which as of yet I am skeptical. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Thu, 2008-01-03 at 00:41 +, Sam Mason wrote: On Wed, Jan 02, 2008 at 05:56:14PM +, Simon Riggs wrote: Like it? Sounds good. I've only given it a quick scan though. Would read-only segments retain the same disk-level format as is currently? Yes, no changes at all to the table. So your app would just work, without any DDL changes. Existing partitioning apps would not change. It seems possible to remove the MVCC fields and hence get more tuples per page--- whether this would actually be a net performance gain/loss seems like a difficult question question to answer, it would definitly be a complexity increase though. I've been looking at general compression at table and segment level, but thats further down the track. Removing the MVCC fields is too much work, I think. Reading this reminds me of the design of the store for a persistent operating system called EROS. It has a very good paper[1] describing the design (implementation and careful benchmarking thereof) that I think could be a useful read. Thanks, will do. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi Simon, Looks like a novel idea. I just want to confirm my understanding of the proposal. a) This proposal would work for the kind of table organizations which are currently partitioned and maintained based on some kind of timestamp. Consider one of the use-case. A large Retail firm has a lot of stores. DBA maintains and updates the inventories of those stores in hash-partitions based on store-no. As the inventory gets updated, the corresponding partition receives the update and it goes like that.. Here all the partitions are going to get constantly updated. So no partition can possibly become a read only partition. You have clearly called it out in your para. i am just re-confirming on that. Or do you have something for this in your soln.? To my limited experience, most partition strategies are based on some form of time-stamp. If the proposed soln. can cater to those, it has lot of use-cases. Thanks, Gokul.
[HACKERS] Dynamic Partitioning using Segment Visibility Maps
Happy New Year, everybody. This proposal follows on from previous thinking about partitioning, where I've taken up Andrew Sullivan's suggestion to re-examine the current partitioning concept of using tables as partitions. So I've come up with an alternative concept to allow us to discuss the particular merits of each. ISTM that this new proposal has considerable potential. Recap: Very Large Table Use Case Many large tables have a regular access pattern: - new rows inserted frequently - some updates and deletes on the recent data - older data becomes effectively read-only - at some point data may be removed in bulk from the table Most tables vary on what recent means; some are insert only, many not. A typical example might be a large history table where the last 6 months data is updated, but after that the data has almost zero change. Partitioning relies on knowledge of the physical location of data to exclude sections of data from a scan. Currently the partitioning knowledge is provided by user declarations in the form of constraints on tables linked together by views or inheritance. We also currently do this in constraint-first fashion, i.e. we put the data where the constraints say we should, rather than deriving the constraints from the data. In-table Partitioning - In the common use-case there is a clear affinity between certain data columns and the physical placement of data in a table. This is true for columns such as LOG_DATE or TRANSACTION_DATE, but is also true for any columns where the id column is assigned by a Sequence. HOT helps this to remain true over time. Given the above use case, when there is an affinity between data values and data placement *and* we know that older data tends to become read only, we can then realise that certain physical sections of a table will become effectively read only. (My experience is that the larger the table, the less random the write pattern - whatever the guys that wrote TPC-C say). If we were able to keep track of which sections of a table are now read-only then we would be able to record information about the data in that section and use it to help solve queries. This is turning the current thinking on its head: we could derive the constraints from the data, rather than placing the data according to the constraints. That would be much more natural: load data into the table and have the system work out the rest. We can imagine that we do this within a single table. Imagine: split a table up into 100 sections, then record the min/max values of all tuples in those sections. When we perform a SeqScan we could skip over sections that could be proved would never satisfy the query predicate. Same thing as partitioning, but within the table. Currently tables are already broken up into 1GB file segments, so it seems fairly natural to pick that as our section size. That fits reasonably well with recommendations for ideal partition size. Segment Exclusion - After we note that a segment is read-only we can scan the segment and record min/max values for all columns. These are then implicit constraints, which can then be used for segment exclusion in a similar manner as we do with the current constraint exclusion mechanism. The implicit constraints can then allow SeqScans to assess segment exclusion for each segment at execution time, i.e. a scan can skip a whole segment if constraints don't match. If a segment does not (yet) have implicit constraints it will always be scanned in full. This allows partitioning, synch scans and buffer recycling to work much better together than they currently do. Other scans types might also use segment exclusion, though this would only be useful for scans retrieving many rows, otherwise the overhead of segment exclusion might not be worthwhile. This would also allow a Merge Join to utilise exclusion for the inner plan at execution time. Instead of scanning the whole inner table plan from the start, we would be able to start the scan from the appropriate segment. This would require us to pass the current value of the outer plan down into the inner plan. The typical executor nodes on the inner plan would be a SortNode and below that a SeqScan. In that case the executor would need to pass the outer value from the MergeJoinNode down thru the SortNode to the SeqScan node. The SeqScan node could then perform partition exclusion, reducing the time for that scan and also reducing the time for the resulting sort. This sounds like it might be worth doing in normal cases also. It might turn out that the potentially applicable cases are already excluded during planning, I haven't thought about that aspect in enough detail yet. If we collect data for all columns then many of our implicit constraints would be useless. e.g. if a column only has a few values and these are present in all segments. Matching our predicate against all constraints would be expensive,
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Wed, Jan 02, 2008 at 05:56:14PM +, Simon Riggs wrote: Like it? Sounds good. I've only given it a quick scan though. Would read-only segments retain the same disk-level format as is currently? It seems possible to remove the MVCC fields and hence get more tuples per page--- whether this would actually be a net performance gain/loss seems like a difficult question question to answer, it would definitly be a complexity increase though. Reading this reminds me of the design of the store for a persistent operating system called EROS. It has a very good paper[1] describing the design (implementation and careful benchmarking thereof) that I think could be a useful read. A lot of your design sounds like the EROS store, with the the Checkpoint Area being, in current and all previous versions of Postgres, the only place data is stored. Data in EROS also has a Home Location which is where the data ends up after a checkpoint, and sounds somewhat like the proposed read-only. Checkpoints here serve a similar purpose than checkpoints to PG, so the following analogy may get a bit confusing. When you're reading the paper I'd try and imagine the checkpoints not occurring as one event, but spread across time as the database recognizes that data is now (or has been marked as) read-only. The home locations would then store only the read-only copies of the data and all the churn would (if the recognition of read-only data works) be done in the checkpoint area. Sam [1] http://www.eros-os.org/papers/storedesign2002.pdf ---(end of broadcast)--- TIP 6: explain analyze is your friend