On Wed, Jan 09, 2008 at 11:47:31AM -0500, Chris Browne wrote: > [EMAIL PROTECTED] (Simon Riggs) writes: > > I think we have an opportunity to bypass the legacy-of-thought that > > Oracle has left us and implement something more usable. > > This seems like a *very* good thing to me, from a couple of > perspectives. > [snip] > 2. Blindly following what Oracle does has always been a dangerous > sort of thing to do. > > There are two typical risks: > > a) There's always the worry that they may have patented some > part of how they implement things, and if you follow too > closely, There Be Dragons...
I think that could be equally said of the dynamic partitioning approach. In fact, it might be more likely since declarative partitioning has been around for eons. > b) They have enough billion$ of development dollar$ and > development re$ource$ that they can follow strategies that > are too expensive for us to even try to follow. I don't see that as an argument against the declarative approach. Reading the details on this thread so far, I think Simon's approach is probably more complex from an implementation POV. > > 3. If, rather than blindly following, we create something at least > quasi-new, there is the chance of doing fundamentally better. > > This very thing happened when it was discovered that IBM had a > patent on the ARC cacheing scheme; the "clock" system that emerged > was a lot better than ARC ever was. Well, I don't think I'm proposing we /blindly follow/ others. I propose we choose a grammar which takes the best of what others have tried to do. Oracle's grammar is hideous, IBM's is too restrictive, for example. > > One major advantage of the dynamic approach is that it can work on > > multiple dimensions simultaneously, which isn't possible with > > declarative partitioning. For example if you have a table of Orders then > > you will be able to benefit from Segment Exclusion on all of these > > columns, rather than just one of them: OrderId, OrderDate, > > RequiredByDate, LastModifiedDate. This will result in some "sloppiness" > > in the partitioning, e.g. if we fill 1 partition a day of Orders, then > > the OrderId and OrderData columns will start out perfectly arranged. Any > > particular RequiredByDate will probably be spread out over 7 partitions, > > but thats way better than being spread out over 365+ partitions. > > I think it's worth observing both the advantages and demerits of this > together. > > In effect, with the dynamic approach, Segment Exclusion provides its > benefits as an emergent property of the patterns of how INSERTs get > drawn into segments. > > The tendancy will correspondly be that Segment Exclusion will be able > to provide useful constraints for those patterns that can naturally > emerge from the INSERTs. Many people, in my experience, doing the kind of data processing which benefits from partitioning are regularly loading data, rather than collecting it in an OLTP fashion. Lets take the easily understandable concept of processing web site traffic. If the amount of data is large enough to benefit from partitioning, they probably have multiple web servers and therefore almost certainly multiple log files. If these files are not sorted into a single file, the records will not have a naturally progressing chronology: every file we go back to the beginning of the period of time the load covers. If you add parallelism to your load, things look even more different. This means you could end up with a bunch of partitions, under the dynamic model, which all cover the same time range. Then there's the way that really big databases are used (say, up around Simon's upper bound of 16 TB). It is expensive to keep data online so people aren't. They're loading and unloading data all the time, to perform different forms of analysis. A common scenario in the example above might be to unload all but the current month's data and then load the same month from the previous year. The unload step needs to be costly (i.e., TRUNCATE). Then, there's no guarantee that what they're interested in is the date range at all. They may want to compare user agent (look at bot activity). In this case, the partitioning is across a small list of strings (well, numbers most likely). Here, the partitions would all have the same range. Partitioning would be useless. > We can therefore expect useful constraints for attributes that are > assigned in some kind of more or less chronological order. Such > attributes will include: > > - Object ID, if set by a sequence > - Processing dates > > There may be a bit of sloppiness, but the constraints may still be > useful enough to exclude enough segments to improve efficiency. I really like the idea of the system being able to identify trends and patterns in the data (actually useful at the user level) but it's the uncertainty of how the partitioning will work for different kinds of data that I don't like. > > _On The Other Hand_, there will be attributes that are *NOT* set in a > more-or-less chronological order, and Segment Exclusion will be pretty > useless for these attributes. > > In order to do any sort of "Exclusion" for non-"chronological" > attributes, it will be necessary to use some mechanism other than the > patterns that fall out of "natural chronological insertions." If you > want exclusion on such attributes, then there needs to be some sort of > rule system to spread such items across additional partitions. Mind > you, if you do such, that will weaken the usefulness of Segment > Exclusion. For instance, suppose you have 4 regions, and scatter > insertions by region. In that case, there will be more segments that > overlap any given chronological range. Right, and so the system seems to 'degrade' to a declarative approach. Consider all the kinds of data other than time series people are storing in Postgres. Some of the biggest I've seen are GIS information or network information. Also, what about people with there own data types? It may be that the data types do not support ordering so range partitioning does not make sense. > > When we look at the data in the partition we can look at any number of > > columns. When we declaratively partition, you get only one connected set > > of columns, which is one of the the reasons you want multi-dimensional > > partitioning in the first place. > > Upside: Yes, you get to exclude based on examining any number of > columns. > > Downside: You only get the exclusions that are "emergent properties" > of the data... There's also a cost involved with calculating all this in the dymanic approach. Consider tables with a non-trivial number of columns. Thanks, Gavin ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly