[HACKERS] Arbitary file size limit in twophase.c
There's an apparently arbitary limit of 10,000,000 bytes in twophase.c on the size of a two phase commit file. I can't see why this limit exists. I hit this limit by creating a prepared transaction which included dropping a schema with about 25,000 objects in it and then trying to commit it. The result is that ReadTwoPhaseFile() returns NULL and FinishPreparedTransaction() says it detected a corrupted file. Thoughts? Thanks, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Arbitary file size limit in twophase.c
On Tue, May 13, 2008 at 10:34:23AM -0400, Tom Lane wrote: Gavin Sherry [EMAIL PROTECTED] writes: There's an apparently arbitary limit of 10,000,000 bytes in twophase.c on the size of a two phase commit file. I can't see why this limit exists. The comment seems perfectly clear about why the limit exists: * Check file length. We can determine a lower bound pretty easily. We * set an upper bound mainly to avoid palloc() failure on a corrupt file. Oops. Where was my brain? Perhaps it'd be better to use malloc() than palloc(), so that we'd not lose control on out-of-memory, and then deem the file too big only if we couldn't malloc the space. That seems better. Thanks, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning grammar
On Mon, Jan 14, 2008 at 10:45:28PM -0500, Tom Lane wrote: Jeff Cohen [EMAIL PROTECTED] writes: In the proposed solution, hash and list partitions work for all types that support an equality operator, and range partitions work for all types that support fully-ordered comparison. Surely a hashing method would require a *hashable* equality operator, ie a hash opclass; likewise range partitions would demand a matching btree opclass. You could do list partitions with an equality operator of either kind. Right. Essentially all of the system's current knowledge about the properties of specific operators is encoded as operator classes for one of these two built-in index types. If you want to make assumptions about the behavior of an operator, it really needs to be founded on these types of opclasses --- or else you're buying into inventing a comparable amount of infrastructure for some other organizational concept. Right, we obviously don't want to do that. I think Peter's point was that you might want to think about generalizing your concepts so that other kinds of operator classes could someday serve as the foundations for other kinds of partitioning rules. Let me see if I've understood: certain operator classes either describe or allow certain kinds of partitioning: hash is obvious, btree allows equality and range based approaches, gist allows users a whole range of possibilities. So, a truly extensible system would define the partitioning type in the catalog? That's an interesting idea. It presents problems, I think, for the grammar I've proposed because some grammatical constructs are tied to range, some to hash, some to list. Any insights into how we could achieve both? Thanks, Gavin PS: Heading off into the French country side for a little while and responses may be a little slow. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Declarative partitioning grammar
On Tue, Jan 15, 2008 at 10:36:17AM -0500, Tom Lane wrote: Markus Schiltknecht [EMAIL PROTECTED] writes: Jeff Cohen wrote: If you don't define a default partition to handle outliers, the insert should fail with an error. IMO, you should always have a default partition, then, so as not to violate the constraints (by rejecting tuples which are correct according to the constraints). I don't agree with that at all. I can imagine plenty of situations where a tuple falling outside the range of available partitions *should* be treated as an error. For instance, consider timestamped observations --- data in the future is certainly bogus, and data further back than you want to deal with must be an entry error as well. I agree that there needs to be a way to have a default partition, but there needs to be a way to not have one, too. Jeff and I discussed this and we came to the same conclusion. We will propose grammar for handling it. Many users we talk to would fall into the class of people who would want an error if the data fell outside the defined partitions. Thanks, Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Declarative partitioning grammar
On Sat, Jan 12, 2008 at 04:01:19PM +0530, NikhilS wrote: Hi, We did look at allowing general functions for partitioning and this was one concern. The other is that we want to enforce that a row only gets inserted into a single partition, so we wanted a declarative syntax where it was relatively easy to check that range and list specifications don't overlap. Detection of mutually exclusive ranges might not turn out to be so easy afterall. I think there is some code in the constraint_exclusion area which might help out in this. In some prototyping code it didn't seem too difficult but if we've made a mistake we might have to look at the CE code. Thanks, Gavin ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Declarative partitioning grammar
On Sat, Jan 12, 2008 at 05:47:30PM +, Simon Riggs wrote: On Sat, 2008-01-12 at 01:59 +0100, Gavin Sherry wrote: The syntax is half the problem, performance is the other. The syntax looks great to me, but I think it is about 5% of the problem, maybe less. I don't really have any questions about the syntax, but I may have thoughts when the implementation details emerge. Yes, that's for another thread. Since the discussion was abot using grammar to control partitions I wanted to get some grammar out. More details on other stuff soon. I'm not sure you'll be able to use PARTITION BY since its part of the SQL Standard for Windowed grouping, which we do hope to implement one day. It will be confusing to have two completely separate meanings for the one phrase in our grammar. I think it's fine. It doesn't cause conflicts in the grammar (in fact, the Greenplum grammar implements both meanings right now with no confusion). Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Fri, Jan 11, 2008 at 08:07:18AM +, Simon Riggs wrote: On Fri, 2008-01-11 at 02:28 +0100, Gavin Sherry wrote: On Thu, Jan 10, 2008 at 09:30:10PM +, Simon Riggs wrote: We cannot perform partition exclusion using this type of WHERE clause at planning time because the CURRENT DATE function is STABLE. We can do the exact same thing -- if it's a direction people want to take. In fact, we can do it better/faster because once we've evaluated one partition we know that there are no others to evaluate. Lost you completely here. I'm explaining to you that *nobody* can solve those problems solely at planning time, by definition, so it has to be done at execution time. I'm not saying anything about your way, my way. Sorry, I wasn't clear enough. I was trying to say, if we're going to do something in the executor (for right or wrong) the declarative approach can do it too. Since there will be partition bounding information available, we can do partition selection in the executor (maybe the planner should tell us to do it). Of course. It's an identical situation for both. Regrettably, none of your comments about dynamic partitioning and planning were accurate as a result. That's not true. We will still have planning drive the partition selection when the predicate is immutable, thus having more accurate plans. Some but not all plans use current_date and similar. Okay. As I said above, nothing in declarative partitioning rules out partition selection with stable functions. So, we lets do it, assuming everyone else thinks it is a good idea. If you check the archives this was long ago been identified as a requirement. And I said exactly the things you said, BTW, when trying to say it didn't matter. I've kept a list of requests for improvement that I can share with you; I've always been loathe to publish a list of bad points. Why, please send them. Thanks, Gavin ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Fri, Jan 11, 2008 at 11:49:50AM +, Simon Riggs wrote: On Fri, 2008-01-11 at 10:25 +0100, Gavin Sherry wrote: Of course. It's an identical situation for both. Regrettably, none of your comments about dynamic partitioning and planning were accurate as a result. That's not true. We will still have planning drive the partition selection when the predicate is immutable, thus having more accurate plans. Not really. The planner already evaluates stable functions at plan time to estimate selectivity against statistics. It can do the same here. The boundary values can't be completely trusted at plan time because they are dynamic, but they're at least as accurate as ANALYZE statistics (and probably derived at identical times), so can be used as estimates. So I don't see any reason to imagine the plans will be badly adrift. Okay, it's good that you want the planner to look at those. Did you consider the point I made about the sheer amount of data the planner would have to consider for large cases? We're back to saying that if the visibility map is volatile, then SE won't help you much. I agree with that and haven't argued otherwise. Does saying it make us throw away SE? No, at least, not yet and not for that reason. Yes, I'm not against SE I just think that only having it would see a serious regression for larger user. Personally, I think SE would be a great idea for append only tables since it removes the thing I'm most worried about with it: the need to vacuum to 'turn it on'. SE does what I was looking for it to do, but doesn't do all of what you'd like to achieve with partitioning, because we're looking at different use cases. I'm sure you'd agree that all large databases are not the same and that they can have very different requirements. I'd characterise our recent positions on this that I've been focused on archival requirements, whereas you've been focused on data warehousing. I think that sums it up, although I'd also say that declarative partitioning is suitable for all those with largish amounts of data which they know how they want stored. This points to another case that SE suits: those who don't know how or (maybe more importantly) don't care to manage their data. I'll go back to what I said above. SE looks like a good performance boost for archival read only data. If we tighten up the definitions of how some tables can be used -- append only -- then we can remove the vacuum requirement and also change other characteristics of the storage. For example, reduced visibilty information, compression, etc. These are hot topics for people with that kind of data. The difference really lies in how much activity and of what kind occurs on a table. You don't unload and reload archives regularly, nor do you I couldn't agree more. perform random updates against the whole table. I'm sure we'd quickly agree that many of the challenges you've seen recently at Greenplum would not be possible with core Postgres, so I'm not really trying too hard to compete. There we diverge. Yes, Greenplum produces systems for very large amounts of data, peta-byte range in fact. However, the architecture -- individual post-masters on a single CPU with their own storage -- means that we see the problems of non-distributed database users when we look at data at the node level. This is why I say that VACUUMing such systems, under the SE model, after you do a load of data is just impossible (I know that after time the cost of vacuum will stabilise but there's always the initial data load). Thanks, Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] Declarative partitioning grammar
Hi all, Many of you will have read the dynamic partitioning thread here: http://archives.postgresql.org/pgsql-hackers/2008-01/msg00028.php I've proposed an alternative approach, which we've called declarative partitioning which is grammar based. This grammar was developed by Jeff Cohen at Greenplum with some assistance from myself. It is to be completely open source. The grammar will need some refinement and we'd appreciate comments, suggestions, etc. The grammar is designed to address the range of use cases we have out there. Basics -- CREATE TABLE is modified to accept a PARTITION BY clause. This clause contains one or more partition declarations. The syntax is as follows: PARTITION BY {partition_type} (column_name[, column_name...]) [PARTITIONS number] ( partition_declaration[, partition_declaration...] ) The partition type can be one of HASH, RANGE or LIST. The column names are the partitioning key. The PARTITIONS sub clause instructs the system on the number of partitions to create if we're doing HASH or, in the case of LIST or RANGE can act as a safe guard for users who want to ensure that they do not generate more than a certain number of partitions. We have discussed adding the partition type REFERENCE which is akin to the LIKE clause in CREATE TABLE. That is, it duplicates the partition configuration of the specified table. Input would be appreciated. Partition declarations -- Hash ... PARTITION BY HASH(order_date) PARTITIONS 5; This will create 5 partitions on the column order_date. Inserts will be distributed roughly evenly across the 5 partitions. List ...PARTITION BY LIST (state) (PARTITION q1_northwest VALUES ('OR', 'WA'), PARTITION q1_southwest VALUES ('AZ', 'UT', 'NM'), PARTITION q1_northeast VALUES ('NY', 'VM', 'NJ'), PARTITION q1_southeast VALUES ('FL', 'GA'), PARTITION q1_northcentral VALUES ('SD', 'WI'), PARTITION q1_southcentral VALUES ('OK', 'TX')); Here, we produce 6 different partitions. The first partition groups states in the North West of the USA. We introduce here the named partition concept for clarity. Range - Range has the most expressive grammar. I'll introduce it in steps: ...PARTITION BY RANGE (b) ( PARTITION aa start (date '2007-01-01') end (date '2008-01-01'), PARTITION bb start (date '2008-01-01') end (date '2009-01-01') ); Here, we create 2 partitions: aa and bb. Partition aa has the range 2007-01-01 to 2008-01-01; partition bb has the range 2008-01-01 to 2009-01-01. Intervals always have this problem: are the bounds included in the range? To deal with this we define: the start of the range is included in the range. The ending bound is not. This can be modified with the keywords INCLUSIVE and EXCLUSIVE, which modify this property on a rule by rule basis. It is common that these partitions follow a pattern, such as following every week, month or year. So, we support the following specification: ... PARTITION BY RANGE(order_date) ( START (date '2005-12-01') end (date '2007-12-01') EVERY(interval '2 months') ); If we like, we can mix the specification a little: ... PARTITION BY RANGE(order_date) ( PARTITION Q1_2005 end (date '2005-04-01'), PARTITION Q2_2005 end (date '2005-07-01'), PARTITION Q3_2005 end (date '2005-10-10'), PARTITION Q4_2005 end (date '2006-01-01'), START (date '2006-02-01') end (date '2008-04-01') EVERY (interval '2 weeks') ); an interesting result of the flexibility of the grammar we've come up with is that you can do something like this: ... PARTITION BY RANGE(order_date) ( PARTITION minny end date '2004-12-01'), end (date '2006-12-01'), PARTITION maxny start (date '2006-12-01') ); Here, when order_date is less than 2004-12-01, we put the data in minny, when it is between 2004-12-01 and 2006-12-01 we put it in an unnamed partition and after this we put it in maxny. Tablespaces --- We allow inline tablespace specification, such as: ... PARTITION BY RANGE(order_date) ( PARTITION minny TABLESPACE compress, start (date '2004-12-01') end (date '2006-12-01') TABLESPACE hot, PARTITION maxny TABLESPACE compress ); I've used the term compress here intentionally. A number of operating systems now ship file systems which can compress partitions. Users with issues with the amount of data they want to keep online can delay the time until they need new storage to a future date by compressing less regularly used data with this technique, for a performance cost. Data being used heavily can live on an uncompressed file system, affected. Multi-column support We can have multi-column partitions. ... PARTITION BY LIST (state, deptno) ( VALUES ('OR', 1,
Re: [HACKERS] Declarative partitioning grammar
On Fri, Jan 11, 2008 at 07:46:36PM -0500, Merlin Moncure wrote: On Jan 11, 2008 6:19 PM, Gavin Sherry [EMAIL PROTECTED] wrote: Hi all, Many of you will have read the dynamic partitioning thread here: http://archives.postgresql.org/pgsql-hackers/2008-01/msg00028.php I've proposed an alternative approach, which we've called declarative partitioning which is grammar based. This grammar was developed by Jeff Cohen at Greenplum with some assistance from myself. It is to be completely open source. I like the syntax, but what I really want to know is how well this is going to work with the query planner. The current 8.x table partitioning mechanism has a lot of issues...it causes more problems than it solves unless you are willing to strictly control how you query the tables...I usually end up rolling my own. The syntax is half the problem, performance is the other. I will bring the performance issues up in another thread. Yes, we are confident that we can address the performance issues that rule out the existing partitioning for many applications. We need it for our own stuff! :P Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Thu, Jan 10, 2008 at 07:25:00AM +, Simon Riggs wrote: On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote: If the exclusion is executor driven, the planner cannot help but create a seq scan plan. The planner will think you're returning 100X rows when really you end up returning X rows. After that, all decisions made by the planner are totally bogus. One of the most important queries on large tables is handling WHERE Eventdate = CURRENT DATE; Really? Well, this isn't handled by the current partitioning approach (i.e., constraint exclusion) so users have worked around it. We cannot perform partition exclusion using this type of WHERE clause at planning time because the CURRENT DATE function is STABLE. We can do the exact same thing -- if it's a direction people want to take. In fact, we can do it better/faster because once we've evaluated one partition we know that there are no others to evaluate. So it seems clear that we need to make partition exclusion work at executor time, whatever else we do. One example doesn't make the rule. Most people doing range partitioning are going to be doing either specific dates or date ranges -- i.e., constants or things that can be folded to constants by the planner. At least, that's what I've seen. Thanks, Gavin ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Thu, Jan 10, 2008 at 04:51:04PM +, Simon Riggs wrote: On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote: If people with large tables like partitioning why is Oracle moving towards automated partitioning in 11g? Automated partitioning was one of Have you used Oracle's partitioning? Since you ask, yep, certified on it, plus DB2, Teradata and Greenwich. It was just a joke. What about consistency over time? High levels of control for users if they want it? Even simple things like tablespace support? High performance unload and load performance much better than now? Reliable and consistent query plans? Support for datatypes other than those relating to time. Most of these comments seem like emotional appeals, so refuting them one by one is just going to look and smell like an argument, and a fruitless one as well since we agree on so many things. Sigh. You can call those questions what you like but your approach doesn't deal with them and the current partitioning approach, for it's problems, does deal with them. So, I get the message that you really want the DDL approach and agree that you've demonstrated there are use cases that need it that you are interested in. That's fine by me as long as we can each work on parts of it to get it done. Will it be possible for you to do that? I assured you offlist that it was. This is something Greenplum needs right now and I'm being paid to deliver it. I feel I can say that because AFAICS we can in principle have both dynamic and declarative techniques, even on the same table. Around half of the mechanics would be identical anyway. So from here I say lets work together to get the basic mechanics right. My experience with PostgreSQL is that the team/list always comes up with a better idea in the end and I'm sure we will this time too. I agree. Scrutiny of ideas almost always leads to better ideas. I have one main technical concern which is the storage model, which is the source of the difference between the two types of partitioning we've been discussing. I see difficulties originating there, so I will start another thread to examine just those items in detail. I look forward to it. As promised, I'll soon post some syntax of what we have been working on. Thanks, Gavin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Thu, Jan 10, 2008 at 09:30:10PM +, Simon Riggs wrote: We cannot perform partition exclusion using this type of WHERE clause at planning time because the CURRENT DATE function is STABLE. We can do the exact same thing -- if it's a direction people want to take. In fact, we can do it better/faster because once we've evaluated one partition we know that there are no others to evaluate. Lost you completely here. I'm explaining to you that *nobody* can solve those problems solely at planning time, by definition, so it has to be done at execution time. I'm not saying anything about your way, my way. Sorry, I wasn't clear enough. I was trying to say, if we're going to do something in the executor (for right or wrong) the declarative approach can do it too. Since there will be partition bounding information available, we can do partition selection in the executor (maybe the planner should tell us to do it). So it seems clear that we need to make partition exclusion work at executor time, whatever else we do. One example doesn't make the rule. Most people doing range partitioning are going to be doing either specific dates or date ranges -- i.e., constants or things that can be folded to constants by the planner. At least, that's what I've seen. It's not always true that planning time = execution time. Using CURRENT DATE to access current day, week, month is common in many applications, as is parameterised SQL for higher transaction rate apps. We we can't re-write all the SQL, so we need to make it work. I think if you demand a full function implementation of partitioning, you'd better take into account *all* of the requirements. Okay. As I said above, nothing in declarative partitioning rules out partition selection with stable functions. So, we lets do it, assuming everyone else thinks it is a good idea. Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Wed, Jan 09, 2008 at 11:47:31AM -0500, Chris Browne wrote: [EMAIL PROTECTED] (Simon Riggs) writes: I think we have an opportunity to bypass the legacy-of-thought that Oracle has left us and implement something more usable. This seems like a *very* good thing to me, from a couple of perspectives. [snip] 2. Blindly following what Oracle does has always been a dangerous sort of thing to do. There are two typical risks: a) There's always the worry that they may have patented some part of how they implement things, and if you follow too closely, There Be Dragons... I think that could be equally said of the dynamic partitioning approach. In fact, it might be more likely since declarative partitioning has been around for eons. b) They have enough billion$ of development dollar$ and development re$ource$ that they can follow strategies that are too expensive for us to even try to follow. I don't see that as an argument against the declarative approach. Reading the details on this thread so far, I think Simon's approach is probably more complex from an implementation POV. 3. If, rather than blindly following, we create something at least quasi-new, there is the chance of doing fundamentally better. This very thing happened when it was discovered that IBM had a patent on the ARC cacheing scheme; the clock system that emerged was a lot better than ARC ever was. Well, I don't think I'm proposing we /blindly follow/ others. I propose we choose a grammar which takes the best of what others have tried to do. Oracle's grammar is hideous, IBM's is too restrictive, for example. One major advantage of the dynamic approach is that it can work on multiple dimensions simultaneously, which isn't possible with declarative partitioning. For example if you have a table of Orders then you will be able to benefit from Segment Exclusion on all of these columns, rather than just one of them: OrderId, OrderDate, RequiredByDate, LastModifiedDate. This will result in some sloppiness in the partitioning, e.g. if we fill 1 partition a day of Orders, then the OrderId and OrderData columns will start out perfectly arranged. Any particular RequiredByDate will probably be spread out over 7 partitions, but thats way better than being spread out over 365+ partitions. I think it's worth observing both the advantages and demerits of this together. In effect, with the dynamic approach, Segment Exclusion provides its benefits as an emergent property of the patterns of how INSERTs get drawn into segments. The tendancy will correspondly be that Segment Exclusion will be able to provide useful constraints for those patterns that can naturally emerge from the INSERTs. Many people, in my experience, doing the kind of data processing which benefits from partitioning are regularly loading data, rather than collecting it in an OLTP fashion. Lets take the easily understandable concept of processing web site traffic. If the amount of data is large enough to benefit from partitioning, they probably have multiple web servers and therefore almost certainly multiple log files. If these files are not sorted into a single file, the records will not have a naturally progressing chronology: every file we go back to the beginning of the period of time the load covers. If you add parallelism to your load, things look even more different. This means you could end up with a bunch of partitions, under the dynamic model, which all cover the same time range. Then there's the way that really big databases are used (say, up around Simon's upper bound of 16 TB). It is expensive to keep data online so people aren't. They're loading and unloading data all the time, to perform different forms of analysis. A common scenario in the example above might be to unload all but the current month's data and then load the same month from the previous year. The unload step needs to be costly (i.e., TRUNCATE). Then, there's no guarantee that what they're interested in is the date range at all. They may want to compare user agent (look at bot activity). In this case, the partitioning is across a small list of strings (well, numbers most likely). Here, the partitions would all have the same range. Partitioning would be useless. We can therefore expect useful constraints for attributes that are assigned in some kind of more or less chronological order. Such attributes will include: - Object ID, if set by a sequence - Processing dates There may be a bit of sloppiness, but the constraints may still be useful enough to exclude enough segments to improve efficiency. I really like the idea of the system being able to identify trends and patterns in the data (actually useful at the user level) but it's the uncertainty of how the partitioning will work for different kinds of data that I don't like. _On The Other Hand_, there will be
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Wed, Jan 09, 2008 at 02:38:21PM -0500, Chris Browne wrote: Ron Mayer [EMAIL PROTECTED] writes: Or am I missing something? Well, this can head in two directions... 1. Suppose we're not using an organize in CLUSTER order approach. If the data is getting added in roughly by order of insertion order, then there's no reason to expect San Francisco data to be clustered together. Ditto for John Smiths; we can expect them to be as scattered as their dates of creation. 1. Suppose we *are* using an organize in CLUSTER order approach. In that case, yes, indeed, you can expect segments to contain specific ranges of the data. However, in that case, the ONLY dimension under which the Segment Exclusion may be expected to be useful is that of the first column of the index being used. Smith should be useful to SE, but not John, because, as a secondary criteria, the first name values will be scattered across all segments. One of the reasons for using partitions is that the usefulness of indexes breaks down because of the low cardinality of the indexed column. Also, the random IO can kill your performance. In my opinion, CLUSTER (and vacuum, mentioned else where) are not data management tools. Thanks, Gavin ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Wed, Jan 09, 2008 at 08:17:41PM +, Simon Riggs wrote: On Wed, 2008-01-09 at 20:03 +0100, Gavin Sherry wrote: I think Simon's approach is probably more complex from an implementation POV. Much of the implementation is exactly the same, and I'm sure we agree on more than 50% of how this should work already. We just need to close in on the remainder. My current opinion on the SE approach is the opposite of the one above, though it gets us nowhere just to state it. I'm trying to avoid opinion and look at the details, which is the reason my viewpoint recently changed in favour of the dynamic approach as the main thrust for implementation. I've written a detailed email this morning to explain where and how the problems lie, which are nowhere near the syntax level. I haven't ruled out a declarative approach yet, but I need some detailed technical review of the issues. I hope you'll be replying to that? I'm responding to that email seperately. Just a matter or addressing all the points in the detail it deserves. 3. If, rather than blindly following, we create something at least quasi-new, there is the chance of doing fundamentally better. This very thing happened when it was discovered that IBM had a patent on the ARC cacheing scheme; the clock system that emerged was a lot better than ARC ever was. Well, I don't think I'm proposing we /blindly follow/ others. I propose we choose a grammar which takes the best of what others have tried to do. Oracle's grammar is hideous, IBM's is too restrictive, for example. I assume the new grammar is good and if we do go that way, it sounds like the right starting place. One major advantage of the dynamic approach is that it can work on multiple dimensions simultaneously, which isn't possible with declarative partitioning. For example if you have a table of Orders then you will be able to benefit from Segment Exclusion on all of these columns, rather than just one of them: OrderId, OrderDate, RequiredByDate, LastModifiedDate. This will result in some sloppiness in the partitioning, e.g. if we fill 1 partition a day of Orders, then the OrderId and OrderData columns will start out perfectly arranged. Any particular RequiredByDate will probably be spread out over 7 partitions, but thats way better than being spread out over 365+ partitions. I think it's worth observing both the advantages and demerits of this together. In effect, with the dynamic approach, Segment Exclusion provides its benefits as an emergent property of the patterns of how INSERTs get drawn into segments. The tendancy will correspondly be that Segment Exclusion will be able to provide useful constraints for those patterns that can naturally emerge from the INSERTs. Many people, in my experience, doing the kind of data processing which benefits from partitioning are regularly loading data, rather than collecting it in an OLTP fashion. Lets take the easily understandable concept of processing web site traffic. If the amount of data is large enough to benefit from partitioning, they probably have multiple web servers and therefore almost certainly multiple log files. If these files are not sorted into a single file, the records will not have a naturally progressing chronology: every file we go back to the beginning of the period of time the load covers. If you add parallelism to your load, things look even more different. This means you could end up with a bunch of partitions, under the dynamic model, which all cover the same time range. Depends how big we make the partitions and how sloppy this is as to whether that is a problem or not. We might still expect a x100 gain from using the SE approach depending upon the data volume. This comes back to my problem with the dynamic approach: the user might get a certain experience but with the declarative approach the expectation matches the result. Then there's the way that really big databases are used (say, up around Simon's upper bound of 16 TB). It is expensive to keep data online so people aren't. They're loading and unloading data all the time, to perform different forms of analysis. That isn't my experience. That sounds very time consuming. I cannot offer any evidence but it's what we see companies doing. The storage cost issue was the reason Andrew wanted offline segments, and why I have been talking about hierarchical storage. The thing is, people with lots of data usually have good hierarchical storage systems. Can we really do better? A common scenario in the example above might be to unload all but the current month's data and then load the same month from the previous year. The unload step needs to be costly (i.e., TRUNCATE). Then, there's no guarantee that what they're interested in is the date range at all. They may want
Re: [HACKERS] Named vs Unnamed Partitions
On Wed, Jan 09, 2008 at 08:51:30PM +, Simon Riggs wrote: That's what I would have done if it was easier to do with constraint exclusion (did only date partitioning), as the reporting queries will always have some server (stats by services, each service being installed on 1 or more servers) and date restrictions. Hmm, well if you found declaring the partitions a problem with constraint exclusion it's not going to be any easier using other declarative approaches. I disagree (although it is unreasonable for me to do so without posting syntax -- it's coming). Proper grammar for partition support means running a single DDL command. The user does not have to line up table generation with rules (or triggers) and check constraints. As such, I believe it to be much much easier. Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi Simon, On Wed, Jan 02, 2008 at 05:56:14PM +, Simon Riggs wrote: Segment Exclusion - After we note that a segment is read-only we can scan the segment and record min/max values for all columns. These are then implicit constraints, which can then be used for segment exclusion in a similar manner as we do with the current constraint exclusion mechanism. What about columns which are varlen? The maximum value could be very long -- 1 GB in fact. Say we decide to not support those. Well, what about numeric? What about user defined types? This would also allow a Merge Join to utilise exclusion for the inner plan at execution time. Instead of scanning the whole inner table plan from the start, we would be able to start the scan from the appropriate segment. This would require us to pass the current value of the outer plan down into the inner plan. The typical executor nodes on the inner plan would be a SortNode and below that a SeqScan. In that case the executor would need to pass the outer value from the MergeJoinNode down thru the SortNode to the SeqScan node. The SeqScan node could then perform partition exclusion, reducing the time for that scan and also reducing the time for the resulting sort. This sounds like it might be worth doing in normal cases also. It might turn out that the potentially applicable cases are already excluded during planning, I haven't thought about that aspect in enough detail yet. I don't think that would work at all. Say you have have skew of data across the partitions. You may end up doing a seq scan for every outer tuple. It would look like a really expensive nested loop. If we collect data for all columns then many of our implicit constraints would be useless. e.g. if a column only has a few values and these are present in all segments. Matching our predicate against all constraints would be expensive, so we must weed out poor constraints. We would do this by removing any constraint that overlapped more than 10% of other segments. Various heuristics would likely need to be discovered during development to make this work without resorting to manual commands. Note that all of this exclusion is happening in the executor, not the planner. That allows this form of exclusion to work with stable functions and parameters without problem. So, in that case, you've totally undercut the planner. All the steps up the tree would be based on the stats for scanning the entire table but you've only returned part of it. To solve that, you could put the min/max values in a catalog somewhere. For a 16 TB table, that's 16000 min/max values. That's pretty expensive. And how do we handle making things non-read-only. You'd have to wait for all current queries to finish... that could take a long time. How do we handle crashes during setting of min/max? I presume it needs WAL support. What happens if the free space map gives us a free block in a read only segment. Then we need to look at concurrency and transactionality of min/max values don't we? If changing it makes it not read-only (which seems to be the case) it would be trivial to totally degrade your partitioning scheme to full seq scan. One a 16 TB table. And getting the performance back might involve a day or more or VACUUMing. Impossible. Visibility Map -- [snip] No dynamic shared memory cache is required because any concurrent changes to the table would be ignored by a Scan anyway, so it doesn't matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any new scans that start will attempt to lock the table and then perform a rel cache check before continuing. So the visibility will be set correctly for *that* scan at least. Is that true in the presence of READ COMMITTED? In most cases the visibility map can be summarised as a single boolean to show whether any 100% visible segments exist. That makes accessing the map very cheap in the common, unset case. What do we do in failure scenarios. Does this go into WAL? Is it replicated via log shipping techniques. You mention Slony support below. I don't know that Slony's target is Very Big Tables (up to 16 TB) but even if it was being used for a 'small' system of a few hundred GB, when you fail over the system has degraded if you aren't vacuuming it. More over, the layout of data may be different and therefore the performance of the segment exclusion. That's a bit of a surprise. Setting the Visibility Map -- VACUUM would scan all READ_WRITE_ALLOWED segments and mark some of them as EFFECTIVE_READ_ONLY if 100% of the remaining rows are visible to all current backends. Marking the map will cause a rel cache invalidation. Argh. Remember, we're talking about 16 TB tables here. With the best of tuning, that takes all day. So, day 1, load data; day 2 VACUUM it? With the existing partitioning approach, there's no such thing. We would never mark the highest numbered
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
Hi Simon, On Wed, Jan 09, 2008 at 03:08:08PM +, Simon Riggs wrote: Do people really like running all that DDL? There is significant manpower cost in implementing and maintaining a partitioning scheme, plus significant costs in getting it wrong. Well... that's impossible for me to say. Most of the people who use lots of data like using MDX (a large and quite complex reporting and modelling language from MS, which is more complex than SQL IMHO) so... If people with large tables like partitioning why is Oracle moving towards automated partitioning in 11g? Automated partitioning was one of Have you used Oracle's partitioning? I'm not surprised they're moving away ;-). More seriously, I didn't know that. Do you have a URL? the major goals for this next set of Postgres partitioning functionality also, whether or not we have declarative partitioning. My thinking is lets go for the best ideas and skip over the stuff that will be (is) obsolete before its left the design stage. Well, you're assuming that because declarative partitioning has been around a while, it's obselete. I don't think that's the case. Postgres has been around about as long... I see many more systems in the Terabyte range now than I did 10 years ago, but I see about the same number of DBAs. We'll always need experts, but I feel we should be using our expertise to simplify the standard cases, not just maintain the same level of difficulty in managing them. One of the big benefits of the dynamic partitioning approach is that it needs no DDL. So it will work out-of-the-box, for anybody. It's fine to say that, but people have already started talking about grammar extensions for dynamic partitions. People want the tools to manage it. There's difficulty in learning how and when to use the different VACUUM options that have been discussed. Deleting older data would be optimised under the proposed scheme, so that's not really a problem. Loading data is actually slightly harder and slower with declarative partitioning (see below). Merging and splitting partitions are tools for fine tuning a very complex partitioning scheme. They do also allow a non-linear segmentation scheme, which might aid performance in some cases. What about adding partitions into a set of partitions. Greg's post on that was dismissed and it might not be representative but we have users doing that all the time. One major advantage of the dynamic approach is that it can work on multiple dimensions simultaneously, which isn't possible with declarative partitioning. For example if you have a table of Orders then you will be able to benefit from Segment Exclusion on all of these columns, rather than just one of them: OrderId, OrderDate, RequiredByDate, LastModifiedDate. This will result in some sloppiness in the partitioning, e.g. if we fill 1 partition a day of Orders, then the OrderId and OrderData columns will start out perfectly arranged. Any particular RequiredByDate will probably be spread out over 7 partitions, but thats way better than being spread out over 365+ partitions. When we look at the data in the partition we can look at any number of columns. When we declaratively partition, you get only one connected set of columns, which is one of the the reasons you want multi-dimensional partitioning in the first place. To this end, we (well, Jeff Cohen) looked at the syntax and semantics of partitining in leading databases (Oracle, Informix, DB2) and came up with a highly expressive grammar which takes the best of each I think (I'll post details on the grammar in a seperate thread). The idea is that range (for example, a date range), list (a list of distinct values) and hash partitioning be supported on multiple columns. Partitions can be added, merged, truncated. Partitions can reside on different tablespaces. The existing issues with the rewriter, COPY support and the like go away by smartening up the backend. To explore the grammar and semantics Jeff and I (to a small extent) have implemented the parsing of such a grammar in the backend. At the moment, this just results in the generation of a bunch of rules (i.e., it produces the current behaviour, as if someone had written rules themselves) and debugging output. The fully fledged approach will see partitioning rules stored in a new system catalog which can be interrogated by the planner or COPY to determine which partition a given tuple belongs in or which partition(s) a WHERE clause matches. When loading data we would need to examine each incoming tuple and distribute it to the correct place in the partitioned heap. That's a non-trivial overhead on data loading that I would like to avoid, since if we rely on the natural locality of loaded data then that overhead is zero. Yes, but it's a lot faster than what we have at the moment and the dynamic approach is not without its costs. I've pointed it out before, but I will
Re: [HACKERS] Dynamic Partitioning using Segment Visibility Maps
On Wed, Jan 02, 2008 at 05:56:14PM +, Simon Riggs wrote: This technique would be useful for any table with historical data keyed by date or timestamp. It would also be useful for data where a time-of-insert component is implicit, such as many major entity tables where the object ids are assigned by a sequence. e.g. an Orders table with an OrderId as PK. Once all orders placed in a period have been shipped/resolved/closed then the segments will be marked read-only. Its not really going to change the code path much for small tables, yet seems likely to work reasonably well for large tables of all shapes and sizes. If a segment is being updated, we leave it alone, and maybe never actually set the visibility map at all. So overall, this idea seems to cover the main use case well, yet with only minimal impact on the existing use cases we support. As before, I will maintain this proposal on the PG developer Wiki, once we get down to detailed design. Like it? Simon, A novel approach to the problem. For me, this proposal addresses some of the other problems in postgres (visibility in the heap vs. index) rather than the problems with partitioning itself. It might seem otherwise, but for me partitioning is tool for managing large volumes of data. It allows the user to encode what they know about the nature of their data, and that's often about management. In this way, your proposal actually makes partitioning too smart: the user wants to tell the system how to organise the data, as opposed to the other way around. At Greenplum, we've been discussing this in depth. Interestingly, we also felt that the storage layer could be much smarter with large tables with predictable layouts and predictable data patterns. But the thing is, people with large tables like partitioning, putting different partitions on different storage; they like the ability to merge and split partitions; they like truncating old partitions. In a word, they seem to like the managability partitions give them -- as well as the performance that comes with this. To this end, we (well, Jeff Cohen) looked at the syntax and semantics of partitining in leading databases (Oracle, Informix, DB2) and came up with a highly expressive grammar which takes the best of each I think (I'll post details on the grammar in a seperate thread). The idea is that range (for example, a date range), list (a list of distinct values) and hash partitioning be supported on multiple columns. Partitions can be added, merged, truncated. Partitions can reside on different tablespaces. The existing issues with the rewriter, COPY support and the like go away by smartening up the backend. To explore the grammar and semantics Jeff and I (to a small extent) have implemented the parsing of such a grammar in the backend. At the moment, this just results in the generation of a bunch of rules (i.e., it produces the current behaviour, as if someone had written rules themselves) and debugging output. The fully fledged approach will see partitioning rules stored in a new system catalog which can be interrogated by the planner or COPY to determine which partition a given tuple belongs in or which partition(s) a WHERE clause matches. Yes, this is the traditional approach but I think it still has merit. We shall continue working on this approach because it is what our customers have asked for. We would also like to see it get into postgres too. Thanks, Gavin ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] VLDB Features
On Wed, Dec 12, 2007 at 08:26:16PM +0100, Markus Schiltknecht wrote: Isn't Gavin Sherry working on this? Haven't read anything from him lately... Me neither. Swallowed by Greenplum and France. Hm.. good for him, I guess! Yes, I'm around -- just extremely busy with a big release at Greenplum as well as other Real Life stuff. Thanks, Gavin ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Adjusting index special storage for pg_filedump's convenience
On Mon, 9 Apr 2007, Tom Lane wrote: We put in a workaround a long time ago to make it possible to tell the difference between btree and hash special space, which are also the same size: there's an unused 16 bits in hash special space that we fill with a specific value. As of 8.2 this doesn't work as well as it used to, because the corresponding space in a btree page is now used for a vacuum cycle ID and so there's 1 chance in 65536 of a false match. Still, it's a lot better than nothing. Sounds... reasonable. Especially if you add the flags test below. I'd like to tweak things for 8.3 so that pg_filedump can work reasonably well again. It looks like the hash solution would work for gist, gin, and bitmap: rearranging fields would allow us to put in a 16-bit ID field in all three cases. (For bitmap, I'm assuming that bm_hrl_words_used could be reduced to 16 bits without problem --- it is a per-page count not something larger, right?) Yes, I've reduced this already but hadn't in previous patches, from memory. I'd add a filler of uint16 now. Got a number I should use? One problem with that is that with four special values, there'd be 1 chance in 16384 of misidentifying a btree page because of chance values of the vacuum cycle ID. This can be improved a bit if we put the flags fields (for those index types that have 'em) in a consistent place too: we can disbelieve that an index is of type X if it doesn't have a flags value that fits. I don't see any way to make it completely bulletproof without enlarging the special space, which seems an unreasonable price to pay. But even one chance in 16K is way better than the current situation. Sounds like the only workable approach. Thanks, Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bitmap index thoughts (another segfault)
I'm seeing a segfault on a size TPC-H size 10 database. The patch and code are: - bitmap patch from 12 Mar - 8.3 dev from 27 Mar Thanks Mark. I tracked this down. I'll post a new patch soon. Gavin ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bitmap index thoughts (another segfault)
On Sat, 7 Apr 2007, Mark Kirkwood wrote: Mark Kirkwood wrote: bitmap=# SELECT count(*) FROM bitmaptest WHERE val1 in (1,7) AND val0 IN (4,3) ; ERROR: XX000: unknown stream type 2 LOCATION: stream_add_node, tidbitmap.c:1033 Thanks. Turned out to be a typo in stream_add_node()! I'll post a new patch soon. Thanks for the test kit and your testing! Gavin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Many unfinished patches
I am currently finishing off an improved VACUUM implementation for bitmaps. The rest of the patch is ready for review. I will try and post a patch within 24 hours. Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bitmapscan changes - Requesting further feedback
On Tue, 20 Mar 2007, Joshua D. Drake wrote: Hackers et al... I was wondering if there are any outstanding issues that need to be resolved in terms of the clustered index/bitmap changes? From the testing that I have done, plus a couple of others it is a net win (at least from DBA space). Not sure if you're talking about bitmap indexes here. If so, I'm working on VACUUM support. Gavin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Why is this allowed?
On Sat, 10 Mar 2007, Chuck McDevitt wrote: Ok... Just to be clear, the ISO SQL spec says that INTERVAL '1' DAY is the correct way to specify a one-day interval. That's why it is surprising that PostgreSQL treats it differently, with no error or warning. The PostgreSQL syntax INTERVAL '1 DAY' is non-standard. Is fixing this on the TODO list? Yes: Add ISO INTERVAL handling There's detail there about the nature of support required. Personally, I think the syntax is awful. I don't understand why intervals are handled so differently in the spec as compared to other data types. This syntax is well supported by Oracle and DB2 though. I guess it should be completed in postgres. Thanks, Gavin ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Stream bitmaps
On Thu, 8 Mar 2007, Heikki Linnakangas wrote: Hi Gavin, Any progress? Really busy at the moment, but it's on my TODO list for today. Thanks, Gavin ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!
On Wed, 7 Mar 2007, Josh Berkus wrote: Approximate queries is something with DSS users *want*. Jim Grey addressed this in his ACM editiorial on the databases of the future. It's something that *I* want, and if the Greenplum people aren't speaking up here, it's because they're not paying atttention. Now, I don't know if this Skyline patch is our answer for approximate queries. Maybe I should pester Meredith about getting QBE free of its IP issues; it certainly looked more flexible than Skyline. In either case, the code probably needs a complete refactor. What people want from approximate queries is different to this: the desire is usually to balance run time with level of accuracy/quality (some times the desire is to have accurate results as well as similar results). Neither skyline or QBE are about this. The only thing in the spec which addresses this is 'tablesample'. Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!
On Tue, 6 Mar 2007, Alvaro Herrera wrote: Also, keep in mind that there were plenty of changes in the executor. This stuff is not likely to be very easy to implement efficiently using our extant executor machinery; note that Ranbeer mentioned implementation of block nested loop and other algorithms. Not sure how easy would be to fold that stuff into the optimizer for multi-input aggregates, instead of hardwiring it to the SKYLINE OF syntax. Yes, there's been a lot of working on calculating skyline efficiently, with different sorting techniques and so on. This is the most interesting part of the idea. You could calculate the query Ranbeer gave using pure SQL and, perhaps, use of some covariance aggregates or something already. Of course, it gets harder when you want to calculate across many dimensions. Personally, I'd love to see some of these newer data analysis capabilities added to PostgreSQL -- or at least put out there as interesting patches. Thanks, Gavin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Stream bitmaps
Heikki, On Mon, 5 Mar 2007, Heikki Linnakangas wrote: Hi all, I'd like to see the indexam API changes needed by the bitmap indexam to be committed soon. Has anyone looked at the proposed API in the latest patch? Any thoughts? Thanks for looking at it! I'm quite happy with it myself, with a few reservations: - All the getbitmap implementations except the new bitmap indexam are just boilerplate. How about making getbitmap-function optional, and having a generic implementation that fills in a hash bitmap using the traditional getnext function? - getbitmap is passed an existing bitmap as argument, and the implementation needs to OR the existing bitmap with new tuples. How about AND? An indexam could be smart about ANDing with an existing bitmap, for example skipping to the first set bit in the existing bitmap and starting the scan from there. - I'd like to have support to return candidate matches with both getbitmap and getnext. A simple flag per page of results would be enough for getbitmap, I think. - StreamBitmap and HashBitmap are separate node types, but OpStream is not. opaque-field in the StreamBitmap struct is not really that opaque, it needs to be a StreamNode. I drew a UML sketch of what I think the class-hierarchy is (http://community.enterprisedb.com/streambitmaps.png). This is object-oriented programming, we're just implementing classes and inheritance with structs and function pointers. The current patch mixes different techniques, and that needs to be cleaned up. All good ideas, I'll look at them more closely in the morning. I'd like to see a separate patch that contains just the API changes. Gavin, could you extract an API-only patch from the bitmap index patch? I can work on it as well, but I don't want to step on your toes. Okay, I can do that. Thanks, Gavin ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PATCHES]
On Wed, 28 Feb 2007, Tom Lane wrote: AFAICT, the footer in question tries to make it illegal for us even to have the message in our mail archives. If I were running the PG lists, I would install filters that automatically reject mails containing such notices, with a message like Your corporate lawyers do not deserve to have access to the internet. Go away until you've acquired a clue. I just noticed Bruce's email suggesting that this is going ahead: http://archives.postgresql.org/pgsql-performance/2007-03/msg00098.php I do not think this is a good idea. These disclaimer messages are generally tacted on by the MTA of the company at which the person works, as others have noted. There seem to be about 10 such emails to general per month, not sure about other lists. FWIW, usually it is just one or two offenders. We do not suffer this problem in isolation. I think the Debian project has tried to address this partly with this: http://www.debian.org/MailingLists/disclaimer Couldn't we place a disclaimer on the subscription page and welcome email which says some of the interesting things in this disclaimer and says that code contributions are implicitly licensed BSD. No idea about the legal issues involved. Another way of looking at this is that we cannot be 100% thorough. What about disclaimers in German? What about false-positives? Another concern I've had is that the assumption is that those wanting to contribute (code) will just have to go register some throw away hotmail or yahoo account. I think this will make tracing the origin of code more difficult in the long term. Just my thoughts. Gavin ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug: Buffer cache is not scan resistant
On Mon, 5 Mar 2007, Mark Kirkwood wrote: To add a little to this - forgetting the scan resistant point for the moment... cranking down shared_buffers to be smaller than the L2 cache seems to help *any* sequential scan immensely, even on quite modest HW: e.g: PIII 1.26Ghz 512Kb L2 cache, 2G ram, SELECT count(*) FROM lineitem (which is about 11GB) performance: Shared_buffers Elapsed -- --- 400MB 101 s 128KB74 s When I've profiled this activity, I've seen a lot of time spent searching for/allocating a new buffer for each page being fetched. Obviously having less of them to search through will help, but having less than the L2 cache-size worth of 'em seems to help a whole lot! Could you demonstrate that point by showing us timings for shared_buffers sizes from 512K up to, say, 2 MB? The two numbers you give there might just have to do with managing a large buffer. Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PATCHES] - WIP Patch Updatable Cursor
On Wed, 28 Feb 2007, John Bartlett wrote: Hi, A list of ctids is stored in the file. I would have thought these would be stored in memory. If the set got large, you'd use a temporary file the way other systems which overflow to disk do? The file is used to store the ctids during an updatable cursor transaction. It is set up as a permanent file as it has a potential lifetime of preserving data between crashes of the backend. Temporary files tend to be used for data that is defined within a single command. In this case the file needs to exist within a transaction and across backend processes. It does not. Cursors are implicitly closed when a session is closed. A backend crash or system restart closes all open sessions. The file gram.y has been corrected in my version. The files ctidListStore.c and ctidListStore.h were pasted into the patch file, as the diff -N command produced a file of several hundred thousand lines. Edit the file with a text editor. If you know which files should be excluded (like tags files), use diff --exclude=pattern. Thanks, Gavin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bitmap index stuff
On Mon, 26 Feb 2007, Heikki Linnakangas wrote: Hi, How are you doing with the bitmap indexes? I need to send of a patch fixing the last bug you pointed out. The code needs a merge of HEAD. I've been trying to get my head around the patch a couple of times to add the vacuum support, but no matter how simple I try to keep it, I just always seem to get stuck. It looks like vacuum support would need: - something similar to read_words, let's call it iterate_set_bits, that returns each set bit from a bitmap vector, keeping the buffer locked over calls. - ability to clear the bit returned by iterate_set_bits. If normal index scans also used this, the same functions could be used to support the kill_prior_tuple thingie. Okay. The above also needs to be able to recompress a page if it gets fragmented by repeated setting and clearing of bits. Yes. I still feel that the data structures are unnecessarily complex. In particular, I'd like to get rid of the special-cased last_word and last_comp_word in the lov item. Perhaps we could instead embed a normal, but smaller, BMBitmapData structure in the lov item, and just add a length field to that? I'm not sure that this really simplifies the code. I agree things could be simpler though. You have a lot of code to support efficient building of a bitmap index. I know you've worked hard on that, but do we really need all that? How did the bitmap build work in the previous versions of the patch, and how much faster is the current approach? I included details on a previous email, I thought. Basically, in cases where the data is distributed as follows: 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 ... We're very fast in both versions. If the data is distributed as: 1 2 3 4 5 6 1 2 3 4 5 6 In the original version(s), we were terribly slow (in my test, 7 times slower than btree). Considering the kind of data sets bitmap suits, this made bitmap unusable. With the rewrite, we're much faster (in my test, faster than btree). The test case was: a table with 600M rows with 100,000 distinct keys to be indexed. BTW: It occured to me that since we're piggybacking on b-tree's strategy numbers, comparison operators etc, conceivably we could also use any other indexam. For example, a bitmap GiST would be pretty funky. We'll probably leave that for future versions, but have you given that any thought? True. I haven't given it any thought though. Interesting... I'd have to think of some interesting data sets which would suit the capabilities (operators) we have with GiST. Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] SCMS question
On Thu, 22 Feb 2007, Tom Lane wrote: Gregory Stark [EMAIL PROTECTED] writes: If we want to minimize the pain of changing and keep the same mode of operation Subversion is definitely the right choice. Its goal was to provide the same operational model as CVS and fix the implementation and architectural problems. Erm ... but this is not an argument in favor of changing. AFAIR the only real disadvantage of CVS that we've run up against is that it's hard to shuffle files around to different directories without losing their change history (or more accurately, making the history harder to find). Now that is a pretty considerable annoyance on some days, but it's not sufficient reason to change to something else. I have no doubt that every other SCMS has annoyances of its own. It's not a problem for the project but I personally experience pain with CVS. I often want to take a bunch of commits and merge them into seperate trees (like Greenplum DB or my private bitmap index tree). This is a lot easier with the patch set based SCMs like darcs/monotone/git/etc. Just my thoughts. Gavin ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Status of Hierarchical Queries
On Wed, 21 Feb 2007, Jonah H. Harris wrote: As was discussed in several threads, I'd handed over the responsibility of hierarchical queries to Greg Stark several weeks ago. He posted a preliminary patch which I don't believe anyone looked at. For 8.3's sake, I wanted to make sure we get the status of this out on the table so there won't be any surprises like those related to 8.2. Where are we at? Has anyone reviewed the preliminary work? Any comments, suggestions, etc? Yes, I looked at it. The WITH support seems okay. I guess I'd thought it might be represented different internally (not a sub query) but the approach Greg has taken is probably more straight forward (in that you get a lot of proven code for free). It should work fine for recursive queries too, if you just re-seed the param keys for every scan of the 'sub-query'. I wonder if anyone can think of a good way to cost the recursive side of the query. I'm still pre-coffee and it hurts my head :). Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Status of Hierarchical Queries
On Wed, 21 Feb 2007, Gregory Stark wrote: Gavin Sherry [EMAIL PROTECTED] writes: The WITH support seems okay. I guess I'd thought it might be represented different internally (not a sub query) but the approach Greg has taken is probably more straight forward (in that you get a lot of proven code for free). It should work fine for recursive queries too, if you just re-seed the param keys for every scan of the 'sub-query'. I don't think it works for recursive queries. Since you can't have the same executor plan in motion for two different sets of parameters simultaneously. That's why I was talking about a Memoize node. Can you elaborate on the 'two different sets of parameters' bit? I'm still without coffee. It is sufficient for the non-recursive case which might make it worthwhile putting it in 8.3. But even there user's expectations are probably that the reason they're writing it as a cte is precisely to avoid duplicate execution. I wonder if the planner should decide that? Thanks, Gavin ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Status of Hierarchical Queries
On Thu, 22 Feb 2007, Gregory Stark wrote: Gavin Sherry [EMAIL PROTECTED] writes: Can you elaborate on the 'two different sets of parameters' bit? I'm still without coffee. The spec allows for arbitrarily complex recursive query structures. Including mutually recursive queries, and even non-linearly recursive queries. I found grokking them requires far stronger brews than coffee :) Hehe. But in a simple recursive tree search you have a node which wants to do a join between the output of tree level n against some table to produce tree level n+1. It can't simply execute the plan to produce tree level n since that's the same tree it's executing itself. If it calls the Init method on itself it'll lose all its state. There's another reason it can't just execute the previous node. You really don't want to recompute all the results for level n when you go to produce level n+1. You want to keep them around from the previous iteration. Otherwise you have an n^2 algorithm. Right. When I've spent some idle cycles thinking through this in the past I figured that in a non-trivial query, we'd end up with a bunch of materialisations, one for each level of recursion. That sounds very ugly. It is sufficient for the non-recursive case which might make it worthwhile putting it in 8.3. But even there user's expectations are probably that the reason they're writing it as a cte is precisely to avoid duplicate execution. I wonder if the planner should decide that? That's one option. For the non-recursive case we could inline the cte subquery everywhere it's referenced and then add smarts to the planner to find identical subqueries and have a heuristic to determine when it would be advantageous to calculate the result once. The alternative is to retain them as references to a single plan. Then have a heuristic for when to inline them. In neither case is a heuristic going to be particularly good. The problem is that for any reasonably complex plan it'll be cheaper to execute it only once than multiple times. Unless there's an advantage to be gained by inlining it such as being able to push conditions down into it. But the only way to find out if that will be possible would be to try planning it and see. Pushing down predicates was the exact idea I had in mind. Thanks, Gavin ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] infinity and DATE type
On Thu, 22 Feb 2007, Warren Turkal wrote: On Thursday 22 February 2007 00:20, Tom Lane wrote: Nobody got round to it. Â I believe it's listed in the TODO file ... It's not at [1]. Should someone add it to the TODO? [1]http://www.postgresql.org/docs/faqs.TODO.html Yes it is. Allow infinite dates and intervals just like infinite timestamps Gavin ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] TopPlan, again
On Sun, 18 Feb 2007, Tom Lane wrote: We've repeatedly discussed getting rid of execution-time access to the Query structure --- here's one old message about it: http://archives.postgresql.org/pgsql-hackers/1999-02/msg00388.php and here's a recent one: http://archives.postgresql.org/pgsql-hackers/2006-08/msg00734.php I think it's time to bite the bullet and do that. Great. The other big problem is the rangetable (rtable): currently it contains Query trees for subqueries (including views) so unless we clean that up we aren't going to be all that far ahead in terms of reducing the overhead. I'm envisioning creating a compact rangetable entry struct with just the fields the executor needs: rtekind relid eref(might only need the table alias name not the column names) requiredPerms checkAsUser and flattening subquery rangetables into the main list, so that there's just one list and rangetable indexes are unique throughout a plan tree. That will allow subqueries to execute with the same EState as the main query and thus simplify nodeSubplan and nodeSubqueryScan. This list will also provide a simple way for the plan cache module to know which relations to lock before determining whether the plan has been invalidated. Cool. Comments, objections? Also, any thoughts about the names to use for these new node types? As I commented last year, I'm not completely happy with TopPlan because it won't actually be a subtype of Plan, but I don't have a better idea. Also I'm unsure what to call the cut-down RangeTblEntry struct; maybe RunTimeRangeTblEntry? I think TopPlan is misleading. What about MetaPlan instead of TopPlan? I think RunTimeRangeTblEntry is okay, though long. ExecRangeTblEntry? Thanks, Gavin ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bitmap index thoughts
On Thu, 8 Feb 2007, Heikki Linnakangas wrote: Gavin Sherry wrote: I will update the code tomorrow. The focus will be cleaning up the executor modifications. Please look else where for now. I'm getting a segfault with this test script: CREATE TABLE bmtest (i int); INSERT INTO bmtest SELECT 1 FROM generate_series(1,10) a; INSERT INTO bmtest SELECT 10 FROM generate_series(1,100) a; DELETE FROM bmtest WHERE i = 1; VACUUM bmtest; CREATE INDEX i_bmtest ON bmtest USING bitmap (i); INSERT INTO bmtest SELECT 10 FROM generate_series(1,10) a; Hmm... this triggers a bug in the newly rewritten update code I think. I'll post a fix soon. Thanks for testing! Gavin ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] --enable-debug does not work with gcc
On Fri, 2 Feb 2007, NikhilS wrote: Hi, configure with --enable-debug does not seem to add -g to CFLAGS while compiling with gcc. Guess we will need to change configure.in as below: Erm... works for me and everyone else... AFAIK. Thanks, Gavin ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] --enable-debug does not work with gcc
On Fri, 2 Feb 2007, NikhilS wrote: Hi, Indeed it does, apologies for not doing the entire groundwork. But what it also does is that it adds -O2 by default for gcc even when --enable-debug is specified. gdb is not able to navigate the stack traces properly with this optimization in place. Especially tracing of static functions becomes difficult. Has this issue been faced by anybody else? If so can try out a patch to avoid using O2 with enable-debug. Yes, this is known. The thing with gcc is, it only emits some warnings at -O2. I'm not that this is why we do not set optimisation to 0 but have long assumed it to be the case. I imagine that it's fairly standard practice for people doing debugging to CFLAGS=-O0 as an argument to configure. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bitmap index thoughts
On Thu, 1 Feb 2007, Bruce Momjian wrote: Where are we on this patch? Does it have performance tests to show where it is beneificial? Is it ready to be reviewed? Here's an updated patch: http://www.alcove.com.au/~swm/bitmap-2007-02-02.patch In this patch, I rewrote the index build system. It was fast before for well clustered data but for poorly clustered data, it was very slow. Now, it is pretty good for each distribution type. I have various test cases but the one which showed bitmap a poor light was a table of 600M rows. The key to the table had a cardinality of 100,000. When the table was loaded with keys clustered, the build time was 1000 seconds with bitmap (2200 with btree). With poorly clustered data (e.g., the index key was (1, 2, 3, ..., 6000, 1, 2, 3, ...)), the build time for bitmap was 14000 seconds! So, I rewrote this to compress data using HRL encoding (the same scheme we use in the bitmap AM itself). Now, clustered data is just as fast and unclustered data is 2000 seconds. The select performance at a cardinality of 100,000 is similar to btree but faster with lower cardinalities. Jie also contributed a rewrite of the WAL code to this patch. Not only is the code faster now, but it handles the notion of incomplete actions -- like btree and friends do. The executor code still needs some work from me -- Jie and I have dirtied things up while experimenting -- but we would really like some review of the code so that this can get squared away well before the approach of 8.3 feature freeze. One of the major deficiencies remaining is the lack of VACUUM support. Heikki put his hand up for this and I'm holding him to it! ;-) I will update the code tomorrow. The focus will be cleaning up the executor modifications. Please look else where for now. Thanks, Gavin ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bitmap index thoughts
On Thu, 1 Feb 2007, Bruce Momjian wrote: Where are we on this patch? Does it have performance tests to show where it is beneificial? Is it ready to be reviewed? I've got an updated patch which adds significant performance improvements for worse case data distributions. It also contains a rewrite of the WAL code to handle incomplete actions. I haven't worked on the stuff discussed below with Heikki. It's a lot of work and probably more suitable for a second generation. I've just got to finish testing the merge of Tom's operator family stuff and then I'll send off the patch and performance figures. Thanks, Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Data archiving/warehousing idea
On Thu, 1 Feb 2007, Chris Dunlop wrote: G'day hackers, G'Day Chris, already - I couldn't find anything in the mail archives, but that doesn't mean it's not there...) There has been a lot of discussion about this kind of thing over the years. The main idea is that, there might be space utilisation and performance advantages if postgres had hard read-only tables, i.e. tables which were guaranteed (by postgres) to never have their data changed (insert/update/delete). This could potentially save disk space by allowing book keeping elements in the page/tuple headers to be removed, e.g. visibility information etc. Also, some indexes could potentially be packed tighter if we know the data will never change (of course this is already available using the fillfactor control). Well, there is also CPU overhead doing MVCC but there are a few fundamental problems that must be overcome. The most significant is that no useful table is always read only, otherwise you could never load it. What do we do in the presence of a failure during the load or a user issued ABORT? I guess we'd truncate the table... What about replay after a crash? Another way of looking at it is, we use the 'bookkeeping' information in the tuple header for concurrency and for handling the abortion of the transaction. The idea would be to introduce a statement something like: ALTER TABLE foo SET ARCHIVE; I'd not thought of that approach. There are two problems: some archive tables are so large that loading them and then reprocessing them isn't appealing. Secondly, we'd be rewriting the binary structure of the table and this does not suit the non-overwriting nature of Postgres's storage system. A different approach discussed earlier involves greatly restricting the way in which the table is used. This table could only be written to if an exclusive lock is held; on error or ABORT, the table is truncated. The problem is that a lot of this looks like a hack and I haven't seen a very clean approach which has gone beyond basic brain dump. Thanks, Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Questions about warnings
On Thu, 25 Jan 2007, Magnus Hagander wrote: I'm looking over the VC build trying to eliminate what warnings are left. One thing that appears in a couple of places is stuff like: .\src\bin\psql\print.c(2014): warning C4090: 'function' : different 'const' qualifiers Seems like other projects have encountered this problem. Looks like a simple type difference. Casting to (void *) should fix it. http://mirror.ethereal.com/lists/ethereal-dev/200502/msg00170.html Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] New feature proposal
On Wed, 24 Jan 2007, Sorin Schwimmer wrote: Dear Developers, I would like to suggest the inclusion of an extension in PostgreSQL. There are instances, I found, when one needs to INSERT several times the same record in a table. The front-end application can do it easy in a loop of a sort, but on remote servers (and that's the norm these days) it creates unnecessary network traffic. My suggestion is to allow INSERT to do it REPEAT x. This should allow, in my view, the followings: a) INSERT INTO my_table (field1, field2, field3) VALUES (value1, value2, value3) REPEAT 5; postgres=# create table baz (i int, j text); CREATE TABLE postgres=# insert into baz (i, j) select 1, 'hello' from generate_series(1, 5); INSERT 0 5 postgres=# select * from baz; i | j ---+--- 1 | hello 1 | hello 1 | hello 1 | hello 1 | hello (5 rows) b) INSERT INTO my_table (field1, field2, field3) VALUES (x, value2/x, value3) REPEAT (x=3); should insert the followings: 1, value2, value3 2, value2/2, value3 3, value2/3, value3 Yuk! Besides, it can be done similarly to the above. This suggestion comes for a practical project that I have. Well, the good thing is, you can use generate_series() now! :-) Thanks, Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Planning aggregates which require sorted or distinct
On Sat, 20 Jan 2007, Simon Riggs wrote: On Sat, 2007-01-20 at 15:58 +1100, Gavin Sherry wrote: On Fri, 19 Jan 2007, Tom Lane wrote: Gavin Sherry [EMAIL PROTECTED] writes: On Fri, 19 Jan 2007, Tom Lane wrote: Er, what primary key would that be exactly? And even if you had a key, I wouldn't call joining on it trivial; I'd call it expensive ... I should have used slightly different language. What I meant to say was, both sets are primarily sorted by saledate so they can be merged back together. This is why I said it was trivial. Yes, your fan out plan sounds best, together with the optimisation to remove whatever you call the individual strands of the fan-out. Think of a good name now, so we can discuss it more easily... Hah. Yes. I've been using the term 'subplan' but it isn't a good one. Can we just choose something 'cool' like iPlan2006? ;) Yep. I was thinking about this all morning. I think I've over engineered the problem in my head. Window function input just looks like a slightly more complex distinct aggregate input. I'll think on it more though. I'm working on modifying Tuplestore that will allow you to store the last N tuples, rather than all previous input. This is specifically designed to allow Merge Join to do Mark/Restore without materializing the whole sort set. This can also be used to materialize Windows (i.e. window clause in SQL:2003), so you can store the current row plus n PRECEDING and/or n FOLLOWING rows as appropriate. Reading from the Window would then be similar-ish to doing a Mark/Restore pair, which we can rename to MarkWindowStart and ReturnToWindowStart. Wow. What a coincidence! Windows are slightly more complex though. As you probably know, there are two ways of specifying the window frame: by an absolute number of rows (ROWS N PRECEDING, for example); or, by a 'range' (RANGE N PRECEDING), where the range, in the case of 'preceding', is determined by subtracted the range parameter from the value of the current field -- i.e., the window attribute. This presents a problem in terms of managing the size of the buffer. If you have data clustered around a value then the amount of data input into the window function when we cross this value explodes. The tuplestore code can or will deal with this but it is currently not designed for this kind of usage pattern. That is, every time we advance a row we must (a) recalculate multiset to be input to the window function and (b) generate the value from the window function by passing the input to it. The problem arises when the window contains more data than can be stored in work_mem. Then, each time we need to recalculate the window function, we'll churn data through the buffer and get no effect from the buffering itself. A lot of the research around window functions recommends offseting this effect by buffering data 'pre-aggregated' for each distinct value in the buffer. Most of the research, however, relies on a trick we don't have available in the SQL window function spec: windows in SQL can have a partition (irrelevant here), data sort order and a range; windows in the world of window function streaming data research have this and a 'slide'. Slide is the interval at which the aggregate is regenerated and in SQL the value is regenerated for every new row. The research concedes that at this level, preaggregation either stops making sense of is very complex. I've come up with a way to be able to do it, but not for all aggregates (median() for example). I'll discuss this in the proposal to be sent through soon. The problem is, the requirements we have for buffering data around window functions could be very complex. I'll present the prototype shortly, I've got a few bugs, but the basic principle is working already. I'm happy to craft that to your exact needs, so that you'll be able to press ahead with the rest of the implementation of Windowed functions. Awesome. I will get the proposal off so that we can discuss the requirements at length. The Materialize node is almost unchanged, but I was thinking of renaming it when used in this way to make the EXPLAIN more readable. Perhaps we should call it a Materialized Window for both the MJ and Window function cases. I think that would be confusing in the case of MJ. This won't help with UNBOUNDED window definitions, but I imagine that these are best handled by running aggregates which would be an O(N) operation, rather than recalculating everything each time, which would be O(N^2). Correct. You only need to recalculate the aggregate if it has a moving window frame. To bring out a slightly different point -- and I know this is putting the cart before the horse -- but window functions will (potentially) output rows in the wrong order. I made a passing reference to this earlier. For example, say we have a table employees with the following data: empno | salary | age 1 | 2000 | 50
Re: [HACKERS] Planning aggregates which require sorted or distinct
On Sat, 20 Jan 2007, Tom Lane wrote: Gavin Sherry [EMAIL PROTECTED] writes: We want to answer the following: for each employee: what is their rank in terms of salary and what is their rank in terms of age. This query answers that: select empno, rank() over (order by salary) as srank, rank() over (order by age) as arank from employees order by empno; Eeek. This seems like the worst sort of action-at-a-distance. How does rank() know what value it's supposed to report the rank of? This is a frustratingly inconsistent bit of the spec. Rank is defined as follows: RANK() OVER WNS is equivalent to: ( COUNT (*) OVER (WNS1 RANGE UNBOUNDED PRECEDING) - COUNT (*) OVER (WNS1 RANGE CURRENT ROW) + 1 ) Say the salary column has the following values: {100, 200, 200, 300}. This would give the following output: {1, 2, 2, 4}. DENSE_RANK() would give: {1, 2, 2, 3}. These functions are pretty ugly (if you think about them in terms of our existing aggregates). However, they are by far the most heavily used window functions (along with ROW_NUMBER()). Thanks, Gavin ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Planning aggregates which require sorted or distinct
On Sat, 20 Jan 2007, Gregory Stark wrote: However for RANGE UNBOUNDED PRECEDING we can apply a different plan. Keep the state variable for each window aggregate around for the entire time. For each record apply the state transition function then apply the FINAL function to generate the result for that record but keep the state variable as it was for the next record. Yes, I made reference to this case else where in the email (or was it a different email in the thread?). Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Planning aggregates which require sorted or distinct
On Fri, 19 Jan 2007, Tom Lane wrote: Gavin Sherry [EMAIL PROTECTED] writes: What we want to do is have a kind of 'sub plan' for each aggregate. In effect, the plan might start looking like a directed graph. Here is part of the plan as a directed graph. GroupAggregate /-^---... | | | | ^ | | Unique | ^ | | Sort Sort (saledate)(saledate,prodid) ^ ^ | | -- Fan Out ... ^ | Scan This idea was presented by Brian Hagenbuch at Greenplum. He calls it a 'Fan Out' plan. It is trivial to rejoin the data because all data input to the aggregates is sorted by the same primary key. Er, what primary key would that be exactly? And even if you had a key, I wouldn't call joining on it trivial; I'd call it expensive ... I should have used slightly different language. What I meant to say was, both sets are primarily sorted by saledate so they can be merged back together. This is why I said it was trivial. Thanks, Gavin ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Planning aggregates which require sorted or distinct
On Sat, 19 Jan 2007, Karen Hill wrote: Gavin Sherry wrote: Recenly, I've been researching and putting together a proposal for window functions. Implementing NULLS FIRST and NULLS LAST appears like another challenging step to getting window functions wrapped up. Has your research lead you to any ideas on what your strategy for NULLS FIRST and NULLS LAST likely would be? Explicit handling of NULL is only in the Oracle extension to window functions. I will look at the core SQL functionality first -- maybe with the extension of 'LEAD' and 'LAG' because they are heavily utilitised by those using window functions. Besides, Tom's been working on NULLS FIRST/LAST already. gavin ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Planning aggregates which require sorted or distinct
On Fri, 19 Jan 2007, Tom Lane wrote: Gavin Sherry [EMAIL PROTECTED] writes: On Fri, 19 Jan 2007, Tom Lane wrote: Er, what primary key would that be exactly? And even if you had a key, I wouldn't call joining on it trivial; I'd call it expensive ... I should have used slightly different language. What I meant to say was, both sets are primarily sorted by saledate so they can be merged back together. This is why I said it was trivial. Ah, my misunderstanding. Then isn't this basically isomorphic to what I was thinking of, ie, somewhat-smarter Aggref nodes attached to the existing GroupAggregate plan node? Yep. I was thinking about this all morning. I think I've over engineered the problem in my head. Window function input just looks like a slightly more complex distinct aggregate input. I'll think on it more though. To bring out a slightly different point -- and I know this is putting the cart before the horse -- but window functions will (potentially) output rows in the wrong order. I made a passing reference to this earlier. For example, say we have a table employees with the following data: empno | salary | age 1 | 2000 | 50 2 | 6000 | 30 3 | 3000 | 20 We want to answer the following: for each employee: what is their rank in terms of salary and what is their rank in terms of age. This query answers that: select empno, rank() over (order by salary) as srank, rank() over (order by age) as arank from employees order by empno; The result will be: empno | salary | age 1 | 1 | 3 2 | 3 | 2 3 | 2 | 1 Both window functions provide results based on the order of their input. So, in terms of empno, srank will output in this order: empno = 1, 3, 2; arank will output in this order: empno = 3, 2, 1. We need to glue these back together and the only way I can think how is via a synthetic key. Ideally, the planner would have some input on how to clue about how large the result set will be and the orders from the window functions so that it can decide whether to use nested loop, merge join or hash join to do it. Can you think of a different approach? Thanks, Gavin ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Planning aggregates which require sorted or distinct input
Recenly, I've been researching and putting together a proposal for window functions. I have not finished this but when I do, I will post it. A nice list of examples can be found here[1]. Rather than spend a lot of time talking about the problems window functions present to the planner and executor, I'd like to bring up the topic of an existing piece of SQL which is well understood and presents a similar problem. Consider the following query: select saledate, count(distinct prodid), count(distinct sellerid), sum(price) from sales group by 1; The point here is that aggregates usually just receive the input that the lower level of the plan generates. When qualified with distinct, however, that changes. Notice that the count on prodid and sellerid receive unique input while the sum on price does not. We do not create seperate plan nodes to achieve this. Rather, it is a property of the aggregation code inside the executor[2]. It seems to me that by creating actual plan nodes for this distinct step we can improve the range of options for executing these types of queries. For example, consider a more complex query than the above that groups over a join using a join key of saledate, prodid (and that the planner implements with a merge join). This means that the sort order is preserved and count(distinct prodid) will receive sorted input. As far as I can tell, however, the executor doesn't know this and but the planner does. That is, the sort step inside the aggregate code is redundant. Another way it could be improved is if we ever introduce a 'hash distinct' execution node. This has been discussed before. My interest here is not so much to do with distinct aggregates but, rather, window functions. Window functions have this same problem as the input to the functions is generally sorted by different keys. So, hypothetically, lets say we wanted to create a plan for the above query which had an explicit Sort - Unique 'branch' for each of the distinct aggregates. This is actually difficult to represent with the existing plan constructs, as it turns out. Currently, the plan looks something like: GroupAggregate ^ | Sort Op ^ | Scan What we want to do is have a kind of 'sub plan' for each aggregate. In effect, the plan might start looking like a directed graph. Here is part of the plan as a directed graph. GroupAggregate /-^---... | | | | ^ | | Unique | ^ | | Sort Sort (saledate)(saledate,prodid) ^ ^ | | -- Fan Out ... ^ | Scan This idea was presented by Brian Hagenbuch at Greenplum. He calls it a 'Fan Out' plan. It is trivial to rejoin the data because all data input to the aggregates is sorted by the same primary key. Also, we could/would optimize this to perform a single sort for those columns which require the same sort order (saledate and sum(price) in the example query). An extra step would be required for (some) window functions because they wont necessarily preserve order. That's not important now. An alternative approach would be a 'pipeline' design. This would fit into the existing plan structure better. In this approach, each aggregate would be a distinct step in the plan. Finalize (like Result) ^ | Agg (sum(price)) ^ | Sort (saledate) ^ | Agg (count(sellerid)) ^ | Sort (saleid, sellerid)/Unique ^ | Agg (count(prodid)) ^ | Sort (saleid, prodid)/Unique ^ | Scan Now, this would actually work as follows: the input would be received from the Scan node. We sort by the input by the key saleid, prodid and produce a unique result. It is input to the aggregate and we produce a result for a distinct grouping expression. We then pass the output of the aggregate for this grouping expression up the tree along with a pointer to the scan data. We do not discard the
Re: [HACKERS] Design notes for EquivalenceClasses
On Wed, 17 Jan 2007, Tom Lane wrote: strict. However, we also allow equivalence clauses that appear below the nullable side of an outer join to form EquivalenceClasses; for these classes, the interpretation is that either all the values are equal, or all (except pseudo-constants) have gone to null. (This requires a limitation that non-constant members be strict, else they might not go to null when the other members do.) Consider for example SELECT * FROM a LEFT JOIN (SELECT * FROM b JOIN c ON b.y = c.z WHERE b.y = 10) ss ON a.x = ss.y WHERE a.x = 42; We can form the below-outer-join EquivalenceClass {b.y c.z 10} and thereby apply c.z = 10 while scanning c. (The reason we disallow outerjoin-delayed clauses from forming EquivalenceClasses is exactly that we want to be able to push any derived clauses as far down as possible.) But once above the outer join it's no longer necessarily the case that b.y = 10, and thus we cannot use such EquivalenceClasses to conclude that sorting is unnecessary (see discussion of PathKeys below). In this example, notice also that a.x = ss.y (really a.x = b.y) is not an equivalence clause because its applicability to b is restricted by the outer join; thus we do not make the mistake of concluding b.y = 42, even though we do have an equivalence class for {a.x 42}. This seems to be correct. A nice improvement. With a little further thought, it becomes apparent that nestloop joins can also produce sorted output. For example, if we do a nestloop join between outer relation A and inner relation B, then any pathkeys relevant to A are still valid for the join result: we have not altered the order of the tuples from A. Even more interesting, if there was an equivalence clause A.X=B.Y, and A.X was a pathkey for the outer relation A, then we can assert that B.Y is a pathkey for the join result; X was ordered before and still is, and the joined values of Y are equal to the joined values of X, so Y must now be ordered too. This is true even though we used neither an explicit sort nor a mergejoin on Y. (Note: hash joins cannot be counted on to preserve the order of their outer relation, because the executor might decide to batch the join, so we always set pathkeys to NIL for a hashjoin path.) Exception: a RIGHT or FULL join doesn't preserve the ordering of its outer relation, because it might insert nulls at random points in the ordering. I was thinking about this, but in relation to hash joins. A hash join cannot be guaranteed to produce output sorted according to the pathkey of the outer relation (as explained in the existing README). I wonder, however, if it might be useful for hash join to pass a hint that the output is known ordered (i.e., the join was not split into multiple batches). Thanks, Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Design notes for EquivalenceClasses
On Wed, 17 Jan 2007, Tom Lane wrote: Gavin Sherry [EMAIL PROTECTED] writes: I was thinking about this, but in relation to hash joins. A hash join cannot be guaranteed to produce output sorted according to the pathkey of the outer relation (as explained in the existing README). I wonder, however, if it might be useful for hash join to pass a hint that the output is known ordered (i.e., the join was not split into multiple batches). Yeah, I've considered that, but I think it'd have to be the other way around: the planner tells the executor that it's assuming the output is sorted, hence do not split into multiple batches. This has the usual assortment of problems if the planner has badly misestimated the rowcount :-( Yep, I thought of that and discarded it for the reason you state. I still think there would be some benefit to passing a hint up the execution tree, effectively turning explicit sorts into no ops. This, however, breaks the major rule in the executor: do what ever the plan tells you to do. Thanks, Gavin ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] ideas for auto-processing patches
On Wed, 10 Jan 2007, Jim C. Nasby wrote: On Thu, Jan 11, 2007 at 08:04:41AM +0900, Michael Glaesemann wrote: Wouldn't there be some value to knowing whether the patch failed due to bitrot vs it just didn't work on some platforms out of the gate? I'm having a hard time figuring out what that value would be. How would that knowledge affect what's needed to fix the patch? I was thinking that knowing it did work at one time would be useful, but maybe that's not the case... It might be useful to patch authors who submit code which remains unreviewed for some time. Then the submitter or reviewer will be able to know at what date the code drifted. This might be easier than looking through the commit history and trying to locate the problem, I think. Still, the more interesting thing to me would be to be able to provide in the patch a set of custom tests inside of the regression test frame work which aren't suitable as RTs in the long term but will be able to tell the patch author if their code works correctly on a variety of platforms. Thanks, Gavin ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PATCHES] SGML index build fix
On Sun, 7 Jan 2007, Tom Lane wrote: Peter Eisentraut [EMAIL PROTECTED] writes: Tom Lane wrote: Perhaps even more to the point, what makes you think that someone will notice the warning? If the docs build is one step in an automated build process, this seems unlikely. Taking a closer look, it's pretty much guaranteed that no one will see them, because the targets they are attached to are intermediate, normally followed by latex runs. If we think this is a problem, ISTM the correct answer is to just force a repeat jade run when doing make all. The only objection to that AFAICS is that when you're doing docs work and only need a draft to look at, you'd rather it not run twice. But perhaps we could address that by providing a separate target, make draft say, that runs jade but once. That's a nice approach. Those working on the docs will know about the draft target and those just wanting to build the docs for publication will get the result. Gavin ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] ideas for auto-processing patches
On Thu, 4 Jan 2007 [EMAIL PROTECTED] wrote: 1. Pull source directly from repositories (cvs, git, etc.) PLM doesn't really track actually scm repositories. It requires directories of source code to be traversed, which are set up by creating mirrors. It seems to me that a better approach might be to mirror the CVS repo -- or at least make that an option -- and pull the sources locally. Having to pull down 100MB of data for every build might be onerous to some build farm members. Thanks, Gavin ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] ideas for auto-processing patches
On Thu, 4 Jan 2007, Alvaro Herrera wrote: Gavin Sherry wrote: On Thu, 4 Jan 2007 [EMAIL PROTECTED] wrote: 1. Pull source directly from repositories (cvs, git, etc.) PLM doesn't really track actually scm repositories. It requires directories of source code to be traversed, which are set up by creating mirrors. It seems to me that a better approach might be to mirror the CVS repo -- or at least make that an option -- and pull the sources locally. Having to pull down 100MB of data for every build might be onerous to some build farm members. Another idea is using the git-cvs interoperability system, as described here (albeit with SVN, but the idea is the same): http://tw.apinc.org/weblog/2007/01/03#subverting-git It seems like that will just add one more cog to the machinary for no extra benefit. Am I missing something? Now, if we were to use a distributed system like Monotone this sort of thing would be completely a non-issue ... Monotone is so 2006. The new new thing is mercurial! Gavin ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] ideas for auto-processing patches
On Thu, 4 Jan 2007, Andrew Dunstan wrote: Gavin Sherry wrote: On Thu, 4 Jan 2007 [EMAIL PROTECTED] wrote: 1. Pull source directly from repositories (cvs, git, etc.) PLM doesn't really track actually scm repositories. It requires directories of source code to be traversed, which are set up by creating mirrors. It seems to me that a better approach might be to mirror the CVS repo -- or at least make that an option -- and pull the sources locally. Having to pull down 100MB of data for every build might be onerous to some build farm members. I am not clear about what is being proposed. Currently buildfarm syncs against (or pulls a fresh copy from, depending on configuration) either the main anoncvs repo or a mirror (which you can get using cvsup or rsync, among other mechanisms). I can imagine a mechanism in which we pull certain patches from a patch server (maybe using an RSS feed, or a SOAP call?) which could be applied before the run. I wouldn't want to couple things much more closely than that. With PLM, you could test patches against various code branches. I'd guessed Mark would want to provide this capability. Pulling branches from anonvcvs regularly might be burdensome bandwidth-wise. So, like you say, a local mirror would be beneficial for patch testing. The patches would need to be vetted first, or no sane buildfarm owner will want to use them. It would be nice if there could be a class of trusted users whose patches would not have to be vetted. Thanks, Gavin ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] ideas for auto-processing patches
On Thu, 4 Jan 2007, Andrew Dunstan wrote: Gavin Sherry wrote: With PLM, you could test patches against various code branches. I'd guessed Mark would want to provide this capability. Pulling branches from anonvcvs regularly might be burdensome bandwidth-wise. So, like you say, a local mirror would be beneficial for patch testing. I think you're missing the point. Buildfarm members already typically have or can get very cheaply a copy of each branch they build (HEAD and/or REL*_*_STABLE). As long as the patch feed is kept to just patches which they can apply there should be no great bandwidth issues. Right... my comment was more for Mark. It would be nice if there could be a class of trusted users whose patches would not have to be vetted. Beyond committers? Hmmm... good question. I think so. I imagine the group would be small though. Thanks, Gavin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Dead Space Map for vacuum
On Thu, 28 Dec 2006, Heikki Linnakangas wrote: ITAGAKI Takahiro wrote: Hello, NTT staffs are working on TODO item: | Create a bitmap of pages that need vacuuming We call the bitmap Dead Space Map (DSM), that allows VACUUM to scan only pages that need vacuuming or freezing. We'd like to discuss the design on hackers and make agreements with community. Great! I agree! We implemented the basic parts of it and measured the performance. As expected, VACUUM took shorter time when the fraction of updates are low. DSM is useful for large but not so heavily-updated databases. The overhead of managing DSM seemed to be negligible small in our CPU-loaded tests. There are a lot of choices in implementation. We followed the descriptions in TODO list and past discussions in some parts, but did not in other parts for some reasons. I would appreciate your comments and suggestions. I experimented with a different DSM design last winter. I got busy with other things and never posted it AFAIR, but the idea was to store a bitmap in the special area on every 32k heap page. That had some advantages: * doesn't require a new dedicated shared memory area that needs to be allocated and tuned. * doesn't introduce a (single) new global lwlock that might become hotspot. * WAL logging is quite simple, since the bitmaps are on normal pages handled by buffer manager. I too have experimented with this idea. 4 DSM pages means 2 bits per block. What were you using the other bit for? When I discussed this some time ago with Jan Weick, we recommended we track blocks in the FSM with this extra bit. One problem this approach may have is contention for the DSM pages and the granularity of the lock associated with it. Locking a DSM page to update one bit will affect writes to 32K pages. This might be less of a problem than I think because of the very low cost of setting the bit. | In the event of a system crash, the bitmap would probably be invalidated. We bought it for simplicity. Currently, all DSM are lost after crash. All tables will be untracked status. Full-scanned vacuum is needed after the lost of DSM. If you flush the DSM to disk at checkpoint, it should be easy to bring it up-to-date on WAL replay. Having to do full-scanning vacuum after crash can be a nasty surprise. I agree. One interesting insight is, we have 4 bits left over (those not used for the 4 DSM blocks). We might use those status bits for some purpose. For example, when a page is dirtied and we update the DSM page, we set a DSM dirtied bit. Checkpoint would then only flush DSM pages dirtied since the last check point. This is important in the presense of lots of read only tables. | One complexity is that index entries still have to be vacuumed, and doing | this without an index scan (by using the heap values to find the index entry) | might be slow and unreliable, especially for user-defined index functions. Indexes are still needed to be full-scanned at the present moment. We are also researching a retail index vacuum method, but it requires more works. Yeah, that's an old story :(. BTW: Yesterday I realized that bitmap indexes could be retail vacuumed safely. You'll still need to visit all bitmaps to find the dead bit, but you only need to check the bitmap page that corresponds the tid of the dead tuple. Cool. The problem is solving it for the other AMs, particularly btree. | http://archives.postgresql.org/pgsql-hackers/2004-03/msg00957.php | Maintain 2 bits per block that tell if the block has been vaccumed of all | dead tuples since the last time it was dirtied, and if all its tuples are | completely frozen. We use 1 bit per block, so we cannot separate pages that need either vacuum or freeze only. The reason is that we cannot determine where to record before/after updated tuples. If the transaction is commited, before-version should be recorded into needs-vacuum bitmap and after-version into needs-freeze bitmap. But on rollback, it is inverted. We cannot judge which we should do until the end of transaction. Yeah, that makes the DSM less useful. Are you thinking of freezing more aggressively because of that? Or doing a full-scanning vacuum every now and then to freeze? I don't see any problem with moving to two status bits per block in the DSM-in-heap model, so we can track this better. What do you think? Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bitmap index thoughts
On Wed, 27 Dec 2006, Heikki Linnakangas wrote: Gavin Sherry wrote: On Tue, 26 Dec 2006, Heikki Linnakangas wrote: for typical bitmap index use cases and most of the needed pages should stay in memory, but could we simplify this? Why do we need the auxiliary heap, couldn't we just store the blk+offset of the LOV item directly in the b-tree index item? The problem is, the b-tree code is very much tied to the heap. I don't want to modify the b-tree code to make bitmap indexes work (better). What's really tempting is to just manage our own balanced tree within the bitmap index file(s) itself. It would start from the metapage and simply spill to other 'special' index pages when necessary. The problem is, we do not have b-tree code generic enough that it would allow us to do this trivially -- consider concurrency and WAL in particular, which we currently get for free. I guess this is why I've been ignoring this issue :-). Maybe we could reuse the code in ginbtree.c. Looks like Teodor Oleg had the same problem :). Modifying the nbtree code doesn't seem that difficult either. AFAICS, the only places where the heap is from within nbtree code is in index building and uniqueness checks. I'll take a look and think about it. And instead of having separate LOV pages that store a number of LOV items, how about storing each LOV item on a page of it's own, and using the rest of the page to store the last chunk of the bitmap. That would eliminate one page access, but more importantly, maybe we could then get rid of all the bm_last_* attributes in BMLOVItemData that complicate the patch quite a bit, while preserving the performance. That's an interesting approach. We would still need a concept of last_word, at the very least, and probably last_comp_word for convenience. Why? The two last words of the bitmap vector, stored in the LOV item, provide a few things: they buffer the addition of new matches in the vector and they ease the calculation of the current position of the end of the vector. Jie, do you want to add more? I think we could do this differently by calculating everything based on the other data stored in the lov item and data page (last_tid_location and num_hrl_words_used, from memory). But, I'm not sure. Jie? Thanks, Gavin ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bitmap index thoughts
On Wed, 27 Dec 2006, Heikki Linnakangas wrote: Jie Zhang wrote: The bitmap data segment sounds good in terms of space. The problem is that one bitmap is likely to occupy more pages than before, which may hurt the query performance. We could have segments of say 1/5 of page. When a bitmap grows larger than that, the bitmap would be moved to a page of its own. That way we wouldn't get unnecessary fragmentation with large bitmaps, but small bitmaps would be stored efficiently. Yes. I have been thinking along the lines of increasing the number of last bitmap words stored in each LOV item, but not to occupy one page. This may prevent some cases Gavin indicated here, but not all. That sounds like more special cases and complexity. I like the segment idea more. But actually I'm not convinced we need to worry about efficient storage of small bitmaps at all. The typical use case for bitmap indexes is large tables with small number of distinct values, and the problem doesn't really arise in that scenario. Let's keep it simple for now, we can enhance it in later releases. The scenario I'm concerned about is where a sales data base, say, has 100,000 products. However, only 500 or 1000 products are popular. They dominate, say 99% of the sales. The other 99,900 products consume a little bit over 8K each for very little benefit :-(. This is pretty contrived but it seem real world enough... Gavin ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bitmap index thoughts
Hey Heikki, On Tue, 26 Dec 2006, Heikki Linnakangas wrote: I've been skimming through the bitmap index patch... A scan needs to access at least five pages: 1. B-tree index (root+others, depending on depth) 2. The auxiliary heap page 3. bitmap index meta page 4. LOV page 5. bitmap page That seems like a lot of indirection. A high startup cost is probably ok You're right, this is excessive and it was something I'd hoped to have ripped out, but... for typical bitmap index use cases and most of the needed pages should stay in memory, but could we simplify this? Why do we need the auxiliary heap, couldn't we just store the blk+offset of the LOV item directly in the b-tree index item? The problem is, the b-tree code is very much tied to the heap. I don't want to modify the b-tree code to make bitmap indexes work (better). What's really tempting is to just manage our own balanced tree within the bitmap index file(s) itself. It would start from the metapage and simply spill to other 'special' index pages when necessary. The problem is, we do not have b-tree code generic enough that it would allow us to do this trivially -- consider concurrency and WAL in particular, which we currently get for free. I guess this is why I've been ignoring this issue :-). And instead of having separate LOV pages that store a number of LOV items, how about storing each LOV item on a page of it's own, and using the rest of the page to store the last chunk of the bitmap. That would eliminate one page access, but more importantly, maybe we could then get rid of all the bm_last_* attributes in BMLOVItemData that complicate the patch quite a bit, while preserving the performance. That's an interesting approach. We would still need a concept of last_word, at the very least, and probably last_comp_word for convenience. Your idea doesn't require any extra space, either, which is good. Something I've been working through is the idea of a 'bitmap data segment'. Currently, we store the HRL compressed bitmap data to the extent of the page. That is, sizeof(BMBitmap) is as close to BLCKSZ as we can make it. The problem is that we may have some bitmaps where a few values occur only a small number of times and are well clustered at the beginning of the heap. In that circumstance, we use a whole page to store a small number of words and the free space cannot be used by any other vector. Now, say a segment was some fraction the size of BLCKSZ, we use less space for those vectors with few tuples in the heap. Just an idea... Thanks, Gavin PS: Another versio of the patch shall be forthcoming shortly. I've been working on compressing the data in memory during CREATE INDEX instead of just managing arrays of TIDs in memory as we did previously. The array of TIDs works great for well clustered data but it stinks for poorly clustered data as we approach maintenance_word_mem and have to swap a lot. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Question about debugging bootstrapping and catalog
On Mon, 18 Dec 2006, Gregory Stark wrote: I've been fooling with catalog entries here and I've obviously done something wrong. But I'm a bit frustrated trying to debug initdb. Because of the way it starts up the database in a separate process I'm finding it really hard to connect to the database and get a backtrace. And the debugging log is being spectacularly unhelpful in not telling me where the problem is. Are there any tricks people have for debugging bootstrapping processing? I just need to know what index it's trying to build here and that should be enough to point me in the right direction: creating template1 database in /var/tmp/db7/base/1 ... FATAL: could not create unique index DETAIL: Table contains duplicated values. Not much fun. Run src/include/catalog/duplicate_oids first. Thanks, Gavin ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bitmap index status
On Wed, 18 Oct 2006, Heikki Linnakangas wrote: Hi, I don't want to harass you :), but what's the status with the bitmap index code? Is there something I can do to help? Hi Heikki, The streaming is implemented, as are range queries. I need to bring it up to HEAD and back-patch to bizgres since... it's not diverged fairly significantly from that code base. Two outstanding items are handling vacuum and I was considering having a bitmap selectivity function but I haven't really looked into it. Once I bring it up to HEAD I'll post. Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] pg_internal.init is hazardous to your health
On Tue, 17 Oct 2006, Tom Lane wrote: Dirk Lutzebaeck and I just spent a tense couple of hours trying to figure out why a large database Down Under wasn't coming up after being reloaded from a base backup plus PITR recovery. The symptoms were that the recovery went fine, but backend processes would fail at startup or soon after with could not open relation XX/XX/XX: No such file type of errors. The answer that ultimately emerged was that they'd been running a nightly maintenance script that did REINDEX SYSTEM (among other things I suppose). The PITR base backup included pg_internal.init files that were appropriate when it was taken, and the PITR recovery process did nothing whatsoever to update 'em :-(. So incoming backends picked up init files with obsolete relfilenode values. Ouch. We don't actually need to *update* the file, per se, we only need to remove it if no longer valid --- the next incoming backend will rebuild it. I could see fixing this by making WAL recovery run around and zap all the .init files (only problem is to find 'em), or we could add a new kind of WAL record saying remove the .init file for database XYZ to be emitted whenever someone removes the active one. Thoughts? The latter seems the Right Way except, I guess, that the decision to remove the file is buried deep inside inval.c. Thanks, Gavin ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bitmap index status
On Tue, 26 Sep 2006, Heikki Linnakangas wrote: Looks a bit better now, though I think you need to think more about the encapsulation of the structs. More detailed comments below. Jie Zhang wrote: Essentially, we want to have a stream bitmap object that has an iterator, which will be able to iterate through the bitmaps. The BitmapIndexscan, BitmapAnd, BitmapOr will be executed once and return a streamp bitmap or a hash bitmap. The BitmapHeapscan then calls tbm_iterate() to iterate through the bitmaps. The StreamBitmap structure will look like below. struct StreamBitmap { NodeTag type; /* to make it a valid Node */ PagetableEntry entry; /* a page of tids in this stream bitmap */ /* the iterator function */ void (*next)(StreamBitmap*); Node* state; /* store how this stream bitmap generated, and all necessary information to obtain the next stream bitmap. */ }; I'd suggest making state just a (void *). It's private to the producer of the bitmap, and I don't see a reason to expose it. I assume that the next-function fills in the PageTableEntry with the next set of tids. Two new state objects will look like below. At the same time, we introduce three new node types: T_StreamBitmapAND, T_StreamBitmapOR, And T_StreamBitmapIndex, to define different states. /* * Stores the necessary information for iterating through the stream bitmaps * generated by nodeBitmapAnd or nodeBitmapOr. */ struct StreamBitmapOp { NodeTag type; /* handles T_StreamBitmapAND and T_StreamBitmapOR */ List* bitmaps; }; AFAICS, this struct is private to tidbitmap.c, where the new stream-enabled tbm_intersect etc. functions are defined. Also, I don't see a reason why it needs to by a valid Node. Well, making it a valid nodes makes it easy to identify (IsA) and gives us access to copy/equal frameworks. I do think that it is best to bury this in the bitmap code however. * Stores some necessary information for iterating through the stream * bitmaps generated by nodeBitmapIndexscan. */ struct StreamBitmapIndex { NodeTag type; /* handle T_StreamBitmapIndex */ IndexScanDesc scan; BlockNumber nextBlockNo;/* next block no to be read */ }; Where would this struct be defined? I think different index access methods might want to have different kind of states. The struct above assumes that the position of an index scan is always represented by the nextBlockNo. That seems to be the right thing for the bitmap indexam, so this struct is fine for StreamBitmaps returned by bmgetbitmap, but not necessary for others. right. Then we will have the iterator functions like the following: ... void StreamBitmapIndexNext(StreamBitmap* node) { StreamBitmapIndex* sbi = (StreamBitmapIndex*) node-state; amgetbitmap(sbi-scan, NULL, sbi-nextBlockNo); } This means that the amgetbitmap function would still be called many times in each scan. What would amgetbitmap return? A new StreamBitmap on each call? I'd like to see just one call to amgetbitmap. It would a) fill in the hash bitmap passed to it, b) return a new hash bitmap, or c) return a new StreamBitmap, with a indexam specific next-function that returns the tids one page at a time. I think we'll also need some kind of a destructor in StreamBitmap that's called by the consumer of the bitmap after it's done with it. Right, I agree. I am working on this now. Thanks, gavin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] pg_upgrade: downgradebility
On Wed, 20 Sep 2006, Andrew Sullivan wrote: On Wed, Sep 20, 2006 at 12:54:14PM +0200, Zdenek Kotala wrote: My first question is how important is downgrade for You and Your customers? And second is how to verify that downgrade is possible? Well, one way to do it is to set up a Slony replica using the older version of the software. So, if you've upgraded to 8.1.x, you replicate to an old 8.0.x back end as well. If 8.1 doesn't work for you, you just MOVE everything back to the 8.0.x back end, and you're golden. Well, I think that people who really want downgrade in such a tool are those for which slony replication is just not an option. That is, data in the range of hundreds of gigabytes. Using slony to upgrade is often not practical either. I wonder if pg_upgrade could be designed in such a way that upgrade is the same as downgrade from a development point of view. That is, the tool can change the system from one binary format to another. Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bitmap index status
On Tue, 19 Sep 2006, Heikki Linnakangas wrote: Jie Zhang wrote: Hi Heikki and all, Please find the latest bitmap index patch in the attachment. This patch is generated against the postgresql cvs head. Thanks. The handling of stream and hash bitmaps looks pretty complicated to me. All the bitmap-related nodes have logic to handle both types slightly differently. It all seems to come down to that if a subnode (or amgetbitmap in a bitmap index scan node) returns a StreamBitmap, the caller needs to call the subnode many times, until it returns a NULL. With a HashBitmap, the caller only calls the subnode once. I think amgetbitmap should be called just once per index scan, and it should return either a hash bitmap or a stream bitmap. The same applies to all the executor nodes that return bitmaps, they would only return a single HashBitmap or StreamBitmap, and the upper node would call tbm_iterate repeatedly on that. StreamBitmap would contain a callback (filled by the indexam) that tbm_iterate would call to fill the next TBMIterateResult. Right, this was the approach taken by an earlier version of the patch I had worked on. It was significantly uglified by the need to keep the index state around and to be careful about what amrescan might do behind out back. I will, however, introduce the idea again because it makes the code much cleaner and more logical, as you seem to suggest. Thanks, Gavin ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Timezone List
On Wed, 6 Sep 2006, Tom Lane wrote: Martijn van Oosterhout kleptog@svana.org writes: In the CVS version there is a table with this information: http://developer.postgresql.org/pgdocs/postgres/view-pg-timezonenames.html Actually, what that view gives you is timezone offset abbreviations, not the full zone names that you could use with SET TIME ZONE. It strikes me that we should have a view for that as well. We could use code similar to scan_available_timezones() to generate the view output. It's somewhat urgent to address this now, because pg_timezonenames is sitting on the obvious name for such a view, and once we release 8.2 we won't be able to change it. On reflection I think the existing view is wrongly named --- perhaps it should be pg_timezoneabbrevs? Or more readably, perhaps pg_timezone_abbrevs, with pg_timezone_names for the other view. I think 'abbrev' is a like unintuitive. How about 'short_names'? Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PATCHES] Trivial patch to double vacuum speed
On Mon, 4 Sep 2006, Joshua D. Drake wrote: I don't have a concrete proposal to make, but I do think that the current patch-queue process is not suited to the project as it stands today. Maybe if this issue-tracking stuff gets off the ground, we could let developers place ACK or NAK flags on patches they've looked at, and have some rule about ACK-vs-NAK requirements for something to go in. How about *requiring* test cases that prove the patch? People including regression tests is not a replacement for code review. For a non-trivial patch, an SQL test will only exercise a few code paths. Moreover, it wont say anything about code quality, maintainability or general correctness or completeness. It will still have to be reviewed. Thanks Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Getting a move on for 8.2 beta
On Fri, 1 Sep 2006, Tom Lane wrote: My feeling is that we ought to bounce bitmap indexes and updatable views as not being ready, accept all the contrib stuff, and try to get the other items done in time for a beta at, say, the end of next week. For what it's worth, Jie and I hope to have finished the bitmap streaming this weekend which is the only outstanding issue with the code as it currently stands. I now realise (nothing like hindsight) that we should have posted a patch before we looked at streaming bitmaps because they've proved more fiddly to implement than I thought :-(. At that point, we'd more or less addressed the issues I'd raised when I posted at feature freeze time. Thanks, Gavin ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PATCHES] WIP: bitmap indexes
On Mon, 14 Aug 2006, Tom Lane wrote: Gavin Sherry [EMAIL PROTECTED] writes: Attached is an update to the patch implementing bitmap indexes Jie sent last week. What's the current status of this patch ... has any work been done since the first of the month? Yes. I am tidying up the executor code at the moment so that the on disk bitmap support is more or less hidden from the executor itself. and Jie has been fixing multicolumn support and working on the planner. I've also brought the code into line with PostgreSQL style, etc. I suppose the patch as given here no longer applies to HEAD, because of recent changes in rmgr for Simon's restartable-recovery patch (should be easy enough to fix) and consumption of OIDs by other patches (also easy to fix, in principle, but doubtless tedious). Right. If you have to renumber OIDs I'd suggest starting at 3000 for your added OIDs, to leave some daylight for any other patches that go in during the next couple weeks. Thanks. I will post an updated patch in a few days time. Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PATCHES] WIP: bitmap indexes
On Mon, 14 Aug 2006, Tom Lane wrote: Gavin Sherry [EMAIL PROTECTED] writes: I will post an updated patch in a few days time. OK. Do you want me to work on the discussed amgetmulti change, or would that just be joggling your elbow at the moment? Yes, that would be joggling ;). The approach I am taking is more or less the one you proposed a few weeks ago. Namely, to pass the bitmap object as an argument to amgetmulti and have the routine fill it in as it sees fit. There is some trickiness, however. One of the main reasons for the uglification of the executor in Jie's original patch was that she wanted to avoid the inefficiency of translating the on disk bitmap representation to the TID bitmap representation. If the plan calls for a straight on disk bitmap read or AND/ORing together a few on-disk bitmaps this is justified. If we're AND/ORing together an ondisk bitmap read and a TID bitmap scan of a btree, we're in trouble. In this case, the existing code implements the current semantics of amgetmulti. What I am doing is treating the bitmap object as opaque, storing the data in the format the AM wants, and providing a 'translate to TID bitmap' routine (trivial for btree). Do you have any thoughts on this? Thanks, Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PATCHES] WIP: bitmap indexes
On Mon, 14 Aug 2006, Tom Lane wrote: Gavin Sherry [EMAIL PROTECTED] writes: One of the main reasons for the uglification of the executor in Jie's original patch was that she wanted to avoid the inefficiency of translating the on disk bitmap representation to the TID bitmap representation. Offhand that seems like micro-optimization, at least as stated -- you want to think about number of page fetches rather than exactly how a bitmap is represented. I've still not got round to reading the patch in detail, but after thinking about it in the shower on a few mornings, it seems to me that the key concept we'd like to implement is that a bitmap index can provide its data in a page-by-page fashion without the overhead of fabricating the bitmap in-memory as btree forces us to do. That is, the current implementation involves doing the whole index scan and building a bitmap in memory, which we then read out page-wise for the heap scan. The weak spot of this approach is that the in-memory bitmap can get unreasonably large. With a bitmap index, you can pass data back in a page-by-page fashion: here are the TID indexes to hit on this page, then here are the TID indexes to hit on the next page, etc. Correct me if I'm wrong, but isn't the patch's present hacking on the executor intended to make it happen like this? Not really. It reads ahead on the bitmap index and passes back the bitmap words. The other executor routines are hacked up to process the data in this format. If I understand your idea correctly, we could change this to read, say, a page of the index at a time, store this internally in the state object we pass around, and we can then read out of this the TIDs on a given heap page which match the query. Once we process all the bitmap data, we just pull more. The main architectural idea I have at the moment is that this should all be private between tidbitmap.c and the bitmap index AM. I think we could perhaps have a flavor of tidbitmap that is serial access as opposed to the present random-access, ie, instead of keeping the whole bitmap in memory, you have a function pointer that you can call to demand the next bitmap page in sequence. tidbitmap.c will need to manage both kinds of bitmaps and be prepared to do ANDs and ORs across both types, but just at the moment I see no showstopper reason why this can't work. Or is that exactly what you said already? It's late here and I'm a bit tired... This is better than what I had in mind. It seems to me that part of this which will be a litle ugly is dealing with key in (1,2,3,...) transformation. Let me think about it... Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] 8.2 features status
On Fri, 4 Aug 2006, Bruce Momjian wrote: My outlook is that it isn't a lot of _new_ things that you couldn't do before, but rather improvements of existing functionality. It seems as though the majority of things on Tom's list are new things you couldn't do (at all easily) before. Gavin ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] 8.2 features status
On Fri, 4 Aug 2006, Bruce Momjian wrote: Gavin Sherry wrote: On Fri, 4 Aug 2006, Bruce Momjian wrote: My outlook is that it isn't a lot of _new_ things that you couldn't do before, but rather improvements of existing functionality. It seems as though the majority of things on Tom's list are new things you couldn't do (at all easily) before. To me new things are like PITR, Win32, savepoints, two-phase commit, partitioned tables, tablespaces. These are from 8.0 and 8.1. What is there in 8.2 like that? Well, GIN and some of the unreviewed stuff (bitmaps, plpgsql debugger, updateable views) are in the same league as the stuff in 8.0 in terms of user demand and catching up with competitors, I think. A lot of the things on Tom's list are new bits of functionality to things added around 8.0 and 8.1 (major enhancements to the usability of constraint exclusion, for example). We knew then that these needed additional functionality to fill them out and make them useful to a wide range of people. Ideally we'd have both at each release but reality doesn't work like that. Thanks, Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] UPDATE/DELETE XXX WHERE CURRENT OF cursor_name
On Mon, 24 Jul 2006, Golden Liu wrote: Updateable cursors are used as follows: begin; declare foo cursor for select * from bar for update; fetch foo; update bar set abc='def' where current of foo; fetch foo; delete from bar where current of foo; commit; PostgreSQL doesn't support this feature now ( 8.1.4). Will PGSQL support it recently? Does anyone work on this? No one has stepped up to do this for 8.2 so unfortunately you will most likely not see this within the next year or so :-(. Thanks, Gavin PS: sorry for not responding to your private email in time. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] On-disk bitmap index patch
On Mon, 24 Jul 2006, Bruce Momjian wrote: Jie Zhang wrote: IIRC they quoted the cardinality of 1 as something that is still faster than btree for several usecases. And also for AND-s of several indexes, where indexes are BIG, your btree indexes may be almost as big as tables but the resulting set of pages is small. Yeah, Hannu points it out very well -- the bitmap index works very well when columns have low cardinalities and AND operations will produce small number of results. What operations on columns of low cardinality produce a small number of results? That seems contradictory. WHERE a = 1 and b = 2 a = 1 may be 5% of the table and b = 2 may be 5% of the table but their intersection may be .001%. Luke: the URL you sent to the bitmap slides was internal to Greenplum. Would you be able to put them on a public site? Thanks, Gavin ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] On-disk bitmap index patch
On Mon, 17 Jul 2006, Jie Zhang wrote: Hi, I have posted a patch to the CVS head for on-disk bitmap index to pgsql-patches. If this can get in 8.2, that would be great. Any comments and suggestions are welcome. I still need to add several items: (1) README file in src/backend/access/bitmap. (2) Bitmap index documentation. (3) Hiding the internal btree. Also, I have disabled the multi-column index support because there is a known problem. Assume that there is a bitmap index on a and b. When a query predicate has only a, the current code may generate a wrong result. That's because the current code assumes that b is null. The ultimate problem is because the search code only handles one bitmap vector now. I need a fix to support manipulating multiple bitmap vectors. If you find any other problems, please let me know. Is anyone else looking at this patch? I am reviewing it with Jie, tidying it up and trying to solve some of the problems mentioned above. If anyone wants to take a look at us, please ask for an updated patch. Thanks, Gavin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] On-disk bitmap index patch
On Sun, 23 Jul 2006, Tom Lane wrote: Gavin Sherry [EMAIL PROTECTED] writes: Is anyone else looking at this patch? It's on my list of things to look at, but so are a lot of other patches ;-) A couple of comments after a *very* fast scan through the patch: * The xlog routines need help; they seem to not be updated for recent changes in the API for xlog recovery code. Yep. The patch was actually against 8.1 and was hastily brought up to 8.2. I think Jie's intention was to simply let everyone know that this was going on. * The hacks on vacuum.c (where it tests the AM name) are utterly unacceptable. If you need the main code to do something special for a particular index AM, define a bool flag column for it in pg_am. Yes. * The interface to the existing executor bitmap scan code is mighty messy --- seems like there's a lot of almost duplicate code, a lot of rather invasive hacking, etc. This needs to be rethought and refactored. I agree. * Why in the world is it introducing duplicate copies of a lot of btree comparison functions? Use the ones that are there. Yes, I raised this with Jie and she has fixed it. One thought is, we may want to rename those comparison functions prefixed with 'bm' to make their naming less confusing. They'll be used by btree, gin and bitmap index methods. Anyway, a seperate patch. * The patch itself is a mess: it introduces .orig and .rej files, changes around $PostgreSQL$ lines, etc. Right, not to mention patches to configure and a lot of style which needs to be knocked into shape. However, the main problem I've got with this is that a new index AM is a pretty large burden, and no one's made the slightest effort to sell pghackers on taking this on. What's the use-case, what's the performance like, where is it really an improvement over what we've got? For low cardinality sets, bitmaps greatly out perform btree. Jie and others at Greenplum tested this implementation (and others) against very large, low cardinality sets. I wasn't involved in it. I urge Jie and/or Luke to present those results. Thanks, Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Adding a pgbench run to buildfarm
On Mon, 24 Jul 2006, Mark Kirkwood wrote: Tom Lane wrote: Mark Kirkwood [EMAIL PROTECTED] writes: Scale factor 10 produces an accounts table of about 130 Mb. Given that most HW these days has at least 1G of ram, this probably means not much retrieval IO is tested (only checkpoint and wal fsync). Do we want to try 100 or even 200? (or recommend scale factor such that size ram)? That gets into a different set of questions, which is what we want the buildfarm turnaround time to be like. The faster members today produce a result within 10-15 minutes of pulling their CVS snaps, and I'd be seriously unhappy if that changed to an hour or three. Maybe we need to divorce compile/regression tests from performance tests? Right - this leads to further questions like, what the performance testing on the buildfarms is actually for. If it is mainly to catch regressions introduced by any new code, then scale factor 10 (i.e essentially in memory testing) may in fact be the best way to show this up. It introduces a problem though. Not all machines stay the same over time. A machine may by upgraded, a machine may be getting backed up or may in some other way be utilised during a performance test. This would skew the stats for that machine. It may confuse people more than help them... At the very least, the performance figures would need to be accompanied by details of what other processes were running and what resources they were chewing during the test. This is what was nice about the OSDL approach. Each test was preceeded by an automatic reinstall of the OS and the machines were specifically for testing. The tester had complete control. We could perhaps mimic some of that using virtualisation tools which control access to system resources but it wont work on all platforms. The problem is that it probably introduces a new variable, in that I'm not sure that virtualisation software can absolutely limit CPU resources a particular container has. That is, you might not be able to get reproducible runs with the same code. :( Just some thoughts. Thanks, Gavin ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Units in postgresql.conf
On Thu, 20 Jul 2006, Peter Eisentraut wrote: One frequent source of confusion are the different units that the parameters in postgresql.conf use. shared_buffers is in 8 kB, work_mem is in 1 kB; bgwriter_delay is in milliseconds, checkpoint_warning is in seconds. Obviously, we can't change that without inconveniencing a lot of users. I think it would be useful to allow units to be added to these settings, for example shared_buffers = 1000kB checkpoint_warning = 30s This would also allow shared_buffers = 512MB which is a bit cumbersome to calculate right now (you'd need = 65536). I haven't thought yet how to parse or implement this, but would people find this useful? Please! Yes! Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Units in postgresql.conf
On Thu, 20 Jul 2006, Josh Berkus wrote: Peter, I don't understand how that is related. Or what a conversion utility would be for that matter. Well, the main issue with changing the units of the PostgreSQL.conf file from a user perspective is that the numbers from you 8.0/8.1 conf file would no longer work. A little conversion utilitily to turn your 8.0 file into an 8.2 file would help solve that. Josh, I would imagine that Peter intends to handle backward compatibility by processing values without explicit units in the units assumed pre 8.2. Thanks, Gavin ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] plPHP and plRuby
On Sun, 16 Jul 2006, Joshua D. Drake wrote: Hello, However plRuby is even a stranger beast as it uses an entirely ruby build system. I am also fairly confident that it does not meat the PostgreSQL style guidelines. Well... JDBC used its own. Is there enough interest in plRuby to get it where it needs to be for possible inclusion into core? I'd like to see it ship with the core distribution. Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Removing AddDepends; should I bother with a project?
On Mon, 10 Jul 2006, Bruce Momjian wrote: Josh Berkus wrote: Folks, For the code sprint, I'm starting off by removing the projects from contrib which need to be removed by still have some usefulness. I'm not exactly sure what to do with adddepends, though. It seems unlike to lead an independent existence on pgFoundry; I'm inclined to just nuke it. I vote for the nuclear option. ;-) There are still 7.2 systems out there which need it. The problem is, adddepend is broken when run against 8.1. It breaks on serial, I think. Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] vacuum, performance, and MVCC
On Thu, 22 Jun 2006, Agent M wrote: On Jun 22, 2006, at 9:56 PM, Christopher Kings-Lynne wrote: The example is a very active web site, the flow is this: query for session information process HTTP request update session information This happens for EVERY http request. Chances are that you won't have concurrent requests for the same row, but you may have well over 100 HTTP server processes/threads answering queries in your web server farm. You're crazy :) Use memcache, not the DB :) Still, the database is the one central location that the apaches can connect too- postgres already has a lot of application platform features- locking synchronization, asynchronous notifications, arbitrary pl code. Personally, I think that a special non-MVCC table type could be created- the catalogs are similarly flat. What I envision is a table type that can only be accessed outside transactions (like AutoCommit mode)- this is already possible to implement in plperl for a single session. It would be more efficient to have something like a global temp table hanging around... Just some random ideas... Unfortunately, it's not so simple. What if a user enters a transaction block, modifies a normal table, modifies this 'special table'... then rolls back? This is the problem MySQL has with innodb and myisam tables. Well... we could just document that. If only. What happens if, as a part the update to the special table, we encounter and error? MVCC currently guarantees that this modification will be invisible. Without MVCC, we have no such capability. There seems to be a bit of confusion about what MVCC is. PostgreSQL is not the only MVCC database. InnoDB is MVCC. Oracle is MVCC. As far as I know, PostgreSQL is the only MVCC database with a 'non-overwriting storage manager'. The other MVCC databases maintain UNDO logs outside of the table's data files. When an update occurs, the existing row version is copied to te UNDO file, the new data replaces the old and a backward pointer from the table row to the UNDO log is created. Concurrent reads must go into the UNDO log and find the version visible to them. This implements the principle of MVCC and uses snapshot isolation, like we do, to isolate read/write concurrency. Overwriting MVCC comes with its own baggage. Ask any Oracle user about error ORA-01555[1]. There's also the added cost of managing the UNDO logs, the cost of jumping around between files to get row versions and so on. Also, it leads to inefficiency with variable size data types. The new version of a row might be longer or shorter than the previous version and this has to be causing them a headaches and performance penalties. As for single version databases -- they have a tonne of problems themselves. They need to maintain UNDO functionality so as to deal with the error handling I detailed above but they also cannot implement MVCC rules for all cases: readers to not block writes, writers do not block readers. Instead, they implement a large lock matrix and certain types of queries use certain types of granular locks. Some potentially interesting reading material on these ideas: J. Gray A Reuter, Transaction Processing: Concepts and Techniques US Patent Number 5,870,758 -- (one of?) Oracle's snapshot isolation patent Tom Lane's MVCC talk: http://www.postgresql.org/files/developer/transactions.pdf Thanks, Gavin --- [1] Basically, the UNDO logs are a circular buffer and, remarkably, Oracle doesnt seem to let the buffer expand if there is a long running transaction. (We do this for WAL). Basically, in this case Oracle finds itself in a compromising position because if any more data is written to the UNDO log the query affected will not be able to roll back and/or concurrent queries will not be able to find the correct row version for their snapshot. UNDO log size can be adjusted but this situation bites Oracle users constantly. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] vacuum, performance, and MVCC
On Thu, 22 Jun 2006, Jonah H. Harris wrote: Not in all systems. A few now perform in-memory UNDO and only write it to disk if and when it is required. Interesting... Overwriting MVCC comes with its own baggage. Ask any Oracle user about error ORA-01555[1]. There's also the added cost of managing the UNDO logs, the cost of jumping around between files to get row versions and so on. This seems to be going in the direction of our common MySQL discussions; relying on old failures and mistakes to base our assumptions on the current version. Please stay apprised of current developments in other systems. Erm. Perhaps a poor example as I was not trying to put Oracle in a poor light. Rather, I was trying to point out that each method has its disadvantages. If the update in place method has no detractions we shouldn't be hanging on to our existing mechanism. J. Gray A Reuter, Transaction Processing: Concepts and Techniques Pretty much older than dirt, discusses locking, and barely touches on MVCC. Still has some good concepts though. The really useful section of this book is the discussion of snapshot isolation. That's the important thing here. Conceivably we could have a higher performance storage system but, IMHO, it must implement snapshot isolation. Thanks, Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] DB2-style INS/UPD/DEL RETURNING
On Sun, 12 Mar 2006, Jonah H. Harris wrote: I was talking with Jonathan Gennick about the INS/UPD/DEL RETURNING stuff, and he recommended looking into the way DB2 handles similar functionality. After looking into it a bit, it's more inline with what Tom's suggestion was regarding a query from the operation rather than returning the values in the manner currently required. Here's DB2's syntax... does anyone have any familiarity with it? Simply put, it's sort-of like: SELECT * FROM (FINAL | NEW | OLD) TABLE (INSERT | UPDATE | DELETE) I'd like to hear from anyone that's used it to see if it really is better... logically it seems nicer, but I've never used it. It works well for cases where you want to pass the result of an insert/delete/update to another query. There was a paper on IBM developer works on how they got the 7 or so queries in an order transaction in TPC-C down to 3 queries and increased throughput impressively. This doesn't solve the generated keys problem that the Java and probably .NET interfaces have. Mind, RETURNING doesn't solve anything either. I prefer this syntax to RETURNING. Then again, Oracle is a bigger target than DB2 so... I'm not sure. Thanks, Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] DB2-style INS/UPD/DEL RETURNING
On Mon, 13 Mar 2006, Tom Lane wrote: Gavin Sherry [EMAIL PROTECTED] writes: On Sun, 12 Mar 2006, Jonah H. Harris wrote: SELECT * FROM (FINAL | NEW | OLD) TABLE (INSERT | UPDATE | DELETE) This doesn't solve the generated keys problem that the Java and probably .NET interfaces have. Mind, RETURNING doesn't solve anything either. Why not? AFAICS, either one lets you get at generated keys. There are a few different ways to get at generated keys from JDBC at least. The case we cannot trivially deal with is when the code executes a statement and then wants a result set of all generated keys. That is, it doesn't register which generated keys it wants returned before the query is executed. It's quite unclear to me what the difference is between FINAL and NEW ... any clarification there? NEW returns the representation of the data which the statement creates; FINAL is the final representation of the data, after AFTER triggers have been applied. The OLD idea is cute but I'm not sure how useful it really is. They seem to have missed a bet anyway: if I understand how this works, you can't get values from both new and old row states in the UPDATE case. The classification seems bogus for both INSERT and DELETE, too; neither of them have more than one row state to deal with. Right, it's not as useful as our OLD.*, NEW.*. Also, is the front SELECT allowed to have its own WHERE, or is it constrained to return exactly one row per inserted/updated/deleted row? If it can have a WHERE then there's a syntactic ambiguity in SELECT ... FROM NEW TABLE UPDATE ... WHERE ... That's definately ambiguous. The manual doesn't clarify and I do not have DB2 installed locally. More generally, this syntax is problematic in that it's syntactically possible to use SELECT FROM NEW TABLE ... as a sub-query, which seems like a truly horrid idea from both semantics and implementation perspectives. I cannot see any reference to whether this is allowed in DB2. The DB2 manual and other IBM apps use it extensively in named expressions. Ie, WITH foo as (SELECT FROM NEW TABLE(...)), bar as (SELECT FROM OLD TABLE(...)) SELECT ... FROM foo, bar It does say that a 'data change table reference' is simply a type of table reference so I suppose it can occur in a sub query. The ability to have INSERT ... RETURNING in a from clause would achieve most of this, I think. Thanks, Gavin ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] to_char and i18n
Gavin Sherry wrote: On Wed, 21 Dec 2005, Tom Lane wrote: Manuel Sugawara masm@fciencias.unam.mx writes: (Some time ago I proposed an--incomplete--patch and it was rejectd by Karel arguing that to_char functions should behave *exactly* the same way that they do in Oracle.) That is the accepted plan for to_char ... of course, if Oracle changes to_char every so often, it'll get more interesting to decide what to do. There's some functionality in 10g which PostgreSQL does not have: TZD - returns the short timezone string with daylight saving information, eg: PDT This is the same as TZ and it is easy to implement. TZM - timezone offset minutes part Trivial TZH - timezone offset hours part Trivial TZR - timezone region (US/Pacific, for example) We don't currently have an offset - region name lookup table but it should be easy enough to implement... RR/ - accept 'rounded' years, eg 99-1-1 = 1999-1-1 (kind of pointless) FF - specify how many digits to the right of the decimal place to display, when looking at factions of seconds. Eg: HH:MM:SS.FF3 would produce 15:56:22.123 Trivial X - the local radix character. Eg: HH:MM:SSXFF would produce 15:56:22.123 I don't know how to get this character... is it included in the locale data some where (and where, specifically) E - Era name (like, Japanese Imperial) (kind of pointless) EE - Full era name No idea where to get this data. DS - Locale formatted short date. For example, DD/MM/ for the Brits, MM/DD/ for the Yanks Is this desirable? It may lead to confusion with datestyle. DL - Locale formatted long date. Eg: fmDay, dd. Month in Germany Should be straight forward - if the underlying library will honour locale. SCC - Like 'CC', but will carry a - (minus) for BC dates (I'm not sure if this implies that Oracle wants BC dates to be marked 'BC'. I don't have an Oracle system around at the moment to check though :-() Thoughts? TS - Locale formatted short time. Should be straight forward - if the underlying library will honour locale. YEAR - Year in words Hmmm. This would be hard to do if we want to support local language settings. SYEAR - Year in words, prefixed by minus sign for BC dates As above. S - , prefixed by minus sign for BC dates Should be straight forward. Any comments on the above? Gavin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Status of INS/UPD/DEL RETURNING?
On Wed, 1 Mar 2006, Bruce Momjian wrote: Jonah H. Harris wrote: Hey guys, What's the status of the current INSERT/UPDATE/DELETE RETURNING patch? Is it ready to go or does it need to be cleaned up? Uh, I don't remember seeing any patch like that. Where is it? Omar Kilani sent in a patch before 8.1 FF derived from (??) some playing around I had done. It still needs work. Thanks, Gavin ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] PostgreSQL unit tests
On Fri, 24 Feb 2006, Michael Glaesemann wrote: [I neglected to cc the list in my reply earlier. Apologies to Gavin for the double-post.] On Feb 23, 2006, at 11:40 , Gavin Sherry wrote: I do think that unit testing of areas such as data types would be useful, particularly the date/time code and arrays as I consider that area of the code quite fragile. I wouldn't expect the unit tests to find any bugs. Rather, it would make it easier, I think, for people (particularly new comers) to hack on that part of the code with more confidence. This is the area I specifically had in mind when thinking of unit tests. I am looking to do more work on the date/time code in particular, and having a unit testing framework and tests to verify expected behavior would definitely give me a greater sense of security that I wasn't mucking stuff up. Yes. You sound like the perfect person to implement this :-). Your earlier proposal was probably the one I remembered. Could you comment on your experience with any of the existing unit testing frameworks? In particular, did any stand out in applicability to Postgres? What's your take on possible licensing issues? I looked at Check and CuTest from memory. The former was more sophisticated, if memory serves me correctly, because it had the ability to fork and run the code from a child to see if it segfaulted, for example. The licensing issue is more of a pain. Obviously we cannot incorporate GPL stuff into to the code base. CuTest seems to have a pretty BSD compatible license, though. That said, the actual implementation is very straight forward: 1) Queue test functions which assert something about the result of a function 2) Run the tests 3) capture pass/fails. We have some special requirements with our code because it can ereport()/elog(). As such, it is quite attractive to just write our own, if unit testing is to proceed. Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Attempting upgrade path; is this possible?
On Wed, 22 Feb 2006, Shaun Thomas wrote: I'm in charge of a very large database, and we're using a highly decrepit version of Postgresql currently. After searching through the archives, Google, and trying out several replication engines, I have a question. I had originally considered Slony-I, as it doesn't seem to require version compatibility between nodes like pgCluster, so upgrading from 7.4.2 to 8.1.3 would be a possible, if slow process. But after looking into the level of micro-management necessary, such as defining sets of every table on a per-database level, then having it add artificial primary-keys to applicable tables, it just doesn't seem like a good choice. Not a fault of Slony-I, but several multi-gig databases hosting hundreds of tables would be a nightmare to use with Slony-I. There are tools in the /tools directory. In particular, take a look at /tools/altperl. You can use to set up slony and replicate all tables with very little hassle. Slony adds the primary keys for you. Then I thought about the backup/recovery system and the WAL files. Would this scenario be possible: 1. Do a pg_dumpall on the existing database running 7.4.2. 2. Do a psql -f foo template1 on the new database running 8.1.3. 3. Wait a very long time while the new database loads. 4. Shut down old database. 5. Start the new database in restore mode, and point it to the WAL files from the old database. 6. Wait for restore to finish. 7. Restart the new database. This is not possible. On your 7.4 systems tables have a unique object ID to identify them. When you restore the dump on the 8.1 system they will have different object IDs. There are several other issues of this nature. Also, the binary format of the log files has changed and the whole process would be significantly more difficult than using slony to upgrade. The size of your databases does not sound like an issue - lots of people have done what you're doing with GB range databases. I wondered about this, as the pg_dumpall/restore would take a very long time for a 50GB database cluster, but theoretically the WAL files would continue to accumulate on the old db while this loading was taking place. If the WAL formats were compatible, the total upgrade time would only be restricted to how long it took to replay the WAL files in the new database. Does the current format of the WAL files make this possible? If not, is such an option for the future? It is possible that someone could write a translation tool which translates WAL entries into SQL, but there is some messiness to deal with (backup blocks, create table foo; insert into foo; drop table foo; and more). I think the best solution is an inplace upgrade tool which does all the binary conversions, additions and subtractions itself. This could be quite cheap because the conversion will often only affect system catalogs not user tables. Thanks, Gavin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] PostgreSQL unit tests
On Wed, 22 Feb 2006, Alvaro Herrera wrote: Robert Treat wrote: You could check into what spikesource has been doing. I believe they mostly just piggyback off of our regression tests for postgresql core, but there might still be something that could be built upon. If you look at this url http://developer.spikesource.com/spikewatch/index.jsp?show=component-resultscomp-id=22074 the actual success information isnt terribly exciting but the code coverage url shows something of more interest. There is more stuff if you dig around a bit. This can't be right. The report for function coverage shows 100% for all utf8_and_*.c files, at the end of the listing. Notice how C/D coverage (I don't know what it means but I assume it's somehow computed per lines of code or something) is 0, which is probably the correct result, because our regression tests do not test charset conversions at all. I think the bug may be that they use function names to see what is actually tested ... IIRC Gavin Sherry gave a URL to a test coverage result some centuries ago. The only thing that I remember about the result was that it was surprinsingly low (IMHO at least). Yes. Coverage was about 50% from memory. This was coverage resulting from regression tests. I previously proposed integrating a unit test framework into PostgreSQL. Getting started wasn't much fun and I gave up. This is because unit testing is really suited to a functional programming model, IMHO. Testing the most complex parts of the postgres backend requires a *lot* of state to be initialised and we'd really have to stomp all over the backend. I do think that unit testing of areas such as data types would be useful, particularly the date/time code and arrays as I consider that area of the code quite fragile. I wouldn't expect the unit tests to find any bugs. Rather, it would make it easier, I think, for people (particularly new comers) to hack on that part of the code with more confidence. The areas of the backend which do not suit unit testing are usually associated with lots of state or lots of concurrency - like WAL, buffer manager and so on. The approaches we have at the moment -- regression tests, user test, load generators (benchmarks et al) -- do an okay job but they are a brute force approach. They test the common code path, not the uncommon one. Approaches used by other projects include sophisticated static analysis (both of language semantics and appliation semantics, such as 'function bar should never be call unless function foo has been called first'); model checking (difficult to implement with limited results) and once of of patches that introduce disasterous conditions in the code (like low level functions randomly failing, to check that the callers of such functions deal with such conditions correctly). Some basic static analysis has been applied to the backend (Stanford Checker and coverity, etc), no model checking (as far as I am aware) and Tom had some good results with the last approach easlier in the year, but I cannot remember what area of the code he was focusing on. The problem is, these approaches are either of limited use or time consuming and niche enough that they cannot be done often or across the whole tree. Any other thoughts on this? Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match