Re: [HACKERS] WIP checksums patch
On Mon, 2012-10-01 at 12:35 -0400, Bruce Momjian wrote: The heap/index files are copied unmodified from the old cluster, so there are no checksums on the pages. That's fine though, the patch still reads the old format the same way, and the pages are treated as though they have no checksum. How is that a reason for defaulting the GUC to off? I'm missing something here. Are we worried about users who turn the GUC on and then expect all of their old data pages to magically be protected? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP checksums patch
On Fri, 2012-09-14 at 17:58 -0700, Jeff Davis wrote: This is just a rebased version of the patch by Simon here: http://archives.postgresql.org/message-id/CA +u5nmkw_gbs6qq_y8-rjgl1v7mvw2hwbhartb8lojhnpfx...@mail.gmail.com Another thing I noticed about the design of this patch: It looks like the checksum will usually be wrong in a backup block in WAL, because it writes the backup block before calculating the checksum (which isn't done until the time the block is written out). I think that's OK, because it's still protected by the WAL CRC, and there's no expectation that the checksum is correct in shared buffers, and the correct checksum should be set on the next checkpoint. Just an observation. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP checksums patch
On Mon, 2012-10-01 at 18:14 +0100, Simon Riggs wrote: You are missing large parts of the previous thread, giving various opinions on what the UI should look like for enabling checksums. I read through all of the discussion that I could find. There was quite a lot, so perhaps I have forgotten pieces of it. But that one section in the docs does look out of date and/or confusing to me. I remember there was discussion about a way to ensure that checksums are set cluster-wide with some kind of special command (perhaps related to VACUUM) and a magic file to let recovery know whether checksums are set everywhere or not. That doesn't necessarily conflict with the GUC though (the GUC could be a way to write checksums lazily, while this new command could be a way to write checksums eagerly). If some consensus was reached on the exact user interface, can you please send me a link? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Statistics and selectivity estimation for ranges
On Tue, 2012-09-04 at 17:27 +0400, Alexander Korotkov wrote: Addon patch is attached. Actually, I don't get your intention of introducing STATISTIC_KIND_RANGE_EMPTY_FRAC stakind. Did you plan to leave it as empty frac in distinct stakind or replace this stakind with STATISTIC_KIND_LENGTH_HISTOGRAM? In the attached patch STATISTIC_KIND_RANGE_EMPTY_FRAC is replaced with STATISTIC_KIND_LENGTH_HISTOGRAM. Review comments: 1. In compute_range_stats, you need to guard against the case where there is no subdiff function. Perhaps default to 1.0 or something? 2. I think it would be helpful to add comments regarding what happens when lengths are identical, right now it's a little confusing. For instance, the comment: Generate a length histogram slot entry if there are at least two length values doesn't seem right, because the condition still matches even if there is only one distinct value. 3. It looks like get_distance also needs to guard against a missing subdiff. 4. There are 3 binary search functions, which seems a little excessive: * rbound_bsearch: greatest i such that hist[i] v; or -1 * rbound_bsearch_equal: greatest i such that: hist[i] = v and (i=0 or hist[i-1] != hist[i]); or -1 * length_hist_bsearch: least i such that hist[i] = v; or length of hist (let me know if I misunderstood the definitions) At a minimum, we need more consistent and informative names. Also, the definition of rbound_bsearch_equal is a little confusing because it's looking for the highest index among distinct values, but the lowest index among identical values. Do you see a way to refactor these to be a little easier to understand? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gistchoose vs. bloat
On Tue, 2012-09-04 at 19:21 +0400, Alexander Korotkov wrote: New version of patch is attached. Parameter randomization was introduced. It controls whether to randomize choose. Choose algorithm was rewritten. Review comments: 1. Comment above while loop in gistRelocateBuildBuffersOnSplit needs to be updated. 2. Typo in two places: if randomization id required. 3. In gistRelocateBuildBuffersOnSplit, shouldn't that be: splitPageInfo = relocationBuffersInfos[bufferIndex]; not: splitPageInfo = relocationBuffersInfos[i]; 4. It looks like the randomization is happening while trying to compare the penalties. I think it may be more readable to separate those two steps; e.g. /* create a mapping whether randomization is on or not */ for (i = FirstOffsetNumber; i = maxoff; i = OffsetNumberNext(i)) offsets[i - FirstOffsetNumber] = i; if (randomization) /* randomize offsets array */ for (i = 0; i maxoff; i++) { offset = offsets[i]; ... } That's just an idea; if you think it's more readable as-is (or if I am misunderstanding) then let me know. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP checksums patch
On Fri, 2012-09-14 at 17:58 -0700, Jeff Davis wrote: This is just a rebased version of the patch by Simon here: I just noticed the following note in the docs for this patch: The default is literaloff/ for backwards compatibility and to allow upgrade. The recommended setting is literalon/ though this should not be enabled until upgrade is successfully complete with full set of new backups. I don't understand what that means -- if they have the page_checksums GUC available, then surely upgrade is complete, right? And what is the backwards-compatibility issue? Also, it looks out of date, because the default in guc.c is set to true. I think we should probably default to true, because it's safer and it can always be disabled at runtime, but I don't have a strong opinion about that. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP checksums patch
On Fri, 2012-09-14 at 17:58 -0700, Jeff Davis wrote: * we might want to make it slightly easier for external utilities, like for backup/replication, to verify the pages Ideally, PageVerificationInfoOK should be available to external utilities, so that someone might script a background job to verify pages. Or, perhaps we should just include a new utility, pg_verify_checksums? Either way, where is a good place to put the function so that it's shared between the server and the utility? In src/port somewhere? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Question about SSI, subxacts, and aborted read-only xacts
On Sun, 2012-09-16 at 16:16 -0500, Kevin Grittner wrote: I'm attaching an alternative proposal, with changes for the following reasons: Looks good to me, aside from not wrapping the text. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] WIP checksums patch
This is just a rebased version of the patch by Simon here: http://archives.postgresql.org/message-id/CA +u5nmkw_gbs6qq_y8-rjgl1v7mvw2hwbhartb8lojhnpfx...@mail.gmail.com There are other things I'd like to do, like: * include page# in checksum, and perhaps relfilenode or object OID (but those two might cause difficulties) * use TLI field instead of page version, if that is the consensus (though it seems some may still disagree with that) * we might want to make it slightly easier for external utilities, like for backup/replication, to verify the pages * optionally protect temporary files and local buffers * come up with a robust way to determine that all pages in the database are protected * protect uninitialized pages But not all of these will make the first patch. This is just a starting point to reignite the discussion. I welcome any feedback. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gistchoose vs. bloat
On Tue, 2012-09-04 at 19:21 +0400, Alexander Korotkov wrote: New version of patch is attached. Parameter randomization was introduced. It controls whether to randomize choose. Choose algorithm was rewritten. Do you expect it to be bad in any reasonable situations? I'm inclined to just make it always randomize if it's better. I think it would be hard for a user to guess when it's better and when not. Maybe it's useful to turn randomization off for testing purposes, e.g. to ensure determinism? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Question about SSI, subxacts, and aborted read-only xacts
On Mon, 2012-09-10 at 11:15 -0500, Kevin Grittner wrote: That's a fair point. Do you have any suggested wording, or suggestions for exactly where in the documentation you think it would be most helpful? The subsection on serializable transactions seems like the most obvious location: Attached. I thought about putting it as a note, but it seems like it's easy to go overboard with those. Regards, Jeff Davis ssi_doc.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Question about SSI, subxacts, and aborted read-only xacts
On Mon, 2012-09-10 at 11:15 -0500, Kevin Grittner wrote: ... and I know Jeff read that quite closely because he raised a question off-list about an error he found in it which managed to survive the many editing and review passes that paper went through. :-) Well, I need to keep up with the discussion on the interaction of temporal and SSI :-) I think the behavior is correct because a function's control flow might be directed by what it reads in a subtransaction, even if it rolls back -- and the transaction as a whole might leave the database in a different state based on that than if it had read different data (from a later snapshot). For example, if a plpgsql function has a BEGIN/EXCEPTION/END block, it might read something from the database and use what it reads to attempt some write. If that write fails and the EXCEPTION code writes something, then the database could be put into a state which is dependent on the data read in the subtransaction, even though that subtransaction is rolled back without the client ever directly seeing what was read. On reflection, I agree with that. Trying to puzzle through your transactions (and application logic) to see if you are depending on any information read in an aborted subtransaction is exactly the kind of thing SSI was meant to avoid. This strikes me as significantly different from returning some rows to a client application and then throwing an error for the transaction as a whole, because the client will certainly have an opportunity to see the failure (or at worst, see a broken connection before being notified of a successful commit). Oh, I see the distinction you're making: in PL/pgSQL, the exception mechanism involves *implicit* subtransaction rollbacks. That's more of a language issue, but a valid point. I'm still not sure I see a theoretical difference, but it does seem wise to keep predicate locks for aborted subtransactions. If so, I think we need a documentation update. The serializable isolation level docs don't quite make it clear that serializability only applies to transactions that commit. It might not be obvious to a user that there's a difference between commit and abort for a RO transaction. I think that, in S2PL, serializability applies even to aborted transactions (though I haven't spent much time thinking about it), so users accustomed to other truly-serializable implementations might be surprised. That's a fair point. Do you have any suggested wording... I'll write something up. Can I document that you may depend on the results read in aborted subtransactions, or should I leave that undefined for now? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Question about SSI, subxacts, and aborted read-only xacts
On Mon, 2012-09-10 at 21:59 -0400, Dan Ports wrote: It might be worth noting that serializable mode will not cause read-only transactions to fail to commit For the archives, and for those not following the paper in detail, there is one situation in which SSI will abort a read-only transaction. When there are three transactions forming a dangerous pattern where T1 (read-only) has a conflict out to T2, and T2 has a conflict out to T3; and T3 is committed and T2 is prepared (for two-phase commit). In that situation, SSI can't roll back the committed or prepared transactions, so it must roll back the read-only transaction (T1). Even in that case, SSI will ordinarily prevent T2 from preparing. It's only if T1 takes its snapshot after T2 prepares and before T2 commits that the situation can happen (I think). Fortunately, for two-phase commit, that's not a big problem because the window between PREPARE TRANSACTION and COMMIT PREPARED is supposed to be narrow (and if it's not, you have bigger problems anyway). As long as the window is narrow, than it's reasonable to retry the transaction T1, and expect it to succeed after a short interval. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Question about SSI, subxacts, and aborted read-only xacts
This question comes about after reading the VLDB paper Serializable Snapshot Isolation in PostgreSQL. We release predicate locks after a transaction abort, but not after a subtransaction abort. The paper says that the reason is: We do not drop SIREAD locks acquired during a subtransaction if the subtransaction is aborted (i.e. all SIREAD locks belong to the top-level transaction). This is because data read during the subtransaction may have been reported to the user or otherwise externalized. (section 7.3). But that doesn't make sense to me, because that reasoning would also apply to top-level transactions that are aborted, but we release the SIREAD locks for those. In other words, this introduces an inconsistency between: BEGIN ISOLATION LEVEL SERIALIZABLE; SAVEPOINT s1; ... ROLLBACK TO s1; COMMIT; and: BEGIN ISOLATION LEVEL SERIALIZABLE; ... ROLLBACK; I'm not suggesting this is a correctness problem: holding SIREAD locks for longer never causes incorrect results. But it does seem a little inconsistent. For top-level transactions, I don't think it's possible to preserve SIREAD locks after an abort, because we rely on aborts to alleviate conflicts (and when using 2PC, we may need to abort a read-only transaction to correct the situation). So it seems like users must not rely on any answers they get from a transaction (or subtransaction) unless it commits. Does that make sense? If so, I think we need a documentation update. The serializable isolation level docs don't quite make it clear that serializability only applies to transactions that commit. It might not be obvious to a user that there's a difference between commit and abort for a RO transaction. I think that, in S2PL, serializability applies even to aborted transactions (though I haven't spent much time thinking about it), so users accustomed to other truly-serializable implementations might be surprised. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol
On Wed, 2012-09-05 at 17:03 -0400, Tom Lane wrote: In general I think the selling point for such a feature would be no administrative hassles, and I believe that has to go not only for the end-user experience but also for the application-developer experience. If you have to manage checkpointing and vacuuming in the application, you're probably soon going to look for another database. Maybe there could be some hooks (e.g., right after completing a statement) that see whether a vacuum or checkpoint is required? VACUUM can't be run in a transaction block[1], so there are some details to work out, but it might be a workable approach. Regards, Jeff Davis [1]: It seems like the only reason for that is so a multi-table vacuum doesn't hold locks for longer than it needs to, but that's not much of a concern in this case. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Minor comment fixups
I noticed a couple comments that look wrong to me. Patch attached. Regards, Jeff Davis *** a/src/backend/commands/tablecmds.c --- b/src/backend/commands/tablecmds.c *** *** 8784,8792 copy_relation_data(SMgrRelation src, SMgrRelation dst, pfree(buf); /* ! * If the rel isn't temp, we must fsync it down to disk before it's safe ! * to commit the transaction. (For a temp rel we don't care since the rel ! * will be uninteresting after a crash anyway.) * * It's obvious that we must do this when not WAL-logging the copy. It's * less obvious that we have to do it even if we did WAL-log the copied --- 8784,8791 pfree(buf); /* ! * If the rel is WAL-logged, must fsync before commit. We use heap_sync ! * to ensure that the toast table gets fsync'd too. * * It's obvious that we must do this when not WAL-logging the copy. It's * less obvious that we have to do it even if we did WAL-log the copied *** a/src/backend/storage/file/reinit.c --- b/src/backend/storage/file/reinit.c *** *** 337,343 ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) copy_file(srcpath, dstpath); } - /* Done with the first pass. */ FreeDir(dbspace_dir); } } --- 337,342 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] NOT NULL constraints in foreign tables
On Tue, 2012-08-21 at 10:41 -0400, Robert Haas wrote: The thing to keep in mind here is that EVERY property of a foreign table is subject to change at any arbitrary point in time, without our knowledge. ... Why should CHECK constraints be any different than, say, column types? So, let's say someone changes column types from int to bigint on the remote side, and you still have int on the local side. It continues to work and everything is fine until all of a sudden you get 2^33 back, and that generates an error. That sounds closer to the semantics of constraint enforcement mechanism #2 than #3 to me. That is, everything is fine until you get something that you know is wrong, and you throw an error. Why should that be any worse with foreign tables than anything else? I mean, lots of people, as things stand today, manage to set up partitioned tables using CHECK constraints. There are undoubtedly people who don't understand the planner benefit of having an appropriate CHECK constraint on each partition, but it's not exactly a common cause of confusion. But there are no consequences there other than performance. With unenforced constraints, they may get correct results during development and testing, and wrong results occasionally when in production. That's hard to explain to a user. It seems to me that the term runtime enforcement is a misnomer; you can't really enforce anything about a foreign table. Maybe I chose the wrong terms, but there are at least 3 semantically different concepts. Feel free to suggest a better term. If we were to propose changing the semantics from the former to the latter, we'd be laughed out of town, and rightly so. I'm not proposing changing the semantics, I'm saying that there are more than just 2 semantic options available, and they offer different kinds of guarantees. Users may be interested in all 3 for different use cases. I mean, let's suppose that we were to allow unique constraints on foreign tables. I'm sure there are cases where people will not want what I am suggesting, but I think there are cases where it is plausibly useful. Now, if the query is something like SELECT * FROM foreign_table WHERE id = 1, you could fairly cheaply validate that there is only one row with id = 1, but that's not the same thing as validating that the assumption (namely, that foreign_table (id) is unique) is still true. And if you don't issue a query at all, the constraint might not still be true; but I don't think that implies that checking it when you do run a query is useless. I think if we go down this road of trying to validate remote-side CHECK constraints, we're going to end up with a mishmash of cases where constraints are checked and other cases where constraints are not checked, and then that really is going to be confusing. If we use keywords to differentiate constraints that are different semantically, then we can just say that some types of constraints are allowed on foreign tables and some are not. I guess what I'd like to avoid is saying that a check constraint on a regular table means one thing, and the same check constraint on a foreign table means something else. If we differentiate them by requiring special keywords like NOT ENFORCED, then it would be more user-visible what's going on, and it would allow room for new semantics later if we want. Normal constraints would be disallowed on foreign tables, but NOT ENFORCED ones would be allowed. That brings up another point: what if someone really, really, doesn't want to pay the overhead of enforcing their constraint on a local table, but wants the planner benefit? Would they have to make it a remote table to bypass the constraint check? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Audit Logs WAS: temporal support patch
On Tue, 2012-08-21 at 17:56 -0500, Kevin Grittner wrote: I don't think the concerns I raised about apparent order of execution for serializable transactions apply to audit logs. If we've moved entirely off the topic of the original subject, it is a complete non-issue. Now I'm confused. The serializability issues you were talking about only seem to matter with respect to system time (a.k.a. transaction time), right? If the user is supplying the time, then it's a non-issue. And audit logs are based on system time, so I thought that audit logs were the case you were talking about. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] temporal support patch
On Tue, 2012-08-21 at 17:07 -0500, Kevin Grittner wrote: The fact that it has an unknown sequence number or timestamp for purposes of ordering visibility of transactions doesn't mean you can't show that it completed in an audit log. In other words, I think the needs for a temporal database are significantly different from the needs of an auditing system. ... I would assume an audit log would have very different needs from tracking changes for a temporal database view. It even seems possible that you might want to see what people *looked* at, versus just changes. You might want to see transactions which were rolled back, which are irrelevant for a temporal view. If we're talking about an auditing system, we're talking about an almost completely different animal from a temporal view of the database. OK, I think I see what you're saying now. Basically, an audit log means different things to different people, so I think it confused the issue. But temporal is fairly vague, too. It also seems like there might be a lot of overlap, depending on how we define those terms. I am most interested in the topic you brought up about serializability and system time (transaction time), because it would be a fundamental piece upon which we can build a lot of these other things (including what could be called an audit log). Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] NOT NULL constraints in foreign tables
On Fri, 2012-08-17 at 15:44 -0400, Robert Haas wrote: On Fri, Aug 17, 2012 at 2:58 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I mean, what are NOT NULL in foreign tables for? Are they harmed or helped by having pg_constraint rows? As I've mentioned when this has come up before, I think that constraints on foreign tables should be viewed as declarative statements about the contents of the foreign data that the DB will assume true. This could be useful for a variety of purposes: constraint exclusion, query optimization, etc. There are at least three kinds of constraint enforcement: 1. Enforced before the query runs (e.g. the current behavior on a normal table). 2. Enforced when the query runs by validating the constraint as you go, and then throwing an error when it turns out to be false. 3. Don't make any attempt to enforce, and silently produce wrong results if it's false. Which are you proposing, and how will the user know which kind of constraint they've got? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] temporal support patch
On Mon, 2012-08-20 at 17:04 -0400, Robert Haas wrote: On Sun, Aug 19, 2012 at 6:28 PM, Jeff Davis pg...@j-davis.com wrote: The other issue is how to handle multiple changes of the same record within the transaction. Should they be stored or not? In a typical audit log, I don't see any reason to. The internals of a transaction should be implementation details; invisible to the outside, right? I'm not convinced. As I understand it, we are talking about recording data changes in one table to another table. Auditing of reads or the logging of raw statements seem like very different kinds of projects to me; but tell me if you think differently. So if we are recording data changes, I don't see much point in recording uncommitted changes. Perhaps my imagination is failing, and someone else can fill me in on a use case. I'm also struggling with the semantics: if we record uncommitted changes, do we record them even if the transaction aborts? If so, what guarantees do we offer about the change actually being recorded? I'm not sure that the database user is the proper thing to be stored in the history table. Many applications usually connect to a database using some virtual user and have their own users/roles tables to handle with privileges. There should be some way to substitute the stored user in the history table with the application's one. It's also helpful to store transaction id that inserted/updated/deleted the record. If the system is recording it for audit purposes, then it better be sure that it's true. You can't allow the application to pick and choose what gets stored there. That position would render this feature useless for every application for which I would otherwise have used it. We could offer a GUC like audit_context or audit_app_context that takes a text string, and the audit log would record the value stored in that GUC along with the data changes in question. The main thing I object to is an implication that the system is vouching for some particular fact that is supplied by a userset GUC. Remember, there are guaranteed to be application-level problems that allow these GUCs to get set improperly for all kinds of reasons. We don't want bug reports along the lines of security breach! PG allows application_name to be spoofed in the audit log!. Also, I'd prefer not use existing GUCs, because there may be all kinds of other reasons that people set existing GUCs, and we want them to be able to handle the audit_context one more carefully and have a clear warning in the documentation. I think it's just nonsense to talk about what we can or can't let the user do. The user is in charge, and our job is to allow him to do what he wants to do more easily, not to dictate what he must do. Remember that the users who depend on the veracity of the audit log are users, too. Let's try to serve both classes of user if we can. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] temporal support patch
On Mon, 2012-08-20 at 19:17 -0400, David Johnston wrote: Ideally the decision of whether to do so could be a client decision. Not storing intra-transaction changes is easier than storing all changes. At worse you could stage up all changed then simply fail to store all intermediate results within a given relation. It that case you gain nothing in execution performance but safe both storage and interpretative resources. So the question becomes is it worth doing without the ability to store intermediate results? If you were to ponder both which setup would the default be? If the default is the harder one (all statements) to implement then to avoid upgrade issues the syntax should specify that it is logging transactions only. I think the biggest question here is what guarantees can be offered? What if the transaction aborts after having written some data, does the audit log still get updated? I see the user element as having two components: I think this is essentially a good idea, although as I said in my other email, we should be careful how we label the application-supplied information in the audit log. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] temporal support patch
On Mon, 2012-08-20 at 16:32 -0700, Josh Berkus wrote: This is sounding like a completely runaway spec on what should be a simple feature. My feeling as well. However, we will eventually want to coalesce around some best practices and make it easy and robust for typical cases. Personally, I would prefer a tool which just made it simpler to build my own triggers, and made it automatic for the history table to track changes in the live table. I think anything we build which controls what goes into the history table, etc., will only narrow the user base. That sounds like a good way to start. Actually, even before the tool, how about just some really good examples of triggers for specific kinds of audit logs, and some ways to run queries on them? I think that might settle a lot of these details. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] temporal support patch
On Mon, 2012-08-20 at 19:32 -0500, Kevin Grittner wrote: Josh Berkus j...@agliodbs.com wrote: This is sounding like a completely runaway spec on what should be a simple feature. I hate to contribute to scope creep (or in this case scope screaming down the tracks at full steam), but I've been watching this with a queasy feeling about interaction with Serializable Snapshot Isolation (SSI). There are all kinds of challenges here, and I'm glad you're thinking about them. I alluded to some problems here: http://archives.postgresql.org/message-id/1345415312.20987.56.camel@jdavis But those might be a subset of the problems you're talking about. It sounds like, at a high level, there are two problems: 1. capturing the apparent order of execution in the audit log 2. assigning meaningful times to the changes that are consistent with the apparent order of execution Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] NOT NULL constraints in foreign tables
On Mon, 2012-08-20 at 16:50 -0400, Robert Haas wrote: #3 for foreign tables. I'm skeptical of that approach for two reasons: (1) It will be hard to inform users which constraints are enforced and which aren't. (2) It will be hard for users to understand the planner benefits or the consequences when the constraint is not enforced. That being said, I can imagine good use cases (like when the foreign table is in postgres, and already has that constraint declared), so I'm not outright opposed to it. #1 is not a reasonable alternative for foreign tables because we lack enforcement power in that case, Right. and #2 is also not reasonable, because the only point of allowing declarative constraints is to get better performance, and if we go with #2 then we've pretty much thrown that out the window. Declared constraints can improve the plans, while runtime-enforced constraints slow down execution of a given plan. I'm not really sure whether runtime enforcement is a good trade-off, but it doesn't seem like an obviously bad one. Also, what did you mean by the only point of allowing declarative constraints is to get better performance? Maybe the user wants to get an error if some important assumption about the remote data source is not as true as when they declared the constraint. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SP-GiST for ranges based on 2d-mapping and quad-tree
On Sat, 2012-08-18 at 18:10 +0400, Alexander Korotkov wrote: On Thu, Aug 16, 2012 at 3:46 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I committed the patch now, but left out the support for adjacent for now. Not because there was necessarily anything wrong with that, but because I have limited time for reviewing, and the rest of the patch looks ready for commit now. I reworded the comments quite a lot, you might want to proofread those to double-check that they're still correct. I'll take a look at the adjacent-support next, as a separate patch. Thanks! There is a separate patch for adjacent. I've reworked adjacent check in order to make it more clear. I am taking a look at this patch now. A few quick comments: * It looks like bounds_adjacent modifies it's by-reference arguments, which is a little worrying to me. The lower/upper labels are flipped back, but the inclusivities are not. Maybe just pass by value instead? * Bounds_adjacent is sensitive to the argument order. Can't it just take bound1 and bound2? * I tried some larger tests and they seemed to work. I haven't reviewed the spgist code changes in detail though. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SP-GiST for ranges based on 2d-mapping and quad-tree
On Sat, 2012-07-28 at 17:50 -0400, Tom Lane wrote: which would come back to bite us if we ever try to support index-only scans with SPGiST. I'm confused: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=92203624934095163f8b57b5b3d7bbd2645da2c8 And the patch that was just committed for Range Types SP-GiST is already using index-only scans. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] temporal support patch
On Mon, 2012-06-25 at 17:46 +0900, Vlad Arkhipov wrote: It's not sufficient to store only a period of validity for a row. If two transactions started in the same time change the same record, you have a problem with TSTZRANGE type because it's normalized to empty interval. That's an interesting point. Let's say you tried setting it to [T1, T2) where T1 is the time of the last transaction to update it and T2 is the time of the current transaction. If T2 = T1, then TSTZRANGE will throw an error, not store the empty interval. And we don't want to store the empty interval, because it would be a lie. There could have been some transaction T3 that happened during T2 that saw the value from T1, so saying that there were no times where that was visible to the system is false. Throwing an error allows you to retry T2, which should allow a microsecond or so to pass, and the problem should resolve itself (assuming your clock didn't move backwards, which is a different issue). We could also argue about the start versus end times of transactions, and snapshot acquisition times, because that could cause confusion if there are long-running transactions. It might be a good reason to store the modifying transaction ID as well, but then you get into transaction wraparound problems. The other issue is how to handle multiple changes of the same record within the transaction. Should they be stored or not? In a typical audit log, I don't see any reason to. The internals of a transaction should be implementation details; invisible to the outside, right? Also it's necessary to store some kind of operation type that was applied to the record (insert/update/delete). For example, there is a table with one record with validity period [0, ) and value 'A'. First way 1. Delete this record in time 1, now there is [0, 1), A in the history table. 2. Insert a new record in time 1, now there is [0, 1), A in the history table and [1, ), B record in the current data table. Second way 1. Update this record in time 1, now there is [0, 1), A in the history table and [1, ), B record in the current data table. So you have the same data in the tables but the actions that led to this configuration were different and the history has been lost partly. Right. Those are yet more possible options that people might want for an audit log. * There is other useful information that could be recorded, such as the user who inserted/updated/deleted the record. I'm not sure that the database user is the proper thing to be stored in the history table. Many applications usually connect to a database using some virtual user and have their own users/roles tables to handle with privileges. There should be some way to substitute the stored user in the history table with the application's one. It's also helpful to store transaction id that inserted/updated/deleted the record. If the system is recording it for audit purposes, then it better be sure that it's true. You can't allow the application to pick and choose what gets stored there. While it may be true that many applications just all use the same DB user, if you want an audit log that includes user information you have to let the DB do some authentication. It's a great proposal but seems to be impossible to implement with triggers only solution, isn't it? Is there any kind of hooks on ALTER TABLE ... in PostgreSQL to update changed columns bitmaps when table structure changes? Column numbers are never reused, so I think it would be stable. But if you do need to be notified of schema changes, the new event triggers mechanism may be able to do that. In SQL2011 there is only one table with the all data, historical and current. So it's not very convenient to specifiy WHERE condition on system time everywhere and for all tables in the query. By default only the current data is selected with a query like SELECT * FROM table. If there is some syntax that offers a convenient shorthand for WHERE, that's fine with me. Or using two tables, one called foo and one called foo_history, is also fine. But I don't want the DML syntax to introduce new mechanisms that aren't available without the fancy syntax (though new DDL arrangements might be fine). Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gistchoose vs. bloat
On Mon, 2012-06-18 at 15:12 +0400, Alexander Korotkov wrote: Hackers, While experimenting with gistchoose I achieve interesting results about relation of gistchoose behaviour and gist index bloat. ... Current implementation of gistchoose select first index tuple which have minimal penalty. It is possible for several tuples to have same minimal penalty. I've created simple patch which selects random from them. I then I've following results for same testcase. I took a look at this patch. The surrounding code is pretty messy (not necessarily because of your patch). A few comments would go a long way. The 'which_grow' array is initialized as it goes, first using pointer notations (*which_grows = -1.0) and then using subscript notation. As far as I can tell, the first r-rd_att-natts of the array (the only elements that matter) need to be written the first time through anyway. Why not just replace which_grow[j] 0 with i == FirstOffsetNumber and add a comment that we're initializing the penalties with the first index tuple? The 'sum_grow' didn't make any sense, thank you for getting rid of that. Also, we should document that the earlier attributes always take precedence, which is why we break out of the inner loop as soon as we encounter an attribute with a higher penalty. Please add a comment indicating why you are randomly choosing among the equal penalties. I think that there might be a problem with the logic, as well. Let's say you have two attributes and there are two index tuples, it1 and it2; with penalties [10,10] and [10,100] respectively. The second time through the outer loop, with i = 2, you might (P=0.5) assign 2 to the 'which' variable in the first iteration of the inner loop, before it realizes that it2 actually has a higher penalty. I think you need to finish out the inner loop and have a flag that indicates that all attributes are equal before you do the probabilistic replacement. Also, I think you should use random() rather than rand(). Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Sun, 2012-02-19 at 21:49 +, Simon Riggs wrote: On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs si...@2ndquadrant.com wrote: v8 attached v10 attached. This patch covers all the valid concerns discussed and has been extensively tested. Is there something I can do to help get this ready for the next commitfest? I am willing to rebase it, but I was worried that might cause confusion. I am also willing to review it after the rebase, of course. There are still a couple open issues, including: * Store the checksum in the page version field or the TLI field? * What mechanism to guarantee to the user that all pages are properly protected by checksums (rather than just new pages)? In other words, there are more than the two states represented by the GUC. * What specific data is included in the checksum? I suggested that we include the block number, and maybe the OID. Regards, Jeff Davis - Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Covering Indexes
On Thu, 2012-07-26 at 12:13 -0400, Bruce Momjian wrote: So, do we want a TODO item about adding columns to a unique index that will not be used for uniqueness checks? -1 from me, at least in its current form. At it's heart, this is about separating the constraint from the index that enforces it -- you'd like the columns to be available for querying (for index only scans or otherwise), but not to take part in the constraint. And when you look at it from that perspective, this proposal is and extremely limited form. You can't, for example, decide that an existing index can be used for a new unique constraint. That's a lot more plausible than the use cases mentioned in this thread as far as I can see, but this proposal can't do that. I tried proposing a more general use case when developing exclusion constraints: http://archives.postgresql.org/message-id/1253466074.6983.22.camel@jdavis (allow user to specify multiple constraints enforced by one existing index). But, at least at the time, my proposal didn't pass the usefulness test: http://archives.postgresql.org/pgsql-hackers/2009-09/msg01355.php even though my proposal was strictly more powerful than this one is. Also, this proposal extends the weird differences between CREATE UNIQUE INDEX and a the declaration of a UNIQUE constraint. For example, if you want DEFERRABLE you need to declare the constraint, but if you want to use an expression (rather than a simple column reference) you need to create the index. This problem does not exist with exclusion constraints. In my opinion, new innovations in unique constraints would be better served as part of exclusion constraints, and we should keep unique constraints simple. If we make an improvement to UNIQUE, then we will want to do similar things for exclusion constraints anyway, so it just seems duplicative. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Covering Indexes
On Fri, 2012-07-27 at 15:27 -0500, Merlin Moncure wrote: The covering index/uniqueness use case adds legitimacy to the INDEX clause of exclusion constraints IMNSHO. Yes, I think it would be worth revisiting the idea. One point of concern though is that (following a bit of testing) alter table foo add exclude using btree (id with =); ...is always strictly slower for inserts than alter table foo add primary key(id); Yes, in my worst-case tests there is about a 2X difference for building the constraint and about a 30-50% slowdown during INSERT (I thought I remembered the difference being lower, but I didn't dig up my old test). That's for an in-memory case, I would expect disk I/O to make the difference less apparent. This is probably because it doesn't use the low level btree based uniqueness check (the index is not declared UNIQUE) -- shouldn't it do that if it can? We could probably detect that the operator being used is the btree equality operator, set the unique property of the btree, and avoid the normal exclusion constraint check. I'm sure there are some details to work out, but if we start collecting more use cases where people want the flexibility of exclusion constraints and the speed of UNIQUE, we should look into it. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Covering Indexes
On Thu, 2012-07-26 at 12:13 -0400, Bruce Momjian wrote: So, do we want a TODO item about adding columns to a unique index that will not be used for uniqueness checks? -1 from me, at least in its current form. At it's heart, this is about separating the constraint from the index that enforces it -- you'd like the columns to be available for querying (for index only scans or otherwise), but not to take part in the constraint. And when you look at it from that perspective, this proposal is and extremely limited form. You can't, for example, decide that an existing index can be used for a new unique constraint. That's a lot more plausible than the use cases mentioned in this thread as far as I can see, but this proposal can't do that. I tried proposing a more general use case when developing exclusion constraints: http://archives.postgresql.org/message-id/1253466074.6983.22.camel@jdavis (allow user to specify multiple constraints enforced by one existing index). But, at least at the time, my proposal didn't pass the usefulness test: http://archives.postgresql.org/pgsql-hackers/2009-09/msg01355.php even though my proposal was strictly more powerful than this one is. Also, this proposal extends the weird differences between CREATE UNIQUE INDEX and a the declaration of a UNIQUE constraint. For example, if you want DEFERRABLE you need to declare the constraint, but if you want to use an expression (rather than a simple column reference) you need to create the index. This problem does not exist with exclusion constraints. In my opinion, new innovations in unique constraints would be better served as part of exclusion constraints, and we should keep unique constraints simple. If we make an improvement to UNIQUE, then we will want to do similar things for exclusion constraints anyway, so it just seems duplicative. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using pg_upgrade on log-shipping standby servers
On Thu, 2012-07-26 at 14:17 -0400, Bruce Momjian wrote: Yes, that would be a problem because the WAL records are deleted by pg_upgrade. Does a shutdown of the standby not already replay all WAL logs? There is no notion of all WAL logs because the WAL is infinite. Do you mean all WAL generated by the master before shutdown or all WAL that the standby knows has been generated by the master so far? Regardless, I don't think the standby attempts to do much after a shutdown is requested. We could also just require them to just start the standby in master mode and shut it down. The problem with that is it might run things like autovacuum. If we had sync rep that waits for application of the WAL, that might be a more robust approach. We could, during shutdown of the master cause the standby to cancel all HS queries, and then change to sync rep and wait for complete catchup. There are a lot of details to work out there, but it might give us a higher confidence that it's doing the right thing. Given two shut-down systems, it should be pretty easy to tell if they have played the same amount of WAL though, right? I was originally thinking that we would require users to run pg_upgrade on the standby, where you need to first switch into master mode. That sounds a little strange to me. If the original master has generated WAL that the original standby hasn't seen yet, then this doesn't help because the two systems would be diverged, and you'd need a new base backup anyway. And if they have played exactly the same WAL, what does this accomplish? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using pg_upgrade on log-shipping standby servers
On Tue, 2012-07-17 at 01:02 -0700, Daniel Farina wrote: Could pg_upgrade emit WAL segment(s) to provide continuity of a timeline? So something like: By segments did you mean records? * Take down the writable primary for pg_upgrade * Some WAL is emitted and possibly archived * The old version, when reaching the special pg_upgrade WAL, could exit or report its situation having paused replay (as clearly, it cannot proceed). Unsure. I don't really understand this step. * Start up a new version of postgres on the same cluster at that point, which plays the upgrade-WAL. I see this being pretty mechanically intensive, but right now my hands are completely tied as to achieving total continuity of my archives, costing a base-backup's worth of risk window upon upgrade. Does continuity of archives mean avoid downtime or maintain a single WAL sequence. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using pg_upgrade on log-shipping standby servers
On Tue, 2012-07-10 at 11:50 -0400, Bruce Momjian wrote: I don't think we can assume that because pg_upgrade was run on the master and standby that they are binary identical, can we? Technically the user file are identical, but the system catalogs and WAL might be different, hence my suggestion to run rsync before allowing the standby to rejoin the primary. Do you have plans to change that in the future? If we know that the user data files are identical between primary and replica, it would be nice if we could provide a robust way to avoid copying them. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Fri, 2012-07-13 at 15:21 -0400, Tom Lane wrote: No, that's incorrect: the fadvise is there, inside pg_flush_data() which is done during the writing phase. Yeah, Andres pointed that out, also. It seems to me that if we think sync_file_range is a win, we ought to be using it in pg_flush_data just as much as in initdb. It was pretty clearly a win for initdb, but I wasn't convinced it was a good idea for other use cases. The mechanism is outlined in the email you linked below. To paraphrase, fadvise tries to put the data in the io scheduler queue (still in the kernel before going to the device), and gives up if there is no room; sync_file_range waits for there to be room if none is available. For the case of initdb, the data needing to be fsync'd is effectively constant, and it's a lot of small files. If the requests don't make it to the io scheduler queue before fsync, the kernel doesn't have an opportunity to schedule them properly. But for larger amounts of data copying (like ALTER DATABASE SET TABLESPACE), it seemed like there was more risk that sync_file_range would starve out other processes by continuously filling up the io scheduler queue (I'm not sure if there are protections against that or not). Also, if the files are larger, a single fsync represents more data, and the kernel may be able to schedule it well enough anyway. I'm not an authority in this area though, so if you are comfortable extending sync_file_range to copydir() as well, that's fine with me. However, I'm a bit confused because in http://archives.postgresql.org/pgsql-hackers/2012-03/msg01098.php you said So, it looks like fadvise is the right thing to do, but I expect we'll Was that backwards? If not, why are we bothering with taking any portability risks by adding use of sync_file_range? That part of the email was less than conclusive, and I can't remember exactly what point I was trying to make. sync_file_range is a big practical win for the reasons I mentioned above (and also avoids some unpleasant noises coming from the disk drive). Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Fri, 2012-07-13 at 17:35 -0400, Tom Lane wrote: I wrote: I'm picking up this patch now. What I'm inclined to do about the -N business is to commit without that, so that we get a round of testing in the buildfarm and find out about any portability issues, but then change to use -N after a week or so. I agree that in the long run we don't want regression tests to run with fsyncs by default. Committed without the -N in pg_regress (for now). I also stuck sync_file_range into the backend's pg_flush_data --- it would be interesting to hear measurements of whether that makes a noticeable difference for CREATE DATABASE. Thank you. One point about the commit message: fadvise does not block to go into the request queue, sync_file_range does. The problem with fadvise is that, when the request queue is small, it fills up so fast that most of the requests never make it in, and fadvise is essentially a no-op. sync_file_range waits for room in the queue, which is (based on my tests) enough to improve the scheduling a lot. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SP-GiST for ranges based on 2d-mapping and quad-tree
On Thu, 2012-06-14 at 02:56 +0400, Alexander Korotkov wrote: Hackers, attached patch implements quad-tree on ranges. Some performance results in comparison with current GiST indexing. Index creation is slightly slower. Probably, it need some investigation. Search queries on SP-GiST use much more pages. However this comparison can be not really correct, because SP-GiST can pin same buffer several times during one scan. In CPU search queries on SP-GiST seems to be slightly faster. Dramatical difference in column @ const query is thanks to 2d-mapping. More comments: * Minor rebase is required (simple int2 - int16). * Perhaps I'm mistaken, but the following code in getQuadrant() looks wrong to me, shouldn't the 1 and 2 be reversed? if (range_cmp_bounds(typcache, upper, centroidUpper) = 0) return 1; else return 2; * in the choose method, why does in-allTheSame unconditionally match? Why not split? Similarly, why does inner_consistent always match when the nodes are allTheSame? * It's a little confusing having empty prefixes mean that empty range go to node0, and non-empty ranges meaning that empty ranges go to node4 (quadrant 5). Why can't there just always be 5 nodes, and iff all the ranges are empty, then the prefix is NULL? And for that matter, let's let the quadrant equal the node number, and have the empty ranges in node0. I don't see much point in always subtracting 1 from the quadrant number. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SP-GiST for ranges based on 2d-mapping and quad-tree
On Mon, 2012-07-02 at 23:47 -0700, Jeff Davis wrote: On Thu, 2012-06-14 at 02:56 +0400, Alexander Korotkov wrote: Hackers, attached patch implements quad-tree on ranges. Some performance results in comparison with current GiST indexing. Index creation is slightly slower. Probably, it need some investigation. Search queries on SP-GiST use much more pages. However this comparison can be not really correct, because SP-GiST can pin same buffer several times during one scan. In CPU search queries on SP-GiST seems to be slightly faster. Dramatical difference in column @ const query is thanks to 2d-mapping. Also, it would be helpful to add a couple tests to rangetypes.sql. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SP-GiST for ranges based on 2d-mapping and quad-tree
On Mon, 2012-07-02 at 23:47 -0700, Jeff Davis wrote: * Perhaps I'm mistaken, but the following code in getQuadrant() looks wrong to me, shouldn't the 1 and 2 be reversed? if (range_cmp_bounds(typcache, upper, centroidUpper) = 0) return 1; else return 2; Oops, looks like I was mistaken. The code looks fine to me now. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GiST subsplit question
On Tue, 2012-06-26 at 11:28 -0400, Robert Haas wrote: On Wed, May 30, 2012 at 4:35 PM, Alexander Korotkov aekorot...@gmail.com wrote: So, do we demote that message to a DEBUG1? Or do we make it more clear what the authors of a specific picksplit are supposed to do to avoid that problem? Or am I misunderstanding something? +1 for demote message to DEBUG1. I think it shouldn't be so noisy, it just indicates something could be improved. Also I think we defenitely need to document secondary split. Now it's no chances to understand without reverse engeneering it from code. I'm happy to go demote the message if we have consensus on that, but somebody else is going to need to provide the doc patch. Any takers? I was planning to do that, but it won't be for a few days at least. If someone else wants to do it sooner, feel free. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Fri, 2012-06-22 at 10:04 -0400, Robert Haas wrote: This may be a stupid question, by why is it initdb's job to fsync the files the server creates, rather than the server's job? Normally we rely on the server to make its own writes persistent. That was my first reaction as well: http://archives.postgresql.org/message-id/1328468000.15224.32.camel@jdavis However, from the discussion it seems like it will be harder to do it that way (during normal operation, a checkpoint is what syncs the files; but as Tom points out, bootstrap mode does not). Also, I would expect the fsyncs to be in a less-controlled pattern, so it might perform more poorly. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SP-GiST for ranges based on 2d-mapping and quad-tree
On Thu, 2012-06-14 at 02:56 +0400, Alexander Korotkov wrote: Hackers, attached patch implements quad-tree on ranges. Some performance results in comparison with current GiST indexing. Index creation is slightly slower. Probably, it need some investigation. Search queries on SP-GiST use much more pages. However this comparison can be not really correct, because SP-GiST can pin same buffer several times during one scan. In CPU search queries on SP-GiST seems to be slightly faster. Dramatical difference in column @ const query is thanks to 2d-mapping. Looking at this patch now. I see that it fails the opr_sanity test (on master), can you look into that? It looks very promising from a performance standpoint. I think the col @ const query will be a common query; and I also think that pattern will be useful to restrict a large table down to something more manageable. In the bounds_connected function, it might make more sense to use the word adjacent which I already used for ordinary ranges, rather than using the new word connected. Also, I'm getting a little confused switching between thinking in terms of X and Y and lower and upper (particularly since lower and upper can be confused with or ). I don't have a suggestion yet how to clarify that, but it might be good to use the spatial terminology in more places and avoid lower/upper except where needed. Please excuse the slow review, I'm catching up on the SP-GiST API. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote: It calls pg_flush_data inside of copy_file which does the posix_fadvise... So maybe just put the sync_file_range in pg_flush_data? The functions in fd.c aren't linked to initdb, so it's a challenge to share that code (I remember now: that's why I didn't use pg_flush_data originally). I don't see a clean way to resolve that, do you? Also, it seems that sync_file_range() doesn't help with creating a database much (on my machine), so it doesn't seem important to use it there. Right now I'm inclined to leave the patch as-is. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] temporal support patch
On Mon, 2012-06-18 at 19:34 +0900, Vlad Arkhipov wrote: What's wrong with SPI/timetravel extension for system versioning? http://www.postgresql.org/docs/9.1/static/contrib-spi.html We are heavily using system-versioned and application-time period tables in our enterprise products (most of them are bi-temporal). However our implementation is based on triggers and views and therefore is not very convenient to use. There are also some locking issues with foreign keys to application-time period tables. It will be great if the new temporal SQL features will be included in the Postgresql core with SQL 2011 syntax support. It is especially important for bi-temporal tables because of complex internal logic of UPDATE/DELETE and huge SELECT queries for such tables. I've already pointed out some missing features in this thread, but the big ones in my mind are: 1. It doesn't use 9.2 Range Types, which would help in a lot of ways (like making the SELECT queries a lot simpler and faster). 2. It's missing a lot of options, like storing the user that modified a row or the changed columns. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Mon, 2012-06-18 at 18:05 +0200, Andres Freund wrote: Quick review: - defaulting to initdb -N in the regression suite is not a good imo, because that way the buildfarm won't catch problems in that area... I removed the -N as you suggest. How much does performance matter on the buildfarm? - could the copydir.c and initdb.c versions of walkdir/sync_fname et al be unified? There's a lot of backend-specific code in the copydir versions, like using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at unifying them before, and concluded that it wouldn't add to the readability, so I just commented where they came from. If you feel there will be a maintainability problem, I can give it another shot. - I personally would find it way nicer to put USE_PRE_SYNC into pre_sync_fname instead of cluttering the main function with it Done. Regards, Jeff Davis initdb-fsync-20120618.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Mon, 2012-06-18 at 21:34 +0300, Peter Eisentraut wrote: On mån, 2012-06-18 at 18:05 +0200, Andres Freund wrote: - defaulting to initdb -N in the regression suite is not a good imo, because that way the buildfarm won't catch problems in that area... The regression test suite also starts postgres with fsync off. This is good, and I don't want to have more overhead in the tests. It's not like we already have awesome test coverage for OS interaction. OK, changing back to using -N during regression tests due to performance concerns. Regards, Jeff Davis initdb-fsync-20120618-2.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Mon, 2012-06-18 at 20:57 +0200, Andres Freund wrote: - defaulting to initdb -N in the regression suite is not a good imo, because that way the buildfarm won't catch problems in that area... I removed the -N as you suggest. How much does performance matter on the buildfarm? I don't think the difference in initdb cost is relevant when running the regression tests. Should it prove to be we can re-add -N after a week or two in the buildfarm machines. I just remember that there were several OS specific regression when adding the pre-syncing for createdb. That sounds reasonable to me. Both patches are out there, so we can figure out what the consensus is. - could the copydir.c and initdb.c versions of walkdir/sync_fname et al be unified? There's a lot of backend-specific code in the copydir versions, like using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at unifying them before, and concluded that it wouldn't add to the readability, so I just commented where they came from. Ok. Sensible reasons. I dislike that we know have two files using different logic (copydir.c only using fadvise, initdb using sync_file_range if available). Maybe we should just move the fadvise and sync_file_range calls into its own common function? I don't see fadvise in copydir.c, it looks like it just uses fsync. It might speed it up to use a pre-sync call there, too -- database creation does take a while. If that's in the scope of this patch, I'll do it. Btw, I just want to have said this, although I don't think its particularly relevant as it doesn't affect correctness: Its possible to have a system where sync_file_range is in the system headers but the kernel during runtime doesn't support it. It is relatively new (2.6.17). It would be possible to fallback to posix_fadvise which has been around far longer in that case... Interesting point, but I'm not too worried about it. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote: It calls pg_flush_data inside of copy_file which does the posix_fadvise... So maybe just put the sync_file_range in pg_flush_data? Oh, I didn't notice that, thank you. In that case, it may be good to combine them if possible. I will look into it. There may be performance implications when used one a larger amount of data though. I can do some brief testing. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Support for foreign keys with arrays
On Fri, 2011-11-04 at 13:48 +0100, Gabriele Bartolini wrote: This patch adds basic support of arrays in foreign keys, by allowing to define a referencing column as an array of elements having the same type as the referenced column in the referenced table. Every NOT NULL element in the referencing array is matched against the referenced table. I'm trying to find commonalities between this feature and my future RANGE FOREIGN KEY feature (not past the hand-waving stage yet). The first thing I observe is that my idea for range foreign keys is almost the opposite of your idea for array FKs. I was imagining a range FK to mean that the referencing side is contained by the referenced side. This is the common definition in the temporal world, because the valid period for the referencing row must be within the valid period for the row it references (same for transaction time). The referenced side must be a range, and the referencing side must be either a range of the same type or the subtype of the range. Other similar definitions exist by replacing contained by with some other operator, though the use cases for those aren't as clear to me. This definition works for arrays and possibly many other types (geometry?) as well. It looks like this is orthogonal from your work, but it does seem like it has potential for confusion in the future. Thoughts? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Support for foreign keys with arrays
On Sun, 2012-06-17 at 21:10 +0800, Simon Riggs wrote: Do we need something like Exclusion FKs? i.e. the FK partner of Exclusion Constraints? Yes, Inclusion Constraints. I've known we need something like that since I did Exclusion Constraints, but I haven't gotten further than that. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] temporal support patch
On Wed, 2012-06-13 at 23:10 +0200, Miroslav Šimulčík wrote: I have working patch for postgresql version 9.0.4, but it needs refactoring before i can submit it, because some parts don't meet formatting requirements yet. And yes, changes are large, so it will be better to discuss design first and then deal with code. Do you insist on compatibility with standard SQL 2011 as Pavel wrote? Try to work on solving the problem and identify the use cases. I don't think the standard will cause a major problem, we should be able to make the relevant parts of your patch match the standard. That's one reason to work on it as an extension first: we can get a better sense of the problem space and various use cases without worrying about violating any standard. Then, as you need specific backend support (e.g. special syntax), we can take the standards more seriously. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Wed, 2012-06-13 at 13:53 +0300, Peter Eisentraut wrote: The --help output for the -N option was copy-and-pasted wrongly. The message issued when using -N is also a bit content-free. Maybe something like Running in nosync mode. The data directory might become corrupt if the operating system crashes.\n Thank you, fixed. Which leads to the question, how does one get out of this state? Is running sync(1) enough? Is starting the postgres server enough? sync(1) calls sync(2), and the man page says: According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However, since version 1.3.20 Linux does actually wait. (This still does not guarantee data integrity: modern disks have large caches.) So it looks like sync is enough if you are on linux *and* you have any unprotected write cache disabled. I don't think starting the postgres server is enough. Before, I think we were safe because we could assume that the OS would flush the buffers before you had time to store any important data. But now, that window can be much larger. There are no updates to the initdb man page included in your patch, which would be a suitable place to discuss this briefly. Thank you, I added that. Regards, Jeff Davis initdb-fsync-20120613.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3: load path to mitigate load penalty for checksums
On Tue, 2012-06-12 at 08:42 -0400, Robert Haas wrote: Instead of trying to maintain MVCC semantics, maybe we should just have something like COPY (FROZEN) that just writes frozen tuples into the table and throws MVCC out the window. Seems like that would be a lot easier to implement and satisfy basically the same use cases. The reason that doesn't work is because there's really no answer for ABORTs except to delete the whole table, or ask the user to try to figure out what made it or what didn't. (That I can see, anyway; you may have a better idea.) In practice, I think the user would need to use this only for new tables and use only one loader so they would know what to do if there is an abort. That makes this very similar to the proposal for optimizing loads by the transaction that creates the table; except with fewer safeguards. It's basically just saying don't abort, and be careful with concurrent reads (which are now dirty). I think this problem largely has to do with freezing though, because that's what makes it impossible to differentiate after an abort. Let's set that aside for now, and just focus on setting HEAP_XMIN_COMMITTED, PD_ALL_VISIBLE, and the VM bit. Those account for a major part of the problem; being able to freeze also is really a bonus (and if the table is really that static, then maybe creating it in the loading transaction is reasonable). In those terms, we can reframe the questions as: what do we do about reads during the load? Answers could include: (a) nothing -- reads during a load are potentially dirty (b) readers ignore hint bits during the load, and pay the penalty (c) allow only INSERT/COPY, and block unsafe SELECT/UPDATE/DELETE (a) is your proposal except without the freezing. (b) was my proposal. (c) was a backup I had in mind (if reads are expensive anyway, maybe people would rather just block). These aren't exclusive. Maybe we can do (c) in READ COMMITTED and above, and (a) in READ UNCOMMITTED. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3: load path to mitigate load penalty for checksums
On Tue, 2012-06-12 at 13:31 -0400, Tom Lane wrote: Yeah; the notion of adding cycles to the visibility-check code paths, which would add cost even for users who have no use at all for this feature, is close to being a deal-breaker for me. I would rather see us designing this around the idea of what can we do without adding any new complexity in visibility checks?. [waves hands wildly] We could create a different HeapTupleSatisfiesMVCC routine that ignores hint bits, and select that at the time the scan is started. That would at least avoid affecting existing code paths. At least for MVCC cases (hence, user tables only), it seems like we could pre-set XMIN_COMMITTED hint bits if there were a way to be sure that the XID in question would be seen as still running in the followup snapshot check that every MVCC visibility test must make anyway. The hard part of that seems to be post-crash behavior, but maybe it'd be acceptable to incur crash-recovery-time cost to run around and unset such bits? Hmm... but that still leaves PD_ALL_VISIBLE, which will also cause a rewrite of the data. Under OLTP, we can assume that PD_ALL_VISIBLE is much more rare than HEAP_XMIN_COMMITTED; but for bulk loads, setting HEAP_XMIN_COMMITTED doesn't help us much without PD_ALL_VISIBLE. This doesn't do anything for pre-freezing of course, but I basically don't see any way that we could pre-freeze without breaking MVCC semantics anyhow. So there's still attraction in the idea of offering users the choice of not sustaining MVCC behavior in exchange for cheaper bulk loads. That may be reasonable, but it's much more dangerous than just breaking MVCC (which to me implies isolation rules) -- pre-freezing would break atomicity if you have any aborts. And I can't think of very many cases where losing atomicity is reasonable (although I'm sure there are some). Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3: load path to mitigate load penalty for checksums
On Tue, 2012-06-12 at 13:10 -0400, Robert Haas wrote: I don't think it's going to solve the problem in general, but TBH I don't think Jeff's proposal is, either. I mean, ignoring xmin-committed, xmax-committed, and all-visible bits is going to come with a pretty steep performance penalty all of its own. I think we'd certainly have to discourage users from launching lots of full table scans during a data load. We could go so far as blocking reads, as I said in my other response, and still (mostly) meet my primary goals. But allowing reads without hint bits might be useful for highly selective index access, which is a substantial use case for the kind of large tables we'd be bulk-loading. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3: load path to mitigate load penalty for checksums
On Tue, 2012-06-12 at 17:53 -0400, Bruce Momjian wrote: Well, truncating tables that were empty on the load would certainly be a easy to do --- not sure if that helps us, though. This goes back to the single-loader-into-an-empty-table case, which I'd like to get away from. I think we already have some reasonable proposals on the table for optimizing that case. Ultimately, that pushes us toward partitioning with the same granularity as the data load. We can try to open that discussion up, but I think it's a bad idea to tie the data load granularity to the partitioning granularity, unless we have a way to coalesce them later without rewriting. It's too bad filesystems don't allow us to just reallocate some blocks from one file to another. [ OK, I guess I just opened this discussion myself ;) ] Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3: load path to mitigate load penalty for checksums
On Tue, 2012-06-12 at 18:02 -0400, Tom Lane wrote: Or (d) it's not a problem, since the inserting XID is still busy according to the readers' snapshots. How much of a savings did we get from PD_ALL_VISIBLE when it was added into the page-at-a-time visibility check? From 608195a3a3656145a7eec7a47d903bc684011d73: In addition to the visibility map, there's a new PD_ALL_VISIBLE flag on each heap page, also indicating that all tuples on the page are visible to all transactions. It's important that this flag is kept up-to-date. It is also used to skip visibility tests in sequential scans, which gives a small performance gain on seqscans. If small means that it's something we can give up, then focusing on HEAP_XMIN_COMMITTED makes sense. But if we can't give it up, then we need to take it into account in the proposal. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Sun, 2012-02-19 at 21:49 +, Simon Riggs wrote: On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs si...@2ndquadrant.com wrote: v8 attached v10 attached. This patch covers all the valid concerns discussed and has been extensively tested. I expect to commit this in about 24 hours, so last call for comments please or say if you need more time to review. I have made an attempt to merge this with master. This patch is not meant to be the next version, and I have not made a real effort to be sure this was a safe merge. I am just hoping this saves Simon some of the tedious portions of the merge, and offers a reference point to see where his merge differs from mine. In the meantime, I'm looking at the original v10 patch against master as it existed when Simon posted the patch. I'd like to see checksums take part in the June commitfest. Regards, Jeff Davis checksums-v10-merge.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3: load path to mitigate load penalty for checksums
On Tue, 2012-06-12 at 22:06 -0400, Robert Haas wrote: How much of a savings did we get from PD_ALL_VISIBLE when it was added into the page-at-a-time visibility check? From 608195a3a3656145a7eec7a47d903bc684011d73: In addition to the visibility map, there's a new PD_ALL_VISIBLE flag on each heap page, also indicating that all tuples on the page are visible to all transactions. It's important that this flag is kept up-to-date. It is also used to skip visibility tests in sequential scans, which gives a small performance gain on seqscans. If small means that it's something we can give up, then focusing on HEAP_XMIN_COMMITTED makes sense. But if we can't give it up, then we need to take it into account in the proposal. It's significant. In that case, the proposals that only involve HEAP_XMIN_COMMITTED don't seem viable to me. To get maxiumum read speed, we will need to set PD_ALL_VISIBLE also, and that means rewriting the entire table anyway (for the workload that I'm describing). However, maybe if we combine the approaches, readers could ignore PD_ALL_VISIBLE during the load, which looks like maybe a 10% penalty, without having to ignore HEAP_XMIN_COMMITTED (which seems much more costly to ignore). Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Sun, 2012-03-25 at 19:59 -0700, Jeff Davis wrote: On Sat, 2012-03-17 at 17:48 +0100, Cédric Villemain wrote: I agree with Andres. I believe we should use sync_file_range (_before?) with linux. And we can use posix_fadvise_dontneed on other kernels. OK, updated patch attached. sync_file_range() is preferred, posix_fadvise() is a fallback. Rebased patch attached. No other changes. Regards, Jeff Davis initdb-fsync-20120612.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3: load path to mitigate load penalty for checksums
On Wed, 2012-06-06 at 22:16 -0400, Noah Misch wrote: Note that, currently, only VACUUM sets PD_ALL_VISIBLE and visibility map bits. Would you make something else like heap_multi_insert() be able to do so? That was the plan (roughly). I was thinking about doing it at the time a new page was allocated. Avoiding measurable overhead in tuple visibility checks when the feature is inactive may well prove to be a key implementation challenge. Perhaps a rudimentary CLOG cache, or some other way to mitigate CLOG access could make it bearable. Although I would like it to be an online operation, I'm not quite as concerned about reads. I'd like to mitigate any major penalty, but if reads are expensive during a load, than so be it. Then, it would remember the current xid as max_loader_xid, and wait until the global xmin is greater than max_loader_xid. This should ensure that all snapshots regard all loading transactions as complete. ... this might not be. Each backend could decide, based on its own xmin, whether to ignore PD_ALL_VISIBLE in a given table. In other words, your ignorehints flag could be an xmin set to InvalidTransactionId during stages 1 and 2 and to the earliest safe xmin during stages 0 and 3. That's a good idea. It might make it easier to implement, and removing a step from finalization is certainly a big plus. * INITIATE and FINALIZE probably need to use PreventTransactionChain() and multiple transactions, to avoid holding the ShareUpdateExclusiveLock for too long. Also, we want to keep people from using it in the same transaction as the loading xact, because they might not realize that they would get a concurrency of 1 that way (because of the ShareUpdateExclusiveLock). Yes. You need to commit the transaction modifying pg_class so other backends can observe the change, at which point you can gather the list to wait on. Consider splitting the INITIATE UI into two interfaces, one that transitions from state 0 to state 1 and another that expects state 1 and blocks until we reach state 2. You then have no need for PreventTransactionChain(), and the interfaces could even be normal functions. It's less clear how reasonably you could do this for the FINALIZE step, given its implicit VACUUM. It could be achieved by having the user do the VACUUM and making the new interface merely throw an error if a VACUUM is still needed. The trivial usage pattern might look like this: SELECT pg_initiate_load('bigtbl'); SELECT pg_wait_load('bigtbl'); -- not a great name COPY bigtbl FROM STDIN; SELECT pg_stop_load('bigtbl'); VACUUM bigtbl; SELECT pg_finalize_load('bigtbl'); It's definitely less elegant, alas. Perhaps offer the interface you've proposed and have it do the above under the hood. That way, users with complex needs have the flexibility of the lower-level interfaces while those who can tolerate PreventTransactionChain() have simplicity. I think that's a reasonable suggestion. I am going back and forth a little on this one. It's got the benefit that you can see the internal states more clearly, and it's easier to tell what's going on, and it's better if we want to do more sophisticated testing. The main drawback here is that it's exposing more to the user. I imagined that we might want to push other kinds of optimizations into the load path, and that might upset the interface you've described above. Then again, we'll probably need the normal, load, and transition states regardless, so maybe it's an empty concern. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3: load path to mitigate load penalty for checksums
On Wed, 2012-06-06 at 15:08 -0400, Robert Haas wrote: On Mon, Jun 4, 2012 at 9:26 PM, Jeff Davis pg...@j-davis.com wrote: Thoughts? Simon already proposed a way of doing this that doesn't require explicit user action, which seems preferable to a method that does require explicit user action, even though it's a little harder to implement. His idea was to store the XID of the process creating the table in the pg_class row, which I think is *probably* better than your idea of having a process that waits and then flips the flag. There are some finicky details though - see previous thread for discussion of some of the issues. My goals include: * The ability to load into existing tables with existing data * The ability to load concurrently My understanding was that the proposal to which you're referring can't do those things, which seem like major limitations. Did I miss something? In many cases it would also be nice to write the tuples pre-frozen, so I think we should look for a design that will support that. You're right, that would be nice. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] 9.3: load path to mitigate load penalty for checksums
Introduction: = A lot of progress has been made on the checksums issue, with Simon's excellent work beginning to gain consensus: http://archives.postgresql.org/message-id/CA +u5nmkw_gbs6qq_y8-rjgl1v7mvw2hwbhartb8lojhnpfx...@mail.gmail.com For the purposes of this proposal, I'll assume that's the general direction we'll be taking for CRCs. The major drawback of that proposal is that it increases the amount of work to be done after a large data load by requiring more WAL. Proposal: = I propose a special user-initiated loading mode at the table granularity. During this time, readers must ignore PD_ALL_VISIBLE, HEAP_XMIN_COMMITTED, and the visibility map entirely. However, writers may set all of those bits before the writing transaction commits, obviating the need to rewrite (and WAL) the data again later. Ideally, there would be no work for VACUUM to do after the data load (unless a transaction aborted). This would also help the non-CRC case of course, but I expect CRCs to make this significantly more important. Goals: = * Table granularity (doesn't affect other tables at all) * Allows concurrent loaders * Allows loading into existing tables with existing data * Online (allow reads to proceed, even if degraded) Rough user interaction: = INITIATE LOAD ON foo AS 'job name'; -- run concurrent loading sessions FINALIZE LOAD 'job name'; High-level design: = By hints I mean the VM bit, PD_ALL_VISIBLE, and HEAP_XMIN_COMMITTED. By ignorehints I mean a flag in pg_class indicating that readers should ignore hints. By optimistichints I mean a flag in pg_class indicating that writers can optimistically set hints. Obviously, readers and writers would need a mechanism to honor those flags, but I haven't dug into the details yet (additional routines in tqual.c?). States: 0: normal * ignorehints = false * optimistichints = false 1: trying to enter data load mode, waiting on existing lockers (who might be scanning) to finish * ignorehints = true * optimistichints = false 2: data load mode * ignorehints = true * optimistichints = true 3: trying to leave data load mode, waiting on old snapshots to be released and aborted transactions to be cleaned up * ignorehints = true * optimistichints = false INITIATE LOAD would first transition from state 0 to 1 by acquiring a ShareUpdateExclusiveLock on the table (to be sure no concurrent INITIATE or FINALIZE LOAD is going on) and setting ignorehints = true. Then it moves from state 1 to state 2 by waiting for all transactions that hold a lock on that table. Any transactions that don't already have a lock will see the new flag when they try to get it. Now we're sure that all readers will see the ignorehints flag, so we can set the optimistichints flag to indicate that writers can write hints optimistically. FINALIZE LOAD would first move from state 2 to state 3 by acquiring a ShareUpdateExclusiveLock on the table setting optimistichints = false. Then, it would move from state 3 to state 0 by first waiting for all transactions that currently hold a lock on the table, to ensure they see the optimistichints=false flag. Then, it would remember the current xid as max_loader_xid, and wait until the global xmin is greater than max_loader_xid. This should ensure that all snapshots regard all loading transactions as complete. Also, it would need to run a lazy VACUUM to remove any tuples from aborted transactions. Details and optimizations = * We probably want a graceful way to handle multiple data loads happening on the same table. Rather than erroring out, we could treat it more like a reference count, and only do the work to move in to data load mode if not there already, and only move out of data load mode if we're the last loading job on the table. * In general, there are some usability issues to sort out, to make sure a table isn't put into data load mode and left that way. Right now, I'm mostly concerned with getting a working design, but those will be important, too. * We could optimize away the VACUUM going from 3 - 0 if we are sure no writing transactions aborted. * INITIATE and FINALIZE probably need to use PreventTransactionChain() and multiple transactions, to avoid holding the ShareUpdateExclusiveLock for too long. Also, we want to keep people from using it in the same transaction as the loading xact, because they might not realize that they would get a concurrency of 1 that way (because of the ShareUpdateExclusiveLock). Thoughts? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] extending relations more efficiently
On Tue, 2012-05-01 at 10:08 -0400, Robert Haas wrote: We've previously discussed the possible desirability of extending relations in larger increments, rather than one block at a time, for performance reasons. I attempted to determine how much performance we could possibly buy this way, and found that, as far as I can see, the answer is, basically, none. Another point here is that with checksums, we will want to make sure that zero pages can be treated as corrupt. That will probably involve using the WAL for extension operations, and we'll want to mitigate that cost somehow. Extending in larger chunks would probably be necessary. There are some challenges there, but I think it's worth pursuing. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] GiST subsplit question
http://archives.postgresql.org/message-id/CAEsn3ybKcnFno_tGQDJ=afhir2xd9ka1fqt5cqxxyu3wz_h...@mail.gmail.com I was trying to answer that question on -general, and I found it a little more challenging than I expected. There was a previous discussion here: http://archives.postgresql.org/pgsql-general/2007-08/msg01816.php And it seems to trace back to these commits: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=783a73168b972488f85e48381546db047cb8f982 http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1f7ef548ec2e594fa8766781c490fb5b998ea46b I looked for the follow-up commit to support subsplit in the contrib modules, figuring that would answer some questions, but I couldn't find it. The part that's confusing me is that the commit message says: pickSplit should set spl_(l|r)datum_exists to 'false', but I don't see any picksplit method that actually does that in contrib, nor in the sample in the docs. The code in that area is a bit difficult to follow, so it's not obvious to me exactly what is supposed to happen. So, do we demote that message to a DEBUG1? Or do we make it more clear what the authors of a specific picksplit are supposed to do to avoid that problem? Or am I misunderstanding something? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronized scans versus relcache reinitialization
On Sat, 2012-05-26 at 15:14 -0400, Tom Lane wrote: 3. Having now spent a good deal of time poking at this, I think that the syncscan logic is in need of more tuning, and I am wondering whether we should even have it turned on by default. It appears to be totally useless for fully-cached-in-RAM scenarios, even if most of the relation is out in kernel buffers rather than in shared buffers. The best case I saw was less than 2X speedup compared to N-times-the-single-client case, and that wasn't very reproducible, and it didn't happen at all unless I hacked BAS_BULKREAD mode to use a ring buffer size many times larger than the current 256K setting (otherwise the timing requirements are too tight for multiple backends to stay in sync --- a seqscan can blow through that much data in a fraction of a millisecond these days, if it's reading from kernel buffers). The current tuning may be all right for cases where you're actually reading from spinning rust, but that seems to be a decreasing fraction of real-world use cases. Do you mean that the best case you saw ever was 2X, or the best case when the table is mostly in kernel buffers was 2X? I clearly saw better than 2X when the table was on disk, so if you aren't, we should investigate. One thing we could do is drive the threshold from effective_cache_size rather than shared_buffers, which was discussed during 8.3 development. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] temporal support patch
On Wed, 2012-05-16 at 23:14 +0200, Miroslav Šimulčík wrote: Hi all, as a part of my master's thesis I have created temporal support patch for PostgreSQL. It enables the creation of special temporal tables with entries versioning. Modifying operations (UPDATE, DELETE, TRUNCATE) on these tables don't cause permanent changes to entries, but create new versions of them. Thus user can easily get to the past states of the table. I would be very interested to see this, thank you for working on it. There are quite a few aspects to a temporal database system, and you are working on a system-maintained transaction-time historical table, right? Or are there other aspects to your proposal? Some general comments: * I'd very much like to see you make use of Range Types from 9.2; in particular, TSTZRANGE would be much better than holding two timestamps. If a standard requires you to display two timestamps in certain situations, perhaps you could use ranges internally and display the boundaries as timestamps when needed. * There is other useful information that could be recorded, such as the user who inserted/updated/deleted the record. * For some purposes, it's very useful to keep track of the columns that changed. For instance, a query like show me any time a salary was changed over the last month (or some other rare event) would be very slow to run if there was not some explicit annotation on the historical records (e.g. a columns changed bitmap or something). * In general, I'm not fond of adorning queries with TRANSACTION TIME AS OF... kinds of things. Those constructs are redundant with a WHERE clause (on a range type, you'd use the contains operator). If a standard requires that, maybe it would be OK to allow such things as syntactic sugar. * I do like having special DDL that creates the appropriate objects. That helps to guide users so they don't have to invent their own solution with triggers, etc. * As Jim mentioned, it might make sense to use something resembling inheritance so that selecting from the historical table includes the current data (but with no upper bound for the range). * It might make sense to hammer out as many of the details as we can with an extension. For instance, exactly what options will be available, what data types will be used, what objects will be created, the trigger code, etc. Then, it will be more obvious exactly what we need to add extra core support for (e.g. if we are going to use some inheritance like mechanism), and what we need to add syntax sugar for. I recommend that you start posting more detailed designs on http://wiki.postgresql.org If you already have code, feel free to submit it for the next commitfest ( http://commitfest.postgresql.org ), but this is a relatively large project, so it will most likely take several commitfest cycles. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: 2d-mapping based indexing for ranges
On Mon, 2012-05-28 at 23:42 +0400, Alexander Korotkov wrote: Hackers, attached patch implements another approach to indexing of ranges: mapping lower and upper bounds as 2d-coordinates and using same indexing approaches as for 2d points. Patch provides range_ops2 operator class which implements this approach. ... Contained by queries with small selectivity are dramatically faster as expected. That's good news, but that seems like a less-common query. Overlaps and contains are probably the two most important. It would be good to show a significant benefit in at least one of those two queries if the build time increases. 2) Limitations of penalty expressive power. GiST penalty function returns just one float value. It is a limitation even for geometrical datatypes. Some papers on R-trees recommends insert new entry into box which have minimal area when this boxes have same extension area (very possible if extension area is zero). We can't express it using single float value. For 2d-mapping ranges indexing situation is even harder. Imagine, some set of ranges could have same lower or upper bound. On 2d space such ranges will be united into lines. But, we can't express increasing of line length in penalty, because area of line is always zero independently on it's length. Actually, similar problem exists for geometrical datatypes, but it seems to me this problem is more urgent for ranges, because I believe in real life ranges, large set of ranges could really have same lower or upper bound. Probably, we would like to rank key extension cases as following or similar: a) extension of finite dimension to infinite b) extension in dimension perpendicular to infinite c) extension with finite area extension d) extension with zero area increment (extension of line length) So, it's impossible to fil all of these into one float. Ans it's hard to choose what to neglect and how. I think there are at least two ways to make GiSTinferface sensitive enough to these things: a) Make penalty function return more than one float. The question is how many floats we allow? b) Create alternative choose interface function which would return set of most preferred keys for insert. Usage of such function could lead to slowdown because currently GiST stop scan when met zero penalty. I agree that we should consider improving the interface for the penalty function. In your last patch improving range indexing, it was quite awkward to express the penalty in one dimension. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_upgrade libraries check
On Mon, 2012-05-28 at 16:06 -0400, Robert Haas wrote: On Sun, May 27, 2012 at 11:31 AM, Tom Lane t...@sss.pgh.pa.us wrote: I don't recall exactly what problems drove us to make pg_upgrade do what it does with extensions, but we need a different fix for them. Well, you need pg_upgrade to preserve the OIDs of objects that are part of extensions just as you do for any other objects. Also, I think it needs to force the extension version to match the old cluster. Otherwise, we could be dealing with on-disk format changes, or other complexities. It doesn't sound difficult, but I thought I'd bring it up. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add primary key/unique constraint using prefix columns of an index
On Tue, 2012-05-22 at 10:24 -0700, Jeff Janes wrote: Now that there are index only scans, there is a use case for having a composite index which has the primary key or a unique key as the prefix column(s) but with extra columns after that. Currently you would also need another index with exactly the primary/unique key, which seems like a waste of storage and maintenance. Should there be a way to declare a unique index with the unique property applying to a prefix of the indexed columns/expression? And having that, a way to turn that prefix into a primary key constraint? Of course this is easier said then done, but is there some reason for it not to be a to-do item? Technically, this can be done today using exclusion constraints if you define an indexable operator that always returns true. A similar idea, which I brought up during 9.0 development, is that it might be useful to have one index that can enforce two UNIQUE constraints. For instance, an index on (a,b,c) could be used to enforce UNIQUE(a,b) and UNIQUE(a,c) using the exclusion constraint mechanism. I didn't offer a lot of convincing evidence of practical value, so it was left out. But it might be worth a second look. Either way, it seems like the deviations from normal UNIQUE would be better expressed in terms of an exclusion constraint, which offers a little more room in the language. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe
On Sun, 2012-03-04 at 16:39 +, Simon Riggs wrote: v3 attached. Added to next commitfest. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] 9.3: summary of corruption detection / checksums / CRCs discussion
/cabrt9rbrmdsoz8kxgehfb4lg-ev9u67-6dlqvoiibpkkhtl...@mail.gmail.com * Some way of caching CLOG information or making the access faster. IIRC, there were some vague ideas about mmapping() the CLOG, or caching a very small representation of the CLOG. * Something else -- There are a few other lines of thought here. For instance, can we use WAL for hint bits without a FPI, and still protect against torn pages causing CRC failures? This is related to a comment during the 2011 developer meeting, where someone brought up the idea of idempotent WAL changes, and how that could help us avoid FPIs. It seems possible after reading the discussions, but not clear enough on the challenges to summarize here. If we do use WAL for hint bit updates, that has an impact on Hot Standby, because HS can't write WAL. So, it would seem that HS could not set hint bits. Comments? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: 9.3: summary of corruption detection / checksums / CRCs discussion
On Sun, 2012-04-22 at 00:08 +0100, Greg Stark wrote: On Sat, Apr 21, 2012 at 10:40 PM, Jeff Davis pg...@j-davis.com wrote: * In addition to detecting random garbage, we also need to be able to detect zeroing of pages. Right now, a zero page is not considered corrupt, so that's a problem. We'll need to WAL table extension operations, and we'll need to mitigate the performance impact of doing so. I think we can do that by extending larger tables by many pages (say, 16 at a time) so we can amortize the cost of WAL and avoid contention. I haven't seen this come up in discussion. I don't have any links, and it might just be based on in-person discussions. I think it's just being left as a loose end for later, but it will eventually need to be solved. WAL logging table extensions wouldn't by itself work because currently we treat the file size on disk as the size of the table. So you would have to do the extension in the critical section or else different backends might see the wrong file size and write out conflicting wal entries. By critical section, I assume you mean while holding the relation extension lock not while inside a CRITICAL_SECTION(), right? There would be some synchronization overhead, to be sure, but I think it can be done. Ideally, we'd be able to do large enough extensions that, if there is a parallel bulk load on a single table or something, the overhead could be made insignificant. I didn't intend to get too much into the detail in this thread, but if it's a totally ridiculous or impossible idea, I'll remove it. The earlier consensus was to move all the hint bits to a dedicated area and exclude them from the checksum. I think double-write buffers seem to have become more fashionable but a summary that doesn't describe the former is definitely incomplete. Thank you, that's the kind of omission I was looking to catch. That link points to the MVCC-safe truncate patch. I don't follow how optimizations in bulk loads are relevant to wal logging hint bit updates. I should have linked to these messages: http://archives.postgresql.org/message-id/CA +tgmoylozdezzjkyj8_x2bpeeerao5dj-omvs1flqoqsqp...@mail.gmail.com http://archives.postgresql.org/message-id/CA +Tgmoa4Xs1jbZhm=pb9Xi4AGMJXRB2a4GSE9EJtLo=70Zne=g...@mail.gmail.com Though perhaps I'm reading too much into Robert's comments. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join
On Tue, 2012-04-17 at 14:24 -0400, Robert Haas wrote: I thought Jeff was parenthetically complaining about cases like A LEFT JOIN (B INNER JOIN C ON b.y = c.y) ON a.x b.x. That presumably would require the parameterized-path stuff to have any chance of doing partial index scans over B. However, I understand that's not the main issue here. To take the mystery out of it, I was talking about any case where an index scan is impossible or impractical. For instance, let's say the ranges are computed values. Just to make it really impossible, let's say the ranges are computed from columns in two different tables joined in a subquery. But yes, the ability of the planner to find the plan is also an issue (hopefully less of one with the recent improvements). One thing that I think needs some analysis is when the range join idea is better or worse than a nested loop with inner index-scan, because potentially those are the options the planner has to choose between, and the costing model had better know enough to make the right thing happen. It strikes me that the nested loop with inner index-scan is likely to be a win when there are large chunks of the indexed relation that the nestloop never needs to visit at all - imagine small JOIN big ON small.a big.a, for example. I suppose the really interesting question is how much we can save when the entirety of both relations has to be visited anyway - it seems promising, but I guess we won't know for sure without testing it. Right, I will need to come up with a prototype that can at least test the executor piece. I suspect that the plan choice won't be all that different from an ordinary index nestloop versus mergejoin case, but with much worse cardinality estimates to work with. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join
On Tue, 2012-04-17 at 14:03 -0400, Robert Haas wrote: I'm actually not sure these are equivalent formulations. Suppose one side has [i,i] where i ranges from 1 to 1000 and the other side the exact same thing plus [1,1000]. That one really big range will come up second on the right side, and you will not be able to discard it until you reach the end of the join. If you just keep rewinding the right side, you're going to end up with O(n^2) behavior, whereas if you can discard tuples from the middle on the right side, then you will get O(n) behavior, which is a big difference. In other words, in your original algorithm the tuples that you discard in step 4 or 5 need not be the first remaining tuple on whichever side of the join we're talking about. To illustrate the problem (slightly modified), I'll write the sets out verbatim rather than use range syntax: {1,2,3} {1,2,3,4,5,6,7,8,9} { 2,3,4}{ 2,3,4}. . . . . {3,4,5} {3,4,5}. . . . { 4,5,6}{ 4,5,6}. . . {5,6,7} {5,6,7}. . { 6,7,8}{ 6,7,8}. {7,8,9} {7,8,9} The . are supposed to represent a shadow that the large range [1,9] casts below it. This shadow prevents the discarding of [2,4] on the RHS even when processing range [5,7] on the LHS, because we can't discard out of the middle. Note that, if you just have some large ranges and a lot of overlap, that's not really a problem with the algorithm, it's just a large result to compute. The problem comes when the ranges vary in size by quite a lot, and there are many ranges that could be eliminated but can't because of the shadow. This problem can be mitigated substantially, I believe. Let me change the algorithm (I don't like the way the pseudocode was structured anyway), and word it a little more loosely: 1. Look at the top ranges on each side. Choose the one with the greater upper bound, and call that Range1 from Side1, and the other range R2 from Side2. If either Range1 or Range2 is empty, terminate. 2. Scan down Side2, discarding ranges that are strictly before, and joining with ranges that overlap, stopping when you hit a range that is strictly after. 3. Now, discard Range1, and reset Side2 to the first non-discarded range. Goto 1. The benefit is, in step 1, we choose a side that will *always* discard the top tuple. And if we choose the one with the greater upper bound, then we are going to eliminate the largest shadow. That doesn't eliminate the problem entirely, but it seems like it would reduce it a lot. Regarding the idea of discarding tuples in the middle, that might be an interesting approach as well. It might be as simple as setting a flag in the tuple header (like was done for full hash joins). Still not perfect, but would make redundant checks very cheap. Combined with my strategy, there's a good chance that we practically eliminate the problem. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join
On Wed, 2012-04-18 at 01:21 -0400, Tom Lane wrote: It would be a pretty weird implementation of mergejoin that could discard tuples from the middle of an input stream. Or to be more specific, it wouldn't be the mergejoin itself that could do that at all --- you'd need the input plan node to be some sort of tweaked version of tuplestore or tuplesort that could respond to a request like that. As I said in my reply to Robert, I think there are some ways we can make this idea work. I can't escape the feeling that Jeff has chosen the wrong basis for his thought experiment, and that what he really ought to be thinking about is hashjoin, which keeps an in-memory table that it could easily modify on the fly if it chose to. The multi-batch aspect of hybrid hashjoin could be useful too (IOW, when you're out of better ideas, throw the tuple back in the water to process later). Obviously hashing is not going to be much use for anything but equality. So I believe this approach is very similar to the temporary-index method, except with batching, and always keeping the index in memory. I don't think we would get the partitioning benefit of hashjoin, because we'd have to put the same tuple in multiple partitions, so it's probably better to just leave the outer side intact. But in-memory indexes and multiple passes of the outer seems like a reasonable alternative, particularly because an in-memory index might be very fast (to build as well as query). This is just handwaving of course. I think some digging in the spatial-join literature would likely find ideas better than any of these. I will look in some more detail. The merge-like approach did seem to be represented in the paper referenced by Alexander (the external plane sweep), but it also refers to several methods based on partitioning. I'm beginning to think that more than one of these ideas has merit. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join
On Mon, 2012-04-16 at 02:52 -0400, Tom Lane wrote: Jeff Davis pg...@j-davis.com writes: 1. Order the ranges on both sides by the lower bound, then upper bound. Empty ranges can be excluded entirely. 2. Left := first range on left, Right := first range on right 3. If Left or Right is empty, terminate. 4. If lower(Left) upper(Right), discard Right, goto 2 5. If lower(Right) upper(Left), discard Left, goto 2 6. return (Left, Right) as joined tuple 7. Right := next range on right 8. goto 3 This is surely not correct in detail. As written, it will be impossible for any tuple on the right side to be joined to more than one left-side tuple. You will need something analogous to the mark-and-restore rewinding logic in standard mergejoin to make this work. Every time you discard a tuple on the left, you go to step 2, which rewinds the right side back to the first non-discarded tuple. So, implemented using mark and restore: * start off with the marks at the first tuple on each side * discard means move the mark down a tuple * setting it back to the first range means restoring to the mark * going to the next range (step 7) just means getting another tuple, without changing the mark Only one side really needs the mark and restore logic, but it was easier to write the pseudocode in a symmetrical way (except step 7). I will note that mergejoin is not only one of the more complicated executor modules, but it accounts for a huge percentage of the planner's complexity; which doesn't exactly make me sympathize with the notion of let's-copy-and-paste-all-that. It'd be a lot better if we could build it as a small tweak to mergejoin rather than an independent thing. (Having said that, I have no idea how the planner support could work at all, because its treatment of mergejoins is intimately tied to mergejoinable equality and EquivalenceClasses and PathKeys, which don't seem readily extensible to the range situation.) I think this is the more important problem area. It's clear that I'll need to do some more investigation into the planning. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join
On Mon, 2012-04-16 at 16:22 +0400, Alexander Korotkov wrote: There is a good overview article about spatial joins. http://www.cs.umd.edu/users/hjs/pubs/jacoxtods07.pdf Thank you, that's exactly the kind of overview I was looking for. It shows that there is a lot of methods based on building temporaty indexes. It might mean that building of temporary GiST index is not a bad idea. That had occurred to me, but I was hesitant to only use temp indexes. It still doesn't really offer a good solution when both sides of the join are relatively large (because of random I/O). Also the build speed of the index would be more important than it is now. On the other hand, if I tackle the problem using temp indexes, it would be a more general solution useful for many problems (for instance, we really need a better solution when the result of a subquery doesn't fit in a work_mem-sized hashtable). We may end up with some kind of combination (as the sweep seems to be; see below). Also there are methods which looks like extension of Jeff Davis proposal to the multidimensional case. It is Plane Sweep and External Plane Sweep methods. However, it might use sophisticated data structures for the sweep. And I believe it should use it for good performance. That method looks closer to what I had in mind. My Range Join proposal is essentially the same as the sweep method except that it only joins on one dimension, and the rest of the dimensions have to be checked with RECHECK. So, this still works for an N-dimensional join, as long as a single dimension is fairly selective. The sweep method seems to do the first dimension with the sweep method, and maintains a changing index that it uses to do the lookup. In other words, it's essentially just a way to do the RECHECK on the other dimensions in less than O(n) time. So, if the sweep dimension is not selective at all, this degenerates to the temporary index method (or something close, anyway). The fact that, in the sweep method, there is still one special dimension, makes me think that it will be easier to make the API work based on ranges (which is a big win because postgres already knows about ranges). If so, that makes it easier to divide the project into stages, because we could get it working for ranges and then extend it to support other dimensions more efficiently later. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join
On Sun, 2012-04-15 at 23:18 -0700, Darren Duncan wrote: Your proposal makes me think of something similar which might be useful, INclusion constraints. As exclusion constraints might be thought of like a generalization of unique/key constraints, inclusion constraints are like a generalization of foreign key constraints. The inclusion constraints basically allow some comparison operator other than is-equal to test if values in one table match values in another table, and the constraint allows the former if the test results in true. An example of said inclusion test is whether the range in one table is contained in a range in another table. I assume/hope that, similarly, now that we have range types in 9.2, that the existing exclusion constraints can be used with range comparison operators. Yes, Range Foreign Keys are one of the loose ends for Range Types that I'd like to wrap up. As to your actual proposal, it sounds like a generalization of the relational join or set intersection operator where instead of comparing sets defined in terms of an enumeration of discrete values we are comparing sets defined by a range, which conceptually have infinite values depending on the data type the range is defined over. But if we're doing this, then it would seem to make sense to go further and see if we have set analogies for all of our relational or set operators, should we want to do work with non-discrete sets. Now this sounds interesting in theory, but I would also assume that these could be implemented by an extension in terms of existing normal relational operators, where each range value is a discrete value, combined with operators for unioning or differencing etc ranges. A relation of ranges effectively can represent a discontinuous range; in that case, the empty discontinuous range is also canonically representable by a relation with zero tuples. Jeff, I get the impression your proposal is partly about helping performance by supporting this internally, rather than one just defining it as a SQL function, am I right? Per my proposal, the problem statement is that it's slow, so speeding it up is really the entire proposal ;) Extending the syntax might be interesting as well, but I don't have a proposal for that. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join
On Mon, 2012-04-16 at 22:20 +0100, Simon Riggs wrote: On Mon, Apr 16, 2012 at 9:12 PM, Jeff Davis pg...@j-davis.com wrote: That had occurred to me, but I was hesitant to only use temp indexes. It still doesn't really offer a good solution when both sides of the join are relatively large (because of random I/O). Also the build speed of the index would be more important than it is now. The thing I like most about temp indexes is that they needn't be temporary. I'd like to see something along the lines of demand-created optional indexes, that we reclaim space/maintenance overhead on according to some cache management scheme. More space you have, the more of the important ones hang around. The rough same idea applies to materialised views. I think to make an informed decision, the next thing I need to do is hack up a prototype version of my merge join idea, and see how well it performs against an index nestloop today. It seems to me that both approaches may have merit independently. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] 9.3 Pre-proposal: Range Merge Join
I hope this is not an inappropriate time for 9.3 discussions. The flip side of asking for submissions in the first couple commitfests means that I need to submit proposals now. What is a Range Join? See attached SQL for example. The key difference is that the join condition is not equality, but overlaps (). Problem statement: slow. Nested loops are the only option, although they can benefit from an inner GiST index if available. But if the join is happening up in the plan tree somewhere, then it's impossible for any index to be available. Proposed solution: a modified merge join that can handle ranges. 1. Order the ranges on both sides by the lower bound, then upper bound. Empty ranges can be excluded entirely. 2. Left := first range on left, Right := first range on right 3. If Left or Right is empty, terminate. 4. If lower(Left) upper(Right), discard Right, goto 2 5. If lower(Right) upper(Left), discard Left, goto 2 6. return (Left, Right) as joined tuple 7. Right := next range on right 8. goto 3 If we get step 4 or step 5 keeps getting triggered, and a btree index is available (ordered by lower bound), we can re-probe to go to the correct position, and consider that the new top range on that side. This is an optimization for the case where there are large numbers of ranges with no match on the other side. Thanks to Nathan Boley for helping me devise this algorithm. However, any bugs are mine alone ;) Weaknesses: I haven't thought through the optimization, but I suspect it will be hard to be very accurate in the costing. That might be OK, because there aren't very many options anyway, but I'll need to think it through. Questions: * Is this idea sane? -- that is, are ranges important enough that people are willing to maintain a new operator? * The more general problem might be spatial joins which can operate in N dimensions, and I doubt this would work very well in that case. Does someone know of a spatial join algorithm (without IP claims) that would be as good as this one for ranges? * Other thoughts? Regards, Jeff Davis -- rate is billing rate in dollars per hour CREATE OR REPLACE FUNCTION bill(rate NUMERIC, during TSTZRANGE) RETURNS NUMERIC LANGUAGE plpgsql AS $$ DECLARE usage_s NUMERIC; BEGIN IF isempty(during) THEN usage_s := 0; ELSE usage_s := extract(epoch from (upper(during) - lower(during))); END IF; RETURN rate * (usage_s/3600.0); END; $$; CREATE TABLE billing_rate(rate NUMERIC, during TSTZRANGE); CREATE TABLE billing_usage(customer_id INT8, during TSTZRANGE); INSERT INTO billing_rate VALUES (12.50, '[2010-01-01 14:00, 2010-01-01 15:00)'); INSERT INTO billing_rate VALUES (14.50, '[2010-01-01 15:00, 2010-01-01 15:45)'); INSERT INTO billing_usage VALUES (123, '[2010-01-01 14:30, 2010-01-01 15:30)'); INSERT INTO billing_usage VALUES (234, '[2010-01-01 14:15, 2010-01-01 15:45)'); SELECT customer_id, bill(rate, range_intersect(u.during, r.during)) AS bill FROM billing_rate r, billing_usage u WHERE r.during u.during; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory usage during sorting
On Sun, 2012-03-18 at 11:25 -0400, Tom Lane wrote: Yeah, that was me, and it came out of actual user complaints ten or more years back. (It's actually not 2X growth but more like 4X growth according to the comments in logtape.c, though I no longer remember the exact reasons why.) We knew when we put in the logtape logic that we were trading off speed for space, and we accepted that. I skimmed through TAOCP, and I didn't find the 4X number you are referring to, and I can't think what would cause that, either. The exact wording in the comment in logtape.c is 4X the actual data volume, so maybe that's just referring to per-tuple overhead? However, I also noticed that section 5.4.4 (Vol 3 p299) starts discussing the idea of running the tapes backwards and forwards. That doesn't directly apply, because a disk seek is cheaper than rewinding a tape, but perhaps it could be adapted to get the block-freeing behavior we want. The comments in logtape.c say: Few OSes allow arbitrary parts of a file to be released back to the OS, so we have to implement this space-recycling ourselves within a single logical file. But if we alternate between reading in forward and reverse order, we can make all of the deallocations at the end of the file, and then just truncate to free space. I would think that the OS could be more intelligent about block allocations and deallocations to avoid too much fragmentation, and it would probably be a net reduction in code complexity. Again, the comments in logtape.c have something to say about it: ...but the seeking involved should be comparable to what would happen if we kept each logical tape in a separate file But I don't understand why that is the case. On another topic, quite a while ago Len Shapiro (PSU professor) suggested to me that we implement Algorithm F (Vol 3 p321). Right now, as tuplesort.c says: In the current code we determine the number of tapes M on the basis of workMem: we want workMem/M to be large enough that we read a fair amount of data each time we preread from a tape But with forcasting, we can be a little smarter about which tapes we preread from if the data in the runs is not random. That means we could potentially merge more runs at once with the same work_mem without sacrificing adequate buffers for prefetching. I'm not sure whether this is a practical problem today, and I'm also not sure what to do if we start merging a lot more runs and then determine that forcasting doesn't work as well as we'd hoped (e.g. that the data in the runs really is random). But I thought it was worth mentioning. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bug in fast-path locking
On Mon, 2012-04-09 at 22:47 -0700, Jeff Davis wrote: but other similar paths do: if (!proclock) { AbortStrongLockAcquire(); I don't think it's necessary outside of LockErrorCleanup(), right? I take that back, it's necessary for the dontwait case, too. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Last gasp
On Mon, 2012-04-09 at 23:12 -0400, Christopher Browne wrote: But there is also a flip side to that, namely that if we do so, there ought to be some aspect to the process to help guide those items that *aren't* particularly close to being committable. I have benefited immensely from review of my WIP patches, and a lot of the serious review tends to happen during commitfests. This is most important for features with a significant user interface, where it's harder to guess what people will want. My current strategy is to submit WIP-marked patches during a commitfest. I agree that we should continue to have a mechanism to review patches that aren't ready for commit, though I'm fine if we change it. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bug in fast-path locking
On Mon, 2012-04-09 at 16:11 -0400, Robert Haas wrote: I wonder though whether you actually need a *count*. What if it were just a flag saying do not take any fast path locks on this object, and once set it didn't get unset until there were no locks left at all on that object? I think if you read the above-referenced section of the README you'll be deconfused. My understanding: The basic reason for the count is that we need to preserve the information that a strong lock is being acquired between the time it's put in FastPathStrongRelationLocks and the time it actually acquires the lock in the lock manager. By definition, the lock manager doesn't know about it yet (so we can't use a simple rule like there are no locks on the object so we can unset the flag). Therefore, the backend must indicate whether it's in this code path or not; meaning that it needs to do something on the error path (in this case, decrement the count). That was the source of this bug. There may be a way around this problem, but nothing occurs to me right now. Regards, Jeff Davis PS: Oops, I missed this bug in the review, too. PPS: In the README, FastPathStrongRelationLocks is referred to as FastPathStrongLocks. Worth a quick update for easier symbol searching. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bug in fast-path locking
On Mon, 2012-04-09 at 17:42 -0500, Jim Nasby wrote: Dumb question... should operations in the various StrongLock functions take place in a critical section? Or is that already ensure outside of these functions? Do you mean CRITICAL_SECTION() in the postgres sense (that is, avoid error paths by making all ERRORs into PANICs and preventing interrupts); or the sense described here: http://en.wikipedia.org/wiki/Critical_section ? If you mean in the postgres sense, you'd have to hold the critical section open from the time you incremented the strong lock count all the way until you decremented it (which is normally at the time the lock is released); which is a cure worse than the disease. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bug in fast-path locking
On Mon, 2012-04-09 at 13:32 -0400, Robert Haas wrote: I looked at this more. The above analysis is basically correct, but the problem goes a bit beyond an error in LockWaitCancel(). We could also crap out with an error before getting as far as LockWaitCancel() and have the same problem. I think that a correct statement of the problem is this: from the time we bump the strong lock count, up until the time we're done acquiring the lock (or give up on acquiring it), we need to have an error-cleanup hook in place that will unbump the strong lock count if we error out. Once we're done updating the shared and local lock tables, the special handling ceases to be needed, because any subsequent lock release will go through LockRelease() or LockReleaseAll(), which will do the appropriate clenaup. The attached patch is an attempt at implementing that; any reviews appreciated. This path doesn't have an AbortStrongLockAcquire: if (!(proclock-holdMask LOCKBIT_ON(lockmode))) { ... elog(ERROR,...) but other similar paths do: if (!proclock) { AbortStrongLockAcquire(); I don't think it's necessary outside of LockErrorCleanup(), right? I think there could be some more asserts. There are three sites that decrement FastPathStrongRelationLocks, but in two of them StrongLockInProgress should always be NULL. Other than that, it looks good to me. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_upgrade incorrectly equates pg_default and database tablespace
On Fri, 2012-03-30 at 13:11 -0700, Jeff Davis wrote: I confirmed this bug upgrading 9.1 to master, and that this patch fixes it. Thank you for the report! Patch looks good to me as well, with one very minor nitpick: the added comment is missing an apostrophe. Bruce, can you take a look at this? Adding this to the next commitfest, just so it doesn't get forgotten. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_upgrade incorrectly equates pg_default and database tablespace
On Thu, 2012-03-22 at 14:55 +0200, Ants Aasma wrote: Hi, while working on a support case I stumbled upon a bug in pg_upgrade. Upgrade fails with No such file or directory when a database is moved to a non-default tablespace and contains a table that is moved to pg_default. The cause seems to be that the following test incorrectly equates empty spclocation with database tablespace: tblspace = PQgetvalue(res, relnum, i_spclocation); /* if no table tablespace, use the database tablespace */ if (strlen(tblspace) == 0) tblspace = dbinfo-db_tblspace; Patch to fix this is attached. I confirmed this bug upgrading 9.1 to master, and that this patch fixes it. Thank you for the report! Patch looks good to me as well, with one very minor nitpick: the added comment is missing an apostrophe. Bruce, can you take a look at this? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_terminate_backend for same-role
On Tue, 2012-03-20 at 01:38 -0700, Daniel Farina wrote: On Thu, Mar 15, 2012 at 9:39 PM, Fujii Masao masao.fu...@gmail.com wrote: On Fri, Mar 16, 2012 at 8:14 AM, Daniel Farina dan...@heroku.com wrote: Parallel to pg_cancel_backend, it'd be nice to allow the user to just outright kill a backend that they own (politely, with a SIGTERM), aborting any transactions in progress, including the idle transaction, and closing the socket. +1 Here's a patch implementing the simple version, with no more guards against signal racing than have been seen previously. The more elaborate variants to close those races is being discussed in a parallel thread, but I thought I'd get this simple version out there. Review: After reading through the threads, it looks like there was no real objection to this approach -- pid recycling is not something we're concerned about. I think you're missing a doc update though, in func.sgml: For the less restrictive functionpg_cancel_backend/, the role of an active backend can be found from the structfieldusename/structfield column of the structnamepg_stat_activity/structname view. Also, high-availability.sgml makes reference to pg_cancel_backend(), and it might be worthwhile to add an ...and pg_terminate_backend() there. Other than that, it looks good to me. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Sat, 2012-03-17 at 17:48 +0100, Cédric Villemain wrote: I agree with Andres. I believe we should use sync_file_range (_before?) with linux. And we can use posix_fadvise_dontneed on other kernels. OK, updated patch attached. sync_file_range() is preferred, posix_fadvise() is a fallback. Regards, Jeff Davis initdb-fsync-20120325.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Incorrect assumptions with low LIMITs
On Sat, 2012-03-17 at 12:48 +, Simon Riggs wrote: The problems are as I described them (1) no account made for sparsity, and other factors leading to an overestimate of rows (N) (2) inappropriate assumption of the effect of LIMIT m, which causes a costly SeqScan to appear better than an IndexScan for low m/N, when in fact that is seldom the case. Overestimating N in (1) inverts the problem, so that an overestimate isn't the safe thing at all. I think the actual problem has more to do with risk. The planner doesn't know how uniform the distribution of the table is, which introduces risk for the table scan. I would tend to agree that for low selectivity fraction and a very low limit (e.g. 1-3 in your example) and a large table, it doesn't seem like a good risk to use a table scan. I don't know how that should be modeled or implemented though. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Fri, 2012-03-16 at 11:25 +0100, Andres Freund wrote: I take that back. There was something wrong with my test -- fadvise helps, but it only takes it from ~10s to ~6.5s. Not quite as good as I hoped. Thats surprising. I wouldn't expect such a big difference between fadvise + sync_file_range. Rather strange. I discussed this with my colleague who knows linux internals, and he pointed me directly at the problem. fadvise and sync_file_range in this case are both trying to put the data in the io scheduler queue (still in the kernel, not on the device). The difference is that fadvise doesn't wait, and sync_file_range does (keep in mind, this is waiting to get in a queue to go to the device, not waiting for the device to write it or even receive it). He indicated that 4096 is a normal number that one might use for the queue size. But on my workstation at home (ubuntu 11.10), the queue is only 128. I bumped it up to 256 and now fadvise is just as fast! This won't be a problem on production systems, but that doesn't help us a lot. People setting up a production system don't care about 6.5 seconds of set-up time anyway. Casual users and developers do (the latter problem can be solved with the --nosync switch, but the former problem is the one we're discussing). So, it looks like fadvise is the right thing to do, but I expect we'll get some widely varying results from actual users. Then again, maybe casual users don't care much about ~10s for initdb anyway? It's a fairly rare operation for everyone except developers. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Incorrect assumptions with low LIMITs
On Fri, 2012-03-16 at 18:25 +, Simon Riggs wrote: Any time we apply a LIMIT clause to a plan with a SeqScan or unqualified IndexScan, we shouldn't assume the scan will do less than say 10% of the table. It might, but its an unsafe assumption because as the selectivity decreases so does the safety of the assumption that rows are uniformly distributed. Just trying to follow along. You mean as the selectivity _increases_ the safety of the assumption that the rows are uniformly distributed decreases, right? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Storage Manager crash at mdwrite()
On Fri, 2012-03-16 at 18:02 -0400, Tareq Aljabban wrote: Thanks for your response Jeff.. It's true that libhdfs code resides under the c++ folder, but in all of the documentation, libhdfs is referred to as the C interface of HDFS. Now what you're saying makes sense.. that nothing guarantees this will work well.. but in the phase of initDB, the function calls are done nicely without any exceptions.. when starting the postmaster, something wrong happens after 3 calls to libhdfs.. that's what I'm confused about.. What it's the difference between the two processes (initDB, start postmaster), that might cause this crash to happen? There is a lot of difference between those two. In particular, it looks like the problem you are seeing is coming from the background writer, which is not running during initdb. I think the next step would be for you to compile in debug mode with -O0, and attach to the bgwriter process with gdb, and step through it. Then, you can see the exact path which causes the bgwriter to exit, and that will give you a better idea where the problem is. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Wed, 2012-03-14 at 10:26 +0100, Andres Freund wrote: On Wednesday, March 14, 2012 05:23:03 AM Jeff Davis wrote: On Tue, 2012-03-13 at 09:42 +0100, Andres Freund wrote: for recursively everything in dir: posix_fadvise(fd, POSIX_FADV_DONTNEED); for recursively everything in dir: fsync(fd); Wow, that made a huge difference! no sync: ~ 1.0s sync: ~10.0s fadvise+sync: ~ 1.3s I take that back. There was something wrong with my test -- fadvise helps, but it only takes it from ~10s to ~6.5s. Not quite as good as I hoped. Well, while the positive effect of this are rather large it also has the bad effect of pushing the whole new database out of the cache. Which is not so nice if you want to run tests on it seconds later. I was unable to see a regression when it comes to starting it up after the fadvise+fsync. My test just started the server, created a table, then stopped the server. It was actually a hair faster with the directory that had been fadvise'd and then fsync'd, but I assume that was noise. Regardless, this doesn't look like an issue. How are the results with sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE)? That is much faster than using fadvise. It goes down to ~2s. Unfortunately, that's non-portable. Any other ideas? 6.5s a little on the annoying side (and causes some disconcerting sounds to come from my disk), especially when we _know_ it can be done in 2s. Anyway, updated patch attached. Regards, Jeff Davis initdb-fsync-20120314.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Storage Manager crash at mdwrite()
On Thu, 2012-03-15 at 13:49 -0400, Tareq Aljabban wrote: I'm implementing an extention to mdwrite() at backend/storage/smgr/md.c When a block is written to the local storage using mdwrite(), I'm sending this block to an HDFS storage. So far I don't need to read back the values I'm writing to HDFS. This approach is working fine in the initDB phase. However, when I'm running postgres (bin/pg_ctl start), the first few write operations run successfully, and then suddenly (after writing exactly 3 files to HDFS), I get a 130 exit code with the following message showing the JVM thread dump of HDFS: LOG: background writer process (PID 29347) exited with exit code 130 LOG: terminating any other active server processes ... This seems like an HDFS issue, but the same code worked properly in initDB(). I replaced this HDFS write code with a code that writes always the same block (empty one) to HDFS regardless from the value received by mdwrite().. Kept getting the same issue after writing 3 files. I copied this exact code to a separate C application and ran it there and it executed without any problems (I wrote/deleted 100 files). That's why I'm doubting that it's something related to postgreSQL. Any ideas on what's going wrong? What code, exactly, did you change in md.c, and anywhere else in postgres? Are you linking in new libraries/code from somewhere into the postgres backend? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Storage Manager crash at mdwrite()
On Thu, 2012-03-15 at 19:36 -0400, Tareq Aljabban wrote: When configuring postgreSQL, I'm adding the libraries needed to run HDFS C API (libhdfs). From the information below, it looks like C++. ./configure --prefix=/diskless/myUser/Workspace/EclipseWS1/pgsql --enable-depend --enable-cassert --enable-debug CFLAGS=$CFLAGS -I/diskless/myUser/Workspace/HDFS_Append/hdfs/src/c++/libhdfs -I/usr/lib/jvm/default-java/include LDFLAGS=$LDFLAGS -L/diskless/myUser/Workspace/HDFS_Append/hdfs/src/c++/libhdfs -L/diskless/myUser/Workspace/HDFS_Append/build/c++/Linux-i386-32/lib -L/usr/lib/jvm/default-java/jre/lib/i386/server -ljvm -lhdfs I have done lots of changes so far on how the storage manager works. In fact, I changed smgr.c so instead of calling regular md.c functions, that it would call my functions . For simplicity, you can say that whenever mdwrite() was supposed to be called, another function is also called beside it. so now what is being called is: mdwrite(param1, param2, buffer, ); hdfswrite(param1, param2, buffer, ); where hdfswrite() is the function where I write the buffer to HDFS. I changed hdfswrite() so it will always write only the same (dummy) buffer down to HDFS storage. Let's say dummyBufferFile. After writing this file 3 times to HDFS, I'm getting the message that I've shown in my first email. The same hdfswrite() code works without any issues when I run it in a separate application. Hope it's clear now. Well, it's clear that there's a lot going on ;) In other words, is there a reason you think that these would all work together nicely? I don't know the specifics, but I understand there are numerous perils to linking complex C++ code into complex C code, particularly around exception handling. Look in the archives for previous discussions around C++. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers