Re: [HACKERS] WIP checksums patch

2012-10-01 Thread Jeff Davis
On Mon, 2012-10-01 at 12:35 -0400, Bruce Momjian wrote:
 The heap/index files are copied unmodified from the old cluster, so
 there are no checksums on the pages.

That's fine though, the patch still reads the old format the same way,
and the pages are treated as though they have no checksum. How is that a
reason for defaulting the GUC to off?

I'm missing something here. Are we worried about users who turn the GUC
on and then expect all of their old data pages to magically be
protected?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP checksums patch

2012-10-01 Thread Jeff Davis
On Fri, 2012-09-14 at 17:58 -0700, Jeff Davis wrote:
 This is just a rebased version of the patch by Simon here:
 
 http://archives.postgresql.org/message-id/CA
 +u5nmkw_gbs6qq_y8-rjgl1v7mvw2hwbhartb8lojhnpfx...@mail.gmail.com

Another thing I noticed about the design of this patch:

It looks like the checksum will usually be wrong in a backup block in
WAL, because it writes the backup block before calculating the checksum
(which isn't done until the time the block is written out).

I think that's OK, because it's still protected by the WAL CRC, and
there's no expectation that the checksum is correct in shared buffers,
and the correct checksum should be set on the next checkpoint. Just an
observation.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP checksums patch

2012-10-01 Thread Jeff Davis
On Mon, 2012-10-01 at 18:14 +0100, Simon Riggs wrote:
 You are missing large parts of the previous thread, giving various
 opinions on what the UI should look like for enabling checksums.

I read through all of the discussion that I could find. There was quite
a lot, so perhaps I have forgotten pieces of it.

But that one section in the docs does look out of date and/or confusing
to me.

I remember there was discussion about a way to ensure that checksums are
set cluster-wide with some kind of special command (perhaps related to
VACUUM) and a magic file to let recovery know whether checksums are set
everywhere or not. That doesn't necessarily conflict with the GUC though
(the GUC could be a way to write checksums lazily, while this new
command could be a way to write checksums eagerly).

If some consensus was reached on the exact user interface, can you
please send me a link?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Statistics and selectivity estimation for ranges

2012-09-30 Thread Jeff Davis
On Tue, 2012-09-04 at 17:27 +0400, Alexander Korotkov wrote:
 Addon patch is attached. Actually, I don't get your intention of
 introducing STATISTIC_KIND_RANGE_EMPTY_FRAC stakind. Did you plan to
 leave it as empty frac in distinct stakind or replace this stakind
 with STATISTIC_KIND_LENGTH_HISTOGRAM? In the attached
 patch STATISTIC_KIND_RANGE_EMPTY_FRAC is replaced
 with STATISTIC_KIND_LENGTH_HISTOGRAM.

Review comments:

1. In compute_range_stats, you need to guard against the case where
there is no subdiff function. Perhaps default to 1.0 or something?

2. I think it would be helpful to add comments regarding what happens
when lengths are identical, right now it's a little confusing. For
instance, the comment: Generate a length histogram slot entry if there
are at least two length values doesn't seem right, because the
condition still matches even if there is only one distinct value.

3. It looks like get_distance also needs to guard against a missing
subdiff.

4. There are 3 binary search functions, which seems a little excessive:
  * rbound_bsearch: greatest i such that hist[i]  v; or -1
  * rbound_bsearch_equal: greatest i such that:
  hist[i] = v and (i=0 or hist[i-1] != hist[i]); or -1
  * length_hist_bsearch: least i such that hist[i] = v;
  or length of hist
(let me know if I misunderstood the definitions)
At a minimum, we need more consistent and informative names. Also, the
definition of rbound_bsearch_equal is a little confusing because it's
looking for the highest index among distinct values, but the lowest
index among identical values. Do you see a way to refactor these to be a
little easier to understand?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gistchoose vs. bloat

2012-09-30 Thread Jeff Davis
On Tue, 2012-09-04 at 19:21 +0400, Alexander Korotkov wrote:

 New version of patch is attached. Parameter randomization was
 introduced. It controls whether to randomize choose. Choose algorithm
 was rewritten.
 
Review comments:

1. Comment above while loop in gistRelocateBuildBuffersOnSplit needs to
be updated.

2. Typo in two places: if randomization id required.

3. In gistRelocateBuildBuffersOnSplit, shouldn't that be:
 splitPageInfo = relocationBuffersInfos[bufferIndex];
   not:
 splitPageInfo = relocationBuffersInfos[i];

4. It looks like the randomization is happening while trying to compare
the penalties. I think it may be more readable to separate those two
steps; e.g.

  /* create a mapping whether randomization is on or not */
  for (i = FirstOffsetNumber; i = maxoff; i = OffsetNumberNext(i))
  offsets[i - FirstOffsetNumber] = i;

  if (randomization)
  /* randomize offsets array */

  for (i = 0; i  maxoff; i++)
  {
 offset = offsets[i];
 ...
  }

That's just an idea; if you think it's more readable as-is (or if I am
misunderstanding) then let me know.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP checksums patch

2012-09-30 Thread Jeff Davis
On Fri, 2012-09-14 at 17:58 -0700, Jeff Davis wrote:
 This is just a rebased version of the patch by Simon here:

I just noticed the following note in the docs for this patch:

  The default is literaloff/ for backwards compatibility and
  to allow upgrade. The recommended setting is literalon/ though
  this should not be enabled until upgrade is successfully complete
  with full set of new backups.

I don't understand what that means -- if they have the page_checksums
GUC available, then surely upgrade is complete, right? And what is the
backwards-compatibility issue?

Also, it looks out of date, because the default in guc.c is set to true.
I think we should probably default to true, because it's safer and it
can always be disabled at runtime, but I don't have a strong opinion
about that.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP checksums patch

2012-09-30 Thread Jeff Davis
On Fri, 2012-09-14 at 17:58 -0700, Jeff Davis wrote:
 * we might want to make it slightly easier for external utilities, like
 for backup/replication, to verify the pages

Ideally, PageVerificationInfoOK should be available to external
utilities, so that someone might script a background job to verify
pages. Or, perhaps we should just include a new utility,
pg_verify_checksums?

Either way, where is a good place to put the function so that it's
shared between the server and the utility? In src/port somewhere?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question about SSI, subxacts, and aborted read-only xacts

2012-09-17 Thread Jeff Davis
On Sun, 2012-09-16 at 16:16 -0500, Kevin Grittner wrote:
 I'm attaching an alternative proposal, with changes for the following
 reasons:

Looks good to me, aside from not wrapping the text.

Regards,
Jeff Davis
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] WIP checksums patch

2012-09-14 Thread Jeff Davis
This is just a rebased version of the patch by Simon here:

http://archives.postgresql.org/message-id/CA
+u5nmkw_gbs6qq_y8-rjgl1v7mvw2hwbhartb8lojhnpfx...@mail.gmail.com

There are other things I'd like to do, like:

* include page# in checksum, and perhaps relfilenode or object OID (but
those two might cause difficulties)
* use TLI field instead of page version, if that is the consensus
(though it seems some may still disagree with that)
* we might want to make it slightly easier for external utilities, like
for backup/replication, to verify the pages
* optionally protect temporary files and local buffers
* come up with a robust way to determine that all pages in the database
are protected
* protect uninitialized pages

But not all of these will make the first patch. This is just a starting
point to reignite the discussion.

I welcome any feedback.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gistchoose vs. bloat

2012-09-11 Thread Jeff Davis
On Tue, 2012-09-04 at 19:21 +0400, Alexander Korotkov wrote:

 New version of patch is attached. Parameter randomization was
 introduced. It controls whether to randomize choose. Choose algorithm
 was rewritten.
 
Do you expect it to be bad in any reasonable situations? I'm inclined to
just make it always randomize if it's better. I think it would be hard
for a user to guess when it's better and when not.

Maybe it's useful to turn randomization off for testing purposes, e.g.
to ensure determinism?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question about SSI, subxacts, and aborted read-only xacts

2012-09-11 Thread Jeff Davis
On Mon, 2012-09-10 at 11:15 -0500, Kevin Grittner wrote:
 That's a fair point.  Do you have any suggested wording, or
 suggestions for exactly where in the documentation you think it
 would be most helpful?  The subsection on serializable transactions
 seems like the most obvious location:

Attached. I thought about putting it as a note, but it seems like it's
easy to go overboard with those.

Regards,
Jeff Davis


ssi_doc.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question about SSI, subxacts, and aborted read-only xacts

2012-09-10 Thread Jeff Davis
On Mon, 2012-09-10 at 11:15 -0500, Kevin Grittner wrote:
 ... and I know Jeff read that quite closely because he raised a
 question off-list about an error he found in it which managed to
 survive the many editing and review passes that paper went through. 
 :-)

Well, I need to keep up with the discussion on the interaction of
temporal and SSI :-)

 I think the behavior is correct because a function's control flow
 might be directed by what it reads in a subtransaction, even if it
 rolls back -- and the transaction as a whole might leave the
 database in a different state based on that than if it had read
 different data (from a later snapshot).  For example, if a plpgsql
 function has a BEGIN/EXCEPTION/END block, it might read something
 from the database and use what it reads to attempt some write.  If
 that write fails and the EXCEPTION code writes something, then the
 database could be put into a state which is dependent on the data
 read in the subtransaction, even though that subtransaction is
 rolled back without the client ever directly seeing what was read.

On reflection, I agree with that. Trying to puzzle through your
transactions (and application logic) to see if you are depending on any
information read in an aborted subtransaction is exactly the kind of
thing SSI was meant to avoid.

 This strikes me as significantly different from returning some rows
 to a client application and then throwing an error for the
 transaction as a whole, because the client will certainly have an
 opportunity to see the failure (or at worst, see a broken connection
 before being notified of a successful commit).

Oh, I see the distinction you're making: in PL/pgSQL, the exception
mechanism involves *implicit* subtransaction rollbacks. That's more of a
language issue, but a valid point.

I'm still not sure I see a theoretical difference, but it does seem wise
to keep predicate locks for aborted subtransactions.

  If so, I think we need a documentation update. The serializable
  isolation level docs don't quite make it clear that
  serializability only applies to transactions that commit. It might
  not be obvious to a user that there's a difference between commit
  and abort for a RO transaction. I think that, in S2PL,
  serializability applies even to aborted transactions (though I
  haven't spent much time thinking about it), so users accustomed to
  other truly-serializable implementations might be surprised.
  
 That's a fair point.  Do you have any suggested wording...

I'll write something up. Can I document that you may depend on the
results read in aborted subtransactions, or should I leave that
undefined for now?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question about SSI, subxacts, and aborted read-only xacts

2012-09-10 Thread Jeff Davis
On Mon, 2012-09-10 at 21:59 -0400, Dan Ports wrote:
 It might be worth noting that serializable mode will not cause
 read-only transactions to fail to commit

For the archives, and for those not following the paper in detail, there
is one situation in which SSI will abort a read-only transaction.

When there are three transactions forming a dangerous pattern where T1
(read-only) has a conflict out to T2, and T2 has a conflict out to T3;
and T3 is committed and T2 is prepared (for two-phase commit). In that
situation, SSI can't roll back the committed or prepared transactions,
so it must roll back the read-only transaction (T1).

Even in that case, SSI will ordinarily prevent T2 from preparing. It's
only if T1 takes its snapshot after T2 prepares and before T2 commits
that the situation can happen (I think).

Fortunately, for two-phase commit, that's not a big problem because the
window between PREPARE TRANSACTION and COMMIT PREPARED is supposed to be
narrow (and if it's not, you have bigger problems anyway). As long as
the window is narrow, than it's reasonable to retry the transaction T1,
and expect it to succeed after a short interval.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Question about SSI, subxacts, and aborted read-only xacts

2012-09-08 Thread Jeff Davis
This question comes about after reading the VLDB paper Serializable
Snapshot Isolation in PostgreSQL.

We release predicate locks after a transaction abort, but not after a
subtransaction abort. The paper says that the reason is:

We do not drop SIREAD locks acquired during a subtransaction if the
subtransaction is aborted (i.e. all SIREAD locks belong to the top-level
transaction). This is because data read during the subtransaction may
have been reported to the user or otherwise externalized. (section
7.3).

But that doesn't make sense to me, because that reasoning would also
apply to top-level transactions that are aborted, but we release the
SIREAD locks for those.

In other words, this introduces an inconsistency between:

  BEGIN ISOLATION LEVEL SERIALIZABLE;
  SAVEPOINT s1;
  ...
  ROLLBACK TO s1;
  COMMIT;

and:

  BEGIN ISOLATION LEVEL SERIALIZABLE;
  ...
  ROLLBACK;

I'm not suggesting this is a correctness problem: holding SIREAD locks
for longer never causes incorrect results. But it does seem a little
inconsistent.

For top-level transactions, I don't think it's possible to preserve
SIREAD locks after an abort, because we rely on aborts to alleviate
conflicts (and when using 2PC, we may need to abort a read-only
transaction to correct the situation). So it seems like users must not
rely on any answers they get from a transaction (or subtransaction)
unless it commits.

Does that make sense?

If so, I think we need a documentation update. The serializable
isolation level docs don't quite make it clear that serializability only
applies to transactions that commit. It might not be obvious to a user
that there's a difference between commit and abort for a RO transaction.
I think that, in S2PL, serializability applies even to aborted
transactions (though I haven't spent much time thinking about it), so
users accustomed to other truly-serializable implementations might be
surprised.

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

2012-09-06 Thread Jeff Davis
On Wed, 2012-09-05 at 17:03 -0400, Tom Lane wrote:
 In general I think the selling point for such a feature would be no
 administrative hassles, and I believe that has to go not only for the
 end-user experience but also for the application-developer experience.
 If you have to manage checkpointing and vacuuming in the application,
 you're probably soon going to look for another database.

Maybe there could be some hooks (e.g., right after completing a
statement) that see whether a vacuum or checkpoint is required? VACUUM
can't be run in a transaction block[1], so there are some details to
work out, but it might be a workable approach.

Regards,
Jeff Davis

[1]: It seems like the only reason for that is so a multi-table vacuum
doesn't hold locks for longer than it needs to, but that's not much of a
concern in this case.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Minor comment fixups

2012-08-27 Thread Jeff Davis
I noticed a couple comments that look wrong to me. Patch attached.

Regards,
Jeff Davis
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***
*** 8784,8792  copy_relation_data(SMgrRelation src, SMgrRelation dst,
  	pfree(buf);
  
  	/*
! 	 * If the rel isn't temp, we must fsync it down to disk before it's safe
! 	 * to commit the transaction.  (For a temp rel we don't care since the rel
! 	 * will be uninteresting after a crash anyway.)
  	 *
  	 * It's obvious that we must do this when not WAL-logging the copy. It's
  	 * less obvious that we have to do it even if we did WAL-log the copied
--- 8784,8791 
  	pfree(buf);
  
  	/*
! 	 * If the rel is WAL-logged, must fsync before commit.	We use heap_sync
! 	 * to ensure that the toast table gets fsync'd too.
  	 *
  	 * It's obvious that we must do this when not WAL-logging the copy. It's
  	 * less obvious that we have to do it even if we did WAL-log the copied
*** a/src/backend/storage/file/reinit.c
--- b/src/backend/storage/file/reinit.c
***
*** 337,343  ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
  			copy_file(srcpath, dstpath);
  		}
  
- 		/* Done with the first pass. */
  		FreeDir(dbspace_dir);
  	}
  }
--- 337,342 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NOT NULL constraints in foreign tables

2012-08-22 Thread Jeff Davis
On Tue, 2012-08-21 at 10:41 -0400, Robert Haas wrote: 
 The thing to keep in mind here is that EVERY property of a foreign
 table is subject to change at any arbitrary point in time, without our
 knowledge.  ... Why should CHECK constraints be any different than,
 say, column types?

So, let's say someone changes column types from int to bigint on the
remote side, and you still have int on the local side. It continues to
work and everything is fine until all of a sudden you get 2^33 back, and
that generates an error.

That sounds closer to the semantics of constraint enforcement mechanism
#2 than #3 to me. That is, everything is fine until you get something
that you know is wrong, and you throw an error.

 Why should that be any worse with foreign tables than anything else?
 I mean, lots of people, as things stand today, manage to set up
 partitioned tables using CHECK constraints.  There are undoubtedly
 people who don't understand the planner benefit of having an
 appropriate CHECK constraint on each partition, but it's not exactly a
 common cause of confusion.

But there are no consequences there other than performance. With
unenforced constraints, they may get correct results during development
and testing, and wrong results occasionally when in production. That's
hard to explain to a user.

 It seems to me that the term runtime enforcement is a misnomer; you
 can't really enforce anything about a foreign table.

Maybe I chose the wrong terms, but there are at least 3 semantically
different concepts. Feel free to suggest a better term.

 If we
 were to propose changing the semantics from the former to the latter,
 we'd be laughed out of town, and rightly so.

I'm not proposing changing the semantics, I'm saying that there are more
than just 2 semantic options available, and they offer different kinds
of guarantees. Users may be interested in all 3 for different use cases.

   I mean, let's
 suppose that we were to allow unique constraints on foreign tables.

I'm sure there are cases where people will not want what I am
suggesting, but I think there are cases where it is plausibly useful.

 Now, if the query is something like SELECT
 * FROM foreign_table WHERE id = 1, you could fairly cheaply validate
 that there is only one row with id = 1, but that's not the same thing
 as validating that the assumption (namely, that foreign_table (id) is
 unique) is still true.

And if you don't issue a query at all, the constraint might not still be
true; but I don't think that implies that checking it when you do run a
query is useless.

 I think if we go down this road of trying to validate
 remote-side CHECK constraints, we're going to end up with a mishmash
 of cases where constraints are checked and other cases where
 constraints are not checked, and then that really is going to be
 confusing.

If we use keywords to differentiate constraints that are different
semantically, then we can just say that some types of constraints are
allowed on foreign tables and some are not.

I guess what I'd like to avoid is saying that a check constraint on a
regular table means one thing, and the same check constraint on a
foreign table means something else. If we differentiate them by
requiring special keywords like NOT ENFORCED, then it would be more
user-visible what's going on, and it would allow room for new semantics
later if we want. Normal constraints would be disallowed on foreign
tables, but NOT ENFORCED ones would be allowed.

That brings up another point: what if someone really, really, doesn't
want to pay the overhead of enforcing their constraint on a local table,
but wants the planner benefit? Would they have to make it a remote table
to bypass the constraint check?

Regards,
Jeff Davis






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Audit Logs WAS: temporal support patch

2012-08-22 Thread Jeff Davis
On Tue, 2012-08-21 at 17:56 -0500, Kevin Grittner wrote:
 I don't think the concerns I raised about apparent order of
 execution for serializable transactions apply to audit logs.  If
 we've moved entirely off the topic of the original subject, it is a
 complete non-issue.

Now I'm confused. The serializability issues you were talking about only
seem to matter with respect to system time (a.k.a. transaction time),
right? If the user is supplying the time, then it's a non-issue.

And audit logs are based on system time, so I thought that audit logs
were the case you were talking about.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] temporal support patch

2012-08-22 Thread Jeff Davis
On Tue, 2012-08-21 at 17:07 -0500, Kevin Grittner wrote:
 The fact that it has an unknown sequence number or timestamp for
 purposes of ordering visibility of transactions doesn't mean you
 can't show that it completed in an audit log.  In other words, I
 think the needs for a temporal database are significantly different
 from the needs of an auditing system.

...
 
 I would assume an audit log would have very different needs from
 tracking changes for a temporal database view.  It even seems
 possible that you might want to see what people *looked* at, versus
 just changes.  You might want to see transactions which were rolled
 back, which are irrelevant for a temporal view.  If we're talking
 about an auditing system, we're talking about an almost completely
 different animal from a temporal view of the database.

OK, I think I see what you're saying now. Basically, an audit log means
different things to different people, so I think it confused the issue.
But temporal is fairly vague, too. It also seems like there might be a
lot of overlap, depending on how we define those terms.

I am most interested in the topic you brought up about serializability
and system time (transaction time), because it would be a fundamental
piece upon which we can build a lot of these other things (including
what could be called an audit log).

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NOT NULL constraints in foreign tables

2012-08-20 Thread Jeff Davis
On Fri, 2012-08-17 at 15:44 -0400, Robert Haas wrote:
 On Fri, Aug 17, 2012 at 2:58 PM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
  I mean, what are NOT NULL in foreign tables for?  Are they harmed or
  helped by having pg_constraint rows?
 
 As I've mentioned when this has come up before, I think that
 constraints on foreign tables should be viewed as declarative
 statements about the contents of the foreign data that the DB will
 assume true.  This could be useful for a variety of purposes:
 constraint exclusion, query optimization, etc.

There are at least three kinds of constraint enforcement:

1. Enforced before the query runs (e.g. the current behavior on a normal
table).

2. Enforced when the query runs by validating the constraint as you go,
and then throwing an error when it turns out to be false.

3. Don't make any attempt to enforce, and silently produce wrong results
if it's false.

Which are you proposing, and how will the user know which kind of
constraint they've got?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] temporal support patch

2012-08-20 Thread Jeff Davis
On Mon, 2012-08-20 at 17:04 -0400, Robert Haas wrote:
 On Sun, Aug 19, 2012 at 6:28 PM, Jeff Davis pg...@j-davis.com wrote:
  The other issue is how to handle multiple changes of the same record
  within the transaction. Should they be stored or not?
 
  In a typical audit log, I don't see any reason to. The internals of a
  transaction should be implementation details; invisible to the outside,
  right?
 
 I'm not convinced.

As I understand it, we are talking about recording data changes in one
table to another table. Auditing of reads or the logging of raw
statements seem like very different kinds of projects to me; but tell me
if you think differently.

So if we are recording data changes, I don't see much point in recording
uncommitted changes. Perhaps my imagination is failing, and someone else
can fill me in on a use case.

I'm also struggling with the semantics: if we record uncommitted
changes, do we record them even if the transaction aborts? If so, what
guarantees do we offer about the change actually being recorded?

  I'm not sure that the database user is the proper thing to be stored in
  the history table. Many applications usually connect to a database using
  some virtual user and have their own users/roles tables to handle with
  privileges. There should be some way to substitute the stored user in
  the history table with the application's one. It's also helpful to store
  transaction id that inserted/updated/deleted the record.
 
  If the system is recording it for audit purposes, then it better be sure
  that it's true. You can't allow the application to pick and choose what
  gets stored there.
 
 That position would render this feature useless for every application
 for which I would otherwise have used it.

We could offer a GUC like audit_context or audit_app_context that
takes a text string, and the audit log would record the value stored in
that GUC along with the data changes in question.

The main thing I object to is an implication that the system is vouching
for some particular fact that is supplied by a userset GUC. Remember,
there are guaranteed to be application-level problems that allow these
GUCs to get set improperly for all kinds of reasons. We don't want bug
reports along the lines of security breach! PG allows application_name
to be spoofed in the audit log!.

Also, I'd prefer not use existing GUCs, because there may be all kinds
of other reasons that people set existing GUCs, and we want them to be
able to handle the audit_context one more carefully and have a clear
warning in the documentation.

  I think it's just nonsense
 to talk about what we can or can't let the user do.  The user is in
 charge, and our job is to allow him to do what he wants to do more
 easily, not to dictate what he must do.

Remember that the users who depend on the veracity of the audit log are
users, too. Let's try to serve both classes of user if we can.

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] temporal support patch

2012-08-20 Thread Jeff Davis
On Mon, 2012-08-20 at 19:17 -0400, David Johnston wrote:
 Ideally the decision of whether to do so could be a client decision.  Not
 storing intra-transaction changes is easier than storing all changes.  At
 worse you could stage up all changed then simply fail to store all
 intermediate results within a given relation.  It that case you gain nothing
 in execution performance but safe both storage and interpretative resources.
 So the question becomes is it worth doing without the ability to store
 intermediate results?  If you were to ponder both which setup would the
 default be?  If the default is the harder one (all statements) to implement
 then to avoid upgrade issues the syntax should specify that it is logging
 transactions only.

I think the biggest question here is what guarantees can be offered?
What if the transaction aborts after having written some data, does the
audit log still get updated?

 I see the user element as having two components:

I think this is essentially a good idea, although as I said in my other
email, we should be careful how we label the application-supplied
information in the audit log.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] temporal support patch

2012-08-20 Thread Jeff Davis
On Mon, 2012-08-20 at 16:32 -0700, Josh Berkus wrote:
 This is sounding like a completely runaway spec on what should be a
 simple feature.

My feeling as well. However, we will eventually want to coalesce around
some best practices and make it easy and robust for typical cases.

 Personally, I would prefer a tool which just made it simpler to build my
 own triggers, and made it automatic for the history table to track
 changes in the live table.  I think anything we build which controls
 what goes into the history table, etc., will only narrow the user base.

That sounds like a good way to start. Actually, even before the tool,
how about just some really good examples of triggers for specific kinds
of audit logs, and some ways to run queries on them? I think that might
settle a lot of these details.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] temporal support patch

2012-08-20 Thread Jeff Davis
On Mon, 2012-08-20 at 19:32 -0500, Kevin Grittner wrote:
 Josh Berkus j...@agliodbs.com wrote:
  
  This is sounding like a completely runaway spec on what should be
  a simple feature.
  
 I hate to contribute to scope creep (or in this case scope screaming
 down the tracks at full steam), but I've been watching this with a
 queasy feeling about interaction with Serializable Snapshot
 Isolation (SSI).

There are all kinds of challenges here, and I'm glad you're thinking
about them. I alluded to some problems here:

http://archives.postgresql.org/message-id/1345415312.20987.56.camel@jdavis

But those might be a subset of the problems you're talking about.

It sounds like, at a high level, there are two problems:

1. capturing the apparent order of execution in the audit log
2. assigning meaningful times to the changes that are consistent with
the apparent order of execution

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NOT NULL constraints in foreign tables

2012-08-20 Thread Jeff Davis
On Mon, 2012-08-20 at 16:50 -0400, Robert Haas wrote:
 #3 for foreign tables.

I'm skeptical of that approach for two reasons:

(1) It will be hard to inform users which constraints are enforced and
which aren't.
(2) It will be hard for users to understand the planner benefits or the
consequences when the constraint is not enforced.

That being said, I can imagine good use cases (like when the foreign
table is in postgres, and already has that constraint declared), so I'm
not outright opposed to it.

 #1 is not a reasonable alternative for foreign
 tables because we lack enforcement power in that case,

Right.

  and #2 is also
 not reasonable, because the only point of allowing declarative
 constraints is to get better performance, and if we go with #2 then
 we've pretty much thrown that out the window.

Declared constraints can improve the plans, while runtime-enforced
constraints slow down execution of a given plan. I'm not really sure
whether runtime enforcement is a good trade-off, but it doesn't seem
like an obviously bad one.

Also, what did you mean by the only point of allowing declarative
constraints is to get better performance? Maybe the user wants to get
an error if some important assumption about the remote data source is
not as true as when they declared the constraint.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST for ranges based on 2d-mapping and quad-tree

2012-08-19 Thread Jeff Davis
On Sat, 2012-08-18 at 18:10 +0400, Alexander Korotkov wrote:
 On Thu, Aug 16, 2012 at 3:46 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 I committed the patch now, but left out the support for
 adjacent for now. Not because there was necessarily anything
 wrong with that, but because I have limited time for
 reviewing, and the rest of the patch looks ready for commit
 now. I reworded the comments quite a lot, you might want to
 proofread those to double-check that they're still correct.
 I'll take a look at the adjacent-support next, as a separate
 patch.
 
 
 Thanks! There is a separate patch for adjacent. I've reworked adjacent
 check in order to make it more clear.

I am taking a look at this patch now. A few quick comments:

* It looks like bounds_adjacent modifies it's by-reference arguments,
which is a little worrying to me. The lower/upper labels are flipped
back, but the inclusivities are not. Maybe just pass by value instead?

* Bounds_adjacent is sensitive to the argument order. Can't it just take
bound1 and bound2?

* I tried some larger tests and they seemed to work. I haven't reviewed
the spgist code changes in detail though.

Regards,
Jeff Davis







-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST for ranges based on 2d-mapping and quad-tree

2012-08-19 Thread Jeff Davis
On Sat, 2012-07-28 at 17:50 -0400, Tom Lane wrote:
 which would come
 back to bite us if we ever try to support index-only scans with SPGiST.

I'm confused:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=92203624934095163f8b57b5b3d7bbd2645da2c8

And the patch that was just committed for Range Types SP-GiST is already
using index-only scans.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] temporal support patch

2012-08-19 Thread Jeff Davis
On Mon, 2012-06-25 at 17:46 +0900, Vlad Arkhipov wrote:
 It's not sufficient to store only a period of validity for a row. If two 
 transactions started in the same time change the same record, you have a 
 problem with TSTZRANGE type because it's normalized to empty interval. 

That's an interesting point.

Let's say you tried setting it to [T1, T2) where T1 is the time of the
last transaction to update it and T2 is the time of the current
transaction. If T2 = T1, then TSTZRANGE will throw an error, not store
the empty interval.

And we don't want to store the empty interval, because it would be a
lie. There could have been some transaction T3 that happened during T2
that saw the value from T1, so saying that there were no times where
that was visible to the system is false. Throwing an error allows you to
retry T2, which should allow a microsecond or so to pass, and the
problem should resolve itself (assuming your clock didn't move
backwards, which is a different issue).

We could also argue about the start versus end times of transactions,
and snapshot acquisition times, because that could cause confusion if
there are long-running transactions. It might be a good reason to store
the modifying transaction ID as well, but then you get into transaction
wraparound problems.

 The other issue is how to handle multiple changes of the same record 
 within the transaction. Should they be stored or not?

In a typical audit log, I don't see any reason to. The internals of a
transaction should be implementation details; invisible to the outside,
right?

 Also it's necessary to store some kind of operation type that was 
 applied to the record (insert/update/delete). For example, there is a 
 table with one record with validity period [0, ) and value 'A'.
 
 First way
 1. Delete this record in time 1, now there is [0, 1), A in the history 
 table.
 2. Insert a new record in time 1, now there is [0, 1), A in the history 
 table and [1, ), B record in the current data table.
 
 Second way
 1. Update this record in time 1, now there is [0, 1), A in the history 
 table and [1, ), B record in the current data table.
 
 So you have the same data in the tables but the actions that led to this 
 configuration were different and the history has been lost partly.

Right. Those are yet more possible options that people might want for an
audit log.

  * There is other useful information that could be recorded, such as the
  user who inserted/updated/deleted the record.
 I'm not sure that the database user is the proper thing to be stored in 
 the history table. Many applications usually connect to a database using 
 some virtual user and have their own users/roles tables to handle with 
 privileges. There should be some way to substitute the stored user in 
 the history table with the application's one. It's also helpful to store 
 transaction id that inserted/updated/deleted the record.

If the system is recording it for audit purposes, then it better be sure
that it's true. You can't allow the application to pick and choose what
gets stored there.

While it may be true that many applications just all use the same DB
user, if you want an audit log that includes user information you have
to let the DB do some authentication.

 It's a great proposal but seems to be impossible to implement with 
 triggers only solution, isn't it? Is there any kind of hooks on ALTER 
 TABLE ... in PostgreSQL to update changed columns bitmaps when table 
 structure changes?

Column numbers are never reused, so I think it would be stable. But if
you do need to be notified of schema changes, the new event triggers
mechanism may be able to do that.

 In SQL2011 there is only one table with the all data, historical and 
 current. So it's not very convenient to specifiy WHERE condition on 
 system time everywhere and for all tables in the query. By default only 
 the current data is selected with a query like SELECT * FROM table.

If there is some syntax that offers a convenient shorthand for WHERE,
that's fine with me. Or using two tables, one called foo and one called
foo_history, is also fine. But I don't want the DML syntax to introduce
new mechanisms that aren't available without the fancy syntax (though
new DDL arrangements might be fine).

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gistchoose vs. bloat

2012-08-19 Thread Jeff Davis
On Mon, 2012-06-18 at 15:12 +0400, Alexander Korotkov wrote:
 Hackers,
 
 
 While experimenting with gistchoose I achieve interesting results
 about relation of gistchoose behaviour and gist index bloat.

...
 
 Current implementation of gistchoose select first index tuple which
 have minimal penalty. It is possible for several tuples to have same
 minimal penalty. I've created simple patch which selects random from
 them. I then I've following results for same testcase.
 
I took a look at this patch. The surrounding code is pretty messy (not
necessarily because of your patch). A few comments would go a long way.

The 'which_grow' array is initialized as it goes, first using pointer
notations (*which_grows = -1.0) and then using subscript notation. As
far as I can tell, the first r-rd_att-natts of the array (the only
elements that matter) need to be written the first time through anyway.
Why not just replace which_grow[j]  0 with i == FirstOffsetNumber
and add a comment that we're initializing the penalties with the first
index tuple?

The 'sum_grow' didn't make any sense, thank you for getting rid of that.

Also, we should document that the earlier attributes always take
precedence, which is why we break out of the inner loop as soon as we
encounter an attribute with a higher penalty.

Please add a comment indicating why you are randomly choosing among the
equal penalties.

I think that there might be a problem with the logic, as well. Let's say
you have two attributes and there are two index tuples, it1 and it2;
with penalties [10,10] and [10,100] respectively. The second time
through the outer loop, with i = 2, you might (P=0.5) assign 2 to the
'which' variable in the first iteration of the inner loop, before it
realizes that it2 actually has a higher penalty. I think you need to
finish out the inner loop and have a flag that indicates that all
attributes are equal before you do the probabilistic replacement.

Also, I think you should use random() rather than rand().

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-08-10 Thread Jeff Davis
On Sun, 2012-02-19 at 21:49 +, Simon Riggs wrote:
 On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs si...@2ndquadrant.com wrote:
 
  v8 attached
 
 v10 attached.
 
 This patch covers all the valid concerns discussed and has been
 extensively tested.

Is there something I can do to help get this ready for the next
commitfest? I am willing to rebase it, but I was worried that might
cause confusion. I am also willing to review it after the rebase, of
course.

There are still a couple open issues, including:

* Store the checksum in the page version field or the TLI field?

* What mechanism to guarantee to the user that all pages are properly
protected by checksums (rather than just new pages)? In other words,
there are more than the two states represented by the GUC.

* What specific data is included in the checksum? I suggested that we
include the block number, and maybe the OID.

Regards,
Jeff Davis


-
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Covering Indexes

2012-07-31 Thread Jeff Davis
On Thu, 2012-07-26 at 12:13 -0400, Bruce Momjian wrote: 
 So, do we want a TODO item about adding columns to a unique index that
 will not be used for uniqueness checks?

-1 from me, at least in its current form.

At it's heart, this is about separating the constraint from the index
that enforces it -- you'd like the columns to be available for querying
(for index only scans or otherwise), but not to take part in the
constraint.

And when you look at it from that perspective, this proposal is and
extremely limited form. You can't, for example, decide that an existing
index can be used for a new unique constraint. That's a lot more
plausible than the use cases mentioned in this thread as far as I can
see, but this proposal can't do that.

I tried proposing a more general use case when developing exclusion
constraints:

http://archives.postgresql.org/message-id/1253466074.6983.22.camel@jdavis

(allow user to specify multiple constraints enforced by one existing
index). But, at least at the time, my proposal didn't pass the
usefulness test:

http://archives.postgresql.org/pgsql-hackers/2009-09/msg01355.php

even though my proposal was strictly more powerful than this one is.

Also, this proposal extends the weird differences between CREATE UNIQUE
INDEX and a the declaration of a UNIQUE constraint. For example, if you
want DEFERRABLE you need to declare the constraint, but if you want to
use an expression (rather than a simple column reference) you need to
create the index. This problem does not exist with exclusion
constraints.

In my opinion, new innovations in unique constraints would be better
served as part of exclusion constraints, and we should keep unique
constraints simple. If we make an improvement to UNIQUE, then we will
want to do similar things for exclusion constraints anyway, so it just
seems duplicative.

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Covering Indexes

2012-07-28 Thread Jeff Davis
On Fri, 2012-07-27 at 15:27 -0500, Merlin Moncure wrote:
 The covering index/uniqueness use case adds
 legitimacy to the INDEX clause of exclusion constraints IMNSHO.

Yes, I think it would be worth revisiting the idea.

 One point of concern though is that (following a bit of testing)
 alter table foo add exclude using btree (id with =);
 ...is always strictly slower for inserts than
 alter table foo add primary key(id);

Yes, in my worst-case tests there is about a 2X difference for building
the constraint and about a 30-50% slowdown during INSERT (I thought I
remembered the difference being lower, but I didn't dig up my old test).
That's for an in-memory case, I would expect disk I/O to make the
difference less apparent.

 This is probably because it doesn't use the low level btree based
 uniqueness check (the index is not declared UNIQUE) -- shouldn't it do
 that if it can?

We could probably detect that the operator being used is the btree
equality operator, set the unique property of the btree, and avoid the
normal exclusion constraint check. I'm sure there are some details to
work out, but if we start collecting more use cases where people want
the flexibility of exclusion constraints and the speed of UNIQUE, we
should look into it.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Covering Indexes

2012-07-27 Thread Jeff Davis
On Thu, 2012-07-26 at 12:13 -0400, Bruce Momjian wrote: 
 So, do we want a TODO item about adding columns to a unique index that
 will not be used for uniqueness checks?

-1 from me, at least in its current form.

At it's heart, this is about separating the constraint from the index
that enforces it -- you'd like the columns to be available for querying
(for index only scans or otherwise), but not to take part in the
constraint.

And when you look at it from that perspective, this proposal is and
extremely limited form. You can't, for example, decide that an existing
index can be used for a new unique constraint. That's a lot more
plausible than the use cases mentioned in this thread as far as I can
see, but this proposal can't do that.

I tried proposing a more general use case when developing exclusion
constraints:

http://archives.postgresql.org/message-id/1253466074.6983.22.camel@jdavis

(allow user to specify multiple constraints enforced by one existing
index). But, at least at the time, my proposal didn't pass the
usefulness test:

http://archives.postgresql.org/pgsql-hackers/2009-09/msg01355.php

even though my proposal was strictly more powerful than this one is.

Also, this proposal extends the weird differences between CREATE UNIQUE
INDEX and a the declaration of a UNIQUE constraint. For example, if you
want DEFERRABLE you need to declare the constraint, but if you want to
use an expression (rather than a simple column reference) you need to
create the index. This problem does not exist with exclusion
constraints.

In my opinion, new innovations in unique constraints would be better
served as part of exclusion constraints, and we should keep unique
constraints simple. If we make an improvement to UNIQUE, then we will
want to do similar things for exclusion constraints anyway, so it just
seems duplicative.

Regards,
Jeff Davis







-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using pg_upgrade on log-shipping standby servers

2012-07-26 Thread Jeff Davis
On Thu, 2012-07-26 at 14:17 -0400, Bruce Momjian wrote:
 Yes, that would be a problem because the WAL records are deleted by
 pg_upgrade.   Does a shutdown of the standby not already replay all WAL
 logs?

There is no notion of all WAL logs because the WAL is infinite. Do you
mean all WAL generated by the master before shutdown or all WAL that
the standby knows has been generated by the master so far?

Regardless, I don't think the standby attempts to do much after a
shutdown is requested.

   We could also just require them to just start the standby in
 master mode and shut it down.  The problem with that is it might run
 things like autovacuum.

If we had sync rep that waits for application of the WAL, that might be
a more robust approach. We could, during shutdown of the master cause
the standby to cancel all HS queries, and then change to sync rep and
wait for complete catchup.

There are a lot of details to work out there, but it might give us a
higher confidence that it's doing the right thing.

Given two shut-down systems, it should be pretty easy to tell if they
have played the same amount of WAL though, right?

 I was originally thinking that we would require users to run pg_upgrade
 on the standby, where you need to first switch into master mode.

That sounds a little strange to me. If the original master has generated
WAL that the original standby hasn't seen yet, then this doesn't help
because the two systems would be diverged, and you'd need a new base
backup anyway. And if they have played exactly the same WAL, what does
this accomplish?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using pg_upgrade on log-shipping standby servers

2012-07-17 Thread Jeff Davis
On Tue, 2012-07-17 at 01:02 -0700, Daniel Farina wrote:
 Could pg_upgrade emit WAL segment(s) to provide continuity of a
 timeline?  So something like:

By segments did you mean records?

 * Take down the writable primary for pg_upgrade
 * Some WAL is emitted and possibly archived
 * The old version, when reaching the special pg_upgrade WAL, could
 exit or report its situation having paused replay (as clearly, it
 cannot proceed). Unsure.

I don't really understand this step.

 * Start up a new version of postgres on the same cluster at that
 point, which plays the upgrade-WAL.
 
 I see this being pretty mechanically intensive, but right now my hands
 are completely tied as to achieving total continuity of my archives,
 costing a base-backup's worth of risk window upon upgrade.

Does continuity of archives mean avoid downtime or maintain a
single WAL sequence. 

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using pg_upgrade on log-shipping standby servers

2012-07-16 Thread Jeff Davis
On Tue, 2012-07-10 at 11:50 -0400, Bruce Momjian wrote:
 I don't think we can assume that because pg_upgrade was run on the
 master and standby that they are binary identical, can we?  Technically
 the user file are identical, but the system catalogs and WAL might be
 different, hence my suggestion to run rsync before allowing the standby
 to rejoin the primary.

Do you have plans to change that in the future?

If we know that the user data files are identical between primary and
replica, it would be nice if we could provide a robust way to avoid
copying them.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-07-13 Thread Jeff Davis
On Fri, 2012-07-13 at 15:21 -0400, Tom Lane wrote:
 No, that's incorrect: the fadvise is there, inside pg_flush_data() which
 is done during the writing phase.

Yeah, Andres pointed that out, also.

   It seems to me that if we think
 sync_file_range is a win, we ought to be using it in pg_flush_data just
 as much as in initdb.

It was pretty clearly a win for initdb, but I wasn't convinced it was a
good idea for other use cases. 

The mechanism is outlined in the email you linked below. To paraphrase,
fadvise tries to put the data in the io scheduler queue (still in the
kernel before going to the device), and gives up if there is no room;
sync_file_range waits for there to be room if none is available.

For the case of initdb, the data needing to be fsync'd is effectively
constant, and it's a lot of small files. If the requests don't make it
to the io scheduler queue before fsync, the kernel doesn't have an
opportunity to schedule them properly.

But for larger amounts of data copying (like ALTER DATABASE SET
TABLESPACE), it seemed like there was more risk that sync_file_range
would starve out other processes by continuously filling up the io
scheduler queue (I'm not sure if there are protections against that or
not). Also, if the files are larger, a single fsync represents more
data, and the kernel may be able to schedule it well enough anyway.

I'm not an authority in this area though, so if you are comfortable
extending sync_file_range to copydir() as well, that's fine with me.

 However, I'm a bit confused because in
 http://archives.postgresql.org/pgsql-hackers/2012-03/msg01098.php
 you said
 
  So, it looks like fadvise is the right thing to do, but I expect we'll
 
 Was that backwards?  If not, why are we bothering with taking any
 portability risks by adding use of sync_file_range?

That part of the email was less than conclusive, and I can't remember
exactly what point I was trying to make. sync_file_range is a big
practical win for the reasons I mentioned above (and also avoids some
unpleasant noises coming from the disk drive).

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-07-13 Thread Jeff Davis
On Fri, 2012-07-13 at 17:35 -0400, Tom Lane wrote:
 I wrote:
  I'm picking up this patch now.  What I'm inclined to do about the -N
  business is to commit without that, so that we get a round of testing
  in the buildfarm and find out about any portability issues, but then
  change to use -N after a week or so.  I agree that in the long run
  we don't want regression tests to run with fsyncs by default.
 
 Committed without the -N in pg_regress (for now).  I also stuck
 sync_file_range into the backend's pg_flush_data --- it would be
 interesting to hear measurements of whether that makes a noticeable
 difference for CREATE DATABASE.

Thank you.

One point about the commit message: fadvise does not block to go into
the request queue, sync_file_range does. The problem with fadvise is
that, when the request queue is small, it fills up so fast that most of
the requests never make it in, and fadvise is essentially a no-op.
sync_file_range waits for room in the queue, which is (based on my
tests) enough to improve the scheduling a lot.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST for ranges based on 2d-mapping and quad-tree

2012-07-03 Thread Jeff Davis
On Thu, 2012-06-14 at 02:56 +0400, Alexander Korotkov wrote:
 Hackers,
 
 
 attached patch implements quad-tree on ranges. Some performance
 results in comparison with current GiST indexing.
 Index creation is slightly slower. Probably, it need some
 investigation. Search queries on SP-GiST use much more pages. However
 this comparison can be not really correct, because SP-GiST can pin
 same buffer several times during one scan. In CPU search queries on
 SP-GiST seems to be slightly faster. Dramatical difference in column
 @ const query is thanks to 2d-mapping.
 

More comments:

* Minor rebase is required (simple int2 - int16).

* Perhaps I'm mistaken, but the following code in getQuadrant() looks
wrong to me, shouldn't the 1 and 2 be reversed?

if (range_cmp_bounds(typcache, upper, centroidUpper) = 0)
return 1;
else
return 2;

* in the choose method, why does in-allTheSame unconditionally match?
Why not split? Similarly, why does inner_consistent always match when
the nodes are allTheSame?

* It's a little confusing having empty prefixes mean that empty range go
to node0, and non-empty ranges meaning that empty ranges go to node4
(quadrant 5). Why can't there just always be 5 nodes, and iff all the
ranges are empty, then the prefix is NULL?

And for that matter, let's let the quadrant equal the node number, and
have the empty ranges in node0. I don't see much point in always
subtracting 1 from the quadrant number.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST for ranges based on 2d-mapping and quad-tree

2012-07-03 Thread Jeff Davis
On Mon, 2012-07-02 at 23:47 -0700, Jeff Davis wrote:
 On Thu, 2012-06-14 at 02:56 +0400, Alexander Korotkov wrote:
  Hackers,
  
  
  attached patch implements quad-tree on ranges. Some performance
  results in comparison with current GiST indexing.
  Index creation is slightly slower. Probably, it need some
  investigation. Search queries on SP-GiST use much more pages. However
  this comparison can be not really correct, because SP-GiST can pin
  same buffer several times during one scan. In CPU search queries on
  SP-GiST seems to be slightly faster. Dramatical difference in column
  @ const query is thanks to 2d-mapping.
  

Also, it would be helpful to add a couple tests to rangetypes.sql.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST for ranges based on 2d-mapping and quad-tree

2012-07-03 Thread Jeff Davis
On Mon, 2012-07-02 at 23:47 -0700, Jeff Davis wrote:
 * Perhaps I'm mistaken, but the following code in getQuadrant() looks
 wrong to me, shouldn't the 1 and 2 be reversed?
 
 if (range_cmp_bounds(typcache, upper, centroidUpper) = 0)
 return 1;
 else
 return 2;

Oops, looks like I was mistaken. The code looks fine to me now.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST subsplit question

2012-06-26 Thread Jeff Davis
On Tue, 2012-06-26 at 11:28 -0400, Robert Haas wrote:
 On Wed, May 30, 2012 at 4:35 PM, Alexander Korotkov
 aekorot...@gmail.com wrote:
  So, do we demote that message to a DEBUG1? Or do we make it more clear
  what the authors of a specific picksplit are supposed to do to avoid
  that problem? Or am I misunderstanding something?
 
 
  +1 for demote message to DEBUG1. I think it shouldn't be so noisy, it just
  indicates something could be improved.
  Also I think we defenitely need to document secondary split. Now it's no
  chances to understand without reverse engeneering it from code.
 
 I'm happy to go demote the message if we have consensus on that, but
 somebody else is going to need to provide the doc patch.  Any takers?

I was planning to do that, but it won't be for a few days at least. If
someone else wants to do it sooner, feel free.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-06-23 Thread Jeff Davis
On Fri, 2012-06-22 at 10:04 -0400, Robert Haas wrote:
 This may be a stupid question, by why is it initdb's job to fsync the
 files the server creates, rather than the server's job?  Normally we
 rely on the server to make its own writes persistent.

That was my first reaction as well:

http://archives.postgresql.org/message-id/1328468000.15224.32.camel@jdavis

However, from the discussion it seems like it will be harder to do it
that way (during normal operation, a checkpoint is what syncs the files;
but as Tom points out, bootstrap mode does not). Also, I would expect
the fsyncs to be in a less-controlled pattern, so it might perform more
poorly.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST for ranges based on 2d-mapping and quad-tree

2012-06-22 Thread Jeff Davis
On Thu, 2012-06-14 at 02:56 +0400, Alexander Korotkov wrote:
 Hackers,
 
 
 attached patch implements quad-tree on ranges. Some performance
 results in comparison with current GiST indexing.
 Index creation is slightly slower. Probably, it need some
 investigation. Search queries on SP-GiST use much more pages. However
 this comparison can be not really correct, because SP-GiST can pin
 same buffer several times during one scan. In CPU search queries on
 SP-GiST seems to be slightly faster. Dramatical difference in column
 @ const query is thanks to 2d-mapping.
 
Looking at this patch now. I see that it fails the opr_sanity test (on
master), can you look into that?

It looks very promising from a performance standpoint. I think the col
@ const query will be a common query; and I also think that pattern
will be useful to restrict a large table down to something more
manageable.

In the bounds_connected function, it might make more sense to use the
word adjacent which I already used for ordinary ranges, rather than
using the new word connected.

Also, I'm getting a little confused switching between thinking in terms
of X and Y and lower and upper (particularly since lower and upper
can be confused with  or ). I don't have a suggestion yet how to
clarify that, but it might be good to use the spatial terminology in
more places and avoid lower/upper except where needed.

Please excuse the slow review, I'm catching up on the SP-GiST API.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-06-19 Thread Jeff Davis
On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote:
 It calls pg_flush_data inside of copy_file which does the posix_fadvise... So 
 maybe just put the sync_file_range in pg_flush_data?

The functions in fd.c aren't linked to initdb, so it's a challenge to
share that code (I remember now: that's why I didn't use pg_flush_data
originally). I don't see a clean way to resolve that, do you?

Also, it seems that sync_file_range() doesn't help with creating a
database much (on my machine), so it doesn't seem important to use it
there.

Right now I'm inclined to leave the patch as-is.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] temporal support patch

2012-06-18 Thread Jeff Davis
On Mon, 2012-06-18 at 19:34 +0900, Vlad Arkhipov wrote:
 What's wrong with SPI/timetravel extension for system versioning?
 http://www.postgresql.org/docs/9.1/static/contrib-spi.html
 
 We are heavily using system-versioned and application-time period
 tables in our enterprise products (most of them are bi-temporal).
 However our implementation is based on triggers and views and
 therefore is not very convenient to use. There are also some locking
 issues with foreign keys to application-time period tables. It will be
 great if the new temporal SQL features will be included in the
 Postgresql core with SQL 2011 syntax support. It is especially
 important for bi-temporal tables because of complex internal logic of
 UPDATE/DELETE and huge SELECT queries for such tables.

I've already pointed out some missing features in this thread, but the
big ones in my mind are:

1. It doesn't use 9.2 Range Types, which would help in a lot of ways
(like making the SELECT queries a lot simpler and faster).

2. It's missing a lot of options, like storing the user that modified a
row or the changed columns.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-06-18 Thread Jeff Davis
On Mon, 2012-06-18 at 18:05 +0200, Andres Freund wrote:
 Quick review:
 - defaulting to initdb -N in the regression suite is not a good imo, because 
 that way the buildfarm won't catch problems in that area...

I removed the -N as you suggest. How much does performance matter on the
buildfarm?

 - could the copydir.c and initdb.c versions of walkdir/sync_fname et al be 
 unified?

There's a lot of backend-specific code in the copydir versions, like
using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at
unifying them before, and concluded that it wouldn't add to the
readability, so I just commented where they came from.

If you feel there will be a maintainability problem, I can give it
another shot.

 - I personally would find it way nicer to put USE_PRE_SYNC into 
 pre_sync_fname 
 instead of cluttering the main function with it

Done.

Regards,
Jeff Davis


initdb-fsync-20120618.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-06-18 Thread Jeff Davis
On Mon, 2012-06-18 at 21:34 +0300, Peter Eisentraut wrote:
 On mån, 2012-06-18 at 18:05 +0200, Andres Freund wrote:
  - defaulting to initdb -N in the regression suite is not a good imo,
  because that way the buildfarm won't catch problems in that area...
  
 The regression test suite also starts postgres with fsync off.  This is
 good, and I don't want to have more overhead in the tests.  It's not
 like we already have awesome test coverage for OS interaction.

OK, changing back to using -N during regression tests due to performance
concerns.

Regards,
Jeff Davis


initdb-fsync-20120618-2.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-06-18 Thread Jeff Davis
On Mon, 2012-06-18 at 20:57 +0200, Andres Freund wrote:
   - defaulting to initdb -N in the regression suite is not a good imo,
   because that way the buildfarm won't catch problems in that area...
  I removed the -N as you suggest. How much does performance matter on the
  buildfarm?
 I don't think the difference in initdb cost is relevant when running the 
 regression tests. Should it prove to be we can re-add -N after a week or two 
 in the buildfarm machines. I just remember that there were several OS 
 specific 
 regression when adding the pre-syncing for createdb.

That sounds reasonable to me. Both patches are out there, so we can
figure out what the consensus is.

   - could the copydir.c and initdb.c versions of walkdir/sync_fname et al
   be unified?
  There's a lot of backend-specific code in the copydir versions, like
  using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at
  unifying them before, and concluded that it wouldn't add to the
  readability, so I just commented where they came from.
 Ok. Sensible reasons. I dislike that we know have two files using different 
 logic (copydir.c only using fadvise, initdb using sync_file_range if 
 available). Maybe we should just move the fadvise and sync_file_range calls 
 into its own common function?

I don't see fadvise in copydir.c, it looks like it just uses fsync. It
might speed it up to use a pre-sync call there, too -- database creation
does take a while.

If that's in the scope of this patch, I'll do it.

 Btw, I just want to have said this, although I don't think its particularly 
 relevant as it doesn't affect correctness: Its possible to have a system 
 where 
 sync_file_range is in the system headers but the kernel during runtime 
 doesn't 
 support it. It is relatively new (2.6.17). It would be possible to fallback 
 to 
 posix_fadvise which has been around far longer in that case...

Interesting point, but I'm not too worried about it.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-06-18 Thread Jeff Davis
On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote:
 It calls pg_flush_data inside of copy_file which does the posix_fadvise... So 
 maybe just put the sync_file_range in pg_flush_data?

Oh, I didn't notice that, thank you.

In that case, it may be good to combine them if possible. I will look
into it. There may be performance implications when used one a larger
amount of data though. I can do some brief testing.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Support for foreign keys with arrays

2012-06-17 Thread Jeff Davis
On Fri, 2011-11-04 at 13:48 +0100, Gabriele Bartolini wrote:
 This patch adds basic support of arrays in foreign keys, by allowing to 
 define a referencing column as an array of elements having the same type 
 as the referenced column in the referenced table.
 Every NOT NULL element in the referencing array is matched against the 
 referenced table.

I'm trying to find commonalities between this feature and my future
RANGE FOREIGN KEY feature (not past the hand-waving stage yet).

The first thing I observe is that my idea for range foreign keys is
almost the opposite of your idea for array FKs.

I was imagining a range FK to mean that the referencing side is
contained by the referenced side. This is the common definition in the
temporal world, because the valid period for the referencing row must be
within the valid period for the row it references (same for transaction
time). The referenced side must be a range, and the referencing side
must be either a range of the same type or the subtype of the range.

Other similar definitions exist by replacing contained by with some
other operator, though the use cases for those aren't as clear to me.

This definition works for arrays and possibly many other types
(geometry?) as well. It looks like this is orthogonal from your work,
but it does seem like it has potential for confusion in the future.

Thoughts?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Support for foreign keys with arrays

2012-06-17 Thread Jeff Davis
On Sun, 2012-06-17 at 21:10 +0800, Simon Riggs wrote:
 Do we need something like Exclusion FKs? i.e. the FK partner of
 Exclusion Constraints?

Yes, Inclusion Constraints. I've known we need something like that
since I did Exclusion Constraints, but I haven't gotten further than
that.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] temporal support patch

2012-06-15 Thread Jeff Davis
On Wed, 2012-06-13 at 23:10 +0200, Miroslav Šimulčík wrote:

 I have working patch for postgresql version 9.0.4, but it needs
 refactoring before i can submit it, because some parts don't
 meet formatting requirements yet. And yes, changes are large, so it
 will be better to discuss design first and then deal with code. Do you
 insist on compatibility with standard SQL 2011 as Pavel wrote?
 
Try to work on solving the problem and identify the use cases. I don't
think the standard will cause a major problem, we should be able to make
the relevant parts of your patch match the standard.

That's one reason to work on it as an extension first: we can get a
better sense of the problem space and various use cases without worrying
about violating any standard. Then, as you need specific backend support
(e.g. special syntax), we can take the standards more seriously.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-06-13 Thread Jeff Davis
On Wed, 2012-06-13 at 13:53 +0300, Peter Eisentraut wrote:
 The --help output for the -N option was copy-and-pasted wrongly.
 
 The message issued when using -N is also a bit content-free.  Maybe
 something like
 
 Running in nosync mode.  The data directory might become corrupt if the
 operating system crashes.\n

Thank you, fixed.

 Which leads to the question, how does one get out of this state?  Is
 running sync(1) enough?  Is starting the postgres server enough?

sync(1) calls sync(2), and the man page says:

According to the standard specification  (e.g.,  POSIX.1-2001),  sync()
schedules the writes, but may return before the actual writing is done.
However, since version 1.3.20 Linux does actually  wait.   (This  still
does not guarantee data integrity: modern disks have large caches.)

So it looks like sync is enough if you are on linux *and* you have any
unprotected write cache disabled.

I don't think starting the postgres server is enough.

Before, I think we were safe because we could assume that the OS would
flush the buffers before you had time to store any important data. But
now, that window can be much larger.

 There are no updates to the initdb man page included in your patch,
 which would be a suitable place to discuss this briefly.

Thank you, I added that.

Regards,
Jeff Davis


initdb-fsync-20120613.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3: load path to mitigate load penalty for checksums

2012-06-12 Thread Jeff Davis
On Tue, 2012-06-12 at 08:42 -0400, Robert Haas wrote:
 Instead of trying to maintain MVCC semantics, maybe we should just
 have something like COPY (FROZEN) that just writes frozen tuples into
 the table and throws MVCC out the window.  Seems like that would be a
 lot easier to implement and satisfy basically the same use cases.

The reason that doesn't work is because there's really no answer for
ABORTs except to delete the whole table, or ask the user to try to
figure out what made it or what didn't. (That I can see, anyway; you may
have a better idea.)

In practice, I think the user would need to use this only for new tables
and use only one loader so they would know what to do if there is an
abort. That makes this very similar to the proposal for optimizing loads
by the transaction that creates the table; except with fewer safeguards.
It's basically just saying don't abort, and be careful with concurrent
reads (which are now dirty).

I think this problem largely has to do with freezing though, because
that's what makes it impossible to differentiate after an abort. Let's
set that aside for now, and just focus on setting HEAP_XMIN_COMMITTED,
PD_ALL_VISIBLE, and the VM bit. Those account for a major part of the
problem; being able to freeze also is really a bonus (and if the table
is really that static, then maybe creating it in the loading transaction
is reasonable).

In those terms, we can reframe the questions as: what do we do about
reads during the load?

Answers could include:
  (a) nothing -- reads during a load are potentially dirty
  (b) readers ignore hint bits during the load, and pay the penalty
  (c) allow only INSERT/COPY, and block unsafe SELECT/UPDATE/DELETE

(a) is your proposal except without the freezing. (b) was my proposal.
(c) was a backup I had in mind (if reads are expensive anyway, maybe
people would rather just block).

These aren't exclusive. Maybe we can do (c) in READ COMMITTED and above,
and (a) in READ UNCOMMITTED.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3: load path to mitigate load penalty for checksums

2012-06-12 Thread Jeff Davis
On Tue, 2012-06-12 at 13:31 -0400, Tom Lane wrote:
 Yeah; the notion of adding cycles to the visibility-check code paths,
 which would add cost even for users who have no use at all for this
 feature, is close to being a deal-breaker for me.  I would rather see
 us designing this around the idea of what can we do without adding
 any new complexity in visibility checks?.

[waves hands wildly]

We could create a different HeapTupleSatisfiesMVCC routine that ignores
hint bits, and select that at the time the scan is started. That would
at least avoid affecting existing code paths.

 At least for MVCC cases (hence, user tables only), it seems like we
 could pre-set XMIN_COMMITTED hint bits if there were a way to be sure
 that the XID in question would be seen as still running in the followup
 snapshot check that every MVCC visibility test must make anyway.  The
 hard part of that seems to be post-crash behavior, but maybe it'd be
 acceptable to incur crash-recovery-time cost to run around and unset
 such bits?

Hmm... but that still leaves PD_ALL_VISIBLE, which will also cause a
rewrite of the data. Under OLTP, we can assume that PD_ALL_VISIBLE is
much more rare than HEAP_XMIN_COMMITTED; but for bulk loads, setting
HEAP_XMIN_COMMITTED doesn't help us much without PD_ALL_VISIBLE.

 This doesn't do anything for pre-freezing of course, but I basically
 don't see any way that we could pre-freeze without breaking MVCC
 semantics anyhow.  So there's still attraction in the idea of offering
 users the choice of not sustaining MVCC behavior in exchange for cheaper
 bulk loads.

That may be reasonable, but it's much more dangerous than just breaking
MVCC (which to me implies isolation rules) -- pre-freezing would break
atomicity if you have any aborts. And I can't think of very many cases
where losing atomicity is reasonable (although I'm sure there are some).

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3: load path to mitigate load penalty for checksums

2012-06-12 Thread Jeff Davis
On Tue, 2012-06-12 at 13:10 -0400, Robert Haas wrote:
 I don't think it's going to solve the problem in general, but TBH I
 don't think Jeff's proposal is, either.  I mean, ignoring
 xmin-committed, xmax-committed, and all-visible bits is going to come
 with a pretty steep performance penalty all of its own.

I think we'd certainly have to discourage users from launching lots of
full table scans during a data load. We could go so far as blocking
reads, as I said in my other response, and still (mostly) meet my
primary goals.

But allowing reads without hint bits might be useful for highly
selective index access, which is a substantial use case for the kind of
large tables we'd be bulk-loading.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3: load path to mitigate load penalty for checksums

2012-06-12 Thread Jeff Davis
On Tue, 2012-06-12 at 17:53 -0400, Bruce Momjian wrote:
 Well, truncating tables that were empty on the load would certainly be
 a easy to do --- not sure if that helps us, though.

This goes back to the single-loader-into-an-empty-table case, which I'd
like to get away from. I think we already have some reasonable proposals
on the table for optimizing that case.

Ultimately, that pushes us toward partitioning with the same granularity
as the data load. We can try to open that discussion up, but I think
it's a bad idea to tie the data load granularity to the partitioning
granularity, unless we have a way to coalesce them later without
rewriting. It's too bad filesystems don't allow us to just reallocate
some blocks from one file to another. [ OK, I guess I just opened this
discussion myself ;) ]

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3: load path to mitigate load penalty for checksums

2012-06-12 Thread Jeff Davis
On Tue, 2012-06-12 at 18:02 -0400, Tom Lane wrote:
 Or (d) it's not a problem, since the inserting XID is still busy
 according to the readers' snapshots.

How much of a savings did we get from PD_ALL_VISIBLE when it was added
into the page-at-a-time visibility check?

From 608195a3a3656145a7eec7a47d903bc684011d73:

In addition to the visibility map, there's a new PD_ALL_VISIBLE flag on
each heap page, also indicating that all tuples on the page are visible
to all transactions. It's important that this flag is kept up-to-date.
It is also used to skip visibility tests in sequential scans, which
gives a small performance gain on seqscans.

If small means that it's something we can give up, then focusing on
HEAP_XMIN_COMMITTED makes sense. But if we can't give it up, then we
need to take it into account in the proposal.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-06-12 Thread Jeff Davis
On Sun, 2012-02-19 at 21:49 +, Simon Riggs wrote:
 On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs si...@2ndquadrant.com wrote:
 
  v8 attached
 
 v10 attached.
 
 This patch covers all the valid concerns discussed and has been
 extensively tested.
 
 I expect to commit this in about 24 hours, so last call for comments
 please or say if you need more time to review.

I have made an attempt to merge this with master. This patch is not
meant to be the next version, and I have not made a real effort to be
sure this was a safe merge. I am just hoping this saves Simon some of
the tedious portions of the merge, and offers a reference point to see
where his merge differs from mine.

In the meantime, I'm looking at the original v10 patch against master as
it existed when Simon posted the patch.

I'd like to see checksums take part in the June commitfest.

Regards,
Jeff Davis


checksums-v10-merge.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3: load path to mitigate load penalty for checksums

2012-06-12 Thread Jeff Davis
On Tue, 2012-06-12 at 22:06 -0400, Robert Haas wrote:
  How much of a savings did we get from PD_ALL_VISIBLE when it was added
  into the page-at-a-time visibility check?
 
  From 608195a3a3656145a7eec7a47d903bc684011d73:
 
  In addition to the visibility map, there's a new PD_ALL_VISIBLE flag on
  each heap page, also indicating that all tuples on the page are visible
  to all transactions. It's important that this flag is kept up-to-date.
  It is also used to skip visibility tests in sequential scans, which
  gives a small performance gain on seqscans.
 
  If small means that it's something we can give up, then focusing on
  HEAP_XMIN_COMMITTED makes sense. But if we can't give it up, then we
  need to take it into account in the proposal.
 
 It's significant.

In that case, the proposals that only involve HEAP_XMIN_COMMITTED don't
seem viable to me. To get maxiumum read speed, we will need to set
PD_ALL_VISIBLE also, and that means rewriting the entire table anyway
(for the workload that I'm describing).

However, maybe if we combine the approaches, readers could ignore
PD_ALL_VISIBLE during the load, which looks like maybe a 10% penalty,
without having to ignore HEAP_XMIN_COMMITTED (which seems much more
costly to ignore).

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-06-12 Thread Jeff Davis
On Sun, 2012-03-25 at 19:59 -0700, Jeff Davis wrote:
 On Sat, 2012-03-17 at 17:48 +0100, Cédric Villemain wrote:
  I agree with Andres.
  
  
  I believe we should use sync_file_range (_before?) with linux.
  
  And we can use posix_fadvise_dontneed on other kernels.
  
 OK, updated patch attached. sync_file_range() is preferred,
 posix_fadvise() is a fallback.
 

Rebased patch attached. No other changes.

Regards,
Jeff Davis


initdb-fsync-20120612.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3: load path to mitigate load penalty for checksums

2012-06-11 Thread Jeff Davis
On Wed, 2012-06-06 at 22:16 -0400, Noah Misch wrote:
 Note that, currently, only VACUUM sets PD_ALL_VISIBLE and visibility map bits.
 Would you make something else like heap_multi_insert() be able to do so?

That was the plan (roughly). I was thinking about doing it at the time a
new page was allocated.

 Avoiding measurable overhead in tuple visibility checks when the feature is
 inactive may well prove to be a key implementation challenge.

Perhaps a rudimentary CLOG cache, or some other way to mitigate CLOG
access could make it bearable.

Although I would like it to be an online operation, I'm not quite as
concerned about reads. I'd like to mitigate any major penalty, but if
reads are expensive during a load, than so be it.

  Then, it would remember the current xid
  as max_loader_xid, and wait until the global xmin is greater than
  max_loader_xid. This should ensure that all snapshots regard all loading
  transactions as complete.
 
 ... this might not be.  Each backend could decide, based on its own xmin,
 whether to ignore PD_ALL_VISIBLE in a given table.  In other words, your
 ignorehints flag could be an xmin set to InvalidTransactionId during stages 1
 and 2 and to the earliest safe xmin during stages 0 and 3.

That's a good idea. It might make it easier to implement, and removing a
step from finalization is certainly a big plus.

   * INITIATE and FINALIZE probably need to use PreventTransactionChain()
  and multiple transactions, to avoid holding the ShareUpdateExclusiveLock
  for too long. Also, we want to keep people from using it in the same
  transaction as the loading xact, because they might not realize that
  they would get a concurrency of 1 that way (because of the
  ShareUpdateExclusiveLock).
 
 Yes.  You need to commit the transaction modifying pg_class so other backends
 can observe the change, at which point you can gather the list to wait on.
 
 Consider splitting the INITIATE UI into two interfaces, one that transitions
 from state 0 to state 1 and another that expects state 1 and blocks until we
 reach state 2.  You then have no need for PreventTransactionChain(), and the
 interfaces could even be normal functions.  It's less clear how reasonably you
 could do this for the FINALIZE step, given its implicit VACUUM.  It could be
 achieved by having the user do the VACUUM and making the new interface merely
 throw an error if a VACUUM is still needed.  The trivial usage pattern might
 look like this:
 
 SELECT pg_initiate_load('bigtbl');
 SELECT pg_wait_load('bigtbl'); -- not a great name
 COPY bigtbl FROM STDIN;
 SELECT pg_stop_load('bigtbl');
 VACUUM bigtbl;
 SELECT pg_finalize_load('bigtbl');
 
 It's definitely less elegant, alas.  Perhaps offer the interface you've
 proposed and have it do the above under the hood.  That way, users with
 complex needs have the flexibility of the lower-level interfaces while those
 who can tolerate PreventTransactionChain() have simplicity.

I think that's a reasonable suggestion. I am going back and forth a
little on this one. It's got the benefit that you can see the internal
states more clearly, and it's easier to tell what's going on, and it's
better if we want to do more sophisticated testing.

The main drawback here is that it's exposing more to the user. I
imagined that we might want to push other kinds of optimizations into
the load path, and that might upset the interface you've described
above. Then again, we'll probably need the normal, load, and transition
states regardless, so maybe it's an empty concern.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3: load path to mitigate load penalty for checksums

2012-06-06 Thread Jeff Davis
On Wed, 2012-06-06 at 15:08 -0400, Robert Haas wrote:
 On Mon, Jun 4, 2012 at 9:26 PM, Jeff Davis pg...@j-davis.com wrote:
  Thoughts?
 
 Simon already proposed a way of doing this that doesn't require
 explicit user action, which seems preferable to a method that does
 require explicit user action, even though it's a little harder to
 implement.  His idea was to store the XID of the process creating the
 table in the pg_class row, which I think is *probably* better than
 your idea of having a process that waits and then flips the flag.
 There are some finicky details though - see previous thread for
 discussion of some of the issues.

My goals include:

* The ability to load into existing tables with existing data
* The ability to load concurrently

My understanding was that the proposal to which you're referring can't
do those things, which seem like major limitations. Did I miss
something?

 In
 many cases it would also be nice to write the tuples pre-frozen, so I
 think we should look for a design that will support that.

You're right, that would be nice.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] 9.3: load path to mitigate load penalty for checksums

2012-06-04 Thread Jeff Davis
Introduction:
=
A lot of progress has been made on the checksums issue, with Simon's
excellent work beginning to gain consensus:

http://archives.postgresql.org/message-id/CA
+u5nmkw_gbs6qq_y8-rjgl1v7mvw2hwbhartb8lojhnpfx...@mail.gmail.com

For the purposes of this proposal, I'll assume that's the general
direction we'll be taking for CRCs.

The major drawback of that proposal is that it increases the amount of
work to be done after a large data load by requiring more WAL.

Proposal:
=
I propose a special user-initiated loading mode at the table
granularity. During this time, readers must ignore PD_ALL_VISIBLE,
HEAP_XMIN_COMMITTED, and the visibility map entirely. However, writers
may set all of those bits before the writing transaction commits,
obviating the need to rewrite (and WAL) the data again later. Ideally,
there would be no work for VACUUM to do after the data load (unless a
transaction aborted).

This would also help the non-CRC case of course, but I expect CRCs to
make this significantly more important.

Goals:
=

  * Table granularity (doesn't affect other tables at all)
  * Allows concurrent loaders
  * Allows loading into existing tables with existing data
  * Online (allow reads to proceed, even if degraded)

Rough user interaction:
=

  INITIATE LOAD ON foo AS 'job name';

  -- run concurrent loading sessions

  FINALIZE LOAD 'job name';

High-level design:
=

By hints I mean the VM bit, PD_ALL_VISIBLE, and HEAP_XMIN_COMMITTED.

By ignorehints I mean a flag in pg_class indicating that readers
should ignore hints.

By optimistichints I mean a flag in pg_class indicating that writers
can optimistically set hints.

Obviously, readers and writers would need a mechanism to honor those
flags, but I haven't dug into the details yet (additional routines in
tqual.c?).

  States:
0: normal
   * ignorehints = false
   * optimistichints = false
1: trying to enter data load mode, waiting on existing
   lockers (who might be scanning) to finish
   * ignorehints = true
   * optimistichints = false
2: data load mode
   * ignorehints = true
   * optimistichints = true
3: trying to leave data load mode, waiting on old snapshots to
   be released and aborted transactions to be cleaned up
   * ignorehints = true
   * optimistichints = false

INITIATE LOAD would first transition from state 0 to 1 by acquiring a
ShareUpdateExclusiveLock on the table (to be sure no concurrent INITIATE
or FINALIZE LOAD is going on) and setting ignorehints = true.

Then it moves from state 1 to state 2 by waiting for all transactions
that hold a lock on that table. Any transactions that don't already have
a lock will see the new flag when they try to get it. Now we're sure
that all readers will see the ignorehints flag, so we can set the
optimistichints flag to indicate that writers can write hints
optimistically.

FINALIZE LOAD would first move from state 2 to state 3 by acquiring a
ShareUpdateExclusiveLock on the table setting optimistichints = false.

Then, it would move from state 3 to state 0 by first waiting for all
transactions that currently hold a lock on the table, to ensure they see
the optimistichints=false flag. Then, it would remember the current xid
as max_loader_xid, and wait until the global xmin is greater than
max_loader_xid. This should ensure that all snapshots regard all loading
transactions as complete. Also, it would need to run a lazy VACUUM to
remove any tuples from aborted transactions.

Details and optimizations
=
 * We probably want a graceful way to handle multiple data loads
happening on the same table. Rather than erroring out, we could treat it
more like a reference count, and only do the work to move in to data
load mode if not there already, and only move out of data load mode if
we're the last loading job on the table.
 * In general, there are some usability issues to sort out, to make sure
a table isn't put into data load mode and left that way. Right now, I'm
mostly concerned with getting a working design, but those will be
important, too.
 * We could optimize away the VACUUM going from 3 - 0 if we are sure no
writing transactions aborted.
 * INITIATE and FINALIZE probably need to use PreventTransactionChain()
and multiple transactions, to avoid holding the ShareUpdateExclusiveLock
for too long. Also, we want to keep people from using it in the same
transaction as the loading xact, because they might not realize that
they would get a concurrency of 1 that way (because of the
ShareUpdateExclusiveLock).

Thoughts?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extending relations more efficiently

2012-05-31 Thread Jeff Davis
On Tue, 2012-05-01 at 10:08 -0400, Robert Haas wrote:
 We've previously discussed the possible desirability of extending
 relations in larger increments, rather than one block at a time, for
 performance reasons.  I attempted to determine how much performance we
 could possibly buy this way, and found that, as far as I can see, the
 answer is, basically, none.

Another point here is that with checksums, we will want to make sure
that zero pages can be treated as corrupt. That will probably involve
using the WAL for extension operations, and we'll want to mitigate that
cost somehow. Extending in larger chunks would probably be necessary.

There are some challenges there, but I think it's worth pursuing.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] GiST subsplit question

2012-05-30 Thread Jeff Davis
http://archives.postgresql.org/message-id/CAEsn3ybKcnFno_tGQDJ=afhir2xd9ka1fqt5cqxxyu3wz_h...@mail.gmail.com


I was trying to answer that question on -general, and I found it a
little more challenging than I expected.

There was a previous discussion here:
http://archives.postgresql.org/pgsql-general/2007-08/msg01816.php

And it seems to trace back to these commits:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=783a73168b972488f85e48381546db047cb8f982
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1f7ef548ec2e594fa8766781c490fb5b998ea46b

I looked for the follow-up commit to support subsplit in the contrib
modules, figuring that would answer some questions, but I couldn't find
it.

The part that's confusing me is that the commit message says: pickSplit
should set spl_(l|r)datum_exists to 'false', but I don't see any
picksplit method that actually does that in contrib, nor in the sample
in the docs.

The code in that area is a bit difficult to follow, so it's not obvious
to me exactly what is supposed to happen.

So, do we demote that message to a DEBUG1? Or do we make it more clear
what the authors of a specific picksplit are supposed to do to avoid
that problem? Or am I misunderstanding something?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronized scans versus relcache reinitialization

2012-05-30 Thread Jeff Davis
On Sat, 2012-05-26 at 15:14 -0400, Tom Lane wrote:
 3. Having now spent a good deal of time poking at this, I think that the
 syncscan logic is in need of more tuning, and I am wondering whether we
 should even have it turned on by default.  It appears to be totally
 useless for fully-cached-in-RAM scenarios, even if most of the relation
 is out in kernel buffers rather than in shared buffers.  The best case
 I saw was less than 2X speedup compared to N-times-the-single-client
 case, and that wasn't very reproducible, and it didn't happen at all
 unless I hacked BAS_BULKREAD mode to use a ring buffer size many times
 larger than the current 256K setting (otherwise the timing requirements
 are too tight for multiple backends to stay in sync --- a seqscan can
 blow through that much data in a fraction of a millisecond these days,
 if it's reading from kernel buffers).  The current tuning may be all
 right for cases where you're actually reading from spinning rust, but
 that seems to be a decreasing fraction of real-world use cases.

Do you mean that the best case you saw ever was 2X, or the best case
when the table is mostly in kernel buffers was 2X?

I clearly saw better than 2X when the table was on disk, so if you
aren't, we should investigate.

One thing we could do is drive the threshold from effective_cache_size
rather than shared_buffers, which was discussed during 8.3 development.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] temporal support patch

2012-05-30 Thread Jeff Davis
On Wed, 2012-05-16 at 23:14 +0200, Miroslav Šimulčík wrote:
 Hi all,
 
 
 as a part of my master's thesis I have created temporal support patch
 for PostgreSQL. It enables the creation of special temporal tables
 with entries versioning. Modifying operations (UPDATE, DELETE,
 TRUNCATE) on these tables don't cause permanent changes to entries,
 but create new versions of them. Thus user can easily get to the past
 states of the table.
 
I would be very interested to see this, thank you for working on it.

There are quite a few aspects to a temporal database system, and you are
working on a system-maintained transaction-time historical table, right?
Or are there other aspects to your proposal?

Some general comments:

* I'd very much like to see you make use of Range Types from 9.2; in
particular, TSTZRANGE would be much better than holding two timestamps.
If a standard requires you to display two timestamps in certain
situations, perhaps you could use ranges internally and display the
boundaries as timestamps when needed.

* There is other useful information that could be recorded, such as the
user who inserted/updated/deleted the record.

* For some purposes, it's very useful to keep track of the columns that
changed. For instance, a query like show me any time a salary was
changed over the last month (or some other rare event) would be very
slow to run if there was not some explicit annotation on the historical
records (e.g. a columns changed bitmap or something).

* In general, I'm not fond of adorning queries with TRANSACTION TIME AS
OF... kinds of things. Those constructs are redundant with a WHERE
clause (on a range type, you'd use the contains operator). If a
standard requires that, maybe it would be OK to allow such things as
syntactic sugar.

* I do like having special DDL that creates the appropriate objects.
That helps to guide users so they don't have to invent their own
solution with triggers, etc.

* As Jim mentioned, it might make sense to use something resembling
inheritance so that selecting from the historical table includes the
current data (but with no upper bound for the range).

* It might make sense to hammer out as many of the details as we can
with an extension. For instance, exactly what options will be available,
what data types will be used, what objects will be created, the trigger
code, etc. Then, it will be more obvious exactly what we need to add
extra core support for (e.g. if we are going to use some inheritance
like mechanism), and what we need to add syntax sugar for.

I recommend that you start posting more detailed designs on
http://wiki.postgresql.org

If you already have code, feel free to submit it for the next commitfest
( http://commitfest.postgresql.org ), but this is a relatively large
project, so it will most likely take several commitfest cycles.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: 2d-mapping based indexing for ranges

2012-05-30 Thread Jeff Davis
On Mon, 2012-05-28 at 23:42 +0400, Alexander Korotkov wrote:
 Hackers,
 
 
 attached patch implements another approach to indexing of ranges:
 mapping lower and upper bounds as 2d-coordinates and using same
 indexing approaches as for 2d points. Patch provides range_ops2
 operator class which implements this approach.

...

 Contained by queries with small selectivity are dramatically faster
 as expected.

That's good news, but that seems like a less-common query. Overlaps and
contains are probably the two most important. It would be good to show a
significant benefit in at least one of those two queries if the build
time increases.

 2) Limitations of penalty expressive power. GiST penalty function
 returns just one float value. It is a limitation even for geometrical
 datatypes. Some papers on R-trees recommends insert new entry into box
 which have minimal area when this boxes have same extension area (very
 possible if extension area is zero). We can't express it using single
 float value. 
 For 2d-mapping ranges indexing situation is even harder.  Imagine,
 some set of ranges could have same lower or upper bound. On 2d space
 such ranges will be united into lines. But, we can't express
 increasing of line length in penalty, because area of line is always
 zero independently on it's length.  Actually, similar problem exists
 for geometrical datatypes, but it seems to me this problem is more
 urgent for ranges, because I believe in real life ranges, large set of
 ranges could really have same lower or upper bound.
 Probably, we would like to rank key extension cases as following or
 similar:
 a) extension of finite dimension to infinite
 b) extension in dimension perpendicular to infinite
 c) extension with finite area extension
 d) extension with zero area increment (extension of line length)
 So, it's impossible to fil all of these into one float. Ans it's hard
 to choose what to neglect and how.
 I think there are at least two ways to make GiSTinferface sensitive
 enough to these things:
 a) Make penalty function return more than one float. The question is
 how many floats we allow?
 b) Create alternative choose interface function which would return
 set of most preferred keys for insert. Usage of such function could
 lead to slowdown because currently GiST stop scan when met zero
 penalty.

I agree that we should consider improving the interface for the penalty
function. In your last patch improving range indexing, it was quite
awkward to express the penalty in one dimension.

Regards,
Jeff Davis
 




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade libraries check

2012-05-29 Thread Jeff Davis
On Mon, 2012-05-28 at 16:06 -0400, Robert Haas wrote:
 On Sun, May 27, 2012 at 11:31 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  I don't recall exactly what problems drove us to make pg_upgrade do
  what it does with extensions, but we need a different fix for them.
 
 Well, you need pg_upgrade to preserve the OIDs of objects that are
 part of extensions just as you do for any other objects.

Also, I think it needs to force the extension version to match the old
cluster. Otherwise, we could be dealing with on-disk format changes, or
other complexities.

It doesn't sound difficult, but I thought I'd bring it up.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add primary key/unique constraint using prefix columns of an index

2012-05-29 Thread Jeff Davis
On Tue, 2012-05-22 at 10:24 -0700, Jeff Janes wrote:
 Now that there are index only scans, there is a use case for having a
 composite index which has the primary key or a unique key as the
 prefix column(s) but with extra columns after that.  Currently you
 would also need another index with exactly the primary/unique key,
 which seems like a waste of storage and maintenance.
 
 Should there be a way to declare a unique index with the unique
 property applying to a prefix of the indexed columns/expression?  And
 having that, a way to turn that prefix into a primary key constraint?
 
 Of course this is easier said then done, but is there some reason for
 it not to be a to-do item?

Technically, this can be done today using exclusion constraints if you
define an indexable operator that always returns true.

A similar idea, which I brought up during 9.0 development, is that it
might be useful to have one index that can enforce two UNIQUE
constraints. For instance, an index on (a,b,c) could be used to enforce
UNIQUE(a,b) and UNIQUE(a,c) using the exclusion constraint mechanism.

I didn't offer a lot of convincing evidence of practical value, so it
was left out. But it might be worth a second look.

Either way, it seems like the deviations from normal UNIQUE would be
better expressed in terms of an exclusion constraint, which offers a
little more room in the language.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-04-21 Thread Jeff Davis
On Sun, 2012-03-04 at 16:39 +, Simon Riggs wrote:
  v3 attached.

Added to next commitfest.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] 9.3: summary of corruption detection / checksums / CRCs discussion

2012-04-21 Thread Jeff Davis
/cabrt9rbrmdsoz8kxgehfb4lg-ev9u67-6dlqvoiibpkkhtl...@mail.gmail.com

* Some way of caching CLOG information or making the access faster.
IIRC, there were some vague ideas about mmapping() the CLOG, or caching
a very small representation of the CLOG.

* Something else -- There are a few other lines of thought here. For
instance, can we use WAL for hint bits without a FPI, and still protect
against torn pages causing CRC failures? This is related to a comment
during the 2011 developer meeting, where someone brought up the idea of
idempotent WAL changes, and how that could help us avoid FPIs. It seems
possible after reading the discussions, but not clear enough on the
challenges to summarize here.

If we do use WAL for hint bit updates, that has an impact on Hot
Standby, because HS can't write WAL. So, it would seem that HS could not
set hint bits.


Comments?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: 9.3: summary of corruption detection / checksums / CRCs discussion

2012-04-21 Thread Jeff Davis
On Sun, 2012-04-22 at 00:08 +0100, Greg Stark wrote:
 On Sat, Apr 21, 2012 at 10:40 PM, Jeff Davis pg...@j-davis.com wrote:
  * In addition to detecting random garbage, we also need to be able to
  detect zeroing of pages. Right now, a zero page is not considered
  corrupt, so that's a problem. We'll need to WAL table extension
  operations, and we'll need to mitigate the performance impact of doing
  so. I think we can do that by extending larger tables by many pages
  (say, 16 at a time) so we can amortize the cost of WAL and avoid
  contention.
 
 I haven't seen this come up in discussion.

I don't have any links, and it might just be based on in-person
discussions. I think it's just being left as a loose end for later, but
it will eventually need to be solved.

 WAL logging table
 extensions wouldn't by itself work because currently we treat the file
 size on disk as the size of the table. So you would have to do the
 extension in the critical section or else different backends might see
 the wrong file size and write out conflicting wal entries.

By critical section, I assume you mean while holding the relation
extension lock not while inside a CRITICAL_SECTION(), right?

There would be some synchronization overhead, to be sure, but I think it
can be done. Ideally, we'd be able to do large enough extensions that,
if there is a parallel bulk load on a single table or something, the
overhead could be made insignificant.

I didn't intend to get too much into the detail in this thread, but if
it's a totally ridiculous or impossible idea, I'll remove it.

 The earlier consensus was to move all the hint bits to a dedicated
 area and exclude them from the checksum. I think double-write buffers
 seem to have become more fashionable but a summary that doesn't
 describe the former is definitely incomplete.

Thank you, that's the kind of omission I was looking to catch.

 That link points to the MVCC-safe truncate patch. I don't follow how
 optimizations in bulk loads are relevant to wal logging hint bit
 updates.

I should have linked to these messages:
http://archives.postgresql.org/message-id/CA
+tgmoylozdezzjkyj8_x2bpeeerao5dj-omvs1flqoqsqp...@mail.gmail.com
http://archives.postgresql.org/message-id/CA
+Tgmoa4Xs1jbZhm=pb9Xi4AGMJXRB2a4GSE9EJtLo=70Zne=g...@mail.gmail.com

Though perhaps I'm reading too much into Robert's comments.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join

2012-04-19 Thread Jeff Davis
On Tue, 2012-04-17 at 14:24 -0400, Robert Haas wrote:
 I thought Jeff was parenthetically complaining about cases like A LEFT
 JOIN (B INNER JOIN C ON b.y = c.y) ON a.x  b.x.  That presumably
 would require the parameterized-path stuff to have any chance of doing
 partial index scans over B.  However, I understand that's not the main
 issue here.

To take the mystery out of it, I was talking about any case where an
index scan is impossible or impractical. For instance, let's say the
ranges are computed values. Just to make it really impossible, let's say
the ranges are computed from columns in two different tables joined in a
subquery.

But yes, the ability of the planner to find the plan is also an issue
(hopefully less of one with the recent improvements).

 One thing that I think needs some analysis is when the range join idea
 is better or worse than a nested loop with inner index-scan, because
 potentially those are the options the planner has to choose between,
 and the costing model had better know enough to make the right thing
 happen.  It strikes me that the nested loop with inner index-scan is
 likely to be a win when there are large chunks of the indexed relation
 that the nestloop never needs to visit at all - imagine small JOIN big
 ON small.a  big.a, for example.  I suppose the really interesting
 question is how much we can save when the entirety of both relations
 has to be visited anyway - it seems promising, but I guess we won't
 know for sure without testing it.

Right, I will need to come up with a prototype that can at least test
the executor piece. I suspect that the plan choice won't be all that
different from an ordinary index nestloop versus mergejoin case, but
with much worse cardinality estimates to work with.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join

2012-04-19 Thread Jeff Davis
On Tue, 2012-04-17 at 14:03 -0400, Robert Haas wrote:
 I'm actually not sure these are equivalent formulations.  Suppose one
 side has [i,i] where i ranges from 1 to 1000 and the other side
 the exact same thing plus [1,1000].  That one really big range
 will come up second on the right side, and you will not be able to
 discard it until you reach the end of the join.  If you just keep
 rewinding the right side, you're going to end up with O(n^2) behavior,
 whereas if you can discard tuples from the middle on the right side,
 then you will get O(n) behavior, which is a big difference.  In other
 words, in your original algorithm the tuples that you discard in step
 4 or 5 need not be the first remaining tuple on whichever side of the
 join we're talking about.

To illustrate the problem (slightly modified), I'll write the sets out
verbatim rather than use range syntax:

  {1,2,3}  {1,2,3,4,5,6,7,8,9}
  {  2,3,4}{  2,3,4}. . . . .
  {3,4,5}  {3,4,5}. . . .
  {  4,5,6}{  4,5,6}. . .
  {5,6,7}  {5,6,7}. .
  {  6,7,8}{  6,7,8}.
  {7,8,9}  {7,8,9}

The . are supposed to represent a shadow that the large range [1,9]
casts below it. This shadow prevents the discarding of [2,4] on the RHS
even when processing range [5,7] on the LHS, because we can't discard
out of the middle.

Note that, if you just have some large ranges and a lot of overlap,
that's not really a problem with the algorithm, it's just a large result
to compute. The problem comes when the ranges vary in size by quite a
lot, and there are many ranges that could be eliminated but can't
because of the shadow.

This problem can be mitigated substantially, I believe. Let me change
the algorithm (I don't like the way the pseudocode was structured
anyway), and word it a little more loosely:

1. Look at the top ranges on each side. Choose the one with the greater
upper bound, and call that Range1 from Side1, and the other range R2
from Side2. If either Range1 or Range2 is empty, terminate.
2. Scan down Side2, discarding ranges that are strictly before, and
joining with ranges that overlap, stopping when you hit a range that is
strictly after.
3. Now, discard Range1, and reset Side2 to the first non-discarded
range. Goto 1.

The benefit is, in step 1, we choose a side that will *always* discard
the top tuple. And if we choose the one with the greater upper bound,
then we are going to eliminate the largest shadow.

That doesn't eliminate the problem entirely, but it seems like it would
reduce it a lot.

Regarding the idea of discarding tuples in the middle, that might be an
interesting approach as well. It might be as simple as setting a flag in
the tuple header (like was done for full hash joins). Still not perfect,
but would make redundant checks very cheap. Combined with my strategy,
there's a good chance that we practically eliminate the problem.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join

2012-04-19 Thread Jeff Davis
On Wed, 2012-04-18 at 01:21 -0400, Tom Lane wrote:
 It would be a pretty weird implementation of mergejoin that could
 discard tuples from the middle of an input stream.  Or to be more
 specific, it wouldn't be the mergejoin itself that could do that at all
 --- you'd need the input plan node to be some sort of tweaked version of
 tuplestore or tuplesort that could respond to a request like that.

As I said in my reply to Robert, I think there are some ways we can make
this idea work.

 I can't escape the feeling that Jeff has chosen the wrong basis for his
 thought experiment, and that what he really ought to be thinking about
 is hashjoin, which keeps an in-memory table that it could easily modify
 on the fly if it chose to.  The multi-batch aspect of hybrid hashjoin
 could be useful too (IOW, when you're out of better ideas, throw the
 tuple back in the water to process later).

Obviously hashing is not going to be much use for anything but equality.
So I believe this approach is very similar to the temporary-index
method, except with batching, and always keeping the index in memory.

I don't think we would get the partitioning benefit of hashjoin, because
we'd have to put the same tuple in multiple partitions, so it's probably
better to just leave the outer side intact.

But in-memory indexes and multiple passes of the outer seems like a
reasonable alternative, particularly because an in-memory index might be
very fast (to build as well as query).

 This is just handwaving of course.  I think some digging in the
 spatial-join literature would likely find ideas better than any of
 these.

I will look in some more detail. The merge-like approach did seem to be
represented in the paper referenced by Alexander (the external plane
sweep), but it also refers to several methods based on partitioning.

I'm beginning to think that more than one of these ideas has merit.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join

2012-04-16 Thread Jeff Davis
On Mon, 2012-04-16 at 02:52 -0400, Tom Lane wrote:
 Jeff Davis pg...@j-davis.com writes:
   1. Order the ranges on both sides by the lower bound, then upper bound.
  Empty ranges can be excluded entirely.
   2. Left := first range on left, Right := first range on right
   3. If Left or Right is empty, terminate.
   4. If lower(Left)  upper(Right), discard Right, goto 2
   5. If lower(Right)  upper(Left), discard Left, goto 2
   6. return (Left, Right) as joined tuple
   7. Right := next range on right
   8. goto 3
 
 This is surely not correct in detail.  As written, it will be impossible
 for any tuple on the right side to be joined to more than one left-side
 tuple.  You will need something analogous to the mark-and-restore
 rewinding logic in standard mergejoin to make this work.

Every time you discard a tuple on the left, you go to step 2, which
rewinds the right side back to the first non-discarded tuple.

So, implemented using mark and restore:

  * start off with the marks at the first tuple on each side
  * discard means move the mark down a tuple
  * setting it back to the first range means restoring to the mark
  * going to the next range (step 7) just means getting another
tuple, without changing the mark

Only one side really needs the mark and restore logic, but it was easier
to write the pseudocode in a symmetrical way (except step 7).

 I will note that mergejoin is not only one of the more complicated
 executor modules, but it accounts for a huge percentage of the planner's
 complexity; which doesn't exactly make me sympathize with the notion of
 let's-copy-and-paste-all-that.  It'd be a lot better if we could build
 it as a small tweak to mergejoin rather than an independent thing.
 
 (Having said that, I have no idea how the planner support could work
 at all, because its treatment of mergejoins is intimately tied to
 mergejoinable equality and EquivalenceClasses and PathKeys, which don't
 seem readily extensible to the range situation.)

I think this is the more important problem area. It's clear that I'll
need to do some more investigation into the planning.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join

2012-04-16 Thread Jeff Davis
On Mon, 2012-04-16 at 16:22 +0400, Alexander Korotkov wrote:

 There is a good overview article about spatial joins.
 http://www.cs.umd.edu/users/hjs/pubs/jacoxtods07.pdf 

Thank you, that's exactly the kind of overview I was looking for.

 It shows that there is a lot of methods based on building temporaty
 indexes. It might mean that building of temporary GiST index is not a
 bad idea.

That had occurred to me, but I was hesitant to only use temp indexes. It
still doesn't really offer a good solution when both sides of the join
are relatively large (because of random I/O). Also the build speed of
the index would be more important than it is now.

On the other hand, if I tackle the problem using temp indexes, it would
be a more general solution useful for many problems (for instance, we
really need a better solution when the result of a subquery doesn't fit
in a work_mem-sized hashtable).

We may end up with some kind of combination (as the sweep seems to be;
see below).

 Also there are methods which looks like extension of Jeff Davis
 proposal to the multidimensional case. It is Plane Sweep and
 External Plane Sweep methods. However, it might use sophisticated data
 structures for the sweep. And I believe it should use it for good
 performance.

That method looks closer to what I had in mind.

My Range Join proposal is essentially the same as the sweep method
except that it only joins on one dimension, and the rest of the
dimensions have to be checked with RECHECK. So, this still works for an
N-dimensional join, as long as a single dimension is fairly selective.

The sweep method seems to do the first dimension with the sweep method,
and maintains a changing index that it uses to do the lookup. In other
words, it's essentially just a way to do the RECHECK on the other
dimensions in less than O(n) time. So, if the sweep dimension is not
selective at all, this degenerates to the temporary index method (or
something close, anyway).

The fact that, in the sweep method, there is still one special
dimension, makes me think that it will be easier to make the API work
based on ranges (which is a big win because postgres already knows about
ranges). If so, that makes it easier to divide the project into stages,
because we could get it working for ranges and then extend it to support
other dimensions more efficiently later.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join

2012-04-16 Thread Jeff Davis
On Sun, 2012-04-15 at 23:18 -0700, Darren Duncan wrote:
 Your proposal makes me think of something similar which might be useful, 
 INclusion constraints.  As exclusion constraints might be thought of like a 
 generalization of unique/key constraints, inclusion constraints are like a 
 generalization of foreign key constraints.  The inclusion constraints 
 basically allow some comparison operator other than is-equal to test if 
 values 
 in one table match values in another table, and the constraint allows the 
 former 
 if the test results in true.  An example of said inclusion test is whether 
 the 
 range in one table is contained in a range in another table.  I assume/hope 
 that, similarly, now that we have range types in 9.2, that the existing 
 exclusion constraints can be used with range comparison operators.

Yes, Range Foreign Keys are one of the loose ends for Range Types that
I'd like to wrap up.

 As to your actual proposal, it sounds like a generalization of the relational 
 join or set intersection operator where instead of comparing sets defined in 
 terms of an enumeration of discrete values we are comparing sets defined by a 
 range, which conceptually have infinite values depending on the data type the 
 range is defined over.  But if we're doing this, then it would seem to make 
 sense to go further and see if we have set analogies for all of our 
 relational 
 or set operators, should we want to do work with non-discrete sets.
 
 Now this sounds interesting in theory, but I would also assume that these 
 could 
 be implemented by an extension in terms of existing normal relational 
 operators, 
 where each range value is a discrete value, combined with operators for 
 unioning 
 or differencing etc ranges.  A relation of ranges effectively can represent a 
 discontinuous range; in that case, the empty discontinuous range is also 
 canonically representable by a relation with zero tuples.
 
 Jeff, I get the impression your proposal is partly about helping performance 
 by 
 supporting this internally, rather than one just defining it as a SQL 
 function, 
 am I right?

Per my proposal, the problem statement is that it's slow, so speeding
it up is really the entire proposal ;)

Extending the syntax might be interesting as well, but I don't have a
proposal for that.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join

2012-04-16 Thread Jeff Davis
On Mon, 2012-04-16 at 22:20 +0100, Simon Riggs wrote:
 On Mon, Apr 16, 2012 at 9:12 PM, Jeff Davis pg...@j-davis.com wrote:
 
  That had occurred to me, but I was hesitant to only use temp indexes. It
  still doesn't really offer a good solution when both sides of the join
  are relatively large (because of random I/O). Also the build speed of
  the index would be more important than it is now.
 
 The thing I like most about temp indexes is that they needn't be temporary.
 
 I'd like to see something along the lines of demand-created optional
 indexes, that we reclaim space/maintenance overhead on according to
 some cache management scheme. More space you have, the more of the
 important ones hang around. The rough same idea applies to
 materialised views.

I think to make an informed decision, the next thing I need to do is
hack up a prototype version of my merge join idea, and see how well it
performs against an index nestloop today.

It seems to me that both approaches may have merit independently.

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] 9.3 Pre-proposal: Range Merge Join

2012-04-15 Thread Jeff Davis
I hope this is not an inappropriate time for 9.3 discussions. The flip
side of asking for submissions in the first couple commitfests means
that I need to submit proposals now.

What is a Range Join?

See attached SQL for example. The key difference is that the join
condition is not equality, but overlaps ().

Problem statement: slow. Nested loops are the only option, although they
can benefit from an inner GiST index if available. But if the join is
happening up in the plan tree somewhere, then it's impossible for any
index to be available.

Proposed solution: a modified merge join that can handle ranges.

 1. Order the ranges on both sides by the lower bound, then upper bound.
Empty ranges can be excluded entirely.
 2. Left := first range on left, Right := first range on right
 3. If Left or Right is empty, terminate.
 4. If lower(Left)  upper(Right), discard Right, goto 2
 5. If lower(Right)  upper(Left), discard Left, goto 2
 6. return (Left, Right) as joined tuple
 7. Right := next range on right
 8. goto 3

If we get step 4 or step 5 keeps getting triggered, and a btree index is
available (ordered by lower bound), we can re-probe to go to the correct
position, and consider that the new top range on that side. This is an
optimization for the case where there are large numbers of ranges with
no match on the other side.

Thanks to Nathan Boley for helping me devise this algorithm. However,
any bugs are mine alone ;)

Weaknesses: I haven't thought through the optimization, but I suspect it
will be hard to be very accurate in the costing. That might be OK,
because there aren't very many options anyway, but I'll need to think it
through.

Questions:

 * Is this idea sane? -- that is, are ranges important enough that
people are willing to maintain a new operator?
 * The more general problem might be spatial joins which can operate
in N dimensions, and I doubt this would work very well in that case.
Does someone know of a spatial join algorithm (without IP claims) that
would be as good as this one for ranges?
 * Other thoughts?

Regards,
Jeff Davis


-- rate is billing rate in dollars per hour
CREATE OR REPLACE FUNCTION
  bill(rate NUMERIC, during TSTZRANGE)
RETURNS NUMERIC LANGUAGE plpgsql AS
$$
DECLARE
  usage_s NUMERIC;
BEGIN
  IF isempty(during) THEN
usage_s := 0;
  ELSE
usage_s := extract(epoch from (upper(during) - lower(during)));
  END IF;
  RETURN rate * (usage_s/3600.0);
END;
$$;

CREATE TABLE billing_rate(rate NUMERIC, during TSTZRANGE);
CREATE TABLE billing_usage(customer_id INT8, during TSTZRANGE);

INSERT INTO billing_rate VALUES
  (12.50, '[2010-01-01 14:00, 2010-01-01 15:00)');
INSERT INTO billing_rate VALUES
  (14.50, '[2010-01-01 15:00, 2010-01-01 15:45)');

INSERT INTO billing_usage VALUES
  (123, '[2010-01-01 14:30, 2010-01-01 15:30)');
INSERT INTO billing_usage VALUES
  (234, '[2010-01-01 14:15, 2010-01-01 15:45)');

SELECT
  customer_id,
  bill(rate, range_intersect(u.during, r.during)) AS bill
FROM billing_rate r, billing_usage u
WHERE r.during  u.during;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory usage during sorting

2012-04-12 Thread Jeff Davis
On Sun, 2012-03-18 at 11:25 -0400, Tom Lane wrote:
 Yeah, that was me, and it came out of actual user complaints ten or more
 years back.  (It's actually not 2X growth but more like 4X growth
 according to the comments in logtape.c, though I no longer remember the
 exact reasons why.)  We knew when we put in the logtape logic that we
 were trading off speed for space, and we accepted that.

I skimmed through TAOCP, and I didn't find the 4X number you are
referring to, and I can't think what would cause that, either. The exact
wording in the comment in logtape.c is 4X the actual data volume, so
maybe that's just referring to per-tuple overhead?

However, I also noticed that section 5.4.4 (Vol 3 p299) starts
discussing the idea of running the tapes backwards and forwards. That
doesn't directly apply, because a disk seek is cheaper than rewinding a
tape, but perhaps it could be adapted to get the block-freeing behavior
we want. The comments in logtape.c say:

  Few OSes allow arbitrary parts of a file to be released back to the
OS, so we have to implement this space-recycling ourselves within a
single logical file.

But if we alternate between reading in forward and reverse order, we can
make all of the deallocations at the end of the file, and then just
truncate to free space. I would think that the OS could be more
intelligent about block allocations and deallocations to avoid too much
fragmentation, and it would probably be a net reduction in code
complexity. Again, the comments in logtape.c have something to say about
it:

   ...but the seeking involved should be comparable to what would
happen if we kept each logical tape in a separate file

But I don't understand why that is the case.

On another topic, quite a while ago Len Shapiro (PSU professor)
suggested to me that we implement Algorithm F (Vol 3 p321). Right now,
as tuplesort.c says:

   In the current code we determine the number of tapes M on the basis
of workMem: we want workMem/M to be large enough that we read a fair
amount of data each time we preread from a tape

But with forcasting, we can be a little smarter about which tapes we
preread from if the data in the runs is not random. That means we could
potentially merge more runs at once with the same work_mem without
sacrificing adequate buffers for prefetching. I'm not sure whether this
is a practical problem today, and I'm also not sure what to do if we
start merging a lot more runs and then determine that forcasting doesn't
work as well as we'd hoped (e.g. that the data in the runs really is
random). But I thought it was worth mentioning.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in fast-path locking

2012-04-10 Thread Jeff Davis
On Mon, 2012-04-09 at 22:47 -0700, Jeff Davis wrote:
 but other similar paths do:
 
   if (!proclock)
   {
 AbortStrongLockAcquire();
 
 I don't think it's necessary outside of LockErrorCleanup(), right?

I take that back, it's necessary for the dontwait case, too.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Last gasp

2012-04-10 Thread Jeff Davis
On Mon, 2012-04-09 at 23:12 -0400, Christopher Browne wrote:
 But there is also a flip side to that, namely that if we do so, there
 ought to be some aspect to the process to help guide those items that
 *aren't* particularly close to being committable.

I have benefited immensely from review of my WIP patches, and a lot of
the serious review tends to happen during commitfests. This is most
important for features with a significant user interface, where it's
harder to guess what people will want.

My current strategy is to submit WIP-marked patches during a commitfest.

I agree that we should continue to have a mechanism to review patches
that aren't ready for commit, though I'm fine if we change it.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in fast-path locking

2012-04-09 Thread Jeff Davis
On Mon, 2012-04-09 at 16:11 -0400, Robert Haas wrote:
  I wonder though whether
  you actually need a *count*.  What if it were just a flag saying do not
  take any fast path locks on this object, and once set it didn't get
  unset until there were no locks left at all on that object?
 
 I think if you read the above-referenced section of the README you'll
 be deconfused.

My understanding:

The basic reason for the count is that we need to preserve the
information that a strong lock is being acquired between the time it's
put in FastPathStrongRelationLocks and the time it actually acquires the
lock in the lock manager.

By definition, the lock manager doesn't know about it yet (so we can't
use a simple rule like there are no locks on the object so we can unset
the flag). Therefore, the backend must indicate whether it's in this
code path or not; meaning that it needs to do something on the error
path (in this case, decrement the count). That was the source of this
bug.

There may be a way around this problem, but nothing occurs to me right
now.

Regards,
Jeff Davis

PS: Oops, I missed this bug in the review, too.

PPS: In the README, FastPathStrongRelationLocks is referred to as
FastPathStrongLocks. Worth a quick update for easier symbol searching.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in fast-path locking

2012-04-09 Thread Jeff Davis
On Mon, 2012-04-09 at 17:42 -0500, Jim Nasby wrote:
 Dumb question... should operations in the various StrongLock functions
 take place in a critical section? Or is that already ensure outside of
 these functions?

Do you mean CRITICAL_SECTION() in the postgres sense (that is, avoid
error paths by making all ERRORs into PANICs and preventing interrupts);
or the sense described here:
http://en.wikipedia.org/wiki/Critical_section ?

If you mean in the postgres sense, you'd have to hold the critical
section open from the time you incremented the strong lock count all the
way until you decremented it (which is normally at the time the lock is
released); which is a cure worse than the disease.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in fast-path locking

2012-04-09 Thread Jeff Davis
On Mon, 2012-04-09 at 13:32 -0400, Robert Haas wrote:
 I looked at this more.  The above analysis is basically correct, but
 the problem goes a bit beyond an error in LockWaitCancel().  We could
 also crap out with an error before getting as far as LockWaitCancel()
 and have the same problem.  I think that a correct statement of the
 problem is this: from the time we bump the strong lock count, up until
 the time we're done acquiring the lock (or give up on acquiring it),
 we need to have an error-cleanup hook in place that will unbump the
 strong lock count if we error out.   Once we're done updating the
 shared and local lock tables, the special handling ceases to be
 needed, because any subsequent lock release will go through
 LockRelease() or LockReleaseAll(), which will do the appropriate
 clenaup.
 
 The attached patch is an attempt at implementing that; any reviews 
 appreciated.
 

This path doesn't have an AbortStrongLockAcquire:

  if (!(proclock-holdMask  LOCKBIT_ON(lockmode)))
  {
...
elog(ERROR,...)

but other similar paths do:

  if (!proclock)
  {
AbortStrongLockAcquire();

I don't think it's necessary outside of LockErrorCleanup(), right?

I think there could be some more asserts. There are three sites that
decrement FastPathStrongRelationLocks, but in two of them
StrongLockInProgress should always be NULL.

Other than that, it looks good to me.

Regards,
Jeff Davis






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade incorrectly equates pg_default and database tablespace

2012-04-06 Thread Jeff Davis
On Fri, 2012-03-30 at 13:11 -0700, Jeff Davis wrote:
 I confirmed this bug upgrading 9.1 to master, and that this patch fixes
 it. Thank you for the report!
 
 Patch looks good to me as well, with one very minor nitpick: the added
 comment is missing an apostrophe.
 
 Bruce, can you take a look at this?

Adding this to the next commitfest, just so it doesn't get forgotten.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade incorrectly equates pg_default and database tablespace

2012-03-30 Thread Jeff Davis
On Thu, 2012-03-22 at 14:55 +0200, Ants Aasma wrote:
 Hi,
 
 while working on a support case I stumbled upon a bug in pg_upgrade.
 Upgrade fails with No such file or directory when a database is
 moved to a non-default tablespace and contains a table that is moved
 to pg_default. The cause seems to be that the following test
 incorrectly equates empty spclocation with database tablespace:
 
 tblspace = PQgetvalue(res, relnum, i_spclocation);
 /* if no table tablespace, use the database tablespace */
 if (strlen(tblspace) == 0)
 tblspace = dbinfo-db_tblspace;
 
 Patch to fix this is attached.

I confirmed this bug upgrading 9.1 to master, and that this patch fixes
it. Thank you for the report!

Patch looks good to me as well, with one very minor nitpick: the added
comment is missing an apostrophe.

Bruce, can you take a look at this?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_terminate_backend for same-role

2012-03-26 Thread Jeff Davis
On Tue, 2012-03-20 at 01:38 -0700, Daniel Farina wrote:
 On Thu, Mar 15, 2012 at 9:39 PM, Fujii Masao masao.fu...@gmail.com wrote:
  On Fri, Mar 16, 2012 at 8:14 AM, Daniel Farina dan...@heroku.com wrote:
  Parallel to pg_cancel_backend, it'd be nice to allow the user to just
  outright kill a backend that they own (politely, with a SIGTERM),
  aborting any transactions in progress, including the idle transaction,
  and closing the socket.
 
  +1
 
 Here's a patch implementing the simple version, with no more guards
 against signal racing than have been seen previously.  The more
 elaborate variants to close those races is being discussed in a
 parallel thread, but I thought I'd get this simple version out there.

Review:

After reading through the threads, it looks like there was no real
objection to this approach -- pid recycling is not something we're
concerned about.

I think you're missing a doc update though, in func.sgml:

For the less restrictive functionpg_cancel_backend/, the role of an
active backend can be found from
the structfieldusename/structfield column of the
structnamepg_stat_activity/structname view.

Also, high-availability.sgml makes reference to pg_cancel_backend(), and
it might be worthwhile to add an ...and pg_terminate_backend() there.

Other than that, it looks good to me.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-03-25 Thread Jeff Davis
On Sat, 2012-03-17 at 17:48 +0100, Cédric Villemain wrote:
 I agree with Andres.
 
 
 I believe we should use sync_file_range (_before?) with linux.
 
 And we can use posix_fadvise_dontneed on other kernels.
 
OK, updated patch attached. sync_file_range() is preferred,
posix_fadvise() is a fallback.

Regards,
Jeff Davis


initdb-fsync-20120325.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incorrect assumptions with low LIMITs

2012-03-19 Thread Jeff Davis
On Sat, 2012-03-17 at 12:48 +, Simon Riggs wrote:
 The problems are as I described them
 
 (1) no account made for sparsity, and other factors leading to an
 overestimate of rows (N)
 
 (2) inappropriate assumption of the effect of LIMIT m, which causes a
 costly SeqScan to appear better than an IndexScan for low m/N, when in
 fact that is seldom the case.
 
 Overestimating N in (1) inverts the problem, so that an overestimate
 isn't the safe thing at all.

I think the actual problem has more to do with risk. The planner doesn't
know how uniform the distribution of the table is, which introduces risk
for the table scan.

I would tend to agree that for low selectivity fraction and a very low
limit (e.g. 1-3 in your example) and a large table, it doesn't seem like
a good risk to use a table scan. I don't know how that should be modeled
or implemented though.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-03-16 Thread Jeff Davis
On Fri, 2012-03-16 at 11:25 +0100, Andres Freund wrote:
  I take that back. There was something wrong with my test -- fadvise
  helps, but it only takes it from ~10s to ~6.5s. Not quite as good as I
  hoped.
 Thats surprising. I wouldn't expect such a big difference between fadvise + 
 sync_file_range. Rather strange.

I discussed this with my colleague who knows linux internals, and he
pointed me directly at the problem. fadvise and sync_file_range in this
case are both trying to put the data in the io scheduler queue (still in
the kernel, not on the device). The difference is that fadvise doesn't
wait, and sync_file_range does (keep in mind, this is waiting to get in
a queue to go to the device, not waiting for the device to write it or
even receive it).

He indicated that 4096 is a normal number that one might use for the
queue size. But on my workstation at home (ubuntu 11.10), the queue is
only 128. I bumped it up to 256 and now fadvise is just as fast!

This won't be a problem on production systems, but that doesn't help us
a lot. People setting up a production system don't care about 6.5
seconds of set-up time anyway. Casual users and developers do (the
latter problem can be solved with the --nosync switch, but the former
problem is the one we're discussing).

So, it looks like fadvise is the right thing to do, but I expect we'll
get some widely varying results from actual users. Then again, maybe
casual users don't care much about ~10s for initdb anyway? It's a fairly
rare operation for everyone except developers.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incorrect assumptions with low LIMITs

2012-03-16 Thread Jeff Davis
On Fri, 2012-03-16 at 18:25 +, Simon Riggs wrote:
 Any time we apply a LIMIT clause to a plan with a SeqScan or
 unqualified IndexScan, we shouldn't assume the scan will do less than
 say 10% of the table. It might, but its an unsafe assumption because
 as the selectivity decreases so does the safety of the assumption that
 rows are uniformly distributed.

Just trying to follow along. You mean as the selectivity _increases_
the safety of the assumption that the rows are uniformly distributed
decreases, right?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Storage Manager crash at mdwrite()

2012-03-16 Thread Jeff Davis
On Fri, 2012-03-16 at 18:02 -0400, Tareq Aljabban wrote:
 Thanks for your response Jeff.. 
 It's true that libhdfs code resides under the c++ folder, but in all
 of the documentation, libhdfs is referred to as the C interface of
 HDFS.
 Now what you're saying makes sense.. that nothing guarantees this will
 work well.. but in the phase of initDB, the function calls are done
 nicely without any exceptions.. when starting the postmaster,
 something wrong happens after 3 calls to libhdfs.. that's what I'm
 confused about..
 What it's the difference between the two processes (initDB, start
 postmaster), that might cause this crash to happen?

There is a lot of difference between those two. In particular, it looks
like the problem you are seeing is coming from the background writer,
which is not running during initdb.

I think the next step would be for you to compile in debug mode with
-O0, and attach to the bgwriter process with gdb, and step through it.
Then, you can see the exact path which causes the bgwriter to exit, and
that will give you a better idea where the problem is.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-03-15 Thread Jeff Davis
On Wed, 2012-03-14 at 10:26 +0100, Andres Freund wrote:
 On Wednesday, March 14, 2012 05:23:03 AM Jeff Davis wrote:
  On Tue, 2012-03-13 at 09:42 +0100, Andres Freund wrote:
   for recursively everything in dir:
  posix_fadvise(fd, POSIX_FADV_DONTNEED);
   
   for recursively everything in dir:
  fsync(fd);
  
  Wow, that made a huge difference!
  
no sync:  ~ 1.0s
sync: ~10.0s
fadvise+sync: ~ 1.3s

I take that back. There was something wrong with my test -- fadvise
helps, but it only takes it from ~10s to ~6.5s. Not quite as good as I
hoped.

 Well, while the positive effect of this are rather large it also has the bad 
 effect of pushing the whole new database out of the cache. Which is not so 
 nice 
 if you want to run tests on it seconds later.

I was unable to see a regression when it comes to starting it up after
the fadvise+fsync. My test just started the server, created a table,
then stopped the server. It was actually a hair faster with the
directory that had been fadvise'd and then fsync'd, but I assume that
was noise. Regardless, this doesn't look like an issue.

 How are the results with sync_file_range(fd, 0, 0, 
 SYNC_FILE_RANGE_WRITE)? 

That is much faster than using fadvise. It goes down to ~2s.

Unfortunately, that's non-portable. Any other ideas? 6.5s a little on
the annoying side (and causes some disconcerting sounds to come from my
disk), especially when we _know_ it can be done in 2s.

Anyway, updated patch attached.

Regards,
Jeff Davis


initdb-fsync-20120314.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Storage Manager crash at mdwrite()

2012-03-15 Thread Jeff Davis
On Thu, 2012-03-15 at 13:49 -0400, Tareq Aljabban wrote:
 I'm implementing an extention to mdwrite() at
 backend/storage/smgr/md.c
 When a block is written to the local storage using mdwrite(), I'm
 sending this block to an HDFS storage. 
 So far I don't need to read back the values I'm writing to HDFS. This
 approach is working fine in the initDB phase.
 However, when I'm running postgres (bin/pg_ctl start), the first few
 write operations run successfully, and then suddenly (after writing
 exactly 3 files to HDFS), I get a 130 exit code with the following
 message showing the JVM thread dump of HDFS:
 
 
 LOG:  background writer process (PID 29347) exited with exit code 130
 LOG:  terminating any other active server processes

...

 This seems like an HDFS issue, but the same code worked properly in
 initDB(). I replaced this HDFS write code with a code that writes
 always the same block (empty one) to HDFS regardless from the value
 received by mdwrite().. Kept getting the same issue after writing 3
 files.
 I copied this exact code to a separate C application and ran it there
 and it executed without any problems (I wrote/deleted 100 files).
 That's why I'm doubting that it's something related to postgreSQL.
 
 
 Any ideas on what's going wrong?

What code, exactly, did you change in md.c, and anywhere else in
postgres? Are you linking in new libraries/code from somewhere into the
postgres backend?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Storage Manager crash at mdwrite()

2012-03-15 Thread Jeff Davis
On Thu, 2012-03-15 at 19:36 -0400, Tareq Aljabban wrote:
 When configuring postgreSQL, I'm adding the libraries needed to run
 HDFS C API (libhdfs).
 

From the information below, it looks like C++.

 
  ./configure --prefix=/diskless/myUser/Workspace/EclipseWS1/pgsql
 --enable-depend --enable-cassert --enable-debug CFLAGS=$CFLAGS
 -I/diskless/myUser/Workspace/HDFS_Append/hdfs/src/c++/libhdfs
 -I/usr/lib/jvm/default-java/include LDFLAGS=$LDFLAGS
 -L/diskless/myUser/Workspace/HDFS_Append/hdfs/src/c++/libhdfs
 -L/diskless/myUser/Workspace/HDFS_Append/build/c++/Linux-i386-32/lib
 -L/usr/lib/jvm/default-java/jre/lib/i386/server -ljvm -lhdfs
 
 
 
 
 
 I have done lots of changes so far on how the storage manager works.
 In fact, I changed smgr.c so instead of calling regular md.c
 functions, that it would call my functions .
 For simplicity, you can say that whenever mdwrite() was supposed to be
 called, another function is also called beside it. so now what is
 being called is:
 mdwrite(param1, param2, buffer, );
 hdfswrite(param1, param2, buffer, );
 
 
 where hdfswrite() is the function where I write the buffer to HDFS.
 I changed hdfswrite() so it will always write only the same (dummy)
 buffer down to HDFS storage. Let's say dummyBufferFile. After
 writing this file 3 times to HDFS, I'm getting the message that I've
 shown in my first email.
 The same hdfswrite() code works without any issues when I run it in a
 separate application.
 
 
 Hope it's clear now.

Well, it's clear that there's a lot going on ;)

In other words, is there a reason you think that these would all work
together nicely?

I don't know the specifics, but I understand there are numerous perils
to linking complex C++ code into complex C code, particularly around
exception handling. Look in the archives for previous discussions around
C++.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


<    1   2   3   4   5   6   7   8   9   10   >