Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Mikko Tiihonen

On 01/25/2012 06:40 PM, Tom Lane wrote:

Marko Kreen  writes:

On Wed, Jan 25, 2012 at 10:23:14AM -0500, Tom Lane wrote:

Huh?  How can that work?  If we decide to change the representation of
some other "well known type", say numeric, how do we decide whether a
client setting that bit is expecting that change or not?



It sets that bit *and* version code - which means that it is
up-to-date with all "well-known" type formats in that version.


Then why bother with the bit in the format code?  If you've already done
some other negotiation to establish what datatype formats you will
accept, this doesn't seem to be adding any value.


Basically, I see 2 scenarios here:



1) Client knows the result types and can set the
text/bin/version code safely, without further restrictions.



2) There is generic framework, that does not know query contents
but can be expected to track Postgres versions closely.
Such framework cannot say "binary" for results safely,
but *could* do it for some well-defined subset of types.


The hole in approach (2) is that it supposes that the client side knows
the specific datatypes in a query result in advance.  While this is
sometimes workable for application-level code that knows what query it's
issuing, it's really entirely untenable for a framework or library.
The only way that a framework can deal with arbitrary queries is to
introduce an extra round trip (Describe step) to see what datatypes
the query will produce so it can decide what format codes to issue
... and that will pretty much eat up any time savings you might get
from a more efficient representation.


This is pretty much what jdbc driver already does, since it does not have
100% coverage of even current binary formats. First time you execute a
query it requests text encoding, but caches the Describe results. Next
time it sets the binary bits on all return columns that it knows how to
decode.


You really want to do the negotiation once, at connection setup, and
then be able to process queries without client-side prechecking of what
data types will be sent back.


I think my original minor_version patch tried to do that. It introduced a
per-connection setting for version. Server GUC_REPORTED the maximum supported
minor_version but defaulted to the baseline wire format.
The jdbc client could bump the minor_version to supported higher
value (error if value larger than what server advertised).

A way was provided for the application using jdbc driver to
override the requested minor_version in the rare event that something
broke (rare, because jdbc driver generally does not expose the
wire-encoding to applications).

Now if pgbounce and other pooling solutions would reset the minor_version
to 0 then it should work.

Scenarios where other end is too old to know about the minor_version:
Vserver>>Vlibpq  => client does nothing -> use baseline version
Vlibpq>>Vserver  => no supported_minor_version in GUC_REPORT -> use baseline

Normal 9.2+ scenarios:
Vserver>Vlibpq  => libpg sets minor_version to largest that is supports
   -> libpq requested version used
Vlibpq>Vserver  => libpg notices that server supported value is lower than
   its so it sets minor_version to server supported value
   -> server version used

For perl driver that exposes the wire format to application by default
I can envision that the driver needs to add a new API that applications
need to use to explicitly bump the minor_version up instead of defaulting
to the largest supported by the driver as in jdbc/libpg.

The reason why I proposed a incrementing minor_version instead of bit flags
of new encodings was that it takes less space and is easier to document and
understand so that exposing it to applications is possible.

But how to handle postgres extensions that change their wire-format?
Maybe we do need to have "oid:minor_version,oid:ver,oid_ver" as the
negotiated version variable?

-Mikko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pause at end of recovery

2012-01-25 Thread Fujii Masao
On Wed, Dec 28, 2011 at 7:27 PM, Simon Riggs  wrote:
> On Thu, Dec 22, 2011 at 6:16 AM, Simon Riggs  wrote:
>
>> I can see a reason to do this now. I've written patch and will commit
>> on Friday. Nudge me if I don't.
>
> It's hard to write this so it works in all cases and doesn't work in
> the right cases also.
>
> Basically, we can't get in the way of crash recovery, so the only way
> we can currently tell a crash recovery from an archive recovery is the
> presence of restore_command.
>
> If you don't have that and you haven't set a recovery target, it won't
> pause and there's nothing I can do, AFAICS.
>
> Please test this and review before commit.

What if wrong recovery target is specified and an archive recovery reaches
end of WAL files unexpectedly? Even in this case, we want to pause
recovery at the end? Otherwise, we'll lose chance to correct the recovery
target and retry archive recovery.

One idea; starting archive recovery with standby_mode=on meets your needs?
When archive recovery reaches end of WAL files, regardless of whether recovery
target is specified or not, recovery pauses at the end. If hot_standby
is enabled,
you can check the contents and if it's OK you can finish recovery by
pg_ctl promote.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add new replication mode synchronous_commit = 'write'.

2012-01-25 Thread Fujii Masao
On Wed, Jan 25, 2012 at 11:12 PM, Simon Riggs  wrote:
> On Wed, Jan 25, 2012 at 1:57 PM, Robert Haas  wrote:
>
>> I think that this would be a lot more clear if we described this as
>> synchronous_commit = remote_write rather than just synchronous_commit
>> = write.  Actually, the internal constant is named that way already,
>> but it's not exposed as such to the user.
>
> That's something to discuss at the end of the CF when people are less
> busy and we get more input.
>
> It's an easy change whatever we decide to do.

Added this to 9.2 Open Items.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2012-01-25 Thread Fujii Masao
On Thu, Jan 26, 2012 at 3:07 AM, Simon Riggs  wrote:
> On Wed, Jan 25, 2012 at 8:49 AM, Simon Riggs  wrote:
>> On Wed, Jan 25, 2012 at 8:16 AM, Fujii Masao  wrote:
>>
 What happens if we shutdown the WALwriter and then issue SIGHUP?
>>>
>>> SIGHUP doesn't affect full_page_writes in that case. Oh, you are concerned 
>>> about
>>> the case where smart shutdown kills walwriter but some backends are
>>> still running?
>>> Currently SIGHUP affects full_page_writes and running backends use the 
>>> changed
>>> new value of full_page_writes. But in the patch, SIGHUP doesn't affect...
>>>
>>> To address the problem, we should either postpone the shutdown of walwriter
>>> until all backends have gone away, or leave the update of full_page_writes 
>>> to
>>> checkpointer process instead of walwriter. Thought?
>>
>> checkpointer seems the correct place to me
>
>
> Done.

Thanks a lot!!

I proposed another small patch which fixes the issue about an error message of
pg_basebackup, in this upthread. If it's reasonable, could you commit it?
http://archives.postgresql.org/message-id/CAHGQGwENjSDN=f_vdpwvq53qru0cu9+wzkbvwnaexmawj-y...@mail.gmail.com

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding shutdown checkpoint at failover

2012-01-25 Thread Fujii Masao
On Fri, Jan 20, 2012 at 12:33 AM, Simon Riggs  wrote:
> On Wed, Jan 18, 2012 at 7:15 AM, Fujii Masao  wrote:
>> On Sun, Nov 13, 2011 at 5:13 PM, Simon Riggs  wrote:
>>> On Tue, Nov 1, 2011 at 12:11 PM, Simon Riggs  wrote:
>>>
 When I say skip the shutdown checkpoint, I mean remove it from the
 critical path of required actions at the end of recovery. We can still
 have a normal checkpoint kicked off at that time, but that no longer
 needs to be on the critical path.

 Any problems foreseen? If not, looks like a quick patch.
>>>
>>> Patch attached for discussion/review.
>>
>> This feature is what I want, and very helpful to shorten the failover time in
>> streaming replication.
>>
>> Here are the review comments. Though I've not checked enough whether
>> this feature works fine in all recovery patterns yet.
>>
>> LocalSetXLogInsertAllowed() must be called before LogEndOfRecovery().
>> LocalXLogInsertAllowed must be set to -1 after LogEndOfRecovery().
>>
>> XLOG_END_OF_RECOVERY record is written to the WAL file with new
>> assigned timeline ID. But it must be written to the WAL file with old one.
>> Otherwise, when re-entering a recovery after failover, we cannot find
>> XLOG_END_OF_RECOVERY record at all.
>>
>> Before XLOG_END_OF_RECOVERY record is written,
>> RmgrTable[rmid].rm_cleanup() might write WAL records. They also
>> should be written to the WAL file with old timeline ID.
>>
>> When recovery target is specified, we cannot write new WAL to the file
>> with old timeline because which means that valid WAL records in it are
>> overwritten with new WAL. So when recovery target is specified,
>> ISTM that we cannot skip end of recovery checkpoint. Or we might need
>> to save all information about timelines in the database cluster instead
>> of writing XLOG_END_OF_RECOVERY record, and use it when re-entering
>> a recovery.
>>
>> LogEndOfRecovery() seems to need to call XLogFlush(). Otherwise,
>> what if the server crashes after new timeline history file is created and
>> recovery.conf is removed, but before XLOG_END_OF_RECOVERY record
>> has not been flushed to the disk yet?
>>
>> During recovery, when we replay XLOG_END_OF_RECOVERY record, we
>> should close the currently-opened WAL file and read the WAL file with
>> the timeline which XLOG_END_OF_RECOVERY record indicates.
>> Otherwise, when re-entering a recovery with old timeline, we cannot
>> reach new timeline.
>
>
>
> OK, some bad things there, thanks for the insightful comments.
>
>
>
> I think you're right that we can't skip the checkpoint if xlog_cleanup
> writes WAL records, since that implies at least one and maybe more
> blocks have changed and need to be flushed. That can be improved upon,
> but not now in 9.2.Cleanup WAL is written in either the old or the new
> timeline, depending upon whether we increment it. So we don't need to
> change anything there, IMHO.
>
> The big problem is how we handle crash recovery after we startup
> without a checkpoint. No quick fixes there.
>
> So let me rethink this: The idea was that we can skip the checkpoint
> if we promote to normal running during streaming replication.
>
> WALReceiver has been writing to WAL files, so can write more data
> without all of the problems noted. Rather than write the
> XLOG_END_OF_RECOVERY record via XLogInsert we should write that **from
> the WALreceiver** as a dummy record by direct injection into the WAL
> stream. So the Startup process sees a WAL record that looks like it
> was written by the primary saying "promote yourself", although it was
> actually written locally by WALreceiver when requested to shutdown.
> That doesn't damage anything because we know we've received all the
> WAL there is. Most importantly we don't need to change any of the
> logic in a way that endangers the other code paths at end of recovery.
>
> Writing the record in that way means we would need to calculate the
> new tli slightly earlier, so we can input the correct value into the
> record. That also solves the problem of how to get additional standbys
> to follow the new master. The XLOG_END_OF_RECOVERY record is simply
> the contents of the newly written tli history file.
>
> If we skip the checkpoint and then crash before the next checkpoint we
> just change timeline when we see XLOG_END_OF_RECOVERY. When we replay
> the XLOG_END_OF_RECOVERY we copy the contents to the appropriate tli
> file and then switch to it.
>
> So this solves 2 problems: having other standbys follow us when they
> don't have archiving, and avoids the checkpoint.
>
> Let me know what you think.

Looks good to me.

One thing I would like to ask is that why you think walreceiver is more
appropriate for writing XLOG_END_OF_RECOVERY record than startup
process. I was thinking the opposite, because if we do so, we might be
able to skip the end-of-recovery checkpoint even in file-based log-shipping
case.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software 

Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families

2012-01-25 Thread Tom Lane
Alvaro Herrera  writes:
> Excerpts from Robert Haas's message of mié ene 25 17:32:49 -0300 2012:
>> On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch  wrote:
> New version that repairs a defective test case.
>> 
>> Committed.  I don't find this to be particularly good style:
>> 
>> +   for (i = 0; i < old_natts && ret; i++)
>> +   ret = 
>> (!IsPolymorphicType(get_opclass_input_type(classObjectId[i
>> +  irel->rd_att->attrs[i]->atttypid == 
>> typeObjectId[i]);
>> 
>> ...but I am not sure whether we have any formal policy against it, so
>> I just committed it as-is for now.  I would have surrounded the loop
>> with an if (ret) block and written the body of the loop as if
>> (condition) { ret = false; break; }.

> I find that code way too clever.

Not only is that code spectacularly unreadable, but has nobody noticed
that this commit broke the buildfarm?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL Restore process during recovery

2012-01-25 Thread Fujii Masao
On Tue, Jan 24, 2012 at 6:43 PM, Fujii Masao  wrote:
>> Cleaned up the points noted, new patch attached in case you wish to
>> review further.
>>
>> Still has bug, so still with me to fix.
>
> Thanks! Will review further.

v3 patch contains lots of unrelated code changes like the following.

-   pid column of the
+   procpid column of the

You seem to have failed to extract the patch from your repository correctly.
So I used v2 patch for the review. Sorry if I comment the things which you've
already fixed in v3 patch.

Here are the comments. They are almost not serious problems.

+/*
+ * GUC parameters
+ */
+intWalRestoreDelay = 1;

You forget to change guc.c to define wal_restore_delay as a GUC parameter?
Or just that source code comment is incorrect?


+   elog(FATAL, "recovery_restore_command is too long");

Typo: s/recovery_restore_command/restore_command


+   InitLatch(&walrstr->WALRestoreLatch); /* initialize latch used in main 
loop */

That latch is shared one. OwnLatch() should be called instead of InitLatch()?
If yes, it's better to call DisownLatch() when walrestore exits. Though skipping
DisownLatch() would be harmless because the latch is never owned by new
process after walrestore exits.


+   {
+   /* use volatile pointer to prevent code rearrangement */
+   volatile WalRestoreData *walrstr = WalRstr;
+
+   nextFileTli = walrstr->nextFileTli;

The declaration of "walrstr" is not required here because it's already done
at the head of WalRestoreNextFile().


+   if (restoredFromArchive)
+   {
+   /* use volatile pointer to prevent code rearrangement */
+   volatile WalRestoreData *walrstr = WalRstr;

Same as above.


+#define TMPRECOVERYXLOG"RECOVERYXLOG"

ISTM that it's better to move this definition to an include file and we should
use it in all the places where the fixed value "RECOVERYXLOG" is still used.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] cursors FOR UPDATE don't return most recent row

2012-01-25 Thread Alvaro Herrera

This is my test case (all in one session):

CREATE TABLE foo (
  key int PRIMARY KEY,
  value   int
);

INSERT INTO foo VALUES (1, 1);

 BEGIN;
 DECLARE foo CURSOR FOR SELECT * FROM foo FOR UPDATE;
 UPDATE foo SET value = 2 WHERE key = 1;
 UPDATE foo SET value = 3 WHERE key = 1;
 FETCH 1 FROM foo;
 COMMIT;


I expected the FETCH to return one row, with the latest data, i.e.
(1, 3), but instead it's returning empty.

If instead I run both UPDATEs in another session, then I do get

alvherre=#  FETCH 1 FROM foo; 
 key | value 
-+---
   1 | 3
(1 fila)


Is this intended?

-- 
Álvaro Herrera 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoid FK validations for no-rewrite ALTER TABLE ALTER TYPE

2012-01-25 Thread Noah Misch
On Wed, Jan 25, 2012 at 11:17:27AM -0300, Alvaro Herrera wrote:
> 
> Excerpts from Noah Misch's message of dom ene 22 02:05:31 -0300 2012:
> 
> > Thanks.  I've attached a new version fixing this problem.
> 
> Looks good to me.  Can you please clarify this bit?
> 
>* Since we elsewhere require that all collations share 
> the same
>* notion of equality, don't compare collation here.
> 
> Since I'm not familiar with this code, it's difficult for me to figure
> out where this "elsewhere" is, and what does "same notion of equality"
> mean.

Good questions.  See the comment in front of ri_GenerateQualCollation(), the
comment in process_ordered_aggregate_single() at nodeAgg.c:605, the larger
comment in add_sort_column(), and the two XXX comments in
relation_has_unique_index_for().  We should probably document that assumption
in xindex.sgml to keep type implementors apprised.

"Same notion of equality" just means that "a COLLATE x = b COLLATE y" has the
same value for every x, y.

In any event, the patch needed a rebase, so I've attached it rebased and with
that comment edited to reference ri_GenerateQualCollation(), that being the
most-relevant source for the assumption in question.

Thanks for reviewing,
nm
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index cb8ac67..ba20950 100644
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***
*** 276,281  static Oid transformFkeyCheckAttrs(Relation pkrel,
--- 276,282 
int numattrs, int16 *attnums,
Oid *opclasses);
  static void checkFkeyPermissions(Relation rel, int16 *attnums, int natts);
+ static CoercionPathType findFkeyCast(Oid targetTypeId, Oid sourceTypeId, Oid 
*funcid);
  static void validateCheckConstraint(Relation rel, HeapTuple constrtup);
  static void validateForeignKeyConstraint(char *conname,
 Relation rel, Relation 
pkrel,
***
*** 357,362  static void ATPostAlterTypeCleanup(List **wqueue, 
AlteredTableInfo *tab, LOCKMOD
--- 358,364 
  static void ATPostAlterTypeParse(Oid oldId, char *cmd,
 List **wqueue, LOCKMODE lockmode, bool 
rewrite);
  static void TryReuseIndex(Oid oldId, IndexStmt *stmt);
+ static void TryReuseForeignKey(Oid oldId, Constraint *con);
  static void change_owner_fix_column_acls(Oid relationOid,
 Oid oldOwnerId, Oid 
newOwnerId);
  static void change_owner_recurse_to_sequences(Oid relationOid,
***
*** 5591,5596  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5593,5600 
numpks;
Oid indexOid;
Oid constrOid;
+   boolold_check_ok;
+   ListCell   *old_pfeqop_item = list_head(fkconstraint->old_conpfeqop);
  
/*
 * Grab an exclusive lock on the pk table, so that someone doesn't 
delete
***
*** 5707,5712  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5711,5723 
(errcode(ERRCODE_INVALID_FOREIGN_KEY),
 errmsg("number of referencing and referenced 
columns for foreign key disagree")));
  
+   /*
+* On the strength of a previous constraint, we might avoid scanning
+* tables to validate this one.  See below.
+*/
+   old_check_ok = (fkconstraint->old_conpfeqop != NIL);
+   Assert(!old_check_ok || numfks == 
list_length(fkconstraint->old_conpfeqop));
+ 
for (i = 0; i < numpks; i++)
{
Oid pktype = pktypoid[i];
***
*** 5721,5726  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
--- 5732,5738 
Oid ppeqop;
Oid ffeqop;
int16   eqstrategy;
+   Oid pfeqop_right;
  
/* We need several fields out of the pg_opclass entry */
cla_ht = SearchSysCache1(CLAOID, 
ObjectIdGetDatum(opclasses[i]));
***
*** 5763,5772  ATAddForeignKeyConstraint(AlteredTableInfo *tab, Relation 
rel,
pfeqop = get_opfamily_member(opfamily, opcintype, fktyped,
 
eqstrategy);
if (OidIsValid(pfeqop))
ffeqop = get_opfamily_member(opfamily, fktyped, fktyped,

 eqstrategy);
else
!   ffeqop = InvalidOid;/* keep compiler quiet */
  
if (!(OidIsValid(pfeqop) && OidIsValid(ffeqop)))
   

Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families

2012-01-25 Thread Noah Misch
On Wed, Jan 25, 2012 at 03:32:49PM -0500, Robert Haas wrote:
> On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch  wrote:
> > New version that repairs a defective test case.
> 
> Committed.  I don't find this to be particularly good style:

Thanks.

> +   for (i = 0; i < old_natts && ret; i++)
> +   ret = 
> (!IsPolymorphicType(get_opclass_input_type(classObjectId[i
> +  irel->rd_att->attrs[i]->atttypid == 
> typeObjectId[i]);
> 
> ...but I am not sure whether we have any formal policy against it, so
> I just committed it as-is for now.  I would have surrounded the loop
> with an if (ret) block and written the body of the loop as if
> (condition) { ret = false; break; }.

I value the savings in vertical space more than the lost idiomaticness.  This
decision is 90+% subjective, so I cannot blame you for concluding otherwise.
I do know the feeling of looking at PostgreSQL source code and wishing the
author had not attempted to conserve every line.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Measuring relation free space

2012-01-25 Thread Noah Misch
On Tue, Jan 24, 2012 at 11:24:08AM -0500, Jaime Casanova wrote:
> On Mon, Jan 23, 2012 at 7:18 PM, Noah Misch  wrote:
> > If someone feels like
> > doing it, +1 for making pgstattuple() count non-leaf free space.
> 
> actually i agreed that non-leaf pages are irrelevant... i just
> confirmed that in a production system with 300GB none of the indexes
> in an 84M rows table nor in a heavily updated one has more than 1 root
> page, all the rest are deleted, half_dead or leaf. so the posibility
> of bloat coming from non-leaf pages seems very odd

FWIW, the number to look at is internal_pages from pgstatindex():

[local] test=# create table t4 (c) as select * from generate_series(1,100);
SELECT 100
[local] test=# alter table t4 add primary key(c);
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "t4_pkey" for 
table "t4"
ALTER TABLE
[local] test=# select * from pgstatindex('t4_pkey');
-[ RECORD 1 ]--+-
version| 2
tree_level | 2
index_size | 22478848
root_block_no  | 290
internal_pages | 10
leaf_pages | 2733
empty_pages| 0
deleted_pages  | 0
avg_leaf_density   | 90.06
leaf_fragmentation | 0

So, 0.4% of this index.  They appear in proportion to the logarithm of the
total index size.  I agree that bloat centered on them is unlikely.  Counting
them would be justified, but that is a question of formal accuracy rather than
practical importance.

> but the possibility of bloat coming from the meta page doesn't exist,
> AFAIUI at least
> 
> we need the most accurate value about usable free space, because the
> idea is to add a sampler mode to the function so we don't scan the
> whole relation. that's why we still need the function.

I doubt we'd add this function solely on the basis that a future improvement
will make it useful.  For the patch to go in now, it needs to be useful now.
(This is not a universal principle, but it mostly holds for low-complexity
patches like this one.)

All my comments below would also apply to such a broader patch.

> btw... pgstattuple also has the problem that it's not using a ring buffer
> 
> 
> attached are two patches:
> - v5: is the same original patch but only track space in leaf, deleted
> and half_dead pages
> - v5.1: adds the same for all kind of indexes (problem is that this is
> inconsistent with the fact that pageinspect only manages btree indexes
> for everything else)

Let's take a step back.  Again, what you're proposing is essentially a faster
implementation of "SELECT free_percent FROM pgstattuple(rel)".  If this code
belongs in core at all, it belongs in the pgstattuple module.  Share as much
infrastructure as is reasonable between the user-visible functions of that
module.  For example, I'm suspecting that the pgstat_index() call tree should
be shared, with pgstat_index_page() checking a flag to decide whether to
gather per-tuple stats.

Next, compare the bits of code that differ between pgstattuple() and
relation_free_space(), convincing yourself that the differences are justified.
Each difference will yield one of the following conclusions:

1. Your code contains an innovation that would apply to both functions.  Where
not too difficult, merge these improvements into pgstattuple().  In order for
a demonstration of your new code's better performance to be interesting, we
must fix the same low-hanging fruit in its competitor.  One example is the use
of the bulk read strategy.  Another is the support for SP-GiST.

2. Your code is missing an essential behavior of pgstattuple().  Add it to
your code.  One example is the presence of CHECK_FOR_INTERRUPTS() calls.

3. Your code behaves differently from pgstattuple() due to a fundamental
difference in their tasks.  These are the only functional differences that
ought to remain in your finished patch; please point them out in comments.
For example, pgstat_heap() visits every tuple in the heap.  You'll have no
reason to do that; pgstattuple() only needs it to calculate statistics other
than free_percent.

In particular, I call your attention to the fact that pgstattuple() takes
shared buffer content locks before examining pages.  Your proposed patch does
not do so.  I do not know with certainty whether that falls under #1 or #2.
The broad convention is to take such locks, because we elsewhere want an exact
answer.  These functions are already inexact; they make no effort to observe a
consistent snapshot of the table.  If you convince yourself that the error
arising from not locking buffers is reasonably bounded, we can lose the locks
(in both functions -- conclusion #1).  Comments would then be in order.

With all that done, run some quick benchmarks: see how "SELECT free_percent
FROM pgstattuple(rel)" fares compared to "SELECT relation_free_space(rel)" for
a large heap and for a large B-tree index.  If the timing difference is too
small to be interesting to you, remove relation_free_space() and submit your
pgstattuple() improvements a

Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?

2012-01-25 Thread Robert Haas
On Sat, Jan 21, 2012 at 10:11 AM, Simon Riggs  wrote:
> Your caution is wise. All users of an index have already checked
> whether the index is usable at plan time, so although there is much
> code that assumes they can look at the index itself, that is not
> executed until after the correct checks.
>
> I'll look at VACUUM and other utilities, so thanks for that.
>
> I don't see much point in having the higher level lock, except perhaps
> as a test this code works.

I thought of another way this can cause a problem: suppose that while
we're dropping the relation with only ShareUpdateExclusiveLock, we get
as far as calling DropRelFileNodeBuffers.  Just after we finish, some
other process that holds AccessShareLock or RowShareLock or
RowExclusiveLock reads and perhaps even dirties a page in the
relation.  Now we've got pages in the buffer pool that might even be
dirty, but the backing file is truncated or perhaps removed
altogether.  If the page is dirty, I think the background writer will
eventually choke trying to write out the page.  If the page is not
dirty, I'm less certain whether anything is going to explode
violently, but it seems mighty sketchy to have pages in the buffer
pool with a buffer tag that don't exist any more.  As the comment
says:

 *  It is the responsibility of higher-level code to ensure that the
 *  deletion or truncation does not lose any data that could be needed
 *  later.  It is also the responsibility of higher-level code to ensure
 *  that no other process could be trying to load more pages of the
 *  relation into buffers.

...and of course, the intended mechanism is an AccessExclusiveLock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-25 Thread Robert Haas
On Wed, Jan 25, 2012 at 3:11 AM, Heikki Linnakangas
 wrote:
> I've been thinking, what exactly is the important part of this group commit
> patch that gives the benefit? Keeping the queue sorted isn't all that
> important - XLogFlush() requests for commits will come in almost the correct
> order anyway.
>
> I also don't much like the division of labour between groupcommit.c and
> xlog.c. XLogFlush() calls GroupCommitWaitForLSN(), which calls back
> DoXLogFlush(), which is a bit hard to follow. (that's in my latest version;
> the original patch had similar division but it also cut across processes,
> with the wal writer actually doing the flush)
>
> It occurs to me that the WALWriteLock already provides much of the same
> machinery we're trying to build here. It provides FIFO-style queuing, with
> the capability to wake up the next process or processes in the queue.
> Perhaps all we need is some new primitive to LWLock, to make it do exactly
> what we need.
>
> Attached is a patch to do that. It adds a new mode to
> LWLockConditionalAcquire(), LW_EXCLUSIVE_BUT_WAIT. If the lock is free, it
> is acquired and the function returns immediately. However, unlike normal
> LWLockConditionalAcquire(), if the lock is not free it goes to sleep until
> it is released. But unlike normal LWLockAcquire(), when the lock is
> released, the function returns *without* acquiring the lock.
>
> I modified XLogFlush() to use that new function for WALWriteLock. It tries
> to get WALWriteLock, but if it's not immediately free, it waits until it is
> released. Then, before trying to acquire the lock again, it rechecks
> LogwrtResult to see if the record was already flushed by the process that
> released the lock.
>
> This is a much smaller patch than the group commit patch, and in my
> pgbench-tools tests on my laptop, it has the same effect on performance. The
> downside is that it touches lwlock.c, which is a change at a lower level.
> Robert's flexlocks patch probably would've been useful here.

I think you should break this off into a new function,
LWLockWaitUntilFree(), rather than treating it as a new LWLockMode.
Also, instead of adding lwWaitOnly, I would suggest that we generalize
lwWaiting and lwExclusive into a field lwWaitRequest, which can be set
to 1 for exclusive, 2 for shared, 3 for wait-for-free, etc.  That way
we don't have to add another boolean every time someone invents a new
type of wait - not that that should hopefully happen very often.  A
side benefit of this is that you can simplify the test in
LWLockRelease(): keep releasing waiters until you come to an exclusive
waiter, then stop.  This possibly saves one shared memory fetch and
subsequent test instruction, which might not be trivial - all of this
code is extremely hot.  We probably want to benchmark this carefully
to make sure that it doesn't cause a performance regression even just
from this:

+   if (!proc->lwWaitOnly)
+   lock->releaseOK = false;

I know it sounds crazy, but I'm not 100% sure that that additional
test there is cheap enough not to matter.  Assuming it is, though, I
kind of like the concept: I think there must be other places where
these semantics are useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] show Heap Fetches in EXPLAIN for index-only scans

2012-01-25 Thread Robert Haas
On Sat, Jan 21, 2012 at 9:50 PM, Jeff Janes  wrote:
> A review:
>
> [ review ]

Thanks.  Committed with hopefully-appropriate revisions.

> As a side-note, I noticed that I needed to run vacuum twice in a row
> to get the Heap Fetches to drop to zero.  I vaguely recall that only
> one vacuum was needed when ios first went in (and I had instrumented
> it to elog heap-fetches).  Does anyone know if this the expected
> consequence of one of the recent changes we made to vacuum?

No, that's not expected.  The change we made to vacuum was to skip
pages that are busy - but it shouldn't be randomly skipping pages for
no reason.  I can reproduce what you're observing, though:

[rhaas 16384]$ pg_filedump 16411 | grep TLI.*Flags | grep -v 'Flags: 0x0004'
 TLI: 0x0001  Prune XID: 0x  Flags: 0x0005 (HAS_FREE_LINES)
 TLI: 0x0001  Prune XID: 0x  Flags: 0x0005 (HAS_FREE_LINES)

After updating a row in the table and checkpointing, the page the rows
was on is marked full and the page that gets the new version becomes
not-all-visible:

[rhaas 16384]$ pg_filedump 16411 | grep TLI.*Flags | grep -v 'Flags: 0x0004'
 TLI: 0x0001  Prune XID: 0x03fb  Flags: 0x0003 (HAS_FREE_LINES|PAGE_FULL)
 TLI: 0x0001  Prune XID: 0x  Flags: 0x0001 (HAS_FREE_LINES)

Now I vacuum the relation and checkpoint, and the page the *new*
relation is on becomes all-visible:

[rhaas 16384]$ pg_filedump 16411 | grep TLI.*Flags | grep -v 'Flags: 0x0004'
 TLI: 0x0001  Prune XID: 0x  Flags: 0x0001 (HAS_FREE_LINES)
 TLI: 0x0001  Prune XID: 0x  Flags: 0x0005 (HAS_FREE_LINES)

Now I vacuum it again and checkpoint, and now the old page also
becomes all-visible:

[rhaas 16384]$ pg_filedump 16411 | grep TLI.*Flags | grep -v 'Flags: 0x0004'
 TLI: 0x0001  Prune XID: 0x  Flags: 0x0005 (HAS_FREE_LINES)
 TLI: 0x0001  Prune XID: 0x  Flags: 0x0005 (HAS_FREE_LINES)

But it seems to me that this is expected (if non-optimal) behavior.
Only the first pass of vacuum knows how to mark pages all-visible.
After the update, the first pass of the first vacuum sees a dead tuple
on the old page and truncates it to a dead line pointer.  When it
comes to the new page, it observes that the page is now all-visible
and marks it so.  It then does index vacuuming and returns to the
first page, marking the dead line pointer unused.  But during this
second visit to the old page, there's no possibility of marking the
page as all-visible, because the code doesn't know how to do that.
The next vacuum's first pass, however, can do so, because there are no
longer any dead tuples on the page.

We could fix this by modifying lazy_vacuum_page(): since we have to
dirty the buffer anyway, we could recheck whether all the remaining
tuples on the page are now all-visible, and if so set the visibility
map bit.  This is probably desirable even apart from index-only scans,
because it will frequently save the next vacuum the cost of reading,
dirtying, and writing extra pages.  There will be some incremental CPU
cost, but that seems likely to be more than repaid by the I/O savings.

Thoughts?  Should we do this at all?  If so, should we squeeze it into
9.2 or leave it for 9.3?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum rate limit in KBps

2012-01-25 Thread Greg Smith

On 01/23/2012 11:16 PM, Robert Treat wrote:

I keep thinking Greg has mistaken happiness with the MB based info in
the vacuum patch as being happy without the output format, when really
it is all about increased visibility.


I haven't taken that as anything but evidence I'm at least moving in the 
right direction.  I'm relatively systematic about how I approach these 
things nowadays:  figure out what isn't being logged/accounted for yet, 
add visibility to that thing, iterate on improving its behavior as 
measured by that, then make the best sorts of behavior changes 
automatic.  This suggested feature change, moving around what I see as 
the worst part of the tuning knob UI, is an initial attempt along step 3 
in that path.  There's still plenty of ongoing work adding more 
visibility too.


To more quickly summarize the point I was trying to make, providing a 
meter that trivially shows you good vs. bad is the almost same problem 
as making it fully automatic.  If you can measure exactly when something 
needs to happen and how badly screwed up it is, that's the hard part of 
knowing when to just go fix it.  Right now, the autovacuum measure is 
based on the percent of change to something.  I don't think any improved 
vacuum triggering will go somewhere useful unless it starts with a 
better measure of how real-world bloat accumulates, which you cannot 
extract from the data collected yet.  The free space measurement thread 
and the ideas that have mainly been bouncing between Jaime and Noah 
recently on this subject are directly addressing that, the part that 
I've found to be the single most useful additional thing to monitor.  It 
may not be obvious that's providing information that can be consumed by 
autovacuum, but that was my intention for how these pieces will 
ultimately fit together.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Psql names completion.

2012-01-25 Thread Dominik Bylica

Hello.

It's probably not the right place to write, but I guess you are there to 
take care of it.


When I was creating a trigger with command like:
create trigger asdf before update on beginninOfTheNameOfMyTable...
I hit tab and I got:
create trigger asdf before update on SET

That was obviously not what I expected to get.

Regards.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PATCH: pg_basebackup (missing exit on error)

2012-01-25 Thread Thomas Ogrisegg
While testing a backup script, I noticed that pg_basebackup exits with
0, even if it had errors while writing the backup to disk when the
backup file is to be sent to stdout. The attached patch should fix this
problem (and some special cases when the last write fails).

Regards,

Thomas
diff -uNr postgresql-9.1.2/src/bin/pg_basebackup/pg_basebackup.c postgresql-9.1.2-patch/src/bin/pg_basebackup/pg_basebackup.c
--- postgresql-9.1.2/src/bin/pg_basebackup/pg_basebackup.c	2011-12-01 22:47:20.0 +0100
+++ postgresql-9.1.2-patch/src/bin/pg_basebackup/pg_basebackup.c	2012-01-23 13:14:16.0 +0100
@@ -410,6 +410,7 @@
 {
 	fprintf(stderr, _("%s: could not write to compressed file \"%s\": %s\n"),
 			progname, filename, get_gz_error(ztarfile));
+	disconnect_and_exit(1);
 }
 			}
 			else
@@ -428,7 +429,14 @@
 #ifdef HAVE_LIBZ
 if (ztarfile)
 	gzclose(ztarfile);
+else
 #endif
+	if (fflush (tarfile) != 0)
+	{
+		fprintf(stderr, _("%s: error flushing stdout: %s\n"),
+strerror (errno));
+		disconnect_and_exit(1);
+	}
 			}
 			else
 			{
@@ -437,7 +445,14 @@
 	gzclose(ztarfile);
 #endif
 if (tarfile != NULL)
-	fclose(tarfile);
+{
+	if (fclose(tarfile) != 0)
+	{
+		fprintf(stderr, _("%s: error closing file \"%s\": %s\n"),
+progname, filename, strerror (errno));
+		disconnect_and_exit(1);
+	}
+}
 			}
 
 			break;
@@ -456,6 +471,7 @@
 			{
 fprintf(stderr, _("%s: could not write to compressed file \"%s\": %s\n"),
 		progname, filename, get_gz_error(ztarfile));
+disconnect_and_exit(1);
 			}
 		}
 		else

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Different error messages executing CREATE TABLE or ALTER TABLE to create a column "xmin"

2012-01-25 Thread Giuseppe Sucameli
Hi hackers,

the attached patch fixes the problem I explained in pgsql-bugs (forwarded).
It's a trivial problem, so no initial patch design was required.

Hope to see my patch applied.
Thanks for your work.

Regards.


-- Forwarded message --
From: Giuseppe Sucameli 
Date: Sun, Jan 22, 2012 at 2:22 PM
Subject: Different error messages executing CREATE TABLE or ALTER
TABLE to create a column "xmin"
To: pgsql-b...@postgresql.org


Hi all,

trying to create a table with a column xmin I get the
following error message:

test=> create table lx (xmin int);
ERROR:  column name "xmin" conflicts with a system
column name

Instead I get a different (and less understandable) error
message if I try to add a column named xmin to an
existent table:

test=> create table lx (i int);
CREATE TABLE
test=> alter table lx add xmin int;
ERROR:  column "xmin" of relation "lx" already exists.

The same problem occurs using "xmax" as column name.

I'm on Ubuntu 11.04.
Tried on both PostgreSQL 8.4.10 and 9.1.2

Regards.

--
Giuseppe Sucameli


-- 
Giuseppe Sucameli


postgresql_syscol_message.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.2] Fix Leaky View Problem

2012-01-25 Thread Robert Haas
On Sun, Jan 22, 2012 at 5:57 AM, Kohei KaiGai  wrote:
>> This passes installcheck initially.  Then upon second invocation of
>> installcheck, it fails.
>>
>> It creates the role "alice", and doesn't clean it up.  On next
>> invocation alice already exists and cases a failure in test
>> select_views.
>>
> Thanks for your pointing out.
>
> The attached patch adds cleaning-up part of object being defined
> within this test;
> includes user "alice".

Urp.  I failed to notice this patch and committed a different fix for
the problem pointed out by Jeff.  I'm inclined to think it's OK to
leave the non-shared objects behind - among other things, if people
are testing pg_upgrade using the regression database, this will help
to ensure that pg_dump is handling security_barrier views correctly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families

2012-01-25 Thread Alvaro Herrera

Excerpts from Robert Haas's message of mié ene 25 19:05:44 -0300 2012:
> 
> On Wed, Jan 25, 2012 at 3:52 PM, Alvaro Herrera
>  wrote:
> >
> > Excerpts from Robert Haas's message of mié ene 25 17:32:49 -0300 2012:
> >> On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch  wrote:
> >> > New version that repairs a defective test case.
> >>
> >> Committed.  I don't find this to be particularly good style:
> >>
> >> +       for (i = 0; i < old_natts && ret; i++)
> >> +               ret = 
> >> (!IsPolymorphicType(get_opclass_input_type(classObjectId[i
> >> +                          irel->rd_att->attrs[i]->atttypid == 
> >> typeObjectId[i]);
> >>
> >> ...but I am not sure whether we have any formal policy against it, so
> >> I just committed it as-is for now.  I would have surrounded the loop
> >> with an if (ret) block and written the body of the loop as if
> >> (condition) { ret = false; break; }.
> >
> > I find that code way too clever.
> 
> The way he wrote it, or the way I proposed to write it?

The code as committed.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families

2012-01-25 Thread Robert Haas
On Wed, Jan 25, 2012 at 3:52 PM, Alvaro Herrera
 wrote:
>
> Excerpts from Robert Haas's message of mié ene 25 17:32:49 -0300 2012:
>> On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch  wrote:
>> > New version that repairs a defective test case.
>>
>> Committed.  I don't find this to be particularly good style:
>>
>> +       for (i = 0; i < old_natts && ret; i++)
>> +               ret = 
>> (!IsPolymorphicType(get_opclass_input_type(classObjectId[i
>> +                          irel->rd_att->attrs[i]->atttypid == 
>> typeObjectId[i]);
>>
>> ...but I am not sure whether we have any formal policy against it, so
>> I just committed it as-is for now.  I would have surrounded the loop
>> with an if (ret) block and written the body of the loop as if
>> (condition) { ret = false; break; }.
>
> I find that code way too clever.

The way he wrote it, or the way I proposed to write it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch for parameterized inner paths

2012-01-25 Thread Robert Haas
On Wed, Jan 25, 2012 at 1:24 PM, Tom Lane  wrote:
> Also, you're assuming that the changes have no upside whatsoever, which
> I fondly hope is not the case.  Large join problems tend not to execute
> instantaneously --- so nobody is going to complain if the planner takes
> awhile longer but the resulting plan is enough better to buy that back.
> In my test cases, the planner *is* finding better plans, or at least
> ones with noticeably lower estimated costs.  It's hard to gauge how
> much that translates to in real-world savings, since I don't have
> real data loaded up.  I also think, though I've not tried to measure,
> that I've made planning cheaper for very simple queries by eliminating
> some overhead in those cases.

I had a 34-table join on one of the last applications I maintained
that planned and executed in less than 2 seconds.  That was pushing
it, but I had many joins in the 10-20 table range that planned and
executed in 100-200 ms.  I agree that if you are dealing with a
terabyte table - or even a gigabyte table -  then the growth of
planning time will probably not bother anyone even if you fail to find
a better plan, and will certainly make the user very happy if you do.
But on tables with only a megabyte of data, it's not nearly so
clear-cut.  In an ideal world, I'd like the amount of effort we spend
planning to be somehow tied to the savings we can expect to get, and
deploy optimizations like this only in cases where we have a
reasonable expectation of that effort being repaid.

AIUI, this is mostly going to benefit cases like small LJ (big1 IJ
big2) and, of course, those cases aren't going to arise if your query
only involves small tables, or even if you have something like big IJ
small1 IJ small2 IJ small3 IJ small4 LJ small5 LJ small6 IJ small7,
which is a reasonably common pattern for me.  Now, if you come back
and say, ah, well, those cases aren't the ones that are going to be
harmed by this, then maybe we should have a more detailed conversation
about where the mines are.  Or maybe it is helping in more cases than
I'm thinking about at the moment.

>> To be clear, I'd love to have this feature.  But if there is a choice
>> between reducing planning time significantly for everyone and NOT
>> getting this feature, and increasing planning time significantly for
>> everyone and getting this feature, I think we will make more people
>> happy by doing the first one.
>
> We're not really talking about "are we going to accept or reject a
> specific feature".  We're talking about whether we're going to decide
> that the last two years worth of planner development were headed in
> the wrong direction and we're now going to reject that and try to
> think of some entirely new concept.  This isn't an isolated patch,
> it's the necessary next step in a multi-year development plan.  The
> fact that it's a bit slower at the moment just means there's still
> work to do.

I'm not proposing that you should never commit this.  I'm proposing
that any commit by anyone that introduces a 35% performance regression
is unwise, and doubly so at the end of the release cycle.  I have
every confidence that you can improve the code further over time, but
the middle of the last CommitFest is not a great time to commit code
that, by your own admission, needs a considerable amount of additional
work.  Sure, there are some things that we're not going to find out
until the code goes into production, but it seems to me that you've
already uncovered a fairly major performance problem that is only
partially fixed.  Once this is committed, it's not coming back out, so
we're either going to have to figure out how to fix it before we
release, or release with a regression in certain cases.  If you got it
down to 10% I don't think I'd be worried, but a 35% regression that we
don't know how to fix seems like a lot.

On another note, nobody besides you has looked at the code yet, AFAIK...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql COPY vs. ON_ERROR_ROLLBACK, multi-command strings

2012-01-25 Thread Alvaro Herrera

Excerpts from Noah Misch's message of sáb ene 14 12:40:02 -0300 2012:
> It has bothered me that psql's \copy ignores the ON_ERROR_ROLLBACK setting.
> Only SendQuery() takes note of ON_ERROR_ROLLBACK, and \copy, like all
> backslash commands, does not route through SendQuery().  Looking into this
> turned up several other weaknesses in psql's handling of COPY.

Interesting.

Committed, thanks.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Marko Kreen
On Wed, Jan 25, 2012 at 02:50:09PM -0600, Merlin Moncure wrote:
> On Wed, Jan 25, 2012 at 2:29 PM, Marko Kreen  wrote:
> >> well, I see the following cases:
> >> 1) Vserver > Vapplication: server downgrades wire formats to
> >> applications version
> >> 2) Vapplication > Vlibpq > Vserver: since the application is
> >> reading/writing formats the server can't understand, an error should
> >> be raised if they are used in either direction
> >> 3) Vlibpq >= VApplication > Vserver: same as above, but libpq can
> >> 'upconvert' low version wire format to application's wire format or
> >> error otherwise.
> >
> > I don't see why you special-case libpq here.  There is no reason
> > libpq cannot pass older/newer formats through.  Only thing that
> > matters it parser/formatter version.  If that is done in libpq,
> > then app version does not matter.  If it's done in app, then
> > libpq version does not matter.
> 
> Only because if the app is targeting wire format N, but the server can
> only handle N-1, libpq has the opportunity to fix it up.  That's could
> be just over thinking it though.

I think it's over thinking.  The value should be formatted/parsed just
once.  Server side must support processing different versions.
Whether client side supports downgrading, it's up to client-side
programmers.

If you want to write compatible client, you have a choice of using
proper wrapper API, or simply writing baseline formatting, ignoring
format changes in new versions.

Both are valid approaches and I think we should keep it that way.

> >> By far, the most common cause of problems (both in terms of severity
> >> and frequency) is case #1.  #3 allows a 'compatibility mode' via
> >> libpq, but that comes at significant cost of complexity since libpq
> >> needs to be able to translate wire formats up (but not down).  #2/3 is
> >> a less common problem though as it's more likely the application can
> >> be adjusted to get up to speed: so to keep things simple we can maybe
> >> just error out in those scenarios.
> >
> > I don't like the idea of "conversion".  Instead either client
> > writes values through API that picks format based on server version,
> > or it writes them for specific version only.  In latter case it cannot
> > work with older server.  Unless the fixed version is the baseline.
> 
> ok.  another point about that: libpq isn't really part of the solution
> anyways since there are other popular fully native protocol consumers,
> including (and especially) jdbc, but also python, node.js etc etc.
> 
> that's why I was earlier insisting on a protocol bump, so that we
> could in the new protocol force application version to be advertised.
> v3 would remain caveat emptor for wire formats but v4 would not.

We can bump major/minor anyway to inform clients about new
functionality.  I don't particularly care about that.  What
I'm interested in is what the actual type negotation looks like.

It might be possible we could get away without bumpping anything.
But I have not thought about that angle too deeply yet.

> >> In the database, we need to maintain outdated send/recv functions
> >> basically forever and as much as possible try and translate old wire
> >> format data to and from newer backend structures (maybe in very
> >> specific cases that will be impossible such that the application is
> >> SOL, but that should be rare).  All send/recv functions, including
> >> user created ones need to be stamped with a version token (database
> >> version?).  With the versions of the application, libpq, and all
> >> server functions, we can determine all wire formats as long as we
> >> assume the application's targeted database version represents all the
> >> wire formats it was using.
> >>
> > My good ideas stop there: the exact mechanics of how the usable set of
> >> functions are determined, how exactly the adjusted type look ups will
> >> work, etc. would all have to be sorted out.  Most of the nastier parts
> >> though (protocol changes notwithstanding) are not in libpq, but the
> >> server.  There's just no quick fix on the client side I can see.
> >
> > It does not need to be complex - just bring the version number to
> > i/o function and let it decide whether it cares about it or not.
> > Most functions will not..  Only those that we want to change in
> > compatible manner need to look at it.
> 
> well, maybe instead of passing version number around, the server
> installs the proper compatibility send/recv functions just once on
> session start up so your code isn't littered with stuff like
> if(version > n) do this; else do this;?

Seems confusing.  Note that type i/o functions are user-callable.
How should they act then?

Also note that if()s are needed only for types that want to change their
on-wire formatting.  Considering the mess incompatible on-wire format change
can cause, it's good price to pay.

> > But seriously - on-wire compatibility is good thing, do not fear it...
> 
> sure -- but for postgres I just

Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-25 Thread Nathan Boley
> I actually don't know much about the I/O subsystem, but, no, WAL is
> not separated from data.  I believe $PGDATA is on a SAN, but I don't
> know anything about its characteristics.

The computer is here:
http://www.supermicro.nl/Aplus/system/2U/2042/AS-2042G-6RF.cfm

$PGDATA is on a 5 disk SATA software raid 5.

Is there anything else that would be useful to know? ( Sorry, this
isn't a subject that I'm very familiar with )

-Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families

2012-01-25 Thread Alvaro Herrera

Excerpts from Robert Haas's message of mié ene 25 17:32:49 -0300 2012:
> On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch  wrote:
> > New version that repairs a defective test case.
> 
> Committed.  I don't find this to be particularly good style:
> 
> +   for (i = 0; i < old_natts && ret; i++)
> +   ret = 
> (!IsPolymorphicType(get_opclass_input_type(classObjectId[i
> +  irel->rd_att->attrs[i]->atttypid == 
> typeObjectId[i]);
> 
> ...but I am not sure whether we have any formal policy against it, so
> I just committed it as-is for now.  I would have surrounded the loop
> with an if (ret) block and written the body of the loop as if
> (condition) { ret = false; break; }.

I find that code way too clever.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Merlin Moncure
On Wed, Jan 25, 2012 at 2:29 PM, Marko Kreen  wrote:
>> well, I see the following cases:
>> 1) Vserver > Vapplication: server downgrades wire formats to
>> applications version
>> 2) Vapplication > Vlibpq > Vserver: since the application is
>> reading/writing formats the server can't understand, an error should
>> be raised if they are used in either direction
>> 3) Vlibpq >= VApplication > Vserver: same as above, but libpq can
>> 'upconvert' low version wire format to application's wire format or
>> error otherwise.
>
> I don't see why you special-case libpq here.  There is no reason
> libpq cannot pass older/newer formats through.  Only thing that
> matters it parser/formatter version.  If that is done in libpq,
> then app version does not matter.  If it's done in app, then
> libpq version does not matter.

Only because if the app is targeting wire format N, but the server can
only handle N-1, libpq has the opportunity to fix it up.  That's could
be just over thinking it though.

>> By far, the most common cause of problems (both in terms of severity
>> and frequency) is case #1.  #3 allows a 'compatibility mode' via
>> libpq, but that comes at significant cost of complexity since libpq
>> needs to be able to translate wire formats up (but not down).  #2/3 is
>> a less common problem though as it's more likely the application can
>> be adjusted to get up to speed: so to keep things simple we can maybe
>> just error out in those scenarios.
>
> I don't like the idea of "conversion".  Instead either client
> writes values through API that picks format based on server version,
> or it writes them for specific version only.  In latter case it cannot
> work with older server.  Unless the fixed version is the baseline.

ok.  another point about that: libpq isn't really part of the solution
anyways since there are other popular fully native protocol consumers,
including (and especially) jdbc, but also python, node.js etc etc.

that's why I was earlier insisting on a protocol bump, so that we
could in the new protocol force application version to be advertised.
v3 would remain caveat emptor for wire formats but v4 would not.

>> In the database, we need to maintain outdated send/recv functions
>> basically forever and as much as possible try and translate old wire
>> format data to and from newer backend structures (maybe in very
>> specific cases that will be impossible such that the application is
>> SOL, but that should be rare).  All send/recv functions, including
>> user created ones need to be stamped with a version token (database
>> version?).  With the versions of the application, libpq, and all
>> server functions, we can determine all wire formats as long as we
>> assume the application's targeted database version represents all the
>> wire formats it was using.
>>
> My good ideas stop there: the exact mechanics of how the usable set of
>> functions are determined, how exactly the adjusted type look ups will
>> work, etc. would all have to be sorted out.  Most of the nastier parts
>> though (protocol changes notwithstanding) are not in libpq, but the
>> server.  There's just no quick fix on the client side I can see.
>
> It does not need to be complex - just bring the version number to
> i/o function and let it decide whether it cares about it or not.
> Most functions will not..  Only those that we want to change in
> compatible manner need to look at it.

well, maybe instead of passing version number around, the server
installs the proper compatibility send/recv functions just once on
session start up so your code isn't littered with stuff like
if(version > n) do this; else do this;?

> But seriously - on-wire compatibility is good thing, do not fear it...

sure -- but for postgres I just don't think it's realistic, especially
for the binary wire formats.  a json based data payload could give it
to you (and I'm only half kidding) :-).

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families

2012-01-25 Thread Robert Haas
On Sun, Jan 22, 2012 at 12:23 AM, Noah Misch  wrote:
> New version that repairs a defective test case.

Committed.  I don't find this to be particularly good style:

+   for (i = 0; i < old_natts && ret; i++)
+   ret = (!IsPolymorphicType(get_opclass_input_type(classObjectId[i
+  irel->rd_att->attrs[i]->atttypid == typeObjectId[i]);

...but I am not sure whether we have any formal policy against it, so
I just committed it as-is for now.  I would have surrounded the loop
with an if (ret) block and written the body of the loop as if
(condition) { ret = false; break; }.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Marko Kreen
On Wed, Jan 25, 2012 at 01:43:03PM -0600, Merlin Moncure wrote:
> On Wed, Jan 25, 2012 at 1:24 PM, Marko Kreen  wrote:
> > On Wed, Jan 25, 2012 at 12:54:00PM -0600, Merlin Moncure wrote:
> >> On Wed, Jan 25, 2012 at 11:24 AM, Marko Kreen  wrote:
> >> > I specifically want to avoid any sort of per-connection
> >> > negotation, except the "max format version supported",
> >> > because it will mess up multiplexed usage of single connection.
> >> > Then they need to either disabled advanced formats completely,
> >> > or still do it per-query somehow (via GUCs?) which is mess.
> >>
> >> Being able to explicitly pick format version other than the one the
> >> application was specifically written against adds a lot of complexity
> >> and needs to be justified.  Maybe you're trying to translate data
> >> between two differently versioned servers?  I'm trying to understand
> >> the motive behind your wanting finer grained control of picking format
> >> version...
> >
> > You mean if client has written with version N formats, but connects
> > to server with version N-1 formats?  True, simply not supporting
> > such case simplifies client-side API.
> >
> > But note that it does not change anything on protocol level, it's purely
> > client-API specific.  It may well be that some higher-level APIs
> > (JDBC, Npgsql, Psycopg) may support such downgrade, but with lower-level
> > API-s (raw libpq), it may be optional whether the client wants to
> > support such usage or not.
> 
> well, I see the following cases:
> 1) Vserver > Vapplication: server downgrades wire formats to
> applications version
> 2) Vapplication > Vlibpq > Vserver: since the application is
> reading/writing formats the server can't understand, an error should
> be raised if they are used in either direction
> 3) Vlibpq >= VApplication > Vserver: same as above, but libpq can
> 'upconvert' low version wire format to application's wire format or
> error otherwise.

I don't see why you special-case libpq here.  There is no reason
libpq cannot pass older/newer formats through.  Only thing that
matters it parser/formatter version.  If that is done in libpq,
then app version does not matter.  If it's done in app, then
libpq version does not matter.

> By far, the most common cause of problems (both in terms of severity
> and frequency) is case #1.  #3 allows a 'compatibility mode' via
> libpq, but that comes at significant cost of complexity since libpq
> needs to be able to translate wire formats up (but not down).  #2/3 is
> a less common problem though as it's more likely the application can
> be adjusted to get up to speed: so to keep things simple we can maybe
> just error out in those scenarios.

I don't like the idea of "conversion".  Instead either client
writes values through API that picks format based on server version,
or it writes them for specific version only.  In latter case it cannot
work with older server.  Unless the fixed version is the baseline.

> In the database, we need to maintain outdated send/recv functions
> basically forever and as much as possible try and translate old wire
> format data to and from newer backend structures (maybe in very
> specific cases that will be impossible such that the application is
> SOL, but that should be rare).  All send/recv functions, including
> user created ones need to be stamped with a version token (database
> version?).  With the versions of the application, libpq, and all
> server functions, we can determine all wire formats as long as we
> assume the application's targeted database version represents all the
> wire formats it was using.
> 
> My good ideas stop there: the exact mechanics of how the usable set of
> functions are determined, how exactly the adjusted type look ups will
> work, etc. would all have to be sorted out.  Most of the nastier parts
> though (protocol changes notwithstanding) are not in libpq, but the
> server.  There's just no quick fix on the client side I can see.

It does not need to be complex - just bring the version number to
i/o function and let it decide whether it cares about it or not.
Most functions will not..  Only those that we want to change in
compatible manner need to look at it.

But I don't see that there is danger of having regular changes in wire
formats.  So most of the functions will ignore the versioning.
Including the ones that don't care about compatibility.

But seriously - on-wire compatibility is good thing, do not fear it...

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch for parameterized inner paths

2012-01-25 Thread David E. Wheeler
On Jan 25, 2012, at 12:19 PM, Tom Lane wrote:

>> Why not create a branch? IIRC the build farm can be configured to run 
>> branches.
> 
> I already know what the patch does against the regression tests.
> Buildfarm testing is not of interest here.  What would be of help is,
> say, Kevin volunteering to load up his Circuit Courts software and data
> into a git-head server and see how performance looks with and without
> the patch.  Distribution of the code doesn't strike me as being the
> bottleneck.

Yeah, but it would be easier to keep a branch up-to-date via `git rebase` than 
to maintain the patch, I would think. And if it’s a remote branch, then simpler 
distribution is a bonus.

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch for parameterized inner paths

2012-01-25 Thread Tom Lane
"David E. Wheeler"  writes:
> On Jan 25, 2012, at 10:24 AM, Tom Lane wrote:
>> Anyway, I'd be willing to hold off committing if someone were to
>> volunteer to test an unintegrated copy of the patch against some
>> moderately complicated application.  But it's a sufficiently large
>> patch that I don't really care to sit on it and try to maintain it
>> outside the tree for a long time.

> Why not create a branch? IIRC the build farm can be configured to run 
> branches.

I already know what the patch does against the regression tests.
Buildfarm testing is not of interest here.  What would be of help is,
say, Kevin volunteering to load up his Circuit Courts software and data
into a git-head server and see how performance looks with and without
the patch.  Distribution of the code doesn't strike me as being the
bottleneck.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-25 Thread Robert Haas
On Wed, Jan 25, 2012 at 12:00 PM, Jeff Janes  wrote:
> On Tue, Jan 24, 2012 at 12:53 PM, Robert Haas  wrote:
>> Early yesterday morning, I was able to use Nate Boley's test machine
>> do a single 30-minute pgbench run at scale factor 300 using a variety
>> of trees built with various patches, and with the -l option added to
>> track latency on a per-transaction basis.  All tests were done using
>> 32 clients and permanent tables.  The configuration was otherwise
>> identical to that described here:
>>
>> http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6...@mail.gmail.com
>
> Previously we mostly used this machine for CPU benchmarking.  Have you
> previously described the IO subsystem?  That is becoming relevant for
> these types of benchmarks.   For example, is WAL separated from data?

I actually don't know much about the I/O subsystem, but, no, WAL is
not separated from data.  I believe $PGDATA is on a SAN, but I don't
know anything about its characteristics.

>> By doing this, I hoped to get a better understanding of (1) the
>> effects of a scale factor too large to fit in shared_buffers,
>
> In my hands, the active part of data at scale of 300 fits very
> comfortably into 8GB shared_buffers.
>
> Indeed, until you have run a very long time so that pgbench_history
> gets large, everything fits in 8GB.

Hmm, my mistake.  Maybe I need to crank up the scale factor some more.
 This is why good benchmarking is hard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-25 Thread Jeff Janes
On Wed, Jan 25, 2012 at 9:09 AM, Robert Haas  wrote:
> On Wed, Jan 25, 2012 at 12:00 PM, Jeff Janes  wrote:
>> On Tue, Jan 24, 2012 at 12:53 PM, Robert Haas  wrote:
>>> Early yesterday morning, I was able to use Nate Boley's test machine
>>> do a single 30-minute pgbench run at scale factor 300 using a variety
>>> of trees built with various patches, and with the -l option added to
>>> track latency on a per-transaction basis.  All tests were done using
>>> 32 clients and permanent tables.  The configuration was otherwise
>>> identical to that described here:
>>>
>>> http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6...@mail.gmail.com
>>
>> Previously we mostly used this machine for CPU benchmarking.  Have you
>> previously described the IO subsystem?  That is becoming relevant for
>> these types of benchmarks.   For example, is WAL separated from data?
>
> I actually don't know much about the I/O subsystem, but, no, WAL is
> not separated from data.  I believe $PGDATA is on a SAN, but I don't
> know anything about its characteristics.

I think the checkpointing issues that Greg is exploring are important,
and I'm pretty sure that that is what is limiting your TPS in this
case.  However, I don't think we can make much progress on that front
using a machine whose IO subsystem is largely unknown, and not
tweakable.

So I think the best use of this machine would be to explore the purely
CPU based scaling issues, like freelist, CLOG, and XLogInsert.

To do that, I'd set the scale factor small enough so that entire data
set is willing to be cached dirty by the kernel, based on the kernel
parameters:

egrep .  /proc/sys/vm/dirty_*

Then set shared_buffers to be less than the needs for that scale
factor, so freelist gets exercised.

And neutralize checkpoints, by setting fsync=off, so they don't
generate physical IO that we can't control for given the current
constraints on the machine set up.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_bulkload ON_DUPLICATE_MERGE

2012-01-25 Thread Benjamin Johnson
PG Gurus,

I have a table like this:

CREATE TABLE filemods (
  guid  BIGINT NOT NULL UNIQUE,
  filepath_guid BIGINT NOT NULL,
  createtimeTIMESTAMP WITH TIME ZONE DEFAULT NULL,
  writetime TIMESTAMP WITH TIME ZONE DEFAULT NULL,
  deletetimeTIMESTAMP WITH TIME ZONE DEFAULT NULL,
);

One "event" might have (1, '2012-01-25 11:00:00', NULL, NULL) and
another event might have (1, NULL, '2012-01-25 11:05:00', NULL) and the
end result should be:
(1, '2012-01-25 11:00:00', '2012-01-25 11:05:00', NULL).


I'm trying to modify pg_bulkload to "merge" two rows together.  The
changes I have made seem to be working, although I would like input on
what I am doing that is unsafe or terribly wrong.  You can be brutal.

I've seen incredible write speed with using pg_bulkload.  If I can just
get it to "consolidate" our rows based on the unique key it will remove
a lot of complexity in our software.

Also, I'm not entirely sure this mailing list is the correct one, but
with all the internals you all know, I'm hoping you can help point out
nasty flaws in my algorithm.  This is the first legitimate attempt I
have made at modifying PG source, so I'm not real familiar with the
proper way of loading pages and tuples and updating heaps and all that
pg core stuff.

Here's the modifications to pg_btree.c (from pg_bulkload HEAD):

http://pastebin.com/U23CapvR

I also attached the patch.

Thank you!!

Ben


-- 
Benjamin Johnson
http://getcarbonblack.com/ | @getcarbonblack
cell: 312.933.3612
diff -r 3f065ec72ab8 pgbulkload/lib/pg_btree.c
--- a/pgbulkload/lib/pg_btree.c Fri Jan 20 09:26:20 2012 -0600
+++ b/pgbulkload/lib/pg_btree.c Wed Jan 25 13:37:43 2012 -0600
@@ -398,6 +398,8 @@
BTReaderTerm(&reader);
 }
 
+void merge_tuples(Relation heap, IndexTuple itup_dst, IndexTuple itup_src);
+
 /*
  * _bt_mergeload - Merge two streams of index tuples into new index files.
  */
@@ -462,7 +464,6 @@
}
else
{
-   // TODO -- BSJ
if (on_duplicate == 
ON_DUPLICATE_KEEP_NEW)
{
self->dup_old++;
@@ -470,7 +471,21 @@

RelationGetRelationName(wstate->index));
itup2 = 
BTReaderGetNextItem(btspool2);
}
-   else 
+   else if (on_duplicate == 
ON_DUPLICATE_MERGE)
+   {
+   self->dup_old++;
+
+   // merge from itup into itup2 
where itup's col[i] is not null
+   // but itup2's col[i] IS null
+   merge_tuples(heapRel, itup2, 
itup); 
+
+   ItemPointerCopy(&t_tid2, 
&itup2->t_tid);
+   self->dup_new++;
+   remove_duplicate(self, heapRel, 
itup,
+   
RelationGetRelationName(wstate->index));
+   itup = 
BTSpoolGetNextItem(btspool, itup, &should_free);
+   }
+   else
{
ItemPointerCopy(&t_tid2, 
&itup2->t_tid);
self->dup_new++;
@@ -950,6 +965,113 @@
self->dup_old + self->dup_new, relname);
 }
 
+// returns Buffer after locking it (BUFFER_LOCK_SHARE then BUFFER_LOCK_UNLOCK)
+Buffer load_buffer(Relation heap, IndexTuple itup, HeapTupleData *tuple /*OUT 
*/, ItemId *itemid /*OUT */ )
+{
+   BlockNumber blknum;
+   BlockNumber offnum;
+   Buffer  buffer;
+   Pagepage;
+
+   blknum = ItemPointerGetBlockNumber(&itup->t_tid);
+   offnum = ItemPointerGetOffsetNumber(&itup->t_tid);
+   buffer = ReadBuffer(heap, blknum);
+
+   LockBuffer(buffer, BUFFER_LOCK_SHARE);
+   page = BufferGetPage(buffer);
+   *itemid = PageGetItemId(page, offnum);
+   tuple->t_data = ItemIdIsNormal(*itemid)
+   ? (HeapTupleHeader) PageGetItem(page, *itemid)
+   : NULL;
+   LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+   return buffer;
+}
+
+void load_tuple(Relation   heap, 
+   HeapTuple   tuple, 
+   IndexTuple  itup, 
+   ItemId  

Re: [HACKERS] WIP patch for parameterized inner paths

2012-01-25 Thread David E. Wheeler
On Jan 25, 2012, at 10:24 AM, Tom Lane wrote:

> Anyway, I'd be willing to hold off committing if someone were to
> volunteer to test an unintegrated copy of the patch against some
> moderately complicated application.  But it's a sufficiently large
> patch that I don't really care to sit on it and try to maintain it
> outside the tree for a long time.

Why not create a branch? IIRC the build farm can be configured to run branches.

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Merlin Moncure
On Wed, Jan 25, 2012 at 1:24 PM, Marko Kreen  wrote:
> On Wed, Jan 25, 2012 at 12:54:00PM -0600, Merlin Moncure wrote:
>> On Wed, Jan 25, 2012 at 11:24 AM, Marko Kreen  wrote:
>> > I specifically want to avoid any sort of per-connection
>> > negotation, except the "max format version supported",
>> > because it will mess up multiplexed usage of single connection.
>> > Then they need to either disabled advanced formats completely,
>> > or still do it per-query somehow (via GUCs?) which is mess.
>>
>> Being able to explicitly pick format version other than the one the
>> application was specifically written against adds a lot of complexity
>> and needs to be justified.  Maybe you're trying to translate data
>> between two differently versioned servers?  I'm trying to understand
>> the motive behind your wanting finer grained control of picking format
>> version...
>
> You mean if client has written with version N formats, but connects
> to server with version N-1 formats?  True, simply not supporting
> such case simplifies client-side API.
>
> But note that it does not change anything on protocol level, it's purely
> client-API specific.  It may well be that some higher-level APIs
> (JDBC, Npgsql, Psycopg) may support such downgrade, but with lower-level
> API-s (raw libpq), it may be optional whether the client wants to
> support such usage or not.

well, I see the following cases:
1) Vserver > Vapplication: server downgrades wire formats to
applications version
2) Vapplication > Vlibpq > Vserver: since the application is
reading/writing formats the server can't understand, an error should
be raised if they are used in either direction
3) Vlibpq >= VApplication > Vserver: same as above, but libpq can
'upconvert' low version wire format to application's wire format or
error otherwise.

By far, the most common cause of problems (both in terms of severity
and frequency) is case #1.  #3 allows a 'compatibility mode' via
libpq, but that comes at significant cost of complexity since libpq
needs to be able to translate wire formats up (but not down).  #2/3 is
a less common problem though as it's more likely the application can
be adjusted to get up to speed: so to keep things simple we can maybe
just error out in those scenarios.

In the database, we need to maintain outdated send/recv functions
basically forever and as much as possible try and translate old wire
format data to and from newer backend structures (maybe in very
specific cases that will be impossible such that the application is
SOL, but that should be rare).  All send/recv functions, including
user created ones need to be stamped with a version token (database
version?).  With the versions of the application, libpq, and all
server functions, we can determine all wire formats as long as we
assume the application's targeted database version represents all the
wire formats it was using.

My good ideas stop there: the exact mechanics of how the usable set of
functions are determined, how exactly the adjusted type look ups will
work, etc. would all have to be sorted out.  Most of the nastier parts
though (protocol changes notwithstanding) are not in libpq, but the
server.  There's just no quick fix on the client side I can see.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Marko Kreen
On Wed, Jan 25, 2012 at 12:54:00PM -0600, Merlin Moncure wrote:
> On Wed, Jan 25, 2012 at 11:24 AM, Marko Kreen  wrote:
> > I specifically want to avoid any sort of per-connection
> > negotation, except the "max format version supported",
> > because it will mess up multiplexed usage of single connection.
> > Then they need to either disabled advanced formats completely,
> > or still do it per-query somehow (via GUCs?) which is mess.
> 
> Being able to explicitly pick format version other than the one the
> application was specifically written against adds a lot of complexity
> and needs to be justified.  Maybe you're trying to translate data
> between two differently versioned servers?  I'm trying to understand
> the motive behind your wanting finer grained control of picking format
> version...

You mean if client has written with version N formats, but connects
to server with version N-1 formats?  True, simply not supporting
such case simplifies client-side API.

But note that it does not change anything on protocol level, it's purely
client-API specific.  It may well be that some higher-level APIs
(JDBC, Npgsql, Psycopg) may support such downgrade, but with lower-level
API-s (raw libpq), it may be optional whether the client wants to
support such usage or not.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Marko Kreen
On Wed, Jan 25, 2012 at 12:58:15PM -0500, Tom Lane wrote:
> Marko Kreen  writes:
> > On Wed, Jan 25, 2012 at 11:40:28AM -0500, Tom Lane wrote:
> >> Then why bother with the bit in the format code?  If you've already done
> >> some other negotiation to establish what datatype formats you will
> >> accept, this doesn't seem to be adding any value.
> 
> > The "other negotiation" is done via Postgres release notes...
> 
> That is really not going to work if the requirement is to not break old
> apps.  They haven't read the release notes.

Yes, but they also keep requesting the old formats so everything is fine?
Note that formats are under full control of client, server has no way
to send newer formats to client that has not requested it.

> > I specifically want to avoid any sort of per-connection
> > negotation, except the "max format version supported",
> > because it will mess up multiplexed usage of single connection.
> > Then they need to either disabled advanced formats completely,
> > or still do it per-query somehow (via GUCs?) which is mess.
> 
> Hmm, that adds yet another level of not-obvious-how-to-meet requirement.
> I tend to concur with Robert that we are not close to a solution.

Well, my simple scheme seems to work fine with such requirement.

[My scheme - client-supplied 16bit type code is only thing
that decides format.]

> > No, the list of "well-known" types is documented and fixed.
> > The bit is specifically for frameworks, so that they can say
> > "I support all well-known types in Postgres version X.Y".
> 
> So in other words, if we have a client that contains a framework that
> knows about version N, and we connect it up to a server that speaks
> version N+1, it suddenly loses the ability to use any version-N
> optimizations?  That does not meet my idea of not breaking old apps.

That is up to Postgres maintainers to decide, whether they want
to phase out some type from the list.  But my main point was
it's OK to add types into list.  I missed that aspect on my
previous mail.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Merlin Moncure
On Wed, Jan 25, 2012 at 11:24 AM, Marko Kreen  wrote:
> I specifically want to avoid any sort of per-connection
> negotation, except the "max format version supported",
> because it will mess up multiplexed usage of single connection.
> Then they need to either disabled advanced formats completely,
> or still do it per-query somehow (via GUCs?) which is mess.

Being able to explicitly pick format version other than the one the
application was specifically written against adds a lot of complexity
and needs to be justified.  Maybe you're trying to translate data
between two differently versioned servers?  I'm trying to understand
the motive behind your wanting finer grained control of picking format
version...

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch for parameterized inner paths

2012-01-25 Thread Tom Lane
Robert Haas  writes:
> I have to admit that I have a bad feeling about this.  It strikes me
> that there is no way we're not going to get complaints about a 35%
> slowdown on planning a large join problem.

I have to admit that I'm worried about that too.  However, I hope to
continue whittling away at that number.  It's also worth pointing out
that the number is based on just a couple of test cases that are not
very real-world, in that I'm testing against empty tables with no
statistics.  I think that this is a worst case, because it results in a
lot of paths with basically the same costs, making them hard to prune;
but I can't do much better, because the examples I've got were supplied
without test data.

Also, you're assuming that the changes have no upside whatsoever, which
I fondly hope is not the case.  Large join problems tend not to execute
instantaneously --- so nobody is going to complain if the planner takes
awhile longer but the resulting plan is enough better to buy that back.
In my test cases, the planner *is* finding better plans, or at least
ones with noticeably lower estimated costs.  It's hard to gauge how
much that translates to in real-world savings, since I don't have
real data loaded up.  I also think, though I've not tried to measure,
that I've made planning cheaper for very simple queries by eliminating
some overhead in those cases.

Anyway, I'd be willing to hold off committing if someone were to
volunteer to test an unintegrated copy of the patch against some
moderately complicated application.  But it's a sufficiently large
patch that I don't really care to sit on it and try to maintain it
outside the tree for a long time.

> To be clear, I'd love to have this feature.  But if there is a choice
> between reducing planning time significantly for everyone and NOT
> getting this feature, and increasing planning time significantly for
> everyone and getting this feature, I think we will make more people
> happy by doing the first one.

We're not really talking about "are we going to accept or reject a
specific feature".  We're talking about whether we're going to decide
that the last two years worth of planner development were headed in
the wrong direction and we're now going to reject that and try to
think of some entirely new concept.  This isn't an isolated patch,
it's the necessary next step in a multi-year development plan.  The
fact that it's a bit slower at the moment just means there's still
work to do.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [v9.2] LEAKPROOF attribute of FUNCTION (Re: [HACKERS] [v9.2] Fix Leaky View Problem)

2012-01-25 Thread Robert Haas
On Sun, Jan 22, 2012 at 5:12 AM, Kohei KaiGai  wrote:
> 2012/1/21 Robert Haas :
>> On Sat, Jan 21, 2012 at 3:59 AM, Kohei KaiGai  wrote:
>>> I marked the default leakproof function according to the criteria that
>>> does not leak contents of the argument.
>>> Indeed, timestamp_ne_timestamptz() has a code path that rises
>>> an error of "timestamp out of range" message. Is it a good idea to
>>> avoid mark "leakproof" on these functions also?
>>
>> I think that anything which looks at the data and uses that as a basis
>> for whether or not to throw an error is non-leakproof.  Even if
>> doesn't directly leak an arbitrary value, I think that leaking even
>> some information about what the value is no good.  Otherwise, you
>> might imagine that we would allow /(int, int), because it only leaks
>> in the second_arg = 0 case.  And you might imagine we'd allow -(int,
>> int) because it only leaks in the case where an overflow occurs.  But
>> of course the combination of the two allows writing something of the
>> form 1/(a-constant) and getting it pushed down, and now you have the
>> ability to probe for an arbitrary value.  So I think it's just no good
>> to allow any leaking at all: otherwise it'll be unclear how safe it
>> really is, especially when combinations of different functions or
>> operators are involved.
>>
> OK. I checked list of the default leakproof functions.
>
> Functions that contains translation between date and timestamp(tz)
> can raise an error depending on the supplied arguments.
> Thus, I unmarked leakproof from them.
>
> In addition, varstr_cmp() contains translation from UTF-8 to UTF-16 on
> win32 platform; that may raise an error if string contains a character that
> is unavailable to translate.
> Although I'm not sure which case unavailable to translate between them,
> it seems to me hit on the basis not to leak what kind of information is
> no good. Thus, related operator functions of bpchar and text got unmarked.
> (Note that bpchareq, bpcharne, texteq and textne don't use it.)

Can you rebase this?  It seems that the pg_proc.h and
select_views{,_1}.out hunks no longer apply cleanly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2012-01-25 Thread Simon Riggs
On Wed, Jan 25, 2012 at 8:49 AM, Simon Riggs  wrote:
> On Wed, Jan 25, 2012 at 8:16 AM, Fujii Masao  wrote:
>
>>> What happens if we shutdown the WALwriter and then issue SIGHUP?
>>
>> SIGHUP doesn't affect full_page_writes in that case. Oh, you are concerned 
>> about
>> the case where smart shutdown kills walwriter but some backends are
>> still running?
>> Currently SIGHUP affects full_page_writes and running backends use the 
>> changed
>> new value of full_page_writes. But in the patch, SIGHUP doesn't affect...
>>
>> To address the problem, we should either postpone the shutdown of walwriter
>> until all backends have gone away, or leave the update of full_page_writes to
>> checkpointer process instead of walwriter. Thought?
>
> checkpointer seems the correct place to me


Done.


-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Tom Lane
Marko Kreen  writes:
> On Wed, Jan 25, 2012 at 11:40:28AM -0500, Tom Lane wrote:
>> Then why bother with the bit in the format code?  If you've already done
>> some other negotiation to establish what datatype formats you will
>> accept, this doesn't seem to be adding any value.

> The "other negotiation" is done via Postgres release notes...

That is really not going to work if the requirement is to not break old
apps.  They haven't read the release notes.

> I specifically want to avoid any sort of per-connection
> negotation, except the "max format version supported",
> because it will mess up multiplexed usage of single connection.
> Then they need to either disabled advanced formats completely,
> or still do it per-query somehow (via GUCs?) which is mess.

Hmm, that adds yet another level of not-obvious-how-to-meet requirement.
I tend to concur with Robert that we are not close to a solution.

> No, the list of "well-known" types is documented and fixed.
> The bit is specifically for frameworks, so that they can say
> "I support all well-known types in Postgres version X.Y".

So in other words, if we have a client that contains a framework that
knows about version N, and we connect it up to a server that speaks
version N+1, it suddenly loses the ability to use any version-N
optimizations?  That does not meet my idea of not breaking old apps.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch for parameterized inner paths

2012-01-25 Thread Robert Haas
On Wed, Jan 25, 2012 at 11:24 AM, Tom Lane  wrote:
> After that I started doing some performance testing, and the initial
> news was bad: the planner was about 3x slower than 9.1 on a moderately
> large join problem.  I've spent the past several days hacking away at
> that, and have got it down to about 35% slower by dint of the following
> changes:
>
> * micro-optimization of add_path(), in particular avoiding duplicate
> calculations during cost comparisons.
>
> * introducing a two-step mechanism for testing whether proposed join
> paths are interesting.  The patch first calculates a quick and dirty
> lower bound on the cost of the join path (mostly, from the costs of
> its input paths) and looks through the joinrel's pathlist to see if
> there is already a path that is clearly better.  If not, it proceeds
> with the full cost calculation and add_path insertion.  In my testing
> the precheck is able to eliminate 80% or so of the proposed paths,
> more than repaying the extra pathlist search.
>
> Now this is of course cheating with both hands, in that either of these
> optimization techniques could have been applied before ... but they
> weren't.  I think that 35% slower on large join problems is probably
> acceptable, given that this is investigating a larger number of possible
> solutions and hopefully often finding a better plan.  I think I can
> knock some more off of that by refactoring the costsize.c APIs, anyway.
> Right now the first-pass and second-pass cost calculations are
> independent and so there's some repeated work, which I'd like to
> eliminate both for speed reasons and to get rid of duplicate code
> that'd have to be kept in sync if it's left as-is.
>
> If there are not objections, I'd like to go ahead and commit this
> after one more round of polishing.  There's still a lot left to do,
> but what it mainly needs now is some testing on real-world planning
> problems, and I don't think it's going to get that until it's been
> merged in.

I have to admit that I have a bad feeling about this.  It strikes me
that there is no way we're not going to get complaints about a 35%
slowdown on planning a large join problem.  It is true that some
people will have the benefit of finding a faster plan - but I think
many people won't.  We're really facing the same trade-off here that
we do with many other things people have asked for the query planner
to do over the years: is it worth slowing down everyone, on every
query, for an optimization that will apply rarely but provide huge
benefits when it does?  Of course, that's fundamentally a judgement
call.  But I can't help observing that the number of requests we've
had for the planner to deduce implied inequalities is far larger than
the number of people who have complained about the problem that this
fixes.  Now, maybe the speed penalty for deducing implied inequalities
would be even larger than 35%: I don't know.  But we've sweat blood to
avoid much smaller regressions on much less important cases.

To be clear, I'd love to have this feature.  But if there is a choice
between reducing planning time significantly for everyone and NOT
getting this feature, and increasing planning time significantly for
everyone and getting this feature, I think we will make more people
happy by doing the first one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Marko Kreen
On Wed, Jan 25, 2012 at 11:40:28AM -0500, Tom Lane wrote:
> Marko Kreen  writes:
> > On Wed, Jan 25, 2012 at 10:23:14AM -0500, Tom Lane wrote:
> >> Huh?  How can that work?  If we decide to change the representation of
> >> some other "well known type", say numeric, how do we decide whether a
> >> client setting that bit is expecting that change or not?
> 
> > It sets that bit *and* version code - which means that it is
> > up-to-date with all "well-known" type formats in that version.
> 
> Then why bother with the bit in the format code?  If you've already done
> some other negotiation to establish what datatype formats you will
> accept, this doesn't seem to be adding any value.

The "other negotiation" is done via Postgres release notes...

I specifically want to avoid any sort of per-connection
negotation, except the "max format version supported",
because it will mess up multiplexed usage of single connection.
Then they need to either disabled advanced formats completely,
or still do it per-query somehow (via GUCs?) which is mess.

Also I don't see any market for "flexible" negotations,
instead I see that people want 2 things:

- Updated formats are easily available
- Old apps not to break

I might be mistaken here, then please correct me,
but currently I'm designing for simplicity.

> > Basically, I see 2 scenarios here:
> 
> > 1) Client knows the result types and can set the
> > text/bin/version code safely, without further restrictions.
> 
> > 2) There is generic framework, that does not know query contents
> > but can be expected to track Postgres versions closely.
> > Such framework cannot say "binary" for results safely,
> > but *could* do it for some well-defined subset of types.
> 
> The hole in approach (2) is that it supposes that the client side knows
> the specific datatypes in a query result in advance.  While this is
> sometimes workable for application-level code that knows what query it's
> issuing, it's really entirely untenable for a framework or library.

No, the list of "well-known" types is documented and fixed.
The bit is specifically for frameworks, so that they can say
"I support all well-known types in Postgres version X.Y".

Note I said that the list cannot be extended but that is wrong.
When this bit and version code are taken together, it clearly
defines "list as in version X.Y".  So considering that
client should not send any higher version than server supports,
it means server always knows what list client refers to.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Robert Haas
On Wed, Jan 25, 2012 at 11:40 AM, Tom Lane  wrote:
> Marko Kreen  writes:
>> On Wed, Jan 25, 2012 at 10:23:14AM -0500, Tom Lane wrote:
>>> Huh?  How can that work?  If we decide to change the representation of
>>> some other "well known type", say numeric, how do we decide whether a
>>> client setting that bit is expecting that change or not?
>
>> It sets that bit *and* version code - which means that it is
>> up-to-date with all "well-known" type formats in that version.
>
> Then why bother with the bit in the format code?  If you've already done
> some other negotiation to establish what datatype formats you will
> accept, this doesn't seem to be adding any value.
>
>> Basically, I see 2 scenarios here:
>
>> 1) Client knows the result types and can set the
>> text/bin/version code safely, without further restrictions.
>
>> 2) There is generic framework, that does not know query contents
>> but can be expected to track Postgres versions closely.
>> Such framework cannot say "binary" for results safely,
>> but *could* do it for some well-defined subset of types.
>
> The hole in approach (2) is that it supposes that the client side knows
> the specific datatypes in a query result in advance.  While this is
> sometimes workable for application-level code that knows what query it's
> issuing, it's really entirely untenable for a framework or library.
> The only way that a framework can deal with arbitrary queries is to
> introduce an extra round trip (Describe step) to see what datatypes
> the query will produce so it can decide what format codes to issue
> ... and that will pretty much eat up any time savings you might get
> from a more efficient representation.
>
> You really want to do the negotiation once, at connection setup, and
> then be able to process queries without client-side prechecking of what
> data types will be sent back.

What might work is for clients to advertise a list of capability
strings, like "compact_array_format", at connection startup time.  The
server can then adjust its behavior based on that list.  But the
problem with that is that as we make changes to the wire protocol, the
list of capabilities clients need to advertise could get pretty long
in a hurry.  A simpler alternative is to have the client send a server
version along with the initial connection attempt and have the server
do its best not to use any features that weren't present in that
server version - but that seems to leave user-defined types out in the
cold.

I reiterate my previous view that we don't have time to engineer a
good solution to this problem right now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-25 Thread Jeff Janes
On Tue, Jan 24, 2012 at 12:53 PM, Robert Haas  wrote:
> Early yesterday morning, I was able to use Nate Boley's test machine
> do a single 30-minute pgbench run at scale factor 300 using a variety
> of trees built with various patches, and with the -l option added to
> track latency on a per-transaction basis.  All tests were done using
> 32 clients and permanent tables.  The configuration was otherwise
> identical to that described here:
>
> http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6...@mail.gmail.com

Previously we mostly used this machine for CPU benchmarking.  Have you
previously described the IO subsystem?  That is becoming relevant for
these types of benchmarks.   For example, is WAL separated from data?

>
> By doing this, I hoped to get a better understanding of (1) the
> effects of a scale factor too large to fit in shared_buffers,

In my hands, the active part of data at scale of 300 fits very
comfortably into 8GB shared_buffers.

Indeed, until you have run a very long time so that pgbench_history
gets large, everything fits in 8GB.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.2] sepgsql's DROP Permission checks

2012-01-25 Thread Robert Haas
On Sun, Jan 22, 2012 at 9:54 AM, Kohei KaiGai  wrote:
> I tried to implement based on the idea to call object-access-hook with
> flag; that
> informs extensions contexts of objects being removed.
> Because I missed DROP_RESTRICT and DROP_CASCADE are enum type,
> this patch added performInternalDeletion() instead of OR-masked DROP_INTERNAL.
> All its difference from performDeletion() is a flag (OAT_DROP_FLAGS_INTERNAL)
> shall be delivered to extension module. I replaced several performDeletion() 
> by
> performInternalDeletion() that clean-up objects due to internal stuff.
>
> How about this approach?

I generally agree with this line of attack, but I think you've failed
to find all the cases where a drop should be considered internal, and
I'd rather add a new parameter to an existing function than define a
new one that someone might accidentally fail to use in some place
where it's needed.  Here's a cut-down patch that *just* adds a
PERFORM_DELETE_INTERNAL flag, plus some related comment additions.  If
this looks reasonable to you, I'll commit it and then we can work out
the remaining details.

Since sepgsql doesn't seem to need the DropBehavior, I'm inclined to
say we shouldn't go to any extra work to pass it just now.  We can
always add that later if some other client needs it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


perform-deletion-internal.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Tom Lane
Marko Kreen  writes:
> On Wed, Jan 25, 2012 at 10:23:14AM -0500, Tom Lane wrote:
>> Huh?  How can that work?  If we decide to change the representation of
>> some other "well known type", say numeric, how do we decide whether a
>> client setting that bit is expecting that change or not?

> It sets that bit *and* version code - which means that it is
> up-to-date with all "well-known" type formats in that version.

Then why bother with the bit in the format code?  If you've already done
some other negotiation to establish what datatype formats you will
accept, this doesn't seem to be adding any value.

> Basically, I see 2 scenarios here:

> 1) Client knows the result types and can set the
> text/bin/version code safely, without further restrictions.

> 2) There is generic framework, that does not know query contents
> but can be expected to track Postgres versions closely.
> Such framework cannot say "binary" for results safely,
> but *could* do it for some well-defined subset of types.

The hole in approach (2) is that it supposes that the client side knows
the specific datatypes in a query result in advance.  While this is
sometimes workable for application-level code that knows what query it's
issuing, it's really entirely untenable for a framework or library.
The only way that a framework can deal with arbitrary queries is to
introduce an extra round trip (Describe step) to see what datatypes
the query will produce so it can decide what format codes to issue
... and that will pretty much eat up any time savings you might get
from a more efficient representation.

You really want to do the negotiation once, at connection setup, and
then be able to process queries without client-side prechecking of what
data types will be sent back.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Why extract( ... from timestamp ) is not immutable?

2012-01-25 Thread Tom Lane
hubert depesz lubaczewski  writes:
> anyway - the point is that in \df date_part(, timestamp) says it's
> immutable, while it is not.

Hmm, you're right.  I thought we'd fixed that way back when, but
obviously not.  Or maybe the current behavior of the epoch case
postdates that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_trigger_depth() v3 (was: TG_DEPTH)

2012-01-25 Thread Alvaro Herrera


Committed, with OID change.  Thanks.

I tested it with plphp just for the heck of it and it worked
wonderfully.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Marko Kreen
On Wed, Jan 25, 2012 at 10:23:14AM -0500, Tom Lane wrote:
> Marko Kreen  writes:
> > On Tue, Jan 24, 2012 at 09:33:52PM -0500, Robert Haas wrote:
> >> Furthermore, while we haven't settled the question of exactly what a
> >> good negotiation facility would look like, we seem to agree that a GUC
> >> isn't it.  I think that means this isn't going to happen for 9.2, so
> >> we should mark this patch Returned with Feedback and return to this
> >> topic for 9.3.
> 
> > Simply extending the text/bin flags should be quite
> > uncontroversial first step.  How to express the
> > capability in startup packet, I leave to others to decide.
> 
> > But my proposal would be following:
> 
> > bit 0 : text/bin
> > bit 1..15 : format version number, maps to best formats in some
> > Postgres version.  
> 
> > It does not solve the resultset problem, where I'd like to say
> > "gimme well-known types in optimal representation, others in text".  
> > I don't know the perfect solution for that, but I suspect the
> > biggest danger here is the urge to go to maximal complexity
> > immediately.  So perhaps the good idea would simply give one
> > additional bit (0x8000?) in result flag to say that only
> > well-known types should be optimized.  That should cover 95%
> > of use-cases, and we can design more flexible packet format
> > when we know more about actual needs.
> 
> Huh?  How can that work?  If we decide to change the representation of
> some other "well known type", say numeric, how do we decide whether a
> client setting that bit is expecting that change or not?

It sets that bit *and* version code - which means that it is
up-to-date with all "well-known" type formats in that version.

The key here is to sanely define the "well-known" types
and document them, so clients can be uptodate with them.

Variants:
- All built-in and contrib types in some Postgres version
- All built-in types in some Postgres version
- Most common types (text, numeric, bytes, int, float, bool, ..)

Also, as we have only one bit, the set of types cannot be
extended.  (Unless we provide more bits for that, but that
may get too confusing?)


Basically, I see 2 scenarios here:

1) Client knows the result types and can set the
text/bin/version code safely, without further restrictions.

2) There is generic framework, that does not know query contents
but can be expected to track Postgres versions closely.
Such framework cannot say "binary" for results safely,
but *could* do it for some well-defined subset of types.


Ofcourse it may be that 2) is not worth supporting, as
frameworks can throw errors on their own if they find
format that they cannot parse.  Then the user needs
to either register their own parser, or simply turn off
optmized formats to get the plain-text values.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Tom Lane
Marko Kreen  writes:
> On Tue, Jan 24, 2012 at 09:33:52PM -0500, Robert Haas wrote:
>> Furthermore, while we haven't settled the question of exactly what a
>> good negotiation facility would look like, we seem to agree that a GUC
>> isn't it.  I think that means this isn't going to happen for 9.2, so
>> we should mark this patch Returned with Feedback and return to this
>> topic for 9.3.

> Simply extending the text/bin flags should be quite
> uncontroversial first step.  How to express the
> capability in startup packet, I leave to others to decide.

> But my proposal would be following:

> bit 0 : text/bin
> bit 1..15 : format version number, maps to best formats in some
> Postgres version.  

> It does not solve the resultset problem, where I'd like to say
> "gimme well-known types in optimal representation, others in text".  
> I don't know the perfect solution for that, but I suspect the
> biggest danger here is the urge to go to maximal complexity
> immediately.  So perhaps the good idea would simply give one
> additional bit (0x8000?) in result flag to say that only
> well-known types should be optimized.  That should cover 95%
> of use-cases, and we can design more flexible packet format
> when we know more about actual needs.

Huh?  How can that work?  If we decide to change the representation of
some other "well known type", say numeric, how do we decide whether a
client setting that bit is expecting that change or not?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoid FK validations for no-rewrite ALTER TABLE ALTER TYPE

2012-01-25 Thread Alvaro Herrera

Excerpts from Noah Misch's message of dom ene 22 02:05:31 -0300 2012:

> Thanks.  I've attached a new version fixing this problem.

Looks good to me.  Can you please clarify this bit?

 * Since we elsewhere require that all collations share 
the same
 * notion of equality, don't compare collation here.

Since I'm not familiar with this code, it's difficult for me to figure
out where this "elsewhere" is, and what does "same notion of equality"
mean.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add new replication mode synchronous_commit = 'write'.

2012-01-25 Thread Simon Riggs
On Wed, Jan 25, 2012 at 1:57 PM, Robert Haas  wrote:

> I think that this would be a lot more clear if we described this as
> synchronous_commit = remote_write rather than just synchronous_commit
> = write.  Actually, the internal constant is named that way already,
> but it's not exposed as such to the user.

That's something to discuss at the end of the CF when people are less
busy and we get more input.

It's an easy change whatever we decide to do.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add new replication mode synchronous_commit = 'write'.

2012-01-25 Thread Robert Haas
On Wed, Jan 25, 2012 at 1:23 AM, Fujii Masao  wrote:
> On Wed, Jan 25, 2012 at 5:35 AM, Jaime Casanova  wrote:
>> On Tue, Jan 24, 2012 at 3:22 PM, Simon Riggs  wrote:
>>> Add new replication mode synchronous_commit = 'write'.
>>> Replication occurs only to memory on standby, not to disk,
>>> so provides additional performance if user wishes to
>>> reduce durability level slightly. Adds concept of multiple
>>> independent sync rep queues.
>>>
>>> Fujii Masao and Simon Riggs
>>>
>>
>> i guess, you should add the new value in postgresql.conf too.
>
> Yes, I forgot to do that. Patch attached.

I think that this would be a lot more clear if we described this as
synchronous_commit = remote_write rather than just synchronous_commit
= write.  Actually, the internal constant is named that way already,
but it's not exposed as such to the user.

There's a logical hierarchy here:

fully async commit ("off") < local flush only ("local") < local flush
+ remote write (currently "write") < local flush + remote flush
(currently "on") < local flush + remote apply

All of the levels except for "off" involve waiting for local flush;
the higher levels also involve waiting for something on the remote
side.  But the name of the variable is just synchronous_commit, so I
thik that if the word "remote" doesn't appear anywhere in the
user-visible parameter name, it's not as clear as it could be.  In
addition to renaming "write" to "remote_write", I think we might also
want to add "remote_flush" as an alias for "on".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-25 Thread Robert Haas
On Tue, Jan 24, 2012 at 4:28 PM, Simon Riggs  wrote:
> I think we should be working to commit XLogInsert and then Group
> Commit, then come back to the discussion.

I definitely agree that those two have way more promise than anything
else on the table.  However, now that I understand how badly we're
getting screwed by checkpoints, I'm curious to do some more
investigation of what's going on there.  I can't help thinking that in
these particular cases the full page writes may be a bigger issue than
the checkpoint itself.  If that turns out to be the case it's not
likely to be something we're able to address for 9.2, but I'd like to
at least characterize it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Client Messages

2012-01-25 Thread Jim Mlodgenski
On Tue, Jan 24, 2012 at 7:38 AM, Heikki Linnakangas
 wrote:
> On 23.01.2012 22:52, Jim Mlodgenski wrote:
>>
>> On Wed, Jan 18, 2012 at 9:19 AM, Jim Mlodgenski  wrote:
>>>
>>> On Wed, Jan 18, 2012 at 3:08 AM, Heikki Linnakangas

 I don't think that's a problem, it's just a free-form message to
 display.

 But it also doesn't seem very useful to have it PGC_USERSET: if it's
 only
 displayed at connect time, there's no point in changing it after
 connecting.
>>>
>>> Should we make it PGC_BACKEND?
>
>
> In hindsight, making it PGC_BACKEND makes it much less useful, because then
> you can't set it on a per-database basis with "ALTER DATABASE foo SET ..."
> So I made it PGC_USERSET again.
>
>
>> Here is the revised patch based on the feedback.
>
>
> Thanks! I renamed the GUC to "welcome_message", to avoid confusion with
> "client_min_messages". I also moved it to "Connection Settings" category.
> Although it's technically settable within a session, the purpose is to
> display a message when connecting, so "Connection Settings" feels more
> fitting.
>
> There's one little problem remaining with this, which is what to do if the
> message is in a different encoding than used by the client? That's not a new
> problem, we have the same problem with any string GUC, if you do "SHOW
> ". We restricted application_name to ASCII characters, which is an
> obvious way to avoid the problem, but I think it would be a shame to
> restrict this to ASCII-only.
Isn't that an issue for the administrator understanding his audience.
Maybe some additional documentation explaining the encoding issue?

>
>
> --
>  Heikki Linnakangas
>  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch : Allow toast tables to be moved to a different tablespace

2012-01-25 Thread Robert Haas
On Wed, Jan 25, 2012 at 7:13 AM, Julien Tachoires  wrote:
> 2012/1/24 Robert Haas :
>> On Sun, Jan 22, 2012 at 11:04 AM, Julien Tachoires  wrote:
>>> 2011/12/15 Alvaro Herrera :

 Uhm, surely you could compare the original toast tablespace to the heap
 tablespace, and if they differ, handle appropriately when creating the
 new toast table?  Just pass down the toast tablespace into
 AlterTableCreateToastTable, instead of having it assume that
 rel->rd_rel->relnamespace is sufficient.  This should be done in all
 cases where a toast tablespace is created, which shouldn't be more than
 a handful of them.
>>>
>>> Thank you, that way seems right.
>>> Now, I distinguish before each creation of a TOAST table with
>>> AlterTableCreateToastTable() : if it will create a new one or recreate
>>> an existing one.
>>> Thus, in create_toast_table() when toastTableSpace is equal to
>>> InvalidOid, we are able :
>>> - to fallback to the main table tablespace in case of new TOAST table 
>>> creation
>>> - to keep it previous tablespace in case of recreation.
>>>
>>> Here's a new version rebased against HEAD.
>>
>> To ask more directly the question that's come up a few times upthread,
>> why do *you* think this is useful?  What motivated you to want this
>> behavior, and/or how do you think it could benefit other PostgreSQL
>> users?
>
> Sorry, I didn't get this question to me. I've just picked up this item
> from the TODO list and then I was thinking that it could be useful. My
> motivation was to learn more about PostgreSQL dev. and to work on a
> concrete case. Now, I'm not sure anymore this is useful.

OK.  In that case, I don't think it makes sense to add this right now.
 I share the feeling that it could possibly be useful under some set
of circumstances, but it doesn't seem prudent to add features because
there might hypothetically be a use case.  I suggest that we mark this
patch Returned with Feedback and add links to your latest version of
this patch and one of the emails expressing concerns about the utility
of this patch to the Todo list.  If we later have a clearer idea in
mind why this might be useful, we can add it then - and document the
use case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch : Allow toast tables to be moved to a different tablespace

2012-01-25 Thread Julien Tachoires
2012/1/24 Robert Haas :
> On Sun, Jan 22, 2012 at 11:04 AM, Julien Tachoires  wrote:
>> 2011/12/15 Alvaro Herrera :
>>>
>>> Uhm, surely you could compare the original toast tablespace to the heap
>>> tablespace, and if they differ, handle appropriately when creating the
>>> new toast table?  Just pass down the toast tablespace into
>>> AlterTableCreateToastTable, instead of having it assume that
>>> rel->rd_rel->relnamespace is sufficient.  This should be done in all
>>> cases where a toast tablespace is created, which shouldn't be more than
>>> a handful of them.
>>
>> Thank you, that way seems right.
>> Now, I distinguish before each creation of a TOAST table with
>> AlterTableCreateToastTable() : if it will create a new one or recreate
>> an existing one.
>> Thus, in create_toast_table() when toastTableSpace is equal to
>> InvalidOid, we are able :
>> - to fallback to the main table tablespace in case of new TOAST table 
>> creation
>> - to keep it previous tablespace in case of recreation.
>>
>> Here's a new version rebased against HEAD.
>
> To ask more directly the question that's come up a few times upthread,
> why do *you* think this is useful?  What motivated you to want this
> behavior, and/or how do you think it could benefit other PostgreSQL
> users?
>

Sorry, I didn't get this question to me. I've just picked up this item
from the TODO list and then I was thinking that it could be useful. My
motivation was to learn more about PostgreSQL dev. and to work on a
concrete case. Now, I'm not sure anymore this is useful.

--
JT

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Configuring Postgres to Add A New Source File

2012-01-25 Thread Tareq Aljabban
Hi,
I'm doing some development on the storage manager module of Postgres.

I have added few source files already to the smgr folder, and I was able to
have the Make system includes them by adding their names to the OBJS list
in the Makefile inside the smgr folder. (i.e. When I add A.c, I would add
A.o to the OBJS list).

That was working fine. Now I'm trying to add a new file hdfs_test.c to the
project. The problem with this file is that it requires some extra
directives in its compilation command (-I and -L directives).


Using gcc the file is compiled with the command:

gcc hdfs_test.c -I/HDFS_HOME/hdfs/src/c++/libhdfs
-I/usr/lib/jvm/default-java/include -L/HDFS_HOME/hdfs/src/c++/libhdfs
-L/HDFS_HOME/build/c++/Linux-i386-32/lib
-L/usr/lib/jvm/default-java/jre/lib/i386/server -ljvm -lhdfs -o
hdfs_test

 I was told that in order to add the extra directives, I need to modify
$LDFLAGS in configure.in and $LIBS in MakeFile.

I read about autoconf and checked configure.in of Postgres but still wasn't
able to know exactly what I should be doing.

Thanks


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Marko Kreen
On Tue, Jan 24, 2012 at 09:33:52PM -0500, Robert Haas wrote:
> Furthermore, while we haven't settled the question of exactly what a
> good negotiation facility would look like, we seem to agree that a GUC
> isn't it.  I think that means this isn't going to happen for 9.2, so
> we should mark this patch Returned with Feedback and return to this
> topic for 9.3.

Simply extending the text/bin flags should be quite
uncontroversial first step.  How to express the
capability in startup packet, I leave to others to decide.

But my proposal would be following:

bit 0 : text/bin
bit 1..15 : format version number, maps to best formats in some
Postgres version.  

It does not solve the resultset problem, where I'd like to say
"gimme well-known types in optimal representation, others in text".  
I don't know the perfect solution for that, but I suspect the
biggest danger here is the urge to go to maximal complexity
immediately.  So perhaps the good idea would simply give one
additional bit (0x8000?) in result flag to say that only
well-known types should be optimized.  That should cover 95%
of use-cases, and we can design more flexible packet format
when we know more about actual needs.

libpq suggestions:

  PQsetformatcodes(bool)
only if its called with TRUE, it starts interpreting
text/bin codes as non-bools.  IOW, we will be compatible
with old code using -1 as TRUE.

protocol suggestions:

  On startup server sends highest supported text/bin codes,
  and gives error if finds higher code than supported.
  Poolers/proxies with different server versions in pool
  will simply give lowest common code out.


Small Q&A, to put obvious aspects into writing
--

* Does that mean we need to keep old formats around infinitely?

Yes.On-wire formats have *much* higher visibility than
on-disk formats.  Also, except some basic types they are
not parsed in adapters, but in client code.  Libpq offers
least help in that respect.

Basically - changing on-wire formatting is big deal,
don't do it willy-nilly.


* Does that mean we cannot turn on new formats automatically?

Yes.  Should be obvious..


-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-25 Thread Peter Eisentraut
On tis, 2012-01-24 at 20:13 -0500, Tom Lane wrote:
> Yeah.  In both cases, the (proposed) new output format is
> self-identifying *to clients that know what to look for*.
> Unfortunately it would only be the most anally-written pre-existing
> client code that would be likely to spit up on the unexpected
> variations.  What's much more likely to happen, and did happen in the
> bytea case, is silent data corruption. 

The problem in the bytea case is that the client libraries are written
to ignore encoding errors.  No amount of protocol versioning will help
you in that case.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PgNext: CFP

2012-01-25 Thread Oleg Bartunov

What's about travel&accomodation support ?

On Tue, 24 Jan 2012, Joshua D. Drake wrote:



Hello,

The call for papers for PgNext (the old PgWest/PgEast) is now open:

January 19th: Talk submission opens
April 15th: Talk submission closes
April 30th: Speaker notification

Submit: https://www.postgresqlconference.org/talk_types

Sincerely,

JD




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2012-01-25 Thread Simon Riggs
On Wed, Jan 25, 2012 at 8:16 AM, Fujii Masao  wrote:

>> What happens if we shutdown the WALwriter and then issue SIGHUP?
>
> SIGHUP doesn't affect full_page_writes in that case. Oh, you are concerned 
> about
> the case where smart shutdown kills walwriter but some backends are
> still running?
> Currently SIGHUP affects full_page_writes and running backends use the changed
> new value of full_page_writes. But in the patch, SIGHUP doesn't affect...
>
> To address the problem, we should either postpone the shutdown of walwriter
> until all backends have gone away, or leave the update of full_page_writes to
> checkpointer process instead of walwriter. Thought?

checkpointer seems the correct place to me

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2012-01-25 Thread Fujii Masao
On Tue, Jan 24, 2012 at 7:54 PM, Simon Riggs  wrote:
> On Tue, Jan 24, 2012 at 9:51 AM, Fujii Masao  wrote:
>
>>> I'll proceed to commit for this now.
>>
>> Thanks a lot!
>
> Can I just check a few things?

Sure!

> You say
> /*
> +        * Update full_page_writes in shared memory and write an
> +        * XLOG_FPW_CHANGE record before resource manager writes cleanup
> +        * WAL records or checkpoint record is written.
> +        */
>
> why does it need to be before the cleanup and checkpoint?

Because the cleanup and checkpoint need to see FPW in shared memory.
If FPW in shared memory is not updated there, the cleanup and (end-of-recovery)
checkpoint always use an initial value (= false) of FPW in shared memory.

> You say
> /*
> +        * Currently only non-exclusive backup can be taken during recovery.
> +        */
>
> why?

At first I proposed to allow exclusive backup to be taken during recovery. But
Heikki disagreed with the proposal because he thought that the exclusive backup
procedure which I proposed was too fragile. No one could come up with any good
user-friendly easy-to-implement procedure. So we decided to allow only
non-exclusive backup to be taken during recovery. In non-exclusive backup,
the complicated procedure is performed by pg_basebackup, so a user doesn't
need to care about that.

> You mention in the docs
> "The backup history file is not created in the database cluster backed up."
> but we need to explain the bad effect, if any.

Users cannot know various information (e.g., which WAL files are required for
backups, backup starting/ending time, etc) about backups which have been taken
so far. If they need such information, they need to record that manually.

Users cannot pass the backup history file to pg_archivecleanup. Which might make
the usage of pg_archivecleanup more difficult.

After a little thought, pg_basebackup would be able to create the backup history
file in the backup, though it cannot be archived. We shoud implement
that feature
to alleviate the bad effect?

> You say
> "If the standby is promoted to the master during online backup, the
> backup fails."
> but no explanation of why?
>
> I could work those things out, but I don't want to have to, plus we
> may disagree if I did.

If the backup succeeds in that case, when we start an archive recovery from that
backup, the recovery needs to cross between two timelines. Which means that
we need to set recovery_target_timeline before starting recovery. Whether
recovery_target_timeline needs to be set or not depends on whether the standby
was promoted during taking the backup. Leaving such a decision to a user seems
fragile.

pg_basebackup -x ensures that all required files are included in the backup and
we can start recovery without restoring any file from the archive. But
if the standby
is promoted during the backup, the timeline history file would become
an essential
file for recovery, but it's not included in the backup.

> There are some good explanations in comments of other things, just not
> everywhere needed.
>
> What happens if we shutdown the WALwriter and then issue SIGHUP?

SIGHUP doesn't affect full_page_writes in that case. Oh, you are concerned about
the case where smart shutdown kills walwriter but some backends are
still running?
Currently SIGHUP affects full_page_writes and running backends use the changed
new value of full_page_writes. But in the patch, SIGHUP doesn't affect...

To address the problem, we should either postpone the shutdown of walwriter
until all backends have gone away, or leave the update of full_page_writes to
checkpointer process instead of walwriter. Thought?

> Are we sure we want to make the change of file format mandatory? That
> means earlier versions of clients such as pg_basebackup will fail
> against this version.

Really? Unless I'm missing something, pg_basebackup doesn't care about the
file format of backup_label. So I don't think that earlier version of
pg_basebackup
fails.

> There are no docs to explain the new feature is available in the main
> docs, or to explain the restrictions.
> I expect you will add that later after commit.

Okay. Will do.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-25 Thread Heikki Linnakangas
I've been thinking, what exactly is the important part of this group 
commit patch that gives the benefit? Keeping the queue sorted isn't all 
that important - XLogFlush() requests for commits will come in almost 
the correct order anyway.


I also don't much like the division of labour between groupcommit.c and 
xlog.c. XLogFlush() calls GroupCommitWaitForLSN(), which calls back 
DoXLogFlush(), which is a bit hard to follow. (that's in my latest 
version; the original patch had similar division but it also cut across 
processes, with the wal writer actually doing the flush)


It occurs to me that the WALWriteLock already provides much of the same 
machinery we're trying to build here. It provides FIFO-style queuing, 
with the capability to wake up the next process or processes in the 
queue. Perhaps all we need is some new primitive to LWLock, to make it 
do exactly what we need.


Attached is a patch to do that. It adds a new mode to 
LWLockConditionalAcquire(), LW_EXCLUSIVE_BUT_WAIT. If the lock is free, 
it is acquired and the function returns immediately. However, unlike 
normal LWLockConditionalAcquire(), if the lock is not free it goes to 
sleep until it is released. But unlike normal LWLockAcquire(), when the 
lock is released, the function returns *without* acquiring the lock.


I modified XLogFlush() to use that new function for WALWriteLock. It 
tries to get WALWriteLock, but if it's not immediately free, it waits 
until it is released. Then, before trying to acquire the lock again, it 
rechecks LogwrtResult to see if the record was already flushed by the 
process that released the lock.


This is a much smaller patch than the group commit patch, and in my 
pgbench-tools tests on my laptop, it has the same effect on performance. 
The downside is that it touches lwlock.c, which is a change at a lower 
level. Robert's flexlocks patch probably would've been useful here.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ce659ec..d726a98 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2066,23 +2066,42 @@ XLogFlush(XLogRecPtr record)
 	/* initialize to given target; may increase below */
 	WriteRqstPtr = record;
 
-	/* read LogwrtResult and update local state */
+	/*
+	 * Now wait until we get the write lock, or someone else does the
+	 * flush for us.
+	 */
+	for (;;)
 	{
 		/* use volatile pointer to prevent code rearrangement */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
+		/* read LogwrtResult and update local state */
 		SpinLockAcquire(&xlogctl->info_lck);
 		if (XLByteLT(WriteRqstPtr, xlogctl->LogwrtRqst.Write))
 			WriteRqstPtr = xlogctl->LogwrtRqst.Write;
 		LogwrtResult = xlogctl->LogwrtResult;
 		SpinLockRelease(&xlogctl->info_lck);
-	}
 
-	/* done already? */
-	if (!XLByteLE(record, LogwrtResult.Flush))
-	{
-		/* now wait for the write lock */
-		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+		/* done already? */
+		if (XLByteLE(record, LogwrtResult.Flush))
+			break;
+
+		/*
+		 * Try to get the write lock. If we can't get it immediately, wait
+		 * until it's released, but before actually acquiring it, loop back
+		 * to check if the backend holding the lock did the flush for us
+		 * already. This helps to maintain a good rate of group committing
+		 * when the system is bottlenecked by the speed of fsyncing.
+		 */
+		if (!LWLockConditionalAcquire(WALWriteLock, LW_EXCLUSIVE_BUT_WAIT))
+		{
+			/*
+			 * Didn't get the lock straight away, but we might be done
+			 * already
+			 */
+			continue;
+		}
+		/* Got the lock */
 		LogwrtResult = XLogCtl->Write.LogwrtResult;
 		if (!XLByteLE(record, LogwrtResult.Flush))
 		{
@@ -2111,6 +2130,8 @@ XLogFlush(XLogRecPtr record)
 			XLogWrite(WriteRqst, false, false);
 		}
 		LWLockRelease(WALWriteLock);
+		/* done */
+		break;
 	}
 
 	END_CRIT_SECTION();
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index cc41568..488f573 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -430,6 +430,7 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 			elog(PANIC, "cannot wait without a PGPROC structure");
 
 		proc->lwWaiting = true;
+		proc->lwWaitOnly = false;
 		proc->lwExclusive = (mode == LW_EXCLUSIVE);
 		proc->lwWaitLink = NULL;
 		if (lock->head == NULL)
@@ -496,7 +497,9 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 /*
  * LWLockConditionalAcquire - acquire a lightweight lock in the specified mode
  *
- * If the lock is not available, return FALSE with no side-effects.
+ * If the lock is not available, return FALSE with no side-effects. In
+ * LW_EXCLUSIVE_BUT_WAIT mode, if the lock is not available, waits until it
+ * is available, but then returns false without acquiring it.
  *
  * If successful, cancel/die interrupts are held off until lock release.
  */
@@ -504,7 +507,9 @@ bool
 LWLockConditionalAcquire(LWLockId l

Re: [HACKERS] Patch: Allow SQL-language functions to reference parameters by parameter name

2012-01-25 Thread Hitoshi Harada
On Mon, Jan 23, 2012 at 7:13 PM, Matthew Draper  wrote:
> On 19/01/12 20:28, Hitoshi Harada wrote:
>>> (Now it occurred to me that forgetting the #include parse_func.h might
>>> hit this breakage..., so I'll fix it here and continue to test, but if
>>> you'll fix it yourself, let me know)
>>
>> I fixed it here and it now works with my environment.
>
> Well spotted; that's exactly what I'd done. :/
>
>
>> The regression tests pass, the feature seems working as aimed, but it
>> seems to me that it needs more test cases and documentation. For the
>> tests, I believe at least we need "ambiguous" case given upthread, so
>> that we can ensure to keep compatibility. For the document, it should
>> describe the name resolution rule, as stated in the patch comment.
>
> Attached are a new pair of patches, fixing the missing include (and the
> other warning), plus addressing the above.
>
> I'm still not sure whether to just revise (almost) all the SQL function
> examples to use parameter names, and declare them the "right" choice; as
> it's currently written, named parameters still seem rather second-class.
>

Agreed. The patch seems ok, except an example I've just found.

db1=# create function t(a int, t t) returns int as $$ select t.a $$
language sql;
CREATE FUNCTION
db1=# select t(0, row(1, 2)::t);
 t
---
 1
(1 row)

Should we throw an error in such ambiguity? Or did you make it happen
intentionally? If latter, we should also mention the rule in the
manual.

Other than that, the patch looks good to me.

Thanks,
-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers