Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-15 Thread David Rowley
On Tue, Jan 14, 2014 at 2:29 PM, Tom Lane  wrote:

>
> One reason I'm thinking this is that whatever we do to ameliorate
> the semantic issues is going to slow down the forward transition
> function --- to no benefit unless the aggregate is being used as a
> window function in a moving window.  So I'm less than convinced
> that we *should* implement any of these designs in the default
> aggregates, even if we get to the point where we arguably *could*
> do it with little risk of functional differences.
>
> regards, tom lane
>

I was thinking about this earlier today and came up with an idea that
perhaps aggregates could support 2 transition functions. 1 for normal
aggregation and for windows with UNBOUNDED PRECEDING. The 2nd transition
function could be used when there is a possibility that we would need to
perform an inverse transition. This 2nd transition function could do all
the extra tracking it needed without having to worry that it would slow
down normal aggregation. With numeric that might be tracking each numeric's
scale as it enters and exits the window frame. It might even be possible to
perform inverse transitions with float and double here, we could just
internally use numeric, and have the final function convert back to the
original type. Though perhaps that might have the side effect of making
floating point calculations too accurate? Or at least not matching the IEEE
standards.

Of course, I'm not thinking this could be part of this patch, but I thought
that I'd post the idea while all this is fresh in people's heads.

David


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-15 Thread David Rowley
On Thu, Jan 16, 2014 at 9:45 AM, Florian Pflug  wrote:

> BTW, AVG() and STDDEV() have the same issue. The problem is just partially
> masked by the division by N (or N-1) at the end, because we always emit as
> least 16 fractional digits when dividing. So you have to have an input
> value with a larger scale than that to trigger it.
>
> For the following query
>
>   select avg(x) over (order by i rows between current row and 1 following)
>   from (values
> (1,1), (2,1), (3,0.1), (4,1), (5,1)
>   ) t(i,x);
>
> 9.3 returns
>  avg
> -
>   1.
>  0.50001
>  0.50001
>   1.
>   1.
>
> but HEAD+patch returns
>  avg
> -
>   1.
>  0.50001
>  0.50001
>  1.0
>  1.0
>
>
Uhhh, that's bad news indeed. That means that I'll need to remove not only
all inverse transition functions for all aggregates on numeric types, but
also avg for int types, the stddev* functions for everything, since they
internally use numeric. I guess that only leaves SUM for smallint, int,
bigint, cash and interval, along with count(exp), count(*)...



> I have to admit that I'm *very* tempted to suggest we simply ignore this -
> but that *would* mean accepting that windowed aggregates are non-
> deterministic in the sense that their result (even if only in the number
> of trailing zeros) depends on values outside of the frame. Which, I guess,
> is a box that best stays closed...
>
>
Yeah, I can understand the temptation but I agree we can't go changing
results.


> I'm currently thinking the best way forward is to get a basic patch
> without any NUMERIC stuff committed, and to revisit this after that's done.
>
>
Agreed... I'll warm up my delete key.

Regards

David Rowley


> best regards,
> Florian Pflug
>
>


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-15 Thread Peter Geoghegan
On Wed, Jan 15, 2014 at 8:23 PM, Peter Geoghegan  wrote:
> I have an idea of what I could do to fix this, but I don't have time
> to make sure that my hunch is correct.

It might just be a matter of:

@@ -186,6 +186,13 @@ ExecLockHeapTupleForUpdateSpec(EState *estate,
switch (test)
{
case HeapTupleInvisible:
+   /*
+* Tuple may have originated from this command, in 
which case it's
+* already locked
+*/
+   if 
(TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple.t_data))
&&
+   HeapTupleHeaderGetCmin(tuple.t_data) == 
estate->es_output_cid)
+   return true;
/* Tuple became invisible;  try again */
if (IsolationUsesXactSnapshot())
ereport(ERROR,

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-15 Thread David Rowley
On Thu, Jan 16, 2014 at 3:47 PM, Florian Pflug  wrote:

> BTW, as it stands, the patch currently uses the inverse transition
> function only when *all* the aggregates belonging to one window clause
> provide one. After working with the code a bit, I think that a bit of
> reshuffling would allow that distinction to be made on a per-aggregate
> basis. The question is, is it worth it?
>
>
I didn't think it was all that worth it due to the fact that we'd still
need to process every tuple in the frame's scope for each aggregate which
has no inverse transition function, of course, there would be slightly less
work to do in such cases. I guess I just assumed that work load types that
would benefit from inverse transitions would have many rows instead of many
aggregates and few rows.

I guess to implement the aggregated up to marker variables would just need
to be per aggregate rather than per window.

If you think it would be worth it I can modify the patch to work that way.

Regards

David Rowley



> best regards,
> Florian Pflug
>


Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it

2014-01-15 Thread Craig Ringer
On 01/16/2014 02:53 AM, Josh Berkus wrote:
> Hackers,
> 
> ALTER SYSTEM SET has been committed and recovery.conf GUCs are being
> reviewed.  I'm going to make a last case for conf.d in relation to these
> two patches before 9.4 goes out the door.

Personally, I'd find conf.d greatly more useful when enabled by default.
The need to hack postgresql.conf to use it eliminates half the point of
having it.

I haven't seen any compelling reason why conf.d should NOT be enabled by
default.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it

2014-01-15 Thread Amit Kapila
On Thu, Jan 16, 2014 at 1:31 AM, Josh Berkus  wrote:
> On 01/15/2014 11:10 AM, Stephen Frost wrote:
>
>> Independent of the above, I don't agree with that.  postgresql.auto.conf
>> is a special thing and should have its own special place.  For once
>> thing, when putting configuration files in a separate directory
>> structure (/etc/ vs /var), then postgresql.auto.conf should stay in the
>> data directory.
>
> Ah, I'd forgotten about that line of argument.  Where is auto.conf now?

   Data Directory

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation

2014-01-15 Thread Asif Naeem
Hi MauMau,

Ah. Sorry, I missed that part. As NTFS junctions and symbolic links are
different (although they behave similarly), there seems only a minor
inconvenience related to misleading error message i.e.

+ #ifdef WIN32
> +  if (rmdir(linkloc) < 0 && errno != ENOENT)
> + #else
>if (unlink(linkloc) < 0 && errno != ENOENT)
> + #endif
>ereport(ERROR,
>(errcode_for_file_access(),
> errmsg("could not remove symbolic link \"%s\": %m",


What is your opinion about it, Is it not worth changing ? . Thanks.



On Wed, Jan 15, 2014 at 7:42 PM, MauMau  wrote:

> From: "Asif Naeem" 
>
>  As you have
>>
> followed destroy_tablespace_directories() function, Is there any specific
> reason not to use same logic to detect type of the file/link i.e.
> “(lstat(linkloc, &st) == 0 && S_ISDIR(st.st_mode))”, It also seems have
> more appropriate error message i.e.
>
> Thanks for reviewing and testing the patch.  Yes, at first I did what you
> mentioned, but modified the patch according to some advice in the mail
> thread.  During redo, create_tablespace_directories() needs to handle the
> case where the $PGDATA/pg_tblspc/xxx is not a symlink but a directory even
> on UNIX/Linux.  Please see TablespaceCreateDbspace is(). 
> destroy_tablespace_directories()
> doesn't have to handle such situation.
>
> Regards
> MauMau
>
>


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-01-15 Thread Amit Kapila
On Thu, Jan 16, 2014 at 12:49 AM, Robert Haas  wrote:
> On Wed, Jan 15, 2014 at 7:28 AM, Amit Kapila  wrote:
>> Unpatched
>> ---
>> testname | wal_generated |
>> duration
>> --+--+--
>>  one short and one long field, no change |1054923224 |  33.101135969162
>>
>> After pgrb_delta_encoding_v4
>> -
>>
>> testname | wal_generated |
>> duration
>> --+--+--
>>  one short and one long field, no change | 877859144 | 30.6749138832092
>>
>>
>> Temporary Changes
>> (Revert Max Chunksize = 4 and logic of finding longer match)
>> -
>>
>>  testname| wal_generated |
>> duration
>> --+--+--
>>  one short and one long field, no change | 677337304 | 25.4048750400543
>
> Sure, but watch me not care.
>
> If we're interested in taking advantage of the internal
> compressibility of tuples, we can do a lot better than this patch.  We
> can compress the old tuple and the new tuple.  We can compress
> full-page images.  We can compress inserted tuples.  But that's not
> the point of this patch.
>
> The point of *this* patch is to exploit the fact that the old and new
> tuples are likely to be very similar, NOT to squeeze out every ounce
> of compression from other sources.

   Okay, got your point.
   Another minor thing is that in latest patch which I have sent yesterday,
   I have modified it such that while formation of chunks if there is a data
   at end of string which doesn't have special pattern and is less than max
   chunk size, we also consider that as a chunk. The reason of doing this
   was that let us say if we have 104 bytes string which contains no special
   bit pattern, then it will just have one 64 byte chunk and will leave the
   remaining bytes, which might miss the chance of doing compression for
   that data.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-15 Thread Peter Geoghegan
On Tue, Jan 14, 2014 at 3:07 AM, Heikki Linnakangas
 wrote:
>> Right, but with your approach, can you really be sure that you have
>> the right rejecting tuple ctid (not reject)? In other words, as you
>> wait for the exclusion constraint to conclusively indicate that there
>> is a conflict, minutes may have passed in which time other conflicts
>> may emerge in earlier unique indexes. Whereas with an approach where
>> values are locked, you are guaranteed that earlier unique indexes have
>> no conflicting values. Maintaining that property seems useful, since
>> we check in a well-defined order, and we're still projecting a ctid.
>> Unlike when row locking is involved, we can make no assumptions or
>> generalizations around where conflicts will occur. Although that may
>> also be a general concern with your approach when row locking, for
>> multi-master replication use-cases. There may be some value in knowing
>> it cannot have been earlier unique indexes (and so the existing values
>> for those unique indexes in the locked row should stay the same -
>> don't many conflict resolution policies work that way?).
>
> I don't understand what you're saying. Can you give an example?
>
> In the use case I was envisioning above, ie. you insert N rows, and if any
> of them violate constraint, you still want to insert the non-violating
> instead of rolling back the whole transaction, you don't care. You don't
> care what existing rows the new rows conflicted with.
>
> Even if you want to know what you conflicted with, I can't make sense of
> what you're saying. In the btreelock approach, the value locks are
> immediately released once you discover that there's conflict. So by the time
> you get to do anything with the ctid of the existing tuple you conflicted
> with, new conflicting tuples might've appeared.

That's true, but at least the timeframe in which an additional
conflict may occur on just-locked index values in bound to more or
less an instant. In any case how important this is is an interesting
question, and perhaps one that Andres can weigh in on as someone that
knows a lot about multi-master replication. This issue is particularly
interesting because this testcase appears to make both patches
livelock, for reasons that I believe are related:

https://github.com/petergeoghegan/upsert/blob/master/torture.sh

I have an idea of what I could do to fix this, but I don't have time
to make sure that my hunch is correct. I'm travelling tomorrow to give
a talk at PDX pug, so I'll have limited access to e-mail.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it

2014-01-15 Thread Stephen Frost
* Peter Eisentraut (pete...@gmx.net) wrote:
> In my mind, a conf.d directory is an extension of a single-file
> configuration, and so it should be handled that way.

I'm apparently out on some funny limb with this thought, but I'll throw
it out there anyway- in my head, the 'postgresql.auto.conf' thing that
essentially ends up included as part of 'postgresql.conf' should be
handled the same way a single 'postgresql.conf' or 'conf.d' directory
is.  Now, I've never particularly agreed with it, but at least on
Debian/Ubuntu, the /etc conf directories are owned by the postgres user
by default.  I dislike the idea of separating the PG config into things
in /etc and things in PGDATA as it will make life more difficult for the
poor sysadmins trying to figure out what's going on.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-15 Thread Florian Pflug
On Jan16, 2014, at 02:32 , Florian Pflug  wrote:
> On Jan14, 2014, at 17:39 , Florian Pflug  wrote:
>> On Jan14, 2014, at 11:06 , David Rowley  wrote:
>>> Here's a patch which removes sum(numeric) and changes the documents a 
>>> little to remove a reference to using sum(numeric) to workaround the fact 
>>> that there's no inverse transitions for sum(float). I also made a small 
>>> change in the aggregates.sql tests to check that the aggfnoid <= .
>> 
>> I've looked over the patch, here a few comments.
>> 
>> For STRICT pairs of transfer and inverse transfer functions we should 
>> complain if any of them ever return NULL. That can never be correct anyway, 
>> since a STRICT function cannot undo a non-NULL -> NULL transition.
>> 
>> The same goes for non-STRICT, at least if we ever want to allow an inverse 
>> transfer function to indicate "Sorry, cannot undo in this case, please 
>> rescan". If we allowed NULL as just another state value now, we couldn't 
>> easily undo that later, so we'd rob ourselves of the obvious way for the 
>> inverse transfer function to indicate this condition to its caller.
>> 
>> The notnullcount machinery seems to apply to both STRICT and non-STRICT 
>> transfer function pairs. Shouldn't that be constrained to STRICT transfer 
>> function pairs? For non-STRICT pairs, it's up to the transfer functions to 
>> deal with NULL inputs however they please, no?
> 
> I modified the patch to do this, and ran into a problem. Currently, 
> aggregates with state type "internal" cannot have a strict transfer function, 
> even if they behave like they did, i.e. ignore non-NULL inputs. This is 
> because the only possible initial value for state type "internal" is NULL, 
> and it's up to the transfer function to create the state - usually upon 
> seeing the first non-NULL input. Now, currently that isn't a huge problem - 
> the transfer function simply has to check for NULL input values itself, and 
> if it's indeed conceptually strict, it just returns in this case.
> 
> With inverse transfer functions, however, each such pair of forward and 
> inverse transfer functions would also need to duplicate the NULL-counting 
> logic from nodeWindowAgg.c if it want's to be conceptually strict, i.e. 
> ignore NULL inputs, but return NULL if there are *only* NULL inputs (or no 
> inputs at all). That seems like a lot of duplicated code in the long run.
> 
> In essence, what we'd want for strict pairs of forward and inverse transfer 
> functions is for the forward transfer function to be strict in all arguments 
> except the state, and the inverse to be strict according to the usual 
> definition. We can't express that property of the forward transfer function 
> within pg_proc, but we don't really have to - a local hack in nodeWindowAgg.c 
> suffices. So what I'm proposing is:
> 
> We allow the forward transfer function to be non-strict even if the inverse 
> is strict, but only if the initial value is NULL. In that case we behave as 
> if the forward transfer function was strict, except that upon seeing the 
> first non-NULL input we call it with a NULL state. The return value must 
> still be non-NULL in all cases.

Ok, I tried this and it worked out quite OK. 

Updated patch is attached. It passes regression tests, but those currently 
don't seem to include any aggregates which *don't* ignore NULL values, so that 
case is probably untested. Also, it allows non-strict forward transfer 
functions together with strict inverse transfer functions even for non-NULL 
initial states now. It seemed rather pedantic to forbid this.

BTW, as it stands, the patch currently uses the inverse transition function 
only when *all* the aggregates belonging to one window clause provide one. 
After working with the code a bit, I think that a bit of reshuffling would 
allow that distinction to be made on a per-aggregate basis. The question is, is 
it worth it?

best regards,
Florian Pflug


inverse_transition_functions_v2.2_fgp.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Robert Haas
On Wed, Jan 15, 2014 at 8:41 PM, Jan Kara  wrote:
> On Wed 15-01-14 10:12:38, Robert Haas wrote:
>> On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara  wrote:
>> > Filesystems could in theory provide facility like atomic write (at least up
>> > to a certain size say in MB range) but it's not so easy and when there are
>> > no strong usecases fs people are reluctant to make their code more complex
>> > unnecessarily. OTOH without widespread atomic write support I understand
>> > application developers have similar stance. So it's kind of chicken and egg
>> > problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
>> > due to its data=journal mode so if someone on the PostgreSQL side wanted to
>> > research on this, knitting some experimental ext4 patches should be doable.
>>
>> Atomic 8kB writes would improve performance for us quite a lot.  Full
>> page writes to WAL are very expensive.  I don't remember what
>> percentage of write-ahead log traffic that accounts for, but it's not
>> small.
>   OK, and do you need atomic writes on per-IO basis or per-file is enough?
> It basically boils down to - is all or most of IO to a file going to be
> atomic or it's a smaller fraction?

The write-ahead log wouldn't need it, but data files writes would.  So
we'd need it a lot, but not for absolutely everything.

For any given file, we'd either care about writes being atomic, or we wouldn't.

> As Dave notes, unless there is HW support (which is coming with newest
> solid state drives), ext4/xfs will have to implement this by writing data
> to a filesystem journal and after transaction commit checkpointing them to
> a final location. Which is exactly what you do with your WAL logs so
> it's not clear it will be a performance win. But it is easy enough to code
> for ext4 that I'm willing to try...

Yeah, hardware support would be great.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-15 Thread Florian Pflug
On Jan14, 2014, at 17:39 , Florian Pflug  wrote:
> On Jan14, 2014, at 11:06 , David Rowley  wrote:
>> Here's a patch which removes sum(numeric) and changes the documents a little 
>> to remove a reference to using sum(numeric) to workaround the fact that 
>> there's no inverse transitions for sum(float). I also made a small change in 
>> the aggregates.sql tests to check that the aggfnoid <= .
> 
> I've looked over the patch, here a few comments.
> 
> For STRICT pairs of transfer and inverse transfer functions we should 
> complain if any of them ever return NULL. That can never be correct anyway, 
> since a STRICT function cannot undo a non-NULL -> NULL transition.
> 
> The same goes for non-STRICT, at least if we ever want to allow an inverse 
> transfer function to indicate "Sorry, cannot undo in this case, please 
> rescan". If we allowed NULL as just another state value now, we couldn't 
> easily undo that later, so we'd rob ourselves of the obvious way for the 
> inverse transfer function to indicate this condition to its caller.
> 
> The notnullcount machinery seems to apply to both STRICT and non-STRICT 
> transfer function pairs. Shouldn't that be constrained to STRICT transfer 
> function pairs? For non-STRICT pairs, it's up to the transfer functions to 
> deal with NULL inputs however they please, no?

I modified the patch to do this, and ran into a problem. Currently, aggregates 
with state type "internal" cannot have a strict transfer function, even if they 
behave like they did, i.e. ignore non-NULL inputs. This is because the only 
possible initial value for state type "internal" is NULL, and it's up to the 
transfer function to create the state - usually upon seeing the first non-NULL 
input. Now, currently that isn't a huge problem - the transfer function simply 
has to check for NULL input values itself, and if it's indeed conceptually 
strict, it just returns in this case.

With inverse transfer functions, however, each such pair of forward and inverse 
transfer functions would also need to duplicate the NULL-counting logic from 
nodeWindowAgg.c if it want's to be conceptually strict, i.e. ignore NULL 
inputs, but return NULL if there are *only* NULL inputs (or no inputs at all). 
That seems like a lot of duplicated code in the long run.

In essence, what we'd want for strict pairs of forward and inverse transfer 
functions is for the forward transfer function to be strict in all arguments 
except the state, and the inverse to be strict according to the usual 
definition. We can't express that property of the forward transfer function 
within pg_proc, but we don't really have to - a local hack in nodeWindowAgg.c 
suffices. So what I'm proposing is:

We allow the forward transfer function to be non-strict even if the inverse is 
strict, but only if the initial value is NULL. In that case we behave as if the 
forward transfer function was strict, except that upon seeing the first 
non-NULL input we call it with a NULL state. The return value must still be 
non-NULL in all cases.

Thoughts?

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2.1 & index-only scans : abnormal heap fetches after VACUUM FULL

2014-01-15 Thread Simon Riggs
On 28 November 2013 22:17, Robert Haas  wrote:

> The fact that you've needed to modify SetHintBits() to make this work
> is pretty nasty.  I'm not exactly sure what to do about that, but it
> doesn't seem good.

That makes me feel bad also.

I think we should be looking for some special case routines rather
than piggybacking there.

Wonderful to see you personally and directly addressing this concern
and very supportive of this solution for 9.4.

I wonder about a user option to let these routines wait until they are
the oldest snapshot, so they can then set every page as all-visible
without even bothering to check individual items. Or maybe a user
option to set them all-visible even without actually being the oldest,
as is possible with COPY FREEZE. Full lock on the table is a big
thing, so a shame to waste the opportunity to set everything visible
just because some long running task is still executing somewhere (but
clearly not here otherwise we wouldn't have the lock).

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-15 Thread Peter Geoghegan
On Tue, Jan 14, 2014 at 3:25 PM, Heikki Linnakangas
 wrote:
> Attached is a patch doing that, to again demonstrate what I mean. I'm not
> sure if setting xmin to invalid is really the best way to mark the tuple
> dead; I don't think a tuple's xmin can currently be InvalidTransaction under
> any other circumstances, so there might be some code out there that's not
> prepared for it. So using an infomask bit might indeed be better. Or
> something else entirely.

Have you thought about the implications for other snapshot types (or
other tqual.c routines)? My concern is that a client of that
infrastructure (either current or future) could spuriously conclude
that a heap tuple satisfied it, when in fact only a promise tuple
satisfied it. It wouldn't necessarily follow that the promise would be
fulfilled, nor that there would be some other proper heap tuple
equivalent to that fulfilled promise tuple as far as those clients are
concerned.

heap_delete() will not call HeapTupleSatisfiesUpdate() when you're
deleting a promise tuple, which on the face of it is fine - it's
always going to technically be instantaneously invisible, because it's
always created by the same command id (i.e. HeapTupleSatisfiesUpdate()
would just return HeapTupleInvisible if called). So far so good, but
we are technically doing something else quite new - deleting a
would-be instantaneously invisible tuple. So like your concern about
setting xmin to invalid, my concern is that code may exist that treats
cmin < cmax as an invariant. Now, you might think that that would be a
manageable concern, and to be fair a look at the ComboCids code that
mostly arbitrates that stuff seems to indicate that it's okay, but
it's still worth noting.

I think you should consider breaking off the relcache parts of my
patch and committing them, because they're independently useful. If we
are going to have a lot of conflicts that need to be handled by a
heap_delete(), there is no point in inserting non-unique index tuples
for what is not yet conclusively a proper (non-promise) tuple. Those
should always come last. And even without upsert, strictly inserting
into unique indexes first seems like a useful thing relative to the
cost. Unique violations are the cause of many aborted transactions,
and there is no need to ever bloat non-unique indexes of the same slot
when that happens.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Dave Chinner
On Wed, Jan 15, 2014 at 07:08:18PM -0500, Tom Lane wrote:
> Dave Chinner  writes:
> > On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote:
> >> What we'd really like for checkpointing is to hand the kernel a boatload
> >> (several GB) of dirty pages and say "how about you push all this to disk
> >> over the next few minutes, in whatever way seems optimal given the storage
> >> hardware and system situation.  Let us know when you're done."
> 
> > The issue there is that the kernel has other triggers for needing to
> > clean data. We have no infrastructure to handle variable writeback
> > deadlines at the moment, nor do we have any infrastructure to do
> > roughly metered writeback of such files to disk. I think we could
> > add it to the infrastructure without too much perturbation of the
> > code, but as you've pointed out that still leaves the fact there's
> > no obvious interface to configure such behaviour. Would it need to
> > be persistent?
> 
> No, we'd be happy to re-request it during each checkpoint cycle, as
> long as that wasn't an unduly expensive call to make.  I'm not quite
> sure where such requests ought to "live" though.  One idea is to tie
> them to file descriptors; but the data to be written might be spread
> across more files than we really want to keep open at one time.

It would be a property of the inode, as that is how writeback is
tracked and timed. Set and queried through a file descriptor,
though - it's basically the same context that fadvise works
through.

> But the only other idea that comes to mind is some kind of global sysctl,
> which would probably have security and permissions issues.  (One thing
> that hasn't been mentioned yet in this thread, but maybe is worth pointing
> out now, is that Postgres does not run as root, and definitely doesn't
> want to.  So we don't want a knob that would require root permissions
> to twiddle.)

I have assumed all along that requiring root to do stuff would be a
bad thing. :)

> We could probably live with serially checkpointing data
> in sets of however-many-files-we-can-have-open, if file descriptors are
> the place to keep the requests.

Inodes live longer than file descriptors, but there's no guarantee
that they live from one fd context to another. Hence my question
about persistence ;)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Dave Chinner
On Wed, Jan 15, 2014 at 10:12:38AM -0500, Robert Haas wrote:
> On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara  wrote:
> > Filesystems could in theory provide facility like atomic write (at least up
> > to a certain size say in MB range) but it's not so easy and when there are
> > no strong usecases fs people are reluctant to make their code more complex
> > unnecessarily. OTOH without widespread atomic write support I understand
> > application developers have similar stance. So it's kind of chicken and egg
> > problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
> > due to its data=journal mode so if someone on the PostgreSQL side wanted to
> > research on this, knitting some experimental ext4 patches should be doable.
> 
> Atomic 8kB writes would improve performance for us quite a lot.  Full
> page writes to WAL are very expensive.  I don't remember what
> percentage of write-ahead log traffic that accounts for, but it's not
> small.

Essentially, the "atomic writes" will essentially be journalled data
so initially there is not going to be any different in performance
between journalling the data in userspace and journalling it in the
filesystem journal. Indeed, it could be worse because the filesystem
journal is typically much smaller than a database WAL file, and it
will flush much more frequently and without the database having any
say in when that occurs.

AFAICT, we're stuck with sucky WAL until block layer and hardware
support atomic writes.

FWIW, I've certainly considered adding per-file data journalling
capabilities to XFS in the past. If we decide that this is the way
to proceed (i.e. as a stepping stone towards hardware atomic write
support), then I can go back to my notes from a few years ago and
see what still needs to be done to support it

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Dave Chinner
On Wed, Jan 15, 2014 at 07:13:27PM -0500, Tom Lane wrote:
> Dave Chinner  writes:
> > On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote:
> >> And most importantly, "Also, please don't freeze up everything else in the
> >> process"
> 
> > If you hand writeback off to the kernel, then writeback for memory
> > reclaim needs to take precedence over "metered writeback". If we are
> > low on memory, then cleaning dirty memory quickly to avoid ongoing
> > allocation stalls, failures and potentially OOM conditions is far more
> > important than anything else.
> 
> I think you're in violent agreement, actually.  Jeff's point is exactly
> that we'd rather the checkpoint deadline slid than that the system goes
> to hell in a handbasket for lack of I/O cycles.  Here "metered" really
> means "do it as a low-priority task".

No, I meant the opposite - in low memory situations, the system is
going to go to hell in a handbasket because we are going to cause a
writeback IO storm cleaning memory regardless of these IO
priorities. i.e. there is no way we'll let "low priority writeback
to avoid IO storms" cause OOM conditions to occur. That is, in OOM
conditions, cleaning dirty pages becomes one of the highest priority
tasks of the system

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Dave Chinner
On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote:
> On Wed, Jan 15, 2014 at 7:12 AM, Tom Lane  wrote:
> 
> > Heikki Linnakangas  writes:
> > > On 01/15/2014 07:50 AM, Dave Chinner wrote:
> > >> FWIW [and I know you're probably sick of hearing this by now], but
> > >> the blk-io throttling works almost perfectly with applications that
> > >> use direct IO.
> >
> > > For checkpoint writes, direct I/O actually would be reasonable.
> > > Bypassing the OS cache is a good thing in that case - we don't want the
> > > written pages to evict other pages from the OS cache, as we already have
> > > them in the PostgreSQL buffer cache.
> >
> > But in exchange for that, we'd have to deal with selecting an order to
> > write pages that's appropriate depending on the filesystem layout,
> > other things happening in the system, etc etc.  We don't want to build
> > an I/O scheduler, IMO, but we'd have to.
> >
> > > Writing one page at a time with O_DIRECT from a single process might be
> > > quite slow, so we'd probably need to use writev() or asynchronous I/O to
> > > work around that.
> >
> > Yeah, and if the system has multiple spindles, we'd need to be issuing
> > multiple O_DIRECT writes concurrently, no?
> >
> 
> writev effectively does do that, doesn't it?  But they do have to be on the
> same file handle, so that could be a problem.  I think we need something
> like sorted checkpoints sooner or later, anyway.

No, it doesn't. writev() allows you to supply multiple user buffers
for a single IO to fixed offset. If th efile is contiguous, then it
will be issued as a single IO. If you want concurrent DIO, then you
need to use multiple threads or AIO.

> > What we'd really like for checkpointing is to hand the kernel a boatload
> > (several GB) of dirty pages and say "how about you push all this to disk
> > over the next few minutes, in whatever way seems optimal given the storage
> > hardware and system situation.  Let us know when you're done."
> 
> And most importantly, "Also, please don't freeze up everything else in the
> process"

If you hand writeback off to the kernel, then writeback for memory
reclaim needs to take precedence over "metered writeback". If we are
low on memory, then cleaning dirty memory quickly to avoid ongoing
allocation stalls, failures and potentially OOM conditions is far more
important than anything else.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Dave Chinner
On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote:
> Heikki Linnakangas  writes:
> > On 01/15/2014 07:50 AM, Dave Chinner wrote:
> >> FWIW [and I know you're probably sick of hearing this by now], but
> >> the blk-io throttling works almost perfectly with applications that
> >> use direct IO.
> 
> > For checkpoint writes, direct I/O actually would be reasonable. 
> > Bypassing the OS cache is a good thing in that case - we don't want the 
> > written pages to evict other pages from the OS cache, as we already have 
> > them in the PostgreSQL buffer cache.
> 
> But in exchange for that, we'd have to deal with selecting an order to
> write pages that's appropriate depending on the filesystem layout,
> other things happening in the system, etc etc.  We don't want to build
> an I/O scheduler, IMO, but we'd have to.

I don't see that as necessary - nobody else needs to do this with
direct IO. Indeed, if the application does ascending offset order
writeback from within a file, then it's replicating exactly what the
kernel page cache writeback does. If what the kernel does is good
enough for you, then I can't see how doing the same thing with
a background thread doing direct IO is going to need any special
help

> > Writing one page at a time with O_DIRECT from a single process might be 
> > quite slow, so we'd probably need to use writev() or asynchronous I/O to 
> > work around that.
> 
> Yeah, and if the system has multiple spindles, we'd need to be issuing
> multiple O_DIRECT writes concurrently, no?
> 
> What we'd really like for checkpointing is to hand the kernel a boatload
> (several GB) of dirty pages and say "how about you push all this to disk
> over the next few minutes, in whatever way seems optimal given the storage
> hardware and system situation.  Let us know when you're done."

The issue there is that the kernel has other triggers for needing to
clean data. We have no infrastructure to handle variable writeback
deadlines at the moment, nor do we have any infrastructure to do
roughly metered writeback of such files to disk. I think we could
add it to the infrastructure without too much perturbation of the
code, but as you've pointed out that still leaves the fact there's
no obvious interface to configure such behaviour. Would it need to
be persistent?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Tom Lane
Robert Haas  writes:
> I don't see that as a problem.  What we're struggling with today is
> that, until we fsync(), the system is too lazy about writing back
> dirty pages.  And then when we fsync(), it becomes very aggressive and
> system-wide throughput goes into the tank.  What we're aiming to do
> here is get is to start the writeback sooner than it would otherwise
> start so that it is spread out over a longer period of time.

Yeah.  It's sounding more and more like the right semantics are to
give the kernel a hint that we're going to fsync these files later,
so it ought to get on with writing them anytime the disk has nothing
better to do.  I'm not sure if there's value in being specific about
how much later; that would probably depend on details of the scheduler
that I don't know.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Tom Lane
Dave Chinner  writes:
> On Wed, Jan 15, 2014 at 07:08:18PM -0500, Tom Lane wrote:
>> No, we'd be happy to re-request it during each checkpoint cycle, as
>> long as that wasn't an unduly expensive call to make.  I'm not quite
>> sure where such requests ought to "live" though.  One idea is to tie
>> them to file descriptors; but the data to be written might be spread
>> across more files than we really want to keep open at one time.

> It would be a property of the inode, as that is how writeback is
> tracked and timed. Set and queried through a file descriptor,
> though - it's basically the same context that fadvise works
> through.

Ah, got it.  That would be fine on our end, I think.

>> We could probably live with serially checkpointing data
>> in sets of however-many-files-we-can-have-open, if file descriptors are
>> the place to keep the requests.

> Inodes live longer than file descriptors, but there's no guarantee
> that they live from one fd context to another. Hence my question
> about persistence ;)

I plead ignorance about what an "fd context" is.  However, if what you're
saying is that there's a small chance of the kernel forgetting the request
during normal system operation, I think we could probably tolerate that,
if the API is designed so that we ultimately do an fsync on the file
anyway.  The point of the hint would be to try to ensure that the later
fsync had little to do.  If sometimes it didn't work, well, that's life.
We're ahead of the game as long as it usually works.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Robert Haas
On Wed, Jan 15, 2014 at 7:22 PM, Dave Chinner  wrote:
> No, I meant the opposite - in low memory situations, the system is
> going to go to hell in a handbasket because we are going to cause a
> writeback IO storm cleaning memory regardless of these IO
> priorities. i.e. there is no way we'll let "low priority writeback
> to avoid IO storms" cause OOM conditions to occur. That is, in OOM
> conditions, cleaning dirty pages becomes one of the highest priority
> tasks of the system

I don't see that as a problem.  What we're struggling with today is
that, until we fsync(), the system is too lazy about writing back
dirty pages.  And then when we fsync(), it becomes very aggressive and
system-wide throughput goes into the tank.  What we're aiming to do
here is get is to start the writeback sooner than it would otherwise
start so that it is spread out over a longer period of time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE FOREIGN TABLE ( ... LIKE ... )

2014-01-15 Thread Vik Fearing
On 01/16/2014 01:21 AM, Tom Lane wrote:
> Vik Fearing  writes:
>>> I am marking this patch as 'returned with feedback' in the commitfest app.
>> It looks like this patch got left behind in the previous commitfest. 
>> What is the policy for moving it?  Is it too late and will have to wait
>> until the next commitfest?
>> https://commitfest.postgresql.org/action/patch_view?id=1254
> I think you were in error to mark it "returned with feedback", as that
> caused everyone to stop paying attention to it in that commitfest.
> (And David dropped the ball too, as he should have done something to
> bring it back from that state, if it was committable or nearly so.)

I see.  Sorry about that.

> I see no reason why you shouldn't move it to the new fest; perhaps
> mark it as waiting on author, since really it's his responsibility
> to take the next step, ie comment on your version of the patch.

Done.

-- 
Vik



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE FOREIGN TABLE ( ... LIKE ... )

2014-01-15 Thread Tom Lane
Vik Fearing  writes:
>> I am marking this patch as 'returned with feedback' in the commitfest app.

> It looks like this patch got left behind in the previous commitfest. 
> What is the policy for moving it?  Is it too late and will have to wait
> until the next commitfest?

> https://commitfest.postgresql.org/action/patch_view?id=1254

I think you were in error to mark it "returned with feedback", as that
caused everyone to stop paying attention to it in that commitfest.
(And David dropped the ball too, as he should have done something to
bring it back from that state, if it was committable or nearly so.)

I see no reason why you shouldn't move it to the new fest; perhaps
mark it as waiting on author, since really it's his responsibility
to take the next step, ie comment on your version of the patch.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE FOREIGN TABLE ( ... LIKE ... )

2014-01-15 Thread David Fetter
On Thu, Jan 16, 2014 at 01:07:50AM +0100, Vik Fearing wrote:
> On 11/24/2013 02:03 AM, Vik Fearing wrote:
> > On 10/15/2013 07:50 AM, David Fetter wrote:
> >> On Mon, Oct 07, 2013 at 11:16:56PM -0700, David Fetter wrote:
> >>> Folks,
> >>>
> >>> Please find attached a patch implementing and documenting, to some
> >>> extent, $subject.  I did this in aid of being able to import SQL
> >>> standard catalogs and other entities where a known example could
> >>> provide a template for a foreign table.
> >>>
> >>> Should there be errhint()s, too?  Should we pile up all such errors
> >>> and mention them at the end rather than simply bailing on the first
> >>> one?
> >>>
> >>> TBD: regression tests.
> >> Now included: regression tests for disallowed LIKE options.
> > I like this patch, but I don't like its implementation at all.
> >
> > First of all, the documentation doesn't compile:
> >
> > openjade:ref/create_foreign_table.sgml:124:17:E: end tag for "LISTITEM"
> > omitted, but OMITTAG NO was specified
> > openjade:ref/create_foreign_table.sgml:119:4: start tag was here
> >
> >
> > I fixed that, and then noticed that like_option is not explained like it
> > is in CREATE TABLE.
> >
> > Then I got down to the description of the LIKE clause in both pages, and
> > I noticed the last line of CREATE TABLE, which is "Inapplicable options
> > (e.g., INCLUDING INDEXES from a view) are ignored.".  This is
> > inconsistent with the behavior of this patch to throw errors for
> > inapplicable options.
> >
> > Attached is a patch which corrects and completes the documentation
> > issues noted above, and also silently ignores inapplicable options.  In
> > addition to reducing patch size, this also allows the use of INCLUDING
> > ALL.  Because these options no longer produce errors, and that's all the
> > regression tests were looking for, I have removed those tests
> > (unfortunately leaving none).
> >
> > Aside from this difference in behavior, I see no fault in this patch.
> >
> > I am marking this patch as 'returned with feedback' in the commitfest app.
> >
> 
> It looks like this patch got left behind in the previous commitfest. 
> What is the policy for moving it?  Is it too late and will have to wait
> until the next commitfest?
> 
> https://commitfest.postgresql.org/action/patch_view?id=1254

I think we should still be OK putting it in the current one.

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Jim Nasby

On 1/15/14, 12:00 AM, Claudio Freire wrote:

My completely unproven theory is that swapping is overwhelmed by
near-misses. Ie: a process touches a page, and before it's actually
swapped in, another process touches it too, blocking on the other
process' read. But the second process doesn't account for that page
when evaluating predictive models (ie: read-ahead), so the next I/O by
process 2 is unexpected to the kernel. Then the same with 1. Etc... In
essence, swap, by a fluke of its implementation, fails utterly to
predict the I/O pattern, and results in far sub-optimal reads.

Explicit I/O is free from that effect, all read calls are accountable,
and that makes a difference.

Maybe, if the kernel could be fixed in that respect, you could
consider mmap'd files as a suitable form of temporary storage. But
that would depend on the success and availability of such a fix/patch.


Another option is to consider some of the more "radical" ideas in this thread, 
but only for temporary data. Our write sequencing and other needs are far less stringent 
for this stuff.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Tom Lane
Dave Chinner  writes:
> On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote:
>> And most importantly, "Also, please don't freeze up everything else in the
>> process"

> If you hand writeback off to the kernel, then writeback for memory
> reclaim needs to take precedence over "metered writeback". If we are
> low on memory, then cleaning dirty memory quickly to avoid ongoing
> allocation stalls, failures and potentially OOM conditions is far more
> important than anything else.

I think you're in violent agreement, actually.  Jeff's point is exactly
that we'd rather the checkpoint deadline slid than that the system goes
to hell in a handbasket for lack of I/O cycles.  Here "metered" really
means "do it as a low-priority task".

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Tom Lane
Dave Chinner  writes:
> On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote:
>> What we'd really like for checkpointing is to hand the kernel a boatload
>> (several GB) of dirty pages and say "how about you push all this to disk
>> over the next few minutes, in whatever way seems optimal given the storage
>> hardware and system situation.  Let us know when you're done."

> The issue there is that the kernel has other triggers for needing to
> clean data. We have no infrastructure to handle variable writeback
> deadlines at the moment, nor do we have any infrastructure to do
> roughly metered writeback of such files to disk. I think we could
> add it to the infrastructure without too much perturbation of the
> code, but as you've pointed out that still leaves the fact there's
> no obvious interface to configure such behaviour. Would it need to
> be persistent?

No, we'd be happy to re-request it during each checkpoint cycle, as
long as that wasn't an unduly expensive call to make.  I'm not quite
sure where such requests ought to "live" though.  One idea is to tie
them to file descriptors; but the data to be written might be spread
across more files than we really want to keep open at one time.
But the only other idea that comes to mind is some kind of global sysctl,
which would probably have security and permissions issues.  (One thing
that hasn't been mentioned yet in this thread, but maybe is worth pointing
out now, is that Postgres does not run as root, and definitely doesn't
want to.  So we don't want a knob that would require root permissions
to twiddle.)  We could probably live with serially checkpointing data
in sets of however-many-files-we-can-have-open, if file descriptors are
the place to keep the requests.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE FOREIGN TABLE ( ... LIKE ... )

2014-01-15 Thread Vik Fearing
On 11/24/2013 02:03 AM, Vik Fearing wrote:
> On 10/15/2013 07:50 AM, David Fetter wrote:
>> On Mon, Oct 07, 2013 at 11:16:56PM -0700, David Fetter wrote:
>>> Folks,
>>>
>>> Please find attached a patch implementing and documenting, to some
>>> extent, $subject.  I did this in aid of being able to import SQL
>>> standard catalogs and other entities where a known example could
>>> provide a template for a foreign table.
>>>
>>> Should there be errhint()s, too?  Should we pile up all such errors
>>> and mention them at the end rather than simply bailing on the first
>>> one?
>>>
>>> TBD: regression tests.
>> Now included: regression tests for disallowed LIKE options.
> I like this patch, but I don't like its implementation at all.
>
> First of all, the documentation doesn't compile:
>
> openjade:ref/create_foreign_table.sgml:124:17:E: end tag for "LISTITEM"
> omitted, but OMITTAG NO was specified
> openjade:ref/create_foreign_table.sgml:119:4: start tag was here
>
>
> I fixed that, and then noticed that like_option is not explained like it
> is in CREATE TABLE.
>
> Then I got down to the description of the LIKE clause in both pages, and
> I noticed the last line of CREATE TABLE, which is "Inapplicable options
> (e.g., INCLUDING INDEXES from a view) are ignored.".  This is
> inconsistent with the behavior of this patch to throw errors for
> inapplicable options.
>
> Attached is a patch which corrects and completes the documentation
> issues noted above, and also silently ignores inapplicable options.  In
> addition to reducing patch size, this also allows the use of INCLUDING
> ALL.  Because these options no longer produce errors, and that's all the
> regression tests were looking for, I have removed those tests
> (unfortunately leaving none).
>
> Aside from this difference in behavior, I see no fault in this patch.
>
> I am marking this patch as 'returned with feedback' in the commitfest app.
>

It looks like this patch got left behind in the previous commitfest. 
What is the policy for moving it?  Is it too late and will have to wait
until the next commitfest?

https://commitfest.postgresql.org/action/patch_view?id=1254

-- 
Vik



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE TABLESPACE WITH

2014-01-15 Thread Vik Fearing
On 01/15/2014 03:07 AM, Michael Paquier wrote:
> I just had a quick look at this patch, no testing at all. 

Thank you for looking at it.

> I am seeing that regression tests are still missing here, they should be 
> added in
> src/test/regress/input/tablespace.source. 

New patch attached, with regression tests.

> Then, for the use of either
> WITH or SET... For example CREATE FUNCTION uses SET for a
> configuration parameter, and ALTER FUNCTION is consistent with that.
> So SET makes more sense to be consistent with CREATE TABLESPACE? This
> patch should not be modified once again as long as there are no more
> opinions though...

Based on your analysis of current behavior elsewhere, and the messier
patch with SET, I have switched my vote from ambivalent to preferring WITH.

-- 
Vik

*** a/doc/src/sgml/ref/create_tablespace.sgml
--- b/doc/src/sgml/ref/create_tablespace.sgml
***
*** 21,27  PostgreSQL documentation
  
   
  
! CREATE TABLESPACE tablespace_name [ OWNER user_name ] LOCATION 'directory'
  
   
  
--- 21,30 
  
   
  
! CREATE TABLESPACE tablespace_name
! [ OWNER user_name ]
! LOCATION 'directory'
! [ WITH ( tablespace_option = value [, ... ] ) ]
  
   
  
***
*** 87,92  CREATE TABLESPACE tablespace_name [
--- 90,113 
 

   
+ 
+  
+   tablespace_option
+   
+
+ A tablespace parameter to be set or reset.  Currently, the only
+ available parameters are seq_page_cost and
+ random_page_cost.  Setting either value for a particular
+ tablespace will override the planner's usual estimate of the cost of
+ reading pages from tables in that tablespace, as established by
+ the configuration parameters of the same name (see
+ ,
+ ).  This may be useful if one
+ tablespace is located on a disk which is faster or slower than the
+ remainder of the I/O subsystem.
+
+   
+  

   
  
*** a/src/backend/commands/tablespace.c
--- b/src/backend/commands/tablespace.c
***
*** 234,239  CreateTableSpace(CreateTableSpaceStmt *stmt)
--- 234,240 
  	Oid			tablespaceoid;
  	char	   *location;
  	Oid			ownerId;
+ 	Datum		newOptions;
  
  	/* Must be super user */
  	if (!superuser())
***
*** 317,323  CreateTableSpace(CreateTableSpaceStmt *stmt)
  	values[Anum_pg_tablespace_spcowner - 1] =
  		ObjectIdGetDatum(ownerId);
  	nulls[Anum_pg_tablespace_spcacl - 1] = true;
! 	nulls[Anum_pg_tablespace_spcoptions - 1] = true;
  
  	tuple = heap_form_tuple(rel->rd_att, values, nulls);
  
--- 318,333 
  	values[Anum_pg_tablespace_spcowner - 1] =
  		ObjectIdGetDatum(ownerId);
  	nulls[Anum_pg_tablespace_spcacl - 1] = true;
! 
! 	/* Generate new proposed spcoptions (text array) */
! 	newOptions = transformRelOptions((Datum) 0,
! 	 stmt->options,
! 	 NULL, NULL, false, false);
! 	(void) tablespace_reloptions(newOptions, true);
! 	if (newOptions != (Datum) 0)
! 		values[Anum_pg_tablespace_spcoptions - 1] = newOptions;
! 	else
! 		nulls[Anum_pg_tablespace_spcoptions - 1] = true;
  
  	tuple = heap_form_tuple(rel->rd_att, values, nulls);
  
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
***
*** 3370,3375  _copyCreateTableSpaceStmt(const CreateTableSpaceStmt *from)
--- 3370,3376 
  	COPY_STRING_FIELD(tablespacename);
  	COPY_STRING_FIELD(owner);
  	COPY_STRING_FIELD(location);
+ 	COPY_NODE_FIELD(options);
  
  	return newnode;
  }
*** a/src/backend/nodes/equalfuncs.c
--- b/src/backend/nodes/equalfuncs.c
***
*** 1610,1615  _equalCreateTableSpaceStmt(const CreateTableSpaceStmt *a, const CreateTableSpace
--- 1610,1616 
  	COMPARE_STRING_FIELD(tablespacename);
  	COMPARE_STRING_FIELD(owner);
  	COMPARE_STRING_FIELD(location);
+ 	COMPARE_NODE_FIELD(options);
  
  	return true;
  }
*** a/src/backend/parser/gram.y
--- b/src/backend/parser/gram.y
***
*** 3588,3599  opt_procedural:
   *
   */
  
! CreateTableSpaceStmt: CREATE TABLESPACE name OptTableSpaceOwner LOCATION Sconst
  {
  	CreateTableSpaceStmt *n = makeNode(CreateTableSpaceStmt);
  	n->tablespacename = $3;
  	n->owner = $4;
  	n->location = $6;
  	$$ = (Node *) n;
  }
  		;
--- 3588,3600 
   *
   */
  
! CreateTableSpaceStmt: CREATE TABLESPACE name OptTableSpaceOwner LOCATION Sconst opt_reloptions
  {
  	CreateTableSpaceStmt *n = makeNode(CreateTableSpaceStmt);
  	n->tablespacename = $3;
  	n->owner = $4;
  	n->location = $6;
+ 	n->options = $7;
  	$$ = (Node *) n;
  }
  		;
*** a/src/include/nodes/parsenodes.h
--- b/src/include/nodes/parsenodes.h
***
*** 1669,1674  typedef struct CreateTableSpaceStmt
--- 1669,1675 
  	char	   *tabl

Re: [HACKERS] trgm regex index peculiarity

2014-01-15 Thread Tom Lane
Alexander Korotkov  writes:
> Revised version of patch with necessary comments.

I looked at this patch a bit.  It seems like this:

+  *BLANK_COLOR_SIZE - How much blank character is more frequent than
+  *   other character in average
+ #define BLANK_COLOR_SIZE  32

is a drastic overestimate.  Using the program Erik provided to generate
some sample data, I find that blanks in trigrams are maybe about three
times more common than other characters.  Possibly Erik's data isn't
very representative of real text, but if that's the case we'd better
get a more representative sample case before we start optimizing ...

Now maybe the above number is okay for this purpose anyway, but if so
it needs some different justification; maybe call it a "penalty"?
But nonetheless it seems like a darn large penalty, particularly for " x"
type trigrams where the value would effectively be squared.  Compared to
the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like
you might as well forget the "sizing" approach and just flat out reject
trigrams containing COLOR_BLANK, because they'll never get through the
size filter.  I'm inclined to think you need a larger MAX_TRGM_SIZE;
and WISH_TRGM_SIZE seems mighty small as well, compared to what the
code would have done before.

Also, surely this bit:

!   trgmNFA->totalTrgmCount = (int) totalTrgmSize;

is flat out wrong?  expandColorTrigrams() expects that
trgmNFA->totalTrgmCount is the truth, not some penalty-bloated
abstraction.  The fact that the patch hasn't failed on you completely
demonstrates that you've not tested any cases where blank-containing
trigrams got through the filter, reinforcing my feeling that it's
probably too strict now.

I wonder if it would be more appropriate to continue to measure the limit
MAX_TRGM_COUNT in terms of actual trigram counts, and just use the
penalized "size" as the sort key while determining which color trigrams
to discard first.

Another thought is that it's not clear that you should apply the same
penalty to blanks in all three positions.  Because of the padding
settings, any one word will normally lead to padded strings "  a",
" ab", "yz ", so that spaces in the first position are about twice
as common as spaces in the other positions.  (It's a little less
than that because single-letter words produce only "  a" and " a ";
but I'd think that such words aren't very common in text that trigrams
are effective for.)  So I'm thinking the penalty ought to take that
into account.

I'm also inclined to think that it might be worth adding a size
field to ColorTrgmInfo rather than repeatedly calculating the
size estimate.  We only allow a thousand and some ColorTrgmInfos
at most, so the extra space wouldn't be that large, and I'm concerned
by the expense the current patch adds to the sort comparator.

Another point is that this comment:

 * Note: per-color-trigram counts cannot overflow an int so long as
 * COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie about
 * 1290.  However, the grand total totalTrgmCount might conceivably
 * overflow an int, so we use int64 for that within this routine.

is no longer valid, or at least it fails to demonstrate that the result
of getColorTrigramSize() can't overflow an int; indeed I rather fear it can.
The patch failed to even update the variable name in this comment, let
alone address the substantive question.

There are some other cosmetic things I didn't like, for instance
colorTrgmInfoCountCmp() is no longer comparing counts but you didn't
rename it.  That's about it for substantive comments though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Backup throttling

2014-01-15 Thread Alvaro Herrera
Alvaro Herrera escribió:
> Antonin Houska escribió:
> > Thanks for checking. The new version addresses your findings.
> 
> I gave this patch a look.

BTW I also moved the patch the commitfest currently running, and set it
waiting-on-author.

Your move.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Jeff Janes
On Wed, Jan 15, 2014 at 7:12 AM, Tom Lane  wrote:

> Heikki Linnakangas  writes:
> > On 01/15/2014 07:50 AM, Dave Chinner wrote:
> >> FWIW [and I know you're probably sick of hearing this by now], but
> >> the blk-io throttling works almost perfectly with applications that
> >> use direct IO.
>
> > For checkpoint writes, direct I/O actually would be reasonable.
> > Bypassing the OS cache is a good thing in that case - we don't want the
> > written pages to evict other pages from the OS cache, as we already have
> > them in the PostgreSQL buffer cache.
>
> But in exchange for that, we'd have to deal with selecting an order to
> write pages that's appropriate depending on the filesystem layout,
> other things happening in the system, etc etc.  We don't want to build
> an I/O scheduler, IMO, but we'd have to.
>
> > Writing one page at a time with O_DIRECT from a single process might be
> > quite slow, so we'd probably need to use writev() or asynchronous I/O to
> > work around that.
>
> Yeah, and if the system has multiple spindles, we'd need to be issuing
> multiple O_DIRECT writes concurrently, no?
>

writev effectively does do that, doesn't it?  But they do have to be on the
same file handle, so that could be a problem.  I think we need something
like sorted checkpoints sooner or later, anyway.



> What we'd really like for checkpointing is to hand the kernel a boatload
> (several GB) of dirty pages and say "how about you push all this to disk
> over the next few minutes, in whatever way seems optimal given the storage
> hardware and system situation.  Let us know when you're done."


And most importantly, "Also, please don't freeze up everything else in the
process"

Cheers,

Jeff


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-01-15 Thread Simon Riggs
On 15 January 2014 16:47, Robert Haas  wrote:

>>> There would be a postgresql.conf parameter prune_cost_limit, as well
>>> as a table level parameter that would prevent pruning except via
>>> VACUUM.
>>>
>>> This will help in these ways
>>> * Reduce write I/O from SELECTs and pg_dump - improving backups and BI 
>>> queries
>>> * Allow finer grained control over Hot Standby conflicts
>>> * Potentially allow diagnostic inspection of older data via SeqScan

>> Patch attached, implemented to reduce writes by SELECTs only.
>
> I am still not sure whether we want this, but I think it's definitely
> an improvement over the previous version.  Assorted comments:
>
> - Naming consistency seems to me to dictate that there should be more
> similarity between the reloption name (allow_buffer_cleanup) and the
> GUC (prune_page_dirty_limit).

Now that I've written the patch, I'm seeing those as two different
things, but YMMV and I am very open to naming suggestions.


> - The documentation doesn't describe the use case where suppressing
> cleanup on a per-table basis would be desirable, and I can't think of
> one, either.

We already know that HOT is ineffective in areas of high contention
(previous thread by me). Prior experience was that smaller tables
didn't show much apparent benefit from using HOT either; its
effectiveness was limited to medium and large tables being updated.
The two already stated use cases that would apply are these ones

* Allow finer grained control over Hot Standby conflicts
* Potentially allow diagnostic inspection of older data via SeqScan

So the use cases for the two parameters seem quite different and we
may decide we want one but not the other.

> - There are a variety of ways to limit pruning; here, you've chosen to
> limit it to a particular number of pruning operations per executor
> invocation.  But the flag is global, not part of the executor state,
> so a query that calls a PL/pgsql function during execution will reset
> the counter for the parent query also, which doesn't seem very
> principled.

That is subtle thing in this patch and I agree that potential problem
exists. The current limit is set according to the current executing
statement, but the current total is not reset until start of the top
level statement. So the behaviour is not reset during statements
executed within PL/pgSQL function.

> In a patch I posted a few years ago to set hint bits only sometimes, I
> settled on an algorithm where I dirtied the first 50 pages per scan
> and then skipped the next 950, or something like that.  The idea was
> that you wanted the pages that did get dirtied to be clustered
> together to avoid random I/O; and also that you wanted table of
> arbitrary size to get hinted within a certain number of scans (e.g.
> 20).  The limiting here is much more aggressive, so on large tables it
> will amount to basically no pruning at all.  I dunno whether that's a
> good idea or not.  But if the idea of making this an integer rather
> than a boolean is to allow some pruning to still happen while keeping
> it checked within reasonable bounds, I'm not sure it will succeed.

It sounds like you're in favour of the overall concept of limiting
writes, which is good.

The behaviour I think we need, based on listening to everybody so far is

* OLTP is unaffected
* Large SELECTs and pg_dump don't cause lots of write I/O.

and hence why "prune_page_dirty_limit" offers a change in behaviour at
a certain point.

Reducing cleanup to "only 5%" just reduces but doesn't remove the
problem. If the data is stored on very poor I/O infrastructure, any
significant volume of writes can adversely affect performance. As we
reduce the percentage, we also reduce the benefit from inducing writes
in the first place and so I would question why bother at all using a
percentage. For me, a parameter that gives you absolute rather than
relative control is more desirable.

The current behaviour assumes it is OK for the first/next user to
touch the data to be the one that won't mind re-writing everything. In
time critical applications, the first/next user could well have a very
urgent need to access the data quickly and doesn't want to have to pay
this price. In seldom-accessed data applications, VACUUM has lots of
time to run out of hours, so users are OK to defer this work. Some
applications exist where we literally want zero I/O.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it

2014-01-15 Thread Peter Eisentraut
On 1/15/14, 3:01 PM, Josh Berkus wrote:
> Three issues:
> 
> a) if postgresql is still going to look for a recovery.conf file in the
> usual place, but we are changing the names and meaning of some of the
> parameters, then aren't we making the upgrade problem much worse?

That assumes that we are changing the names and meanings of some of the
parameters, which I don't see a reason for.

> b) what if the admin *does* want to disable reading recovery.conf in
> order to prevent old utilities from mistakenly including one?  How will
> they do that?

That assumes that there is a reason for doing that, which goes away if
point (a) is addressed.

> c) would this be in the configdir, datadir, or both?

That might depend on the parameter and what a tool wants to do with it.

There is also the consideration of whether some of those tools couldn't
be changed to use ALTER SYSTEM.

> I'd thought that part of the idea of the merger was to remove the
> "magic" status of recovery.conf.

Well, clearly, everyone has their own ideas about that.  I have several
non-overlapping ones of my own. ;-)  But my point is that we should look
what actually comes out of that discussion before we start designing
other facilities that interact with it.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it

2014-01-15 Thread Peter Eisentraut
On 1/15/14, 4:35 PM, Tom Lane wrote:
> Peter Eisentraut  writes:
>> On 1/15/14, 1:53 PM, Josh Berkus wrote:
>>> Yes, I'm also arguing that postgresql.auto.conf should go into conf.d.
>>> I said I'd bring that up again after ALTER SYSTEM SET was committed, and
>>> here it is.
> 
>> Independent of the above, I don't agree with that.  postgresql.auto.conf
>> is a special thing and should have its own special place.  For once
>> thing, when putting configuration files in a separate directory
>> structure (/etc/ vs /var), then postgresql.auto.conf should stay in the
>> data directory.
> 
> It seems like we still aren't all on the same page as to whether the
> conf.d directory (and contained files) should be expected to be writable
> by the postgres server or not.  I think it's hopeless to proceed further
> unless there's a strong consensus on that.

You can turn this around and ask, why should it be writable?  The server
has no need to write anything there.

In my mind, a conf.d directory is an extension of a single-file
configuration, and so it should be handled that way.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it

2014-01-15 Thread Tom Lane
Peter Eisentraut  writes:
> On 1/15/14, 1:53 PM, Josh Berkus wrote:
>> Yes, I'm also arguing that postgresql.auto.conf should go into conf.d.
>> I said I'd bring that up again after ALTER SYSTEM SET was committed, and
>> here it is.

> Independent of the above, I don't agree with that.  postgresql.auto.conf
> is a special thing and should have its own special place.  For once
> thing, when putting configuration files in a separate directory
> structure (/etc/ vs /var), then postgresql.auto.conf should stay in the
> data directory.

It seems like we still aren't all on the same page as to whether the
conf.d directory (and contained files) should be expected to be writable
by the postgres server or not.  I think it's hopeless to proceed further
unless there's a strong consensus on that.

My vote would be that the server should *not* be expected to be able to
write these files.  It might physically be able to, in a quick-and-dirty
installation, but in a setup prepared by a packaging system I'd not
expect that the config files would be postgres-owned.

Given that assumption, it's clear that postgresql.auto.conf can't live
in conf.d.  However, if you reject that assumption then maybe it should.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-15 Thread Tom Lane
Florian Pflug  writes:
> I'm currently thinking the best way forward is to get a basic patch
> without any NUMERIC stuff committed, and to revisit this after that's done.

+1.  I liked Robert's suggestion that the "fast" aggregates be implemented
as a contrib module rather than directly in core, too.  If we do that then
it's perfectly reasonable to handle that as a separate patch.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-15 Thread Florian Pflug
On Jan15, 2014, at 19:56 , Robert Haas  wrote:
> It strikes me that for numeric what you really need is to just tell
> the sum operation, whether through a parameter or otherwise, how many
> decimal places to show in the output.  Obviously that's not a
> practical change for sum() itself, but if we're inventing new stuff it
> can be done.

You can already do that, just cast the result of SUM(numeric) to
an appropriately constrained NUMERIC type, i.e. to NUMERIC(prec, scale).

BTW, AVG() and STDDEV() have the same issue. The problem is just partially
masked by the division by N (or N-1) at the end, because we always emit as
least 16 fractional digits when dividing. So you have to have an input 
value with a larger scale than that to trigger it.

For the following query

  select avg(x) over (order by i rows between current row and 1 following)
  from (values
(1,1), (2,1), (3,0.1), (4,1), (5,1)
  ) t(i,x);

9.3 returns
 avg 
-
  1.
 0.50001
 0.50001
  1.
  1.

but HEAD+patch returns
 avg 
-
  1.
 0.50001
 0.50001
 1.0
 1.0

I have to admit that I'm *very* tempted to suggest we simply ignore this -
but that *would* mean accepting that windowed aggregates are non-
deterministic in the sense that their result (even if only in the number
of trailing zeros) depends on values outside of the frame. Which, I guess,
is a box that best stays closed...

I'm currently thinking the best way forward is to get a basic patch
without any NUMERIC stuff committed, and to revisit this after that's done.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-15 Thread Andres Freund
Hi,

On 2014-01-15 13:28:25 -0500, Robert Haas wrote:
> - I think you should just regard ReplicationSlotCtlLock as protecting
> the "name" and "active" flags of every slot.  ReplicationSlotCreate()
> would then not need to mess with the spinlocks at all, and
> ReplicationSlotAcquire and ReplicationSlotDrop get a bit simpler too I
> think.  Functions like ReplicationSlotsComputeRequiredXmin() can
> acquire this lock in shared mode and only lock the slots that are
> actually active, instead of all of them.

I first thought you meant that we should get rid of the spinlock, but
after rereading I think you just mean that ->name, ->active, ->in_use
are only allowed to change while holding the lwlock exclusively so we
don't need to spinlock in those cases? If so, yes, that works for me.

> - If you address /* FIXME: apply sanity checking to slot name */, then
> I think that also addresses /* XXX: do we want to use truncate
> identifier instead? */. In other words, let's just error out if the
> name is too long.  I'm not sure what other sanity checking is needed
> here; maybe just restrict it to 7-bit characters to avoid problems
> with encodings for individual databases varying.

Yea, erroring out seems like a good idea. But I think we need to
restrict slot names a bit more than that, given they are used as
filenames... We could instead name the files using the slot's offset,
but I'd prefer to not go that route.

> - ReplicationSlotAcquire probably needs to ignore slots that are not active.

Not sure what you mean? If the slot isn't in_use we'll skip it in the loop.

> - If there's a coding rule that slot->database can't be changed while
> the slot is active, then the check to make sure that the user isn't
> trying to bind to a slot with a mis-matching database could be done
> before the code described in the previous point, avoiding the need to
> go back and release the resource.

I don't think slot->database should be allowed to change at all...

> - I think the critical section in ReplicationSlotDrop is bogus.  If
> DeleteSlot() fails, we scarcely need to PANIC.  The slot just isn't
> gone.

Well, if delete slot fails, we don't really know at which point it
failed which means that the on-disk state might not correspond to the
in-memory state. I don't see a point in adding code trying to handle
that case correctly...

> - cancel_before_shmem_exit is only guaranteed to remove the
> most-recently-added callback.

Yea :(. I think that's safe for the current usages but seems mighty
fragile... Not sure what to do about it. Just register KillSlot once and
keep it registered?

> - Why does ReplicationSlotsComputeRequiredXmin() need to hold
> ProcArrayLock at all?

There's reasoning, but I just noticed that it's basis might be flawed
anyway :(.
When starting changeset extraction in a new slot we need to guarantee
that we only start decoding records we know the catalog tuples haven't
been removed for.
So, when creating the slot I've so far done a GetOldestXmin() and used
that to check against xl_running_xact->oldestRunningXid. But
GetOldestXmin() can go backwards...

I'll think a bit and try to simplify this.

> - ReplicationSlotsComputeRequiredXmin scarcely needs to hold the
> spinlock while it does all of those gyrations.  It can just acquire
> the spinlock, copy the three fields needed into local variables, and
> release the spinlock.  The rest can be worked out afterwards.
> Similarly in ReplicationSlotsComputeRequiredXmin.

Yea, will change.

> - A comment in KillSlot wonders whether locking is required.  I say
> yes.  It's safe to take lwlocks and spinlocks during shmem exit, and
> failing to do so seems like a recipe for subtle corner-case bugs.

I agree that it's safe to use spinlocks, but lwlocks? What if we are
erroring out while holding an lwlock? Code that runs inside a
TransactionCommand is protected against that, but if not ProcKill()
which invokes LWLockReleaseAll() runs pretty late in the teardown
process...

> - pg_get_replication_slots() wonders what permissions we require.  I
> don't know that any special permissions are needed here; the data
> we're exposing doesn't appear to be sensitive.  Unless I'm missing
> something?

I don't see a problem either, but it seems others have -
pg_stat_replication only displays minor amounts of information if one
doesn't have the replication privilege... Not sure what the reasoning
there is, and whether it applies here as well.

> - There seems to be no interface to acquire or release slots from
> either SQL or the replication protocol, nor any way for a client of
> this code to update its slot details.

I don't think either ever needs to do that - START_TRANSACTION SLOT slot
...; and decoding_slot_*changes will acquire/release for them while
active. What would the usecase be to allow them to be acquired from SQL?

The slot details get updates by the respective replication code. For
streaming rep, that should happen via reply and feedback messages. For
changeset ext

Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it

2014-01-15 Thread Stephen Frost
Josh,

* Josh Berkus (j...@agliodbs.com) wrote:
> However, Debian is *never* going to add conf.d to the packages if we
> don't recommend it as an upstream project.  And, frankly, I use the
> apt.postgresql.org packages anyway, which would certainly include
> anything which was decided on this list.

Those are both categorial false claims.  Debian certainly does not ship
with 'trust' auth, nor do our PGDG packages.  They also move the conf
files out of the data directory.  Last, but not least, they absolutely
enable conf.d directories even when that is not the default upstream.

Additionally, I fully expect and hope that the apt.postgresql.org
packages to follow the Debian/Ubuntu package management- having those
diverge would absolutely be a horrible idea and cause a great deal of
trouble for our users.  Ideally, we can all agree, but this notion that
the PGDG must follow what happens on -hackers is simply wrong, we need
include and coordinate with the OS package maintainers.

> It's more than in interesting thought.  It's the difference between
> having 20 lines of backwards compatibility code, and having 150 plus a
> bunch of additional user documentation and setup.

If I was writing the tool, I'm pretty sure that I'd be writing all that
code either way.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it

2014-01-15 Thread Josh Berkus
On 01/15/2014 11:10 AM, Stephen Frost wrote:
> I don't buy this argument one bit- certainly, on Debian, the defaults
> are overridden for a number of variables already and could be done to
> enable a conf.d directory as well.  Directory creation would, of
> course, also be able to be handled by the Debian package.  Any tool
> which is being packaged for Debian would simply have to Depend on the
> version of the PG package that enabled the conf.d option by default.

However, Debian is *never* going to add conf.d to the packages if we
don't recommend it as an upstream project.  And, frankly, I use the
apt.postgresql.org packages anyway, which would certainly include
anything which was decided on this list.

> This doesn't deal with upgrade cases or where the user has disabled the
> feature and so the tool would need to check for a directory's
> existance, but changing our default isn't going to completely address
> that issue either.  

There's a HUGE difference between "This tool depends on conf.d, so
please don't disable it" and "please edit postgresql.conf in the
following way, and create this directory with these permissions ..."

>> I'm particularly thinking about this in relation to the merger of
>> recovery.conf and postgresql.conf.  There are several tools already
>> (RepMgr, OminPITR, HandyRep, pgPool, etc.) which manage recovery.conf
>> separately from postgresql.conf.  These tools will want to continue
>> managing recovery.conf as a separate file, even if it's /included in
>> postgresql.conf now.
> 
> Certainly an interesting thought.

It's more than in interesting thought.  It's the difference between
having 20 lines of backwards compatibility code, and having 150 plus a
bunch of additional user documentation and setup.

>> If conf.d exists by default, and is enabled in postgresql.conf by
>> default, this is easy: the tool just drops a recovery.conf file into
>> conf.d.  Changing file locations and variable names is a fairly simple
>> exercise in backwards compatibility.
> 
> This isn't quite that simple on at least Debian, and I believe RH, as
> there's going to be multiple directories involved (one for each cluster
> which exists on the system).  Also, there are rules which packages need
> to follow on Debian regarding conf file changes, which these certainly
> would be.

Again, from the perspective of the tool (assuming it already supports
Debian), this is just a directory change.  ("if version == 9.4,
directory = directory + "/conf.d/")

> My recommendation would be to argue the point with the package
> maintainers.  Even if we *had* a default, I'm not convinced that the
> package maintainers would keep it that way until and unless they update
> their scripts and whatnot to handle it.

This is disengenuous.  The package maintainers are never, ever going to
add conf.d and enable it by default unless it's the official
recommendation of the PostgreSQL project that they do so.  Exactly how
far would I get with "I couldn't get pgsql-hackers to agree to this, so
I'm asking you"?

On 01/15/2014 11:40 AM, Peter Eisentraut wrote:
> That seems like a very general argument.  We'd need to consider exactly
> what these tools want to do, and how that affects the recovery.conf
> merger.  My personal proposal was to allow postgresql.conf to contain
> recovery parameters, but read recovery.conf last anyway.  That would
> solve that problem.

Three issues:

a) if postgresql is still going to look for a recovery.conf file in the
usual place, but we are changing the names and meaning of some of the
parameters, then aren't we making the upgrade problem much worse?

b) what if the admin *does* want to disable reading recovery.conf in
order to prevent old utilities from mistakenly including one?  How will
they do that?

c) would this be in the configdir, datadir, or both?

I'd thought that part of the idea of the merger was to remove the
"magic" status of recovery.conf.

>> If conf.d exists by default, and is enabled in postgresql.conf by
>> default, this is easy: the tool just drops a recovery.conf file into
>> conf.d.
>
> That would just give false hope, I think.  My standard operating
> procedure is to delete the default postgresql.conf and write a new one
> from scratch (or have deployment tools do it).  I don't want to
> subscribe to a policy that the only endorsed way to maintain
> postgresql.conf is to start with the default one and edit around
carefully.
>
> The only way you could achieve your goal is to read the configuration
> directory automatically, but that would come with its own set of problems.

Frankly, I'm not concerned about people who rewrite their
postgresql.conf from scratch already; it's *easy* to tell those people
to re-add the conf.d reference to that file.  I'm talking about the vast
majority of our users who never edit pg.conf by hand.

> Independent of the above, I don't agree with that.  postgresql.auto.conf
> is a special thing and should have its own special place.  For once
> thing, 

Re: [HACKERS] tests for client programs

2014-01-15 Thread Erik Rijkers
On Wed, January 15, 2014 06:30, Peter Eisentraut wrote:
> As we all know, the client programs (src/bin/) don't have any real test
>
> So I wrote something.
>
> I chose to use Perl-based tools, prove and Test::More, because those are

> [ 0001-Add-TAP-tests-for-client-programs.patch ]   32 k

I gave this a quick try.

Centos 6.5 final / perl 5.18.2


As mentioned earlier I had to install IPC::Run.

2 tests stumbled:

1. One test  ( pg_ctl/t/001_start_stop.pl )  failed because I had PGDATA set.  
I unset all PG+ vars after that.  No a big
problem but nonetheless it might be better if the test suite removes /controls 
the variables before running.

2. The pg_isready test failed command_fails()  ('fails with no server running') 
because it defaults to the compiled-in
server-port (and that server was running).  I added the test-designated port 
(65432, as defined in TestLib.pm).  This
simple change is in the attached patch.


With these two changes the whole test suite passed.


Thanks,

Erik Rijkers




--- src/bin/scripts/t/080_pg_isready.pl.orig	2014-01-15 20:08:16.325916223 +0100
+++ src/bin/scripts/t/080_pg_isready.pl	2014-01-15 20:18:24.705927054 +0100
@@ -7,7 +7,7 @@
 program_version_ok('pg_isready');
 program_options_handling_ok('pg_isready');
 
-command_fails(['pg_isready'], 'fails with no server running');
+command_fails(['pg_isready', '-p65432'], 'fails with no server running');
 
 my $tempdir = tempdir;
 start_test_server $tempdir;
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add %z support to elog/ereport?

2014-01-15 Thread Andres Freund
On 2013-12-08 01:32:45 +0100, Andres Freund wrote:
> On 2013-12-06 09:54:59 -0500, Peter Eisentraut wrote:
> > On 11/11/13, 12:01 PM, Tom Lane wrote:
> > > I do recall Peter saying that the infrastructure knows how to
> > > verify conversion specs in translated strings, so it would have to be
> > > aware of 'z' flags for this to work.
> > 
> > It just checks that the same conversion placeholders appear in the
> > translation.  It doesn't interpret the actual placeholders.  I don't
> > think this will be a problem.
> 
> Ok, so here's a patch.
> 
> Patch 01 introduces infrastructure, and elog.c implementation.
> Patch 02 converts some elog/ereport() callers to using the z modifier,
>   some were wrong at least on 64 bit windows.

There's just been another instance of printing size_t values being
annoying, reminding me of this patch. I'll add it to the current
commitfest given I've posted it over a month ago unless somebody
protests?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it

2014-01-15 Thread Peter Eisentraut
On 1/15/14, 1:53 PM, Josh Berkus wrote:
> I'm particularly thinking about this in relation to the merger of
> recovery.conf and postgresql.conf.  There are several tools already
> (RepMgr, OminPITR, HandyRep, pgPool, etc.) which manage recovery.conf
> separately from postgresql.conf.  These tools will want to continue
> managing recovery.conf as a separate file, even if it's /included in
> postgresql.conf now.

That seems like a very general argument.  We'd need to consider exactly
what these tools want to do, and how that affects the recovery.conf
merger.  My personal proposal was to allow postgresql.conf to contain
recovery parameters, but read recovery.conf last anyway.  That would
solve that problem.

> If conf.d exists by default, and is enabled in postgresql.conf by
> default, this is easy: the tool just drops a recovery.conf file into
> conf.d.

That would just give false hope, I think.  My standard operating
procedure is to delete the default postgresql.conf and write a new one
from scratch (or have deployment tools do it).  I don't want to
subscribe to a policy that the only endorsed way to maintain
postgresql.conf is to start with the default one and edit around carefully.

The only way you could achieve your goal is to read the configuration
directory automatically, but that would come with its own set of problems.

> Yes, I'm also arguing that postgresql.auto.conf should go into conf.d.
> I said I'd bring that up again after ALTER SYSTEM SET was committed, and
> here it is.

Independent of the above, I don't agree with that.  postgresql.auto.conf
is a special thing and should have its own special place.  For once
thing, when putting configuration files in a separate directory
structure (/etc/ vs /var), then postgresql.auto.conf should stay in the
data directory.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-01-15 Thread Robert Haas
On Wed, Jan 15, 2014 at 7:28 AM, Amit Kapila  wrote:
> Unpatched
> ---
> testname | wal_generated |
> duration
> --+--+--
>  one short and one long field, no change |1054923224 |  33.101135969162
>
> After pgrb_delta_encoding_v4
> -
>
> testname | wal_generated |
> duration
> --+--+--
>  one short and one long field, no change | 877859144 | 30.6749138832092
>
>
> Temporary Changes
> (Revert Max Chunksize = 4 and logic of finding longer match)
> -
>
>  testname| wal_generated |
> duration
> --+--+--
>  one short and one long field, no change | 677337304 | 25.4048750400543

Sure, but watch me not care.

If we're interested in taking advantage of the internal
compressibility of tuples, we can do a lot better than this patch.  We
can compress the old tuple and the new tuple.  We can compress
full-page images.  We can compress inserted tuples.  But that's not
the point of this patch.

The point of *this* patch is to exploit the fact that the old and new
tuples are likely to be very similar, NOT to squeeze out every ounce
of compression from other sources.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Jan Kara
On Wed 15-01-14 14:38:44, Hannu Krosing wrote:
> On 01/15/2014 02:01 PM, Jan Kara wrote:
> > On Wed 15-01-14 12:16:50, Hannu Krosing wrote:
> >> On 01/14/2014 06:12 PM, Robert Haas wrote:
> >>> This would be pretty similar to copy-on-write, except
> >>> without the copying. It would just be
> >>> forget-from-the-buffer-pool-on-write. 
> >> +1
> >>
> >> A version of this could probably already be implement using MADV_DONTNEED
> >> and MADV_WILLNEED
> >>
> >> Thet is, just after reading the page in, use MADV_DONTNEED on it. When
> >> evicting
> >> a clean page, check that it is still in cache and if it is, then
> >> MADV_WILLNEED it.
> >>
> >> Another nice thing to do would be dynamically adjusting kernel
> >> dirty_background_ratio
> >> and other related knobs in real time based on how many buffers are dirty
> >> inside postgresql.
> >> Maybe in background writer.
> >>
> >> Question to LKM folks - will kernel react well to frequent changes to
> >> /proc/sys/vm/dirty_*  ?
> >> How frequent can they be (every few second? every second? 100Hz ?)
> >   So the question is what do you mean by 'react'. We check whether we
> > should start background writeback every dirty_writeback_centisecs (5s). We
> > will also check whether we didn't exceed the background dirty limit (and
> > wake writeback thread) when dirtying pages. However this check happens once
> > per several dirtied MB (unless we are close to dirty_bytes).
> >
> > When writeback is running we check roughly once per second (the logic is
> > more complex there but I don't think explaining details would be useful
> > here) whether we are below dirty_background_bytes and stop writeback in
> > that case.
> >
> > So changing dirty_background_bytes every few seconds should work
> > reasonably, once a second is pushing it and 100 Hz - no way. But I'd also
> > note that you have conflicting requirements on the kernel writeback. On one
> > hand you want checkpoint data to steadily trickle to disk (well, trickle
> > isn't exactly the proper word since if you need to checkpoing 16 GB every 5
> > minutes than you need a steady throughput of ~50 MB/s just for
> > checkpointing) so you want to set dirty_background_bytes low, on the other
> > hand you don't want temporary files to get to disk so you want to set
> > dirty_background_bytes high. 
> Is it possible to have more fine-grained control over writeback, like
> configuring dirty_background_bytes per file system / device (or even
> a file or a group of files) ?
  Currently it isn't possible to tune dirty_background_bytes per device
directly. However see below.

> If not, then how hard would it be to provide this ?
  We do track amount of dirty pages per device and the thread doing the
flushing is also per device. The thing is that currently we compute the
per-device background limit as dirty_background_bytes * p, where p is a
proportion of writeback happening on this device to total writeback in the
system (computed as floating average with exponential time-based backoff).
BTW, similarly maximum per-device dirty limit is derived from global
dirty_bytes in the same way. And you can also set bounds on the proportion
'p' in /sys/block/sda/bdi/{min,max}_ratio so in theory you should be able
to set fixed background limit for a device by setting matching min and max
proportions.

Honza
-- 
Jan Kara 
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it

2014-01-15 Thread Stephen Frost
Josh,

* Josh Berkus (j...@agliodbs.com) wrote:
> In 9.3, we added support for a config directory (conf.d), but have it
> disabled by default.  For tool authors, this makes conf.d useless since
> you never know, on any given installation, whether it will be
> present/enabled or not.  While we don't want to prevent users from
> disabling it, conf.d only becomes useful if it's present by default.
> There's a simple reason why: if you want to write a tool which manages
> postgresql.conf, you don't want the user to have to manually edit
> postgresql.conf (and create a directory) in order to enable the tool.

I don't buy this argument one bit- certainly, on Debian, the defaults
are overridden for a number of variables already and could be done to
enable a conf.d directory as well.  Directory creation would, of
course, also be able to be handled by the Debian package.  Any tool
which is being packaged for Debian would simply have to Depend on the
version of the PG package that enabled the conf.d option by default.

This doesn't deal with upgrade cases or where the user has disabled the
feature and so the tool would need to check for a directory's
existance, but changing our default isn't going to completely address
that issue either.  Indeed, the Debian package would at least be able to
indicate, through the existance or not of the directory, whether a given
cluster was set up by default with the conf.d structure or not.

> I'm particularly thinking about this in relation to the merger of
> recovery.conf and postgresql.conf.  There are several tools already
> (RepMgr, OminPITR, HandyRep, pgPool, etc.) which manage recovery.conf
> separately from postgresql.conf.  These tools will want to continue
> managing recovery.conf as a separate file, even if it's /included in
> postgresql.conf now.

Certainly an interesting thought.

> If conf.d exists by default, and is enabled in postgresql.conf by
> default, this is easy: the tool just drops a recovery.conf file into
> conf.d.  Changing file locations and variable names is a fairly simple
> exercise in backwards compatibility.

This isn't quite that simple on at least Debian, and I believe RH, as
there's going to be multiple directories involved (one for each cluster
which exists on the system).  Also, there are rules which packages need
to follow on Debian regarding conf file changes, which these certainly
would be.

> If conf.d does NOT exist by default, things become complicated, and
> backwards compatibility for replication management tools becomes much
> harder.

My recommendation would be to argue the point with the package
maintainers.  Even if we *had* a default, I'm not convinced that the
package maintainers would keep it that way until and unless they update
their scripts and whatnot to handle it.

> Yes, I'm also arguing that postgresql.auto.conf should go into conf.d.
> I said I'd bring that up again after ALTER SYSTEM SET was committed, and
> here it is.

I disagree with enabling that by default.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Claudio Freire
On Wed, Jan 15, 2014 at 3:41 PM, Stephen Frost  wrote:
> * Claudio Freire (klaussfre...@gmail.com) wrote:
>> But, still, the implementation is very similar to what postgres needs:
>> sharing a physical page for two distinct logical pages, efficiently,
>> with efficient copy-on-write.
>
> Agreed, except that KSM seems like it'd be slow/lazy about it and I'm
> guessing there's a reason the pagecache isn't included normally..

KSM does an active de-duplication. That's slow. This would be
leveraging KSM structures in the kernel (page sharing) but without all
the de-duplication logic.

>
>> So it'd be just a matter of removing that limitation regarding page
>> cache and shared pages.
>
> Any idea why that limitation is there?

No, but I'm guessing it's because nobody bothered to implement the
required copy-on-write in the page cache, which would be a PITA to
write - think of all the complexities with privilege checks and
everything - even though the benefits for many kinds of applications
would be important.

>> If you asked me, I'd implement it as copy-on-write on the page cache
>> (not the user page). That ought to be low-overhead.
>
> Not entirely sure I'm following this- if it's a shared page, it doesn't
> matter who starts writing to it, as soon as that happens, it need to get
> copied.  Perhaps you mean that the application should keep the
> "original" and that the page-cache should get the "copy" (or, really,
> perhaps just forget about the page existing at that point- we won't want
> it again...).
>
> Would that be a way to go, perhaps?  This does go back to the "make it
> act like mmap, but not *be* mmap", but the idea would be:
> open(..., O_ZEROCOPY_READ)
> read() - Goes to PG's shared buffers, pagecache and PG share the page
> page fault (PG writes to it) - pagecache forgets about the page
> write() / fsync() - operate as normal

Yep.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-15 Thread Robert Haas
On Tue, Jan 14, 2014 at 4:16 AM, David Rowley  wrote:
> On Tue, Jan 14, 2014 at 9:09 PM, David Rowley  wrote:
>>
>> I think unless anyone has some objections I'm going to remove the inverse
>> transition for SUM(numeric) and modify the documents to tell the user how to
>> build their own FAST_SUM(numeric) using the built in functions to do it. I'm
>> starting to think that playing around with resetting numeric scale will turn
>> a possible 9.4 patch into a 9.5/10.0 patch. I see no reason why what's there
>> so far, minus sum(numeric), can't go in...
>>
>
> Of course its only now that I discover that this is not possible to do this:
>
> CREATE AGGREGATE fast_sum (numeric)
> (
> stype = numeric,
> sfunc = numeric_avg_accum,
> invfunc = numeric_avg_accum_inv,
> finalfunc = numeric_sum
> );
>
> because SUM(numeric) uses an internal type to store the transition state.
>
> hmmm, built-in fast_sum anyone?
> Is there any simple way to limit these to only be used in the context of a
> window? If so is it worth it?
> Would we want fast_sum() for float too?

Maybe these additional "fast" functions (one might also say
"inaccurate") should live in an extension, in contrib.

It strikes me that for numeric what you really need is to just tell
the sum operation, whether through a parameter or otherwise, how many
decimal places to show in the output.  Obviously that's not a
practical change for sum() itself, but if we're inventing new stuff it
can be done.

For floats, things are not so good.  The data type is inexact by
nature, and I think cases where you get substantially wrong answers
will be common enough that people who attempt to use whatever we
devise in this area will have sum trouble.  *ducks*

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it

2014-01-15 Thread Josh Berkus
Hackers,

ALTER SYSTEM SET has been committed and recovery.conf GUCs are being
reviewed.  I'm going to make a last case for conf.d in relation to these
two patches before 9.4 goes out the door.

In 9.3, we added support for a config directory (conf.d), but have it
disabled by default.  For tool authors, this makes conf.d useless since
you never know, on any given installation, whether it will be
present/enabled or not.  While we don't want to prevent users from
disabling it, conf.d only becomes useful if it's present by default.
There's a simple reason why: if you want to write a tool which manages
postgresql.conf, you don't want the user to have to manually edit
postgresql.conf (and create a directory) in order to enable the tool.

I'm particularly thinking about this in relation to the merger of
recovery.conf and postgresql.conf.  There are several tools already
(RepMgr, OminPITR, HandyRep, pgPool, etc.) which manage recovery.conf
separately from postgresql.conf.  These tools will want to continue
managing recovery.conf as a separate file, even if it's /included in
postgresql.conf now.

If conf.d exists by default, and is enabled in postgresql.conf by
default, this is easy: the tool just drops a recovery.conf file into
conf.d.  Changing file locations and variable names is a fairly simple
exercise in backwards compatibility.

If conf.d does NOT exist by default, things become complicated, and
backwards compatibility for replication management tools becomes much
harder.

Yes, I'm also arguing that postgresql.auto.conf should go into conf.d.
I said I'd bring that up again after ALTER SYSTEM SET was committed, and
here it is.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Stephen Frost
* Claudio Freire (klaussfre...@gmail.com) wrote:
> But, still, the implementation is very similar to what postgres needs:
> sharing a physical page for two distinct logical pages, efficiently,
> with efficient copy-on-write.

Agreed, except that KSM seems like it'd be slow/lazy about it and I'm
guessing there's a reason the pagecache isn't included normally..

> So it'd be just a matter of removing that limitation regarding page
> cache and shared pages.

Any idea why that limitation is there?

> If you asked me, I'd implement it as copy-on-write on the page cache
> (not the user page). That ought to be low-overhead.

Not entirely sure I'm following this- if it's a shared page, it doesn't
matter who starts writing to it, as soon as that happens, it need to get
copied.  Perhaps you mean that the application should keep the
"original" and that the page-cache should get the "copy" (or, really,
perhaps just forget about the page existing at that point- we won't want
it again...).

Would that be a way to go, perhaps?  This does go back to the "make it
act like mmap, but not *be* mmap", but the idea would be:

open(..., O_ZEROCOPY_READ)
read() - Goes to PG's shared buffers, pagecache and PG share the page
page fault (PG writes to it) - pagecache forgets about the page
write() / fsync() - operate as normal

The differences here from O_DIRECT are that the pagecache will keep the
page while clean (absolutely valuable from PG's perspective- we might
have to evict the page from shared buffers sooner than the kernel does),
and the write()'s happen at the kernel's pace, allowing for
write-combining, etc, until an fsync() happens, of course.

This isn't the "big win" of dealing with I/O issues during checkpoints
that we'd like to see, but it certainly feels like it'd be an
improvement over the current double-buffering situation at least.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] tests for client programs

2014-01-15 Thread Peter Eisentraut
On 1/15/14, 1:46 AM, Erik Rijkers wrote:
> The seems to be a dependency on IPC::Run
> 
> I can install that, of course... but I suppose you want to make this work 
> without that.

No, IPC::Run will be required.  It looked like it was part of the
default installation where I tested, but apparently not.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning recovery.conf into GUCs

2014-01-15 Thread Robert Haas
On Wed, Jan 15, 2014 at 2:06 AM, Jaime Casanova  wrote:
> On Wed, Jan 15, 2014 at 2:00 AM, Jaime Casanova  wrote:
>> On Mon, Nov 18, 2013 at 12:27 PM, Andres Freund  
>> wrote:
>>
>>> * Maybe we should rename names like pause_at_recovery_target to
>>>   recovery_pause_at_target? Since we already force everyone to bother
>>>   changing their setup...
>>
>> i don't have a problem with this, anyone else? if no one speaks i will
>> do what Andres suggests
>>
>
> Actually Michael had objected to this idea but i forgot about that...
> so i will wait for some consensus (my personal opinion is that
> Michael's argument is a good one)

I prefer the current name.  It's more like the way English is spoken.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] drop duplicate buffers in OS

2014-01-15 Thread Robert Haas
On Wed, Jan 15, 2014 at 1:53 AM, KONDO Mitsumasa
 wrote:
> I create patch that can drop duplicate buffers in OS using usage_count
> alogorithm. I have developed this patch since last summer. This feature seems 
> to
> be discussed in hot topic, so I submit it more faster than my schedule.
>
> When usage_count is high in shared_buffers, they are hard to drop from
> shared_buffers. However, these buffers wasn't required in file cache. Because
> they aren't accessed by postgres(postgres access to shared_buffers).
> So I create algorithm that dropping file cache which is high usage_count in
> shared_buffers and is clean state in OS. If file cache are clean state in OS, 
> and
> executing posix_fadvice DONTNEED, it can only free in file cache without 
> writing
> physical disk. This algorithm will solve double-buffered situation problem and
> can use memory more efficiently.
>
> I am testing DBT-2 benchmark now...

The thing about this is that our usage counts for shared_buffers don't
really work right now; it's common for everything, or nearly
everything, to have a usage count of 5.  So I'm reluctant to rely on
that for much of anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-15 Thread Robert Haas
Review of patch 0002:

- I think you should just regard ReplicationSlotCtlLock as protecting
the "name" and "active" flags of every slot.  ReplicationSlotCreate()
would then not need to mess with the spinlocks at all, and
ReplicationSlotAcquire and ReplicationSlotDrop get a bit simpler too I
think.  Functions like ReplicationSlotsComputeRequiredXmin() can
acquire this lock in shared mode and only lock the slots that are
actually active, instead of all of them.

- If you address /* FIXME: apply sanity checking to slot name */, then
I think that also addresses /* XXX: do we want to use truncate
identifier instead? */. In other words, let's just error out if the
name is too long.  I'm not sure what other sanity checking is needed
here; maybe just restrict it to 7-bit characters to avoid problems
with encodings for individual databases varying.

- ReplicationSlotAcquire probably needs to ignore slots that are not active.

- ReplicationSlotAcquire should be tweaked so that the code that holds
the spinlock is more self-contained.  If you adopt the above-proposed
recasting of ReplicationSlotCtlLock, then the part that holds the
spinlock can probably look like this: SpinLockAcquire(&slot->mutex);
was_active = slot->active; slot->active = true;
SpinLockRelease(&slot->mutex), which looks quite a bit safer.

- If there's a coding rule that slot->database can't be changed while
the slot is active, then the check to make sure that the user isn't
trying to bind to a slot with a mis-matching database could be done
before the code described in the previous point, avoiding the need to
go back and release the resource.

- I think the critical section in ReplicationSlotDrop is bogus.  If
DeleteSlot() fails, we scarcely need to PANIC.  The slot just isn't
gone.

- cancel_before_shmem_exit is only guaranteed to remove the
most-recently-added callback.

- Why does ReplicationSlotsComputeRequiredXmin() need to hold
ProcArrayLock at all?

- ReplicationSlotsComputeRequiredXmin scarcely needs to hold the
spinlock while it does all of those gyrations.  It can just acquire
the spinlock, copy the three fields needed into local variables, and
release the spinlock.  The rest can be worked out afterwards.
Similarly in ReplicationSlotsComputeRequiredXmin.

- A comment in KillSlot wonders whether locking is required.  I say
yes.  It's safe to take lwlocks and spinlocks during shmem exit, and
failing to do so seems like a recipe for subtle corner-case bugs.

- pg_get_replication_slots() wonders what permissions we require.  I
don't know that any special permissions are needed here; the data
we're exposing doesn't appear to be sensitive.  Unless I'm missing
something?

- PG_STAT_GET_LOGICAL_DECODING_SLOTS_COLS has a leftover "logical" in its name.

- There seems to be no interface to acquire or release slots from
either SQL or the replication protocol, nor any way for a client of
this code to update its slot details.  The value of
catalog_xmin/data_xmin vs. effective_catalog_xmin/effective_data_xmin
is left to the imagination.

...Robert


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS]

2014-01-15 Thread Fabrízio de Royes Mello
On Thu, Jan 9, 2014 at 2:50 PM, Robert Haas  wrote:
>
> On Tue, Jan 7, 2014 at 10:55 PM, Dilip kumar 
wrote:
> > Below attached patch is implementing following todo item..
> >
> > machine-readable pg_controldata?
> >
> > http://www.postgresql.org/message-id/4b901d73.8030...@agliodbs.com
> >
> > Possible approaches:
> >
> > 1.   Implement as backend function and provide a view to user.
>
> I think this would be useful.
>

Hi,

Joe Conway already created this feature [1] some time ago. I think this
code can be reused to create an extension.

Regards,

[1] https://github.com/jconway/pg_controldata

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog sobre TI: http://fabriziomello.blogspot.com
>> Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Claudio Freire
On Wed, Jan 15, 2014 at 2:52 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> I think that the bottom line is that we're not likely to make massive
>> changes to the way that we do block caching now.  Even if some other
>> scheme could work much better on Linux (and so far I'm unconvinced
>> that any of the proposals made here would in fact work much better),
>> we aim to be portable to Windows as well as other UNIX-like systems
>> (BSD, Solaris, etc.).  So using completely Linux-specific technology
>> in an overhaul of our block cache seems to me to have no future.
>
> Unfortunately, I have to agree with this.  Even if there were a way to
> merge our internal buffers with the kernel's, it would surely be far
> too invasive to coexist with buffer management that'd still work on
> more traditional platforms.
>
> But we could add hint calls, or modify the I/O calls we use, and that
> ought to be a reasonably localized change.


That's what's pretty nice with the zero-copy read idea. It's almost
transparent. You read to a page-aligned address, and it works. The
only code change would be enabling zero-copy reads, which I'm not sure
will be low-overhead enough to leave enabled by default.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Deprecations in authentication

2014-01-15 Thread Tom Lane
Magnus Hagander  writes:
> One thing I noticed - in MSVC, the config parameter "krb5" (equivalent of
> the removed --with-krb5) enabled *both* krb5 and gssapi, and there is no
> separate config parameter for gssapi. Do we want to rename that one to
> "gss", or do we want to keep it as "krb5"? Renaming it would break
> otherwise working environments, but it's kind of weird to leave it...

+1 for renaming --- anybody who's building with "krb5" and expecting to,
you know, actually *get* krb5 would probably rather find out about this
change at build time instead of down the road a ways.

A compromise position would be to introduce a gss parameter while leaving
krb5 in place as a deprecated (perhaps undocumented?) synonym for it.
But I think that's basically confusing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Tom Lane
Robert Haas  writes:
> I think that the bottom line is that we're not likely to make massive
> changes to the way that we do block caching now.  Even if some other
> scheme could work much better on Linux (and so far I'm unconvinced
> that any of the proposals made here would in fact work much better),
> we aim to be portable to Windows as well as other UNIX-like systems
> (BSD, Solaris, etc.).  So using completely Linux-specific technology
> in an overhaul of our block cache seems to me to have no future.

Unfortunately, I have to agree with this.  Even if there were a way to
merge our internal buffers with the kernel's, it would surely be far
too invasive to coexist with buffer management that'd still work on
more traditional platforms.

But we could add hint calls, or modify the I/O calls we use, and that
ought to be a reasonably localized change.

> And the idea of being able to do an 8kB atomic write with OS support
> so that we don't have to save full page images in our write-ahead log
> to cover the "torn page" scenario seems very intriguing indeed.  If
> that worked well, it would be a *big* deal for us.

+1.  That would be a significant win, and trivial to implement, since
we already have a way to switch off full-page images for people who
trust their filesystems to do atomic writes.  It's just that safe
use of that switch isn't widely possible ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] shared memory message queues

2014-01-15 Thread Robert Haas
On Wed, Jan 15, 2014 at 12:39 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Jan 14, 2014 at 9:28 PM, Peter Eisentraut  wrote:
>>> Something is causing this new compiler warning:
>>>
>>> setup.c: In function ‘setup_dynamic_shared_memory’:
>>> setup.c:102:3: error: format ‘%llu’ expects argument of type ‘long long 
>>> unsigned int’, but argument 2 has type ‘Size’ [-Werror=format=]
>
>> Oops.  I've pushed a fix (at least, I hope it's a fix).
>
> Unless I'm misreading what you did, that will just move the problem to
> a different set of platforms.  size_t is not necessarily "long" (one
> counterexample is Win64, IIRC).
>
> In the past we've usually resorted to explicitly casting.  You could
> silence the warning correctly with either of
>ereport("... %lu ...", (unsigned long) shm_mq_minimum_size);
>ereport("... " UINT64_FORMAT " ...", (uint64) shm_mq_minimum_size);
>
> However, the problem with the second one (which also existed in your
> original code) is that it breaks translatability of the string, because
> any given translator is only going to see the version that gets compiled
> on their hardware.  Unless there's a significant likelihood of
> shm_mq_minimum_size exceeding 4GB, I'd go with the first one.

I think the current value is about 56, and while the header might
expand in the future, 4294967296 would be a heck of a lot of header
data.  So I've changed it as you suggest.  I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)

2014-01-15 Thread Robert Haas
This 0001 patch, to log running transactions more frequently, has been
pending for a long time now, and I haven't heard any objections, so
I've gone ahead and committed that part.

...Robert


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] shared memory message queues

2014-01-15 Thread Tom Lane
Robert Haas  writes:
> On Tue, Jan 14, 2014 at 9:28 PM, Peter Eisentraut  wrote:
>> Something is causing this new compiler warning:
>> 
>> setup.c: In function ‘setup_dynamic_shared_memory’:
>> setup.c:102:3: error: format ‘%llu’ expects argument of type ‘long long 
>> unsigned int’, but argument 2 has type ‘Size’ [-Werror=format=]

> Oops.  I've pushed a fix (at least, I hope it's a fix).

Unless I'm misreading what you did, that will just move the problem to
a different set of platforms.  size_t is not necessarily "long" (one
counterexample is Win64, IIRC).

In the past we've usually resorted to explicitly casting.  You could
silence the warning correctly with either of
   ereport("... %lu ...", (unsigned long) shm_mq_minimum_size);
   ereport("... " UINT64_FORMAT " ...", (uint64) shm_mq_minimum_size);

However, the problem with the second one (which also existed in your
original code) is that it breaks translatability of the string, because
any given translator is only going to see the version that gets compiled
on their hardware.  Unless there's a significant likelihood of
shm_mq_minimum_size exceeding 4GB, I'd go with the first one.

If you really want to be able to print values >4GB in translatable
messages, what you need to do is to print into a local string variable
with UINT64_FORMAT, then use %s in the translatable string.  There's
examples of this in commands/sequence.c among other places.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Claudio Freire
On Wed, Jan 15, 2014 at 1:35 PM, Stephen Frost  wrote:
>> And there's a nice bingo. Had forgotten about KSM. KSM could help lots.
>>
>> I could try to see of madvising shared_buffers as mergeable helps. But
>> this should be an automatic case of KSM - ie, when reading into a
>> page-aligned address, the kernel should summarily apply KSM-style
>> sharing without hinting. The current madvise interface puts the burden
>> of figuring out what duplicates what on the kernel, but postgres
>> already knows.
>
> I'm certainly curious as to if KSM could help here, but on Ubuntu 12.04
> with 3.5.0-23-generic, it's not doing anything with just PG running.
> The page here: http://www.linux-kvm.org/page/KSM seems to indicate why:
>
> 
> KSM is a memory-saving de-duplication feature, that merges anonymous
> (private) pages (not pagecache ones).
> 
>
> Looks like it won't merge between pagecache and private/application
> memory?  Or is it just that we're not madvise()'ing the shared buffers
> region?  I'd be happy to test doing that, if there's a chance it'll
> actually work..


Yes, it's onlyl *intended* for merging private memory.

But, still, the implementation is very similar to what postgres needs:
sharing a physical page for two distinct logical pages, efficiently,
with efficient copy-on-write.

So it'd be just a matter of removing that limitation regarding page
cache and shared pages.

If you asked me, I'd implement it as copy-on-write on the page cache
(not the user page). That ought to be low-overhead.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-01-15 Thread Robert Haas
On Tue, Jan 14, 2014 at 4:13 PM, Simon Riggs  wrote:
> On 8 January 2014 08:33, Simon Riggs  wrote:
>> VACUUM cleans up blocks, which is nice because it happens offline in a
>> lazy manner.
>>
>> We also make SELECT clean up blocks as it goes. That is useful in OLTP
>> workloads, but it means that large SQL queries and pg_dump effectively
>> do much the same work as VACUUM, generating huge amounts of I/O and
>> WAL on the master, the cost and annoyance of which is experienced
>> directly by the user. That is avoided on standbys.
>>
>> Effects of that are that long running statements often run much longer
>> than we want, increasing bloat as a result. It also produces wildly
>> varying response times, depending upon extent of cleanup required.
>>
>> It is a simple task to make that behaviour optional on the master.
>>
>> I propose a USERSET parameter, prune_cost_limit (<---insert better name here)
>> which will make the behaviour optional, default -1, in normal user
>> processes. VACUUM will ignore this parameter and so its actions will
>> never be deferred.
>>
>> In detail, this parameter would disable pruning for any scan larger
>> than the cost limit. So large scans will disable the behaviour. The
>> default, -1, means never disable pruning, which is the current
>> behavour.
>>
>> We track the number of pages dirtied by the current statement. When
>> this reaches prune_cost_limit, we will apply these behaviours to all
>> shared_buffer block accesses...
>>
>> (1) avoid running heap_page_prune_opt()
>>
>> (2) avoid dirtying the buffer for hints. (This is safe because the
>> hinted changes will either be lost or will be part of the full page
>> image when we make a logged-change).
>>
>> (i.e. doesn't apply to temp tables)
>>
>> For example, if we set prune_cost_limit = 4 this behaviour allows
>> small index lookups via bitmapheapscan to continue to cleanup, while
>> larger index and seq scans will avoid cleanup.
>>
>>
>>
>> There would be a postgresql.conf parameter prune_cost_limit, as well
>> as a table level parameter that would prevent pruning except via
>> VACUUM.
>>
>> This will help in these ways
>> * Reduce write I/O from SELECTs and pg_dump - improving backups and BI 
>> queries
>> * Allow finer grained control over Hot Standby conflicts
>> * Potentially allow diagnostic inspection of older data via SeqScan
>>
>> Prototype patch shows this is possible and simple enough for 9.4.
>> Major objections? Or should I polish up and submit?
>
> Patch attached, implemented to reduce writes by SELECTs only.

I am still not sure whether we want this, but I think it's definitely
an improvement over the previous version.  Assorted comments:

- Naming consistency seems to me to dictate that there should be more
similarity between the reloption name (allow_buffer_cleanup) and the
GUC (prune_page_dirty_limit).

- The documentation doesn't describe the use case where suppressing
cleanup on a per-table basis would be desirable, and I can't think of
one, either.

- There are a variety of ways to limit pruning; here, you've chosen to
limit it to a particular number of pruning operations per executor
invocation.  But the flag is global, not part of the executor state,
so a query that calls a PL/pgsql function during execution will reset
the counter for the parent query also, which doesn't seem very
principled.

In a patch I posted a few years ago to set hint bits only sometimes, I
settled on an algorithm where I dirtied the first 50 pages per scan
and then skipped the next 950, or something like that.  The idea was
that you wanted the pages that did get dirtied to be clustered
together to avoid random I/O; and also that you wanted table of
arbitrary size to get hinted within a certain number of scans (e.g.
20).  The limiting here is much more aggressive, so on large tables it
will amount to basically no pruning at all.  I dunno whether that's a
good idea or not.  But if the idea of making this an integer rather
than a boolean is to allow some pruning to still happen while keeping
it checked within reasonable bounds, I'm not sure it will succeed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Stephen Frost
* Claudio Freire (klaussfre...@gmail.com) wrote:
> Yes, that's basically zero-copy reads.
> 
> It could be done. The kernel can remap the page to the physical page
> holding the shared buffer and mark it read-only, then expire the
> buffer and transfer ownership of the page if any page fault happens.
> 
> But that incurrs:
>  - Page faults, lots
>  - Hugely bloated mappings, unless KSM is somehow leveraged for this

The page faults might be a problem but might be worth it.  Bloated
mappings sounds like a real issue though.

> And there's a nice bingo. Had forgotten about KSM. KSM could help lots.
> 
> I could try to see of madvising shared_buffers as mergeable helps. But
> this should be an automatic case of KSM - ie, when reading into a
> page-aligned address, the kernel should summarily apply KSM-style
> sharing without hinting. The current madvise interface puts the burden
> of figuring out what duplicates what on the kernel, but postgres
> already knows.

I'm certainly curious as to if KSM could help here, but on Ubuntu 12.04
with 3.5.0-23-generic, it's not doing anything with just PG running.
The page here: http://www.linux-kvm.org/page/KSM seems to indicate why:


KSM is a memory-saving de-duplication feature, that merges anonymous
(private) pages (not pagecache ones).


Looks like it won't merge between pagecache and private/application
memory?  Or is it just that we're not madvise()'ing the shared buffers
region?  I'd be happy to test doing that, if there's a chance it'll
actually work..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Soften pg_[start|stop]_backup to allow them on a standby?

2014-01-15 Thread Andres Freund
On 2014-01-15 11:19:52 -0500, Robert Haas wrote:
> On Tue, Jan 14, 2014 at 7:54 AM, Andres Freund  wrote:
> > On 2014-01-14 12:31:09 +0900, Michael Paquier wrote:
> >> Currently, pg_start_backup and pg_stop_backup cannot run on a standby
> >> because it is not possible to write a backup_label file to disk,
> >> because of the nature of a standby server preventing to write any data
> >> in its PGDATA. Is this thought right? This is what the comments at the
> >> top of do_pg_start_backup make me conclude.
> >
> > No, the actual reason is that a plain pg_stop_backup() writes WAL -
> > which we can't do on a standby. The walsender command gets around this
> > by storing the required data in the backup label itself, but that
> > requires the label to be written after the basebackup actually finished
> > which doesn't work for plain start/stop backup.
> 
> This is true, but a better way to say it might be that when we fire up
> postgres and point it at the backup, it needs to begin recovery at a
> checkpoint; otherwise, pages torn by the backup process won't get
> fixed.  Maybe there's a way that pg_start_backup() could piggyback on
> the most recent checkpoint rather than performing one itself; if so,
> such a mode could be used on either the master or the standby (but
> would require replaying more WAL, of course).

That's what the walsender variant essentially already does for backups
taken in recovery. What that does not solve is correctly setting the min
recovery point though. I don't immediately see how we could compute that
correctly without either a wal record or a backup label written at the
end of recovery.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Robert Haas
On Wed, Jan 15, 2014 at 10:53 AM, Mel Gorman  wrote:
> I realise that now and sorry for the noise.
>
> I later read the parts of the thread that covered the strict ordering
> requirements and in a summary mail I split the requirements in two. In one,
> there are dirty sticky pages that the kernel should not writeback unless
> it has no other option or fsync is called. This may be suitable for large
> temporary files that Postgres does not necessarily want to hit the platter
> but also does not have strict ordering requirements for. The second is have
> pages that are strictly kept dirty until the application syncs them. An
> unbounded number of these pages would blow up but maybe bounds could be
> placed on it. There are no solid conclusions on that part yet.

I think that the bottom line is that we're not likely to make massive
changes to the way that we do block caching now.  Even if some other
scheme could work much better on Linux (and so far I'm unconvinced
that any of the proposals made here would in fact work much better),
we aim to be portable to Windows as well as other UNIX-like systems
(BSD, Solaris, etc.).  So using completely Linux-specific technology
in an overhaul of our block cache seems to me to have no future.

On the other hand, giving the kernel hints about what we're doing that
would enable it to be smarter seems to me to have a lot of potential.
Ideas so far mentioned include:

- Hint that we're going to do an fsync on file X at time Y, so that
the kernel can schedule the write-out to complete right around that
time.
- Hint that a block is a good candidate for reclaim without actually
purging it if there's no memory pressure.
- Hint that a page we modify in our cache should be dropped from the
kernel cache.
- Hint that a page we write back to the operating system should be
dropped from the kernel cache after the I/O completes.

It's hard to say which of these ideas will work well without testing
them, and the overhead of the extra system calls might be significant
in some of those cases, but it seems a promising line of inquiry.

And the idea of being able to do an 8kB atomic write with OS support
so that we don't have to save full page images in our write-ahead log
to cover the "torn page" scenario seems very intriguing indeed.  If
that worked well, it would be a *big* deal for us.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Soften pg_[start|stop]_backup to allow them on a standby?

2014-01-15 Thread Robert Haas
On Tue, Jan 14, 2014 at 7:54 AM, Andres Freund  wrote:
> On 2014-01-14 12:31:09 +0900, Michael Paquier wrote:
>> Currently, pg_start_backup and pg_stop_backup cannot run on a standby
>> because it is not possible to write a backup_label file to disk,
>> because of the nature of a standby server preventing to write any data
>> in its PGDATA. Is this thought right? This is what the comments at the
>> top of do_pg_start_backup make me conclude.
>
> No, the actual reason is that a plain pg_stop_backup() writes WAL -
> which we can't do on a standby. The walsender command gets around this
> by storing the required data in the backup label itself, but that
> requires the label to be written after the basebackup actually finished
> which doesn't work for plain start/stop backup.

This is true, but a better way to say it might be that when we fire up
postgres and point it at the backup, it needs to begin recovery at a
checkpoint; otherwise, pages torn by the backup process won't get
fixed.  Maybe there's a way that pg_start_backup() could piggyback on
the most recent checkpoint rather than performing one itself; if so,
such a mode could be used on either the master or the standby (but
would require replaying more WAL, of course).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpgsql.warn_shadow

2014-01-15 Thread Marko Tiikkaja

On 1/15/14 3:09 PM, Pavel Stehule wrote:

You first should to say, what is warning and why it is only warning and not
error.


Personally, I'm a huge fan of the computer telling me that I might have 
made a mistake.  But on the other hand, I also really hate it when the 
computer gets in my way when I'm trying to test something quickly and 
making these mistakes on purpose.  Warnings are really good for that: 
hard to ignore (YMMV) accidentally, but easy to spot when developing.


As to how we would categorize these checks between warnings and errors.. 
 I can't really answer that.  I'm tempted to say "anything that is an 
error now is an error, any additional checks we might add are warnings", 
but that's effectively just freezing the definition at an arbitrary 
point in time.



And why plpgsql warning processing should be different than general
postgresql processing?


What do you mean?  We're adding extra checks on *top* of the normal 
"this is clearly an error" conditions.  PostgreSQL in general doesn't 
really do that.  Consider:


  SELECT * FROM foo WHERE fooid IN (SELECT fooid FROM bar);

where the table "bar" doesn't have a column "fooid".  That's a perfectly 
valid query, but it almost certainly doesn't do what you would want. 
Personally I'd like to see a WARNING here normally, but I've eventually 
learned to live with this caveat.  I'm hoping that in PL/PgSQL we could 
at least solve some of the most annoying pitfalls.



My objection is against too general option. Every option shoudl to do one
clean thing.


It looks to me like the GUC *is* doing only one thing.  "list of 
warnings I want to see", or the shorthand "all" for convenience.



Regards,
Marko Tiikkaja


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Mel Gorman
On Wed, Jan 15, 2014 at 10:16:27AM -0500, Robert Haas wrote:
> On Wed, Jan 15, 2014 at 4:44 AM, Mel Gorman  wrote:
> > That applies if the dirty pages are forced to be kept dirty. You call
> > this pinned but pinned has special meaning so I would suggest calling it
> > something like dirty-sticky pages. It could be the case that such hinting
> > will have the pages excluded from dirty background writing but can still
> > be cleaned if dirty limits are hit or if fsync is called. It's a hint,
> > not a forced guarantee.
> >
> > It's still a hand grenade because if this is tracked on a per-page basis
> > because of what happens if the process crashes? Those pages stay dirty
> > potentially forever. An alternative would be to track this on a per-inode
> > instead of per-page basis. The hint would only exist where there is an
> > open fd for that inode.  Treat it as a privileged call with a sysctl
> > controlling how many dirty-sticky pages can exist in the system with the
> > information presented during OOM kills and maybe it starts becoming a bit
> > more manageable. Dirty-sticky pages are not guaranteed to stay dirty
> > until userspace action, the kernel just stays away until there are no
> > other sensible options.
> 
> I think this discussion is vividly illustrating why this whole line of
> inquiry is a pile of fail.  If all the processes that have the file
> open crash, the changes have to be *thrown away* not written to disk
> whenever the kernel likes.
> 

I realise that now and sorry for the noise.

I later read the parts of the thread that covered the strict ordering
requirements and in a summary mail I split the requirements in two. In one,
there are dirty sticky pages that the kernel should not writeback unless
it has no other option or fsync is called. This may be suitable for large
temporary files that Postgres does not necessarily want to hit the platter
but also does not have strict ordering requirements for. The second is have
pages that are strictly kept dirty until the application syncs them. An
unbounded number of these pages would blow up but maybe bounds could be
placed on it. There are no solid conclusions on that part yet.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY and Volatile default expressions

2014-01-15 Thread Simon Riggs
On 16 April 2013 14:37, Simon Riggs  wrote:
> On 16 April 2013 13:57, Heikki Linnakangas  wrote:
>
>> You still need to check the args, if the function is nextval, otherwise you
>> incorrectly perform the optimization for something like
>> "nextval(myvolatilefunc())".
>
> Guess so. At least its an easy change.
>
> Thanks for checking.

Rebased v3

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


copy_tuning_with_default_nextval.v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] shared memory message queues

2014-01-15 Thread Andres Freund
On 2014-01-15 10:19:32 -0500, Robert Haas wrote:
> On Tue, Jan 14, 2014 at 9:28 PM, Peter Eisentraut  wrote:
> > Something is causing this new compiler warning:
> >
> > setup.c: In function ‘setup_dynamic_shared_memory’:
> > setup.c:102:3: error: format ‘%llu’ expects argument of type ‘long long 
> > unsigned int’, but argument 2 has type ‘Size’ [-Werror=format=]
> 
> Oops.  I've pushed a fix (at least, I hope it's a fix).

I don't think that works on LLP64 platforms like windows - long is 32bit
there.
That's basically why I proposed the z modifier to elog to encapsulate
this in some other thread...

Till then you'll have to cast to a known size.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] shared memory message queues

2014-01-15 Thread Robert Haas
On Tue, Jan 14, 2014 at 9:53 PM, Kevin Grittner  wrote:
> I'm not seeing that one but I am now getting these:
>
> setup.c: In function ‘test_shm_mq_setup’:
> setup.c:65:25: warning: ‘outq’ may be used uninitialized in this function 
> [-Wmaybe-uninitialized]
> setup.c:66:24: warning: ‘inq’ may be used uninitialized in this function 
> [-Wmaybe-uninitialized]

That warning is bogus and I don't get it, but I'll add an
initialization to placate the compiler, which is being a bit too smart
for it's own good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] shared memory message queues

2014-01-15 Thread Robert Haas
On Tue, Jan 14, 2014 at 9:28 PM, Peter Eisentraut  wrote:
> Something is causing this new compiler warning:
>
> setup.c: In function ‘setup_dynamic_shared_memory’:
> setup.c:102:3: error: format ‘%llu’ expects argument of type ‘long long 
> unsigned int’, but argument 2 has type ‘Size’ [-Werror=format=]

Oops.  I've pushed a fix (at least, I hope it's a fix).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Robert Haas
On Wed, Jan 15, 2014 at 4:44 AM, Mel Gorman  wrote:
> That applies if the dirty pages are forced to be kept dirty. You call
> this pinned but pinned has special meaning so I would suggest calling it
> something like dirty-sticky pages. It could be the case that such hinting
> will have the pages excluded from dirty background writing but can still
> be cleaned if dirty limits are hit or if fsync is called. It's a hint,
> not a forced guarantee.
>
> It's still a hand grenade because if this is tracked on a per-page basis
> because of what happens if the process crashes? Those pages stay dirty
> potentially forever. An alternative would be to track this on a per-inode
> instead of per-page basis. The hint would only exist where there is an
> open fd for that inode.  Treat it as a privileged call with a sysctl
> controlling how many dirty-sticky pages can exist in the system with the
> information presented during OOM kills and maybe it starts becoming a bit
> more manageable. Dirty-sticky pages are not guaranteed to stay dirty
> until userspace action, the kernel just stays away until there are no
> other sensible options.

I think this discussion is vividly illustrating why this whole line of
inquiry is a pile of fail.  If all the processes that have the file
open crash, the changes have to be *thrown away* not written to disk
whenever the kernel likes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Robert Haas
On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara  wrote:
> Filesystems could in theory provide facility like atomic write (at least up
> to a certain size say in MB range) but it's not so easy and when there are
> no strong usecases fs people are reluctant to make their code more complex
> unnecessarily. OTOH without widespread atomic write support I understand
> application developers have similar stance. So it's kind of chicken and egg
> problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
> due to its data=journal mode so if someone on the PostgreSQL side wanted to
> research on this, knitting some experimental ext4 patches should be doable.

Atomic 8kB writes would improve performance for us quite a lot.  Full
page writes to WAL are very expensive.  I don't remember what
percentage of write-ahead log traffic that accounts for, but it's not
small.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Tom Lane
Heikki Linnakangas  writes:
> On 01/15/2014 07:50 AM, Dave Chinner wrote:
>> FWIW [and I know you're probably sick of hearing this by now], but
>> the blk-io throttling works almost perfectly with applications that
>> use direct IO.

> For checkpoint writes, direct I/O actually would be reasonable. 
> Bypassing the OS cache is a good thing in that case - we don't want the 
> written pages to evict other pages from the OS cache, as we already have 
> them in the PostgreSQL buffer cache.

But in exchange for that, we'd have to deal with selecting an order to
write pages that's appropriate depending on the filesystem layout,
other things happening in the system, etc etc.  We don't want to build
an I/O scheduler, IMO, but we'd have to.

> Writing one page at a time with O_DIRECT from a single process might be 
> quite slow, so we'd probably need to use writev() or asynchronous I/O to 
> work around that.

Yeah, and if the system has multiple spindles, we'd need to be issuing
multiple O_DIRECT writes concurrently, no?

What we'd really like for checkpointing is to hand the kernel a boatload
(several GB) of dirty pages and say "how about you push all this to disk
over the next few minutes, in whatever way seems optimal given the storage
hardware and system situation.  Let us know when you're done."  Right now,
because there's no way to negotiate such behavior, we're reduced to having
to dribble out the pages (in what's very likely a non-optimal order) and
hope that the kernel is neither too lazy nor too aggressive about cleaning
dirty pages in its caches.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Portal suddenly disappears?

2014-01-15 Thread Tatsuo Ishii
>> I'm confused too.  Surely there are lots of ways a portal could get
>> dropped, but most of them would have left traces in the postmaster log,
>> I'd think, since you evidently have log_statement == LOGSTMT_ALL.
>> What was log_min_messages set to?
> 
> I don't have it at this moment. I requested the user to give the
> log_min_messages setting.

It turned out that it is set to default: notice.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Heikki Linnakangas

On 01/15/2014 07:50 AM, Dave Chinner wrote:

However, the first problem is dealing with the IO storm problem on
fsync. Then we can measure the effect of spreading those writes out
in time and determine what triggers read starvations (if they are
apparent). The we can look at whether IO scheduling tweaks or
whether blk-io throttling solves those problems. Or whether
something else needs to be done to make it work in environments
where problems are manifesting.

FWIW [and I know you're probably sick of hearing this by now], but
the blk-io throttling works almost perfectly with applications that
use direct IO.


For checkpoint writes, direct I/O actually would be reasonable. 
Bypassing the OS cache is a good thing in that case - we don't want the 
written pages to evict other pages from the OS cache, as we already have 
them in the PostgreSQL buffer cache.


Writing one page at a time with O_DIRECT from a single process might be 
quite slow, so we'd probably need to use writev() or asynchronous I/O to 
work around that.


We'd still need to issue an fsync() to flush any already-written pages 
from the OS cache to disk, though.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe

2014-01-15 Thread Simon Riggs
On 1 August 2013 01:53, Noah Misch  wrote:
> On Mon, Jul 15, 2013 at 05:50:40PM +0100, Simon Riggs wrote:
>> On 15 July 2013 15:06, Robert Haas  wrote:
>> > Generally, the question on the table is: to what extent do MVCC
>> > catalog scans make the world safe for concurrent DDL, or put
>> > negatively, what hazards remain?
>>
>> On Tom's test I've been unable to find a single problem.
>>
>> > Noah discovered an interesting one recently: apparently, the relcache
>> > machinery has some logic in it that depends on the use of
>> > AccessExclusiveLock in subtle ways.  I'm going to attempt to explain
>> > it, and maybe he can jump in and explain it better.  Essentially, the
>> > problem is that when a relcache reload occurs, certain data structures
>> > (like the tuple descriptor, but there are others) are compared against
>> > the old version of the same data structure.  If there are no changes,
>> > we do nothing; else, we free the old one and install the new one.  The
>> > reason why we don't free the old one and install the new one
>> > unconditionally is because other parts of the backend might have
>> > pointers to the old data structure, so just replacing it all the time
>> > would result in crashes.  Since DDL requires AccessExclusiveLock +
>> > CheckTableNotInUse(), any actual change to the data structure will
>> > happen when we haven't got any pointers anyway.
>
>> If you look at this as a generalised problem you probably can find
>> some issues, but that doesn't mean they apply in the specific cases
>> we're addressing.
>>
>> The lock reductions we are discussing in all cases have nothing at all
>> to do with structure and only relate to various options. Except in the
>> case of constraints, though even there I see no issues as yet.
>
> I was able to distill the above hypothesis into an actual crash with
> reduce_lock_levels.v13.patch.  Test recipe:
>
> 1. Build with --enable-cassert and with -DCATCACHE_FORCE_RELEASE=1.  An
> AcceptInvalidationMessages() will then happen at nearly every syscache lookup,
> making it far easier to hit a problem of this sort.
>
> 2. Run these commands as setup:
>   create table root (c int);
>   create table part (check (c > 0), check (c > 0)) inherits (root);
>
> 3. Attach a debugger to your session and set a breakpoint at plancat.c:660 (as
> of commit 16f38f72ab2b8a3b2d45ba727d213bb3cea4).
>
> 4. Run this in your session; the breakpoint will trip:
>   select * from root where c = -1;
>
> 5. Start another session and run:
>   alter table part add check (c > 0);
>
> 6. Exit the debugger to release the first session.  It will crash.
>
> plancache.c:657 stashes a pointer to memory belonging to the rd_att of a
> relcache entry.  It then proceeds to call eval_const_expressions(), which
> performs a syscache lookup in its simplify_function() subroutine.  Under
> CATCACHE_FORCE_RELEASE, the syscache lookup reliably prompts an
> AcceptInvalidationMessages().  The ensuing RelationClearRelation() against
> "part" notices the new constraint, decides keep_tupdesc = false, and frees the
> existing tupdesc.  plancache.c is now left holding a pointer into freed
> memory.  The next loop iteration will crash when it dereferences a pointer
> stored within that freed memory at plancat.c:671.
>
>
> A remediation strategy that seemed attractive when I last contemplated this
> problem is to repoint rd_att immediately but arrange to free the obsolete
> TupleDesc in AtEOXact_RelationCache().

v15 to fix the above problem.

Looking at other potential problems around plancache but nothing found as yet.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


reduce_lock_levels.v15.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation

2014-01-15 Thread MauMau

From: "Asif Naeem" 

As you have

followed destroy_tablespace_directories() function, Is there any specific
reason not to use same logic to detect type of the file/link i.e.
“(lstat(linkloc, &st) == 0 && S_ISDIR(st.st_mode))”, It also seems have
more appropriate error message i.e.

Thanks for reviewing and testing the patch.  Yes, at first I did what you 
mentioned, but modified the patch according to some advice in the mail 
thread.  During redo, create_tablespace_directories() needs to handle the 
case where the $PGDATA/pg_tblspc/xxx is not a symlink but a directory even 
on UNIX/Linux.  Please see TablespaceCreateDbspace is(). 
destroy_tablespace_directories() doesn't have to handle such situation.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Linux kernel impact on PostgreSQL performance (summary v1 2014-1-15)

2014-01-15 Thread Mel Gorman
> One assumption would be that Postgres is perfectly happy with the current
> kernel behaviour in which case our discussion here is done.

It has been demonstrated that this statement was farcical.  The thread is
massive just from interaction with the LSF/MM program committee.  I'm hoping
that there will be Postgres representation at LSF/MM this year to bring
the issues to a wider audience. I expect that LSF/MM can only commit to
one person attending the whole summit due to limited seats but we could
be more more flexible for the Postgres track itself so informal meetings
can be arranged for the evenings and at collab summit.

In this gets forgotten, this mail describes what has already been
discussed and some of the proposals. Some stuff I do not describe because
it was superseded by later discussion. If I missed something important,
misinterpreted or simply screwed up then shout and I'll update this. I'd
rather none of this gets lost even if it takes months or years to address
it all.

On testing of modern kernels


Josh Berkus claims that most people are using Postgres with 2.6.19 and
consequently there may be poor awareness of recent kernel developments.
This is a disturbingly large window of opportunity for problems to have
been introduced. It begs the question what sort of penetration modern
distributions shipping Postgres has. More information on why older kernels
dominate in Postgres installation would be nice.

Postgres bug reports and LKML
---

It is claimed that LKML does not welcome bug reports but it's less clear
what the basis of this claim is.  Is it because the reports are ignored? A
possible explanation is that they are simply getting lost in the LKML noise
and there would be better luck if the bug report was cc'd to a specific
subsystem list. Another explanation is that there is not enough data
available to debug the problem. The worst explanation is that to date
the problem has not been fixable but the details of this have been lost
and are now unknown. Is is possible that some of these bug reports can be
refreshed so at least there is a chance they get addressed?

Apparently there were changes to the reclaim algorithms that crippled
performance without any sysctls. The problem may be compounded by the
introduction of adaptive replacement cache in the shape of the thrash
detection patches currently being reviewed.  Postgres investigated the
use of ARC in the past and ultimately abandoned it. Details are in the
archives (http://www.Postgres.org/search/?m=1&q=arc&l=1&d=-1&s=r). I
have not read then, just noting they exist for future reference.

Sysctls to control VM behaviour are not popular as such tuning parameters
are often used as an excuse to not properly fix the problem. Would it be
possible to describe a test case that shows 2.6.19 performing well and a
modern kernel failing? That would give the VM people a concrete basis to
work from to either fix the problem or identify exactly what sysctls are
required to make this work.

I am confident that any bug related to VM reclaim in this area has been lost.
At least, I recall no instances of it being discussed on linux-mm and it
has not featured on LSF/MM during the last years.

IO Scheduling
-

Kevin Grittner has stated that it is known that the DEADLINE and NOOP
schedulers perform better than any alternatives for most database loads.
It would be desirable to quantify this for some test case and see can the
default scheduler cope in some way.

The deadline scheduler makes sense to a large extent though. Postgres
is sensitive to large latencies due to IO write spikes. It is at least
plausible that deadline would give more deterministic behaviour for
parallel reads in the presence of large writes assuming there were not
ordering problems between the reads/writes and the underlying filesystem.

For reference, these IO spikes can be massive. If the shared buffer is
completely dirtied in a short space of time then it could be 20-25% of
RAM being dirtied and writeback required in typical configurations. There
have been cases where it was worked around by limiting the size of the
shared buffer to a small enough size so that it can be written back
quickly. There are other tuning options available such as altering when
dirty background writing starts within the kernel but that will not help if
the dirtying happens in a very short space of time. Dave Chinner described
the considerations as follows

There's no absolute rule here, but the threshold for background
writeback needs to consider the amount of dirty data being generated,
the rate at which it can be retired and the checkpoint period the
application is configured with. i.e. it needs to be slow enough to
not cause serious read IO perturbations, but still fast enough that
it avoids peaks at synchronisation points. And most importantly, it
needs to be fast enought that it

Re: [HACKERS] plpgsql.warn_shadow

2014-01-15 Thread Pavel Stehule
2014/1/15 Marko Tiikkaja 

> On 1/15/14 2:27 PM, Pavel Stehule wrote:
>
>> 2014/1/15 Marko Tiikkaja 
>>
>>> Yeah, me neither, it's just something that needs to be communicated very
>>> clearly.  So probably just a list  plpgsql.warnings  would be the most
>>> appropriate then.
>>>
>>>
>> I am thinking so the name is not good. Changing handling warnings is messy
>> - minimally in Postgres, where warnings and errors are different
>> creatures.
>>
>> what about
>>
>> plpgsql.enhanced_checks = (none | warning | error)
>>
>
> You crammed several suggestions into one here:
>
>   1) You're advocating the ability to turn warnings into errors.  This has
> been met with some resistance.  I think it's a useful feature, but I would
> be happy with just having warnings available.
>   2) This syntax doesn't allow the user to specify a list of warnings to
> enable.  Which might be fine, I guess.  I imagine the normal approach would
> be to turn all warnings on anyway, and possibly fine-tune with per-function
> directives if some functions do dangerous things on purpose.
>   3) You want to change the name to "enhanced_checks".  I still think the
> main feature is about displaying warnings to the user.  I don't
> particularly like this suggestion.
>

You first should to say, what is warning and why it is only warning and not
error. And why plpgsql warning processing should be different than general
postgresql processing?

My objection is against too general option. Every option shoudl to do one
clean thing.

Regards

Pavel


>
>
> Regards,
> Marko Tiikkaja
>


Re: [HACKERS] plpgsql.warn_shadow

2014-01-15 Thread Marko Tiikkaja

On 1/15/14 2:27 PM, Pavel Stehule wrote:

2014/1/15 Marko Tiikkaja 

Yeah, me neither, it's just something that needs to be communicated very
clearly.  So probably just a list  plpgsql.warnings  would be the most
appropriate then.



I am thinking so the name is not good. Changing handling warnings is messy
- minimally in Postgres, where warnings and errors are different creatures.

what about

plpgsql.enhanced_checks = (none | warning | error)


You crammed several suggestions into one here:

  1) You're advocating the ability to turn warnings into errors.  This 
has been met with some resistance.  I think it's a useful feature, but I 
would be happy with just having warnings available.
  2) This syntax doesn't allow the user to specify a list of warnings 
to enable.  Which might be fine, I guess.  I imagine the normal approach 
would be to turn all warnings on anyway, and possibly fine-tune with 
per-function directives if some functions do dangerous things on purpose.
  3) You want to change the name to "enhanced_checks".  I still think 
the main feature is about displaying warnings to the user.  I don't 
particularly like this suggestion.



Regards,
Marko Tiikkaja


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Robert Haas
On Tue, Jan 14, 2014 at 5:23 PM, Dave Chinner  wrote:
> By default, background writeback doesn't start until 10% of memory
> is dirtied, and on your machine that's 25GB of RAM. That's way to
> high for your workload.
>
> It appears to me that we are seeing large memory machines much more
> commonly in data centers - a couple of years ago 256GB RAM was only
> seen in supercomputers. Hence machines of this size are moving from
> "tweaking settings for supercomputers is OK" class to "tweaking
> settings for enterprise servers is not OK"
>
> Perhaps what we need to do is deprecate dirty_ratio and
> dirty_background_ratio as the default values as move to the byte
> based values as the defaults and cap them appropriately.  e.g.
> 10/20% of RAM for small machines down to a couple of GB for large
> machines

I think that's right.  In our case we know we're going to call fsync()
eventually and that's going to produce a torrent of I/O.  If that
torrent fits in downstream caches or can be satisfied quickly without
disrupting the rest of the system too much, then life is good.  But
the downstream caches don't typically grow proportionately to the size
of system memory.  Maybe a machine with 16GB has 1GB of battery-backed
write cache, but it doesn't follow that 256GB machine has 16GB of
battery-backed write cache.

> Essentially, changing dirty_background_bytes, dirty_bytes and
> dirty_expire_centiseconds to be much smaller should make the kernel
> start writeback much sooner and so you shouldn't have to limit the
> amount of buffers the application has to prevent major fsync
> triggered stalls...

I think this has been tried with some success, but I don't know the
details.  I think the bytes values are clearly more useful than the
percentages, because you can set them smaller and with better
granularity.

One thought that occurs to me is that it might be useful to have
PostgreSQL tell the system when we expect to perform an fsync.
Imagine fsync_is_coming(int fd, time_t).  We know long in advance
(minutes) when we're gonna do it, so in some sense what we'd like to
tell the kernel is: we're not in a hurry to get this data on disk
right now, but when the indicated time arrives, we are going to do
fsyncs of a bunch of files in rapid succession, so please arrange to
flush the data as close to that time as possible (to maximize
write-combining) while still finishing by that time (so that the
fsyncs are fast and more importantly so that they don't cause a
system-wide stall).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Robert Haas
On Tue, Jan 14, 2014 at 4:23 PM, James Bottomley
 wrote:
> Yes, that's what I was thinking: it's a cache.  About how many files
> comprise this cache?  Are you thinking it's too difficult for every
> process to map the files?

No, I'm thinking that would throw cache coherency out the window.
Separate mappings are all well and good until somebody decides to
modify the page, but after that point the database processes need to
see the modified version of the page (which is, further, hedged about
with locks) yet the operating system MUST NOT see the modified version
of the page until the write-ahead log entry for the page modification
has been flushed to disk.  There's really no way to do that without
having our own private cache.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Hannu Krosing
On 01/15/2014 02:01 PM, Jan Kara wrote:
> On Wed 15-01-14 12:16:50, Hannu Krosing wrote:
>> On 01/14/2014 06:12 PM, Robert Haas wrote:
>>> This would be pretty similar to copy-on-write, except
>>> without the copying. It would just be
>>> forget-from-the-buffer-pool-on-write. 
>> +1
>>
>> A version of this could probably already be implement using MADV_DONTNEED
>> and MADV_WILLNEED
>>
>> Thet is, just after reading the page in, use MADV_DONTNEED on it. When
>> evicting
>> a clean page, check that it is still in cache and if it is, then
>> MADV_WILLNEED it.
>>
>> Another nice thing to do would be dynamically adjusting kernel
>> dirty_background_ratio
>> and other related knobs in real time based on how many buffers are dirty
>> inside postgresql.
>> Maybe in background writer.
>>
>> Question to LKM folks - will kernel react well to frequent changes to
>> /proc/sys/vm/dirty_*  ?
>> How frequent can they be (every few second? every second? 100Hz ?)
>   So the question is what do you mean by 'react'. We check whether we
> should start background writeback every dirty_writeback_centisecs (5s). We
> will also check whether we didn't exceed the background dirty limit (and
> wake writeback thread) when dirtying pages. However this check happens once
> per several dirtied MB (unless we are close to dirty_bytes).
>
> When writeback is running we check roughly once per second (the logic is
> more complex there but I don't think explaining details would be useful
> here) whether we are below dirty_background_bytes and stop writeback in
> that case.
>
> So changing dirty_background_bytes every few seconds should work
> reasonably, once a second is pushing it and 100 Hz - no way. But I'd also
> note that you have conflicting requirements on the kernel writeback. On one
> hand you want checkpoint data to steadily trickle to disk (well, trickle
> isn't exactly the proper word since if you need to checkpoing 16 GB every 5
> minutes than you need a steady throughput of ~50 MB/s just for
> checkpointing) so you want to set dirty_background_bytes low, on the other
> hand you don't want temporary files to get to disk so you want to set
> dirty_background_bytes high. 
Is it possible to have more fine-grained control over writeback, like
configuring dirty_background_bytes per file system / device (or even
a file or a group of files) ?

If not, then how hard would it be to provide this ?

This is a bit backwards from keeping-the-cache-clean perspective,
but would help a lot with hinting the writer that a big sync is coming.

> And also that changes of
> dirty_background_bytes probably will not take into account other events
> happening on the system (maybe a DB backup is running...). So I'm somewhat
> skeptical you will be able to tune dirty_background_bytes frequently in a
> useful way.
>


Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] trgm regex index peculiarity

2014-01-15 Thread Alexander Korotkov
On Fri, Jun 21, 2013 at 5:39 PM, Erik Rijkers  wrote:

> On Fri, June 21, 2013 15:11, Alexander Korotkov wrote:
> > On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers  wrote:
> >
> >> On Fri, June 21, 2013 05:25, Tom Lane wrote:
> >> > "Erik Rijkers"  writes:
> >> >> In a 112 MB test table (containing random generated text) with a trgm
> >> index (gin_trgm_ops), I consistently get these
> >> >> timings:
> >> >> select txt from azjunk6 where txt ~ '^abcd';
> >> >>130 ms
> >> >> select txt from azjunk6
> >> >> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
> >> >>3 ms
> >> >
> >
> > Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and
> 'bcd'.
> > However trigrams '__a' is much more frequent than '_ab' which in turn is
> > much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting
> of
> > '__a' and '_ab' and that gives so significant speedup.
>
> > [trgm_regex_optimize.1.patch ]
>
> Yes, that fixes the problem, thanks.
>

Revised version of patch with necessary comments.

--
With best regards,
Alexander Korotkov.


trgm-regex-optimize.2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Jan Kara
On Wed 15-01-14 12:16:50, Hannu Krosing wrote:
> On 01/14/2014 06:12 PM, Robert Haas wrote:
> > This would be pretty similar to copy-on-write, except
> > without the copying. It would just be
> > forget-from-the-buffer-pool-on-write. 
> 
> +1
> 
> A version of this could probably already be implement using MADV_DONTNEED
> and MADV_WILLNEED
> 
> Thet is, just after reading the page in, use MADV_DONTNEED on it. When
> evicting
> a clean page, check that it is still in cache and if it is, then
> MADV_WILLNEED it.
> 
> Another nice thing to do would be dynamically adjusting kernel
> dirty_background_ratio
> and other related knobs in real time based on how many buffers are dirty
> inside postgresql.
> Maybe in background writer.
> 
> Question to LKM folks - will kernel react well to frequent changes to
> /proc/sys/vm/dirty_*  ?
> How frequent can they be (every few second? every second? 100Hz ?)
  So the question is what do you mean by 'react'. We check whether we
should start background writeback every dirty_writeback_centisecs (5s). We
will also check whether we didn't exceed the background dirty limit (and
wake writeback thread) when dirtying pages. However this check happens once
per several dirtied MB (unless we are close to dirty_bytes).

When writeback is running we check roughly once per second (the logic is
more complex there but I don't think explaining details would be useful
here) whether we are below dirty_background_bytes and stop writeback in
that case.

So changing dirty_background_bytes every few seconds should work
reasonably, once a second is pushing it and 100 Hz - no way. But I'd also
note that you have conflicting requirements on the kernel writeback. On one
hand you want checkpoint data to steadily trickle to disk (well, trickle
isn't exactly the proper word since if you need to checkpoing 16 GB every 5
minutes than you need a steady throughput of ~50 MB/s just for
checkpointing) so you want to set dirty_background_bytes low, on the other
hand you don't want temporary files to get to disk so you want to set
dirty_background_bytes high. And also that changes of
dirty_background_bytes probably will not take into account other events
happening on the system (maybe a DB backup is running...). So I'm somewhat
skeptical you will be able to tune dirty_background_bytes frequently in a
useful way.

Honza
-- 
Jan Kara 
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpgsql.warn_shadow

2014-01-15 Thread Pavel Stehule
2014/1/15 Marko Tiikkaja 

> On 1/15/14 2:00 PM, Florian Pflug wrote:
>
>> On Jan15, 2014, at 13:32 , Marko Tiikkaja  wrote:
>>
>>> On 1/15/14 1:23 PM, Florian Pflug wrote:
>>>
 The fact that it's named plpgsql.warnings already clearly documents
 that this only affects plpgsql. But whether a particular warning is emitted
 during compilation or during execution it largely irrelevant, I think. For
 example, if we called this compiler_warning, we'd couldn't add a warning
 which triggers when SELECT .. INTO ingores excessive rows.

>>>
>>> There is the fact that something being a "compiler warning" gives you an
>>> idea on its effects on performance.  But maybe that would be better
>>> described in the documentation (perhaps even more accurately).
>>>
>>> I like the idea of warning about SELECT .. INTO, though, but that one
>>> could have a non-negligible performance penalty during execution.
>>>
>>
>> I'm not overly concerned about that. I image people would usually enable
>> warnings during development, not production.
>>
>
> Yeah, me neither, it's just something that needs to be communicated very
> clearly.  So probably just a list  plpgsql.warnings  would be the most
> appropriate then.
>

I am thinking so the name is not good. Changing handling warnings is messy
- minimally in Postgres, where warnings and errors are different creatures.

what about

plpgsql.enhanced_checks = (none | warning | error)





>
>
> Regards,
> Marko Tiikkaja
>


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Jan Kara
On Wed 15-01-14 10:27:26, Heikki Linnakangas wrote:
> On 01/15/2014 06:01 AM, Jim Nasby wrote:
> >For the sake of completeness... it's theoretically silly that Postgres
> >is doing all this stuff with WAL when the filesystem is doing something
> >very similar with it's journal. And an SSD drive (and next generation
> >spinning rust) is doing the same thing *again* in it's own journal.
> >
> >If all 3 communities (or even just 2 of them!) could agree on the
> >necessary interface a tremendous amount of this duplicated technology
> >could be eliminated.
> >
> >That said, I rather doubt the Postgres community would go this route,
> >not so much because of the presumably massive changes needed, but more
> >because our community is not a fan of restricting our users to things
> >like "Thou shalt use a journaled FS or risk all thy data!"
> 
> The WAL is also used for continuous archiving and replication, not
> just crash recovery. We could skip full-page-writes, though, if we
> knew that the underlying filesystem/storage is guaranteeing that a
> write() is atomic.
> 
> It might be useful for PostgreSQL somehow tell the filesystem that
> we're taking care of WAL-logging, so that the filesystem doesn't
> need to.
  Well, journalling fs generally cares about its metadata consistency. We
have much weaker guarantees regarding file data because those guarantees
come at a cost most people don't want to pay.

Filesystems could in theory provide facility like atomic write (at least up
to a certain size say in MB range) but it's not so easy and when there are
no strong usecases fs people are reluctant to make their code more complex
unnecessarily. OTOH without widespread atomic write support I understand
application developers have similar stance. So it's kind of chicken and egg
problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
due to its data=journal mode so if someone on the PostgreSQL side wanted to
research on this, knitting some experimental ext4 patches should be doable.

Honza
-- 
Jan Kara 
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Dave Chinner
On Tue, Jan 14, 2014 at 09:54:20PM -0600, Jim Nasby wrote:
> On 1/14/14, 3:41 PM, Dave Chinner wrote:
> >On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote:
> >>On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman 
> >>wrote: Whether the problem is with the system call or the
> >>programmer is harder to determine.  I think the problem is in
> >>part that it's not exactly clear when we should call it.  So
> >>suppose we want to do a checkpoint.  What we used to do a long
> >>time ago is write everything, and then fsync it all, and then
> >>call it good.  But that produced horrible I/O storms.  So what
> >>we do now is do the writes over a period of time, with sleeps in
> >>between, and then fsync it all at the end, hoping that the
> >>kernel will write some of it before the fsyncs arrive so that we
> >>don't get a huge I/O spike.  And that sorta works, and it's
> >>definitely better than doing it all at full speed, but it's
> >>pretty imprecise.  If the kernel doesn't write enough of the
> >>data out in advance, then there's still a huge I/O storm when we
> >>do the fsyncs and everything grinds to a halt.  If it writes out
> >>more data than needed in advance, it increases the total number
> >>of physical writes because we get less write-combining, and that
> >>hurts performance, too.
> 
> I think there's a pretty important bit that Robert didn't mention:
> we have a specific *time* target for when we want all the fsync's
> to complete. People that have problems here tend to tune
> checkpoints to complete every 5-15 minutes, and they want the
> write traffic for the checkpoint spread out over 90% of that time
> interval. To put it another way, fsync's should be done when 90%
> of the time to the next checkpoint hits, but preferably not a lot
> before then.

I think that is pretty much understood. I don't recall anyone
mentioning a typical checkpoint period, though, so knowing the
typical timeframe of IO storms and how much data is typically
written in a checkpoint helps us understand the scale of the
problem.

> >It sounds to me like you want the kernel to start background
> >writeback earlier so that it doesn't build up as much dirty data
> >before you require a flush. There are several ways to do this by
> >tweaking writeback knobs. The simplest is probably just to set
> >/proc/sys/vm/dirty_background_bytes to an appropriate threshold
> >(say 50MB) and dirty_expire_centiseconds to a few seconds so that
> >background writeback starts and walks all dirty inodes almost
> >immediately. This will keep a steady stream of low level
> >background IO going, and fsync should then not take very long.
> 
> Except that still won't throttle writes, right? That's the big
> issue here: our users often can't tolerate big spikes in IO
> latency. They want user requests to always happen within a
> specific amount of time.

Right, but that's a different problem and one that io scheduling
tweaks can have a major effect on. e.g. the deadline scheduler
should be able to provide a maximum upper bound on read IO latency
even while writes are in progress, though how successful it is is
dependent on the nature of the write load and the architecture of
the underlying storage.

However, the first problem is dealing with the IO storm problem on
fsync. Then we can measure the effect of spreading those writes out
in time and determine what triggers read starvations (if they are
apparent). The we can look at whether IO scheduling tweaks or
whether blk-io throttling solves those problems. Or whether
something else needs to be done to make it work in environments
where problems are manifesting.

FWIW [and I know you're probably sick of hearing this by now], but
the blk-io throttling works almost perfectly with applications that
use direct IO.

> So while delaying writes potentially reduces the total amount of
> data you're writing, users that run into problems here ultimately
> care more about ensuring that their foreground IO completes in a
> timely fashion.

Understood. Applications that crunch randomly through large data
sets are almost always read IO latency bound

> >Fundamentally, though, we need bug reports from people seeing
> >these problems when they see them so we can diagnose them on
> >their systems. Trying to discuss/diagnose these problems without
> >knowing anything about the storage, the kernel version, writeback
> >thresholds, etc really doesn't work because we can't easily
> >determine a root cause.
> 
> So is lsf...@linux-foundation.org the best way to accomplish that?

No. That is just the list for organising the LFSMM summit. ;)

For general pagecache and writeback issues, discussions, etc,
linux-fsde...@vger.kernel.org is the list to use. LKML simple has
too much noise to be useful these days, so I'd avoid it. Otherwise
the filesystem specific lists are are good place to get help for
specific problems (e.g. linux-e...@vger.kernel.org and
x...@oss.sgi.com). We tend to cross-post to other relevant lists as
tria

[HACKERS] What's the condition of bug "PANIC: WAL contains references to invalid pages"?

2014-01-15 Thread MauMau

Hello,

Please tell me a bit about the following bug which has just been solved.  I 
wish this is exactly what has been annoying for a year.


Hot standby 9.2.6 -> 9.2.6 PANIC: WAL contains references to invalid pages
http://www.postgresql.org/message-id/CAL_0b1s4QCkFy_55kk_8XWcJPs7wsgVWf8vn4=jxe6v4r7h...@mail.gmail.com

I've read the discussion, but I'm wondering what the condition where this 
failure happens.  I guess I understand the following conditions need to hold 
true.  Are there any other conditions?


* The database server crashes while a btree index is being extended (by page 
split).

* Hot standby is used.
* The standby is rebuilt and started.

When I last investigated the bug, the user was doing repeated failover 
testing --- stop the master by running "pg_ctl stop -mi" while some 
application was performing database updates, promote the standby, rebuild 
the standby with pg_basebackup, and start the new standby.  In one of those 
iterations, the newly rebuilt standby crashed with "WAL contains references 
to invalid pages".  This seems to match the above mail thread.


However, I don't understand why btree_xlog_vacuum() encountered an all-zero 
page.  How did the all-zero page appear on the standby?  Was it transferred 
from master by pg_basebackup?  FYI, the server log didn't contain any 
messages related to disk full, nor any ERROR messages.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpgsql.warn_shadow

2014-01-15 Thread Marko Tiikkaja

On 1/15/14 2:00 PM, Florian Pflug wrote:

On Jan15, 2014, at 13:32 , Marko Tiikkaja  wrote:

On 1/15/14 1:23 PM, Florian Pflug wrote:

The fact that it's named plpgsql.warnings already clearly documents that this 
only affects plpgsql. But whether a particular warning is emitted during 
compilation or during execution it largely irrelevant, I think. For example, if 
we called this compiler_warning, we'd couldn't add a warning which triggers 
when SELECT .. INTO ingores excessive rows.


There is the fact that something being a "compiler warning" gives you an idea 
on its effects on performance.  But maybe that would be better described in the 
documentation (perhaps even more accurately).

I like the idea of warning about SELECT .. INTO, though, but that one could 
have a non-negligible performance penalty during execution.


I'm not overly concerned about that. I image people would usually enable 
warnings during development, not production.


Yeah, me neither, it's just something that needs to be communicated very 
clearly.  So probably just a list  plpgsql.warnings  would be the most 
appropriate then.



Regards,
Marko Tiikkaja


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] identify table oid for an AggState during plan tree initialization

2014-01-15 Thread Ashutosh Bapat
Hmm, ok, this is slighly involved.

Is it important that you should do this at the execution time? At planner
level where Agg node is involved, one can look for Aggref node and traverse
down the args list to find any vars are involved in aggregation. You can
find Var nodes involved in an Aggref by calling pull_var_clauses() with
appropriate arguments. The varno in those Var node can be used to pull
RangeTblEntry (RTE) in range table of query, hanging off of Query
(rt_fetch). If the RTE belongs to a table (it could be anything subquery,
function, join ...), relid in the RTE gives you OID of the table.

Though this is simple for simple query given as example in your mail, the
process is tricky for complex queries.


On Wed, Jan 15, 2014 at 6:05 PM, Masterprojekt Naumann1 <
mpws201...@gmail.com> wrote:

> 2014/1/15 Ashutosh Bapat 
>
>> Hi Cathleen,
>> An aggregate can be working on more than one table e.g.
>>
>> select count(*) from a, b, c where a.c1 = b.c1 and b.c1 = c.c1;
>>
>> In such a case, which table would you like to be reported?
>>
>> IOW, it doesn't look to be sensible to attach and aggregate with a table.
>>
>> If you can explain what you intend to do with that information, there
>> might be other ways to solve your problem.
>>
>>
>> On Wed, Jan 15, 2014 at 3:29 PM, Masterprojekt Naumann1 <
>> mpws201...@gmail.com> wrote:
>>
>>>
>>> 2014/1/15 Masterprojekt Naumann1 
>>>
 Hi,

 during the initialization of the nodes in the plan tree (in
 ExecInitNode in the file execProcnode.c) I want to find out for a node with
 the type T_Agg which table will be aggregated. I tried the following:

 resultAsAggState = ExecInitAgg((Agg *) node, estate, eflags);

 if (resultAsAggState)
 {
  //tableOid = rel->rd_id;
 //tableOid = resultAsAggState->ss.ss_currentRelation->rd_id;
  }
 It would be great to get the Oid of the table, but I would also be
 satisfied if I could get at least the name of the table. Does anyone know
 if it is possible to gather these information?

 Best regards
 Cathleen

>>>
>>> Sorry my mail program send the mail to early. Please ignore the line
>>> tableOid = rel->rd_id. I was just deleting the line when google send the
>>> mail :(
>>>
>>> resultAsAggState is of the type AggState. As you can see I tried to get
>>> the Oid with the ScanState inside of the AggState, but its Relation
>>> ss_currentRelation is not set already. Is there another way to get the Oid?
>>>
>>
>>
>>
>> --
>> Best Wishes,
>> Ashutosh Bapat
>> EnterpriseDB Corporation
>> The Postgres Database Company
>>
>
>
> Hi,
>
> maybe I should clarify what I want to achieve: At the moment I only handle
> special aggregations with a filter that compares values with a constant. An
> exampe query would be "select sum(c_custkey), c_nationkey from customer
> group by c_nationkey having sum(c_custkey) = 445820473;" This query can be
> run on TPCH.
>
> For such special queries I want to find out, on which column in the query
> result will be filtered and what is the filtered value. For the columnid,
> it is sufficient, if I get the table OID and the columnid of the original
> column, i.e. the table OID of customer und the columnid of c_custkey in the
> original table. I know that the expression is represented by an OpExpr,
> that has an opno and operators. I would like to find such expression in an
> AggState. Additionally, I need the table OID, which may be elsewhere.
> Where can I find this information?
>
> Best regards,
> Cathleen
>
>


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] plpgsql.warn_shadow

2014-01-15 Thread Florian Pflug
On Jan15, 2014, at 13:32 , Marko Tiikkaja  wrote:
> On 1/15/14 1:23 PM, Florian Pflug wrote:
>> The fact that it's named plpgsql.warnings already clearly documents that 
>> this only affects plpgsql. But whether a particular warning is emitted 
>> during compilation or during execution it largely irrelevant, I think. For 
>> example, if we called this compiler_warning, we'd couldn't add a warning 
>> which triggers when SELECT .. INTO ingores excessive rows.
> 
> There is the fact that something being a "compiler warning" gives you an idea 
> on its effects on performance.  But maybe that would be better described in 
> the documentation (perhaps even more accurately).
> 
> I like the idea of warning about SELECT .. INTO, though, but that one could 
> have a non-negligible performance penalty during execution.

I'm not overly concerned about that. I image people would usually enable 
warnings during development, not production.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL Rate Limiting

2014-01-15 Thread Magnus Hagander
On Wed, Jan 15, 2014 at 3:20 AM, Simon Riggs  wrote:

> We've discussed previously the negative impact of large bulk
> operations, especially wrt WAL writes. Patch here allows maintenance
> operations to have their WAL generation slowed down as a replication
> lag prevention feature.
>
> I believe there was originally intended to be some work on I/O rate
> limiting, but that hasn't happened and is in some ways orthogonal to
> this patch and we will likely eventually want both.
>
> Single new parameter works very similarly to vacuum_cost_delay
>
> wal_rate_limit_delay = Xms
>

Seems like a really bad name if we are only slowing down some commands -
that seems to indicate we're slowing down all of them. I think it should be
something that indicates that it only affects the maintenance commands.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] identify table oid for an AggState during plan tree initialization

2014-01-15 Thread Masterprojekt Naumann1
2014/1/15 Ashutosh Bapat 

> Hi Cathleen,
> An aggregate can be working on more than one table e.g.
>
> select count(*) from a, b, c where a.c1 = b.c1 and b.c1 = c.c1;
>
> In such a case, which table would you like to be reported?
>
> IOW, it doesn't look to be sensible to attach and aggregate with a table.
>
> If you can explain what you intend to do with that information, there
> might be other ways to solve your problem.
>
>
> On Wed, Jan 15, 2014 at 3:29 PM, Masterprojekt Naumann1 <
> mpws201...@gmail.com> wrote:
>
>>
>> 2014/1/15 Masterprojekt Naumann1 
>>
>>> Hi,
>>>
>>> during the initialization of the nodes in the plan tree (in ExecInitNode
>>> in the file execProcnode.c) I want to find out for a node with the type
>>> T_Agg which table will be aggregated. I tried the following:
>>>
>>> resultAsAggState = ExecInitAgg((Agg *) node, estate, eflags);
>>>
>>> if (resultAsAggState)
>>> {
>>>  //tableOid = rel->rd_id;
>>> //tableOid = resultAsAggState->ss.ss_currentRelation->rd_id;
>>>  }
>>> It would be great to get the Oid of the table, but I would also be
>>> satisfied if I could get at least the name of the table. Does anyone know
>>> if it is possible to gather these information?
>>>
>>> Best regards
>>> Cathleen
>>>
>>
>> Sorry my mail program send the mail to early. Please ignore the line
>> tableOid = rel->rd_id. I was just deleting the line when google send the
>> mail :(
>>
>> resultAsAggState is of the type AggState. As you can see I tried to get
>> the Oid with the ScanState inside of the AggState, but its Relation
>> ss_currentRelation is not set already. Is there another way to get the Oid?
>>
>
>
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>


Hi,

maybe I should clarify what I want to achieve: At the moment I only handle
special aggregations with a filter that compares values with a constant. An
exampe query would be "select sum(c_custkey), c_nationkey from customer
group by c_nationkey having sum(c_custkey) = 445820473;" This query can be
run on TPCH.

For such special queries I want to find out, on which column in the query
result will be filtered and what is the filtered value. For the columnid,
it is sufficient, if I get the table OID and the columnid of the original
column, i.e. the table OID of customer und the columnid of c_custkey in the
original table. I know that the expression is represented by an OpExpr,
that has an opno and operators. I would like to find such expression in an
AggState. Additionally, I need the table OID, which may be elsewhere.
Where can I find this information?

Best regards,
Cathleen


  1   2   >