date:20140113

Re: [HACKERS] [PATCH] Doc fix for VACUUM FREEZE

2014-01-13 Thread Amit Kapila

On Fri, Jan 3, 2014 at 9:02 PM, Peter Eisentraut  wrote:
> On 12/17/13, 8:16 PM, Maciek Sakrejda wrote:
>> (now with patch--sorry about that)
>
> This patch doesn't apply.

There are some recent changes around same place which broke this
patch. Please find the modified patch attached with this mail, I had
changed order of specifying vacuum_freeze_min_age and
vacuum_freeze_table_age in Vacuum FREEZE option.

I will mark this patch as Ready For Committer, unless there is
any other objection for this patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

correct-vacuum-freeze-docs_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: variant of regclass

2014-01-13 Thread Michael Paquier

Hi,

On Tue, Jan 14, 2014 at 4:28 PM, Yugo Nagata  wrote:
> Here is the patch to implement to_regclass, to_regproc, to_regoper,
> and to_regtype. They are new functions similar to regclass, regproc,
> regoper, and regtype except that if requested object is not found,
> returns InvalidOid, rather than raises an error.

You should add this patch to the upcoming commit fest (beginning
tomorrow actually), I am not seeing it in the list:
https://commitfest.postgresql.org/action/commitfest_view?id=21
Thanks,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] surprising to_timestamp behavior

2014-01-13 Thread Jeevan Chalke

On Thu, Oct 31, 2013 at 10:50 AM, Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
>
>
> On Tue, Oct 29, 2013 at 11:05 PM, Robert Haas wrote:
>
>> On Tue, Oct 29, 2013 at 12:03 PM, Tom Lane  wrote:
>> > Robert Haas  writes:
>> >> It turns out that when you use the to_timestamp function, a space in
>> >> the format mask can result in skipping any character at all, even a
>> >> digit, in the input string.  Consider this example, where 10 hours are
>> >> lost:
>> >
>> >> rhaas=# select to_timestamp('2013-10-29 10:47:18', '-MM-DD
>>  HH24:MI:SS');
>> >>   to_timestamp
>> >> 
>> >>  2013-10-29 00:47:18-04
>> >> (1 row)
>> >
>> > And that's a bug why?  The format says to ignore two characters before
>> the
>> > hours field.  I think you're proposing to remove important
>> functionality.
>> >
>> > To refine the point a bit, it's absolutely stupid to be using
>> to_timestamp
>> > at all for sane input data like this example.  Just cast the string to
>> > timestamp(tz), and the standard datatype input function will do a better
>> > job than to_timestamp ever would.  The point of to_timestamp, IMNSHO,
>> > is to extract data successfully from weirdly formatted input; which
>> might
>> > well include cases where there are stray digits you don't want taken as
>> > data.  So I'm not on board with proposals to "fix" cases like this by
>> > making the format string's meaning squishier.
>>
>> Well, you're the second person to react that way to this proposal, but
>> the current behavior seems mighty odd to me - even odder, now that I
>> realize that we'll happily match '"cat'" to 'dog'.  I just work here,
>> though.
>>
>
> Well, I agree with Tom that user provided two spaces to skip before hours
> and this is what we are exactly doing.
>
> Still here are few other observations:
>
>
> (1) I don't see following as wrong output in postgresql as I already said
> above and agreed with Tom. (in input, only one space between DD and HH24,
> but
> in mask we have 2 spaces)
>
> postgres=# select to_timestamp('2011-03-18 23:38:15', '-MM-DD
> HH24:MI:SS');
>to_timestamp
> ---
>  2011-03-18 03:38:15+05:30
> (1 row)
>
> (Note that, time 23 became 03, due to extra space in mask eating 2 in 23,
> resulting in 3 for HH24. But fair enough, as expected and thus NO issues)
>
>
> (2) But I see following buggy (both in input and mask we have 2 spaces
> between DD and HH24)
>
> postgres=# select to_timestamp('2011-03-18  23:38:15', '-MM-DD
> HH24:MI:SS');
>to_timestamp
> ---
>  2011-03-18 03:38:15+05:30
> (1 row)
>
> (Note that, this time we should not end up with eating 2 from 23 as we have
> exact spaces in mask and input. NOT so good and NOT expected, looks like
> BUG)
>
> So I think we need to resolve second case.
>

Attached patch which fixes this issue.

I have just tweaked the code around ignoring spaces in DCH_from_char()
function.

Adding to CommitFest 2014-01 (Open).

Thanks


> Thanks
>
>
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>>
>> --
>> Sent via pgsql-bugs mailing list (pgsql-b...@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-bugs
>>
>
> --
> Jeevan B Chalke
> Principal Software Engineer, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
>


-- 
Jeevan B Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-13 Thread David Rowley

On Tue, Jan 14, 2014 at 2:00 PM, Florian Pflug  wrote:

> On Jan10, 2014, at 22:27 , David Rowley  wrote:
> > As the patch stands at the moment, I currently have a regression test
> > which currently fails due to these extra zeros after the decimal point:
> >
> > -- This test currently fails due extra trailing 0 digits.
> > SELECT SUM(n::numeric) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND
> UNBOUNDED FOLLOWING)
> >   FROM (VALUES(1,1.01),(2,2),(3,3)) v(i,n);
> >
> > Patched produces:
> >  6.01
> >  5.00
> >  3.00
> > Unpatched produces:
> >  6.01
> >  5
> >  3
>
> Hm, that's annoying. I checked the standard to see if it offers any
> guidance
> here, but it doesn't look like it does - the standard just says that SUM()
> ought
> to return a type with the same scale as the input type has. That's fine if
> every NUMERIC type has a fixed scale, but that's not the case in postgres -
> we allow values of unconstrained NUMERIC types (i.e., numeric types
> without an
> explicitly specified precision or scale) to each define their own scale.
>
>
Thanks for digging that out in the standard and thanks for all that
information you supplied here
http://www.postgresql.org/message-id/0fa6c08e-2166-405b-83f7-63b196b88...@phlo.org
too.
Sorry I've not had the chance to reply yet. I was kind of hoping that the
answer would be more in my favour to help with inverse transitions for
sum(numeric).



> To fix this, we'd have to track the maximum scale within the current frame.
> That's easier than the general problem of providing an inverse transition
> function for MAX, because AFAIK we limit the scale to at most 1000. So it'd
> be sufficient to track the number of times we saw each scale, and also the
> current maximum. Once we reduce the current maximum's count back to zero
> in the inverse transition function, we'd scan from that value to the left
> to
> find the next non-zero entry.
>
>
I've been thinking about this, but I had thought that the maximum dscale
was bigger than 1000. The docs seem to claim 16383 here -->
http://www.postgresql.org/docs/devel/static/datatype-numeric.html I'd go
ahead and implement this if that number was smaller, but I'm thinking
zeroing out an array of 16383 elements on first call to do_numeric_accum
might be too big an overhead to write off as "background noise". If it was
20 or even 100 then I'd probably try for that.

I think the overhead for each call after that would likely be ok as it
would probably just be an operation like
state->scaleCount[X.dscale]++; which I would imagine would be a very small
percentage overhead on normal aggregate functions. Of course the inverse
would have to do the harder work of looping backwards over the array until
it found an element with the count above 0 and setting that as the current
maximum. I think this would be a winner if it wasn't for the high initial
hit of zeroing that 16383 element array.. Or 1000 whichever.



> We could also choose to ignore this, although I'm not sure I really like
> that.
> It seems entirely OK at first sight - after all, the values all stay the
> same,
> we just emit a different number of trailing zeros. But it still causes
> results
> to be affected by values, even if only in the number of trailing zeros,
> which
> lie outside the value's range. That seems like more non-determinism than as
> database should show.
>
> > With inverse transitions this query still produces correct results, it
> just does
> > not produces the numeric in the same format as it does without
> performing inverse
> > transitions. Personally I'd rather focus on trying to get SUM(numeric)
> in there
> > for 9.4
>
> I think it'd be worthwile to get this into 9.4, if that's still an option,
> even if we only support COUNT.
>
> best regards,
> Florian Pflug
>
>

Re: [HACKERS] [PATCH] Filter error log statements by sqlstate

2014-01-13 Thread Jeevan Chalke

On Mon, Jan 13, 2014 at 4:30 PM, Oskari Saarenmaa  wrote:

> Hi,
>
>
> On 13/01/14 10:26, Jeevan Chalke wrote:
>
>> 1. Documentation is missing and thus becomes difficult to understand what
>> exactly you are trying to do. Or in other words, user will be uncertain
>> about using it more efficiently.
>>
>
> I figured I'd write documentation for this if it looks like a useful
> feature which would be accepted for 9.4, but I guess it would've helped to
> have a bit better description of this for the initial submission as well.
>
>
>  2. Some more comments required. At each new function and specifically at
>> get_sqlstate_error_level().
>>
>
> Just after I submitted the patch I noticed that I had a placeholder for
> comment about that function but never wrote the actual comment, sorry about
> that.
>
>
>  3. Please add test-case if possible.
>>
>
> Sure.
>
>
>  4. Some code part does not comply with PostgreSQL indentation style. (Can
>> be
>> ignored as it will pass through pg_indent, but better fix it).
>>
>
> I'll try to fix this for v2.
>
>
>  5. You have used ""XX000:warning," string to get maximum possible length
>> of
>> the valid sqlstate:level identifier. It's perfect, but small explanation
>> about that will be good there. Also in future if we have any other error
>> level
>> which exceeds this, we need changes here too. Right ?
>>
>
> Good point, I'll address this in v2.
>
>
>  I will look into this further. But please have your attention on above
>> points.
>>
>
> Thanks for the review!
>

Since you are taking care of most of the points above. I will wait for v2
patch. Till then marking "Waiting on Author".

Thanks


>
> / Oskari
>
>
>  On Fri, Jan 10, 2014 at 12:56 AM, Oskari Saarenmaa 
>> wrote:
>> > Allow the default log_min_error_statement to be overridden per
>> > sqlstate to make it possible to filter out some error types while
>> > maintaining a low log_min_error_statement or enable logging for some
>> > error types when the default is to not log anything.
>> >
>> > I've tried to do something like this using rsyslog filters, but
>> > that's pretty awkward and doesn't work at all when the statement is
>> > split to multiple syslog messages.
>> >
>> > https://github.com/saaros/postgres/compare/log-by-sqlstate
>>
>
>


-- 
Jeevan B Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.

Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-01-13 Thread Amit Kapila

On Tue, Jan 14, 2014 at 2:16 AM, Robert Haas  wrote:
> On Sat, Jan 11, 2014 at 1:08 AM, Amit Kapila  wrote:
>> Yes, currently this applies to update, what I have in mind is that
>> in future if some one wants to use WAL compression for any other
>> operation like 'full_page_writes', then it can be easily extendible.
>>
>> To be honest, I have not evaluated whether such a flag or compression
>> would make sense for full page writes, but I think it should be possible
>> while doing full page write (BkpBlock has RelFileNode) to check such a
>> flag if it's present.
>
> Makes sense.

   So shall I change it to string instead of bool and keep the name as
   compress_wal or compress_wal_for_opr?

>> The reason of adding the same chunk in head of list is that it uses same
>> technique as pglz_hist_add. Now in pglz, it will not have repeat steps
>> from c~f, as it has concept of good_match which leads to get this done in
>> one go.
>>
>> Being said above, I am really not sure, how much real world data falls
>> in above category and should we try to optimize based on above example,
>> but yes it will save some CPU cycles in current test we are using.
>
> In the Rabin algorithm, we shouldn't try to find a longer match.  The
> match should end at the chunk end, period.  Otherwise, you lose the
> shift-resistant property of the algorithm.

   Okay, it will work well for cases when most chunks in tuple are due
   due to special pattern in it, but it will loose out on CPU cycles in
   cases where most of the chunks are due to maximum chunk boundary
   and most part of new tuple matches with old tuple. The reason is that
   if the algorithm have some such property of finding longer matches than
   chunk boundaries, then it can save us on calculating hash again and
   again when we try to find match in old tuple.
   However I think it is better to go with rabin's algorithm instead of adding
   optimizations based on our own assumptions, because it is difficult to
   predict the real world tuple data.

>>
>> Isn't it similar to how current pglz works, basically it also
>> uses next 4 bytes to calculate index (pglz_hist_idx) but still
>> does byte by byte comparison, here if we try to map to rabin's
>> delta encoding then always chunk size is 1.
>
> I don't quite understand this.  The point of the Rabin algorithm is to
> split the old tuple up into chunks and then for those chunks in the
> new tuple.  For example, suppose the old tuple is
> abcdefghijklmnopqrstuvwxyz.  It might get split like this: abcdef
> hijklmnopqrstuvw xyz.  If any of those three chunks appear in the new
> tuple, then we'll use them for compression.  If not, we'll just copy
> the literal bytes.  If the chunks appear in the new tuple reordered or
> shifted or with stuff inserted between one chunk at the next, we'll
> still find them.  Unless I'm confused, which is possible, what you're
> doing is essentially looking at the string and spitting it in those
> three places, but then recording the chunks as being three bytes
> shorter than they really are.  I don't see how that can be right.

  Today again spending some time on algorithm, I got the bug you
  are pointing to and you are right in saying that chunk is shorter.
  I think it should not be difficult to address this issue without affecting
  most part of algorithm, let me try to handle it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Case sensitive mode in windows build option

2014-01-13 Thread Craig Ringer

On 01/14/2014 11:49 AM, Dilip kumar wrote:
>  
> 
> As per current behavior if user want to build in debug mode in windows,
> then he need to give debug in capital letters (DEBUG),
> 
> I think many user will always make mistake in giving this option, in my
> opinion we can make it case insensitive.

The idea seems reasonable, the implementation does not. You've changed
the meaning rather more than making it case insensitive.

Use the Perl 'lc' function to compare a lower-cased input instead.

http://perldoc.perl.org/functions/lc.html

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Case sensitive mode in windows build option

2014-01-13 Thread Dilip kumar


As per current behavior if user want to build in debug mode in windows, then he 
need to give debug in capital letters (DEBUG),
I think many user will always make mistake in giving this option, in my opinion 
we can make it case insensitive.

I have attached a small patch for the same ( just converted comparison to case 
insensitive).

Regards,
Dilip


windows_build.patch
Description: windows_build.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Soften pg_[start|stop]_backup to allow them on a standby?

2014-01-13 Thread Michael Paquier

Hi all,

This is perhaps something that has already been discussed on hackers,
I just could not find anything in the archives.
Currently, pg_start_backup and pg_stop_backup cannot run on a standby
because it is not possible to write a backup_label file to disk,
because of the nature of a standby server preventing to write any data
in its PGDATA. Is this thought right? This is what the comments at the
top of do_pg_start_backup make me conclude.

Could we consider soften the rules of pg_start_backup and
pg_stop_backup (aka allow creation of a backup_label file) to be able
to run them on a standby? Or not? This could accelerate taking backups
from standbys when taking backups locally.

Another idea would be to send the backup label file directly as the
output of pg_start_backup such as client application can grab it and
reuse it. Any thoughts about that as well?

Note that pg_basebackup uses the command BASE_BACKUP to bypass that
and send directly the label file through a stream without writing it
to disk, satisfying this condition and making possible the creation of
backups from a standby.

Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Florian Pflug

On Jan14, 2014, at 02:10 , Kevin Grittner  wrote:
> The fact that some
> day some new programmer might not be aware of all business rules,
> or might choose to try to ignore them is the reason you add
> constraints to columns and domains.

Well, for columns and domains that seems easy. We could have

  array_has_shape(value anyarray, variadic bounds int4range[])

and

  array_has_shape(value anyarray, variadic bounds int4[])

The first returns true if the value has length(bounds) dimensions
and each dimension's bounds match the corresponding range's bound,
where NULL means "arbitrary". The second one requires all lower
bounds to be 1, and checks the upper bounds against the bounds array.

Checking that an array is one-dimensional with lower bound 1 is then
accomplished by

  array_has_shape(myarray, int4range(1, NULL))

or simply

  array_has_shape(myarray, NULL);

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Capturing EXPLAIN ANALYZE for a query without discarding the normal result set

2014-01-13 Thread Tom Lane

Dave Cole  writes:
> It would be really cool if you could direct the EXPLAIN ANALYZE output to a
> temporary table so that the query being analyzed could execute normally.

What happens if the current transaction rolls back?

If you want noninvasive explain data, contrib/auto_explain offers
a solution right now.  The info goes to the postmaster log, which is
perhaps less convenient than a temp table for interactive use, but
it doesn't have the rollback problem --- and you can capture data
about queries issued by a live application, without hacking the app.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Capturing EXPLAIN ANALYZE for a query without discarding the normal result set

2014-01-13 Thread Marti Raudsepp

On Tue, Jan 14, 2014 at 5:06 AM, Dave Cole  wrote:
> It would be really cool if you could direct the EXPLAIN ANALYZE output to a
> temporary table so that the query being analyzed could execute normally.

You can use the auto_explain contrib module to log the query plans of
slow(er) queries:
http://www.postgresql.org/docs/current/static/auto-explain.html

If you Really Need To, you can use the csvlog log format and import
that to a table, but really it's easier to use less/grep/etc.

Regards,
Marti


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Capturing EXPLAIN ANALYZE for a query without discarding the normal result set

2014-01-13 Thread Dave Cole

I apologise for dropping this out of nowhere.  I had an idea about EXPLAIN
ANALYZE that would be very useful for the system we are developing and
supporting.

It would be really cool if you could direct the EXPLAIN ANALYZE output to a
temporary table so that the query being analyzed could execute normally.
 Something like this:

EXPLAIN ANALYZE INTO a_temp
   SELECT 

Then we could temporarily cause our application to log EXPLAIN ANALYZE
information for certain queries without disrupting normal operation of the
system. In the case when we notice long running queries we would then
immediately follow up the original query with a select on the temporay
table.

We deal with a lot of transient data, so the conditions that cause bad
query performance are not always reproducible.

I have no idea how feasible this is, so please feel free to tell me I am an
idiot.

- Dave

[HACKERS] Comment typo in src/include/access/gin_private.h

2014-01-13 Thread Etsuro Fujita

I ran into a typo in src/include/access/gin_private.h.  Patch attached.

Thanks,

Best regards,
Etsuro Fujita


gin_private.h-typo.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Florian Pflug

On Jan14, 2014, at 00:33 , Craig Ringer  wrote:
> So I guess the question is: Is it worth all that hassle to remove a
> misfeature you have to go out of your way to use? Is support for non-1
> lower bounds stopping us from doing something useful and important? Or
> is it just an irritation that it exists?

I don't think it's worh it - as you say, the actual risk of bugs is low,
because you have to go out of your way to end up with a lower bound other
than one.

Also, at least from my POV, the fact that we use one type do represent
arrays with an arbitrary number of dimensions is actually worse than
the lower-bound problem. So *if* we ever remove support for arbitrary
lower bounds, we should also add distinct types for different dimensions.
That'd probably required some extension of the type system though...

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Where do we stand on 9.3 bugs?

2014-01-13 Thread Tom Lane

Marti Raudsepp  writes:
> On Mon, Jan 13, 2014 at 5:16 PM, Tom Lane  wrote:
>> What remaining issues are there blocking a 9.3.3 release?

> Well hardly a blocker since this has missed 2 releases already, but
> I'm still hopeful to get many PGXS-based extensions to build again
> without the dreaded "install: will not overwrite just-created ..."
> http://www.postgresql.org/message-id/52406191.6040...@dunslane.net

AFAICT, the state of play on that is that it was reverted out of the
non-devel branches immediately before the October releases (9.3.1
et al), on the grounds that it still seemed to be a moving target, cf
http://www.postgresql.org/message-id/1381193255.25702.4.ca...@vanquo.pezone.net

I've not seen any further work on PGXS since commit eebdea08b, so maybe
that complaint is no longer valid?  On the other hand, dddc91ddd leaves
the impression that all is not yet well in PGXS-land.

> Thankfully I'm mostly using Debian so it's already patched for me.

The Debian guys do seem remarkably willing to apply patches that
upstream doesn't trust yet, don't they.

Anyway, my opinion about this one is it's not a regression and therefore
not a blocker, but I have no objection to fixing it in the back branches
once we have consensus that the fix is stable.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-13 Thread Peter Geoghegan

On Mon, Jan 13, 2014 at 12:49 AM, Heikki Linnakangas
 wrote:
>> In any case, my patch is bound to win decisively for the other
>> extreme, the insert-only case, because the overhead of doing an index
>> scan first is always wasted there with your approach, and the overhead
>> of extended btree leaf page locking has been shown to be quite low.
>
> Quite possibly. Run the benchmark, and we'll see how big a difference we're
> talking about.

I'll come up with something and let you know.

> Another way to optimize it is to keep the b-tree page pinned after doing the
> pre-check. Then you don't need to descend the tree again when doing the
> insert. That would require small indexam API changes, but wouldn't be too
> invasive, I think.

You'll still need a callback to drop the pin when it transpires that
there is a conflict in a later unique index, and state to pass a bt
stack back, at which point you've already made exactly the same
changes to the AM interface as in my proposal. The only difference is
that the core code doesn't rely on the value locks being released
after an instant, but that isn't something that you take advantage of.
Furthermore, AFAIK there is no reason to think that anything other
than btree will benefit, which makes it a bit unfortunate that the AM
has to support it generally.

So, again, it's kind of a modularity violation, and it may not even
actually be possible, since _bt_search() is only callable with an
insertion scankey, which is the context in which the existing
guarantee around releasing locks and re-searching from that point
applies, for reasons that seem to me to be very subtle. At the very
least you need to pass a btstack to _bt_doinsert() to save the work of
re-scanning, as I do.

>> All other concerns of mine still remain, including the concern over
>> the extra locking of the proc array - I'm concerned about the
>> performance impact of that on other parts of the system not exercised
>> by this test.
>
> Yeah, I'm not thrilled about that part either. Fortunately there are other
> ways to implement that. In fact, I think you could just not bother taking
> the ProcArrayLock when setting the fields. The danger is that another
> backend sees a mixed state of the fields, but that's OK. The worst that can
> happen is that it will do an unnecessary lock/release on the heavy-weight
> lock. And to reduce the overhead when reading the fields, you could merge
> the SpeculativeInsertionIsInProgress() check into
> TransactionIdIsInProgress(). The call site in tqual.c always calls it
> together with TransactionIdIsInProgress(), which scans the proc array
> anyway, while holding the lock.

Currently in your patch all insertions do
SpeculativeInsertionLockAcquire(GetCurrentTransactionId()) -
presumably this is not something you intend to keep. Also, you should
not do this for regular insertion:

if (options & HEAP_INSERT_SPECULATIVE)
SetSpeculativeInsertion(relation->rd_node, &heaptup->t_self);

Can you explain the following, please?:

+ /*
+  * Returns a speculative insertion token for waiting for the insertion to
+  * finish.
+  */
+ uint32
+ SpeculativeInsertionIsInProgress(TransactionId xid, RelFileNode rel,
ItemPointer tid)
+ {
+   uint32  result = 0;
+   ProcArrayStruct *arrayP = procArray;
+   int index;

Why is this optimization correct? Presently it allows your patch to
avoid getting a shared ProcArrayLock from HeapTupleSatisfiesDirty().

+   if (TransactionIdPrecedes(xid, TransactionXmin))
+   return false;

So from HeapTupleSatisfiesDirty(), you're checking if "xid" (the
passed tuple's xmin) precedes our transaction's xmin (well, that of
our last snapshot updated by GetSnapshotData()). This is set within
GetSnapshotData(), but we're dealing with a dirty snapshot with no
xmin, so TransactionXmin pertains to our MVCC snapshot, not our dirty
snapshot.

It isn't really true that TransactionIdIsInProgress() gets the same
shared ProcArrayLock in a similar fashion, for a full linear search; I
think that the various fast-paths make it far less likely than it is
for SpeculativeInsertionIsInProgress() (or, perhaps, should be). Here
is what that other routine does in around the same place:

/*
 * Don't bother checking a transaction older than RecentXmin; it could 
not
 * possibly still be running.  (Note: in particular, this guarantees 
that
 * we reject InvalidTransactionId, FrozenTransactionId, etc as not
 * running.)
 */
if (TransactionIdPrecedes(xid, RecentXmin))
{
xc_by_recent_xmin_inc();
return false;
}

This extant code checks against RecentXmin, *not* TransactionXmin.  It
also caches things quite effectively, but that caching isn't very
useful to you here. It checks latestCompletedXid before doing a linear
search through the proc array too.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@

Re: [HACKERS] plpgsql.consistent_into

2014-01-13 Thread Marko Tiikkaja


On 2014-01-14 02:54, Marti Raudsepp wrote:

On Sun, Jan 12, 2014 at 7:51 AM, Marko Tiikkaja  wrote:

the behaviour of SELECT .. INTO when the query returns more than one row.
Some of you might know that no exception is raised in this case


Agreed. But I also agree with the rest of the thread about changing
current INTO behavior and introducing new GUC variables.

But PL/pgSQL already has an assignment syntax with the behavior you want:


According to the docs, that doesn't set FOUND which would make this a 
pain to deal with..



Regards,
Marko Tiikkaja



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN improvements part 1: additional information

2014-01-13 Thread Etsuro Fujita

Peter Eisentraut wrote:
> On 12/10/13, 2:44 PM, Alexander Korotkov wrote:
> > However, patch didn't apply to head. Corrected version is attached.

> Update the pgstattuple regression test results.

The latest version of the patch still doesn't pass the test.

I'll look at the patch in further detail.

Thanks,

Best regards,
Etsuro Fujita



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Where do we stand on 9.3 bugs?

2014-01-13 Thread Tom Lane

Andres Freund  writes:
> On 2014-01-13 17:30:55 -0800, Josh Berkus wrote:
>> Despite actually dating back to 9.0, something in the 9.3.2/9.2.6
>> updates is causing users to hit it now.  And for those who do hit it,
>> replication is impossible and there's no workaround.

> There have been reports about it for years if you search in the archives
> and the code around it hasn't changed for a fair bit of time, so I think
> that's just coincidence.

I looked at the commit history for 9.2 post-release, and I don't see
anything that looks related, so I also doubt the theory that some recent
change has increased the odds of this happening.  I think it's more that
we now recognize the cause of what previously seemed random breakage.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Josh Berkus

On 01/13/2014 05:48 PM, Andres Freund wrote:
> On 2014-01-13 10:56:00 -0800, Josh Berkus wrote:
>> Well, it was the lack of sysctl options which takes the 2Q change from
>> "annoyance" to "potential disaster".  We can't ever get away from the
>> possibility that the Postgres use-case might be the minority use-case,
>> and we might have to use non-default options.  It's when those options
>> aren't present *at all* that we're stuck.
> 
> Unless I am missing something the kernel's going further *away* from a
> simple 2q system, not the contrary.

Well, they implemented a 2Q system and deliberately offered no sysctl
variables to modify its behavior.  Now they're talking about
implementing an ARC system -- which we know the perils of -- again,
without any configuration variables in case the default behavior doesn't
work for everyone.  And it's highly unlikely that an ARC which is
designed for desktop and/or file server users -- let alone mobile users
-- is going to be optimal for PostgreSQL out of the box.

In fact, I'd assert that it's flat-out impossible to engineer an ARC
which will work for multiple different use cases without user-level
configuration.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

2014-01-13 Thread Marti Raudsepp

On Sun, Jan 12, 2014 at 7:51 AM, Marko Tiikkaja  wrote:
> the behaviour of SELECT .. INTO when the query returns more than one row.
> Some of you might know that no exception is raised in this case

Agreed. But I also agree with the rest of the thread about changing
current INTO behavior and introducing new GUC variables.

But PL/pgSQL already has an assignment syntax with the behavior you want:

DECLARE
  foo int;
BEGIN
  foo = generate_series(1,1); -- this is OK
  foo = generate_series(1,2); -- fails
  foo = 10 WHERE FALSE; -- sets foo to NULL
  -- And you can actually do:
  foo = some_col FROM some_table WHERE bar=10;
END;

So if we extend this syntax to support multiple columns, it should
satisfy the use cases you care about.

  foo, bar = col1, col2 FROM some_table WHERE bar=10;

It's ugly without the explicit SELECT though. Perhaps make the SELECT optional:

  foo, bar = SELECT col1, col2 FROM some_table WHERE bar=10;

I think that's more aesthetically pleasing than INTO and also looks
more familiar to other languages.

Plus, now you can copy-paste the query straight to an SQL shell
without another footgun involving creating new tables in your
database.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Andres Freund

On 2014-01-13 10:56:00 -0800, Josh Berkus wrote:
> Well, it was the lack of sysctl options which takes the 2Q change from
> "annoyance" to "potential disaster".  We can't ever get away from the
> possibility that the Postgres use-case might be the minority use-case,
> and we might have to use non-default options.  It's when those options
> aren't present *at all* that we're stuck.

Unless I am missing something the kernel's going further *away* from a
simple 2q system, not the contrary.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-13 Thread Gavin Flower


On 14/01/14 14:29, Tom Lane wrote:
[...]
(2) the float and numeric variants should be implemented under 
nondefault names (I'm thinking FAST_SUM(), but bikeshed away). People 
who need extra speed and don't mind the slightly different results can 
alter their queries to use these variants. One reason I'm thinking 
this is that whatever we do to ameliorate the semantic issues is going 
to slow down the forward transition function --- to no benefit unless 
the aggregate is being used as a window function in a moving window. 
So I'm less than convinced that we *should* implement any of these 
designs in the default aggregates, even if we get to the point where 
we arguably *could* do it with little risk of functional differences. 
regards, tom lane 
How SUM_FAST() instead, then it will more likely to be close to SUM() in 
an index?



Cheers,
Gavin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Josh Berkus

On 01/13/2014 05:30 PM, Dave Chinner wrote:
> On Mon, Jan 13, 2014 at 03:24:38PM -0800, Josh Berkus wrote:
> No matter what default NUMA allocation policy we set, there will be
> an application for which that behaviour is wrong. As such, we've had
> tools for setting application specific NUMA policies for quite a few
> years now. e.g:

Yeah, that's why I personally regard the NUMA stuff as just an
information problem; there's an easy configuration variable, and you
can't please everyone (and our project would hardly be one to point
fingers about sub-optimal default configurations).  I was responding to
a question of "what's wrong with the default setting?"

Personally, I have my doubts that the NUMA memory isolation, as
currently implemented, accomplishes what it wants to do.  But that's a
completely different discussion.

The real issue there was that our users had never heard of this change
until suddenly half their RAM became unavailable.  So the solution is
for our project to somehow have these kinds of changes flagged for our
attention so that we can update our docs.  The kernel change list is
quite volumnious, and it's very easy to miss changes of significance in
it.  The easiest way to do this is going to be getting involved in
kernel-database performance testing.

Of course, we are annoyed that we finally removed the main reason to
modify sysctl.conf (SHMMAX), and here we are needing to advise users
about sysctl again.  :-(

I'm much more bothered by the introduction of 2Q logic, since that comes
without a configuration variable to modify its behavior.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Where do we stand on 9.3 bugs?

2014-01-13 Thread Andres Freund

On 2014-01-13 17:30:55 -0800, Josh Berkus wrote:
> On 01/13/2014 07:16 AM, Tom Lane wrote:
> > What remaining issues are there blocking a 9.3.3 release?  I know that
> > there were unresolved multixact issues when we put out 9.3.2 --- are
> > those all dealt with now?  What else do people see as release-blockers?
> 
> I see this bug as a release-blocker.  It's a fairly crippling
> replication bug for a bunch of users (5 reports to date):

> http://archives.postgresql.org/message-id/20140106134815.GE28320%40alap2.anarazel.de

Absolutely. I didn't list it because it's not 9.3 specific. I think Tom
is working on a fix right now, if not, Heikki has showed interest and I
would also work on it once the CF has started. So I don't think there's
danger of missing that one.

> Despite actually dating back to 9.0, something in the 9.3.2/9.2.6
> updates is causing users to hit it now.  And for those who do hit it,
> replication is impossible and there's no workaround.

There have been reports about it for years if you search in the archives
and the code around it hasn't changed for a fair bit of time, so I think
that's just coincidence.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Where do we stand on 9.3 bugs?

2014-01-13 Thread Josh Berkus

On 01/13/2014 07:16 AM, Tom Lane wrote:
> What remaining issues are there blocking a 9.3.3 release?  I know that
> there were unresolved multixact issues when we put out 9.3.2 --- are
> those all dealt with now?  What else do people see as release-blockers?

I see this bug as a release-blocker.  It's a fairly crippling
replication bug for a bunch of users (5 reports to date):

http://archives.postgresql.org/message-id/20140106134815.GE28320%40alap2.anarazel.de

Despite actually dating back to 9.0, something in the 9.3.2/9.2.6
updates is causing users to hit it now.  And for those who do hit it,
replication is impossible and there's no workaround.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-13 Thread Tom Lane

Florian Pflug  writes:
> I think it'd be worthwile to get this into 9.4, if that's still an option,
> even if we only support COUNT.

My thought is

(1) we can certainly implement inverse transitions for COUNT() and the
integer variants of SUM(), and that alone would be a killer feature for
some people.

(2) the float and numeric variants should be implemented under nondefault
names (I'm thinking FAST_SUM(), but bikeshed away).  People who need
extra speed and don't mind the slightly different results can alter
their queries to use these variants.

One reason I'm thinking this is that whatever we do to ameliorate
the semantic issues is going to slow down the forward transition
function --- to no benefit unless the aggregate is being used as a
window function in a moving window.  So I'm less than convinced
that we *should* implement any of these designs in the default
aggregates, even if we get to the point where we arguably *could*
do it with little risk of functional differences.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

2014-01-13 Thread Josh Berkus

On 01/13/2014 05:10 PM, Jim Nasby wrote:
> On 1/13/14, 7:06 PM, Josh Berkus wrote:
>> Regularly?  No.  But I've seen it, especially as part of a "does this
>> query return any rows?" test.  That's not the best way to test that, but
>> that doesn't stop a lot of people doing it.
> 
> Right, and I certainly don't want to force anyone to rewrite all their
> code. But I'd certainly like a safer default so people don't mistakenly
> go the "multiple rows is OK" route without doing so very intentionally.

The problem is that if you change the default, you're creating an
unexpected barrier to upgrading.  I just don't think that it's worth
doing so in order to meet some standard of code neatness, especially in
plpgsql, the unwanted bastard child of SQL and ADA.

For people who want to enable this in order to prevent stupid query bugs
from creeping into their plpgsql, that's great, let's have an easy
option to turn on.  But it's hard enough to get people to upgrade as it
is.  If we're going to add an upgrade landmine, it better be for
something really important.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Andres Freund

On 2014-01-13 17:13:51 -0800, James Bottomley wrote:
> a file into a user provided buffer, thus obtaining a page cache entry
> and a copy in their userspace buffer, then insert the page of the user
> buffer back into the page cache as the page cache page ... that's right,
> isn't it postgress people?

Pretty much, yes. We'd probably hint (*advise(DONTNEED)) that the page
isn't needed anymore when reading. And we'd normally write if the page
is dirty.

> Effectively you end up with buffered read/write that's also mapped into
> the page cache.  It's a pretty awful way to hack around mmap.

Well, the problem is that you can't really use mmap() for the things we
do. Postgres' durability works by guaranteeing that our journal entries
(called WAL := Write Ahead Log) are written & synced to disk before the
corresponding entries of tables and indexes reach the disk. That also
allows to group together many random-writes into a few contiguous writes
fdatasync()ed at once. Only during a checkpointing phase the big bulk of
the data is then (slowly, in the background) synced to disk.

I don't see how that's doable with holding all pages in mmap()ed
buffers.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Jim Nasby

On 1/13/14, 7:10 PM, Kevin Grittner wrote:

Tom Lane  wrote:

>I think the argument really is that some people don't want to
>make their application code work with such cases (which is fine)
>so they'd like an inside-the-database guarantee that the app code
>won't ever see such cases.  Which is less fine, ISTM: if you fear
>some part of your app might be generating such arrays, then you
>don't have such little use for the feature after all, eh?

While I don't see the argument for dropping support for lower
bounds other than 1 as long as standard syntax for creating arrays
yields arrays staring at 1, this argument holds no water at all.  I
have done too many conversions from databases which did not enforce
data integrity rules not to have a lot of sympathy for people
wanting to have the ability to easily constrain data to whatever
their business rules say it should look like.  The fact that some
day some new programmer might not be aware of all business rules,
or might choose to try to ignore them is the reason you add
constraints to columns and domains.

I think providing functions that make it easy to enforce this rule
make a lot of sense, as it would make a lot of application code
simpler, easier to read, and less fragile.  I support overloading
functions with simpler parameter lists for standard-conforming
arrays; I don't support dropping support for whoever might be using
non-conforming arrays.  A veriable lower bound is an extension to
the standard, and extensions are fine.

To be clear: I am completely fine with continuing to support custom lower bounds (to quote my 
original email: "Obviously we can't just drop support, but what about an initdb (or hell, 
even configure) option to disallow arrays with a lower bound <> 1?").

It might also be possible to protect users on the access side instead: provide 
alternate functions that protect you if you pass them an array that does have a 
custom lower bound. But I'm not so sure that'll work out, since one of the 
biggest offenders here is:

FOR i IN 1..array_upper(some_array) LOOP.

Yes, today you can use FOREACH, but what about all your legacy code that's 
ass-u-ming that the lower bound is 1?
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Kevin Grittner

Tom Lane  wrote:

> I think the argument really is that some people don't want to
> make their application code work with such cases (which is fine)
> so they'd like an inside-the-database guarantee that the app code
> won't ever see such cases.  Which is less fine, ISTM: if you fear
> some part of your app might be generating such arrays, then you
> don't have such little use for the feature after all, eh?

While I don't see the argument for dropping support for lower
bounds other than 1 as long as standard syntax for creating arrays
yields arrays staring at 1, this argument holds no water at all.  I
have done too many conversions from databases which did not enforce
data integrity rules not to have a lot of sympathy for people
wanting to have the ability to easily constrain data to whatever
their business rules say it should look like.  The fact that some
day some new programmer might not be aware of all business rules,
or might choose to try to ignore them is the reason you add
constraints to columns and domains.

I think providing functions that make it easy to enforce this rule
make a lot of sense, as it would make a lot of application code
simpler, easier to read, and less fragile.  I support overloading
functions with simpler parameter lists for standard-conforming
arrays; I don't support dropping support for whoever might be using
non-conforming arrays.  A veriable lower bound is an extension to
the standard, and extensions are fine.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

2014-01-13 Thread Jim Nasby


On 1/13/14, 7:06 PM, Josh Berkus wrote:

On 01/13/2014 04:20 PM, Jim Nasby wrote:

On 1/13/14, 5:57 PM, Josh Berkus wrote:

I *really* don't want to go through all my old code to find places where
I used SELECT ... INTO just to pop off the first row, and ignored the
rest.  I doubt anyone else does, either.


Do you regularly have use cases where you actually want just one RANDOM
row? I suspect the far more likely scenario is that people write code
assuming they'll get only one row and they'll end up with extremely hard
to trace bugs if that assumption is ever wrong.


Regularly?  No.  But I've seen it, especially as part of a "does this
query return any rows?" test.  That's not the best way to test that, but
that doesn't stop a lot of people doing it.


Right, and I certainly don't want to force anyone to rewrite all their code. But I'd 
certainly like a safer default so people don't mistakenly go the "multiple rows is 
OK" route without doing so very intentionally.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread James Bottomley

On Mon, 2014-01-13 at 21:29 +, Greg Stark wrote:
> On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund  wrote:
> > For one, postgres doesn't use mmap for files (and can't without major
> > new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> > horrible consequences for performance/scalability - very quickly you
> > contend on locks in the kernel.
> 
> 
> I may as well dump this in this thread. We've discussed this in person
> a few times, including at least once with Ted T'so when he visited
> Dublin last year.
> 
> The fundamental conflict is that the kernel understands better the
> hardware and other software using the same resources, Postgres
> understands better its own access patterns. We need to either add
> interfaces so Postgres can teach the kernel what it needs about its
> access patterns or add interfaces so Postgres can find out what it
> needs to know about the hardware context.
> 
> The more ambitious and interesting direction is to let Postgres tell
> the kernel what it needs to know to manage everything. To do that we
> would need the ability to control when pages are flushed out. This is
> absolutely necessary to maintain consistency. Postgres would need to
> be able to mark pages as unflushable until some point in time in the
> future when the journal is flushed. We discussed various ways that
> interface could work but it would be tricky to keep it low enough
> overhead to be workable.

So in this case, the question would be what additional information do we
need to exchange that's not covered by the existing interfaces.  Between
madvise and splice, we seem to have most of what you want; what's
missing?

> The less exciting, more conservative option would be to add kernel
> interfaces to teach Postgres about things like raid geometries. Then
> Postgres could use directio and decide to do prefetching based on the
> raid geometry, how much available i/o bandwidth and iops is available,
> etc.
> 
> Reimplementing i/o schedulers and all the rest of the work that the
> kernel provides inside Postgres just seems like something outside our
> competency and that none of us is really excited about doing.

This would also be a well trodden path ... I believe that some large
database company introduced Direct IO for roughly this purpose.

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Where do we stand on 9.3 bugs?

2014-01-13 Thread Marti Raudsepp

On Mon, Jan 13, 2014 at 5:16 PM, Tom Lane  wrote:
> What remaining issues are there blocking a 9.3.3 release?

Well hardly a blocker since this has missed 2 releases already, but
I'm still hopeful to get many PGXS-based extensions to build again
without the dreaded "install: will not overwrite just-created ..."

http://www.postgresql.org/message-id/52406191.6040...@dunslane.net

Thankfully I'm mostly using Debian so it's already patched for me.

Regards,
Marti


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Trond Myklebust


On Jan 13, 2014, at 19:03, Hannu Krosing  wrote:

> On 01/13/2014 09:53 PM, Trond Myklebust wrote:
>> On Jan 13, 2014, at 15:40, Andres Freund  wrote:
>> 
>>> On 2014-01-13 15:15:16 -0500, Robert Haas wrote:
 On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner  wrote:
> I notice, Josh, that you didn't mention the problems many people
> have run into with Transparent Huge Page defrag and with NUMA
> access.
 Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
 setting zone_reclaim_mode; is there some other problem besides that?
>>> I think that fixes some of the worst instances, but I've seen machines
>>> spending horrible amounts of CPU (& BUS) time in page reclaim
>>> nonetheless. If I analyzed it correctly it's in RAM << working set
>>> workloads where RAM is pretty large and most of it is used as page
>>> cache. The kernel ends up spending a huge percentage of time finding and
>>> potentially defragmenting pages when looking for victim buffers.
>>> 
 On a related note, there's also the problem of double-buffering.  When
 we read a page into shared_buffers, we leave a copy behind in the OS
 buffers, and similarly on write-out.  It's very unclear what to do
 about this, since the kernel and PostgreSQL don't have intimate
 knowledge of what each other are doing, but it would be nice to solve
 somehow.
>>> I've wondered before if there wouldn't be a chance for postgres to say
>>> "my dear OS, that the file range 0-8192 of file x contains y, no need to
>>> reread" and do that when we evict a page from s_b but I never dared to
>>> actually propose that to kernel people...
>> O_DIRECT was specifically designed to solve the problem of double buffering 
>> between applications and the kernel. Why are you not able to use that in 
>> these situations?
> What is asked is the opposite of O_DIRECT - the write from a buffer inside
> postgresql to linux *buffercache* and telling linux that it is the same
> as what
> is currently on disk, so don't bother to write it back ever.

I don’t understand. Are we talking about mmap()ed files here? Why would the 
kernel be trying to write back pages that aren’t dirty?

> This would avoid current double-buffering between postgresql and linux
> buffer caches while still making use of linux cache when possible.
> 
> The use case is  pages that postgresql has moved into its buffer cache
> but which it has not modified. They will at some point be evicted from the
> postgresql cache, but it is likely that they will still be needed
> sometime soon,
> so what is required is "writing them back" to the original file, only
> they should
> not really be written - or marked dirty to be written later - more
> levels than
> just to the linux cache, as they *already* are on the disk.
> 
> It is probably ok to put them in the LRU position as they are "written"
> out from postgresql, though it may be better if we get some more control
> over
> where in the LRU order they would be placed. It may make sense to put them
> there based on when they were last read while residing inside postgresql
> cache
> 
> Cheers
> 
> 
> -- 
> Hannu Krosing
> PostgreSQL Consultant
> Performance, Scalability and High Availability
> 2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

2014-01-13 Thread Marko Tiikkaja


On 1/14/14, 1:57 AM, Tom Lane wrote:

Whatever your opinion of the default behavior, the
fact that it's been that way for upwards of fifteen years without any
mass protests should give you pause about changing it.


For what it's worth, my patch does not change the default behaviour.  I 
don't think I've ever publicly said that it should.



Regards,
Marko Tiikkaja


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

2014-01-13 Thread Jim Nasby

On 1/13/14, 6:36 PM, Florian Pflug wrote:

On Jan14, 2014, at 01:20 , Jim Nasby  wrote:

>On 1/13/14, 5:57 PM, Josh Berkus wrote:

>>On 01/13/2014 03:41 PM, Florian Pflug wrote:

>>>It therefor isn't an oversight that SELECT ... INTO allows multiple result 
rows
>>>but INSERT/UPDATE/DELETE forbids them, it's been done that way on purpose and
>>>for a reason. We shouldn't be second-guessing ourselves by changing that 
later -
>>>not, at least, unless we have a*very*  good reason for it. Which, AFAICS, we 
don't.
>>>
>>>(And yeah, personally I'd prefer if we'd complain about multiple rows. But 
it's
>>>IMHO just too late for that)

>>
>>I*really*  don't want to go through all my old code to find places where
>>I used SELECT ... INTO just to pop off the first row, and ignored the
>>rest.  I doubt anyone else does, either.

>
>Do you regularly have use cases where you actually want just one RANDOM row?
>I suspect the far more likely scenario is that people write code assuming 
they'll
>get only one row and they'll end up with extremely hard to trace bugs if that
>assumption is ever wrong.

One case that immediatly comes to mind is a JOIN which sometimes returns
multiple rows, and a projection clause that only uses one of the tables
involved in the join.

Which is just bad coding IMHO.

Another are queries including an ORDER BY - I don't think the patch makes an
exception for those, and even if it did, it probably wouldn't catch all
cases, like e.g. an SRF which returns the rows in a deterministic order.

Sure, but if you've gone to the trouble of putting the ORDER BY in, how hard is 
it to add LIMIT 1?

Or maybe you're picking a row to process next, and don't really care about
the order in which you work through them.

Again, how hard is LIMIT 1?

BTW, my issue here isn't with typing out " STRICT". I'd be fine if the way 
things worked was you had to specify INTO STRICT or INTO LOOSE (or whatever you want to 
call the opposite of strict).

My problem is that the default here is plain and simple a bad choice for 
default behavior. In light of Tom's research showing this was added in 8.2 I 
understand why we went that route, but I'd really like a way to change the 
default. Or even disallow it (IE: force the user to state which route they're 
going here).
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

2014-01-13 Thread Josh Berkus

On 01/13/2014 04:20 PM, Jim Nasby wrote:
> On 1/13/14, 5:57 PM, Josh Berkus wrote:
>> I *really* don't want to go through all my old code to find places where
>> I used SELECT ... INTO just to pop off the first row, and ignored the
>> rest.  I doubt anyone else does, either.
> 
> Do you regularly have use cases where you actually want just one RANDOM
> row? I suspect the far more likely scenario is that people write code
> assuming they'll get only one row and they'll end up with extremely hard
> to trace bugs if that assumption is ever wrong.

Regularly?  No.  But I've seen it, especially as part of a "does this
query return any rows?" test.  That's not the best way to test that, but
that doesn't stop a lot of people doing it.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Jan Kara

On Mon 13-01-14 22:36:06, Mel Gorman wrote:
> On Mon, Jan 13, 2014 at 06:27:03PM -0200, Claudio Freire wrote:
> > On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby  wrote:
> > > On 1/13/14, 2:19 PM, Claudio Freire wrote:
> > >>
> > >> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas 
> > >> wrote:
> > >>>
> > >>> On a related note, there's also the problem of double-buffering.  When
> > >>> we read a page into shared_buffers, we leave a copy behind in the OS
> > >>> buffers, and similarly on write-out.  It's very unclear what to do
> > >>> about this, since the kernel and PostgreSQL don't have intimate
> > >>> knowledge of what each other are doing, but it would be nice to solve
> > >>> somehow.
> > >>
> > >>
> > >>
> > >> There you have a much harder algorithmic problem.
> > >>
> > >> You can basically control duplication with fadvise and WONTNEED. The
> > >> problem here is not the kernel and whether or not it allows postgres
> > >> to be smart about it. The problem is... what kind of smarts
> > >> (algorithm) to use.
> > >
> > >
> > > Isn't this a fairly simple matter of when we read a page into shared 
> > > buffers
> > > tell the kernel do forget that page? And a corollary to that for when we
> > > dump a page out of shared_buffers (here kernel, please put this back into
> > > your cache).
> > 
> > 
> > That's my point. In terms of kernel-postgres interaction, it's fairly 
> > simple.
> > 
> > What's not so simple, is figuring out what policy to use. Remember,
> > you cannot tell the kernel to put some page in its page cache without
> > reading it or writing it. So, once you make the kernel forget a page,
> > evicting it from shared buffers becomes quite expensive.
> 
> posix_fadvise(POSIX_FADV_WILLNEED) is meant to cover this case by
> forcing readahead. If you evict it prematurely then you do get kinda
> screwed because you pay the IO cost to read it back in again even if you
> had enough memory to cache it. Maybe this is the type of kernel-postgres
> interaction that is annoying you.
> 
> If you don't evict, the kernel eventually steps in and evicts the wrong
> thing. If you do evict and it was unnecessarily you pay an IO cost.
> 
> That could be something we look at. There are cases buried deep in the
> VM where pages get shuffled to the end of the LRU and get tagged for
> reclaim as soon as possible. Maybe you need access to something like
> that via posix_fadvise to say "reclaim this page if you need memory but
> leave it resident if there is no memory pressure" or something similar.
> Not exactly sure what that interface would look like or offhand how it
> could be reliably implemented.
  Well, kernel managing user space cache postgres guys talk about looks
pretty much like what "volatile range" patches are trying to achieve.

Note to postgres guys: I think you should have a look at the proposed
'vrange' system call. The latest posting is here:
http://www.spinics.net/lists/linux-mm/msg67328.html. It contains a rather
detailed description of the feature. And if the feature looks good to you,
you can add your 'me to' plus if anyone would be willing to try that out
with postgress that would be most welcome (although I understand you might
not want to burn your time on experimental kernel feature).

Honza
-- 
Jan Kara 
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Theodore Ts'o

The issue with O_DIRECT is actually a much more general issue ---
namely, database programmers that for various reasons decide they
don't want to go down the O_DIRECT route, but then care about
performance.  PostgreSQL is not the only database which is had this
issue.

There are two papers at this year's FAST conference about the "Journal
of Journal" (JoJ) problem, which has been triggered by the use of SQLite on
android handsets, and its write patterns, some of which some folks
(including myself) have characterized as "abusive".  (As in, when the
database developer says to the kernel developer, "Doctor, doctor, it
hurts when I do that...")

The program statement for JoJ was introduced in last year's Usenix ATC
conference, I/O Stack Optimizations for Smartphones[1]

[1] 
https://www.usenix.org/conference/atc13/technical-sessions/presentation/jeong

The high order bit is what's the right thing to do when database
progammers come to kernel engineers saying, we want to do  and
the performance sucks.  Do we say, "Use O_DIRECT, dummy", not
withstanding Linus's past comments on the issue?  Or do we have some
general design principles that we tell database engineers that they
should do for better performance, and then all developers for all of
the file systems can then try to optimize for a set of new API's, or
recommended ways of using the existing API's?

Surely the wrong answer is that we do things which encourage people to
create entire new specialized file systems for different databases.
The f2fs file system was essentially created because someone thought
it was easier to create a new file system from sratch instad of trying
to change how SQLite or some other existing file system works.
Hopefully we won't have companies using MySQL and PostgreSQL deciding
they need their own mysqlfs and postgresqlfs!  :-)

Cheers,

- Ted


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Trond Myklebust


On Jan 13, 2014, at 16:03, Robert Haas  wrote:

> On Mon, Jan 13, 2014 at 3:53 PM, Trond Myklebust  wrote:
>> O_DIRECT was specifically designed to solve the problem of double buffering 
>> between applications and the kernel. Why are you not able to use that in 
>> these situations?
> 
> O_DIRECT was apparently designed by a deranged monkey on some serious
> mind-controlling substances.  But don't take it from me, I have it on
> good authority:
> 
> http://yarchive.net/comp/linux/o_direct.html
> 
> One might even say the best authority.

You do realise that is 12 year old information, right? …and yes, we have added 
both aio and vectored operations to O_DIRECT in the meantime.

Meanwhile, no progress has been made on the “non-deranged” interface that 
authority was advocating.

Cheers,
  Trond

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Jan Kara

On Mon 13-01-14 22:26:45, Mel Gorman wrote:
> The flipside is also meant to hold true. If you know data will be needed
> in the near future then posix_fadvise(POSIX_FADV_WILLNEED). Glancing at
> the implementation it does a forced read-ahead on the range of pages of
> interest. It doesn't look like it would block.
  That's not quite true. POSIX_FADV_WILLNEED still needs to map logical
file offsets to physical disk blocks and create IO requests. This happens
synchronously. So if your disk is congested and relevant metadata is out of
cache, or we simply run out of free IO requests, POSIX_FADV_WILLNEED can
block for a significant amount of time.

Honza
-- 
Jan Kara 
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread James Bottomley

On Mon, 2014-01-13 at 22:12 +0100, Andres Freund wrote:
> On 2014-01-13 12:34:35 -0800, James Bottomley wrote:
> > On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote:
> > > Well, if we were to collaborate with the kernel community on this then
> > > presumably we can do better than that for eviction... even to the
> > > extent of "here's some data from this range in this file. It's (clean|
> > > dirty). Put it in your cache. Just trust me on this."
> > 
> > This should be the madvise() interface (with MADV_WILLNEED and
> > MADV_DONTNEED) is there something in that interface that is
> > insufficient?
> 
> For one, postgres doesn't use mmap for files (and can't without major
> new interfaces).

I understand, that's why you get double buffering: because we can't
replace a page in the range you give us on read/write.  However, you
don't have to switch entirely to mmap: you can use mmap/madvise
exclusively for cache control and still use read/write (and still pay
the double buffer penalty, of course).  It's only read/write with
directio that would cause problems here (unless you're planning to
switch to DIO?).

>  Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> horrible consequences for performance/scalability - very quickly you
> contend on locks in the kernel.

Is this because of problems in the mmap_sem?

> Also, that will mark that page dirty, which isn't what we want in this
> case.

You mean madvise (page_addr)?  It shouldn't ... the state of the dirty
bit should only be updated by actual writes.  Which MADV_ primitive is
causing the dirty marking, because we might be able to fix it (unless
there's some weird corner case I don't know about).

>  One major usecase is transplanting a page comming from postgres'
> buffers into the kernel's buffercache because the latter has a much
> better chance of properly allocating system resources across independent
> applications running.

If you want to share pages between the application and the page cache,
the only known interface is mmap ... perhaps we can discuss how better
to improve mmap for you?

We also do have a way of transplanting pages: it's called splice.  How
do the semantics of splice differ from what you need?

> Oh, and the kernel's page-cache management while far from perfect,
> actually scales much better than postgres'.

Well, then, it sounds like the best way forward would be to get
postgress to use the kernel page cache more efficiently.

James

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-13 Thread Florian Pflug

On Jan10, 2014, at 22:27 , David Rowley  wrote:
> As the patch stands at the moment, I currently have a regression test
> which currently fails due to these extra zeros after the decimal point:
> 
> -- This test currently fails due extra trailing 0 digits.
> SELECT SUM(n::numeric) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND 
> UNBOUNDED FOLLOWING)
>   FROM (VALUES(1,1.01),(2,2),(3,3)) v(i,n);
> 
> Patched produces:
>  6.01
>  5.00
>  3.00
> Unpatched produces:
>  6.01
>  5
>  3

Hm, that's annoying. I checked the standard to see if it offers any guidance
here, but it doesn't look like it does - the standard just says that SUM() ought
to return a type with the same scale as the input type has. That's fine if
every NUMERIC type has a fixed scale, but that's not the case in postgres -
we allow values of unconstrained NUMERIC types (i.e., numeric types without an
explicitly specified precision or scale) to each define their own scale.

To fix this, we'd have to track the maximum scale within the current frame.
That's easier than the general problem of providing an inverse transition
function for MAX, because AFAIK we limit the scale to at most 1000. So it'd
be sufficient to track the number of times we saw each scale, and also the
current maximum. Once we reduce the current maximum's count back to zero
in the inverse transition function, we'd scan from that value to the left to
find the next non-zero entry.

We could also choose to ignore this, although I'm not sure I really like that.
It seems entirely OK at first sight - after all, the values all stay the same,
we just emit a different number of trailing zeros. But it still causes results
to be affected by values, even if only in the number of trailing zeros, which
lie outside the value's range. That seems like more non-determinism than as
database should show.

> With inverse transitions this query still produces correct results, it just 
> does
> not produces the numeric in the same format as it does without performing 
> inverse
> transitions. Personally I'd rather focus on trying to get SUM(numeric) in 
> there
> for 9.4

I think it'd be worthwile to get this into 9.4, if that's still an option,
even if we only support COUNT.

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

2014-01-13 Thread Tom Lane

Florian Pflug  writes:
> On Jan14, 2014, at 01:20 , Jim Nasby  wrote:
>> And if we've always had it, why on earth didn't we make STRICT the default
>> behavior?

> Dunno, but AFAIK pl/pgsql mimics Oracle's PL/SQL, at least in some aspects,
> so maybe this is one of the areas where we just do what oracle does.

STRICT is a relatively recent addition (8.2, looks like).  One of the
reasons it got in was that the author didn't have any grandiose ideas
about making it the new default.  I think this patch would have a lot
better chance if it was just adding another keyword as an alternative
to STRICT, and not hoping to impose the author's opinions on the rest
of the world.  Whatever your opinion of the default behavior, the
fact that it's been that way for upwards of fifteen years without any
mass protests should give you pause about changing it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Jim Nasby


On 1/13/14, 4:47 PM, Jan Kara wrote:

Note to postgres guys: I think you should have a look at the proposed
'vrange' system call. The latest posting is here:
http://www.spinics.net/lists/linux-mm/msg67328.html. It contains a rather
detailed description of the feature. And if the feature looks good to you,
you can add your 'me to' plus if anyone would be willing to try that out
with postgress that would be most welcome (although I understand you might
not want to burn your time on experimental kernel feature).


I don't think that would help us with buffers unless we switched to MMAP (which 
is a huge change), but this part is interesting:

"* Opportunistic freeing of memory that may be quickly reused. Minchan
has done a malloc implementation where free() marks the pages as
volatile, allowing the kernel to reclaim under pressure. This avoids the
unmapping and remapping of anonymous pages on free/malloc."

Postgres has it's own memory management on top of malloc that gives us memory 
contexts; some of those contexts get destroyed frequently. Allowing the kernel 
to reclaim that free'd memory in the background might be a performance win for 
us.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Jim Nasby


On 1/13/14, 3:04 PM, Jeff Janes wrote:


I think the above is pretty simple for both interaction (allow us to inject a 
clean page into the file page cache) and policy (forget it after you hand it to 
us, then remember it again when we hand it back to you clean).  And I think it 
would pretty likely be an improvement over what we currently do.  But I think 
it is probably the wrong way to get the improvement.  I think the real problem 
is that we don't trust ourselves to manage more of the memory ourselves.

As far as I know, we still don't have a publicly disclosable and readily 
reproducible test case for the reports of performance degradation when we have 
more than 8GB in shared_buffers. If we had one of those, we could likely reduce 
the double buffering problem by fixing our own scalability issues and therefore 
taking responsibility for more of the data ourselves.


While I agree we need to fix the 8GB limit, we're always going to have a 
problem here unless we put A LOT of new abilities into our memory capabilities. 
Like, for example, stealing memory from shared buffers to support a sort. Or 
implementing a system-wide limit on WORK_MEM. Or both.

I would much rather teach the OS and Postgres to work together on memory 
management than for us to try and re-implement everything the OS has already 
done for us.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Jim Nasby

On 1/13/14, 4:44 PM, Andres Freund wrote:

> >  One major usecase is transplanting a page comming from postgres'
> >buffers into the kernel's buffercache because the latter has a much
> >better chance of properly allocating system resources across independent
> >applications running.

>
>If you want to share pages between the application and the page cache,
>the only known interface is mmap ... perhaps we can discuss how better
>to improve mmap for you?

I think purely using mmap() is pretty unlikely to work out - there's
just too many constraints about when a page is allowed to be written out
(e.g. it's interlocked with postgres' write ahead log). I also think
that for many practical purposes using mmap() would result in an absurd
number of mappings or mapping way too huge areas; e.g. large btree
indexes are usually accessed in a quite fragmented manner.

Which brings up another interesting area^Wcan-of-worms: the database is implementing 
journaling on top of a filesystem that's probably also journaling. And it's going to get 
worse: a Segate researcher presented at RICon East last year that the next generation (or 
maybe the one after that) of spinning rust will use "shingling", which means 
that the drive can't write randomly. So now the drive will ALSO have to journal. And of 
course SSDs already do this.

So now there's *three* pieces of software all doing the exact same thing, none 
of which are able to coordinate with each other.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

2014-01-13 Thread Florian Pflug

(Responding to both of your mails here)

On Jan14, 2014, at 01:20 , Jim Nasby  wrote:
> On 1/13/14, 5:57 PM, Josh Berkus wrote:
>> On 01/13/2014 03:41 PM, Florian Pflug wrote:
>>> It therefor isn't an oversight that SELECT ... INTO allows multiple result 
>>> rows
>>> but INSERT/UPDATE/DELETE forbids them, it's been done that way on purpose 
>>> and
>>> for a reason. We shouldn't be second-guessing ourselves by changing that 
>>> later -
>>> not, at least, unless we have a *very* good reason for it. Which, AFAICS, 
>>> we don't.
>>> 
>>> (And yeah, personally I'd prefer if we'd complain about multiple rows. But 
>>> it's
>>> IMHO just too late for that)
>> 
>> I *really* don't want to go through all my old code to find places where
>> I used SELECT ... INTO just to pop off the first row, and ignored the
>> rest.  I doubt anyone else does, either.
> 
> Do you regularly have use cases where you actually want just one RANDOM row?
> I suspect the far more likely scenario is that people write code assuming 
> they'll
> get only one row and they'll end up with extremely hard to trace bugs if that
> assumption is ever wrong.

One case that immediatly comes to mind is a JOIN which sometimes returns
multiple rows, and a projection clause that only uses one of the tables
involved in the join. 

Another are queries including an ORDER BY - I don't think the patch makes an
exception for those, and even if it did, it probably wouldn't catch all
cases, like e.g. an SRF which returns the rows in a deterministic order.

Or maybe you're picking a row to process next, and don't really care about
the order in which you work through them.

>> The question is, how many bugs stemmed from wrong SQL queries, and what
>> percentage of those would have been caught by this? The way I see it, there
>> are thousands of ways to screw up a query, and having it return multiple
>> rows instead of one is just one of them.
> 
> A query that's simply wrong is more likely to fail consistently. Non-strict
> use of INTO is going to fail in very subtle ways (unless you actually DO want
> just the first row, in which case you should explicitly use LIMIT 1).

How so? Say you expect "SELECT * FROM view WHERE c=" to only ever return
one row. Then "SELECT sum(f) FROM table JOIN view ON table.c = view.c" is
just as subtly wrong as the first query is.

> And if we've always had it, why on earth didn't we make STRICT the default
> behavior?

Dunno, but AFAIK pl/pgsql mimics Oracle's PL/SQL, at least in some aspects,
so maybe this is one of the areas where we just do what oracle does.

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Jim Nasby


On 1/13/14, 5:33 PM, Craig Ringer wrote:

So I guess the question is: Is it worth all that hassle to remove a
misfeature you have to go out of your way to use? Is support for non-1
lower bounds stopping us from doing something useful and important? Or
is it just an irritation that it exists?


It's not an irritation to -hackers, but it is an irritation for anyone that 
cares about data quality, because you're forced to code all of your stuff to 
always look at array_lower().

Actually, now that I think about it, if you want to be really safe you would 
actually force your code to use a different lower bound so you're more likely 
to discover code that's broken.

So it really is a big pain for users that know what's going on. And it will 
become a big pain for users that don't know if they ever accidentally end up 
with non-1 arrays. :)
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Tom Lane

Craig Ringer  writes:
> So I guess the question is: Is it worth all that hassle to remove a
> misfeature you have to go out of your way to use? Is support for non-1
> lower bounds stopping us from doing something useful and important? Or
> is it just an irritation that it exists?

I think the argument really is that some people don't want to make their
application code work with such cases (which is fine) so they'd like an
inside-the-database guarantee that the app code won't ever see such cases.
Which is less fine, ISTM: if you fear some part of your app might be
generating such arrays, then you don't have such little use for the
feature after all, eh?

This is being camouflaged in a whole lot of utter BS about how nobody
could possibly be using the feature, nobody *should* want it, it's outside
the standard, etc etc.  If we ripped out every feature being used by less
than 10% of the user base, we'd have a much smaller and more maintainable
system, for sure ... but we'd probably piss off upwards of 90% of the
user base by doing that.  Your useless frammish is the next man's killer
feature.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Claudio Freire

On Mon, Jan 13, 2014 at 7:36 PM, Mel Gorman  wrote:
> That could be something we look at. There are cases buried deep in the
> VM where pages get shuffled to the end of the LRU and get tagged for
> reclaim as soon as possible. Maybe you need access to something like
> that via posix_fadvise to say "reclaim this page if you need memory but
> leave it resident if there is no memory pressure" or something similar.
> Not exactly sure what that interface would look like or offhand how it
> could be reliably implemented.


I don't see a reason not to make this behavior the default for WONTNEED.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

2014-01-13 Thread Jim Nasby


On 1/13/14, 6:16 PM, Florian Pflug wrote:

On Jan14, 2014, at 00:52 , Marko Tiikkaja  wrote:

When I've worked with PL/PgSQL, this has been a source of a few bugs that
would have been noticed during testing if the behaviour of INTO wasn't as
dangerous as it is right now.


The question is, how many bugs stemmed from wrong SQL queries, and what
percentage of those would have been caught by this? The way I see it, there
are thousands of ways to screw up a query, and having it return multiple
rows instead of one is just one of them.


A query that's simply wrong is more likely to fail consistently. Non-strict use 
of INTO is going to fail in very subtle ways (unless you actually DO want just 
the first row, in which case you should explicitly use LIMIT 1).


If we're not going to scrap PL/PgSQL and
start over again, we are going to have to do changes like this to make the
language better.  Also I think that out of all the things we could do to
break backwards compatibility, this is closer to "harmless" than "a pain
in the butt".


I very strongly believe that languages don't get better by adding a thousand
little knobs which subtly change semantics. Look at the mess that is PHP -
we absolutely, certainly don't want to go there. The most important rule in
language design is in my opinion "stick with your choices". C, C++ and JAVA
all seem to follow this, and it's one of the reasons these languages are
popular for big projects, I think.

The way I see it, the only OK way to change existing behaviour is to have
the concept of a "language version", and force code to indicate the language
version it expects. The important thing is that the language version is an
attribute of code, not some global setting that you can change without ever
looking at the code it'd affect.

So if we really want to change this, I think we need to have a
LANGUAGE_VERSION attribute on functions. Each time a major postgres release
changes the behaviour of one of the procedural languages, we'd increment
that language's version, and enable the old behaviour for all functions
tagged with an older one.


I like that idea. It allows us to fix past decisions that were ill considered 
without hosing all existing code.

BTW, have we always had support for STRICT, or was it added at some point? It's 
in 8.4, but I don't know how far back it goes.

And if we've always had it, why on earth didn't we make STRICT the default 
behavior?
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

2014-01-13 Thread Jim Nasby


On 1/13/14, 5:57 PM, Josh Berkus wrote:

On 01/13/2014 03:41 PM, Florian Pflug wrote:

It therefor isn't an oversight that SELECT ... INTO allows multiple result rows
but INSERT/UPDATE/DELETE forbids them, it's been done that way on purpose and
for a reason. We shouldn't be second-guessing ourselves by changing that later -
not, at least, unless we have a *very* good reason for it. Which, AFAICS, we 
don't.

(And yeah, personally I'd prefer if we'd complain about multiple rows. But it's
IMHO just too late for that)


I *really* don't want to go through all my old code to find places where
I used SELECT ... INTO just to pop off the first row, and ignored the
rest.  I doubt anyone else does, either.


Do you regularly have use cases where you actually want just one RANDOM row? I 
suspect the far more likely scenario is that people write code assuming they'll 
get only one row and they'll end up with extremely hard to trace bugs if that 
assumption is ever wrong.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested hstore patch

2014-01-13 Thread Erik Rijkers

On Mon, January 13, 2014 18:30, Andrew Dunstan wrote:
>
>
> On 01/13/2014 11:16 AM, Oleg Bartunov wrote:
>> Andrew,
>>
>> did you run perl script ? Actually, I found, that operator table needs
>> to be fixed.
>>
>
> No. My build machine doesn't actually have DBD::Pg installed. Can you
> send me a patch if you don't want to push it yourself, or maybe Erik can
> send a pacth top adjust the table.
>

> [ nested_hstore_and_jsonb-2.patch ]

( centos 6.5, gcc 4.8.2. )

The patch applies & compiles with warnings (see below).

The opr_sanity test fails during make check: regression.diffs attached.

Also attached are changes to hstore.sgml, to operator + functions table, plus 
some typos.

Thanks,
Erik Rijkers


make

jsonfuncs.c: In function each_object_field_end_jsonb:
jsonfuncs.c:1328:7: warning: assignment from incompatible pointer type [enabled 
by default]
   val = DatumGetPointer(DirectFunctionCall1(jsonb_in, CStringGetDatum(cstr)));
   ^
jsonfuncs.c: In function elements_array_element_end_jsonb:
jsonfuncs.c:1530:8: warning: assignment from incompatible pointer type [enabled 
by default]
  jbval = DatumGetPointer(DirectFunctionCall1(jsonb_in, CStringGetDatum(cstr)));
^


make contrib:

hstore_io.c: In function array_to_hstore:
hstore_io.c:1694:30: warning: result may be used uninitialized in this 
function [-Wmaybe-uninitialized]
  PG_RETURN_POINTER(hstoreDump(result));







regression.diffs
Description: Binary data
--- doc/src/sgml/hstore.sgml.orig	2014-01-14 00:06:30.070883763 +0100
+++ doc/src/sgml/hstore.sgml	2014-01-14 00:58:53.069334810 +0100
@@ -350,7 +350,7 @@
   text[]
   get values for keys (NULL if not present)
   'a=>x, b=>y, c=>z'::hstore -> ARRAY['c','a']
-  {"z","x"}
+  {z,x}
  
 
  
@@ -422,7 +422,7 @@
   hstore
   delete key from left operand
   'a=>1, b=>2, c=>3'::hstore - 'b'::text
-  "a"=>"1", "c"=>"3"
+  "a"=>1, "c"=>3
  
 
  
@@ -438,7 +438,7 @@
   hstore
   delete keys from left operand
   'a=>1, b=>2, c=>3'::hstore - ARRAY['a','b']
-  "c"=>"3"
+  "c"=>3
  
 
  
@@ -446,14 +446,14 @@
   hstore
   delete matching pairs from left operand
   'a=>1, b=>2, c=>3'::hstore - 'a=>4, b=>2'::hstore
-  "a"=>"1", "c"=>"3"
+  "a"=>1, "c"=>3
  
 
  
   hstore #- text[]
   hstore
   delete key path from left operand
-  '{a => {b => { c => [1,2]}}}'::hstore #- '[a,b,c,0]'
+  '{a => {b => { c => [1,2]}}}'::hstore #- '{a,b,c,0}'
   "a"=>{"b"=>{"c"=>[2]}}
  
 
@@ -525,7 +525,7 @@
   hstore
   construct an hstore from a record or row
   hstore(ROW(1,2))
-  f1=>1,f2=>2
+  "f1"=>"1","f2"=>"2"
  
 
  
@@ -534,7 +534,7 @@
   construct an hstore from an array, which may be either
a key/value array, or a two-dimensional array
   hstore(ARRAY['a','1','b','2']) || hstore(ARRAY[['c','3'],['d','4']])
-  a=>1, b=>2, c=>3, d=>4
+  "a"=>"1", "b"=>"2", "c"=>"3", "d"=>"4"
  
 
  
@@ -707,7 +707,7 @@
   hstore
   extract a subset of an hstore
   slice('a=>1,b=>2,c=>3'::hstore, ARRAY['b','c','x'])
-  "b"=>"2", "c"=>"3"
+  "b"=>2, "c"=>3
  
 
  
@@ -766,15 +766,15 @@
   replace(hstore,text[],hstore)replace
   hstore
   replace value at the specified path
-  replace('a=>1,b=>{c=>3,d=>[4,5,6]}'::hstore,'[b,d]', '1')
-  "a"=>1, "b"=>{"c"=>3, "d"=>}
+  replace('a=>1,b=>{c=>3,d=>[4,5,6]}'::hstore,'{b,d}', '1')
+  "a"=>1, "b"=>{"c"=>3, "d"=>1}
  
 
  
   concat_path(hstore,text[],hstore)concat_path
   hstore
   concatenate hstore value at the specified path
-  concat_path('b=>{c=>3,d=>[4,5,6]}'::hstore,'[b,d]', '1')
+  concat_path('b=>{c=>3,d=>[4,5,6]}'::hstore,'{b,d}', '1')
   "b"=>{"c"=>3, "d"=>[4, 5, 6, 1]}
  
 
@@ -783,7 +783,7 @@
   hstore
   delete pair with matching key
   delete('a=>1,b=>2','b')
-  "a"=>"1"
+  "a"=>1
  
 
  
@@ -791,7 +791,7 @@
   hstore
   delete pairs with matching keys
   delete('a=>1,b=>2,c=>3',ARRAY['a','b'])
-  "c"=>"3"
+  "c"=>3
  
 
  
@@ -799,7 +799,7 @@
   hstore
   delete pairs matching those in the second argument
   delete('a=>1,b=>2','a=>4,b=>2'::hstore)
-  "a"=>"1"
+  "a"=>1
  
 
  
@@ -1007,7 +1007,7 @@
 
   
But populate_record() supports more complicated records and nested
-   hstore values, as well. It makes an effort to convert
+   hstore values as well. It makes an effort to convert
from hstore data types to PostgreSQL types, including arrays,
json, and hstore values:
 
@@ -1124,7 +1124,7 @@
   The internal representation of hstore has been updated
a couple of times in its history. Data types and nested structures were
added in PostgreSQL 9.4, while capacity and improved index support were
-   introduced in Postgrsql 9.0. These changes present no obstacle for
+   introduced

Re: [HACKERS] plpgsql.consistent_into

2014-01-13 Thread Florian Pflug

On Jan14, 2014, at 00:52 , Marko Tiikkaja  wrote:
> When I've worked with PL/PgSQL, this has been a source of a few bugs that
> would have been noticed during testing if the behaviour of INTO wasn't as
> dangerous as it is right now.

The question is, how many bugs stemmed from wrong SQL queries, and what
percentage of those would have been caught by this? The way I see it, there
are thousands of ways to screw up a query, and having it return multiple 
rows instead of one is just one of them.

> Yes, it breaks backwards compatibility, but that's why there's a nice GUC.

Which doesn't help, because the GUC isn't tied to the code. This *adds*
an error case, not remove one - now, instead of getting your code correct,
you *also* have to get the GUC correct. If you even *know* that such a GUC
exists.

> If we're not going to scrap PL/PgSQL and
> start over again, we are going to have to do changes like this to make the
> language better.  Also I think that out of all the things we could do to
> break backwards compatibility, this is closer to "harmless" than "a pain
> in the butt".

I very strongly believe that languages don't get better by adding a thousand
little knobs which subtly change semantics. Look at the mess that is PHP -
we absolutely, certainly don't want to go there. The most important rule in
language design is in my opinion "stick with your choices". C, C++ and JAVA
all seem to follow this, and it's one of the reasons these languages are
popular for big projects, I think.

The way I see it, the only OK way to change existing behaviour is to have
the concept of a "language version", and force code to indicate the language
version it expects. The important thing is that the language version is an
attribute of code, not some global setting that you can change without ever
looking at the code it'd affect.

So if we really want to change this, I think we need to have a
LANGUAGE_VERSION attribute on functions. Each time a major postgres release
changes the behaviour of one of the procedural languages, we'd increment
that language's version, and enable the old behaviour for all functions
tagged with an older one.

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-13 Thread Jim Nasby


On 1/13/14, 5:05 PM, Peter Geoghegan wrote:

On Mon, Jan 13, 2014 at 2:20 PM, Jim Nasby  wrote:

Well, a common case for INSERT RETURNING is to get your set of surrogate
keys back; so I think users would want the ability to RETURN what finally
made it into the table.


Your update can also have a RETURNING clause. I'm not necessarily that
attached to fully generalizing RETURNING REJECTS as REJECTING. It was
just an idea. When an insert is rejected and you lock a conflicting
row, it hardly matters what your surrogate key might have been had
that insert succeeded.

To get the surrogate key when it upsert inserts, do a regular
INSERTRETURNING..., and break the work up into multiple commands.
That will almost always be sufficient, because you'll almost always
know ahead of time where the conflict might be (certainly, the MySQL
feature mandates that you do know).


As long as there's a way to get back what was ultimately inserted or updated 
that'd work... there might be some cases where you'd actually want to know what 
the result of the REJECTING command was (ie: did the update do something 
fancy?).

Actually, you'd also want to know if triggers did anything. So we definitely 
want to keep the existing RETURNING behavior (sorry, I don't know offhand if 
you've kept that or not).


Also, if we want to support the case of identifying tuples where a BEFORE
trigger disallowed the insert, we probably want to expose that that's why
those tuples were rejected (as opposed to them being rejected due to a
duplicate key violation).


The ctid *won't* indicate a specific rejecting row then, I guess,
which will do it.


Yeah, the only other thing you might want is the name of the trigger that 
returned NULL... that would allow you to do something different based on which 
trigger it was.

Part of me thinks that'd be useful... part of me thinks it's just a foot-gun...
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Hannu Krosing

On 01/14/2014 12:33 AM, Craig Ringer wrote:
> On 01/14/2014 12:40 AM, Merlin Moncure wrote:
>> On Sun, Jan 12, 2014 at 4:38 AM, Craig Ringer  wrote:
>>> Implicit casts to text, anybody?
>> This backward compatibility break orphaned the company I work for on
>> 8.1 until last year and very nearly caused postgres to be summarily
>> extirpated (only rescued at the last minute by my arrival).
> You're far from the only one, too. Until last year I was still seeing
> people saying they "can't" upgrade because of this. OTOH, that was a
> sudden and drastic change, with no BC switch like the removal of
> implicit joins had, that affected wide swaths of code. Lets not do that
> again.
>
> Removal of lower bounds for arrays is unlikely to even get noticed by
> the vast majority of users, and can be done progressively with BC features.
>
> The real issue IMO is how to get those few to stop using it so it can be
> truly removed. Past experience has shown that people just turn the
> compatibility flag on and forget they're using the deprecated feature.
> If there are warnings they'll silence them in their application and
> still forget they're using the deprecated feature. If there are log
> messages, they'll probably turn logging detail down to hide them and
> STILL forget they're using the deprecated feature.
>
> Then whine about it three years later when it gets removed.
>
> So I guess the question is: Is it worth all that hassle to remove a
> misfeature you have to go out of your way to use? Is support for non-1
> lower bounds stopping us from doing something useful and important? Or
> is it just an irritation that it exists?
>
Let's just add user defined operator for '[]' (weirdly-positioned but
2 argument, almost infix :) ) and add that to JSON arrays to get
0-based ones into poastgresq  ;)

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Hannu Krosing

On 01/13/2014 09:53 PM, Trond Myklebust wrote:
> On Jan 13, 2014, at 15:40, Andres Freund  wrote:
>
>> On 2014-01-13 15:15:16 -0500, Robert Haas wrote:
>>> On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner  wrote:
 I notice, Josh, that you didn't mention the problems many people
 have run into with Transparent Huge Page defrag and with NUMA
 access.
>>> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
>>> setting zone_reclaim_mode; is there some other problem besides that?
>> I think that fixes some of the worst instances, but I've seen machines
>> spending horrible amounts of CPU (& BUS) time in page reclaim
>> nonetheless. If I analyzed it correctly it's in RAM << working set
>> workloads where RAM is pretty large and most of it is used as page
>> cache. The kernel ends up spending a huge percentage of time finding and
>> potentially defragmenting pages when looking for victim buffers.
>>
>>> On a related note, there's also the problem of double-buffering.  When
>>> we read a page into shared_buffers, we leave a copy behind in the OS
>>> buffers, and similarly on write-out.  It's very unclear what to do
>>> about this, since the kernel and PostgreSQL don't have intimate
>>> knowledge of what each other are doing, but it would be nice to solve
>>> somehow.
>> I've wondered before if there wouldn't be a chance for postgres to say
>> "my dear OS, that the file range 0-8192 of file x contains y, no need to
>> reread" and do that when we evict a page from s_b but I never dared to
>> actually propose that to kernel people...
> O_DIRECT was specifically designed to solve the problem of double buffering 
> between applications and the kernel. Why are you not able to use that in 
> these situations?
What is asked is the opposite of O_DIRECT - the write from a buffer inside
postgresql to linux *buffercache* and telling linux that it is the same
as what
is currently on disk, so don't bother to write it back ever.

This would avoid current double-buffering between postgresql and linux
buffer caches while still making use of linux cache when possible.

The use case is  pages that postgresql has moved into its buffer cache
but which it has not modified. They will at some point be evicted from the
postgresql cache, but it is likely that they will still be needed
sometime soon,
so what is required is "writing them back" to the original file, only
they should
not really be written - or marked dirty to be written later - more
levels than
just to the linux cache, as they *already* are on the disk.

It is probably ok to put them in the LRU position as they are "written"
out from postgresql, though it may be better if we get some more control
over
where in the LRU order they would be placed. It may make sense to put them
there based on when they were last read while residing inside postgresql
cache

Cheers


-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

2014-01-13 Thread Josh Berkus

On 01/13/2014 03:41 PM, Florian Pflug wrote:
> It therefor isn't an oversight that SELECT ... INTO allows multiple result 
> rows
> but INSERT/UPDATE/DELETE forbids them, it's been done that way on purpose and
> for a reason. We shouldn't be second-guessing ourselves by changing that 
> later -
> not, at least, unless we have a *very* good reason for it. Which, AFAICS, we 
> don't.
> 
> (And yeah, personally I'd prefer if we'd complain about multiple rows. But 
> it's
> IMHO just too late for that)

I *really* don't want to go through all my old code to find places where
I used SELECT ... INTO just to pop off the first row, and ignored the
rest.  I doubt anyone else does, either.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

2014-01-13 Thread Marko Tiikkaja


On 1/14/14, 12:41 AM, Florian Pflug wrote:

In fact, after reading the documentation on SELECT ... INTO, I'm convinced the
the whole consistent_into thing is a bad idea. The documentation states clearly
that

   For INSERT/UPDATE/DELETE with RETURNING, PL/pgSQL reports an error for more 
than
   one returned row, even when STRICT is not specified. This is because there 
is no
   option such as ORDER BY with which to determine which affected row should be
   returned.

It therefor isn't an oversight that SELECT ... INTO allows multiple result rows
but INSERT/UPDATE/DELETE forbids them, it's been done that way on purpose and
for a reason.


Yeah, it does state that.  But it's a BS reason.  In addition to ORDER 
BY, SELECT also has a LIMIT which you can use to get the "first row" 
behaviour.  There's no way to go to the more sane behaviour from what we 
have right now.


When I've worked with PL/PgSQL, this has been a source of a few bugs 
that would have been noticed during testing if the behaviour of INTO 
wasn't as dangerous as it is right now.  Yes, it breaks backwards 
compatibility, but that's why there's a nice GUC.  If we're not going to 
scrap PL/PgSQL and start over again, we are going to have to do changes 
like this to make the language better.  Also I think that out of all the 
things we could do to break backwards compatibility, this is closer to 
"harmless" than "a pain in the butt".



Regards,
Marko Tiikkaja


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

2014-01-13 Thread Florian Pflug

On Jan13, 2014, at 22:49 , Jim Nasby  wrote:
> ISTM that in this case, it should be safe to make the new default behavior 
> STRICT;
> if you forget to set the GUC to disable than you'll get an error that points 
> directly
> at the problem, at which point you'll go "Oh, yeah... I forgot to set X..."

What do you mean by STRICT? STRICT (which we already support) complains if the
query doesn't return *exactly* one row. What Marko wants is to raise an error
for a plain SELECT ... INTO if more than one row is returned, but to still
convert zero rows to NULL.

> Outside of the GUC, I believe the default should definitely be STRICT. If 
> your app is
> relying on non-strict then you need to be made aware of that. We should be 
> able to
> provide a DO block that will change this setting for every function you've 
> got if
> someone isn't happy with STRICT mode.

If you mean that we should change SELECT ... INTO to always behave as if STRICT
had been specified - why on earth would we want to do that? That would break
perfectly fine code for no good reason whatsoever.

In fact, after reading the documentation on SELECT ... INTO, I'm convinced the
the whole consistent_into thing is a bad idea. The documentation states clearly
that

  For INSERT/UPDATE/DELETE with RETURNING, PL/pgSQL reports an error for more 
than
  one returned row, even when STRICT is not specified. This is because there is 
no
  option such as ORDER BY with which to determine which affected row should be
  returned.

It therefor isn't an oversight that SELECT ... INTO allows multiple result rows
but INSERT/UPDATE/DELETE forbids them, it's been done that way on purpose and
for a reason. We shouldn't be second-guessing ourselves by changing that later -
not, at least, unless we have a *very* good reason for it. Which, AFAICS, we 
don't.

(And yeah, personally I'd prefer if we'd complain about multiple rows. But it's
IMHO just too late for that)

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN improvements part 1: additional information

2014-01-13 Thread Tomas Vondra

On 13.1.2014 18:07, Alexander Korotkov wrote:
> On Sat, Jan 11, 2014 at 6:15 AM, Tomas Vondra  > wrote:
> 
> On 8.1.2014 22:58, Alexander Korotkov wrote:
> > Thanks for reporting. Fixed version is attached.
> 
> I've tried to rerun the 'archie' benchmark with the current patch, and
> once again I got
> 
>PANIC:  could not split GIN page, didn't fit
> 
> I reran it with '--enable-cassert' and with that I got
> 
> TRAP: FailedAssertion("!(ginCompareItemPointers(&items[i - 1],
>&items[i]) < 0)", File: "gindatapage.c", Line: 149)
> LOG:  server process (PID 5364) was terminated by signal 6: Aborted
> DETAIL:  Failed process was running: INSERT INTO messages ...
> 
> so the assert in GinDataLeafPageGetUncompressed fails for some reason.
> 
> I can easily reproduce it, but my knowledge in this area is rather
> limited so I'm not entirely sure what to look for.
> 
> 
> I've fixed this bug and many other bug. Now patch passes test suite that
> I've used earlier. The results are so:

OK, it seems the bug is gone. However now there's a memory leak
somewhere. I'm loading pgsql mailing list archives (~600k messages)
using this script

   https://bitbucket.org/tvondra/archie/src/1bbeb920/bin/load.py

And after loading about 1/5 of the data, all the memory gets filled by
the pgsql backends (loading the data in parallel) and the DB gets killed
by the OOM killer.

Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Craig Ringer

On 01/14/2014 12:40 AM, Merlin Moncure wrote:
> On Sun, Jan 12, 2014 at 4:38 AM, Craig Ringer  wrote:
>> Implicit casts to text, anybody?
> 
> This backward compatibility break orphaned the company I work for on
> 8.1 until last year and very nearly caused postgres to be summarily
> extirpated (only rescued at the last minute by my arrival).

You're far from the only one, too. Until last year I was still seeing
people saying they "can't" upgrade because of this. OTOH, that was a
sudden and drastic change, with no BC switch like the removal of
implicit joins had, that affected wide swaths of code. Lets not do that
again.

Removal of lower bounds for arrays is unlikely to even get noticed by
the vast majority of users, and can be done progressively with BC features.

The real issue IMO is how to get those few to stop using it so it can be
truly removed. Past experience has shown that people just turn the
compatibility flag on and forget they're using the deprecated feature.
If there are warnings they'll silence them in their application and
still forget they're using the deprecated feature. If there are log
messages, they'll probably turn logging detail down to hide them and
STILL forget they're using the deprecated feature.

Then whine about it three years later when it gets removed.

So I guess the question is: Is it worth all that hassle to remove a
misfeature you have to go out of your way to use? Is support for non-1
lower bounds stopping us from doing something useful and important? Or
is it just an irritation that it exists?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Josh Berkus

On 01/13/2014 02:26 PM, Mel Gorman wrote:
> Really?
> 
> zone_reclaim_mode is often a complete disaster unless the workload is
> partitioned to fit within NUMA nodes. On older kernels enabling it would
> sometimes cause massive stalls. I'm actually very surprised to hear it
> fixes anything and would be interested in hearing more about what sort
> of circumstnaces would convince you to enable that thing.

So the problem with the default setting is that it pretty much isolates
all FS cache for PostgreSQL to whichever socket the postmaster is
running on, and makes the other FS cache unavailable.  This means that,
for example, if you have two memory banks, then only one of them is
available for PostgreSQL filesystem caching ... essentially cutting your
available cache in half.

And however slow moving cached pages between memory banks is, it's an
order of magnitude faster than moving them from disk.  But this isn't
how the NUMA stuff is configured; it seems to assume that it's less
expensive to get pages from disk than to move them between banks, so
whatever you've got cached on the other bank, it flushes it to disk as
fast as possible.  I understand the goal was to make memory usage local
to the processors stuff was running on, but that includes an implicit
assumption that no individual process will ever want more than one
memory bank worth of cache.

So disabling all of the NUMA optimizations is the way to go for any
workload I personally deal with.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-13 Thread Florian Pflug

On Jan13, 2014, at 22:30 , "Joshua D. Drake"  wrote:
> On 01/13/2014 01:14 PM, Jim Nasby wrote:
>> 
>> On 1/13/14, 12:21 PM, Joshua D. Drake wrote:
>>> 
>>> On 01/13/2014 10:12 AM, Hannu Krosing wrote:
>> In other words, if we're going to have auto-degrade, the most
>> intelligent place for it is in
>> RepMgr/HandyRep/OmniPITR/pgPoolII/whatever.  It's also the *easiest*
>> place.  Anything we do *inside* Postgres is going to have a really,
>> really hard time determining when to degrade.
> +1
> 
> This is also how 2PC works, btw - the database provides the building
> blocks, i.e. PREPARE and COMMIT, and leaves it to a transaction manager
> to deal with issues that require a whole-cluster perspective.
> 

 ++1
>>> 
>>> +1
>> 
>> Josh, what do you think of the upthread idea of being able to recover
>> in-progress transactions that are waiting when we turn off sync rep? I'm
>> thinking that would be a very good feature to have... and it's not
>> something you can easily do externally.
> 
> I think it is extremely valuable, else we have lost those transactions which
> is exactly what we don't want.

We *have* to "recover" waiting transaction upon switching off sync rep.

A transaction that waits for a sync standby to respond has already committed
locally (i.e., updated the clog), it just hasn't updated the proc array yet,
and thus is still seen as in-progress by the rest of the system. But rolling
back the transaction is nevertheless *impossible* at that point (except by
PITR, and hence the quoted around reciver). So the only alternative to
"recovering" them, i.e. have them abort their waiting, is to let them linger
indefinitely, still holding their locks, preventing xmin from advancing, etc,
until either the client disconnects or the server is restarted.

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-13 Thread Peter Geoghegan

On Mon, Jan 13, 2014 at 2:20 PM, Jim Nasby  wrote:
> Well, a common case for INSERT RETURNING is to get your set of surrogate
> keys back; so I think users would want the ability to RETURN what finally
> made it into the table.

Your update can also have a RETURNING clause. I'm not necessarily that
attached to fully generalizing RETURNING REJECTS as REJECTING. It was
just an idea. When an insert is rejected and you lock a conflicting
row, it hardly matters what your surrogate key might have been had
that insert succeeded.

To get the surrogate key when it upsert inserts, do a regular
INSERTRETURNING..., and break the work up into multiple commands.
That will almost always be sufficient, because you'll almost always
know ahead of time where the conflict might be (certainly, the MySQL
feature mandates that you do know).

> Also, if we want to support the case of identifying tuples where a BEFORE
> trigger disallowed the insert, we probably want to expose that that's why
> those tuples were rejected (as opposed to them being rejected due to a
> duplicate key violation).

The ctid *won't* indicate a specific rejecting row then, I guess,
which will do it.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Mel Gorman

On Mon, Jan 13, 2014 at 11:38:44PM +0100, Jan Kara wrote:
> On Mon 13-01-14 22:26:45, Mel Gorman wrote:
> > The flipside is also meant to hold true. If you know data will be needed
> > in the near future then posix_fadvise(POSIX_FADV_WILLNEED). Glancing at
> > the implementation it does a forced read-ahead on the range of pages of
> > interest. It doesn't look like it would block.
>   That's not quite true. POSIX_FADV_WILLNEED still needs to map logical
> file offsets to physical disk blocks and create IO requests. This happens
> synchronously. So if your disk is congested and relevant metadata is out of
> cache, or we simply run out of free IO requests, POSIX_FADV_WILLNEED can
> block for a significant amount of time.
> 

Umm, yes, you're right. It also potentially stalls allocating the pages
up front even though it will only try and direct reclaim pages once.
That can stall in some circumstances, particularly if there are a number
of processes trying to reclaim memory.

That kinda sucks though. One point of discussion would be to check if
this is an interface that can be used and if so, is it required to never
block and if so is there something we can do about it -- queue the IO
asynchronously if you can but if the kernel would block then do not bother.
That does mean that fadvise is not guaranteeing that the pages will be
resident in the future but it was not the intent of the interface
anyway.

-- 
Mel Gorman
SUSE Labs

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Andres Freund

On 2014-01-13 14:19:56 -0800, James Bottomley wrote:
> >  Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> > horrible consequences for performance/scalability - very quickly you
> > contend on locks in the kernel.
> 
> Is this because of problems in the mmap_sem?

It's been a while since I looked at it, but yes, mmap_sem was part of
it. I also seem to recall the amount of IPIs increasing far too much for
it to be practical, but I am not sure anymore.

> > Also, that will mark that page dirty, which isn't what we want in this
> > case.
> 
> You mean madvise (page_addr)?  It shouldn't ... the state of the dirty
> bit should only be updated by actual writes.  Which MADV_ primitive is
> causing the dirty marking, because we might be able to fix it (unless
> there's some weird corner case I don't know about).

Not the madvise() itself, but transplanting the buffer from postgres'
buffers to the mmap() area of the underlying file would, right?

> We also do have a way of transplanting pages: it's called splice.  How
> do the semantics of splice differ from what you need?

Hm. I don't really see how splice would allow us to seed the kernel's
pagecache with content *without* marking the page as dirty in the
kernel.
We don't need zero-copy IO here, the important thing is just to fill the
pagecache with content without a) rereading the page from disk b)
marking the page as dirty.

> >  One major usecase is transplanting a page comming from postgres'
> > buffers into the kernel's buffercache because the latter has a much
> > better chance of properly allocating system resources across independent
> > applications running.
> 
> If you want to share pages between the application and the page cache,
> the only known interface is mmap ... perhaps we can discuss how better
> to improve mmap for you?

I think purely using mmap() is pretty unlikely to work out - there's
just too many constraints about when a page is allowed to be written out
(e.g. it's interlocked with postgres' write ahead log). I also think
that for many practical purposes using mmap() would result in an absurd
number of mappings or mapping way too huge areas; e.g. large btree
indexes are usually accessed in a quite fragmented manner.

> > Oh, and the kernel's page-cache management while far from perfect,
> > actually scales much better than postgres'.
> 
> Well, then, it sounds like the best way forward would be to get
> postgress to use the kernel page cache more efficiently.

No arguments there, although working on postgres scalability is a good
idea as well ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-13 Thread Peter Geoghegan

On Mon, Jan 13, 2014 at 12:58 PM, Heikki Linnakangas
 wrote:
> Well, even if you don't agree that locking all the conflicting rows for
> update is sensible, it's still perfectly sensible to return the rejected
> rows to the user. For example, you're inserting N rows, and if some of them
> violate a constraint, you still want to insert the non-conflicting rows
> instead of rolling back the whole transaction.

Right, but with your approach, can you really be sure that you have
the right rejecting tuple ctid (not reject)? In other words, as you
wait for the exclusion constraint to conclusively indicate that there
is a conflict, minutes may have passed in which time other conflicts
may emerge in earlier unique indexes. Whereas with an approach where
values are locked, you are guaranteed that earlier unique indexes have
no conflicting values. Maintaining that property seems useful, since
we check in a well-defined order, and we're still projecting a ctid.
Unlike when row locking is involved, we can make no assumptions or
generalizations around where conflicts will occur. Although that may
also be a general concern with your approach when row locking, for
multi-master replication use-cases. There may be some value in knowing
it cannot have been earlier unique indexes (and so the existing values
for those unique indexes in the locked row should stay the same -
don't many conflict resolution policies work that way?).

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Mel Gorman

On Mon, Jan 13, 2014 at 06:27:03PM -0200, Claudio Freire wrote:
> On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby  wrote:
> > On 1/13/14, 2:19 PM, Claudio Freire wrote:
> >>
> >> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas 
> >> wrote:
> >>>
> >>> On a related note, there's also the problem of double-buffering.  When
> >>> we read a page into shared_buffers, we leave a copy behind in the OS
> >>> buffers, and similarly on write-out.  It's very unclear what to do
> >>> about this, since the kernel and PostgreSQL don't have intimate
> >>> knowledge of what each other are doing, but it would be nice to solve
> >>> somehow.
> >>
> >>
> >>
> >> There you have a much harder algorithmic problem.
> >>
> >> You can basically control duplication with fadvise and WONTNEED. The
> >> problem here is not the kernel and whether or not it allows postgres
> >> to be smart about it. The problem is... what kind of smarts
> >> (algorithm) to use.
> >
> >
> > Isn't this a fairly simple matter of when we read a page into shared buffers
> > tell the kernel do forget that page? And a corollary to that for when we
> > dump a page out of shared_buffers (here kernel, please put this back into
> > your cache).
> 
> 
> That's my point. In terms of kernel-postgres interaction, it's fairly simple.
> 
> What's not so simple, is figuring out what policy to use. Remember,
> you cannot tell the kernel to put some page in its page cache without
> reading it or writing it. So, once you make the kernel forget a page,
> evicting it from shared buffers becomes quite expensive.

posix_fadvise(POSIX_FADV_WILLNEED) is meant to cover this case by
forcing readahead. If you evict it prematurely then you do get kinda
screwed because you pay the IO cost to read it back in again even if you
had enough memory to cache it. Maybe this is the type of kernel-postgres
interaction that is annoying you.

If you don't evict, the kernel eventually steps in and evicts the wrong
thing. If you do evict and it was unnecessarily you pay an IO cost.

That could be something we look at. There are cases buried deep in the
VM where pages get shuffled to the end of the LRU and get tagged for
reclaim as soon as possible. Maybe you need access to something like
that via posix_fadvise to say "reclaim this page if you need memory but
leave it resident if there is no memory pressure" or something similar.
Not exactly sure what that interface would look like or offhand how it
could be reliably implemented.

-- 
Mel Gorman
SUSE Labs

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Mel Gorman

On Mon, Jan 13, 2014 at 03:15:16PM -0500, Robert Haas wrote:
> On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner  wrote:
> > I notice, Josh, that you didn't mention the problems many people
> > have run into with Transparent Huge Page defrag and with NUMA
> > access.
> 

Ok, there are at least three potential problems there that you may or
may not have run into.

First, THP when it was first introduced was a bit of a disaster. In 3.0,
it was *very* heavy handed and would trash the system reclaiming memory
to satisfy an allocation. When it did this, it would also writeback a
bunch of data and block on it to boot. It was not the smartest move of
all time but was improved over time and in some cases the patches were
also backported by 3.0.101. This is a problem that should have
alleviated over time.

The general symptoms of the problem would be massive stalls and
monitoring the /proc/PID/stack of interesting processes would show it to
be somewhere in do_huge_pmd_anonymous_page -> alloc_page_nodemask ->
try_to_free_pages -> migrate_pages or something similar. You may have
worked around it by disabling THP with a command line switch or
/sys/kernel/mm/transparent_hugepage/enabled in the past.

This is "not meant to happen" any more or at least it has been a while
since a bug was filed against me in this area. There are corner cases
though. If the underlying filesystem is NFS, the problem might still be
experienced.

That is the simple case.

You might have also hit the case where THPages filled with zeros did not
use the zero page. That would have looked like a larger footprint than
anticipated and lead to another range of problems. This is also addressed
since but maybe not recently enough. It's less likely this is your problem
though as I expect you actually use your buffers, not leave them filled
with zeros.

You mention NUMA but that's trickier to figure out that problem without more
context.  THP can cause unexpected interleaving between NUMA nodes. Memory
that would have been local on a 4K page boundary becomes remote accesses
when THP is enabled and performance would be hit (maybe 3-5% depending on
the machine). It's not the only possibility though. If memory was being
used sparsely and THP was in use then the overall memory footprint may be
higher than it should be. This potentially would cause allocations to spill
over to remote nodes while kswapd wakes up to reclaim local memory. That
would lead to weird buffer aging inversion problems. This is a hell of a
lot of guessing though and we'd need a better handle on the reproduction
case to pin it down.

> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
> setting zone_reclaim_mode; is there some other problem besides that?
> 

Really?

zone_reclaim_mode is often a complete disaster unless the workload is
partitioned to fit within NUMA nodes. On older kernels enabling it would
sometimes cause massive stalls. I'm actually very surprised to hear it
fixes anything and would be interested in hearing more about what sort
of circumstnaces would convince you to enable that thing.

> The other thing that comes to mind is the kernel's caching behavior.
> We've talked a lot over the years about the difficulties of getting
> the kernel to write data out when we want it to and to not write data
> out when we don't want it to. 

Is sync_file_range() broke?

> When it writes data back to disk too
> aggressively, we get lousy throughput because the same page can get
> written more than once when caching it for longer would have allowed
> write-combining. 

Do you think that is related to dirty_ratio or dirty_writeback_centisecs?
If it's dirty_writeback_centisecs then that would be particularly tricky
because poor interactions there would come down to luck basically.

> When it doesn't write data to disk aggressively
> enough, we get huge latency spikes at checkpoint time when we call
> fsync() and the kernel says "uh, what? you wanted that data *on the
> disk*? sorry boss!" and then proceeds to destroy the world by starving
> the rest of the system for I/O for many seconds or minutes at a time.

Ok, parts of that are somewhat expected. It *may* depend on the
underlying filesystem. Some of them handle fsync better than others. If
you are syncing the whole file though when you call fsync then you are
potentially burned by having to writeback dirty_ratio amounts of memory
which could take a substantial amount of time.

> We've made some desultory attempts to use sync_file_range() to improve
> things here, but I'm not sure that's really the right tool, and if it
> is we don't know how to use it well enough to obtain consistent
> positive results.
> 

That implies that either sync_file_range() is broken in some fashion we
(or at least I) are not aware of and that needs kicking.

> On a related note, there's also the problem of double-buffering.  When
> we read a page into shared_buffers, we leave a copy behind in the OS
> buffers, and similarly on write-out.  It's

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Merlin Moncure

On Mon, Jan 13, 2014 at 3:45 PM, David Fetter  wrote:
> On Mon, Jan 13, 2014 at 10:40:57AM -0600, Merlin Moncure wrote:
>> This project has no deprecation policy,
>
> I believe it actually does, although it's not a formal, written
> policy.  Would you like to help draft one up?

Lack of 'formal, written, policy' is equivalent to 'no policy'.
Regardless, the way things we done in the 7.x/8.x series may no longer
apply today; the project has grown up and we need to be more serious
about things, at least, IMNSHO.

>> and I'd argue we'd need one
>> before considering breaking changes.  For example, maybe we could pull
>> out an occasional release for longer term support to help users that
>> caught out.   But really, the better way to go IMNSHO is to take a
>> hard line on compatibility issues pretty much always -- consider the
>> case of libc and win32 api.
>
> Could you please help remind us what that was?

Let's take gets() for example.  C11 finally ditched it 12 years (!)
after it was formally deprecated in C99 and informally deprecate in
endless man pages ("don't use this!") for decades before that.  And
even then most compilers, at least the decent ones, should allow to
request previous standards for some time beyond that.  The win32 API
is also remarkably stable; ancient code written for it beyond the dim
horizon of time will still compile and execute today.  These are
probably strong contenders for most popular APIs ever made -- see the
connection?  Now, comparing C APIs to an SQL implementation for
deprecation purposes isn't quite applies to apples, but I'll stand by
the analogy.

>> or gross violations of the standard
>
> We're definitely there on lower bounds of arrays.  The standard, for a
> wonder, is clear and unambiguous about them.  Whether we should go
> there on the rest of our array implementation is a question for
> another thread.

The SQL standard requests that standard syntax gives standard
behavior.  Alternate bounds is non-standard syntax giving non-standard
behavior and is thus excepted.  Naturally, non-standard syntax is
dangerous because the standard may later implement it in which case
you then have a real problem (that may be the case here: I don't
know).  Our array implementation is a real mess on multiple levels but
at least it's an internally consistent mess.  Maybe it really should
be 'fixed', but not before the super un-fun discussion of how to ease
the path for our hapless users happens first.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-13 Thread Jim Nasby


On 1/12/14, 9:35 PM, Andreas Karlsson wrote:

On 01/12/2014 11:20 PM, Peter Geoghegan wrote:

On Sun, Jan 12, 2014 at 8:12 AM, Andreas Karlsson  wrote:

On 01/11/2014 11:42 PM, Peter Geoghegan wrote:

I recently suggested that rather than RETURNING REJECTS, we could have
a REJECTING clause, which would see a DML statement project strictly
the complement of what RETURNING projects in the same context. So
perhaps you could also see what RETURNING would not have projected
because a before row trigger returned NULL (i.e. when a before trigger
indicates to not proceed with insertion). That is certainly more
general, and so is perhaps preferable. It's also less verbose, and it
seems less likely to matter that we'll need to make REJECTING a fully
reserved keyword, as compared to REJECTS. (RETURNING is already a
fully reserved keyword not described by the standard, so this makes a
certain amount of sense to me). If nothing else, REJECTING is more
terse than RETURNING REJECTS.


I do not entirely understand what you are proposing here.  Any example how
this would look compared to your RETURNING REJECTS proposal?


It's very similar - REJECTING is a total generalization of what I
already have. The difference is only that REJECTING is accepted in all
contexts that RETURNING is, and not just with INSERT...ON DUPLICATE
KEY LOCK FOR UPDATE. So you could potentially have REJECTING project
the slot proposed for insertion on an UPDATE where RETURNING would
not. If for example a BEFORE ROW trigger fired, and returned NULL,
perhaps it'd then be possible to project the slot as it was before
being passed to the trigger. Perhaps there is no real demand for that,
but, as I said, from a usability perspective it may be easier to
reason about a feature that projects strictly the complement of what
RETURNING would project in the same context.


So simply this?

WITH rej AS (
 INSERT INTO foo (a, b, c)
 VALUES (44, 1078, 'insert'), (55, 1088, 'insert')
 REJECTING a)
UPDATE foo SET c = 'update' FROM rej WHERE foo.a = rej.a;

Another question: have you given any thought on the case where you want to use 
both the successfully inserted tuples and the rejected and use in the CTE? Is 
that even something anyone would want? Would perhaps MERGE be more suited for 
that?


Well, a common case for INSERT RETURNING is to get your set of surrogate keys 
back; so I think users would want the ability to RETURN what finally made it 
into the table.

Also, if we want to support the case of identifying tuples where a BEFORE 
trigger disallowed the insert, we probably want to expose that that's why those 
tuples were rejected (as opposed to them being rejected due to a duplicate key 
violation).
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Josh Berkus

Everyone,

I am looking for one or more hackers to go to Collab with me to discuss
this.  If you think that might be you, please let me know and I'll look
for funding for your travel.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Robert Haas

On Mon, Jan 13, 2014 at 4:45 PM, David Fetter  wrote:
> On Mon, Jan 13, 2014 at 10:40:57AM -0600, Merlin Moncure wrote:
>> On Sun, Jan 12, 2014 at 4:38 AM, Craig Ringer  wrote:
>> > Implicit casts to text, anybody?
>>
>> This backward compatibility break orphaned the company I work for on
>> 8.1 until last year and very nearly caused postgres to be summarily
>> extirpated (only rescued at the last minute by my arrival). It cost
>> hundreds of thousands of dollars to qualify a sprawling java code base
>> so that it could be moved back into a supported version.  Breaking
>> compatibility sucks -- it hurts your users and costs people money.
>> Hacking type casts may not have been a mistake, but the arbitrary
>> introduction of the breakage certainly was.
>
> With utmost respect, it was not.  Databases are no good if there are
> fixable things in them that cause them to produce incorrect results at
> random, as auto-casting to text did.

With a precisely equal level of respect, that's a load of bunk.  As
has been discussed here many times in the past, those changes broke
many applications that were just fine.  Mine included, EnterpriseDB's
included.  It was designed to maximize rather than minimize breakage
in ways that were completely unnecessary, and every time anyone dares
to argue that the change was less than perfect in every respect, the
reponses evince a level of dubiousness normally reserved for the
parole hearings of convicted murderers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

2014-01-13 Thread Jim Nasby

On 1/13/14, 1:44 AM, Pavel Stehule wrote:

2014/1/12 Florian Pflug mailto:f...@phlo.org>>

On Jan12, 2014, at 22:37 , Pavel Stehule mailto:pavel.steh...@gmail.com>> wrote:
 > There is  GUC for variable_conflict already too. In this case I would to
 > enable this functionality everywhere (it is tool how to simply eliminate
 > some kind of strange bugs) so it needs a GUC.
 >
 > We have GUC for plpgsql.variable_conflict three years and I don't know
 > about any problem.

I must say I hate behaviour-changing GUCs with quite some passion. IMHO
they tend to cause bugs, not avoid them, in the long run. The pattern
usually is

   1) Code gets written, depends on some particular set of settings
  to work correctly

   2) Code gets reused, with little further testing since it's supposed
  to be battle-proven anyway. Settings get dropped.

   3) Code blows up for those corner-cases where the setting actually
  matter. Debugging is hell, because you effectively have to go
  over the code line-by-line and check if it might be affected by
  some GUC or another.

Only a few days ago I spent more than an hour tracking down a bug
which, as it turned out, was caused by a regex which subtly changed its
meaning depending on whether standard_conforming_strings is on or off.

Some GUCs are unavoidable - standard_conforming_strings, for example
probably still was a good idea, since the alternative would have been
to stick with the historical, non-standard behaviour forever.

But in this case, my feeling is that the trouble such a GUC may cause
out-weights the potential benefits. I'm all for having a directive like
#consistent_into (though I feel that the name could convey the
meaning better). If we *really* think that this ought to be the default
from 9.4 onward, then we should

   *) Change it to always complain, except if the function explictly
  specifies "#consistent_into on" or whatever.

   *) Have pg_dump add that to all plpgsql functions if the server
  version is < 9.4 or whatever major release this ends up in

That's all just my opinion of course.

I am thinking so GUC and plpgsql option can live together. If you like to 
accent a some behave, then you can use a plpgsql option. On second hand, I 
would to use a some functionality, that is safe, but I don't would to dirty 
source code by using repeated options. But I have to check (and calculate with 
risk) a GUC settings.

One idea: required GUC? Can be nice a possibility to ensure some GUC setting, 
and restore ensure these values or raises warning.

Back to main topic. Required and described feature doesn't change a behave of 
INTO clause. I can enable or disable this functionality and well written code 
should to work without change (and problems). When check is disabled, then 
execution is just less safe. So in this case, a impact of GUC is significantly 
less than by you described issues. Does know anybody a use case where this 
check should be disabled?

Probably we have a different experience about GUC. I had a problem with  
standard_conforming_strings and bytea format some years ago. Now I prepare 
document about required setting. But I can see (from my experience from Czech 
area) more often  problems related to effective_cache_size or 
from_collapse_limit and similar GUC. These parameters are behind knowledge (and 
visibility) typical user.

ISTM that in this case, it should be safe to make the new default behavior STRICT; if you 
forget to set the GUC to disable than you'll get an error that points directly at the 
problem, at which point you'll go "Oh, yeah... I forgot to set X..."

Outside of the GUC, I believe the default should definitely be STRICT. If your 
app is relying on non-strict then you need to be made aware of that. We should 
be able to provide a DO block that will change this setting for every function 
you've got if someone isn't happy with STRICT mode.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread David Fetter

On Mon, Jan 13, 2014 at 10:40:57AM -0600, Merlin Moncure wrote:
> On Sun, Jan 12, 2014 at 4:38 AM, Craig Ringer  wrote:
> > Implicit casts to text, anybody?
> 
> This backward compatibility break orphaned the company I work for on
> 8.1 until last year and very nearly caused postgres to be summarily
> extirpated (only rescued at the last minute by my arrival). It cost
> hundreds of thousands of dollars to qualify a sprawling java code base
> so that it could be moved back into a supported version.  Breaking
> compatibility sucks -- it hurts your users and costs people money.
> Hacking type casts may not have been a mistake, but the arbitrary
> introduction of the breakage certainly was.

With utmost respect, it was not.  Databases are no good if there are
fixable things in them that cause them to produce incorrect results at
random, as auto-casting to text did.

> This project has no deprecation policy,

I believe it actually does, although it's not a formal, written
policy.  Would you like to help draft one up?

> and I'd argue we'd need one
> before considering breaking changes.  For example, maybe we could pull
> out an occasional release for longer term support to help users that
> caught out.   But really, the better way to go IMNSHO is to take a
> hard line on compatibility issues pretty much always -- consider the
> case of libc and win32 api.

Could you please help remind us what that was?

> There are certain limited exceptions to this rule -- for example
> security problems

Probably not.

> or gross violations of the standard

We're definitely there on lower bounds of arrays.  The standard, for a
wonder, is clear and unambiguous about them.  Whether we should go
there on the rest of our array implementation is a question for
another thread.

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-13 Thread Joshua D. Drake



On 01/13/2014 01:14 PM, Jim Nasby wrote:


On 1/13/14, 12:21 PM, Joshua D. Drake wrote:


On 01/13/2014 10:12 AM, Hannu Krosing wrote:

In other words, if we're going to have auto-degrade, the most
intelligent place for it is in
RepMgr/HandyRep/OmniPITR/pgPoolII/whatever.  It's also the *easiest*
place.  Anything we do *inside* Postgres is going to have a really,
really hard time determining when to degrade.

+1

This is also how 2PC works, btw - the database provides the building
blocks, i.e. PREPARE and COMMIT, and leaves it to a transaction manager
to deal with issues that require a whole-cluster perspective.



++1


+1


Josh, what do you think of the upthread idea of being able to recover
in-progress transactions that are waiting when we turn off sync rep? I'm
thinking that would be a very good feature to have... and it's not
something you can easily do externally.


I think it is extremely valuable, else we have lost those transactions 
which is exactly what we don't want.


JD


--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
"In a time of universal deceit - telling the truth is a revolutionary 
act.", George Orwell



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Greg Stark

On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund  wrote:
> For one, postgres doesn't use mmap for files (and can't without major
> new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> horrible consequences for performance/scalability - very quickly you
> contend on locks in the kernel.

I may as well dump this in this thread. We've discussed this in person
a few times, including at least once with Ted T'so when he visited
Dublin last year.

The fundamental conflict is that the kernel understands better the
hardware and other software using the same resources, Postgres
understands better its own access patterns. We need to either add
interfaces so Postgres can teach the kernel what it needs about its
access patterns or add interfaces so Postgres can find out what it
needs to know about the hardware context.

The more ambitious and interesting direction is to let Postgres tell
the kernel what it needs to know to manage everything. To do that we
would need the ability to control when pages are flushed out. This is
absolutely necessary to maintain consistency. Postgres would need to
be able to mark pages as unflushable until some point in time in the
future when the journal is flushed. We discussed various ways that
interface could work but it would be tricky to keep it low enough
overhead to be workable.

The less exciting, more conservative option would be to add kernel
interfaces to teach Postgres about things like raid geometries. Then
Postgres could use directio and decide to do prefetching based on the
raid geometry, how much available i/o bandwidth and iops is available,
etc.

Reimplementing i/o schedulers and all the rest of the work that the
kernel provides inside Postgres just seems like something outside our
competency and that none of us is really excited about doing.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Jim Nasby


On 1/13/14, 10:40 AM, Merlin Moncure wrote:

On Sun, Jan 12, 2014 at 4:38 AM, Craig Ringer  wrote:

>Implicit casts to text, anybody?

This backward compatibility break orphaned the company I work for on
8.1 until last year and very nearly caused postgres to be summarily
extirpated (only rescued at the last minute by my arrival). It cost
hundreds of thousands of dollars to qualify a sprawling java code base
so that it could be moved back into a supported version.  Breaking
compatibility sucks -- it hurts your users and costs people money.
Hacking type casts may not have been a mistake, but the arbitrary
introduction of the breakage certainly was.


IMHO, I see no reason we need to plan on removing support. Presumably it's not 
that much burden on our codebase; it's only a PITA for users writing correct 
code. (It'd be very interesting to see how much user code would blow up if 
presented with anything other than 1 as the lower bound...)

I'd be perfectly happy with an initdb option to allow for lower bound support if you 
wanted it and disable it by default. People the legitimately want/need <> 1 
lower bounds can set that up, but our general user population will end up protected 
from a class of heisenbugs.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-13 Thread Andres Freund

On 2014-01-13 15:14:21 -0600, Jim Nasby wrote:
> On 1/13/14, 12:21 PM, Joshua D. Drake wrote:
> >
> >On 01/13/2014 10:12 AM, Hannu Krosing wrote:
> In other words, if we're going to have auto-degrade, the most
> intelligent place for it is in
> RepMgr/HandyRep/OmniPITR/pgPoolII/whatever.  It's also the *easiest*
> place.  Anything we do *inside* Postgres is going to have a really,
> really hard time determining when to degrade.
> >>>+1
> >>>
> >>>This is also how 2PC works, btw - the database provides the building
> >>>blocks, i.e. PREPARE and COMMIT, and leaves it to a transaction manager
> >>>to deal with issues that require a whole-cluster perspective.
> >>>
> >>
> >>++1
> >
> >+1
> 
> Josh, what do you think of the upthread idea of being able to recover 
> in-progress transactions that are waiting when we turn off sync rep? I'm 
> thinking that would be a very good feature to have... and it's not something 
> you can easily do externally.

I think it'd be a fairly simple patch to re-check the state of syncrep
config in SyncRepWaitForLsn(). Alternatively you can just write code to
iterate over the procarray and sets Proc->syncRepState to
SYNC_REP_WAIT_CANCELLED or such.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-13 Thread Jim Nasby


On 1/13/14, 12:21 PM, Joshua D. Drake wrote:


On 01/13/2014 10:12 AM, Hannu Krosing wrote:

In other words, if we're going to have auto-degrade, the most
intelligent place for it is in
RepMgr/HandyRep/OmniPITR/pgPoolII/whatever.  It's also the *easiest*
place.  Anything we do *inside* Postgres is going to have a really,
really hard time determining when to degrade.

+1

This is also how 2PC works, btw - the database provides the building
blocks, i.e. PREPARE and COMMIT, and leaves it to a transaction manager
to deal with issues that require a whole-cluster perspective.



++1


+1


Josh, what do you think of the upthread idea of being able to recover 
in-progress transactions that are waiting when we turn off sync rep? I'm 
thinking that would be a very good feature to have... and it's not something 
you can easily do externally.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Andres Freund

On 2014-01-13 12:34:35 -0800, James Bottomley wrote:
> On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote:
> > Well, if we were to collaborate with the kernel community on this then
> > presumably we can do better than that for eviction... even to the
> > extent of "here's some data from this range in this file. It's (clean|
> > dirty). Put it in your cache. Just trust me on this."
> 
> This should be the madvise() interface (with MADV_WILLNEED and
> MADV_DONTNEED) is there something in that interface that is
> insufficient?

For one, postgres doesn't use mmap for files (and can't without major
new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
horrible consequences for performance/scalability - very quickly you
contend on locks in the kernel.
Also, that will mark that page dirty, which isn't what we want in this
case. One major usecase is transplanting a page comming from postgres'
buffers into the kernel's buffercache because the latter has a much
better chance of properly allocating system resources across independent
applications running.

Oh, and the kernel's page-cache management while far from perfect,
actually scales much better than postgres'.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Kevin Grittner

Josh Berkus  wrote:

> Wanna go to Collab?

I don't think that works out for me, but thanks for suggesting it.

I'd be happy to brainstorm with anyone who does go about issues to
discuss; although the ones I keep running into have already been
mentioned.

Regarding the problems others have mentioned, there are a few
features that might be a very big plus for us.  Additional ways of
hinting pages might be very useful.  If we had a way to specify how
many dirty pages were cached in PostgreSQL, the OS would count
those for calculations for writing dirty pages, and we could avoid
the "write avalanche" which is currently so tricky to avoid without
causing repeated writes to the same page.  Or perhaps instead a way
to hint a page as dirty so that the OS could not only count those,
but discard the obsolete data from its cache if it is not already
dirty at the OS level, and lower the write priority if it is dirty
(to improve the odds of collapsing multiple writes).  If there was
a way to use DONTNEED or something similar with the ability to
rescind it if the page was still happened to be in the OS cache,
that might help for when we discard a still-clean page from our
buffers.  And I seem to have a vague memory of there being cases
where the OS is first reading pages when we ask to write them,
which seems like avoidable I/O.  (I'm not sure about that one,
though.)

Also, something like THP support should really have sysctl support
rather than requiring people to put echo commands into scripts and
tie those into runlevel changes.  That's pretty ugly for something
which has turned out to be necessary so often.

I don't get too excited about changes to the default schedulers --
it's been pretty widely known for a long time that DEADLINE or NOOP
perform better than any alternatives for most database loads. 
Anyone with a job setting up Linux machines to be used for database
servers should know to cover that.  As long as those two don't get
broken, I'm good.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Trond Myklebust


On Jan 13, 2014, at 15:40, Andres Freund  wrote:

> On 2014-01-13 15:15:16 -0500, Robert Haas wrote:
>> On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner  wrote:
>>> I notice, Josh, that you didn't mention the problems many people
>>> have run into with Transparent Huge Page defrag and with NUMA
>>> access.
>> 
>> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
>> setting zone_reclaim_mode; is there some other problem besides that?
> 
> I think that fixes some of the worst instances, but I've seen machines
> spending horrible amounts of CPU (& BUS) time in page reclaim
> nonetheless. If I analyzed it correctly it's in RAM << working set
> workloads where RAM is pretty large and most of it is used as page
> cache. The kernel ends up spending a huge percentage of time finding and
> potentially defragmenting pages when looking for victim buffers.
> 
>> On a related note, there's also the problem of double-buffering.  When
>> we read a page into shared_buffers, we leave a copy behind in the OS
>> buffers, and similarly on write-out.  It's very unclear what to do
>> about this, since the kernel and PostgreSQL don't have intimate
>> knowledge of what each other are doing, but it would be nice to solve
>> somehow.
> 
> I've wondered before if there wouldn't be a chance for postgres to say
> "my dear OS, that the file range 0-8192 of file x contains y, no need to
> reread" and do that when we evict a page from s_b but I never dared to
> actually propose that to kernel people...

O_DIRECT was specifically designed to solve the problem of double buffering 
between applications and the kernel. Why are you not able to use that in these 
situations?

Cheers,
   Trond

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread James Bottomley

On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote:
> On 1/13/14, 2:27 PM, Claudio Freire wrote:
> > On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby  wrote:
> >> On 1/13/14, 2:19 PM, Claudio Freire wrote:
> >>>
> >>> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas 
> >>> wrote:
> 
>  On a related note, there's also the problem of double-buffering.  When
>  we read a page into shared_buffers, we leave a copy behind in the OS
>  buffers, and similarly on write-out.  It's very unclear what to do
>  about this, since the kernel and PostgreSQL don't have intimate
>  knowledge of what each other are doing, but it would be nice to solve
>  somehow.
> >>>
> >>>
> >>>
> >>> There you have a much harder algorithmic problem.
> >>>
> >>> You can basically control duplication with fadvise and WONTNEED. The
> >>> problem here is not the kernel and whether or not it allows postgres
> >>> to be smart about it. The problem is... what kind of smarts
> >>> (algorithm) to use.
> >>
> >>
> >> Isn't this a fairly simple matter of when we read a page into shared 
> >> buffers
> >> tell the kernel do forget that page? And a corollary to that for when we
> >> dump a page out of shared_buffers (here kernel, please put this back into
> >> your cache).
> >
> >
> > That's my point. In terms of kernel-postgres interaction, it's fairly 
> > simple.
> >
> > What's not so simple, is figuring out what policy to use. Remember,
> > you cannot tell the kernel to put some page in its page cache without
> > reading it or writing it. So, once you make the kernel forget a page,
> > evicting it from shared buffers becomes quite expensive.
> 
> Well, if we were to collaborate with the kernel community on this then
> presumably we can do better than that for eviction... even to the
> extent of "here's some data from this range in this file. It's (clean|
> dirty). Put it in your cache. Just trust me on this."

This should be the madvise() interface (with MADV_WILLNEED and
MADV_DONTNEED) is there something in that interface that is
insufficient?

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Robert Haas

On Mon, Jan 13, 2014 at 3:53 PM, Trond Myklebust  wrote:
> O_DIRECT was specifically designed to solve the problem of double buffering 
> between applications and the kernel. Why are you not able to use that in 
> these situations?

O_DIRECT was apparently designed by a deranged monkey on some serious
mind-controlling substances.  But don't take it from me, I have it on
good authority:

http://yarchive.net/comp/linux/o_direct.html

One might even say the best authority.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Jeff Janes

On Mon, Jan 13, 2014 at 12:32 PM, Jim Nasby  wrote:

> On 1/13/14, 2:27 PM, Claudio Freire wrote:
>
>> On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby  wrote:
>>
>>> On 1/13/14, 2:19 PM, Claudio Freire wrote:
>>>

 On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas 
 wrote:

>
> On a related note, there's also the problem of double-buffering.  When
> we read a page into shared_buffers, we leave a copy behind in the OS
> buffers, and similarly on write-out.  It's very unclear what to do
> about this, since the kernel and PostgreSQL don't have intimate
> knowledge of what each other are doing, but it would be nice to solve
> somehow.
>

 There you have a much harder algorithmic problem.

 You can basically control duplication with fadvise and WONTNEED. The
 problem here is not the kernel and whether or not it allows postgres
 to be smart about it. The problem is... what kind of smarts
 (algorithm) to use.

>>>
>>>
>>> Isn't this a fairly simple matter of when we read a page into shared
>>> buffers
>>> tell the kernel do forget that page? And a corollary to that for when we
>>> dump a page out of shared_buffers (here kernel, please put this back into
>>> your cache).
>>>
>>
>>
>> That's my point. In terms of kernel-postgres interaction, it's fairly
>> simple.
>>
>> What's not so simple, is figuring out what policy to use.
>
>
I think the above is pretty simple for both interaction (allow us to inject
a clean page into the file page cache) and policy (forget it after you hand
it to us, then remember it again when we hand it back to you clean).  And I
think it would pretty likely be an improvement over what we currently do.
 But I think it is probably the wrong way to get the improvement.  I think
the real problem is that we don't trust ourselves to manage more of the
memory ourselves.

As far as I know, we still don't have a publicly disclosable and readily
reproducible test case for the reports of performance degradation when we
have more than 8GB in shared_buffers.   If we had one of those, we could
likely reduce the double buffering problem by fixing our own scalability
issues and therefore taking responsibility for more of the data ourselves.

Remember,
>> you cannot tell the kernel to put some page in its page cache without
>> reading it or writing it. So, once you make the kernel forget a page,
>> evicting it from shared buffers becomes quite expensive.
>>
>
> Well, if we were to collaborate with the kernel community on this then
> presumably we can do better than that for eviction... even to the extent of
> "here's some data from this range in this file. It's (clean|dirty). Put it
> in your cache. Just trust me on this."

Which, in the case of it being clean, amounts to "Here is data we don't
want in memory any more because we think it is cold.  But we don't trust
ourselves, so please hold on to it anyway."  That might be a tough sell to
the kernel people.

 Cheers,

Jeff

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Andres Freund

On 2014-01-13 15:53:36 -0500, Trond Myklebust wrote:
> > I've wondered before if there wouldn't be a chance for postgres to say
> > "my dear OS, that the file range 0-8192 of file x contains y, no need to
> > reread" and do that when we evict a page from s_b but I never dared to
> > actually propose that to kernel people...
> 
> O_DIRECT was specifically designed to solve the problem of double buffering 
> between applications and the kernel. Why are you not able to use that in 
> these situations?

Because we like to handle the OS handle part of postgres' caching. For
one, it makes servers with several applications/databases much more
realistic without seriously overallocating memory, for another it's a
huge chunk of platform dependent code to get good performance
everywhere.
The above was explicitly not to avoid double buffering but to move a
buffer away from postgres' own buffers to the kernel's buffers once it's
not 100% clear we need it in buffers anymore.

Part of the reason this is being discussed is because previously people
suggested going the direct IO route and some people (most prominently
J. Corbet in 
http://archives.postgresql.org/message-id/20131204083345.31c60dd1%40lwn.net
) and others disagreed because that goes the route of reinventing
storage layers everywhere without improving the common codepaths.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-13 Thread Heikki Linnakangas


On 01/13/2014 10:53 PM, Peter Geoghegan wrote:

On Mon, Jan 13, 2014 at 12:17 PM, Robert Haas  wrote:

For what it's worth, I agree with Heikki.  There's probably nothing
sensible an upsert can do if it conflicts with more than one tuple,
but if it conflicts with just exactly one, it oughta be OK.


If there is exactly one, *and* the existing value is exactly the same
as the value proposed for insertion (or, I suppose, a subset of the
existing value, but that's so narrow that it might as well not apply).
In short, when you're using an exclusion constraint as a unique
constraint. Which is very narrow indeed. Weighing the costs and the
benefits, that seems like far more cost than benefit, before we even
consider anything beyond simply explaining the applicability and
limitations of upserting with exclusion constraints. It's generally
far cleaner to define speculative insertion as something that happens
with unique indexes only.


Well, even if you don't agree that locking all the conflicting rows for 
update is sensible, it's still perfectly sensible to return the rejected 
rows to the user. For example, you're inserting N rows, and if some of 
them violate a constraint, you still want to insert the non-conflicting 
rows instead of rolling back the whole transaction.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Jim Nasby


On 1/13/14, 2:37 PM, Claudio Freire wrote:

On Mon, Jan 13, 2014 at 5:32 PM, Jim Nasby  wrote:


That's my point. In terms of kernel-postgres interaction, it's fairly
simple.

What's not so simple, is figuring out what policy to use. Remember,
you cannot tell the kernel to put some page in its page cache without
reading it or writing it. So, once you make the kernel forget a page,
evicting it from shared buffers becomes quite expensive.



Well, if we were to collaborate with the kernel community on this then
presumably we can do better than that for eviction... even to the extent of
"here's some data from this range in this file. It's (clean|dirty). Put it
in your cache. Just trust me on this."



If I had a kernel developer hat, I'd put it on to say: I don't think
allowing that last bit is wise for a kernel.

It would violate oh-so-many separation rules and open an oh-so-big can-o-worms.


Yeah, if it were me I'd probably want to keep a hash of the page and it's 
address and only accept putting a page back into the kernel if it matched my 
hash. Otherwise you'd just have to treat it as a write.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: Partial sort

2014-01-13 Thread Marti Raudsepp

Hi Alexander,

First, thanks a lot for working on this feature. This PostgreSQL
shortcoming crops up in all the time in web applications that implement
paging by multiple sorted columns.

I've been trying it out in a few situations. I implemented a new
enable_partialsort GUC to make it easier to turn on/off, this way it's a
lot easier to test. The attached patch applies on top of
partial-sort-5.patch

I will spend more time reviewing the patch, but some of this planner code
is over my head. If there's any way I can help to make sure this lands in
the next version, let me know.



The patch performs just as well as I would expect it to:

marti=# select ac.name, r.name from artist_credit ac join release r on (
ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
Time: 9.830 ms
marti=# set enable_partialsort = off;
marti=# select ac.name, r.name from artist_credit ac join release r on (
ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
Time: 1442.815 ms

A difference of almost 150x!

There's a missed opportunity in that the code doesn't consider pushing new
Sort steps into subplans. For example, if there's no index on
language(name) then this query cannot take advantage partial sorts:

marti=# explain select l.name, r.name from language l join release r on (
l.id=r.language) order by l.name, r.name limit 1000;
 Limit  (cost=123203.20..123205.70 rows=1000 width=32)
   ->  Sort  (cost=123203.20..126154.27 rows=1180430 width=32)
 Sort Key: l.name, r.name
 ->  Hash Join  (cost=229.47..58481.49 rows=1180430 width=32)
   Hash Cond: (r.language = l.id)
   ->  Seq Scan on release r  (cost=0.00..31040.10 rows=1232610
width=26)
   ->  Hash  (cost=131.43..131.43 rows=7843 width=14)
 ->  Seq Scan on language l  (cost=0.00..131.43
rows=7843 width=14)

But because there are only so few languages, it would be a lot faster to
sort languages in advance and then do partial sort:
 Limit  (rows=1000 width=31)
   ->  Partial sort  (rows=1180881 width=31)
 Sort Key: l.name, r.name
 Presorted Key: l.name
 ->  Nested Loop  (rows=1180881 width=31)
   ->  Sort  (rows=7843 width=10)
 Sort Key: name
 ->  Seq Scan on language  (rows=7843 width=14)
   ->  Index Scan using release_language_idx on release r
(rows=11246 width=25)
 Index Cond: (language = l.id)

Even an explicit sorted CTE cannot take advantage of partial sorts:
marti=# explain with sorted_lang as (select id, name from language order by
name)
marti-# select l.name, r.name from sorted_lang l join release r on
(l.id=r.language)
order by l.name, r.name limit 1000;
 Limit  (cost=3324368.83..3324371.33 rows=1000 width=240)
   CTE sorted_lang
 ->  Sort  (cost=638.76..658.37 rows=7843 width=14)
   Sort Key: language.name
   ->  Seq Scan on language  (cost=0.00..131.43 rows=7843 width=14)
   ->  Sort  (cost=3323710.46..3439436.82 rows=46290543 width=240)
 Sort Key: l.name, r.name
 ->  Merge Join  (cost=664.62..785649.92 rows=46290543 width=240)
   Merge Cond: (r.language = l.id)
   ->  Index Scan using release_language_idx on release r
(cost=0.43..87546.06 rows=1232610 width=26)
   ->  Sort  (cost=664.19..683.80 rows=7843 width=222)
 Sort Key: l.id
 ->  CTE Scan on sorted_lang l  (cost=0.00..156.86
rows=7843 width=222)

But even with these limitations, this will easily be the killer feature of
the next release, for me at least.

Regards,
Marti


On Mon, Jan 13, 2014 at 8:01 PM, Alexander Korotkov wrote:

> On Tue, Dec 31, 2013 at 5:41 AM, Andreas Karlsson wrote:
>
>> On 12/29/2013 08:24 AM, David Rowley wrote:
>>
>>> If it was possible to devise some way to reuse any
>>> previous tuplesortstate perhaps just inventing a reset method which
>>> clears out tuples, then we could see performance exceed the standard
>>> seqscan -> sort. The code the way it is seems to lookup the sort
>>> functions from the syscache for each group then allocate some sort
>>> space, so quite a bit of time is also spent in palloc0() and pfree()
>>>
>>> If it was not possible to do this then maybe adding a cost to the number
>>> of sort groups would be better so that the optimization is skipped if
>>> there are too many sort groups.
>>>
>>
>> It should be possible. I have hacked a quick proof of concept for reusing
>> the tuplesort state. Can you try it and see if the performance regression
>> is fixed by this?
>>
>> One thing which have to be fixed with my patch is that we probably want
>> to close the tuplesort once we have returned the last tuple from ExecSort().
>>
>> I have attached my patch and the incremental patch on Alexander's patch.
>
>
> Thanks. It's included into attached version of patch. As wall as
> estimation improvements, more comments and regression tests fix.
>
> --
> With best re

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-13 Thread Peter Geoghegan

On Mon, Jan 13, 2014 at 12:17 PM, Robert Haas  wrote:
> For what it's worth, I agree with Heikki.  There's probably nothing
> sensible an upsert can do if it conflicts with more than one tuple,
> but if it conflicts with just exactly one, it oughta be OK.

If there is exactly one, *and* the existing value is exactly the same
as the value proposed for insertion (or, I suppose, a subset of the
existing value, but that's so narrow that it might as well not apply).
In short, when you're using an exclusion constraint as a unique
constraint. Which is very narrow indeed. Weighing the costs and the
benefits, that seems like far more cost than benefit, before we even
consider anything beyond simply explaining the applicability and
limitations of upserting with exclusion constraints. It's generally
far cleaner to define speculative insertion as something that happens
with unique indexes only.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Planning time in explain/explain analyze

2014-01-13 Thread Robert Haas

On Mon, Jan 13, 2014 at 3:40 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Mon, Jan 13, 2014 at 3:23 PM, Tom Lane  wrote:
>>> Meh.  Why?  This would only come into play for EXPLAIN EXECUTE stmtname.
>>> I don't think users would be surprised to see a report of minimal planning
>>> time for that.  In fact, it might be a good thing, as it would make it
>>> easier to tell the difference between whether you were seeing a generic
>>> plan or a custom plan for the prepared statement.
>
>> It would also make it easier to be wrong.  If you want to display that
>> information explicitly, fine.  But asking the user to use the elapsed
>> time to guess whether or not we really planned anything is just going
>> to confuse people who don't have enough experience with the system to
>> know what the boundary is between the largest time that could be a
>> cache lookup and the smallest time that could be real planning
>> activity.  And that means virtually everyone, me included.
>
> If you're saying that you'd like EXPLAIN to explicitly mention whether
> the plan was cached or custom, I don't have any great complaint about
> that.  I'm just not seeing how you arrive at the conclusion that we
> mustn't report the amount of time EXPLAIN spent to get the plan.
> If we do what you're proposing we'll just have a different set of confused
> users, who will be wondering how EXPLAIN could have managed to spend
> 100 msec planning something when the EXPLAIN only took 10 msec in toto
> according to psql.

What I'm saying is that if EXPLAIN reports something that's labelled
"Planning Time", it should *be* the planning time, and not anything
else.  When we retrieve a plan from cache, it would be sensible not to
report the planning time at all, and IMHO it would also be sensible to
report the time it actually took to plan whenever we originally did
it.  But reporting a value that is not the planning time and calling
it the planning time does not seem like a good idea to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-01-13 Thread Robert Haas

On Sat, Jan 11, 2014 at 1:08 AM, Amit Kapila  wrote:
> Yes, currently this applies to update, what I have in mind is that
> in future if some one wants to use WAL compression for any other
> operation like 'full_page_writes', then it can be easily extendible.
>
> To be honest, I have not evaluated whether such a flag or compression
> would make sense for full page writes, but I think it should be possible
> while doing full page write (BkpBlock has RelFileNode) to check such a
> flag if it's present.

Makes sense.

> The reason of adding the same chunk in head of list is that it uses same
> technique as pglz_hist_add. Now in pglz, it will not have repeat steps
> from c~f, as it has concept of good_match which leads to get this done in
> one go.
>
> Being said above, I am really not sure, how much real world data falls
> in above category and should we try to optimize based on above example,
> but yes it will save some CPU cycles in current test we are using.

In the Rabin algorithm, we shouldn't try to find a longer match.  The
match should end at the chunk end, period.  Otherwise, you lose the
shift-resistant property of the algorithm.

>>But I do think there might be a bug here, which is
>> that, unless I'm misinterpreting something, hp is NOT the end of the
>> chunk.  After calling pgrb_hash_init(), we've looked at the first FOUR
>> bytes of the input.  If we find that we have a zero hash value at that
>> point, shouldn't the chunk size be 4, not 1?  And similarly if we find
>> it after sucking in one more byte, shouldn't the chunk size be 5, not
>> 2?  Right now, we're deciding where the chunks should end based on the
>> data in the chunk plus the following 3 bytes, and that seems wonky.  I
>> would expect us to include all of those bytes in the chunk.
>
> It depends on how we define chunk, basically chunk size will be based
> on the byte for which we consider hindex. The hindex for any byte is
> calculated considering that byte and the following 3 bytes, so
> after calling pgrb_hash_init(), even though we have looked at 4 bytes
> but still the hindex is for first byte and thats why it consider
> chunk size as 1, not 4.
>
> Isn't it similar to how current pglz works, basically it also
> uses next 4 bytes to calculate index (pglz_hist_idx) but still
> does byte by byte comparison, here if we try to map to rabin's
> delta encoding then always chunk size is 1.

I don't quite understand this.  The point of the Rabin algorithm is to
split the old tuple up into chunks and then for those chunks in the
new tuple.  For example, suppose the old tuple is
abcdefghijklmnopqrstuvwxyz.  It might get split like this: abcdef
hijklmnopqrstuvw xyz.  If any of those three chunks appear in the new
tuple, then we'll use them for compression.  If not, we'll just copy
the literal bytes.  If the chunks appear in the new tuple reordered or
shifted or with stuff inserted between one chunk at the next, we'll
still find them.  Unless I'm confused, which is possible, what you're
doing is essentially looking at the string and spitting it in those
three places, but then recording the chunks as being three bytes
shorter than they really are.  I don't see how that can be right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Andres Freund

On 2014-01-13 15:15:16 -0500, Robert Haas wrote:
> On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner  wrote:
> > I notice, Josh, that you didn't mention the problems many people
> > have run into with Transparent Huge Page defrag and with NUMA
> > access.
> 
> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
> setting zone_reclaim_mode; is there some other problem besides that?

I think that fixes some of the worst instances, but I've seen machines
spending horrible amounts of CPU (& BUS) time in page reclaim
nonetheless. If I analyzed it correctly it's in RAM << working set
workloads where RAM is pretty large and most of it is used as page
cache. The kernel ends up spending a huge percentage of time finding and
potentially defragmenting pages when looking for victim buffers.

> On a related note, there's also the problem of double-buffering.  When
> we read a page into shared_buffers, we leave a copy behind in the OS
> buffers, and similarly on write-out.  It's very unclear what to do
> about this, since the kernel and PostgreSQL don't have intimate
> knowledge of what each other are doing, but it would be nice to solve
> somehow.

I've wondered before if there wouldn't be a chance for postgres to say
"my dear OS, that the file range 0-8192 of file x contains y, no need to
reread" and do that when we evict a page from s_b but I never dared to
actually propose that to kernel people...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Planning time in explain/explain analyze

2014-01-13 Thread Tom Lane

Robert Haas  writes:
> On Mon, Jan 13, 2014 at 3:23 PM, Tom Lane  wrote:
>> Meh.  Why?  This would only come into play for EXPLAIN EXECUTE stmtname.
>> I don't think users would be surprised to see a report of minimal planning
>> time for that.  In fact, it might be a good thing, as it would make it
>> easier to tell the difference between whether you were seeing a generic
>> plan or a custom plan for the prepared statement.

> It would also make it easier to be wrong.  If you want to display that
> information explicitly, fine.  But asking the user to use the elapsed
> time to guess whether or not we really planned anything is just going
> to confuse people who don't have enough experience with the system to
> know what the boundary is between the largest time that could be a
> cache lookup and the smallest time that could be real planning
> activity.  And that means virtually everyone, me included.

If you're saying that you'd like EXPLAIN to explicitly mention whether
the plan was cached or custom, I don't have any great complaint about
that.  I'm just not seeing how you arrive at the conclusion that we
mustn't report the amount of time EXPLAIN spent to get the plan.
If we do what you're proposing we'll just have a different set of confused
users, who will be wondering how EXPLAIN could have managed to spend
100 msec planning something when the EXPLAIN only took 10 msec in toto
according to psql.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Claudio Freire

On Mon, Jan 13, 2014 at 5:32 PM, Jim Nasby  wrote:
>>
>> That's my point. In terms of kernel-postgres interaction, it's fairly
>> simple.
>>
>> What's not so simple, is figuring out what policy to use. Remember,
>> you cannot tell the kernel to put some page in its page cache without
>> reading it or writing it. So, once you make the kernel forget a page,
>> evicting it from shared buffers becomes quite expensive.
>
>
> Well, if we were to collaborate with the kernel community on this then
> presumably we can do better than that for eviction... even to the extent of
> "here's some data from this range in this file. It's (clean|dirty). Put it
> in your cache. Just trust me on this."


If I had a kernel developer hat, I'd put it on to say: I don't think
allowing that last bit is wise for a kernel.

It would violate oh-so-many separation rules and open an oh-so-big can-o-worms.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 151 matches

Mail list logo