date:20140202

Re: [HACKERS] patch: option --if-exists for pg_dump

2014-02-02 Thread Jeevan Chalke

Hi Peter,

I am not sure why you getting build unstable due to white-space errors.

Are you referring to these line ?

*11:25:01* src/bin/pg_dump/pg_backup_archiver.c:477: indent with
spaces.*11:25:01* + 
dropStmt,*11:25:01*
src/bin/pg_dump/pg_backup_archiver.c:478: indent with
spaces.*11:25:10* + 
buffer,*11:25:14*
src/bin/pg_dump/pg_backup_archiver.c:479: indent with
spaces.*11:25:15* + 
mark + l);*11:25:15* + echo
unstable

If yes, then in my latest attached patch, these lines are NOT AT ALL there.
I have informed on my comment that I have fixed these in my version of
patch,
but still you got unstable build. NOT sure how. Seems like you are applying
wrong patch.

Will you please let us know what's going wrong ?

Thanks

On Thu, Jan 30, 2014 at 6:56 PM, Pavel Stehule wrote:

>
>
>
> 2014-01-30 Jeevan Chalke :
>
> OK.
>>
>> Assigned it to committer.
>>
>> Thanks for the hard work.
>>
>
> Thank you for review
>
> Pavel
>
>
>>
>>
>> On Thu, Jan 30, 2014 at 6:16 PM, Pavel Stehule 
>> wrote:
>>
>>> Hello
>>>
>>> All is ok
>>>
>>> Thank you
>>>
>>> Pavel
>>>
>>
>> --
>> Jeevan B Chalke
>> Principal Software Engineer, Product Development
>> EnterpriseDB Corporation
>> The Enterprise PostgreSQL Company
>>
>>
>

-- 
Jeevan B Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.

Re: [HACKERS] GIN improvements part2: fast scan

2014-02-02 Thread Oleg Bartunov

Tomasa, it'd be nice if you use real data in your testing.

One very good application of gin fast-scan is dramatic performance
improvement  of hstore/jsonb @> operator, see slides 57, 58
http://www.sai.msu.su/~megera/postgres/talks/hstore-dublin-2013.pdf.
I'd like not to lost this benefit :)

Oleg

PS. I used data delicious-rss-1250k.gz from
http://randomwalker.info/data/delicious/

On Mon, Feb 3, 2014 at 5:44 AM, Tomas Vondra  wrote:
> On 3.2.2014 00:13, Tomas Vondra wrote:
>> On 2.2.2014 11:45, Heikki Linnakangas wrote:
>>> On 01/30/2014 01:53 AM, Tomas Vondra wrote:
 (3) A file with explain plans for 4 queries suffering ~2x slowdown,
  and explain plans with 9.4 master and Heikki's patches is available
  here:

http://www.fuzzy.cz/tmp/gin/queries.txt

  All the queries have 6 common words, and the explain plans look
  just fine to me - exactly like the plans for other queries.

  Two things now caught my eye. First some of these queries actually
  have words repeated - either exactly like "database & database" or
  in negated form like "!anything & anything". Second, while
  generating the queries, I use "dumb" frequency, where only exact
  matches count. I.e. "write != written" etc. But the actual number
  of hits may be much higher - for example "write" matches exactly
  just 5% documents, but using @@ it matches more than 20%.

  I don't know if that's the actual cause though.
>>>
>>> Ok, here's another variant of these patches. Compared to git master, it
>>> does three things:
>>>
>>> 1. It adds the concept of ternary consistent function internally, but no
>>> catalog changes. It's implemented by calling the regular boolean
>>> consistent function "both ways".
>>>
>>> 2. Use a binary heap to get the "next" item from the entries in a scan.
>>> I'm pretty sure this makes sense, because arguably it makes the code
>>> more readable, and reduces the number of item pointer comparisons
>>> significantly for queries with a lot of entries.
>>>
>>> 3. Only perform the pre-consistent check to try skipping entries, if we
>>> don't already have the next item from the entry loaded in the array.
>>> This is a tradeoff, you will lose some of the performance gain you might
>>> get from pre-consistent checks, but it also limits the performance loss
>>> you might get from doing useless pre-consistent checks.
>>>
>>> So taken together, I would expect this patch to make some of the
>>> performance gains less impressive, but also limit the loss we saw with
>>> some of the other patches.
>>>
>>> Tomas, could you run your test suite with this patch, please?
>>
>> Sure, will do. Do I get it right that this should be applied instead of
>> the four patches you've posted earlier?
>
> So, I was curious and did a basic testing - I've repeated the tests on
> current HEAD and 'HEAD with the new patch'. The complete data are
> available at [http://www.fuzzy.cz/tmp/gin/gin-scan-benchmarks.ods] and
> I've updated the charts at [http://www.fuzzy.cz/tmp/gin/] too.
>
> Look for branches named 9.4-head-2 and 9.4-heikki-2.
>
> To me it seems that:
>
> (1) The main issue was that with common words, it used to be much
> slower than HEAD (or 9.3). This seems to be fixed, i.e. it's not
> slower than before. See
>
>   http://www.fuzzy.cz/tmp/gin/3-common-words.png (previous patch)
>   http://www.fuzzy.cz/tmp/gin/3-common-words-new.png (new patch)
>
> for comparison vs. 9.4 HEAD. With the new patch there's no slowdown,
> which seems nice. Compared to 9.3 it looks like this:
>
>   http://www.fuzzy.cz/tmp/gin/3-common-words-new-vs-93.png
>
> so there's a significant speedup (thanks to the modified layout).
>
> (2) The question is whether the new patch works fine on rare words. See
> this for comparison of the patches against HEAD:
>
>   http://www.fuzzy.cz/tmp/gin/3-rare-words.png
>   http://www.fuzzy.cz/tmp/gin/3-rare-words-new.png
>
> and this is the comparison of the two patches:
>
>   http://www.fuzzy.cz/tmp/gin/patches-rare-words.png
>
> That seems fine to me - some queries are slower, but we're talking
> about queries taking 1 or 2 ms, so the measurement error is probably
> the main cause of the differences.
>
> (3) With higher numbers of frequent words, the differences (vs. HEAD or
> the previous patch) are not that dramatic as in (1) - the new patch
> is consistently by ~20% faster.
>
> Tomas
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-02-02 Thread Amit Kapila

On Wed, Jan 15, 2014 at 2:43 AM, Simon Riggs  wrote:
> On 8 January 2014 08:33, Simon Riggs  wrote:
>
> Patch attached, implemented to reduce writes by SELECTs only.

This is really a valuable improvement over current SELECT behaviour
w.r.t Writes.

While going though patch, I observed few points, so thought of
sharing with you:

+ /*
+ * If we are tracking pruning in SELECTs then we can only get
+ * here by heap_page_prune_opt() call that cleans a block,
+ * so in that case, register it as a pruning operation.
+ * Make sure we don't double count during VACUUMs.
+ */
+ if (PrunePageDirtyLimit > -1)
+ PrunePageDirty++;

a. As PrunePageDirtyLimit variable is not initialized for DDL flow,
   any statement like Create Function().. will have value of
   PrunePageDirtyLimit as 4 (default) and in such cases MarkBufferDirty()
   will increment the wrong counter.

b. For DDL statements like Create Materialized view, it will behave as
   Select statement.
   Ex.
   Create Materialized view mv1 as select * from t1;

   Now here I think it might not be a problem, because for t1 anyway there
   will be no write, so skipping pruning should not be a problem and for
   materialized views also there will no dead rows, so skipping should be
   okay, but I think it is not strictly adhering to statement "to reduce writes
   by SELECTs only" and purpose of patch which is to avoid only when
   Top level statement is SELECT.
   Do you think it's better to consider such cases and optimize for them
   or should we avoid it by following thumb rule that pruning will be avoided
   only for top level SELECT?

2. + "Allow cleanup of shared buffers by foreground processes, allowing
later cleanup by VACUUM",
This line is not clear, what do you mean to say by "allowing later cleanup
by VACUUM", if already foreground process has done cleanup, then it
should save effort of Vacuum.

In general, though both the optimisations (allow_buffer_cleanup and
prune_page_dirty_limit )  used in patch have similarity in the sense
that they will be used to avoid pruning, but still I feel they are for different
cases (READ ONLY OP and WRITE ON SMALL TABLES) and also as there
are more people inclined to do this for only SELECT operations, do you think
it will be a good idea to make them as separate patches?

I think there can be some applications or use cases which can be benefited
by avoiding pruning for WRITE ON SMALL TABLES, but the case for SELECT
is more general and more applications can get benefit with this optimisation,so
it would be better if we first try to accomplish that case.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FOR [SHARE|UPDATE] NOWAIT may still block in EvalPlanQualFetch

2014-02-02 Thread Craig Ringer

On 02/01/2014 05:28 AM, Bruce Momjian wrote:
> On Fri, Aug  2, 2013 at 04:00:03PM +0800, Craig Ringer wrote:
>> FOR SHARE|UPDATE NOWAIT will still block if they have to follow a ctid
>> chain because the call to EvalPlanQualFetch doesn't take a param for
>> noWait, so it doesn't know not to block if the updated row can't be locked.
>>
>> The attached patch against master includes an isolationtester spec to
>> demonstrate this issue and a proposed fix. Builds with the fix applied
>> pass "make check" and isolationtester "make installcheck".
>>
>> To reach this point you need to apply the patch in
>> http://www.postgresql.org/message-id/51fb4305.3070...@2ndquadrant.com
>> first, otherwise you'll get stuck there and won't touch the code changed
>> in this patch.
> 
> The above looks like a legitimate patch that was not applied:
> 
>   http://www.postgresql.org/message-id/51fb6703.9090...@2ndquadrant.com
> 
> The patch mentioned in the text above was applied, I think.


The first patch, linked to in text, was commited as
706f9dd914c64a41e06b5fbfd62d6d6dab43eeb8.

I can't see the second in the history either. It'd be good to get it
committed, though the issue is obviously not causing any great outcry.

It was detected when testing a high-rate queueing system in PostgreSQL.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] narwhal and PGDLLIMPORT

2014-02-02 Thread Peter Geoghegan

On Sun, Feb 2, 2014 at 5:22 PM, Craig Ringer  wrote:
> I'm not a fan of MinGW (gcc) on Windows, it's a right pain. It's also
> the only open source compiler currently supported for PostgreSQL on
> Windows - practically the only one available. I don't know about you,
> but I'm not too keen on assuming Microsoft will continue to offer free
> toolchains that're usable for our purposes. They're crippling their free
> build tools more and more with each release, which isn't a good trend.

I was under the impression that Microsoft had finally come around to
implementing a few C99 features in Visual Studio 2013 precisely
because they want there to be an open source ecosystem on Windows.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] narwhal and PGDLLIMPORT

2014-02-02 Thread Tom Lane

Amit Kapila  writes:
> Also as per below link, it seems that PGDLLIMPORT is required for
> exported global variables.
> http://msdn.microsoft.com/en-us/library/8fskxacy(v=vs.80).aspx

That was what we'd always assumed, but the fact that postgres_fdw has been
working for the last year (on everything except narwhal) puts the lie to
that as a blanket assumption.

So what I want to know now is how come it wasn't failing; because then
we might be able to exploit that to eliminate the need for the PGDLLIMPORT
cruft everywhere.  I've never been a fan of the fact that Windows insists
on its own private set of global symbol visibility rules.  If we can make
that build work more like other platforms, it'll be a step forward IMO.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [review] PostgreSQL Service on Windows does not start if data directory given is relative path

2014-02-02 Thread Rajeev rastogi


On 1st February 2014, MauMau Wrote: 
 
> I reviewed the patch content.  I find this fix useful.
> 
> I'd like to suggest some code improvements.  I'll apply and test the
> patch when I receive your reply.

Thanks for reviewing the patch.

> (1)
> I think it is appropriate to place find_my_abs_path() in path.c rather
> than
> exec.c.  Please look at the comments at the beginning of those files.
> exec.c handles functions related to executables, while path.c handles
> general functions handling paths.

I have moved this function to path.c

> It's better to rename the function to follow the naming of other
> functions
> in path.c, something like get_absolute_path() or so.  Unfortunately, we
> cannot use make_absolute_path() as the name because it is used in
> src/backend/utils/init/miscinit.c, which conflicts in the backend
> module.

Renamed the function as get_absolute_path.

> (2)
> In pg_ctl.c, dbpath can be better named as datadir, because it holds
> data
> directory location.  dbpath is used to mean some different location in
> other
> source files.

Renamed as dataDir.

> (3)
> find_my_abs_path() had better not call make_native_path() because the
> role
> of this function should be to just return an absolute path.  pg_ctl.c
> can
> instead call make_native_path() after find_my_abs_path().

Changed as per suggestion.

> (4)
> find_my_abs_path() should not call resolve_symlinks().  For reference,
> look
> at make_absolute_path() in src/backend/utils/init/miscinit.c and
> src/test/regress/pg_regress.c.  I guess the reason is that if the user
> changed to the directory through a symbolic link, we should retain the
> symbolic link name.

Changed as per suggestion.

> (5)
> Change "file/path" in the comment of find_my_abs_path() to "path",
> because
> file is also a path.

Changed as per suggestion.

Please find the attached revised patch.

Thanks and Regards,
Kumar Rajeev Rastogi




pgctl_win32service_rel_dbpath_v3.patch
Description: pgctl_win32service_rel_dbpath_v3.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] narwhal and PGDLLIMPORT

2014-02-02 Thread Amit Kapila

On Sun, Feb 2, 2014 at 6:33 AM, Tom Lane  wrote:
> I happened to notice today that the owner of buildfarm member narwhal
> is trying to revive it after a long time offline, but it's failing in
> the 9.3 branch (and not attempting to build HEAD, yet).  The cause
> appears to be that contrib/postgres_fdw is referencing the DateStyle
> and IntervalStyle variables, which aren't marked PGDLLIMPORT.
> Hm, well, that would be an easy change ... but that code was committed
> last March.  How is it that we didn't notice this long ago?
>
> What this seems to indicate is that narwhal is the only buildfarm
> animal that has a need for PGDLLIMPORT marks on global variables that
> are referenced by extensions.

I have tried to debug the part of code where we are using DateStyle and
IntervalStyle variables in postgres_fdw on my m/c
(Win 7 - 64 bit, MSVC2010) and found that there value is junk, to me the
problem looks exactly similar to what we have seen for test_shm_mq
modules few days back. I am not sure why it is passing on other
buildfarm m/c, but I think we need to put PGDLLIMPORT for global
variables we want to use in extensions.
Also as per below link, it seems that PGDLLIMPORT is required for
exported global variables.
http://msdn.microsoft.com/en-us/library/8fskxacy(v=vs.80).aspx

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] narwhal and PGDLLIMPORT

2014-02-02 Thread Amit Kapila

On Mon, Feb 3, 2014 at 6:52 AM, Craig Ringer  wrote:
> On 02/02/2014 09:03 AM, Tom Lane wrote:
>> According to the buildfarm database, narwhal is running a gcc build on
>> Windows 2003.  That hardly seems like a mainstream use case.  I could
>> believe that it might be of interest to developers, but clearly no
>> developers are actually running such a build.
>>
>> I think we should give serious consideration to desupporting this
>> combination
>
> I'm not a fan of MinGW (gcc) on Windows, it's a right pain. It's also
> the only open source compiler currently supported for PostgreSQL on
> Windows - practically the only one available. I don't know about you,
> but I'm not too keen on assuming Microsoft will continue to offer free
> toolchains that're usable for our purposes. They're crippling their free
> build tools more and more with each release, which isn't a good trend.
>
> If you wish to eliminate PGDLLIMPORT from the codebase the correct
> approach would be building with --export-all-symbols (a MinGW extension
> flag to gcc). That would make the MinGW builds consistent with the MSVC
> build, which generates a .def file that exports all symbols.
>
> As for why PGDLLIMPORT appears to be required in some places on the MSVC
> build, so far it's looking like we auto-export functions, but not
> necessarily variables.

I could see both the variables (DateStyle and IntervalStyle) currently Tom
points out in both .def file and by using dumpbin utility.
I think these symbols are exported, the main reason of failure is that they
are not imported in postres_fdw as we have not used PGDLLIMPORT.

Looking at below code, as per my understanding for contrib modules
BUILDING_DLL is not defined and by using PGDLLIMPORT, it
can lead to import of the required variables.

#ifdef BUILDING_DLL
#define PGDLLIMPORT __declspec (dllexport)
#else /* not BUILDING_DLL */
#define PGDLLIMPORT __declspec (dllimport)
#endif


We need this for variables, but they are optional for functions. Please refer
below link:
http://msdn.microsoft.com/en-us/library/8fskxacy(v=vs.80).aspx


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] narwhal and PGDLLIMPORT

2014-02-02 Thread Craig Ringer

On 02/03/2014 11:10 AM, Ashesh Vashi wrote:
> On Mon, Feb 3, 2014 at 6:52 AM, Craig Ringer  > wrote:
> 
> On 02/02/2014 09:03 AM, Tom Lane wrote:
> > According to the buildfarm database, narwhal is running a gcc build on
> > Windows 2003.  That hardly seems like a mainstream use case.  I could
> > believe that it might be of interest to developers, but clearly no
> > developers are actually running such a build.
> >
> > I think we should give serious consideration to desupporting this
> > combination
> 
> I'm not a fan of MinGW (gcc) on Windows, it's a right pain. It's also
> the only open source compiler currently supported for PostgreSQL on
> Windows - practically the only one available. I don't know about you,
> but I'm not too keen on assuming Microsoft will continue to offer free
> toolchains that're usable for our purposes. They're crippling their free
> build tools more and more with each release, which isn't a good trend.
> 
> If you wish to eliminate PGDLLIMPORT from the codebase the correct
> approach would be building with --export-all-symbols (a MinGW extension
> flag to gcc). That would make the MinGW builds consistent with the MSVC
> build, which generates a .def file that exports all symbols.
> 
> As for why PGDLLIMPORT appears to be required in some places on the MSVC
> build, so far it's looking like we auto-export functions, but not
> necessarily variables.
> 
> In my experience, that is true.
> During some contrib module development, while accessing a postgresql
> variable
> it was crashing on windows, while same was accessible in non-windows
> platform.

I think it's a good thing personally - we shouldn't be exporting every
little internal var in the symbol table.

If we built with -fvisibility=hidden on 'nix there'd be no need to
complain about commits being on on 'nix then breaking on Windows, 'cos
the 'nix build would break in the same place. That's all or nothing
though, there's no "vars hidden, procs exported" option in gcc.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] inherit support for foreign tables

2014-02-02 Thread Etsuro Fujita


(2014/01/31 9:56), Robert Haas wrote:

On Thu, Jan 30, 2014 at 5:05 PM, Tom Lane  wrote:

Robert Haas  writes:

On Thu, Jan 30, 2014 at 11:04 AM, Tom Lane  wrote:

I think this is totally misguided.  Who's to say that some weird FDW
might not pay attention to attstorage?  I could imagine a file-based
FDW using that to decide whether to compress columns, for instance.
Admittedly, the chances of that aren't large, but it's pretty hard
to argue that going out of our way to prevent it is a useful activity.



I think that's a pretty tenuous position.  There are already
FDW-specific options sufficient to let a particular FDW store whatever
kinds of options it likes; letting the user set options that were only
ever intended to be applied to tables just because we can seems sort
of dubious.  I'm tempted by the idea of continuing to disallow SET
STORAGE on an unvarnished foreign table, but allowing it on an
inheritance hierarchy that contains at least one real table, with the
semantics that we quietly ignore the foreign tables and apply the
operation to the plain tables.


[ shrug... ] By far the easiest implementation of that is just to apply
the catalog change to all of them.  According to your assumptions, it'll
be a no-op on the foreign tables anyway.


Well, there's some point to that, too, I suppose.  What do others think?


Allowing ALTER COLUMN SET STORAGE on foreign tables would make sense if 
for example, "SELECT * INTO local_table FROM foreign_table" did create a 
new local table of columns having the storage types associated with 
those of a foreign table?


Thanks,

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pg_basebackup: progress report max once per second

2014-02-02 Thread Sawada Masahiko

On Sat, Feb 1, 2014 at 8:29 PM, Oskari Saarenmaa  wrote:
> 31.01.2014 10:59, Sawada Masahiko kirjoitti:
>
> I think the idea in the new progress_report() call (with force == true) is
> to make sure that there is at least one progress_report call that actually
> writes the progress report.  Otherwise the final report may go missing if it
> gets suppressed by the time-based check.  The force argument as used in the
> new call skips that check.
>

I understood.

I have two concerns as follows.
- I think that there is possible that progress_report() is called
frequently ( less than 1 second).
  That is, progress_report() is called with force == true after
progress_report was called with force == false and execute this
function.
- progress_report() is called even if -P option is disabled. I'm
concerned about that is cause of performance degradation.

Regards,

---
Sawada Masahiko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] narwhal and PGDLLIMPORT

2014-02-02 Thread Ashesh Vashi

On Mon, Feb 3, 2014 at 6:52 AM, Craig Ringer  wrote:

> On 02/02/2014 09:03 AM, Tom Lane wrote:
> > According to the buildfarm database, narwhal is running a gcc build on
> > Windows 2003.  That hardly seems like a mainstream use case.  I could
> > believe that it might be of interest to developers, but clearly no
> > developers are actually running such a build.
> >
> > I think we should give serious consideration to desupporting this
> > combination
>
> I'm not a fan of MinGW (gcc) on Windows, it's a right pain. It's also
> the only open source compiler currently supported for PostgreSQL on
> Windows - practically the only one available. I don't know about you,
> but I'm not too keen on assuming Microsoft will continue to offer free
> toolchains that're usable for our purposes. They're crippling their free
> build tools more and more with each release, which isn't a good trend.
>
> If you wish to eliminate PGDLLIMPORT from the codebase the correct
> approach would be building with --export-all-symbols (a MinGW extension
> flag to gcc). That would make the MinGW builds consistent with the MSVC
> build, which generates a .def file that exports all symbols.
>
> As for why PGDLLIMPORT appears to be required in some places on the MSVC
> build, so far it's looking like we auto-export functions, but not
> necessarily variables.

In my experience, that is true.
During some contrib module development, while accessing a postgresql
variable
it was crashing on windows, while same was accessible in non-windows
platform.

> I'd need to read the fairly scary MSVC build
> genreator scripts in detail to confirm that, to see how they produce
> their DEF files; that'll have to wait until after I've got the
> row-security work sorted out.
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
--

Thanks & Regards,

Ashesh Vashi
EnterpriseDB INDIA: Enterprise PostgreSQL Company



*http://www.linkedin.com/in/asheshvashi*

Re: [HACKERS] WITH ORDINALITY planner improvements

2014-02-02 Thread Etsuro Fujita

(2014/02/01 8:01), Tom Lane wrote:
> Bruce Momjian  writes:
>> On Thu, Aug 15, 2013 at 07:25:17PM +0900, Etsuro Fujita wrote:
>>> Attached is an updated version of the patch.  In that version the code for 
>>> the
>>> newly added function build_function_pathkeys() has been made more simple by
>>> using the macro INTEGER_BTREE_FAM_OID.
> 
>> Is this patch to be applied?
> 
> It hasn't been reviewed AFAIK.

Thanks for picking this up.

I think it doesn't need to be reviewed anymore because the same feature
has been commited in [1] if I understand correctly.

[1]
http://www.postgresql.org/message-id/e1vjel1-0007x9...@gemulon.postgresql.org

Best regards,
Etsuro Fujita


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.warn_shadow

2014-02-02 Thread Marko Tiikkaja


Hi everyone,

Here's a new version of the patch.  Added new tests and docs and changed 
the GUCs per discussion.


plpgsql.warnings_as_errors only affects compilation at CREATE FUNCTION time:

=# set plpgsql.warnings to 'all';
SET
=#* set plpgsql.warnings_as_errors to true;
SET
=#* select foof(1); -- compiled since it's the first call in this session
WARNING:  variable "f1" shadows a previously defined variable
LINE 2: declare f1 int;
^
 foof
--

(1 row)

=#* create or replace function foof(f1 int) returns void as
$$
declare f1 int;
begin
end $$ language plpgsql;
ERROR:  variable "f1" shadows a previously defined variable
LINE 3: declare f1 int;


Currently, plpgsql_warnings is a boolean since there's only one warning 
we implement.  The idea is to make it a bit field of some kind in the 
future when we add more warnings.  Starting that work for 9.4 seemed 
like overkill, though.  I tried to keep things simple.



Regards,
Marko Tiikkaja
*** a/doc/src/sgml/plpgsql.sgml
--- b/doc/src/sgml/plpgsql.sgml
***
*** 4712,4717  a_output := a_output || $$ if v_$$ || referrer_keys.kind || 
$$ like '$$
--- 4712,4762 

  

+   
+Warnings
+ 
+
+ To aid the user in finding instances of simple but common problems before
+ they cause harm, PL/PgSQL provides a number of
+ warnings.  When enabled, a WARNING is emitted
+ during the compilation of a function.  Optionally, if
+ plpgsql.warnings_as_errors is set, an ERROR is
+ raised when attempting to create a function which would otherwise result 
in
+ a warning.
+
+ 
+  
+   Warnings are enabled through the configuration variable
+   plpgsql.warnings.  It can be set either to a comma-separated 
list
+   of warnings, "none" or "all".  The default is
+   "none".  Currently the list of available warnings includes only
+   one:
+   
+
+ shadow
+ 
+  
+   Warns when a declaration shadows a previously defined variable.  For
+   example:
+ 
+ CREATE FUNCTION foo(f1 int) RETURNS int AS $$
+ DECLARE
+ f1 int;
+ BEGIN
+ RETURN f1;
+ END
+ $$ LANGUAGE plpgsql;
+ WARNING:  variable "f1" shadows a previously defined variable
+ LINE 3: f1 int;
+ ^
+ CREATE FUNCTION
+ 
+  
+ 
+
+   
+  
+  
   
  

*** a/src/pl/plpgsql/src/pl_comp.c
--- b/src/pl/plpgsql/src/pl_comp.c
***
*** 352,357  do_compile(FunctionCallInfo fcinfo,
--- 352,360 
function->out_param_varno = -1; /* set up for no OUT param */
function->resolve_option = plpgsql_variable_conflict;
function->print_strict_params = plpgsql_print_strict_params;
+   function->warnings = plpgsql_warnings;
+   /* only promote warnings to errors at CREATE FUNCTION time */
+   function->warnings_as_errors = plpgsql_warnings_as_errors && 
forValidator;
  
if (is_dml_trigger)
function->fn_is_trigger = PLPGSQL_DML_TRIGGER;
***
*** 849,854  plpgsql_compile_inline(char *proc_source)
--- 852,860 
function->out_param_varno = -1; /* set up for no OUT param */
function->resolve_option = plpgsql_variable_conflict;
function->print_strict_params = plpgsql_print_strict_params;
+   function->warnings = plpgsql_warnings;
+   /* never promote warnings to errors */
+   function->warnings_as_errors = false;
  
plpgsql_ns_init();
plpgsql_ns_push(func_name);
*** a/src/pl/plpgsql/src/pl_gram.y
--- b/src/pl/plpgsql/src/pl_gram.y
***
*** 727,732  decl_varname   : T_WORD
--- 727,746 

  $1.ident, NULL, NULL,

  NULL) != NULL)
yyerror("duplicate 
declaration");
+ 
+   if 
(plpgsql_curr_compile->warnings)
+   {
+   PLpgSQL_nsitem *nsi;
+   nsi = 
plpgsql_ns_lookup(plpgsql_ns_top(), false,
+   
$1.ident, NULL, NULL, NULL);
+   if (nsi != NULL)
+   
ereport(plpgsql_curr_compile->warnings_as_errors ? ERROR : WARNING,
+   
(errcode(ERRCODE_DUPLICATE_ALIAS),
+   
 errmsg("variable \"%s\" shadows a previously defined variable",
+   
$1.ident),
+

Re: [HACKERS] bugfix patch for json_array_elements

2014-02-02 Thread Craig Ringer

On 02/03/2014 09:09 AM, Craig Ringer wrote:

> At a guess, we're looking at a case where a new child context is created
> at every call, so every MemoryContextResetChildren call has to deal with
> more child contexts.

That would be "yes". After a short run, I see 32849 lines like:

json_array_elements temporary cxt: 8192 total in 1 blocks; 8160 free (0
chunks); 32 used

under the context:

  PortalMemory: 8192 total in 1 blocks
PortalHeapMemory: 7168 total in 3 blocks
  ExecutorState: 65600 total in 4 blocks
ExprContext: 8192 total in 1 blocks
  json_array_elements temporary cxt: 8192 total in 1 blocks;
8160 free (0 chunks); 32 used


The attached patch deletes the context after use, bringing performance
back into line. It should be backpatched to 9.3.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
>From f4ae516d643dee43200053533dbcc8b362e3f92b Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Mon, 3 Feb 2014 09:47:34 +0800
Subject: [PATCH] Fix leaked memory context in json_array_each and json_each

json_array_each and json_each leaked a memory context with each
invocation, causing profiles involving many invocations in a single
query context to be dominated by MemoryContextReset because of
MemoryContextResetChildren calls made after each tuple is emitted.

See initial bug report:

http://stackoverflow.com/q/21507127/398670

and list discussion:

 http://www.postgresql.org/message-id/52eeec37.9040...@2ndquadrant.com
---
 src/backend/utils/adt/jsonfuncs.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
index a19b222..116d7a0 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -975,6 +975,9 @@ each_worker(PG_FUNCTION_ARGS, bool as_text)
 	rsi->setResult = state->tuple_store;
 	rsi->setDesc = state->ret_tdesc;
 
+	MemoryContextDelete(state->tmp_cxt);
+	state->tmp_cxt = NULL;
+
 	PG_RETURN_NULL();
 }
 
@@ -1157,6 +1160,9 @@ elements_worker(PG_FUNCTION_ARGS, bool as_text)
 	rsi->setResult = state->tuple_store;
 	rsi->setDesc = state->ret_tdesc;
 
+	MemoryContextDelete(state->tmp_cxt);
+	state->tmp_cxt = NULL;
+
 	PG_RETURN_NULL();
 }
 
-- 
1.8.3.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup and pg_stat_tmp directory

2014-02-02 Thread Fujii Masao

On Mon, Feb 3, 2014 at 6:57 AM, Peter Eisentraut  wrote:
> On 2/2/14, 10:23 AM, Fujii Masao wrote:
>> I'm thinking to change basebackup.c so that it compares the
>> name of the directory that it's trying to back up and the setting
>> value of log_directory parameter, then, if they are the same,
>> it just skips the directory. The patch that I sent upthread does
>> this regarding stats_temp_directory.
>
> I'm undecided on whether log files should be copied, but in case we
> decide not to, it needs to be considered whether we at least recreate
> the pg_log directory on the standby.  Otherwise weird things will happen
> when you start the standby, and it would introduce an extra fixup step
> to sort that out.

Yes, basically we should skip only files under pg_log, but not pg_log
directory itself.

> Extra credit for doing something useful when pg_log is a symlink.
>
> I fear, however, that if you end up implementing all that logic, it
> would become too much special magic.

ISTM that pg_xlog has already been handled in that way by basebackup.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] narwhal and PGDLLIMPORT

2014-02-02 Thread Craig Ringer

On 02/02/2014 09:03 AM, Tom Lane wrote:
> According to the buildfarm database, narwhal is running a gcc build on
> Windows 2003.  That hardly seems like a mainstream use case.  I could
> believe that it might be of interest to developers, but clearly no
> developers are actually running such a build.
> 
> I think we should give serious consideration to desupporting this
> combination

I'm not a fan of MinGW (gcc) on Windows, it's a right pain. It's also
the only open source compiler currently supported for PostgreSQL on
Windows - practically the only one available. I don't know about you,
but I'm not too keen on assuming Microsoft will continue to offer free
toolchains that're usable for our purposes. They're crippling their free
build tools more and more with each release, which isn't a good trend.

If you wish to eliminate PGDLLIMPORT from the codebase the correct
approach would be building with --export-all-symbols (a MinGW extension
flag to gcc). That would make the MinGW builds consistent with the MSVC
build, which generates a .def file that exports all symbols.

As for why PGDLLIMPORT appears to be required in some places on the MSVC
build, so far it's looking like we auto-export functions, but not
necessarily variables. I'd need to read the fairly scary MSVC build
genreator scripts in detail to confirm that, to see how they produce
their DEF files; that'll have to wait until after I've got the
row-security work sorted out.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Making strxfrm() blobs in indexes work

2014-02-02 Thread Peter Geoghegan

On Thu, Jan 30, 2014 at 8:51 PM, Peter Geoghegan  wrote:
> I've done some more digging. It turns out that the 1977 paper "An
> Encoding Method for Multifield Sorting and Indexing" describes a
> technique that involves concatenating multiple column values and
> comparing them using a simple strcmp().

So I thought about it again, and realized that this probably can't be
made to work given our present assumptions. Commit 656beff59 added the
varstr_cmp() strcmp() tie-breaker.  If we cannot trust strcoll() to
indicate equality, why should we be able to trust strxfrm() + strcmp()
to do so, without an equivalent tie-breaker on the original string?
Now, I guess it might be good enough that the comparison in the inner
page indicated "equivalence" provided the comparison on leaf pages
indicated "equality proper". That seems like the kind of distinction
that I'd rather not have to make, though. You could end up with two
non-equal but equivalent tuples in an inner page. Seems pretty woolly
to me.

However, it also occurs to me that strxfrm() blobs have another useful
property: We (as, say, the author of an equality operator on text, an
operator intended for a btree operator class) *can* trust a strcmp()'s
result on blobs, provided the result isn't 0/equal, *even if the blobs
are truncated*. So maybe a better scheme, and certainly a simpler one
would be to have a pseudo-attribute in inner pages only with, say, the
first 8 bytes of a strxfrm() blob formed from the logically-leading
text attribute of the same indexTuple. Because we're only doing this
on inner pages, there is a very good chance that that will be good
enough most of the time. This also allows us to reduce bloat very
effectively.

As I touched on before, other schemes for other types do seem to
suggest themselves. We could support a pseudo leading attribute for
numeric too, for example, where we use a 64-bit integer as a proxy for
the whole number part (with the implementation having special handling
of "out of range" values by means of an internal magic number - as
always with the scheme, equality means "inconclusive, ask logically
leading attribute"). That could win just by mostly avoiding the
serialization overhead.

Now, the obvious counter-argument here is to point out that there are
worst cases where none of this helps. I suspect that those are so rare
in the real world, and the cost of doing all this is so low, that it's
still likely to be well worth it for any given use-case at the macro
level.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Multiple calls to json_array_elements slow down nonlinearly

2014-02-02 Thread Craig Ringer

Hi all

I ran into this performance report over the weekend:

http://stackoverflow.com/q/21507127/398670

and wanted to mention it here.

json_array_elements seems to spend about 97% of its time in
MemoryContextReset(...).

Given dummy data:

test=> create table g as select (select json_agg(random()) json  from
generate_series(0, (r1*4)::int))  from (select random() r1 from
generate_series(1,2)) aux;

Compare these two methods of producing the same result set:

test=> create table q as select json->x foo from g,
generate_series(0,json_array_length(g.json)-1) x;
SELECT 60103
Time: 157.702 ms
test=> create table p as select json_array_elements(json) foo from g;
SELECT 60103
Time: 4254.494 ms


The issue is reproducible and scales non-linearly with row count, which
is a clue. At 100k rows input, the lateral query takes 592ms vs 179959ms
(3 minutes) for json_array_elements.

Whenever I grab a backtrace it looks like:

> #0  0x0072dd7d in MemoryContextReset (context=0x2a02dc90) at 
> mcxt.c:130
> #1  0x0072dd90 in MemoryContextResetChildren (context= out>) at mcxt.c:155
> #2  MemoryContextReset (context=0x1651220) at mcxt.c:131
> #3  0x005817f9 in ExecScan (node=node@entry=0x164e1a0, 
> accessMtd=accessMtd@entry=0x592040 , 
> recheckMtd=recheckMtd@entry=0x592030 )
> at execScan.c:155

(Sorry for the quote-paste; only way to make @#$ Thunderbird not wrap
mail, I need to switch clients or fix that).

"perf top" on the process shows:

 96.92%  postgres  [.] MemoryContextReset
  0.15%  [kernel]  [k] cpuacct_account_field
  0.09%  [kernel]  [k] update_cfs_rq_blocked_load
  0.09%  postgres  [.] AllocSetAlloc


At a guess, we're looking at a case where a new child context is created
at every call, so every MemoryContextResetChildren call has to deal with
more child contexts. I'm going to take a quick look now, I just wanted
to get this written up before I got sidetracked.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN improvements part2: fast scan

2014-02-02 Thread Tomas Vondra

On 2.2.2014 11:45, Heikki Linnakangas wrote:
> On 01/30/2014 01:53 AM, Tomas Vondra wrote:
>> (3) A file with explain plans for 4 queries suffering ~2x slowdown,
>>  and explain plans with 9.4 master and Heikki's patches is available
>>  here:
>>
>>http://www.fuzzy.cz/tmp/gin/queries.txt
>>
>>  All the queries have 6 common words, and the explain plans look
>>  just fine to me - exactly like the plans for other queries.
>>
>>  Two things now caught my eye. First some of these queries actually
>>  have words repeated - either exactly like "database & database" or
>>  in negated form like "!anything & anything". Second, while
>>  generating the queries, I use "dumb" frequency, where only exact
>>  matches count. I.e. "write != written" etc. But the actual number
>>  of hits may be much higher - for example "write" matches exactly
>>  just 5% documents, but using @@ it matches more than 20%.
>>
>>  I don't know if that's the actual cause though.
> 
> Ok, here's another variant of these patches. Compared to git master, it
> does three things:
> 
> 1. It adds the concept of ternary consistent function internally, but no
> catalog changes. It's implemented by calling the regular boolean
> consistent function "both ways".
> 
> 2. Use a binary heap to get the "next" item from the entries in a scan.
> I'm pretty sure this makes sense, because arguably it makes the code
> more readable, and reduces the number of item pointer comparisons
> significantly for queries with a lot of entries.
> 
> 3. Only perform the pre-consistent check to try skipping entries, if we
> don't already have the next item from the entry loaded in the array.
> This is a tradeoff, you will lose some of the performance gain you might
> get from pre-consistent checks, but it also limits the performance loss
> you might get from doing useless pre-consistent checks.
> 
> So taken together, I would expect this patch to make some of the
> performance gains less impressive, but also limit the loss we saw with
> some of the other patches.
> 
> Tomas, could you run your test suite with this patch, please?

Sure, will do. Do I get it right that this should be applied instead of
the four patches you've posted earlier?

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-02 Thread Tom Lane

Greg Stark  writes:
> On Sun, Feb 2, 2014 at 6:03 PM, Tom Lane  wrote:
>> Can we see the associated WAL records (ie, the ones matching the LSNs
>> in the last blocks of these files)?

> Sorry, I've lost track of what information I already shared or didn't,

Hm.  So one of these is a heap update, not an index update, which lets
out the theory that it's something specific to indexes.  But they are
all full-page-image updates, so the WAL replay code path for full-page
images still seems to be the suspect.

What version were you running before 9.1.11 exactly?  I took a look
through all the diffs from 9.1.9 up to 9.1.11, and couldn't find any
changes that seemed even vaguely related to this.  There are some
changes in known-transaction tracking, but it's hard to see a connection
there.  Most of the other diffs are in code that wouldn't execute during
WAL replay at all.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GSOC13 proposal - extend RETURNING syntax

2014-02-02 Thread David Fetter

On Wed, Aug 21, 2013 at 08:52:25PM +0200, Karol Trzcionka wrote:
> W dniu 21.08.2013 19:17, Boszormenyi Zoltan pisze:
> > With this fixed, a more complete review:
> Thanks.

I've done some syntactic and white space cleanup, here attached.

Karol, would you care to help with commenting the sections that need
same?

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
*** a/doc/src/sgml/ref/update.sgml
--- b/doc/src/sgml/ref/update.sgml
***
*** 194,205  UPDATE [ ONLY ] table_name [ * ] [
  output_expression

!   An expression to be computed and returned by the UPDATE
!   command after each row is updated.  The expression can use any
!   column names of the table named by table_name
!   or table(s) listed in FROM.
!   Write * to return all columns.

--- 194,220 
  output_expression

!   An expression to be computed and returned by the
!   UPDATE command either before or after (prefixed with
!   BEFORE. and AFTER.,
!   respectively) each row is updated.  The expression can use any
!   column names of the table named by table_name or table(s) listed in
!   FROM.  Write AFTER.* to return all 
!   columns after the update. Write BEFORE.* for all
!   columns before the update. Write * to return all
!   columns after update and all triggers fired (these values are in table
!   after command). You may combine BEFORE, AFTER and raw columns in the
!   expression.

+  
+  Mixing table names or aliases named before or after with the
+  above will result in confusion and suffering.  If you happen to
+  have a table called before or
+  after, alias it to something else when using
+  RETURNING.
+  
+ 

***
*** 287,301  UPDATE weather SET temp_lo = temp_lo+1, temp_hi = temp_lo+15, 
prcp = DEFAULT

!Perform the same operation and return the updated entries:

  UPDATE weather SET temp_lo = temp_lo+1, temp_hi = temp_lo+15, prcp = DEFAULT
WHERE city = 'San Francisco' AND date = '2003-07-03'
!   RETURNING temp_lo, temp_hi, prcp;

 Use the alternative column-list syntax to do the same update:

--- 302,317 

!Perform the same operation and return information on the changed entries:

  UPDATE weather SET temp_lo = temp_lo+1, temp_hi = temp_lo+15, prcp = DEFAULT
WHERE city = 'San Francisco' AND date = '2003-07-03'
!   RETURNING temp_lo AS new_low, temp_hi AS new_high, 
BEFORE.temp_hi/BEFORE.temp_low AS old_ratio, AFTER.temp_hi/AFTER.temp_low AS 
new_ratio prcp;

+ 

 Use the alternative column-list syntax to do the same update:

*** a/src/backend/commands/trigger.c
--- b/src/backend/commands/trigger.c
***
*** 2335,2341  ExecASUpdateTriggers(EState *estate, ResultRelInfo *relinfo)
  TupleTableSlot *
  ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 ResultRelInfo *relinfo,
!ItemPointer tupleid, TupleTableSlot 
*slot)
  {
TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
HeapTuple   slottuple = ExecMaterializeSlot(slot);
--- 2335,2341 
  TupleTableSlot *
  ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 ResultRelInfo *relinfo,
!ItemPointer tupleid, TupleTableSlot 
*slot, TupleTableSlot **planSlot)
  {
TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
HeapTuple   slottuple = ExecMaterializeSlot(slot);
***
*** 2382,2387  ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
--- 2382,2388 
if (newSlot != NULL)
{
slot = ExecFilterJunk(relinfo->ri_junkFilter, newSlot);
+   *planSlot = newSlot;
slottuple = ExecMaterializeSlot(slot);
newtuple = slottuple;
}
*** a/src/backend/executor/nodeModifyTable.c
--- b/src/backend/executor/nodeModifyTable.c
***
*** 609,615  ExecUpdate(ItemPointer tupleid,
resultRelInfo->ri_TrigDesc->trig_update_before_row)
{
slot = ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
!   
tupleid, slot);

if (slot == NULL)   /* "do nothing" */
return NULL;
--- 609,615 
resultRelInfo->ri_TrigDesc->trig_update_before_row)
{
slot = ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
!

Re: [HACKERS] mvcc catalo gsnapshots and TopTransactionContext

2014-02-02 Thread Andres Freund

On 2014-02-02 15:16:45 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On February 2, 2014 5:52:22 PM CET, Tom Lane  wrote:
> >> More to the point, changing the Assert so it doesn't fire
> >> doesn't do one damn thing to ameliorate the fact that cache reload
> >> during transaction abort is wrong and unsafe.
>
> > And, as upthread, I still don't think that's correct. I don't have
> > sources available right now, but IIRC we already have aborted out of the
> > transaction. Released locks, the xid and everything.
>
> Nope ... the case exhibited in the example is dying in AtEOSubXact_Inval,
> which is in the very midst of subxact abort.

True. But we've done LWLockReleaseAll(), TransactionIdAbortTree(),
XidCacheRemoveRunningXids() and
ResourceOwnerRelease(RESOURCE_RELEASE_BEFORE_LOCKS), which is why we are
currently able to build correct entries, even though we are in an
aborted transaction.

> I've been thinking about this for the past little while, and I believe
> that it's probably okay to have RelationClearRelation leave the relcache
> entry un-rebuilt, but with rd_isvalid = false so it will be rebuilt when
> next opened.  The rationale is explained in the comments in the attached
> patch.  I've checked that this fixes Noah's test case and still passes
> the existing regression tests.

Hm, a bit scary, but I don't see an immediate problem.

The following comment now is violated for nailed relations
 *  We assume that at the time we are called, we have at least 
AccessShareLock
 *  on the target index.  (Note: in the calls from RelationClearRelation,
 *  this is legitimate because we know the rel has positive refcount.)
but that should be easy to fix.

I wonder though, if we couldn't just stop doing the
RelationReloadIndexInfo() for nailed indexes. The corresponding comment
says:
 * If it's a nailed index, then we need to re-read the pg_class row to 
see
 * if its relfilenode changed.  We do that immediately if we're inside a
 * valid transaction.  Otherwise just mark the entry as possibly 
invalid,
 * and it'll be fixed when next opened.
 */

but any relfilenode change should have already been handled by
RelationInitPhysicalAddr()?

Do you plan to backpatch this? If so, even to 8.4?

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup and pg_stat_tmp directory

2014-02-02 Thread Peter Eisentraut

On 2/2/14, 10:23 AM, Fujii Masao wrote:
> I'm thinking to change basebackup.c so that it compares the
> name of the directory that it's trying to back up and the setting
> value of log_directory parameter, then, if they are the same,
> it just skips the directory. The patch that I sent upthread does
> this regarding stats_temp_directory.

I'm undecided on whether log files should be copied, but in case we
decide not to, it needs to be considered whether we at least recreate
the pg_log directory on the standby.  Otherwise weird things will happen
when you start the standby, and it would introduce an extra fixup step
to sort that out.

Extra credit for doing something useful when pg_log is a symlink.

I fear, however, that if you end up implementing all that logic, it
would become too much special magic.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Include planning time in EXPLAIN ANALYZE output.

2014-02-02 Thread Tom Lane

Gavin Flower  writes:
> Can I assume:
> 'Total runtime' is 'elapsed time'
> and
> 'Execution time' is 'processor time'.

No.  It's going to be elapsed time, either way.

> In a parallel implementation, one would likely want both.

When and if we have that, we can argue about what to measure.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] CacheInvalidateRelcache in btree is a crummy idea

2014-02-02 Thread Tom Lane

While investigating the assertion failure Noah presented at
http://www.postgresql.org/message-id/20130805170931.ga369...@tornado.leadboat.com
I was confused for a bit as to why we're getting a relcache rebuild
request for t1's index at exit of the subtransaction, when the
subtransaction hasn't made any change to the index's schema.
Investigation showed that the reason is that _bt_getroot() creates
a root page for the index during the subtransaction's INSERT, and
then it does CacheInvalidateRelcache() to notify other backends
that it's updated the index's metapage.

Now, this is a dumb idea, because relcache invalidation is assumed
to be a transactional operation, which means that if the subtransaction
rolls back, we don't send any invalidation event to other backends.
Update of a btree metapage, however, is not a transactional thing.
So if we actually *needed* that notification to be sent, this would be
a bug.

But we don't need that.  _bt_getroot() is the primary user of the
cached metapage, and it sufficiently verifies that the page it's
reached is the right one, which is why we've not seen bug reports.
_bt_getrootheight() also uses the cached data, but that's only for a
statistical estimate, so it doesn't matter if it's a bit stale.

Moreover, if we did need reliable transmission of the invalidation
message, that would've gotten broken even more by commit 96675bff1, which
suppressed doing that when InRecovery.  (In fairness, it wasn't a problem
at the time since that was pre-Hot Standby.  But the lack of notification
would be an issue now for hot standby sessions, if they needed one.)

So I'm thinking my commit d2896a9ed, which introduced this mechanism,
was poorly thought out and we should just remove the relcache invals
as per the attached patch.  Letting _bt_getroot() update the cached
metapage at next use should be a lot cheaper than a full relcache
rebuild for the index.

Note: this change will mean that Noah's test case no longer triggers
the problem discussed in that thread ... but I'm sure we can find
another test case for that.

regards, tom lane

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 40f09e3..bf5bc38 100644
*** a/src/backend/access/nbtree/README
--- b/src/backend/access/nbtree/README
*** location of the root page --- both the t
*** 438,451 
  root ("fast" root).  To avoid fetching the metapage for every single index
  search, we cache a copy of the meta-data information in the index's
  relcache entry (rd_amcache).  This is a bit ticklish since using the cache
! implies following a root page pointer that could be stale.  We require
! every metapage update to send out a SI "relcache inval" message on the
! index relation.  That ensures that each backend will flush its cached copy
! not later than the start of its next transaction.  Therefore, stale
! pointers cannot be used for longer than the current transaction, which
! reduces the problem to the same one already dealt with for concurrent
! VACUUM --- we can just imagine that each open transaction is potentially
! "already in flight" to the old root.
  
  The algorithm assumes we can fit at least three items per page
  (a "high key" and two real data items).  Therefore it's unsafe
--- 438,454 
  root ("fast" root).  To avoid fetching the metapage for every single index
  search, we cache a copy of the meta-data information in the index's
  relcache entry (rd_amcache).  This is a bit ticklish since using the cache
! implies following a root page pointer that could be stale.  However, a
! backend following a cached pointer can sufficiently verify whether it
! reached the intended page; either by checking the is-root flag when it
! is going to the true root, or by checking that the page has no siblings
! when going to the fast root.  At worst, this could result in descending
! some extra tree levels if we have a cached pointer to a fast root that is
! now above the real fast root.  Such cases shouldn't arise often enough to
! be worth optimizing; and in any case we can expect a relcache flush will
! discard the cached metapage before long, since a VACUUM that's moved the
! fast root pointer can be expected to issue a statistics update for the
! index.
  
  The algorithm assumes we can fit at least three items per page
  (a "high key" and two real data items).  Therefore it's unsafe
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 49c084a..9140749 100644
*** a/src/backend/access/nbtree/nbtinsert.c
--- b/src/backend/access/nbtree/nbtinsert.c
***
*** 21,27 
  #include "miscadmin.h"
  #include "storage/lmgr.h"
  #include "storage/predicate.h"
- #include "utils/inval.h"
  #include "utils/tqual.h"
  
  
--- 21,26 
*** _bt_insertonpg(Relation rel,
*** 868,880 
  
  		END_CRIT_SECTION();
  
! 		/* release buffers; send out relcache inval if metapage changed */
  		if (BufferIsVal

Re: [HACKERS] [COMMITTERS] pgsql: Include planning time in EXPLAIN ANALYZE output.

2014-02-02 Thread Gavin Flower


On 03/02/14 09:44, Peter Geoghegan wrote:

On Sun, Feb 2, 2014 at 8:13 AM, Tom Lane  wrote:

Perhaps s/Total runtime/Execution time/ ?

+1


If the planning was ever made into a parallel process, then 'elapsed 
time' would be less than the 'processor time'.  So what does 'Execution 
time' mean?


Can I assume:
'Total runtime' is 'elapsed time'
and
'Execution time' is 'processor time'.

In a parallel implementation, one would likely want both.

Possible this is not an issue now, and I'm thinking to far ahead!


Cheers,
Gavin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Include planning time in EXPLAIN ANALYZE output.

2014-02-02 Thread Peter Geoghegan

On Sun, Feb 2, 2014 at 8:13 AM, Tom Lane  wrote:
> Perhaps s/Total runtime/Execution time/ ?

+1


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] mvcc catalo gsnapshots and TopTransactionContext

2014-02-02 Thread Tom Lane

Andres Freund  writes:
> On February 2, 2014 5:52:22 PM CET, Tom Lane  wrote:
>> More to the point, changing the Assert so it doesn't fire
>> doesn't do one damn thing to ameliorate the fact that cache reload
>> during transaction abort is wrong and unsafe.

> And, as upthread, I still don't think that's correct. I don't have
> sources available right now, but IIRC we already have aborted out of the
> transaction. Released locks, the xid and everything.

Nope ... the case exhibited in the example is dying in AtEOSubXact_Inval,
which is in the very midst of subxact abort.

I've been thinking about this for the past little while, and I believe
that it's probably okay to have RelationClearRelation leave the relcache
entry un-rebuilt, but with rd_isvalid = false so it will be rebuilt when
next opened.  The rationale is explained in the comments in the attached
patch.  I've checked that this fixes Noah's test case and still passes
the existing regression tests.

regards, tom lane

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 2a46cfc..791ddc6 100644
*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c
*** RelationClearRelation(Relation relation,
*** 1890,1900 
  	 * in case it is a mapped relation whose mapping changed.
  	 *
  	 * If it's a nailed index, then we need to re-read the pg_class row to see
! 	 * if its relfilenode changed.	We can't necessarily do that here, because
! 	 * we might be in a failed transaction.  We assume it's okay to do it if
! 	 * there are open references to the relcache entry (cf notes for
! 	 * AtEOXact_RelationCache).  Otherwise just mark the entry as possibly
! 	 * invalid, and it'll be fixed when next opened.
  	 */
  	if (relation->rd_isnailed)
  	{
--- 1890,1898 
  	 * in case it is a mapped relation whose mapping changed.
  	 *
  	 * If it's a nailed index, then we need to re-read the pg_class row to see
! 	 * if its relfilenode changed.	We do that immediately if we're inside a
! 	 * valid transaction.  Otherwise just mark the entry as possibly invalid,
! 	 * and it'll be fixed when next opened.
  	 */
  	if (relation->rd_isnailed)
  	{
*** RelationClearRelation(Relation relation,
*** 1903,1909 
  		if (relation->rd_rel->relkind == RELKIND_INDEX)
  		{
  			relation->rd_isvalid = false;		/* needs to be revalidated */
! 			if (relation->rd_refcnt > 1)
  RelationReloadIndexInfo(relation);
  		}
  		return;
--- 1901,1907 
  		if (relation->rd_rel->relkind == RELKIND_INDEX)
  		{
  			relation->rd_isvalid = false;		/* needs to be revalidated */
! 			if (IsTransactionState())
  RelationReloadIndexInfo(relation);
  		}
  		return;
*** RelationClearRelation(Relation relation,
*** 1921,1927 
  		relation->rd_indexcxt != NULL)
  	{
  		relation->rd_isvalid = false;	/* needs to be revalidated */
! 		RelationReloadIndexInfo(relation);
  		return;
  	}
  
--- 1919,1926 
  		relation->rd_indexcxt != NULL)
  	{
  		relation->rd_isvalid = false;	/* needs to be revalidated */
! 		if (IsTransactionState())
! 			RelationReloadIndexInfo(relation);
  		return;
  	}
  
*** RelationClearRelation(Relation relation,
*** 1942,1947 
--- 1941,1969 
  		/* And release storage */
  		RelationDestroyRelation(relation);
  	}
+ 	else if (!IsTransactionState())
+ 	{
+ 		/*
+ 		 * If we're not inside a valid transaction, we can't do any catalog
+ 		 * access so it's not possible to rebuild yet.  Just exit, leaving
+ 		 * rd_isvalid = false so that the rebuild will occur when the entry is
+ 		 * next opened.
+ 		 *
+ 		 * Note: it's possible that we come here during subtransaction abort,
+ 		 * and the reason for wanting to rebuild is that the rel is open in
+ 		 * the outer transaction.  In that case it might seem unsafe to not
+ 		 * rebuild immediately, since whatever code has the rel already open
+ 		 * will keep on using the relcache entry as-is.  However, in such a
+ 		 * case the outer transaction should be holding a lock that's
+ 		 * sufficient to prevent any significant change in the rel's schema,
+ 		 * so the existing entry contents should be good enough for its
+ 		 * purposes; at worst we might be behind on statistics updates or the
+ 		 * like.  (See also CheckTableNotInUse() and its callers.)  These
+ 		 * same remarks also apply to the cases above where we exit without
+ 		 * having done RelationReloadIndexInfo() yet.
+ 		 */
+ 		return;
+ 	}
  	else
  	{
  		/*
*** RelationClearRelation(Relation relation,
*** 2054,2059 
--- 2076,2082 
   * RelationFlushRelation
   *
   *	 Rebuild the relation if it is open (refcount > 0), else blow it away.
+  *	 This is used when we receive a cache invalidation event for the rel.
   */
  static void
  RelationFlushRelation(Relation relation)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http:

Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-02 Thread Greg Stark

On Sun, Feb 2, 2014 at 6:03 PM, Tom Lane  wrote:
> Greg Stark  writes:
>> The relfilenodes that have nul blocks before the last block are:
>
> Can we see the associated WAL records (ie, the ones matching the LSNs
> in the last blocks of these files)?

Sorry, I've lost track of what information I already shared or didn't,
i've been staring at these records all day. It would be so much easier
if xlogdump was a fdw or set returning function so I could do joins
with pageinspect data:

This is the information from the final block headers:

 relfilenode | blockno  | segnum |  offset  |lsn | tli | flags
| lower | upper | special | pagesize | version | prune_xid
-+--++--++-+---+---+---+-+--+-+
  473158 | 18090059 |138 | 18090059 | EA1/625E28 |   6 | 0
|   144 |   896 |8192 | 8192 |   4 | 1401029863
 1366221 |  2208159 | 16 |  2208159 | EA1/62DDF0 |   6 | 0
|  1180 |  3552 |8176 | 8192 |   4 |  0
 1261982 |  7141472 | 54 |  7141472 | EA1/638988 |   6 | 0
|  1240 |  3312 |8176 | 8192 |   4 |  0
 1364767 |  3631161 | 27 |  3631161 | EA1/63A0A8 |   6 | 0
|  1180 |  3552 |8176 | 8192 |   4 |  0
 1519286 |   311808 |  2 |   311808 | EA1/708B08 |   6 | 1
|   312 |  3040 |8192 | 8192 |   4 |  0

These are all the records from the tx that corresponds to the lsn in
the first relfilenode:

[cur:EA1/6240C0, xid:1418089146, rmid:10(Heap), len/tot_len:28/7524,
info:72, prev:EA1/624080] hot_update: s/d/r:1663/16385/473158 block
9386399 off 29 to block 9386399 off 30
[cur:EA1/6240C0, xid:1418089146, rmid:10(Heap), len/tot_len:28/7524,
info:72, prev:EA1/624080] bkpblock[1]: s/d/r:1663/16385/473158
blk:9386399 hole_off/len:144/752
[cur:EA1/625E28, xid:1418089146, rmid:1(Transaction),
len/tot_len:32/64, info:0, prev:EA1/6240C0] d/s:16385/1663 commit at
2014-01-21 05:41:11 UTC

The middle three blocks have LSNs that are all part of the same tx,
here are all the records for this tx:

[cur:EA1/625F28, xid:1418089147, rmid:10(Heap), len/tot_len:21/6441,
info:8, prev:EA1/625E68] insert: s/d/r:1663/16385/473158
blk/off:9386398/33 header: none
[cur:EA1/625F28, xid:1418089147, rmid:10(Heap), len/tot_len:21/6441,
info:8, prev:EA1/625E68] bkpblock[1]: s/d/r:1663/16385/473158
blk:9386398 hole_off/len:156/1828
[cur:EA1/627868, xid:1418089147, rmid:11(Btree), len/tot_len:18/8214,
info:8, prev:EA1/625F28] insert_leaf: s/d/r:1663/16385/473176 tid
1319804/405
[cur:EA1/627868, xid:1418089147, rmid:11(Btree), len/tot_len:18/8214,
info:8, prev:EA1/625F28] bkpblock[1]: s/d/r:1663/16385/473176
blk:1319804 hole_off/len:1644/52
[cur:EA1/629898, xid:1418089147, rmid:11(Btree), len/tot_len:18/6494,
info:8, prev:EA1/627868] insert_leaf: s/d/r:1663/16385/473182 tid
1186167/147
[cur:EA1/629898, xid:1418089147, rmid:11(Btree), len/tot_len:18/6494,
info:8, prev:EA1/627868] bkpblock[1]: s/d/r:1663/16385/473182
blk:1186167 hole_off/len:1300/1772
[cur:EA1/62B210, xid:1418089147, rmid:11(Btree), len/tot_len:18/5314,
info:8, prev:EA1/629898] insert_leaf: s/d/r:1663/16385/1270734 tid
1294137/2
[cur:EA1/62B210, xid:1418089147, rmid:11(Btree), len/tot_len:18/5314,
info:8, prev:EA1/629898] bkpblock[1]: s/d/r:1663/16385/1270734
blk:1294137 hole_off/len:1064/2952
[cur:EA1/62C6E8, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
info:8, prev:EA1/62B210] insert_leaf: s/d/r:1663/16385/1366221 tid
1364573/66
[cur:EA1/62C6E8, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
info:8, prev:EA1/62B210] bkpblock[1]: s/d/r:1663/16385/1366221
blk:1364573 hole_off/len:1180/2372
[cur:EA1/62DDF0, xid:1418089147, rmid:11(Btree), len/tot_len:18/4814,
info:8, prev:EA1/62C6E8] insert_leaf: s/d/r:1663/16385/1404440 tid
1195953/51
[cur:EA1/62DDF0, xid:1418089147, rmid:11(Btree), len/tot_len:18/4814,
info:8, prev:EA1/62C6E8] bkpblock[1]: s/d/r:1663/16385/1404440
blk:1195953 hole_off/len:964/3452
[cur:EA1/62F0D8, xid:1418089147, rmid:11(Btree), len/tot_len:18/6862,
info:8, prev:EA1/62DDF0] insert_leaf: s/d/r:1663/16385/1410405 tid
1894183/39
[cur:EA1/62F0D8, xid:1418089147, rmid:11(Btree), len/tot_len:18/6862,
info:8, prev:EA1/62DDF0] bkpblock[1]: s/d/r:1663/16385/1410405
blk:1894183 hole_off/len:988/1404
[cur:EA1/630BC0, xid:1418089147, rmid:11(Btree), len/tot_len:18/7254,
info:8, prev:EA1/62F0D8] insert_leaf: s/d/r:1663/16385/1521566 tid
1691110/132
[cur:EA1/630BC0, xid:1418089147, rmid:11(Btree), len/tot_len:18/7254,
info:8, prev:EA1/62F0D8] bkpblock[1]: s/d/r:1663/16385/1521566
blk:1691110 hole_off/len:1044/1012
[cur:EA1/632830, xid:1418089147, rmid:11(Btree), len/tot_len:18/5174,
info:8, prev:EA1/630BC0] insert_leaf: s/d/r:1663/16385/5285587 tid
386419/102
[cur:EA1/632830, xid:1418089147, rmid:11(Btree), len/tot_len:18/5174,
info:8, prev:EA1/630BC0] bkpblock[1]: s/d/r:1663/16385/5285587
blk:386419 hole_off/len:1036/3092
[cur:EA

Re: [HACKERS] mvcc catalo gsnapshots and TopTransactionContext

2014-02-02 Thread Andres Freund

On February 2, 2014 5:52:22 PM CET, Tom Lane  wrote:
>Andres Freund  writes:
>> On 2014-01-31 16:41:33 -0500, Bruce Momjian wrote:
>>> Is there any plan to commit this?
>
>> IMO it has to be applied. Tom objected on the grounds that cache
>> invalidation has to be fixed properly but that's a major
>restructuring
>> of code that worked this way for a long time. So changing the
>Assert()
>> to reflect that seems fair to me.
>
>The replacement Assert is wrong no?  At least that's what was said
>upthread.

Well, no. Noah's version isn't as strict as desirable, but it doesn't trigger 
in wrong cases. Thus still better than what's in 9.3 (nothing).

> More to the point, changing the Assert so it doesn't fire
>doesn't do one damn thing to ameliorate the fact that cache reload
>during transaction abort is wrong and unsafe.

And, as upthread, I still don't think that's correct. I don't have sources 
available right now, but IIRC we already have aborted out of the transaction. 
Released locks, the xid and everything. We just haven't changed the state yet - 
and affair we can't naively do so, otherwise we'd do incomplete cleanups if we 
got interrupted somehow.
So yes, it's not pretty and it's really hard to verify. But that doesn't make 
it entirely wrong.
I don't see the point of an assert that triggers for code (and coders) that 
can't do anything about it because their code is correct. All the while the 
assertion actually triggers for ugly but correct code.

Andres

-- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-02 Thread Tom Lane

Greg Stark  writes:
> The relfilenodes that have nul blocks before the last block are:

Can we see the associated WAL records (ie, the ones matching the LSNs
in the last blocks of these files)?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-02 Thread Greg Stark

Hm, I'm not entirely convinced those are erroneous replays to wrong
blocks. They don't look right but there are no blocks of NULs
preceding them. So if they're wrong then they only extended the
relations by a single block.

The relfilenodes that have nul blocks before the last block are:

 relfilenode | blockno  | segnum |  offset  | lsn | tli | flags |
lower | upper | special | pagesize | version | prune_xid
-+--++--+-+-+---+---+---+-+--+-+---
 1261982 |  7141472 | 54 |  7141472 | 0/0 |   0 | 0 |
0 | 0 |   0 |0 |   0 | 0
  473158 | 18090059 |138 | 18090059 | 0/0 |   0 | 0 |
0 | 0 |   0 |0 |   0 | 0
 1366221 |  2208159 | 16 |  2208159 | 0/0 |   0 | 0 |
0 | 0 |   0 |0 |   0 | 0
 1364767 |  3631161 | 27 |  3631161 | 0/0 |   0 | 0 |
0 | 0 |   0 |0 |   0 | 0
 1519286 |   311808 |  2 |   311808 | 0/0 |   0 | 0 |
0 | 0 |   0 |0 |   0 | 0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] mvcc catalo gsnapshots and TopTransactionContext

2014-02-02 Thread Tom Lane

Andres Freund  writes:
> On 2014-01-31 16:41:33 -0500, Bruce Momjian wrote:
>> Is there any plan to commit this?

> IMO it has to be applied. Tom objected on the grounds that cache
> invalidation has to be fixed properly but that's a major restructuring
> of code that worked this way for a long time. So changing the Assert()
> to reflect that seems fair to me.

The replacement Assert is wrong no?  At least that's what was said
upthread.  More to the point, changing the Assert so it doesn't fire
doesn't do one damn thing to ameliorate the fact that cache reload
during transaction abort is wrong and unsafe.

We need to fix the real problem not paper over it.  The fact that the
fix may be hard doesn't change that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-02 Thread Greg Stark

I've poked at this a bit more. There are at least 10 relations where
the last block doesn't match the block mentioned in the xlog record
that its LSN indicates. At least it looks like from the info xlogdump
prints.

Including two blocks where the "correct" block has the same LSN which
maybe means the record was correctlly replayed the second time. Or
perhaps it just means I'm underestimating the complexity of btree
splits.

[cur:EAD/511424E0, xid:1423308342, rmid:11(Btree),
len/tot_len:3750/12786, info:77, prev:EAD/51142490] split_r:
s/d/r:1663/16385/1521566 leftsib 1697585
[cur:EAD/511424E0, xid:1423308342, rmid:11(Btree),
len/tot_len:3750/12786, info:77, prev:EAD/51142490] bkpblock[1]:
s/d/r:1663/16385/1521566 blk:1697585 hole_off/len:576/4288
[cur:EAD/511424E0, xid:1423308342, rmid:11(Btree),
len/tot_len:3750/12786, info:77, prev:EAD/51142490] bkpblock[2]:
s/d/r:1663/16385/1521566 blk:786380 hole_off/len:740/3140

 relfilenode | blockno  | lsn  | tli | flags | lower | upper |
special | pagesize | version | prune_xid
-+--+--+-+---+---+---+-+--+-+
 1521566 |  1697585 | EAD/511456E8 |   6 | 0 |   576 |  4864 |
   8176 | 8192 |   4 |  0
 1521566 |  1704143 | EAD/511456E8 |   6 | 0 |   644 |  4456 |
   8176 | 8192 |   4 |  0

[cur:EAD/520F0EE0, xid:1423309260, rmid:11(Btree),
len/tot_len:4230/4450, info:69, prev:EAD/520F0E98] split_r:
s/d/r:1663/16385/4995658 leftsib 139569
[cur:EAD/520F0EE0, xid:1423309260, rmid:11(Btree),
len/tot_len:4230/4450, info:69, prev:EAD/520F0E98] bkpblock[2]:
s/d/r:1663/16385/4995658 blk:18152 hole_off/len:28/8028
 relfilenode | blockno  | lsn  | tli | flags | lower | upper |
special | pagesize | version | prune_xid
-+--+--+-+---+---+---+-+--+-+
 4995658 |   139569 | EAD/520F2058 |   6 | 0 |   152 |  4336 |
   8176 | 8192 |   4 |  0
 4995658 |   139584 | EAD/520F2058 |   6 | 0 |   164 |  3976 |
   8176 | 8192 |   4 |  0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] slow startup due to LWLockAssign() spinlock

2014-02-02 Thread Andres Freund

Hi,

On larger, multi-socket, machines, startup takes a fair bit of time. As
I was profiling anyway I looked into it and noticed that just about all
of it is spent in LWLockAssign() called by InitBufferPool(). Starting
with shared_buffers=48GB on the server Nate Boley provided, takes about
12 seconds. Nearly all of it spent taking the ShmemLock spinlock.
Simply modifying LWLockAssign() to not take the spinlock when
!IsUnderPostmaster speeds it up to 2 seconds. While certainly not making
LWLockAssign() prettier it seems enough of a speedup to be worthwile
nonetheless.
Since this code is also hit when do an emergency restart, I'd say it has
practical relevance...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 0679eb5..10ee364 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -451,9 +451,11 @@ CreateLWLocks(void)
  * LWLockAssign - assign a dynamically-allocated LWLock number
  *
  * We interlock this using the same spinlock that is used to protect
- * ShmemAlloc().  Interlocking is not really necessary during postmaster
- * startup, but it is needed if any user-defined code tries to allocate
- * LWLocks after startup.
+ * ShmemAlloc().  Interlocking is not necessary during postmaster startup, but
+ * it is needed if any user-defined code tries to allocate LWLocks after
+ * startup. Since we allocate large amounts of LWLocks for the buffer pool, we
+ * avoid taking the spinlock when not needed, as it has shown to slowdown
+ * startup considerably.
  */
 LWLock *
 LWLockAssign(void)
@@ -464,14 +466,17 @@ LWLockAssign(void)
 	volatile int *LWLockCounter;
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - 3 * sizeof(int));
-	SpinLockAcquire(ShmemLock);
+	if (IsUnderPostmaster)
+		SpinLockAcquire(ShmemLock);
 	if (LWLockCounter[0] >= LWLockCounter[1])
 	{
-		SpinLockRelease(ShmemLock);
+		if (IsUnderPostmaster)
+			SpinLockRelease(ShmemLock);
 		elog(ERROR, "no more LWLocks available");
 	}
 	result = &MainLWLockArray[LWLockCounter[0]++].lock;
-	SpinLockRelease(ShmemLock);
+	if (IsUnderPostmaster)
+		SpinLockRelease(ShmemLock);
 	return result;
 }
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] mvcc catalo gsnapshots and TopTransactionContext

2014-02-02 Thread Andres Freund

On 2014-01-31 16:41:33 -0500, Bruce Momjian wrote:
> On Mon, Oct  7, 2013 at 03:02:36PM -0400, Robert Haas wrote:
> > On Fri, Oct 4, 2013 at 3:20 PM, Andres Freund  
> > wrote:
> > > On 2013-10-04 15:15:36 -0400, Robert Haas wrote:
> > >> Andres, are you (or is anyone) going to try to fix this assertion 
> > >> failure?
> > >
> > > I think short term replacing it by IsTransactionOrTransactionBlock() is
> > > the way to go. Entirely restructuring how cache invalidation in the
> > > abort path works is not a small task.
> > 
> > Well, if we're going to go that route, how about something like the
> > attached?  I included the assert-change per se, an explanatory
> > comment, and the test case that Noah devised to cause the current
> > assertion to fail.
> 
> Is there any plan to commit this?

IMO it has to be applied. Tom objected on the grounds that cache
invalidation has to be fixed properly but that's a major restructuring
of code that worked this way for a long time. So changing the Assert()
to reflect that seems fair to me.
I'd adapt the tests with a sentence explaining what they test, on a
first look they are pretty obscure...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Include planning time in EXPLAIN ANALYZE output.

2014-02-02 Thread Tom Lane

Peter Geoghegan  writes:
> On Wed, Jan 29, 2014 at 1:10 PM, Robert Haas  wrote:
>> Include planning time in EXPLAIN ANALYZE output.

> Isn't it perhaps a little confusing that "Planning time" may well
> exceed "Total runtime"?

Perhaps s/Total runtime/Execution time/ ?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [GENERAL] Insert result does not match record count

2014-02-02 Thread Tom Lane

Vik Fearing  writes:
> Without re-doing the work, my IRC logs show that I was bothered by this
> in src/backend/tcop/postgres.c:

> max_rows = pq_getmsgint(&input_message, 4);

> I needed to change max_rows to int64 which meant I had to change
> pq_getmsgint to pq_getmsgint64 which made me a little worried.

As well you should be, because we are *not* doing that.  That would be
a guaranteed-incompatible protocol change.  Fortunately, I don't see
any functional need for widening the row-limit field in execute messages;
how likely is it that someone wants to fetch exactly 3 billion rows?
The practical use-cases for nonzero row limits generally involve fetching
a bufferload worth of data at a time, so that the restriction to getting
no more than INT_MAX rows at once is several orders of magnitude away
from being a problem.

The same goes for internal uses of row limits, which makes it
questionable whether it's worth changing the width of ExecutorRun's
count parameter, which is what I assume you were on about here.  But
in any case, if we did that we'd not try to reflect it as far as here,
because the message format specs can't change.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] narwhal and PGDLLIMPORT

2014-02-02 Thread Tom Lane

Dave Page  writes:
> On Sun, Feb 2, 2014 at 1:03 AM, Tom Lane  wrote:
>> I think we should give serious consideration to desupporting this
>> combination so that we can get rid of the plague of PGDLLIMPORT
>> marks.

> No objection here - though I should point out that it's not been
> offline for a long time (aside from a couple of weeks in January) -
> it's been happily building most pre-9.2 branches for ages. 9.1 seems
> to be stuck, along with HEAD, and I forgot to add 9.3. I'm in the
> process of cleaning that up as time allows, but am happy to drop it
> instead if we no longer want to support anything that old. We
> certainly don't use anything resembling that config for the EDB
> installer builds.

Further discussion pointed out that currawong, for example, seems to
want PGDLLIMPORT markings but is able to get by without them in
some cases that narwhal evidently doesn't like.  So at this point,
desupporting narwhal's configuration is clearly premature --- we
should instead be looking into exactly what is causing the different
cases to fail or not fail.

I still have hopes that we might be able to get rid of PGDLLIMPORT
marks, but by actually understanding why they seem to be needed in
some cases and not others, not by just arbitrarily dropping support.

In the meantime, please do get HEAD running again on that machine.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup and pg_stat_tmp directory

2014-02-02 Thread Fujii Masao

On Sat, Feb 1, 2014 at 2:08 AM, Robert Haas  wrote:
> On Fri, Jan 31, 2014 at 8:40 AM, Fujii Masao  wrote:
>> Yeah, I was thinking that, too. I'm not sure whether including log files
>> in backup really increases the security risk, though. There are already
>> very important data, i.e., database, in backups. Anyway, since
>> the amount of log files can be very large and they are not essential
>> for recovery, it's worth considering whether to exclude them. OTOH,
>> I'm sure that some users prefer current behavior for some reasons.
>> So I think that it's better to expose the pg_basebackup option
>> specifying whether log files are included in backups or not.
>
> I don't really see how this can work reliably; pg_log isn't a
> hard-coded name, but rather a GUC that users can leave set to that
> value or set to any other value they choose.

I'm thinking to change basebackup.c so that it compares the
name of the directory that it's trying to back up and the setting
value of log_directory parameter, then, if they are the same,
it just skips the directory. The patch that I sent upthread does
this regarding stats_temp_directory.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [9.3 bug] disk space in pg_xlog increases during archive recovery

2014-02-02 Thread Andres Freund

On 2014-02-02 23:50:40 +0900, Fujii Masao wrote:
> On Sun, Feb 2, 2014 at 5:49 AM, Andres Freund  wrote:
> > On 2014-01-24 22:31:17 +0900, MauMau wrote:
> >> I haven't tried reducing checkpoint_timeout.
> >
> > Did you try reducing checkpoint_segments? As I pointed out, at least if
> > standby_mode is enabled, it will also trigger checkpoints, independently
> > from checkpoint_timeout.
>
> Right. If standby_mode is enabled, checkpoint_segment can trigger
> the restartpoint. But the problem is that the timing of restartpoint
> depends on not only the checkpoint parameters (i.e.,
> checkpoint_timeout and checkpoint_segments) that are used during
> archive recovery but also the checkpoint WAL that was generated
> by the master.

Sure. But we really *need* all the WAL since the last checkpoint's redo
location locally to be safe.

> For example, could you imagine the case where the master generated
> only one checkpoint WAL since the last backup and it crashed with
> database corruption. Then DBA decided to perform normal archive
> recovery by using the last backup. In this case, even if DBA reduces
> both checkpoint_timeout and checkpoint_segments, only one
> restartpoint can occur during recovery. This low frequency of
> restartpoint might fill up the disk space with lots of WAL files.

I am not sure I understand the point of this scenario. If the primary
crashed after a checkpoint, there won't be that much WAL since it
happened...

> > If the issue is that you're not using standby_mode (if so, why?), then
> > the fix maybe is to make that apply to a wider range of situations.
> 
> I guess that he is not using standby_mode because, according to
> his first email in this thread, he said he would like to prevent WAL
> from accumulating in pg_xlog during normal archive recovery (i.e., PITR).

Well, that doesn't necessarily prevent you from using
standby_mode... But yes, that might be the case.

I wonder if we shouldn't just always look at checkpoint segments during
!crash recovery.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD

2014-02-02 Thread Andres Freund

Hi,

In the nearby thread at
http://archives.postgresql.org/message-id/20140202140014.GM5930%40awork2.anarazel.de
Peter and I discovered that there is a large performance difference
between different max_connections on a larger machine (4x Opteron 6272,
64 cores together) in a readonly pgbench tests...

Just as reference, we're talking about a performance degradation from
475963.613865 tps to 197744.913556 in a pgbench -S -cj64 just by setting
max_connections to 90, from 91...

On 2014-02-02 15:00:14 +0100, Andres Freund wrote:
> On 2014-02-01 19:47:29 -0800, Peter Geoghegan wrote:
> > Here are the results of a benchmark on Nathan Boley's 64-core, 4
> > socket server: 
> > http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/amd-4-socket-rwlocks/
>
> That's interesting. The maximum number of what you see here (~293125)
> is markedly lower than what I can get.
>
> ... poke around ...
>
> Hm, that's partially because you're using pgbench without -M prepared if
> I see that correctly. The bottleneck in that case is primarily memory
> allocation. But even after that I am getting higher
> numbers: ~342497.
>
> Trying to nail down the differnce it oddly seems to be your
> max_connections=80 vs my 100. The profile in both cases is markedly
> different, way much more spinlock contention with 80. All in
> Pin/UnpinBuffer().
>
> I think =80 has to lead to some data being badly aligned. I can
> reproduce that =91 has *much* better performance than =90. 170841.844938
> vs 368490.268577 in a 10s test. Reproducable both with an without the test.
> That's certainly worth some investigation.
> This is *not* reproducable on the intel machine, so it might the
> associativity of the L1/L2 cache on the AMD.

So, I looked into this, and I am fairly certain it's because of the
(mis-)alignment of the buffer descriptors. With certain max_connections
settings InitBufferPool() happens to get 64byte aligned addresses, with
others not. I checked the alignment with gdb to confirm that.

A quick hack (attached) making BufferDescriptor 64byte aligned indeed
restored performance across all max_connections settings. It's not
surprising that a misaligned buffer descriptor causes problems -
there'll be plenty of false sharing of the spinlocks otherwise. Curious
that the the intel machine isn't hurt much by this.

Now all this hinges on the fact that by a mere accident
BufferDescriptors are 64byte in size:
struct sbufdesc {
BufferTag  tag;  /* 020 */
BufFlags   flags;/*20 2 */
uint16 usage_count;  /*22 2 */
unsigned int   refcount; /*24 4 */
intwait_backend_pid; /*28 4 */
slock_tbuf_hdr_lock; /*32 1 */

/* XXX 3 bytes hole, try to pack */

intbuf_id;   /*36 4 */
intfreeNext; /*40 4 */

/* XXX 4 bytes hole, try to pack */

LWLock *   io_in_progress_lock;  /*48 8 */
LWLock *   content_lock; /*56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */

/* size: 64, cachelines: 1, members: 10 */
/* sum members: 57, holes: 2, sum holes: 7 */
};

We could polish up the attached patch and apply it to all the branches,
the costs of memory are minimal. But I wonder if we shouldn't instead
make ShmemInitStruct() always return cacheline aligned addresses. That
will require some fiddling, but it might be a good idea nonetheless?

I think we should also consider some more reliable measures to have
BufferDescriptors cacheline sized, rather than relying on the happy
accident. Debugging alignment issues isn't fun, too much of a guessing
game...

Thoughts?

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index e187242..96b4eea 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -78,10 +78,14 @@ InitBufferPool(void)
 	BufferDescriptors = (BufferDesc *)
 		ShmemInitStruct("Buffer Descriptors",
 		NBuffers * sizeof(BufferDesc), &foundDescs);
+	BufferDescriptors = (BufferDesc *)(
+		TYPEALIGN(64, BufferDescriptors));

 	BufferBlocks = (char *)
 		ShmemInitStruct("Buffer Blocks",
 		NBuffers * (Size) BLCKSZ, &foundBufs);
+	BufferBlocks = (char *) (
+		TYPEALIGN(64, BufferBlocks));

 	if (foundDescs || foundBufs)
 	{
@@ -167,9 +171,11 @@ BufferShmemSize(void)

 	/* size of buffer descriptors */
 	size = add_size(size, mul_size(NBuffers, sizeof(BufferDesc)));
+	size = add_size(size, 64);

 	/* size of data pages */
 	size = add_size(size,

Re: [HACKERS] [9.3 bug] disk space in pg_xlog increases during archive recovery

2014-02-02 Thread Fujii Masao

On Sun, Feb 2, 2014 at 5:49 AM, Andres Freund  wrote:
> On 2014-01-24 22:31:17 +0900, MauMau wrote:
>> From: "Fujii Masao" 
>> >On Wed, Jan 22, 2014 at 6:37 AM, Heikki Linnakangas
>> >>>Thanks! The patch looks good to me. Attached is the updated version of
>> >>>the patch. I added the comments.
>>
>> Thank you very much.  Your comment looks great.  I tested some recovery
>> situations, and confirmed that WAL segments were kept/removed as intended.
>> I'll update the CommitFest entry with this patch.
>
> You haven't updated the patches status so far, so I've now marked as
> returned feedback as several people voiced serious doubts about the
> approach. If that's not accurate please speak up.
>
>> >The problem is, we might not be able to perform restartpoints more
>> >aggressively
>> >even if we reduce checkpoint_timeout in the server under the archive
>> >recovery.
>> >Because the frequency of occurrence of restartpoints depends on not only
>> >that
>> >checkpoint_timeout but also the checkpoints which happened while the
>> >server
>> >was running.
>>
>> I haven't tried reducing checkpoint_timeout.
>
> Did you try reducing checkpoint_segments? As I pointed out, at least if
> standby_mode is enabled, it will also trigger checkpoints, independently
> from checkpoint_timeout.

Right. If standby_mode is enabled, checkpoint_segment can trigger
the restartpoint. But the problem is that the timing of restartpoint
depends on not only the checkpoint parameters (i.e.,
checkpoint_timeout and checkpoint_segments) that are used during
archive recovery but also the checkpoint WAL that was generated
by the master.

For example, could you imagine the case where the master generated
only one checkpoint WAL since the last backup and it crashed with
database corruption. Then DBA decided to perform normal archive
recovery by using the last backup. In this case, even if DBA reduces
both checkpoint_timeout and checkpoint_segments, only one
restartpoint can occur during recovery. This low frequency of
restartpoint might fill up the disk space with lots of WAL files.

This would be harmless if the server that we are performing recovery
in has enough disk space. But I can imagine that some users want to
recover the database and restart the service temporarily in poor server
with less enough disk space until they can purchase sufficient server.
In this case, accumulating lots of WAL files in pg_xlog might be harmful.

> If the issue is that you're not using standby_mode (if so, why?), then
> the fix maybe is to make that apply to a wider range of situations.

I guess that he is not using standby_mode because, according to
his first email in this thread, he said he would like to prevent WAL
from accumulating in pg_xlog during normal archive recovery (i.e., PITR).

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wait free LW_SHARED acquisition - v0.2

2014-02-02 Thread Andres Freund

Hi,

On 2014-02-01 19:47:29 -0800, Peter Geoghegan wrote:
> Here are the results of a benchmark on Nathan Boley's 64-core, 4
> socket server: 
> http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/amd-4-socket-rwlocks/

That's interesting. The maximum number of what you see here (~293125)
is markedly lower than what I can get.

... poke around ...

Hm, that's partially because you're using pgbench without -M prepared if
I see that correctly. The bottleneck in that case is primarily memory
allocation. But even after that I am getting higher
numbers: ~342497.

Trying to nail down the differnce it oddly seems to be your
max_connections=80 vs my 100. The profile in both cases is markedly
different, way much more spinlock contention with 80. All in
Pin/UnpinBuffer().

I think =80 has to lead to some data being badly aligned. I can
reproduce that =91 has *much* better performance than =90. 170841.844938
vs 368490.268577 in a 10s test. Reproducable both with an without the test.
That's certainly worth some investigation.
This is *not* reproducable on the intel machine, so it might the
associativity of the L1/L2 cache on the AMD.

> Perhaps I should have gone past 64 clients, because in the document
> "Lock Scaling Analysis on Intel Xeon Processors" [1], Intel write:
>
> "This implies that with oversubscription (more threads running than
> available logical CPUs), the performance of spinlocks can depend
> heavily on the exact OS scheduler behavior, and may change drastically
> with operating system or VM updates."
>
> I haven't bothered with a higher client counts though, because Andres
> noted it's the same with 90 clients on this AMD system. Andres: Do you
> see big problems when # clients < # logical cores on the affected
> Intel systems?

There's some slowdown with the patch applied, but it's not big. Without
it, the slowdown is much earlier.

> There is only a marginal improvement in performance on this big 4
> socket system. Andres informs me privately that he has reproduced the
> problem on multiple new 4-socket Intel servers, so it seems reasonable
> to suppose that more or less an Intel thing.

I've just poked around, it's not just 4 socket, but also 2 socket
systems.

Some background:
The setups that triggered me into working on the patchset didn't really
have a pgbench like workload, the individual queries were/are more
complicated even though it's still an high throughput OLTP workload. And
the contention was *much* higher than what I can reproduce with pgbench
-S, there was often nearly all time spent in the lwlock's spinlock, and
it was primarily the buffer mapping lwlocks, being locked in shared
mode. The difference is that instead of locking very few buffers per
query like pgbench does, they touched much more.
If you look at a profile of a pgbench -S workload -cj64 it's pretty much all
bottlenecked by GetSnapshotData():
unpatched:
-  40.91%  postgres_plainl  postgres_plainlw[.] s_lock
   - s_lock
  - 51.34% LWLockAcquire
   GetSnapshotData
 - GetTransactionSnapshot
+ 55.23% PortalStart
+ 44.77% PostgresMain
  - 48.66% LWLockRelease
   GetSnapshotData
 - GetTransactionSnapshot
+ 53.64% PostgresMain
+ 46.36% PortalStart
+   2.65%  pgbench  [kernel.kallsyms]   [k] try_to_wake_up
+   2.61%  postgres_plainl  postgres_plainlw[.] LWLockRelease
+   1.36%  postgres_plainl  postgres_plainlw[.] LWLockAcquire
+   1.25%  postgres_plainl  postgres_plainlw[.] GetSnapshotData

patched:
-   2.94%   postgres  postgres  [.] LWLockAcquire
   - LWLockAcquire
  + 26.61% ReadBuffer_common
  + 17.08% GetSnapshotData
  + 10.71% _bt_relandgetbuf
  + 10.24% LockAcquireExtended
  + 10.11% VirtualXactLockTableInsert
  + 9.17% VirtualXactLockTableCleanup
  + 7.55% _bt_getbuf
  + 3.40% index_fetch_heap
  + 1.51% LockReleaseAll
  + 0.89% StartTransactionCommand
  + 0.83% _bt_next
  + 0.75% LockRelationOid
  + 0.52% ReadBufferExtended
-   2.75%   postgres  postgres  [.] _bt_compare
   - _bt_compare
  + 96.72% _bt_binsrch
  + 1.71% _bt_moveright
  + 1.29% _bt_search
-   2.67%   postgres  postgres  [.] GetSnapshotData
   - GetSnapshotData
  + 97.03% GetTransactionSnapshot
  + 2.16% PostgresMain
  + 0.81% PortalStart

So now the profile looks much saner/less contended which immediately is
visible in transaction rates: 192584.218530 vs 552834.002573.

But if you try to attack the contention from the other end, by setting
default_transaction_isolation='repeatable read' to reduce the number of
snapshots taken, its suddenly 536789.807391 vs 566839.328922. A *much*
smaller benefit.
The reason the patch doesn't help that much with that setting is that there
simply isn't as much actual contention there:

+   2.77%   postgres  postgres[.] _bt_compare
-   2.72%   postgres  postgres

Re: [HACKERS] SSL: better default ciphersuite

2014-02-02 Thread Marko Kreen

On Thu, Dec 12, 2013 at 04:32:07PM +0200, Marko Kreen wrote:
> Attached patch changes default ciphersuite to HIGH:MEDIUM:+3DES:!aNULL
> and also adds documentation about reasoning for it.

This is the last pending SSL cleanup related patch:

  https://commitfest.postgresql.org/action/patch_view?id=1310

Peter, you have claimed it as committer, do you see any remaining
issues with it?

-- 
marko



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN improvements part2: fast scan

2014-02-02 Thread Heikki Linnakangas


On 01/30/2014 01:53 AM, Tomas Vondra wrote:

(3) A file with explain plans for 4 queries suffering ~2x slowdown,
 and explain plans with 9.4 master and Heikki's patches is available
 here:

   http://www.fuzzy.cz/tmp/gin/queries.txt

 All the queries have 6 common words, and the explain plans look
 just fine to me - exactly like the plans for other queries.

 Two things now caught my eye. First some of these queries actually
 have words repeated - either exactly like "database & database" or
 in negated form like "!anything & anything". Second, while
 generating the queries, I use "dumb" frequency, where only exact
 matches count. I.e. "write != written" etc. But the actual number
 of hits may be much higher - for example "write" matches exactly
 just 5% documents, but using @@ it matches more than 20%.

 I don't know if that's the actual cause though.


Ok, here's another variant of these patches. Compared to git master, it 
does three things:


1. It adds the concept of ternary consistent function internally, but no
catalog changes. It's implemented by calling the regular boolean 
consistent function "both ways".


2. Use a binary heap to get the "next" item from the entries in a scan. 
I'm pretty sure this makes sense, because arguably it makes the code 
more readable, and reduces the number of item pointer comparisons 
significantly for queries with a lot of entries.


3. Only perform the pre-consistent check to try skipping entries, if we 
don't already have the next item from the entry loaded in the array. 
This is a tradeoff, you will lose some of the performance gain you might 
get from pre-consistent checks, but it also limits the performance loss 
you might get from doing useless pre-consistent checks.


So taken together, I would expect this patch to make some of the 
performance gains less impressive, but also limit the loss we saw with 
some of the other patches.


Tomas, could you run your test suite with this patch, please?

- Heikki

diff --git a/src/backend/access/gin/Makefile b/src/backend/access/gin/Makefile
index aabc62f..db4f496 100644
--- a/src/backend/access/gin/Makefile
+++ b/src/backend/access/gin/Makefile
@@ -14,6 +14,6 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = ginutil.o gininsert.o ginxlog.o ginentrypage.o gindatapage.o \
 	ginbtree.o ginscan.o ginget.o ginvacuum.o ginarrayproc.o \
-	ginbulk.o ginfast.o ginpostinglist.o
+	ginbulk.o ginfast.o ginpostinglist.o ginlogic.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index a45d722..f2ea962 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -32,41 +32,6 @@ typedef struct pendingPosition
 	bool	   *hasMatchKey;
 } pendingPosition;
 
-
-/*
- * Convenience function for invoking a key's consistentFn
- */
-static bool
-callConsistentFn(GinState *ginstate, GinScanKey key)
-{
-	/*
-	 * If we're dealing with a dummy EVERYTHING key, we don't want to call the
-	 * consistentFn; just claim it matches.
-	 */
-	if (key->searchMode == GIN_SEARCH_MODE_EVERYTHING)
-	{
-		key->recheckCurItem = false;
-		return true;
-	}
-
-	/*
-	 * Initialize recheckCurItem in case the consistentFn doesn't know it
-	 * should set it.  The safe assumption in that case is to force recheck.
-	 */
-	key->recheckCurItem = true;
-
-	return DatumGetBool(FunctionCall8Coll(&ginstate->consistentFn[key->attnum - 1],
- ginstate->supportCollation[key->attnum - 1],
-		  PointerGetDatum(key->entryRes),
-		  UInt16GetDatum(key->strategy),
-		  key->query,
-		  UInt32GetDatum(key->nuserentries),
-		  PointerGetDatum(key->extra_data),
-	   PointerGetDatum(&key->recheckCurItem),
-		  PointerGetDatum(key->queryValues),
-	 PointerGetDatum(key->queryCategories)));
-}
-
 /*
  * Goes to the next page if current offset is outside of bounds
  */
@@ -453,13 +418,31 @@ restartScanEntry:
 	freeGinBtreeStack(stackEntry);
 }
 
+static int
+entryCmp(Datum a, Datum b, void *arg)
+{
+	GinScanEntry ea = (GinScanEntry) DatumGetPointer(a);
+	GinScanEntry eb = (GinScanEntry) DatumGetPointer(b);
+	return  -ginCompareItemPointers(&ea->curItem, &eb->curItem);
+}
+
 static void
 startScanKey(GinState *ginstate, GinScanKey key)
 {
+	int			i;
 	ItemPointerSetMin(&key->curItem);
 	key->curItemMatches = false;
 	key->recheckCurItem = false;
 	key->isFinished = false;
+
+	GinInitConsistentMethod(ginstate, key);
+
+	key->entryHeap = binaryheap_allocate(key->nentries, entryCmp, NULL);
+	for (i = 0; i < key->nentries; i++)
+		binaryheap_add(key->entryHeap, PointerGetDatum(key->scanEntry[i]));
+
+	key->nunloaded = 0;
+	key->unloadedEntries = palloc(key->nentries * sizeof(GinScanEntry));
 }
 
 static void
@@ -649,6 +632,11 @@ entryLoadMoreItems(GinState *ginstate, GinScanEntry entry, ItemPointerData advan
  *
  * Item pointers are returned in ascending order.
  *
+ *

Re: [HACKERS] [PATCH] pg_sleep(interval)

2014-02-02 Thread Julien Rouhaud

Hi

It seems like pg_sleep_until() has issues if used within a transaction, as
it uses now() and not clock_timestamp(). Please find attached a patch that
solves this issue.

For consistency reasons, I also modified pg_sleep_for() to also use
clock_timestamp.

Regards


On Fri, Jan 31, 2014 at 2:12 AM, Vik Fearing  wrote:

> On 01/30/2014 09:48 PM, Robert Haas wrote:
> > On Thu, Oct 17, 2013 at 9:11 AM, Vik Fearing 
> wrote:
> >> On 10/17/2013 02:42 PM, Robert Haas wrote:
> >>> On Thu, Oct 17, 2013 at 8:26 AM, Vik Fearing 
> wrote:
>  On 10/17/2013 10:03 AM, Fabien COELHO wrote:
> > My guess is that it won't be committed if there is a single "but it
> > might break one code or surprise one user somewhere in the universe",
> > but I wish I'll be proven wrong. IMO, "returned with feedback" on a 1
> > liner is really akin to "rejected".
>  I have attached here an entirely new patch (new documentation and
>  everything) that should please everyone.  It no longer overloads
>  pg_sleep(double precision) but instead add two new functions:
> 
>   * pg_sleep_for(interval)
>   * pg_sleep_until(timestamp with time zone)
> 
>  Because it's no longer overloading the original pg_sleep, Robert's
>  ambiguity objection is no more.
> 
>  Also, I like how it reads aloud: SELECT pg_sleep_for('5 minutes');
> 
>  If people like this, I'll reject the current patch and add this one to
>  the next commitfest.
> >>> I find that naming relatively elegant.  However, you've got to
> >>> schema-qualify every function and operator used in the definitions, or
> >>> you're creating a search-path security vulnerability.
> >>>
> >> Good catch.  Updated patch attached.
> > Committed.
>
> Thanks!
>
> --
> Vik
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***
*** 3034,3042  DATA(insert OID = 2625 ( pg_ls_dir   PGNSP 
PGUID 12 1 1000 0 0 f f f f t t v 1 0
  DESCR("list all files in a directory");
  DATA(insert OID = 2626 ( pg_sleep PGNSP PGUID 12 1 0 0 0 
f f f f t f v 1 0 2278 "701" _null_ _null_ _null_ _null_ pg_sleep _null_ _null_ 
_null_ ));
  DESCR("sleep for the specified time in seconds");
! DATA(insert OID = 3935 ( pg_sleep_for PGNSP PGUID 14 1 0 0 0 
f f f f t f v 1 0 2278 "1186" _null_ _null_ _null_ _null_ "select 
pg_catalog.pg_sleep(extract(epoch from pg_catalog.now() operator(pg_catalog.+) 
$1) operator(pg_catalog.-) extract(epoch from pg_catalog.now()))" _null_ _null_ 
_null_ ));
  DESCR("sleep for the specified interval");
! DATA(insert OID = 3936 ( pg_sleep_until   PGNSP PGUID 14 
1 0 0 0 f f f f t f v 1 0 2278 "1184" _null_ _null_ _null_ _null_ "select 
pg_catalog.pg_sleep(extract(epoch from $1) operator(pg_catalog.-) extract(epoch 
from pg_catalog.now()))" _null_ _null_ _null_ ));
  DESCR("sleep until the specified time");
  
  DATA(insert OID = 2971 (  textPGNSP PGUID 12 
1 0 0 0 f f f f t f i 1 0 25 "16" _null_ _null_ _null_ _null_ booltext _null_ 
_null_ _null_ ));
--- 3034,3042 
  DESCR("list all files in a directory");
  DATA(insert OID = 2626 ( pg_sleep PGNSP PGUID 12 1 0 0 0 
f f f f t f v 1 0 2278 "701" _null_ _null_ _null_ _null_ pg_sleep _null_ _null_ 
_null_ ));
  DESCR("sleep for the specified time in seconds");
! DATA(insert OID = 3935 ( pg_sleep_for PGNSP PGUID 14 1 0 0 0 
f f f f t f v 1 0 2278 "1186" _null_ _null_ _null_ _null_ "select 
pg_catalog.pg_sleep(extract(epoch from pg_catalog.clock_timestamp() 
operator(pg_catalog.+) $1) operator(pg_catalog.-) extract(epoch from 
pg_catalog.clock_timestamp()))" _null_ _null_ _null_ ));
  DESCR("sleep for the specified interval");
! DATA(insert OID = 3936 ( pg_sleep_until   PGNSP PGUID 14 
1 0 0 0 f f f f t f v 1 0 2278 "1184" _null_ _null_ _null_ _null_ "select 
pg_catalog.pg_sleep(extract(epoch from $1) operator(pg_catalog.-) extract(epoch 
from pg_catalog.clock_timestamp()))" _null_ _null_ _null_ ));
  DESCR("sleep until the specified time");
  
  DATA(insert OID = 2971 (  textPGNSP PGUID 12 
1 0 0 0 f f f f t f i 1 0 25 "16" _null_ _null_ _null_ _null_ booltext _null_ 
_null_ _null_ ));

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [GENERAL] Insert result does not match record count

2014-02-02 Thread Vik Fearing

On 02/01/2014 02:26 AM, Bruce Momjian wrote:
> On Sat, Feb  1, 2014 at 02:25:16AM +0100, Vik Fearing wrote:
>>> OK, thanks for the feedback.  I understand now.  The contents of the
>>> string will potentially have a larger integer, but the byte length of
>>> the string in the wire protocol doesn't change.
>>>
>>> Let's wait for Vik to reply and I think we can move forward.
>> Unfortunately, I just did some cleanup last week and removed that
>> branch.  Had I waited a bit more I still would have had all the work I
>> had done.  I'll see how quickly I can redo it to get to the part where I
>> got scared of what I was doing.
>>
>> It will have to wait until next week though; I am currently at FOSDEM.
> OK, thanks.  I thought it only required passing the int64 around until
> it got into the string passed to the client.  The original patch is in
> the email archives if you want it.


The original patch didn't have much in the way of actual work done,
unfortunately.

Without re-doing the work, my IRC logs show that I was bothered by this
in src/backend/tcop/postgres.c:

case 'E':/* execute */
{
const char *portal_name;
intmax_rows;

forbidden_in_wal_sender(firstchar);

/* Set statement_timestamp() */
SetCurrentStatementStartTimestamp();

portal_name = pq_getmsgstring(&input_message);
max_rows = pq_getmsgint(&input_message, 4);
pq_getmsgend(&input_message);

exec_execute_message(portal_name, max_rows);
}
break;

I needed to change max_rows to int64 which meant I had to change
pq_getmsgint to pq_getmsgint64 which made me a little worried.  I was
already overwhelmed by how much code I was changing and this one made me
reconsider.

If it's just a n00b thing, I guess I can pick this back up for 9.5.

-- 
Vik



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

50 matches

Mail list logo