date:20131212

Re: [HACKERS] [GENERAL] Case sensitivity

2013-12-12 Thread Dev Kumkar

On Thu, Dec 12, 2013 at 12:47 PM, Dev Kumkar wrote:

> + hackers
>
>
>
> On Thu, Dec 12, 2013 at 12:34 PM, Dev Kumkar wrote:
>
>> On Wed, Dec 11, 2013 at 9:47 PM, Dev Kumkar wrote:
>>
>>> Actually for searches lower will work.
>>> But the other important aspect is 'inserts' which would result 2 rows if
>>> the values are 'A' and 'a'. Intent here to have it case insensitive.
>>>
>>> If CITEXT it will update the same row and works.
>>> CITEXT is an alternative but was wondering if there is any other
>>> alternate solution/setting while creating database.
>>>
>>> Also does CITEXT fetch via JDBC works the same way as fetch/set string
>>> values? Any quick comments here.
>>>
>>> http://techie-experience.blogspot.in/2013/04/hibernate-supporting-case-insensitive.html
>>>
>>> Regards...
>>>
>>
>> Also if the key columns are CITEXT is there any performance issues on
>> indexes?
>>
> I am ok with CITEXT but for changing the database design for the
primary/foreign key columns to be CITEXT need some suggestions/comments
regarding performance for inserts/reads.

Awaiting...

Re: [HACKERS] RFC: programmable file format for postgresql.conf

2013-12-12 Thread Álvaro Hernández Tortosa




On 13/12/13 04:11, Greg Stark wrote:


On 12 Dec 2013 04:20, "Álvaro Hernández Tortosa" mailto:a...@nosys.es>> wrote:

 > Thanks, Greg. I've been going through those threads, they are
quite interesting. I didn't find an answer, though, about my question:
why parsing the postgresql.conf (and for instance preserving the
comments while writing it back) is no longer a problem

Parsing it isn't hard. It's precisely because the file isn't
programmable and is such a simple format that's easy to parse.

It's making changes and then writing it out again while preserving the
intended format that's hard.


Sure, it's writing back the "difficult" part.


So we convinced people to stop trying to do that.

The whole idea of include rules is to separate the portion of the file
that's human edited and the portion that's machine maintained. That's
the only viable strategy.


	I agree that makes the file "programmable" (in a limited way). You say 
you're trying to stop people trying to do that, but that's precisely 
what is needed to, for example, create tools to help configure postgres!


	Going back to my original email, the main issues I wanted to analyze 
were basically:


- Adding metainformation to the config file so that non-expert users 
(i.e., the great and vast majority of postgres users) can configure 
postgresql more easily (by having extra information in-place, such as 
the min val, max val, vartype, comments and URL to the docs).


- Adding metainformation to the config file so that this metainformation 
is centrally located and self-contained. This in turn encourages tool 
devs to create both GUI tools for configuring postgres and automatic 
tools. I consider "critical" the "centrally located" part, as it becomes 
a "framework" or "repository" of metainformation, so that tool devs 
don't have to write their own for every tool.


	I now realize that maybe I should have called my post "Adding 
metainformation to the postgresql.conf file" or something like that.


	The include mechanism allows some degree of programmability, but it has 
to be in a format compatible with the current postgresql.conf file that 
doesn't contain this metainformation.


	To achieve the goals above, I think the only viable way would be to 
ship with the postgresql distribution a file with all the 
metainformation, which could:


- either be the postgresql.conf file itself (which would need a 
different format, of course)


- or an external file with an included program to convert that file to 
the current postgresql.conf


Please let me know if there would be a third or fourth option.

	I started some little research on the second approach, and I'll post 
back with a file format and code of a proof of concept tool to convert 
to postgresql.conf and help users configure postgresql both from a GUI 
and CLI.


Regards,

aht

--
Álvaro Hernández Tortosa


---
NOSYS
Networked Open SYStems


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logging WAL when updating hintbit

2013-12-12 Thread Sawada Masahiko

On Fri, Dec 13, 2013 at 1:51 PM, Dilip kumar  wrote:
> On 04 December 2013, Sawada Masahiko Wrote
>
>
>> I attached the patch which have modified based on Robert suggestion,
>> and fixed typo.
>
> I have reviewed the modified patch and I have some comments..
>
> 1. Patch need to be rebased (failed applying on head)
>
> 2. crc field should be at end in ControlFileData struct, because for crc 
> calculation, it directly take the offset of crc field in ControlFileData.
>
> /* CRC of all above ... MUST BE LAST! */
> pg_crc32crc;
> +
> +   /* Enable logging WAL when updating hint bits */
> +   boolwal_log_hintbits;
> } ControlFileData;
>
> 3. wal_log_hintbits field should be printed in PrintControlValues function.
>

Thank you for reviewing the patch!
I have modified the patch base on your comment, and I attached the v7 patch.

Regards,

---
Sawada Masahiko


log_hint_bit_wal_v7.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logging WAL when updating hintbit

2013-12-12 Thread Dilip kumar

On 04 December 2013, Sawada Masahiko Wrote


> I attached the patch which have modified based on Robert suggestion,
> and fixed typo.

I have reviewed the modified patch and I have some comments..

1. Patch need to be rebased (failed applying on head)

2. crc field should be at end in ControlFileData struct, because for crc 
calculation, it directly take the offset of crc field in ControlFileData.

/* CRC of all above ... MUST BE LAST! */
pg_crc32crc;
+ 
+   /* Enable logging WAL when updating hint bits */
+   boolwal_log_hintbits;
} ControlFileData;

3. wal_log_hintbits field should be printed in PrintControlValues function.


Regards,
Dilip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 7:40 PM, Peter Geoghegan  wrote:
> Couldn't that just be the app setting it locally?

Yes, that's what is happening there (I had to check with the client's 
developers).  It's possible that the one-minute repeat is due to the 
application reissuing the query, rather than specifically related to the 
spinlock issue.  What this does reveal is that all the spinlock issues have 
been on long-running queries, for what it is worth.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] configure: allow adding a custom string to PG_VERSION

2013-12-12 Thread Michael Paquier

On Fri, Dec 13, 2013 at 12:03 PM, Peter Eisentraut  wrote:
> On Wed, 2013-11-20 at 01:08 +0200, Oskari Saarenmaa wrote:
>> Agreed.  Attached an updated patch, or you can grab it from
>> https://github.com/saaros/postgres/compare/extra-version
>
> Committed.
Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Peter Geoghegan

On Thu, Dec 12, 2013 at 7:35 PM, Christophe Pettus  wrote:
> There are a *lot* of "canceling statement due to statement timeout" messages, 
> which is interesting, because:
>
> postgres=# show statement_timeout;
>  statement_timeout
> ---
>  0
> (1 row)

Couldn't that just be the app setting it locally? In fact, isn't that
the recommended usage?


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Christophe Pettus

On Dec 12, 2013, at 6:41 PM, Andres Freund  wrote:

> Christophe: are there any "unusual" ERROR messages preceding the crash,
> possibly some minutes before?

Interestingly, each spinlock PANIC is *followed*, about one minute later (+/- 
five seconds) by a "canceling statement due to statement timeout" on that exact 
query.  The queries vary enough in text that it is unlikely to be a coincidence.

There are a *lot* of "canceling statement due to statement timeout" messages, 
which is interesting, because:

postgres=# show statement_timeout;
 statement_timeout 
---
 0
(1 row)

--
-- Christophe Pettus
   x...@thebuild.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 6:24 PM, Andres Freund  wrote:
> Is it really a regular pattern like hourly? What's your
> checkpoint_segments?

No, it's not a pattern like that; that's an approximation.  Sometimes, they 
come in clusters, sometimes, 2-3 hours past without one.  They don't happen 
exclusively inside or outside of a checkpoint.

checkpoint_timeout = 5min
checkpoint_segments = 64
checkpoint_completion_target = 0.9

> Could you, arround the time of a crash, check "grep Dirt
> /proc/meminfo" and run iostat -xm 1 20?

Dirty: 30104 kB

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   3.700.000.910.530.00   94.85

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.83   113.131.182.01 0.06 0.45   329.29 
0.04   12.181.28   18.55   0.57   0.18
sdb   0.06   113.150.981.99 0.06 0.45   349.36 
0.24   79.303.57  116.60   1.46   0.43
md0   0.00 0.000.000.00 0.00 0.00 3.39 
0.000.000.000.00   0.00   0.00
md1   0.00 0.001.18  114.92 0.01 0.45 8.01 
0.000.000.000.00   0.00   0.00
dm-0  0.00 0.000.06  111.82 0.00 0.44 8.02 
0.574.880.244.89   0.04   0.43
dm-1  0.00 0.001.113.03 0.00 0.01 8.00 
1.25  300.470.38  410.89   0.17   0.07
sdc   0.00 0.00   12.10  136.13 0.5019.97   282.85 
1.94   13.072.30   14.03   0.55   8.20
dm-2  0.0039.63   24.23  272.24 1.0039.82   281.97 
1.314.441.984.65   0.44  13.03
sdd   0.00 0.00   12.13  136.11 0.5019.84   281.10 
1.359.101.649.77   0.42   6.21

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   1.090.000.080.130.00   98.71

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdb   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
md0   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
md1   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
dm-0  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
dm-1  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdc   0.00 0.000.00  558.00 0.00 8.9532.85 
7.36   13.200.00   13.20   0.12   6.80
dm-2  0.0028.000.00  558.00 0.00 8.9532.85 
7.38   13.230.00   13.23   0.12   6.80
sdd   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.380.000.170.130.00   99.33

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdb   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
md0   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
md1   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
dm-0  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
dm-1  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdc   0.00 0.00   36.00   11.00 0.18 0.1514.30 
0.061.360.673.64   0.94   4.40
dm-2  0.00 0.00   36.00   11.00 0.18 0.1514.30 
0.061.360.673.64   0.94   4.40
sdd   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.830.000.290.040.00   98.83

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdb   0.00 0.000.000.00 0.

[HACKERS] PGCon 2014 call for papers

2013-12-12 Thread Dan Langille

PGCon 2014 will be on 20-24 May 2014 at University of Ottawa.

* 20-21 (Tue-Wed) tutorials
* 22-23 (Thu-Fri) talks - the main part of the conference
* 24 (Sat) The Unconference (very popular in 2013, the first year)

See http://www.pgcon.org/2014/

We are now accepting proposals for the main part of the conference (22-23 May).
Proposals can be quite simple. We do not require academic-style papers.

If you are doing something interesting with PostgreSQL, please submit
a proposal.  You might be one of the backend hackers or work on a
PostgreSQL related project and want to share your know-how with
others. You might be developing an interesting system using PostgreSQL
as the foundation. Perhaps you migrated from another database to
PostgreSQL and would like to share details.  These, and other stories
are welcome. Both users and developers are encouraged to share their
experiences.

Here are a some ideas to jump start your proposal process:

- novel ways in which PostgreSQL is used
- migration of production systems from another database
- data warehousing
- tuning PostgreSQL for different work loads
- replication and clustering
- hacking the PostgreSQL code
- PostgreSQL derivatives and forks
- applications built around PostgreSQL
- benchmarking and performance engineering
- case studies
- location-aware and mapping software with PostGIS
- The latest PostgreSQL features and features in development
- research and teaching with PostgreSQL
- things the PostgreSQL project could do better
- integrating PostgreSQL with 3rd-party software

Both users and developers are encouraged to share their experiences.

The schedule is:

1 Dec 2013 Proposal acceptance begins
19 Jan 2014 Proposal acceptance ends
19 Feb 2014 Confirmation of accepted proposals

NOTE: the call for lightning talks will go out very close to the conference.
Do not submit lightning talks proposals until then.

See also 

Instructions for submitting a proposal to PGCon 2014 are available
from: 

-- 
Dan Langille - http://langille.org



signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [HACKERS] RFC: programmable file format for postgresql.conf

2013-12-12 Thread Greg Stark

On 12 Dec 2013 04:20, "Álvaro Hernández Tortosa"  wrote:

> Thanks, Greg. I've been going through those threads, they are
quite interesting. I didn't find an answer, though, about my question: why
parsing the postgresql.conf (and for instance preserving the comments while
writing it back) is no longer a problem

Parsing it isn't hard. It's precisely because the file isn't programmable
and is such a simple format that's easy to parse.

It's making changes and then writing it out again while preserving the
intended format that's hard.

So we convinced people to stop trying to do that.

The whole idea of include rules is to separate the portion of the file
that's human edited and the portion that's machine maintained. That's the
only viable strategy.

Re: [HACKERS] [PATCH] configure: allow adding a custom string to PG_VERSION

2013-12-12 Thread Peter Eisentraut

On Wed, 2013-11-20 at 01:08 +0200, Oskari Saarenmaa wrote:
> Agreed.  Attached an updated patch, or you can grab it from 
> https://github.com/saaros/postgres/compare/extra-version

Committed.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why standby.max_connections must be higher than primary.max_connections?

2013-12-12 Thread satoshi yamada

>> Because the KnownAssignedXIDs and lock tables on the standby need to
>> be large enough to contain the largest snapshot and greatest number of
>> AccessExclusiveLocks that could exist on the master at any given time.
>
> Right. Initially during the development of Hot Standby, it looked like
> the "max_connections >= master's" requirement on standbys wasn't going
> to be necessary, or could be avoided. However, Simon gave up on that
> idea on pragmatic grounds here:
>
>
http://www.postgresql.org/message-id/1252002165.2889.467.camel@ebony.2ndQuadrant
>
> I'd thought about revisiting this myself, but I think that the impetus
> to do so is lessened by recent work on logical replication.
>

Hi Peter

Your information make my question be clearly.
I understand the discussions about this restriction.

Thanks.



2013/12/12 Peter Geoghegan 

> On Tue, Dec 10, 2013 at 1:17 PM, Robert Haas 
> wrote:
> > Because the KnownAssignedXIDs and lock tables on the standby need to
> > be large enough to contain the largest snapshot and greatest number of
> > AccessExclusiveLocks that could exist on the master at any given time.
>
> Right. Initially during the development of Hot Standby, it looked like
> the "max_connections >= master's" requirement on standbys wasn't going
> to be necessary, or could be avoided. However, Simon gave up on that
> idea on pragmatic grounds here:
>
>
> http://www.postgresql.org/message-id/1252002165.2889.467.camel@ebony.2ndQuadrant
>
> I'd thought about revisiting this myself, but I think that the impetus
> to do so is lessened by recent work on logical replication.
>
> --
> Peter Geoghegan
>

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Andres Freund

On 2013-12-12 21:15:29 -0500, Tom Lane wrote:
> Christophe Pettus  writes:
> > On Dec 12, 2013, at 5:45 PM, Tom Lane  wrote:
> >> Presumably, we are seeing the victim rather than the perpetrator of
> >> whatever is going wrong.
> 
> > This is probing about a bit blindly, but the only thing I can see about 
> > this system that is in some way unique (and this is happening on multiple 
> > machines, so it's unlikely to be hardware) is that there are a relatively 
> > large number of relations (like, 440,000+) distributed over many schemas.  
> > Is there anything that pins a buffer that is O(N) to the number of 
> > relations?
> 
> It's not a buffer *pin* that's at issue, it's a buffer header spinlock.
> And there are no loops, of any sort, that are executed while holding
> such a spinlock.  At least not in the core PG code.  Are you possibly
> using any nonstandard extensions?

It could maybe be explained by a buffer aborting while performing
IO. Until it has call AbortBufferIO(), other backends will happily loop
in WaitIO(), constantly taking the the buffer header spinlock and
locking io_in_progress_lock in shared mode, thereby preventing
AbortBufferIO() from succeeding.

Christophe: are there any "unusual" ERROR messages preceding the crash,
possibly some minutes before?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Peter Geoghegan

On Thu, Dec 12, 2013 at 5:45 PM, Tom Lane  wrote:
> Memo to hackers: I think the SIGSTOP stuff is rather obsolete now that
> most systems dump core files with process IDs embedded in the names.
> What would be more useful today is an option to send SIGABRT, or some
> other signal that would force core dumps.  Thoughts?

I think it would be possible, at least on Linux, to have GDB connect
to the postmaster, and then automatically create new inferiors as new
backends are forked, and then have every inferior paused as
breakpoints are hit. See:

http://sourceware.org/gdb/onlinedocs/gdb/Forks.html

and

http://sourceware.org/gdb/onlinedocs/gdb/All_002dStop-Mode.html

(I think the word 'thread' is just a shorthand for 'inferior' in the
"stops mode" doc page, and you can definitely debug Postgres processes
in multiple inferiors today).

Now, I'm not sure how feasible this is in a production debugging
situation. It seems like an interesting way of debugging these sorts
of issues that should be explored and perhaps subsequently codified.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Andres Freund

Hi,


On 2013-12-12 13:50:06 -0800, Christophe Pettus wrote:
> Immediately after an upgrade from 9.3.1 to 9.3.2, we have a client getting 
> frequent (hourly) errors of the form:

Is it really a regular pattern like hourly? What's your
checkpoint_segments?

Could you, arround the time of a crash, check "grep Dirt
/proc/meminfo" and run iostat -xm 1 20?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 6:15 PM, Tom Lane  wrote:

> Are you possibly using any nonstandard extensions?

No, totally stock PostgreSQL.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SSL: better default ciphersuite

2013-12-12 Thread Peter Eisentraut

On Thu, 2013-12-12 at 12:30 +0200, Marko Kreen wrote:
> First, if there is explicit wish to keep RC4/SEED in play, I'm fine
> with "HIGH:MEDIUM:!aNULL" as new default.  Clarity-wise, it's still
> much better than current value.  And this value will result *exactly*
> same list in same order as current value.

If we have to make a change, I'd go for that, but I'm not convinced that
this is necessarily clearer.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Tom Lane

Christophe Pettus  writes:
> On Dec 12, 2013, at 5:45 PM, Tom Lane  wrote:
>> Presumably, we are seeing the victim rather than the perpetrator of
>> whatever is going wrong.

> This is probing about a bit blindly, but the only thing I can see about this 
> system that is in some way unique (and this is happening on multiple 
> machines, so it's unlikely to be hardware) is that there are a relatively 
> large number of relations (like, 440,000+) distributed over many schemas.  Is 
> there anything that pins a buffer that is O(N) to the number of relations?

It's not a buffer *pin* that's at issue, it's a buffer header spinlock.
And there are no loops, of any sort, that are executed while holding
such a spinlock.  At least not in the core PG code.  Are you possibly
using any nonstandard extensions?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Christophe Pettus

On Dec 12, 2013, at 5:45 PM, Tom Lane  wrote:

> Presumably, we are seeing the victim rather than the perpetrator of
> whatever is going wrong.

This is probing about a bit blindly, but the only thing I can see about this 
system that is in some way unique (and this is happening on multiple machines, 
so it's unlikely to be hardware) is that there are a relatively large number of 
relations (like, 440,000+) distributed over many schemas.  Is there anything 
that pins a buffer that is O(N) to the number of relations?

--
-- Christophe Pettus
   x...@thebuild.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Tom Lane

Christophe Pettus  writes:
> On Dec 12, 2013, at 4:23 PM, Andres Freund  wrote:
>> Could you install the -dbg package and regenerate?

> Here's another, same system, different crash:

Both of these look like absolutely run-of-the-mill buffer access attempts.
Presumably, we are seeing the victim rather than the perpetrator of
whatever is going wrong.  Whoever is holding the spinlock is just
going down with the rest of the system ...

In a devel environment, I'd try using the postmaster's -T switch so that
it SIGSTOP's all the backends instead of SIGQUIT'ing them, and then I'd
run around and gdb all the other backends to try to see which one was
holding the spinlock and why.  Unfortunately, that's probably not
practical in a production environment; it'd take too long to collect
the stack traces by hand.  So I have no good ideas about how to debug
this, unless you can reproduce it on a devel box, or are willing to
run modified executables in production.

Memo to hackers: I think the SIGSTOP stuff is rather obsolete now that
most systems dump core files with process IDs embedded in the names.
What would be more useful today is an option to send SIGABRT, or some
other signal that would force core dumps.  Thoughts?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql: Fix a couple of bugs in MultiXactId freezing

2013-12-12 Thread Alvaro Herrera

Andres Freund wrote:
> On 2013-12-12 18:24:34 -0300, Alvaro Herrera wrote:
> > +   /*
> > +* It's an update; should we keep it?  If the 
> > transaction is known
> > +* aborted then it's okay to ignore it, otherwise not.  
> > (Note this
> > +* is just an optimization and not needed for 
> > correctness, so it's
> > +* okay to get this test wrong; for example, in case an 
> > updater is
> > +* crashed, or a running transaction in the process of 
> > aborting.)
> > +*/
> > +   if (!TransactionIdDidAbort(members[i].xid))
> > +   {
> > +   newmembers[nnewmembers++] = members[i];
> > +   Assert(!TransactionIdIsValid(update_xid));
> > +
> > +   /*
> > +* Tell caller to set HEAP_XMAX_COMMITTED hint 
> > while we have
> > +* the Xid in cache.  Again, this is just an 
> > optimization, so
> > +* it's not a problem if the transaction is 
> > still running and
> > +* in the process of committing.
> > +*/
> > +   if (TransactionIdDidCommit(update_xid))
> > +   update_committed = true;
> > +
> > +   update_xid = newmembers[i].xid;
> > +   }
> 
> I still don't think this is ok. Freezing shouldn't set hint bits earlier
> than tqual.c does. What's the problem with adding a
> !TransactionIdIsInProgress()?

I was supposed to tell you, and evidently forgot, that patch 0001 was
the same as previously submitted, and was modified by the subsequent
patches modify per review comments.  These comments should already be
handled in the later patches in the series I just posted.  The idea was
to spare you reading the whole thing all over again, but evidently that
backfired.  I think the new code doesn't suffer from the problem you
mention; and neither the other one that I trimmed out.

> > +   else if (TransactionIdIsValid(update_xid) && !has_lockers)
> > +   {
> > +   /*
> > +* If there's a single member and it's an update, pass it back 
> > alone
> > +* without creating a new Multi.  (XXX we could do this when 
> > there's a
> > +* single remaining locker, too, but that would complicate the 
> > API too
> > +* much; moreover, the case with the single updater is more
> > +* interesting, because those are longer-lived.)
> > +*/
> > +   Assert(nnewmembers == 1);
> > +   *flags |= FRM_RETURN_IS_XID;
> > +   if (update_committed)
> > +   *flags |= FRM_MARK_COMMITTED;
> > +   xid = update_xid;
> > +   }
> 
> Afaics this will cause HEAP_KEYS_UPDATED to be reset, is that
> problematic? I don't really remember what it's needed for TBH...

Hmm, will check that out.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 4:23 PM, Andres Freund  wrote:
> Could you install the -dbg package and regenerate?

Here's another, same system, different crash:

#0  0x7fa03faf5425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7fa03faf8b8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7fa041e1491b in errfinish (dummy=)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/utils/error/elog.c:542
#3  0x7fa041e15477 in elog_finish (elevel=, 
fmt=0x7fa041f32a48 "stuck spinlock (%p) detected at %s:%d")
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/utils/error/elog.c:1297
#4  0x7fa041d30db3 in s_lock_stuck (line=1099, 
file=0x7fa041f2fa78 
"/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c",
 lock=0x7f9c2acb2ac8 "\001")
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/lmgr/s_lock.c:40
#5  s_lock (lock=0x7f9c2acb2ac8 "\001", 
file=0x7fa041f2fa78 
"/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c",
 line=1099)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/lmgr/s_lock.c:109
#6  0x7fa041d191f0 in PinBuffer (buf=0x7f9c2acb2aa8, strategy=0x0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099
#7  0x7fa041d19af9 in BufferAlloc (foundPtr=0x7fff1948963e "\001", 
strategy=0x0, blockNum=8796, forkNum=MAIN_FORKNUM, relpersistence=112 'p', 
smgr=)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:776
#8  ReadBuffer_common (smgr=, relpersistence=112 'p', 
forkNum=MAIN_FORKNUM, blockNum=8796, mode=RBM_NORMAL, strategy=0x0, 
hit=0x7fff194896af "")
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:333
#9  0x7fa041d1a53e in ReadBufferExtended (reln=0x7f9c1edd4908, 
forkNum=MAIN_FORKNUM, blockNum=8796, mode=, 
strategy=)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:252
#10 0x7fa041b5a706 in heapgetpage (scan=0x7fa043389050, page=8796)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/access/heap/heapam.c:332
#11 0x7fa041b5ac12 in heapgettup_pagemode (scan=0x7fa043389050, 
dir=, nkeys=0, key=0x0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/access/heap/heapam.c:939
#12 0x7fa041b5bf76 in heap_getnext (scan=0x7fa043389050, 
direction=)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/access/heap/heapam.c:1459
#13 0x7fa041c7a9eb in SeqNext (node=)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeSeqscan.c:66
#14 0x7fa041c6998e in ExecScanFetch (
recheckMtd=0x7fa041c7a9b0 , 
accessMtd=0x7fa041c7a9c0 , node=0x7fa0440f1c10)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execScan.c:82
#15 ExecScan (node=0x7fa0440f1c10, accessMtd=0x7fa041c7a9c0 , 
recheckMtd=0x7fa041c7a9b0 )
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execScan.c:167
#16 0x7fa041c629c8 in ExecProcNode (node=0x7fa0440f1c10)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execProcnode.c:400
#17 0x7fa041c6f07f in agg_retrieve_direct (aggstate=0x7fa0440f1510)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeAgg.c:1121
#18 ExecAgg (node=0x7fa0440f1510)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeAgg.c:1013
#19 0x7fa041c628b8 in ExecProcNode (node=0x7fa0440f1510)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execProcnode.c:476
#20 0x7fa041c5fdd2 in ExecutePlan (dest=0x7fa042a955e0, 
direction=, numberTuples=0, sendTuples=1 '\001', 
operation=CMD_SELECT, planstate=0x7fa0440f1510, estate=0x7fa0440f1400)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execMain.c:1472
#21 standard_ExecutorRun (queryDesc=0x7fa0440f0ff0, direction=, 
count=0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execMain.c:307
#22 0x7fa03c0d428d in explain_ExecutorRun (queryDesc=0x7fa0440f0ff0, 
direction=ForwardScanDirection, count=0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../contrib/auto_explain/auto_explain.c:233
#23 0x7fa03bece525 in pgss_ExecutorRun (queryDesc=0x7fa0440f0ff0, 
direction=ForwardScanDirection, count=0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../contrib/pg_stat_statements/pg_stat_statements.c:717
#24 0x7fa041d40207 in PortalRunSelect (portal=0x7fa0427061f0, 
forward=, count=0, dest=)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/pquery.c:946
#25 0x7fa041d41651 in PortalRun (portal=0x7fa0427061f0, 
count=9223372036854775807, isTopLevel=1 '\001', dest=0x7fa042a955e0, 
altdest=0x7fa042a955e0, completionTag=0x7fff19489fa0 "")
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/pquery.c:790
#26 0x7fa041d3d960 in exec_simple_qu

Re: [HACKERS] ruleutils vs. empty targetlists

2013-12-12 Thread Tom Lane

Alvaro Herrera  writes:
> Tom Lane escribió:
>> What I'm thinking about this today is that really the *right* solution
>> is to allow syntactically-empty SELECT lists; once we've bought into the
>> notion of zero-column tables, the notion that you can't have an empty
>> select list is just fundamentally at odds with that.  And since you can
>> already have semantically-empty SELECT lists, this should in theory not
>> create much risk of new bugs.  If we did that, the existing ruleutils
>> code is just fine, as are any existing dump files containing this sort
>> of query.

> Wow, as strange-sounding as that is, you're probably correct.

I experimented with this a bit, and found that things seem to pretty much
work, although unsurprisingly there are a couple of regression tests that
expected to get a syntax error and no longer do.

The only thing I've come across that arguably doesn't work is SELECT
DISTINCT:

regression=# select distinct from pg_database;
--
(8 rows)

The reason this says "8 rows" is that the produced plan is just a seqscan
of pg_database returning no columns.  The parser produced a Query with an
empty distinctClause, so the planner thinks no unique-ification is
required, so it doesn't produce any sort/uniq or hashagg node.  This seems
a bit bogus to me; mathematically, it seems like what this ought to mean
is that all rows are equivalent and so you get just one empty row out.
However, I don't see any practical use-case for that behavior, so I'm
disinclined to spend time on inventing a solution --- and any solution
would probably require a change in Query representation, making it not
back-patchable anyway.  Instead I suggest we add an error check in parse
analysis that throws an error for DISTINCT with no columns.  If anyone is
ever motivated to make this case do something sane, they can remove that
check and fix up whatever needs fixing up.  The attached first-draft patch
doesn't yet have such a prohibition, though.

Another point is that there is a second use of target_list in the grammar,
which is "RETURNING target_list".  I did not change that one into
"RETURNING opt_target_list", because if I had, it would have an issue
similar to DISTINCT's: RETURNING with no arguments would be represented
the same as no RETURNING at all, leading to probably not the behavior
the user expected.  We've already got that problem if you say "RETURNING
zero_column_table.*", making me wonder if a parse-analysis-time error
for that case wouldn't be a good thing.

Comments?

regards, tom lane

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 8fced44..f9d4577 100644
*** a/src/backend/parser/gram.y
--- b/src/backend/parser/gram.y
*** static Node *makeRecursiveViewSelect(cha
*** 334,340 
  name_list from_clause from_list opt_array_bounds
  qualified_name_list any_name any_name_list
  any_operator expr_list attrs
! target_list insert_column_list set_target_list
  set_clause_list set_clause multiple_set_clause
  ctext_expr_list ctext_row def_list indirection opt_indirection
  reloption_list group_clause TriggerFuncArgs select_limit
--- 334,340 
  name_list from_clause from_list opt_array_bounds
  qualified_name_list any_name any_name_list
  any_operator expr_list attrs
! target_list opt_target_list insert_column_list set_target_list
  set_clause_list set_clause multiple_set_clause
  ctext_expr_list ctext_row def_list indirection opt_indirection
  reloption_list group_clause TriggerFuncArgs select_limit
*** select_clause:
*** 9259,9265 
   * However, this is not checked by the grammar; parse analysis must check it.
   */
  simple_select:
! 			SELECT opt_distinct target_list
  			into_clause from_clause where_clause
  			group_clause having_clause window_clause
  {
--- 9259,9265 
   * However, this is not checked by the grammar; parse analysis must check it.
   */
  simple_select:
! 			SELECT opt_distinct opt_target_list
  			into_clause from_clause where_clause
  			group_clause having_clause window_clause
  {
*** ctext_row: '(' ctext_expr_list ')'	{
*** 12215,12220 
--- 12215,12224 
   *
   */

+ opt_target_list: target_list		{ $$ = $1; }
+ 			| /* EMPTY */			{ $$ = NIL; }
+ 		;
+ 
  target_list:
  			target_el{ $$ = list_make1($1); }
  			| target_list ',' target_el{ $$ = lappend($1, $3); }
diff --git a/src/test/regress/expected/errors.out b/src/test/regress/expected/errors.out
index 4061512..b63cb54 100644
*** a/src/test/regress/expected/errors.out
--- b/src/test/regress/expected/errors.out
*** select 1;
*** 15,35 
  --
  --
  -- SELECT
! -- missing relation name
  select;
! ERROR:  syntax error at or near ";"
! LINE 1: select;
!   ^
  -- no such relation
  select * from nonesuch;
  ERROR:  relation "nonesuch" does

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 4:23 PM, Andres Freund  wrote:
> Could you install the -dbg package and regenerate?

Of course!

#0  0x7f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f699a4fdb8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f699c81991b in errfinish (dummy=)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/utils/error/elog.c:542
#3  0x7f699c81a477 in elog_finish (elevel=, 
fmt=0x7f699c937a48 "stuck spinlock (%p) detected at %s:%d")
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/utils/error/elog.c:1297
#4  0x7f699c735db3 in s_lock_stuck (line=1099, 
file=0x7f699c934a78 
"/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c",
 lock=0x7f6585e2cbb4 "\001")
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/lmgr/s_lock.c:40
#5  s_lock (lock=0x7f6585e2cbb4 "\001", 
file=0x7f699c934a78 
"/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c",
 line=1099)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/lmgr/s_lock.c:109
#6  0x7f699c71e1f0 in PinBuffer (buf=0x7f6585e2cb94, strategy=0x0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099
#7  0x7f699c71eaf9 in BufferAlloc (foundPtr=0x7fff60ec563e "", 
strategy=0x0, blockNum=1730, forkNum=MAIN_FORKNUM, relpersistence=112 'p', 
smgr=)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:776
#8  ReadBuffer_common (smgr=, relpersistence=112 'p', 
forkNum=MAIN_FORKNUM, blockNum=1730, mode=RBM_NORMAL, strategy=0x0, 
hit=0x7fff60ec56af "")
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:333
#9  0x7f699c71f53e in ReadBufferExtended (reln=0x7f6577d80560, 
forkNum=MAIN_FORKNUM, blockNum=1730, mode=, 
strategy=)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:252
#10 0x7f699c56d03a in index_fetch_heap (scan=0x7f699f94c7a0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/access/index/indexam.c:515
#11 0x7f699c67a0b7 in IndexOnlyNext (node=0x7f699f94b690)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeIndexonlyscan.c:109
#12 0x7f699c66e98e in ExecScanFetch (
recheckMtd=0x7f699c679fb0 , 
accessMtd=0x7f699c679fe0 , node=0x7f699f94b690)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execScan.c---Type
  to continue, or q  to quit---
:82
#13 ExecScan (node=0x7f699f94b690, accessMtd=0x7f699c679fe0 , 
recheckMtd=0x7f699c679fb0 )
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execScan.c:167
#14 0x7f699c6679a8 in ExecProcNode (node=0x7f699f94b690)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execProcnode.c:408
#15 0x7f699c67407f in agg_retrieve_direct (aggstate=0x7f699f94af90)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeAgg.c:1121
#16 ExecAgg (node=0x7f699f94af90)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeAgg.c:1013
#17 0x7f699c6678b8 in ExecProcNode (node=0x7f699f94af90)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execProcnode.c:476
#18 0x7f699c664dd2 in ExecutePlan (dest=0x7f699f98c308, 
direction=, numberTuples=0, sendTuples=1 '\001', 
operation=CMD_SELECT, planstate=0x7f699f94af90, estate=0x7f699f94ae80)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execMain.c:1472
#19 standard_ExecutorRun (queryDesc=0x7f699f940dc0, direction=, 
count=0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execMain.c:307
#20 0x7f6996ad928d in explain_ExecutorRun (queryDesc=0x7f699f940dc0, 
direction=ForwardScanDirection, count=0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../contrib/auto_explain/auto_explain.c:233
#21 0x7f69968d3525 in pgss_ExecutorRun (queryDesc=0x7f699f940dc0, 
direction=ForwardScanDirection, count=0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../contrib/pg_stat_statements/pg_stat_statements.c:717
#22 0x7f699c745207 in PortalRunSelect (portal=0x7f699de596a0, 
forward=, count=0, dest=)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/pquery.c:946
#23 0x7f699c746651 in PortalRun (portal=0x7f699de596a0, 
count=9223372036854775807, isTopLevel=1 '\001', dest=0x7f699f98c308, 
altdest=0x7f699f98c308, completionTag=0x7fff60ec5f30 "")
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/pquery.c:790
#24 0x7f699c742960 in exec_simple_query (
query_string=0x7f699dd564a0 "SELECT COUNT(*) FROM \"signups\"  WHERE 
(signups.is_supporter = true)")
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/postgres.c:1048
#25 PostgresMain (argc=, argv=, 
---Type  to continue, or q  to quit---
dbname=0x7f699dd3e138 "nbuild", username=)
at 
/tmp

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Andres Freund

On 2013-12-12 16:22:28 -0800, Christophe Pettus wrote:
> 
> On Dec 12, 2013, at 4:04 PM, Tom Lane  wrote:
> > If you aren't getting a core file for a PANIC, then core
> > files are disabled.
> 
> And just like that, we get one.  Stack trace:
> 
> #0  0x7f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> (gdb) bt
> #0  0x7f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7f699a4fdb8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7f699c81991b in errfinish ()
> #3  0x7f699c81a477 in elog_finish ()
> #4  0x7f699c735db3 in s_lock ()
> #5  0x7f699c71e1f0 in ?? ()
> #6  0x7f699c71eaf9 in ?? ()

Could you install the -dbg package and regenerate?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 4:04 PM, Tom Lane  wrote:
> If you aren't getting a core file for a PANIC, then core
> files are disabled.

And just like that, we get one.  Stack trace:

#0  0x7f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x7f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f699a4fdb8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f699c81991b in errfinish ()
#3  0x7f699c81a477 in elog_finish ()
#4  0x7f699c735db3 in s_lock ()
#5  0x7f699c71e1f0 in ?? ()
#6  0x7f699c71eaf9 in ?? ()
#7  0x7f699c71f53e in ReadBufferExtended ()
#8  0x7f699c56d03a in index_fetch_heap ()
#9  0x7f699c67a0b7 in ?? ()
#10 0x7f699c66e98e in ExecScan ()
#11 0x7f699c6679a8 in ExecProcNode ()
#12 0x7f699c67407f in ExecAgg ()
#13 0x7f699c6678b8 in ExecProcNode ()
#14 0x7f699c664dd2 in standard_ExecutorRun ()
#15 0x7f6996ad928d in ?? ()
   from /usr/lib/postgresql/9.3/lib/auto_explain.so
#16 0x7f69968d3525 in ?? ()
   from /usr/lib/postgresql/9.3/lib/pg_stat_statements.so
#17 0x7f699c745207 in ?? ()
#18 0x7f699c746651 in PortalRun ()
#19 0x7f699c742960 in PostgresMain ()
#20 0x7f699c6ff765 in PostmasterMain ()
#21 0x7f699c53bea2 in main ()


--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Tom Lane

Christophe Pettus  writes:
> On Dec 12, 2013, at 3:18 PM, Tom Lane  wrote:
>> Hm, a PANIC really ought to result in a core file.  You sure you don't
>> have that disabled (perhaps via a ulimit setting)?

> Since it's using the Ubuntu packaging, we have pg_ctl_options = '-c' in 
> /etc/postgresql/9.3/main/pg_ctl.conf.

[ shrug... ]  If you aren't getting a core file for a PANIC, then core
files are disabled.  I take no position on the value of the setting
you mention above, but I will note that pg_ctl can't override a hard
"ulimit -c 0" system-wide setting.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 3:18 PM, Tom Lane  wrote:
> Hm, a PANIC really ought to result in a core file.  You sure you don't
> have that disabled (perhaps via a ulimit setting)?

Since it's using the Ubuntu packaging, we have pg_ctl_options = '-c' in 
/etc/postgresql/9.3/main/pg_ctl.conf.

> As for the root cause, it's hard to say.  The file/line number says it's
> a buffer header lock that's stuck.  I rechecked all the places that lock
> buffer headers, and all of them have very short code paths to the
> corresponding unlock, so there's no obvious explanation how this could
> happen.

The server was running with shared_buffers=100GB, but the problem has 
reoccurred now with shared_buffers=16GB.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Christophe Pettus

On Dec 12, 2013, at 3:33 PM, Andres Freund  wrote:

> Any other changes but the upgrade? Maybe a different compiler version?

Just the upgrade; they're using the Ubuntu packages from apt.postgresql.org.

> Also, could you share some details about the workload? Highly
> concurrent? Standby? ...

The workload is not very highly concurrent; actually quite lightly loaded.   
There are a very large number (442,000) of user tables.  No standby attached.

--
-- Christophe Pettus
   x...@thebuild.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 3:37 PM, Peter Geoghegan  wrote:
> Show pg_config output.

Below; it's the Ubuntu package.

BINDIR = /usr/lib/postgresql/9.3/bin
DOCDIR = /usr/share/doc/postgresql-doc-9.3
HTMLDIR = /usr/share/doc/postgresql-doc-9.3
INCLUDEDIR = /usr/include/postgresql
PKGINCLUDEDIR = /usr/include/postgresql
INCLUDEDIR-SERVER = /usr/include/postgresql/9.3/server
LIBDIR = /usr/lib
PKGLIBDIR = /usr/lib/postgresql/9.3/lib
LOCALEDIR = /usr/share/locale
MANDIR = /usr/share/postgresql/9.3/man
SHAREDIR = /usr/share/postgresql/9.3
SYSCONFDIR = /etc/postgresql-common
PGXS = /usr/lib/postgresql/9.3/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--with-tcl' '--with-perl' '--with-python' '--with-pam' 
'--with-openssl' '--with-libxml' '--with-libxslt' 
'--with-tclconfig=/usr/lib/tcl8.5' '--with-tkconfig=/usr/lib/tk8.5' 
'--with-includes=/usr/include/tcl8.5' 'PYTHON=/usr/bin/python' 
'--mandir=/usr/share/postgresql/9.3/man' 
'--docdir=/usr/share/doc/postgresql-doc-9.3' 
'--sysconfdir=/etc/postgresql-common' '--datarootdir=/usr/share/' 
'--datadir=/usr/share/postgresql/9.3' '--bindir=/usr/lib/postgresql/9.3/bin' 
'--libdir=/usr/lib/' '--libexecdir=/usr/lib/postgresql/' 
'--includedir=/usr/include/postgresql/' '--enable-nls' 
'--enable-integer-datetimes' '--enable-thread-safety' '--enable-debug' 
'--disable-rpath' '--with-ossp-uuid' '--with-gnu-ld' '--with-pgport=5432' 
'--with-system-tzdata=/usr/share/zoneinfo' 'CFLAGS=-g -O2 -fstack-protector 
--param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security 
-fPIC -pie -I/usr/include/mit-krb5 -DLINUX_OOM_ADJ=0' 
'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -Wl,--as-needed 
-L/usr/lib/mit-krb5 -L/usr/lib/x86_64-linux-gnu/mit-krb5' '--with-krb5' 
'--with-gssapi' '--with-ldap' 'CPPFLAGS=-D_FORTIFY_SOURCE=2'
CC = gcc
CPPFLAGS = -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2 
-I/usr/include/tcl8.5
CFLAGS = -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat 
-Wformat-security -Werror=format-security -fPIC -pie -I/usr/include/mit-krb5 
-DLINUX_OOM_ADJ=0 -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g
CFLAGS_SL = -fpic
LDFLAGS = -L../../../src/common -Wl,-Bsymbolic-functions -Wl,-z,relro 
-Wl,-z,now -Wl,--as-needed -L/usr/lib/mit-krb5 
-L/usr/lib/x86_64-linux-gnu/mit-krb5 -L/usr/lib/x86_64-linux-gnu -Wl,--as-needed
LDFLAGS_EX = 
LDFLAGS_SL = 
LIBS = -lpgport -lpgcommon -lxslt -lxml2 -lpam -lssl -lcrypto -lkrb5 -lcom_err 
-lgssapi_krb5 -lz -ledit -lcrypt -ldl -lm 
VERSION = PostgreSQL 9.3.2

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Peter Geoghegan

On Thu, Dec 12, 2013 at 3:33 PM, Andres Freund  wrote:
> Any other changes but the upgrade? Maybe a different compiler version?

Show pg_config output.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Andres Freund

On 2013-12-12 13:50:06 -0800, Christophe Pettus wrote:
> Immediately after an upgrade from 9.3.1 to 9.3.2, we have a client getting 
> frequent (hourly) errors of the form:
> 
> /var/lib/postgresql/9.3/main/pg_log/postgresql-2013-12-12_211710.csv:2013-12-12
>  21:40:10.328 
> UTC,"n","n",32376,"10.2.1.142:52451",52aa24eb.7e78,5,"SELECT",2013-12-12 
> 21:04:43 UTC,9/7178,0,PANIC,XX000,"stuck spinlock (0x7f7df94672f4) detected 
> at 
> /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099",,""

Any other changes but the upgrade? Maybe a different compiler version?

Also, could you share some details about the workload? Highly
concurrent? Standby? ...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "stuck spinlock"

2013-12-12 Thread Tom Lane

Christophe Pettus  writes:
> Immediately after an upgrade from 9.3.1 to 9.3.2, we have a client getting 
> frequent (hourly) errors of the form:

> /var/lib/postgresql/9.3/main/pg_log/postgresql-2013-12-12_211710.csv:2013-12-12
>  21:40:10.328 
> UTC,"n","n",32376,"10.2.1.142:52451",52aa24eb.7e78,5,"SELECT",2013-12-12 
> 21:04:43 UTC,9/7178,0,PANIC,XX000,"stuck spinlock (0x7f7df94672f4) detected 
> at 
> /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099",,""

> uname -a: Linux postgresql3-master 3.8.0-33-generic #48~precise1-Ubuntu SMP 
> Thu Oct 24 16:28:06 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux.

> Generally, there's no core file (which is currently enable), as the 
> postmaster just normally exits the backend.

Hm, a PANIC really ought to result in a core file.  You sure you don't
have that disabled (perhaps via a ulimit setting)?

As for the root cause, it's hard to say.  The file/line number says it's
a buffer header lock that's stuck.  I rechecked all the places that lock
buffer headers, and all of them have very short code paths to the
corresponding unlock, so there's no obvious explanation how this could
happen.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql: Fix a couple of bugs in MultiXactId freezing

2013-12-12 Thread Andres Freund

On 2013-12-12 18:24:34 -0300, Alvaro Herrera wrote:
> + /*
> +  * It's an update; should we keep it?  If the 
> transaction is known
> +  * aborted then it's okay to ignore it, otherwise not.  
> (Note this
> +  * is just an optimization and not needed for 
> correctness, so it's
> +  * okay to get this test wrong; for example, in case an 
> updater is
> +  * crashed, or a running transaction in the process of 
> aborting.)
> +  */
> + if (!TransactionIdDidAbort(members[i].xid))
> + {
> + newmembers[nnewmembers++] = members[i];
> + Assert(!TransactionIdIsValid(update_xid));
> +
> + /*
> +  * Tell caller to set HEAP_XMAX_COMMITTED hint 
> while we have
> +  * the Xid in cache.  Again, this is just an 
> optimization, so
> +  * it's not a problem if the transaction is 
> still running and
> +  * in the process of committing.
> +  */
> + if (TransactionIdDidCommit(update_xid))
> + update_committed = true;
> +
> + update_xid = newmembers[i].xid;
> + }

I still don't think this is ok. Freezing shouldn't set hint bits earlier
than tqual.c does. What's the problem with adding a
!TransactionIdIsInProgress()?

You also wrote:
On 2013-12-11 22:08:41 -0300, Alvaro Herrera wrote:
> Hmm ... Is there an actual difference?  I mean, a transaction that
> marked itself as committed in pg_clog cannot return to any other state,
> regardless of what happens elsewhere.

Hm, that's not actually true, I missed that so far: Think of async
commits and what we do in tqual.c:SetHintBits(). But I think we're safe
in this scenario, at least for the current callers. vacuumlazy.c will
WAL log the freezing and set the LSN while holding an exclusive lock,
therefor providing an LSN interlock. VACUUM FULL/CLUSTER will be safe,
even with wal_level=minimal, because the relation won't be visible until
it commits and it will contain a smgr pending delete forcing a
synchronous commit. But that should be documented.

> + if (TransactionIdPrecedes(update_xid, cutoff_xid))
> + {
> + update_xid = InvalidTransactionId;
> + update_committed = false;
> +
> + }

Deserves an Assert().

> + else if (TransactionIdIsValid(update_xid) && !has_lockers)
> + {
> + /*
> +  * If there's a single member and it's an update, pass it back 
> alone
> +  * without creating a new Multi.  (XXX we could do this when 
> there's a
> +  * single remaining locker, too, but that would complicate the 
> API too
> +  * much; moreover, the case with the single updater is more
> +  * interesting, because those are longer-lived.)
> +  */
> + Assert(nnewmembers == 1);
> + *flags |= FRM_RETURN_IS_XID;
> + if (update_committed)
> + *flags |= FRM_MARK_COMMITTED;
> + xid = update_xid;
> + }

Afaics this will cause HEAP_KEYS_UPDATED to be reset, is that
problematic? I don't really remember what it's needed for TBH...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql: Fix a couple of bugs in MultiXactId freezing

2013-12-12 Thread Tom Lane

Andres Freund  writes:
> Unfortunately I find that too ugly. Adding a flag in the procarray
> because of an Assert() in a lowlevel routine? That's overkill.

If this flag doesn't need to be visible to other backends, it absolutely
does not belong in the procarray.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch: make_timestamp function

2013-12-12 Thread Fabrízio de Royes Mello

On Thu, Dec 12, 2013 at 3:11 PM, Pavel Stehule wrote:

> Hello
>
> this patch try to complete a set of functions make_date and make_timestamp.
>
>
Could we have the 'make_timestamptz' function too?

Regards,

-- 
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog sobre TI: http://fabriziomello.blogspot.com
>> Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello

Re: [HACKERS] pgsql: Fix a couple of bugs in MultiXactId freezing

2013-12-12 Thread Andres Freund

Hi,

On 2013-12-12 18:39:43 -0300, Alvaro Herrera wrote:
> Alvaro Herrera wrote:
> > One last thing (I hope).  It's not real easy to disable this check,
> > because it actually lives in GetNewMultiXactId.  It would uglify the API
> > a lot if we were to pass a flag down two layers of routines; and moving
> > it to higher-level routines doesn't seem all that appropriate
> > either.

Unfortunately I find that too ugly. Adding a flag in the procarray
because of an Assert() in a lowlevel routine? That's overkill.
What's the problem with moving the check to MultiXactIdCreate() and
MultiXactIdExpand() instead? Since those are the ones where it's
required to have called SetOldest() before, I don't see why that would
be inappropriate?

> > I'm thinking we can have a new flag in MyPgXact->vacuumFlags, so
> > heap_prepare_freeze_tuple does this:
> > 
> > PG_TRY();
> > {
> > /* set flag to let multixact.c know what we're doing */
> > MyPgXact->vacuumFlags |= PROC_FREEZING_MULTI;
> > newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
> > cutoff_xid, cutoff_multi, &flags);
> > }
> 
> Uhm, actually we don't need a PG_TRY block at all for this to work: we
> can rely on the flag being reset at transaction abort, if anything wrong
> happens and we lose control.  So just set the flag, call
> FreezeMultiXactId, reset flag.

I don't think that'd be correct for a CLUSTER in a subtransaction?  A
subtransaction's abort afaics doesn't call ProcArrayEndTransaction() and
thus don't clear vacuumFlags...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Florian Pflug

On Dec12, 2013, at 19:29 , Tom Lane  wrote:
> However ... where this thread started was not about trying to reduce
> the remaining statistical imperfections in our existing sampling method.
> It was about whether we could reduce the number of pages read for an
> acceptable cost in increased statistical imperfection.

True, but Jeff's case shows that even the imperfections of the current
sampling method are larger than what the n_distinct estimator expects.

Making it even more biased will thus require us to rethink how we
obtain a n_distinct estimate or something equivalent. I don't mean that
as an argument against changing the sampling method, just as something
to watch out for.

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-12 Thread Fabrízio de Royes Mello

On Thu, Dec 12, 2013 at 3:42 PM, Fabrízio de Royes Mello <
fabriziome...@gmail.com> wrote:
>
> On Thu, Dec 12, 2013 at 3:39 PM, Simon Riggs 
wrote:
> >
> > On 12 December 2013 15:19, Simon Riggs  wrote:
> >
> > > Don't panic guys! I meant UTC offset only. And yes, it may not be
> > > needed, will check.
> >
> > Checked, all non-UTC TZ offsets work without further effort here.
> >
>
> Thanks!
>

Reviewing the committed patch I noted that the "CheckForStandbyTrigger()"
after the delay was removed.

If we promote the standby during the delay and don't check the trigger
immediately after the delay, then we will replay undesired WALs records.

The attached patch add this check.

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog sobre TI: http://fabriziomello.blogspot.com
>> Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a76aef3..fbc2d2f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6835,6 +6835,14 @@ StartupXLOG(void)
 	recoveryApplyDelay();
 
 	/*
+	 * Check for standby trigger to prevent the
+	 * replay of undesired WAL records if the
+	 * slave was promoted during the delay.
+	 */
+	if (CheckForStandbyTrigger())
+		break;
+
+	/*
 	 * We test for paused recovery again here. If
 	 * user sets delayed apply, it may be because
 	 * they expect to pause recovery in case of

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] "stuck spinlock"

2013-12-12 Thread Christophe Pettus

Greetings,

Immediately after an upgrade from 9.3.1 to 9.3.2, we have a client getting 
frequent (hourly) errors of the form:

/var/lib/postgresql/9.3/main/pg_log/postgresql-2013-12-12_211710.csv:2013-12-12 
21:40:10.328 
UTC,"n","n",32376,"10.2.1.142:52451",52aa24eb.7e78,5,"SELECT",2013-12-12 
21:04:43 UTC,9/7178,0,PANIC,XX000,"stuck spinlock (0x7f7df94672f4) detected at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099",,""

uname -a: Linux postgresql3-master 3.8.0-33-generic #48~precise1-Ubuntu SMP Thu 
Oct 24 16:28:06 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux.

Generally, there's no core file (which is currently enable), as the postmaster 
just normally exits the backend.

Diagnosis suggestions?
 
--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql: Fix a couple of bugs in MultiXactId freezing

2013-12-12 Thread Alvaro Herrera

Alvaro Herrera wrote:

> One last thing (I hope).  It's not real easy to disable this check,
> because it actually lives in GetNewMultiXactId.  It would uglify the API
> a lot if we were to pass a flag down two layers of routines; and moving
> it to higher-level routines doesn't seem all that appropriate either.
> I'm thinking we can have a new flag in MyPgXact->vacuumFlags, so
> heap_prepare_freeze_tuple does this:
> 
> PG_TRY();
> {
> /* set flag to let multixact.c know what we're doing */
> MyPgXact->vacuumFlags |= PROC_FREEZING_MULTI;
> newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
> cutoff_xid, cutoff_multi, &flags);
> }

Uhm, actually we don't need a PG_TRY block at all for this to work: we
can rely on the flag being reset at transaction abort, if anything wrong
happens and we lose control.  So just set the flag, call
FreezeMultiXactId, reset flag.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra functionality to createuser

2013-12-12 Thread Christopher Browne

On Wed, Dec 11, 2013 at 7:53 AM, Robert Haas  wrote:
> On Tue, Dec 10, 2013 at 9:55 AM, Amit Kapila  wrote:
>> On Tue, Dec 10, 2013 at 12:20 AM, Robert Haas  wrote:
>>> On Sat, Dec 7, 2013 at 11:39 PM, Amit Kapila  
>>> wrote:
 On Fri, Dec 6, 2013 at 10:31 AM, Peter Eisentraut  wrote:
>
> How about only one role name per -g option, but allowing the -g option
> to be repeated?

I think that might simplify the problem and patch, but do you think
 it is okay to have inconsistency
for usage of options between Create User statement and this utility?
>>>
>>> Yes.  In general, command-line utilities use a very different syntax
>>> for options-passing that SQL commands.  Trying to make them consistent
>>> feels unnecessary or perhaps even counterproductive.  And the proposed
>>> syntax is certainly a convention common to many other command-line
>>> utilities, so I think it's fine.
>>
>> Okay, the new way for syntax suggested by Peter has simplified the problem.
>> Please find the updated patch and docs for multiple -g options.
>
> Committed.

Looks good, thanks!

-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql: Fix a couple of bugs in MultiXactId freezing

2013-12-12 Thread Alvaro Herrera

Andres Freund wrote:
> On 2013-12-11 22:08:41 -0300, Alvaro Herrera wrote:
> > Andres Freund wrote:

> > > I worry that this MultiXactIdSetOldestMember() will be problematic in
> > > longrunning vacuums like a anti-wraparound vacuum of a huge
> > > table. There's no real need to set MultiXactIdSetOldestMember() here,
> > > since we will not become the member of a multi. So I think you should
> > > either move the Assert() in MultiXactIdCreateFromMembers() to it's other
> > > callers, or add a parameter to skip it.
> > 
> > I would like to have the Assert() work automatically, that is, check the
> > PROC_IN_VACUUM flag in MyProc->vacuumflags ... but this probably won't
> > work with CLUSTER.  That said, I think we *should* call SetOldestMember
> > in CLUSTER.  So maybe both things should be conditional on
> > PROC_IN_VACUUM.
> 
> Why should it be dependent on cluster? SetOldestMember() defines the
> oldest multi we can be a member of. Even in cluster, the freezing will
> not make us a member of a multi. If the transaction does something else
> requiring SetOldestMember(), that will do it?

One last thing (I hope).  It's not real easy to disable this check,
because it actually lives in GetNewMultiXactId.  It would uglify the API
a lot if we were to pass a flag down two layers of routines; and moving
it to higher-level routines doesn't seem all that appropriate either.
I'm thinking we can have a new flag in MyPgXact->vacuumFlags, so
heap_prepare_freeze_tuple does this:

PG_TRY();
{
/* set flag to let multixact.c know what we're doing */
MyPgXact->vacuumFlags |= PROC_FREEZING_MULTI;
newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
cutoff_xid, cutoff_multi, &flags);
}
PG_CATCH();
{
MyPgXact->vacuumFlags &= ~PROC_FREEZING_MULTI;
PG_RE_THROW();
}
PG_END_TRY();
MyPgXact->vacuumFlags &= ~PROC_FREEZING_MULTI;

and GetNewMultiXactId tests it to avoid the assert:

/*
 * MultiXactIdSetOldestMember() must have been called already, but don't
 * check while freezing MultiXactIds.
 */
Assert((MyPggXact->vacuumFlags & PROC_FREEZING_MULTI) ||
   MultiXactIdIsValid(OldestMemberMXactId[MyBackendId]));

This avoids the API uglification issues, but introduces a setjmp call
for every tuple to be frozen.  I don't think this is an excessive cost
to pay; after all, this is going to happen only for tuples for which
heap_tuple_needs_freeze already returned true; and for those we're
already going to do a lot of other work.

Attached is the whole series of patches for 9.3.  (master is the same,
only with an additional patch that removes the legacy WAL record.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
>From d266b3cef8598c3383d3ba17105d17fc1f384f7d Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 10 Dec 2013 17:56:02 -0300
Subject: [PATCH 1/4] Fix freezing of multixacts
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Andres Freund and Álvaro
---
 src/backend/access/heap/heapam.c   |  679 +---
 src/backend/access/rmgrdesc/heapdesc.c |9 +
 src/backend/access/transam/multixact.c |   18 +-
 src/backend/commands/vacuumlazy.c  |   31 +-
 src/include/access/heapam_xlog.h   |   43 +-
 src/include/access/multixact.h |3 +
 6 files changed, 628 insertions(+), 155 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1a0dd21..24d843a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -5238,14 +5238,261 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 		CacheInvalidateHeapTuple(relation, tuple, NULL);
 }
 
+#define		FRM_NOOP0x0001
+#define		FRM_INVALIDATE_XMAX		0x0002
+#define		FRM_RETURN_IS_XID		0x0004
+#define		FRM_RETURN_IS_MULTI		0x0008
+#define		FRM_MARK_COMMITTED		0x0010
 
 /*
- * heap_freeze_tuple
+ * FreezeMultiXactId
+ *		Determine what to do during freezing when a tuple is marked by a
+ *		MultiXactId.
+ *
+ * NB -- this might have the side-effect of creating a new MultiXactId!
+ *
+ * "flags" is an output value; it's used to tell caller what to do on return.
+ * Possible flags are:
+ * FRM_NOOP
+ *		don't do anything -- keep existing Xmax
+ * FRM_INVALIDATE_XMAX
+ *		mark Xmax as InvalidTransactionId and set XMAX_INVALID flag.
+ * FRM_RETURN_IS_XID
+ *		The Xid return value is a single update Xid to set as xmax.
+ * FRM_MARK_COMMITTED
+ *		Xmax can be marked as HEAP_XMAX_COMMITTED
+ * FRM_RETURN_IS_MULTI
+ *		The return value is a new MultiXactId to set as new Xmax.
+ *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ */
+static TransactionId
+FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
+  TransactionId cutoff_xid, MultiXactId

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Jeff Janes

On Tue, Dec 3, 2013 at 3:30 PM, Greg Stark  wrote:

> At multiple conferences I've heard about people trying all sorts of
> gymnastics to avoid ANALYZE which they expect to take too long and
> consume too much I/O. This is especially a big complain after upgrades
> when their new database performs poorly until the new statistics are
> in and they did pg_upgrade to avoid an extended downtime and complain
> about ANALYZE taking hours.
>

Out of curiosity, are they using the 3 stage script
"analyze_new_cluster.sh"?  If so, is the complaint that even the first
rounds are too slow, or that the database remains unusable until the last
round is complete?

Cheers,

Jeff

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Claudio Freire

On Thu, Dec 12, 2013 at 4:13 PM, Jeff Janes  wrote:
>> Well, why not take a supersample containing all visible tuples from N
>> selected blocks, and do bootstrapping over it, with subsamples of M
>> independent rows each?
>
>
> Bootstrapping methods generally do not work well when ties are significant
> events, i.e. when two values being identical is meaningfully different from
> them being very close but not identical.

Yes, that's why I meant to say (but I see now that I didn't) that it
wouldn't do much for n_distinct, just the histogram.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Jeff Janes

On Thu, Dec 12, 2013 at 10:33 AM, Claudio Freire wrote:

> On Thu, Dec 12, 2013 at 3:29 PM, Tom Lane  wrote:
> > Jeff Janes  writes:
> >> It would be relatively easy to fix this if we trusted the number of
> visible
> >> rows in each block to be fairly constant.  But without that assumption,
> I
> >> don't see a way to fix the sample selection process without reading the
> >> entire table.
> >
> > Yeah, varying tuple density is the weak spot in every algorithm we've
> > looked at.  The current code is better than what was there before, but as
> > you say, not perfect.  You might be entertained to look at the threads
> > referenced by the patch that created the current sampling method:
> >
> http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gt...@email.aon.at
> >
> > particularly
> >
> http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at#ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at
>

Thanks, I will read those.

> >
> >
> > However ... where this thread started was not about trying to reduce
> > the remaining statistical imperfections in our existing sampling method.
> > It was about whether we could reduce the number of pages read for an
> > acceptable cost in increased statistical imperfection.
>

I think it is pretty clear that n_distinct at least, and probably MCV,
would be a catastrophe under some common data distribution patterns if we
sample all rows in each block without changing our current computation
method.  If we come up with a computation that works for that type of
sampling, it would probably be an improvement under our current sampling as
well.  If we find such a thing, I wouldn't want it to get rejected just
because the larger block-sampling method change did not make it in.

Well, why not take a supersample containing all visible tuples from N
> selected blocks, and do bootstrapping over it, with subsamples of M
> independent rows each?
>

Bootstrapping methods generally do not work well when ties are significant
events, i.e. when two values being identical is meaningfully different from
them being very close but not identical.

Cheers,

Jeff

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Claudio Freire

On Thu, Dec 12, 2013 at 3:56 PM, Josh Berkus  wrote:
>
> Estimated grouping should, however, affect MCVs.  In cases where we
> estimate that grouping levels are high, the expected % of observed
> values should be "discounted" somehow.  That is, with total random
> distribution you have a 1:1 ratio between observed frequency of a value
> and assumed frequency.  However, with highly grouped values, you might
> have a 2:1 ratio.

Cross validation can help there. But it's costly.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Josh Berkus

On 12/12/2013 10:33 AM, Claudio Freire wrote:
> Well, why not take a supersample containing all visible tuples from N
> selected blocks, and do bootstrapping over it, with subsamples of M
> independent rows each?

Well, we still need to look at each individual block to determine
grouping correlation.  Let's take a worst case example: imagine a table
has *just* been created by:

CREATE TABLE newdata AS SELECT * FROM olddata ORDER BY category, item;

If "category" is fairly low cardinality, then grouping will be severe;
we can reasonably expect that if we sample 100 blocks, many of them will
have only one category value present.  The answer to this is to make our
block samples fairly widely spaced and compare them.

In this simplified example, if the table had 1000 blocks, we would take
blocks 1,101,201,301,401,etc.  Then we would compare the number and
content of values found on each block with the number and content found
on each other block.  For example, if we see that block 101 is entirely
the category "cats", and block 701 is entirely the category "shopping"
and block 901 is split 60/40 between the categories "transportation" and
"voting", then we can assume that the level of grouping is very high,
and the number of unknown values we haven't seen is also high.

Whereas if 101 is "cats" and 201 is "cats" and 301 through 501 are
"cats" with 2% other stuff, then we assume that the level of grouping is
moderate and it's just the case that most of the dataset is "cats".
Which means that the number of unknown values we haven't seen is low.

Whereas if 101, 201, 501, and 901 have near-identical distributions of
values, we assume that the level of grouping is very low, and that there
are very few values we haven't seen.

As someone else pointed out, full-block (the proposal) vs. random-row
(our current style) doesn't have a very significant effect on estimates
of Histograms and nullfrac, as long as the sampled blocks are widely
spaced.  Well, nullfrac is affected in the extreme example of a totally
ordered table where the nulls are all in one block, but I'll point out
that we can (and do) also miss that using our current algo.

Estimated grouping should, however, affect MCVs.  In cases where we
estimate that grouping levels are high, the expected % of observed
values should be "discounted" somehow.  That is, with total random
distribution you have a 1:1 ratio between observed frequency of a value
and assumed frequency.  However, with highly grouped values, you might
have a 2:1 ratio.

Again, more math (backed by statistical analysis) is needed.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Changeset Extraction Interfaces

2013-12-12 Thread Andres Freund

On 2013-12-12 12:13:24 -0500, Robert Haas wrote:
> On Thu, Dec 12, 2013 at 10:49 AM, Andres Freund  
> wrote:
> > If we were to start out streaming changes before the last running
> > transaction has finished, they would be visible in that exported
> > snapshot and you couldn't use it to to roll forward from anymore.
> 
> Actually, you could.  You'd just have to throw away any transactions
> whose XIDs are visible to the exported snapshot.  In other words, you
> begin replication at time T0, and all transactions which begin after
> that time are included in the change stream.  At some later time T1,
> all transactions in progress at time T0 have ended, and now you can
> export a snapshot at that time, or any later time, from which you can
> roll forward.  Any change-stream entries for XIDs which would be
> visible to that snapshot shouldn't be replayed when rolling forward
> from it, though.

But that would become a too complex interface, imo without a
corresponding benefit. If you skip the changes when rolling forward,
there's no point in streaming them out in the first place.

> I think it sucks (that's the technical term) to have to wait for all
> currently-running transactions to terminate before being able to begin
> streaming changes, because that could take a long time.

I don't think there's much of an alternative for replication solutions,
for other usecases, we may want to add an option to skip the wait. It's
not like that's something you do all the time. As soon as a slot was
acquired, there's no further waits anymore.

> And you might
> well know that the long-running transaction which is rolling up
> enormous table A that you don't care about is never going to touch
> table B which you actually want to replicate.  Now, ideally, the DBA
> would have a way to ignore that long-running transaction and force
> replication to start, perhaps with the caveat that if that
> long-running transaction actually does touch B after all then we have
> to resync.

Puh. I honestly have zero confidence in DBAs making an informed decision
about something like this. Honestly, for a replication solution, how
often do you think this will be an issue?

> So imagine this.  After initiating logical replication, a replication
> solution either briefly x-locks a table it wants to replicate, so that
> there can't be anyone else touching it, or it observes who has a lock
> >= RowExclusiveLock and waits for all of those locks to drop away.  At
> that point, it knows that no currently-in-progress transaction can
> have modified the table prior to the start of replication, and begins
> copying the table.  If a transaction that began before the start of
> replication subsequently modifies the table, a WAL record will be
> written, and the core logical decoding support could let the plugin
> know by means of an optional callback (hey, btw, a change I can't
> decode just hit table XYZ).  The plugin will need to respond by
> recopying the table, which sucks, but it was the plugin's decision to
> be optimistic in the first place, and that will in many cases be a
> valid policy decision.  If no such callback arrives before the
> safe-snapshot point, then the plugin made the right bet and will reap
> the just rewards of its optimism.

Sure, all that's possible. But hell, it's complicated to use. If reality
proves people want this, lets go there, but lets get the basics right
and committed first.

All the logic around whether to decode a transaction is:
void
SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
   int nsubxacts, TransactionId *subxacts)
...
if (builder->state < SNAPBUILD_CONSISTENT)
{
/* ensure that only commits after this are getting replayed */
if (builder->transactions_after < lsn)
builder->transactions_after = lsn;
and then

/*
 * Should the contents of a transaction ending at 'ptr' be decoded?
 */
bool
SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
{
return ptr <= builder->transactions_after;
}

so it's not like it will require all too many changes.

What I can see as possibly getting into 9.4 is a FASTSTART option that
doesn't support exporting a snapshot, but doesn't have to wait for the
SNAPBUILD_CONSISTENT state in return. That's fine for some usecases,
although I don't think for any of the major ones.

> > It's not too difficult to provide an option to do that. What I've been
> > thinking of was to correlate the confirmation of consumption with the
> > transaction the SRF is running in. So, confirm the data as consumed if
> > it commits, and don't if not. I think we could do that relatively easily
> > by registering a XACT_EVENT_COMMIT.
> 
> That's a bit too accident-prone for my taste.  I'd rather the DBA had
> some equivalent of peek_at_replication(nchanges int).

One point for my suggested behaviour is that it closes a bigger
racecondition. Currently as soon as s

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Claudio Freire

On Thu, Dec 12, 2013 at 3:29 PM, Tom Lane  wrote:
> Jeff Janes  writes:
>> It would be relatively easy to fix this if we trusted the number of visible
>> rows in each block to be fairly constant.  But without that assumption, I
>> don't see a way to fix the sample selection process without reading the
>> entire table.
>
> Yeah, varying tuple density is the weak spot in every algorithm we've
> looked at.  The current code is better than what was there before, but as
> you say, not perfect.  You might be entertained to look at the threads
> referenced by the patch that created the current sampling method:
> http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gt...@email.aon.at
>
> particularly
> http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at#ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at
>
>
> However ... where this thread started was not about trying to reduce
> the remaining statistical imperfections in our existing sampling method.
> It was about whether we could reduce the number of pages read for an
> acceptable cost in increased statistical imperfection.


Well, why not take a supersample containing all visible tuples from N
selected blocks, and do bootstrapping over it, with subsamples of M
independent rows each?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Tom Lane

Jeff Janes  writes:
> It would be relatively easy to fix this if we trusted the number of visible
> rows in each block to be fairly constant.  But without that assumption, I
> don't see a way to fix the sample selection process without reading the
> entire table.

Yeah, varying tuple density is the weak spot in every algorithm we've
looked at.  The current code is better than what was there before, but as
you say, not perfect.  You might be entertained to look at the threads
referenced by the patch that created the current sampling method:
http://www.postgresql.org/message-id/1tkva0h547jhomsasujt2qs7gcgg0gt...@email.aon.at

particularly
http://www.postgresql.org/message-id/flat/ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at#ri5u70du80gnnt326k2hhuei5nlnimo...@email.aon.at

However ... where this thread started was not about trying to reduce
the remaining statistical imperfections in our existing sampling method.
It was about whether we could reduce the number of pages read for an
acceptable cost in increased statistical imperfection.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] In-Memory Columnar Store

2013-12-12 Thread Merlin Moncure

On Thu, Dec 12, 2013 at 12:18 PM, knizhnik  wrote:
> IMHO it is strange to see such small default values in postgresql
> configuration.

This (low default work mem) is because of three things:

1) Most queries do not really need a lot of work mem
2) Work mem stacks with each query using it -- so with your 1mb
setting vs 1000 connections, you get a gigabyte.  So, some
conservatism is justified although this setting tended to be much more
dangerous in the old days when we measured memory in megabytes.
3) Postgres does not query available physical memory for default
settings due to portability issues.  So we tend to tune to "common
denominator".

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] In-Memory Columnar Store

2013-12-12 Thread knizhnik


On 12/12/2013 07:03 PM, Merlin Moncure wrote:

On Thu, Dec 12, 2013 at 4:02 AM, knizhnik  wrote:
Yeah. It's not fair to compare vs an implementation that is 
constrained to use only 1mb. For analytics work huge work mem is 
pretty typical setting. 10x improvement is believable considering 
you've removed all MVCC overhead, locking, buffer management, etc. and 
have a simplified data structure. merlin 
I agree that it is not fair comparison. As an excuse I can say that I am 
not an experienced PostgreSQL user, so I thought that setting 
shared_buffers is enough to avoid disk access by PostgreSQL. Only after 
getting such strange results I started investigation of how to properly 
tune P{ostgreSQL parameters.


IMHO it is strange to see such small default values in postgresql 
configuration - PostgreSQL is not an embedded database and now even 
mobile devices have several gigs of memory...
Also it will be nice to have one single switch - how much physical 
memory can PostgreSQL use. And let PostgreSQL spit it in optimal way. 
For example I have no idea how to optimally split memory between 
""shared_buffers", "temp_buffers", "work_mem", "maintenance_work_mem". 
PostgreSQL itself should do this work much better than unexperienced 
administrator.


And one of the possible values of such parameter can be "auto": make it 
possible to automatically determine available memory (it is not a big 
deal to check amount of available RAM in the system). I know that 
vendors of big databases never tries to simplify configuration and 
tuning of their products: just because most of the profit them get from 
consulting. But I think that it is not true for PostgreSQL.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] should we add a XLogRecPtr/LSN SQL type?

2013-12-12 Thread Tom Lane

Andres Freund  writes:
> On 2013-12-12 11:55:51 -0500, Tom Lane wrote:
>> I'm not, however, terribly thrilled with the suggestions to add implicit
>> casts associated with this type.  Implicit casts are generally dangerous.

> It's a tradeof. Currently we have the following functions returning LSNs
> as text:
> * pg_current_xlog_location
> * pg_current_xlog_insert_location
> * pg_last_xlog_receive_location
> * pg_last_xlog_replay_location
> one view containing LSNs
> * pg_stat_replication
> and the following functions accepting LSNs as textual paramters:
> * pg_xlog_location_diff
> * pg_xlogfile_name

> The question is how do we deal with backward compatibility when
> introducing a LSN type? There might be some broken code around
> monitoring if we simply replace the type without implicit casts.

Given the limited usage, how bad would it really be if we simply
made all those take/return the LSN type?  As long as the type's
I/O representation looks like the old text format, I suspect
most queries wouldn't notice.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Jeff Janes

On Thu, Dec 12, 2013 at 6:39 AM, Florian Pflug  wrote:

> Here's an analysis of Jeff Janes' simple example of a table where our
> n_distinct estimate is way off.
>
> On Dec11, 2013, at 00:03 , Jeff Janes  wrote:
> > create table baz as select floor(random()*1000), md5(random()::text)
> from generate_series(1,1);
> > create table baz2 as select * from baz order by floor;
> > create table baz3 as select * from baz order by md5(floor::text);
> >
> > baz unclustered, baz2 is clustered with perfect correlation, baz3 is
> clustered but without correlation.
> >
> > After analyzing all of them:
> >
> > select tablename, n_distinct,correlation  from pg_stats where tablename
>  like 'baz%' and attname='floor'  ;
> >  tablename | n_distinct  | correlation
> > ---+-+-
> >  baz   | 8.56006e+06 |  0.00497713
> >  baz2  |  333774 |   1
> >  baz3  |  361048 |  -0.0118147
>
> I think I understand what's going on here. I worked with a reduced test
> cases
> of 1e7 rows containing random values between 0 and 5e5 and instrumented
> analyze to print the raw ndistinct and nmultiple values of the sample
> population (i.e. the number of distinct values in the sample, and the
> number
> of distinct values which appeared more than once). I've also considered
> only
> baz and baz2, and thus removed the than unnecessary md5 column. To account
> for
> the reduced table sizes, I adjusted default_statistics_target to 10
> instead of
> 100. The resulting estimates are then
>
>  tablename | n_distinct (est) | n_distinct (act)
> ---+--+--
>  baz   |   391685 |   50
>  baz2  |36001 |   50
>
> ANALYZE assumes that both tables contain 1048 rows and samples 3000 of
> those.
>
> The sample of baz contains 2989 distinct values, 11 of which appear more
> than
> once. The sample of baz2 contains 2878 distinct values, 117 (!) of which
> appear more than once.
>
> The very different results stem from the Duj1 estimator we use. It
> estimates
> n_distinct by computing n*d/(n - f1 + f1*n/N) where n is the number of
> samples, N the number of rows, d the number of distinct samples, and f1 the
> number of distinct samples occurring exactly once. If all samples are
> unique
> (i.e. n=d=f1) this yields N. But if f1 is less than d, the results drops
> very
> quickly - sampling baz2 produces 117 non-unique values out of 2878 -
> roughly
> 0.03% - and the estimate already less than a 1/10 of what it would be if f1
> where 0.
>
> Which leaves us with the question why sampling baz2 produces more duplicate
> values than sampling baz does. Now, if we used block sampling, that
> behaviour
> would be unsurprising - baz2 is sorted, so each block very likely contains
> each value more than since, since the row count exceeds the number of
> possible
> values by more than a magnitude. In fact, with block sampling, we'd
> probably
> see f1 values close to 0 and thus our estimate of n_distinct would be
> roughly
> equal to the number of distinct values in the *sample* population, i.e.
> about
> 300 or so.
>
> So why does vitter's algorithm fail here? Given that we see inflated
> numbers
> of duplicate values in our sample, yet still far fewer than block-based
> sampling would yield, my guess is that it's the initial reservoir filling
> that
> bites us here. After that initial filling step, the reservoir contains a
> lot of
> consecutive rows, i.e. a block-based sample taken from just a few blocks.
> If
> the replacement phase that follows somehow uses a slightly smaller
> replacement
> probability than it should, more of these rows will survive replacement,
> resulting in exactly the kind of inflated numbers of duplicate values we're
> seeing. I've yet to validate this by looking at the reservoir before and
> after
> the replacement stage, though.
>


I think the problem is more subtle than that.  It is easier to visualize it
if you think of every block having the same number of rows, with that
number being fairly large.  If you pick 30,000 rows at random from
1,000,000 blocks, the number of rows chosen from any given block should be
close to following a poisson distribution with average of 0.03, which means
about 29113 blocks should have exactly 1 row chosen from them while 441
would have two or more rows chosen from them.

But if you instead select 30,000 row from 30,000 blocks, which is what we
ask Vitter's algorithm to do, you get about a Poisson distribution with
average of 1.0.  Then about 11036 blocks have exactly one row chosen from
them, and 7927 blocks have two or more rows sampled from it.  Another
11,036 blocks get zero rows selected from them due to Vitter, in addition
to the 970,000 that didn't even get submitted to Vitter in the first place.
 That is why you see too many duplicates for clustered data, as too many
blocks are sampled multiple times.

The Poisson argument doesn't apply cleanly

Re: [HACKERS] Time-Delayed Standbys

2013-12-12 Thread Fabrízio de Royes Mello

On Thu, Dec 12, 2013 at 3:39 PM, Simon Riggs  wrote:
>
> On 12 December 2013 15:19, Simon Riggs  wrote:
>
> > Don't panic guys! I meant UTC offset only. And yes, it may not be
> > needed, will check.
>
> Checked, all non-UTC TZ offsets work without further effort here.
>

Thanks!

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog sobre TI: http://fabriziomello.blogspot.com
>> Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello

Re: [HACKERS] typo: XIDs are actually compared using modulo-2^32 arithmetic

2013-12-12 Thread Tom Lane

Gianni Ciolli  writes:
> It seems there is a typo here:
>   
> http://www.postgresql.org/docs/devel/static/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND
> where we say that we compare XIDs using arithmetic modulo 2^31, which
> should instead be 2^32 (as it is with uint32, e.g. xid_age).

[ thinks about that for awhile... ]  Yeah, I think you're right.
Patch pushed, thanks!

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-12 Thread Simon Riggs

On 12 December 2013 15:19, Simon Riggs  wrote:

> Don't panic guys! I meant UTC offset only. And yes, it may not be
> needed, will check.

Checked, all non-UTC TZ offsets work without further effort here.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] should we add a XLogRecPtr/LSN SQL type?

2013-12-12 Thread Andres Freund

Hi,

On 2013-12-12 11:55:51 -0500, Tom Lane wrote:
> I'm not, however, terribly thrilled with the suggestions to add implicit
> casts associated with this type.  Implicit casts are generally dangerous.

It's a tradeof. Currently we have the following functions returning LSNs
as text:
* pg_current_xlog_location
* pg_current_xlog_insert_location
* pg_last_xlog_receive_location
* pg_last_xlog_replay_location
one view containing LSNs
* pg_stat_replication
and the following functions accepting LSNs as textual paramters:
* pg_xlog_location_diff
* pg_xlogfile_name

The question is how do we deal with backward compatibility when
introducing a LSN type? There might be some broken code around
monitoring if we simply replace the type without implicit casts. But
just leaving all those as-is seems quite unattractive.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Changeset Extraction Interfaces

2013-12-12 Thread Robert Haas

On Thu, Dec 12, 2013 at 10:49 AM, Andres Freund  wrote:
>> I hadn't realized that the options were going to be different for
>> logical vs. physical.
>
> I don't see how we could avoid that, there just are some differences
> between both.

Right, I'm not complaining, just observing that it was a point I had overlooked.

>> So you could
>> also do ACQUIRE_LOGICAL_SLOT, ACQUIRE_PHYSICAL_SLOT,
>> START_REPLICATION, START_LOGICAL_REPLICATION, and RELEASE_SLOT.  I'm
>> not sure whether that's better.
>
> Not sure either, but I slightly favor keeping the the toplevel slot
> commands the same. I think we'll want one namespace for both and
> possibly similar reporting functions and that seems less surprising if
> they are treated more similar.

OK.

> If we were to start out streaming changes before the last running
> transaction has finished, they would be visible in that exported
> snapshot and you couldn't use it to to roll forward from anymore.

Actually, you could.  You'd just have to throw away any transactions
whose XIDs are visible to the exported snapshot.  In other words, you
begin replication at time T0, and all transactions which begin after
that time are included in the change stream.  At some later time T1,
all transactions in progress at time T0 have ended, and now you can
export a snapshot at that time, or any later time, from which you can
roll forward.  Any change-stream entries for XIDs which would be
visible to that snapshot shouldn't be replayed when rolling forward
from it, though.

I think it sucks (that's the technical term) to have to wait for all
currently-running transactions to terminate before being able to begin
streaming changes, because that could take a long time.  And you might
well know that the long-running transaction which is rolling up
enormous table A that you don't care about is never going to touch
table B which you actually want to replicate.  Now, ideally, the DBA
would have a way to ignore that long-running transaction and force
replication to start, perhaps with the caveat that if that
long-running transaction actually does touch B after all then we have
to resync.  Your model's fine when we want to replicate the whole
database, but a big part of why I want this feature is to allow
finer-grained replication, down to the table level, or even slices of
tables.

So imagine this.  After initiating logical replication, a replication
solution either briefly x-locks a table it wants to replicate, so that
there can't be anyone else touching it, or it observes who has a lock
>= RowExclusiveLock and waits for all of those locks to drop away.  At
that point, it knows that no currently-in-progress transaction can
have modified the table prior to the start of replication, and begins
copying the table.  If a transaction that began before the start of
replication subsequently modifies the table, a WAL record will be
written, and the core logical decoding support could let the plugin
know by means of an optional callback (hey, btw, a change I can't
decode just hit table XYZ).  The plugin will need to respond by
recopying the table, which sucks, but it was the plugin's decision to
be optimistic in the first place, and that will in many cases be a
valid policy decision.  If no such callback arrives before the
safe-snapshot point, then the plugin made the right bet and will reap
the just rewards of its optimism.

>> I don't have a problem with the behavior.  Seems useful.  One useful
>> addition might be to provide an option to stream out up to X changes
>> but without consuming them, so that the DBA can peek at the
>> replication stream.  I think it's a safe bet DBAs will want to do
>> things like that, so it'd be nice to make it easy, if we can.
>
> It's not too difficult to provide an option to do that. What I've been
> thinking of was to correlate the confirmation of consumption with the
> transaction the SRF is running in. So, confirm the data as consumed if
> it commits, and don't if not. I think we could do that relatively easily
> by registering a XACT_EVENT_COMMIT.

That's a bit too accident-prone for my taste.  I'd rather the DBA had
some equivalent of peek_at_replication(nchanges int).

>> Sounds about right, but I think we need to get religion about figuring
>> out what terminology to use.  At the moment it seems to vary quite a
>> bit between "logical", "logical decoding", and "decoding".  Not sure
>> how to nail that down.
>
> Agreed. Perhaps we should just avoid both logical and decoding entirely
> and go for "changestream" or similar?

So wal_level=changestream?  Not feeling it.  Of course we don't have
to be 100% rigid about this but we should try to make our terminology
corresponding with natural semantic boundaries.  Maybe we should call
the process logical decoding, and the results logical streams, or
something like that.

>> As a more abstract linguistic question, what do we think the
>> difference is between logical *replication* and logical *decoding*?
>> Are they

[HACKERS] patch: make_timestamp function

2013-12-12 Thread Pavel Stehule

Hello

this patch try to complete a set of functions make_date and make_timestamp.

Regards

Pavel
commit a1344a0624f87438e2a12c0a7263a0e6dc9a1a30
Author: Pavel Stehule 
Date:   Thu Dec 12 18:08:47 2013 +0100

initial implementation make_timestamp

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index a411e3a..bdf285c 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -6735,6 +6735,30 @@ SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})');

 
  
+  make_timestamp
+ 
+ 
+  
+   make_timestamp(year int,
+   month int,
+   day int,
+   hour int,
+   min int,
+   sec double precision)
+  
+ 
+
+timestamp
+
+ Create timestamp from year, month, day, hour, minute and seconds fields
+
+make_timestamp(1-23, 7, 15, 8, 15, 23.5)
+2013-07-15 08:15:23.5
+   
+
+   
+
+ 
   now
  
  now()
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index c3c71b7..9faa5d9 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -477,6 +477,88 @@ timestamptz_in(PG_FUNCTION_ARGS)
 	PG_RETURN_TIMESTAMPTZ(result);
 }
 
+/*
+ * make_timestamp() - timestamp constructor
+ */
+Datum
+make_timestamp(PG_FUNCTION_ARGS)
+{
+	struct pg_tm tm;
+	int			tm_hour = PG_GETARG_INT32(3);
+	int			tm_min = PG_GETARG_INT32(4);
+	double		sec = PG_GETARG_FLOAT8(5);
+	TimeOffset	date;
+	TimeOffset	time;
+	int			dterr;
+	Timestamp	result;
+
+	tm.tm_year = PG_GETARG_INT32(0);
+	tm.tm_mon = PG_GETARG_INT32(1);
+	tm.tm_mday = PG_GETARG_INT32(2);
+
+	/*
+	 * Note: we'll reject zero or negative year values.  Perhaps negatives
+	 * should be allowed to represent BC years?
+	 */
+	dterr = ValidateDate(DTK_DATE_M, false, false, false, &tm);
+
+	if (dterr != 0)
+		ereport(ERROR,
+(errcode(ERRCODE_DATETIME_FIELD_OVERFLOW),
+ errmsg("date field value out of range: %d-%02d-%02d",
+		tm.tm_year, tm.tm_mon, tm.tm_mday)));
+
+	if (!IS_VALID_JULIAN(tm.tm_year, tm.tm_mon, tm.tm_mday))
+		ereport(ERROR,
+(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
+ errmsg("date out of range: %d-%02d-%02d",
+		tm.tm_year, tm.tm_mon, tm.tm_mday)));
+
+	date = date2j(tm.tm_year, tm.tm_mon, tm.tm_mday) - POSTGRES_EPOCH_JDATE;
+
+	/* This should match the checks in DecodeTimeOnly */
+	if (tm_hour < 0 || tm_min < 0 || tm_min > MINS_PER_HOUR - 1 ||
+		sec < 0 || sec > SECS_PER_MINUTE ||
+		tm_hour > HOURS_PER_DAY ||
+	/* test for > 24:00:00 */
+		(tm_hour == HOURS_PER_DAY && (tm_min > 0 || sec > 0)))
+		ereport(ERROR,
+(errcode(ERRCODE_DATETIME_FIELD_OVERFLOW),
+ errmsg("time field value out of range: %d:%02d:%02g",
+		tm_hour, tm_min, sec)));
+
+	/* This should match tm2time */
+#ifdef HAVE_INT64_TIMESTAMP
+	time = (((tm_hour * MINS_PER_HOUR + tm_min) * SECS_PER_MINUTE)
+			* USECS_PER_SEC) + rint(sec * USECS_PER_SEC);
+
+	result = date * USECS_PER_DAY + time;
+	/* check for major overflow */
+	if ((result - time) / USECS_PER_DAY != date)
+		ereport(ERROR,
+(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
+ errmsg("timestamp out of range: %d-%02d-%02d %d:%02d:%02g",
+		tm.tm_year, tm.tm_mon, tm.tm_mday,
+		tm_hour, tm_min, sec)));
+
+	/* check for just-barely overflow (okay except time-of-day wraps) */
+	/* caution: we want to allow 1999-12-31 24:00:00 */
+	if ((result < 0 && date > 0) ||
+		 (result > 0 && date < -1))
+		ereport(ERROR,
+(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
+ errmsg("timestamp out of range: %d-%02d-%02d %d:%02d:%02g",
+		tm.tm_year, tm.tm_mon, tm.tm_mday,
+		tm_hour, tm_min, sec)));
+#else
+	time = ((tm_hour * MINS_PER_HOUR + tm_min) * SECS_PER_MINUTE) + sec;
+	result = date * SECS_PER_DAY + time;
+#endif
+
+	PG_RETURN_TIMESTAMP(result);
+}
+
+
 /* timestamptz_out()
  * Convert a timestamp to external form.
  */
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 0117500..6334fef 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4680,6 +4680,8 @@ DATA(insert OID = 3846 ( make_date	PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 1082
 DESCR("construct date");
 DATA(insert OID = 3847 ( make_time	PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 1083 "23 23 701" _null_ _null_ "{hour,min,sec}" _null_ make_time _null_ _null_ _null_ ));
 DESCR("construct time");
+DATA(insert OID = 3920 ( make_timestamp	PGNSP PGUID 12 1 0 0 0 f f f f t f i 6 0 1114 "23 23 23 23 23 701" _null_ _null_ "{year,month,day,hour,min,sec}" _null_ make_timestamp _null_ _null_ _null_ ));
+DESCR("construct timestamp");
 
 /* spgist support functions */
 DATA(insert OID = 4001 (  spggettuple	   PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2281 2281" _null_ _null_ _null_ _null_	spggettuple _null_ _null_ _null_ ));
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timesta

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Jeff Janes

On Wed, Dec 11, 2013 at 2:33 PM, Greg Stark  wrote:

>
> I think we're all wet here. I don't see any bias towards larger or smaller
> rows. Larger tied will be on a larger number of pages but there will be
> fewer of them on any one page. The average effect should be the same.
>
> Smaller values might have a higher variance with block based sampling than
> larger values. But that actually *is* the kind of thing that Simon's
> approach of just compensating with later samples can deal with.
>

I think that looking at all rows in randomly-chosen blocks will not bias
size, or histograms.  But it will bias n_distinct and MCV for some data
distributions of data, unless we find some way to compensate for it.

But even for avg size and histograms, what does block sampling get us?  We
get larger samples sizes for the same IO, but those samples are less
independent (assuming data is no randomly scattered over the table), so the
"effective sample size" is less than the true sample size.  So we can't
just sample 100 time fewer blocks because there are about 100 rows per
block--doing so would not bias our avg size or histogram boundaries, but it
would certainly make them noisier.

Cheers,

Jeff

Re: [HACKERS] should we add a XLogRecPtr/LSN SQL type?

2013-12-12 Thread Tom Lane

Alvaro Herrera  writes:
> Robert Haas escribió:
>> I am happy to have my old patch resurrected - could become a trend.
>> But someone should probably go back and check who objected and for
>> what reasons.

> Here it is FWIW:
> http://www.postgresql.org/message-id/flat/ca+tgmozrmnn0evesd-kxb9e-mvdmwoti6guujuvqp_8q2c5...@mail.gmail.com

AFAICS the objections were "why bother with a datatype for just one
function".  If we've now got multiple use-cases, that loses its force.

I'm not, however, terribly thrilled with the suggestions to add implicit
casts associated with this type.  Implicit casts are generally dangerous.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN improvements part 1: additional information

2013-12-12 Thread Alexander Korotkov

On Tue, Dec 10, 2013 at 12:26 AM, Heikki Linnakangas <
hlinnakan...@vmware.com> wrote:

> On 12/09/2013 11:34 AM, Alexander Korotkov wrote:
>
>> On Mon, Dec 9, 2013 at 1:18 PM, Heikki Linnakangas
>> wrote:
>>
>>  Even if we use varbyte encoding, I wonder if it would be better to treat
>>> block + offset number as a single 48-bit integer, rather than encode them
>>> separately. That would allow the delta of two items on the same page to
>>> be
>>> stored as a single byte, rather than two bytes. Naturally it would be a
>>> loss on other values, but would be nice to see some kind of an analysis
>>> on
>>> that. I suspect it might make the code simpler, too.
>>>
>>
>> Yeah, I had that idea, but I thought it's not a better option. Will try to
>> do some analysis.
>>
>
> The more I think about that, the more convinced I am that it's a good
> idea. I don't think it will ever compress worse than the current approach
> of treating block and offset numbers separately, and, although I haven't
> actually tested it, I doubt it's any slower. About the same amount of
> arithmetic is required in both versions.
>
> Attached is a version that does that. Plus some other minor cleanup.
>
> (we should still investigate using a completely different algorithm,
> though)

I've thought about different algorithms little more. General problem I see
is online update. We need it while it is typically not covered by
researches at all. We already have to invent small index in the end of
page. Different encoding methods adds more challenges. In general, methods
can be classified in two groups:
1) Values aren't aligned by bytes (gamma-codes, PFOR etc.)
2) Multiple values are packed together in small group (simple-9, simple-18)
For the first group of methods when inserting in the middle of the page we
would have to do not byte-aligned shift of right part of values. I don't
know how expensive is this shift but I expect that it would be much slower
than memmove.
When values are packed into small groups, we have to either insert
inefficiently encoded value or re-encode whole right part of values.

The option I see is to encode bins between item indexes separately. This
still might be slower and require much more complicated maintenance. And
also this much complicates further work on storing additional information
in GIN.

Any other thoughts?

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] should we add a XLogRecPtr/LSN SQL type?

2013-12-12 Thread Alvaro Herrera

Robert Haas escribió:

> I am happy to have my old patch resurrected - could become a trend.
> But someone should probably go back and check who objected and for
> what reasons.

Here it is FWIW:

http://www.postgresql.org/message-id/flat/ca+tgmozrmnn0evesd-kxb9e-mvdmwoti6guujuvqp_8q2c5...@mail.gmail.com

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql: Fix a couple of bugs in MultiXactId freezing

2013-12-12 Thread Alvaro Herrera

Andres Freund wrote:
> On 2013-12-11 22:08:41 -0300, Alvaro Herrera wrote:
> > Andres Freund wrote:

> > > I worry that all these multixact accesses will create huge performance
> > > problems due to the inefficiency of the multixactid cache. If you scan a
> > > huge table there very well might be millions of different multis we
> > > touch and afaics most of them will end up in the multixactid cache. That
> > > can't end well.
> > > I think we need to either regularly delete that cache when it goes past,
> > > say, 100 entries, or just bypass it entirely.
> > 
> > Delete the whole cache, or just prune it of the least recently used
> > entries?  Maybe the cache should be a dlist instead of the open-coded
> > stuff that's there now; that would enable pruning of the oldest entries.
> > I think a blanket deletion might be a cure worse than the disease.  I
> > see your point anyhow.
> 
> I was thinking of just deleting the whole thing. Revamping the cache
> mechanism to be more efficient, is an important goal, but it imo
> shouldn't be lumped together with this. Now you could argue that purging
> the cache shouldn't be either - but 9.3.2+ the worst case essentially is
> O(n^2) in the number of rows in a table. Don't think that can be
> acceptable.

So I think this is the only remaining issue to make this patch
committable (I will address the other points in Andres' email.)  Since
there has been no other feedback on this thread, Andres and I discussed
the cache issue a bit over IM and we seem to agree that a patch to
revamp the cache should be a fairly localized change that could be
applied on both 9.3 and master, separately from this fix.  Doing cache
deletion seems more invasive, and not provide better performance anyway.

Since having a potentially O(n^2) cache behavior but with working freeze
seems better than no O(n^2) but broken freeze, I'm going to apply this
patch shortly and then work on reworking the cache.

Are there other opinions?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Changeset Extraction Interfaces

2013-12-12 Thread Andres Freund

On 2013-12-12 10:01:21 -0500, Robert Haas wrote:
> On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund  wrote:
> > I think there'll always be a bit of a difference between slots for
> > physical and logical data, even if 90% of the implementation is the
> > same. We can signal that difference by specifying logical/physical as an
> > option or having two different sets of commands.
> >
> > Maybe?
> >
> > ACQUIRE_REPLICATION_SLOT slot_name PHYSICAL physical_opts
> > ACQUIRE_REPLICATION_SLOT slot_name LOGICAL logical_opts
> > -- already exists without slot, PHYSICAL arguments
> > START_REPLICATION [SLOT slot] [PHYSICAL] RECPTR opt_timeline
> > START_REPLICATION SLOT LOGICAL slot plugin_options
> > RELEASE_REPLICATION_SLOT slot_name
> 
> I assume you meant START_REPLICATION SLOT slot LOGICAL plugin_options,
> but basically this seems OK to me.

Uh, yes.

> I hadn't realized that the options were going to be different for
> logical vs. physical.

I don't see how we could avoid that, there just are some differences
between both.

> So you could
> also do ACQUIRE_LOGICAL_SLOT, ACQUIRE_PHYSICAL_SLOT,
> START_REPLICATION, START_LOGICAL_REPLICATION, and RELEASE_SLOT.  I'm
> not sure whether that's better.

Not sure either, but I slightly favor keeping the the toplevel slot
commands the same. I think we'll want one namespace for both and
possibly similar reporting functions and that seems less surprising if
they are treated more similar.

> > So what you could get is something that starts streaming you changes
> > sometime after you asked it to start streaming, without a guarantee that
> > you can restart at exactly the position you stopped. If that's useful,
> > we can do it, but I am not sure what the usecase would be?
> 
> I haven't yet looked closely at the snapshot-building stuff, but my
> thought is that you ought to be able to decode any transactions that
> start after you make the connection.  You might not be able to decode
> transactions that are already in progress at that point, because you
> might have already missed XID assignment records, catalog changes,
> etc. that they've performed.  But transactions that begin after that
> point ought to be OK.

It works mostly like that, yes. At least on primaries. When we start
decoding, we jot down the current xlog insertion pointer to know where
to start decoding from, then trigger a xl_running_xacts record to be
logged so we have enough information. Then we start reading from that
point onwards. On standbys the process is the same, just that we have to
wait for the primary to issue a xl_running_xacts.
(I had considered starting with information from the procarray, but
turns out that's hard to do without race conditions.)

We only decode changes in transactions that commit after the last
transaction that was in-progress when we started observing has finished
though. That allows us to export a snapshot when the last still-running
transaction finished which shows a snapshot of the database that can be
rolled forward exactly by the changes contained in the changestream. I
think that's a useful property for the majority of cases.

If we were to start out streaming changes before the last running
transaction has finished, they would be visible in that exported
snapshot and you couldn't use it to to roll forward from anymore.

It'd be pretty easy to optionally decode the transactions we currently
skip if we want that feature later. That would remove the option to
export a snapshot in many cases though (think suboverflowed snapshots).

> I have a feeling you're going to tell me it
> doesn't work like that, but maybe it should, because there's a whole
> lot of benefit in having decoding start up quickly, and a whole lot of
> benefit also to having the rules for that be easy to understand.

I am not sure if the above qualifies as "doesn't work like that", if
not, sometimes the correct thing isn't the immediately obvious thing. I
think "all transactions that were running when initiating decoding need
to finish" is reasonably easy to explain.

> > I am also open to different behaviour for the SRF, but I am not sure
> > what that could be. There's just no sensible way to stream data on the
> > SQL level afaics.

> I don't have a problem with the behavior.  Seems useful.  One useful
> addition might be to provide an option to stream out up to X changes
> but without consuming them, so that the DBA can peek at the
> replication stream.  I think it's a safe bet DBAs will want to do
> things like that, so it'd be nice to make it easy, if we can.

It's not too difficult to provide an option to do that. What I've been
thinking of was to correlate the confirmation of consumption with the
transaction the SRF is running in. So, confirm the data as consumed if
it commits, and don't if not. I think we could do that relatively easily
by registering a XACT_EVENT_COMMIT.

> > What about pg_decoding_slot_get_[binary_]changes()?
>
> Sounds about right, but I think we need to get religion about

[HACKERS] typo: XIDs are actually compared using modulo-2^32 arithmetic

2013-12-12 Thread Gianni Ciolli

Hi,

It seems there is a typo here:

  
http://www.postgresql.org/docs/devel/static/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND

where we say that we compare XIDs using arithmetic modulo 2^31, which
should instead be 2^32 (as it is with uint32, e.g. xid_age).

Best wishes,
Dr. Gianni Ciolli - 2ndQuadrant Italia
PostgreSQL Training, Services and Support
gianni.cio...@2ndquadrant.it | www.2ndquadrant.it


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-12 Thread Tom Lane

I wrote:
> That's about what I thought: it's unique-ifying according to the original
> semijoin qual, without realizing that the pulled-up clause from the lower
> WHERE would need to be treated as part of the semijoin qual.  This isn't
> a bug in the existing code, because the case can never arise, since we
> don't treat an IN/=ANY as a semijoin if the sub-select contains any
> outer-level Vars.  But if you remove that check from
> convert_ANY_sublink_to_join then you've got to deal with the problem.

> That said, I'm not too sure where the problem is in detail.  I'd have
> thought that subquery pullup would stick the subquery's WHERE clause
> into the join quals of the parent JoinExpr node.  Is that not happening,
> or is it perhaps not sufficient to cue the UniquePath machinery?

BTW, on further thought, I'm afraid this is a bigger can of worms than
it appears.  The remarks above presume that the subquery is simple enough
to be pulled up, which is the case in this example.  It might not be too
hard to make that case work.  But what if the subquery *isn't* simple
enough to be pulled up --- for instance, it includes grouping or
aggregation?  Then there's no way to unify its WHERE clause with the upper
semijoin qual.  At the very least, this breaks the uniqueify-then-do-a-
plain-join implementation strategy for semijoins.

So I'm now thinking this patch isn't worth pursuing.  Getting all the
corner cases right would be a significant amount of work, and in the
end it would only benefit strangely-written queries.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2013-12-12 Thread Robert Haas

On Thu, Dec 12, 2013 at 12:27 AM, Amit Kapila  wrote:
> On Thu, Dec 12, 2013 at 3:43 AM, Peter Eisentraut  wrote:
>> This patch fails the regression tests; see attachment.
>
>Thanks for reporting the diffs. The reason for failures is that
> still decoding for tuple is not done as mentioned in Notes section in
> mail
>
> (http://www.postgresql.org/message-id/caa4ek1jeuby16uwrda2tabkk+rlrl3giyyqy1mvh_6cthmd...@mail.gmail.com)
>
>However, to keep the sanity of patch, I will do that and post an
> updated patch, but I think the main idea behind new approach at this
>point is to get feedback on if such an optimization is acceptable
> for worst case scenarios and if not whether we can get this done
>with table level or GUC option.

I don't understand why lack of decoding support should cause
regression tests to fail.  I thought decoding was only being done
during WAL replay, a case not exercised by the regression tests.

A few other comments:

+#define PGRB_HKEY_PRIME11 /* prime number used for
rolling hash */
+#define PGRB_HKEY_SQUARE_PRIME11 * 11 /* prime number
used for rolling hash */
+#define PGRB_HKEY_CUBE_PRIME11 * 11 * 11 /* prime
number used for rolling hash */

11 * 11 can't accurately be described as a prime number.  Nor can 11 *
11 * 11.  Please adjust the comment.  Also, why 11?

It doesn't appear that pglz_hist_idx is changed except for whitespace;
please revert that hunk.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-12 Thread Simon Riggs

On 12 December 2013 15:03, Robert Haas  wrote:
> On Thu, Dec 12, 2013 at 9:52 AM, Tom Lane  wrote:
>> Simon Riggs  writes:
>>> On 12 December 2013 11:05, Andres Freund  wrote:
> My suggestion would be to add the TZ to the checkpoint record. This
> way all users of WAL can see the TZ of the master and act accordingly.
> I'll do a separate patch for that.
>>
 Intuitively I'd say that might be useful - but I am not reall sure what
 for. And we don't exactly have a great interface for looking at a
 checkpoint's data. Maybe add it to the control file instead?
>>
>>> That's actually what I had in mind, I just phrased it badly in mid-thought.
>>
>> I don't think you realize what a can of worms that would be.  There's
>> no compact representation of "a timezone", unless you are only proposing
>> to store the UTC offset; and frankly I'm not particularly seeing the point
>> of that.
>
> +1.  I can see the point of storing a timestamp in each checkpoint
> record, if we don't already, but time zones should be completely
> irrelevant to this feature.  Everything should be reckoned in seconds
> since the epoch.

Don't panic guys! I meant UTC offset only. And yes, it may not be
needed, will check.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-12 Thread Tom Lane

Kevin Grittner  writes:
> Further confirmation using the EXPLAIN patch with Antonin's v2
> patch against the table before any EXPLAIN or ANALYZE:

>  Hash Join  (cost=37.12..80.40 rows=442 width=12)
>    Hash Cond: (((upper.f2)::double precision = lower.f3) AND (upper.f1 = 
> lower.f2))
>    ->  Seq Scan on subselect_tbl upper  (cost=0.00..27.70 rows=1770 width=16)
>    ->  Hash  (cost=34.12..34.12 rows=200 width=12)
>  ->  HashAggregate  (cost=32.12..34.12 rows=200 width=12)
>    Group Key: lower.f2
>    ->  Seq Scan on subselect_tbl lower  (cost=0.00..27.70 
> rows=1770 width=12)

That's about what I thought: it's unique-ifying according to the original
semijoin qual, without realizing that the pulled-up clause from the lower
WHERE would need to be treated as part of the semijoin qual.  This isn't
a bug in the existing code, because the case can never arise, since we
don't treat an IN/=ANY as a semijoin if the sub-select contains any
outer-level Vars.  But if you remove that check from
convert_ANY_sublink_to_join then you've got to deal with the problem.

That said, I'm not too sure where the problem is in detail.  I'd have
thought that subquery pullup would stick the subquery's WHERE clause
into the join quals of the parent JoinExpr node.  Is that not happening,
or is it perhaps not sufficient to cue the UniquePath machinery?

> The additional information is so useful, I'm all for committing
> that patch.

Will do.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] should we add a XLogRecPtr/LSN SQL type?

2013-12-12 Thread Robert Haas

On Thu, Dec 12, 2013 at 8:20 AM, Simon Riggs  wrote:
> On 12 December 2013 12:27, Andres Freund  wrote:
>> On 2013-12-11 08:13:18 -0500, Robert Haas wrote:
>>> On Wed, Dec 11, 2013 at 7:41 AM, Andres Freund  
>>> wrote:
>>> > There's already a couple of SQL function dealing with XLogRecPtrs and
>>> > the logical replication work will add a couple of more. Currently each
>>> > of those funtions taking/returning an LSN does sprintf/scanf to
>>> > print/parse the strings. Which both is awkward and potentially
>>> > noticeable performancewise.
>>> >
>>> > It seems relatively simple to add a proper type, with implicit casts
>>> > from text, instead?
>>>
>>> I'm pretty sure that this was discussed last year, and I voted for it
>>> but more people
>>> voted against it, so it died.  I still think that was a mistake, but I
>>> just work here.
>>
>> Ah. I missed or forgot that discussion.
>
> Hmm, don't recall that. Just in case I opposed it, its a good idea now.

I am happy to have my old patch resurrected - could become a trend.
But someone should probably go back and check who objected and for
what reasons.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] In-Memory Columnar Store

2013-12-12 Thread Merlin Moncure

On Thu, Dec 12, 2013 at 4:02 AM, knizhnik  wrote:
> On 12/12/2013 11:42 AM, Pavel Stehule wrote:
>
> it is interesting idea. For me, a significant information from comparation,
> so we do some significantly wrong. Memory engine should be faster naturally,
> but I don't tkink it can be 1000x.
>
>
> Sorry, but I didn't  fabricate this results:
> Below is just snapshot from my computer:
>
>
> postgres=# select DbItem_load();
>  dbitem_load
> -
>  998
> (1 row)
>
> postgres=# \timing
> Timing is on.
> postgres=# select cs_used_memory();
>  cs_used_memory
> 
>  4441894912
> (1 row)
>
> postgres=# select agg_val,cs_cut(group_by,'c22c30c10') from
>  (select (cs_project_agg(ss1.*)).* from
>(select (s1).sum/(s2).sum,(s1).groups from DbItem_get() q,
> cs_hash_sum(q.score*q.volenquired,
> q.trader||q.desk||q.office) s1,
>  cs_hash_sum(q.volenquired, q.trader||q.desk||q.office) s2)
> ss1) ss2;
>  agg_val  |   cs_cut
> --+
>  1.50028393511844 | ("John Coltrane","New York Corporates","New York")
> 
> Time: 506.125 ms
>
> postgres=# select sum(score*volenquired)/sum(volenquired) from DbItem group
> by (trader,desk,office);
> ...
> Time: 449328.645 ms
> postgres=# select sum(score*volenquired)/sum(volenquired) from DbItem group
> by (trader,desk,office);
> ...
> Time: 441530.689 ms
>
> Please notice that time of second execution is almost the same as first,
> although all data can fit in cache!
>
> Certainly it was intersting to me to understand the reason of such bad
> performance.
> And find out two things:
>
> 1.
>  select sum(score*volenquired)/sum(volenquired) from DbItem group by
> (trader,desk,office);
> and
>  select sum(score*volenquired)/sum(volenquired) from DbItem group by
> trader,desk,office;
>
> are not the same queries (it is hard to understand to C programmer:)
> And first one is executed significantly slower.
>
> 2. It is not enough to increase "shared_buffers" parameter in
> postgresql.conf.
> "work_mem" is also very important. When I increased it to 1Gb from default
> 1Mb, then time of query execution is reduced to
> 7107.146 ms. So the real difference is ten times, not 1000 times.

Yeah.  It's not fair to compare vs an implementation that is
constrained to use only 1mb.  For analytics work huge work mem is
pretty typical setting.   10x improvement is believable considering
you've removed all MVCC overhead, locking, buffer management, etc. and
have a simplified data structure.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-12 Thread Robert Haas

On Thu, Dec 12, 2013 at 9:52 AM, Tom Lane  wrote:
> Simon Riggs  writes:
>> On 12 December 2013 11:05, Andres Freund  wrote:
 My suggestion would be to add the TZ to the checkpoint record. This
 way all users of WAL can see the TZ of the master and act accordingly.
 I'll do a separate patch for that.
>
>>> Intuitively I'd say that might be useful - but I am not reall sure what
>>> for. And we don't exactly have a great interface for looking at a
>>> checkpoint's data. Maybe add it to the control file instead?
>
>> That's actually what I had in mind, I just phrased it badly in mid-thought.
>
> I don't think you realize what a can of worms that would be.  There's
> no compact representation of "a timezone", unless you are only proposing
> to store the UTC offset; and frankly I'm not particularly seeing the point
> of that.

+1.  I can see the point of storing a timestamp in each checkpoint
record, if we don't already, but time zones should be completely
irrelevant to this feature.  Everything should be reckoned in seconds
since the epoch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Changeset Extraction Interfaces

2013-12-12 Thread Robert Haas

On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund  wrote:
> I think there'll always be a bit of a difference between slots for
> physical and logical data, even if 90% of the implementation is the
> same. We can signal that difference by specifying logical/physical as an
> option or having two different sets of commands.
>
> Maybe?
>
> ACQUIRE_REPLICATION_SLOT slot_name PHYSICAL physical_opts
> ACQUIRE_REPLICATION_SLOT slot_name LOGICAL logical_opts
> -- already exists without slot, PHYSICAL arguments
> START_REPLICATION [SLOT slot] [PHYSICAL] RECPTR opt_timeline
> START_REPLICATION SLOT LOGICAL slot plugin_options
> RELEASE_REPLICATION_SLOT slot_name

I assume you meant START_REPLICATION SLOT slot LOGICAL plugin_options,
but basically this seems OK to me.  I hadn't realized that the options
were going to be different for logical vs. physical.  So you could
also do ACQUIRE_LOGICAL_SLOT, ACQUIRE_PHYSICAL_SLOT,
START_REPLICATION, START_LOGICAL_REPLICATION, and RELEASE_SLOT.  I'm
not sure whether that's better.

>> It also strikes me that just as it's possible to stream WAL without
>> allocating a slot first (since we don't at present have slots),
>> perhaps it ought also to be possible to stream logical replication
>> data without acquiring a slot first.  You could argue that it was a
>> mistake not to introduce slots in the first place, but the stateless
>> nature of WAL streaming definitely has some benefits, and it's unclear
>> to me why you shouldn't be able to do the same thing with logical
>> decoding.
>
> I think it would be quite a bit harder for logical decoding. The
> difference is that, from the perspective of the walsender, for plain WAL
> streaming, all that needs to be checked is whether the WAL is still
> there. For decoding though, we need to be sure that a) the catalog xmin
> is still low enough and has been all along b) that we are able instantly
> build a historical mvcc snapshot from the point we want to start
> streaming.
> Both a) and b) are solved by keeping the xmin and the point where to
> reread WAL from in the slot data and by serializing data about
> historical snapshots to disk. But those are removed if there isn't a
> slot around requiring them...
>
> So what you could get is something that starts streaming you changes
> sometime after you asked it to start streaming, without a guarantee that
> you can restart at exactly the position you stopped. If that's useful,
> we can do it, but I am not sure what the usecase would be?

I haven't yet looked closely at the snapshot-building stuff, but my
thought is that you ought to be able to decode any transactions that
start after you make the connection.  You might not be able to decode
transactions that are already in progress at that point, because you
might have already missed XID assignment records, catalog changes,
etc. that they've performed.  But transactions that begin after that
point ought to be OK.  I have a feeling you're going to tell me it
doesn't work like that, but maybe it should, because there's a whole
lot of benefit in having decoding start up quickly, and a whole lot of
benefit also to having the rules for that be easy to understand.

Now if you have that, then I think ad-hoc decoding is potentially
useful.  Granted, you're not going to want to build a full-fledged
replication solution that way, but you might want to just connect and
watch the world stream by... or you might imagine an application that
opens a replication connection and a regular connection, copies a
table, and then applies the stream of changes made to that table after
the fact.  When that completes, the table is sync'd between the two
machines as of the end of the copy.  Useful enough to bother with?  I
don't know.  But not obviously useless.

> I am also open to different behaviour for the SRF, but I am not sure
> what that could be. There's just no sensible way to stream data on the
> SQL level afaics.

I don't have a problem with the behavior.  Seems useful.  One useful
addition might be to provide an option to stream out up to X changes
but without consuming them, so that the DBA can peek at the
replication stream.  I think it's a safe bet DBAs will want to do
things like that, so it'd be nice to make it easy, if we can.

> What about pg_decoding_slot_get_[binary_]changes()?

Sounds about right, but I think we need to get religion about figuring
out what terminology to use.  At the moment it seems to vary quite a
bit between "logical", "logical decoding", and "decoding".  Not sure
how to nail that down.

As a more abstract linguistic question, what do we think the
difference is between logical *replication* and logical *decoding*?
Are they the same or different?  If different, how?

> I wonder if we should let the output plugin tell us whether it will
> output data in binary? I think it generally would be a good idea to let
> the output plugin's _init() function return some configuration
> data. That will make extending the interface to support mo

Re: [HACKERS] Time-Delayed Standbys

2013-12-12 Thread Tom Lane

Simon Riggs  writes:
> On 12 December 2013 11:05, Andres Freund  wrote:
>>> My suggestion would be to add the TZ to the checkpoint record. This
>>> way all users of WAL can see the TZ of the master and act accordingly.
>>> I'll do a separate patch for that.

>> Intuitively I'd say that might be useful - but I am not reall sure what
>> for. And we don't exactly have a great interface for looking at a
>> checkpoint's data. Maybe add it to the control file instead?

> That's actually what I had in mind, I just phrased it badly in mid-thought.

I don't think you realize what a can of worms that would be.  There's
no compact representation of "a timezone", unless you are only proposing
to store the UTC offset; and frankly I'm not particularly seeing the point
of that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-12 Thread Kevin Grittner

Tom Lane  wrote:

> Hm, that means there's only one grouping column (and it's the second
> tlist entry of the child plan node).  So that seems conclusive that
> the unique-ification is being done wrong.

Further confirmation using the EXPLAIN patch with Antonin's v2
patch against the table before any EXPLAIN or ANALYZE:

 Hash Join  (cost=37.12..80.40 rows=442 width=12)
   Hash Cond: (((upper.f2)::double precision = lower.f3) AND (upper.f1 = 
lower.f2))
   ->  Seq Scan on subselect_tbl upper  (cost=0.00..27.70 rows=1770 width=16)
   ->  Hash  (cost=34.12..34.12 rows=200 width=12)
 ->  HashAggregate  (cost=32.12..34.12 rows=200 width=12)
   Group Key: lower.f2
   ->  Seq Scan on subselect_tbl lower  (cost=0.00..27.70 rows=1770 
width=12)

The additional information is so useful, I'm all for committing
that patch.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

2013-12-12 Thread Florian Pflug

Here's an analysis of Jeff Janes' simple example of a table where our
n_distinct estimate is way off.

On Dec11, 2013, at 00:03 , Jeff Janes  wrote:
> create table baz as select floor(random()*1000), md5(random()::text) from 
> generate_series(1,1);
> create table baz2 as select * from baz order by floor;
> create table baz3 as select * from baz order by md5(floor::text);
> 
> baz unclustered, baz2 is clustered with perfect correlation, baz3 is 
> clustered but without correlation.
> 
> After analyzing all of them:
> 
> select tablename, n_distinct,correlation  from pg_stats where tablename  like 
> 'baz%' and attname='floor'  ;
>  tablename | n_distinct  | correlation
> ---+-+-
>  baz   | 8.56006e+06 |  0.00497713
>  baz2  |  333774 |   1
>  baz3  |  361048 |  -0.0118147

I think I understand what's going on here. I worked with a reduced test cases
of 1e7 rows containing random values between 0 and 5e5 and instrumented
analyze to print the raw ndistinct and nmultiple values of the sample
population (i.e. the number of distinct values in the sample, and the number
of distinct values which appeared more than once). I've also considered only
baz and baz2, and thus removed the than unnecessary md5 column. To account for
the reduced table sizes, I adjusted default_statistics_target to 10 instead of
100. The resulting estimates are then

 tablename | n_distinct (est) | n_distinct (act) 
---+--+--
 baz   |   391685 |   50 
 baz2  |36001 |   50 

ANALYZE assumes that both tables contain 1048 rows and samples 3000 of
those.

The sample of baz contains 2989 distinct values, 11 of which appear more than
once. The sample of baz2 contains 2878 distinct values, 117 (!) of which
appear more than once.

The very different results stem from the Duj1 estimator we use. It estimates
n_distinct by computing n*d/(n - f1 + f1*n/N) where n is the number of
samples, N the number of rows, d the number of distinct samples, and f1 the
number of distinct samples occurring exactly once. If all samples are unique
(i.e. n=d=f1) this yields N. But if f1 is less than d, the results drops very
quickly - sampling baz2 produces 117 non-unique values out of 2878 - roughly
0.03% - and the estimate already less than a 1/10 of what it would be if f1
where 0.

Which leaves us with the question why sampling baz2 produces more duplicate
values than sampling baz does. Now, if we used block sampling, that behaviour
would be unsurprising - baz2 is sorted, so each block very likely contains
each value more than since, since the row count exceeds the number of possible
values by more than a magnitude. In fact, with block sampling, we'd probably
see f1 values close to 0 and thus our estimate of n_distinct would be roughly
equal to the number of distinct values in the *sample* population, i.e. about
300 or so.

So why does vitter's algorithm fail here? Given that we see inflated numbers
of duplicate values in our sample, yet still far fewer than block-based
sampling would yield, my guess is that it's the initial reservoir filling that
bites us here. After that initial filling step, the reservoir contains a lot of
consecutive rows, i.e. a block-based sample taken from just a few blocks. If
the replacement phase that follows somehow uses a slightly smaller replacement
probability than it should, more of these rows will survive replacement,
resulting in exactly the kind of inflated numbers of duplicate values we're
seeing. I've yet to validate this by looking at the reservoir before and after
the replacement stage, though.

So at least for the purpose of estimating n_distinct, our current sampling
method seems to exhibit the worst rather than the best properties of block-
and row- based sampling. What conclusions to draw of that, I'm not sure yet -
other that if we move to block-based sampling, we'll certainly have to change
the way we estimate n_distinct.

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SSL: better default ciphersuite

2013-12-12 Thread Marko Kreen

On Thu, Dec 12, 2013 at 01:33:57PM +0100, Magnus Hagander wrote:
> On Thu, Dec 12, 2013 at 11:30 AM, Marko Kreen  wrote:
> > On Wed, Dec 11, 2013 at 10:08:44PM -0500, Tom Lane wrote:
> > I know that SChannel SSL library in Windows XP (and earlier) is such
> > RC4+3DES only implementation, but I have not heard about anything
> > using it to connect to Postgres.
> >
> > Also I have not heard about any Postgres clients actually allowing
> > to configure ciphers, so my impression all client libraries
> > use defaults, which usually prefer AES anyway.
> >
> 
> I don't know, but I would assume that npgsql which sit on top of dotnet,
> would sit on top of schannel in the end.

Probably yes.

> That said, this is XP and earlier, right? Newer versions of Windows have
> better defaults?

Yes, since Vista it supports AES:

  
http://msdn.microsoft.com/en-us/library/windows/desktop/ff468651%28v=vs.85%29.aspx

> > So my new proposal would be to pick one from following defaults:
> >
> > 1) HIGH:+3DES:!aNULL - disables RC4, orders 3DES last.
> >
> > 2) HIGH:MEDIUM:+3DES:!aNULL - no suite changes from current one,
> >except 3DES is ordered last.
> >
> > +3DES reorders already picked 3DES suites to the end.  As my
> > impression is that no clients ever have actually used 3DES,
> > it's fine to use !3DES there.  It's clearer too.  But if max
> > compatibility is goal, then +3DES is better.
> >
> > It's not as nice and simple as I hoped though. :(
> >
> 
> We definitely want to explain in a comment next to the default value the
> exact reasoning behind the default value, I think. That doesn't mean people
> will understand it, but it means they at least have a chance.
> 
> That said, #2 seems reasonable I think, but I wouldn't object to #1 either
> :)

Although I don't think that making .NET users on XP use 3DES is a big
problem, there is also no urgent need to drop RC4, as long as all
reasonable alternatives are used first.

Attached patch changes default ciphersuite to HIGH:MEDIUM:+3DES:!aNULL
and also adds documentation about reasoning for it.  (I tried to be
consistent there about "cipher" vs. "cipher suite", not sure whether
I succeeded.)

Summary: this default with previous server-order-first patch make sure
that MEDIUM ciphers are never picked accidentally (eg. the bad default
in Java 6), but only when explicitly requested or when only alternative
is 3DES.

-- 
marko

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8896988..2755f55 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -876,12 +876,62 @@ include 'filename'
   
   

-Specifies a list of SSL ciphers that are allowed to be
+Specifies a list of SSL cipher suites that are allowed to be
 used on secure connections.  See
 the ciphers manual page
 in the OpenSSL package for the syntax of this setting
-and a list of supported values.  The default value is usually
-reasonable, unless you have specific security requirements.
+and a list of supported values.  The default value is
+HIGH:MEDIUM:+3DES:!aNULL.  It is usually reasonable,
+unless you have specific security requirements.
+   
+   
+Explanation for default value:
+
+ 
+  HIGH
+  
+   
+Cipher suites that use ciphers from HIGH section.
+(AES, Camellia, 3DES)
+   
+  
+ 
+ 
+  MEDIUM
+  
+   
+Cipher suites that use ciphers from MEDIUM section.
+(RC4, SEED)
+   
+  
+ 
+ 
+  +3DES
+  
+   
+OpenSSL default order for HIGH is problematic as it orders 3DES
+higher than AES128.  This is wrong as 3DES offers less security than AES128
+and it is also much slower.  +3DES reorders it after all other
+HIGH and MEDIUM ciphers.
+   
+  
+ 
+ 
+  !aNULL
+  
+   
+Disables anonymous cipher suites that do no authentication.
+Such cipher suites are vulnerable to a "man in the middle"
+attack and so should not be used.
+   
+  
+ 
+
+Available cipher suite details will vary across OpenSSL versions, these values
+are taken from OpenSSL 1.0.1.  Use command
+openssl ciphers -v 'HIGH:MEDIUM:+3DES:!aNULL'
+to see actual details for currently installed OpenSSL version.
+Note that this list is filtered on runtime based on server key type.

   
  
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e69e132..0ad02e8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3156,7 +3156,7 @@ static struct config_string ConfigureNamesString[] =
 		},
 		&SSLCipherSuites,
 #ifdef USE_SSL
-		"DEFAULT:!LOW:!EXP:!MD

[HACKERS] pg_stat_statements shows too short COMMIT time

2013-12-12 Thread MauMau


Hello,

I've found a problem with pg_stat_statements while doing some performance 
test.  If the fix is desired and not difficult, I'm willing to address it. 
Could you give me any information and/or your opinions?



[Problem]
The time of COMMIT statements is unreasonably short.  pg_stat_statements' 
result shows that the average COMMIT time is 2 us (calls=200, total_time=0.4 
ms).  It doesn't include most of the commit processing, including WAL flush.



[Cause]
pg_stat_statements measures the time taken in ProcessUtility().  However, 
ProcessUtility() only updates some transaction state variables.  The actual 
COMMIT processing, CommitTransactionCommand(), is called from 
finish_xact_command() in postgres.c.  This is not measured.



[Question]
Is this an expected behavior to be kept as now?
If no, can this be fixed easily (and how)?  Is the actual commit processing 
out of the scope of pg_stat_statements?



Regards
MauMau




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimize kernel readahead using buffer access strategy

2013-12-12 Thread Simon Riggs

On 12 December 2013 13:43, Mitsumasa KONDO  wrote:

>> Your tests seem to relate to pgbench. Do we have tests on more BI related
>> tasks?
>
> Yes, off-course!  We will need another benchmark test before conclusion of
> this patch.
> What kind of benchmark do you have?

I suggest isolating SeqScan and IndexScan and BitmapIndex/HeapScan
examples, as well as some of the simpler TPC-H queries.

But start with some SeqScan and VACUUM cases.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-12 Thread Mitsumasa KONDO

2013/12/12 Simon Riggs 

> On 12 December 2013 10:42, KONDO Mitsumasa
>  wrote:
>
> >> I agree with your request here, but I don't think negative values are
> >> the right way to implement that, at least it would not be very usable.
> >
> > I think that my proposal is the easiest and simplist way to solve this
> > problem. And I believe that the man who cannot calculate the difference
> in
> > time-zone doesn't set replication cluster across continents.
> >
> >
> >> My suggestion would be to add the TZ to the checkpoint record. This
> >> way all users of WAL can see the TZ of the master and act accordingly.
> >> I'll do a separate patch for that.
> >
> > It is something useful for also other situations. However, it might be
> > going to happen long and complicated discussions... I think that our
> hope is
> > to commit this patch in this commit-fest or next final commit-fest.
>
> Agreed on no delay for the delay patch, as shown by my commit.

Our forecast was very accurate...
Nice commit, Thanks!

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Re: [HACKERS] Optimize kernel readahead using buffer access strategy

2013-12-12 Thread Mitsumasa KONDO

2013/12/12 Simon Riggs 

> On 14 November 2013 12:09, KONDO Mitsumasa
>  wrote:
>
> > For your information of effect of this patch, I got results of pgbench
> which are
> > in-memory-size database and out-memory-size database, and postgresql.conf
> > settings are always used by us. It seems to improve performance to a
> better. And
> > I think that this feature is going to be necessary for business
> intelligence
> > which will be realized at PostgreSQL version 10. I seriously believe
> Simon's
> > presentation in PostgreSQL conference Europe 2013! It was very
> exciting!!!
>
> Thank you.
>
> I like the sound of this patch, sorry I've not been able to help as yet.
>
> Your tests seem to relate to pgbench. Do we have tests on more BI related
> tasks?
>
Yes, off-course!  We will need another benchmark test before conclusion of
this patch.
What kind of benchmark do you have?

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Re: [HACKERS] should we add a XLogRecPtr/LSN SQL type?

2013-12-12 Thread Simon Riggs

On 12 December 2013 12:27, Andres Freund  wrote:
> On 2013-12-11 08:13:18 -0500, Robert Haas wrote:
>> On Wed, Dec 11, 2013 at 7:41 AM, Andres Freund  
>> wrote:
>> > There's already a couple of SQL function dealing with XLogRecPtrs and
>> > the logical replication work will add a couple of more. Currently each
>> > of those funtions taking/returning an LSN does sprintf/scanf to
>> > print/parse the strings. Which both is awkward and potentially
>> > noticeable performancewise.
>> >
>> > It seems relatively simple to add a proper type, with implicit casts
>> > from text, instead?
>>
>> I'm pretty sure that this was discussed last year, and I voted for it
>> but more people
>> voted against it, so it died.  I still think that was a mistake, but I
>> just work here.
>
> Ah. I missed or forgot that discussion.

Hmm, don't recall that. Just in case I opposed it, its a good idea now.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimize kernel readahead using buffer access strategy

2013-12-12 Thread Simon Riggs

On 14 November 2013 12:09, KONDO Mitsumasa
 wrote:

> For your information of effect of this patch, I got results of pgbench which 
> are
> in-memory-size database and out-memory-size database, and postgresql.conf
> settings are always used by us. It seems to improve performance to a better. 
> And
> I think that this feature is going to be necessary for business intelligence
> which will be realized at PostgreSQL version 10. I seriously believe Simon's
> presentation in PostgreSQL conference Europe 2013! It was very exciting!!!

Thank you.

I like the sound of this patch, sorry I've not been able to help as yet.

Your tests seem to relate to pgbench. Do we have tests on more BI related tasks?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SSL: better default ciphersuite

2013-12-12 Thread Magnus Hagander

On Thu, Dec 12, 2013 at 11:30 AM, Marko Kreen  wrote:

> On Wed, Dec 11, 2013 at 10:08:44PM -0500, Tom Lane wrote:
> > Peter Eisentraut  writes:
> > > Any other opinions on this out there?  All instances of other
> > > SSL-enabled servers out there, except nginx, default to some variant of
> > > DEFAULT:!LOW:... or HIGH:MEDIUM:  The proposal here is essentially
> > > to disable MEDIUM ciphers by default, which is explicitly advised
> > > against in the Postfix and Dovecot documentation, for example.
> >
> > Doesn't seem like a great idea then.
>
> First, if there is explicit wish to keep RC4/SEED in play, I'm fine
> with "HIGH:MEDIUM:!aNULL" as new default.  Clarity-wise, it's still
> much better than current value.  And this value will result *exactly*
> same list in same order as current value.
>
>
> But please don't look into SMTP/IMAP world for sane defaults,
> their situation is quite different:
>
> * Lot's of old and crap mail clients around (usually named Outlook).
> * They use aNULL ciphers for "opportunistic" SSL.
> * They use aNULL ciphers because CA-less certs.
>
> If you need aNULL enabled anyway, there is no point in "secure" ciphers.
>

Yeah, aNULL is definitely pointless.


> I assume that if left to its own
> > devices, PG presently selects some MEDIUM-level cipher by default?  If
> so,
> > it sounds like this change amounts to imposing a performance penalty for
> > SSL connections by fiat.  On the other hand, if we select a HIGH cipher
> by
> > default, then aren't we just refusing to let clients who explicitly ask
> > for a MEDIUM cipher have one?  Either way, I'd want to see a pretty darn
> > airtight rationale for that, and there sure isn't one in this thread
> > so far.
>
> Performance penalty can happen for clients that support only RC4 and 3DES,
> and prefer RC4 currently.  3DES is slow cipher, so falling back to it
> would be noticeable.
>
> According to ssllabs.com, Java 6 does prefer RC4, but it would fall back
> to AES, which is fast cipher.  Java 7 prefers AES.
>
> OpenSSL has always used AES, so no change there.
>
> I know that SChannel SSL library in Windows XP (and earlier) is such
> RC4+3DES only implementation, but I have not heard about anything
> using it to connect to Postgres.
>
> Also I have not heard about any Postgres clients actually allowing
> to configure ciphers, so my impression all client libraries
> use defaults, which usually prefer AES anyway.
>

I don't know, but I would assume that npgsql which sit on top of dotnet,
would sit on top of schannel in the end.

That said, this is XP and earlier, right? Newer versions of Windows have
better defaults?


> The part of the patch that removes @STRENGTH seems plausible, though,
> > if Marko is correct that that's effectively overriding a hand-tailored
> > ordering.
> >
> > In the end I wonder why our default isn't just "DEFAULT".  Anybody who
> > thinks that's an inappropriate default should be lobbying the OpenSSL
> > folk, not us, I should think.
>
> It's easy to see why - then every Postgres admin who wants SSL *must*
> configure the suite.  The "ALL:!aNULL:!eNULL" default is clearly
> a library default, which does not know in what situation it is used,
> geared at max compatibility.
>
> It's especially bad choice because it will look fine to people
> unfamiliar with OpenSSL internal nuances.  As seen in this thread.
>
> My goal is to have default which will be secure by default,
> and give rough impression what it is about.
>
> -
>
> Hm, looking at Java suites, I notice a problem that is due to bug
> in OpenSSL default cipher order - it orders 3DES before AES128.
>
> So if we give priority to server cipher order and client does
> not accept AES256 (like Java 6/7), 3DES will be picked.
>
> This is indeed bug that should be fixed on OpenSSL side, but seems
> we cannot wait for their fix...
>
> So my new proposal would be to pick one from following defaults:
>
> 1) HIGH:+3DES:!aNULL - disables RC4, orders 3DES last.
>
> 2) HIGH:MEDIUM:+3DES:!aNULL - no suite changes from current one,
>except 3DES is ordered last.
>
> +3DES reorders already picked 3DES suites to the end.  As my
> impression is that no clients ever have actually used 3DES,
> it's fine to use !3DES there.  It's clearer too.  But if max
> compatibility is goal, then +3DES is better.
>
> It's not as nice and simple as I hoped though. :(
>

We definitely want to explain in a comment next to the default value the
exact reasoning behind the default value, I think. That doesn't mean people
will understand it, but it means they at least have a chance.

That said, #2 seems reasonable I think, but I wouldn't object to #1 either
:)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: [HACKERS] RFC: programmable file format for postgresql.conf

2013-12-12 Thread Álvaro Hernández Tortosa




On 09/12/13 18:00, Robert Haas wrote:

On Fri, Dec 6, 2013 at 10:28 PM, Álvaro Hernández Tortosa  wrote:

 I think both could be used a lot, editing directly a rich configuration
file or using a GUI tool. I'm trying to suggest supporting both.


I don't really understand how changing the file format fixes anything.
  You could make the file an INI file or an XML file and it would still
be hard to edit programmatically, not because the current format is
"hard to parse" in any meaningful sense, but because there's no way
for a program to know how to make changes while preserving the
comments.  For example, suppose the user tries to set work_mem to 4MB.


	Thanks for your detailed explanation, Robert. I think that since the 
comments are the problem, they should be part of the data structure that 
holds the parameter (setting). That way comments would be easily 
parseable, not for a INI file (which doesn't allow for these kind of 
data structures) but definitely for XML (note that I'm not suggesting to 
use XML).





The only kind of change that I see as possibly helpful is some format
that explicitly marks which comments go with which settings.  For
example, suppose we did this:


work_mem

min 64kB


If you want to set the value, you remove the comment tags around it.
And if you want to comment on the value, you can put whatever you like
within the comment tags.  Now, you've got a machine-editable format,
assuming that people keep their comments in the  section and
not inside actual SGML comments.

But that's ugly and overly verbose, so meh.


	I agree that what you suggested is a machine-editable format, so I 
think it's great. I would not care about SGML comments, though. If this 
is for programs to use it too, I see no problem on the "verbosity" of 
having uncommented all the parameters with all their associated 
metainformation.


	However, you think it's ugly and verbose. It definitely is (specially 
XML, I'd go a different route) but as I said in a previous email: if it 
would help regular postgresql users as: (1) it makes it easier to create 
config tools, but (2) also helps them providing them much more 
information on how to configure manually, why not sacrifice that 
verbosity? Is it that bad?




Generally I don't regard trying to tinker with postgresql.conf as a
useful way to spend time.  Many people have strong and sometimes
conflicting feelings about it, making getting any consensus of any
change almost impossible.  And while I'm sure some die-hard will


	I completely understand that. In order to explore whether the approach 
I'm suggesting works or not, I'm going to work on a POC of a sample 
configuration file, structured in the way I have been describing, and a 
GUI and CLI tool (POC!) to use it. I'll get back to the list with it, to 
check whether it may make any sense.


Thanks!

aht

--
Álvaro Hernández Tortosa


---
NOSYS
Networked Open SYStems


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] should we add a XLogRecPtr/LSN SQL type?

2013-12-12 Thread Andres Freund

On 2013-12-11 08:13:18 -0500, Robert Haas wrote:
> On Wed, Dec 11, 2013 at 7:41 AM, Andres Freund  wrote:
> > There's already a couple of SQL function dealing with XLogRecPtrs and
> > the logical replication work will add a couple of more. Currently each
> > of those funtions taking/returning an LSN does sprintf/scanf to
> > print/parse the strings. Which both is awkward and potentially
> > noticeable performancewise.
> >
> > It seems relatively simple to add a proper type, with implicit casts
> > from text, instead?
> 
> I'm pretty sure that this was discussed last year, and I voted for it
> but more people
> voted against it, so it died.  I still think that was a mistake, but I
> just work here.

Ah. I missed or forgot that discussion. The primary argument seemed to
be that we were are only talking about a single function using it. I
don't think that really was true back then, but there definitely are
some more uses coming up. E.g. the changeset extraction SRF will return
the LSN of every extracted change...
I don't really buy the argument that it can wholl be replaced by
representing LSNs as numeric - those look significantly different enough
that that doesn't seem to make much sense to me. Also, they are a
pass-by-reference type...

> -- except for the implicit casts part, perhaps --

I'd like to convert some existing functions to use the new type, and
without added casts that seems too likely to break existing usages. If
we just consistently use the new type everywhere, I can't see the usual
troubles with the added casts.

On 2013-12-11 10:13:51 -0300, Euler Taveira wrote:
> On 11-12-2013 09:41, Andres Freund wrote:
> While discussing pg_xlog_location_diff function, Robert posted a lsn
> datatype [1]. At that time we wouldn't go that far (a new datatype) to
> cover only one function. If your proposal is just validation, I think
> generic validation functions is the way to follow. However, if you are
> thinking in adding operators, the lsn datatype should be implemented.

I am mostly thinking of returning many such datums - representing them
as text is just pointless overhead. But even if it just were validation
- why sprinkle validation over dozens of places if we actually have a
typesystem to handle that kind of thing?

> > It seems relatively simple to add a proper type, with implicit casts
> > from text, instead?
> Do you want to change the function signatures too?

Yes.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: programmable file format for postgresql.conf

2013-12-12 Thread Álvaro Hernández Tortosa




On 09/12/13 18:26, Greg Stark wrote:

On Sat, Dec 7, 2013 at 3:28 AM, Álvaro Hernández Tortosa  wrote:

"Right now, writing such a tool in a generic way gets so bogged down
just in parsing/manipulating the postgresql.conf file that it's hard to
focus on actually doing the tuning part."


That was in 2008.  I don't think that stance is accurate anymore.


 Just for me to learn about this: why is it not accurate anymore?


This topic has been under active discussion for the last five years. I
strongly recommend going back and skimming over the past discussions
before trying to pick it up again. In particular go look up the
discussion of SET PERSISTENT


	Thanks, Greg. I've been going through those threads, they are quite 
interesting. I didn't find an answer, though, about my question: why 
parsing the postgresql.conf (and for instance preserving the comments 
while writing it back) is no longer a problem. I read about ways of 
mitigating this (such as the include facility and so on) but I still 
find parsing the file as hard as before. Nonetheless, I think this adds 
nothing to what we're talking about, so I'll skip this :)




Since we have include files now you can just generate an
auto-tune.conf and not try to parse or write the main config file.

The reason previous efforts got bogged down in parsing/manipulating
the postgresql.conf file was purely because they were trying to allow
you to edit the file by hand and mix that with auto generated config.



	Just IMO, it is great if a config file would allow for both use cases: 
that both tools and users could seamlessly edit them at will. But of 
course YMMV.


Regards,

aht

--
Álvaro Hernández Tortosa


---
NOSYS
Networked Open SYStems


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Changeset Extraction Interfaces

2013-12-12 Thread Andres Freund

Hello Robert,

On 2013-12-11 22:29:46 -0500, Robert Haas wrote:
> On Wed, Dec 4, 2013 at 7:15 PM, Andres Freund  wrote:
> > There's basically three major 'verbs' that can be performed on a
> > stream, currently named (walsender names):
> > * INIT_LOGICAL_REPLICATION "name" "output_plugin"
> > * START_LOGICAL_REPLICATION "name" last_received ("option_name" value,...)
> > * FREE_LOGICAL_REPLICATION "name"
> >
> > The SQL variant currrently has:
> > * init_logical_replication(name, plugin)
> > * start_logical_replication(name, stream_upto, options[])
> > * stop_logical_replication(name)
> >
> > You might have noticed the slight inconsistency...
> 
> I think this naming is probably not the greatest.

Completely agreed.

> When I hear "init",
> I don't think "permanently allocate a resource that will never be
> released unless I explicitly throw it away", and when I hear "stop", I
> don't think "free that resource".  I suggest naming based around
> create/drop, register/unregister, or acquire/release.  Since, as
> previously noted, I'm gunning for these slots to apply to ordinary
> replication as well, I kind of like ACQUIRE_REPLICATION_SLOT and
> RELEASE_REPLICATION_SLOT.  If we're going to make them specific to
> logical replication, then your suggestion of CREATE_DECODING_SLOT and
> DROP_DECODING_SLOT, or maybe ACQUIRE_DECODING_SLOT and
> RELEASE_DECODING_SLOT, sounds fine.

I think there'll always be a bit of a difference between slots for
physical and logical data, even if 90% of the implementation is the
same. We can signal that difference by specifying logical/physical as an
option or having two different sets of commands.

Maybe?

ACQUIRE_REPLICATION_SLOT slot_name PHYSICAL physical_opts
ACQUIRE_REPLICATION_SLOT slot_name LOGICAL logical_opts
-- already exists without slot, PHYSICAL arguments
START_REPLICATION [SLOT slot] [PHYSICAL] RECPTR opt_timeline
START_REPLICATION SLOT LOGICAL slot plugin_options
RELEASE_REPLICATION_SLOT slot_name

> It also strikes me that just as it's possible to stream WAL without
> allocating a slot first (since we don't at present have slots),
> perhaps it ought also to be possible to stream logical replication
> data without acquiring a slot first.  You could argue that it was a
> mistake not to introduce slots in the first place, but the stateless
> nature of WAL streaming definitely has some benefits, and it's unclear
> to me why you shouldn't be able to do the same thing with logical
> decoding.

I think it would be quite a bit harder for logical decoding. The
difference is that, from the perspective of the walsender, for plain WAL
streaming, all that needs to be checked is whether the WAL is still
there. For decoding though, we need to be sure that a) the catalog xmin
is still low enough and has been all along b) that we are able instantly
build a historical mvcc snapshot from the point we want to start
streaming.
Both a) and b) are solved by keeping the xmin and the point where to
reread WAL from in the slot data and by serializing data about
historical snapshots to disk. But those are removed if there isn't a
slot around requiring them...

So what you could get is something that starts streaming you changes
sometime after you asked it to start streaming, without a guarantee that
you can restart at exactly the position you stopped. If that's useful,
we can do it, but I am not sure what the usecase would be?

> > b) Decide which of the SQL functions should be in a contrib module, and
> >which in core. Currently init_logical_replication() and
> >stop_logical_replication() are in core, whereas
> >start_logical_replication() is in the 'test_logical_decoding'
> >extension. The reasoning behind that is that init/stop ones are
> >important to the DBA and the start_logical_replication() SRF isn't
> >all that useful in the real world because our SRFs don't support
> >streaming changes out.
> 
> Seems pretty arbitrary to me.  If start_logical_replication() is
> usable with any output plugin, then it ought to be in core.

Ok, I certainly have no problem with that.

> I think
> the name isn't great, though; the actual functionality seems to be
> more or less decode-from-last-position-up-to-present, which doesn't
> sound much like "start".

The name originally comes from the START_REPLICATION walsender command,
but I agree, there's not much point in trying to keep that name.

I am also open to different behaviour for the SRF, but I am not sure
what that could be. There's just no sensible way to stream data on the
SQL level afaics.

What about pg_decoding_slot_get_[binary_]changes()?

> > c) Which data-types does start_logical_replication() return. Currently
> > it's OUT location text, OUT xid bigint, OUT data text. Making the 'data'
> > column text has some obvious disadvantages though - there's obvious
> > usecases for output plugins that return binary data. But making it bytea
> > sucks, because the output is harder to read by default...

> I think havi

Re: [HACKERS] COPY table FROM STDIN doesn't show count tag

2013-12-12 Thread Rajeev rastogi

On 12th December 2013, Rajeev Rastogi Wrote:
>On 9th December, Amit Khandelkar wrote:

>>1.  slashcopyissuev1.patch :- This patch fixes the \COPY issue.
>>You have removed the if condition in this statement, mentioning that it is 
>>always true now:
>>-   if (copystream == pset.cur_cmd_source)
>>-   pset.lineno++;
>>+   pset.lineno++;
>>
> >But copystream can be different than pset.cur_cmd_source , right ?

>As per the earlier code, condition result was always true. So pset.lineno was 
>always incremented.
>In the earlier code pset.cur_cmd_source was sent as parameter to function and 
>inside the function same parameter was used with the name copystream. So on 
>entry of this function both will be one and same.
>I checked inside the function handleCopyIn, both of these parameters are not 
>changing before above check. Also since pset is specific to single session, so 
>it cannot change concurrently.
>Please let me know, if I am missing something.

>>+   FILE   *copyStream; /* Stream to read/write for copy 
>>command */
>>
>>There is no tab between FILE and *copystream, hence it is not aligned.

>OK. I shall change accodingly.

I ran pgindent on settings.h file but found no issue reported. Seems tab is not 
mandatory in between variable declaration.
Even for other parameters also in psqlSetting structure space instead of tab is 
being used.
So seems no change required for this. Please confirm.


>>2.  initialcopyissuev1_ontopofslashcopy.patch : Fix for "COPY table FROM 
>>STDIN/STDOUT doesn't show count tag".
>>The following header comments of ProcessResult() need to be modified:
>>* Changes its argument to point to the last PGresult of the command string,
>>* or NULL if that result was for a COPY FROM STDIN or COPY TO STDOUT.

>OK. I shall change accodingly.
I have changed it in the latest patch.

>>Regression results show all passed.
>>Other than this, the patch needs a new regression test.

>I had checked the existing regression test cases and observed that it has 
>already got all kind of test cases. Like
>copystdin,
>copystdout,
>\copy.stdin
>\copy.stdout.

>But since as regression framework runs in "quite i.e. -q" mode, so it does not 
>show any message except query output.
>So our new code change does not impact regression framework.

>Please let me know if you were expecting any other test cases?

Summary of two patches:

1.  slashcopyissuev1.patch :- No change in this patch (same as earlier).

2.  Initialcopyissuev2_ontopofslashcopy.patch : This patch is modified to 
change comment as per above review comments.

Please provide your opinion or let me know if any other changes are required.

Thanks and Regards,
Kumar Rajeev Rastogi




slashcopyissuev1.patch
Description: slashcopyissuev1.patch


initialcopyissuev2_ontopofslashcopy.patch
Description: initialcopyissuev2_ontopofslashcopy.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-12 Thread Simon Riggs

On 12 December 2013 11:05, Andres Freund  wrote:

>> My suggestion would be to add the TZ to the checkpoint record. This
>> way all users of WAL can see the TZ of the master and act accordingly.
>> I'll do a separate patch for that.
>
> Intuitively I'd say that might be useful - but I am not reall sure what
> for. And we don't exactly have a great interface for looking at a
> checkpoint's data. Maybe add it to the control file instead?

That's actually what I had in mind, I just phrased it badly in mid-thought.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autovacuum_work_mem

2013-12-12 Thread Simon Riggs

On 11 December 2013 22:23, Peter Geoghegan  wrote:
> On Wed, Dec 11, 2013 at 1:28 PM, Simon Riggs  wrote:
>> But nobody has given a sensible answer to my questions, other than to
>> roll out the same general points again. In practice, its a knob that
>> does not do very much of use for us. At best it is addressing the
>> symptoms, not the cause. IMHO.
>
> It's just a usability feature. It lessens the intellectual overhead of
> managing maintenance_work_mem. I understand that it isn't very
> impressive from a technical point of view. However, in many
> environments, it actually will make a significant difference, because
> non-autovacuum maintenance operations are very rare compared to
> autovacuum workers vacuuming, and therefore I can now afford to be
> much less conservative in setting maintenance_work_mem globally on
> each of 8 Postgres instances hosted on a single box. These are
> precisely the kinds of Postgres instances where users are very
> unlikely to have heard of maintenance_work_mem at all. These users are
> not even using an admin tool in many cases, preferring to rely on ORM
> migrations.  Having said that, it's also something I'll find useful on
> a day to day basis, because it's a chore to set maintenace_work_mem
> manually, and sometimes I forget to do so, particularly when under
> pressure to relieve a production performance issues on a random
> customer's database.

My view remains that, yes, we have a problem setting maint work men
usefully, my conclusion is that having two parameters we don't know
how to set won't improve things and doesn't constitute an improvement
in usability.

That being said, I acknowledge everybody else's viewpoints here and
will commit this feature as is.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [bug fix] pg_ctl always uses the same event source

2013-12-12 Thread MauMau


Hi, Amit san,

From: "Amit Kapila" 

[elog.c]
Writing the default value in this file was redundant, because 
event_source

cannot be NULL.  So change


I think this change might not be safe as write_eventlog() gets called
from write_stderr() which might get called
before reading postgresql.conf, so the code as exists in PG might be
to handle that case or some other similar
situation where event_source is still not set. Have you confirmed in
all ways that it is never the case that
event_source is not set when the control reaches this function.


Yes, you are right.  I considered this again after replying to your previous 
mail, and found out write_stderr() calls write_eventlog().  In that case, 
event_source is still NULL before GUC initialization.


Please find attached the patch.
I put the default event source name in one place -- pg_config_manual.h, 
which seems the best place.  This could eliminate duplicate default values 
in four places: pgevent.c, pg_ctl.c, elog.c, and guc.c.  Thanks to your 
review and comment, the code got cleaner and correct.  Thank you very much.


Regards
MauMau


pg_ctl_eventsrc_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-12 Thread Andres Freund

On 2013-12-12 09:09:21 +, Simon Riggs wrote:
> >> * Add functionality (I propose)
> >> We can set negative number at min_standby_apply_delay. I think that
> >> this feature
> >> is for world wide replication situation. For example, master server is
> >> in
> >> Japan and slave server is in San Francisco. Japan time fowards than
> >> San
> >> Francisco time
> >> . And if we want to delay in this situation, it can need negative

> > This is because local time is recorded in XLOG. And it has big cost for
> > calculating global time.

Uhm? Isn't the timestamp in commit records actually a TimestampTz? And
thus essentially stored as UTC? I don't think this problem actually
exists?

> My suggestion would be to add the TZ to the checkpoint record. This
> way all users of WAL can see the TZ of the master and act accordingly.
> I'll do a separate patch for that.

Intuitively I'd say that might be useful - but I am not reall sure what
for. And we don't exactly have a great interface for looking at a
checkpoint's data. Maybe add it to the control file instead?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-12 Thread Simon Riggs

On 12 December 2013 10:42, KONDO Mitsumasa
 wrote:

>> I agree with your request here, but I don't think negative values are
>> the right way to implement that, at least it would not be very usable.
>
> I think that my proposal is the easiest and simplist way to solve this
> problem. And I believe that the man who cannot calculate the difference in
> time-zone doesn't set replication cluster across continents.
>
>
>> My suggestion would be to add the TZ to the checkpoint record. This
>> way all users of WAL can see the TZ of the master and act accordingly.
>> I'll do a separate patch for that.
>
> It is something useful for also other situations. However, it might be
> going to happen long and complicated discussions... I　think that our hope is
> to commit this patch in this commit-fest or next final commit-fest.

Agreed on no delay for the delay patch, as shown by my commit.

Still think we need better TZ handling.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimize kernel readahead using buffer access strategy

2013-12-12 Thread KONDO Mitsumasa


(2013/12/12 9:30), Claudio Freire wrote:

On Wed, Dec 11, 2013 at 3:14 AM, KONDO Mitsumasa
 wrote:



enable_readahead=os|fadvise

with os = on, fadvise = off


Hmm. fadvise is method and is not a purpose. So I consider another idea of
this GUC.


Yeah, I was thinking of opening the door for readahead=aio, but
whatever clearer than on-off would work ;)


I'm very interested in Postgres with libaio, and I'd like to see the perfomance 
improvements. I'm not sure about libaio, however, it will face 
exclusive-buffer-lock problem in asynchronous IO.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-12 Thread Dimitri Fontaine

Tom Lane  writes:
>> Yeah, I found myself wishing for an EXPLAIN option that would show
>> that.
>
> It's not hard to do ... how about the attached?

+1

> I chose to print grouping keys for both Agg and Group nodes, and to
> show them unconditionally.  There's some case maybe for only including
> them in verbose mode, but since sort keys are shown unconditionally,
> it seemed more consistent this way.

+1

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 112 matches

Mail list logo