date:20160921

Re: [HACKERS] Tracking wait event for latches

2016-09-21 Thread Michael Paquier

On Thu, Sep 22, 2016 at 6:49 AM, Thomas Munro
 wrote:
> On Thu, Sep 22, 2016 at 1:23 AM, Robert Haas  wrote:
>> Moreover, it's pretty confusing that we have this general concept of
>> wait events in pg_stat_activity, and then here the specific type of
>> wait event we're waiting for is the ... wait event kind.  Uh, what?
>
> Yeah, that is confusing.  It comes about because of the coincidence
> that pg_stat_activity finished up with a wait_event column, and our IO
> multiplexing abstraction finished up with the name WaitEventSet.
> We could instead call these new things "wait
> points", because, well, erm, they name points in the code at which we
> wait.  They don't name events (they just happen to use the
> WaitEventSet mechanism, which itself does involve events: but those
> events are things like "hey, this socket is now ready for
> writing").

What about trying first to come up with a class name generic enough
that would cover the set of events we are trying to cover. One idea is
to simply call the class "Process" ("Point", "Poll"), and mention in
the docs that this can wait for a socket, a timeout, postmaster death
or another process to tell it to move on. The idea is to come with a
name generic enough. In the first patch we don't detail much what a
process is waiting for at a given point, like what has been submitted
does. This would still be helpful for the user, because it would be
possible to look back at the code and guess where this is waiting.

Then comes the interesting bits: I don't think that it is a good idea
to associate only one event for a waiting point in the system views.
Say that at a waiting point WaitLatch is called with
WL_POSTMASTER_DEATH and WL_TIMEOUT, I'd rather let the user know that
both of them have been set up and not choose just one of them using
some hierarchy. But I rather think that we should list all of them
depending on the WL_* flags set up by the caller. Here comes a second
patch: let's use the last byte of the wait events and add this new
text[] column in pg_stat_activity, say wait_details. So for a waiting
point it would be possible to tell to the user that it is waiting on
'{socket,timeout,postmaster_death}' for example.

Then comes a third patch: addition of a system view to list all the
activity happening for processes that are not dependent on
max_connections. pg_stat_activity has the advantage, as Tom mentioned
a couple of days ago, that one can just run count(*) on it to estimate
the number of connections left. I think this is quite helpful. Or we
could just add a column in pg_stat_activity to mark a process type: is
it a system process or not? This deserves its own discussion for sure.

Patches 2 and 3 are independent things. Patch 1 is a requirement for 2 and 3.

>> Another thing to think about is that there's no way to actually see
>> wait event information for a number of the processes which this patch
>> instruments, because they don't appear in pg_stat_activity.
>
> Good point.  Perhaps there could be another extended view, or system
> process view, or some other mechanism.  That could be material for a
> separate patch?

Definitely a separate patch..
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-09-21 Thread Ashutosh Bapat

Hi Amit,
Following sequence of DDLs gets an error
--
-- multi-leveled partitions
--
CREATE TABLE prt1_l (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE prt1_l_p1 PARTITION OF prt1_l FOR VALUES START (0) END
(250) PARTITION BY RANGE (b);
CREATE TABLE prt1_l_p1_p1 PARTITION OF prt1_l_p1 FOR VALUES START (0) END (100);
CREATE TABLE prt1_l_p1_p2 PARTITION OF prt1_l_p1 FOR VALUES START
(100) END (250);
CREATE TABLE prt1_l_p2 PARTITION OF prt1_l FOR VALUES START (250) END
(500) PARTITION BY RANGE (c);
CREATE TABLE prt1_l_p2_p1 PARTITION OF prt1_l_p2 FOR VALUES START
('0250') END ('0400');
CREATE TABLE prt1_l_p2_p2 PARTITION OF prt1_l_p2 FOR VALUES START
('0400') END ('0500');
CREATE TABLE prt1_l_p3 PARTITION OF prt1_l FOR VALUES START (500) END
(600) PARTITION BY RANGE ((b + a));
ERROR:  cannot use column or expression from ancestor partition key

The last statement is trying create subpartitions by range (b + a),
which contains a partition key from ancestor partition key but is not
exactly same as that. In fact it contains some extra columns other
than the ancestor partition key columns. Why do we want to prohibit
such cases?

On Thu, Sep 15, 2016 at 2:23 PM, Amit Langote
 wrote:
>
> Hi
>
> On 2016/09/09 17:55, Amit Langote wrote:
>> On 2016/09/06 22:04, Amit Langote wrote:
>>> Will fix.
>>
>> Here is an updated set of patches.
>
> An email from Rajkumar somehow managed to break out of this thread.
> Quoting his message below so that I don't end up replying with patches on
> two different threads.
>
> On 2016/09/14 16:58, Rajkumar Raghuwanshi wrote:
>> I have Continued with testing declarative partitioning with the latest
>> patch. Got some more observation, given below
>
> Thanks a lot for testing.
>
>> -- Observation 1 : Getting overlap error with START with EXCLUSIVE in range
>> partition.
>>
>> create table test_range_bound ( a int) partition by range(a);
>> --creating a partition to contain records {1,2,3,4}, by default 1 is
>> inclusive and 5 is exclusive
>> create table test_range_bound_p1 partition of test_range_bound for values
>> start (1) end (5);
>> --now trying to create a partition by explicitly mentioning start is
>> exclusive to contain records {5,6,7}, here trying to create with START with
>> 4 as exclusive so range should be 5 to 8, but getting partition overlap
>> error.
>> create table test_range_bound_p2 partition of test_range_bound for values
>> start (4) EXCLUSIVE end (8);
>> ERROR:  partition "test_range_bound_p2" would overlap partition
>> "test_range_bound_p1"
>
> Wow, this is bad.  What is needed in this case is "canonicalization" of
> the range partition bounds specified in the command.  Range types do this
> and hence an equivalent test done with range type values would disagree
> with the result given by the patch for range partition bounds.
>
> select '[1,5)'::int4range && '(4,8]'::int4range as cmp;
>  cmp
> -
>  f
> (1 row)
>
> In this case, the second range is converted into its equivalent canonical
> form viz. '[5, 9)'.  Then comparison of bounds 5) and [5 can tell that the
> ranges do not overlap after all.  Range type operators can do this because
> their code can rely on the availability of a canonicalization function for
> a given range type.  From the range types documentation:
>
> """
> A discrete range type should have a canonicalization function that is
> aware of the desired step size for the element type. The canonicalization
> function is charged with converting equivalent values of the range type to
> have identical representations, in particular consistently inclusive or
> exclusive bounds. If a canonicalization function is not specified, then
> ranges with different formatting will always be treated as unequal, even
> though they might represent the same set of values in reality.
> """
>
> to extend the last sentence:
>
> "... or consider two ranges overlapping when in reality they are not
> (maybe they are really just adjacent)."
>
> Within the code handling range partition bound, no such canonicalization
> happens, so comparison 5) and (4 ends up concluding that upper1 > lower2,
> hence ranges overlap.
>
> To mitigate this, how about we restrict range partition key to contain
> columns of only those types for which we know we can safely canonicalize a
> range bound (ie, discrete range types)?  I don't think we can use, say,
> existing int4range_canonical but will have to write a version of it for
> partitioning usage (range bounds of partitions are different from what
> int4range_canonical is ready to handle).  This approach will be very
> limiting as then range partitions will be limited to columns of int,
> bigint and date type only.
>
> One more option is we let the user specify the canonicalize function next
> to the column name when defining the partition key.  If not specified, we
> hard-code one for the types for which we will be implementing a
> canonicalize function (ie, above mentioned types).  In other cases, we
> just don't have on

Re: [HACKERS] Showing parallel status in \df+

2016-09-21 Thread Rushabh Lathia

I agree with the argument in this thread, having "Source code" as part of
\df+
is bit annoying, specifically when output involve some really big PL
language
functions. Having is separate does make \df+ output more readable. So I
would
vote for \df++ rather then adding the source code as part of footer for
\df+.

Personally I didn't like idea for keeping "source code" for C/internal
functions as part of \df+ and moving others out of it. If we really want to
move "source code" from \df+, then it should be consistent - irrespective
of language. So may be remove "source code" completely from \df+ and add
\df++ support for the "source code".



On Wed, Sep 7, 2016 at 12:14 AM, Pavel Stehule 
wrote:

> Hi
>
> 2016-09-06 0:05 GMT+02:00 Tom Lane :
>
>> I wrote:
>> > Pavel Stehule  writes:
>> >> Using footer for this purpose is little bit strange. What about
>> following
>> >> design?
>> >> 1. move out source code of PL functions from \df+
>> >> 2. allow not unique filter in \sf and allow to display multiple
>> functions
>>
>> > Wasn't that proposed and rejected upthread?
>>
>> So ... why did you put this patch in "Waiting on Author" state?  AFAIK,
>> we had dropped the idea of relying on \sf for this, mainly because
>> Peter complained about \df+ no longer providing source code.  I follow
>> his point: if you're used to using \df+ to see source code, you probably
>> can figure it out quickly if that command shows the source in a different
>> place than before.  But if it doesn't show it at all, using \sf instead
>> might not occur to you right away.
>>
>
> I see only one situation, when I want to see more then one source code -
> checking overloaded functions. I prefer to see complete source code - in
> \sf format. But I don't remember, when I did it last time. So I can live
> without it well.
>
> I am thinking, there is strong agreement about reduction \dt+ result. I am
> not sure about usability of showing source code in footer. It is not too
> much readable - and the fact, so function's body is displayed not as CREATE
> statements, does the result less readable.
>
> Now I am thinking so using footer for this purpose is not too great idea -
> maybe we can live better without it (without source code of PL in \dt+
> result, I would to see only C function source there). If you like using
> footer, then the format should be changed to be more consistent, readable?
> I am not sure, how it can be enhanced.
>
> Regards
>
> Pavel
>
>
>>
>> regards, tom lane
>>
>
>


-- 
Rushabh Lathia

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-21 Thread Amit Kapila

On Thu, Sep 22, 2016 at 8:51 AM, Jeff Janes  wrote:
> On Tue, Sep 20, 2016 at 10:27 PM, Amit Kapila 
> wrote:
>>
>> On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes  wrote:
>> > On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila 
>> > wrote:
>> >>
>> >>
>> >> Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
>> >> that, I have changed _hash_alloc_buckets() to initialize the page
>> >> instead of making it completely Zero because of problems discussed in
>> >> another related thread [1].  I have also updated README.
>> >>
>> >
>> > with v7 of the concurrent has patch and v4 of the write ahead log patch
>> > and
>> > the latest relcache patch (I don't know how important that is to
>> > reproducing
>> > this, I suspect it is not), I once got this error:
>> >
>> >
>> > 38422  0 2016-09-19 16:25:50.055 PDT:LOG:  database system was
>> > interrupted; last known up at 2016-09-19 16:25:49 PDT
>> > 38422  0 2016-09-19 16:25:50.057 PDT:LOG:  database system was not
>> > properly shut down; automatic recovery in progress
>> > 38422  0 2016-09-19 16:25:50.057 PDT:LOG:  redo starts at
>> > 3F/2200DE90
>> > 38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification
>> > failed,
>> > calculated checksum 65067 but expected 21260
>> > 38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at
>> > 3F/22053B50
>> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
>> > 38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block 9
>> > of
>> > relation base/16384/17334
>> > 38422  XX001 2016-09-19 16:25:50.071 PDT:CONTEXT:  xlog redo at
>> > 3F/22053B50
>> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
>> >
>> >
>> > The original page with the invalid checksum is:
>> >
>>
>> I think this is a example of torn page problem, which seems to be
>> happening because of the below code in your test.
>>
>> ! if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
>> !RecoveryInProgress()) {
>> !   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
>> ! ereport(FATAL,
>> ! (errcode(ERRCODE_DISK_FULL),
>> !  errmsg("could not write block %u of relation %s: wrote only %d of %d
>> bytes",
>> ! blocknum,
>> ! relpath(reln->smgr_rnode, forknum),
>> ! nbytes, BLCKSZ),
>> !  errhint("JJ is screwing with the database.")));
>> ! } else {
>> !   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
>> ! }
>>
>> If you are running the above test by disabling JJ_torn_page, then it
>> is a different matter and we need to investigate it, but l assume you
>> are running by enabling it.
>>
>> I think this could happen if the actual change in page is in 2/3 part
>> of page which you are not writing in above code.  The checksum in page
>> header which is written as part of partial page write (1/3 part of
>> page) would have considered the actual change you have made whereas
>> after restart when it again read the page to apply redo, the checksum
>> calculation won't include the change being made in 2/3 part.
>
>
> Correct.  But any torn page write must be covered by the restoration of a
> full page image during replay, shouldn't it?  And that restoration should
> happen blindly, without first reading in the old page and verifying the
> checksum.
>

Probably, but I think this is not currently the way it is handled and
I don't want to change it.  AFAIU, what happens now is that we first
read the old page (and we do page verification while reading old page
and display *warning* if checksum doesn't match) and then restore the
image.  The relevant code path is
XLogReadBufferForRedo()->XLogReadBufferExtended()->ReadBufferWithoutRelcache()->ReadBuffer_common()->PageIsVerified().

Now, here the point is that why instead of WARNING we are seeing FATAL
for the bitmap page of hash index.  The reason as explained in my
previous e-mail is that for bitmap page we are not logging full page
image and we should fix that as explained there.  Once the full page
image is logged, then first time it reads the torn page, it will use
flag RBM_ZERO_AND_LOCK which will make the FATAL error you are seeing
to WARNING.

I think this is the reason why Ashutosh is seeing WARNING for heap
page whereas you are seeing FATAL for bitmap page of hash index.

I don't think we should try to avoid WARNING for torn pages as part of
this patch, even if that is better course of action, but certainly
FATAL should be fixed and a WARNING should be displayed instead as it
is displayed for heap pages.

>
>> Today, while thinking on this problem, I realized that currently in
>> patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
>> problem reported by you [1].  That change will fix the problem
>> reported by you, but it will expose bitmap pages for torn-page
>> hazards.  I think the right fix there is to make pd_lower equal to
>> pd_upper for bitmap page, so that full page writes doesn't exclude the
>> data in bitmappage.
>
>
> I'm afraid that is over my head.  I can study it until it makes sense, but
> it will take me a w

Re: [HACKERS] 9.6 TAP tests and extensions

2016-09-21 Thread Craig Ringer

On 13 September 2016 at 22:02, Tom Lane  wrote:
> Craig Ringer  writes:
>> While updating an extension for 9.6 I noticed that while the
>> $(prove_check) definition is exposed for use by PGXS in
>> Makefile.global, extensions can't actually use the TAP tests because
>> we don't install the required Perl modules like PostgresNode.pm.

Whoops, I managed to misplace this thread.

>> I don't see any reason not to make this available to extension authors
>> and doing so is harmless, so here's a small patch to install it. I
>> think it's reasonable to add this to 9.6 even at this late stage; IMO
>> it should've been installed from the beginning.
>
> Without taking a position on the merits of this patch per se, I'd like
> to say that I find the argument for back-patching into 9.6 and not
> further than that to be pretty dubious.  $(prove_check) has been there
> since 9.4, and in the past we've often regretted it when we failed
> to back-patch TAP infrastructure fixes all the way back to 9.4.

No objection to backpatching, I just thought I'd be more intrusive to
do that than just 9.6.

Since 9.5 and older have more limited versions of PostgresNode which
lack safe_psql, etc, I'm not sure it's very practical for extensions
to bother running TAP tests on 9.4 and 9.5 anyway.

I'd love to be able to, but unless we backport the new src/test/perl
stuff and the changes to the rest of the TAP tests to make them work
with it I don't see it being very useful. Since that's really not
going to fly, I say this should just go in 9.6.

Extension authors can just use:

ifeq ($(MAJORVERSION),9.6)
endif

when defining their prove rules.

Not beautiful, but when it wasn't really designed from the start to
work with PGXS I don't see much alternative.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-09-21 Thread Amit Kapila

On Thu, Sep 22, 2016 at 8:03 AM, Andres Freund  wrote:
> On 2016-09-21 22:23:27 -0400, Tom Lane wrote:
>> Andres Freund  writes:
>> > Sure. But that can be addressed, with a lot less effort than fixing and
>> > maintaining the hash indexes, by adding the ability to do that
>> > transparently using btree indexes + a recheck internally.  How that
>> > compares efficiency-wise is unclear as of now. But I do think it's
>> > something we should measure before committing the new code.
>>
>> TBH, I think we should reject that argument out of hand.  If someone
>> wants to spend time developing a hash-wrapper-around-btree AM, they're
>> welcome to do so.  But to kick the hash AM as such to the curb is to say
>> "sorry, there will never be O(1) index lookups in Postgres".
>
> Note that I'm explicitly *not* saying that. I just would like to see
> actual comparisons being made before investing significant amounts of
> code and related effort being invested in fixing the current hash table
> implementation. And I haven't seen a lot of that.
>

I think it can be deduced from testing done till now.  Basically, by
having index (btree/hash) on integer column can do the fair
comparison.  The size of key will be same in both hash and btree
index.  In such a case, if we know that hash index is performing
better in certain cases, then it is indication that it will perform
better in the scheme you are suggesting because it doesn't have extra
recheck in btree code which will further worsen the case for btree.

>  If the result of that
> comparison is that hash-indexes actually perform very well: Great!
>

>
>> always be superior, I don't see how it follows that we should refuse to
>> commit work that's already been done.  Is committing it somehow going to
>> prevent work on the btree-wrapper approach?
>
> The necessary work seems a good bit from finished.
>

Are you saying this about WAL patch?  If yes, then even if it is still
away from being in shape to committed, there is a lot of effort being
put in to taking into its current stage and it is not in bad shape
either.  It has survived lot of testing, there are still some bugs
which we are fixing.

One more thing, I want to say that don't assume that all people
involved in current development of hash indexes or further development
on it will run away once the code is committed and the responsibility
of maintenance will be on other senior members of community.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-21 Thread Jeff Janes

On Tue, Sep 20, 2016 at 10:27 PM, Amit Kapila 
wrote:

> On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes  wrote:
> > On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila 
> > wrote:
> >>
> >>
> >> Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
> >> that, I have changed _hash_alloc_buckets() to initialize the page
> >> instead of making it completely Zero because of problems discussed in
> >> another related thread [1].  I have also updated README.
> >>
> >
> > with v7 of the concurrent has patch and v4 of the write ahead log patch
> and
> > the latest relcache patch (I don't know how important that is to
> reproducing
> > this, I suspect it is not), I once got this error:
> >
> >
> > 38422  0 2016-09-19 16:25:50.055 PDT:LOG:  database system was
> > interrupted; last known up at 2016-09-19 16:25:49 PDT
> > 38422  0 2016-09-19 16:25:50.057 PDT:LOG:  database system was not
> > properly shut down; automatic recovery in progress
> > 38422  0 2016-09-19 16:25:50.057 PDT:LOG:  redo starts at 3F/2200DE90
> > 38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification
> failed,
> > calculated checksum 65067 but expected 21260
> > 38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at
> 3F/22053B50
> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
> > 38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block 9
> of
> > relation base/16384/17334
> > 38422  XX001 2016-09-19 16:25:50.071 PDT:CONTEXT:  xlog redo at
> 3F/22053B50
> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
> >
> >
> > The original page with the invalid checksum is:
> >
>
> I think this is a example of torn page problem, which seems to be
> happening because of the below code in your test.
>
> ! if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
> !RecoveryInProgress()) {
> !   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
> ! ereport(FATAL,
> ! (errcode(ERRCODE_DISK_FULL),
> !  errmsg("could not write block %u of relation %s: wrote only %d of %d
> bytes",
> ! blocknum,
> ! relpath(reln->smgr_rnode, forknum),
> ! nbytes, BLCKSZ),
> !  errhint("JJ is screwing with the database.")));
> ! } else {
> !   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
> ! }
>
> If you are running the above test by disabling JJ_torn_page, then it
> is a different matter and we need to investigate it, but l assume you
> are running by enabling it.
>
> I think this could happen if the actual change in page is in 2/3 part
> of page which you are not writing in above code.  The checksum in page
> header which is written as part of partial page write (1/3 part of
> page) would have considered the actual change you have made whereas
> after restart when it again read the page to apply redo, the checksum
> calculation won't include the change being made in 2/3 part.
>

Correct.  But any torn page write must be covered by the restoration of a
full page image during replay, shouldn't it?  And that restoration should
happen blindly, without first reading in the old page and verifying the
checksum.  Failure to restore the page from a FPI would be a bug.  (That
was the purpose for which I wrote this testing harness in the first place,
to verify that the restoration of FPI happens correctly; although most of
the bugs it happens to uncover have been unrelated to that.)



>
> Today, Ashutosh has shared the logs of his test run where he has shown
> similar problem for HEAP page.  I think this could happen though
> rarely for any page with the above kind of tests.
>

I think Ashutosh's examples are of warnings, not errors.   I think the
warnings occur when replay needs to read in the block (for reason's I don't
understand yet) but then doesn't care if it passes the checksum or not
because it will just be blown away by the replay anyway.


> Does this explanation explains the reason of problem you are seeing?
>

If it can't survive artificial torn page writes, then it probably can't
survive reals ones either.  So I am pretty sure it is a bug of some sort.
Perhaps the bug is that it is generating an ERROR when should just be a
WARNING?


>
> >
> > If I ignore the checksum failure and re-start the system, the page gets
> > restored to be a bitmap page.
> >
>
> Okay, but have you ensured in some way that redo is applied to bitmap page?
>


I haven't done that yet.  I can't start the system without destroying the
evidence, and I haven't figured out yet how to import a specific block from
a shut-down server into a bytea of a running server, in order to inspect it
using pageinspect.

Today, while thinking on this problem, I realized that currently in
> patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
> problem reported by you [1].  That change will fix the problem
> reported by you, but it will expose bitmap pages for torn-page
> hazards.  I think the right fix there is to make pd_lower equal to
> pd_upper for bitmap page, so that full page writes doesn't exclude the
> data in bitmappa

Re: [HACKERS] ICU integration

2016-09-21 Thread Petr Jelinek

On 31/08/16 04:46, Peter Eisentraut wrote:
> Here is a patch I've been working on to allow the use of ICU for sorting
> and other locale things.
>

Hi, first pass review of the patch (somewhat high level at this point).

First, there was some discussion about ICU versioning and collation
versioning and support for multiple versions of ICU at the same time. My
opinion on this is that the approach you have chosen is good, at least
for first iteration of the feature. We might want to support multiple
versions of the library in the future (I am not all that convinced of
that though), but we definitely don't at the moment. It is important to
have version checking against the data directory which you have done.

> 
> What I have done is extend collation objects with a collprovider column
> that tells whether the collation is using POSIX (appropriate name?) or
> ICU facilities.  The pg_locale_t type is changed to a struct that
> contains the provider-specific locale handles.  Users of locale
> information are changed to look into that struct for the appropriate
> handle to use.
> 

This sounds reasonable. I would prefer the POSIX to be named something
like SYSTEM though (think windows).

> In initdb, I initialize the default collation set as before from the
> `locale -a` output, but also add all available ICU locales with a "%icu"
> appended (so "fr_FR%icu").  I suppose one could create a configuration
> option perhaps in initdb to change the default so that, say, "fr_FR"
> uses ICU and "fr_FR%posix" uses the old stuff.

I wonder if we should have both %icu and %posix unconditionally and then
have option to pick one of them for the list of collations without the
suffix (defaulting to posix for now but maybe not in the future).

> 
> That all works well enough for named collations and for sorting.  The
> thread about the FreeBSD ICU patch discusses some details of how to best
> use the ICU APIs to do various aspects of the sorting, so I didn't focus
> on that too much.

That's something for follow-up patches IMHO.

> I'm not sure how well it will work to replace all the bits of LIKE and
> regular expressions with ICU API calls.  One problem is that ICU likes
> to do case folding as a whole string, not by character.  I need to do
> more research about that.  

Can't we use the regular expression api provided by ICU, or is that too
incompatible?

> Another problem, which was also previously
> discussed is that ICU does case folding in a locale-agnostic manner, so
> it does not consider things such as the Turkish special cases.  This is
> per Unicode standard modulo weasel wording, but it breaks existing tests
> at least.
> 

I am quite sure it will break existing usecases as well. Don't know if
that's an issue if we keep multiple locale providers though. It's
expected that you get different behavior from them.

> 
> Where it gets really interesting is what to do with the database
> locales.  They just set the global process locale.  So in order to port
> that to ICU we'd need to check every implicit use of the process locale
> and tweak it.  We could add a datcollprovider column or something.  But
> we also rely on the datctype setting to validate the encoding of the
> database.  Maybe we wouldn't need that anymore, but it sounds risky.

I think we'll need to separate the process locale and the locale used
for string operations.

> 
> We could have a datcollation column that by OID references a collation
> defined inside the database.  With a background worker, we can log into
> the database as it is being created and make adjustments, including
> defining or adjusting collation definitions.  This would open up
> interesting new possibilities.

I don't really follow this but it sounds icky.

> 
> What is a way to go forward here?  What's a minimal useful feature that
> is future-proof?  Just allow named collations referencing ICU for now?

I think that's minimal useful feature indeed, it will move us forward,
especially on systems where posix locales are not that well implemented
but will not be world changer just yet.

> Is throwing out POSIX locales even for the process locale reasonable?
> 

I don't think we can really do that, we'll need some kind of mapping
table between system locales and ICU locales (IIRC we build something
like that on windows as well). Maybe that mapping can be used for the
datctype to process locale conversion (datctype would then be provider
specific).

We also need to keep posix locale support anyway otherwise we'd break
pg_upgrade I think (at leas to the point that pg_upgrade where you have
to reindex whole database is much less feasible).

> Oh, that case folding code in formatting.c needs some refactoring.
> There are so many ifdefs there and it's repeated almost identically
> three times, it's crazy to work in that.

Yuck, yes it wasn't pretty before and it's quite unreadable now, I think
this patch will have to do something about that.

The message about collate provider version ch

Re: [HACKERS] ICU integration

2016-09-21 Thread Bruce Momjian

On Thu, Sep  8, 2016 at 12:19:39PM -0400, Peter Eisentraut wrote:
> On 9/8/16 11:16 AM, Tom Lane wrote:
> > This is a problem, if ICU won't guarantee cross-version compatibility,
> > because it destroys the argument that moving to ICU would offer us
> > collation behavior stability.
> 
> It would offer a significant upgrade over the current situation.
> 
> First, it offers stability inside the same version.  Whereas glibc might
> change a collation in a minor upgrade, ICU won't do that.  And the
> postgres binary is bound to a major version of ICU by the soname (which
> changes with every major release).  So this would avoid the situation
> that a simple OS update could break collations.

Uh, how do we know that ICU doesn't change collations in minor versions?
Couldn't we get into a case where the OS changes the ICU version or
collations more frequently than glibc does?  Seems that would be a
negative.

I don't see how having our binary bound to a ICU major version gives us
any benefit.

It seems we are still hostage to the OS version.

> Second, it offers a way to detect that something has changed.  With
> glibc, you don't know anything unless you read the source diffs.  With
> ICU, you can compare the collation version before and after and at least
> tell the user that they need to refresh indexes or whatever.

Yes, that is a clear win.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-09-21 Thread Andres Freund

On 2016-09-21 22:23:27 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Sure. But that can be addressed, with a lot less effort than fixing and
> > maintaining the hash indexes, by adding the ability to do that
> > transparently using btree indexes + a recheck internally.  How that
> > compares efficiency-wise is unclear as of now. But I do think it's
> > something we should measure before committing the new code.
> 
> TBH, I think we should reject that argument out of hand.  If someone
> wants to spend time developing a hash-wrapper-around-btree AM, they're
> welcome to do so.  But to kick the hash AM as such to the curb is to say
> "sorry, there will never be O(1) index lookups in Postgres".

Note that I'm explicitly *not* saying that. I just would like to see
actual comparisons being made before investing significant amounts of
code and related effort being invested in fixing the current hash table
implementation. And I haven't seen a lot of that.  If the result of that
comparison is that hash-indexes actually perform very well: Great!


> always be superior, I don't see how it follows that we should refuse to
> commit work that's already been done.  Is committing it somehow going to
> prevent work on the btree-wrapper approach?

The necessary work seems a good bit from finished.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-09-21 Thread Tom Lane

Andres Freund  writes:
> Sure. But that can be addressed, with a lot less effort than fixing and
> maintaining the hash indexes, by adding the ability to do that
> transparently using btree indexes + a recheck internally.  How that
> compares efficiency-wise is unclear as of now. But I do think it's
> something we should measure before committing the new code.

TBH, I think we should reject that argument out of hand.  If someone
wants to spend time developing a hash-wrapper-around-btree AM, they're
welcome to do so.  But to kick the hash AM as such to the curb is to say
"sorry, there will never be O(1) index lookups in Postgres".

It's certainly conceivable that it's impossible to get decent performance
out of hash indexes, but I do not agree that we should simply stop trying.

Even if I granted the unproven premise that use-a-btree-on-hash-codes will
always be superior, I don't see how it follows that we should refuse to
commit work that's already been done.  Is committing it somehow going to
prevent work on the btree-wrapper approach?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] to_date_valid()

2016-09-21 Thread Peter Eisentraut

On 9/9/16 4:48 AM, Andreas 'ads' Scherbaum wrote:
> ValidateDate() will tell you if it's a valid date. But not if the 
> transformation was correct:
> 
> postgres=# SELECT to_date('2011 12  18', ' MM   DD');
>to_date
> 
>   2011-12-08
> (1 row)
> 
> (with the patch from Artur)
> 
> 
> Any idea how to solve this problem?

I think we need to do overflow checking as we read the various values,
in do_to_timestamp() and functions called from it.  This would be very
similar to overflow checking in input functions.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] System load consideration before spawning parallel workers

2016-09-21 Thread Peter Eisentraut

It seems clear that this patch design is not favored by the community,
so I'm setting the patch as rejected in the CF app.

I think there is interest in managing system resources better, but I
don't know what that would look like.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relax requirement for INTO with SELECT in pl/pgsql

2016-09-21 Thread Peter Eisentraut

On 9/9/16 2:57 PM, Peter Eisentraut wrote:
> On 6/6/16 9:59 AM, Merlin Moncure wrote:
>> Thanks for the review.  All your comments look pretty reasonable. I'll
>> touch it up and resubmit.
> 
> This is the last email in this thread that the commit fest app shows.  I
> think we are waiting on an updated patch, with docs and tests.

I'm setting this patch to returned with feedback now.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-21 Thread Claudio Freire

On Tue, Sep 20, 2016 at 3:34 PM, Claudio Freire  wrote:
> On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire  wrote:
>> On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas  wrote:
>>>
>>> Claudio, if you could also repeat the tests you ran on Peter's patch set on
>>> the other thread, with these patches, that'd be nice. These patches are
>>> effectively a replacement for
>>> 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
>>> would be much appreciated too, of course.
>>>
>>> Attached are new versions. Compared to last set, they contain a few comment
>>> fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
>>> that were completely unused.
>>
>>
>> Will do so
>
> Well, here they are, the results.
>
> ODS format only (unless you've got issues opening the ODS).
>
> The results seem all over the map. Some regressions seem significant
> (both in the amount of performance lost and their significance, since
> all 4 runs show a similar regression). The worst being "CREATE INDEX
> ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
> work_mem, which should be an in-memory sort, which makes it odd.
>
> I will re-run it overnight just in case to confirm the outcome.

A new run for "patched" gives better results, it seems it was some
kind of glitch in the run (maybe some cron decided to do something
while running those queries).

Attached

In essence, it doesn't look like it's harmfully affecting CPU
efficiency. Results seem neutral on the CPU front.


logtape_preload_timings.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-09-21 Thread Andres Freund

On 2016-09-21 19:49:15 +0300, Oskari Saarenmaa wrote:
> 21.09.2016, 15:29, Robert Haas kirjoitti:
> > For PostgreSQL, I expect the benefits of improving hash indexes to be
> > (1) slightly better raw performance for equality comparisons and (2)
> > better concurrency.
> 
> There's a third benefit: with large columns a hash index is a lot smaller on
> disk than a btree index.  This is the biggest reason I've seen people want
> to use hash indexes instead of btrees.  hashtext() btrees are a workaround,
> but they require all queries to be adjusted which is a pain.

Sure. But that can be addressed, with a lot less effort than fixing and
maintaining the hash indexes, by adding the ability to do that
transparently using btree indexes + a recheck internally.  How that
compares efficiency-wise is unclear as of now. But I do think it's
something we should measure before committing the new code.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-09-21 Thread Jeff Janes

On Wed, Sep 21, 2016 at 12:44 PM, Geoff Winkless 
wrote:

> On 21 September 2016 at 13:29, Robert Haas  wrote:
> > I'd be curious what benefits people expect to get.
>
> An edge case I came across the other day was a unique index on a large
> string: postgresql popped up and told me that I couldn't insert a
> value into the field because the BTREE-index-based constraint wouldn't
> support the size of string, and that I should use a HASH index
> instead. Which, of course, I can't, because it's fairly clearly
> deprecated in the documentation...
>

Yes, this large string issue is why I argued against removing hash indexes
the last couple times people proposed removing them.  I'd rather be able to
use something that gets the job done, even if it is deprecated.

You could use btree indexes over hashes of the strings.  But then you would
have to rewrite all your queries to inject an additional qualification,
something like:

Where value = 'really long string' and md5(value)=md5('really long string').

Alas, it still wouldn't support unique indexes.  I don't think you can even
use an excluding constraint, because you would have to exclude on the hash
value alone, not the original value, and so it would also forbid
false-positive collisions.

There has been discussion to make btree-over-hash just work without needing
to rewrite the queries, but discussions aren't patches...

Cheers,

Jeff

Re: [HACKERS] Tracking wait event for latches

2016-09-21 Thread Thomas Munro

On Thu, Sep 22, 2016 at 1:23 AM, Robert Haas  wrote:
> On Tue, Sep 20, 2016 at 7:13 PM, Thomas Munro
>  wrote:
>> This paragraph seems a bit confused.  I suggest something more like
>> this:  "The server process is waiting for one or more sockets, a timer
>> or an interprocess latch.  The wait happens in a WaitEventSet,
>> PostgreSQL's portable IO multiplexing abstraction."
>
> I'm worried we're exposing an awful lot of internal detail here.

Ok, maybe it's not a good idea to mention WaitEventSet in the manual.
I was trying to keep the same basic structure of text as Michael
proposed, but fix the bit that implied that waiting happens "in a
latch", even when no latch is involved.  Perhaps only the first
sentence is necessary.  This will all become much clearer if there is
a follow-up patch that shows strings like "Socket", "Socket|Latch",
"Latch" instead of a general catch-all wait_event_type of "EventSet",
as discussed.

> Moreover, it's pretty confusing that we have this general concept of
> wait events in pg_stat_activity, and then here the specific type of
> wait event we're waiting for is the ... wait event kind.  Uh, what?

Yeah, that is confusing.  It comes about because of the coincidence
that pg_stat_activity finished up with a wait_event column, and our IO
multiplexing abstraction finished up with the name WaitEventSet.
We could instead call these new things "wait
points", because, well, erm, they name points in the code at which we
wait.  They don't name events (they just happen to use the
WaitEventSet mechanism, which itself does involve events: but those
events are things like "hey, this socket is now ready for
writing").

> I have to admit that I like the individual event names quite a bit,
> and I think the detail will be useful to users.  But I wonder if
> there's a better way to describe the class of events that we're
> talking about that's not so dependent on internal data structures.
> Maybe we could divide these waits into a couple of categories - e.g.
> "Socket", "Timeout", "Process" - and then divide these detailed wait
> events among those classes.

Well we already discussed a couple of different ways to get "Socket",
"Latch", "Socket|Latch", ... or something like that into the
wait_event_type column or new columns.  Wouldn't that be better, and
automatically fall out of the code rather than needing human curated
categories?  Although Michael suggested that that should be done as a
separate patch on top of the basic feature.

> The "SecureRead" and "SecureWrite" wait events are going to be
> confusing, because the connection isn't necessarily secure.  I think
> those should be called "ClientRead" and "ClientWrite".
> Comprehensibility is more important than absolute consistency with the
> C function names.

Devil's advocate mode:  Then why not improve those function names?
Keeping the wait point names systematically in sync with the actual
code makes things super simple and avoids a whole decision process and
discussion to create new user-friendly obfuscation every time anyone
introduces a new wait point.  This is fundamentally an introspection
mechanism allowing expert users to shine a light on the engine and see
what's going on inside it, so I don't see what's wrong with being
straight up about what is actually going on and using the names for
parts of our code that we already have.  If that leads us to improve
some function names I'm not sure why that's a bad thing.

Obviously this gets complicated by the existence of static functions
whose names are ambiguous and lack context, and multiple wait points
in a single function.  Previously I've suggested a hierarchical
arrangement for these names which might help with that.  Imagine names
like: executor.Hash. (reported by a background worker
executing a hypothetical parallel hash join),
executor.Hash.. to disambiguate multiple wait
points in one function, walsender. etc.  That way we could
have a tidy curated meaningful naming scheme based on modules, but a
no-brainer systematic way to name the most numerous leaf bits of that
hierarchy.  Just an idea...

> Another thing to think about is that there's no way to actually see
> wait event information for a number of the processes which this patch
> instruments, because they don't appear in pg_stat_activity.

Good point.  Perhaps there could be another extended view, or system
process view, or some other mechanism.  That could be material for a
separate patch?

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Better tracking of free space during SP-GiST index build

2016-09-21 Thread Tomas Vondra


On 08/25/2016 03:26 AM, Tomas Vondra wrote:



On 08/25/2016 01:45 AM, Tom Lane wrote:

Over in the thread about the SP-GiST inet opclass, I threatened to
post a patch like this, and here it is.

The basic idea is to track more than just the very latest page
we've used in each of the page categories that SP-GiST works with.
I started with an arrangement that gave an equal number of cache
slots to each category, but soon realized that that was dumb,
because there are usually way more leaf pages than anything else.
So this version has a little table of how many slots to give to
each category. The constants could maybe use a bit more fiddling,
if we have some more test data sets to try this on.

On the IRRExplorer data set we discussed in the other thread, this
reduces the index size from 132MB to 120MB. Poking into that more
closely with pg_filedump, the total free space within the index
drops from 42MB to 28MB. If you think those numbers don't add up,
you're right --- this seems to result in more non-leaf tuples than
before. I'm not sure why; maybe more aggressive sucking up of free
space results in more splits. (Maybe adjustment of the default
spgist fillfactor would be in order to counteract that?) But the
index search time doesn't seem to be hurt, so perhaps there's
nothing to worry about.

As coded, this makes no attempt to preferentially select pages with
the most or least free space. I don't know if it'd be worth any
cycles to do that.

I'll put this in the commitfest queue. It could use review from
someone with the time and motivation to do performance
testing/tuning.





I've been toying with this patch a bit today, particularly looking at 
(1) sizing of the cache, and (2) how beneficial it'd be to choose pages 
from the cache in a smarter way.



(1) sizing of the cache

I wonder why the patch only sets the cache size to 100 items, when we 
might fit many more entries into the ~8kB limit. Sure, it's going to be 
more expensive to maintain the cache, but if it results in smaller 
index, it might be worth it. I've tried increasing the cache size to 768 
entries, with vast majority of them (~600) allocated to leaf pages.


Sadly, this seems to only increase the CREATE INDEX duration a bit, 
without making the index significantly smaller (still ~120MB).



(2) page selection

I do believe this is due to the current simple selection of pages from 
the cache - we simply select the pages more or less randomly (as long as 
the  page has enough free space). My understanding is that this leads to 
pages having roughly the same amount of free space. Subsequently, when 
SpGistSetLastUsedPage() selects page to evict from the cache, it finds 
with roughly the same amount of free space, and even if it picks the 
most full one, it wastes quite a bit of space.


I do think selecting the page with the least free space would save 
additional space. Instead of making SpGistGetBuffer() more complicated, 
I've simply shoved a pg_qsort() calls on a few places, sorting the cache 
by free space, with unused slots at the end. Clearly, this is quite 
expensive and proper patch would have to do that differently (e.g. 
maintaining the order incrementally), but OTOH it'd allow some 
optimizations in SpGistGetBuffer() - the first page with enough free 
space would have the smallest amount of free space (more or less).


This actually helped a bit, and the index size dropped by ~2MB. So not 
bad, but nowhere close to the initial 132MB -> 120MB improvement.


The following table shows comparison of index sizes, and also the effect 
of fillfactor=100:


master: 132MB
master + fillfactor=100: 124MB

patch: 120MB
patch + fillfactor=100: 109MB

patch + 768 items + selection: 117MB
patch + 768 items + selection + fillfactor=100: 103MB

It's worth mentioning the spgist fillfactor is a bit crude, most likely 
thanks to splits - e.g. the 109MB index still contains ~10MB of free 
space on the pages (measures using pageinspect as upper-lower), so 
almost 10%. Perhaps it really is time to increase the spgist default 
fillfactor?


(3) random comments

It seems the patch keeps new/empty/deleted pages in the cache, and thus 
segregated by type. Is that intentional, or should 
SpGistSetLastUsedPage() remove such pages from the cache? Or maybe move 
them into a special category? It's true we'll reuse those pages, as 
allocNewBuffer() allocates the buffer directly, but those pages are 
unlikely to get evicted from the cache due to high freeSpace value 
(despite possibly already reused).


Similarly for completely full pages (with freeSpace==0) - does it make 
sense to keep them in the cache? Although it's probably harmless, as 
those pages will get evicted first if needed.



Overall, I think the patch is good - it may be possible to improve it in 
the future, but it's a solid improvement.


One thing I'd change is making the SpGistLUPCache dynamic, i.e. storing 
the size and lastUsedPagesMap on the meta page. That should allow us 
resizing the

Re: [HACKERS] Hash Indexes

2016-09-21 Thread Geoff Winkless

On 21 September 2016 at 13:29, Robert Haas  wrote:
> I'd be curious what benefits people expect to get.

An edge case I came across the other day was a unique index on a large
string: postgresql popped up and told me that I couldn't insert a
value into the field because the BTREE-index-based constraint wouldn't
support the size of string, and that I should use a HASH index
instead. Which, of course, I can't, because it's fairly clearly
deprecated in the documentation...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-09-21 Thread Robert Haas

On Wed, Sep 21, 2016 at 2:11 PM, Jeff Janes  wrote:
> We are?  I thought we were trying to preserve on-disk compatibility so that
> we didn't have to rebuild the indexes.

Well, that was my initial idea, but ...

> Is the concern that lack of WAL logging has generated some subtle
> unrecognized on disk corruption?

...this is a consideration in the other direction.

> If I were using hash indexes on a production system and I experienced a
> crash, I would surely reindex immediately after the crash, not wait until
> the next pg_upgrade.

You might be more responsible, and more knowledgeable, than our typical user.

>> But is that a good thing to do?  That's a little harder to
>> say.
>
> How could we go about deciding that?  Do you think anything short of coding
> it up and seeing how it works would suffice?  I agree that if we want to do
> it, v10 is the time.  But we have about 6 months yet on that.

Yes, I think some experimentation will be needed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-09-21 Thread Bruce Momjian

On Wed, Sep 21, 2016 at 08:29:59AM -0400, Robert Haas wrote:
> Of course, if we want to implement clustered indexes, that's going to
> require significant changes to the heap format ... or the ability to
> support multiple heap storage formats.  I'm not opposed to that, but I
> think it makes sense to fix the existing implementation first.

For me, there are several measurements for indexes:

Build time
INSERT / UPDATE overhead
Storage size
Access speed

I am guessing people make conclusions based on their Computer Science
education.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New SQL counter statistics view (pg_stat_sql)

2016-09-21 Thread David Fetter

On Wed, Sep 21, 2016 at 02:05:24PM -0300, Alvaro Herrera wrote:
> Another consideration is that the present patch lumps together all
> ALTER cases in a single counter.  This isn't great, but at the same
> time we don't want to bloat the stat files by having hundreds of
> counters per database, do we?

I count 37 documented versions of ALTER as of git master.  Is there
some multiplier I'm missing?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GiST penalty functions [PoC]

2016-09-21 Thread Andrew Borodin

Hi Heikki!

> Got a link for a description of the RR*-tree? Would be good to reference it
> in the patch comments, too.
Well, as for now, the best way to reach the paper is
DOI 10.1145/1559845.1559929
http://sci-hub.cc/
Authors in conversations clearly stated that they endorse (not sure in
correct word) implementation in PostgreSQL, so I do not think it's a
bad kind of piracy.
More or less persistent link is http://dl.acm.org/citation.cfm?id=1559929

> If I understand correctly, cases #1 and #3 arise when one of the dimensions
> is 0. For example, in a 3D space, if the existing entry is a rectangle on a
> plane, with zero-length edge on one of the dimensions, and the new entry is
> on the same plane. Case #1 arises if the new entry falls within that
> rectangle, and case #3 if it's outside it. Currently, we treat all such
> cases as 0-penalty, because the degenerate 0-dimension causes all the
> calculated volumes to become zero. So clearly we can do better, which is
> what this patch does.
>
> At first blush, I'm surprised that you switch to using the sum of the edges
> in those cases. I would expect to ignore the degenerate 0-dimension, and
> calculate the volume using the other dimensions. So in a 3D space, you would
> calculate the increase of the area of the rectangle (A*B), not the sum of
> the edges (A+B). And it probably would be best to take into account how many
> of the dimensions are zero. So in a 3D space, if there is an existing line
> segment that the new point falls into, and also a rectangle that the new
> point falls into, you should prefer the 1-dimensional line segment over the
> 2-dimensional rectangle.
>
> I don't know how big a difference that makes in practice. But it seems odd
> that if you e.g. have a set of 3D points, but the Z dimension in all of the
> points is 0, the algorithm behaves differently than if you had the exact
> same points in a 2D space.
>
> (If this is explained in the RR*-tree paper, feel free to just point me to
> that and I'll ask again if I have further questions.)

As far as I know, your version of penalty function degradation is a
new concept regarding R-trees. I have not saw this idea before.
It is based on two assertions:
1. Degrading algorithms should resemble general algorithm, if choice
of general algorithm is correct.
2. Degradation happens when and only when at least on edge of MBB has zero size.

First assertion seems correct, while second is wrong. When you index
high-dimensional data (say 10 dimensions), you can easily reach 0 by
multiplication of values around 10^-4. And such data often is a result
of scaling and normalizing in machine learning, these are concepts
natural for them, along with high dimensinonality.

We can rejig your algorithm: edges are sorted in descending order, and
multiplied just before getting zero. Choose subtree picks tuple by
count of numbers multiplied, resolving ties by result of
multiplication.

We can get on fire with big edges, but current cube has no more than
100 dimensions. That is a little more than 1000 for each before
overlow (if multiplication is done in doubles).

Practically, this algorithm cannot be implemented in current GiST API
(only if we provide non-penalty-based choose subtree function,
optional for GiST extension), but it certainly has scientific value.

Regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch: function xmltable

2016-09-21 Thread Pavel Stehule

2016-09-18 11:53 GMT+02:00 Pavel Stehule :

> Hi
>
> new update:
>
> * doc is moved to better place - xml processing functions
> * few more regress tests
> * call forgotten check_srf_call_placement
>

another small update - fix XMLPath parser - support multibytes characters

Regards

Pavel


>
> Regards
>
> Pavel
>


xmltable-9.patch.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-09-21 Thread Jeff Janes

On Thu, Sep 15, 2016 at 7:13 AM, Robert Haas  wrote:

> On Thu, Sep 15, 2016 at 1:41 AM, Amit Kapila 
> wrote:
> > I think it is possible without breaking pg_upgrade, if we match all
> > items of a page at once (and save them as local copy), rather than
> > matching item-by-item as we do now.  We are already doing similar for
> > btree, refer explanation of BTScanPosItem and BTScanPosData in
> > nbtree.h.
>
> If ever we want to sort hash buckets by TID, it would be best to do
> that in v10 since we're presumably going to be recommending a REINDEX
> anyway.

We are?  I thought we were trying to preserve on-disk compatibility so that
we didn't have to rebuild the indexes.

Is the concern that lack of WAL logging has generated some subtle
unrecognized on disk corruption?

If I were using hash indexes on a production system and I experienced a
crash, I would surely reindex immediately after the crash, not wait until
the next pg_upgrade.

> But is that a good thing to do?  That's a little harder to
> say.
>

How could we go about deciding that?  Do you think anything short of coding
it up and seeing how it works would suffice?  I agree that if we want to do
it, v10 is the time.  But we have about 6 months yet on that.

Cheers,

Jeff

Re: [HACKERS] PL/Python adding support for multi-dimensional arrays

2016-09-21 Thread Dave Cramer

On 18 September 2016 at 09:27, Dave Cramer  wrote:

>
> On 10 August 2016 at 01:53, Pavel Stehule  wrote:
>
>> Hi
>>
>> 2016-08-03 13:54 GMT+02:00 Alexey Grishchenko :
>>
>>> On Wed, Aug 3, 2016 at 12:49 PM, Alexey Grishchenko <
>>> agrishche...@pivotal.io> wrote:
>>>
 Hi

 Current implementation of PL/Python does not allow the use of
 multi-dimensional arrays, for both input and output parameters. This forces
 end users to introduce workarounds like casting arrays to text before
 passing them to the functions and parsing them after, which is an
 error-prone approach

 This patch adds support for multi-dimensional arrays as both input and
 output parameters for PL/Python functions. The number of dimensions
 supported is limited by Postgres MAXDIM macrovariable, by default equal to
 6. Both input and output multi-dimensional arrays should have fixed
 dimension sizes, i.e. 2-d arrays should represent MxN matrix, 3-d arrays
 represent MxNxK cube, etc.

 This patch does not support multi-dimensional arrays of composite
 types, as composite types in Python might be represented as iterators and
 there is no obvious way to find out when the nested array stops and
 composite type structure starts. For example, if we have a composite type
 of (int, text), we can try to return "[ [ [1,'a'], [2,'b'] ], [ [3,'c'],
 [4,'d'] ] ]", and it is hard to find out that the first two lists are
 lists, and the third one represents structure. Things are getting even more
 complex when you have arrays as members of composite type. This is why I
 think this limitation is reasonable.

 Given the function:

 CREATE FUNCTION test_type_conversion_array_int4(x int4[]) RETURNS
 int4[] AS $$
 plpy.info(x, type(x))
 return x
 $$ LANGUAGE plpythonu;

 Before patch:

 # SELECT * FROM test_type_conversion_array_int
 4(ARRAY[[1,2,3],[4,5,6]]);
 ERROR:  cannot convert multidimensional array to Python list
 DETAIL:  PL/Python only supports one-dimensional arrays.
 CONTEXT:  PL/Python function "test_type_conversion_array_int4"

 After patch:

 # SELECT * FROM test_type_conversion_array_int
 4(ARRAY[[1,2,3],[4,5,6]]);
 INFO:  ([[1, 2, 3], [4, 5, 6]], )
  test_type_conversion_array_int4
 -
  {{1,2,3},{4,5,6}}
 (1 row)

 --
 Best regards,
 Alexey Grishchenko

>>>
>>> Also this patch incorporates the fix for https://www.postgresql.org
>>> /message-id/CAH38_tkwA5qgLV8zPN1OpPzhtkNKQb30n3xq-2NR9jUfv3q
>>> wHA%40mail.gmail.com, as they touch the same piece of code - array
>>> manipulation in PL/Python
>>>
>>>
>> I am sending review of this patch:
>>
>> 1. The implemented functionality is clearly benefit - passing MD arrays,
>> pretty faster passing bigger arrays
>> 2. I was able to use this patch cleanly without any errors or warnings
>> 3. There is no any error or warning
>> 4. All tests passed - I tested Python 2.7 and Python 3.5
>> 5. The code is well commented and clean
>> 6. For this new functionality the documentation is not necessary
>>
>> 7. I invite more regress tests for both directions (Python <-> Postgres)
>> for more than two dimensions
>>
>> My only one objection is not enough regress tests - after fixing this
>> patch will be ready for commiters.
>>
>> Good work, Alexey
>>
>> Thank you
>>
>> Regards
>>
>> Pavel
>>
>>
>>> --
>>> Best regards,
>>> Alexey Grishchenko
>>>
>>
>>
>
> Pavel,
>
> I will pick this up.
>
>
> Pavel,

Please see attached patch which provides more test cases

I just realized this patch contains the original patch as well. What is the
protocol for sending in subsequent patches ?

>
> Dave Cramer
>
> da...@postgresintl.com
> www.postgresintl.com
>
>

0002-PL-Python-adding-support-for-multi-dimensional-arrays.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New SQL counter statistics view (pg_stat_sql)

2016-09-21 Thread Alvaro Herrera

Peter Eisentraut wrote:

> How about having the tag not be a column name but a row entry.  So you'd
> do something like
> 
> SELECT * FROM pg_stat_sql WHERE tag = 'ALTER VIEW';
> 
> That way, we don't have to keep updating (and re-debating) this when new
> command types or subtypes are added.  And queries written for future
> versions will not fail when run against old servers.

Yeah, good idea.

Let's also discuss the interface from the stats collector.  Currently we
have some 20 new SQL functions, all alike, each loading the whole data
and returning a single counter, and then the view invokes each function
separately.  That doesn't seem great to me.  How about having a single C
function that returns the whole thing as a SRF instead, and the view is
just a single function invocation -- something like pg_lock_status
filling pg_locks in one go.

Another consideration is that the present patch lumps together all ALTER
cases in a single counter.  This isn't great, but at the same time we
don't want to bloat the stat files by having hundreds of counters per
database, do we?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-21 Thread Heikki Linnakangas

On 08/02/2016 01:18 AM, Peter Geoghegan wrote:

Tape unification

Sort operations have a unique identifier, generated before any workers
are launched, using a scheme based on the leader's PID, and a unique
temp file number. This makes all on-disk state (temp files managed by
logtape.c) discoverable by the leader process. State in shared memory
is sized in proportion to the number of workers, so the only thing
about the data being sorted that gets passed around in shared memory
is a little logtape.c metadata for tapes, describing for example how
large each constituent BufFile is (a BufFile associated with one
particular worker's tapeset).

(See below also for notes on buffile.c's role in all of this, fd.c and
resource management, etc.)

> ...

buffile.c, and "unification"

There has been significant new infrastructure added to make logtape.c
aware of workers. buffile.c has in turn been taught about unification
as a first class part of the abstraction, with low-level management of
certain details occurring within fd.c. So, "tape unification" within
processes to open other backend's logical tapes to generate a unified
logical tapeset for the leader to merge is added. This is probably the
single biggest source of complexity for the patch, since I must
consider:

* Creating a general, reusable abstraction for other possible BufFile
users (logtape.c only has to serve tuplesort.c, though).

* Logical tape free space management.

* Resource management, file lifetime, etc. fd.c resource management
can now close a file at xact end for temp files, while not deleting it
in the leader backend (only the "owning" worker backend deletes the
temp file it owns).

* Crash safety (e.g., when to truncate existing temp files, and when not to).

I find this unification business really complicated. I think it'd be 
simpler to keep the BufFiles and LogicalTapeSets separate, and instead 
teach tuplesort.c how to merge tapes that live on different 
LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single 
LogicalTapeSet can contain tapes from different underlying BufFiles.

What I have in mind is something like the attached patch. It refactors 
LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a 
LogicalTape as argument, instead of LogicalTapeSet and tape number. 
LogicalTapeSet doesn't have the concept of a tape number anymore, it can 
contain any number of tapes, and you can create more on the fly. With 
that, it'd be fairly easy to make tuplesort.c merge LogicalTapes that 
came from different tape sets, backed by different BufFiles. I think 
that'd avoid much of the unification code.

That leaves one problem, though: reusing space in the final merge phase. 
If the tapes being merged belong to different LogicalTapeSets, and 
create one new tape to hold the result, the new tape cannot easily reuse 
the space of the input tapes because they are on different tape sets. 
But looking at your patch, ISTM you actually dodged that problem as well:

+* As a consequence of only being permitted to write to the leader
+* controlled range, parallel sorts that require a final materialized 
tape
+* will use approximately twice the disk space for temp files compared 
to
+* a more or less equivalent serial sort.  This is deemed acceptable,
+* since it is far rarer in practice for parallel sort operations to
+* require a final materialized output tape.  Note that this does not
+* apply to any merge process required by workers, which may reuse space
+* eagerly, just like conventional serial external sorts, and so
+* typically, parallel sorts consume approximately the same amount of 
disk
+* blocks as a more or less equivalent serial sort, even when workers 
must
+* perform some merging to produce input to the leader.

I'm slightly worried about that. Maybe it's OK for a first version, but 
it'd be annoying in a query where a sort is below a merge join, for 
example, so that you can't do the final merge on the fly because 
mark/restore support is needed.

One way to fix that would be have all the parallel works share the work 
files to begin with, and keep the "nFileBlocks" value in shared memory 
so that the workers won't overlap each other. Then all the blocks from 
different workers would be mixed together, though, which would hurt the 
sequential pattern of the tapes, so each workers would need to allocate 
larger chunks to avoid that.

- Heikki

>From a1aa45c22cd13a2059880154e30f48d884a849ef Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 21 Sep 2016 19:31:33 +0300
Subject: [PATCH 1/1] Refactor LogicalTapeSet/LogicalTape interface.

A LogicalTape is now visible to callers, not just an internal object
within logtape.c. All the tape functions, like LogicalTapeRead and
LogicalTapeWrite, take a LogicalTape as argument, instead of
LogicalTapeSet+tape number.

You can cre

Re: [HACKERS] Hash Indexes

2016-09-21 Thread Oskari Saarenmaa


21.09.2016, 15:29, Robert Haas kirjoitti:

For PostgreSQL, I expect the benefits of improving hash indexes to be
(1) slightly better raw performance for equality comparisons and (2)
better concurrency.


There's a third benefit: with large columns a hash index is a lot 
smaller on disk than a btree index.  This is the biggest reason I've 
seen people want to use hash indexes instead of btrees.  hashtext() 
btrees are a workaround, but they require all queries to be adjusted 
which is a pain.


/ Oskari


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_ctl promote wait

2016-09-21 Thread Peter Eisentraut

On 8/11/16 9:28 AM, Michael Paquier wrote:
> I have looked at them and the changes are looking fine for me. So I
> have switched the patch as ready for committer, aka you.
> 
> Just a nit:
> +   if (wait_seconds > 0)
> +   {
> +   sleep(1);
> +   wait_seconds--;
> +   continue;
> +   }
> This may be better this pg_usleep() instead of sleep().

Committed with that adjustment.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New SQL counter statistics view (pg_stat_sql)

2016-09-21 Thread David Fetter

On Wed, Sep 21, 2016 at 11:25:14AM -0400, Peter Eisentraut wrote:
> On 9/14/16 4:01 PM, Robert Haas wrote:
> > I think it is not a good idea to make the command names used here the
> > plural forms of the command tags.  Instead of "inserts", "updates",
> > "imports", etc. just use "INSERT", "UPDATE", "IMPORT".  That's simpler
> > and less error prone - e.g. you won't end up with things like
> > "refreshs", which is not a word.
> 
> How about having the tag not be a column name but a row entry.  So you'd
> do something like
> 
> SELECT * FROM pg_stat_sql WHERE tag = 'ALTER VIEW';

+1 for this.  It's MUCH easier to deal with changes in row counts than
changes in row type.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect: Hash index support

2016-09-21 Thread Jeff Janes

On Tue, Sep 20, 2016 at 11:14 PM, Michael Paquier  wrote:

>
> + 
> +  The type information will be 'm' for a metadata
> page,
> +  'v' for an overflow page,
> 'b' for a bucket page,
> +  'i' for a bitmap page, and
> 'u' for an unused page.
> + 
>

> Other functions don't go into this level of details, so I would just
> rip out this paragraph.
>

I'd argue that the other functions should go into that level detail in some
places. Pageinspect is needlessly hard to use; not all precedent is good
precedent.  Some of them do refer you to header files or README files,
which can be useful. But the abbreviations used here are not explained in
any header file or README file, so I think the right place to explain them
is the documentation in that case. Or change from the single-letter strings
to full name strings so they are self-documenting.

Cheers,

Jeff

Re: [HACKERS] Speedup twophase transactions

2016-09-21 Thread Stas Kelvich

> On 21 Sep 2016, at 10:32, Michael Paquier  wrote:
> 
> On Tue, Sep 20, 2016 at 11:13 PM, Stas Kelvich  
> wrote:
>> 
>> Putting that before actual WAL replay is just following historical order of 
>> events.
>> Prepared files are pieces of WAL that happened before checkpoint, so ISTM
>> there is no conceptual problem in restoring their state before replay.
> 
> For wal_level = minimal there is no need to call that to begin with..
> And I'd think that it is better to continue with things the same way.
> 

Hm, why?

>> 
>> With this patch we are reusing one infrastructure for normal work, recovery 
>> and replay.
> 
> The mix between normal work and recovery is the scary part. Normal
> work and recovery are entirely two different things.
> 

Okay, agreed. Commit procedure that checks whether recovery is active or not
is quite hacky solution.

>> So taking into account my comments what do you think? Should we keep current 
>> approach
>> or try to avoid replaying prepare records at all?
> 
> I'd really like to give a try to the idea you just mentioned, aka
> delay the fsync of the 2PC files from RecreateTwoPhaseFile to
> CreateRestartPoint, or get things into memory.. I could write one, or
> both of those patches. I don't mind.

I’m not giving up yet, i’ll write them) I still have in mind several other 
patches to 2pc handling in
postgres during this release cycle — logical decoding and partitioned hash 
instead of 
TwoPhaseState list.

My bet that relative speed of that patches will depend on used filesystem. Like 
it was with the
first patch in that mail thread it is totally possible sometimes to hit 
filesystem limits on file
creation speed. Otherwise both approaches should be more or less equal, i 
suppose.

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Typo in pgstat.h

2016-09-21 Thread Peter Eisentraut

On 9/21/16 10:04 AM, Michael Paquier wrote:
> While going through pgstat.h I bumped into that:
> - * as 4-bytes where first byte repersents the wait event class (type of
> + * as 4-bytes where first byte represents the wait event class (type of

Fixed.  There was actually another one of the same a few lines later.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New SQL counter statistics view (pg_stat_sql)

2016-09-21 Thread Peter Eisentraut

On 9/14/16 4:01 PM, Robert Haas wrote:
> I think it is not a good idea to make the command names used here the
> plural forms of the command tags.  Instead of "inserts", "updates",
> "imports", etc. just use "INSERT", "UPDATE", "IMPORT".  That's simpler
> and less error prone - e.g. you won't end up with things like
> "refreshs", which is not a word.

How about having the tag not be a column name but a row entry.  So you'd
do something like

SELECT * FROM pg_stat_sql WHERE tag = 'ALTER VIEW';

That way, we don't have to keep updating (and re-debating) this when new
command types or subtypes are added.  And queries written for future
versions will not fail when run against old servers.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] get_home_path: use HOME

2016-09-21 Thread Peter Eisentraut

On 9/20/16 1:44 PM, Rudolf Gavlas wrote:
> If you think that using the value of HOME variable as the user's home
> directory is bad idea, I won't argue with that, I've already expressed
> my opinion. What is the real problem here is using home directory of a
> user A as a home directory for user B. That's clearly a bug and if you
> want to solve it without using HOME, I am fine with that.

I have no problem with using the HOME variable optionally.  That is
wide-spread practice.  But I dispute what you describe as the "real
problem".  In Unix, users are identified by uids.  The real problem, as
I see it, is that you think you have multiple users but you actually don't.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] wal_segment size vs max_wal_size

2016-09-21 Thread Peter Eisentraut

There is apparently some misbehavior if max_wal_size is less than 5 *
wal_segment_size.

For example, if you build with --with-wal-segsize=64, then the recovery
test fails unless you set max_wal_size to at least 320MB in
PostgresNode.pm.  The issue is that pg_basebackup fails with:

pg_basebackup: could not get transaction log end position from server:
ERROR:  could not find any WAL files

This should probably be made friendlier in some way.  But it also shows
that bigger WAL segment sizes are apparently not well-chartered
territory lately.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2016-09-21 Thread Peter Eisentraut

On 9/21/16 8:12 AM, Robert Haas wrote:
>> Oh, I'm on board with increasing the default size a bit. A different
>> > default size isn't a non-default compile time option anymore though, and
>> > I don't think 1GB is a reasonable default.
> But that's not the question.  What Peter said was: "maybe we should at
> least *allow* some larger sizes, for testing out".  I see very little
> merit in restricting the values that people can set via configure.
> That just makes life difficult.  If a user picks a setting that
> doesn't perform well, oops.

Right.  If we think that a larger size can have some performance benefit
and we think that 64MB might be a good new default (as was the initial
suggestion), then we should surely allow at least say 128 and 256 to be
tried out.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tracking wait event for latches

2016-09-21 Thread Robert Haas

On Wed, Sep 21, 2016 at 10:23 AM, Michael Paquier
 wrote:
> On Wed, Sep 21, 2016 at 11:18 PM, Robert Haas  wrote:
>> No, that's not what I want to do.  I think we should categorize the
>> events administratively by their main purpose, rather than
>> technologically by what we're waiting for.
>
> So we'd just have three class IDs instead of one? Well why not.

Yeah, or, I mean, it doesn't have to be three precisely, but I'd like
to try to avoid exposing the users to the fact that we have an
internal data structure called a WaitEventSet and instead classify by
function.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tracking wait event for latches

2016-09-21 Thread Robert Haas

On Wed, Sep 21, 2016 at 10:02 AM, Michael Paquier
 wrote:
> On Wed, Sep 21, 2016 at 10:23 PM, Robert Haas  wrote:
>> I have to admit that I like the individual event names quite a bit,
>> and I think the detail will be useful to users.  But I wonder if
>> there's a better way to describe the class of events that we're
>> talking about that's not so dependent on internal data structures.
>> Maybe we could divide these waits into a couple of categories - e.g.
>> "Socket", "Timeout", "Process" - and then divide these detailed wait
>> events among those classes.
>
> pgstat.h is mentioning that there is 1 byte still free. I did not
> notice that until a couple of minutes ago. There are 2 bytes used for
> the event ID, and 1 byte for the class ID, but there are 4 bytes
> available. Perhaps we could use this extra byte to store this extra
> status information, then use it for WaitEventSet to build up a string
> that will be stored in classId field? For example if a process is
> waiting on a socket and a timeout, we'd write "Socket,Timeout" as a
> text field.

No, that's not what I want to do.  I think we should categorize the
events administratively by their main purpose, rather than
technologically by what we're waiting for.

>> Another thing to think about is that there's no way to actually see
>> wait event information for a number of the processes which this patch
>> instruments, because they don't appear in pg_stat_activity.
>
> We could create a new system to track the activity of system-related
> processes, for example pg_stat_system_activity, or pg_system_activity,
> and list all the processes that are not counted in max_connections...

Yes.  Or we could decide to include everything in pg_stat_activity.  I
think those are the two reasonable options.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tracking wait event for latches

2016-09-21 Thread Michael Paquier

On Wed, Sep 21, 2016 at 11:18 PM, Robert Haas  wrote:
> No, that's not what I want to do.  I think we should categorize the
> events administratively by their main purpose, rather than
> technologically by what we're waiting for.

So we'd just have three class IDs instead of one? Well why not.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-09-21 Thread Robert Haas

On Wed, Sep 21, 2016 at 5:54 AM, Petr Jelinek  wrote:
>>> Reading again the thread, it seems that my previous post [1] was a bit
>>> misunderstood. My position is to not introduce any new behavior
>>> changes in 9.6, so we could just make the FIRST NUM grammar equivalent
>>> to NUM.
>>>
>>> [1]: 
>>> https://www.postgresql.org/message-id/CAB7nPqRDvJn18e54ccNpOP1A2_iUN6-iU=4njgmmgiagvcs...@mail.gmail.com
>>
>> I misunderstood your intent, then.  But I still stand by what I did
>> understand, namely that 'k (...)'  should mean 'any k (...)'.  It's much
>> more natural than having it mean 'first k (...)' and I also think it
>> will be more frequent in practice.
>>
>
> I think so as well.

Well, I agree, but I think making behavior changes after rc1 is a
non-starter.  It's better to live with the incompatibility than to
change the behavior so close to release.  At least, that's my
position.  Getting the release out on time with a minimal bug count is
more important to me than a minor incompatibility in the meaning of
one GUC.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Transaction traceability - txid_status(bigint)

2016-09-21 Thread Robert Haas

On Wed, Sep 21, 2016 at 3:40 AM, Craig Ringer  wrote:
> The only non-horrible one of those is IMO to just let the caller see
> an error if they lose the race. It's a function that's intended for
> use when you're dealing with indeterminate transaction state after a
> server or application error anyway, so it's part of an error path
> already. So I say we just document the behaviour.

I am not keen on that idea.  The errors we're likely to be exposing
are going to look like low-level internal failures, which might scare
some DBA.  Even if they don't, I think it's playing with fire.  The
system is designed on the assumption that nobody will try to look up
an XID that's too old, and if we start to violate that assumption I
think we're undermining the design integrity of the system in a way
we'll likely come to regret.  To put that more plainly, when the code
is written with the assumption that X will never happen, it's usually
a bad idea to casually add code that does X.

> Because slru.c
> doesn't release its LWLock on error we also need to ensure
> txid_status(...) is also only called from a toplevel xact so the user
> doesn't attempt to wrap it in plpgsql BEGIN ... EXCEPTION block and it
> causes the xact to abort.

I think this is muddled, because an error aborts the transaction, and
AbortTransaction() and AbortSubTransaction() start with
LWLockReleaseAll().

It might not be too hard to add a second copy of oldestXid in shared
memory that is updated before truncation rather than afterward... but
yeah, like you, I'm not finding this nearly as straightforward as
might have been hoped.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Typo in pgstat.h

2016-09-21 Thread Michael Paquier

Hi all,

While going through pgstat.h I bumped into that:
- * as 4-bytes where first byte repersents the wait event class (type of
+ * as 4-bytes where first byte represents the wait event class (type of
Regards,
-- 
Michael
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index dc3320d..97cf5e1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1007,7 +1007,7 @@ extern void pgstat_initstats(Relation rel);
  *
  *	Called from places where server process needs to wait.  This is called
  *	to report wait event information.  The wait information is stored
- *	as 4-bytes where first byte repersents the wait event class (type of
+ *	as 4-bytes where first byte represents the wait event class (type of
  *	wait, for different types of wait, refer WaitClass) and the next
  *	3-bytes repersent the actual wait event.  Currently 2-bytes are used
  *	for wait event which is sufficient for current usage, 1-byte is

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tracking wait event for latches

2016-09-21 Thread Michael Paquier

On Wed, Sep 21, 2016 at 10:23 PM, Robert Haas  wrote:
> I have to admit that I like the individual event names quite a bit,
> and I think the detail will be useful to users.  But I wonder if
> there's a better way to describe the class of events that we're
> talking about that's not so dependent on internal data structures.
> Maybe we could divide these waits into a couple of categories - e.g.
> "Socket", "Timeout", "Process" - and then divide these detailed wait
> events among those classes.

pgstat.h is mentioning that there is 1 byte still free. I did not
notice that until a couple of minutes ago. There are 2 bytes used for
the event ID, and 1 byte for the class ID, but there are 4 bytes
available. Perhaps we could use this extra byte to store this extra
status information, then use it for WaitEventSet to build up a string
that will be stored in classId field? For example if a process is
waiting on a socket and a timeout, we'd write "Socket,Timeout" as a
text field.

> The "SecureRead" and "SecureWrite" wait events are going to be
> confusing, because the connection isn't necessarily secure.  I think
> those should be called "ClientRead" and "ClientWrite".
> Comprehensibility is more important than absolute consistency with the
> C function names.

Noted.

> Another thing to think about is that there's no way to actually see
> wait event information for a number of the processes which this patch
> instruments, because they don't appear in pg_stat_activity.

We could create a new system to track the activity of system-related
processes, for example pg_stat_system_activity, or pg_system_activity,
and list all the processes that are not counted in max_connections...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect: Hash index support

2016-09-21 Thread Jesper Pedersen


On 09/21/2016 08:43 AM, Michael Paquier wrote:

page_stats / page_items should not be used on the metadata page.

As these functions are marked as superuser only it is expected that people
provides the correct input, especially since the raw page structure is used
as the input.


btree functions use the block number to do some sanity checks. You
cannot do that here as only bytea functions are available, but you
could do it in verify_hash_page by looking at the opaque data and look
at LH_META_PAGE. Then add a boolean argument into verify_hash_page to
see if the caller expects a meta page or not and just issue an error.
Actually it would be a good idea to put in those safeguards, even if I
agree with you that calling those functions is at the risk of the
user... Could you update the patch in this sense?

I had fun doing the same tests, aka running the items and stats
functions on a meta page, and the meta function on a non-meta page,
but at my surprise I did not see a crash, so perhaps I was lucky and
perhaps that was because of OSX.



Attached is v5, which add basic page verification.

Thanks for the feedback !

Best regards,
 Jesper

>From 05fe7b9056bcbb2f29809798229640499576c3a9 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Fri, 5 Aug 2016 10:16:32 -0400
Subject: [PATCH] pageinspect: Hash index support

---
 contrib/pageinspect/Makefile  |  10 +-
 contrib/pageinspect/hashfuncs.c   | 422 ++
 contrib/pageinspect/pageinspect--1.5--1.6.sql |  63 
 contrib/pageinspect/pageinspect--1.5.sql  | 279 -
 contrib/pageinspect/pageinspect--1.6.sql  | 338 +
 contrib/pageinspect/pageinspect.control   |   2 +-
 doc/src/sgml/pageinspect.sgml | 104 +++
 7 files changed, 933 insertions(+), 285 deletions(-)
 create mode 100644 contrib/pageinspect/hashfuncs.c
 create mode 100644 contrib/pageinspect/pageinspect--1.5--1.6.sql
 delete mode 100644 contrib/pageinspect/pageinspect--1.5.sql
 create mode 100644 contrib/pageinspect/pageinspect--1.6.sql

diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index a98237e..c73d728 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -2,13 +2,13 @@
 
 MODULE_big	= pageinspect
 OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \
-		  brinfuncs.o ginfuncs.o $(WIN32RES)
+		  brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES)
 
 EXTENSION = pageinspect
-DATA = pageinspect--1.5.sql pageinspect--1.4--1.5.sql \
-	pageinspect--1.3--1.4.sql pageinspect--1.2--1.3.sql \
-	pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \
-	pageinspect--unpackaged--1.0.sql
+DATA = pageinspect--1.6.sql pageinspect--1.5--1.6.sql \
+	pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \
+	pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \
+	pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql
 PGFILEDESC = "pageinspect - functions to inspect contents of database pages"
 
 ifdef USE_PGXS
diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c
new file mode 100644
index 000..3ee57dc
--- /dev/null
+++ b/contrib/pageinspect/hashfuncs.c
@@ -0,0 +1,422 @@
+/*
+ * hashfuncs.c
+ *		Functions to investigate the content of HASH indexes
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/pageinspect/hashfuncs.c
+ */
+
+#include "postgres.h"
+
+#include "access/hash.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+
+PG_FUNCTION_INFO_V1(hash_metap);
+PG_FUNCTION_INFO_V1(hash_page_items);
+PG_FUNCTION_INFO_V1(hash_page_stats);
+
+/* 
+ * structure for single hash page statistics
+ * 
+ */
+typedef struct HashPageStat
+{
+	uint32		live_items;
+	uint32		dead_items;
+	uint32		page_size;
+	uint32		max_avail;
+	uint32		free_size;
+	uint32		avg_item_size;
+	char		type;
+
+	/* opaque data */
+	BlockNumber hasho_prevblkno;
+	BlockNumber hasho_nextblkno;
+	Bucket		hasho_bucket;
+	uint16		hasho_flag;
+	uint16		hasho_page_id;
+} HashPageStat;
+
+
+/*
+ * Verify that the given bytea contains a HASH page, or die in the attempt.
+ * A pointer to the page is returned.
+ */
+static Page
+verify_hash_page(bytea *raw_page, bool metap)
+{
+	Page		page;
+	int			raw_page_size;
+	HashPageOpaque pageopaque;
+
+	raw_page_size = VARSIZE(raw_page) - VARHDRSZ;
+
+	if (raw_page_size != BLCKSZ)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("input page too small"),
+ errdetail("Expected size %d, got %d",
+		   BLCKSZ, raw_page_size)));
+
+	page = VARDATA(raw_page);
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	if (pageopaque->hasho_page_id != HASHO_PAGE_ID)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("page is not a HASH page"),
+ errdetail("Expected %08x, got %08x.",
+		   HASHO_PAGE_ID, pageopaque->hasho_p

Re: [HACKERS] Tracking wait event for latches

2016-09-21 Thread Robert Haas

On Tue, Sep 20, 2016 at 7:13 PM, Thomas Munro
 wrote:
> This paragraph seems a bit confused.  I suggest something more like
> this:  "The server process is waiting for one or more sockets, a timer
> or an interprocess latch.  The wait happens in a WaitEventSet,
> PostgreSQL's portable IO multiplexing abstraction."

I'm worried we're exposing an awful lot of internal detail here.
Moreover, it's pretty confusing that we have this general concept of
wait events in pg_stat_activity, and then here the specific type of
wait event we're waiting for is the ... wait event kind.  Uh, what?

I have to admit that I like the individual event names quite a bit,
and I think the detail will be useful to users.  But I wonder if
there's a better way to describe the class of events that we're
talking about that's not so dependent on internal data structures.
Maybe we could divide these waits into a couple of categories - e.g.
"Socket", "Timeout", "Process" - and then divide these detailed wait
events among those classes.

The "SecureRead" and "SecureWrite" wait events are going to be
confusing, because the connection isn't necessarily secure.  I think
those should be called "ClientRead" and "ClientWrite".
Comprehensibility is more important than absolute consistency with the
C function names.

Another thing to think about is that there's no way to actually see
wait event information for a number of the processes which this patch
instruments, because they don't appear in pg_stat_activity.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench - compute & show latency consistently

2016-09-21 Thread Fabien COELHO



Hello Kuntal,



transaction type: 
scaling factor: 10
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 1000
number of transactions actually processed: 1/1
tps = 85.184871 (including connections establishing)
tps = 85.296346 (excluding connections establishing)


Shouldn't we include latency average here as well and explain what it is?


Indeed, now it seems to be always printed but the documentation did not 
follow, there should be a:


  latency average = 117.392 ms

In front of the tps line. Well, the performance displayed could also be 
improved... On my dual core SSD laptop I just got:


 sh> ./pgbench -c 10 -t 1000
 starting vacuum...end.
 transaction type: 
 scaling factor: 100
 query mode: simple
 number of clients: 10
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 1/1
 latency average = 9.527 ms
 tps = 1049.665115 (including connections establishing)
 tps = 1049.890194 (excluding connections establishing)

Which is about 10 times better.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical Replication WIP

2016-09-21 Thread Peter Eisentraut

Some partial notes on 0005-Add-logical-replication-workers.patch:

Documentation still says that TRUNCATE is supported.

In catalogs.sgml for pg_subscription column subpublications I'd add a
note that those are publications that live on the remote server.
Otherwise one might think by mistake that it references pg_publication.

The changes in reference.sgml should go into an earlier patch.

Document that table and column names are matched by name.  (This seems
obvious, but it's not explained anywhere, AFAICT.)

Document to what extent other relation types are supported (e.g.,
materialized views as source, view or foreign table or temp table as
target).  Suggest an updatable view as target if user wants to have
different table names or write into a different table structure.

subscriptioncmds.c: In CreateSubscription(), the
CommandCounterIncrement() call is apparently not needed.

subscriptioncmds.c: Duplicative code for libpqwalreceiver loading and
init, should be refactored.

subscriptioncmds.c: Perhaps combine logicalrep_worker_find() and
logicalrep_worker_stop() into one call that also encapsulates the
required locking.

001_rep_changes.pl: The TAP protocol does not allow direct printing to
stdout.  (It needs to be prefixed with # or with spaces or something; I
forget.)  In this case, the print calls can just be removed, because the
following is() calls in each case will print the failing value anyway.

In get_subscription_list(), the memory context pointers don't appear to
do anything useful, because everything ends up being CurrentMemoryContext.

pg_stat_get_subscription(NULL) for "all" seems a bit of a weird interface.

pglogical_apply_main not used, should be removed.

In logicalreprel_open(), the error message "cache lookup failed for
remote relation %u" could be clarified.  This message could probably
happen if the protocol did not send a Relation message first.  (The term
"cache" is perhaps inappropriate for LogicalRepRelMap, because it
implies that the value can be gotten from elsewhere if it's not in the
cache.  In this case it's really session state that cannot be recovered
easily.)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix Error Message for allocate_recordbuf() Failure

2016-09-21 Thread Robert Haas

On Tue, Sep 20, 2016 at 9:25 PM, Michael Paquier
 wrote:
> On Wed, Sep 21, 2016 at 12:32 AM, Robert Haas  wrote:
>> For what it's worth, I think it's fine.  Good error messages are a useful 
>> thing.
>>
>> More generally, I think the whole patch looks good and should be committed.
>
> Hm. I'd think that it is still more portable to just issue a WARNING
> message in palloc_extended() when MCXT_ALLOC_NO_OOM then. Other code
> paths could benefit from that as well, and the patch proposed does
> nothing for the other places calling it. I am fine to write a patch
> for this purpose if you agree on that.

No, I strongly disagree with that.  I think when you pass
MCXT_ALLOC_NO_OOM, you're saying that you are prepared to deal with a
NULL return value.  It becomes your job to decide whether to emit any
log message and which one to emit.  If palloc_extended() insists on
emitting a warning regardless, the caller can't now emit a more
specific message without creating redundant log chatter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect: Hash index support

2016-09-21 Thread Ashutosh Sharma

Hi,

> git apply w/ v4 works here, so you will have to investigate what happens on
> your side.
>

Thanks, It works with v4 patch.

> As these functions are marked as superuser only it is expected that people
> provides the correct input, especially since the raw page structure is used
> as the input.

Well, page_stats / page_items does accept bitmap page as input but
these function are not defined to read bitmap page as bitmap page
doesnot have a standard page layout. Infact if we pass bitmap page as
an input to page_stats / page_items, the output is always same. I
think we need to have a separete function to read bitmap page. And i
am not sure why should a server crash if i pass meta page to
hash_page_stats or hash_page_items.


> The "if" statement will need updating once the CHI patch goes in, as it
> defines new constants for page types.

Not sure if CHI patch is adding a new type of hash page other than
holding an information about split in progress. Basically my point was
can we have hash page of types other than meta page, bucket page,
overflow and bitmap page. If pageinspect tool finds a page that
doesnot fall under any of these category shouldn't it error out.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] more parallel query documentation

2016-09-21 Thread Robert Haas

On Tue, Sep 20, 2016 at 11:18 AM, Peter Eisentraut
 wrote:
> On 9/19/16 1:22 PM, Robert Haas wrote:
>> On Fri, Sep 16, 2016 at 4:28 PM, Alvaro Herrera
>>  wrote:
>>> I agree it should be added.  I suggest that it could even be added to
>>> the 9.6 docs, if you can make it.
>>
>> Here's a patch.  I intend to commit this pretty quickly unless
>> somebody objects, and also to backpatch it into 9.6.  I'm sure it's
>> not perfect, but imperfect documentation is better than no
>> documentation.
>
> Looks reasonable to me.

Cool.  Committed after fixing a typo that Alvaro noted off-list and a
few others that I found after inspecting with an editor that features
spell-check.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect: Hash index support

2016-09-21 Thread Michael Paquier

On Wed, Sep 21, 2016 at 9:21 PM, Jesper Pedersen
 wrote:
> On 09/21/2016 07:24 AM, Ashutosh Sharma wrote:
> git apply w/ v4 works here, so you will have to investigate what happens on
> your side.

Works here as well.

>> I continued reviewing your v4 patch and here are some more comments:
>>
>> 1). Got below error if i pass meta page to hash_page_items(). Does
>> hash_page_items() accept only bucket or bitmap page as input?
>>
>> postgres=# SELECT * FROM hash_page_items(get_raw_page('hash_i4_index',
>> 0));
>> ERROR:  invalid memory alloc request size 18446744073709551593
>>
>
> page_stats / page_items should not be used on the metadata page.
>
> As these functions are marked as superuser only it is expected that people
> provides the correct input, especially since the raw page structure is used
> as the input.

btree functions use the block number to do some sanity checks. You
cannot do that here as only bytea functions are available, but you
could do it in verify_hash_page by looking at the opaque data and look
at LH_META_PAGE. Then add a boolean argument into verify_hash_page to
see if the caller expects a meta page or not and just issue an error.
Actually it would be a good idea to put in those safeguards, even if I
agree with you that calling those functions is at the risk of the
user... Could you update the patch in this sense?

I had fun doing the same tests, aka running the items and stats
functions on a meta page, and the meta function on a non-meta page,
but at my surprise I did not see a crash, so perhaps I was lucky and
perhaps that was because of OSX.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-09-21 Thread Robert Haas

On Tue, Sep 20, 2016 at 7:55 PM, Bruce Momjian  wrote:
>> If it turns out that it has little benefit, then we don't really need
>> to step up our user education.  People can just keep using btree like
>
> The big problem is people coming from other databases and assuming our
> hash indexes have the same benefits over btree that exist in some other
> database software.  The 9.5 warning at least helps with that.

I'd be curious what benefits people expect to get.  For example, I
searched for "Oracle hash indexes" using Google and found this page:

http://logicalread.solarwinds.com/oracle-11g-hash-indexes-mc02/

It implies that their hash indexes are actually clustered indexes;
that is, the table data is physically organized into contiguous chunks
by hash bucket.  Also, they can't split buckets on the fly.  I think
the DB2 implementation is similar.  So our hash indexes, even once we
add write-ahead logging and better concurrency, will be somewhat
different from those products.  However, I'm not actually sure how
widely-used those index types are.  I wonder if people who use hash
indexes in PostgreSQL are even likely to be familiar with those
technologies, and what expectations they might have.

For PostgreSQL, I expect the benefits of improving hash indexes to be
(1) slightly better raw performance for equality comparisons and (2)
better concurrency.  The details aren't very clear at this stage.  We
know that write performance is bad right now, even with Amit's
patches, but that's without the kill_prior_tuple optimization which is
probably extremely important but which has never been implemented for
hash indexes.  Read performance is good, but there are still further
optimizations that haven't been done there, too, so it may be even
better by the time Amit gets done working in this area.

Of course, if we want to implement clustered indexes, that's going to
require significant changes to the heap format ... or the ability to
support multiple heap storage formats.  I'm not opposed to that, but I
think it makes sense to fix the existing implementation first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect: Hash index support

2016-09-21 Thread Jesper Pedersen


On 09/21/2016 02:14 AM, Michael Paquier wrote:

Adjusted in v4.


Thanks for the new version.


Code/doc will need an update once the CHI patch goes in.


If you are referring to the WAL support, I guess that this patch or
the other patch could just rebase and adjust as needed.



It is the main patch [1] that defines the new constants for page type. 
But I'll submit an update for pageinspect when needed.



hash_page_items and hash_page_stats share a lot of common points with
their btree equivalents, still doing a refactoring would just impact
the visibility of the code, and it is wanted as educational in this
module, so let's go with things the way you suggest.



Ok.


+ 
+  The type information will be 'm' for a metadata page,
+  'v' for an overflow page,
'b' for a bucket page,
+  'i' for a bitmap page, and
'u' for an unused page.
+ 
Other functions don't go into this level of details, so I would just
rip out this paragraph.



I'll add an annotation for this part, and leave it for the committer to 
decide, since Jeff wanted documentation for the 'type' information.



The patch looks in pretty good shape to me, so I am switching it as
ready for committer.



Thanks for your feedback !

Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect: Hash index support

2016-09-21 Thread Jesper Pedersen


Hi,

On 09/21/2016 07:24 AM, Ashutosh Sharma wrote:

The development branch is @
https://github.com/jesperpedersen/postgres/tree/pageinspect_hash


I am working on centOS 7. I am still facing the issue when applying
your patch using "git apply" command but if i use 'patch' command it
works fine however i am seeing some warning messages with 'patch'
command as well.



git apply w/ v4 works here, so you will have to investigate what happens 
on your side.



I continued reviewing your v4 patch and here are some more comments:

1). Got below error if i pass meta page to hash_page_items(). Does
hash_page_items() accept only bucket or bitmap page as input?

postgres=# SELECT * FROM hash_page_items(get_raw_page('hash_i4_index', 0));
ERROR:  invalid memory alloc request size 18446744073709551593



page_stats / page_items should not be used on the metadata page.

As these functions are marked as superuser only it is expected that 
people provides the correct input, especially since the raw page 
structure is used as the input.



2). Server crashed when below query was executed on a code that was
build with cassert-enabled.

postgres=# SELECT * FROM hash_page_stats(get_raw_page('hash_i4_index', 0));
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> \q

The callstack is as follows:

(gdb) bt
#0  0x7fef794f15f7 in raise () from /lib64/libc.so.6
#1  0x7fef794f2ce8 in abort () from /lib64/libc.so.6
#2  0x00967381 in ExceptionalCondition
(conditionName=0x7fef699c942d "!(((id)->lp_len != 0))",
errorType=0x7fef699c92d4 "FailedAssertion",
fileName=0x7fef699c9397 "hashfuncs.c", lineNumber=123) at assert.c:54
#3  0x7fef699c6ed6 in GetHashPageStatistics (page=0x1060c2c "",
stat=0x7ffd76846230) at hashfuncs.c:123
#4  0x7fef699c70a4 in hash_page_stats (fcinfo=0x7ffd768463e0) at
hashfuncs.c:169
#5  0x00682e6b in ExecMakeTableFunctionResult
(funcexpr=0x10457f0, econtext=0x1044e28, argContext=0x104ee38,
expectedDesc=0x1047640,
randomAccess=0 '\000') at execQual.c:2216
#6  0x006a88e2 in FunctionNext (node=0x1044d10) at nodeFunctionscan.c:94
#7  0x0068a3d9 in ExecScanFetch (node=0x1044d10,
accessMtd=0x6a882b , recheckMtd=0x6a8c10
) at execScan.c:95
#8  0x0068a448 in ExecScan (node=0x1044d10, accessMtd=0x6a882b
, recheckMtd=0x6a8c10 ) at
execScan.c:145
#9  0x006a8c45 in ExecFunctionScan (node=0x1044d10) at
nodeFunctionscan.c:268
#10 0x0067f0cf in ExecProcNode (node=0x1044d10) at execProcnode.c:449
#11 0x0067b40b in ExecutePlan (estate=0x1044bf8,
planstate=0x1044d10, use_parallel_mode=0 '\000', operation=CMD_SELECT,
sendTuples=1 '\001',
numberTuples=0, direction=ForwardScanDirection, dest=0x10444c8) at
execMain.c:1567
#12 0x00679527 in standard_ExecutorRun (queryDesc=0xfac778,
direction=ForwardScanDirection, count=0) at execMain.c:338
#13 0x006793ab in ExecutorRun (queryDesc=0xfac778,
direction=ForwardScanDirection, count=0) at execMain.c:286
#14 0x00816b3e in PortalRunSelect (portal=0xfa49e8, forward=1
'\001', count=0, dest=0x10444c8) at pquery.c:948
#15 0x008167d1 in PortalRun (portal=0xfa49e8,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x10444c8,
altdest=0x10444c8,
completionTag=0x7ffd76846c60 "") at pquery.c:789
#16 0x00810a27 in exec_simple_query (query_string=0x1007018
"SELECT * FROM hash_page_stats(get_raw_page('hash_i4_index', 0));") at
postgres.c:1094
#17 0x00814b5e in PostgresMain (argc=1, argv=0xfb1d08,
dbname=0xf873d8 "postgres", username=0xfb1b70 "edb") at
postgres.c:4070
#18 0x0078990c in BackendRun (port=0xfacb50) at postmaster.c:4260




Same deal here.



3). Could you please let me know, what does the hard coded values '*5'
and '+1' represents in the below lines of code. You have used them
when allocating memory for storing spare pages allocated before
certain split point and block number of bitmap pages inside
hash_metap().

spares = palloc0(HASH_MAX_SPLITPOINTS * 5 + 1);

mapp = palloc0(HASH_MAX_BITMAPS * 5 + 1);

I guess it would be better to add some comments if using any hard coded values.



It is the space needed to output the values.


4). Inside hash_page_stats(), you are first calling verify_hash_page()
to verify if it is a hash page or not and


No, we assume that the page is a valid hash page structure, since there 
is an explicit cast w/o any further checks.



then calling
GetHashPageStatistics() to get the page stats wherein you are trying
to check what is the type of hash page. Below is the code snippet for
this,

+   /* page type (flags) */
+   if (opaque->hasho_flag & LH_META_PAGE)
+   stat->type = 'm';
+   else if (opaque->hasho_flag & LH_OVERFLOW_PAGE)
+   stat->type = 'v';
+   else if (opaque->hasho_flag & LH_BUCKET_PAGE)
+   stat->type = 'b';
+   else if (opaque-

Re: [HACKERS] increasing the default WAL segment size

2016-09-21 Thread Robert Haas

On Tue, Sep 20, 2016 at 4:42 PM, Andres Freund  wrote:
> On 2016-09-20 16:32:46 -0400, Robert Haas wrote:
>> > Requiring a non-default compile time or even just cluster creation time
>> > option for tuning isn't something worth expanding energy on imo.
>>
>> I don't agree.  The latency requirements on an archive_command when
>> you're churning out 16MB files multiple times per second are insanely
>> tight, and saying that we shouldn't increase the size because it's
>> better to go redesign a bunch of other things that will eventually
>> *maybe* remove the need for archive_command does not seem like a
>> reasonable response.
>
> Oh, I'm on board with increasing the default size a bit. A different
> default size isn't a non-default compile time option anymore though, and
> I don't think 1GB is a reasonable default.

But that's not the question.  What Peter said was: "maybe we should at
least *allow* some larger sizes, for testing out".  I see very little
merit in restricting the values that people can set via configure.
That just makes life difficult.  If a user picks a setting that
doesn't perform well, oops.

> Running multiple archive_commands concurrently - pretty easy to
> implement - isn't the same as removing the need for archive command. I'm
> pretty sure that continously,and if necessary concurrently, archiving a
> bunch of 64MB files is going to work better than irregularly
> creating / transferring 1GB files.

I'm not trying to block you from implementing parallel archiving, but
right now we don't have it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect: Hash index support

2016-09-21 Thread Ashutosh Sharma

Hi,

> Which platform are you on ?
>
> The development branch is @
> https://github.com/jesperpedersen/postgres/tree/pageinspect_hash

I am working on centOS 7. I am still facing the issue when applying
your patch using "git apply" command but if i use 'patch' command it
works fine however i am seeing some warning messages with 'patch'
command as well.

I continued reviewing your v4 patch and here are some more comments:

1). Got below error if i pass meta page to hash_page_items(). Does
hash_page_items() accept only bucket or bitmap page as input?

postgres=# SELECT * FROM hash_page_items(get_raw_page('hash_i4_index', 0));
ERROR:  invalid memory alloc request size 18446744073709551593

2). Server crashed when below query was executed on a code that was
build with cassert-enabled.

postgres=# SELECT * FROM hash_page_stats(get_raw_page('hash_i4_index', 0));
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> \q

The callstack is as follows:

(gdb) bt
#0  0x7fef794f15f7 in raise () from /lib64/libc.so.6
#1  0x7fef794f2ce8 in abort () from /lib64/libc.so.6
#2  0x00967381 in ExceptionalCondition
(conditionName=0x7fef699c942d "!(((id)->lp_len != 0))",
errorType=0x7fef699c92d4 "FailedAssertion",
fileName=0x7fef699c9397 "hashfuncs.c", lineNumber=123) at assert.c:54
#3  0x7fef699c6ed6 in GetHashPageStatistics (page=0x1060c2c "",
stat=0x7ffd76846230) at hashfuncs.c:123
#4  0x7fef699c70a4 in hash_page_stats (fcinfo=0x7ffd768463e0) at
hashfuncs.c:169
#5  0x00682e6b in ExecMakeTableFunctionResult
(funcexpr=0x10457f0, econtext=0x1044e28, argContext=0x104ee38,
expectedDesc=0x1047640,
randomAccess=0 '\000') at execQual.c:2216
#6  0x006a88e2 in FunctionNext (node=0x1044d10) at nodeFunctionscan.c:94
#7  0x0068a3d9 in ExecScanFetch (node=0x1044d10,
accessMtd=0x6a882b , recheckMtd=0x6a8c10
) at execScan.c:95
#8  0x0068a448 in ExecScan (node=0x1044d10, accessMtd=0x6a882b
, recheckMtd=0x6a8c10 ) at
execScan.c:145
#9  0x006a8c45 in ExecFunctionScan (node=0x1044d10) at
nodeFunctionscan.c:268
#10 0x0067f0cf in ExecProcNode (node=0x1044d10) at execProcnode.c:449
#11 0x0067b40b in ExecutePlan (estate=0x1044bf8,
planstate=0x1044d10, use_parallel_mode=0 '\000', operation=CMD_SELECT,
sendTuples=1 '\001',
numberTuples=0, direction=ForwardScanDirection, dest=0x10444c8) at
execMain.c:1567
#12 0x00679527 in standard_ExecutorRun (queryDesc=0xfac778,
direction=ForwardScanDirection, count=0) at execMain.c:338
#13 0x006793ab in ExecutorRun (queryDesc=0xfac778,
direction=ForwardScanDirection, count=0) at execMain.c:286
#14 0x00816b3e in PortalRunSelect (portal=0xfa49e8, forward=1
'\001', count=0, dest=0x10444c8) at pquery.c:948
#15 0x008167d1 in PortalRun (portal=0xfa49e8,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x10444c8,
altdest=0x10444c8,
completionTag=0x7ffd76846c60 "") at pquery.c:789
#16 0x00810a27 in exec_simple_query (query_string=0x1007018
"SELECT * FROM hash_page_stats(get_raw_page('hash_i4_index', 0));") at
postgres.c:1094
#17 0x00814b5e in PostgresMain (argc=1, argv=0xfb1d08,
dbname=0xf873d8 "postgres", username=0xfb1b70 "edb") at
postgres.c:4070
#18 0x0078990c in BackendRun (port=0xfacb50) at postmaster.c:4260



3). Could you please let me know, what does the hard coded values '*5'
and '+1' represents in the below lines of code. You have used them
when allocating memory for storing spare pages allocated before
certain split point and block number of bitmap pages inside
hash_metap().

spares = palloc0(HASH_MAX_SPLITPOINTS * 5 + 1);

mapp = palloc0(HASH_MAX_BITMAPS * 5 + 1);

I guess it would be better to add some comments if using any hard coded values.

4). Inside hash_page_stats(), you are first calling verify_hash_page()
to verify if it is a hash page or not and then calling
GetHashPageStatistics() to get the page stats wherein you are trying
to check what is the type of hash page. Below is the code snippet for
this,

+   /* page type (flags) */
+   if (opaque->hasho_flag & LH_META_PAGE)
+   stat->type = 'm';
+   else if (opaque->hasho_flag & LH_OVERFLOW_PAGE)
+   stat->type = 'v';
+   else if (opaque->hasho_flag & LH_BUCKET_PAGE)
+   stat->type = 'b';
+   else if (opaque->hasho_flag & LH_BITMAP_PAGE)
+   stat->type = 'i';
+   else
+   stat->type = 'u';

My question is, can we have a hash page that does not fall under the
category of Metapage, overflow page, bitmap page, bucket page? I guess
we can't have such type of hash page and incase if we found such type
of undefined page i guess we should error out instead of reading the
page because it is quite possible that the page would be corrupted.
Please let me know your thoughts on this.

5). I think we have added enough functions to show the

Re: [HACKERS] pgbench - compute & show latency consistently

2016-09-21 Thread Kuntal Ghosh

On Wed, Sep 21, 2016 at 4:05 PM, Heikki Linnakangas  wrote:

> pgbench.sgml actually already had the "latency average = ..." version in its
> example. Even before this patch, we printed it with a "=" if one of options
> that caused per-transaction timings to be measured, like --rate, was used,
> and as ":" otherwise.
>
> - Heikki
>
I'm talking about the first example in pgbench.sgml.

Typical output from pgbench looks like:

transaction type: 
scaling factor: 10
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 1000
number of transactions actually processed: 1/1
tps = 85.184871 (including connections establishing)
tps = 85.296346 (excluding connections establishing)


Shouldn't we include latency average here as well and explain what it is?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench - compute & show latency consistently

2016-09-21 Thread Heikki Linnakangas


(I just pushed the patch, didn't see your post until after that)

On 09/21/2016 01:07 PM, Kuntal Ghosh wrote:

On Thus, July 7,2016 at 08:39 PM, Fabien COELHO  wrote:

Also there is still the bug under -t which displays a 0 latency.

Your patch clearly fixed the issue.


The attached patch still fixes that and make it consistent the other way
around, i.e. by using "=" for latency. I switched to use ":" for weight
which is an input parameter. I let ":" when there is a long sentence to
describe the figure displayed, more on aesthetic grounds.

In the above context, I suggest few other changes.

Present output-> progress: 1.0 s, 1221.0 tps, lat 0.816 ms stddev 0.272
Suggestion-> progress: 1.0 s, 1221.0 tps, lat avg 0.816 ms stddev 0.272 ms


Yeah, perhaps.


Present output->
SQL script 1: so.sql
   - weight = 1 (targets 50.0% of total)
   - 10010 transactions (50.1% of total, tps = 100.101872)
   - latency average = 1.878 ms
   - latency stddev = 3.614 ms
Suggestion->
SQL script 1: so.sql
   - weight = 1 (targets 50.0% of total)
   - 10010 transactions (50.1% of total)
   - tps = 100.101872
   - latency average = 1.878 ms
   - latency stddev = 3.614 ms


I think it fits well on a single line.


Apart from that, pgbench.sgml should be updated to reflect latency
average in the output.


pgbench.sgml actually already had the "latency average = ..." version in 
its example. Even before this patch, we printed it with a "=" if one of 
options that caused per-transaction timings to be measured, like --rate, 
was used, and as ":" otherwise.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench - compute & show latency consistently

2016-09-21 Thread Heikki Linnakangas


On 07/13/2016 11:39 AM, Fabien COELHO wrote:

 number of transactions per client: 1000
-latency average = 15.844 ms
+latency average: 15.844 ms
 tps = 618.764555 (including connections establishing)


I think what you have here is that colons separate input parameters and
equal signs separate result output.  So I think it's OK the way it is.


Hmmm... Then other measures displayed are not all consistent with this
theory.

Also there is still the bug under -t which displays a 0 latency.

The attached patch still fixes that and make it consistent the other way
around, i.e. by using "=" for latency. I switched to use ":" for weight
which is an input parameter. I let ":" when there is a long sentence to
describe the figure displayed, more on aesthetical grounds.


Committed, thanks!

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench - compute & show latency consistently

2016-09-21 Thread Kuntal Ghosh

On Thus, July 7,2016 at 08:39 PM, Fabien COELHO  wrote:
> Also there is still the bug under -t which displays a 0 latency.
Your patch clearly fixed the issue.

> The attached patch still fixes that and make it consistent the other way
> around, i.e. by using "=" for latency. I switched to use ":" for weight
> which is an input parameter. I let ":" when there is a long sentence to
> describe the figure displayed, more on aesthetic grounds.
In the above context, I suggest few other changes.

Present output-> progress: 1.0 s, 1221.0 tps, lat 0.816 ms stddev 0.272
Suggestion-> progress: 1.0 s, 1221.0 tps, lat avg 0.816 ms stddev 0.272 ms

Present output->
SQL script 1: so.sql
   - weight = 1 (targets 50.0% of total)
   - 10010 transactions (50.1% of total, tps = 100.101872)
   - latency average = 1.878 ms
   - latency stddev = 3.614 ms
Suggestion->
SQL script 1: so.sql
   - weight = 1 (targets 50.0% of total)
   - 10010 transactions (50.1% of total)
   - tps = 100.101872
   - latency average = 1.878 ms
   - latency stddev = 3.614 ms

Apart from that, pgbench.sgml should be updated to reflect latency
average in the output.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-09-21 Thread Petr Jelinek

On 21/09/16 09:18, Vik Fearing wrote:
> On 09/21/2016 08:30 AM, Michael Paquier wrote:
>> On Sat, Sep 17, 2016 at 2:04 AM, Masahiko Sawada  
>> wrote:
>>> Since we already released 9.6RC1, I understand that it's quite hard to
>>> change syntax of 9.6.
>>> But considering that we support the quorum commit, this could be one
>>> of the solutions in order to avoid breaking backward compatibility and
>>> to provide useful user interface.
>>> So I attached these patches.
>>
>>  standby_config:
>> -standby_list{ $$ = create_syncrep_config("1", $1); }
>> -| FIRST NUM '(' standby_list ')'{ $$ =
>> create_syncrep_config($1, $4); }
>> +standby_list{ $$ =
>> create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
>> +| ANY NUM '(' standby_list ')'{ $$ =
>> create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
>> +| FIRST NUM '(' standby_list ')'{ $$ =
>> create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
>>
>> Reading again the thread, it seems that my previous post [1] was a bit
>> misunderstood. My position is to not introduce any new behavior
>> changes in 9.6, so we could just make the FIRST NUM grammar equivalent
>> to NUM.
>>
>> [1]: 
>> https://www.postgresql.org/message-id/CAB7nPqRDvJn18e54ccNpOP1A2_iUN6-iU=4njgmmgiagvcs...@mail.gmail.com
> 
> I misunderstood your intent, then.  But I still stand by what I did
> understand, namely that 'k (...)'  should mean 'any k (...)'.  It's much
> more natural than having it mean 'first k (...)' and I also think it
> will be more frequent in practice.
> 

I think so as well.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical Replication WIP

2016-09-21 Thread Petr Jelinek

On 21/09/16 05:35, Steve Singer wrote:
> On Tue, 20 Sep 2016, Peter Eisentraut wrote:
> 
>> On 9/18/16 4:17 PM, Steve Singer wrote:
> 
>>
>> I think if we want to prevent the creation of subscriptions that point
>> to self, then we need to create a magic token when the postmaster starts
>> and check for that when we connect.  So more of a running-instance
>> identifier instead of a instance-on-disk identifier.
>>
>> The other option is that we just allow it and make it more robust.
> 
> I think we should go with the second option for now. I feel that the
> effort is  better spent making sure that initial syncs that have don't
> subscribe (for whatever reasons) can be aborted instead of trying to
> build a concept of node identity before we really need it.
> 

Well connecting to yourself will always hang though because the slot
creation needs snapshot and it will wait forever for the current query
to finish. So it will never really work. The hanging query is now
abortable though.

Question is if doing the logical snapshot is really required since we
don't really use the snapshot for anything here.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/Postgr

2016-09-21 Thread Amit Kapila

On Tue, Sep 20, 2016 at 10:33 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, May 26, 2016 at 7:44 PM, Michael Paquier
>>  wrote:
>>> Yeah, it is definitely better to register it. And I have switched the
>>> patch as ready for committer just now.
>
>> Committed and back-patched to 9.4, where DSM was introduced.
>
> ISTM both the previous coding and this version can fail for no good
> reason, that is what if GetLastError() happens to return one of these
> error codes as a result of some unrelated failure from before this
> subroutine is entered?  That is, wouldn't it be a good idea to
> do SetLastError(0) before calling CreateFileMapping?
>

Yes, that seems like a good idea.  Do you need a patch with some
testing on windows environment?

>  Or is the latter
> guaranteed to do that on success?
>

I don't see any such guarantee from the msdn page and it appears from
GetLastError()/SetLastError() specs [1][2] that functions that set
last error code to zero on success, do mention it in their
documentation.


[1] - 
https://msdn.microsoft.com/en-us/library/windows/desktop/ms679360(v=vs.85).aspx
[2] - 
https://msdn.microsoft.com/en-us/library/windows/desktop/ms680627(v=vs.85).aspx



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Error running custom plugin: “output plugins have to declare the _PG_output_plugin_init symbol”

2016-09-21 Thread valeriof

The link was very helpful. It's a standard 'Attach to process' approach with
Visual Studio and it works just as expected. Thank you Ashutosh!



--
View this message in context: 
http://postgresql.nabble.com/Error-running-custom-plugin-output-plugins-have-to-declare-the-PG-output-plugin-init-symbol-tp5921145p5922077.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] get_home_path: use HOME

2016-09-21 Thread Aleksander Alekseev

> I work in an environment, where servers are administered by people
> with different user names and identical uid (0).

Multiple users with same uid is orthodox indeed. Just out of curiosity -
what environment is this, if it's not a secret?

> The usage of HOME environment variable (if set) is IMO the right,
> standard and faster way to get_home_path().

As a side note I personally think that considering $HOME environment
variable is not such a bad idea. However I think we should make sure
first that this is really a bug that is relatively easy to reproduce in
real-world environments, a not just a hack for single misconfigured
system.

-- 
Best regards,
Aleksander Alekseev


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Transaction traceability - txid_status(bigint)

2016-09-21 Thread Craig Ringer

On 20 September 2016 at 22:46, Robert Haas  wrote:

> I did some cleanup of 0001 (see attached) and was all set to commit it
> when I realized what I think is a problem: holding XidGenLock doesn't
> seem to help with the race condition between this function and CLOG
> truncation, because vac_truncate_clog() updates the shared memory
> limits AFTER performing the truncation.

Thanks ... and drat.

> If the order of those
> operations were reversed, we'd be fine, because it would get stuck
> trying to update the shared memory limits and wouldn't be able to
> truncate until it did - and conversely, if it updated the shared
> memory limits before we examined them, that would be OK, too, because
> we'd be sure not to consult the pages that are about to be truncated.

I'm hesitant to mess with something that fundamental for what I was
hoping was a low-impact feature, albeit one that seems to be trying
hard not to be at every turn.  It looks pretty reasonable to update
oldestXid before clog truncation but I don't want to be wrong and
create some obscure race or crash recovery issue related to wraparound
and clog truncation. It could well be a problem if we're very close to
wraparound.

So far nothing has had any reason to care about this, since there's no
way to attempt to access an older xid in a normally functioning
system. The commit timestamp lookup code doesn't care whether the xid
is still in clog or not and nothing else does lookups based on xids
supplied by the user. If anything else did or does in future it will
have the same problem as txid_status.

We've already ruled out releasing the slru LWLock in SlruReportIOError
then PG_TRY / PG_CATCH()ing clog access errors in txid_status() per my
original approach to this issue.

So I see a few options now:

* Do nothing. Permit this race to exist, document the error, and
expect apps to cope. I'm pretty tempted to go for exactly this since
it pushes the cost onto users of the feature and doesn't make a mess
elsewhere. People who use this will typically be invoking it as a
standalone toplevel function anyway, so it's mostly just a bit of
noise in the error logs if you lose the race - and we have plenty of
other sources of that already.

* Rather than calling TransactionIdDidCommit / TransactionIdDidAbort,
call clog.c's TransactionIdGetStatus with a new missing_ok flag. That
sets a bool* missing  param added to SimpleLruReadPage_ReadOnly(...)
and in turn to SimpleLruReadPage(...) that's set instead of calling
SlruReportIOError(...). This seems rather intrusive and will add
little-used params and paths to fairly hot slru.c code so I'm not
keen.

* Take CLogControlLock LW_SHARED in txid_status() to prevent
truncation before reading oldestXid. We'd need a way to pass an
"already locked" state through TransactionIdGetStatus(...) to
SimpleLruReadPage_ReadOnly(...), which isn't great since again it's
pretty hot code.

* Don't release the slru LWLock in SlruReportIOError; instead release
CLogControlLock from txid_status on clog access failure. As before
means PG_TRY / PG_CATCH without PG_RE_THROW, but it means the only
place it affects is callers of txid_status(...). For added safety,
restrict txid_status() to being called in a toplevel virtual xact so
we know we'll finish up promptly. It's a horrible layering violation
having txid_status(...) release the clog control lock though, and
seems risky.

The only non-horrible one of those is IMO to just let the caller see
an error if they lose the race. It's a function that's intended for
use when you're dealing with indeterminate transaction state after a
server or application error anyway, so it's part of an error path
already. So I say we just document the behaviour. Because slru.c
doesn't release its LWLock on error we also need to ensure
txid_status(...) is also only called from a toplevel xact so the user
doesn't attempt to wrap it in plpgsql BEGIN ... EXCEPTION block and it
causes the xact to abort.

Will follow up with just that.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2016-09-21 Thread Michael Paquier

On Tue, Sep 20, 2016 at 11:13 PM, Stas Kelvich  wrote:
>> On 20 Sep 2016, at 09:40, Michael Paquier  wrote:
>> +   StartupCLOG();
>> +   StartupSUBTRANS(checkPoint.oldestActiveXid);
>> +   RecoverPreparedFromFiles();
>> [...]
>>/*
>> -* Startup commit log and subtrans only.  MultiXact and commit
>> -* timestamp have already been started up and other SLRUs are not
>> -* maintained during recovery and need not be started yet.
>> -*/
>> -   StartupCLOG();
>> -   StartupSUBTRANS(oldestActiveXID);
>> Something is definitely wrong in this patch if we begin to do that.
>
> Putting that before actual WAL replay is just following historical order of 
> events.
> Prepared files are pieces of WAL that happened before checkpoint, so ISTM
> there is no conceptual problem in restoring their state before replay.

For wal_level = minimal there is no need to call that to begin with..
And I'd think that it is better to continue with things the same way.

>> I have been thinking about this patch quite a bit, and the approach
>> taken looks really over-complicated to me. Basically what we are
>> trying to do here is to reuse as much as possible code paths that are
>> being used by non-recovery code paths during recovery, and then try to
>> adapt a bunch of things like SUBTRANS structures, CLOG, XID handling
>> and PGXACT things in sync to handle the 2PC information in memory
>> correctly. I am getting to think that this is *really* bug-prone in
>> the future and really going to make maintenance hard.
>
> With this patch we are reusing one infrastructure for normal work, recovery 
> and replay.

The mix between normal work and recovery is the scary part. Normal
work and recovery are entirely two different things.

>> Taking one step back, and I know that I am a noisy one, but looking at
>> this patch is making me aware of the fact that it is trying more and
>> more to patch things around the more we dig into issues, and I'd
>> suspect that trying to adapt for example sub-transaction and clog
>> handling is just the tip of the iceberg and that there are more things
>> that need to be taken care of if we continue to move on with this
>> approach.
>
> Again, it isn’t patching around to fix issues, it’s totally possible to keep 
> old interface.
> However it’s possible that current approach is wrong because of some aspect
> that i didn’t think of, but now I don’t see good counterarguments.

Any missing points could be costly here. If we have a way to make
things with the same performance

>> Coming to the point: why don't we simplify things? In short:
>
>> - Just use a static list and allocate a chunk of shared memory as
>> needed. DSM could come into play here to adapt to the size of a 2PC
>> status file, this way there is no need to allocate a chunk of memory
>> with an upper bound. Still we could use an hardcoded upper-bound of
>> say 2k with max_prepared_transactions, and just write the 2PC status
>> file to disk if its size is higher than what can stick into memory.
>> - save 2PC information in memory instead of writing it to a 2PC file
>> when XLOG_XACT_PREPARE shows up.
>> - flush information to disk when there is a valid restart point in
>> RecoveryRestartPoint(). We would need as well to tweak
>> PrescanPreparedTransactions accordingly than everything we are trying
>> to do here.
>> That would be way more simple than what's proposed here, and we'd
>> still keep all the performance benefits.
>
> So file arrives to replica through WAL and instead of directly reading it you 
> suggest
> to copy it do DSM if it is small enough, and to filesystem if not. Probably 
> that will
> allow us to stay only around reading/writing files, but:
>
> * That will be measurably slower than proposed approach because of unnecessary
> copying between WAL and DSM. Not to mention prepare records of arbitrary 
> length.
> * Different behaviour on replica and master — less tested code for replica.

Well, the existing code on HEAD is battery-tested as well. This
reduces the likeliness of bugs at replay with new features.

> * Almost the same behaviour can be achieved by delaying fsync on 2pc files 
> till
> checkpoint.

Oh, that's an idea here. It may be interesting to see if this meets
your performance goals... And that could result in a far smaller
patch.

> But if reword you comment in a way that it is better to avoid replaying 
> prepare record at all,
> like it happens now in master, then I see the point. And that is possible 
> even without DSM, it possible
> to allocate static sized array storing some info about tx, whether it is in 
> the WAL or in file, xid, gid.
> Some sort of PGXACT doppelganger only for replay purposes instead of using 
> normal one.

That's exactly what I meant. The easiest approach is just to allocate
a bunk of shared memory made of 2PC_STATUS_SIZE * max_prepared_xacts
with 2PC_STATUS_SIZE set up at an arbitrary size that we find
appropriate to

Re: [HACKERS] GiST penalty functions [PoC]

2016-09-21 Thread Heikki Linnakangas


On 09/14/2016 07:57 PM, Andrew Borodin wrote:

Here is v6 of cube penalty patch.
There was a warning about unused edge function  under systems without
__STDC_IEC_559__ defined, this patch fixes it.


Thanks!

Reading through this thread again, I think we got a bit side-tracked 
with this float bit-twiddling aspect, and we haven't really discussed 
the algorithm itself. So:


On 08/29/2016 09:04 AM, Andrew Borodin wrote:

Some time ago I put a patch to commitfest that optimizes slightly GiST
inserts and updates. It's effect in general is quite modest (15% perf
improved), but for sorted data inserts are 3 times quicker. This
effect I attribute to mitigation of deficiencies of old split
algorithm.
Alexander Korotkov advised me to lookup some features of the RR*-tree.

>

The RR*-tree differs from combination of Gist + cube extension in two
algorithms:
1.Algorithm to choose subtree for insertion
2.Split algorithm


Got a link for a description of the RR*-tree? Would be good to reference 
it in the patch comments, too.



This is the message regarding implementation of first one in GiST. I
think that both of this algorithms will improve querying performance.

Long story short, there are two problems in choose subtree algorithm.
1.GiST chooses subtree via penalty calculation for each entry,
while RR*-tree employs complicated conditional logic: when there are
MBBs (Minimum Bounding Box) which do not require extensions, smallest
of them is chosen; in some cases, entries are chosen not by volume,
but by theirs margin.
2.RR*-tree algorithm jumps through entry comparation non-linearly,
I think this means that GiST penalty computation function will have to
access other entries on a page.

In this message I address only first problem. I want to construct a
penalty function that will choose:
1.Entry with a zero volume and smallest margin, that can
accommodate insertion tuple without extension, if one exists;
2.Entry with smallest volume, that can accommodate insertion tuple
without extension, if one exists;
3.Entry with zero volume increase after insert and smallest margin
increase, if one exists;
4.Entry with minimal volume increase after insert.

Current cube behavior inspects only 4th option by returning as a
penalty (float) MBB’s volume increase. To implement all 4 options, I
use a hack: order of positive floats corresponds exactly to order of
integers with same binary representation (I’m not sure this stands for
every single supported platform). So I use two bits of float as if it
were integer, and all others are used as shifted bits of corresponding
float: option 4 – volume increase, 3 - margin increase, 2 – MBB
volume, 1 – MBB margin. You can check the reinterpretation cast
function pack_float() in the patch.


If I understand correctly, cases #1 and #3 arise when one of the 
dimensions is 0. For example, in a 3D space, if the existing entry is a 
rectangle on a plane, with zero-length edge on one of the dimensions, 
and the new entry is on the same plane. Case #1 arises if the new entry 
falls within that rectangle, and case #3 if it's outside it. Currently, 
we treat all such cases as 0-penalty, because the degenerate 0-dimension 
causes all the calculated volumes to become zero. So clearly we can do 
better, which is what this patch does.


At first blush, I'm surprised that you switch to using the sum of the 
edges in those cases. I would expect to ignore the degenerate 
0-dimension, and calculate the volume using the other dimensions. So in 
a 3D space, you would calculate the increase of the area of the 
rectangle (A*B), not the sum of the edges (A+B). And it probably would 
be best to take into account how many of the dimensions are zero. So in 
a 3D space, if there is an existing line segment that the new point 
falls into, and also a rectangle that the new point falls into, you 
should prefer the 1-dimensional line segment over the 2-dimensional 
rectangle.


I don't know how big a difference that makes in practice. But it seems 
odd that if you e.g. have a set of 3D points, but the Z dimension in all 
of the points is 0, the algorithm behaves differently than if you had 
the exact same points in a 2D space.


(If this is explained in the RR*-tree paper, feel free to just point me 
to that and I'll ask again if I have further questions.)


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-09-21 Thread Vik Fearing

On 09/21/2016 08:30 AM, Michael Paquier wrote:
> On Sat, Sep 17, 2016 at 2:04 AM, Masahiko Sawada  
> wrote:
>> Since we already released 9.6RC1, I understand that it's quite hard to
>> change syntax of 9.6.
>> But considering that we support the quorum commit, this could be one
>> of the solutions in order to avoid breaking backward compatibility and
>> to provide useful user interface.
>> So I attached these patches.
> 
>  standby_config:
> -standby_list{ $$ = create_syncrep_config("1", $1); }
> -| FIRST NUM '(' standby_list ')'{ $$ =
> create_syncrep_config($1, $4); }
> +standby_list{ $$ =
> create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
> +| ANY NUM '(' standby_list ')'{ $$ =
> create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
> +| FIRST NUM '(' standby_list ')'{ $$ =
> create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
> 
> Reading again the thread, it seems that my previous post [1] was a bit
> misunderstood. My position is to not introduce any new behavior
> changes in 9.6, so we could just make the FIRST NUM grammar equivalent
> to NUM.
> 
> [1]: 
> https://www.postgresql.org/message-id/CAB7nPqRDvJn18e54ccNpOP1A2_iUN6-iU=4njgmmgiagvcs...@mail.gmail.com

I misunderstood your intent, then.  But I still stand by what I did
understand, namely that 'k (...)'  should mean 'any k (...)'.  It's much
more natural than having it mean 'first k (...)' and I also think it
will be more frequent in practice.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

72 matches

Mail list logo