date:20141210

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-10 Thread Rahila Syed

>I am sorry but I can't understand the above results due to wrapping.
>Are you saying compression was twice as slow?

CPU usage at user level (in seconds)  for compression set 'on' is 562 secs
while that for compression  set 'off' is 354 secs. As per the readings, it
takes little less than double CPU time to compress.
However , the total time  taken to run 25 transactions for each of the
scenario is as follows,

compression = 'on'  : 1838 secs
= 'off'  : 1701 secs

Different is around 140 secs.

Thank you,
Rahila Syed


On Wed, Dec 10, 2014 at 7:55 PM, Bruce Momjian  wrote:

> On Wed, Dec 10, 2014 at 07:40:46PM +0530, Rahila Syed wrote:
> > The tests ran for around 30 mins.Manual checkpoint was run before each
> test.
> >
> > Compression   WAL generated%compressionLatency-avg   CPU usage
> > (seconds)  TPS
>  Latency
> > stddev
> >
> >
> > on  1531.4 MB  ~35 %  7.351 ms
>
> >   user diff: 562.67s system diff: 41.40s  135.96
>
> >   13.759 ms
> >
> >
> > off  2373.1 MB 6.781
> ms
> >   user diff: 354.20s  system diff: 39.67s147.40
>
> >   14.152 ms
> >
> > The compression obtained is quite high close to 35 %.
> > CPU usage at user level when compression is on is quite noticeably high
> as
> > compared to that when compression is off. But gain in terms of reduction
> of WAL
> > is also high.
>
> I am sorry but I can't understand the above results due to wrapping.
> Are you saying compression was twice as slow?
>
> --
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
>   + Everyone has their own god. +
>

Re: [HACKERS] Commitfest problems

2014-12-10 Thread Josh Berkus

On 12/10/2014 10:35 PM, Bruce Momjian wrote:
> I have heard repeated concerns about the commitfest process in the past
> few months.  The fact we have been in a continual commitfest since
> August also is concerning.
> 
> I think we are reaching the limits on how much we can do with our
> existing commitfest structure, and it might be time to consider changes.
> While the commitfest process hasn't changed much and was very successful
> in the first few years, a few things have changed externally:
> 
>   1  more new developers involved in contributing small patches
>   2  more full-time developers creating big patches
>   3  increased time demands on experienced Postgres developers

I will add:

4. commitfest managers have burned out and refuse to do it again

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Commitfest problems

2014-12-10 Thread Bruce Momjian

I have heard repeated concerns about the commitfest process in the past
few months.  The fact we have been in a continual commitfest since
August also is concerning.

I think we are reaching the limits on how much we can do with our
existing commitfest structure, and it might be time to consider changes.
While the commitfest process hasn't changed much and was very successful
in the first few years, a few things have changed externally:

1  more new developers involved in contributing small patches
2  more full-time developers creating big patches
3  increased time demands on experienced Postgres developers

These changes are all driven by increased Postgres popularity.  In an
ideal world, as 1 and 2 increase, the time experienced developers have
to work on commitfest items would also increase, but the opposite has
happened.  Many of our most experienced developers (3) hold responsible
positions in their organizations which demand their time.

You would think that more full-time developers (2) would help with the
commitfest, but often their paid work is in a specific subsystem of
Postgres, preventing them from effectively helping with other complex
patches.

We have always known that we can create novice developers faster than
experienced ones, but it seems this difference is accelerating.

A good example is the UPSERT patch, which, while receiving extensive
feedback on the user interface and syntax, has not had a full review,
though it has been posted for several months.

It seems like a good time to consider options for restructuring our
process.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.5 release scheduling (was Re: logical column ordering)

2014-12-10 Thread Josh Berkus

On 12/10/2014 09:35 PM, Tom Lane wrote:
> Josh Berkus  writes:
>> On 12/10/2014 05:14 PM, Stephen Frost wrote:
>>> * Andres Freund (and...@2ndquadrant.com) wrote:
 But the scheduling of commits with regard to the 9.5 schedule actually
 opens a relevant question: When are we planning to release 9.5? Because
 If we try ~ one year from now it's a whole different ballgame than if we
 try to go back to september. And I think there's pretty good arguments
 for both.
> 
>>> This should really be on its own thread for discussion...  I'm leaning,
>>> at the moment at least, towards the September release schedule.  I agree
>>> that having a later release would allow us to get more into it, but
>>> there's a lot to be said for the consistency we've kept up over the past
>>> few years with a September (our last non-September release was 8.4).
> 
>> Can we please NOT discuss this in the thread about someone's patch?  Thanks.
> 
> Quite.  So, here's a new thread.
> 
> MHO is that, although 9.4 has slipped more than any of us would like,
> 9.5 development launched right on time in August.  So I don't see a
> good reason to postpone 9.5 release just because 9.4 has slipped.
> I think we should stick to the schedule agreed to in Ottawa last spring.
> 
> Comments?

If anything, it seems like a great reason to try to get 9.5 out BEFORE
we open the tree for 9.6/10.0/Cloud++.  While there were technical
issues, 9.4 dragged a considerable amount because most people were
ignoring it in favor of 9.5 development.

So far, I haven't seen any features for 9.5 which would delay a more
timely release the way we did for 9.4.  Anybody know of a bombshell
someone's going to drop on us for CF5?


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-12-10 Thread Peter Geoghegan

On Mon, Dec 8, 2014 at 8:16 PM, Peter Geoghegan  wrote:
> Attached revision, v1.6, slightly tweaks the ordering of per-statement
> trigger execution.

Right now, there is no way for a before row insert/update trigger to
determine whether it was called as part of an INSERT ... ON CONFLICT
UPDATE or not. It's also not possible for a DO INSTEAD trigger on a
view (a before row insert trigger) to determine that it was called
specifically due to an INSERT...IGNORE (which I think ought to imply
that any corresponding, "redirected" insertion into a table should
also use IGNOREthat's at least going to be something that a
certain number of apps will need to be made robust against).

The question is: Do we want to expose this distinction to triggers?
The natural way to do so would probably be to add TG_SPECULATIVE
special variable to plpgsql (and equivalent variables in other PLs).
This text variable would be either "UPSERT" or "IGNORE"; it would be
NULL when it was not applicable (e.g. with traditional INSERTs).

How do people feel about this? Is it important to include this in our
initial cut of the feature? I thought that I'd avoid that kind of
thing until prompted to address it by others, since it probably won't
end up being a common concern, but I'd like to hear a few opinions.
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] inherit support for foreign tables

2014-12-10 Thread Ashutosh Bapat

I marked this as ready for committer.

On Thu, Dec 11, 2014 at 8:39 AM, Etsuro Fujita 
wrote:

> Hi Ashutosh,
>
> Thanks for the review!
>
> (2014/12/10 14:47), Ashutosh Bapat wrote:
>
>> We haven't heard anything from Horiguchi-san and Hanada-san for almost a
>> week. So, I am fine marking it as "ready for committer". What do you say?
>>
>
> ISTM that both of them are not against us, so let's ask for the
> committer's review!
>
>
> Thanks,
>
> Best regards,
> Etsuro Fujita
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

[HACKERS] 9.5 release scheduling (was Re: logical column ordering)

2014-12-10 Thread Tom Lane

Josh Berkus  writes:
> On 12/10/2014 05:14 PM, Stephen Frost wrote:
>> * Andres Freund (and...@2ndquadrant.com) wrote:
>>> But the scheduling of commits with regard to the 9.5 schedule actually
>>> opens a relevant question: When are we planning to release 9.5? Because
>>> If we try ~ one year from now it's a whole different ballgame than if we
>>> try to go back to september. And I think there's pretty good arguments
>>> for both.

>> This should really be on its own thread for discussion...  I'm leaning,
>> at the moment at least, towards the September release schedule.  I agree
>> that having a later release would allow us to get more into it, but
>> there's a lot to be said for the consistency we've kept up over the past
>> few years with a September (our last non-September release was 8.4).

> Can we please NOT discuss this in the thread about someone's patch?  Thanks.

Quite.  So, here's a new thread.

MHO is that, although 9.4 has slipped more than any of us would like,
9.5 development launched right on time in August.  So I don't see a
good reason to postpone 9.5 release just because 9.4 has slipped.
I think we should stick to the schedule agreed to in Ottawa last spring.

Comments?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] double vacuum in initdb

2014-12-10 Thread Tom Lane

Peter Eisentraut  writes:
> initdb currently does
> PG_CMD_PUTS("ANALYZE;\nVACUUM FULL;\nVACUUM FREEZE;\n");
> FREEZE is now part of FULL, so this seems redundant.  Also, ANALYZE can
> be run as part of VACUUM.  So this could be
> PG_CMD_PUTS("VACUUM FULL ANALYZE;\n");

Merging the ANALYZE step would save few cycles, and it'd probably result
in the contents of pg_statistic being at least partly unfrozen at the end
of the process, so please don't go that far.

> There has been some concerns about time spent in initdb in test suites,
> which is why I looked into this.  In testing, this change can shave off
> between 10% and 20% of the run time of initdb, so it would be kind of
> useful.

> The last change to this was

> commit 66cd8150636e48a8f143560136a25ec5eb355d8c
> Author: Tom Lane 
> Date:   Mon Nov 29 03:05:03 2004 +

> Clean up initdb's error handling so that it prints something more
> useful than just \'failed\' when there's a problem.  Per gripe from
> Chris Albertson.

> In an unrelated change, use VACUUM FULL; VACUUM FREEZE; rather than
> a single VACUUM FULL FREEZE command, to respond to my worries of a
> couple days ago about the reliability of doing this in one go.

> That was a long time ago.  Is that still applicable?

Probably not; what I was on about was

http://www.postgresql.org/message-id/12179.1101591...@sss.pgh.pa.us

which certainly isn't a case that exists anymore since we got rid
of the old VACUUM FULL implementation.  So I think we could go to

PG_CMD_PUTS("ANALYZE;\nVACUUM FULL FREEZE;\n");

without any degradation of the intended results.

Another idea would be to drop the FULL part and make this

PG_CMD_PUTS("ANALYZE;\nVACUUM FREEZE;\n");

which would speed initdb but it would also lose the testing angle
of being sure that we can VACUUM FULL all system catalogs.  OTOH,
I don't immediately see why we shouldn't test that somewhere in the
regression tests rather than in every initdb.  (I think part of the
argument for the forced FULL was to make sure we broke anything that
thought system catalogs had fixed relfilenodes, but that should be
pretty well a done deal by now.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-12-10 Thread Amit Kapila

On Wed, Dec 10, 2014 at 11:51 PM, Robert Haas  wrote:
>
> On Mon, Dec 8, 2014 at 10:59 PM, Amit Kapila 
wrote:
> > Yeah and also how would user specify the values, as an example
> > assume that table is partitioned on monthly_salary, so partition
> > definition would look:
> >
> > PARTITION BY LIST(monthly_salary)
> > (
> > PARTITION salary_less_than_thousand VALUES(300, 900),
> > PARTITION salary_less_than_two_thousand VALUES (500,1000,1500),
> > ...
> > )
> >
> > Now if user wants to define multi-column Partition based on
> > monthly_salary and annual_salary, how do we want him to
> > specify the values.  Basically how to distinguish which values
> > belong to first column key and which one's belong to second
> > column key.
>
> I assume you just add some parentheses.
>
> PARTITION BY LIST (colA, colB) (PARTITION VALUES ((valA1, valB1),
> (valA2, valB2), (valA3, valB3))
>
> Multi-column list partitioning may or may not be worth implementing,
> but the syntax is not a real problem.
>

Yeah either this way or what Josh has suggested upthread, the main
point was that if at all we want to support multi-column list partitioning
then we need to have slightly different syntax, however I feel that we
can leave multi-column list partitioning for first version.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] On partitioning

2014-12-10 Thread Amit Kapila

On Wed, Dec 10, 2014 at 7:52 PM, Alvaro Herrera 
wrote:
>
> Amit Langote wrote:
>
> > On Wed, Dec 10, 2014 at 12:46 PM, Amit Kapila 
wrote:
> > > On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera <
alvhe...@2ndquadrant.com>
> > > wrote:
>
> > >> FWIW in my original proposal I was rejecting some things that after
> > >> further consideration turn out to be possible to allow; for instance
> > >> directly referencing individual partitions in COPY.  We could allow
> > >> something like
> > >>
> > >> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
> > >> or maybe
> > >> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT
> > >>
> > > or
> > > COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01'  TO STDOUT
> > > COPY [TABLE] lineitems PARTITION   TO STDOUT
> > >
> > > I think we should try to support operations on partitions via main
> > > table whereever it is required.
>
> Um, I think the only difference is that you added the noise word TABLE
> which we currently don't allow in COPY,

Yeah, we could eliminate TABLE keyword from this syntax, the reason
I have kept was for easier understanding of syntax, currently we don't have
concept of PARTITION in COPY syntax, but now if we want to introduce
such a concept, then it might be better to have TABLE keyword for the
purpose of syntax clarity.

> and that you added the
> possibility of using named partitions, about which see below.
>
> > We can also allow to explicitly name a partition
> >
> > COPY [TABLE ] lineitems PARTITION lineitems_2001 TO STDOUT;
>
> The problem with naming partitions is that the user has to pick names
> for every partition, which is tedious and doesn't provide any
> significant benefit.  The input I had from users of other partitioning
> systems was that they very much preferred not to name the partitions at
> all,

It seems to me both Oracle and DB2 supports named partitions, so even
though there are user's which don't prefer named partitions, I suspect
equal or more number of users will be there who will prefer for the sake
of migration and because they are already used to such a syntax.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] WRITE_UINT_FIELD used where WRITE_OID_FIELD likely intended

2014-12-10 Thread Michael Paquier

On Thu, Dec 11, 2014 at 7:44 AM, Mark Dilger  wrote:

> At line 1787 of outfuncs.c, the line:
>
> WRITE_UINT_FIELD(reltablespace)
>
> should probably say
>
> WRITE_OID_FIELD(reltablespace)
>
> since that variable is of type Oid, not uint32.
> Granted, these two macros are interchangeable,
> but they won't be if we ever move to 64-bit Oids.
>
For correctness you are right. Looks like you spent quite some time looking
at that..
-- 
Michael

[HACKERS] Too strict check when starting from a basebackup taken off a standby

2014-12-10 Thread Andres Freund

Hi,

A customer recently reported getting "backup_label contains data
inconsistent with control file" after taking a basebackup from a standby
and starting it with a typo in primary_conninfo.

When starting postgres from a basebackup StartupXLOG() has the follow
code to deal with backup labels:
if (haveBackupLabel)
{
ControlFile->backupStartPoint = checkPoint.redo;
ControlFile->backupEndRequired = backupEndRequired;

if (backupFromStandby)
{
if (dbstate_at_startup != DB_IN_ARCHIVE_RECOVERY)
ereport(FATAL,
(errmsg("backup_label contains data inconsistent 
with control file"),
 errhint("This means that the backup is corrupted 
and you will "
   "have to use another backup for recovery.")));
ControlFile->backupEndPoint = ControlFile->minRecoveryPoint;
}
}

while I'm not enthusiastic about the error message, that bit of code
looks sane at first glance. We certainly expect the control file to
indicate we're in recovery. Since we're unlinking the backup label
shortly afterwards we'd normally not expect to hit that case after a
shutdown in recovery.

The problem is that after reading the backup label we also have to read
the corresponding checkpoing from pg_xlog. If primary_conninfo and/or
restore_command are misconfigured and can't restore files that can only
be fixed by shutting down the cluster and fixing up recovery.conf -
which sets DB_SHUTDOWNED_IN_RECOVERY in the control file.

The easiest solution seems to be to simply also allow that as a state in
the above check. It might be nicer to not allow a ShutdownXLOG to modify
the control file et al at that stage, but I think that'd end up being
more invasive.

A short search shows that that also looks like a credible explanation
for #12128...

I plan to relax that check unless somebody comes up with a different &
better plan.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] inherit support for foreign tables

2014-12-10 Thread Etsuro Fujita


Hi Ashutosh,

Thanks for the review!

(2014/12/10 14:47), Ashutosh Bapat wrote:

We haven't heard anything from Horiguchi-san and Hanada-san for almost a
week. So, I am fine marking it as "ready for committer". What do you say?


ISTM that both of them are not against us, so let's ask for the 
committer's review!


Thanks,

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical column ordering

2014-12-10 Thread Josh Berkus

On 12/09/2014 09:11 PM, Tom Lane wrote:
> Josh Berkus  writes:
>> Question on COPY, though: there's reasons why people would want COPY to
>> dump in either physical or logical order.  If you're doing COPY to
>> create CSV files for output, then you want the columns in logical order.
>>  If you're doing COPY for pg_dump, then you want them in physical order
>> for faster dump/reload.  So we're almost certainly going to need to have
>> an option for COPY.
> 
> This is complete nonsense, Josh, or at least it is until you can provide
> some solid evidence to believe that column ordering would make any
> noticeable performance difference in COPY.  I know of no reason to believe
> that the existing user-defined-column-ordering option makes any difference.


Chill, dude, chill.  When we have a patch, I'll do some performance
testing, and we'll see if it's something we care about or not. There's
no reason to be belligerent about it.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical column ordering

2014-12-10 Thread Josh Berkus

On 12/10/2014 05:14 PM, Stephen Frost wrote:
> * Andres Freund (and...@2ndquadrant.com) wrote:
>> > But the scheduling of commits with regard to the 9.5 schedule actually
>> > opens a relevant question: When are we planning to release 9.5? Because
>> > If we try ~ one year from now it's a whole different ballgame than if we
>> > try to go back to september. And I think there's pretty good arguments
>> > for both.
> This should really be on its own thread for discussion...  I'm leaning,
> at the moment at least, towards the September release schedule.  I agree
> that having a later release would allow us to get more into it, but
> there's a lot to be said for the consistency we've kept up over the past
> few years with a September (our last non-September release was 8.4).

Can we please NOT discuss this in the thread about someone's patch?  Thanks.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.5] Custom Plan API

2014-12-10 Thread Robert Haas

On Tue, Dec 9, 2014 at 3:24 AM, Simon Riggs  wrote:
> Feedback I am receiving is that the API is unusable. That could be
> because it is impenetrable, or because it is unusable. I'm not sure it
> matters which.

It would be nice to here what someone is trying to use it for and what
problems that person is encountering.  Without that, it's pretty much
impossible for anyone to fix anything.

As for sample code, KaiGai had a working example, which of course got
broken when Tom changed the API, but it didn't look to me like Tom's
changes would have made anything impossible that was possible before.
I'm frankly kind of astonished by the tenor of this entire
conversation; there is certainly plenty of code in the backend that is
less self-documenting than this is; and KaiGai did already put up a
wiki page with documentation as you requested.  From his response, it
sounds like he has updated the ctidscan code, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Lockless StrategyGetBuffer() clock sweep

2014-12-10 Thread Robert Haas

On Mon, Dec 8, 2014 at 2:51 PM, Andres Freund  wrote:
> On 2014-10-30 07:55:08 -0400, Robert Haas wrote:
>> On Wed, Oct 29, 2014 at 3:09 PM, Andres Freund  
>> wrote:
>> >> But if it is, then how about
>> >> adding a flag that is 4 bytes wide or less alongside bgwriterLatch,
>> >> and just checking the flag instead of checking bgwriterLatch itself?
>> >
>> > Yea, that'd be nicer. I didn't do it because it made the patch slightly
>> > more invasive... The complexity really is only needed because we're not
>> > guaranteed that 64bit reads are atomic. Although we actually can be
>> > sure, because there's no platform with nonatomic intptr_t reads - so
>> > 64bit platforms actually *do* have atomic 64bit reads/writes.
>> >
>> > So if we just have a integer 'setBgwriterLatch' somewhere we're good. We
>> > don't even need to take a lock to set it. Afaics the worst that can
>> > happen is that several processes set the latch...
>>
>> OK, that design is fine with me.
>
> There's a slight problem with this, namely restarts of bgwriter. If it
> crashes the reference to the relevant latch will currently be reset via
> StrategyNotifyBgWriter(). In reality that's not a problem because
> sizeof(void*) writes are always atomic, but currently we don't assume
> that. We'd sometimes write to a old latch, but that's harmless because
> they're never deallocated.
>
> So I can see several solutions right now:
> 1) Redefine our requirements so that aligned sizeof(void*) writes are
>always atomic. That doesn't affect any currently supported platform
>and won't ever affect any future platform either. E.g. linux has had
>that requirement for a long time.
> 2) Use a explicitly defined latch for the bgworker instead of using the
>PGPROC->procLatch. That way it never has to be reset and all
>SetLatch()s will eventually go to the right process.
> 3) Continue requiring the spinlock when setting the latch.

Maybe you could store the pgprocno instead of the PROC *.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps

2014-12-10 Thread Noah Misch

On Mon, Dec 08, 2014 at 02:23:39AM +0100, Petr Jelinek wrote:
> On 08/12/14 00:56, Noah Misch wrote:
> >The commit_ts test suite gives me the attached diff on a 32-bit MinGW build
> >running on 64-bit Windows Server 2003.  I have not checked other Windows
> >configurations; the suite does pass on GNU/Linux.
> 
> Hmm I wonder if "< now()" needs to be changed to "<= now()" in those queries
> to make them work correctly on that plarform, I don't have machine with that
> environment handy right now, so I would appreciate if you could try that, in
> case you don't have time for that, I will try to setup something later...

I will try that, though perhaps not until next week.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

2014-12-10 Thread Robert Haas

On Sat, Dec 6, 2014 at 10:08 PM, Tomas Vondra  wrote:
> select a.i, b.i from a join b on (a.i = b.i);

I think the concern is that the inner side might be something more
elaborate than a plain table scan, like an aggregate or join.  I might
be all wet, but my impression is that you can make rescanning
arbitrarily expensive if you work at it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] double vacuum in initdb

2014-12-10 Thread Robert Haas

On Wed, Dec 10, 2014 at 8:50 PM, Peter Eisentraut  wrote:
> In an unrelated change, use VACUUM FULL; VACUUM FREEZE; rather than
> a single VACUUM FULL FREEZE command, to respond to my worries of a
> couple days ago about the reliability of doing this in one go.
>
> That was a long time ago.  Is that still applicable?

Gosh, I hope not.  Note that that was back when we still had old-style
VACUUM FULL, which was significantly more fragile than what we've got
now, I think...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] WRITE_UINT_FIELD used where WRITE_OID_FIELD likely intended

2014-12-10 Thread Mark Dilger

At line 1787 of outfuncs.c, the line:

WRITE_UINT_FIELD(reltablespace)

should probably say

WRITE_OID_FIELD(reltablespace)

since that variable is of type Oid, not uint32.
Granted, these two macros are interchangeable,
but they won't be if we ever move to 64-bit Oids.

mark


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Final Patch for GROUPING SETS

2014-12-10 Thread Andrew Gierth

> "Tom" == Tom Lane  writes:

More comment on this later, but I want to highlight these specific
points since we need clear answers here to avoid wasting unnecessary
time and effort:

 Tom> I've not spent any real effort looking at gsp2.patch yet, but it
 Tom> seems even worse off comment-wise: if there's any explanation in
 Tom> there at all of what a "chained aggregate" is, I didn't find it.

(Maybe "stacked" would have been a better term.)

What that code does is produce plans that look like this:

  GroupAggregate
  -> Sort
 -> ChainAggregate
-> Sort
   -> ChainAggregate

in much the same way that WindowAgg nodes are generated.

Where would you consider the best place to comment this? The WindowAgg
equivalent seems to be discussed primarily in the header comment of
nodeWindowAgg.c.

 Tom> I'd also counsel you to find some other way to do it than
 Tom> putting bool chain_head fields in Aggref nodes;

There are no chain_head fields in Aggref nodes.

Agg.chain_head is true for the Agg node at the top of the chain (the
GroupAggregate node in the above example), while AggState.chain_head
is set on the ChainAggregate nodes to point to the AggState of the
GroupAggregate node.

What we need to know before doing any further work on this is whether
this idea of stacking up aggregate and sort nodes is a viable one.

(The feedback I've had so far suggests that the performance is
acceptable, even if there are still optimization opportunities that
can be tackled later, like adding HashAggregate support.)

 Tom> I took a quick look at gsp-u.patch.  It seems like that approach
 Tom> should work, with of course the caveat that using CUBE/ROLLUP as
 Tom> function names in a GROUP BY list would be problematic.  I'm not
 Tom> convinced by the commentary in ruleutils.c suggesting that extra
 Tom> parentheses would help disambiguate: aren't extra parentheses
 Tom> still going to contain grouping specs according to the standard?

The spec is of minimal help here since it does not allow expressions in
GROUP BY at all, last I looked; only column references.

The extra parens do actually disambiguate because CUBE(x) and
(CUBE(x)) are not equivalent anywhere; while CUBE(x) can appear inside
GROUPING SETS (...), it cannot appear inside a (...) list nested inside
a GROUPING SETS list (or anywhere else).

As the comments in gram.y explain, the productions used are intended
to follow the spec with the exception of using a_expr where the spec
requires . So CUBE and ROLLUP are recognized as
special only as part of a group_by_item ( in the
spec), and as soon as we see a paren that isn't part of the "GROUPING
SETS (" opener, we're forced into parsing an a_expr, in which CUBE()
would become a function call.

(The case of upgrading from an old pg version seems to require the use
of --quote-all-identifiers in pg_dump)

 Tom> Forcibly schema-qualifying such function names seems like a less
 Tom> fragile answer on that end.

That I guess would require keeping more state, unless you applied it
everywhere to any function with a keyword for a name? I dunno.

The question that needs deciding here is less whether the approach
_could_ work but whether we _want_ it. The objection has been made
that we are in effect introducing a new category of "unreserved almost
everywhere" keyword, which I think has a point; on the other hand,
reserving CUBE is a seriously painful prospect.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix typo um vacuumdb tests

2014-12-10 Thread Peter Eisentraut

On 12/10/14 4:55 PM, Fabrízio de Royes Mello wrote:
> Hi all,
> 
> A little typo in vacuumdb tests.

Fixed, thanks.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] double vacuum in initdb

2014-12-10 Thread Peter Eisentraut

initdb currently does

PG_CMD_PUTS("ANALYZE;\nVACUUM FULL;\nVACUUM FREEZE;\n");

FREEZE is now part of FULL, so this seems redundant.  Also, ANALYZE can
be run as part of VACUUM.  So this could be

PG_CMD_PUTS("VACUUM FULL ANALYZE;\n");

There has been some concerns about time spent in initdb in test suites,
which is why I looked into this.  In testing, this change can shave off
between 10% and 20% of the run time of initdb, so it would be kind of
useful.

The last change to this was

commit 66cd8150636e48a8f143560136a25ec5eb355d8c
Author: Tom Lane 
Date:   Mon Nov 29 03:05:03 2004 +

Clean up initdb's error handling so that it prints something more
useful than just \'failed\' when there's a problem.  Per gripe from
Chris Albertson.

In an unrelated change, use VACUUM FULL; VACUUM FREEZE; rather than
a single VACUUM FULL FREEZE command, to respond to my worries of a
couple days ago about the reliability of doing this in one go.

That was a long time ago.  Is that still applicable?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GSSAPI, SSPI - include_realm default

2014-12-10 Thread Stephen Frost

* Peter Eisentraut (pete...@gmx.net) wrote:
> On 12/9/14 5:40 PM, Stephen Frost wrote:
> > I agree with this but I don't really see why we wouldn't say "hey, this
> > is going to change in 9.5."
> 
> Well, for one thing, we don't even know if it's going to be called 9.5. ;-)

Now that is certainly a very good point.

> And there is always a chance for a technical reason popping up that we
> might not make the change after all in 9.5.

I suppose.

> I'd be fine with something more along the lines of "the default might
> change in the future ... if you want to be future-proof, set it explicitly".

Sure, that'd work for me.

> Let's see an actual patch.

Will do.  Might be a few days before I get to it but I don't think
there's any cause for rush anyway.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] On partitioning

2014-12-10 Thread Robert Haas

On Wed, Dec 10, 2014 at 7:25 PM, Amit Langote
 wrote:
> In heap_create(), do we create storage for a top level partitioned table 
> (say, RELKIND_PARTITIONED_TABLE)? How about a partition that is further 
> sub-partitioned? We might allocate storage for a partition at some point and 
> then later choose to sub-partition it. In such a case, perhaps, we would have 
> to move existing data to the storage of subpartitions and deallocate the 
> partition's storage. In other words only leaf relations in a partition 
> hierarchy would have storage. Is there such a notion within code for some 
> other purpose or we'd have to invent it for partitioning scheme?

I think it would be advantageous to have storage only for the leaf
partitions, because then you don't need to waste time doing a
zero-block sequential scan of the root as part of the append-plan, an
annoyance of the current system.

We have no concept for this right now; in fact, right now, the relkind
fully determines whether a given relation has storage.  One idea is to
make the leaves relkind = 'r' and the interior notes some new relkind.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical column ordering

2014-12-10 Thread Stephen Frost

* Andres Freund (and...@2ndquadrant.com) wrote:
> But the scheduling of commits with regard to the 9.5 schedule actually
> opens a relevant question: When are we planning to release 9.5? Because
> If we try ~ one year from now it's a whole different ballgame than if we
> try to go back to september. And I think there's pretty good arguments
> for both.

This should really be on its own thread for discussion...  I'm leaning,
at the moment at least, towards the September release schedule.  I agree
that having a later release would allow us to get more into it, but
there's a lot to be said for the consistency we've kept up over the past
few years with a September (our last non-September release was 8.4).

I'd certainly vote against planning for a mid-December release as, at
least in my experience, it's a bad idea to try and do December (or
January 1..) major releases.  October might work, but that's not much of
a change from September.  Late January or February would probably work
but that's quite a shift from September and don't think it'd be
particularly better.  Worse, I'm nervous we might get into a habit of
longer and longer releases.  Having yearly releases, imv at least, is
really good for the project and those who depend on it.  New features
are available pretty quickly to end-users and people can plan around our
schedule pretty easily (eg- plan to do DB upgrades in January/February).

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] On partitioning

2014-12-10 Thread Amit Langote

> From: Robert Haas [mailto:robertmh...@gmail.com]
> On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund
>  wrote:
> >> I don't think that's mutually exclusive with the idea of
> >> partitions-as-tables.  I mean, you can add code to the ALTER TABLE
> >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
> >> wherever you want.
> >
> > That'll be a lot of places you'll need to touch. More fundamentally: Why
> > should we name something a table that's not one?
> 
> Well, I'm not convinced that it isn't one.  And adding a new relkind
> will involve a bunch of code churn, too.  But I don't much care to
> pre-litigate this: when someone has got a patch, we can either agree
> that the approach is OK or argue that it is problematic because X.  I
> think we need to hammer down the design in broad strokes first, and
> I'm not sure we're totally there yet.
> 

In heap_create(), do we create storage for a top level partitioned table (say, 
RELKIND_PARTITIONED_TABLE)? How about a partition that is further 
sub-partitioned? We might allocate storage for a partition at some point and 
then later choose to sub-partition it. In such a case, perhaps, we would have 
to move existing data to the storage of subpartitions and deallocate the 
partition's storage. In other words only leaf relations in a partition 
hierarchy would have storage. Is there such a notion within code for some other 
purpose or we'd have to invent it for partitioning scheme?

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] thinko in convertToJsonb()

2014-12-10 Thread Tom Lane

Michael Paquier  writes:
> On Tue, Dec 9, 2014 at 11:11 AM, Mark Dilger  wrote:
>> Perhaps the code really meant to say:
>> reserveFromBuffer(&buffer, VARHDRSZ)

> Good catch! The code is indeed incorrect. Attached is a one-line patch
> addressing that, I guess someone is going to pick up that sooner or
> later.

Pushed, thanks!

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] advance local xmin more aggressively

2014-12-10 Thread Robert Haas

On Wed, Dec 10, 2014 at 3:28 PM, Heikki Linnakangas
 wrote:
>> Care to code it up?
>
> Here you are.

That was quick.

You need to add a semicolon to the end of line 20 in pairingheap.c.

I wonder what the worst case for this is.  I tried it on your
destruction test case from before and it doesn't seem to cost anything
material:

your patch: 0m38.714s 0m34.424s 0m35.191s
master: 0m35.260s 0m34.643s 0m34.945s
my patch: 0m34.728s 0m34.619s 0m35.015s

The first measurement with your patch is somewhat of an outlier, but
that was also the first measurement I took, which might have thrown
off the results.  Basically, either your patch or my patch looks free
from here.  I'm kind of suspicious about that: it doesn't seem like
the additional linked-list manipulation you're doing in
RegisterSnapshotOnOwner ought to cost something, but maybe that
function just isn't called enough for it to matter, at least on this
test case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Casting issues with domains

2014-12-10 Thread Tom Lane

Kevin Grittner  writes:
> Tom Lane  wrote:
>> As far as that goes, I think the OP was unhappy about the performance
>> of the information_schema views, which in our implementation do exactly
>> that so that the exposed types of the view columns conform to the SQL
>> standard, even though the underlying catalogs use PG-centric types.
>> 
>> I don't believe that that's the only reason why the performance of the
>> information_schema views tends to be sucky, but it's certainly a reason.

> Is that schema too "edge case" to justify some functional indexes
> on the cast values on the underlying catalogs? (I'm inclined to
> think so, but it seemed like a question worth putting out
> there)

We don't support functional indexes on system catalogs, so whether it'd
be justified is sorta moot.  On the whole though I'm inclined to agree
that the information_schema views aren't used enough to justify adding
overhead to system-catalog updates, even if the pieces for that all
existed.

> Or, since these particular domains are known, is there any sane way
> to "special-case" these to allow the underlying types to work?

I don't particularly care for a kluge solution here.

I notice that recent versions of the SQL spec contain the notion of a
"distinct type", which is a user-defined type that is representationally
identical to some base type but has its own name, and comes equipped with
assignment-grade casts to and from the base type (which in PG terms would
be binary-compatible casts, though the spec doesn't require that).
It seems like this might be intended to be the sort of "zero cost type
alias" we were talking about, except that the SQL committee seems to have
got it wrong by not specifying the cast-to-base-type as being implicit.
Which ISTM you would want so that operators/functions on the base type
would apply automatically to the distinct type.  But perhaps we could
extend the spec with some option to CREATE TYPE to allow the cast to come
out that way.

Or in short, maybe we should try to replace the domains used in the
current information_schema with distinct types.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TABLESAMPLE patch

2014-12-10 Thread Petr Jelinek


On 11/12/14 00:24, Petr Jelinek wrote:

Hello,

Attached is a basic implementation of TABLESAMPLE clause. It's SQL
standard clause and couple of people tried to submit it before so I
think I don't need to explain in length what it does - basically returns
"random" sample of a table using a specified sampling method.

I implemented both SYSTEM and BERNOULLI sampling as specified by SQL
standard. The SYSTEM sampling does block level sampling using same
algorithm as ANALYZE, BERNOULLI scans whole table and picks tuple randomly.

There is API for sampling methods which consists of 4 functions at the
moment - init, end, nextblock and nexttuple. I added catalog which maps
the sampling method to the functions implementing this API. The grammar
creates new TableSampleRange struct that I added for sampling. Parser
then uses the catalog to load information about the sampling method into
TableSampleClause which is then attached to RTE. Planner checks for if
this parameter is present in the RTE and if it finds it it will create
plan with just one path - SampleScan. SampleScan implements standard
executor API and calls the sampling method API as needed.

It is possible to write custom sampling methods. The sampling method
parameters are not limited to just percent number as in standard but
dynamic list of expressions which is checked against the definition of
the init function in a similar fashion (although much simplified) as
function calls are.

Notable lacking parts are:
- proper costing and returned row count estimation - given the dynamic
nature of parameters I think for we'll need to let the sampling method
do this, so there will have to be fifth function in the API.
- ruleutils support (it needs a bit of code in get_from_clause_item
function)
- docs are sparse at the moment



Forgot the obligatory:

The research leading to these results has received funding from the 
European Union's Seventh Framework Programme (FP7/2007-2013) under grant 
agreement n° 318633.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] TABLESAMPLE patch

2014-12-10 Thread Petr Jelinek


Hello,

Attached is a basic implementation of TABLESAMPLE clause. It's SQL 
standard clause and couple of people tried to submit it before so I 
think I don't need to explain in length what it does - basically returns 
"random" sample of a table using a specified sampling method.


I implemented both SYSTEM and BERNOULLI sampling as specified by SQL 
standard. The SYSTEM sampling does block level sampling using same 
algorithm as ANALYZE, BERNOULLI scans whole table and picks tuple randomly.


There is API for sampling methods which consists of 4 functions at the 
moment - init, end, nextblock and nexttuple. I added catalog which maps 
the sampling method to the functions implementing this API. The grammar 
creates new TableSampleRange struct that I added for sampling. Parser 
then uses the catalog to load information about the sampling method into 
TableSampleClause which is then attached to RTE. Planner checks for if 
this parameter is present in the RTE and if it finds it it will create 
plan with just one path - SampleScan. SampleScan implements standard 
executor API and calls the sampling method API as needed.


It is possible to write custom sampling methods. The sampling method 
parameters are not limited to just percent number as in standard but 
dynamic list of expressions which is checked against the definition of 
the init function in a similar fashion (although much simplified) as 
function calls are.


Notable lacking parts are:
- proper costing and returned row count estimation - given the dynamic 
nature of parameters I think for we'll need to let the sampling method 
do this, so there will have to be fifth function in the API.
- ruleutils support (it needs a bit of code in get_from_clause_item 
function)

- docs are sparse at the moment

--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 01d24a5..250ae29 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -49,7 +49,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressionwhere from_item can be one of:
 
-[ ONLY ] table_name [ * ] [ [ AS ] alias [ ( column_alias [, ...] ) ] ]
+[ ONLY ] table_name [ * ] [ TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ] ] [ [ AS ] alias [ ( column_alias [, ...] ) ] ]
 [ LATERAL ] ( select ) [ AS ] alias [ ( column_alias [, ...] ) ]
 with_query_name [ [ AS ] alias [ ( column_alias [, ...] ) ] ]
 [ LATERAL ] function_name ( [ argument [, ...] ] )
@@ -317,6 +317,38 @@ TABLE [ ONLY ] table_name [ * ]
  
 
  
+  TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]
+  
+   
+Table sample clause after
+table_name indicates that
+a sampling_method should
+be used to retrieve subset of rows in the table.
+The sampling_method can be
+one of:
+
+ 
+  SYSTEM
+ 
+ 
+  BERNOULLI
+ 
+
+Both of those sampling methods currently accept only single argument
+which is the percent (floating point from 0 to 100) of the rows to
+be returned.
+The SYSTEM sampling method does block level
+sampling with each block having same chance of being selected and
+returns all rows from each selected block.
+The BERNOULLI scans whole table and returns
+individual rows with equal probability.
+The optional numeric parameter REPEATABLE is used
+as random seed for sampling.
+   
+  
+ 
+
+ 
   alias
   

diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 21721b4..595737c 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -8,6 +8,7 @@ subdir = src/backend/access
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS	= brin common gin gist hash heap index nbtree rmgrdesc spgist transam
+SUBDIRS	= brin common gin gist hash heap index nbtree rmgrdesc spgist \
+			  transam tsm
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/tsm/Makefile b/src/backend/access/tsm/Makefile
new file mode 100644
index 000..73bbbd7
--- /dev/null
+++ b/src/backend/access/tsm/Makefile
@@ -0,0 +1,17 @@
+#-
+#
+# Makefile--
+#Makefile for access/tsm
+#
+# IDENTIFICATION
+#src/backend/access/tsm/Makefile
+#
+#-
+
+subdir = src/backend/access/tsm
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = tsm_system.o tsm_bernoulli.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/tsm/tsm_bernoulli.c b/src/backend/access/tsm/tsm_bernoulli.c
new file mode 100644
index 000..c273ca6
--- /de

Re: [HACKERS] Casting issues with domains

2014-12-10 Thread Kevin Grittner

Tom Lane  wrote:
> Kevin Grittner  writes:

>> It's kinda hard for me to visualize where it makes sense to define
>> the original table column as the bare type but use a domain when
>> referencing it in the view.
>
> As far as that goes, I think the OP was unhappy about the performance
> of the information_schema views, which in our implementation do exactly
> that so that the exposed types of the view columns conform to the SQL
> standard, even though the underlying catalogs use PG-centric types.
>
> I don't believe that that's the only reason why the performance of the
> information_schema views tends to be sucky, but it's certainly a reason.

Is that schema too "edge case" to justify some functional indexes
on the cast values on the underlying catalogs? (I'm inclined to
think so, but it seemed like a question worth putting out
there)

Or, since these particular domains are known, is there any sane way
to "special-case" these to allow the underlying types to work?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Casting issues with domains

2014-12-10 Thread Tom Lane

Kevin Grittner  writes:
> It's kinda hard for me to visualize where it makes sense to define
> the original table column as the bare type but use a domain when
> referencing it in the view.

As far as that goes, I think the OP was unhappy about the performance
of the information_schema views, which in our implementation do exactly
that so that the exposed types of the view columns conform to the SQL
standard, even though the underlying catalogs use PG-centric types.

I don't believe that that's the only reason why the performance of the
information_schema views tends to be sucky, but it's certainly a reason.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Fix typo um vacuumdb tests

2014-12-10 Thread Fabrízio de Royes Mello

Hi all,

A little typo in vacuumdb tests.

Regards,

-- 
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello
diff --git a/src/bin/scripts/t/100_vacuumdb.pl b/src/bin/scripts/t/100_vacuumdb.pl
index d6555f5..ac160ba 100644
--- a/src/bin/scripts/t/100_vacuumdb.pl
+++ b/src/bin/scripts/t/100_vacuumdb.pl
@@ -29,4 +29,4 @@ issues_sql_like(
 issues_sql_like(
 	[ 'vacuumdb', '-Z', 'postgres' ],
 	qr/statement: ANALYZE;/,
-	'vacuumdb -z');
+	'vacuumdb -Z');

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GiST kNN search queue (Re: KNN-GiST with recheck)

2014-12-10 Thread Heikki Linnakangas


On 12/10/2014 10:59 PM, Arthur Silva wrote:

It may be better to replace the lib/binaryheap altogether as it offers
comparable/better performance.


It's not always better. A binary heap is more memory-efficient, for 
starters.


There are only two uses of lib/binaryheap: reorderbuffer.c and merge 
append. Reorderbuffer isn't that performance critical, although a binary 
heap may well be better there, because the comparison function is very 
cheap. For merge append, it might be a win, especially if the comparison 
function is expensive. (That's on the assumption that the overall number 
of comparisons needed with a pairing heap is smaller - I'm not sure how 
true that is). That would be worth testing.


I'd love to test some other heap implementation in in tuplesort.c. It 
has a custom binary heap implementation that's used in the final merge 
phase of an external sort, and also when doing a so-called bounded sort, 
i.e. "ORDER BY x LIMIT Y". But that would be difficult to replace, 
because tuplesort.c collects tuples in an array in memory, and then 
turns that into a heap. An array is efficient to turn into a binary 
heap, but to switch to another data structure, you'd suddenly need extra 
memory. And we do the switch when we run out of work_mem, so allocating 
more isn't really an option.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] libpq pipelining

2014-12-10 Thread Matt Newell

On Friday, December 05, 2014 12:22:38 PM Heikki Linnakangas wrote:
> Oh, that's what the PQgetLastQuery/PQgetNextQuery functions work! I
> didn't understand that before. I'd suggest renaming them to something
> like PQgetSentQuery() and PQgetResultQuery(). The first/last/next names
> made me think that they're used to iterate a list of queries, but in
> fact they're supposed to be used at very different stages.
> 
> - Heikki


Okay, I have renamed them with your suggestions, and added 
PQsetPipelining/PQgetPipelining, defaulting to pipelining off.  There should be 
no behavior change unless pipelining is enabled.

Documentation should be mostly complete except the possible addition of an 
example and maybe a general pipelining overview paragraph.

I have implemented async query support (that takes advantage of pipelining) in 
Qt, along with a couple test cases.  This led to me discovering a bug with my 
last patch where a PGquery object could be reused twice in a row.  I have fixed 
that.  I contemplated not reusing the PGquery objects at all, but that 
wouldn't solve the problem because it's very possible that malloc will return 
a recent free of the same size anyway.  Making the guarantee that a PGquery 
won't be reused twice in a row should be sufficient, and the only alternative 
is 
to add a unique id, but that will add further complexity that I don't think is 
warranted.

Feedback is very welcome and appreciated.

Thanks,
Matt Newell


diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index d829a4b..4e0431e 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -3947,9 +3947,14 @@ int PQsendQuery(PGconn *conn, const char *command);
 
After successfully calling PQsendQuery, call
PQgetResult one or more times to obtain the
-   results.  PQsendQuery cannot be called again
-   (on the same connection) until PQgetResult
-   has returned a null pointer, indicating that the command is done.
+   results.  If pipelining is enabled PQsendQuery
+   may be called multiple times before reading the results. See 
+   PQsetPipelining and PQisPipelining.
+   Call PQgetSentQuery to get a PGquery
+   which can be used to identify which results obtained from
+   PQgetResult belong to each pipelined query.
+   If only one query is dispatched at a time, you can call PQgetResult
+   until a NULL value is returned to indicate the end of the query.
   
  
 
@@ -4133,8 +4138,8 @@ PGresult *PQgetResult(PGconn *conn);
 
   
PQgetResult must be called repeatedly until
-   it returns a null pointer, indicating that the command is done.
-   (If called when no command is active,
+   it returns a null pointer, indicating that all dispatched commands
+   are done. (If called when no command is active,
PQgetResult will just return a null pointer
at once.) Each non-null result from
PQgetResult should be processed using the
@@ -4144,14 +4149,17 @@ PGresult *PQgetResult(PGconn *conn);
PQgetResult will block only if a command is
active and the necessary response data has not yet been read by
PQconsumeInput.
+   If query pipelining is being used, PQgetResultQuery
+   can be called after PQgetResult to match the result to the query.
   
 
   

 Even when PQresultStatus indicates a fatal
-error, PQgetResult should be called until it
-returns a null pointer, to allow libpq to
-process the error information completely.
+error, PQgetResult should be called until the
+query has no more results (null pointer return if not using query
+pipelining, otherwise see PQgetResultQuery),
+to allow libpq to process the error information completely.

   
  
@@ -4385,6 +4393,158 @@ int PQflush(PGconn *conn);
read-ready and then read the response as described above.
   
 
+ 
+  
+   
+PQsetPipelining
+
+ PQsetPipelining
+
+   
+
+   
+
+ Enables or disables query pipelining.
+
+int PQsetPipelining(PGconn *conn, int arg);
+
+
+
+
+ Enables pipelining for the connectino if arg is 1, or disables it
+ if arg is 0.  When pipelining is enabled multiple async queries can
+ be sent before processing the results of the first.  If pipelining
+ is disabled an error will be raised an async query is attempted
+ while another is active.
+
+   
+  
+
+  
+   
+PQisPipelining
+
+ PQisPipelining
+
+   
+
+   
+
+ Returns the pipelining status of the connection
+
+int PQisPipelining(PGconn *conn);
+
+
+
+
+ Returns 1 if pipelining is enabled, or 0 if pipeling is disabled.
+ Query pipelining is disabled unless enabled with a call to
+ PQsetPipelining.
+
+   
+  
+
+  
+   
+PQgetQueryCommand
+
+ PQgetQueryCommand
+
+   
+
+   
+
+ Returns the query string associated with th

Re: [HACKERS] GiST kNN search queue (Re: KNN-GiST with recheck)

2014-12-10 Thread Arthur Silva

On Wed, Dec 10, 2014 at 1:50 PM, Heikki Linnakangas  wrote:

> On 01/28/2014 04:12 PM, Alexander Korotkov wrote:
>
>> >3. A binary heap would be a better data structure to buffer the rechecked
>>> >values. A Red-Black tree allows random insertions and deletions, but in
>>> >this case you need to insert arbitrary values but only remove the
>>> minimum
>>> >item. That's exactly what a binary heap excels at. We have a nice binary
>>> >heap implementation in the backend that you can use, see
>>> >src/backend/lib/binaryheap.c.
>>> >
>>>
>> Hmm. For me binary heap would be a better data structure for KNN-GiST at
>> all :-)
>>
>
> I decided to give this a shot, replacing the red-black tree in GiST with
> the binary heap we have in lib/binaryheap.c. It made the GiST code somewhat
> simpler, as the binaryheap interface is simpler than the red-black tree
> one. Unfortunately, performance was somewhat worse. That was quite
> surprising, as insertions and deletions are both O(log N) in both data
> structures, but the red-black tree implementation is more complicated.
>
> I implemented another data structure called a Pairing Heap. It's also a
> fairly simple data structure, but insertions are O(1) instead of O(log N).
> It also performs fairly well in practice.
>
> With that, I got a small but measurable improvement. To test, I created a
> table like this:
>
> create table gisttest (id integer, p point);
> insert into gisttest select id, point(random(), random()) from
> generate_series(1, 100) id;
> create index i_gisttest on gisttest using gist (p);
>
> And I ran this query with pgbench:
>
> select id from gisttest order by p <-> '(0,0)' limit 1000;
>
> With unpatched master, I got about 650 TPS, and with the patch 720 TPS.
> That's a nice little improvement, but perhaps more importantly, the pairing
> heap implementation consumes less memory. To measure that, I put a
> MemoryContextStats(so->queueCtx) call into gistendscan. With the above
> query, but without the "limit" clause, on master I got:
>
> GiST scan context: 2109752 total in 10 blocks; 2088456 free (24998
> chunks); 21296 used
>
> And with the patch:
>
> GiST scan context: 1061160 total in 9 blocks; 1040088 free (12502 chunks);
> 21072 used
>
> That's 2MB vs 1MB. While that's not much in absolute terms, it'd be nice
> to reduce that memory consumption, as there is no hard upper bound on how
> much might be needed. If the GiST tree is really disorganized for some
> reason, a query might need a lot more.
>
>
> So all in all, I quite like this patch, even though it doesn't do anything
> too phenomenal. It adds a some code, in the form of the new pairing heap
> implementation, but it makes the GiST code a little bit simpler. And it
> gives a small performance gain, and reduces memory usage a bit.
>
> - Heikki
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>
It may be better to replace the lib/binaryheap altogether as it offers
comparable/better performance.

Re: [HACKERS] advance local xmin more aggressively

2014-12-10 Thread Heikki Linnakangas


On 12/10/2014 08:35 PM, Robert Haas wrote:

On Wed, Dec 10, 2014 at 12:57 PM, Heikki Linnakangas
 wrote:

Clever. Could we use that method in ResourceOwnerReleaseInternal and
ResourceOwnerDelete, too? Might be best to have a
ResourceOwnerWalk(resowner, callback) function for walking all resource
owners in a tree, instead of one for walking the snapshots in them.


Sure.  It would be a little more complicated there since you want to
stop when you get back to the starting point, but not too bad.  But is
that solving any actual problem?


I thought that a transaction commit or abort in some special 
circumstances might call ResourceOwnerReleaseInternal on the top level, 
but I can't make it happen. The machinery in xact.c is too clever, and 
always releases the resource owners from the bottom up. And I can't find 
a way to create a deep resource owner tree in any other way. So I guess 
it's fine as it is.


MemoryContextCheck and MemoryContextPrint also recurse, however. 
MemoryContextCheck is only enabled with --enable-cassert, but 
MemoryContextPrint is called when you run out of memory. That could turn 
a plain "out of memory" error into a stack overrun, triggering a server 
crash and restart.



It occurs to me that the pairing heap I just posted in another thread
(http://www.postgresql.org/message-id/54886bb8.9040...@vmware.com) would be
a good fit for this. It's extremely cheap to insert to and to find the
minimum item (O(1)), and the delete operation is O(log N), amortized. I
didn't implement a delete operation, for removing a particular node, I only
did delete-min, but it's basically the same code. Using a pairing heap for
this might be overkill, but if we have that implementation anyway, the code
in snapmgr.c to use it would be very simple, so I see little reason not to.
It might even be simpler than your patch, because you wouldn't need to have
the heuristics on whether to attempt updating the xmin; it would be cheap
enough to always try it.


Care to code it up?


Here you are.

- Heikki
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 327a1bc..b24ece6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = ilist.o binaryheap.o stringinfo.o
+OBJS = ilist.o binaryheap.o pairingheap.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
new file mode 100644
index 000..06cd117
--- /dev/null
+++ b/src/backend/lib/pairingheap.c
@@ -0,0 +1,237 @@
+/*-
+ *
+ * pairingheap.c
+ *	  A Pairing Heap implementation
+ *
+ * Portions Copyright (c) 2012-2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/pairingheap.c
+ *
+ *-
+ */
+
+#include "postgres.h"
+
+#include 
+
+#include "lib/pairingheap.h"
+
+static pairingheap_node * merge(pairingheap *heap, pairingheap_node *a, pairingheap_node *b)
+static pairingheap_node *merge_children(pairingheap *heap, pairingheap_node *children);
+
+/*
+ * pairingheap_allocate
+ *
+ * Returns a pointer to a newly-allocated heap, with the heap property
+ * defined by the given comparator function, which will be invoked with the
+ * additional argument specified by 'arg'.
+ */
+pairingheap *
+pairingheap_allocate(pairingheap_comparator compare, void *arg)
+{
+	pairingheap *heap;
+
+	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	heap->ph_compare = compare;
+	heap->ph_arg = arg;
+
+	heap->ph_root = NULL;
+
+	return heap;
+}
+
+/*
+ * pairingheap_free
+ *
+ * Releases memory used by the given pairingheap.
+ *
+ * Note: The items in the heap are not released!
+ */
+void
+pairingheap_free(pairingheap *heap)
+{
+	pfree(heap);
+}
+
+
+/* A helper function to merge two subheaps into one. */
+static pairingheap_node *
+merge(pairingheap *heap, pairingheap_node *a, pairingheap_node *b)
+{
+	if (a == NULL)
+		return b;
+	if (b == NULL)
+		return a;
+
+	/* Put the larger of the items as a child of the smaller one */
+	if (heap->ph_compare(a, b, heap->ph_arg) < 0)
+	{
+		pairingheap_node *tmp;
+
+		tmp = a;
+		a = b;
+		b = tmp;
+	}
+
+	if (a->first_child)
+		a->first_child->prev_or_parent = b;
+	b->prev_or_parent = a;
+	b->next_sibling = a->first_child;
+	a->first_child = b;
+	return a;
+}
+
+/*
+ * pairingheap_add
+ *
+ * Adds the given datum to the heap in O(1) time.
+ */
+void
+pairingheap_add(pairingheap *heap, pairingheap_node *d)
+{
+	d->first_child = NULL;
+
+	/* Link the new item as a new tree */
+	heap->ph_root = merge(heap, heap->ph_root, d);
+}
+
+/*
+ * pairingheap_first
+ *
+ * Returns a pointer to the first (root, topmost) node in the heap
+ * without modifying the heap. The caller must ensure that this
+ * routine is not used on an empty heap. Always O(1

Re: [HACKERS] GSSAPI, SSPI - include_realm default

2014-12-10 Thread Peter Eisentraut

On 12/9/14 5:40 PM, Stephen Frost wrote:
> I agree with this but I don't really see why we wouldn't say "hey, this
> is going to change in 9.5."

Well, for one thing, we don't even know if it's going to be called 9.5. ;-)

And there is always a chance for a technical reason popping up that we
might not make the change after all in 9.5.

I'd be fine with something more along the lines of "the default might
change in the future ... if you want to be future-proof, set it explicitly".

Let's see an actual patch.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Casting issues with domains

2014-12-10 Thread Kevin Grittner

Thomas Reiss  wrote:

> postgres=# create table test1 (a text);
> CREATE TABLE
> postgres=# insert into test1 select generate_series(1,10);
> INSERT 0 10
> postgres=# create index idx1 on test1(a);
> CREATE INDEX
> postgres=# analyze test1 ;
> ANALYZE;
> postgres=# explain select * from test1 where a = 'toto';
> QUERY PLAN
> ---
>  Index Only Scan using idx1 on test1 (cost=0.29..8.31 rows=1 width=5)
>Index Cond: (a = 'toto'::text)
> (2 lignes)
>
> Now we create a tstdom domain and cast the a column to tstdom in the
> view definition :
> postgres=# create domain tstdom as text;
> CREATE DOMAIN
> postgres=# create view test2 as select a::tstdom from test1 ;
> CREATE VIEW
> postgres=# explain select * from test2 where a='toto';
> QUERY PLAN
> --
>  Seq Scan on test1 (cost=0.00..1693.00 rows=500 width=5)
>Filter: (((a)::tstdom)::text = 'toto'::text)
> (2 lignes)
>
> As you can see, a is casted to tstdom then again to text. This casts
> prevents the optimizer to choose an index scan to retrieve the data. The
> casts are however strictly equivalent and should be not prevent the
> optimizer to use indexes.

You can create an index to be used for searching using the domain.
Following the steps in your example, you can run this:

postgres=# create index idx2 on test1 ((a::tstdom));
CREATE INDEX
postgres=# vacuum analyze test1;
VACUUM
postgres=# explain select * from test2 where a='toto';
QUERY PLAN
--
 Index Scan using idx2 on test1 (cost=0.29..8.31 rows=1 width=5)
   Index Cond: (((a)::tstdom)::text = 'toto'::text)
(2 rows)

It's even easier if "a" is defined to be a member of the domain in
the original table:

postgres=# create domain tstdom as text;
CREATE DOMAIN
postgres=# create table test1 (a tstdom);
CREATE TABLE
postgres=# insert into test1 select generate_series(1,10);
INSERT 0 10
postgres=# create index idx1 on test1(a);
CREATE INDEX
postgres=# analyze test1 ;
ANALYZE
postgres=# explain select * from test1 where a = 'toto';
  QUERY PLAN
---
 Index Only Scan using idx1 on test1 (cost=0.29..8.31 rows=1 width=5)
   Index Cond: (a = 'toto'::text)
(2 rows)

postgres=# create view test2 as select a::tstdom from test1 ;
CREATE VIEW
postgres=# explain select * from test2 where a='toto';
  QUERY PLAN
---
 Index Only Scan using idx1 on test1 (cost=0.29..8.31 rows=1 width=5)
   Index Cond: (a = 'toto'::text)
(2 rows)

It's kinda hard for me to visualize where it makes sense to define
the original table column as the bare type but use a domain when
referencing it in the view.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Final Patch for GROUPING SETS

2014-12-10 Thread Tom Lane

Andrew Gierth  writes:
> And here is that recut patch set.

I started looking over this patch, but eventually decided that it needs
more work to be committable than I'm prepared to put in right now.

My single biggest complaint is about the introduction of struct
GroupedVar.  If we stick with that, we're going to have to teach an
extremely large number of places that know about Vars to also know
about GroupedVars.  This will result in code bloat and errors of
omission.  If you think the latter concern is hypothetical, note that
you can't get 40 lines into gsp1.patch without finding such an omission,
namely the patch fails to teach pg_stat_statements.c about GroupedVars.
(That also points up that some of the errors of omission will be in
third-party code that we can't fix easily.)

I think you should get rid of that concept and instead implement the
behavior by having nodeAgg.c set the relevant fields of the representative
tuple slot to NULL, so that a regular Var does the right thing.

I'm also not happy about the quality of the internal documentation.
The big problem here is the seriously lacking documentation of the new
parse node types, eg

+/*
+ * Node representing substructure in GROUPING SETS
+ *
+ * This is not actually executable, but it's used in the raw parsetree
+ * representation of GROUP BY, and in the groupingSets field of Query, to
+ * preserve the original structure of rollup/cube clauses for readability
+ * rather than reducing everything to grouping sets.
+ */
+
+typedef enum
+{
+   GROUPING_SET_EMPTY,
+   GROUPING_SET_SIMPLE,
+   GROUPING_SET_ROLLUP,
+   GROUPING_SET_CUBE,
+   GROUPING_SET_SETS
+} GroupingSetKind;
+
+typedef struct GroupingSet
+{
+   Exprxpr;
+   GroupingSetKind kind;
+   List   *content;
+   int location;
+} GroupingSet;

The only actual documentation there is a long-winded excuse for having
put the struct declaration in the wrong place.  (Since it's not an
executable expression, it should be in parsenodes.h not primnodes.h.)
Good luck figuring out what "content" is a list of, or indeed anything
at all except that this has got something to do with grouping sets.
If one digs around in the patch long enough, some useful information can
be found in the header comments for various functions --- but there should
be a spec for what this struct means, what its fields are, what the
relevant invariants are *in the .h file*.  Poking around in parsenodes.h,
eg the description of SortGroupClause, should give you an idea of the
standard here.

I'm not too happy about struct Grouping either.  If one had to guess, one
would probably guess that this was part of the representation of a GROUP
BY clause; a guess led on by the practice of the patch of dealing with
this and struct GroupingSet together, as in eg pg_stat_statements.c and
nodes.h.  Reading enough of the patch will eventually clue you that this
is the representation of a call of the GROUPING() pseudo-function, but
that's not exactly clear from either the name of the struct or its random
placement between Var and Const in primnodes.h.  And the comment is oh so
helpful:

+/*
+ * Grouping
+ */

I'd be inclined to call it GroupingFunc and put it after
Aggref/WindowFunc.  Also please note that there is an attempt throughout
the system to order code stanzas that deal with assorted node types in an
order matching the order in which they're declared in the *nodes.h files.
You should never be flipping a coin to decide where to add such code, and
"put it at the end of the existing list" is usually not the best answer
either.

Some other random examples of inadequate attention to commenting:

@@ -243,7 +243,7 @@ typedef struct AggStatePerAggData
 * rest.
 */
 
-   Tuplesortstate *sortstate;  /* sort object, if DISTINCT or ORDER BY 
*/
+   Tuplesortstate **sortstate; /* sort object, if DISTINCT or ORDER BY 
*/

This change didn't even bother to pluralize the comment, let alone explain
the length of the array or what it's indexed according to, let alone
explain why we now need multiple tuplesort objects in what is still
apparently a "per aggregate" state struct.  (BTW, as a matter of good
engineering I think it's useful to change a field's name when you change
its meaning and representation so fundamentally.  In this case, renaming
to "sortstates" would have been clearer and would have helped ensure that
you didn't miss fixing any referencing code.)

@@ -338,81 +339,101 @@ static Datum GetAggInitVal(Datum textInitVal, Oid 
transtype);
 static void
 initialize_aggregates(AggState *aggstate,
  AggStatePerAgg peragg,
- AggStatePerGroup pergroup)
+ AggStatePerGroup pergroup,
+ int numReinitialize)
 {
int aggno;

I wonder what numReinitialize is, or why it's

Re: [HACKERS] advance local xmin more aggressively

2014-12-10 Thread Robert Haas

On Wed, Dec 10, 2014 at 12:57 PM, Heikki Linnakangas
 wrote:
> Clever. Could we use that method in ResourceOwnerReleaseInternal and
> ResourceOwnerDelete, too? Might be best to have a
> ResourceOwnerWalk(resowner, callback) function for walking all resource
> owners in a tree, instead of one for walking the snapshots in them.

Sure.  It would be a little more complicated there since you want to
stop when you get back to the starting point, but not too bad.  But is
that solving any actual problem?

>> 2. Instead of traversing the tree until we find an xmin equal to the
>> one we're currently advertising, the code now traverses the entire
>> tree each time it runs. However, it also keeps a record of how many
>> times the oldest xmin occurred in the tree, which is decremented each
>> time we unregister a snapshot with that xmin; the traversal doesn't
>> run again until that count reaches 0.  That fixed the performance
>> regression on your test case.
>>
>> With a million subtransactions:
>>
>> master 34.464s 33.742s 34.317s
>> advance-xmin 34.516s 34.069s 34.196s
>
> Well, you can still get the slowness back by running other stuff in the
> background. I admit that it's a very obscure case, probably fine in
> practice. I would still feel better if snapmgr.c did its own bookkeeping,
> from an aesthetic point of view. In a heap, or even just a linked list, if
> the performance characteristics of that is acceptable.
>
> It occurs to me that the pairing heap I just posted in another thread
> (http://www.postgresql.org/message-id/54886bb8.9040...@vmware.com) would be
> a good fit for this. It's extremely cheap to insert to and to find the
> minimum item (O(1)), and the delete operation is O(log N), amortized. I
> didn't implement a delete operation, for removing a particular node, I only
> did delete-min, but it's basically the same code. Using a pairing heap for
> this might be overkill, but if we have that implementation anyway, the code
> in snapmgr.c to use it would be very simple, so I see little reason not to.
> It might even be simpler than your patch, because you wouldn't need to have
> the heuristics on whether to attempt updating the xmin; it would be cheap
> enough to always try it.

Care to code it up?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-12-10 Thread Robert Haas

On Mon, Dec 8, 2014 at 10:59 PM, Amit Kapila  wrote:
> Yeah and also how would user specify the values, as an example
> assume that table is partitioned on monthly_salary, so partition
> definition would look:
>
> PARTITION BY LIST(monthly_salary)
> (
> PARTITION salary_less_than_thousand VALUES(300, 900),
> PARTITION salary_less_than_two_thousand VALUES (500,1000,1500),
> ...
> )
>
> Now if user wants to define multi-column Partition based on
> monthly_salary and annual_salary, how do we want him to
> specify the values.  Basically how to distinguish which values
> belong to first column key and which one's belong to second
> column key.

I assume you just add some parentheses.

PARTITION BY LIST (colA, colB) (PARTITION VALUES ((valA1, valB1),
(valA2, valB2), (valA3, valB3))

Multi-column list partitioning may or may not be worth implementing,
but the syntax is not a real problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-12-10 Thread Robert Haas

On Mon, Dec 8, 2014 at 5:05 PM, Jim Nasby  wrote:
> Agreed, but it's possible to keep a block/CTID interface while doing
> something different on the disk.

Objection: hand-waving.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-12-10 Thread Robert Haas

On Wed, Dec 10, 2014 at 9:22 AM, Alvaro Herrera
 wrote:
> The problem with naming partitions is that the user has to pick names
> for every partition, which is tedious and doesn't provide any
> significant benefit.  The input I had from users of other partitioning
> systems was that they very much preferred not to name the partitions at
> all, which is why I chose the PARTITION FOR VALUE syntax (not sure if
> this syntax is exactly what other systems use; it just seemed the
> natural choice.)

FWIW, Oracle does name partitions.  It generates the names
automatically if you don't care to specify them, and the partition
names for a given table live in their own namespace that is separate
from the toplevel object namespace.  For example:

CREATE TABLE sales
 ( invoice_no NUMBER,
   sale_year  INT NOT NULL,
   sale_month INT NOT NULL,
   sale_day   INT NOT NULL )
   STORAGE (INITIAL 100K NEXT 50K) LOGGING
   PARTITION BY RANGE ( sale_year, sale_month, sale_day)
 ( PARTITION sales_q1 VALUES LESS THAN ( 1999, 04, 01 )
TABLESPACE tsa STORAGE (INITIAL 20K, NEXT 10K),
   PARTITION sales_q2 VALUES LESS THAN ( 1999, 07, 01 )
TABLESPACE tsb,
   PARTITION sales_q3 VALUES LESS THAN ( 1999, 10, 01 )
TABLESPACE tsc,
   PARTITION sales q4 VALUES LESS THAN ( 2000, 01, 01 )
TABLESPACE tsd)
   ENABLE ROW MOVEMENT;

I don't think this practice has much to recommend it.  We're going to
need a way to refer to individual partitions by name, and I don't see
much benefit in making that name something other than what is stored
in pg_class.relname.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] advance local xmin more aggressively

2014-12-10 Thread Heikki Linnakangas


On 12/10/2014 06:56 PM, Robert Haas wrote:

On Wed, Dec 10, 2014 at 9:49 AM, Robert Haas  wrote:

I guess this bears some further thought.  I certainly don't like the
fact that this makes the whole system crap out at a lower number of
subtransactions than presently.  The actual performance numbers don't
bother me very much; I'm comfortable with the possibility that closing
a cursor will be some modest percentage slower if you've got thousands
of active savepoints.


Here's a new version with two changes:

1. I changed the traversal of the resource owner tree to iterate
instead of recurse.  It now does a depth-first, pre-order traversal of
the tree; when we reach the last child of a node, we follow its parent
pointer to get back to where we were.  That way, we don't need to keep
anything on the stack.  That fixed the crash at 100k cursors, but it
was still 4x slower.


Clever. Could we use that method in ResourceOwnerReleaseInternal and 
ResourceOwnerDelete, too? Might be best to have a 
ResourceOwnerWalk(resowner, callback) function for walking all resource 
owners in a tree, instead of one for walking the snapshots in them.



2. Instead of traversing the tree until we find an xmin equal to the
one we're currently advertising, the code now traverses the entire
tree each time it runs. However, it also keeps a record of how many
times the oldest xmin occurred in the tree, which is decremented each
time we unregister a snapshot with that xmin; the traversal doesn't
run again until that count reaches 0.  That fixed the performance
regression on your test case.

With a million subtransactions:

master 34.464s 33.742s 34.317s
advance-xmin 34.516s 34.069s 34.196s


Well, you can still get the slowness back by running other stuff in the 
background. I admit that it's a very obscure case, probably fine in 
practice. I would still feel better if snapmgr.c did its own 
bookkeeping, from an aesthetic point of view. In a heap, or even just a 
linked list, if the performance characteristics of that is acceptable.


It occurs to me that the pairing heap I just posted in another thread 
(http://www.postgresql.org/message-id/54886bb8.9040...@vmware.com) would 
be a good fit for this. It's extremely cheap to insert to and to find 
the minimum item (O(1)), and the delete operation is O(log N), 
amortized. I didn't implement a delete operation, for removing a 
particular node, I only did delete-min, but it's basically the same 
code. Using a pairing heap for this might be overkill, but if we have 
that implementation anyway, the code in snapmgr.c to use it would be 
very simple, so I see little reason not to. It might even be simpler 
than your patch, because you wouldn't need to have the heuristics on 
whether to attempt updating the xmin; it would be cheap enough to always 
try it.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical column ordering

2014-12-10 Thread Andres Freund

On 2014-12-10 12:06:11 -0500, Robert Haas wrote:
> Ultimately, I think this is mostly going to be a question of what
> Alvaro feels comfortable with; he's presumably going to have a better
> sense of where and to what extent there might be bugs lurking than any
> of the rest of us.  But the point is worth considering, because I
> think we would probably all agree that having a release that is stable
> and usable right out of the gate is more important than having any
> single feature get into that release.

Sure, 9.4 needs to be out of the gate. I don't think anybody doubts
that.

But the scheduling of commits with regard to the 9.5 schedule actually
opens a relevant question: When are we planning to release 9.5? Because
If we try ~ one year from now it's a whole different ballgame than if we
try to go back to september. And I think there's pretty good arguments
for both.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical column ordering

2014-12-10 Thread Robert Haas

On Wed, Dec 10, 2014 at 11:22 AM, Stephen Frost  wrote:
> I'm not quite sure that I see how that's relevant.  Bugs will happen,
> unfortunately, no matter how much review is done of a given patch.  That
> isn't to say that we shouldn't do any review, but it's a trade-off.
> This change, at least, strikes me as less likely to have subtle bugs
> in it as compared to the dropped column case.

Yeah, that's possible.  They seem similar to me because they both
introduce new ways for the tuple as it is stored on disk to be
different from what must be shown to the user.  But I don't really
know how well that translates to what needs to be changed on a code
level.  If we're basically going back over all the same places that
needed special handling for attisdropped, then the likelihood of bugs
may be quite a bit lower than it was for that patch, because now we
know (mostly!) which places require attisdropped handling and we can
audit them all to make sure they handle this, too.  But if it's a
completely different set of places that need to be touched, then I
think there's a lively possibility for bugs of omission.

Ultimately, I think this is mostly going to be a question of what
Alvaro feels comfortable with; he's presumably going to have a better
sense of where and to what extent there might be bugs lurking than any
of the rest of us.  But the point is worth considering, because I
think we would probably all agree that having a release that is stable
and usable right out of the gate is more important than having any
single feature get into that release.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] advance local xmin more aggressively

2014-12-10 Thread Robert Haas

On Wed, Dec 10, 2014 at 9:49 AM, Robert Haas  wrote:
> I guess this bears some further thought.  I certainly don't like the
> fact that this makes the whole system crap out at a lower number of
> subtransactions than presently.  The actual performance numbers don't
> bother me very much; I'm comfortable with the possibility that closing
> a cursor will be some modest percentage slower if you've got thousands
> of active savepoints.

Here's a new version with two changes:

1. I changed the traversal of the resource owner tree to iterate
instead of recurse.  It now does a depth-first, pre-order traversal of
the tree; when we reach the last child of a node, we follow its parent
pointer to get back to where we were.  That way, we don't need to keep
anything on the stack.  That fixed the crash at 100k cursors, but it
was still 4x slower.

2. Instead of traversing the tree until we find an xmin equal to the
one we're currently advertising, the code now traverses the entire
tree each time it runs. However, it also keeps a record of how many
times the oldest xmin occurred in the tree, which is decremented each
time we unregister a snapshot with that xmin; the traversal doesn't
run again until that count reaches 0.  That fixed the performance
regression on your test case.

With a million subtransactions:

master 34.464s 33.742s 34.317s
advance-xmin 34.516s 34.069s 34.196s

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 0955bcc..529209f 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -21,6 +21,7 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "miscadmin.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "utils/memutils.h"
@@ -113,6 +114,7 @@ typedef struct ResourceOwnerData
 ResourceOwner CurrentResourceOwner = NULL;
 ResourceOwner CurTransactionResourceOwner = NULL;
 ResourceOwner TopTransactionResourceOwner = NULL;
+ResourceOwner FirstRootResourceOwner = NULL;
 
 /*
  * List of add-on callbacks for resource releasing
@@ -167,6 +169,11 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 		owner->nextchild = parent->firstchild;
 		parent->firstchild = owner;
 	}
+	else
+	{
+		owner->nextchild = FirstRootResourceOwner;
+		FirstRootResourceOwner = owner;
+	}
 
 	return owner;
 }
@@ -442,6 +449,8 @@ ResourceOwnerDelete(ResourceOwner owner)
 	 * the owner tree.  Better a leak than a crash.
 	 */
 	ResourceOwnerNewParent(owner, NULL);
+	Assert(FirstRootResourceOwner == owner);
+	FirstRootResourceOwner = owner->nextchild;
 
 	/* And free the object. */
 	if (owner->buffers)
@@ -502,6 +511,14 @@ ResourceOwnerNewParent(ResourceOwner owner,
 			}
 		}
 	}
+	else
+	{
+		ResourceOwner *link = &FirstRootResourceOwner;
+
+		while (*link != owner)
+			link = &((*link)->nextchild);
+		*link = owner->nextchild;
+	}
 
 	if (newparent)
 	{
@@ -513,7 +530,8 @@ ResourceOwnerNewParent(ResourceOwner owner,
 	else
 	{
 		owner->parent = NULL;
-		owner->nextchild = NULL;
+		owner->nextchild = FirstRootResourceOwner;
+		FirstRootResourceOwner = owner;
 	}
 }
 
@@ -1161,6 +1179,59 @@ ResourceOwnerForgetSnapshot(ResourceOwner owner, Snapshot snapshot)
 }
 
 /*
+ * Invoke a caller-supplied function on every snapshot registered with any
+ * resource owner in the system.  The callback can abort the traversal by
+ * returning true.
+ */
+bool
+ResourceOwnerWalkAllSnapshots(ResourceWalkSnapshotCallback callback, void *arg)
+{
+	ResourceOwner	owner = FirstRootResourceOwner;
+	bool		visited = false;
+
+	/*
+	 * We do this traversal non-recursively in order to avoid running out of
+	 * stack space, which can otherwise happen with large numbers of nested
+	 * subtransactions.  The easiest way to do that is to search depth-first,
+	 * so that we visit all of a given node's descendents before reaching its
+	 * right sibling.  When we've visited all of the node's descendents, we'll
+	 * follow the last child's parent pointer back to that node, but visited
+	 * will be set to true at that point, so we'll advance to the right
+	 * sibling instead of traversing it again.
+	 */
+	while (owner != NULL)
+	{
+		if (!visited)
+		{
+			int	i;
+
+			for (i = 0; i < owner->nsnapshots; ++i)
+if (callback(owner->snapshots[i], arg))
+	return true;
+
+			if (owner->firstchild != NULL)
+			{
+owner = owner->firstchild;
+continue;
+			}
+		}
+
+		if (owner->nextchild != NULL)
+		{
+			owner = owner->nextchild;
+			visited = false;
+		}
+		else
+		{
+			owner = owner->parent;
+			visited = true;
+		}
+	}
+
+	return false;
+}
+
+/*
  * Debugging subroutine
  */
 static void
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index d601efe..1fd73c8 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -122,14 +122,28 @@ typedef struct ActiveSnapshotElt
 /* Top

Re: [HACKERS] logical column ordering

2014-12-10 Thread Stephen Frost

Robert, all,

* Robert Haas (robertmh...@gmail.com) wrote:
> To the extent that I have any concern about the patch at this point,
> it's around stability.  I would awfully rather see something like this
> get committed at the beginning of a development cycle than the end.

I tend to agree with this; we have a pretty bad habit of bouncing
around big patches until the end and then committing them.  That's not
as bad when the patch has been getting reviews and feedback over a few
months (or years) but this particular patch isn't even code-complete at
this point, aiui.

> It's quite possible that I'm being more nervous than is justified, but
> given that we're *still* fixing bugs related to dropped-column
> handling (cf. 9b35ddce93a2ef336498baa15581b9d10f01db9c from July of
> this year) which was added in July 2002, maybe not.

I'm not quite sure that I see how that's relevant.  Bugs will happen,
unfortunately, no matter how much review is done of a given patch.  That
isn't to say that we shouldn't do any review, but it's a trade-off.
This change, at least, strikes me as less likely to have subtle bugs
in it as compared to the dropped column case.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] logical column ordering

2014-12-10 Thread Robert Haas

On Wed, Dec 10, 2014 at 9:25 AM, Alvaro Herrera
 wrote:
> FWIW I have no intention to add options for physical/logical ordering
> anywhere.  All users will see is that tables will follow the same
> (logical) order everywhere.

Just to be clear, I wasn't in any way attending to say that the patch
had a problem in this area.  I was just expressing concern about the
apparent rush to judgement on whether converting between physical and
logical column ordering would be expensive.  I certainly think that's
something that we should test - for example, we might want to consider
whether there are cases where you could maybe convince the executor to
spend a lot of time pointlessly reorganizing tuples in ways that
wouldn't happen today.  But I have no particular reason to think that
any issues we hit there issues won't be solvable.

To the extent that I have any concern about the patch at this point,
it's around stability.  I would awfully rather see something like this
get committed at the beginning of a development cycle than the end.
It's quite possible that I'm being more nervous than is justified, but
given that we're *still* fixing bugs related to dropped-column
handling (cf. 9b35ddce93a2ef336498baa15581b9d10f01db9c from July of
this year) which was added in July 2002, maybe not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] GiST kNN search queue (Re: KNN-GiST with recheck)

2014-12-10 Thread Heikki Linnakangas

On 01/28/2014 04:12 PM, Alexander Korotkov wrote:

>3. A binary heap would be a better data structure to buffer the rechecked
>values. A Red-Black tree allows random insertions and deletions, but in
>this case you need to insert arbitrary values but only remove the minimum
>item. That's exactly what a binary heap excels at. We have a nice binary
>heap implementation in the backend that you can use, see
>src/backend/lib/binaryheap.c.
>

Hmm. For me binary heap would be a better data structure for KNN-GiST at
all :-)

I decided to give this a shot, replacing the red-black tree in GiST with 
the binary heap we have in lib/binaryheap.c. It made the GiST code 
somewhat simpler, as the binaryheap interface is simpler than the 
red-black tree one. Unfortunately, performance was somewhat worse. That 
was quite surprising, as insertions and deletions are both O(log N) in 
both data structures, but the red-black tree implementation is more 
complicated.

I implemented another data structure called a Pairing Heap. It's also a 
fairly simple data structure, but insertions are O(1) instead of O(log 
N). It also performs fairly well in practice.

With that, I got a small but measurable improvement. To test, I created 
a table like this:

create table gisttest (id integer, p point);
insert into gisttest select id, point(random(), random()) from 
generate_series(1, 100) id;

create index i_gisttest on gisttest using gist (p);

And I ran this query with pgbench:

select id from gisttest order by p <-> '(0,0)' limit 1000;

With unpatched master, I got about 650 TPS, and with the patch 720 TPS. 
That's a nice little improvement, but perhaps more importantly, the 
pairing heap implementation consumes less memory. To measure that, I put 
a MemoryContextStats(so->queueCtx) call into gistendscan. With the above 
query, but without the "limit" clause, on master I got:

GiST scan context: 2109752 total in 10 blocks; 2088456 free (24998 
chunks); 21296 used

And with the patch:

GiST scan context: 1061160 total in 9 blocks; 1040088 free (12502 
chunks); 21072 used

That's 2MB vs 1MB. While that's not much in absolute terms, it'd be nice 
to reduce that memory consumption, as there is no hard upper bound on 
how much might be needed. If the GiST tree is really disorganized for 
some reason, a query might need a lot more.

So all in all, I quite like this patch, even though it doesn't do 
anything too phenomenal. It adds a some code, in the form of the new 
pairing heap implementation, but it makes the GiST code a little bit 
simpler. And it gives a small performance gain, and reduces memory usage 
a bit.

- Heikki

diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 7a8692b..52b2c53 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -18,6 +18,7 @@
 #include "access/relscan.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "lib/pairingheap.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -243,8 +244,6 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
 	GISTPageOpaque opaque;
 	OffsetNumber maxoff;
 	OffsetNumber i;
-	GISTSearchTreeItem *tmpItem = so->tmpTreeItem;
-	bool		isNew;
 	MemoryContext oldcxt;

 	Assert(!GISTSearchItemIsHeap(*pageItem));
@@ -275,18 +274,15 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
 		oldcxt = MemoryContextSwitchTo(so->queueCxt);

 		/* Create new GISTSearchItem for the right sibling index page */
-		item = palloc(sizeof(GISTSearchItem));
-		item->next = NULL;
+		item = palloc(SizeOfGISTSearchItem(scan->numberOfOrderBys));
 		item->blkno = opaque->rightlink;
 		item->data.parentlsn = pageItem->data.parentlsn;

 		/* Insert it into the queue using same distances as for this page */
-		tmpItem->head = item;
-		tmpItem->lastHeap = NULL;
-		memcpy(tmpItem->distances, myDistances,
+		memcpy(item->distances, myDistances,
 			   sizeof(double) * scan->numberOfOrderBys);

-		(void) rb_insert(so->queue, (RBNode *) tmpItem, &isNew);
+		pairingheap_add(so->queue, &item->fbNode);

 		MemoryContextSwitchTo(oldcxt);
 	}
@@ -348,8 +344,7 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
 			oldcxt = MemoryContextSwitchTo(so->queueCxt);

 			/* Create new GISTSearchItem for this item */
-			item = palloc(sizeof(GISTSearchItem));
-			item->next = NULL;
+			item = palloc(SizeOfGISTSearchItem(scan->numberOfOrderBys));

 			if (GistPageIsLeaf(page))
 			{
@@ -372,12 +367,10 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
 			}

 			/* Insert it into the queue using new distance data */
-			tmpItem->head = item;
-			tmpItem->lastHeap = GISTSearchItemIsHeap(*item) ? item : NULL;
-			memcpy(tmpItem->distances, so->distances,
+			memcpy(item->distances, so->distances,
    sizeof(double) * scan->numberOfOrderBys);

-			(void) rb_insert(so->queue,

Re: [HACKERS] [COMMITTERS] pgsql: Keep track of transaction commit timestamps

2014-12-10 Thread Petr Jelinek


On 10/12/14 04:26, Michael Paquier wrote:

On Thu, Dec 4, 2014 at 9:26 PM, Heikki Linnakangas
 wrote:

Yeah, it was raised. I don't think it was really addressed. There was
lengthy discussion on whether to include LSN, node id, and/or some other
information, but there was no discussion on:

* What is a node ID?
* How is it used?
* Who assigns it?

etc.

None of those things are explained in the documentation nor code comments.

The node ID sounds suspiciously like the replication identifiers that have
been proposed a couple of times. I don't remember if I like replication
identifiers or not, but I'd rather discuss that issue explicitly and
separately. I don't want the concept of a replication/node ID to fly under
the radar like this.


Replication identifiers are bigger thing but yes there is overlap as the 
nodeid itself makes it possible to implement replication identifiers 
outside of core if needed.





What would the SQL API look like, and what would it be used for?



It would probably mirror the C one, from my POV it's not needed but
maybe some replication solution would prefer to use this without having
to write C components...



I can't comment on that, because without any documentation, I don't even
know what the C API is.

BTW, why is it OK that the node-ID of a commit is logged as a separate WAL
record, after the commit record? That means that it's possible that a
transaction commits, but you crash just before writing the SETTS record, so
that after replay the transaction appears committed but with the default
node ID.

(Maybe that's OK given the intended use case, but it's hard to tell without
any documentation. See a pattern here? ;-) )


This is actually good point, it's not OK to have just commit record but 
no nodeid record if nodeid was set.




Could it be possible to get a refresh on that before it bitrots too
much or we'll simply forget about it? In the current state, there is
no documentation able to explain what is the point of the nodeid
stuff, how to use it and what use cases it is aimed at. The API in
committs.c should as well be part of it. If that's not done, I think
that it would be fair to remove this portion and simply WAL-log the
commit timestamp its GUC is activated.


I have this on my list so it's not being forgotten and I don't see it 
bitrotting unless we are changing XLog api again. I have bit busy 
schedule right now though, can it wait till next week? - I will post 
some code documentation patches then.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] advance local xmin more aggressively

2014-12-10 Thread Robert Haas

On Wed, Dec 10, 2014 at 7:34 AM, Heikki Linnakangas
 wrote:
>>> 1. I don't really see why resource owners shouldn't be traversed like
>>> that.  They are clearly intended to form a hierarchy, and there's
>>> existing code that recurses through the hierarchy from a given level
>>> downward.  What's ugly about that?
>
> I can't exactly point my finger on it, but it just feels wrong from a
> modularity point of view. Their purpose is to make sure that we don't leak
> resources on abort, by allowing easy an "release everything" operation. It's
> not designed for finding objects based on some other criteria.
>
> There is similar double bookkeeping of many other things that are tracked by
> resource owners. Heavy-weight locks are tracked by LOCALLOCK structs, buffer
> pins in PrivateRefCount array etc. Those things existed before resource
> owners were invented, but if we were starting from scratch, that design
> would still make sense, as different objects have different access criteria.
> fd.c needs to be able to find the least-recently-used open file, for
> example, and you need to find the snapshot with the lowest xmin.

I think all of these are performance trade-offs rather than questions
of style.  LOCALLOCK structs and private buffer pin tracking existed
before resource owners because we need to do lookups of locks by lock
tag or buffers by buffer number frequently enough that, if we had to
grovel trough all the locks or buffer pins in the system without any
indexing, it would be slow.  We *could* do it that way now that we
have resource owners, but it's pretty obvious that it would suck.

That's less obvious here.  If we throw all of the registered snapshots
into a binary heap, finding the lowest xmin will be cheaper every time
we do it, but we'll have the overhead of maintaining that binary heap
even if it never gets used.

> I think you're confusing the N and the N above. It's true that deleting a
> resource owner is O(M), where the M is the number of children of that
> resource owner. It does not follow that releasing N resource owners is
> O(N^2), where N is the number of resource owners released. Calling
> ResourceOwnerDelete(x) will only visit each resource owner in that tree
> once, so it's O(N), where N is the total number of resource owners in the
> tree.

Not if the portals are dropped in separate commands with an
unpredictable ordering.  Then you have N separate calls to
ResourceOwnerDelete.

> I did some testing of the worst case scenario. The attached script first
> creates a lot of cursors, then a lot of savepoints, and finally closes the
> cursors in FIFO order. When the number of savepoints is high enough, this
> actually segfaults with your patch, because you run out of stack space when
> recursing the subxact resource owners. That's hardly this patch's fault, I'm
> actually surprised it doesn't crash without it, because we recurse into all
> resource owners in ResourceOwnerRelease too. Apparently the subxacts are
> closed in LIFO order at commit, but there might be are other cases where you
> could trigger that. In any case, a stack-depth check would be nice.

Thanks for the test case; that's helpful.

I added a stack depth check, but it doesn't help much:

ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth"
(currently 2048kB), after ensuring the platform's stack depth limit is
adequate.
STATEMENT:  CLOSE cur0;
ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth"
(currently 2048kB), after ensuring the platform's stack depth limit is
adequate.
WARNING:  AbortSubTransaction while in ABORT state
ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth"
(currently 2048kB), after ensuring the platform's stack depth limit is
adequate.
WARNING:  AbortSubTransaction while in ABORT state
ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth"
(currently 2048kB), after ensuring the platform's stack depth limit is
adequate.
WARNING:  AbortSubTransaction while in ABORT state
ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth"
(currently 2048kB), after ensuring the platform's stack depth limit is
adequate.
PANIC:  ERRORDATA_STACK_SIZE exceeded

That's at 100k subtransactions.  I took some performance data for
lower numbers of subtransactions:

-- at 1000 subtransactions --
master 0.156s 0.157s 0.154s
advance-xmin 0.177s 0.175s 0.177s

-- at 1 subtransactions --
master 0.458s 0.457s 0.456s
advance-xmin 0.639s 0.637s 0.633

I guess this bears some further thought.  I certainly don't like the
fact that this makes the whole system crap out at a lower number of
subtransactions than presently.  The actual performance numbers don't
bother me very much; I'm comfortable with the possibility that closing
a cursor will be some modest percentage slower if you've got thousands
of active savepoints.

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-10 Thread Arthur Silva

On Wed, Dec 10, 2014 at 12:10 PM, Rahila Syed 
wrote:

> >What I would suggest is instrument the backend with getrusage() at
> >startup and shutdown and have it print the difference in user time and
> >system time.  Then, run tests for a fixed number of transactions and
> >see how the total CPU usage for the run differs.
>
> Folllowing are the numbers obtained on tests with absolute CPU usage,
> fixed number of transactions and longer duration with latest fpw
> compression patch
>
> pgbench command : pgbench  -r -t 25 -M prepared
>
> To ensure that data is not highly compressible, empty filler columns were
> altered using
>
> alter table pgbench_accounts alter column filler type text using
> gen_random_uuid()::text
>
> checkpoint_segments = 1024
> checkpoint_timeout =  5min
> fsync = on
>
> The tests ran for around 30 mins.Manual checkpoint was run before each
> test.
>
> Compression   WAL generated%compressionLatency-avg   CPU usage
> (seconds)  TPS  Latency
> stddev
>
>
> on  1531.4 MB  ~35 %  7.351 ms
> user diff: 562.67s system diff: 41.40s  135.96
> 13.759 ms
>
>
> off  2373.1 MB 6.781
> ms   user diff: 354.20s  system diff: 39.67s147.40
>   14.152 ms
>
> The compression obtained is quite high close to 35 %.
> CPU usage at user level when compression is on is quite noticeably high as
> compared to that when compression is off. But gain in terms of reduction of
> WAL is also high.
>
> Server specifications:
> Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
> RAM: 32GB
> Disk : HDD  450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
> 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm
>
>
>
> Thank you,
>
> Rahila Syed
>
>
>
>
>
> On Fri, Dec 5, 2014 at 10:38 PM, Robert Haas 
> wrote:
>
>> On Fri, Dec 5, 2014 at 1:49 AM, Rahila Syed 
>> wrote:
>> >>If that's really true, we could consider having no configuration any
>> >>time, and just compressing always.  But I'm skeptical that it's
>> >>actually true.
>> >
>> > I was referring to this for CPU utilization:
>> >
>> http://www.postgresql.org/message-id/1410414381339-5818552.p...@n5.nabble.com
>> > 
>> >
>> > The above tests were performed on machine with configuration as follows
>> > Server specifications:
>> > Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2
>> nos
>> > RAM: 32GB
>> > Disk : HDD  450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
>> > 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm
>>
>> I think that measurement methodology is not very good for assessing
>> the CPU overhead, because you are only measuring the percentage CPU
>> utilization, not the absolute amount of CPU utilization.  It's not
>> clear whether the duration of the tests was the same for all the
>> configurations you tried - in which case the number of transactions
>> might have been different - or whether the number of operations was
>> exactly the same - in which case the runtime might have been
>> different.  Either way, it could obscure an actual difference in
>> absolute CPU usage per transaction.  It's unlikely that both the
>> runtime and the number of transactions were identical for all of your
>> tests, because that would imply that the patch makes no difference to
>> performance; if that were true, you wouldn't have bothered writing
>> it
>>
>> What I would suggest is instrument the backend with getrusage() at
>> startup and shutdown and have it print the difference in user time and
>> system time.  Then, run tests for a fixed number of transactions and
>> see how the total CPU usage for the run differs.
>>
>> Last cycle, Amit Kapila did a bunch of work trying to compress the WAL
>> footprint for updates, and we found that compression was pretty darn
>> expensive there in terms of CPU time.  So I am suspicious of the
>> finding that it is free here.  It's not impossible that there's some
>> effect which causes us to recoup more CPU time than we spend
>> compressing in this case that did not apply in that case, but the
>> projects are awfully similar, so I tend to doubt it.
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>
>
This can be improved in the future by using other algorithms.

Re: [HACKERS] BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

2014-12-10 Thread Dennis Kögel

Hi,

Am 04.09.2014 um 17:50 schrieb Jehan-Guillaume de Rorthais :
> Since few months, we occasionally see .ready files appearing on some slave
> instances from various context. The two I have in mind are under 9.2.x. […]
> So it seems for some reasons, these old WALs were "forgotten" by the
> restartpoint mechanism when they should have been recylced/deleted.

Am 08.10.2014 um 11:54 schrieb Heikki Linnakangas :
> 1. Where do the FF files come from? In 9.2, FF-segments are not supposed to 
> created, ever. […]
> 2. Why are the .done files sometimes not being created?



We’ve encountered behaviour which seems to match what has been described here: 
On Streaming Replication slaves, there is an odd piling up of old WALs and 
.ready files in pg_xlog, going back several months.

The fine people on IRC have pointed me to this thread, and have encouraged me 
to revive it with our observations, so here we go:

Environment:

Master,  9.2.9
|- Slave S1, 9.2.9, on the same network as the master
'- Slave S2, 9.2.9, some 100 km away (occassional network hickups; *not* a 
cascading replication)

wal_keep_segments M=100 S1=100 S2=30
checkpoint_segments M=100 S1=30 S2=30
wal_level hot_standby (all)
archive_mode on (all)
archive_command on both slaves: /bin/true
archive_timeout 600s (all)


- On both slaves, we have „ghost“ WALs and corresponding .ready files 
(currently >600 of each on S2, slowly becoming a disk space problem)

- There’s always gaps in the ghost WAL names, often roughly 0x20, but not always

- The slave with the „bad“ network link has significantly more of these files, 
which suggests that disturbances of the Streaming Replication increase chances 
of triggering this bug; OTOH, the presence of a name gap pattern suggests the 
opposite

- We observe files named *FF as well


As you can see in the directory listings below, this setup is *very* low 
traffic, which may explain the pattern in WAL name gaps (?).

I’ve listed the entries by time, expecting to easily match WALs to their .ready 
files.
There sometimes is an interesting delay between the WAL’s mtime and the .ready 
file — especially for *FF, where there’s several days between the WAL and the 
.ready file.

- Master:   http://pgsql.privatepaste.com/52ad612dfb
- Slave S1: http://pgsql.privatepaste.com/58b4f3bb10
- Slave S2: http://pgsql.privatepaste.com/a693a8d7f4


I’ve only skimmed through the thread; my understanding is that there were 
several patches floating around, but nothing was committed.
If there’s any way I can help, please let me know.

- D.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical column ordering

2014-12-10 Thread Alvaro Herrera

Robert Haas wrote:
> On Wed, Dec 10, 2014 at 12:17 AM, Tom Lane  wrote:
> > Alvaro Herrera  writes:
> >> Andrew Dunstan wrote:
> >>> I seriously doubt it, although I could be wrong. Unless someone can show a
> >>> significant performance gain from using physical order, which would be a 
> >>> bit
> >>> of a surprise to me, I would just stick with logical ordering as the
> >>> default.
> >
> >> Well, we have an optimization that avoids a projection step IIRC by
> >> using the "physical tlist" instead of having to build a tailored one.  I
> >> guess the reason that's there is because somebody did measure an
> >> improvement.  Maybe it *is* worth having as an option for pg_dump ...
> >
> > The physical tlist thing is there because it's demonstrable that
> > ExecProject() takes nonzero time.  COPY does not go through ExecProject
> > though.  What's more, it already has code to deal with a user-specified
> > column order, and nobody's ever claimed that that code imposes a
> > measurable performance overhead.
> 
> Also, if we're adding options to use the physical rather than the
> logical column ordering in too many places, that's probably a sign
> that we need to rethink this whole concept.  The concept of a logical
> column ordering doesn't have much meaning if you're constantly forced
> to fall back to some other column ordering whenever you want good
> performance.

FWIW I have no intention to add options for physical/logical ordering
anywhere.  All users will see is that tables will follow the same
(logical) order everywhere.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-10 Thread Bruce Momjian

On Wed, Dec 10, 2014 at 07:40:46PM +0530, Rahila Syed wrote:
> The tests ran for around 30 mins.Manual checkpoint was run before each test.
> 
> Compression   WAL generated    %compression    Latency-avg   CPU usage
> (seconds)                                          TPS              Latency
> stddev               
> 
> 
> on                  1531.4 MB          ~35 %                  7.351 ms     
>   user diff: 562.67s     system diff: 41.40s              135.96            
>   13.759 ms
> 
> 
> off                  2373.1 MB                                     6.781 ms   
>  
>       user diff: 354.20s      system diff: 39.67s            147.40           
>  
>   14.152 ms
> 
> The compression obtained is quite high close to 35 %.
> CPU usage at user level when compression is on is quite noticeably high as
> compared to that when compression is off. But gain in terms of reduction of 
> WAL
> is also high.

I am sorry but I can't understand the above results due to wrapping. 
Are you saying compression was twice as slow?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-12-10 Thread Alvaro Herrera

Amit Langote wrote:

> On Wed, Dec 10, 2014 at 12:46 PM, Amit Kapila  wrote:
> > On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera 
> > wrote:

> >> FWIW in my original proposal I was rejecting some things that after
> >> further consideration turn out to be possible to allow; for instance
> >> directly referencing individual partitions in COPY.  We could allow
> >> something like
> >>
> >> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
> >> or maybe
> >> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT
> >>
> > or
> > COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01'  TO STDOUT
> > COPY [TABLE] lineitems PARTITION   TO STDOUT
> >
> > I think we should try to support operations on partitions via main
> > table whereever it is required.

Um, I think the only difference is that you added the noise word TABLE
which we currently don't allow in COPY, and that you added the
possibility of using named partitions, about which see below.

> We can also allow to explicitly name a partition
> 
> COPY [TABLE ] lineitems PARTITION lineitems_2001 TO STDOUT;

The problem with naming partitions is that the user has to pick names
for every partition, which is tedious and doesn't provide any
significant benefit.  The input I had from users of other partitioning
systems was that they very much preferred not to name the partitions at
all, which is why I chose the PARTITION FOR VALUE syntax (not sure if
this syntax is exactly what other systems use; it just seemed the
natural choice.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-10 Thread Rahila Syed

>What I would suggest is instrument the backend with getrusage() at
>startup and shutdown and have it print the difference in user time and
>system time.  Then, run tests for a fixed number of transactions and
>see how the total CPU usage for the run differs.

Folllowing are the numbers obtained on tests with absolute CPU usage, fixed
number of transactions and longer duration with latest fpw compression
patch

pgbench command : pgbench  -r -t 25 -M prepared

To ensure that data is not highly compressible, empty filler columns were
altered using

alter table pgbench_accounts alter column filler type text using
gen_random_uuid()::text

checkpoint_segments = 1024
checkpoint_timeout =  5min
fsync = on

The tests ran for around 30 mins.Manual checkpoint was run before each test.

Compression   WAL generated%compressionLatency-avg   CPU usage
(seconds)  TPS  Latency
stddev


on  1531.4 MB  ~35 %  7.351 ms
  user diff: 562.67s system diff: 41.40s  135.96
13.759 ms


off  2373.1 MB 6.781 ms
  user diff: 354.20s  system diff: 39.67s147.40
  14.152 ms

The compression obtained is quite high close to 35 %.
CPU usage at user level when compression is on is quite noticeably high as
compared to that when compression is off. But gain in terms of reduction of
WAL is also high.

Server specifications:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD  450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm



Thank you,

Rahila Syed





On Fri, Dec 5, 2014 at 10:38 PM, Robert Haas  wrote:

> On Fri, Dec 5, 2014 at 1:49 AM, Rahila Syed 
> wrote:
> >>If that's really true, we could consider having no configuration any
> >>time, and just compressing always.  But I'm skeptical that it's
> >>actually true.
> >
> > I was referring to this for CPU utilization:
> >
> http://www.postgresql.org/message-id/1410414381339-5818552.p...@n5.nabble.com
> > 
> >
> > The above tests were performed on machine with configuration as follows
> > Server specifications:
> > Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
> > RAM: 32GB
> > Disk : HDD  450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
> > 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm
>
> I think that measurement methodology is not very good for assessing
> the CPU overhead, because you are only measuring the percentage CPU
> utilization, not the absolute amount of CPU utilization.  It's not
> clear whether the duration of the tests was the same for all the
> configurations you tried - in which case the number of transactions
> might have been different - or whether the number of operations was
> exactly the same - in which case the runtime might have been
> different.  Either way, it could obscure an actual difference in
> absolute CPU usage per transaction.  It's unlikely that both the
> runtime and the number of transactions were identical for all of your
> tests, because that would imply that the patch makes no difference to
> performance; if that were true, you wouldn't have bothered writing
> it
>
> What I would suggest is instrument the backend with getrusage() at
> startup and shutdown and have it print the difference in user time and
> system time.  Then, run tests for a fixed number of transactions and
> see how the total CPU usage for the run differs.
>
> Last cycle, Amit Kapila did a bunch of work trying to compress the WAL
> footprint for updates, and we found that compression was pretty darn
> expensive there in terms of CPU time.  So I am suspicious of the
> finding that it is free here.  It's not impossible that there's some
> effect which causes us to recoup more CPU time than we spend
> compressing in this case that did not apply in that case, but the
> projects are awfully similar, so I tend to doubt it.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: [HACKERS] Small TRUNCATE glitch

2014-12-10 Thread Alex Shulgin

Bruce Momjian  writes:

> On Wed, Dec 10, 2014 at 10:32:42AM +0200, Heikki Linnakangas wrote:
>> >I don't think we need to have 2PC files survive a pg_upgrade.  It seems
>> >perfectly okay to remove them from the new cluster at some appropriate
>> >time, *if* they are copied from the old cluster at all (I don't think
>> >they should be.)
>> 
>> I think pg_upgrade should check if there are any prepared
>> transactions pending, and refuse to upgrade if there are. It could
>> be made to work, but it's really not worth the trouble. If there are
>> any pending prepared transactions in the system when you run
>> pg_upgrade, it's more likely to be a mistake or oversight in the
>> first place, than on purpose.
>
> pg_upgrade already checks for prepared transactions and errors out if
> they exist;  see check_for_prepared_transactions().

Alright, that's good to know.  So I'm just adding a new bool field to
the 2PC pgstat record instead of the bit magic.

Attached is v0.2, now with a regression test included.

--
Alex

>From 4c8fae27ecd9c94e7c3da0997f03099045a152d9 Mon Sep 17 00:00:00 2001
From: Alex Shulgin 
Date: Tue, 9 Dec 2014 16:35:14 +0300
Subject: [PATCH] WIP: track TRUNCATEs in pgstat transaction stats.

The n_live_tup and n_dead_tup counters need to be set to 0 after a
TRUNCATE on the relation.  We can't issue a special message to the stats
collector because the xact might be later aborted, so we track the fact
that the relation was truncated during the xact (and reset this xact's
insert/update/delete counters).  When xact is committed, we use the
`truncated' flag to reset the n_live_tup and n_dead_tup counters.
---
 src/backend/commands/tablecmds.c   |   2 +
 src/backend/postmaster/pgstat.c|  52 -
 src/include/pgstat.h   |   3 +
 src/test/regress/expected/truncate.out | 136 +
 src/test/regress/sql/truncate.sql  |  98 
 5 files changed, 288 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
new file mode 100644
index 1e737a0..192d033
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
*** ExecuteTruncate(TruncateStmt *stmt)
*** 1224,1229 
--- 1224,1231 
  			 */
  			reindex_relation(heap_relid, REINDEX_REL_PROCESS_TOAST);
  		}
+ 
+ 		pgstat_count_heap_truncate(rel);
  	}
  
  	/*
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
new file mode 100644
index c7f41a5..88c83d2
*** a/src/backend/postmaster/pgstat.c
--- b/src/backend/postmaster/pgstat.c
*** typedef struct TwoPhasePgStatRecord
*** 201,206 
--- 201,207 
  	PgStat_Counter tuples_deleted;		/* tuples deleted in xact */
  	Oid			t_id;			/* table's OID */
  	bool		t_shared;		/* is it a shared catalog? */
+ 	bool		t_truncated;	/* was the relation truncated? */
  } TwoPhasePgStatRecord;
  
  /*
*** pgstat_count_heap_delete(Relation rel)
*** 1864,1869 
--- 1865,1894 
  }
  
  /*
+  * pgstat_count_heap_truncate - update tuple counters due to truncate
+  */
+ void
+ pgstat_count_heap_truncate(Relation rel)
+ {
+ 	PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+ 
+ 	if (pgstat_info != NULL)
+ 	{
+ 		/* We have to log the effect at the proper transactional level */
+ 		int			nest_level = GetCurrentTransactionNestLevel();
+ 
+ 		if (pgstat_info->trans == NULL ||
+ 			pgstat_info->trans->nest_level != nest_level)
+ 			add_tabstat_xact_level(pgstat_info, nest_level);
+ 
+ 		pgstat_info->trans->tuples_inserted = 0;
+ 		pgstat_info->trans->tuples_updated = 0;
+ 		pgstat_info->trans->tuples_deleted = 0;
+ 		pgstat_info->trans->truncated = true;
+ 	}
+ }
+ 
+ /*
   * pgstat_update_heap_dead_tuples - update dead-tuples count
   *
   * The semantics of this are that we are reporting the nontransactional
*** AtEOXact_PgStat(bool isCommit)
*** 1927,1932 
--- 1952,1959 
  			tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
  			if (isCommit)
  			{
+ tabstat->t_counts.t_truncated = trans->truncated;
+ 
  /* insert adds a live tuple, delete removes one */
  tabstat->t_counts.t_delta_live_tuples +=
  	trans->tuples_inserted - trans->tuples_deleted;
*** AtEOSubXact_PgStat(bool isCommit, int ne
*** 1991,1999 
  			{
  if (trans->upper && trans->upper->nest_level == nestDepth - 1)
  {
! 	trans->upper->tuples_inserted += trans->tuples_inserted;
! 	trans->upper->tuples_updated += trans->tuples_updated;
! 	trans->upper->tuples_deleted += trans->tuples_deleted;
  	tabstat->trans = trans->upper;
  	pfree(trans);
  }
--- 2018,2037 
  			{
  if (trans->upper && trans->upper->nest_level == nestDepth - 1)
  {
! 	if (trans->truncated)
! 	{
! 		trans->upper->truncated = true;
! 		/* replace upper xact stats with ours */
! 		trans->upper->tuples_inserted = trans->tuples_inserted;
!

Re: [HACKERS] PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

2014-12-10 Thread Kevin Grittner

Tomas Vondra  wrote:

> back when we were discussing the hashjoin patches (now committed),
> Robert proposed that maybe it'd be a good idea to sometimes increase the
> number of tuples per bucket instead of batching.
>
> That is, while initially sizing the hash table - if the hash table with
> enough buckets to satisfy NTUP_PER_BUCKET load factor does not fit into
> work_mem, try a bit higher load factor before starting to batch.
>
> Attached patch is an initial attempt to implement this - it's a bit
> rough on the edges, but hopefully enough to judge the benefit of this.
>
> The default load factor is 1. The patch tries to degrade this to 2, 4 or
> 8 in attempt to fit the hash table into work mem. If it doesn't work, it
> starts batching with the default load factor. If the batching is
> required while the hashjoin is running, the load factor is switched back
> to the default one (if we're batching, there's no point in keeping the
> slower hash table).
>
> The patch also modifies explain output, to show the load factor.
>
> The testing I've done so far are rather disappointing, though.
>
> create table a as select i from generate_series(1,100) s(i);
> create table b as select i from generate_series(1,100) s(i);
>
> analyze a;
> analyze b;
>
> select a.i, b.i from a join b on (a.i = b.i);
>
> work_mem  batchestuples per bucket  duration
> -
> 64 MB11585 ms
> 46 MB12639 ms
> 43 MB14794 ms
> 40 MB18  1075 ms
> 39 MB21623 ms
>
> So, even increasing the load factor to 2 is slower than batching.

Right, I saw the same thing.

I tried pretty hard to create a case where increasing the load
factor from 1 to 2 was faster than going to a second batch, and was
unable to do so.   I hypothesized that the second batch was cached
by the OS and flipping the data in and out of the OS cache was
faster than chasing through the links.   I expect that if you have a
large enough hashtable to barely exceed your machines ability to
cache, you *might* see a benefit in the hash table access by
increasing the load factor.   I think it would be incredibly hard to
accurately identify those cases, and every time a hash table was
used there would be a cost to trying to figure it out.  I just
can't see this being a win.

> I'm not sure what's the best way forward - clearly, for cases that fit
> into RAM (temp files into page cache), batching is faster. For tables
> large enough to cause a lot of I/O, it may make a difference - but I'm
> not sure how to identify these cases.
>
> So ISTM implementing this requires a reliable way to identify which case
> we're dealing with - if the outer table is large enough (prefer
> increasing load factor) or not (prefer batching). Until then keeping the
> current simple/predictible approach is probably better.
>
> Of course, this also depends on additional variables (e.g. is the system
> memory-stressed?).

All that is on-target, but my take-away is that increasing load
factor to avoid a second batch was an interesting idea that turns
out to not really be a good one.   If someone can craft a
reproducible test case that demonstrates a win for increasing the
load factor that doesn't have significant cost for cases where it
isn't a win, I might change my opinion; but count me as a skeptic.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical column ordering

2014-12-10 Thread Robert Haas

On Wed, Dec 10, 2014 at 12:17 AM, Tom Lane  wrote:
> Alvaro Herrera  writes:
>> Andrew Dunstan wrote:
>>> I seriously doubt it, although I could be wrong. Unless someone can show a
>>> significant performance gain from using physical order, which would be a bit
>>> of a surprise to me, I would just stick with logical ordering as the
>>> default.
>
>> Well, we have an optimization that avoids a projection step IIRC by
>> using the "physical tlist" instead of having to build a tailored one.  I
>> guess the reason that's there is because somebody did measure an
>> improvement.  Maybe it *is* worth having as an option for pg_dump ...
>
> The physical tlist thing is there because it's demonstrable that
> ExecProject() takes nonzero time.  COPY does not go through ExecProject
> though.  What's more, it already has code to deal with a user-specified
> column order, and nobody's ever claimed that that code imposes a
> measurable performance overhead.

Also, if we're adding options to use the physical rather than the
logical column ordering in too many places, that's probably a sign
that we need to rethink this whole concept.  The concept of a logical
column ordering doesn't have much meaning if you're constantly forced
to fall back to some other column ordering whenever you want good
performance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] advance local xmin more aggressively

2014-12-10 Thread Heikki Linnakangas


On 12/09/2014 10:35 PM, Robert Haas wrote:

On Mon, Dec 8, 2014 at 9:31 AM, Robert Haas  wrote:

On Mon, Dec 8, 2014 at 4:56 AM, Heikki Linnakangas
 wrote:

I don't immediately see the problem either, but I have to say that
grovelling through all the resource owners seems ugly anyway. Resource
owners are not meant to be traversed like that. And there could be a lot of
them, and you have to visit every one of them. That could get expensive if
there are a lot of resource owners.


1. I don't really see why resource owners shouldn't be traversed like
that.  They are clearly intended to form a hierarchy, and there's
existing code that recurses through the hierarchy from a given level
downward.  What's ugly about that?


I can't exactly point my finger on it, but it just feels wrong from a 
modularity point of view. Their purpose is to make sure that we don't 
leak resources on abort, by allowing easy an "release everything" 
operation. It's not designed for finding objects based on some other 
criteria.


There is similar double bookkeeping of many other things that are 
tracked by resource owners. Heavy-weight locks are tracked by LOCALLOCK 
structs, buffer pins in PrivateRefCount array etc. Those things existed 
before resource owners were invented, but if we were starting from 
scratch, that design would still make sense, as different objects have 
different access criteria. fd.c needs to be able to find the 
least-recently-used open file, for example, and you need to find the 
snapshot with the lowest xmin.



Upthread, I suggested keeping a tally of the number of snapshots with
the advertised xmin and recomputing the xmin to advertise only when it
reaches 0.  This patch doesn't implementation that optimization, but
it does have code that aborts the traversal of the resource owner
hierarchy as soon as we see an xmin that will preclude advancing our
advertised xmin.  Releasing N resource owners could therefore cost
O(N^2) in the worst case, but note that releasing N resource owners is
*already* an O(N^2) operation in the worst case, because the list of
children of a particular parent resource owner is singly linked, and
thus deleting a resource owner is O(N).  It's been that way for an
awfully long time without anyone complaining, probably because (a)
it's not particularly common to have large numbers of cursors open
simultaneously and (b) even if you do have that, the constant factor
is pretty low.


I think you're confusing the N and the N above. It's true that deleting 
a resource owner is O(M), where the M is the number of children of that 
resource owner. It does not follow that releasing N resource owners is 
O(N^2), where N is the number of resource owners released. Calling 
ResourceOwnerDelete(x) will only visit each resource owner in that tree 
once, so it's O(N), where N is the total number of resource owners in 
the tree.


I did some testing of the worst case scenario. The attached script first 
creates a lot of cursors, then a lot of savepoints, and finally closes 
the cursors in FIFO order. When the number of savepoints is high enough, 
this actually segfaults with your patch, because you run out of stack 
space when recursing the subxact resource owners. That's hardly this 
patch's fault, I'm actually surprised it doesn't crash without it, 
because we recurse into all resource owners in ResourceOwnerRelease too. 
Apparently the subxacts are closed in LIFO order at commit, but there 
might be are other cases where you could trigger that. In any case, a 
stack-depth check would be nice.


- Heikki



cursors.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GSSAPI, SSPI - include_realm default

2014-12-10 Thread Stephen Frost

* Bruce Momjian (br...@momjian.us) wrote:
> On Tue, Dec  9, 2014 at 05:40:35PM -0500, Stephen Frost wrote:
> > > I thought the idea was to backpatch documentation saying "it's a good idea
> > > to change this value to x because of y". Not actually referring to the
> > > upcoming change directly. And I still think that part is a good idea, as 
> > > it
> > > helps people avoid potential security pitfalls.
> > 
> > I agree with this but I don't really see why we wouldn't say "hey, this
> > is going to change in 9.5."  Peter's argument sounds like he'd rather we
> > not make any changes to the existing documentation, and I don't agree
> > with that, and if we're making changes then, imv, we might as well
> > comment that the default is changed in 9.5.
> 
> I agree with Peter --- it is unwise to reference a future released
> feature in a backbranch doc patch.  Updating the backbranch docs to add
> a recommendation is fine.

Alright, I don't agree but it's not worth the argument.  I'll work on
the doc-update patch for the back-branches.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Small TRUNCATE glitch

2014-12-10 Thread Bruce Momjian

On Wed, Dec 10, 2014 at 10:32:42AM +0200, Heikki Linnakangas wrote:
> >I don't think we need to have 2PC files survive a pg_upgrade.  It seems
> >perfectly okay to remove them from the new cluster at some appropriate
> >time, *if* they are copied from the old cluster at all (I don't think
> >they should be.)
> 
> I think pg_upgrade should check if there are any prepared
> transactions pending, and refuse to upgrade if there are. It could
> be made to work, but it's really not worth the trouble. If there are
> any pending prepared transactions in the system when you run
> pg_upgrade, it's more likely to be a mistake or oversight in the
> first place, than on purpose.

pg_upgrade already checks for prepared transactions and errors out if
they exist;  see check_for_prepared_transactions().

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GSSAPI, SSPI - include_realm default

2014-12-10 Thread Bruce Momjian

On Tue, Dec  9, 2014 at 05:40:35PM -0500, Stephen Frost wrote:
> > I thought the idea was to backpatch documentation saying "it's a good idea
> > to change this value to x because of y". Not actually referring to the
> > upcoming change directly. And I still think that part is a good idea, as it
> > helps people avoid potential security pitfalls.
> 
> I agree with this but I don't really see why we wouldn't say "hey, this
> is going to change in 9.5."  Peter's argument sounds like he'd rather we
> not make any changes to the existing documentation, and I don't agree
> with that, and if we're making changes then, imv, we might as well
> comment that the default is changed in 9.5.

I agree with Peter --- it is unwise to reference a future released
feature in a backbranch doc patch.  Updating the backbranch docs to add
a recommendation is fine.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2014-12-10 Thread Michael Paquier

On Tue, Dec 9, 2014 at 2:15 PM, Amit Kapila  wrote:

> On Mon, Dec 8, 2014 at 3:17 PM, Simon Riggs  wrote:
> >
> > On 8 December 2014 at 11:46, Michael Paquier 
> wrote:
> > > I don't really like those new names, but I'd prefer
> > > wal_compression_level if we go down that road with 'none' as default
> > > value. We may still decide in the future to support compression at the
> > > record level instead of context level, particularly if we have an API
> > > able to do palloc_return_null_at_oom, so the idea of WAL compression
> > > is not related only to FPIs IMHO.
> >
> > We may yet decide, but the pglz implementation is not effective on
> > smaller record lengths. Nor has any testing been done to show that is
> > even desirable.
> >
>
> It's even much worse for non-compressible (or less-compressible)
> WAL data.  I am not clear here that how a simple on/off switch
> could address such cases because the data could be sometimes
> dependent on which table user is doing operations (means schema or
> data in some tables are more prone for compression in which case
> it can give us benefits).  I think may be we should think something on
> lines what Robert has touched in one of his e-mails (context-aware
> compression strategy).
>

So, I have been doing some measurements using the patch compressing FPWs
and had a look at the transaction latency using pgbench -P 1 with those
parameters on my laptop:
shared_buffers=512MB
checkpoint_segments=1024
checkpoint_timeout = 5min
fsync=off

A checkpoint was executed just before a 20-min run, so 3 checkpoints at
least kicked in during each measurement, roughly that:
pgbench -i -s 100
psql -c 'checkpoint;'
date > ~/report.txt
pgbench -P 1 -c 16 -j 16 -T 1200 2>> ~/report.txt &

1) Compression of FPW:
latency average: 9.007 ms
latency stddev: 25.527 ms
tps = 1775.614812 (including connections establishing)

Here is the latency when a checkpoint that wrote 28% of the buffers begun
(570s):
progress: 568.0 s, 2000.9 tps, lat 8.098 ms stddev 23.799
progress: 569.0 s, 1873.9 tps, lat 8.442 ms stddev 22.837
progress: 570.2 s, 1622.4 tps, lat 9.533 ms stddev 24.027
progress: 571.0 s, 1633.4 tps, lat 10.302 ms stddev 27.331
progress: 572.1 s, 1588.4 tps, lat 9.908 ms stddev 25.728
progress: 573.1 s, 1579.3 tps, lat 10.186 ms stddev 25.782
All the other checkpoints have the same profile, giving that the
transaction latency increases by roughly 1.5~2ms to 10.5~11ms.

2) No compression of FPW:
latency average: 8.507 ms
latency stddev: 25.052 ms
tps = 1870.368880 (including connections establishing)

Here is the latency for a checkpoint that wrote 28% of buffers:
progress: 297.1 s, 1997.9 tps, lat 8.112 ms stddev 24.288
progress: 298.1 s, 1990.4 tps, lat 7.806 ms stddev 21.849
progress: 299.0 s, 1986.9 tps, lat 8.366 ms stddev 22.896
progress: 300.0 s, 1648.1 tps, lat 9.728 ms stddev 25.811
progress: 301.0 s, 1806.5 tps, lat 8.646 ms stddev 24.187
progress: 302.1 s, 1810.9 tps, lat 8.960 ms stddev 24.201
progress: 303.0 s, 1831.9 tps, lat 8.623 ms stddev 23.199
progress: 304.0 s, 1951.2 tps, lat 8.149 ms stddev 22.871

Here is another one that began around 600s (20% of buffers):
progress: 594.0 s, 1738.8 tps, lat 9.135 ms stddev 25.140
progress: 595.0 s, 893.2 tps, lat 18.153 ms stddev 67.186
progress: 596.1 s, 1671.0 tps, lat 9.470 ms stddev 25.691
progress: 597.1 s, 1580.3 tps, lat 10.189 ms stddev 26.430
progress: 598.0 s, 1570.9 tps, lat 10.089 ms stddev 23.684
progress: 599.2 s, 1657.0 tps, lat 9.385 ms stddev 23.794
progress: 600.0 s, 1665.5 tps, lat 10.280 ms stddev 25.857
progress: 601.1 s, 1571.7 tps, lat 9.851 ms stddev 25.341
progress: 602.1 s, 1577.7 tps, lat 10.056 ms stddev 25.331
progress: 603.0 s, 1600.1 tps, lat 10.329 ms stddev 25.429
progress: 604.0 s, 1593.8 tps, lat 10.004 ms stddev 26.816
Not sure what happened here, the burst has been a bit higher.

However roughly the latency was never higher than 10.5ms for the
non-compression case. With those measurements I am getting more or less 1ms
of latency difference between the compression and non-compression cases
when checkpoint show up. Note that fsync is disabled.

Also, I am still planning to hack a patch able to compress directly records
with a scratch buffer up 32k and see the difference with what I got here.
For now, the results are attached.

Comments welcome.
-- 
Michael


fpw_results.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small TRUNCATE glitch

2014-12-10 Thread Heikki Linnakangas


On 12/10/2014 03:04 AM, Alvaro Herrera wrote:

Alex Shulgin wrote:


The 2PC part requires extending bool flag to fit the trunc flag, is this
approach sane?  Given that 2PC transaction should survive server
restart, it's reasonable to expect it to also survive the upgrade, so I
see no clean way of adding another bool field to the
TwoPhasePgStatRecord struct (unless some would like to add checks on
record length, etc.).


I don't think we need to have 2PC files survive a pg_upgrade.  It seems
perfectly okay to remove them from the new cluster at some appropriate
time, *if* they are copied from the old cluster at all (I don't think
they should be.)


I think pg_upgrade should check if there are any prepared transactions 
pending, and refuse to upgrade if there are. It could be made to work, 
but it's really not worth the trouble. If there are any pending prepared 
transactions in the system when you run pg_upgrade, it's more likely to 
be a mistake or oversight in the first place, than on purpose.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

71 matches

Mail list logo