date:20131209

Bruce Momjian  writes:
> OK, Christoph has provided a full set of tested patches back to 8.4. 
> Should I backpatch these?  Peter says no, but two others say yes.

It's hard to paint that as a bug fix, so I'd vote for HEAD only.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stats for network traffic WIP

2013-12-09 Thread Atri Sharma

On Tue, Dec 10, 2013 at 10:59 AM, Fujii Masao  wrote:
> On Tue, Dec 10, 2013 at 6:56 AM, Nigel Heron  wrote:
>> On Sat, Dec 7, 2013 at 1:17 PM, Fujii Masao  wrote:
>>>
>>> Could you share the performance numbers? I'm really concerned about
>>> the performance overhead caused by this patch.
>>>
>>
>> I've tried pgbench in select mode with small data sets to avoid disk
>> io and didn't see any difference. That was on my old core2duo laptop
>> though .. I'll have to retry it on some server class multi core
>> hardware.
>
> When I ran pgbench -i -s 100 in four parallel, I saw the performance 
> difference
> between the master and the patched one. I ran the following commands.
>
> psql -c "checkpoint"
> for i in $(seq 1 4); do time pgbench -i -s100 -q db$i & done
>
> The results are:
>
> * Master
>   1000 of 1000 tuples (100%) done (elapsed 13.91 s, remaining 0.00 s).
>   1000 of 1000 tuples (100%) done (elapsed 14.03 s, remaining 0.00 s).
>   1000 of 1000 tuples (100%) done (elapsed 14.01 s, remaining 0.00 s).
>   1000 of 1000 tuples (100%) done (elapsed 14.13 s, remaining 0.00 s).
>
>   It took almost 14.0 seconds to store 1000 tuples.
>
> * Patched
>   1000 of 1000 tuples (100%) done (elapsed 14.90 s, remaining 0.00 s).
>   1000 of 1000 tuples (100%) done (elapsed 15.05 s, remaining 0.00 s).
>   1000 of 1000 tuples (100%) done (elapsed 15.42 s, remaining 0.00 s).
>   1000 of 1000 tuples (100%) done (elapsed 15.70 s, remaining 0.00 s).
>
>   It took almost 15.0 seconds to store 1000 tuples.
>--
Regards,

Atri
l'apprenant
> Thus, I'm afraid that enabling network statistics would cause serious
> performance
> degradation. Thought?

Hmm, I think I did not push it this high. The performance numbers here
are cause of worry.

Another point I may mention here is that if we can isolate a few
points of performance degradation and work on them because I still
feel that the entire patch itself does not cause a serious lapse,
rather, a few points may.

However, the above numbers bring up the original concerns for the
performance voiced. I guess I was testing on too low number of clients
for the gap to show up significantly.

Regards,

Atri


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stats for network traffic WIP

2013-12-09 Thread Fujii Masao

On Tue, Dec 10, 2013 at 6:56 AM, Nigel Heron  wrote:
> On Sat, Dec 7, 2013 at 1:17 PM, Fujii Masao  wrote:
>>
>> Could you share the performance numbers? I'm really concerned about
>> the performance overhead caused by this patch.
>>
>
> I've tried pgbench in select mode with small data sets to avoid disk
> io and didn't see any difference. That was on my old core2duo laptop
> though .. I'll have to retry it on some server class multi core
> hardware.

When I ran pgbench -i -s 100 in four parallel, I saw the performance difference
between the master and the patched one. I ran the following commands.

psql -c "checkpoint"
for i in $(seq 1 4); do time pgbench -i -s100 -q db$i & done

The results are:

* Master
  1000 of 1000 tuples (100%) done (elapsed 13.91 s, remaining 0.00 s).
  1000 of 1000 tuples (100%) done (elapsed 14.03 s, remaining 0.00 s).
  1000 of 1000 tuples (100%) done (elapsed 14.01 s, remaining 0.00 s).
  1000 of 1000 tuples (100%) done (elapsed 14.13 s, remaining 0.00 s).

  It took almost 14.0 seconds to store 1000 tuples.

* Patched
  1000 of 1000 tuples (100%) done (elapsed 14.90 s, remaining 0.00 s).
  1000 of 1000 tuples (100%) done (elapsed 15.05 s, remaining 0.00 s).
  1000 of 1000 tuples (100%) done (elapsed 15.42 s, remaining 0.00 s).
  1000 of 1000 tuples (100%) done (elapsed 15.70 s, remaining 0.00 s).

  It took almost 15.0 seconds to store 1000 tuples.

Thus, I'm afraid that enabling network statistics would cause serious
performance
degradation. Thought?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get more from indices.

2013-12-09 Thread Kyotaro HORIGUCHI

Thank you,

> > One is, you put the added code for getrelation_info() out of the block for
> > the condition (info->relam == BTREE_AM_OID) (though amcanorder would be
..
> By checking the following equation in build_index_paths(), the updated
> version of the patch guarantees that the result of an index scan is ordered:
> 
> index_is_ordered = (index->sortopfamily != NULL);

Oops.. I forgot about it although many times I've seen...
You're right.

> > > Another is, you changed pathkeys expantion to be all-or-nothing decision.
> > > While this change should simplify the code slightly, it also dismisses
> > > the oppotunity for partially-extended pathkeys. Could you let me know
> > > the reason why you did so.
...
> > We might be able to partially-extend the original
> > pathkey list optimally in something significant, but that seems useless
> > complexity to me.  So, I modified the patch to do the all-or-nothing
> > decision.
> 
> Here I mean the optimality for use in merge joins.

Ok, I'll follow your advice. I agree on the point about merit vs
complexity.

I'm convinced of the validity of your patch. I'm satisfied with
it. Thank you.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good


On 10/12/13 17:18, Claudio Freire wrote:

On Tue, Dec 10, 2013 at 12:13 AM, Mark Kirkwood
 wrote:

Just one more...

The Intel 520 with ext4:


Without patch:  ANALYZE pgbench_accounts 5s
With patch: ANALYZE pgbench_accounts  1s

And double checking -
With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts  5s

These results look more like Heikki's. Which suggests more benefit on SSD
than spinning disks. Some more data points (apart from mine) would be good
to see tho.


Assuming ANALYZE is sampling less than 5% of the table, I'd say
fadvising will always be a win.

I'd also suggest higher e_i_c for rotating media. Rotating media has
longer latencias, and e_i_c has to be computed in terms of latency,
rather than "spindles" when doing prefetch.

For backward index scans, I found the optimum number for a single
spindle to be about 20.


Yeah - was just fooling about with this on the velociraptor, looks like 
somewhere in the 20's works well.


   | pgbench_accounts
eff_io_concurrency | analyze time (s)
---+-
8  |  43
16 |  40
24 |  36
32 |  35
64 |  35

Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-09 Thread KONDO Mitsumasa


(2013/12/09 20:29), Andres Freund wrote:

On 2013-12-09 19:51:01 +0900, KONDO Mitsumasa wrote:

Add my comment. We have to consider three situations.

1. PITR
2. replication standby
3. replication standby with restore_command

I think this patch cannot delay in 1 situation.


Why?


I have three reasons.

  1. It is written in document. Can we remove it?
  2. Name of this feature is "Time-delayed *standbys*", not "Time-delayed
 *recovery*". Can we change it?
  3. I think it is unnessesary in master PITR. And if it can delay in master
 PITR, it will become master at unexpected timing, not to continue to
 recovery. It is meaningless.

I'd like to ask you what do you expect from this feature and how to use it.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

On Tue, Dec 10, 2013 at 12:13 AM, Mark Kirkwood
 wrote:
> Just one more...
>
> The Intel 520 with ext4:
>
>
> Without patch:  ANALYZE pgbench_accounts 5s
> With patch: ANALYZE pgbench_accounts  1s
>
> And double checking -
> With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts  5s
>
> These results look more like Heikki's. Which suggests more benefit on SSD
> than spinning disks. Some more data points (apart from mine) would be good
> to see tho.

Assuming ANALYZE is sampling less than 5% of the table, I'd say
fadvising will always be a win.

I'd also suggest higher e_i_c for rotating media. Rotating media has
longer latencias, and e_i_c has to be computed in terms of latency,
rather than "spindles" when doing prefetch.

For backward index scans, I found the optimum number for a single
spindle to be about 20.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?

On Tue, Dec 10, 2013 at 08:47:22AM +0800, Craig Ringer wrote:
> On 12/05/2013 11:25 PM, MauMau wrote:
> > Hello,
> > 
> > My customers and colleagues sometimes (or often?) ask about the
> > following message:
> > 
> > FATAL:  the database system is starting up
> 
> I would LOVE that message to do away, forever.
> 
> It's a huge PITA for automated log monitoring, analysis, and alerting.
> 
> The other one I'd personally like to change, but realise is harder to
> actually do, is to separate "ERROR"s caused by obvious user input issues
> from internal ERRORs like not finding the backing file for a relation,
> block read errors, etc.
> 
> String pattern matching is a crude and awful non-solution, especially
> given the way PostgreSQL loves to output messages to the log in whatever
> language and encoding the current database connection is in.

Yes, this is certainly a challenge.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql_check_function - rebase for 9.3

2013-12-09 Thread Amit Kapila

On Mon, Dec 9, 2013 at 10:54 AM, Pavel Stehule  wrote:
> 2013/12/9 Amit Kapila 
>>
>> >
>> > There are two points, that should be solved
>> >
>> > a) introduction a dependency to other object in schema - now CREATE
>> > FUNCTION
>> > is fully independent on others
>> >
>> > b) slow start - if we check all paths on start, then start can be slower
>> > -
>> > and some functions should not work due dependency on temporary tables.
>> >
>> > I am thinking about possible marking a function by #option (we have same
>> > idea)
>> >
>> > some like
>> >
>> > #option check_on_first_start
>> > #option check_on_create
>> > #option check_newer
>>
>> what exactly check_newer means, does it mean whenever a function is
>> replaced (changed)?
>>
>
> no, it means, so request for check will be ignored ever - some functions
> cannot be deeply checked due using dynamic SQL or dynamic created data types
> - temporary tables created in functions.

Thanks for clarification, the part of name 'newer' has created
confusion. I understand
that creating/identifying dependency in some of the cases will be
quite tricky, does other
similar languages for other databases does that for all cases (objects
in dynamic statements).

Is the main reason for identifying/creating dependency is to mark
function as invalid when
any dependent object is Dropped/Altered?

This thread is from quite some time, so please excuse me if I had
asked anything which has
been discussed previously.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [patch] Adding EXTRA_REGRESS_OPTS to all pg_regress invocations

2013-12-09 Thread Michael Paquier

On Tue, Dec 10, 2013 at 12:08 PM, Bruce Momjian  wrote:
> On Thu, Dec  5, 2013 at 09:52:27AM +0100, Christoph Berg wrote:
>> > The change is sane in itself. It won't affect anyone who doesn't use
>> > EXTRA_REGRESS_OPTS. Why would we want to make packagers do MORE
>> > work?
>>
>> The patch has been in the Debian/Ubuntu/apt.pg.o packages for some
>> time, for 8.3+. I'm attaching the patches used there.
>>
>> (Sidenote: To enable building of several package flavors in parallel
>> on the same machine we use
>>
>> make -C build check-world EXTRA_REGRESS_OPTS='--host=/tmp --port=$(shell 
>> perl -le 'print 1024 + int(rand(64000))')'
>>
>> so pg_regress' static per-version ports do not conflict. But 9.2's
>> contrib/pg_upgrade/{Makefile/test.sh} do not like --port in there, so
>> the 9.2 patch has an extra sed hack in there to remove --port for
>> pg_upgrade. That bit should probably not be applied for general use.
>> The rest is safe, though.)
>
> OK, Christoph has provided a full set of tested patches back to 8.4.
> Should I backpatch these?  Peter says no, but two others say yes.
My 2c. Adding a new feature in a maintenance branch is usually not
done, so I'd vote no.

Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

On Fri, Dec  6, 2013 at 04:04:36PM +0100, Andres Freund wrote:
> On 2013-12-05 23:01:28 +0200, Heikki Linnakangas wrote:
> > On 12/05/2013 10:37 PM, Robert Haas wrote:
> > >On Thu, Dec 5, 2013 at 3:05 PM, Tom Lane  wrote:
> > >>It might be unpleasant to use in some cases, though.
> > >
> > >Why would there be more than a few cases in the first place?  Who is
> > >going to use this beyond psql, pg_dump(all), and pg_upgrade, and why?
> > 
> > Well, you might want to use pgAdmin, or your other favorite admin tool. I'm
> > not sure how well it would work, and I think it's OK if we say "sorry, can't
> > do that", but it's not a crazy thing to want.
> 
> Pgadmin wouldn't work, it uses multiple connections for anything but the
> most trivial tasks. You can't even send a manual sql query using only
> one connection.
> I think that's true for most of the non-trivial tools.

FYI, pg_upgrade in parallel mode needs multiple database connections
too.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Storing pg_stat_statements query texts externally, pg_stat_statements in core

On Mon, Dec 9, 2013 at 7:31 PM, Peter Geoghegan  wrote:
> I go to some lengths to to avoid doing this with only a shared lock.

I should have said: I go to great lengths to do this with only a
shared lock, and not an exclusive (see gc_count stuff).


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Storing pg_stat_statements query texts externally, pg_stat_statements in core

On Sat, Dec 7, 2013 at 9:26 AM, Fujii Masao wrote:
> The patch doesn't apply cleanly against the master due to recent change of
> pg_stat_statements. Could you update the patch?

Revision is attached, including changes recently discussed on other
thread [1] to allow avoiding sending query text where that's not
desirable and increase in default pg_stat_statements.max to 5000.

I've found myself tweaking things a bit more here and there. I've
added additional clarification around why we do an in-place garbage
collection (i.e. why we don't just write a new file and atomically
rename -- grep "over-engineer" to find what I mean). I also fully
documented the pg_stat_statements function. If we want external tool
authors to use it to avoid sending query texts, they have to know
about it in the first place.

I now use the pg_stat_tmp directory for my temp file (so
pg_stat_statements behaves more like the statistics collector). The
stats_temp_directory GUC is not used, for reasons that a patch comment
explains.

I've fuzz-tested this throughout with the "make installcheck-parallel"
regression tests. I found it useful to run that in one terminal, while
running pgbench with multiple clients in another. The pgbench script
looks something like:

"""
select * from pg_stat_statements;
select pg_stat_statements_reset();
"""

pg_stat_statements_reset() was temporarily rigged to perform a garbage
collection (and not a reset) for the benefit of this stress-test. The
function pg_stat_statements(), the garbage collection function and
pgss_store() will complain loudly if they notice anything out of the
ordinary. I was not able to demonstrate any problems through this
technique (though I think it focused my thinking on the right areas
during development). Clearly, missed subtleties around managing the
external file while locking in order to ensure correct behavior (that
no session can observe something that it should or should not) will be
something that a committer will want to pay particular attention to. I
go to some lengths to to avoid doing this with only a shared lock.

FYI, without hacking things up, it's quite hard to make external query
text garbage collection occur - need_gc_qtexts() will almost always
say no. It will only automatically happen once with
pg_stat_statements.max set to 1000 when the standard regression tests
are run. This is a really extreme workload in terms of causing
pg_stat_statements churn, which shows just how unlikely garbage
collection is in the real world. Still, it ought to be totally robust
when it happens.

In passing, I did this:

/*
!* Rename file into place, to atomically commit to serializing.
*/

Because at this juncture, there ought to be no file to replace (not
since Magnus had pg_stat_statements not keep the main global file
around for as long as the server is running, in the style of the
statistics collector).

Thanks

[1]
http://www.postgresql.org/message-id/cam3swzsvff-vbnc8sc-0cpwzxvqh49reybchb95t95fcrgs...@mail.gmail.com
--
Peter Geoghegan

pg_stat_statements_ext_text.v5.2013_12_09.patch.gz
Description: GNU Zip compressed data

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [bug fix] pg_ctl always uses the same event source

2013-12-09 Thread Amit Kapila

On Mon, Dec 9, 2013 at 5:52 PM, MauMau  wrote:
> From: "Amit Kapila" 
>
>> 1. isn't it better to handle as it is done in write_eventlog() which
>> means if string is empty then
>>use PostgreSQL.
>> "evtHandle = RegisterEventSource(NULL, event_source ? event_source :
>> "PostgreSQL");"
>
>
> Thank you for reviewing.  Yes, I did so with the first revision of this
> patch (see the first mail of this thread.)  I wanted to avoid duplicating
> the default value in both the server and pg_ctl code.  If user does not set
> event_source, postgres -C returns the default value "PostgreSQL" in the
> normal case, so I wanted to rely on it.

   I think it is better to keep it like what I suggested above,
because in that case
   it will assign default name even if postgres -C fails due to some reason.

>
>
>> 2. What will happen if user doesn't change the name in "event_source"
>> or kept the same name,
>>won't it hit the same problem again? So shouldn't it try to
>> generate different name by appending
>>version string to it?
>
>
> Yes, but I assume that the user has to set his own name to identify his
> instance uniquely.  Even if version string is added, the same issue can
> happen --- in the likely case where the user explicitly installs, for
> example, PostgreSQL 9.3 as a standalone database, as well as some packaged
> application that embeds PostgreSQL 9.3 which the user is unaware of.

   I mentioned it just to make sure that if such things can happen, it
is good to
   either avoid it or document it in some way, so that user can
understand better
   how to configure his system.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good


On 10/12/13 15:32, Mark Kirkwood wrote:

On 10/12/13 15:17, Mark Kirkwood wrote:

On 10/12/13 15:11, Mark Kirkwood wrote:

On 10/12/13 15:04, Mark Kirkwood wrote:

On 10/12/13 13:53, Mark Kirkwood wrote:

On 10/12/13 13:20, Mark Kirkwood wrote:

On 10/12/13 13:14, Mark Kirkwood wrote:

On 10/12/13 12:14, Heikki Linnakangas wrote:



I took a stab at using posix_fadvise() in ANALYZE. It turned 
out to be very easy, patch attached. Your mileage may vary, but 
I'm seeing a nice gain from this on my laptop. Taking a 3 
page sample of a table with 717717 pages (ie. slightly larger 
than RAM), ANALYZE takes about 6 seconds without the patch, and 
less than a second with the patch, with 
effective_io_concurrency=10. If anyone with a good test data 
set loaded would like to test this and post some numbers, that 
would be great.







I did a test run:

pgbench scale 2000 (pgbench_accounts approx 25GB).
postgres 9.4

i7 3.5Ghz Cpu
16GB Ram
500 GB Velociraptor 10K

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts90s
With patch: ANALYZE pgbench_accounts  91s

So I'm essentially seeing no difference :-(



Arrg - sorry forgot the important bits:

Ubuntu 13.10 (kernel 3.11.0-14)
filesystem is ext4





Doing the same test as above, but on a 80GB Intel 520 (ext4 
filesystem mounted with discard):


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  5s
With patch: ANALYZE pgbench_accounts  5s







Redoing the filesystem on the 520 as btrfs didn't seem to make any 
difference either:


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  6.4s
With patch: ANALYZE pgbench_accounts  6.4s





Ah - I have just realized I was not setting effective_io_concurrency 
- so I'll redo the test. - Apologies.





Redoing the test on the velociraptor gives me exactly the same 
numbers as before (effective_io_concurrency = 10 instead of 1).





Good grief - repeating the test gave:

Without patch:  ANALYZE pgbench_accounts 90s
With patch: ANALYZE pgbench_accounts  42s

pretty consistent *now*. No idea what was going on in the 1st run 
(maybe I happened to have it running at the same time as a 
checkpoint)? Anyway will stop now before creating more confusion.





Just one more...

The Intel 520 with ext4:

Without patch:  ANALYZE pgbench_accounts 5s
With patch: ANALYZE pgbench_accounts  1s

And double checking -
With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts  5s

These results look more like Heikki's. Which suggests more benefit on 
SSD than spinning disks. Some more data points (apart from mine) would 
be good to see tho.


Cheers

Mark






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [patch] Adding EXTRA_REGRESS_OPTS to all pg_regress invocations

On Thu, Dec  5, 2013 at 09:52:27AM +0100, Christoph Berg wrote:
> > The change is sane in itself. It won't affect anyone who doesn't use
> > EXTRA_REGRESS_OPTS. Why would we want to make packagers do MORE
> > work?
> 
> The patch has been in the Debian/Ubuntu/apt.pg.o packages for some
> time, for 8.3+. I'm attaching the patches used there.
> 
> (Sidenote: To enable building of several package flavors in parallel
> on the same machine we use
> 
> make -C build check-world EXTRA_REGRESS_OPTS='--host=/tmp --port=$(shell perl 
> -le 'print 1024 + int(rand(64000))')'
> 
> so pg_regress' static per-version ports do not conflict. But 9.2's
> contrib/pg_upgrade/{Makefile/test.sh} do not like --port in there, so
> the 9.2 patch has an extra sed hack in there to remove --port for
> pg_upgrade. That bit should probably not be applied for general use.
> The rest is safe, though.)

OK, Christoph has provided a full set of tested patches back to 8.4. 
Should I backpatch these?  Peter says no, but two others say yes.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good


On 10/12/13 15:17, Mark Kirkwood wrote:

On 10/12/13 15:11, Mark Kirkwood wrote:

On 10/12/13 15:04, Mark Kirkwood wrote:

On 10/12/13 13:53, Mark Kirkwood wrote:

On 10/12/13 13:20, Mark Kirkwood wrote:

On 10/12/13 13:14, Mark Kirkwood wrote:

On 10/12/13 12:14, Heikki Linnakangas wrote:



I took a stab at using posix_fadvise() in ANALYZE. It turned out 
to be very easy, patch attached. Your mileage may vary, but I'm 
seeing a nice gain from this on my laptop. Taking a 3 page 
sample of a table with 717717 pages (ie. slightly larger than 
RAM), ANALYZE takes about 6 seconds without the patch, and less 
than a second with the patch, with effective_io_concurrency=10. 
If anyone with a good test data set loaded would like to test 
this and post some numbers, that would be great.







I did a test run:

pgbench scale 2000 (pgbench_accounts approx 25GB).
postgres 9.4

i7 3.5Ghz Cpu
16GB Ram
500 GB Velociraptor 10K

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts90s
With patch: ANALYZE pgbench_accounts  91s

So I'm essentially seeing no difference :-(



Arrg - sorry forgot the important bits:

Ubuntu 13.10 (kernel 3.11.0-14)
filesystem is ext4





Doing the same test as above, but on a 80GB Intel 520 (ext4 
filesystem mounted with discard):


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  5s
With patch: ANALYZE pgbench_accounts  5s







Redoing the filesystem on the 520 as btrfs didn't seem to make any 
difference either:


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  6.4s
With patch: ANALYZE pgbench_accounts  6.4s





Ah - I have just realized I was not setting effective_io_concurrency 
- so I'll redo the test. - Apologies.





Redoing the test on the velociraptor gives me exactly the same numbers 
as before (effective_io_concurrency = 10 instead of 1).





Good grief - repeating the test gave:

Without patch:  ANALYZE pgbench_accounts 90s
With patch: ANALYZE pgbench_accounts  42s

pretty consistent *now*. No idea what was going on in the 1st run (maybe 
I happened to have it running at the same time as a checkpoint)? Anyway 
will stop now before creating more confusion.


Regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Josh Berkus

On 12/09/2013 02:37 PM, Robert Haas wrote:
> I've never seen an n_distinct value of more than 5 digits, regardless
> of reality.  Typically I've seen 20-50k, even if the real number is
> much higher.  But the n_distinct value is only for non-MCVs, so if we
> estimate the selectivity of column = 'rarevalue' to be
> (1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces
> the estimate, and making the MCV list longer naturally makes mcvfrac
> bigger.  I'm not sure how important the
> less-frequent-than-the-least-common-MCV part is, but I'm very sure
> that raising the statistics target helps to solve the problem of
> overestimating the prevalence of uncommon values in a very big table.

I did an analysis of our ndistinct algorithm several years ago ( ~~
8.1), and to sum up:

1. we take far too small of a sample to estimate ndistinct well for
tables larger than 100,000 rows.

2. the estimation algo we have chosen is one which tends to be wrong in
the downwards direction, rather strongly so.  That is, if we could
potentially have an ndistinct of 1000 to 100,000 based on the sample,
our algo estimates 1500 to 3000.

3. Other algos exist.  The tend to be wrong in other directions.

4. Nobody has done an analysis of whether it's worse, on average, to
estimate low vs. high for ndistinct.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good


On 10/12/13 15:11, Mark Kirkwood wrote:

On 10/12/13 15:04, Mark Kirkwood wrote:

On 10/12/13 13:53, Mark Kirkwood wrote:

On 10/12/13 13:20, Mark Kirkwood wrote:

On 10/12/13 13:14, Mark Kirkwood wrote:

On 10/12/13 12:14, Heikki Linnakangas wrote:



I took a stab at using posix_fadvise() in ANALYZE. It turned out 
to be very easy, patch attached. Your mileage may vary, but I'm 
seeing a nice gain from this on my laptop. Taking a 3 page 
sample of a table with 717717 pages (ie. slightly larger than 
RAM), ANALYZE takes about 6 seconds without the patch, and less 
than a second with the patch, with effective_io_concurrency=10. 
If anyone with a good test data set loaded would like to test 
this and post some numbers, that would be great.







I did a test run:

pgbench scale 2000 (pgbench_accounts approx 25GB).
postgres 9.4

i7 3.5Ghz Cpu
16GB Ram
500 GB Velociraptor 10K

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts90s
With patch: ANALYZE pgbench_accounts  91s

So I'm essentially seeing no difference :-(



Arrg - sorry forgot the important bits:

Ubuntu 13.10 (kernel 3.11.0-14)
filesystem is ext4





Doing the same test as above, but on a 80GB Intel 520 (ext4 
filesystem mounted with discard):


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  5s
With patch: ANALYZE pgbench_accounts  5s







Redoing the filesystem on the 520 as btrfs didn't seem to make any 
difference either:


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  6.4s
With patch: ANALYZE pgbench_accounts  6.4s





Ah - I have just realized I was not setting effective_io_concurrency - 
so I'll redo the test. - Apologies.





Redoing the test on the velociraptor gives me exactly the same numbers 
as before (effective_io_concurrency = 10 instead of 1).


Cheers

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good


On 10/12/13 15:04, Mark Kirkwood wrote:

On 10/12/13 13:53, Mark Kirkwood wrote:

On 10/12/13 13:20, Mark Kirkwood wrote:

On 10/12/13 13:14, Mark Kirkwood wrote:

On 10/12/13 12:14, Heikki Linnakangas wrote:



I took a stab at using posix_fadvise() in ANALYZE. It turned out 
to be very easy, patch attached. Your mileage may vary, but I'm 
seeing a nice gain from this on my laptop. Taking a 3 page 
sample of a table with 717717 pages (ie. slightly larger than 
RAM), ANALYZE takes about 6 seconds without the patch, and less 
than a second with the patch, with effective_io_concurrency=10. If 
anyone with a good test data set loaded would like to test this 
and post some numbers, that would be great.







I did a test run:

pgbench scale 2000 (pgbench_accounts approx 25GB).
postgres 9.4

i7 3.5Ghz Cpu
16GB Ram
500 GB Velociraptor 10K

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts90s
With patch: ANALYZE pgbench_accounts  91s

So I'm essentially seeing no difference :-(



Arrg - sorry forgot the important bits:

Ubuntu 13.10 (kernel 3.11.0-14)
filesystem is ext4





Doing the same test as above, but on a 80GB Intel 520 (ext4 
filesystem mounted with discard):


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  5s
With patch: ANALYZE pgbench_accounts  5s







Redoing the filesystem on the 520 as btrfs didn't seem to make any 
difference either:


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  6.4s
With patch: ANALYZE pgbench_accounts  6.4s





Ah - I have just realized I was not setting effective_io_concurrency - 
so I'll redo the test. - Apologies.


Regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good


On 10/12/13 13:53, Mark Kirkwood wrote:

On 10/12/13 13:20, Mark Kirkwood wrote:

On 10/12/13 13:14, Mark Kirkwood wrote:

On 10/12/13 12:14, Heikki Linnakangas wrote:



I took a stab at using posix_fadvise() in ANALYZE. It turned out to 
be very easy, patch attached. Your mileage may vary, but I'm seeing 
a nice gain from this on my laptop. Taking a 3 page sample of a 
table with 717717 pages (ie. slightly larger than RAM), ANALYZE 
takes about 6 seconds without the patch, and less than a second 
with the patch, with effective_io_concurrency=10. If anyone with a 
good test data set loaded would like to test this and post some 
numbers, that would be great.







I did a test run:

pgbench scale 2000 (pgbench_accounts approx 25GB).
postgres 9.4

i7 3.5Ghz Cpu
16GB Ram
500 GB Velociraptor 10K

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts90s
With patch: ANALYZE pgbench_accounts  91s

So I'm essentially seeing no difference :-(



Arrg - sorry forgot the important bits:

Ubuntu 13.10 (kernel 3.11.0-14)
filesystem is ext4





Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem 
mounted with discard):


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  5s
With patch: ANALYZE pgbench_accounts  5s







Redoing the filesystem on the 520 as btrfs didn't seem to make any 
difference either:


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  6.4s
With patch: ANALYZE pgbench_accounts  6.4s




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dblink performance regression


On 12/7/13 7:50 PM, Joe Conway wrote:

On 12/07/2013 05:41 PM, Fabrízio de Royes Mello wrote:

>
>On Sat, Dec 7, 2013 at 11:20 PM, Michael Paquier
>mailto:michael.paqu...@gmail.com>>
>wrote:

>>>
>>>IMHO is more elegant create a procedure to encapsulate the code
>>>to avoid redundancy.

>>Yep, perhaps something like PQsetClientEncodingIfDifferent or
>>similar would make sense.

>
>Well I think at this first moment we can just create a procedure
>inside the dblink contrib and not touch in libpq.

Maybe a libpq function could be done for 9.4, but not for back branches.


Stupid question... why don't we just pass encoding in with the other connection 
parameters? That eliminates any ambiguity. The only issue would be if the user 
also passed that in via connstr... though now that I think about it, we 
currently silently ignore that parameter, which IMHO is bad. We should either 
respect if the user passes that in (ie: not do anything at all), or we should 
throw an error.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

On 12/10/2013 05:20 AM, Jim Nasby wrote:
>>
> 
> FWIW, if synchronize_seqscans is on I'd think it'd be pretty easy to
> fire up a 2nd backend to do the ANALYZE portion (or perhaps use Robert's
> fancy new shared memory stuff).

Apologies for posting the same as a new idea before I saw your post. I
need to finish the thread before posting.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

On 12/06/2013 09:52 AM, Peter Geoghegan wrote:
> Has anyone ever thought about opportunistic ANALYZE piggy-backing on
> other full-table scans? That doesn't really help Greg, because his
> complaint is mostly that a fresh ANALYZE is too expensive, but it
> could be an interesting, albeit risky approach.

It'd be particularly interesting, IMO, if autovacuum was able to do it
using the synchronized scans machinery, so the analyze still ran in a
separate proc and didn't impact the user's query. Have an autovac worker
or two waiting for new scans to start on tables they need to analyze,
and if one doesn't start within a reasonable time frame they start their
own scan.

We've seen enough issues with hint-bit setting causing unpredictable
query times for user queries that I wouldn't want to add another "might
write during a read-only query" behaviour.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good


On 10/12/13 13:20, Mark Kirkwood wrote:

On 10/12/13 13:14, Mark Kirkwood wrote:

On 10/12/13 12:14, Heikki Linnakangas wrote:



I took a stab at using posix_fadvise() in ANALYZE. It turned out to 
be very easy, patch attached. Your mileage may vary, but I'm seeing 
a nice gain from this on my laptop. Taking a 3 page sample of a 
table with 717717 pages (ie. slightly larger than RAM), ANALYZE 
takes about 6 seconds without the patch, and less than a second with 
the patch, with effective_io_concurrency=10. If anyone with a good 
test data set loaded would like to test this and post some numbers, 
that would be great.







I did a test run:

pgbench scale 2000 (pgbench_accounts approx 25GB).
postgres 9.4

i7 3.5Ghz Cpu
16GB Ram
500 GB Velociraptor 10K

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts90s
With patch: ANALYZE pgbench_accounts  91s

So I'm essentially seeing no difference :-(



Arrg - sorry forgot the important bits:

Ubuntu 13.10 (kernel 3.11.0-14)
filesystem is ext4





Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem 
mounted with discard):


(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts  5s
With patch: ANALYZE pgbench_accounts  5s





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql: Fix a couple of bugs in MultiXactId freezing

2013-12-09 Thread Andres Freund

On 2013-12-09 19:14:58 -0300, Alvaro Herrera wrote:
> Here's a revamped version of this patch.  One thing I didn't do here is
> revert the exporting of CreateMultiXactId, but I don't see any way to
> avoid that.

I don't so much have a problem with exporting CreateMultiXactId(), just
with exporting it under its current name. It's already quirky to have
both MultiXactIdCreate and CreateMultiXactId() in multixact.c but
exporting it imo goes to far.

> Andres mentioned the idea of sharing some code between
> heap_prepare_freeze_tuple and heap_tuple_needs_freeze, but I haven't
> explored that.

My idea would just be to have heap_tuple_needs_freeze() call
heap_prepare_freeze_tuple() and check whether it returns true. Yes,
that's slightly more expensive than the current
heap_tuple_needs_freeze(), but it's only called when we couldn't get a
cleanup lock on a page, so that seems ok.

> ! static TransactionId
> ! FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
> !   TransactionId cutoff_xid, MultiXactId 
> cutoff_multi,
> !   uint16 *flags)
> ! {

> ! if (!MultiXactIdIsValid(multi))
> ! {
> ! /* Ensure infomask bits are appropriately set/reset */
> ! *flags |= FRM_INVALIDATE_XMAX;
> ! return InvalidTransactionId;
> ! }

Maybe comment that we're sure to be only called when IS_MULTI is set,
which will be unset by FRM_INVALIDATE_XMAX? I wondered twice whether we
wouldn't just continually mark the buffer dirty this way.

> ! else if (MultiXactIdPrecedes(multi, cutoff_multi))
> ! {
> ! /*
> !  * This old multi cannot possibly have members still running.  
> If it
> !  * was a locker only, it can be removed without any further
> !  * consideration; but if it contained an update, we might need 
> to
> !  * preserve it.
> !  */
> ! if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
> ! {
> ! *flags |= FRM_INVALIDATE_XMAX;
> ! return InvalidTransactionId;

Cna you place an Assert(!MultiXactIdIsRunning(multi)) here?

> ! if (ISUPDATE_from_mxstatus(members[i].status) &&
> ! !TransactionIdDidAbort(members[i].xid))#

It makes me wary to see a DidAbort() without a previous InProgress()
call.
Also, after we crashed, doesn't DidAbort() possibly return false for
transactions that were in progress before we crashed? At least that's
how I always understood it, and how tqual.c is written.

> ! /* We only keep lockers if they are still running */
> ! if (TransactionIdIsCurrentTransactionId(members[i].xid) ||

I don't think there's a need to check for
TransactionIdIsCurrentTransactionId() - vacuum can explicitly *not* be
run inside a transaction.

> ***
> *** 5443,5458  heap_freeze_tuple(HeapTupleHeader tuple, TransactionId 
> cutoff_xid,
>* xvac transaction succeeded.
>*/
>   if (tuple->t_infomask & HEAP_MOVED_OFF)
> ! HeapTupleHeaderSetXvac(tuple, 
> InvalidTransactionId);
>   else
> ! HeapTupleHeaderSetXvac(tuple, 
> FrozenTransactionId);
>   
>   /*
>* Might as well fix the hint bits too; usually 
> XMIN_COMMITTED
>* will already be set here, but there's a small chance 
> not.
>*/
>   Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
> ! tuple->t_infomask |= HEAP_XMIN_COMMITTED;
>   changed = true;
>   }
>   }
> --- 5571,5586 
>* xvac transaction succeeded.
>*/
>   if (tuple->t_infomask & HEAP_MOVED_OFF)
> ! frz->frzflags |= XLH_FREEZE_XVAC;
>   else
> ! frz->frzflags |= XLH_INVALID_XVAC;
>   

Hm. Isn't this case inverted? I.e. shouldn't you set XLH_FREEZE_XVAC and
XLH_INVALID_XVAC exactly the other way round? I really don't understand
the moved in/off, since the code has been gone longer than I've followed
the code...

 *** a/src/backend/access/rmgrdesc/mxactdesc.c
> --- b/src/backend/access/rmgrdesc/mxactdesc.c
> ***
> *** 41,47  out_member(StringInfo buf, MultiXactMember *member)
>   appendStringInfoString(buf, "(upd) ");
>   break;
>   default:
> ! appendStringInfoString(buf, "(unk) ");
>   break;
>   }
>   }
> --- 41,47 
>   appendStringInfoString(buf, "(upd) ");
>   break;
>   default:
> ! appendStringInfo(buf, "(unk) ", member->status);
>   break;
>

Re: [HACKERS] Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?

On 12/06/2013 03:02 AM, Josh Berkus wrote:
> Heck, I'd be happy just to have a class of messages which specifically
> means "OMG, there's something wrong with the server", that is, a flag
> for messages which only occur when PostgreSQL encounters a bug, data
> corrpution, or platform error.  Right now, I have to suss those out by
> regex.

+10

That's what I really need to see in the logs, too.

Using SQLState might be reasonable, but I'd be concerned we'd find cases
where the same standard-specififed SQLState should be sent to the
*client*, despite different underlying causes for the server of
different levels of DBA concern.

I'd rather not suppress logging of user-level errors, etc; they're
important too. It's just necessary to be able to separate them from
internal errors that should never happen when the system is operating
correctly in order to do effective log-based alerting.

After all, I *don't* want to get an SMS whenever the deadlock detector
kicks in and someone's created a database in de_DE so the string pattern
match doesn't kick in.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?

On 12/05/2013 11:25 PM, MauMau wrote:
> Hello,
> 
> My customers and colleagues sometimes (or often?) ask about the
> following message:
> 
> FATAL:  the database system is starting up

I would LOVE that message to do away, forever.

It's a huge PITA for automated log monitoring, analysis, and alerting.

The other one I'd personally like to change, but realise is harder to
actually do, is to separate "ERROR"s caused by obvious user input issues
from internal ERRORs like not finding the backing file for a relation,
block read errors, etc.

String pattern matching is a crude and awful non-solution, especially
given the way PostgreSQL loves to output messages to the log in whatever
language and encoding the current database connection is in.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?

Jim Nasby  writes:
> On 12/9/13 5:56 PM, Tom Lane wrote:
>> How so?  "FATAL" means "an error that terminates your session", which
>> is exactly what these are.

> Except in these cases the user never actually got a working session; their 
> request was denied.

> To be clear, from the client standpoint it's certainly fatal, but not from 
> the server's point of view. This is fully expected behavior as far as the 
> server is concerned. (Obviously it might be an error that caused the 
> shutdown/recovery, but that's something different.)

Right, but as already pointed out in this thread, these messages are
worded from the client's point of view.  "The client never got a working
connection" seems to me to be an empty distinction.  If you got SIGTERM'd
before you could issue your first query, should that not be FATAL because
you'd not gotten any work done?

More generally, we also say FATAL for all sorts of entirely routine
connection failures, like wrong password or mistyped user name.  People
don't seem to have a problem with those.  Even if some do complain,
the costs of changing that behavior after fifteen-years-and-counting
would certainly exceed any benefit.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good


On 10/12/13 13:14, Mark Kirkwood wrote:

On 10/12/13 12:14, Heikki Linnakangas wrote:



I took a stab at using posix_fadvise() in ANALYZE. It turned out to 
be very easy, patch attached. Your mileage may vary, but I'm seeing a 
nice gain from this on my laptop. Taking a 3 page sample of a 
table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes 
about 6 seconds without the patch, and less than a second with the 
patch, with effective_io_concurrency=10. If anyone with a good test 
data set loaded would like to test this and post some numbers, that 
would be great.







I did a test run:

pgbench scale 2000 (pgbench_accounts approx 25GB).
postgres 9.4

i7 3.5Ghz Cpu
16GB Ram
500 GB Velociraptor 10K

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts90s
With patch: ANALYZE pgbench_accounts  91s

So I'm essentially seeing no difference :-(



Arrg - sorry forgot the important bits:

Ubuntu 13.10 (kernel 3.11.0-14)
filesystem is ext4



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?


On 12/9/13 5:56 PM, Tom Lane wrote:

Jim Nasby  writes:

Arguably 1-3 are inaccurate since they're not really about a backend dying... 
they occur during the startup phase; you never even get a functioning backend. 
That's a marked difference from other uses of FATAL.


How so?  "FATAL" means "an error that terminates your session", which
is exactly what these are.


Except in these cases the user never actually got a working session; their 
request was denied.

To be clear, from the client standpoint it's certainly fatal, but not from the 
server's point of view. This is fully expected behavior as far as the server is 
concerned. (Obviously it might be an error that caused the shutdown/recovery, 
but that's something different.)
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

Mark Kirkwood  writes:
> I did a test run:

> pgbench scale 2000 (pgbench_accounts approx 25GB).
> postgres 9.4

> i7 3.5Ghz Cpu
> 16GB Ram
> 500 GB Velociraptor 10K

> (cold os and pg cache both runs)
> Without patch:  ANALYZE pgbench_accounts90s
> With patch: ANALYZE pgbench_accounts  91s

> So I'm essentially seeing no difference :-(

What OS and filesystem?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good


On 10/12/13 12:14, Heikki Linnakangas wrote:



I took a stab at using posix_fadvise() in ANALYZE. It turned out to be 
very easy, patch attached. Your mileage may vary, but I'm seeing a 
nice gain from this on my laptop. Taking a 3 page sample of a 
table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes 
about 6 seconds without the patch, and less than a second with the 
patch, with effective_io_concurrency=10. If anyone with a good test 
data set loaded would like to test this and post some numbers, that 
would be great.







I did a test run:

pgbench scale 2000 (pgbench_accounts approx 25GB).
postgres 9.4

i7 3.5Ghz Cpu
16GB Ram
500 GB Velociraptor 10K

(cold os and pg cache both runs)
Without patch:  ANALYZE pgbench_accounts90s
With patch: ANALYZE pgbench_accounts  91s

So I'm essentially seeing no difference :-(

Regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?

Jim Nasby  writes:
> Arguably 1-3 are inaccurate since they're not really about a backend dying... 
> they occur during the startup phase; you never even get a functioning 
> backend. That's a marked difference from other uses of FATAL.

How so?  "FATAL" means "an error that terminates your session", which
is exactly what these are.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

Heikki Linnakangas  writes:
> Maybe. Or maybe the heuristic read ahead isn't significant/helpful, when 
> you're prefetching with posix_fadvise anyway.

Yeah.  If we're not reading consecutive blocks, readahead is unlikely
to do anything anyhow.

Claudio's comments do suggest that it might be a bad idea to issue a
posix_fadvise when the next block to be examined *is* adjacent to the
current one, though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

On Mon, Dec 9, 2013 at 8:45 PM, Heikki Linnakangas
 wrote:
> Claudio Freire  wrote:
>>On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas
>> wrote:
>>> I took a stab at using posix_fadvise() in ANALYZE. It turned out to
>>be very
>>> easy, patch attached. Your mileage may vary, but I'm seeing a nice
>>gain from
>>> this on my laptop. Taking a 3 page sample of a table with 717717
>>pages
>>> (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without
>>the
>>> patch, and less than a second with the patch, with
>>> effective_io_concurrency=10. If anyone with a good test data set
>>loaded
>>> would like to test this and post some numbers, that would be great.
>>
>>Kernel version?
>
> 3.12, from Debian experimental. With an ssd drive and btrfs filesystem.  
> Admittedly not your average database server setup, so it would be nice to get 
> more reports from others.


Yeah, read-ahead isn't relevant for SSD.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

Claudio Freire  wrote:
>On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas
> wrote:
>> I took a stab at using posix_fadvise() in ANALYZE. It turned out to
>be very
>> easy, patch attached. Your mileage may vary, but I'm seeing a nice
>gain from
>> this on my laptop. Taking a 3 page sample of a table with 717717
>pages
>> (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without
>the
>> patch, and less than a second with the patch, with
>> effective_io_concurrency=10. If anyone with a good test data set
>loaded
>> would like to test this and post some numbers, that would be great.
>
>Kernel version?

3.12, from Debian experimental. With an ssd drive and btrfs filesystem.  
Admittedly not your average database server setup, so it would be nice to get 
more reports from others. 

>I raised this issue on LKML, and, while I got no news on this front,
>they might have silently fixed it. I'd have to check the sources
>again.

Maybe. Or maybe the heuristic read ahead isn't significant/helpful, when you're 
prefetching with posix_fadvise anyway.
I

- Heikki


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas
 wrote:
> On 12/09/2013 11:56 PM, Claudio Freire wrote:
>> Without patches to the kernel, it is much better.
>>
>> posix_fadvise interferes with read-ahead, so posix_fadvise on, say,
>> bitmap heap scans (or similarly sorted analyze block samples) run at 1
>> IO / block, ie horrible, whereas aio can do read coalescence and
>> read-ahead when the kernel thinks it'll be profitable, significantly
>> increasing IOPS. I've seen everything from a 2x to 10x difference.
>
>
> How did you test that, given that we don't actually have an asynchronous I/O
> implementation? I don't recall any recent patches floating around either to
> do that. When Greg Stark investigated this back in 2007-2008 and implemented
> posix_fadvise() for bitmap heap scans, posix_fadvise certainly gave a
> significant speedup on the test data he used. What kind of a data
> distribution gives a slowdown like that?

That's basically my summarized experience from working on this[0]
patch, and the feedback given there about competing AIO work.

Sequential I/O was the biggest issue. I had to actively avoid
fadvising on sequential I/O, or sequential-ish I/O, which was a huge
burden on fadvise logic.

>
> I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very
> easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from
> this on my laptop. Taking a 3 page sample of a table with 717717 pages
> (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the
> patch, and less than a second with the patch, with
> effective_io_concurrency=10. If anyone with a good test data set loaded
> would like to test this and post some numbers, that would be great.

Kernel version?

I raised this issue on LKML, and, while I got no news on this front,
they might have silently fixed it. I'd have to check the sources
again.

[0] 
http://www.postgresql.org/message-id/col116-w162aeaa64173e77d4597eea3...@phx.gbl


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?


On 12/6/13 7:38 AM, Andres Freund wrote:

On 2013-12-06 22:35:21 +0900, MauMau wrote:

From: "Tom Lane" 

No.  They are FATAL so far as the individual session is concerned.
Possibly some documentation effort is needed here, but I don't think
any change in the code behavior would be an improvement.


You are suggesting that we should add a note like "Don't worry about the
following message.  This is a result of normal connectivity checking", don't
you?

FATAL:  the database system is starting up


Uh. An explanation why you cannot connect to the database hardly seems
like a superflous log message.


It is when *you* are not actually trying to connect but rather pg_ctl is (which 
is one of the use cases here).

Arguably 1-3 are inaccurate since they're not really about a backend dying... 
they occur during the startup phase; you never even get a functioning backend. 
That's a marked difference from other uses of FATAL.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Problem with displaying "wide" tables in psql

2013-12-09 Thread Jeff Janes

On Mon, Dec 2, 2013 at 10:45 PM, Sergey Muraviov <
sergey.k.murav...@gmail.com> wrote:

> Hi.
>
> Psql definitely have a problem with displaying "wide" tables.
> Even in expanded mode, they look horrible.
> So I tried to solve this problem.
>

I get compiler warnings:

print.c: In function 'print_aligned_vertical':
print.c:1238: warning: ISO C90 forbids mixed declarations and code
print.c: In function 'print_aligned_vertical':
print.c:1238: warning: ISO C90 forbids mixed declarations and code

But I really like this and am already benefiting from it.  No point in
having the string of hyphens between every record wrap to be 30 lines long
just because one field somewhere down the list does so.  And configuring
the pager isn't much of a solution because the pager doesn't know that the
hyphens are semantically different than the other stuff getting thrown at
it.

Cheers,

Jeff

Re: [HACKERS] Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?

On 12/8/13 3:08 PM, MauMau wrote:

#5 is output when the DBA shuts down the replication standby server.
#6 is output when the DBA shuts down the server if he is using any custom
background worker.
These messages are always output. What I'm seeing as a problem is that
FATAL messages are output in a normal situation, which worries the DBA,
and
those messages don't help the DBA with anything. They merely worry the
DBA.

While "worry" is something to be avoided not logging messages is not the
correct solution. On the project side looking for ways and places to better
communicate to the user is worthwhile in the absence of adding additional
complexity to the logging system. The user/DBA, though, is expected to
learn how the proces works, ideally in a testing environment where worry is
much less a problem, and understand that some of what they are seeing are
client-oriented messages that they should be aware of but that do not
indicate server-level issues by themselves.

I see, It might makes sense to make the DBA learn how the system works, so that
more DBAs can solve problems by themselves. However, I wonder how those
messages (just stopping server process by admin request) are useful for the
DBA. If they are useful, why don't other background processes (e.g. archiver,
writer, walsender, etc.) output such a message when stopping? For information,
#5 and #6 are not client-oriented.

If we want to keep the messages, I think LOG or DEBUG would be appropriate. How about altering
"ereport(FATAL, ...)" to "ereport(LOG, ...); proc_exit(1)"?

IMNSHO, this isn't a "worry" thing, it's a complete mis-categorization of what's happening. Do we
output "FATAL" when you disconnect from the database? Something that is happening *by design*
should not be considered as "fatal".

I'm not saying we shouldn't log it at all, but it's silly to call it fatal just
because the process is going away *when we did something that needs to make it
go away*.

What makes it worse is what happens if you do something like
pg_terminate_backend() on one of those? That *should* create a fatal error (or
at least be treated the same way we treat any other use of
pg_terminate_backend()).

So in this case I'm betting the real problem is that an orderly shutdown is
calling pg_terminate_backend() or the equivalent. While that's convenient from
a code standpoint, it's inaccurate from an API standpoint.
--
Jim C. Nasby, Data Architect j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good


On 12/09/2013 11:56 PM, Claudio Freire wrote:

On Mon, Dec 9, 2013 at 6:47 PM, Heikki Linnakangas
 wrote:

On 12/09/2013 11:35 PM, Jim Nasby wrote:


On 12/8/13 1:49 PM, Heikki Linnakangas wrote:


On 12/08/2013 08:14 PM, Greg Stark wrote:


The whole accounts table is 1.2GB and contains 10 million rows. As
expected with rows_per_block set to 1 it reads 240MB of that
containing nearly 2 million rows (and takes nearly 20s -- doing a full
table scan for select count(*) only takes about 5s):


One simple thing we could do, without or in addition to changing the
algorithm, is to issue posix_fadvise() calls for the blocks we're
going to read. It should at least be possible to match the speed of a
plain sequential scan that way.


Hrm... maybe it wouldn't be very hard to use async IO here either? I'm
thinking it wouldn't be very hard to do the stage 2 work in the callback
routine...


Yeah, other than the fact we have no infrastructure to do asynchronous I/O
anywhere in the backend. If we had that, then we could easily use it here. I
doubt it would be much better than posix_fadvising the blocks, though.


Without patches to the kernel, it is much better.

posix_fadvise interferes with read-ahead, so posix_fadvise on, say,
bitmap heap scans (or similarly sorted analyze block samples) run at 1
IO / block, ie horrible, whereas aio can do read coalescence and
read-ahead when the kernel thinks it'll be profitable, significantly
increasing IOPS. I've seen everything from a 2x to 10x difference.


How did you test that, given that we don't actually have an asynchronous 
I/O implementation? I don't recall any recent patches floating around 
either to do that. When Greg Stark investigated this back in 2007-2008 
and implemented posix_fadvise() for bitmap heap scans, posix_fadvise 
certainly gave a significant speedup on the test data he used. What kind 
of a data distribution gives a slowdown like that?


I took a stab at using posix_fadvise() in ANALYZE. It turned out to be 
very easy, patch attached. Your mileage may vary, but I'm seeing a nice 
gain from this on my laptop. Taking a 3 page sample of a table with 
717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 
seconds without the patch, and less than a second with the patch, with 
effective_io_concurrency=10. If anyone with a good test data set loaded 
would like to test this and post some numbers, that would be great.


- Heikki
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 9845b0b..410d487 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -58,10 +58,18 @@
 /* Data structure for Algorithm S from Knuth 3.4.2 */
 typedef struct
 {
+	Relation	rel;
+	ForkNumber	forknum;
+
 	BlockNumber N;/* number of blocks, known in advance */
 	int			n;/* desired sample size */
 	BlockNumber t;/* current block number */
 	int			m;/* blocks selected so far */
+
+	int			prefetch_target;
+	int			start;
+	int			end;
+	BlockNumber *prefetched;
 } BlockSamplerData;
 
 typedef BlockSamplerData *BlockSampler;
@@ -87,10 +95,11 @@ static BufferAccessStrategy vac_strategy;
 static void do_analyze_rel(Relation onerel, VacuumStmt *vacstmt,
 			   AcquireSampleRowsFunc acquirefunc, BlockNumber relpages,
 			   bool inh, int elevel);
-static void BlockSampler_Init(BlockSampler bs, BlockNumber nblocks,
+static void BlockSampler_Init(BlockSampler bs, Relation rel, ForkNumber forknum, BlockNumber nblocks,
   int samplesize);
 static bool BlockSampler_HasMore(BlockSampler bs);
 static BlockNumber BlockSampler_Next(BlockSampler bs);
+static BlockNumber BlockSampler_Next_internal(BlockSampler bs);
 static void compute_index_stats(Relation onerel, double totalrows,
 	AnlIndexData *indexdata, int nindexes,
 	HeapTuple *rows, int numrows,
@@ -954,8 +963,12 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
  * algorithm.
  */
 static void
-BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize)
+BlockSampler_Init(BlockSampler bs, Relation rel, ForkNumber forknum,
+  BlockNumber nblocks, int samplesize)
 {
+	bs->rel = rel;
+	bs->forknum = forknum;
+
 	bs->N = nblocks;			/* measured table size */
 
 	/*
@@ -965,17 +978,61 @@ BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize)
 	bs->n = samplesize;
 	bs->t = 0;	/* blocks scanned so far */
 	bs->m = 0;	/* blocks selected so far */
+
+	bs->prefetch_target = target_prefetch_pages;
+	if (target_prefetch_pages > 0)
+		bs->prefetched = palloc((bs->prefetch_target + 1) * sizeof(BlockNumber));
+	else
+		bs->prefetched = NULL;
+	bs->start = bs->end = 0;
 }
 
 static bool
 BlockSampler_HasMore(BlockSampler bs)
 {
+	if (bs->end != bs->start)
+		return true;
 	return (bs->t < bs->N) && (bs->m < bs->n);
 }
 
+static void
+BlockSampler_Prefetch(BlockSampler bs)
+{
+	int		next;
+
+	next = (bs->end + 1) % (bs->prefetch_target + 1);
+	while (next != bs->start && (bs->t < bs->N) && (bs->m < bs->

Re: [HACKERS] ANALYZE sampling is too good

On Mon, Dec 9, 2013 at 4:18 PM, Jeff Janes  wrote:
> My reading of the code is that if it is not in the MCV, then it is assumed
> to have the average selectivity (about 1/n_distinct, but deflating top and
> bottom for the MCV list).  There is also a check that it is less than the
> least common of the MCV, but I don't know why that situation would ever
> prevail--that should always be higher or equal to the average selectivity.

I've never seen an n_distinct value of more than 5 digits, regardless
of reality.  Typically I've seen 20-50k, even if the real number is
much higher.  But the n_distinct value is only for non-MCVs, so if we
estimate the selectivity of column = 'rarevalue' to be
(1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces
the estimate, and making the MCV list longer naturally makes mcvfrac
bigger.  I'm not sure how important the
less-frequent-than-the-least-common-MCV part is, but I'm very sure
that raising the statistics target helps to solve the problem of
overestimating the prevalence of uncommon values in a very big table.

> I think that parts of the planner are N^2 in the size of histogram (or was
> that the size of the MCV list?).  So we would probably need a way to use a
> larger sample size to get more accurate n_distinct and MCV frequencies, but
> not save the entire histogram that goes with that sample size.

I think the saving the histogram part is important.  As you say, the
MCVs are important for a variety of planning purposes, such as hash
joins.  More than that, in my experience, people with large tables are
typically very willing to spend more planning time to get a better
plan, because mistakes are expensive and the queries are likely to run
for a while anyway.  People with small tables care about planning
time, because it makes no sense to spend an extra 1ms planning a query
unless you improve the plan by enough to save at least 1ms when
executing it, and when the tables are small and access is expected to
be fast anyway that's often not the case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql: Fix a couple of bugs in MultiXactId freezing

2013-12-09 Thread Alvaro Herrera

Here's a revamped version of this patch.  One thing I didn't do here is
revert the exporting of CreateMultiXactId, but I don't see any way to
avoid that.

Andres mentioned the idea of sharing some code between
heap_prepare_freeze_tuple and heap_tuple_needs_freeze, but I haven't
explored that.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***
*** 5238,5251  heap_inplace_update(Relation relation, HeapTuple tuple)
  		CacheInvalidateHeapTuple(relation, tuple, NULL);
  }
  
  
  /*
!  * heap_freeze_tuple
   *
   * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
!  * are older than the specified cutoff XID.  If so, replace them with
!  * FrozenTransactionId or InvalidTransactionId as appropriate, and return
!  * TRUE.  Return FALSE if nothing was changed.
   *
   * It is assumed that the caller has checked the tuple with
   * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
--- 5238,5448 
  		CacheInvalidateHeapTuple(relation, tuple, NULL);
  }
  
+ #define		FRM_NOOP0x0001
+ #define		FRM_INVALIDATE_XMAX		0x0002
+ #define		FRM_RETURN_IS_XID		0x0004
+ #define		FRM_RETURN_IS_MULTI		0x0008
+ #define		FRM_MARK_COMMITTED		0x0010
  
  /*
!  * FreezeMultiXactId
!  *		Determine what to do during freezing when a tuple is marked by a
!  *		MultiXactId.
!  *
!  * "flags" is an output value; it's used to tell caller what to do on return.
!  *
!  * Possible flags are:
!  * FRM_NOOP
!  *		don't do anything -- keep existing Xmax
!  * FRM_INVALIDATE_XMAX
!  *		mark Xmax as InvalidTransactionId and set XMAX_INVALID flag.
!  * FRM_RETURN_IS_XID
!  *		The Xid return value is a single update Xid to set as xmax.
!  * FRM_MARK_COMMITTED
!  *		Xmax can be marked as HEAP_XMAX_COMMITTED
!  * FRM_RETURN_IS_MULTI
!  *		The return value is a new MultiXactId to set as new Xmax.
!  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
!  */
! static TransactionId
! FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
!   TransactionId cutoff_xid, MultiXactId cutoff_multi,
!   uint16 *flags)
! {
! 	TransactionId	xid = InvalidTransactionId;
! 	int			i;
! 	MultiXactMember *members;
! 	int			nmembers;
! 	bool	need_replace;
! 	int		nnewmembers;
! 	MultiXactMember *newmembers;
! 	bool	has_lockers;
! 	TransactionId update_xid;
! 	bool	update_committed;
! 
! 	*flags = 0;
! 
! 	if (!MultiXactIdIsValid(multi))
! 	{
! 		/* Ensure infomask bits are appropriately set/reset */
! 		*flags |= FRM_INVALIDATE_XMAX;
! 		return InvalidTransactionId;
! 	}
! 	else if (MultiXactIdPrecedes(multi, cutoff_multi))
! 	{
! 		/*
! 		 * This old multi cannot possibly have members still running.  If it
! 		 * was a locker only, it can be removed without any further
! 		 * consideration; but if it contained an update, we might need to
! 		 * preserve it.
! 		 */
! 		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
! 		{
! 			*flags |= FRM_INVALIDATE_XMAX;
! 			return InvalidTransactionId;
! 		}
! 		else
! 		{
! 			/* replace multi by update xid */
! 			xid = MultiXactIdGetUpdateXid(multi, t_infomask);
! 
! 			/* wasn't only a lock, xid needs to be valid */
! 			Assert(TransactionIdIsValid(xid));
! 
! 			/*
! 			 * If the xid is older than the cutoff, it has to have aborted,
! 			 * otherwise the tuple would have gotten pruned away.
! 			 */
! 			if (TransactionIdPrecedes(xid, cutoff_xid))
! 			{
! Assert(!TransactionIdDidCommit(xid));
! *flags |= FRM_INVALIDATE_XMAX;
! /* xid = InvalidTransactionId; */
! 			}
! 			else
! 			{
! *flags |= FRM_RETURN_IS_XID;
! 			}
! 		}
! 	}
! 
! 	/*
! 	 * This multixact might have or might not have members still running,
! 	 * but we know it's valid and is newer than the cutoff point for
! 	 * multis.  However, some member(s) of it may be below the cutoff for
! 	 * Xids, so we need to walk the whole members array to figure out what
! 	 * to do, if anything.
! 	 */
! 
! 	nmembers = GetMultiXactIdMembers(multi, &members, false);
! 	if (nmembers <= 0)
! 	{
! 		/* Nothing worth keeping */
! 		*flags |= FRM_INVALIDATE_XMAX;
! 		return InvalidTransactionId;
! 	}
! 
! 	/* is there anything older than the cutoff? */
! 	need_replace = false;
! 	for (i = 0; i < nmembers; i++)
! 	{
! 		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
! 		{
! 			need_replace = true;
! 			break;
! 		}
! 	}
! 
! 	/*
! 	 * In the simplest case, there is no member older than the cutoff; we can
! 	 * keep the existing MultiXactId as is.
! 	 */
! 	if (!need_replace)
! 	{
! 		*flags |= FRM_NOOP;
! 		pfree(members);
! 		return InvalidTransactionId;
! 	}
! 
! 	/*
! 	 * If the multi needs to be updated, figure out which members do we need
! 	 * to keep.
! 	 */
! 	nnewmembers = 0;
! 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
! 	has_lockers = false;
! 	update_xid = InvalidTransactionId;
! 	update_com

Re: [HACKERS] plpgsql_check_function - rebase for 9.3

2013-12-09 Thread Peter Eisentraut

On 12/8/13, 12:01 PM, Pavel Stehule wrote:
> But still I have no idea, how to push check without possible slowdown
> execution with code duplication

Create a GUC parameter plpgsql.slow_checks or whatever (perhaps more
specific), which people can turn on when they run their test suites.

This doesn't really have to be all that much different from what we are
currently doing in C with scan-build and address sanitizer, for example.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

On Mon, Dec 9, 2013 at 6:47 PM, Heikki Linnakangas
 wrote:
> On 12/09/2013 11:35 PM, Jim Nasby wrote:
>>
>> On 12/8/13 1:49 PM, Heikki Linnakangas wrote:
>>>
>>> On 12/08/2013 08:14 PM, Greg Stark wrote:

 The whole accounts table is 1.2GB and contains 10 million rows. As
 expected with rows_per_block set to 1 it reads 240MB of that
 containing nearly 2 million rows (and takes nearly 20s -- doing a full
 table scan for select count(*) only takes about 5s):
>>>
>>>
>>> One simple thing we could do, without or in addition to changing the
>>> algorithm, is to issue posix_fadvise() calls for the blocks we're
>>> going to read. It should at least be possible to match the speed of a
>>> plain sequential scan that way.
>>
>>
>> Hrm... maybe it wouldn't be very hard to use async IO here either? I'm
>> thinking it wouldn't be very hard to do the stage 2 work in the callback
>> routine...
>
>
> Yeah, other than the fact we have no infrastructure to do asynchronous I/O
> anywhere in the backend. If we had that, then we could easily use it here. I
> doubt it would be much better than posix_fadvising the blocks, though.


Without patches to the kernel, it is much better.

posix_fadvise interferes with read-ahead, so posix_fadvise on, say,
bitmap heap scans (or similarly sorted analyze block samples) run at 1
IO / block, ie horrible, whereas aio can do read coalescence and
read-ahead when the kernel thinks it'll be profitable, significantly
increasing IOPS. I've seen everything from a 2x to 10x difference.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stats for network traffic WIP

2013-12-09 Thread Nigel Heron

On Sat, Dec 7, 2013 at 1:17 PM, Fujii Masao  wrote:
>
> Could you share the performance numbers? I'm really concerned about
> the performance overhead caused by this patch.
>

I've tried pgbench in select mode with small data sets to avoid disk
io and didn't see any difference. That was on my old core2duo laptop
though .. I'll have to retry it on some server class multi core
hardware.

I could create a new GUC to turn on/off this feature. Currently, it
uses "track_counts".

> Here are the comments from me:
>
> All the restrictions of this feature should be documented. For example,
> this feature doesn't track the bytes of the data transferred by FDW.
> It's worth documenting that kind of information.
>

OK. It also doesn't account for DNS resolution, Bonjour traffic and
any traffic generated from PL functions that create their own sockets.

> ISTM that this feature doesn't support SSL case. Why not?

It does support SSL, see my_sock_read() and my_sock_write() in
backend/libpq/be-secure.c

> The amount of data transferred by walreceiver also should be tracked,
> I think.

I'll have to take another look at it. I might be able to create SSL
BIO functions in libpqwalreceiver.c and change some other functions
(eg. libpqrcv_send) to return byte counts instead of void to get it
working.

> I just wonder how conn_received, conn_backend and conn_walsender
> are useful.

I thought of it mostly for monitoring software usage (eg. cacti,
nagios) to track connections/sec which might be used for capacity
planning, confirm connection pooler settings, monitoring abuse, etc.
Eg. If your conn_walsender is increasing and you have a fixed set of
slaves it could show a network issue.
The information is available in the logs if "log_connections" GUC is
on but it requires parsing and access to log files to extract. With
the increasing popularity of hosted postgres services without OS or
log access, I think more metrics should be available through system
views.

-nigel.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good


On 12/09/2013 11:35 PM, Jim Nasby wrote:

On 12/8/13 1:49 PM, Heikki Linnakangas wrote:

On 12/08/2013 08:14 PM, Greg Stark wrote:

The whole accounts table is 1.2GB and contains 10 million rows. As
expected with rows_per_block set to 1 it reads 240MB of that
containing nearly 2 million rows (and takes nearly 20s -- doing a full
table scan for select count(*) only takes about 5s):


One simple thing we could do, without or in addition to changing the
algorithm, is to issue posix_fadvise() calls for the blocks we're
going to read. It should at least be possible to match the speed of a
plain sequential scan that way.


Hrm... maybe it wouldn't be very hard to use async IO here either? I'm
thinking it wouldn't be very hard to do the stage 2 work in the callback
routine...


Yeah, other than the fact we have no infrastructure to do asynchronous 
I/O anywhere in the backend. If we had that, then we could easily use it 
here. I doubt it would be much better than posix_fadvising the blocks, 
though.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?

2013-12-09 Thread Kevin Grittner

MauMau  wrote:
> From: "Greg Stark" 

>> On the client end the FATAL is pretty logical but in the logs it
>> makes it sound like the entire server died.

I agree that is easily misunderstood, especially since a FATAL
problem is less severe than a PANIC; while in common English usage
panic is what might feel when faced with the prospect of speaking
in public, but fatal generally means something that kills -- like a
disease or a plane crash.  There is the notion of a "fatal error"
in English, though; which means an error which puts an end to what
the person who makes such an error is attempting.

>> FATAL is a term of art peculiar to Postgres.

No, it is not; at least not in terms of being "characteristic of
only one person, group, or thing".  The Java logger from Apache
called log4j has a FATAL level which is more serious than ERROR.
The distinction is intended to indicate whether the application is
likely to be able to continue running.  Similarly, Sybase ASE has
severity levels, where above a certain point they are described as
"fatal" -- meaning that the application must acquire a new
connection.  It's probably used elsewhere as well; those are just a
couple I happened to be familiar with.

> I find it unnatural for a normal administration operation to emit
> a FATAL message.

It seems to be a fairly common term of art for a problem which
requires a restart or reconnection.  FATAL is used when the problem
is severe enough that the process or connection must end.  It seems
to me to be what should consistently be used when a client
connection or its process must be terminated for a reason other
than a client-side request to terminate.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

On Mon, Dec 9, 2013 at 1:18 PM, Jeff Janes  wrote:
> I don't recall ever tracing a bad plan down to a bad n_distinct.

It does happen. I've seen it several times.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good


On 12/8/13 1:49 PM, Heikki Linnakangas wrote:

On 12/08/2013 08:14 PM, Greg Stark wrote:

The whole accounts table is 1.2GB and contains 10 million rows. As
expected with rows_per_block set to 1 it reads 240MB of that
containing nearly 2 million rows (and takes nearly 20s -- doing a full
table scan for select count(*) only takes about 5s):


One simple thing we could do, without or in addition to changing the algorithm, 
is to issue posix_fadvise() calls for the blocks we're going to read. It should 
at least be possible to match the speed of a plain sequential scan that way.


Hrm... maybe it wouldn't be very hard to use async IO here either? I'm thinking 
it wouldn't be very hard to do the stage 2 work in the callback routine...
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in VACUUM reporting of "removed %d row versions" in 9.2+


Where are we on this?

---

On Fri, May 10, 2013 at 04:37:58PM +0100, Simon Riggs wrote:
> Commit d0dcb315db0043f10073a9a244cea138e9e60edd and previous
> introduced a bug into the reporting of removed row versions. ('Twas
> myself et al, before you ask).
> 
> In those commits, lazy_vacuum_heap() skipped pinned blocks, but then
> failed to report that accurately, claiming that the tuples were
> actually removed when they were not. That bug has masked the effect of
> the page skipping behaviour.
> 
> Bug is in 9.2 and HEAD.
> 
> Attached patch corrects that, with logic to move to the next block
> rather than re-try the lock in a tight loop once per tuple, which was
> mostly ineffective.
> 
> Attached patch also changes the algorithm slightly to retry a skipped
> block by sleeping and then retrying the block, following observation
> of the effects of the current skipping algorithm once skipped rows are
> correctly reported.
> 
> It also adds a comment which explains the skipping behaviour.
> 
> Viewpoints?
> 
> --
>  Simon Riggs   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services

> diff --git a/src/backend/commands/vacuumlazy.c 
> b/src/backend/commands/vacuumlazy.c
> index 02f3cf3..f0d054a 100644
> --- a/src/backend/commands/vacuumlazy.c
> +++ b/src/backend/commands/vacuumlazy.c
> @@ -1052,15 +1052,15 @@ lazy_scan_heap(Relation onerel, LVRelStats 
> *vacrelstats,
>  static void
>  lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
>  {
> - int tupindex;
> - int npages;
> + int tupindex = 0;
> + int npages = 0;
> + int ntupskipped = 0;
> + int npagesskipped = 0;
>   PGRUsageru0;
>   Buffer  vmbuffer = InvalidBuffer;
>  
>   pg_rusage_init(&ru0);
> - npages = 0;
>  
> - tupindex = 0;
>   while (tupindex < vacrelstats->num_dead_tuples)
>   {
>   BlockNumber tblk;
> @@ -1075,9 +1075,32 @@ lazy_vacuum_heap(Relation onerel, LVRelStats 
> *vacrelstats)
>vac_strategy);
>   if (!ConditionalLockBufferForCleanup(buf))
>   {
> - ReleaseBuffer(buf);
> - ++tupindex;
> - continue;
> + /*
> +  * If we can't get the lock, sleep, then try again just 
> once.
> +  *
> +  * If we can't get the lock the second time, skip this 
> block and
> +  * move onto the next one. This is possible because by 
> now we
> +  * know the tuples are dead and all index pointers to 
> them have been
> +  * removed, so it is safe to ignore them, even if not 
> ideal.
> +  */
> + VacuumCostBalance += VacuumCostLimit;
> + vacuum_delay_point();
> + if (!ConditionalLockBufferForCleanup(buf))
> + {
> + BlockNumber blkno = tblk;
> +
> + ReleaseBuffer(buf);
> + tupindex++;
> + for (; tupindex < vacrelstats->num_dead_tuples; 
> tupindex++)
> + {
> + ntupskipped++;
> + tblk = 
> ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]);
> + if (tblk != blkno)
> + break;
> + }
> + npagesskipped++;
> + continue;
> + }
>   }
>   tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, 
> vacrelstats,
>   
> &vmbuffer);
> @@ -1098,9 +1121,9 @@ lazy_vacuum_heap(Relation onerel, LVRelStats 
> *vacrelstats)
>   }
>  
>   ereport(elevel,
> - (errmsg("\"%s\": removed %d row versions in %d pages",
> + (errmsg("\"%s\": removed %d row versions in %d pages 
> (skipped %d row versions in %d pages)",
>   RelationGetRelationName(onerel),
> - tupindex, npages),
> + tupindex - ntupskipped, npages, 
> ntupskipped, npagesskipped),
>errdetail("%s.",
>  pg_rusage_show(&ru0;
>  }

> 
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
  Bruce Momjia

Re: [HACKERS] plpgsql_check_function - rebase for 9.3


On 12/9/13 1:08 PM, Pavel Stehule wrote:

So presumably it would be check_never, not check_newer... :) BTW, it's not 
terribly hard to work around the temp table issue; you just need to create the 
expected table in the session when you create the function. But even in this 
case, I think it would still be good to check what we can, like at least basic 
plpgsql syntax.


I sorry.

You cannot to create temporary table - this check should not have any side 
effect - and creating temporary table can run some event trigger.

But there should be some hints for check like annotations or some similar. Or 
you can minimize a area where check will be disabled.


Sorry, I meant that the user can work around it by creating the table. I didn't 
mean to imply that we would magically create a temp table to do the checking.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good


On 12/6/13 3:21 AM, Andres Freund wrote:

On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote:

Has anyone ever thought about opportunistic ANALYZE piggy-backing on
other full-table scans? That doesn't really help Greg, because his
complaint is mostly that a fresh ANALYZE is too expensive, but it
could be an interesting, albeit risky approach.


What I've been thinking of is

a) making it piggy back on scans vacuum is doing instead of doing
separate ones all the time (if possible, analyze needs to be more
frequent). Currently with quite some likelihood the cache will be gone
again when revisiting.


FWIW, if synchronize_seqscans is on I'd think it'd be pretty easy to fire up a 
2nd backend to do the ANALYZE portion (or perhaps use Robert's fancy new shared 
memory stuff).
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Jeff Janes

On Sat, Dec 7, 2013 at 11:46 AM, Robert Haas  wrote:

> On Tue, Dec 3, 2013 at 6:30 PM, Greg Stark  wrote:
> > I always gave the party line that ANALYZE only takes a small
> > constant-sized sample so even very large tables should be very quick.
> > But after hearing the same story again in Heroku I looked into it a
> > bit further. I was kind of shocked but the numbers.
> >
> > ANALYZE takes a sample of 300 * statistics_target rows. That sounds
> > pretty reasonable but with default_statistics_target set to 100 that's
> > 30,000 rows. If I'm reading the code right It takes this sample by
> > sampling 30,000 blocks and then (if the table is large enough) taking
> > an average of one row per block. Each block is 8192 bytes so that
> > means it's reading 240MB of each table.That's a lot more than I
> > realized.
>
> That is a lot.  On the other hand, I question the subject line:
> sometimes, our ANALYZE sampling is not good enough.  Before we raised
> the default statistics target from 10 to 100, complaints about bad
> plan choices due to insufficiently-precise statistics were legion --
> and we still have people periodically proposing to sample a fixed
> percentage of the table instead of a fixed amount of the table, even
> on large tables, which is going the opposite direction.  I think this
> is because they're getting really bad n_distinct estimates, and no
> fixed-size sample can reliably give a good one.
>


I don't recall ever tracing a bad plan down to a bad n_distinct.  I have
seen several that were due to bad frequency estimates in MCV list, because
hash join planning is extremely sensitive to that.  Do we have some kind of
catalog of generators of problematic data, so that changes can be tested on
known problem sets?  Perhaps a wiki page to accumulate them would be
useful.  For automated testing I guess the generator and query is the easy
part, the hard part is the cost settings/caching/RAM needed to trigger the
problem, and parsing and interpreting the results.


>
> More generally, I think the basic problem that people are trying to
> solve by raising the statistics target is avoid index scans on
> gigantic tables.  Obviously, there are a lot of other situations where
> inadequate statistics can cause problems, but that's a pretty
> easy-to-understand one that we do not always get right.  We know that
> an equality search looking for some_column = 'some_constant', where
> some_constant is an MCV, must be more selective than a search for the
> least-frequent MCV.  If you store more and more MCVs for a table,
> eventually you'll have enough that the least-frequent one is pretty
> infrequent, and then things work a lot better.
>

My reading of the code is that if it is not in the MCV, then it is assumed
to have the average selectivity (about 1/n_distinct, but deflating top and
bottom for the MCV list).  There is also a check that it is less than the
least common of the MCV, but I don't know why that situation would ever
prevail--that should always be higher or equal to the average selectivity.



>
> This is more of a problem for big tables than for small tables.  MCV
> #100 can't have a frequency of greater than 1/100 = 0.01, but that's a
> lot more rows on a big table than small one.  On a table with 10
> million rows we might estimate something close to 100,000 rows when
> the real number is just a handful; when the table has only 10,000
> rows, we just can't be off by as many orders of magnitude.  Things
> don't always work out that badly, but in the worst case they do.
>
> Maybe there's some highly-principled statistical approach which could
> be taken here, and if so that's fine, but I suspect not.  So what I
> think we should do is auto-tune the statistics target based on the
> table size.  If, say, we think that the generally useful range for the
> statistics target is something like 10 to 400, then let's come up with
> a formula based on table size that outputs 10 for small tables, 400
> for really big tables, and intermediate values for tables in the
> middle.
>

I think that parts of the planner are N^2 in the size of histogram (or was
that the size of the MCV list?).  So we would probably need a way to use a
larger sample size to get more accurate n_distinct and MCV frequencies, but
not save the entire histogram that goes with that sample size.


Cheers,

Jeff

Re: [HACKERS] GIN improvements part 1: additional information


On 12/09/2013 11:34 AM, Alexander Korotkov wrote:

On Mon, Dec 9, 2013 at 1:18 PM, Heikki Linnakangas
wrote:


Even if we use varbyte encoding, I wonder if it would be better to treat
block + offset number as a single 48-bit integer, rather than encode them
separately. That would allow the delta of two items on the same page to be
stored as a single byte, rather than two bytes. Naturally it would be a
loss on other values, but would be nice to see some kind of an analysis on
that. I suspect it might make the code simpler, too.


Yeah, I had that idea, but I thought it's not a better option. Will try to
do some analysis.


The more I think about that, the more convinced I am that it's a good 
idea. I don't think it will ever compress worse than the current 
approach of treating block and offset numbers separately, and, although 
I haven't actually tested it, I doubt it's any slower. About the same 
amount of arithmetic is required in both versions.


Attached is a version that does that. Plus some other minor cleanup.

(we should still investigate using a completely different algorithm, though)

- Heikki


gin-packed-postinglists-19.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNNEST with multiple args, and TABLE with multiple funcs

2013-12-09 Thread Noah Misch

On Thu, Dec 05, 2013 at 10:34:08PM -0500, Stephen Frost wrote:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > Noah Misch  writes:
> > > Two naming proposals, "ROWS FROM" and "TABLE FROM", got an ACK from more 
> > > than
> > > one person apiece.  I move that we settle on "ROWS FROM".
> > 
> > I'm not sufficiently annoyed by "ROWS FROM" to object.  Other opinions?
> 
> Works well enough for me.

Great.  Here's the patch I'll be using.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com
diff --git a/doc/src/sgml/queries.sgml b/doc/src/sgml/queries.sgml
index b33de68..daba74b 100644
--- a/doc/src/sgml/queries.sgml
+++ b/doc/src/sgml/queries.sgml
@@ -647,7 +647,7 @@ FROM (VALUES ('anne', 'smith'), ('bob', 'jones'), ('joe', 
'blow'))
 
 
 
- Table functions may also be combined using the TABLE
+ Table functions may also be combined using the ROWS FROM
  syntax, with the results returned in parallel columns; the number of
  result rows in this case is that of the largest function result, with
  smaller results padded with NULLs to match.
@@ -655,7 +655,7 @@ FROM (VALUES ('anne', 'smith'), ('bob', 'jones'), ('joe', 
'blow'))
 
 
 function_call WITH ORDINALITY 
AS table_alias 
(column_alias , ... 
)
-TABLE( function_call , ...  ) 
WITH ORDINALITY AS 
table_alias 
(column_alias , ... 
)
+ROWS FROM( function_call , ... 
 ) WITH ORDINALITY 
AS table_alias 
(column_alias , ... 
)
 
 
 
@@ -674,7 +674,7 @@ TABLE( function_call , 
...  ) UNNEST
  () had been called on each parameter
- separately and combined using the TABLE construct.
+ separately and combined using the ROWS FROM construct.
 
 
 
@@ -683,7 +683,7 @@ UNNEST( array_expression 
, ... 
 
 
  If no table_alias is specified, the function
- name is used as the table name; in the case of a TABLE()
+ name is used as the table name; in the case of a ROWS FROM()
  construct, the first function's name is used.
 
 
@@ -731,20 +731,20 @@ SELECT * FROM vw_getfoo;
 
 function_call AS 
alias (column_definition 
, ... )
 function_call AS 
alias 
(column_definition , ... )
-TABLE( ... function_call AS 
(column_definition , ... ) 
, ...  )
+ROWS FROM( ... function_call AS 
(column_definition , ... ) 
, ...  )
 
 
 
- When not using the TABLE() syntax,
+ When not using the ROWS FROM() syntax,
  the column_definition list replaces the column
  alias list that could otherwise be attached to the FROM
  item; the names in the column definitions serve as column aliases.
- When using the TABLE() syntax,
+ When using the ROWS FROM() syntax,
  a column_definition list can be attached to
  each member function separately; or if there is only one member function
  and no WITH ORDINALITY clause,
  a column_definition list can be written in
- place of a column alias list following TABLE().
+ place of a column alias list following ROWS FROM().
 
 
 
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 88ebd73..d6a17cc 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -56,7 +56,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressionalias [ ( column_alias [, ...] ) ] ]
 [ LATERAL ] function_name ( [ 
argument [, ...] ] ) [ AS ] 
alias ( column_definition [, ...] )
 [ LATERAL ] function_name ( [ 
argument [, ...] ] ) AS ( 
column_definition [, ...] )
-[ LATERAL ] TABLE( function_name ( [ argument [, ...] ] ) [ AS ( column_definition [, ...] ) ] [, ...] )
+[ LATERAL ] ROWS FROM( function_name ( [ argument [, ...] ] ) [ AS ( column_definition [, ...] ) ] [, ...] )
 [ WITH ORDINALITY ] [ [ AS ] alias [ ( column_alias [, ...] ) ] ]
 from_item [ NATURAL ] 
join_type from_item [ ON join_condition | USING ( join_column [, ...] ) ]
 
@@ -390,7 +390,7 @@ TABLE [ ONLY ] table_name [ * ]

 Multiple function calls can be combined into a
 single FROM-clause item by surrounding them
-with TABLE( ... ).  The output of such an item is the
+with ROWS FROM( ... ).  The output of such an item is the
 concatenation of the first row from each function, then the second
 row from each function, etc.  If some of the functions produce fewer
 rows than others, NULLs are substituted for the missing data, so
@@ -410,18 +410,18 @@ TABLE [ ONLY ] table_name [ * ]

 

-When using the TABLE( ... ) syntax, if one of the
+When using the ROWS FROM( ... ) syntax, if one of the
 functions requires a column definition list, it's preferred to put
 the column definition list after the function call inside
-TABLE( ... ).  A column definition list can be placed
-after the TABLE( ... ) construct only if there's just a
-single function and no WITH ORDINALITY clause.
+ROWS FROM( ... ).  A column definition list can be placed
+after the

[HACKERS] In-Memory Columnar Store

2013-12-09 Thread knizhnik


Hello!

I want to annouce my implementation of In-Memory Columnar Store 
extension for PostgreSQL:


 Documentation: http://www.garret.ru/imcs/user_guide.html
 Sources: http://www.garret.ru/imcs-1.01.tar.gz

Any feedbacks, bug reports and suggestions are welcome.

Vertical representation of data is stored in PostgreSQL shared memory.
This is why it is important to be able to utilize all available physical 
memory.
Now servers with Tb or more RAM are not something exotic, especially in 
financial world.
But there is limitation in Linux with standard 4kb pages  for maximal 
size of mapped memory segment: 256Gb.
It is possible to overcome this limitation either by creating multiple 
segments - but it requires too much changes in PostgreSQL memory manager.
Or just set MAP_HUGETLB flag (assuming that huge pages were allocated in 
the system).


I found several messages related with MAP_HUGETLB flag, the most recent 
one was from 21 of November:

http://www.postgresql.org/message-id/20131125032920.ga23...@toroid.org

I wonder what is the current status of this patch?






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] shared memory message queues

On Sun, Dec 8, 2013 at 5:52 AM, Kohei KaiGai  wrote:
> 2013/12/6 Kohei KaiGai :
>> What will happen if sender tries to send a large chunk that needs to
>> be split into multiple sub-chunks and receiver concurrently detaches
>> itself from the queue during the writes by sender?
>> It seems to me the sender gets SHM_MQ_DETACHED and only
>> earlier half of the chunk still remains on the queue even though
>> its total length was already in the message queue.
>> It may eventually lead infinite loop on the receiver side when another
>> receiver appeared again later, then read incomplete chunk.
>> Does it a feasible scenario? If so, it might be a solution to prohibit
>> enqueuing something without receiver, and reset queue when a new
>> receiver is attached.
>>
> Doesn't it an intended usage to attach a peer process on a message
> queue that had once detached, does it?
> If so, it may be a solution to put ereport() on shm_mq_set_receiver()
> and shm_mq_set_sender() to prohibit to assign a process on the
> message queue with mq_detached = true. It will make the situation
> simplified.

It's not intended that you should be able to attach a new reader or
writer in place of an old one that detached.  That would in fact be
pretty tricky to support, because if the detached process was in the
middle of reading or writing a message at the time it died, then
there's no way to recover protocol sync.  We could design some
mechanism for that, but in the case of background workers connected to
dynamic shared memory segments it isn't needed, because I assume that
when the background worker croaks, you're going to tear down the
dynamic shared memory segment and thus the whole queue will disappear;
if the user retries the query, we'll create a whole new segment
containing a whole new queue (or queues).

Now, if we wanted to use these queues in permanent shared memory, we'd
probably need to think a little bit harder about this.  It is not
impossible to make it work even as things stand, because you could
reuse the same chunk of shared memory and just overwrite it with a
newly-initialized queue.  You'd need some mechanism to figure out when
to do that, and it might be kind of ugly, but I think i'd be doable.
That wasn't the design center for this attempt, though, and if we want
to use it that way then we probably should spend some time figuring
out how to support both a "clean" detach, where the reader or writer
goes away at a message boundary, and possibly also a "dirty" detach,
where the reader or writer goes away in the middle of a message.  I
view those as problems for future patches, though.

> Regarding to the test-shm-mq-v1.patch, setup_background_workers()
> tries to launch nworkers of background worker processes, however,
> may fail during the launching if max_worker_processes is not enough.
> Is it a situation to attach the BGWORKER_EPHEMERAL flag when
> your patch gets committed, isn't it?

I dropped the proposal for BGWORKER_EPHEMERAL; I no longer think we
need that.  If not all of the workers can be registered,
setup_background_workers() will throw an error when
RegisterDynamicBackgroundWorker returns false.  If the workers are
successfully registered but don't get as far as connecting to the
shm_mq, wait_for_workers_to_become_ready() will detect that condition
and throw an error.  If all of the workers start up and attached to
the shared memory message queues but then later one of them dies, the
fact that it got as far as connecting to the shm_mq means that the
message queue's on_dsm_detach callback will run, which will mark the
queues to which it is connected as detached.  That will cause the
workers on either side of it to exit also until eventually the failure
propagates back around to the user backend.  This is all a bit complex
but I don't see a simpler solution.

> Also, test_shm_mq_setup() waits for completion of starting up of
> background worker processes. I'm uncertain whether it is really
> needed, because this shared memory message queue allows to
> send byte stream without receiver, and also blocks until byte
> stream will come from the peer to be set later.

That's actually a very important check.  Suppose we've got just 3
worker processes, so that the message queues are connected like this:

user backend -> worker 1 -> worker 2 -> worker 3 -> user backend

When test_shm_mq_setup connects to the queues linking it to worker 1
and worker 3, it passes a BackgroundWorkerHandle to shm_mq_attach. As
a result, if either worker 1 or worker 3 fails during startup, before
attaching to the queue, the user backend would notice that and error
out right away, even if it didn't do
wait_for_workers_to_become_ready().  However, if worker 2 fails during
startup, neither the user backend nor either of the other workers
would notice that without wait_for_workers_to_become_ready(): the user
backend isn't connected to worker 2 by a shm_mq at all, and workers 1
and 3 have no BackgroundWorkerHandle to pass to shm_mq_attach(),

Re: [HACKERS] What are multixactids?


On 12/9/13 1:05 PM, hubert depesz lubaczewski wrote:

On Mon, Dec 09, 2013 at 07:59:10PM +0100, Andreas Karlsson wrote:

I recommend you read the section in README.tuplock.
1. 
https://github.com/postgres/postgres/blob/d9250da032e723d80bb0150b9276cc544df6a087/src/backend/access/heap/README.tuplock#L68


+1, even if it's just a link to the README. I also started wondering what these 
were while investigating the vacuum bugs.

I'd submit a patch, but I don't have doc builds working :(
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What are multixactids?

2013-12-09 Thread Andreas Karlsson


On 12/09/2013 08:05 PM, hubert depesz lubaczewski wrote:

Thanks. Read that. Still, it would be good to have some information in
normal docs, but I guess this has to do for now.


It is mentioned several times in the documentation but I do not think it 
is explained anywhere.


--
Andreas Karlsson


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql_check_function - rebase for 9.3

2013-12-09 Thread Pavel Stehule

2013/12/9 Jim Nasby 

> On 12/8/13 11:24 PM, Pavel Stehule wrote:
>
>>  > #option check_on_first_start
>>  > #option check_on_create
>>  > #option check_newer
>>
>> what exactly check_newer means, does it mean whenever a function is
>> replaced (changed)?
>>
>>
>> no, it means, so request for check will be ignored ever - some functions
>> cannot be deeply checked due using dynamic SQL or dynamic created data
>> types - temporary tables created in functions.
>>
>
> So presumably it would be check_never, not check_newer... :) BTW, it's not
> terribly hard to work around the temp table issue; you just need to create
> the expected table in the session when you create the function. But even in
> this case, I think it would still be good to check what we can, like at
> least basic plpgsql syntax.
>

I sorry.

You cannot to create temporary table - this check should not have any side
effect - and creating temporary table can run some event trigger.

But there should be some hints for check like annotations or some similar.
Or you can minimize a area where check will be disabled.

>
> Do we really need first_start? ISTM that if you're dependent on run state
> then you're basically out of luck.
>

I afraid so checking on creation time is not enough for plpgsql.

and I have a very good experience with check on start from plpgsql_lint
usage when I wrote a regression tests.

A "first start" doesn't create dependency on state - but just more
preciously define a time, when checking will be done. Probably a option
check_create_and_start can be useful.

Regards

Pavel

> --
> Jim C. Nasby, Data Architect   j...@nasby.net
> 512.569.9461 (cell) http://jim.nasby.net
>

Re: [HACKERS] What are multixactids?

2013-12-09 Thread hubert depesz lubaczewski

On Mon, Dec 09, 2013 at 07:59:10PM +0100, Andreas Karlsson wrote:
> As you can see from Peter's message it is explained in
> README.tuplock[1]. Basically it is used whenever more than one lock
> is acquired on the same tuples as a reference to where the locks are
> stored. It can store updated/deleted Xid for the tuple so it needs
> to be persisted.
> I recommend you read the section in README.tuplock.
> 1. 
> https://github.com/postgres/postgres/blob/d9250da032e723d80bb0150b9276cc544df6a087/src/backend/access/heap/README.tuplock#L68

Thanks. Read that. Still, it would be good to have some information in
normal docs, but I guess this has to do for now.

Best regards,

depesz

-- 
The best thing about modern society is how easy it is to avoid contact with it.
 http://depesz.com/


signature.asc
Description: Digital signature

Re: [HACKERS] How to do fast performance timing


On 12/9/13 7:33 AM, Benedikt Grundmann wrote:

At Jane Street we have recently spend a lot of time trying to get a fast 
gettimeofday.  I saw lots of references in various postgres hacker threads 
related to a lack of such a facility so 

The culmination of those efforts can be read here:

https://github.com/janestreet/core/blob/master/lib/time_stamp_counter.mli
and

https://github.com/janestreet/core/blob/master/lib/time_stamp_counter.ml

it's all OCaml but the code is mostly imperative and very well documented.  In 
particular we made an effort to document our assumption.  There are a few which 
are ocaml specific.  But a lot of the lessons we have learned here should be 
applicable to postgres.


Looks interesting. I think this isn't nearly as big an issue in Postgres as it 
used to be, but I think there's also things we've been avoiding because of the 
overhead. IE: using IO response time to determine if something came from cache 
or not.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What are multixactids?

2013-12-09 Thread Andreas Karlsson


On 12/09/2013 06:04 PM, hubert depesz lubaczewski wrote:

Hi,
when working on fixing the bug related to vacuum freeze, I found out
that there is something called "MultiXactId".

Searching docs showed that it is mentioned only once, in release notes
to 9.3.2:
http://www.postgresql.org/search/?u=%2Fdocs%2F9.3%2F&q=multixactid

What's more - I found that Peter Eisentraut already once asked about
them, and lack of documentation:
http://postgresql.1045698.n5.nabble.com/MultiXactId-concept-underdocumented-td5766754.html

So, my question is - what are multixactids, what are they used for,
where can I find any documentation/explanation/whatever?

It seems to be related in some way to the relfrozenxid/vacuum bug, but
I can't comprehend the relation without knowing what multixactid
actually is.


As you can see from Peter's message it is explained in 
README.tuplock[1]. Basically it is used whenever more than one lock is 
acquired on the same tuples as a reference to where the locks are 
stored. It can store updated/deleted Xid for the tuple so it needs to be 
persisted.


I recommend you read the section in README.tuplock.

1. 
https://github.com/postgres/postgres/blob/d9250da032e723d80bb0150b9276cc544df6a087/src/backend/access/heap/README.tuplock#L68


--
Andreas Karlsson


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

On Mon, Dec 9, 2013 at 6:54 PM, Greg Stark  wrote:
>
> This "some math" is straightforward basic statistics.  The 95th
> percentile confidence interval for a sample consisting of 300 samples
> from a population of a 1 million would be 5.66%. A sample consisting
> of 1000 samples would have a 95th percentile confidence interval of
> +/- 3.1%.

Incidentally I got this using an online sample size calculator. Google
turns up several but this one seems the easiest to use:
http://www.raosoft.com/samplesize.html


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] About shared cache invalidation mechanism

On Sun, Dec 8, 2013 at 1:55 AM, huaicheng Li  wrote:
> I know that all invalid cache messages are stored in the
> shmInvalidationBuffer ring buffer and that they should be consumed by all
> other backends to keep their own cache fresh. Since there may be some
> "stragglers" which process the SI message quite slow, we use *catchup*
> interrupt(signal) to accelerate their cosuming shared invalid messages. Here
> comes my questions :
> (1). When the current number of messages in the shmInvalidationBuffer
> exceeds a threshold, it needs to be cleaned up by using SICleanupQueue.
> After that, if the number still exceeds MAXNUMMESSAGES/2, threshold will be
> calculated by the following formula：
> Threshold = (numMsgs/CLEANUP_QUANTUM + 1) * CLEANUP_QUANTUM
> (2). For those slow backends, if their *nextMsgNum* value is less than
> *lowbound*, they will be reset, and the *lowbound* is calculated by
> lowbound = maxMsgNum - MAXNUMMESSAGES + minFree,
> and if their *nextMsgNum* value is less than *minsig*, they will get catchup
> signals to speed up, *minsig* is calculated by
> minsig = maxMsgNum - MAXNUMMESSAGES/2
>
> Here, I want to ask why threshold, lowbound and minsig are calculated like
> that ? Do the three formulas have any performance considerations when
> designed ? I have searched through the old mail list archives, but found
> nothing about these(these changes emerged in pg8.4 firstly), any help would
> be appreciated.

Since the invalidation queue is a ring buffer, it will eventually wrap
around, with new invalidation messages overwriting previously-written
ones.  Once that happens, any backends that hadn't yet processed some
message that's since been overwritten have to be reset.  Since there's
no longer any way of knowing exactly which parts of their
backend-local cache need to be flushed, they'll just have to flush
everything.  That's the point of the lowbound.

However, that's expensive, so we want to prevent it from happening.
To do that, when we notice that a backend is way behind, we send it a
catchup signal.  Hopefully, it will respond to the catchup signal by
reading the messages it hasn't yet processed from the queue.  But if
it doesn't do that, or doesn't do it quickly enough, and more messages
keep arriving, then eventually the queue will wrap around and we'll
have to just reset it.

Does that help?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

On Mon, Dec 9, 2013 at 6:03 PM, Josh Berkus  wrote:
>
> It's also applicable for the other stats; histogram buckets constructed
> from a 5% sample are more likely to be accurate than those constructed
> from a 0.1% sample.   Same with nullfrac.  The degree of improved
> accuracy, would, of course, require some math to determine.

This "some math" is straightforward basic statistics.  The 95th
percentile confidence interval for a sample consisting of 300 samples
from a population of a 1 million would be 5.66%. A sample consisting
of 1000 samples would have a 95th percentile confidence interval of
+/- 3.1%.

The histogram and nullfact answers the same kind of question as a
political poll, "what fraction of the population falls within this
subset". This is why pollsters don't need to sample 15 million
Americans to have a decent poll result. That's just not how the math
works for these kinds of questions.

n_distinct is an entirely different kettle of fish. It's a different
kind of problem and the error rate there *is* going to be dependent on
the percentage of the total population that you sampled. Moreover from
the papers I read I'm convinced any sample less than 50-80% is nearly
useless so I'm convinced you can't get good results without reading
the whole table.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql_check_function - rebase for 9.3

On 12/8/13 11:24 PM, Pavel Stehule wrote:

 > #option check_on_first_start
 > #option check_on_create
 > #option check_newer

what exactly check_newer means, does it mean whenever a function is
replaced (changed)?

no, it means, so request for check will be ignored ever - some functions cannot 
be deeply checked due using dynamic SQL or dynamic created data types - 
temporary tables created in functions.

So presumably it would be check_never, not check_newer... :) BTW, it's not 
terribly hard to work around the temp table issue; you just need to create the 
expected table in the session when you create the function. But even in this 
case, I think it would still be good to check what we can, like at least basic 
plpgsql syntax.

Do we really need first_start? ISTM that if you're dependent on run state then 
you're basically out of luck.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

On Mon, Dec 9, 2013 at 1:03 PM, Josh Berkus  wrote:
>> I really don't believe the 5% thing. It's not enough for n_distinct
>> and it's *far* too high a value for linear properties like histograms
>> or nullfrac etc.
>
> Actually, it is enough for n_distinct, or more properly, 5% is as good
> as you can get for n_distinct unless you're going to jump to scanning
> 50% or more.

I'd like to see a proof of that result.

Not because I'm hostile to changing the algorithm, but because you've
made numerous mathematical claims on this thread that fly in the face
of what Greg, myself, and others understand to be mathematically true
- including this one.  If our understanding is wrong, then by all
means let's get that fixed.  But you're not going to convince anyone
here that we should rip out the existing algorithm and its
peer-reviewed journal citations by making categorical assertions about
the right way to do things.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra functionality to createuser

On Sat, Dec 7, 2013 at 11:39 PM, Amit Kapila  wrote:
> On Fri, Dec 6, 2013 at 10:31 AM, Peter Eisentraut  wrote:
>> On Wed, 2013-11-20 at 11:23 -0500, Christopher Browne wrote:
>>> I note that similar (with not quite identical behaviour) issues apply
>>> to the user name.  Perhaps the
>>> resolution to this is to leave quoting issues to the administrator.
>>> That simplifies the problem away.
>>
>> How about only one role name per -g option, but allowing the -g option
>> to be repeated?
>
>I think that might simplify the problem and patch, but do you think
> it is okay to have inconsistency
>for usage of options between Create User statement and this utility?

Yes.  In general, command-line utilities use a very different syntax
for options-passing that SQL commands.  Trying to make them consistent
feels unnecessary or perhaps even counterproductive.  And the proposed
syntax is certainly a convention common to many other command-line
utilities, so I think it's fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

Josh Berkus  writes:
> Reading 5% of a 200GB table is going to be considerably faster than
> reading the whole thing, if that 5% is being scanned in a way that the
> FS understands.

Really?  See the upthread point that reading one sector from each track
has just as much seek overhead as reading the whole thing.  I will grant
that if you think that reading a *contiguous* 5% of the table is good
enough, you can make it faster --- but I don't believe offhand that
you can make this better without seriously compromising the randomness
of your sample.  Too many tables are loaded roughly in time order, or
in other ways that make contiguous subsets nonrandom.

> You do seem kind of hostile to the idea of full-page-sampling, going
> pretty far beyond the "I'd need to see the math".  Why?

I'm detecting a lot of hostility to assertions unsupported by any math.
For good reason.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance optimization of btree binary search

On Fri, Dec 6, 2013 at 4:53 PM, Peter Geoghegan wrote:
> I had considered that something like Intel Speedstep technology had a
> role here, but I'm pretty sure it steps up very aggressively when
> things are CPU bound - I tested that against a Core 2 Duo desktop a
> couple of years back, where it was easy to immediately provoke it by
> moving around desktop windows or something.

I decided to increase the default CPU governor from "ondemand" to
"performance" for each of the 8 logical cores on this system. I then
re-ran the benchmark. I saw markedly better, much more *consistent*
performance for master [1].

I Googled for clues, and found this:

https://communities.intel.com/community/datastack/blog/2013/08/05/how-to-maximise-cpu-performance-for-the-oracle-database-on-linux

(It happens to mention Oracle, but I think it would equally well apply
to any database). I strongly suspect this is down to Kernel version. I
should highlight this:

"""
Another further CPU setting is the Energy/Performance Bias and Red Hat
and Oracle users should note that the default setting has changed in
the Linux kernel used between the releases of Red Hat/Oracle Linux 5
and Red Hat/Oracle Linux 6. (Some system BIOS options may include a
setting to prevent the OS changing this value). In release 5 Linux did
not set a value for this setting and therefore the value remained at 0
for a bias towards performance. In Red Hat 6 this behaviour has
changed and the default sets a median range to move this bias more
towards conserving energy (remember the same Linux kernel is present
in both ultrabooks as well as servers and on my ultrabook I use
powertop and the other Linux tools and configurations discussed here
to maximise battery life) and reports the following in the dmesg
output on boot.

...

You can also use the tool to set a lower value to change the bias
entirely towards performance (the default release 5 behaviour).
"""

If there is regression in Postgres performance on more recent Linux
kernels [2], perhaps this is it. I certainly don't recall hearing
advice on this from the usual places. I'm surprised that turbo boost
mode wasn't enabled very quickly on a workload like this. It makes a
*huge* difference - at 4 clients (one per physical core), setting the
CPU governor to "performance" increases TPS by a massive 40% compared
to some earlier, comparable runs of master. These days Redhat are even
pointing out that CPU governor policy can be set via cron jobs [3].

I cannot account for why the original benchmark performed was
consistent with the patch having helped to such a large degree, given
the large number of runs involved, and their relatively long duration
for a CPU/memory bound workload. As I said, this machine is on
dedicated hardware, and virtualization was not used. However, at this
point I have no choice but to withdraw it from consideration, and not
pursue this any further. Sorry for the noise.

[1] http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/turbo/

[2] http://www.postgresql.org/message-id/529f7d58.1060...@agliodbs.com

[3]
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Power_Management_Guide/cpufreq_governors.html
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

2013-12-09 Thread Josh Berkus

Greg,

> I really don't believe the 5% thing. It's not enough for n_distinct
> and it's *far* too high a value for linear properties like histograms
> or nullfrac etc. 

Actually, it is enough for n_distinct, or more properly, 5% is as good
as you can get for n_distinct unless you're going to jump to scanning
50% or more.

It's also applicable for the other stats; histogram buckets constructed
from a 5% sample are more likely to be accurate than those constructed
from a 0.1% sample.   Same with nullfrac.  The degree of improved
accuracy, would, of course, require some math to determine.

> From a computer point of view it's too high to be
> worth bothering. If we have to read 5% of the table we might as well
> do a full scan anyways, it'll be marginally slower but much better
> quality results.

Reading 5% of a 200GB table is going to be considerably faster than
reading the whole thing, if that 5% is being scanned in a way that the
FS understands.

Also, we can optimize this significantly by using the VM, as Robert (I
think) suggested.

In the advanced approaches section, there's also the idea of collecting
analyze data from table pages while they're in memory anyway for other
reasons.

You do seem kind of hostile to the idea of full-page-sampling, going
pretty far beyond the "I'd need to see the math".  Why?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extension Templates S03E11

2013-12-09 Thread Jeff Davis

On Mon, 2013-12-09 at 12:17 -0500, Robert Haas wrote:
> On Sat, Dec 7, 2013 at 3:12 AM, Jeff Davis  wrote:
> > So if we do it this way, then we should pick a new name, like "package".
> 
> That was my first reaction as well, when I looked at this a few years
> ago, but I've since backed away from that position.  You're certainly
> correct that it's awkward to have a single kind of object that behaves
> in two radically different ways, but it's also pretty awkward to have
> the same "stuff" installed as one of two completely different types of
> objects depending on who installed it and how.

I think awkwardness is most visible in the resulting documentation and
error messages.

At the moment, I'm having a difficult time imagining how we explain how
this works to users (or, when they make a mistake or don't get the
results they expect, explain to them what they did wrong and how to fix
it).

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: programmable file format for postgresql.conf

On Sat, Dec 7, 2013 at 3:28 AM, Álvaro Hernández Tortosa  wrote:
>>> "Right now, writing such a tool in a generic way gets so bogged down
>>> just in parsing/manipulating the postgresql.conf file that it's hard to
>>> focus on actually doing the tuning part."
>>
>> That was in 2008.  I don't think that stance is accurate anymore.
>
> Just for me to learn about this: why is it not accurate anymore?

This topic has been under active discussion for the last five years. I
strongly recommend going back and skimming over the past discussions
before trying to pick it up again. In particular go look up the
discussion of SET PERSISTENT

Since we have include files now you can just generate an
auto-tune.conf and not try to parse or write the main config file.

The reason previous efforts got bogged down in parsing/manipulating
the postgresql.conf file was purely because they were trying to allow
you to edit the file by hand and mix that with auto generated config.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extension Templates S03E11

On Sat, Dec 7, 2013 at 3:12 AM, Jeff Davis  wrote:
> So if we do it this way, then we should pick a new name, like "package".

That was my first reaction as well, when I looked at this a few years
ago, but I've since backed away from that position.  You're certainly
correct that it's awkward to have a single kind of object that behaves
in two radically different ways, but it's also pretty awkward to have
the same "stuff" installed as one of two completely different types of
objects depending on who installed it and how.

If we're targeting deployment of user-written application code, then I
can see that it might make sense to have a different concept than
"extension" for that, because arguably it's a different problem,
though it's no longer clear to me that it's all that much different.
But if we're talking about deployment of the same PGXN code (or
wherever upstream lives) either by a DBA who is also the sysadmin (and
can thus run make install or yum install) or one who is not (and thus
wishes to proceed entirely via libpq) then making those two different
concepts seems like it might be slicing awfully thin.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] What are multixactids?

2013-12-09 Thread hubert depesz lubaczewski

Hi,
when working on fixing the bug related to vacuum freeze, I found out
that there is something called "MultiXactId".

Searching docs showed that it is mentioned only once, in release notes
to 9.3.2:
http://www.postgresql.org/search/?u=%2Fdocs%2F9.3%2F&q=multixactid

What's more - I found that Peter Eisentraut already once asked about
them, and lack of documentation:
http://postgresql.1045698.n5.nabble.com/MultiXactId-concept-underdocumented-td5766754.html

So, my question is - what are multixactids, what are they used for,
where can I find any documentation/explanation/whatever?

It seems to be related in some way to the relfrozenxid/vacuum bug, but
I can't comprehend the relation without knowing what multixactid
actually is.

Best regards,

depesz

-- 
The best thing about modern society is how easy it is to avoid contact with it.
 http://depesz.com/


signature.asc
Description: Digital signature

Re: [HACKERS] RFC: programmable file format for postgresql.conf

On Fri, Dec 6, 2013 at 10:28 PM, Álvaro Hernández Tortosa  wrote:
> I think both could be used a lot, editing directly a rich configuration
> file or using a GUI tool. I'm trying to suggest supporting both.

I don't really understand how changing the file format fixes anything.
 You could make the file an INI file or an XML file and it would still
be hard to edit programmatically, not because the current format is
"hard to parse" in any meaningful sense, but because there's no way
for a program to know how to make changes while preserving the
comments.  For example, suppose the user tries to set work_mem to 4MB.
 If there's an existing line in the config file for work_mem, it's
fairly plausible to think that you might just replace everything on
that line, up to the beginning of any comment, with a new work_mem
setting.

But what if, as in the default configuration file, there isn't any
such setting?  A human will go and find the line that says:

#work_mem = 1MB

...and delete the hash mark, and replace 1MB with 4MB.  No problem!
But for a computer, editing comments is hard, and kind of iffy.  After
all, there might be multiple lines that look like the above, and how
would you know which one to replace?  There could even be something
like this in the file:

#In our installation, because we have very little memory, it's
important not to do anything silly like set
#work_mem = 64MB

A configuration file editor that replaces that line will corrupt the
comment, because no program can be smart enough to recognize the
context the way a human will.

Now, we could design something that gets it right, or close enough to
right, 99% of the time.  But previous discussions of this issue on
this mailing list have concluded that people are not willing to accept
that kind of solution, which IMHO is understandable.

The only kind of change that I see as possibly helpful is some format
that explicitly marks which comments go with which settings.  For
example, suppose we did this:

   work_mem

   min 64kB

If you want to set the value, you remove the comment tags around it.
And if you want to comment on the value, you can put whatever you like
within the comment tags.  Now, you've got a machine-editable format,
assuming that people keep their comments in the  section and
not inside actual SGML comments.

But that's ugly and overly verbose, so meh.

Generally I don't regard trying to tinker with postgresql.conf as a
useful way to spend time.  Many people have strong and sometimes
conflicting feelings about it, making getting any consensus of any
change almost impossible.  And while I'm sure some die-hard will
disagree with me on this, the current format, imperfect as it is, is
not really all *that* bad.  We all have our bones to pick with it and
I certainly wouldn't have picked this exact approach myself, but we
could have done far worse.  If it were clear what the next logical
step to make it better was, or even if it were clear that the current
blew chunks, then I'd be all over putting energy into getting this
fixed.  But it isn't, and it doesn't, and the amount of collective
energy that would need to be put into making any change here doesn't
seem likely to be worth what we'd get out of it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] JSON decoding plugin

2013-12-09 Thread Euler Taveira

On 09-12-2013 13:12, Merlin Moncure wrote:
> This is pretty neat.   Couple minor questions:
> *) Aren't you *en*coding data into json, not the other way around (decoding?)
>
Yes. The 'decoding' came from the functionality (logical decoding) and
because the POC plugin is named 'test_decoding'. I also think that
'json_decoding' doesn't say much about the module purpose. I confess
that I don't like the name but can't come up with a good name. Maybe
'wal2json' or 'logrep2json'? Could you suggest something?

> *) Consider generating a long bytea instead of explicitly writing a
> 32kb sql into the patch.
>
I'll consider for next version.

> *) You've built your own json serializer here.  Maybe some code can be
> shared with the json type?
>
Same here. I already took a look at the json datatype but decided that I
wouldn't mess up with the backend code before have a feedback in the
general idea.

> *) Consider removing 'plugin ' from the name of the plugin.
> --plugin=json_decoding etc.
> 
'plugin' was a tentative to produce an unique name (it sucks but...).


-- 
   Euler Taveira   Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_archivecleanup bug

On Thu, Dec 5, 2013 at 6:15 PM, Tom Lane  wrote:
> But the other usages seem to be in assorted utilities, which
> will need to do it right for themselves.  initdb.c's walkdir() seems to
> have it right and might be a reasonable model to follow.  Or maybe we
> should invent a frontend-friendly version of ReadDir() rather than
> duplicating all the error checking code in ten-and-counting places?

If there's enough uniformity in all of those places to make that
feasible, it certainly seems wise to do it that way.  I don't know if
that's the case, though - e.g. maybe some callers want to exit and
others do not.  pg_resetxlog wants to exit; pg_archivecleanup and
pg_standby most likely want to print an error and carry on.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_archivecleanup bug

On Fri, Dec 6, 2013 at 11:10 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Dec 5, 2013 at 6:15 PM, Tom Lane  wrote:
>>> In general, I think there is no excuse for code in the backend to use
>>> readdir() directly; it should be using ReadDir(), which takes care of this
>>> as well as error reporting.
>
>> My understanding is that the fd.c infrastructure can't be used in the
>> postmaster.
>
> Say what?  See ParseConfigDirectory for code that certainly runs in the
> postmaster, and uses ReadDir().

Gosh, I could have sworn that I had calls into fd.c that were crashing
and burning during development because they happened too early in
postmaster startup.  But it seems to work fine now, so I've pushed a
fix for this and a few related issues.  Please let me know if you
think there are remaining issues.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] JSON decoding plugin

2013-12-09 Thread Merlin Moncure

On Mon, Dec 9, 2013 at 7:03 AM, Euler Taveira  wrote:
> Hi,
>
> A few months ago, it was proposed [1] that would be interested to have a
> json output plugin for logical decoding. Here it is.
>
> Each transaction is a JSON object that can contain xid (optional),
> timestamp (optional), and change array. Each change's element is a
> command that was decoded and it can contains: kind (I/U/D), schema
> (optional), table, columnnames, columntypes (optional), columnvalues,
> and oldkeys (only for U/D). columnnames, columntypes and columnvalues
> are arrays. oldkeys is an object that contains the following arrays:
> keynames, keytypes (optional), and keyvalues.
>
> The JSON objects are serialized if you are decoding a serie of
> transactions. Here is an output example:
>
> {
> "xid": 702,
> "change": [
> {
> "kind": "insert",
> "schema": "public",
> "table": "foo",
> "columnnames": ["a", "b", "c"],
> "columntypes": ["int4", "int4", "text"],
> "columnvalues": [1, 2, "test"]
> }
> ,{
> "kind": "update",
> "schema": "public",
> "table": "foo",
> "columnnames": ["a", "b", "c"],
> "columntypes": ["int4", "int4", "text"],
> "columnvalues": [1, 2, "test2"],
> "oldkeys": {
> "keynames": ["a", "b"],
> "keytypes": ["int4", "int4"],
> "keyvalues": [1, 2]
> }
> }
> ]
> }
> {
> "xid": 703,
> "change": [
> {
> "kind": "update",
> "schema": "public",
> "table": "foo",
> "columnnames": ["a", "b", "c"],
> "columntypes": ["int4", "int4", "text"],
> "columnvalues": [1, 3, "test2"],
> "oldkeys": {
> "keynames": ["a", "b"],
> "keytypes": ["int4", "int4"],
> "keyvalues": [1, 2]
> }
> }
> ]
> }
> {
> "xid": 704,
> "change": [
> {
> "kind": "delete",
> "schema": "public",
> "table": "foo",
> "oldkeys": {
> "keynames": ["a", "b"],
> "keytypes": ["int4", "int4"],
> "keyvalues": [1, 3]
> }
> }
> ]
> }
>
>
> Some data types was adapted to conform with JSON spec. NAN and Infinity
> are not valid JSON symbols so their representation is NULL (as some JSON
> implementations). Due to JSON datatype simplicity, I represent the vast
> majority of Postgres datatypes as string (However, I admit that we could
> mimic the json datatype conversion rules).
>
> The oldkeys treatment follows what was defined by the commit [2]. It uses:
>
> (i) primary key (default behavior);
> (ii) unique index (if REPLICA IDENTITY USING INDEX is defined for table);
> (iii) full tuple (if REPLICA IDENTITY FULL is defined for table);
> (iv) nothing means an error (if REPLICA IDENTITY NOTHING is defined for
> table).
>
> The TOAST columns have a special treatment for UPDATEs. If a tuple that
> contains a TOAST field is updated, the TOAST field is included iif it is
> changed too. It means that unchanged TOAST field are omitted from
> columns* arrays. This means less overhead while transmitting,
> processing and applying changes.
>
> By design, (i) output plugin doesn't know about aborted transactions and
> (ii) subtransactions are reordered into a toplevel transaction and only
> the committed pieces are passed to the plugin.
>
> You can test it firing the regression tests (e.g. 'make test') or using
> the following steps?
>
> postgresql.conf:
> wal_level = logical
> max_wal_senders = 2
> max_logical_slots = 2
>
> start collecting WAL records:
>
> $ pg_recvlogical --slot=foo -d euler -f /dev/stdout
> --plugin=json_decoding_plugin --init
>
> [execute some transactions]
>
> start printing decoded transactions:
>
> $ pg_recvlogical --slot=foo -d euler -f /dev/stdout --start
>
> stop collecting WAL records:
>
> $ pg_recvlogical --slot=foo -d euler -f /dev/stdout --stop
>
>
> Comments?

This is pretty neat.   Couple minor questions:
*) Aren't you *en*coding data into json, not the other way around (decoding?)
*) Consider generating a long bytea instead of explicitly writing a
32kb sql into the patch.
*) You've built your own json serializer here.  Maybe some code ca

Re: [HACKERS] Backup throttling

2013-12-09 Thread Fujii Masao

On Fri, Dec 6, 2013 at 6:43 PM, Boszormenyi Zoltan  wrote:
> Hi,
>
> 2013-12-05 15:36 keltezéssel, Antonin Houska írta:
>
>> On 12/02/2013 02:23 PM, Boszormenyi Zoltan wrote:
>>>
>>> Hi,
>>>
>>> I am reviewing your patch.
>>
>> Thanks. New version attached.
>
>
> I have reviewed and tested it and marked it as ready for committer.

Here are the review comments:

+  -r
+  --max-rate

You need to add something like rate.

+The purpose is to limit impact of
pg_basebackup
+on a running master server.

s/"master server"/"server" because we can take a backup from also the standby.

I think that it's better to document the default value and the accepted range of
the rate that we can specify.

You need to change the protocol.sgml because you changed BASE_BACKUP
replication command.

+printf(_("  -r, --max-rate maximum transfer rate to
transfer data directory\n"));

You need to add something like =RATE just after --max-rate.

+result = strtol(src, &after_num, 0);

errno should be set to 0 just before calling strtol().

+if (errno_copy == ERANGE || result != (uint64) ((uint32) result))
+{
+fprintf(stderr, _("%s: transfer rate \"%s\" exceeds integer
range\n"), progname, src);
+exit(1);
+}

We can move this check after the check of "src == after_num" like
parse_int() in guc.c does.
If we do this, the local variable 'errno_copy' is no longer necessary.

I think that it's better to output the hint message like "Valid units for
the transfer rate are \"k\" and \"M\"." when a user specified wrong unit.

+/*
+ * THROTTLING_SAMPLE_MIN / MAX_RATE_LOWER (in seconds) should be the
+ * longest possible time to sleep. Thus the cast to long is safe.
+ */
+pg_usleep((long) sleep);

It's better to use the latch here so that we can interrupt immediately.

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WITHIN GROUP patch

2013-12-09 Thread Peter Eisentraut

On 11/21/13, 5:04 AM, Atri Sharma wrote:
> Please find attached the latest patch for WITHIN GROUP. This patch is
> after fixing the merge conflicts.

I would like to see more explanations and examples in the documentation.
 You introduce this feature with "Ordered set functions compute a single
result from an ordered set of input values."  But string_agg, for
example, does that as well, so it's not clear how this is different.
Between ordered aggregates, window functions, and this new feature, it
can get pretty confusing.  Also, the "hypothetical" part should perhaps
be explained in more detail.  The tutorial part of the documentation
contains a nice introduction to window function.  I suggest you add
something like that as well.

In func.sgml, please list the functions in alphabetical order.

Also, don't write "should" when you mean "must".

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

On 9 Dec 2013 12:16, "Craig Ringer"  wrote:

> The only way to "deal with" clock drift that isn't fragile in the face
> of variable latency, etc, is to basically re-implement (S)NTP in order
> to find out what the clock difference with the remote is.

There's actually an entirely different way to "deal" with clock drift: test
"master time" and "slave time" as two different incomparable spaces.
Similar to how you would treat measurements in different units.

If you do that then you can measure and manage the delay in the slave
between receiving and applying a record and also measure the amount of
master server time which can be pending. These measurements don't depend at
all on time sync between servers.

The specified feature depends explicitly on the conversion between master
and slave time spaces so it's inevitable that sync would be an issue. It
might be nice to print a warning on connection if the time is far out of
sync or periodically check. But I don't think reimplementing NTP is a good
idea.

[HACKERS] How to do fast performance timing

2013-12-09 Thread Benedikt Grundmann

At Jane Street we have recently spend a lot of time trying to get a fast
gettimeofday.  I saw lots of references in various postgres hacker threads
related to a lack of such a facility so 

The culmination of those efforts can be read here:

https://github.com/janestreet/core/blob/master/lib/time_stamp_counter.mli
and

https://github.com/janestreet/core/blob/master/lib/time_stamp_counter.ml

it's all OCaml but the code is mostly imperative and very well documented.
In particular we made an effort to document our assumption.  There are a
few which are ocaml specific.  But a lot of the lessons we have learned
here should be applicable to postgres.

Hope this will be useful,

Cheers,

Bene

PS: We are releasing our code under the Apache license so you should feel
free to reuse the ideas.

Re: [HACKERS] Recovery to backup point

2013-12-09 Thread MauMau


From: "Heikki Linnakangas" 
Thanks. Looks sane, although I don't much like the proposed interface to 
trigger this, setting recovery_target_time='backup_point'. What the code 
actually does is to stop recovery as soon as you reach consistency, which 
might not have anything to do with a backup. If you set it on a warm 
standby server, for example, it will end recovery as soon as it reaches 
consistency, but there was probably no backup taken at that point.


Thank you for reviewing so rapidly.  I thought I would check the end of 
backup in recoveryStopsHere(), by matching XLOG_BACKUP_END and 
ControlFile->backupStartPoint for backups taken on the primary, and 
comparing the current redo location with ControlFile->backupEndPoint for 
backups taken on the standby.  However, that would duplicate much code in 
XLOG_BACKUP_END redo processing and checkRecoveryConsistency().  Besides, 
the code works only when the user explicitly requests recovery to backup 
point, not when he starts the warm standby server.  (I wonder I'm answering 
correctly.)


Hmm. I guess it's a nice work-around to use this option, but it doesn't 
really solve the underlying issue. The system might well reach consistency 
between deleting database files and the transaction commit, in which case 
you still have the same problem.


Yes, you're right.  But I believe the trouble can be avoided most of the 
time.



It would be nice to have a more robust fix for that. Perhaps we could use 
the safe_restartpoint machinery we have to not allow recovery to end until 
we see the commit record. I was really hoping to get rid of that machinery 
in 9.4, though, as it won't be needed for GIN and B-tree after the patches 
I have in the current commitfest are committed.


In any case, that's a separate discussion and separate patch.


I think so, too.  That still seems a bit difficult for what I am now.  If 
someone starts a discussion in a separate thread, I'd like to join it.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [bug fix] pg_ctl always uses the same event source

2013-12-09 Thread MauMau


From: "Amit Kapila" 

1. isn't it better to handle as it is done in write_eventlog() which
means if string is empty then
   use PostgreSQL.
"evtHandle = RegisterEventSource(NULL, event_source ? event_source :
"PostgreSQL");"


Thank you for reviewing.  Yes, I did so with the first revision of this 
patch (see the first mail of this thread.)  I wanted to avoid duplicating 
the default value in both the server and pg_ctl code.  If user does not set 
event_source, postgres -C returns the default value "PostgreSQL" in the 
normal case, so I wanted to rely on it.  I thought the second revision would 
be appreciated by PostgreSQL-based products like EnterpriseDB, because there 
are fewer source files to modify.  But I don't mind which revision will be 
adopted.



2. What will happen if user doesn't change the name in "event_source"
or kept the same name,
   won't it hit the same problem again? So shouldn't it try to
generate different name by appending
   version string to it?


Yes, but I assume that the user has to set his own name to identify his 
instance uniquely.  Even if version string is added, the same issue can 
happen --- in the likely case where the user explicitly installs, for 
example, PostgreSQL 9.3 as a standalone database, as well as some packaged 
application that embeds PostgreSQL 9.3 which the user is unaware of.
If the user installs multiple versions of PostgreSQL explicitly with the 
community installer, the installer can set event_source = 'PostgreSQL 9.3' 
and 'PostgreSQL 9.4' for each instance.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Recovery to backup point


On 12/09/2013 02:03 PM, MauMau wrote:

From: "Michael Paquier" 

As far as I recall, I don't think so. The problem and the way to solve
that are clear. The only trick is to be sure that recovery is done
just until a consistent point is reached, and to implement that
cleanly.


May I implement this feature and submit a patch for the next
commitfest if I have time?

Please feel free. I might as well participate in the review.


I've done with the attached patch.


Thanks. Looks sane, although I don't much like the proposed interface to 
trigger this, setting recovery_target_time='backup_point'. What the code 
actually does is to stop recovery as soon as you reach consistency, 
which might not have anything to do with a backup. If you set it on a 
warm standby server, for example, it will end recovery as soon as it 
reaches consistency, but there was probably no backup taken at that point.



 I also confirmed that the problem I
raised in the first mail of the below thread was solved with this patch.

[bug fix] PITR corrupts the database cluster
http://www.postgresql.org/message-id/F93E42280A9A4A5EB74FC7350C801A20%40maumau


Hmm. I guess it's a nice work-around to use this option, but it doesn't 
really solve the underlying issue. The system might well reach 
consistency between deleting database files and the transaction commit, 
in which case you still have the same problem.


It would be nice to have a more robust fix for that. Perhaps we could 
use the safe_restartpoint machinery we have to not allow recovery to end 
until we see the commit record. I was really hoping to get rid of that 
machinery in 9.4, though, as it won't be needed for GIN and B-tree after 
the patches I have in the current commitfest are committed.


In any case, that's a separate discussion and separate patch.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys