Re: [HACKERS] Move unused buffers to freelist

2013-05-24 Thread Amit Kapila
On Thursday, May 23, 2013 8:45 PM Robert Haas wrote:
 On Tue, May 21, 2013 at 3:06 AM, Amit Kapila amit.kap...@huawei.com
 wrote:
  Here are the results.  The first field in each line is the number of
  clients. The second number is the scale factor.  The numbers after
  master and patched are the median of three runs.
 
 but overall, on both the read-only and
  read-write tests, I'm not seeing anything that resembles the big
 gains
  you reported.
 
  I have not generated numbers for read-write tests, I will check that
 once.
  For read-only tests, the performance increase is minor and different
 from
  what I saw.
  Few points which I could think of for difference in data:
 
  1. In my test's I always observed best data when number of
 clients/threads
  are equal to number of cores which in your case should be at 16.
 
 Sure, but you also showed substantial performance increases across a
 variety of connection counts, whereas I'm seeing basically no change
 at any connection count.
  2. I think for scale factor 100 and 300, there should not be much
  performance increase, as for them they should mostly get buffer from
  freelist inspite of even bgwriter adds to freelist or not.
 
 I agree.
 
  3. In my tests variance is for shared buffers, database size is
 always less
  than RAM (Scale Factor -1200, approx db size 16~17GB, RAM -24 GB),
 but due
  to variance in shared buffers, it can lead to I/O.
 
 Not sure I understand this.

What I wanted to say is that all your tests was on same shared buffer
configuration 8GB, where as in my tests I was trying to vary shared buffers
as well.
However this is not important point, as it should show performance gain on
configuration you ran, if there is any real benefit of this patch.
 
  4. Each run is of 20 minutes, not sure if this has any difference.
 
 I've found that 5-minute tests are normally adequate to identify
 performance changes on the pgbench SELECT-only workload.
 
  Tests were run on a 16-core, 64-hwthread PPC64 machine provided to
 the
  PostgreSQL community courtesy of IBM.  Fedora 16, Linux kernel
 3.2.6.
 
  To think about the difference in your and my runs, could you please
 tell me
  about below points
  1. What is RAM in machine.
 
 64GB
 
  2. Are number of threads equal to number of clients.
 
 Yes.
 
  3. Before starting tests I have always done pre-warming of buffers
 (used
  pg_prewarm written by you last year), is it same for above read-only
 tests.
 
 No, I did not use pg_prewarm.  But I don't think that should matter
 very much.  First, the data was all in the OS cache.  Second, on the
 small scale factors, everything should end up in cache pretty quickly
 anyway.  And on the large scale factors, well, you're going to be
 churning shared_buffers anyway, so pg_prewarm is only going to affect
 the very beginning of the test.
 
  4. Can you please once again run only the test where you saw
 variation(8
  clients @ scale factor 1000 on master), because I have also seen
 that
  performance difference is very good for certain
 configurations(Scale Factor, RAM, Shared Buffers)
 
 I can do this if I get a chance, but I don't really see where that's
 going to get us.  It seems pretty clear to me that there's no benefit
 on these tests from this patch.  So either one of us is doing the
 benchmarking incorrectly, or there's some difference in our test
 environments that is significant, but none of the proposals you've
 made so far seem to me to explain the difference.

Sorry for requesting you to run again without any concrete point.
I realized after reading data you posted more carefully that the reading was
just some m/c problem or something else, but actually there is no gain.
After your post, I had tried with various configurations on different m/c,
but till now I am not able see the performance gain as was shown in my
initial mail.
Infact I had tried on same m/c as well, it some times give good data. I will
update you if I get any concrete reason and results.

  Apart from above, I had one more observation during my investigation
 to find
  why in some cases, there is a small dip:
  1. Many times, it finds the buffer in free list is not usable, means
 it's
  refcount or usage count is not zero, due to which it had to spend
 more time
  under BufFreelistLock.
 I had not any further experiments related to this finding like if
 it
  really adds any overhead.
 
  Currently I am trying to find reasons for small dip of performance
 and see
  if I could do something to avoid it. Also I will run tests with
 various
  configurations.
 
  Any other suggestions?
 
 Well, I think that the code in SyncOneBuffer is not really optimal.
 In some cases you actually lock and unlock the buffer header an extra
 time, which seems like a whole lotta extra overhead.  In fact, I don't
 think you should be modifying SyncOneBuffer() at all, because that
 affects not only the background writer but also checkpoints.
 Presumably it is not right to put every unused 

Re: [HACKERS] Move unused buffers to freelist

2013-05-24 Thread Amit Kapila
On Friday, May 24, 2013 2:47 AM Jim Nasby wrote:
 On 5/14/13 2:13 PM, Greg Smith wrote:
  It is possible that we are told to put something in the freelist that
  is already in it; don't screw up the list if so.
 
  I don't see where the code does anything to handle that though.  What
 was your intention here?
 
 IIRC, the code that pulls from the freelist already deals with the
 possibility that a block was on the freelist but has since been put to
 use. 

You are right, the check exists in StrategyGetBuffer()

If that's the case then there shouldn't be much penalty to adding
 a block multiple times (at least within reason...)

There is a check in StrategyFreeBuffer() which will not allow to put
multiple times, 
I had just used the same check in new function.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to .gitignore

2013-05-24 Thread amul sul
 ctags and etags be part of postgres source tree and its generate some 
output inside them, so I think we must ignore it. 

+1


Regards,
Amul Sul


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WARNING : pgstat wait timeout - Postgres 9.1

2013-05-24 Thread Mathieu Guerin
Hello,

Thanks a lot for your answers.

 You should get it...
 stats_temp_directory   |
pg_stat_tmp   | Writes temporary
statistics files to the specified directory.

I don't know why i don't get it. I am in 9.1 version...

Moreover, when I mount pg_stat_tmp in a tmpfs, the warning messages
decrease  the warning messages decrease from 1 each minutes to 1 each five
secondes. I don't have any others logs warning but the file pg_stat.stat in
the mounting point is not created... I tryed before on a test environment
and it works...

If you have any ideas...
Thanks a lot.
Regards,
Math



2013/5/24 Michael Paquier michael.paqu...@gmail.com




 On Thu, May 23, 2013 at 9:31 PM, Mathieu Guerin 
 mathieu.gueri...@gmail.com wrote:

 What are the consequences ? Because this file will be remove if the
 server reboot.

 Those temporary statistics are stored in global directory when server
 shuts down, so the risk here would be to lose a portion of this data in the
 case of a crash, either at PG or at OS level.


 If we change the parameter stats_temp_directory is it necessary to
 reboot the server ?

 No, sending SIGHUP to the server is enough.


 When I lauch a SHOW ALL; command, the parameter stats_temp_director is
 not here.

 You should get it...
 stats_temp_directory|
 pg_stat_tmp   | Writes temporary
 statistics files to the specified directory.
  --
 Michael



Re: [HACKERS] WARNING : pgstat wait timeout - Postgres 9.1

2013-05-24 Thread Mathieu Guerin
About the stats_temp_directory, I didn't run as root...
Now I'm sure the configurations are correct.

I think, I have too much IO to use stats. I will ever have this message...
Maybe I can disable this option.
Do you know what it really impact ?

Thanks.
Math




2013/5/24 Mathieu Guerin mathieu.gueri...@gmail.com

 Hello,

 Thanks a lot for your answers.


  You should get it...
  stats_temp_directory   |
 pg_stat_tmp   | Writes temporary
 statistics files to the specified directory.

 I don't know why i don't get it. I am in 9.1 version...

 Moreover, when I mount pg_stat_tmp in a tmpfs, the warning messages
 decrease  the warning messages decrease from 1 each minutes to 1 each five
 secondes. I don't have any others logs warning but the file pg_stat.stat in
 the mounting point is not created... I tryed before on a test environment
 and it works...

 If you have any ideas...
 Thanks a lot.
 Regards,
 Math



 2013/5/24 Michael Paquier michael.paqu...@gmail.com




 On Thu, May 23, 2013 at 9:31 PM, Mathieu Guerin 
 mathieu.gueri...@gmail.com wrote:

 What are the consequences ? Because this file will be remove if the
 server reboot.

 Those temporary statistics are stored in global directory when server
 shuts down, so the risk here would be to lose a portion of this data in the
 case of a crash, either at PG or at OS level.


 If we change the parameter stats_temp_directory is it necessary to
 reboot the server ?

 No, sending SIGHUP to the server is enough.


 When I lauch a SHOW ALL; command, the parameter stats_temp_director is
 not here.

 You should get it...
 stats_temp_directory|
 pg_stat_tmp   | Writes temporary
 statistics files to the specified directory.
  --
 Michael





Re: [HACKERS] Removal of pageinspect--1.0.sql

2013-05-24 Thread Heikki Linnakangas

On 20.05.2013 19:50, Michael Paquier wrote:

The contrib module pageinspect has been upgraded to 1.1, but
pageinspect--1.0.sql is still present in source code. Shouldn't it be
removed? Please find patch attached.


Yep. Removed, thanks.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-24 Thread German Becker
Hi Sergey,

Maybe I didn't explain correctly. I am using COPY/pg_dump/pg_restore for
migration (and it is working fine). The streaming replication is for
hot-standby replication *once migrated*. Thing is I disbable archving and
set wal_level to minimal, when migrating the large portion of data, to make
it faster. Then I switch to wal_level=hot_standby, i.e the production
configuration, and the WAL segment seuqence seems to overlap with the
segments generated with the other setting.


On Thu, May 23, 2013 at 7:44 PM, Sergey Konoplev gray...@gmail.com wrote:

 On Thu, May 23, 2013 at 6:18 AM, German Becker german.bec...@gmail.com
 wrote:
  Let me describe the process I follow to get to this. What I am doing is
  testing a migration from 8.3 to 9.1. They way I plan to do it is the
  following.
  1) Create the schema
  2) import the biggest tables, which are not updated,only growing, with
 COPY
  (this is about 35gb of data)
  2)import the small, changing part of the data
 
 
  The target system is 9.1 with streaming relication.
  For steps 1 and 2, I set a restore configuration, that amongs other
 things
  like more work mem, it sets archive_mode=off and wal_level=minimal
 (attached
  the difference between restore and normal).
  The archive_command is just a cp wrapped in a shell script in case I
 need to
  change it.

 You can not migrate between any major versions with WAL based or
 streaming replication.

 Use either full dump/restore or schema only dump/restore plus trigger
 based replication (londiste, slony) to migrate data.

 --
 Kind regards,
 Sergey Konoplev
 PostgreSQL Consultant and DBA

 Profile: http://www.linkedin.com/in/grayhemp
 Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
 Skype: gray-hemp
 Jabber: gray...@gmail.com



Re: [HACKERS] Patch to .gitignore

2013-05-24 Thread Tom Lane
=?ISO-8859-1?Q?Fabr=EDzio_de_Royes_Mello?= fabriziome...@gmail.com writes:
 On Fri, May 24, 2013 at 12:04 AM, Christopher Browne 
 cbbro...@gmail.comwrote:
 There hasn't been general agreement on the merits of particular .gitignore
 rules of this sort.

 I agree with you about vim-oriented patterns, because its a particular
 tool, but ctags and etags be part of postgres source tree and its
 generate some output inside them, so I think we must ignore it.

[ shrug... ]  Editor backup files get generated inside the source tree
as well.  Chris stated the policy accurately: if you use tools that
leave unexpected files in the source tree, it's up to you to have a
personal .gitignore for those, assuming you want them ignored.  The
project's .gitignore files are only supposed to list files that get
generated by the standard build processes.

FWIW, my personal .gitexclude file looks like

*~
*.orig

where the latter one is for junk generated by patch.  I believe
patch's extension for saved files varies across versions, so it
wouldn't be too sensible to have an exclusion like that in the
project-wide file.  Note also that I intentionally *don't* have an
exclusion for *.rej --- if any patch hunk failed, I want git to
mention it.  But that's a matter of personal preference.  I rather
imagine that other people configure it differently, and that's fine.
As long as we don't try to put such things in the project-wide
exclusion list, we don't have to have a consensus about it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-24 Thread Amit Langote
 Maybe I didn't explain correctly. I am using COPY/pg_dump/pg_restore for
 migration (and it is working fine). The streaming replication is for
 hot-standby replication *once migrated*. Thing is I disbable archving and
 set wal_level to minimal, when migrating the large portion of data, to make
 it faster. Then I switch to wal_level=hot_standby, i.e the production
 configuration, and the WAL segment seuqence seems to overlap with the
 segments generated with the other setting.


Though, now you understand it's not what it looks like, right? :-)


--
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost limited statements RFC

2013-05-24 Thread Robert Haas
On Thu, May 23, 2013 at 7:27 PM, Greg Smith g...@2ndquadrant.com wrote:
 I'm working on a new project here that I wanted to announce, just to keep
 from duplicating effort in this area.  I've started to add a cost limit
 delay for regular statements.  The idea is that you set a new
 statement_cost_delay setting before running something, and it will restrict
 total resources the same way autovacuum does.  I'll be happy with it when
 it's good enough to throttle I/O on SELECT and CREATE INDEX CONCURRENTLY.

Cool.  We have an outstanding customer request for this type of
functionality; although in that case, I think the desire is more along
the lines of being able to throttle writes rather than reads.

But I wonder if we wouldn't be better off coming up with a little more
user-friendly API.  Instead of exposing a cost delay, a cost limit,
and various charges, perhaps we should just provide limits measured in
KB/s, like dirty_rate_limit = amount of data you can dirty per
second, in kB and read_rate_limit = amount of data you can read into
shared buffers per second, in kB.  This is less powerful than what we
currently offer for autovacuum, which allows you to come up with a
blended measure of when vacuum has done too much work, but I don't
have a lot of confidence that it's better in practice.

 Modifying the buffer manager to account for statement-based cost
 accumulation isn't difficult.  The tricky part here is finding the right
 spot to put the delay at.  In the vacuum case, it's easy to insert a call to
 check for a delay after every block of I/O.  It should be possible to find a
 single or small number of spots to put a delay check in the executor.  But I
 expect that every utility command may need to be modified individually to
 find a useful delay point.  This is starting to remind me of the SEPostgres
 refactoring, because all of the per-command uniqueness ends up requiring a
 lot of work to modify in a unified way.

I haven't looked at this in detail, but I would hope it's not that
bad.  For one thing, many DDL commands don't do any significant I/O in
the first place and so can probably be disregarded.  Those that do are
mostly things that rewrite the table and things that build indexes.  I
doubt there are more than 3 or 4 code paths to patch.

 The main unintended consequences issue I've found so far is when a cost
 delayed statement holds a heavy lock.  Autovacuum has some protection
 against letting processes with an exclusive lock on a table go to sleep.  It
 won't be easy to do that with arbitrary statements.  There's a certain
 amount of allowing the user to shoot themselves in the foot here that will
 be time consuming (if not impossible) to eliminate.  The person who runs an
 exclusive CLUSTER that's limited by statement_cost_delay may suffer from
 holding the lock too long.  But that might be their intention with setting
 the value.  Hard to idiot proof this without eliminating useful options too.

Well, we *could* have a system where, if someone blocks waiting for a
lock held by a rate-limited process, the rate limits are raised or
abolished.  But I'm pretty sure that's a bad idea.  I think that the
people who want rate limits want them because allowing too much write
(or maybe read?) activity hoses the performance of the entire system,
and that's not going to be any less true if there are multiple jobs
piling up.  Let's say someone has a giant COPY into a huge table, and
CLUSTER blocks behind it, waiting for AccessExclusiveLock.  Well...
making the COPY run faster so that we can hurry up and start
CLUSTER-ing seems pretty clearly wrong.  We want the COPY to run
slower, and we want the CLUSTER to run slower, too.  If we don't want
that, then, as you say, we shouldn't set the GUC in the first place.

Long story short, I'm inclined to define this as expected behavior.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] background processes vs. hot standby

2013-05-24 Thread Robert Haas
CheckRequiredParameterValues() has some code that, when hot standby is
in use, checks the values of max_connections,
max_prepared_transactions, and max_locks_per_transaction against the
master.   The comment says we must have at least as many backend
slots as the primary ... but the code no longer enforces that,
because we now compute MaxBackends like this:

MaxBackends = MaxConnections + autovacuum_max_workers + 1 +
GetNumShmemAttachedBgworkers();

If GetNumShmemAttachedBgworkers() returns a lower value on the standby
than it did on the master, then we might well have fewer backend slots
on the standby.  I'm having trouble remembering why it's a problem to
have fewer backend slots on the standby than the master, but if we
need to prevent that then this code is no longer adequate to the task.

The comment doesn't explain why we check max_locks_per_transaction.  I
thought the reason for that check was that we needed to ensure that
there were at least as many lock table slots on the standby as there
were on the master, to prevent bad things from happening later.  That
was already not true, since the existing code didn't enforce any
limitation on autovacuum_max_workers on the standby side.  Maybe that
doesn't matter, since autovacuum workers can't run in hot standby
mode; not sure.  But the addition of background workers to MaxBackends
provides another way for that to be not true.  Here's how we compute
the size of the lock table:

#define NLOCKENTS() \
mul_size(max_locks_per_xact, add_size(MaxBackends, max_prepared_xacts))

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] background processes vs. hot standby

2013-05-24 Thread Andres Freund
Hi,

On 2013-05-24 09:48:03 -0400, Robert Haas wrote:
 CheckRequiredParameterValues() has some code that, when hot standby is
 in use, checks the values of max_connections,
 max_prepared_transactions, and max_locks_per_transaction against the
 master.   The comment says we must have at least as many backend
 slots as the primary ... but the code no longer enforces that,
 because we now compute MaxBackends like this:
 
 MaxBackends = MaxConnections + autovacuum_max_workers + 1 +
 GetNumShmemAttachedBgworkers();
 
 If GetNumShmemAttachedBgworkers() returns a lower value on the standby
 than it did on the master, then we might well have fewer backend slots
 on the standby.  I'm having trouble remembering why it's a problem to
 have fewer backend slots on the standby than the master, but if we
 need to prevent that then this code is no longer adequate to the task.

It's afair important because we need to allocate shared memory which can
keep track of the maximum number of xids (toplevel *
max_non_suboverflowed_subxids) in progress. That's the
KnownAssignedXids* stuff in procarray.c.

 The comment doesn't explain why we check max_locks_per_transaction.  I
 thought the reason for that check was that we needed to ensure that
 there were at least as many lock table slots on the standby as there
 were on the master, to prevent bad things from happening later.  That
 was already not true, since the existing code didn't enforce any
 limitation on autovacuum_max_workers on the standby side.  Maybe that
 doesn't matter, since autovacuum workers can't run in hot standby
 mode; not sure.  But the addition of background workers to MaxBackends
 provides another way for that to be not true.  Here's how we compute
 the size of the lock table:

Yea, we need it exactly for that reason. I think its unlikely to cause
actual problems since we only ship access exclusive locks to the standby
and its hard to see scenarios where we have that many AEL on the
primary. But we probably should fix it anyway.

I think fixing the autovacuum_max_workers case is entirely reasonable
and relatively unlikely to cause problems. I don't think we can easily
do it in a minor release though since I don't see a way to transport
knowledge about it via the WAL without breaking either the WAL format
entirely or change the meaning of MaxConnections in ControlFile which
would cause problems with upgrading the primary first.

I am less excited about doing something similar for the background
worker case. Requiring just as many background workers on the standby
sounds like a bad idea to me, there seem to be too many cases where that
doesn't seem to make sense.
I wonder if we shouldn't make background workers use connections slots
from max_connections similar to how superuser_reserved_connections
work. That would mean we don't need to care about it for HS.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost limited statements RFC

2013-05-24 Thread Jim Nasby

On 5/24/13 8:21 AM, Robert Haas wrote:

On Thu, May 23, 2013 at 7:27 PM, Greg Smithg...@2ndquadrant.com  wrote:

I'm working on a new project here that I wanted to announce, just to keep
from duplicating effort in this area.  I've started to add a cost limit
delay for regular statements.  The idea is that you set a new
statement_cost_delay setting before running something, and it will restrict
total resources the same way autovacuum does.  I'll be happy with it when
it's good enough to throttle I/O on SELECT and CREATE INDEX CONCURRENTLY.

Cool.  We have an outstanding customer request for this type of
functionality; although in that case, I think the desire is more along
the lines of being able to throttle writes rather than reads.

But I wonder if we wouldn't be better off coming up with a little more
user-friendly API.  Instead of exposing a cost delay, a cost limit,
and various charges, perhaps we should just provide limits measured in
KB/s, like dirty_rate_limit = amount of data you can dirty per
second, in kB and read_rate_limit = amount of data you can read into
shared buffers per second, in kB.  This is less powerful than what we
currently offer for autovacuum, which allows you to come up with a
blended measure of when vacuum has done too much work, but I don't
have a lot of confidence that it's better in practice.


Doesn't that hit the old issue of not knowing if a read came from FS cache or 
disk? I realize that the current cost_delay mechanism suffers from that too, 
but since the API is lower level that restriction is much more apparent.

Instead of KB/s, could we look at how much time one process is spending waiting 
on IO vs the rest of the cluster? Is it reasonable for us to measure IO wait 
time for every request, at least on the most popular OSes?
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-05-24 Thread Jim Nasby

On 5/14/13 8:42 AM, Amit Kapila wrote:

In the attached patch, bgwriter/checkpointer moves unused (usage_count =0  
refcount = 0) buffer’s to end of freelist. I have implemented a new API 
StrategyMoveBufferToFreeListEnd() to

move buffer’s to end of freelist.



Instead of a separate function, would it be better to add an argument to 
StrategyFreeBuffer? ISTM this is similar to the other strategy stuff in the 
buffer manager, so perhaps it should mirror that...
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of freezing

2013-05-24 Thread Andres Freund
On 2013-05-23 22:09:02 -0400, Robert Haas wrote:
 On Thu, May 23, 2013 at 1:51 PM, Andres Freund and...@2ndquadrant.com wrote:
  So, what I propose instead is basically:
  1) only vacuum non-all-visible pages, even when doing it for
 anti-wraparound

 Check.  We might want an option to force a scan of the whole relation.

Yea, thought of that as well. VACUUM (DEEP) ;).

  3) When we cannot mark a page all-visible or we cannot get the cleanup
 lock, remember the oldest xmin on that page. We could set all visible
 in the former case, but we want the page to be cleaned up sometime
 soonish.

 I think you mean in the latter case not in the former case.  If
 not, then I'm confused.

Uh. Yes.

  We don't even necessarily need to log the hint bits for all items since
  the redo for all_visible could make sure all items are hinted. The only
  problem is knowing up to where we can truncate pg_clog...

 [all-visible cannot restore hint bits without FPI because of torn pages]

I haven't yet thought about this sufficiently yet. I think we might have
a chance of working around this, let me ponder a bit.

But even if that means needing a full page write via the usual mechanism
for all visible if any hint bits needed to be set we are still out far
ahead of the current state imo.
* cleanup would quite possibly do an FPI shortly after in vacuum
  anyway. If we do it for all visible, it possibly does not need to be
  done for it.
* freezing would FPI almost guaranteedly since we do it so much
  later.
* Not having to rescan the whole heap will be a bigger cost saving...

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

2013-05-24 Thread Thom Brown
On 23 May 2013 10:03, Thom Brown t...@linux.com wrote:
 On 23 May 2013 07:10, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 Hi,

 I've been hacking on a tool to allow resynchronizing an old master server
 after failover. The need to do a full backup/restore has been a common
 complaint ever since we've had streaming replication. I saw on the wiki that
 this was discussed in the dev meeting; too bad I couldn't make it.

 In a nutshell, the idea is to do copy everything that has changed between
 the cluster, like rsync does, but instead of reading through all files, use
 the WAL to determine what has changed. Here's a somewhat more detailed
 explanation, from the README:

 Theory of operation
 ---

 The basic idea is to copy everything from the new cluster to old, except for
 the blocks that we know to be the same.

 1. Scan the WAL log of the old cluster, starting from the point where
 the new cluster's timeline history forked off from the old cluster. For each
 WAL record, make a note of the data blocks that are touched. This yields a
 list of all the data blocks that were changed in the old cluster, after the
 new cluster forked off.

 2. Copy all those changed blocks from the new master to the old master.

 3. Copy all other files like clog, conf files etc. from the new cluster
 to old. Everything except the relation files.

 4. Apply the WAL from the new master, starting from the checkpoint
 created at failover. (pg_rewind doesn't actually apply the WAL, it just
 creates a backup label file indicating that when PostgreSQL is started, it
 will start replay from that checkpoint and apply all the required WAL)


 Please take a look: https://github.com/vmware/pg_rewind

 6 instances set up:

 [Primary (5530)]
 |
 ---[Standby 1 (5531)]
 |
 ---[Standby 2 (5532)]
 |
 ---[Standby 3 (5533)]
 |
 ---[Standby 4 (5534)]
 |
 ---[Standby 5 (5535)]

 1) Created a table on the primary with some data.
 2) Promoted Standby 1
 3) Cleanly shut down Primary
 4) pg_rewind --target-pgdata=/tmp/primary
 --source-server='host=localhost port=5531 dbname=postgres'

 Last common WAL position: 0/30227F8 on timeline 1
 Last common checkpoint at 0/30227F8 on timeline 1
 error reading xlog record: record with zero length at 0/3022860
 Done!

 Contents of pg_xlog directory in Primary and Standby 1:
 thom@swift /tmp $ ls -l primary/pg_xlog/
 total 49156
 -rw--- 1 thom users 16777216 May 23 09:52 00010002
 -rw--- 1 thom users 16777216 May 23 09:52 00010003
 -rw--- 1 thom users 16777216 May 23 09:52 00020003
 -rw--- 1 thom users   41 May 23 09:52 0002.history
 drwx-- 2 thom users   80 May 23 09:52 archive_status
 thom@swift /tmp $ ls -l standby1/pg_xlog/
 total 49156
 -rw--- 1 thom users 16777216 May 23 09:49 00010002
 -rw--- 1 thom users 16777216 May 23 09:50 00010003
 -rw--- 1 thom users 16777216 May 23 09:52 00020003
 -rw--- 1 thom users   41 May 23 09:50 0002.history
 drwx-- 2 thom users   80 May 23 09:50 archive_status

 5) Changed recovery.done in primary to point its primary_conninfo port
 to 5531 (that of Standby 1).
 6) Renamed it to .conf.
 7) Changed postgresql.conf to set the port back to its original one
 (as pg_rewind has caused it to match that of Standby 1)
 8) Start Primary

 Latest log in primary reads:

 LOG:  database system was interrupted; last known up at 2013-05-23 09:50:34 
 EDT
 LOG:  entering standby mode
 LOG:  invalid xl_info in checkpoint record
 FATAL:  could not locate required checkpoint record
 HINT:  If you are not restoring from a backup, try removing the file
 /tmp/primary/backup_label.
 LOG:  startup process (PID 31503) exited with exit code 1
 LOG:  aborting startup due to startup process failure

 9) Okay, so I'll delete that label and try again.  Now all is well:

 LOG:  database system was interrupted; last known up at 2013-05-23 09:50:34 
 EDT
 LOG:  entering standby mode
 LOG:  database system was not properly shut down; automatic recovery in 
 progress
 LOG:  redo starts at 0/3022828
 LOG:  record with zero length at 0/3041A60
 LOG:  consistent recovery state reached at 0/3041A60
 LOG:  database system is ready to accept read only connections
 LOG:  started streaming WAL from primary at 0/300 on timeline 2

 10) Connect to Standby 1 and insert more rows into our original table.
 11) Connect to Primary and those rows are appearing.

By the way, without any data inserted I get:

thom@swift /tmp $ pg_rewind --target-pgdata=/tmp/primary
--source-server='host=localhost port=5531 dbname=postgres' -v
connected to remote server
fetched file global/pg_control, length 8192
fetched file pg_xlog/0002.history, length 41
Last common WAL position: 0/300 on timeline 1
pg_rewind: xlogreader.c:214: XLogReadRecord: 

Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

2013-05-24 Thread Andres Freund
On 2013-05-24 10:57:22 -0400, Thom Brown wrote:
 By the way, without any data inserted I get:
 
 thom@swift /tmp $ pg_rewind --target-pgdata=/tmp/primary
 --source-server='host=localhost port=5531 dbname=postgres' -v
 connected to remote server
 fetched file global/pg_control, length 8192
 fetched file pg_xlog/0002.history, length 41
 Last common WAL position: 0/300 on timeline 1
 pg_rewind: xlogreader.c:214: XLogReadRecord: Assertion `((RecPtr) %
 8192 = (((intptr_t) ((sizeof(XLogPageHeaderData))) + ((8) - 1)) 
 ~((intptr_t) ((8) - 1' failed.
 Aborted (core dumped)
 
 And this magically goes away with -k on initdb.

That looks like this is a bug independent of -k. pg_rewind tries to read
the beginning of a page but xlogreader doesn't allow that atm. The
likely reason this doesn't happen with -k is autovacuum probably hints
some pages which are logged when checksums are enabled.

Should we make that assert
Assert((RecPtr % XLOG_BLCKSZ == 0) ||XRecOffIsValid(RecPtr));
instead?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Patch to add support of IF NOT EXISTS to others CREATE statements

2013-05-24 Thread Fabrízio de Royes Mello
Hi all,

I working in a patch to include support of IF NOT EXISTS into CREATE
statements that not have it yet.

I started with DefineStmt section from src/backend/parser/gram.y:
- CREATE AGGREGATE [ IF NOT EXISTS ] ...
- CREATE OPERATOR [ IF NOT EXISTS ] ...
- CREATE TYPE [ IF NOT EXISTS ] ... [AS [{ENUM | RANGE}] (...)]
- CREATE TEXT SEARCH {PARSER | DITIONARY | TEMPLATE | CONFIGURATION} [ IF
NOT EXISTS ] ...
- CREATE COLLATION [ IF NOT EXISTS ] ...

My intention is cover anothers CREATE statements too, not just the above.

If has no objection about this implementation I'll finish him and soon I
sent the patch.

Regards,

-- 
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Blog sobre TI: http://fabriziomello.blogspot.com
 Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello


Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

2013-05-24 Thread Jim Nasby

On 5/23/13 12:51 PM, Pavan Deolasee wrote:




On Thu, May 23, 2013 at 11:10 PM, Heikki Linnakangas hlinnakan...@vmware.com 
mailto:hlinnakan...@vmware.com wrote:

On 23.05.2013 07:55, Robert Haas wrote:

On Thu, May 23, 2013 at 7:10 AM, Heikki Linnakangas
hlinnakan...@vmware.com mailto:hlinnakan...@vmware.com  wrote:

1. Scan the WAL log of the old cluster, starting from the point 
where
the new cluster's timeline history forked off from the old cluster. 
For each
WAL record, make a note of the data blocks that are touched. This 
yields a
list of all the data blocks that were changed in the old cluster, 
after the
new cluster forked off.


Suppose that a transaction is open and has written tuples at the point
where WAL forks.  After WAL forks, the transaction commits.  Then, it
hints some of the tuples that it wrote.  There is no record in WAL
that those blocks are changed, but failing to revert them leads to
data corruption.


Bummer, you're right. Hmm, if you have checksums enabled, however, we'll 
WAL log a full-page every time a page is dirtied for setting a hint bit, which 
fixes the problem. So, there's a caveat with pg_rewind; you must have checksums 
enabled.


I was quite impressed with the idea, but hint bits indeed are problem. I 
realised the same issue also applies to the other idea that Fujii-san and 
others have suggested about waiting for dirty buffers to be written until the 
WAL is received at the standby. But since that idea would anyways need to be 
implemented in the core, we could teach SetHintBits() to return false unless 
the corresponding commit WAL records are written to the standby first.


Would it be useful to turn this problem around? Heikki's proposal is based on being able 
to track (without fail) all blocks that have been modified; could we instead track blocks 
that we know for certain have NOT been modified? The difference there is that we can be 
more conservative in stating we know this block is the same; worst case we 
just do some extra copying.

thinking out loud...
One possibility would be to use file timestamps. For files that are past a 
certain age on both master and slave, if we force the timestamp on the slave to 
match the timestamp from the master, rsync will be able to safely ignore that 
file. I realize that's not as good as block-level detection, but it's probably 
a tremendous improvement over what we have today. The critical thing in this 
case would be to *guarantee* that the timestamps did not match on modified 
files.

Of course, screwing around with FS timestamps in this manner is pretty grotty, 
at least on a live system. Perhaps there's some way to track that info 
separately and then use it to change file timestamps before running rsync. Or 
if we are able to define a list of files that we think may have changed, we 
just feed that list to rsync and let it do the heavy lifting.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] background processes vs. hot standby

2013-05-24 Thread Alvaro Herrera
Andres Freund escribió:

 I wonder if we shouldn't make background workers use connections slots
 from max_connections similar to how superuser_reserved_connections
 work. That would mean we don't need to care about it for HS.

I remember considering this and concluding that it's messy.  Suppose we
decide that the registered bgworker number would be subtracted from
max_connections: if the configuration registers as many bgworkers as
max_connections, then no client connections can take place; if there are
more bgworkers than max_connections, there's going to be errors at
startup because the last few bgworkers cannot start at all (and no
client connections will be allowed).  So users would be forced to change
max_connections according to bgworkers configuration.  That doesn't
sound friendly.

If, instead of subtracting bgworkers from max_connections, we were to
add the number of bgworkers to max_connections, then we're no better
than currently, because the number of bgworkers from the standby would
be different from those in the master, and we'd be back into the problem
of how to ensure that the allowed number of locks meets the restriction.
If you wanted to have more bgworkers in the master than the standby,
you'd have to advise users to increase max_connections in the standby to
fulfill the restriction.

(I currently have no proposal on how to go about solving this problem.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-24 Thread German Becker
On Fri, May 24, 2013 at 10:01 AM, Amit Langote amitlangot...@gmail.comwrote:

  Maybe I didn't explain correctly. I am using COPY/pg_dump/pg_restore for
  migration (and it is working fine). The streaming replication is for
  hot-standby replication *once migrated*. Thing is I disbable archving and
  set wal_level to minimal, when migrating the large portion of data, to
 make
  it faster. Then I switch to wal_level=hot_standby, i.e the production
  configuration, and the WAL segment seuqence seems to overlap with the
  segments generated with the other setting.
 

 Though, now you understand it's not what it looks like, right? :-)


 --
 Amit Langote


I didn't quite understand what you mean by that... But anyways so do you
people think this sequence number overlap is normal ?


Re: [HACKERS] getting rid of freezing

2013-05-24 Thread Robert Haas
On Fri, May 24, 2013 at 10:53 AM, Andres Freund and...@2ndquadrant.com wrote:
 [all-visible cannot restore hint bits without FPI because of torn pages]

 I haven't yet thought about this sufficiently yet. I think we might have
 a chance of working around this, let me ponder a bit.

Yeah.  I too feel like there might be a solution.  But I don't know
have something specific in mind, yet anyway.

 But even if that means needing a full page write via the usual mechanism
 for all visible if any hint bits needed to be set we are still out far
 ahead of the current state imo.
 * cleanup would quite possibly do an FPI shortly after in vacuum
   anyway. If we do it for all visible, it possibly does not need to be
   done for it.
 * freezing would FPI almost guaranteedly since we do it so much
   later.
 * Not having to rescan the whole heap will be a bigger cost saving...

The basic problem is that if the data is going to be removed before it
would have gotten frozen, then the extra FPIs are just overhead.  In
effect, we're just deciding to freeze a lot sooner.  And while that
might well be beneficial in some use cases (e.g. the data's already in
cache) it might also not be so beneficial (the table is larger than
cache and would have been dropped before freezing kicked in).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-24 Thread Amit Langote
 I didn't quite understand what you mean by that... But anyways so do you
 people think this sequence number overlap is normal ?

There is no overlap at all. The newer segments that you see are
pre-allocated ones. They have not been written to yet.

From the ls -l pg_xlog output that you  sent, it can be seen that
segments starting from 0001000E00A8 through
00010010007E have been pre-allocated (at that point of
time) and 0001000E00A7 is currently being written to. Just
look at the modified times in your ls -l listing.
0001000E00A7 has May 22 15:32 (the latest writes seem to
have happened to this segment) whereas pre-allocated ones seem to have
around May 22 12:05 to 12:15 (which are yet to be written to).

Does that help?

--
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of freezing

2013-05-24 Thread Andres Freund
On 2013-05-24 11:29:10 -0400, Robert Haas wrote:
  But even if that means needing a full page write via the usual mechanism
  for all visible if any hint bits needed to be set we are still out far
  ahead of the current state imo.
  * cleanup would quite possibly do an FPI shortly after in vacuum
anyway. If we do it for all visible, it possibly does not need to be
done for it.
  * freezing would FPI almost guaranteedly since we do it so much
later.
  * Not having to rescan the whole heap will be a bigger cost saving...
 
 The basic problem is that if the data is going to be removed before it
 would have gotten frozen, then the extra FPIs are just overhead.  In
 effect, we're just deciding to freeze a lot sooner.

Well, freezing without removing information for debugging.

 And while that
 might well be beneficial in some use cases (e.g. the data's already in
 cache) it might also not be so beneficial (the table is larger than
 cache and would have been dropped before freezing kicked in).

Not sure how caching comes into play here? At this point we know the
page to be in cache already since vacuum is looking at it anyway?

I think it's not really comparable since in those situations we a)
already do an XLogInsert(). b) already dirty the page. so the only
change is that we possibly write an additionall full page image. If
there is actually near future DML write activity that would make the
all-visible superflous that would have to FPI likely anyway.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of freezing

2013-05-24 Thread Robert Haas
On Fri, May 24, 2013 at 11:29 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, May 24, 2013 at 10:53 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
 [all-visible cannot restore hint bits without FPI because of torn pages]

 I haven't yet thought about this sufficiently yet. I think we might have
 a chance of working around this, let me ponder a bit.

 Yeah.  I too feel like there might be a solution.  But I don't know
 have something specific in mind, yet anyway.

One thought I had is that it might be beneficial to freeze when a page
ceases to be all-visible, rather than when it becomes all-visible.
Any operation that makes the page not-all-visible is going to emit an
FPI anyway, so we don't have to worry about torn pages in that case.
Under such a scheme, we'd have to enforce the rule that xmin and xmax
are ignored for any page that is all-visible; and when a page ceases
to be all-visible, we have to go back and really freeze the
pre-existing tuples.  I think we might be able to use the existing
all_visible_cleared/new_all_visible_cleared flags to trigger this
behavior, without adding anything new to WAL at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of freezing

2013-05-24 Thread Robert Haas
On Fri, May 24, 2013 at 11:52 AM, Andres Freund and...@2ndquadrant.com wrote:
 The basic problem is that if the data is going to be removed before it
 would have gotten frozen, then the extra FPIs are just overhead.  In
 effect, we're just deciding to freeze a lot sooner.

 Well, freezing without removing information for debugging.

Sure, but what I'm trying to avoid is incurring the WAL cost of
freezing.  If we didn't mind paying that sooner, we could just drop
vacuum_freeze_min/table_age.  But we do mind that.

 And while that
 might well be beneficial in some use cases (e.g. the data's already in
 cache) it might also not be so beneficial (the table is larger than
 cache and would have been dropped before freezing kicked in).

 Not sure how caching comes into play here? At this point we know the
 page to be in cache already since vacuum is looking at it anyway?

OK, true.

 I think it's not really comparable since in those situations we a)
 already do an XLogInsert(). b) already dirty the page. so the only
 change is that we possibly write an additionall full page image. If
 there is actually near future DML write activity that would make the
 all-visible superflous that would have to FPI likely anyway.

Well, if there's near-future write activity, then freezing is pretty
worthless anyway.  What I'm trying to avoid is adding WAL overhead in
the case where there *isnt* any near-future write activity, like
inserting 100MB of data into an existing table.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] background processes vs. hot standby

2013-05-24 Thread Robert Haas
On Fri, May 24, 2013 at 11:25 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Andres Freund escribió:
 I wonder if we shouldn't make background workers use connections slots
 from max_connections similar to how superuser_reserved_connections
 work. That would mean we don't need to care about it for HS.

 I remember considering this and concluding that it's messy.  Suppose we
 decide that the registered bgworker number would be subtracted from
 max_connections: if the configuration registers as many bgworkers as
 max_connections, then no client connections can take place; if there are
 more bgworkers than max_connections, there's going to be errors at
 startup because the last few bgworkers cannot start at all (and no
 client connections will be allowed).  So users would be forced to change
 max_connections according to bgworkers configuration.  That doesn't
 sound friendly.

I agree.  To put that more succinctly, if we take that approach, then
max_connections is no longer the maximum number of connections, which
is a POLA violation.

 (I currently have no proposal on how to go about solving this problem.)

If the problem were only with the size of the lock table, I'd be
somewhat inclined to propose ripping out max_locks_per_transaction and
putting in a GUC called max_locks instead.  The current system seems
more confusing than helpful; when the default proves insufficient, the
recourse is usually to figure out how many objects we need to lock,
and then divide by max_connections to figure out how to set
max_locks_per_transaction, which is really backing into the problem
from the wrong end.

But I don't know what to do about the problem of needing to know how
many backends there are.  I agree with Andres that it's not very
friendly to enforce a restriction that all the same worker processes
must be present on the standby.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of freezing

2013-05-24 Thread Jim Nasby

On 5/24/13 9:53 AM, Andres Freund wrote:

We don't even necessarily need to log the hint bits for all items since
 the redo for all_visible could make sure all items are hinted. The only
 problem is knowing up to where we can truncate pg_clog...

[all-visible cannot restore hint bits without FPI because of torn pages]

I haven't yet thought about this sufficiently yet. I think we might have
a chance of working around this, let me ponder a bit.

But even if that means needing a full page write via the usual mechanism
for all visible if any hint bits needed to be set we are still out far
ahead of the current state imo.
* cleanup would quite possibly do an FPI shortly after in vacuum
   anyway. If we do it for all visible, it possibly does not need to be
   done for it.
* freezing would FPI almost guaranteedly since we do it so much
   later.
* Not having to rescan the whole heap will be a bigger cost saving...


Would we only set all the hint bits within vacuum? If so I don't think the WAL 
hit matters at all, because vacuum is almost always a background, throttled 
process.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PGCON meetup FreeNAS/FreeBSD: In Ottawa Tue Wed.

2013-05-24 Thread Jim Nasby

I'm moving this to -advocacy, as it seems more appropriate there...

On 5/20/13 10:31 AM, Alfred Perlstein wrote:

Hello PostgreSQL Hackers,

I am now in Ottawa, last week we wrapped up the BSDCon and I was hoping to chat 
with a few Postgresql developers in person about using Postgresql in FreeNAS 
and offering it as an extension to the platform as a plug-in technology.  
Unfortunately due to time constraints I can not attend the entire conference 
and I am only in town until Wednesday at noon.

I'm hoping there's a good time to talk to a few developers about Postgresql + 
FreeNAS before I have to depart back to the bay area.


Were you able to meetup with anyone? If not you should attend the SF Postgres 
Users Group meeting.


Some info on me:  My name is Alfred Perlstein, I am a FreeBSD developer and 
FreeNAS project lead.  I am the VP of Software Engineering at iXsystems.  I 
have been a fan of Postgresql for many years.  In the early 2000s we build a 
high speed web tracking application on top of Postgresql and worked closely 
with the community to shake out performance and bug, so closely that Tom Lane 
and Vadim Mikheevhad logins on our box.  Since that time I have tried to 
get Postgresql into as many places as possible.

Some info on the topics I wanted to briefly discuss:

1) Using Postgresql as the config store for FreeNAS.
We currently use SQLITE, SQLITE fits our needs until we get to the point of 
replication between HA (high availability) units.  Then we are forced to 
manually sync data between configurations.  A discussion on how we might do 
this better using Postgresql, while still maintaining our ease of config export 
(single file) and small footprint would be interesting.


Probably best to discuss in person at SFPUG... :)


2) Postgresql plugin for FreeNAS.
Flip a switch and suddenly your file server is also serving enterprise data.  
We currently have a plug-in architecture, but would like to discuss the 
possibility of a tighter integration so that Postgresql looks like a more 
cohesive addition to FreeNAS.


Doesn't sound like that should be that terribly hard; you'd probably want to 
tweak the stock postgresql.conf for improved performance.


3) Statistic monitoring / EagleEye
In FreeBSD/FreeNAS I have developed a system called EagleEye. EagleEye is a system 
where all mibs are easily exportable with timestamps in a common format (for now 
YAML  modified CSV) which is then consumed by a utility which can then provide 
graphs. The entire point of EagleEye is to eventually upstream the modifications to 
future proof statistics tracking into the FreeBSD and FreeNAS systems.  I have 
spoken with some Illuminos/ZFS developers and they are interested as well.


Are you thinking of storing that data in Postgres? You might find 
http://labs.omniti.com/labs/reconnoiter interesting.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] visibilitymap_set and checksums

2013-05-24 Thread Andres Freund
Hi,

while thinking about vacuum freeze I noticed that since the checksums
patch visibilitymap_set() does:
/*
 * If data checksums are enabled, we need to protect the heap
 * page from being torn.
 */
if (DataChecksumsEnabled())
{
Page heapPage = BufferGetPage(heapBuf);

/* caller is expected to set PD_ALL_VISIBLE first */
Assert(PageIsAllVisible(heapPage));
PageSetLSN(heapPage, recptr);
}

That pattern looks dangerous. Setting the lsn of the heap page will
prevent the next action from doing a FPI even if it would be required.

Its e.g. called like this from lazy_scan_heap:

if (all_visible  !all_visible_according_to_vm)
{
/*
 * It should never be the case that the visibility map page is 
set
 * while the page-level bit is clear, but the reverse is allowed
 * (if checksums are not enabled).  Regardless, set the both 
bits
 * so that we get back in sync.
 *
 * NB: If the heap page is all-visible but the VM bit is not 
set,
 * we don't need to dirty the heap page.  However, if checksums 
are
 * enabled, we do need to make sure that the heap page is 
dirtied
 * before passing it to visibilitymap_set(), because it may be
 * logged.  Given that this situation should only happen in rare
 * cases after a crash, it is not worth optimizing.
 */
PageSetAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
  vmbuffer, 
visibility_cutoff_xid);
}

other callers look similarly dangerous.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-24 Thread German Becker
Thanks Amit, I understand now. Is there a way to know/predict how many
prealocated segments will there be in a certain moment? What does it
deppend on?


On Fri, May 24, 2013 at 12:46 PM, Amit Langote amitlangot...@gmail.comwrote:

  I didn't quite understand what you mean by that... But anyways so do you
  people think this sequence number overlap is normal ?

 There is no overlap at all. The newer segments that you see are
 pre-allocated ones. They have not been written to yet.

 From the ls -l pg_xlog output that you  sent, it can be seen that
 segments starting from 0001000E00A8 through
 00010010007E have been pre-allocated (at that point of
 time) and 0001000E00A7 is currently being written to. Just
 look at the modified times in your ls -l listing.
 0001000E00A7 has May 22 15:32 (the latest writes seem to
 have happened to this segment) whereas pre-allocated ones seem to have
 around May 22 12:05 to 12:15 (which are yet to be written to).

 Does that help?

 --
 Amit Langote



Re: [HACKERS] visibilitymap_set and checksums

2013-05-24 Thread Simon Riggs
On 24 May 2013 18:40, Andres Freund and...@2ndquadrant.com wrote:

 That pattern looks dangerous. Setting the lsn of the heap page will
 prevent the next action from doing a FPI even if it would be required.

Can you be more specific about the danger you see?

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Sort

2013-05-24 Thread Jim Nasby

On 5/13/13 9:28 AM, Noah Misch wrote:

It would be great if one client session could take advantage of multiple CPU
cores.  EnterpriseDB wishes to start the trek into this problem space for 9.4
by implementing parallel internal (i.e. not spilling to disk) sort.  This
touches on a notable subset of the infrastructure components we'll need for
parallel general query.  My intent is to map out the key design topics, hear
about critical topics I hadn't considered, and solicit feedback on the quality
of the high-level plan.  Full designs for key pieces will come later.


Have you considered GPU-based sorting? I know there's been discussion in the 
past.

To me, the biggest advantage of GPU sorting is that most of the concerns you've 
laid out go away; a backend that needs to sort just throws data at the GPU to 
do the actual sorting; all the MVCC issues and what not remain within the scope 
of a single backend.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-24 Thread Amit Langote
On Sat, May 25, 2013 at 3:08 AM, German Becker german.bec...@gmail.com wrote:
 Thanks Amit, I understand now. Is there a way to know/predict how many
 prealocated segments will there be in a certain moment? What does it deppend
 on?

Upthread, Fujii Masao-san suggested what might have happened that
caused these pre-allocated segments to be created. To quote him:

WAL recycling is performed by checkpoint. Checkpoint always checks
whether there are WAL files no longer  required for crash recovery,
IOW, WAL files which were generated before the prior checkpoint
happened, and then if they are found, checkpoint tries to recycle
them.

Reading here would also help:
http://www.postgresql.org/docs/9.1/static/wal-configuration.html

If you are still using the same values as during this observation,
could you provide values for these postgresql.conf parameters:
checkpoint_segments, checkpoint_timeout, wal_keep_segments?


--
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incomplete description of pg_start_backup?

2013-05-24 Thread Dmitry Koterov
I don't get still.

Suppose we have a data file with blocks with important (non-empty) data:

A B C D

1. I call pg_start_backup().
2. Tar starts to copy A block to the destination archive...
3. During this copying, somebody removes data from a table which is
situated in B block. So this data is a subject for vacuuming, and the block
is marked as a free space.
4. Somebody writes data to a table, and this data is placed to a free space
- to B block. This is also added to the WAL log (so the data is stored at 2
places: at B block and at WAL).
5. Tar (at last!) finishes copying of A block and begins to copy B block.
6. It finishes, then it copies C and D to the archive too.
7. Then we call pg_stop_backup() and also archive collected WAL (which
contains the new data of B block as we saw above).

The question is - *where is the OLD data of B block in this scheme?* Seems
it is NOT in the backup! So it cannot be restored. (And, in case when we
never overwrite blocks between pg_start_backup...pg_stop_backup, but always
append the new data, it is not a problem.) Seems to me this is not
documented at all! That is what my initial e-mail about.

(I have one hypothesis on that, but I am not sure. Here is it: does vacuum
saves ALL deleted data of B block to WAL on step 3 prior deletion? If yes,
it is, of course, a part of the backup. But it wastes space a lot...)




On Tue, May 14, 2013 at 6:05 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Mon, May 13, 2013 at 4:31 PM, Dmitry Koterov dmi...@koterov.ru wrote:

 Could you please provide a bit more detailed explanation on how it works?

 And how could postgres write at the middle of archiving files during an
 active pg_start_backup? if it could, here might be a case when a part of
 archived data file contains an overridden information from the future,


 The data files cannot contain information from the future.  If the backup
 is restored, it must be restored to the time of pg_stop_backup (at least),
 which means the data would at that point be from the past/present, not the
 future.

 Cheers,

 Jeff



Re: [HACKERS] Cost limited statements RFC

2013-05-24 Thread Greg Smith

On 5/24/13 10:36 AM, Jim Nasby wrote:

Instead of KB/s, could we look at how much time one process is spending
waiting on IO vs the rest of the cluster? Is it reasonable for us to
measure IO wait time for every request, at least on the most popular OSes?


It's not just an OS specific issue.  The overhead of collecting timing 
data varies massively based on your hardware, which is why there's the 
pg_test_timing tool now to help quantify that.


I have a design I'm working on that exposes the system load to the 
database usefully.  That's what I think people really want if the goal 
is to be adaptive based on what else is going on.  My idea is to use 
what uptime collects as a starting useful set of numbers to quantify 
what's going on.  If you have both a short term load measurement and a 
longer term one like uptime provides, you can quantify both the overall 
load and whether it's rising or falling.  I want to swipe some ideas on 
how moving averages are used to determine trend in stock trading 
systems: 
http://www.onlinetradingconcepts.com/TechnicalAnalysis/MASimple2.html


Dynamic load-sensitive statement limits and autovacuum are completely 
feasible on UNIX-like systems.  The work to insert a cost delay point 
needs to get done before building more complicated logic on top of it 
though, so I'm starting with this part.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL segments (names) not in a sequence

2013-05-24 Thread German Becker
On Fri, May 24, 2013 at 3:24 PM, Amit Langote amitlangot...@gmail.comwrote:

 On Sat, May 25, 2013 at 3:08 AM, German Becker german.bec...@gmail.com
 wrote:
  Thanks Amit, I understand now. Is there a way to know/predict how many
  prealocated segments will there be in a certain moment? What does it
 deppend
  on?

 Upthread, Fujii Masao-san suggested what might have happened that
 caused these pre-allocated segments to be created. To quote him:

 WAL recycling is performed by checkpoint. Checkpoint always checks
 whether there are WAL files no longer  required for crash recovery,
 IOW, WAL files which were generated before the prior checkpoint
 happened, and then if they are found, checkpoint tries to recycle
 them.

 Reading here would also help:
 http://www.postgresql.org/docs/9.1/static/wal-configuration.html

 If you are still using the same values as during this observation,
 could you provide values for these postgresql.conf parameters:
 checkpoint_segments, checkpoint_timeout, wal_keep_segments?


 --
 Amit Langote


Amit,

Frist, thanks for your help and your interest. I (think) I understand how
checkpoint /wal segment work. What I didn't understand from the
documentation is the possibility of segments being prealocated. I thought
that the WAL segment with the higher sequence number is the one being
written at present time, as opposed to a segment allocated to be written in
the future. If you could clarify this point to me, that would be great.

Here are the parameters related to checkpoint

For restore config
checkpoint_segments = 256   # in logfile segments, min 1, 16MB
each
#checkpoint_timeout = 5min  # range 30s-1h
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 -
1.0
#checkpoint_warning = 30s

#wal_keep_segments = 0  # in logfile segments, 16MB each; 0 disables




For production config:
checkpoint_segments = 256   # in logfile segments, min 1, 16MB
each
#checkpoint_timeout = 5min  # range 30s-1h
checkpoint_completion_target = 0.7  # checkpoint target duration, 0.0 -
1.0
wal_keep_segments = 256 # in logfile segments, 16MB each; 0 disables

Thanks again,

Germán


Re: [HACKERS] Cost limited statements RFC

2013-05-24 Thread Greg Smith

On 5/24/13 9:21 AM, Robert Haas wrote:


But I wonder if we wouldn't be better off coming up with a little more
user-friendly API.  Instead of exposing a cost delay, a cost limit,
and various charges, perhaps we should just provide limits measured in
KB/s, like dirty_rate_limit = amount of data you can dirty per
second, in kB and read_rate_limit = amount of data you can read into
shared buffers per second, in kB.


I already made and lost the argument for doing vacuum in KB/s units, so 
I wasn't planning on putting that in the way of this one.  I still think 
it's possible to switch to real world units and simplify all of those 
parameters.  Maybe I'll get the energy to fight this battle again for 
9.4.  I do have a lot more tuning data from production deployments to 
use as evidence now.


I don't think the UI end changes the bulk of the implementation work 
though.  The time consuming part of this development is inserting all of 
the cost delay hooks and validating they work.  Exactly what parameters 
and logic fires when they are called can easily be refactored later.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Sort

2013-05-24 Thread james
 Have you considered GPU-based sorting? I know there's been discussion 
in the past.


If you use OpenCL, then you can use a CPU driver if there is no GPU, and 
that can allow you to leverage all the CPU cores without having to do 
the multi-thread stuff in the backend.


While the compilation of a specific kernel can be quite expensive, it 
also has the effect of a JIT compiler in terms of system independence.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] adding import in pl/python function

2013-05-24 Thread Szymon Guz
Hi,
I'm wondering if it would be OK to change the procedure code before
execution. I'm thinking about adding magically an import at the beginning
of a function.

Currently numeric arguments passed to the procedure are converted into
floats. This is not good, as it causes loss of information.

The proposed solution in code comment is maybe use a string?.

I'm thinking about something else. We could convert it into Decimal (
http://docs.python.org/2/library/decimal.html) class in Python.
Unfortunately this class requires import like `from decimal import Decimal`
from a standard Python library.

I'm wondering if it would be a good idea to do it like this. It shouldn't
fail even with the trusted version of pl/python, as I'd rather see the
trusted version to allow importing packages from standard library.

regards,
Szymon


Re: [HACKERS] Cost limited statements RFC

2013-05-24 Thread Robert Haas
On Fri, May 24, 2013 at 10:36 AM, Jim Nasby j...@nasby.net wrote:
 Doesn't that hit the old issue of not knowing if a read came from FS cache
 or disk? I realize that the current cost_delay mechanism suffers from that
 too, but since the API is lower level that restriction is much more
 apparent.

Sure, but I think it's still useful despite that limitation.

 Instead of KB/s, could we look at how much time one process is spending
 waiting on IO vs the rest of the cluster? Is it reasonable for us to measure
 IO wait time for every request, at least on the most popular OSes?

I doubt that's going to be very meaningful.  The backend that dirties
the buffer is fairly likely to be different from the backend that
writes it out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] adding import in pl/python function

2013-05-24 Thread Claudio Freire
On Fri, May 24, 2013 at 4:10 PM, Szymon Guz mabew...@gmail.com wrote:

 I'm thinking about something else. We could convert it into Decimal
 (http://docs.python.org/2/library/decimal.html) class in Python.
 Unfortunately this class requires import like `from decimal import Decimal`
 from a standard Python library.

 I'm wondering if it would be a good idea to do it like this. It shouldn't
 fail even with the trusted version of pl/python, as I'd rather see the
 trusted version to allow importing packages from standard library.


Why would passing a decimal require an import?

The extension itself needs a reference to Decimal, to build them, but
the procedure's context doesn't need to have it.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] shmem startup and shutdown hooks

2013-05-24 Thread Alvaro Herrera
Hi,

While going through Andres' BDR code I noticed that it has a shmem
startup hook, which installs a on_shmem_exit() callback to write stuff
at shutdown time.  This works fine but seems somewhat of a hazard: it is
having postmaster do the actual write, which has to access shared memory
while doing it.  Moreover, since the callback is installed to be called
at shmem reset, it means that a process crash causes this code to be
called, which seems similarly broken.

It seems that the right place to do this is checkpointer shutdown, i.e.
when checkpointer is told to close shop it should also invoke various
modules' shutdown callbacks.  There's no hook point there though, so
we'd need to introduce something new for this specific purpose.

We already have pg_stat_statements doing this, viz. pgss_shmem_startup():

/*
 * If we're in the postmaster (or a standalone backend...), set up a 
shmem
 * exit hook to dump the statistics to disk.
 */
if (!IsUnderPostmaster)
on_shmem_exit(pgss_shmem_shutdown, (Datum) 0);

Also, it'd be good to discuss what should happen when another process
crashes.  Should we attempt to write the data anyway, or should we reset
it just like we do for pgstats?

Thoughts?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] adding import in pl/python function

2013-05-24 Thread Szymon Guz
On 24 May 2013 21:14, Claudio Freire klaussfre...@gmail.com wrote:

 On Fri, May 24, 2013 at 4:10 PM, Szymon Guz mabew...@gmail.com wrote:
 
  I'm thinking about something else. We could convert it into Decimal
  (http://docs.python.org/2/library/decimal.html) class in Python.
  Unfortunately this class requires import like `from decimal import
 Decimal`
  from a standard Python library.
 
  I'm wondering if it would be a good idea to do it like this. It shouldn't
  fail even with the trusted version of pl/python, as I'd rather see the
  trusted version to allow importing packages from standard library.


 Why would passing a decimal require an import?

 The extension itself needs a reference to Decimal, to build them, but
 the procedure's context doesn't need to have it.


Hm... maybe you're right. I think I don't understand fully how the
procedures are executed, and I need to read more to get it.


thanks,
Szymon


Re: [HACKERS] visibilitymap_set and checksums

2013-05-24 Thread Andres Freund
On 2013-05-24 19:09:57 +0100, Simon Riggs wrote:
 On 24 May 2013 18:40, Andres Freund and...@2ndquadrant.com wrote:
 
  That pattern looks dangerous. Setting the lsn of the heap page will
  prevent the next action from doing a FPI even if it would be required.
 
 Can you be more specific about the danger you see?

CHECKPOINT at lsn 0/10;
vacuum starts
vacuum finds page which is all visible
vacuum sets all_visible
PageSetAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
  vmbuffer, visibility_cutoff_xid);
recptr = log_heap_visible(rel-rd_node, heapBuf, vmBuf,
  cutoff_xid);
if (DataChecksumsEnabled())
PageSetLSN(heapPage, recptr);

So at this point the *heap* page will have the lsn of the
xl_heap_visible record. Which I thought to be rather dangerous because I
somewow missed the fact that log_heap_visible does:
if (DataChecksumsEnabled())
{
rdata[1].next = (rdata[2]);

rdata[2].data = NULL;
rdata[2].len = 0;
rdata[2].buffer = heap_buffer;
rdata[2].buffer_std = true;
rdata[2].next = NULL;
}

So. Forget what I said, I just was confused.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] adding import in pl/python function

2013-05-24 Thread Claudio Freire
On Fri, May 24, 2013 at 4:22 PM, Szymon Guz mabew...@gmail.com wrote:
 Hm... maybe you're right. I think I don't understand fully how the
 procedures are executed, and I need to read more to get it.


Well, it's easy.

Instead of PLyFloat_FromNumeric[0], you can make a
PLyDecimal_FromNumeric. There, you'd do with the Python/C[1]:

PyObject *decimal = PyImport_Import(decimal);
PyObject *decimal_dict = PyModule_GetDict(decimal);
PyObject *decimal_ctor = PyDict_GetItemString(decimal_dict, Decimal);

And invoke it with a string rep of your Numeric:

PyObject *value = PyObject_CallFunction(decimal_ctor, S, string_value);

Add of course all kinds of error checking and reference count boiler
plate, and you'd have a very dumb version of it.

To make it more pro, you'd want to do all that stuff to get
decimal_ctor only at initialization time. Especially since you don't
want to fumble with the import lock right there in _FromNumeric.

And to make it totally pro, you can even freeze Decimal (using
pyfreeze) if you'd like. I would only do this in contexts where you
don't have a stdlib of course. Not sure whether windows falls into
that category. Linux doesn't.


[0] 
http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/pl/plpython/plpy_typeio.c#l518
[1] http://docs.python.org/2/c-api/import.html


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of freezing

2013-05-24 Thread Josh Berkus
Andres,

If I understand your solution correctly, though, this doesn't really
help the pathological case for freezing, which is the time-oriented
append-only table.  For data which isn't being used, allvisible won't be
set either because it won't have been read, no?  Is it still cheaper to
set allvisible than vacuum freeze even in that case?

Don't get me wrong, I'm in favor of this if it fixes the other (more
common) cases.  I just want to be clear on the limitations.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] adding import in pl/python function

2013-05-24 Thread Szymon Guz
On 24 May 2013 21:46, Claudio Freire klaussfre...@gmail.com wrote:

 On Fri, May 24, 2013 at 4:22 PM, Szymon Guz mabew...@gmail.com wrote:
  Hm... maybe you're right. I think I don't understand fully how the
  procedures are executed, and I need to read more to get it.


 Well, it's easy.

 Instead of PLyFloat_FromNumeric[0], you can make a
 PLyDecimal_FromNumeric. There, you'd do with the Python/C[1]:

 PyObject *decimal = PyImport_Import(decimal);
 PyObject *decimal_dict = PyModule_GetDict(decimal);
 PyObject *decimal_ctor = PyDict_GetItemString(decimal_dict, Decimal);

 And invoke it with a string rep of your Numeric:

 PyObject *value = PyObject_CallFunction(decimal_ctor, S, string_value);

 Add of course all kinds of error checking and reference count boiler
 plate, and you'd have a very dumb version of it.

 To make it more pro, you'd want to do all that stuff to get
 decimal_ctor only at initialization time. Especially since you don't
 want to fumble with the import lock right there in _FromNumeric.

 And to make it totally pro, you can even freeze Decimal (using
 pyfreeze) if you'd like. I would only do this in contexts where you
 don't have a stdlib of course. Not sure whether windows falls into
 that category. Linux doesn't.


 [0]
 http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/pl/plpython/plpy_typeio.c#l518
 [1] http://docs.python.org/2/c-api/import.html



Thanks, I will take a look at this, looks pretty easy. However testing on
Windows will be pretty funny :)

thanks,
Szymon


Re: [HACKERS] getting rid of freezing

2013-05-24 Thread Andres Freund
On 2013-05-24 15:49:31 -0400, Josh Berkus wrote:
 If I understand your solution correctly, though, this doesn't really
 help the pathological case for freezing, which is the time-oriented
 append-only table.  For data which isn't being used, allvisible won't be
 set either because it won't have been read, no?  Is it still cheaper to
 set allvisible than vacuum freeze even in that case?

all visible is only set in vacuum and it determines which parts of a
table will be scanned in a non full table vacuum. So, since we won't
regularly start vacuum in the insert only case there will still be a
batch of work at once. But nearly all of that work is *already*
performed. We would just what the details of that around for a
bit. *But* since we now would only need to vacuum the non all-visible
part that would get noticeably cheaper as well.

I think for that case we should run vacuum more regularly for insert
only tables since we currently don't do regularly enough which a) increases
the amount of work needed at once and b) prevents index only scans from
working there.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Sort

2013-05-24 Thread Kohei KaiGai
Let me introduce one thing we discussed in the developer meeting at
Ottawa. We got a consensus that pluggable exec-node may be useful to
replace a part of exec-node tree with an alternative one being
implemented by extensions; which will allow to run something like
GpuSort instead of existing Sort.

http://wiki.postgresql.org/wiki/PgCon_2013_Developer_Meeting#Pluggable_plan.2Fexec_nodes

2013/5/24 james ja...@mansionfamily.plus.com:
 Have you considered GPU-based sorting? I know there's been discussion in
 the past.

 If you use OpenCL, then you can use a CPU driver if there is no GPU, and
 that can allow you to leverage all the CPU cores without having to do the
 multi-thread stuff in the backend.

 While the compilation of a specific kernel can be quite expensive, it also
 has the effect of a JIT compiler in terms of system independence.




 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] visibilitymap_set and checksums

2013-05-24 Thread Simon Riggs
On 24 May 2013 20:26, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-05-24 19:09:57 +0100, Simon Riggs wrote:
 On 24 May 2013 18:40, Andres Freund and...@2ndquadrant.com wrote:

  That pattern looks dangerous. Setting the lsn of the heap page will
  prevent the next action from doing a FPI even if it would be required.

 Can you be more specific about the danger you see?

 CHECKPOINT at lsn 0/10;
 vacuum starts
 vacuum finds page which is all visible
 vacuum sets all_visible
 PageSetAllVisible(page);
 MarkBufferDirty(buf);
 visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
   vmbuffer, visibility_cutoff_xid);
 recptr = log_heap_visible(rel-rd_node, heapBuf, vmBuf,
   cutoff_xid);
 if (DataChecksumsEnabled())
 PageSetLSN(heapPage, recptr);

 So at this point the *heap* page will have the lsn of the
 xl_heap_visible record. Which I thought to be rather dangerous because I
 somewow missed the fact that log_heap_visible does:
 if (DataChecksumsEnabled())
 {
 rdata[1].next = (rdata[2]);

 rdata[2].data = NULL;
 rdata[2].len = 0;
 rdata[2].buffer = heap_buffer;
 rdata[2].buffer_std = true;
 rdata[2].next = NULL;
 }

 So. Forget what I said, I just was confused.

I think its perfectly understandable. Robert, Jeff and I discussed
that for a while before we passed it. I'm still not happy with it, and
think its a pretty confusing section of code with multiple paths
through it, but I just can't see a better way.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incomplete description of pg_start_backup?

2013-05-24 Thread Heikki Linnakangas

On 24.05.2013 14:33, Dmitry Koterov wrote:

I don't get still.

Suppose we have a data file with blocks with important (non-empty) data:

A B C D

1. I call pg_start_backup().
2. Tar starts to copy A block to the destination archive...
3. During this copying, somebody removes data from a table which is
situated in B block. So this data is a subject for vacuuming, and the block
is marked as a free space.
4. Somebody writes data to a table, and this data is placed to a free space
- to B block. This is also added to the WAL log (so the data is stored at 2
places: at B block and at WAL).
5. Tar (at last!) finishes copying of A block and begins to copy B block.
6. It finishes, then it copies C and D to the archive too.
7. Then we call pg_stop_backup() and also archive collected WAL (which
contains the new data of B block as we saw above).

The question is - *where is the OLD data of B block in this scheme?* Seems
it is NOT in the backup!


Correct.


So it cannot be restored.


Right, the old data on block B is gone. The backup is fine, the old data 
on block B is is not needed to recover the backup.



(And, in case when we never overwrite blocks between
pg_start_backup...pg_stop_backup, but always append the new data, it
is not a problem.) Seems to me this is not documented at all! That is
what my initial e-mail about.


When you restore the backup, the database is restored to the state it 
was when pg_stop_backup() was called. What did you expect?



(I have one hypothesis on that, but I am not sure. Here is it: does vacuum
saves ALL deleted data of B block to WAL on step 3 prior deletion? If yes,
it is, of course, a part of the backup. But it wastes space a lot...)


Nope, it doesn't do that.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers