Re: [GENERAL] out-of-order XID insertion in KnownAssignedXids

2016-10-24 Thread fredrik
Hi All,

thank you all, I sincerely appreciate your feedback.

I have done a fair amount of testing on the solution proposed by you all (not 
removing backup_label), and it seems to have completely addressed the issue.

This was actually introduced some time back, and I am not completely certain 
how it crept into our codebase. I think that at least part of the explanation 
lies in the fact that we are experiencing a fair amount of growth in the 
database size and use on some of our installations. This could be the reason 
why extensive testing did not show the issue back then and why we are seeing it 
now.

Would it make sense to log a warning in the case of a missing backup_label 
file, or would it be difficult to identify that situation in the code? I would 
be happy to dig in and develop a patch?


With regards to the package version; we *are* working with a few "stock" 
scenarios, where one of them is a fairly old RHEL installation. We also have 
centos versions that are much more updated.
Best regards, and thank you all again,

Fredrik
On 20 October 2016 at 22:38:26 +02:00, Andres Freund  wrote:

> On 2016-10-20 22:37:15 +0900, Michael Paquier wrote:
> 
> > On Thu, Oct 20, 2016 at 10:21 PM, <> wrote:
> > 
> > > - remove a file called backup_label, but I am not certain that this file 
> > > is
> > > in fact there (any more).
> > > 
> > It is never a good idea when you are trying to restore from a backup,
> > backup_label contains critical information when restoring from a
> > backup, so you may finish with a corrupted data folder.
> > 
> And this actually seems like a likely source of these errors. Removing
> a backup label unfortunately causes hard to diagnose errors, because
> everything appears to be ok as long as there's no checkpoints while
> taking the base backups (or when the control file was copied early
> enough). But as soon as a second checkpoint happens before the control
> file is copied...
> 
> Fredrik, how did you end up removing the label?
> 
> Greetings,
>
> Andres Freund
>



Re: [GENERAL] out-of-order XID insertion in KnownAssignedXids

2016-10-20 Thread fredrik
It should be noted that we have not spent a significant amount of 
time determinig the best way to recover, since we have been focused on how to 
avoid the situation altogether

I am not certain which xlog file it is having difficulties with?

Best regards,

Fredrik

On October 18, 2016 at 8:49:29 pm +02:00, Andres Freund  
wrote:

> Hi,
> 
> On 2016-10-18 14:57:52 +0200,  wrote:
> 
> > we are running many postgresql master/slave setups. The slaves are
> > initialised from a pg_basebackup from the master and are sync
> > streaming from the master. When we determine the master has failed,
> > the slave is promoted. Some time after that, the old master is again
> > initialised with a pg_basebackup and starts streaming from the new
> > master.
> > 
> Could you describe in a bit more detail how exactly you're setting up
> the standbys? E.g. the exact recovery.conf used, whether you remove any
> files during starting a standby. Also how exactly you're promoting
> standbys?
> 
> > Recently, we have gotten this error a fair amount of times: "out-of-order 
> > XID insertion in KnownAssignedXids" when postgresql attempts to start after 
> > being initialised with a pg_basebackup from the current master.
> >
> 
> 
> Which version are you encountering this on precisely?
> 
> 
> > Once the issue has occurred, a subsequent re-initialisation (with a 
> > completely new pg_basebackup) does not resolve the issue.
> > 
> 
> How have you recovered from this so far?
> 
> 
> > I have a setup in the failing state, so I can produce any kind of log 
> > mesages / details that would be helpful.
> > 
> 
> Could you use pg_xlogdump to dump the WAL file on which replay failed?
> And then attach the output in a compressed manner?
> 
> Greetings,
> 
> Andres Freund
>



[GENERAL] out-of-order XID insertion in KnownAssignedXids (on 9.5.3)

2016-10-19 Thread Fredrik . HuitfeldtMadsen
Hi All,

we are running many postgresql master/slave setups. The slaves are 
initialised from a pg_basebackup from the master  and are sync streaming 
from the master. When we determine the master has failed, the slave is 
promoted. Some time after that, the old master is again initialised with a 
pg_basebackup and starts streaming from the new master.

Our setup seems pretty "stock" and has been running for us for some years 
(with different versions of postgresql but also different OSes). 

Recently, we have gotten this error a fair amount of times: "out-of-order 
XID insertion in KnownAssignedXids " when postgresql attempts to start 
after being initialised with a pg_basebackup from the current master. 

The only reference I can find on that particular error, is from 2012 and 
the resulting issue is long fixed in our version of postgresql (9.5.3) ... 
(
https://www.postgresql.org/message-id/201205230849.59825.and...@anarazel.de
)


Once the issue has occurred, a subsequent re-initialisation (with a 
completely new pg_basebackup) does not resolve the issue.

I have a setup in the failing state, so I can produce any kind of log 
mesages / details that would be helpful.


Thank you for your support,
Fredrik



[GENERAL] out-of-order XID insertion in KnownAssignedXids

2016-10-18 Thread fredrik
Hi All,

we are running many postgresql master/slave setups. The slaves are initialised 
from a pg_basebackup from the master and are sync streaming from the master. 
When we determine the master has failed, the slave is promoted. Some time after 
that, the old master is again initialised with a pg_basebackup and starts 
streaming from the new master.

Our setup seems pretty "stock" and has been running for us for some years (with 
different versions of postgresql but also different OSes).

Recently, we have gotten this error a fair amount of times: "out-of-order XID 
insertion in KnownAssignedXids" when postgresql attempts to start after being 
initialised with a pg_basebackup from the current master.

The only reference I can find on that particular error, is from 2012 and the 
resulting issue is long fixed in our version of postgresql (9.5.3) ... 
(<https://www.postgresql.org/message-id/201205230849.59825.and...@anarazel.de>)


Once the issue has occurred, a subsequent re-initialisation (with a completely 
new pg_basebackup) does not resolve the issue.

I have a setup in the failing state, so I can produce any kind of log mesages / 
details that would be helpful.


Thank you for your support,
Fredrik





Re: [GENERAL] Streaming replication, master recycling

2016-05-11 Thread fredrik
I apologise for the missing data.

we are running 9.1.15 on debian servers.

when we promote the old slave, it seems to go fine. Are you saying that it will 
cause issues down the line if the previous master is not shut down before 
promoting?

I was actually more concerned with the fact that we (some times) recycle the 
old master without doing a full basebackup. Again, this seems to work, but this 
presentation seems to indicate that this can cause problems (while seeming to 
work): <http://hlinnaka.iki.fi/presentations/NordicPGDay2015-pg_rewind.pdf>

The note is on page 14, under the headline: "Naive approach".


thank you for your support,

Fredrik
On 11 May 2016 at 12:47:13 +02:00, Venkata Balaji N  wrote:

> 
> 
> On Wed, May 11, 2016 at 2:31 PM, <> wrote:
> 
> > Hi All,
> > 
> > we are currently using streaming replication on multiple node pairs. We are 
> > seeing some issues, but I am mainly interrested in clarification.
> > 
> > When a failover occurs, we touch the trigger file, promoting the previous 
> > slave to master. That works perfectly.
> > 
> > For recycling the previous master, we create a recovery.conf (with 
> > recovery_target_timeline = 'latest') and *try* to start up. If postgresql 
> > starts up, we accept it as a new slave. If it does not, we proceed with a 
> > full basebackup.
> > 
> Which version of postgresql you are using ?
> 
> You need to shutdown master first, then promote slave and then other way 
> round, but, this can be clarified only if you let us know the postgresql 
> version. This is quite tricky in 9.2.x and from 9.3.x.
> 
> Regards,
> Venkata B N
> 
> Fujitsu Australia
>



[GENERAL] Streaming replication, master recycling

2016-05-10 Thread fredrik
Hi All,

we are currently using streaming replication on multiple node pairs. We are 
seeing some issues, but I am mainly interrested in clarification.

When a failover occurs, we touch the trigger file, promoting the previous slave 
to master. That works perfectly.

For recycling the previous master, we create a recovery.conf (with 
recovery_target_timeline = 'latest') and *try* to start up. If postgresql 
starts up, we accept it as a new slave. If it does not, we proceed with a full 
basebackup.

This approach seems to work, but I have found indications that it can lead to 
database corruption: 
<http://hlinnaka.iki.fi/presentations/NordicPGDay2015-pg_rewind.pdf>

I am mainly seeking understanding of if and why this approach is a bad idea.

Thanks,
Fredrik Huitfeldt



Re: [GENERAL] Re: "missing chunk number XX for toast value YY in pg_toast ..." after pg_basebackup.

2016-03-10 Thread fredrik
lector is not responding
> > 2016-02-22 02:13:01 GMT 30908 LOG: using stale statistics instead of
> > current ones because stats collector is not responding
> > 2016-02-22 02:13:52 GMT 30908 LOG: using stale statistics instead of
> > current ones because stats collector is not responding
> > 
> > There are lots of these mesages within the timeframe. There seems to be
> > a couple of them every 2-4 hours.
> > 
> This is usually a sign of resource starvation. I see this on an old machine, 
> at night, when I run some intensive file system backups. I figured it out by 
> looking at my crontab. The problems such as they are is that the messages 
> fill up logs and your statistics become, as the message says, stale for how 
> ever long the collector does not respond.
> Thank you very much, this is great feedback. I will try to determine what put 
> the master under such load.

> 
> 
> > 
> > 
> > > 
> > > 
> > > Best regards,
> > > Fredrik Huitfeldt
> > > 
> > > 
> > > On 7 March 2016 16:35:29 +01:00, Adrian Klaver
> > > < >> wrote:
> > > 
> > > On 03/06/2016 10:18 PM, 
> > > >
> > >  >>
> > > wrote:
> > > 
> > > HI All,
> > > 
> > > i would really appreciate any help I can get on this issue.
> > > 
> > > basically, a pg_basebackup + streaming attach, led to a database
> > > that we
> > > could not read from afterwards.
> > > 
> > > 
> > > From original post:
> > > 
> > > <http://www.postgresql.org/message-id/1456919678340.31300.116900@webmail2>
> > > 
> > > "The issue remained until we ran a full vacuum analyze on the
> > > cluster."
> > > 
> > > Which cluster was that, the master or the slave?
> > > 
> > > "I have logfiles from the incident, but I cannot see anything
> > > out of
> > > the ordinary (despite having a fair amount of experience
> > > investigating
> > > postgresql logs)."
> > > 
> > > 
> > > Can we see the section before and after ERROR?
> > > 
> > > 
> > > Beset regards,
> > > Fredrik
> > > 
> > > PS please advise if this is better posted on another list.
> > > 
> > > 
> > > 
> > > --
> > > Adrian Klaver
> > >  >
> > > 
> > > >>
> > > 
> > > 
> > > 
> > > 
> > > --
> > > Adrian Klaver
> > >  >
> > > Best regards,
> > Fredrik
> > 
> > 
> 
> -- 
> Adrian Klaver
> 
> Thank you for your help so far,

Fredrik



Re: [GENERAL] Re: "missing chunk number XX for toast value YY in pg_toast ..." after pg_basebackup.

2016-03-09 Thread fredrik
Hi Adrian,

thank you very much for your patience. I apologise for the missing information.

On 9 March 2016 16:13:00 +01:00, Adrian Klaver  
wrote:

> On 03/09/2016 04:56 AM,  wrote:
> 
> > Hi Adrian,
> > 
> > thank you very much for your response.
> > 
> > I ran the "VACUUM ANALYZE" command on the master node.
> > 
> > Regarding log messages.
> > 
> > Here is the contents of the log (excluding connections/disconnections):
> > 
> Assuming the below is from the replica database.
> the "LOG: recovery was paused" message was indeed from the replica.

> 
> 
> > 
> > 2016-02-22 02:30:08 GMT 24616 LOG: recovery has paused
> > 
> So what happened to cause the above?
> 
we automatically pause recovery on the replica before running pg_dump. This is 
in order to make certain that we get a consistent dump of the database.

> 
> I am not seeing anything below that indicates the recovery started again.
> the reason why we do not see a matching "resume" is that the pg_dump failed 
> and our error handling was insufficient.

> 
> 
> > 2016-02-22 02:30:08 GMT 24616 HINT: Execute pg_xlog_replay_resume() to
> > continue.
> > 2016-02-22 02:37:19 GMT 23859 DBNAME ERROR: missing chunk number 0 for
> > toast value 2747579 in pg_toast_22066
> > 2016-02-22 02:37:19 GMT 23859 DBNAME STATEMENT: COPY public.room_shape
> > (room_uuid, data) TO stdout;
> > 2016-02-22 02:37:41 GMT 2648 DBNAME LOG: could not receive data from
> > client: Connection reset by peer
> > 2016-02-22 02:37:41 GMT 2648 DBNAME LOG: unexpected EOF on client
> > connection
> > 
> What does the log from the master show?
> 
It doesnt seem to show much. It does have these repeated messages, however:
2016-02-22 02:12:18 GMT 30908 LOG: using stale statistics instead of current 
ones because stats collector is not responding
2016-02-22 02:13:01 GMT 30908 LOG: using stale statistics instead of current 
ones because stats collector is not responding
2016-02-22 02:13:52 GMT 30908 LOG: using stale statistics instead of current 
ones because stats collector is not responding

There are lots of these mesages within the timeframe. There seems to be a 
couple of them every 2-4 hours.


> > 
> > 
> > Best regards,
> > Fredrik Huitfeldt
> > 
> > 
> > On 7 March 2016 16:35:29 +01:00, Adrian Klaver
> > <> wrote:
> > 
> > > On 03/06/2016 10:18 PM, 
> > > > wrote:
> > > 
> > > HI All,
> > > 
> > > i would really appreciate any help I can get on this issue.
> > > 
> > > basically, a pg_basebackup + streaming attach, led to a database
> > > that we
> > > could not read from afterwards.
> > > 
> > > 
> > > From original post:
> > > 
> > > <http://www.postgresql.org/message-id/1456919678340.31300.116900@webmail2>
> > > 
> > > "The issue remained until we ran a full vacuum analyze on the cluster."
> > > 
> > > Which cluster was that, the master or the slave?
> > > 
> > > "I have logfiles from the incident, but I cannot see anything out of
> > > the ordinary (despite having a fair amount of experience investigating
> > > postgresql logs)."
> > > 
> > > 
> > > Can we see the section before and after ERROR?
> > > 
> > > 
> > > Beset regards,
> > > Fredrik
> > > 
> > > PS please advise if this is better posted on another list.
> > > 
> > > 
> > > 
> > > --
> > > Adrian Klaver
> > >  >
> > > 
> > 
> 
> -- 
> Adrian Klaver
> 
> Best regards,

Fredrik



Re: [GENERAL] Re: "missing chunk number XX for toast value YY in pg_toast ..." after pg_basebackup.

2016-03-09 Thread fredrik
Hi Adrian,

thank you very much for your response.

I ran the "VACUUM ANALYZE" command on the master node.

Regarding log messages.

Here is the contents of the log (excluding connections/disconnections):

2016-02-22 02:30:08 GMT 24616 LOG: recovery has paused
2016-02-22 02:30:08 GMT 24616 HINT: Execute pg_xlog_replay_resume() to continue.
2016-02-22 02:37:19 GMT 23859 DBNAME ERROR: missing chunk number 0 for toast 
value 2747579 in pg_toast_22066
2016-02-22 02:37:19 GMT 23859 DBNAME STATEMENT: COPY public.room_shape 
(room_uuid, data) TO stdout;
2016-02-22 02:37:41 GMT 2648 DBNAME LOG: could not receive data from client: 
Connection reset by peer
2016-02-22 02:37:41 GMT 2648 DBNAME LOG: unexpected EOF on client connection


Best regards,

Fredrik Huitfeldt

On 7 March 2016 16:35:29 +01:00, Adrian Klaver  
wrote:

> On 03/06/2016 10:18 PM,  wrote:
> 
> > HI All,
> > 
> > i would really appreciate any help I can get on this issue.
> > 
> > basically, a pg_basebackup + streaming attach, led to a database that we
> > could not read from afterwards.
> > 
> From original post:
> 
> <http://www.postgresql.org/message-id/1456919678340.31300.116900@webmail2>
> 
> "The issue remained until we ran a full vacuum analyze on the cluster."
> 
> Which cluster was that, the master or the slave?
> 
> "I have logfiles from the incident, but I cannot see anything out of the 
> ordinary (despite having a fair amount of experience investigating postgresql 
> logs)."
> 
>
> Can we see the section before and after ERROR?
> 
> 
> > 
> > Beset regards,
> > Fredrik
> > 
> > PS please advise if this is better posted on another list.
> > 
> 
> -- 
> Adrian Klaver
> 
>



[GENERAL] Re: "missing chunk number XX for toast value YY in pg_toast ..." after pg_basebackup.

2016-03-06 Thread fredrik
HI All,

i would really appreciate any help I can get on this issue.

basically, a pg_basebackup + streaming attach, led to a database that we could 
not read from afterwards.


Beset regards,
Fredrik


PS please advise if this is better posted on another list.


[GENERAL] "missing chunk number XX for toast value YY in pg_toast ..." after pg_basebackup.

2016-03-02 Thread fredrik
Hi All,

we are running postgresql 9.1.15 on Debian.

we are, basically, running a postgresql cluster with two nodes. We are using 
synchronous streaming replication to make sure that the slave is always fully 
in sync (using a recovery.conf that points out the master). The slave is mainly 
used to maintain high availability. We also run backups on the slave to reduce 
load on the master.

On both nodes, postgresqls data is stored on a SAN.

At some point, we restarted the slave. The first thing it does is to do a 
pg_basebackup and then start streaming changes based on that. This seemed to 
work fine, until 16 hrs later when we made a backup on the slave, using 
pg_dump. pg_dump was unable to complete, due to a pg_toast error (ERROR: 
missing chunk number 0 for toast value 2753291 in pg_toast_22066).

The issue remained until we ran a full vacuum analyze on the cluster.

We have been running in this way on several clusters for some years now, 
basically since 9.1 was released, without seeing this issue. Since we upgraded 
to postgresql 9.1.15, we have seen it twice. This does not necessarily mean 
anything, but I thought I would mention it.

The issue is resolved now, but I would still like to understand what happened.
I have logfiles from the incident, but I cannot see anything out of the 
ordinary (despite having a fair amount of experience investigating postgresql 
logs).

I have read that this kind of issues are most frequently due to hardware issues 
or bugs in postgresql.

Using ,
I have reviewed the release notes from the subsequent 9.1 releases 
(9.1.15-9.1.19),
but I have been unable to identify anything that hints at the issue that we 
have experienced.

the closest one would be:
(9.1.16,9.4.2,9.3.7,9.2.11,9.0.20) Fix possible failure during hash index 
bucket split, if other processes are modifying the index concurrently (Tom Lane)

but that seems like a long shot.

My question is:
- does anyone know of bug reports (fixed or otherwise) that could shed light on 
our issue?
- does anyone have experience with this kind of behavior?

- our setup is fairly automated; would it be a good idea to always run vacuum 
analyze after a pg_restore (or before pg_dump)?




Re: [pgpool-general: 1315] Re: [GENERAL] Database connections seemingly hanging

2013-01-28 Thread Fredrik . HuitfeldtMadsen
Hello Mr. Ishii,

we have attempted to create a selfcontained testcase, but have been 
unsuccessful so far. We understand how pgpool acquires locks in theory, 
but it seems that what we are seeing is different. We have summarized our 
findings here: http://pastebin.com/9f6gjxLA

It seems that pgpool child 7606 communicates with process 26453 on the 
.202 node and process 17789 on the .204 node. We can see from the output 
from postgres that process 26453 is waiting for the lock on the .202 
server, while 17789 has the lock on the .204 server. This means that 
pgpool child 7606 will wait until process 26453 can obtain the lock on the 
.202 server. 

At the same time, we can see that pgpool child 7681 communicates with 
process 23451 on the .202 server and process 12464 on the .204 server. We 
can see from the output from postgres that the process 23451 has its lock 
on the .202 server but process 12464 is waiting for it on the .204 server. 
This means that pgpool child 7681 will wait until process 12464 can obtain 
the lock on the .204 server. 

Since pgpool child 7606 via process 17789 has the lock on the .204 server, 
it blocks pgpool child 7681 from completing. Since pgpool child 7681 via 
process 23451 has the lock on the .202 server, it blocks pgpool child 7606 
from completing. This seems to be a classic deadlock.

We have spent a fair amount of time debugging this situation, and we would 
really appreciate feedback on our situation.

Is there any information that would aid you in providing us with this kind 
of feedback?

Kind regards,
Fredrik & Friends





Tatsuo Ishii  
2013/01/16 07:54

To
fredrik.huitfeldtmad...@schneider-electric.com
cc
t...@sss.pgh.pa.us, nag...@sraoss.co.jp, pgsql-general@postgresql.org, 
pgpool-gene...@pgpool.net
Subject
Re: [pgpool-general: 1315] Re: [GENERAL] Database connections seemingly 
hanging






> It seems that the root cause was that pgpool acquired the locks in the 
> wrong order. If the resource is called A it seems that pgpool allows 
child 
> X to acquire A on node1 and at the same time, child Y acquires A on 
node2. 
> This leaves X wanting A on node2 and Y wanting A on node1. This leaves 
> both children hanging indefinitely. It also leaves both postgres'es 
> blissfully unaware of the deadlock, whereby it escapes postgres'es 
> deadlock detection.

That's hard to believe for me. For any query, pgpool sends it to the
master node (node1 in your case) first and waits until the node
returns response by using select(2) on the socket to PostgreSQL
backend. After someting comes from the socket, then pgpool issues to
node2.  So pgpool never sends query to master node(node1) and node2
concurrently.  This is a classical technique to avoid a cross node
dead lock situation.

If your explation is correct, pgpool easily goes into dead lock
situation even by using simple pgbench query.

Could you please show me self-contained test case?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

__
This email has been scanned by the Symantec Email Security.cloud service.
__



Re: [GENERAL] Database connections seemingly hanging

2013-01-11 Thread Fredrik . HuitfeldtMadsen
Hi All, 

@ Tom
Thank you for your response. While working on your suggestions, we seem to 
have found the cause of our problems.

@ Yugo
Thank you for your response. We are running pgpool in replication mode 
with load balancing enabled. If you have further questions to aid in 
debugging the situation, please let me know. 


It seems that the root cause was that pgpool acquired the locks in the 
wrong order. If the resource is called A it seems that pgpool allows child 
X to acquire A on node1 and at the same time, child Y acquires A on node2. 
This leaves X wanting A on node2 and Y wanting A on node1. This leaves 
both children hanging indefinitely. It also leaves both postgres'es 
blissfully unaware of the deadlock, whereby it escapes postgres'es 
deadlock detection.

We have included a summary of the system state here:
http://pastebin.com/9f6gjxLA

We have used netstat to trace the connections between the pgpool children 
and the postgress'es. pgpool child 7606 has acquired a lock on the .204 
server but waits for the same lock on the .202 server. At the same time 
pgpool child 7681 has the lock on the .202 server and waits for it on the 
.204 server. Pgpool is running on the .204 server. 

If anyone is interested, we have included the full outputs in the 
following pastebins:

pg_locks on 10.216.73.202: http://pastebin.com/uRQh5Env
pg_locks on 10.216.73.204: http://pastebin.com/BXpirVQ2
netstat -p on 10.216.73.202: http://pastebin.com/b9kV7Wz4
netstat -p on 10.216.73.204: http://pastebin.com/tPz8gwRG

Kind regards,
Fredrik & friends





Tom Lane  
2013/01/10 05:30

To
fredrik.huitfeldtmad...@schneider-electric.com
cc
pgsql-general@postgresql.org, pgpool-gene...@pgpool.net
Subject
Re: [GENERAL] Database connections seemingly hanging






fredrik.huitfeldtmad...@schneider-electric.com writes:
> We have a setup where 2 JBoss (5.1) servers communicate with 1 instance 
of 
> PgPool (3.04), which again communicates with 2 Postgresql (8.4) servers. 

> The JBoss servers host some Java code for us and as part of that they 
run 
> some quartz jobs. 

> These jobs are triggered right after startup and as part of that we get 
> what seems to get stuck. At least when we can see in the database that 
> when inspecting pg_locks, there exists a virtual transaction that has 
all 
> desired locks granted but seems to be stuck. When we inspect 
> pg_stat_activity, it seems that the process is still waiting for the 
query 
> (SELECT ... FOR UPDATE) to finish.

> The locking transaction is described here: http://pastebin.com/3pEn6vPe

What that shows is several sessions running SELECT FOR UPDATE, but none
of them seem to be waiting.  What else is going on?  In particular, are
there any idle-in-transaction sessions?  Also, would any of these
SELECTs return enough rows that the sessions might be blocked trying to
send data back to their clients?  That wouldn't show as waiting = true,
though I think you could detect it by strace'ing the backends to see if
they are stopped in a send() kernel call.

> We suspect that a connection to the database acquires its locks but 
> somehow does not return to the application. If this is true, it would 
> either be a postgresql or a pgpool problem. We would appreciate any help 

> in further debugging or resolving the situation. 

It seems like a good guess would be that you have a deadlock situation
that cannot be detected by the database because part of the blockage is
on the client side --- that is, client thread A is waiting on its
database query, that query is waiting on some lock held by client thread
B's database session, and thread B is somehow waiting for A on the
client side.  It's not too hard to get into this type of situation when
B is sitting on an open idle-in-transaction session: B isn't waiting for
the database to do anything, but is doing something itself, and so it's
not obvious that there's any risk.  Thus my question about what idle
sessions there might be.  This does usually lead to a visibly waiting
database session for client A, though, so it's probably too simple as an
explanation for your issue.  We have seen some harder-to-debug cases
where the database sessions weren't visibly "waiting" because they were
blocked on client I/O, so maybe you've got something like that.

Another line of thought to pursue is possible misuse of pgpool.  If
pgpool doesn't realize you're inside a transaction and swaps the
connection to some other client thread, all kinds of confusion ensues.

Also, I hope you're running a reasonably recent 8.4.x minor release.
A quick look through the commit logs didn't show anything about deadlock
fixes in the 8.4 branch, but I might have missed something that was
fixed a long time ago.

 regards, tom lane

__
This email has been scanned by the Symantec Email Security.cloud service.
__



[GENERAL] Database connections seemingly hanging

2013-01-09 Thread Fredrik . HuitfeldtMadsen
Hi All,

We have a setup where 2 JBoss (5.1) servers communicate with 1 instance of 
PgPool (3.04), which again communicates with 2 Postgresql (8.4) servers. 
The JBoss servers host some Java code for us and as part of that they run 
some quartz jobs. 

These jobs are triggered right after startup and as part of that we get 
what seems to get stuck. At least when we can see in the database that 
when inspecting pg_locks, there exists a virtual transaction that has all 
desired locks granted but seems to be stuck. When we inspect 
pg_stat_activity, it seems that the process is still waiting for the query 
(SELECT ... FOR UPDATE) to finish.

The locking transaction is described here: http://pastebin.com/3pEn6vPe

We know that the quartz thread is attempting to obtain a row share lock. 
We know that we have enough connections available in postgres and in 
pgpool. We also know that the issue occurs much more frequently when we 
enable postgres statememt logging. We assume that this is due to postgres 
becomming slower as a result of the additionsl logging. When we look at 
the server thread dump, we can see that all quartz threads are either 
sleeping or waiting for postgres.

A thread dump of the relevant quartz threads is described here: 
http://pastebin.com/iPhuFLrM

It is important to note that the issue does not only occur with quartz 
jobs, but it is where we see it most frequently. This is likely to be due 
to the fact that it is the place where we have the highest level of 
concurrency.

We suspect that a connection to the database acquires its locks but 
somehow does not return to the application. If this is true, it would 
either be a postgresql or a pgpool problem. We would appreciate any help 
in further debugging or resolving the situation. 

Kind regards,
Fredrik

[GENERAL] What will happen to PostgreSQL server (on Linux) if CMOS battery dies?

2008-12-02 Thread Fredrik Israelsson

Hi!
I wonder what the consequences would be for a PostgreSQL server running on 
Linux on a machine where the CMOS battery that gives life to the local system 
(time) clock dies.
Note that the machine is not connected to a network, so synchronizing with an 
external time server is not an option. The machine is usually turned off every 
night, and turned on every morning, so I guess that with a dead battery, the 
local time will be set to 1970 every morning.

Of course the battery should be changed as soon as possible if it dies, but I 
just want to know if PostgreSQL will be able to run at all in these conditions.

Best regards,
Fredrik Israelsson

_
Beställ bläck före 19 för leverans nästa vardag
http://www.inkclub.com/msn8

Re: [GENERAL] PostgreSQL on system with root as only user

2006-08-30 Thread Fredrik Israelsson
Well, no actual "design decision" is made yet. Just testing...

...actually, in embedded, minimalistic environments, I don't think it's
unusual to have one and only one user.

/Fredrik 

-Original Message-
From: Tom Lane [mailto:[EMAIL PROTECTED] 
Sent: den 30 augusti 2006 15:17
To: Fredrik Israelsson
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] PostgreSQL on system with root as only user 

"Fredrik Israelsson" <[EMAIL PROTECTED]> writes:
> I am about to install PostgreSQL on a minimal Linux system, where root

> is the only user that is allowed to exist.

You've *got* to be kidding.  That's possibly the stupidest system design
decision I've ever heard ... what is the point of disallowing non-root
userids?  It certainly can't improve system security to run everything
as root.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[GENERAL] PostgreSQL on system with root as only user

2006-08-30 Thread Fredrik Israelsson
Hi!
I am about to install PostgreSQL on a minimal Linux system, where root
is the only user that is allowed to exist. I would prefer to use a
binary installer. Is it possible to install PostgreSQL without the
postgres user?

/Fredrik

---(end of broadcast)---
TIP 6: explain analyze is your friend


[GENERAL] Storing a chain

2003-11-19 Thread Johan Fredrik Øhman
Hi,
I'm seeking some opinion on the following problem

For those familiar with "iptables/netfilter", I am actually storing firewall
rules in the database. The order of these rules is critical, so there has to
be some kind of system to how they are ordered. As I see it there are at
least 2 approaches. 

1) Use a Sequence number in the table. This way it is possible to use ORDER
BY sequencenumber to retrive the correct list.  This works, but you have to
do some massive updating when you what to insert a rule between i.e "12" and
"13".


CREATE TABLE FW_CHAIN (
 FW_CHAIN_ID  INT4 not null,
 PARENT_IDINT4 null,
    fields ..
 constraint PK_FW_CHAIN primary key (FW_CHAIN_ID)
);

2) Use a Parent_Id, as described intended in the table above.  I feel that
this is a little prettier, even if it might be somewhat less efficient.
Unfortunately, I'm really not sure how to perform the queries.  As I see it
some kind of recursive select statement is needed. Cursors ?  Stored
Procedure?

If somebody has the time, I'd really appreciate some links to relevant
examples, documentation or advice on how to solve this.

--
Johan Fredrik Øhman



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


[GENERAL] Storing a chain

2003-11-19 Thread Johan Fredrik Øhman
Hi,
I'm seeking some opinion on the following problem

For those familiar with "iptables/netfilter", I am actually storing firewall
rules in the database. The order of these rules is critical, so there has to
be some kind of system to how they are ordered. As I see it there are at
least 2 approaches. 

1) Use a Sequence number in the table. This way it is possible to use ORDER
BY sequencenumber to retrive the correct list.  This works, but you have to
do some massive updating when you what to insert a rule between i.e "12" and
"13".


CREATE TABLE FW_CHAIN (
 FW_CHAIN_ID  INT4 not null,
 PARENT_IDINT4 null,
    fields ..
 constraint PK_FW_CHAIN primary key (FW_CHAIN_ID)
);

2) Use a Parent_Id, as described intended in the table above.  I feel that
this is a little prettier, even if it might be somewhat less efficient.
Unfortunately, I'm really not sure how to perform the queries.  As I see it
some kind of recursive select statement is needed. Cursors ?  Stored
Procedure?

If somebody has the time, I'd really appreciate some links to relevant
examples, documentation or advice on how to solve this.

--
Johan Fredrik Øhman



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly