subject:"Re\: \[HACKERS\] Win32 hard crash problem"

Re: [HACKERS] Win32 hard crash problem

2006-10-01 Thread Magnus Hagander

> IIRC there is no real SIGINT on Windows, so it can only come 
> from a postgres program. The windows shutdown could be 
> calling pg_ctl to stop the service, of course.

Well, not quite that, but it will send a service command to the running
pg_ctl (which is our "service supervisor"), which *will* respond with a
SIGINT to the postmaster. 


//Magnus

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Win32 hard crash problem

2006-10-01 Thread Andrew Dunstan


IIRC there is no real SIGINT on Windows, so it can only come from a
postgres program. The windows shutdown could be calling pg_ctl to stop the
service, of course.

cheers

andrew

Joshua D. Drake wrote:
> Magnus Hagander wrote:
 That log entry is the last (of consequence) entry before
>>> the machine says:
 2006-09-28 16:40:36.921  LOG:  received fast shutdown request
>>> Oh?  That's pretty interesting on a Windows machine, because
>>> AFAIK there wouldn't be any standard mechanism that might tie
>>> into our homegrown signal facility.  Anyone have a theory on
>>> what might trigger a SIGINT to the postmaster, other than
>>> intentional pg_ctl invocation?
>>
>> pg_ctl will send SIGINT to the postmaster when the service is stopped,
>> or when windows is shutting down.
>
> O.k. that pretty much confirms my suspicion then. The SIGINT likely came
> from the user rebooting windows.
>
>>
>> Do you get anything about the postgresql service in the eventlog within
>> say a minute of this happening? (before or after)
>
> Too late to say now :( I will have to follow up with them.
>



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Win32 hard crash problem

2006-10-01 Thread Joshua D. Drake

Magnus Hagander wrote:
>>> That log entry is the last (of consequence) entry before 
>> the machine says:
>>> 2006-09-28 16:40:36.921  LOG:  received fast shutdown request
>> Oh?  That's pretty interesting on a Windows machine, because 
>> AFAIK there wouldn't be any standard mechanism that might tie 
>> into our homegrown signal facility.  Anyone have a theory on 
>> what might trigger a SIGINT to the postmaster, other than 
>> intentional pg_ctl invocation?
> 
> pg_ctl will send SIGINT to the postmaster when the service is stopped,
> or when windows is shutting down. 

O.k. that pretty much confirms my suspicion then. The SIGINT likely came
from the user rebooting windows.

> 
> Do you get anything about the postgresql service in the eventlog within
> say a minute of this happening? (before or after)

Too late to say now :( I will have to follow up with them.

Sincerely,

Joshua D. Drake

-- 

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Win32 hard crash problem

2006-10-01 Thread Magnus Hagander

> > That log entry is the last (of consequence) entry before 
> the machine says:
> > 2006-09-28 16:40:36.921  LOG:  received fast shutdown request
> 
> Oh?  That's pretty interesting on a Windows machine, because 
> AFAIK there wouldn't be any standard mechanism that might tie 
> into our homegrown signal facility.  Anyone have a theory on 
> what might trigger a SIGINT to the postmaster, other than 
> intentional pg_ctl invocation?

pg_ctl will send SIGINT to the postmaster when the service is stopped,
or when windows is shutting down. 

Do you get anything about the postgresql service in the eventlog within
say a minute of this happening? (before or after)


Could it be a backend or the postmaster trying to send a signal to a
different backend, that for some reason sends it to the wrong process?

//Magnus

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Win32 hard crash problem

2006-09-29 Thread Joshua D. Drake

Tom Lane wrote:
> "Joshua D. Drake" <[EMAIL PROTECTED]> writes:
>> O.k. further on this.. the crashing is happening quickly now but not
>> predictably. (as in sometimes a week sometimes 2 days).
> 
> OK, that seems to eliminate the GetTickCount-overflow theory anyway.
> 
>> That log entry is the last (of consequence) entry before the machine says:
>> 2006-09-28 16:40:36.921  LOG:  received fast shutdown request
> 
> Oh?  That's pretty interesting on a Windows machine, because AFAIK there
> wouldn't be any standard mechanism that might tie into our homegrown
> signal facility.  Anyone have a theory on what might trigger a SIGINT
> to the postmaster, other than intentional pg_ctl invocation?

Well the other option would be a windows restart. On windows would that
send a SIGINT to the backend?

Joshua D. Drake


> 
>   regards, tom lane
> 
> ---(end of broadcast)---
> TIP 5: don't forget to increase your free space map settings
> 


-- 

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Win32 hard crash problem

2006-09-29 Thread Tom Lane

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:
> O.k. further on this.. the crashing is happening quickly now but not
> predictably. (as in sometimes a week sometimes 2 days).

OK, that seems to eliminate the GetTickCount-overflow theory anyway.

> That log entry is the last (of consequence) entry before the machine says:
> 2006-09-28 16:40:36.921  LOG:  received fast shutdown request

Oh?  That's pretty interesting on a Windows machine, because AFAIK there
wouldn't be any standard mechanism that might tie into our homegrown
signal facility.  Anyone have a theory on what might trigger a SIGINT
to the postmaster, other than intentional pg_ctl invocation?

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Win32 hard crash problem

2006-09-29 Thread Joshua D. Drake

Joshua D. Drake wrote:
> Tom Lane wrote:
>> "Joshua D. Drake" <[EMAIL PROTECTED]> writes:
>>> Yes, unfortunately there isn't much more to be had for another 2
>>> weeks ;)
>>
>> I trust they've got the reboot time and they will know exactly how long
>> from reboot to problem?  I'm not all that sold on the "GetTickCount
>> overflow" theory, but certainly we ought not be missing a chance to test
>> or disprove it.
> 
> Yes I documented all conversations and disclaimers :)

O.k. further on this.. the crashing is happening quickly now but not
predictably. (as in sometimes a week sometimes 2 days). I just now got
them to send some further logs... Interestingly:

2006-09-28 16:38:37.406  LOG:  could not send data to client: An
operation on a socket could not be performed because the system lacked
sufficient buffer space or because a queue was full.

That log entry is the last (of consequence) entry before the machine says:

2006-09-28 16:40:36.921  LOG:  received fast shutdown request
2006-09-28 16:40:36.921  LOG:  aborting any active transactions
2006-09-28 16:40:36.921  FATAL:  terminating connection due to
administrator command

On the ERROR side of things I have a bunch of standard, unique key
violations etc... AND:

postgresql-2006-09-27_00.log:2006-09-27 23:49:57.671  FATAL:  could
not read from statistics collector pipe: No error

I have requested a clean run with entire log at DEBUG2. Hopefully that
will give us more info.

Sincerely,

Joshua D. Drake

> 
> Joshua D. Drake
> 
>>
>> regards, tom lane
>>
> 
> 

-- 

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Win32 hard crash problem

2006-09-07 Thread Gregory Stark


"Joshua D. Drake" <[EMAIL PROTECTED]> writes:

> Yes I am fully aware of that. I am only relaying what the customer said.

Yeah sorry, I guess what I sent was pretty obvious to you. I should stop
confusing -general with -hackers :)

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Win32 hard crash problem

2006-09-06 Thread Alvaro Herrera

Joshua D. Drake wrote:
> Gregory Stark wrote:
> >"Joshua D. Drake" <[EMAIL PROTECTED]> writes:

> >>The only known resolution is to reboot Windows. Using the service
> >>control panel to shutdown postgresql will fail once the message is
> >>received. It is unknown if using the task master to individually
> >>kill processes will work.
> >
> >This contradicts your previous email about restarting the postmaster 
> >working.
> 
> No, it doesn't. I never said restarting the postmaster would work. I
> said rebooting windows, allows postgresql to come back up. Those are 
> entirely different things.

Yup.  It was me who said that restarting the postmaster solved the
problem.  That's what Dave Cramer told me.  But maybe Dave was not
certain about that -- he did use the word "reboot" and I asked for
confirmation about whether this was an actual reboot of the machine 
or just a postmaster "reboot", and he said it was the latter.  But this
may have been a suposition.

Sorry for the confusion.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Win32 hard crash problem

2006-09-06 Thread Joshua D. Drake


Gregory Stark wrote:

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:


O.k. to recap:

This message will present itself, if connection attempts are made from the Web
Application (Java/JDBC), or locally via PgAdmin. Once the error message is
received, all subsequent connection attempts will also result in that same
message. We do not know if the error occurs before or after authentication.


I think other people have claimed that this message is in libpq and not in
JDBC source code which is inconsistent with this description.


Yes I am fully aware of that. I am only relaying what the customer said.




The only known resolution is to reboot Windows. Using the service control panel
to shutdown postgresql will fail once the message is received. It is unknown if
using the task master to individually kill processes will work.


This contradicts your previous email about restarting the postmaster working.


No, it doesn't. I never said restarting the postmaster would work. I 
said rebooting windows, allows postgresql to come back up. Those are 
entirely different things.


Sincerely,

Joshua D. Drake


--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Win32 hard crash problem

2006-09-06 Thread Gregory Stark

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:

> O.k. to recap:
>
> This message will present itself, if connection attempts are made from the Web
> Application (Java/JDBC), or locally via PgAdmin. Once the error message is
> received, all subsequent connection attempts will also result in that same
> message. We do not know if the error occurs before or after authentication.

I think other people have claimed that this message is in libpq and not in
JDBC source code which is inconsistent with this description.

> The only known resolution is to reboot Windows. Using the service control 
> panel
> to shutdown postgresql will fail once the message is received. It is unknown 
> if
> using the task master to individually kill processes will work.

This contradicts your previous email about restarting the postmaster working.

I think you have to sit down and write down *exactly* what sequence of actions
cause what results. Describing them in shorthand like "if connection attempts
are made" is leading to a lot of speculation instead of systematic deductions.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Win32 hard crash problem

2006-09-06 Thread Dave Cramer



On 6-Sep-06, at 3:27 AM, Magnus Hagander wrote:


Yes they are using a connection pool. A java based one.

Since java has it's own protocol implementation, this is

totally

unrelated to any libpq error messages.

Another important point that we've not been given information

on:

when pgAdmin/libpq starts failing like this, exactly what is
happening with the connection pool?  Is it still able to issue
queries, and if not what happens exactly?


No, when this happens everything stops. The only thing they get

back

is that message until they reboot the server. The web app (via
java/connection pool), pgAdmin both give the same error.

Which now that I think about it, seems odd if the message is

coming

from libpq yes?

Yes, this is very odd, AFICS, this message does not exist in the
java driver. So it would be interesting to get the actual logs
from the client.


Definitly - that error msg showing up in the web app really doesn't  
make

sense. However, are we sure that the error message is *exactly* the
same, word for word, or is it possible that it's just "the same in  
what

it says" but with different words? I assume there are screendumps to
verify this ;-)


I looked at the code in the jdbc driver and it doesn't even do this  
check






Another point that at least I don't know - what kind of connection  
pool

is it? Is it an external one (like pgpool) to which the java app
connects (using FE/BE protocol, emulating a "proper postmaster" but
pooling access to the database), or is it running inside the app  
server
(like for example .net connection pooling does, which simply means  
that

when you run the Open() method on the connection object it will pick
something off an *internal* pool)?
It's an internal pool, and the client has told me off list they have  
removed it and are using the jdbc driver pool.


At this point I'm confused as to what they really are using, but as  
they have contracted Command Prompt to fix this for them, I am no  
longer in the private loop.


Dave


//Magnus


---(end of  
broadcast)---

TIP 2: Don't 'kill -9' the postmaster




---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] Win32 hard crash problem

2006-09-06 Thread Magnus Hagander

> > server sent data ("D" message) without prior row description ("T"
> > message)
> 
> During the connection attempt?  I don't think libpq can report that
> message until it tries to do a regular query (might be wrong
> though).
> Is the client using some application that's going to issue a query
> immediately on connecting?

In the case of pgAdmin, it does. It will set datestyle, load a list of
dbs etc.


//Magnus


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Win32 hard crash problem

2006-09-06 Thread Michael Paesold


Magnus Hagander wrote:
> Another point that at least I don't know - what kind of connection pool
> is it? Is it an external one (like pgpool) to which the java app
> connects (using FE/BE protocol, emulating a "proper postmaster" but
> pooling access to the database), or is it running inside the app server
> (like for example .net connection pooling does, which simply means that
> when you run the Open() method on the connection object it will pick
> something off an *internal* pool)?

Googling for 3CPO [1] shows that it is a Java-based connection pool that 
implements connection pooling using the JDBC API, i.e. it is an *internal* 
pool running inside the app servers JVM. PG Admin cannot in any case 
connect through this pool.


Best Regards
Michael Paesold

[1] http://sourceforge.net/projects/c3p0

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Win32 hard crash problem

2006-09-06 Thread Magnus Hagander

>  Yes they are using a connection pool. A java based one.
> >>> Since java has it's own protocol implementation, this is
> totally
> >>> unrelated to any libpq error messages.
> >> Another important point that we've not been given information
> on:
> >> when pgAdmin/libpq starts failing like this, exactly what is
> >> happening with the connection pool?  Is it still able to issue
> >> queries, and if not what happens exactly?
> >
> > No, when this happens everything stops. The only thing they get
> back
> > is that message until they reboot the server. The web app (via
> > java/connection pool), pgAdmin both give the same error.
> >
> > Which now that I think about it, seems odd if the message is
> coming
> > from libpq yes?
> Yes, this is very odd, AFICS, this message does not exist in the
> java driver. So it would be interesting to get the actual logs
> from the client.

Definitly - that error msg showing up in the web app really doesn't make
sense. However, are we sure that the error message is *exactly* the
same, word for word, or is it possible that it's just "the same in what
it says" but with different words? I assume there are screendumps to
verify this ;-)


Another point that at least I don't know - what kind of connection pool
is it? Is it an external one (like pgpool) to which the java app
connects (using FE/BE protocol, emulating a "proper postmaster" but
pooling access to the database), or is it running inside the app server
(like for example .net connection pooling does, which simply means that
when you run the Open() method on the connection object it will pick
something off an *internal* pool)?

//Magnus


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Oleg Bartunov


I'm a bit fear to to engage into this thread, but I've seen also
reproducible case when libpq client stops working and 'vaccuum analyze'
helped. It's happened on Windows Server 2003 and XP with PostgreSQL 8.1.4.
I don't have client source code, so I can't say more, but customer's developer
said the same behaviour was observed on Linux with 8.1.0 and has gone in 8.1.4.
They said, that this happens only with enabled row statistics.
Client inserts some data in transaction, backend writes 'COMMIT' to log,
but client wait something and 'vacuum analyze' of all database in some
magic way pushed the process.

I've got their installation CD and will try to investigate this problem.
Any suggestions ? I'm not familiar with W32 at all.

Oleg

On Tue, 5 Sep 2006, Tom Lane wrote:


"Joshua D. Drake" <[EMAIL PROTECTED]> writes:

Yes, unfortunately there isn't much more to be had for another 2 weeks ;)


I trust they've got the reboot time and they will know exactly how long
from reboot to problem?  I'm not all that sold on the "GetTickCount
overflow" theory, but certainly we ought not be missing a chance to test
or disprove it.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Joshua D. Drake


Tom Lane wrote:

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:

Yes, unfortunately there isn't much more to be had for another 2 weeks ;)


I trust they've got the reboot time and they will know exactly how long
from reboot to problem?  I'm not all that sold on the "GetTickCount
overflow" theory, but certainly we ought not be missing a chance to test
or disprove it.


Yes I documented all conversations and disclaimers :)

Joshua D. Drake



regards, tom lane




--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Tom Lane

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:
> Yes, unfortunately there isn't much more to be had for another 2 weeks ;)

I trust they've got the reboot time and they will know exactly how long
from reboot to problem?  I'm not all that sold on the "GetTickCount
overflow" theory, but certainly we ought not be missing a chance to test
or disprove it.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Dave Cramer



On 5-Sep-06, at 7:00 PM, Joshua D. Drake wrote:


Tom Lane wrote:

Dave Cramer <[EMAIL PROTECTED]> writes:

On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote:

Yes they are using a connection pool. A java based one.
Since java has it's own protocol implementation, this is totally   
unrelated to any libpq error messages.

Another important point that we've not been given information on:
when pgAdmin/libpq starts failing like this, exactly what is  
happening

with the connection pool?  Is it still able to issue queries, and
if not what happens exactly?


No, when this happens everything stops. The only thing they get  
back is that message until they reboot the server. The web app (via  
java/connection pool), pgAdmin both give the same error.


Which now that I think about it, seems odd if the message is coming  
from libpq yes?
Yes, this is very odd, AFICS, this message does not exist in the java  
driver. So it would be interesting to get the actual logs from  
the client.




Sincerely,

Joshua D. Drake



regards, tom lane
---(end of  
broadcast)---

TIP 2: Don't 'kill -9' the postmaster



--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of  
broadcast)---

TIP 6: explain analyze is your friend




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Joshua D. Drake




It sounds to me like we don't actually know that, because the client
doesn't know how to restart the postmaster without rebooting the OS.
(Josh says "pg_ctl stop" doesn't work in this state, which is a tad
interesting in itself, because that doesn't go through a connection
request.)  It would be useful to try killing off the postgres processes
via task manager and then see if a new postmaster can be started and if
things then behave normally, or if a reboot is truly needed.


Right, and I have asked that the next time this happens that they try 
and use the task manager to kill the process.




The bottom line here is that all we have so far are client-side
observations ("I get this message") and we have no clue what state
the postmaster thinks it's in.  We really need more information.



Yes, unfortunately there isn't much more to be had for another 2 weeks ;)

Sincerely,

Joshua D. Drake



regards, tom lane




--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Tom Lane

Alvaro Herrera <[EMAIL PROTECTED]> writes:
> Joshua D. Drake wrote:
>> I already said that ;). The problem IS NOT that we can't restart the 
>> system and get postgresql back. It is that it happens at all.

> It is quite different a bug that can only be fixed by "rebooting the
> server" (which to me means taking the operating system down and starting
> it afresh) than one that can be fixed by restarting the PostgreSQL
> server (_without_ taking the operating system down).  I've been reading
> "reboot" all along -- sorry if I missed an email saying otherwise.

It sounds to me like we don't actually know that, because the client
doesn't know how to restart the postmaster without rebooting the OS.
(Josh says "pg_ctl stop" doesn't work in this state, which is a tad
interesting in itself, because that doesn't go through a connection
request.)  It would be useful to try killing off the postgres processes
via task manager and then see if a new postmaster can be started and if
things then behave normally, or if a reboot is truly needed.

The bottom line here is that all we have so far are client-side
observations ("I get this message") and we have no clue what state
the postmaster thinks it's in.  We really need more information.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Alvaro Herrera

Joshua D. Drake wrote:

> The only known resolution is to reboot Windows. Using the service 
  ^^
> control panel to shutdown postgresql will fail once the message is 
> received. It is unknown if using the task master to individually kill 
> processes will work.

This is what I'm saying that doesn't match what Dave told me.

The stuff about failing to shut the postmaster down, is the first time I
hear.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Alvaro Herrera

Joshua D. Drake wrote:
> Alvaro Herrera wrote:
> >Joshua D. Drake wrote:
> >>Tom Lane wrote:
> >>>Dave Cramer <[EMAIL PROTECTED]> writes:
> On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote:
> >Yes they are using a connection pool. A java based one.
> Since java has it's own protocol implementation, this is totally  
> unrelated to any libpq error messages.
> >>>Another important point that we've not been given information on:
> >>>when pgAdmin/libpq starts failing like this, exactly what is happening
> >>>with the connection pool?  Is it still able to issue queries, and
> >>>if not what happens exactly?
> >>No, when this happens everything stops. The only thing they get back is 
> >>that message until they reboot the server. The web app (via 
> >>java/connection pool), pgAdmin both give the same error.
> >
> >Actually Dave Cramer told me that if the postmaster was stopped and then
> >restarted, it would start answering fine again.  Which would make a lot
> >of sense.
> 
> I already said that ;). The problem IS NOT that we can't restart the 
> system and get postgresql back. It is that it happens at all.

It is quite different a bug that can only be fixed by "rebooting the
server" (which to me means taking the operating system down and starting
it afresh) than one that can be fixed by restarting the PostgreSQL
server (_without_ taking the operating system down).  I've been reading
"reboot" all along -- sorry if I missed an email saying otherwise.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Joshua D. Drake


Hello,

O.k. to recap:

OS: Win2k3 SP1
PostgreSQL: 8.1.2
Application Server: Jboss
Connection Pooler: C3PO
JDBC Version: 8.1.404, Also verified with 8.0.311

Problem:

After 2/3 weeks, PostgreSQL will begin issuing the following message:

server sent data ("D" message) without prior row description ("T" message)

This message will present itself, if connection attempts are made from 
the Web Application (Java/JDBC), or locally via PgAdmin. Once the error 
message is received, all subsequent connection attempts will also result 
in that same message. We do not know if the error occurs before or after 
authentication.


The only known resolution is to reboot Windows. Using the service 
control panel to shutdown postgresql will fail once the message is 
received. It is unknown if using the task master to individually kill 
processes will work.


Sincerely,

Joshua D. Drake




--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Joshua D. Drake


Alvaro Herrera wrote:

Joshua D. Drake wrote:

Tom Lane wrote:

Dave Cramer <[EMAIL PROTECTED]> writes:

On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote:

Yes they are using a connection pool. A java based one.
Since java has it's own protocol implementation, this is totally  
unrelated to any libpq error messages.

Another important point that we've not been given information on:
when pgAdmin/libpq starts failing like this, exactly what is happening
with the connection pool?  Is it still able to issue queries, and
if not what happens exactly?
No, when this happens everything stops. The only thing they get back is 
that message until they reboot the server. The web app (via 
java/connection pool), pgAdmin both give the same error.


Actually Dave Cramer told me that if the postmaster was stopped and then
restarted, it would start answering fine again.  Which would make a lot
of sense.


I already said that ;). The problem IS NOT that we can't restart the 
system and get postgresql back. It is that it happens at all.


Joshua D. Drake







--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Alvaro Herrera

Joshua D. Drake wrote:
> Tom Lane wrote:
> >Dave Cramer <[EMAIL PROTECTED]> writes:
> >>On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote:
> >>>Yes they are using a connection pool. A java based one.
> >
> >>Since java has it's own protocol implementation, this is totally  
> >>unrelated to any libpq error messages.
> >
> >Another important point that we've not been given information on:
> >when pgAdmin/libpq starts failing like this, exactly what is happening
> >with the connection pool?  Is it still able to issue queries, and
> >if not what happens exactly?
> 
> No, when this happens everything stops. The only thing they get back is 
> that message until they reboot the server. The web app (via 
> java/connection pool), pgAdmin both give the same error.

Actually Dave Cramer told me that if the postmaster was stopped and then
restarted, it would start answering fine again.  Which would make a lot
of sense.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Dave Page



-Original Message-
From: "Joshua D. Drake" <[EMAIL PROTECTED]>
To: "Joshua D. Drake" <[EMAIL PROTECTED]>; "Tom Lane" <[EMAIL PROTECTED]>; 
"Merlin Moncure" <[EMAIL PROTECTED]>; "Magnus Hagander" <[EMAIL PROTECTED]>; 
"PostgreSQL-development" 
Sent: 05/09/06 23:27
Subject: Re: [HACKERS] Win32 hard crash problem


> Well except when they are connecting with Pgadmin (which wouldn't go 
> through the connection pool) they get the error as well.

It wouldn't? It's just a 'regular' libpq app. Doesn't say much for the 
connection pool if it cannot handle a simple libpq connection.

/D

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Joshua D. Drake


Tom Lane wrote:

Dave Cramer <[EMAIL PROTECTED]> writes:

On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote:

Yes they are using a connection pool. A java based one.


Since java has it's own protocol implementation, this is totally  
unrelated to any libpq error messages.


Another important point that we've not been given information on:
when pgAdmin/libpq starts failing like this, exactly what is happening
with the connection pool?  Is it still able to issue queries, and
if not what happens exactly?


No, when this happens everything stops. The only thing they get back is 
that message until they reboot the server. The web app (via 
java/connection pool), pgAdmin both give the same error.


Which now that I think about it, seems odd if the message is coming from 
libpq yes?


Sincerely,

Joshua D. Drake




regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster




--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Joshua D. Drake


Alvaro Herrera wrote:

Joshua D. Drake wrote:

Alvaro Herrera wrote:

Joshua D. Drake wrote:

Alvaro Herrera wrote:



What I've been wondering all along is whether they are using a
connection pool.

Yes they are using a connection pool. A java based one.

It's quite possible that it's the connection pool that gets confused,
and not PostgreSQL itself.  It would be interesting if they change the
connection setting when the "hang" next occurs, to point directly to
PostgreSQL bypassing the connection pool.
Well except when they are connecting with Pgadmin (which wouldn't go 
through the connection pool) they get the error as well.


Are you assuming, or did they/you verify that this is indeed the case?
I see no reason to assume that pgAdmin can't connect via a pool.



Verified. They do not connect to the connection pool for pgadmin.

Although I would think pgadmin might have problems connecting to a java 
based pool. If I recall, (I could be cranked) JDBC apps can't use pgpool 
for example.


Sincerely,

Joshua D. Drake



--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Alvaro Herrera

Joshua D. Drake wrote:
> Alvaro Herrera wrote:
> >Tom Lane wrote:
> >>"Joshua D. Drake" <[EMAIL PROTECTED]> writes:
> >Fail in what way. Hang, not connect, or get an error msg?
> >>>Just verified with customer. Once the problem occurs the first time, the 
> >>>customer will continually get the same error message for each subsequent 
> >>>connection attempt:
> >>>server sent data ("D" message) without prior row description ("T" 
> >>>message)
> >>During the connection attempt?  I don't think libpq can report that
> >>message until it tries to do a regular query (might be wrong though).
> >>Is the client using some application that's going to issue a query
> >>immediately on connecting?
> >
> >What I've been wondering all along is whether they are using a
> >connection pool.
> 
> Yes they are using a connection pool. A java based one.

It's quite possible that it's the connection pool that gets confused,
and not PostgreSQL itself.  It would be interesting if they change the
connection setting when the "hang" next occurs, to point directly to
PostgreSQL bypassing the connection pool.

OTOH the connection pool may be the thing with the TickCounter problem.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Tom Lane

Dave Cramer <[EMAIL PROTECTED]> writes:
> On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote:
>> Yes they are using a connection pool. A java based one.

> Since java has it's own protocol implementation, this is totally  
> unrelated to any libpq error messages.

Another important point that we've not been given information on:
when pgAdmin/libpq starts failing like this, exactly what is happening
with the connection pool?  Is it still able to issue queries, and
if not what happens exactly?

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Alvaro Herrera

Joshua D. Drake wrote:
> Alvaro Herrera wrote:
> >Joshua D. Drake wrote:
> >>Alvaro Herrera wrote:

> >>>What I've been wondering all along is whether they are using a
> >>>connection pool.
> >>Yes they are using a connection pool. A java based one.
> >
> >It's quite possible that it's the connection pool that gets confused,
> >and not PostgreSQL itself.  It would be interesting if they change the
> >connection setting when the "hang" next occurs, to point directly to
> >PostgreSQL bypassing the connection pool.
> 
> Well except when they are connecting with Pgadmin (which wouldn't go 
> through the connection pool) they get the error as well.

Are you assuming, or did they/you verify that this is indeed the case?
I see no reason to assume that pgAdmin can't connect via a pool.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Dave Cramer



On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote:


Alvaro Herrera wrote:

Tom Lane wrote:

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:

Fail in what way. Hang, not connect, or get an error msg?
Just verified with customer. Once the problem occurs the first  
time, the customer will continually get the same error message  
for each subsequent connection attempt:
server sent data ("D" message) without prior row description  
("T" message)

During the connection attempt?  I don't think libpq can report that
message until it tries to do a regular query (might be wrong  
though).

Is the client using some application that's going to issue a query
immediately on connecting?

What I've been wondering all along is whether they are using a
connection pool.


Yes they are using a connection pool. A java based one.
Since java has it's own protocol implementation, this is totally  
unrelated to any libpq error messages.


While I've not personally used the pool in question (c3p0) my  
understanding is that it is pretty robust.


Personally, I'm betting on some windows TCP/IP weirdness here.

Dave


Sincerely,

Joshua D. Drake


--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of  
broadcast)---

TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that  
your

  message can get through to the mailing list cleanly




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Joshua D. Drake


Alvaro Herrera wrote:

Joshua D. Drake wrote:

Alvaro Herrera wrote:

Tom Lane wrote:

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:

Fail in what way. Hang, not connect, or get an error msg?
Just verified with customer. Once the problem occurs the first time, the 
customer will continually get the same error message for each subsequent 
connection attempt:
server sent data ("D" message) without prior row description ("T" 
message)

During the connection attempt?  I don't think libpq can report that
message until it tries to do a regular query (might be wrong though).
Is the client using some application that's going to issue a query
immediately on connecting?

What I've been wondering all along is whether they are using a
connection pool.

Yes they are using a connection pool. A java based one.


It's quite possible that it's the connection pool that gets confused,
and not PostgreSQL itself.  It would be interesting if they change the
connection setting when the "hang" next occurs, to point directly to
PostgreSQL bypassing the connection pool.


Well except when they are connecting with Pgadmin (which wouldn't go 
through the connection pool) they get the error as well.


Joshua D. Drake



OTOH the connection pool may be the thing with the TickCounter problem.





--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Jeremy Drake

On Tue, 5 Sep 2006, Joshua D. Drake wrote:

> Right, but "just took a reboot to fix it" isn't very confidence inspiring ;)

Are you kidding?  This is standard procedure for troubleshooting Windows
problems :)

--
The world is coming to an end.  Please log off.

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Joshua D. Drake


Alvaro Herrera wrote:

Tom Lane wrote:

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:

Fail in what way. Hang, not connect, or get an error msg?
Just verified with customer. Once the problem occurs the first time, the 
customer will continually get the same error message for each subsequent 
connection attempt:

server sent data ("D" message) without prior row description ("T" message)

During the connection attempt?  I don't think libpq can report that
message until it tries to do a regular query (might be wrong though).
Is the client using some application that's going to issue a query
immediately on connecting?


What I've been wondering all along is whether they are using a
connection pool.



Yes they are using a connection pool. A java based one.

Sincerely,

Joshua D. Drake


--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Joshua D. Drake


Tom Lane wrote:

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:

Fail in what way. Hang, not connect, or get an error msg?


Just verified with customer. Once the problem occurs the first time, the 
customer will continually get the same error message for each subsequent 
connection attempt:



server sent data ("D" message) without prior row description ("T" message)


During the connection attempt?  I don't think libpq can report that
message until it tries to do a regular query (might be wrong though).
Is the client using some application that's going to issue a query
immediately on connecting?


Well, windows ;) Customer says that they double click pgadmin and they 
get that message. I have informed them on how to increase to debug5 and 
hopefully we get something from that, of course it will likely be 24.85 
days from now ;)




It would be useful to turn on log_connections and log_statement (and
perhaps crank log_min_messages all the way up to DEBUG5) to see if we
can get anything in the postmaster log giving a hint what actually
happens here.  A TCP sniff of the connection attempt traffic would be
pretty useful too.



Sincerely,

Joshua D. Drake


--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Alvaro Herrera

Tom Lane wrote:
> "Joshua D. Drake" <[EMAIL PROTECTED]> writes:
> >>> Fail in what way. Hang, not connect, or get an error msg?
> 
> > Just verified with customer. Once the problem occurs the first time, the 
> > customer will continually get the same error message for each subsequent 
> > connection attempt:
> 
> > server sent data ("D" message) without prior row description ("T" message)
> 
> During the connection attempt?  I don't think libpq can report that
> message until it tries to do a regular query (might be wrong though).
> Is the client using some application that's going to issue a query
> immediately on connecting?

What I've been wondering all along is whether they are using a
connection pool.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Tom Lane

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:
>>> Fail in what way. Hang, not connect, or get an error msg?

> Just verified with customer. Once the problem occurs the first time, the 
> customer will continually get the same error message for each subsequent 
> connection attempt:

> server sent data ("D" message) without prior row description ("T" message)

During the connection attempt?  I don't think libpq can report that
message until it tries to do a regular query (might be wrong though).
Is the client using some application that's going to issue a query
immediately on connecting?

It would be useful to turn on log_connections and log_statement (and
perhaps crank log_min_messages all the way up to DEBUG5) to see if we
can get anything in the postmaster log giving a hint what actually
happens here.  A TCP sniff of the connection attempt traffic would be
pretty useful too.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Joshua D. Drake




Josh failed to answer the most important question though:


Sorry.




Subsequent connections to the database will fail (such as pgAdmin)
and Windows must be completely rebooted.

Fail in what way. Hang, not connect, or get an error msg?


Just verified with customer. Once the problem occurs the first time, the 
customer will continually get the same error message for each subsequent 
connection attempt:


server sent data ("D" message) without prior row description ("T" message)




regards, tom lane




--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Tom Lane

"Merlin Moncure" <[EMAIL PROTECTED]> writes:
> On 9/5/06, Joshua D. Drake <[EMAIL PROTECTED]> wrote:
>> Magnus Hagander wrote:
>>> What do you mean by this? It doesn't start upon reboot? What is needed
>>> to make it start?
>> 
>> It means that postgresql doesn't recover on its own. On linux if a
>> backend crashes all of PostgreSQL will restart and come back up if it can.
>> 
>> On Win32 it doesn't.

> it does for me, at least for me when I used to work with windows :).
> I think it just doesn't restart for this particular type of crash.

As best I can tell, Josh isn't describing a crash at all.  Something
(possibly in the TCP stack) has locked up, but there's no way for the
postmaster to know there's anything wrong, and probably no way for the
postmaster to fix it if it did know.  Restarting backends certainly
isn't going to fix a communication problem.

Josh failed to answer the most important question though:

>> Subsequent connections to the database will fail (such as pgAdmin)
>> and Windows must be completely rebooted.
> 
> Fail in what way. Hang, not connect, or get an error msg?

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Magnus Hagander

> >> PostgreSQL will also not recover on its own (e.g; auto restart and 
> >> roll through the logs).
> > 
> > What do you mean by this? It doesn't start upon reboot? 
> What is needed 
> > to make it start?
> 
> It means that postgresql doesn't recover on its own. On linux 
> if a backend crashes all of PostgreSQL will restart and come 
> back up if it can.
> 
> On Win32 it doesn't.

Ah, I thought you meant that the database recovery process (that runs
after a crash) failed and lost data. But it's not data-loss then, it
just took a reboot to fix it?

I think we're somehow seeing a complete postmaster hang, where it's
either not able to kill off th ebackends as required, or just not
capable of accepting new connections after that. Which makes a
stacktrace from the postmaster the most interesting one to look at.

//Magnus

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Joshua D. Drake


Magnus Hagander wrote:
PostgreSQL will also not recover on its own (e.g; auto restart and 
roll through the logs).
What do you mean by this? It doesn't start upon reboot? 
What is needed 

to make it start?
It means that postgresql doesn't recover on its own. On linux 
if a backend crashes all of PostgreSQL will restart and come 
back up if it can.


On Win32 it doesn't.


Ah, I thought you meant that the database recovery process (that runs
after a crash) failed and lost data. But it's not data-loss then, it
just took a reboot to fix it?


Right, but "just took a reboot to fix it" isn't very confidence inspiring ;)



I think we're somehow seeing a complete postmaster hang, where it's
either not able to kill off th ebackends as required, or just not
capable of accepting new connections after that. Which makes a
stacktrace from the postmaster the most interesting one to look at.


I have asked the customer to also look and see if there was one 
particular process that was eating cpu via the task master and see if 
that process can be killed. If that process can be killed and postgresql 
comes back clean, then that is a step.


However, debugging this beast is a pain. I take it mingw doesn't have a 
gdb we can use?




//Magnus




--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Merlin Moncure

On 9/5/06, Joshua D. Drake <[EMAIL PROTECTED]> wrote:

Magnus Hagander wrote:
> What do you mean by this? It doesn't start upon reboot? What is needed
> to make it start?

It means that postgresql doesn't recover on its own. On linux if a
backend crashes all of PostgreSQL will restart and come back up if it can.

On Win32 it doesn't.

it does for me, at least for me when I used to work with windows :).
I think it just doesn't restart for this particular type of crash.  I
had a couple of similarly wierd undetectable windows problems that I
could never quite figured out until I got hired by another company and
left that monster behind for good.

merlin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] Win32 hard crash problem

2006-09-05 Thread Joshua D. Drake


Magnus Hagander wrote:

Oops, going backwards through the mails it seems :)


Subsequent connections to the database will fail (such as pgAdmin)
and Windows must be completely rebooted.


Fail in what way. Hang, not connect, or get an error msg?


PostgreSQL will also not recover on its own (e.g; auto restart and
roll through the logs).


What do you mean by this? It doesn't start upon reboot? What is needed
to make it start?


It means that postgresql doesn't recover on its own. On linux if a 
backend crashes all of PostgreSQL will restart and come back up if it can.


On Win32 it doesn't.

Joshua D. Drake





//Magnus






--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Win32 hard crash problem

2006-09-01 Thread Zeugswetter Andreas DCP SD


> >> My bet is something depending on GetTickCount to measure elapsed
time 
> >> (and no, it's not used in the core Postgres code, but you've got 
> >> plenty of other possible culprits in that stack).
> 
> > This doesn't quite make sense. The only reason we have to reboot is 
> > because PostgreSQL no longer responds. The system itself is fine.
> 
> The Windows kernel may still work, but that doesn't mean that 
> everything Postgres depends on still works.

It may be a not reacting listen socket. This may be because of a handle
leak. Next time it blocks look at the handle counts (e.g. with
handle.exe
from sysinternals).

You could also look for handle count now with Task Manager and see if it
increases constantly. (handle.exe shows you the details)

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Win32 hard crash problem

2006-09-01 Thread Magnus Hagander

Oops, going backwards through the mails it seems :)

> Subsequent connections to the database will fail (such as pgAdmin)
> and Windows must be completely rebooted.

Fail in what way. Hang, not connect, or get an error msg?

> PostgreSQL will also not recover on its own (e.g; auto restart and
> roll through the logs).

What do you mean by this? It doesn't start upon reboot? What is needed
to make it start?


//Magnus



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Win32 hard crash problem

2006-09-01 Thread Magnus Hagander

> >> My bet is something depending on GetTickCount to measure elapsed
> time
> >> (and no, it's not used in the core Postgres code, but you've got
> >> plenty of other possible culprits in that stack).
> 
> > This doesn't quite make sense. The only reason we have to reboot
> is
> > because PostgreSQL no longer responds. The system itself is fine.
> 
> The Windows kernel may still work, but that doesn't mean that
> everything Postgres depends on still works.  I'm wondering about
> (a) the TCP stack (and that includes 3rd party firewalls and such,
> not only the core Windows code); (b) timing or threading stuff
> inside the application that's using libpq, which the only thing we
> know about so far is that it's *not* JDBC/Hibernate.

How about getting a simple backtrace from a couple of the stuck postgres
processes? And from the postmaster which should be accepting new
connections... Or does that also hang completely?

How to get one? Well, since we don't have the MSVC build yet (yeah,
yeah, eventually), you can only get a semi-backtrace that only looks at
exported symbols. You can get this using process explorer (thread tab,
click stack), using WinDBG or using Visual Studio (you'll need VS 2005,
and you need to check the option for "Load DLL exports" in
options->debugging->native).


Oh, btw, if there is a 3rd firewall on the box the standard
recommendation of uninstalling it definitely sounds like a good plan :-)

//Magnus


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Win32 hard crash problem

2006-08-31 Thread Dave Page

On 31/8/06 23:34, "Joshua D. Drake" <[EMAIL PROTECTED]> wrote:

> Sure it is a registry entry... so we could (in theory) shrink that quite
> a bit.. However I am confused, if we don't use it, what that is
> connecting to libpq would trigger it?
> 
> I know they are using pgAAdmin...

Are they using pgAgent? That's the only part of pgAdmin that doesn't any
sort of timing I can think of offhand (other than the query tool timer which
only runs whilst a query is running). Even then it's done indirectly through
wxWidgets so I'm not familiar with how it's implemented at the win32 API
level.

If it were pgAdmin (or any other client) though, how would that lock up the
entire PostgreSQL instance, but not the rest of the server?

Regards, Dave.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Win32 hard crash problem

2006-08-31 Thread Tom Lane

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:
> Which means we need to start stripping it down. Gah, I actually argued 
> *for* this port to. Next time slap me.

Well, before you invest a lot of time barking up what might be the wrong
tree, there is a very easy test you can use to check the GetTickCount
theory: keep closer track of time-since-boot on the affected systems.
If that idea is right, it won't be "two or three weeks" between boot and
problems appearing, it'll be 24.85 days on the nose.  It shouldn't take
much except waiting to either falsify the theory or make it look pretty
convincing.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Win32 hard crash problem

2006-08-31 Thread Joshua D. Drake


Alvaro Herrera wrote:

Dave Cramer wrote:

On 31-Aug-06, at 6:01 PM, Tom Lane wrote:


"Joshua D. Drake" <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

BTW, are you sure this is coming from JDBC?  I see the exact same
message text in libpq:
libpq_gettext("server sent data (\"D\" message) without prior row  
description (\"T\" message)\n"));

Maybe the JDBC driver uses the identical message wording but my
thought is to look for something going through libpq.

The error is server side. I was just describing the environment.

I can entirely assure you that that error message is not present in
the server code.
Well that's even more interesting because it doesn't exist in the  
jdbc driver either.


Conclusion: they are using libpq in some form, so you should investigate
that.

Is there a way to alter the tick counter, so that a test run does not
need to take the full 3 weeks?



Sure it is a registry entry... so we could (in theory) shrink that quite 
a bit.. However I am confused, if we don't use it, what that is 
connecting to libpq would trigger it?


I know they are using pgAAdmin...

Joshua D. Drake


--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Win32 hard crash problem

2006-08-31 Thread Joshua D. Drake


Tom Lane wrote:

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:

My bet is something depending on GetTickCount to measure elapsed time
(and no, it's not used in the core Postgres code, but you've got plenty
of other possible culprits in that stack).


This doesn't quite make sense. The only reason we have to reboot is 
because PostgreSQL no longer responds. The system itself is fine.


The Windows kernel may still work, but that doesn't mean that everything
Postgres depends on still works.  I'm wondering about (a) the TCP stack
(and that includes 3rd party firewalls and such, not only the core
Windows code); (b) timing or threading stuff inside the application
that's using libpq, which the only thing we know about so far is that
it's *not* JDBC/Hibernate.


/me grumbles in a not so polite way about Windows.

Which means we need to start stripping it down. Gah, I actually argued 
*for* this port to. Next time slap me.


Joshua D. Drake




regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq




--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Win32 hard crash problem

2006-08-31 Thread Tom Lane

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:
>> My bet is something depending on GetTickCount to measure elapsed time
>> (and no, it's not used in the core Postgres code, but you've got plenty
>> of other possible culprits in that stack).

> This doesn't quite make sense. The only reason we have to reboot is 
> because PostgreSQL no longer responds. The system itself is fine.

The Windows kernel may still work, but that doesn't mean that everything
Postgres depends on still works.  I'm wondering about (a) the TCP stack
(and that includes 3rd party firewalls and such, not only the core
Windows code); (b) timing or threading stuff inside the application
that's using libpq, which the only thing we know about so far is that
it's *not* JDBC/Hibernate.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Win32 hard crash problem

2006-08-31 Thread Joshua D. Drake



That sounds suspiciously close to the time from boot to wraparound of
GetTickCount:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/gettickcount.asp
M$ list this as 49 days but that's the time to wrap clear around to
zero; the value overflows and goes negative in 24.85 days if I've
done the math correctly.

My bet is something depending on GetTickCount to measure elapsed time
(and no, it's not used in the core Postgres code, but you've got plenty
of other possible culprits in that stack).


This doesn't quite make sense. The only reason we have to reboot is 
because PostgreSQL no longer responds. The system itself is fine.


Sincerely,

Joshua D. Drake


--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Win32 hard crash problem

2006-08-31 Thread Alvaro Herrera

Dave Cramer wrote:
> 
> On 31-Aug-06, at 6:01 PM, Tom Lane wrote:
> 
> >"Joshua D. Drake" <[EMAIL PROTECTED]> writes:
> >>Tom Lane wrote:
> >>>BTW, are you sure this is coming from JDBC?  I see the exact same
> >>>message text in libpq:
> >>>libpq_gettext("server sent data (\"D\" message) without prior row  
> >>>description (\"T\" message)\n"));
> >>>Maybe the JDBC driver uses the identical message wording but my
> >>>thought is to look for something going through libpq.
> >
> >>The error is server side. I was just describing the environment.
> >
> >I can entirely assure you that that error message is not present in
> >the server code.
> Well that's even more interesting because it doesn't exist in the  
> jdbc driver either.

Conclusion: they are using libpq in some form, so you should investigate
that.

Is there a way to alter the tick counter, so that a test run does not
need to take the full 3 weeks?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Win32 hard crash problem

2006-08-31 Thread Dave Cramer



On 31-Aug-06, at 6:01 PM, Tom Lane wrote:


"Joshua D. Drake" <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

BTW, are you sure this is coming from JDBC?  I see the exact same
message text in libpq:
libpq_gettext("server sent data (\"D\" message) without prior row  
description (\"T\" message)\n"));
Maybe the JDBC driver uses the identical message wording but my  
thought

is to look for something going through libpq.



The error is server side. I was just describing the environment.


I can entirely assure you that that error message is not present in  
the

server code.
Well that's even more interesting because it doesn't exist in the  
jdbc driver either.


Dave


regards, tom lane

---(end of  
broadcast)---

TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that  
your

   message can get through to the mailing list cleanly




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] Win32 hard crash problem

2006-08-31 Thread Joshua D. Drake


Tom Lane wrote:

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

BTW, are you sure this is coming from JDBC?  I see the exact same
message text in libpq:
libpq_gettext("server sent data (\"D\" message) without prior row description (\"T\" 
message)\n"));
Maybe the JDBC driver uses the identical message wording but my thought
is to look for something going through libpq.



The error is server side. I was just describing the environment.


I can entirely assure you that that error message is not present in the
server code.


Ok let me be more clear. The message is being throw via PostgreSQL. I am 
getting per the message I posted..


http://projects.commandprompt.com/public/pgsql/browser/trunk/pgsql/src/interfaces/libpq/fe-protocol2.c?rev=22194
http://projects.commandprompt.com/public/pgsql/browser/trunk/pgsql/src/interfaces/libpq/fe-protocol3.c?rev=25989

It is in libpq and the protocol not the backend that is giving me the 
message. When I said server, I as referring to postgresql inclusively, 
not the driver that was actually connecting.


Sincerely,

Joshua D. Drake





regards, tom lane




--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] Win32 hard crash problem

2006-08-31 Thread Tom Lane

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> BTW, are you sure this is coming from JDBC?  I see the exact same
>> message text in libpq:
>> libpq_gettext("server sent data (\"D\" message) without prior row 
>> description (\"T\" message)\n"));
>> Maybe the JDBC driver uses the identical message wording but my thought
>> is to look for something going through libpq.

> The error is server side. I was just describing the environment.

I can entirely assure you that that error message is not present in the
server code.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Win32 hard crash problem

2006-08-31 Thread Joshua D. Drake


Tom Lane wrote:

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:
Dave Cramer and I have dealt with a company today running 8.1.4 on 
Windows 2003. The application is a web app that runs via JDBC/Hibernate.
The application will function perfectly for about 2/3 weeks and then we 
will receive a:
"server sent data (\"D\" message) without prior row description (\"T\" 
message)");


That sounds suspiciously close to the time from boot to wraparound of
GetTickCount:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/gettickcount.asp
M$ list this as 49 days but that's the time to wrap clear around to
zero; the value overflows and goes negative in 24.85 days if I've
done the math correctly.

My bet is something depending on GetTickCount to measure elapsed time
(and no, it's not used in the core Postgres code, but you've got plenty
of other possible culprits in that stack).

BTW, are you sure this is coming from JDBC?  I see the exact same
message text in libpq:
 libpq_gettext("server sent data (\"D\" message) without prior row description 
(\"T\" message)\n"));
Maybe the JDBC driver uses the identical message wording but my thought
is to look for something going through libpq.


The error is server side. I was just describing the environment.




Any thoughts?


I suppose "get a real operating system" won't go over well?


Tried that, I got nervous laughter on the other end ;)

Joshua D. Drake



regards, tom lane




--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] Win32 hard crash problem

2006-08-31 Thread Tom Lane

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:
> Dave Cramer and I have dealt with a company today running 8.1.4 on 
> Windows 2003. The application is a web app that runs via JDBC/Hibernate.
> The application will function perfectly for about 2/3 weeks and then we 
> will receive a:
> "server sent data (\"D\" message) without prior row description (\"T\" 
> message)");

That sounds suspiciously close to the time from boot to wraparound of
GetTickCount:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/gettickcount.asp
M$ list this as 49 days but that's the time to wrap clear around to
zero; the value overflows and goes negative in 24.85 days if I've
done the math correctly.

My bet is something depending on GetTickCount to measure elapsed time
(and no, it's not used in the core Postgres code, but you've got plenty
of other possible culprits in that stack).

BTW, are you sure this is coming from JDBC?  I see the exact same
message text in libpq:
 libpq_gettext("server sent data (\"D\" message) without prior row description 
(\"T\" message)\n"));
Maybe the JDBC driver uses the identical message wording but my thought
is to look for something going through libpq.

> Any thoughts?

I suppose "get a real operating system" won't go over well?

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

60 matches

Mail list logo