Hi!

We could reproduce the start-up problem on Windows 2003. After a reboot,
postmaster, in its start-up sequence cleans up old temporary files, and
this step used to take several minutes (a little over 4 minutes), delaying
the writing of line 6 onwards into the PID file. This delay caused pg_ctl
to timeout, leaving behind an orphaned postgres.exe process (which
eventually forks off many other postgres.exe processes). But since pg_ctl
itself isn't running after the timeout, Windows thinks the service isn't
running. A subsequent attempt to start the service using pg_ctl now
complains about the existing lock file still being used by one of the
postgres.exe processes that was spawned before.

We have observed conclusively that file system cache is coming into play.
We tested the scenario in which a reboot was followed by navigating the
file system under the data directory using "find" Cygwin command, following
which there was "no" timeout for pg_ctl and the server started up fine,
suggesting that the clean up is way faster when the file system is cached.

Any ideas on fixing this start-up delay in postmaster?

Could the task of cleanup move elsewhere, specifically to somewhere after
the writing of PID file is complete so that pg_ctl doesn't timeout?

Any other suggestions for working around this problem?


Thanks,

Deepak


On Tue, May 8, 2012 at 12:13 PM, deepak <deepak...@gmail.com> wrote:

>
>
> On Tue, May 8, 2012 at 3:09 AM, Alban Hertroys <haram...@gmail.com> wrote:
>
>> On 8 May 2012, at 24:34, deepak wrote:
>>
>> > Hi,
>> >
>> > On Windows 2008, sometimes the server fails to start due to an existing
>> "postmaster.pid' file.
>> >
>> > I tried rebooting a few times and even force shutting down the server,
>> and it started up fine.
>> > It seems to be a race-condition of sorts in the code that detects
>> whether the process with PID
>> > in the file is running or not.
>>
>> No, it means that postgres wasn't shut down properly when Windows shut
>> down. Removing the pid-file is one of the last things the shut-down
>> procedure does. The file is used to prevent 2 instances of the same server
>> running on the same data-directory.
>>
>> If it's a race-condition, it's probably one in Microsoft's shutdown code.
>> I've seen similar problems with Outlook mailboxes on a network directory;
>> Windows unmounts the remote file-systems before Outlook finished updating
>> its files under that mount point, so Outlook throws an error message and
>> Windows doesn't shut down because of that.
>>
>> I don't suppose that pid-file is on a remote file-system?
>>
>> No, it's local.
>
>
>> > Does any one have this same problem?  Any way to fix it besides
>> removing the PID file
>> > manually each time the server complains about this?
>>
>>
>> You could probably script removal of the pid file if its creation date is
>> before the time the system started booting up.
>>
>>
> Thanks, it looks like the code already seems to overwrite an old pid file
> if no other process is using it (if I understand the code correctly, it
> just echoes a byte onto a pipe to detect this).
>
> Still, I can't see under what conditions this occurs, but I have seen it
> happen a couple of times, just that I don't know how to predictably
> reproduce the problem.
>
>
> --
> Deepak
>
>

Reply via email to