Move to Linux. :-) In our case, everything but the database servers
were already Linux so it was an easy choice. Things have been rock
solid since then.
Once things get stuck, I don't think there is an alternative besides
"stop -m immediate". However, since the problem is caused by an idle
back
That might be one cause (or it might otherwise exacerbate the problem),
but it isn't the only cause. We weren't running anti-virus software and
neither is Thomas. Unfortunately with the last go around, we
collectively ran out of ideas before an underlying cause could be
identified.
Pete
>>> Tom
The same problem exists in 8.1 too. See this thread
http://archives.postgresql.org/pgsql-bugs/2006-04/msg00177.php
Tom and Magnus tracked down a cause, but I don't think a fix was ever
implemented.
FWIW, we were bitten by the fsync problem which you noticed too.
Unfortunately we were never ab
>>> On 16.06.2006 at 23:21:21, in message
<[EMAIL PROTECTED]>, Bruce Momjian
wrote:
> Yea. Where you using WAL archiving? We will have a fix in 8.1.5 to
> prevent multiple archivers from starting. Perhaps that was a cause.
>
Not at the time, no. The rename in question was just a regular WAL
s
Really? If there was a patch, I missed it.
My recollection is that there was general agreement about this
particular problem (see, for example,
http://archives.postgresql.org/pgsql-bugs/2006-04/msg00189.php ), but
things kind of trailed off after that without a resolution.
As far as the complete
Test server has SP1. This bug has only bit us twice (and never in a
testing environment) so it's hard to say (from our experience).
The successful pgbench runs are definitely good to see though.
Pete
>>> "Magnus Hagander" <[EMAIL PROTECTED]> 05/02/06 10:14 am >>>
Great news. One question though
With the patch applied, I let an inhouse stress test run for several
hours and it completed without incident. I also ran two runs of pgbench
with 50 connections x 1000 transactions and one run of 50 connections x
5000 transactions. All completed successfully. (Test server is a dual
Xeon with Hyp
Sure.
I should note that we're moving to Linux for our production servers so
our interest in the Windows port is waning (at least for the time
being). In particular, the stuck WAL segment rename problem has
occasionally been rather a pain in the neck.
As long as we still have Windows test server
This is probably somewhat superfluous, but we had another one these
incidents last night whose details confirm your explanation here.
[2006-04-21 00:22:19.500 ] 2452 LOG: could not rename file
"pg_xlog/0001011A004C" to
"pg_xlog/0001011A0071", continuing to try
the autovac
I'm not sure that's the whole story. "Server #3" had backends with
handles to the old relfilenode. It didn't have any fsync errors and the
old relfilenode was apparently successfully deleted (or at least it
wasn't visible in the file system anymore). That's the part of the
morning's investigatio
Here's the evidence from this morning. I have to admit I'm not really
sure what to make of it though.
Pete
The fsync / Permission denied errors occurred on 2 of 3 active servers
for the 7 am CLUSTER cycle.
Server #1 (with fsync errors):
- Both old and new relfilenodes are still visible with a
It happens often enough and the episodes last long enough that grabbing
a handle dump while this is going on should be easily done.
Regarding the Win32 error code, backend/storage/file/fd.c calls
_commit().
http://msdn2.microsoft.com/en-us/library/17618685(VS.80).aspx It looks
like it is alread
Does that also explain why an attempt to make a new connection just
hangs?
One other thing regarding that is that connection attempt seems to
kinda, sorta succeed. It never makes it as far as a command prompt, but
on the "stop -m immediate", psql does print the "HINT: In a moment you
should be a
It's definitely possible. Both failures occurred around the end of the
business day as update traffic would have been coasting to a stop. The
middle tier never closes a connection unless it's forced to (e.g. as a
result of a query error, connection going away, etc.)
Pete
>>> Tom Lane <[EMAIL
They are local.
Pete
>>> "Harald Armin Massa" <[EMAIL PROTECTED]> 04/18/06 4:35 pm
>>>
"G" - is that really a LOKAL drive at that server, or rather some NAS
or similiar?
---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will igno
Unfortunately, it's not that simple. It would be straightforward to
track down if it were.
In response to other questions:
It's Postgres 8.1.3 running on Windows 2003 Server. No anti-virus
software is installed. The servers are essentially bare except for the
OS and Postgres.
We have "handle
Hi all,
In the last couple of days, we've been bitten (a couple of times, on
different servers) by an apparent glitch or bad interaction in the
Windows implementation of rename().
The relevant log message is:
[2006-04-17 16:49:22.583 ] 2252 LOG: could not rename file
"pg_xlog/0001010A00
The error messages refer to the old relfilenode (in 3 out of 3
occurrences today).
Pete
>>> Tom Lane <[EMAIL PROTECTED]> 04/14/06 2:41 am >>>
OK ... but what's still unclear is whether the failures are occurring
against the old relfilenode (the one just removed by the CLUSTER) or
the
new one just
Apparently we got lucky on all four servers with the latest cycle, so
nothing to report. Load (both reading and writing) is quite light today
so perhaps the bug is only triggered under a higher load. It seems the
problem typically doesn't show up on weekends either (when load is also
much lighter
The culprit is CLUSTER. There is a batch file which runs CLUSTER
against six, relatively small (60k rows between them) tables at 7am,
1pm, and 9pm. Follows is the list of dates and hours when the
"Permission denied" errors showed up. They match up to a tee (although
the error apparently sometime
Sounds good.
There is nothing sensitive in DbTranImageStatus_pkey so if you decide
you want it after all, it's there for the asking.
Pete
>>> Tom Lane <[EMAIL PROTECTED]> 04/13/06 3:30 am >>>
Oh, never mind ... I've sussed it.
---(end of broadcast)--
It turns out we've been getting rather huge numbers of "Permission
denied" errors relating to fsync so perhaps it wasn't really a precursor
to the crash as I'd previously thought.
I've pasted in a complete list following this email covering the time
span from 3/20 to 4/6. The number in the firs
The middle tier transaction log indicates this record was inserted into
the county database at 2006-03-31 21:00:32.94. It would have hit the
central databases sometime thereafter (more or less immediately if all
was well).
The Panel table contains some running statistics which are updated
frequen
Per the DBAs, there hadn't been any recent crashes before last Thursday.
A "vacuum analyze verbose" discovered the problem early Thursday
morning. After the PANIC, the database never came back up (the
heap_clean_redo: no block / full_page_writes = off problem).
One thing that seems strange to me
I can't find any duplicates?!?
The query
select starelid, staattnum, ctid, xmin, xmax, cmin, cmax
from pg_statistic p1
where (select count(*) from pg_statistic p2 where
p1.starelid = p2.starelid and p1.staattnum = p2.staattnum) > 1
doesn't turn up anything. Nor does dumping
select starelid,
| S02
WA| R | Warrant issued| t |
10 | t | F | S04
(19 rows)
bigbird=#
>>> Tom Lane <[EMAIL PROTECTED]> 04/12/06 5:00 am >>>
"Peter Brant" <[EMAIL PROTECTED]> writes:
>
Also, when I tried to run a database-wide VACUUM ANALYZE VERBOSE it
actually doesn't even get to Panel and errors out with:
INFO: analyzing "public.MaintCode"
INFO: "MaintCode": scanned 1 of 1 pages, containing 19 live rows and 0
dead rows; 19 rows in sample, 19 estimated total rows
ERROR: dupl
The index data isn't sensitive, but I should ask for permission
nonetheless. I'll send over the '-f' output tomorrow morning.
Pete
***
* PostgreSQL File/Block Formatted Dump Utility - Version 8.1.1
*
* File: 180571
* Options used:
2 = 635
(gdb) print rightsib
No symbol "rightsib" in current context.
(gdb) print nextoffset
$3 = 87
(gdb) print leftsib
$4 = 636
(gdb) print rightsib
No symbol "rightsib" in current context.
(gdb) continue
Continuing.
Program exited with code 03.
(gdb)
Pete
>>> Tom
Sorry about the delay in responding. We had a bit of difficulty with
the test machine. Kevin is also on vacation this week.
The problem is repeatable with a VACUUM. I've found the offending
block. A (partial) pg_filedump of that block is pasted in below. I'm a
little lost as to what the next
Hi all,
We were bitten by this same bug over the weekend (PG 8.1.3 / Windows
Server 2003). The exact error was:
FATAL: semctl(170688872, 6, SETVAL, 0) failed: A non-blocking socket
operation could not be completed immediately.
The start of the errors corresponded to a nightly "vacuum analyze"
31 matches
Mail list logo