On Wed, Jan 28, 2015 at 12:47 PM, Tom Lane wrote:
> Merlin Moncure writes:
>> ...hm, I spoke to soon. So I deleted everything, and booted up a new
>> instance 9.4 vanilla with asserts on and took no other action.
>> Applying the script with no data activity fails an assertion every
>> single tim
Merlin Moncure writes:
> ...hm, I spoke to soon. So I deleted everything, and booted up a new
> instance 9.4 vanilla with asserts on and took no other action.
> Applying the script with no data activity fails an assertion every
> single time:
> TRAP: FailedAssertion("!(flags & 0x0010)", File: "d
On Wed, Jan 28, 2015 at 8:05 AM, Merlin Moncure wrote:
> On Thu, Jan 22, 2015 at 3:50 PM, Merlin Moncure wrote:
>> I still haven't categorically ruled out pl/sh yet; that's something to
>> keep in mind.
>
> Well, after bisection proved not to be fruitful, I replaced the pl/sh
> calls with dummy c
On Thu, Jan 22, 2015 at 3:50 PM, Merlin Moncure wrote:
> I still haven't categorically ruled out pl/sh yet; that's something to
> keep in mind.
Well, after bisection proved not to be fruitful, I replaced the pl/sh
calls with dummy calls that approximated the same behavior and the
problem went awa
On Thu, Jan 22, 2015 at 03:50:03PM -0600, Merlin Moncure wrote:
> Quick update: not done yet, but I'm making consistent progress, with
> several false starts. (for example, I had a .conf problem with the
> new dynamic shared memory setting and git merrily bisected down to the
> introduction of th
On Thu, Jan 22, 2015 at 1:50 PM, Merlin Moncure wrote:
> Quick update: not done yet, but I'm making consistent progress, with
> several false starts. (for example, I had a .conf problem with the
> new dynamic shared memory setting and git merrily bisected down to the
> introduction of the featur
On Thu, Jan 22, 2015 at 1:50 PM, Merlin Moncure wrote:
>
> So far, the 'nasty' damage seems to generally if not always follow a
> checksum failure and the checksum failures are always numerically
> adjacent. For example:
>
> [cds2 12707 2015-01-22 12:51:11.032 CST 2754]WARNING: page
> verificat
On Fri, Jan 16, 2015 at 5:20 PM, Peter Geoghegan wrote:
> On Fri, Jan 16, 2015 at 10:33 AM, Merlin Moncure wrote:
>> ISTM the next step is to bisect the problem down over the weekend in
>> order to to narrow the search. If that doesn't turn up anything
>> productive I'll look into taking other s
On Fri, Jan 16, 2015 at 6:21 AM, Heikki Linnakangas
wrote:
> It looks very much like that a page has for some reason been moved to a
> different block number. And that's exactly what Peter found out in his
> investigation too; an index page was mysteriously copied to a different
> block with ident
On Fri, Jan 16, 2015 at 10:33 AM, Merlin Moncure wrote:
> ISTM the next step is to bisect the problem down over the weekend in
> order to to narrow the search. If that doesn't turn up anything
> productive I'll look into taking other steps.
That might be the quickest way to do it, provided you c
On Fri, Jan 16, 2015 at 8:22 AM, Andres Freund wrote:
> Hi,
>
> On 2015-01-16 08:05:07 -0600, Merlin Moncure wrote:
>> On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan wrote:
>> > On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure wrote:
>> >> Running this test on another set of hardware to verify
On Fri, Jan 16, 2015 at 8:22 AM, Andres Freund wrote:
> Is there any chance you can package this somehow so that others can run
> it locally? It looks hard to find the actual bug here without adding
> instrumentation to to postgres.
That's possible but involves a lot of complexity in the setup be
Hi,
On 2015-01-16 08:05:07 -0600, Merlin Moncure wrote:
> On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan wrote:
> > On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure wrote:
> >> Running this test on another set of hardware to verify -- if this
> >> turns out to be a false alarm which it may very
On 01/16/2015 04:05 PM, Merlin Moncure wrote:
On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan wrote:
On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure wrote:
Running this test on another set of hardware to verify -- if this
turns out to be a false alarm which it may very well be, I can only
of
On Fri, Jan 16, 2015 at 8:05 AM, Merlin Moncure wrote:
> On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan wrote:
>> On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure wrote:
>>> Running this test on another set of hardware to verify -- if this
>>> turns out to be a false alarm which it may very wel
On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan wrote:
> On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure wrote:
>> Running this test on another set of hardware to verify -- if this
>> turns out to be a false alarm which it may very well be, I can only
>> offer my apologies! I've never had a new
On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure wrote:
> Running this test on another set of hardware to verify -- if this
> turns out to be a false alarm which it may very well be, I can only
> offer my apologies! I've never had a new drive fail like that, in
> that manner. I'll burn the other
On Thu, Jan 15, 2015 at 4:03 PM, Merlin Moncure wrote:
> On Thu, Jan 15, 2015 at 1:32 PM, Merlin Moncure wrote:
>> Since it's possible the database is a loss, do you see any value in
>> bootstrappinng it again with checksums turned on? One point of note
>> is that this is a brand spanking new SS
On Thu, Jan 15, 2015 at 1:32 PM, Merlin Moncure wrote:
> Since it's possible the database is a loss, do you see any value in
> bootstrappinng it again with checksums turned on? One point of note
> is that this is a brand spanking new SSD, maybe we nee to rule out
> hardware based corruption?
hm!
On Thu, Jan 15, 2015 at 1:15 PM, Andres Freund wrote:
> Hi,
>
>> The plot thickens! I looped the test, still stock 9.4 as of this time
>> and went to lunch. When I came back, the database was in recovery
>> mode. Here is the rough sequence of events.
>>
>
> Whoa. That looks scary. Did you see (s
On 2015-01-15 20:15:42 +0100, Andres Freund wrote:
> > WARNING: did not find subXID 14955 in MyProc
> > CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35
> > during exception cleanup
> > WARNING: you don't own a lock of type RowExclusiveLock
> > CONTEXT: PL/pgSQL function cdsrecon
Hi,
> The plot thickens! I looped the test, still stock 9.4 as of this time
> and went to lunch. When I came back, the database was in recovery
> mode. Here is the rough sequence of events.
>
Whoa. That looks scary. Did you see (some of) those errors before? Most
of them should have been emitte
On Thu, Jan 15, 2015 at 8:02 AM, Merlin Moncure wrote:
> On Thu, Jan 15, 2015 at 6:04 AM, Heikki Linnakangas
> wrote:
>> On 01/15/2015 03:23 AM, Peter Geoghegan wrote:
>>>
>>> So now the question is: how did that inconsistency arise? It didn't
>>> necessarily arise at the time of the (presumed) s
On Thu, Jan 15, 2015 at 6:02 AM, Merlin Moncure wrote:
> Question: Coming in this morning I did an immediate restart and logged
> into the database and queried pg_class via index. Everything was
> fine, and the leftright verify returns nothing. How did it repair
> itself without a reindex?
May
On Thu, Jan 15, 2015 at 6:04 AM, Heikki Linnakangas
wrote:
> On 01/15/2015 03:23 AM, Peter Geoghegan wrote:
>>
>> So now the question is: how did that inconsistency arise? It didn't
>> necessarily arise at the time of the (presumed) split of block 2 to
>> create 9. It could be that the opaque area
On 01/15/2015 03:23 AM, Peter Geoghegan wrote:
So now the question is: how did that inconsistency arise? It didn't
necessarily arise at the time of the (presumed) split of block 2 to
create 9. It could be that the opaque area was changed by something
else, some time later. I'll investigate more.
On Wed, Jan 14, 2015 at 8:50 PM, Peter Geoghegan wrote:
> I am mistaken on one detail here - blocks 2 and 9 are actually fully
> identical. I still have no idea why, though.
So, I've looked at it in more detail and it appears that the page of
block 2 split at some point, thereby creating a new pa
On Wed, Jan 14, 2015 at 5:23 PM, Peter Geoghegan wrote:
> My immediate observation here is that blocks 2 and 9 have identical
> metadata (from their page opaque area), but partially non-matching
> data items (however, the number of items on each block is consistent
> and correct according to that
On Wed, Jan 14, 2015 at 5:23 PM, Peter Geoghegan wrote:
> My immediate observation here is that blocks 2 and 9 have identical
> metadata (from their page opaque area), but partially non-matching
> data items (however, the number of items on each block is consistent
> and correct according to that
On Wed, Jan 14, 2015 at 4:53 PM, Merlin Moncure wrote:
> yeah. via:
> cds2=# \copy (select s as page, (bt_page_items('pg_class_oid_index',
> s)).* from generate_series(1,12) s) to '/tmp/page_items.csv' csv
> header;
My immediate observation here is that blocks 2 and 9 have identical
metadata (fr
On Wed, Jan 14, 2015 at 6:50 PM, Peter Geoghegan wrote:
> This is great, but it's not exactly clear which bt_page_items() page
> is which - some are skipped, but I can't be sure which. Would you mind
> rewriting that query to indicate which block is under consideration by
> bt_page_items()?
yeah.
This is great, but it's not exactly clear which bt_page_items() page
is which - some are skipped, but I can't be sure which. Would you mind
rewriting that query to indicate which block is under consideration by
bt_page_items()?
Thanks
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (
On Wed, Jan 14, 2015 at 6:26 PM, Merlin Moncure wrote:
> On Wed, Jan 14, 2015 at 5:39 PM, Peter Geoghegan wrote:
>> On Wed, Jan 14, 2015 at 3:38 PM, Merlin Moncure wrote:
>>> (gdb) print BufferGetBlockNumber(buf)
>>> $15 = 9
>>>
>>> ..and it stays 9, continuing several times having set breakpoi
On Wed, Jan 14, 2015 at 4:26 PM, Merlin Moncure wrote:
> The index is the oid index on pg_class. Some more info:
>
> *) temp table churn is fairly high. Several dozen get spawned and
> destroted at the start of a replication run, all at once, due to some
> dodgy coding via dblink. During the re
On Wed, Jan 14, 2015 at 5:39 PM, Peter Geoghegan wrote:
> On Wed, Jan 14, 2015 at 3:38 PM, Merlin Moncure wrote:
>> (gdb) print BufferGetBlockNumber(buf)
>> $15 = 9
>>
>> ..and it stays 9, continuing several times having set breakpoint.
>
>
> And the index involved? I'm pretty sure that this in
On Wed, Jan 14, 2015 at 3:38 PM, Merlin Moncure wrote:
> (gdb) print BufferGetBlockNumber(buf)
> $15 = 9
>
> ..and it stays 9, continuing several times having set breakpoint.
And the index involved? I'm pretty sure that this in an internal page, no?
--
Peter Geoghegan
--
Sent via pgsql-hac
On Wed, Jan 14, 2015 at 2:32 PM, Peter Geoghegan wrote:
> On Wed, Jan 14, 2015 at 12:24 PM, Peter Geoghegan wrote:
>> Could you write some code to print out the block number (i.e.
>> "BlockNumber blkno") if there are more than, say, 5 retries within
>> _bt_moveright()?
>
> Obviously I mean that t
On Wed, Jan 14, 2015 at 12:24 PM, Peter Geoghegan wrote:
> Could you write some code to print out the block number (i.e.
> "BlockNumber blkno") if there are more than, say, 5 retries within
> _bt_moveright()?
Obviously I mean that the block number should be printed, no matter
whether or not the P
On Wed, Jan 14, 2015 at 11:49 AM, Merlin Moncure wrote:
> so it looks like nobody ever exits from _bt_moveright. any last
> requests before I start bisecting down?
Could you write some code to print out the block number (i.e.
"BlockNumber blkno") if there are more than, say, 5 retries within
_b
On Wed, Jan 14, 2015 at 9:49 AM, Andres Freund wrote:
> On 2015-01-14 09:47:19 -0600, Merlin Moncure wrote:
>> On Wed, Jan 14, 2015 at 9:30 AM, Andres Freund
>> wrote:
>> > If you gdb in, and type 'fin' a couple times, to wait till the function
>> > finishes, is there actually any progress? I'm
On Wed, Jan 14, 2015 at 7:22 AM, Merlin Moncure wrote:
> I'll try to pull commits that Peter suggested and see if that helps
> (I'm getting ready to bring the database down). I can send the code
> off-list if you guys think it'd help.
Thanks for the code!
I think it would be interesting to see
On 2015-01-14 09:47:19 -0600, Merlin Moncure wrote:
> On Wed, Jan 14, 2015 at 9:30 AM, Andres Freund wrote:
> > If you gdb in, and type 'fin' a couple times, to wait till the function
> > finishes, is there actually any progress? I'm wondering whether it's
> > just many catalog accesses + contenti
On Wed, Jan 14, 2015 at 9:30 AM, Andres Freund wrote:
> If you gdb in, and type 'fin' a couple times, to wait till the function
> finishes, is there actually any progress? I'm wondering whether it's
> just many catalog accesses + contention, or some other
> problem. Alternatively set a breakpoint
On 2015-01-14 09:22:45 -0600, Merlin Moncure wrote:
> On Wed, Jan 14, 2015 at 9:11 AM, Andres Freund wrote:
> > On 2015-01-14 10:05:01 -0500, Tom Lane wrote:
> >> Merlin Moncure writes:
> >> > On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane wrote:
> >> >> What are the autovac processes doing (accordin
On Wed, Jan 14, 2015 at 9:11 AM, Andres Freund wrote:
> On 2015-01-14 10:05:01 -0500, Tom Lane wrote:
>> Merlin Moncure writes:
>> > On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane wrote:
>> >> What are the autovac processes doing (according to pg_stat_activity)?
>>
>> > pid,running,waiting,query
>> >
On 2015-01-14 10:13:32 -0500, Tom Lane wrote:
> Merlin Moncure writes:
> > Yes, it is pg_class is coming from LockBufferForCleanup (). As you
> > can see above, it has a shorter runtime. So it was killed off once
> > about a half hour ago which did not free up the logjam. However, AV
> > spaw
Andres Freund writes:
> On 2015-01-14 10:05:01 -0500, Tom Lane wrote:
>> Hah, I suspected as much. Is that the one that's stuck in
>> LockBufferForCleanup, or the other one that's got a similar backtrace
>> to all the user processes?
> Do you have a theory? Right now it primarily looks like cont
Merlin Moncure writes:
> Yes, it is pg_class is coming from LockBufferForCleanup (). As you
> can see above, it has a shorter runtime. So it was killed off once
> about a half hour ago which did not free up the logjam. However, AV
> spawned it again and now it does not respond to cancel.
Int
On 2015-01-14 10:05:01 -0500, Tom Lane wrote:
> Merlin Moncure writes:
> > On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane wrote:
> >> What are the autovac processes doing (according to pg_stat_activity)?
>
> > pid,running,waiting,query
> > 7105,00:28:40.789221,f,autovacuum: VACUUM ANALYZE pg_catalog.
On Wed, Jan 14, 2015 at 9:05 AM, Tom Lane wrote:
> Merlin Moncure writes:
>> On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane wrote:
>>> What are the autovac processes doing (according to pg_stat_activity)?
>
>> pid,running,waiting,query
>> 7105,00:28:40.789221,f,autovacuum: VACUUM ANALYZE pg_catalog.p
Merlin Moncure writes:
> On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane wrote:
>> What are the autovac processes doing (according to pg_stat_activity)?
> pid,running,waiting,query
> 7105,00:28:40.789221,f,autovacuum: VACUUM ANALYZE pg_catalog.pg_class
Hah, I suspected as much. Is that the one that'
On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane wrote:
> Merlin Moncure writes:
>> There were seven process with that backtrace exact backtrace (except
>> that randomly they are sleeping in the spinloop). Something else
>> interesting: autovacuum has been running all night as well. Unlike
>> the oth
Merlin Moncure writes:
> There were seven process with that backtrace exact backtrace (except
> that randomly they are sleeping in the spinloop). Something else
> interesting: autovacuum has been running all night as well. Unlike
> the other process however, cpu utilization does not register on
On Wed, Jan 14, 2015 at 8:03 AM, Merlin Moncure wrote:
> Here's a backtrace:
>
> #0 0x00750a97 in spin_delay ()
> #1 0x00750b19 in s_lock ()
> #2 0x00750844 in LWLockRelease ()
> #3 0x0073 in LockBuffer ()
> #4 0x004b2db4 in _bt_relandgetbuf ()
> #5
On Tue, Jan 13, 2015 at 7:24 PM, Peter Geoghegan wrote:
> On Tue, Jan 13, 2015 at 3:54 PM, Merlin Moncure wrote:
>> Some more information what's happening:
>> This is a ghetto logical replication engine that migrates data from
>> sql sever to postgres, consolidating a sharded database into a sing
On Tue, Jan 13, 2015 at 3:54 PM, Merlin Moncure wrote:
> Some more information what's happening:
> This is a ghetto logical replication engine that migrates data from
> sql sever to postgres, consolidating a sharded database into a single
> set of tables (of which there are only two). There is onl
On Tue, Jan 13, 2015 at 3:54 PM, Andres Freund wrote:
>> I don't remember seeing _bt_moveright() or _bt_compare() figuring so
>> prominently, where _bt_binsrch() is nowhere to be seen. I can't see a
>> reference to _bt_binsrch() in either profile.
>
> Well, we do a _bt_moveright pretty early on,
On Tue, Jan 13, 2015 at 4:05 PM, Tom Lane wrote:
> I'm not convinced that Peter is barking up the right tree. I'm noticing
> that the profiles seem rather skewed towards parser/planner work; so I
> suspect the contention is probably on access to system catalogs. No
> idea exactly why though.
I
On 2015-01-13 19:05:10 -0500, Tom Lane wrote:
> Merlin Moncure writes:
> > On Tue, Jan 13, 2015 at 5:54 PM, Peter Geoghegan wrote:
> >> In case it isn't clear, I think that the proximate cause here may well
> >> be either one (or both) of commits
> >> efada2b8e920adfdf7418862e939925d2acd1b89 and/
Merlin Moncure writes:
> On Tue, Jan 13, 2015 at 5:54 PM, Peter Geoghegan wrote:
>> In case it isn't clear, I think that the proximate cause here may well
>> be either one (or both) of commits
>> efada2b8e920adfdf7418862e939925d2acd1b89 and/or
>> 40dae7ec537c5619fc93ad602c62f37be786d161. Probably
On Tue, Jan 13, 2015 at 5:54 PM, Peter Geoghegan wrote:
> On Tue, Jan 13, 2015 at 3:50 PM, Merlin Moncure wrote:
>>> I don't remember seeing _bt_moveright() or _bt_compare() figuring so
>>> prominently, where _bt_binsrch() is nowhere to be seen. I can't see a
>>> reference to _bt_binsrch() in ei
On Tue, Jan 13, 2015 at 5:42 PM, Andres Freund wrote:
> On 2015-01-13 17:39:09 -0600, Merlin Moncure wrote:
>> On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund
>> wrote:
>> > On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
>> >> I'm inclined to think that this is a livelock, and so the proble
On Tue, Jan 13, 2015 at 3:50 PM, Merlin Moncure wrote:
>> I don't remember seeing _bt_moveright() or _bt_compare() figuring so
>> prominently, where _bt_binsrch() is nowhere to be seen. I can't see a
>> reference to _bt_binsrch() in either profile.
>
> hm, this is hand compiled now, I bet the sym
On 2015-01-13 15:49:33 -0800, Peter Geoghegan wrote:
> On Tue, Jan 13, 2015 at 3:21 PM, Andres Freund wrote:
> > My guess is rather that it's contention on the freelist lock via
> > StrategyGetBuffer's. I've seen profiles like this due to exactly that
> > before - and it fits to parallel loading q
On Tue, Jan 13, 2015 at 5:49 PM, Peter Geoghegan wrote:
> On Tue, Jan 13, 2015 at 3:21 PM, Andres Freund wrote:
>> My guess is rather that it's contention on the freelist lock via
>> StrategyGetBuffer's. I've seen profiles like this due to exactly that
>> before - and it fits to parallel loading
On Tue, Jan 13, 2015 at 3:21 PM, Andres Freund wrote:
> My guess is rather that it's contention on the freelist lock via
> StrategyGetBuffer's. I've seen profiles like this due to exactly that
> before - and it fits to parallel loading quite well.
I'm not saying you're wrong, but the breakdown of
On 2015-01-13 17:39:09 -0600, Merlin Moncure wrote:
> On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund wrote:
> > On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
> >> I'm inclined to think that this is a livelock, and so the problem
> >> isn't evident from the structure of the B-Tree, but it ca
On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund wrote:
> On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
>> I'm inclined to think that this is a livelock, and so the problem
>> isn't evident from the structure of the B-Tree, but it can't hurt to
>> check.
>
> My guess is rather that it's conte
On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
> I'm inclined to think that this is a livelock, and so the problem
> isn't evident from the structure of the B-Tree, but it can't hurt to
> check.
My guess is rather that it's contention on the freelist lock via
StrategyGetBuffer's. I've seen p
On Tue, Jan 13, 2015 at 2:29 PM, Merlin Moncure wrote:
> On my workstation today (running vanilla 9.4.0) I was testing some new
> code that does aggressive parallel loading to a couple of tables.
Could you give more details, please? For example, I'd like to see
representative data, or at least th
On Tue, Jan 13, 2015 at 4:33 PM, Andres Freund wrote:
> Hi,
>
> On 2015-01-13 16:29:51 -0600, Merlin Moncure wrote:
>> On my workstation today (running vanilla 9.4.0) I was testing some new
>> code that does aggressive parallel loading to a couple of tables. It
>> ran ok several dozen times and fr
Hi,
On 2015-01-13 16:29:51 -0600, Merlin Moncure wrote:
> On my workstation today (running vanilla 9.4.0) I was testing some new
> code that does aggressive parallel loading to a couple of tables. It
> ran ok several dozen times and froze up with no external trigger.
> There were at most 8 active
On my workstation today (running vanilla 9.4.0) I was testing some new
code that does aggressive parallel loading to a couple of tables. It
ran ok several dozen times and froze up with no external trigger.
There were at most 8 active backends that were stuck (the loader is
threaded to a cap) -- eac
Title: Hung backends
Hi,
I'm new to PostgreSQL and have been asked to determine the cause of what appear to be hung processes on FreeBSD after one or more frontend apps crash. I did alot of searching through the msg lists and found a few discussions that seem related, but I was unable to find
74 matches
Mail list logo