On Thu, Jan 22, 2015 at 3:50 PM, Merlin Moncure mmonc...@gmail.com wrote:
I still haven't categorically ruled out pl/sh yet; that's something to
keep in mind.
Well, after bisection proved not to be fruitful, I replaced the pl/sh
calls with dummy calls that approximated the same behavior and the
Merlin Moncure mmonc...@gmail.com writes:
...hm, I spoke to soon. So I deleted everything, and booted up a new
instance 9.4 vanilla with asserts on and took no other action.
Applying the script with no data activity fails an assertion every
single time:
TRAP: FailedAssertion(!(flags
On Wed, Jan 28, 2015 at 8:05 AM, Merlin Moncure mmonc...@gmail.com wrote:
On Thu, Jan 22, 2015 at 3:50 PM, Merlin Moncure mmonc...@gmail.com wrote:
I still haven't categorically ruled out pl/sh yet; that's something to
keep in mind.
Well, after bisection proved not to be fruitful, I replaced
On Wed, Jan 28, 2015 at 12:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:
Merlin Moncure mmonc...@gmail.com writes:
...hm, I spoke to soon. So I deleted everything, and booted up a new
instance 9.4 vanilla with asserts on and took no other action.
Applying the script with no data activity fails an
On Thu, Jan 22, 2015 at 03:50:03PM -0600, Merlin Moncure wrote:
Quick update: not done yet, but I'm making consistent progress, with
several false starts. (for example, I had a .conf problem with the
new dynamic shared memory setting and git merrily bisected down to the
introduction of the
On Thu, Jan 22, 2015 at 1:50 PM, Merlin Moncure mmonc...@gmail.com wrote:
So far, the 'nasty' damage seems to generally if not always follow a
checksum failure and the checksum failures are always numerically
adjacent. For example:
[cds2 12707 2015-01-22 12:51:11.032 CST 2754]WARNING:
On Thu, Jan 22, 2015 at 1:50 PM, Merlin Moncure mmonc...@gmail.com wrote:
Quick update: not done yet, but I'm making consistent progress, with
several false starts. (for example, I had a .conf problem with the
new dynamic shared memory setting and git merrily bisected down to the
On Fri, Jan 16, 2015 at 5:20 PM, Peter Geoghegan p...@heroku.com wrote:
On Fri, Jan 16, 2015 at 10:33 AM, Merlin Moncure mmonc...@gmail.com wrote:
ISTM the next step is to bisect the problem down over the weekend in
order to to narrow the search. If that doesn't turn up anything
productive
On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan p...@heroku.com wrote:
On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure mmonc...@gmail.com wrote:
Running this test on another set of hardware to verify -- if this
turns out to be a false alarm which it may very well be, I can only
offer my
On Fri, Jan 16, 2015 at 8:05 AM, Merlin Moncure mmonc...@gmail.com wrote:
On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan p...@heroku.com wrote:
On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure mmonc...@gmail.com wrote:
Running this test on another set of hardware to verify -- if this
turns
On 01/16/2015 04:05 PM, Merlin Moncure wrote:
On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan p...@heroku.com wrote:
On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure mmonc...@gmail.com wrote:
Running this test on another set of hardware to verify -- if this
turns out to be a false alarm which
Hi,
On 2015-01-16 08:05:07 -0600, Merlin Moncure wrote:
On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan p...@heroku.com wrote:
On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure mmonc...@gmail.com wrote:
Running this test on another set of hardware to verify -- if this
turns out to be a
On Fri, Jan 16, 2015 at 8:22 AM, Andres Freund and...@2ndquadrant.com wrote:
Is there any chance you can package this somehow so that others can run
it locally? It looks hard to find the actual bug here without adding
instrumentation to to postgres.
That's possible but involves a lot of
On Fri, Jan 16, 2015 at 8:22 AM, Andres Freund and...@2ndquadrant.com wrote:
Hi,
On 2015-01-16 08:05:07 -0600, Merlin Moncure wrote:
On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan p...@heroku.com wrote:
On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure mmonc...@gmail.com wrote:
Running
On Fri, Jan 16, 2015 at 10:33 AM, Merlin Moncure mmonc...@gmail.com wrote:
ISTM the next step is to bisect the problem down over the weekend in
order to to narrow the search. If that doesn't turn up anything
productive I'll look into taking other steps.
That might be the quickest way to do
On Fri, Jan 16, 2015 at 6:21 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
It looks very much like that a page has for some reason been moved to a
different block number. And that's exactly what Peter found out in his
investigation too; an index page was mysteriously copied to a
On Thu, Jan 15, 2015 at 6:02 AM, Merlin Moncure mmonc...@gmail.com wrote:
Question: Coming in this morning I did an immediate restart and logged
into the database and queried pg_class via index. Everything was
fine, and the leftright verify returns nothing. How did it repair
itself without
On Thu, Jan 15, 2015 at 8:02 AM, Merlin Moncure mmonc...@gmail.com wrote:
On Thu, Jan 15, 2015 at 6:04 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
On 01/15/2015 03:23 AM, Peter Geoghegan wrote:
So now the question is: how did that inconsistency arise? It didn't
necessarily arise at
On Thu, Jan 15, 2015 at 1:15 PM, Andres Freund and...@2ndquadrant.com wrote:
Hi,
The plot thickens! I looped the test, still stock 9.4 as of this time
and went to lunch. When I came back, the database was in recovery
mode. Here is the rough sequence of events.
Whoa. That looks scary. Did
On 2015-01-15 20:15:42 +0100, Andres Freund wrote:
WARNING: did not find subXID 14955 in MyProc
CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35
during exception cleanup
WARNING: you don't own a lock of type RowExclusiveLock
CONTEXT: PL/pgSQL function
On Thu, Jan 15, 2015 at 1:32 PM, Merlin Moncure mmonc...@gmail.com wrote:
Since it's possible the database is a loss, do you see any value in
bootstrappinng it again with checksums turned on? One point of note
is that this is a brand spanking new SSD, maybe we nee to rule out
hardware based
On Thu, Jan 15, 2015 at 4:03 PM, Merlin Moncure mmonc...@gmail.com wrote:
On Thu, Jan 15, 2015 at 1:32 PM, Merlin Moncure mmonc...@gmail.com wrote:
Since it's possible the database is a loss, do you see any value in
bootstrappinng it again with checksums turned on? One point of note
is that
On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure mmonc...@gmail.com wrote:
Running this test on another set of hardware to verify -- if this
turns out to be a false alarm which it may very well be, I can only
offer my apologies! I've never had a new drive fail like that, in
that manner. I'll
On Wed, Jan 14, 2015 at 8:50 PM, Peter Geoghegan p...@heroku.com wrote:
I am mistaken on one detail here - blocks 2 and 9 are actually fully
identical. I still have no idea why, though.
So, I've looked at it in more detail and it appears that the page of
block 2 split at some point, thereby
On 01/15/2015 03:23 AM, Peter Geoghegan wrote:
So now the question is: how did that inconsistency arise? It didn't
necessarily arise at the time of the (presumed) split of block 2 to
create 9. It could be that the opaque area was changed by something
else, some time later. I'll investigate more.
On Thu, Jan 15, 2015 at 6:04 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
On 01/15/2015 03:23 AM, Peter Geoghegan wrote:
So now the question is: how did that inconsistency arise? It didn't
necessarily arise at the time of the (presumed) split of block 2 to
create 9. It could be that
Hi,
The plot thickens! I looped the test, still stock 9.4 as of this time
and went to lunch. When I came back, the database was in recovery
mode. Here is the rough sequence of events.
Whoa. That looks scary. Did you see (some of) those errors before? Most
of them should have been emitted
On Wed, Jan 14, 2015 at 9:49 AM, Andres Freund and...@2ndquadrant.com wrote:
On 2015-01-14 09:47:19 -0600, Merlin Moncure wrote:
On Wed, Jan 14, 2015 at 9:30 AM, Andres Freund and...@2ndquadrant.com
wrote:
If you gdb in, and type 'fin' a couple times, to wait till the function
finishes, is
On Wed, Jan 14, 2015 at 4:53 PM, Merlin Moncure mmonc...@gmail.com wrote:
yeah. via:
cds2=# \copy (select s as page, (bt_page_items('pg_class_oid_index',
s)).* from generate_series(1,12) s) to '/tmp/page_items.csv' csv
header;
My immediate observation here is that blocks 2 and 9 have
On Wed, Jan 14, 2015 at 6:50 PM, Peter Geoghegan p...@heroku.com wrote:
This is great, but it's not exactly clear which bt_page_items() page
is which - some are skipped, but I can't be sure which. Would you mind
rewriting that query to indicate which block is under consideration by
This is great, but it's not exactly clear which bt_page_items() page
is which - some are skipped, but I can't be sure which. Would you mind
rewriting that query to indicate which block is under consideration by
bt_page_items()?
Thanks
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list
On Wed, Jan 14, 2015 at 2:32 PM, Peter Geoghegan p...@heroku.com wrote:
On Wed, Jan 14, 2015 at 12:24 PM, Peter Geoghegan p...@heroku.com wrote:
Could you write some code to print out the block number (i.e.
BlockNumber blkno) if there are more than, say, 5 retries within
_bt_moveright()?
On Wed, Jan 14, 2015 at 4:26 PM, Merlin Moncure mmonc...@gmail.com wrote:
The index is the oid index on pg_class. Some more info:
*) temp table churn is fairly high. Several dozen get spawned and
destroted at the start of a replication run, all at once, due to some
dodgy coding via dblink.
On Wed, Jan 14, 2015 at 5:39 PM, Peter Geoghegan p...@heroku.com wrote:
On Wed, Jan 14, 2015 at 3:38 PM, Merlin Moncure mmonc...@gmail.com wrote:
(gdb) print BufferGetBlockNumber(buf)
$15 = 9
..and it stays 9, continuing several times having set breakpoint.
And the index involved? I'm
On Wed, Jan 14, 2015 at 3:38 PM, Merlin Moncure mmonc...@gmail.com wrote:
(gdb) print BufferGetBlockNumber(buf)
$15 = 9
..and it stays 9, continuing several times having set breakpoint.
And the index involved? I'm pretty sure that this in an internal page, no?
--
Peter Geoghegan
--
On Wed, Jan 14, 2015 at 6:26 PM, Merlin Moncure mmonc...@gmail.com wrote:
On Wed, Jan 14, 2015 at 5:39 PM, Peter Geoghegan p...@heroku.com wrote:
On Wed, Jan 14, 2015 at 3:38 PM, Merlin Moncure mmonc...@gmail.com wrote:
(gdb) print BufferGetBlockNumber(buf)
$15 = 9
..and it stays 9,
On Wed, Jan 14, 2015 at 5:23 PM, Peter Geoghegan p...@heroku.com wrote:
My immediate observation here is that blocks 2 and 9 have identical
metadata (from their page opaque area), but partially non-matching
data items (however, the number of items on each block is consistent
and correct
On Wed, Jan 14, 2015 at 5:23 PM, Peter Geoghegan p...@heroku.com wrote:
My immediate observation here is that blocks 2 and 9 have identical
metadata (from their page opaque area), but partially non-matching
data items (however, the number of items on each block is consistent
and correct
Merlin Moncure mmonc...@gmail.com writes:
On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane t...@sss.pgh.pa.us wrote:
What are the autovac processes doing (according to pg_stat_activity)?
pid,running,waiting,query
7105,00:28:40.789221,f,autovacuum: VACUUM ANALYZE pg_catalog.pg_class
Hah, I suspected
Merlin Moncure mmonc...@gmail.com writes:
There were seven process with that backtrace exact backtrace (except
that randomly they are sleeping in the spinloop). Something else
interesting: autovacuum has been running all night as well. Unlike
the other process however, cpu utilization does
On Wed, Jan 14, 2015 at 9:05 AM, Tom Lane t...@sss.pgh.pa.us wrote:
Merlin Moncure mmonc...@gmail.com writes:
On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane t...@sss.pgh.pa.us wrote:
What are the autovac processes doing (according to pg_stat_activity)?
pid,running,waiting,query
On 2015-01-14 09:22:45 -0600, Merlin Moncure wrote:
On Wed, Jan 14, 2015 at 9:11 AM, Andres Freund and...@2ndquadrant.com wrote:
On 2015-01-14 10:05:01 -0500, Tom Lane wrote:
Merlin Moncure mmonc...@gmail.com writes:
On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane t...@sss.pgh.pa.us wrote:
On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane t...@sss.pgh.pa.us wrote:
Merlin Moncure mmonc...@gmail.com writes:
There were seven process with that backtrace exact backtrace (except
that randomly they are sleeping in the spinloop). Something else
interesting: autovacuum has been running all
On 2015-01-14 10:05:01 -0500, Tom Lane wrote:
Merlin Moncure mmonc...@gmail.com writes:
On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane t...@sss.pgh.pa.us wrote:
What are the autovac processes doing (according to pg_stat_activity)?
pid,running,waiting,query
7105,00:28:40.789221,f,autovacuum:
On 2015-01-14 10:13:32 -0500, Tom Lane wrote:
Merlin Moncure mmonc...@gmail.com writes:
Yes, it is pg_class is coming from LockBufferForCleanup (). As you
can see above, it has a shorter runtime. So it was killed off once
about a half hour ago which did not free up the logjam. However,
Andres Freund and...@2ndquadrant.com writes:
On 2015-01-14 10:05:01 -0500, Tom Lane wrote:
Hah, I suspected as much. Is that the one that's stuck in
LockBufferForCleanup, or the other one that's got a similar backtrace
to all the user processes?
Do you have a theory? Right now it primarily
On Wed, Jan 14, 2015 at 9:11 AM, Andres Freund and...@2ndquadrant.com wrote:
On 2015-01-14 10:05:01 -0500, Tom Lane wrote:
Merlin Moncure mmonc...@gmail.com writes:
On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane t...@sss.pgh.pa.us wrote:
What are the autovac processes doing (according to
Merlin Moncure mmonc...@gmail.com writes:
Yes, it is pg_class is coming from LockBufferForCleanup (). As you
can see above, it has a shorter runtime. So it was killed off once
about a half hour ago which did not free up the logjam. However, AV
spawned it again and now it does not respond
On Wed, Jan 14, 2015 at 9:30 AM, Andres Freund and...@2ndquadrant.com wrote:
If you gdb in, and type 'fin' a couple times, to wait till the function
finishes, is there actually any progress? I'm wondering whether it's
just many catalog accesses + contention, or some other
problem.
On 2015-01-14 09:47:19 -0600, Merlin Moncure wrote:
On Wed, Jan 14, 2015 at 9:30 AM, Andres Freund and...@2ndquadrant.com wrote:
If you gdb in, and type 'fin' a couple times, to wait till the function
finishes, is there actually any progress? I'm wondering whether it's
just many catalog
On Wed, Jan 14, 2015 at 7:22 AM, Merlin Moncure mmonc...@gmail.com wrote:
I'll try to pull commits that Peter suggested and see if that helps
(I'm getting ready to bring the database down). I can send the code
off-list if you guys think it'd help.
Thanks for the code!
I think it would be
On Tue, Jan 13, 2015 at 7:24 PM, Peter Geoghegan p...@heroku.com wrote:
On Tue, Jan 13, 2015 at 3:54 PM, Merlin Moncure mmonc...@gmail.com wrote:
Some more information what's happening:
This is a ghetto logical replication engine that migrates data from
sql sever to postgres, consolidating a
On Wed, Jan 14, 2015 at 8:03 AM, Merlin Moncure mmonc...@gmail.com wrote:
Here's a backtrace:
#0 0x00750a97 in spin_delay ()
#1 0x00750b19 in s_lock ()
#2 0x00750844 in LWLockRelease ()
#3 0x0073 in LockBuffer ()
#4 0x004b2db4 in
On Tue, Jan 13, 2015 at 3:54 PM, Andres Freund and...@2ndquadrant.com wrote:
I don't remember seeing _bt_moveright() or _bt_compare() figuring so
prominently, where _bt_binsrch() is nowhere to be seen. I can't see a
reference to _bt_binsrch() in either profile.
Well, we do a _bt_moveright
On Tue, Jan 13, 2015 at 3:54 PM, Merlin Moncure mmonc...@gmail.com wrote:
Some more information what's happening:
This is a ghetto logical replication engine that migrates data from
sql sever to postgres, consolidating a sharded database into a single
set of tables (of which there are only
On Tue, Jan 13, 2015 at 4:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:
I'm not convinced that Peter is barking up the right tree. I'm noticing
that the profiles seem rather skewed towards parser/planner work; so I
suspect the contention is probably on access to system catalogs. No
idea exactly
Hi,
On 2015-01-13 16:29:51 -0600, Merlin Moncure wrote:
On my workstation today (running vanilla 9.4.0) I was testing some new
code that does aggressive parallel loading to a couple of tables. It
ran ok several dozen times and froze up with no external trigger.
There were at most 8 active
On Tue, Jan 13, 2015 at 4:33 PM, Andres Freund and...@2ndquadrant.com wrote:
Hi,
On 2015-01-13 16:29:51 -0600, Merlin Moncure wrote:
On my workstation today (running vanilla 9.4.0) I was testing some new
code that does aggressive parallel loading to a couple of tables. It
ran ok several
On Tue, Jan 13, 2015 at 2:29 PM, Merlin Moncure mmonc...@gmail.com wrote:
On my workstation today (running vanilla 9.4.0) I was testing some new
code that does aggressive parallel loading to a couple of tables.
Could you give more details, please? For example, I'd like to see
representative
On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
I'm inclined to think that this is a livelock, and so the problem
isn't evident from the structure of the B-Tree, but it can't hurt to
check.
My guess is rather that it's contention on the freelist lock via
StrategyGetBuffer's. I've seen
On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund and...@2ndquadrant.com wrote:
On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
I'm inclined to think that this is a livelock, and so the problem
isn't evident from the structure of the B-Tree, but it can't hurt to
check.
My guess is rather
On 2015-01-13 17:39:09 -0600, Merlin Moncure wrote:
On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund and...@2ndquadrant.com wrote:
On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
I'm inclined to think that this is a livelock, and so the problem
isn't evident from the structure of the
On Tue, Jan 13, 2015 at 3:21 PM, Andres Freund and...@2ndquadrant.com wrote:
My guess is rather that it's contention on the freelist lock via
StrategyGetBuffer's. I've seen profiles like this due to exactly that
before - and it fits to parallel loading quite well.
I'm not saying you're wrong,
On Tue, Jan 13, 2015 at 5:49 PM, Peter Geoghegan p...@heroku.com wrote:
On Tue, Jan 13, 2015 at 3:21 PM, Andres Freund and...@2ndquadrant.com wrote:
My guess is rather that it's contention on the freelist lock via
StrategyGetBuffer's. I've seen profiles like this due to exactly that
before -
On 2015-01-13 15:49:33 -0800, Peter Geoghegan wrote:
On Tue, Jan 13, 2015 at 3:21 PM, Andres Freund and...@2ndquadrant.com wrote:
My guess is rather that it's contention on the freelist lock via
StrategyGetBuffer's. I've seen profiles like this due to exactly that
before - and it fits to
On Tue, Jan 13, 2015 at 3:50 PM, Merlin Moncure mmonc...@gmail.com wrote:
I don't remember seeing _bt_moveright() or _bt_compare() figuring so
prominently, where _bt_binsrch() is nowhere to be seen. I can't see a
reference to _bt_binsrch() in either profile.
hm, this is hand compiled now, I
On Tue, Jan 13, 2015 at 5:54 PM, Peter Geoghegan p...@heroku.com wrote:
On Tue, Jan 13, 2015 at 3:50 PM, Merlin Moncure mmonc...@gmail.com wrote:
I don't remember seeing _bt_moveright() or _bt_compare() figuring so
prominently, where _bt_binsrch() is nowhere to be seen. I can't see a
On Tue, Jan 13, 2015 at 5:42 PM, Andres Freund and...@2ndquadrant.com wrote:
On 2015-01-13 17:39:09 -0600, Merlin Moncure wrote:
On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund and...@2ndquadrant.com
wrote:
On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
I'm inclined to think that this
Merlin Moncure mmonc...@gmail.com writes:
On Tue, Jan 13, 2015 at 5:54 PM, Peter Geoghegan p...@heroku.com wrote:
In case it isn't clear, I think that the proximate cause here may well
be either one (or both) of commits
efada2b8e920adfdf7418862e939925d2acd1b89 and/or
On 2015-01-13 19:05:10 -0500, Tom Lane wrote:
Merlin Moncure mmonc...@gmail.com writes:
On Tue, Jan 13, 2015 at 5:54 PM, Peter Geoghegan p...@heroku.com wrote:
In case it isn't clear, I think that the proximate cause here may well
be either one (or both) of commits
70 matches
Mail list logo