Re: dsa_allocate() faliure

2019-02-12 Thread Thomas Munro
On Mon, Feb 11, 2019 at 10:33 AM Tom Lane wrote: > Thomas Munro writes: > > So I'll wait until after the release, and we'll have > > to live with the bug for another 3 months. > > Check. Please hold off committing until you see the release tags > appear, probably late Tuesday my time / Wednesday

Re: dsa_allocate() faliure

2019-02-10 Thread Tom Lane
Thomas Munro writes: > On Mon, Feb 11, 2019 at 10:33 AM Tom Lane wrote: >> I observe from >> https://coverage.postgresql.org/src/backend/utils/mmgr/freepage.c.gcov.html >> that the edge cases in this function aren't too well exercised by >> our regression tests, meaning that the buildfarm might n

Re: dsa_allocate() faliure

2019-02-10 Thread Thomas Munro
On Mon, Feb 11, 2019 at 10:33 AM Tom Lane wrote: > I observe from > > https://coverage.postgresql.org/src/backend/utils/mmgr/freepage.c.gcov.html > > that the edge cases in this function aren't too well exercised by > our regression tests, meaning that the buildfarm might not prove > much either w

Re: dsa_allocate() faliure

2019-02-10 Thread Thomas Munro
On Mon, Feb 11, 2019 at 11:02 AM Justin Pryzby wrote: > On Mon, Feb 11, 2019 at 09:45:07AM +1100, Thomas Munro wrote: > > Ouch. Yeah, that'd do it and matches the evidence. With this change, > > I couldn't reproduce the problem after 90 minutes with a test case > > that otherwise hits it within

Re: dsa_allocate() faliure

2019-02-10 Thread Justin Pryzby
On Mon, Feb 11, 2019 at 09:45:07AM +1100, Thomas Munro wrote: > Ouch. Yeah, that'd do it and matches the evidence. With this change, > I couldn't reproduce the problem after 90 minutes with a test case > that otherwise hits it within a couple of minutes. ... > Note that this patch addresses the e

Re: dsa_allocate() faliure

2019-02-10 Thread Tom Lane
Thomas Munro writes: > This brings us to a difficult choice: we're about to cut a new > release, and this could in theory be included. Even though the fix is > quite convincing, it doesn't seem wise to change such complicated code > at the last minute, and I know from an off-list chat that that i

Re: dsa_allocate() faliure

2019-02-10 Thread Thomas Munro
On Sun, Feb 10, 2019 at 5:41 PM Robert Haas wrote: > On Sun, Feb 10, 2019 at 2:37 AM Thomas Munro > wrote: > > But at first glance it shouldn't be allocating pages, because it just > > does consolidation to try to convert to singleton format, and then it > > does recycle list cleanup using soft=t

Re: dsa_allocate() faliure

2019-02-10 Thread Justin Pryzby
On Sun, Feb 10, 2019 at 07:11:22PM +0300, Sergei Kornilov wrote: > > I ran overnight with this patch, but all parallel processes ended up stuck > > in > > the style of bug#15585. So that's either not the root cause, or there's a > > 2nd > > issue. > > Maybe i missed something in this discussion,

Re: dsa_allocate() faliure

2019-02-10 Thread Sergei Kornilov
Hi > I ran overnight with this patch, but all parallel processes ended up stuck in > the style of bug#15585. So that's either not the root cause, or there's a 2nd > issue. Maybe i missed something in this discussion, but you can reproduce bug#15585? How? With this testcase: https://www.postgres

Re: dsa_allocate() faliure

2019-02-10 Thread Justin Pryzby
On Sun, Feb 10, 2019 at 12:10:52PM +0530, Robert Haas wrote: > I think I see what's happening. At the moment the problem occurs, > there is no btree - there is only a singleton range. So > FreePageManagerInternal() takes the fpm->btree_depth == 0 branch and > then ends up in the section with the

Re: dsa_allocate() faliure

2019-02-09 Thread Robert Haas
On Sun, Feb 10, 2019 at 2:37 AM Thomas Munro wrote: > ... but why would it do that? I can reproduce cases where (for > example) FreePageManagerPutInternal() returns 179, and then > FreePageManagerLargestContiguous() returns 179, but then after > FreePageBtreeCleanup() it returns 178. At that poi

Re: dsa_allocate() faliure

2019-02-09 Thread Robert Haas
On Sun, Feb 10, 2019 at 1:55 AM Thomas Munro wrote: > Bleugh. Yeah. What I said before wasn't quite right. The value > returned by FreePageManagerPutInternal() is actually correct at the > moment it is returned, but it ceases to be correct immediately > afterwards if the following call to FreeP

Re: dsa_allocate() faliure

2019-02-09 Thread Thomas Munro
On Sun, Feb 10, 2019 at 7:24 AM Thomas Munro wrote: > On Sat, Feb 9, 2019 at 9:21 PM Robert Haas wrote: > > On Fri, Feb 8, 2019 at 8:00 AM Thomas Munro > > wrote: > > > Sometimes FreeManagerPutInternal() returns a > > > number-of-contiguous-pages-created-by-this-insertion that is too large > > >

Re: dsa_allocate() faliure

2019-02-09 Thread Thomas Munro
On Sat, Feb 9, 2019 at 9:21 PM Robert Haas wrote: > On Fri, Feb 8, 2019 at 8:00 AM Thomas Munro > wrote: > > Sometimes FreeManagerPutInternal() returns a > > number-of-contiguous-pages-created-by-this-insertion that is too large > > by one. [...] > > I spent a long time thinking about this and st

Re: dsa_allocate() faliure

2019-02-09 Thread Robert Haas
On Fri, Feb 8, 2019 at 8:00 AM Thomas Munro wrote: > Sometimes FreeManagerPutInternal() returns a > number-of-contiguous-pages-created-by-this-insertion that is too large > by one. If this happens to be a new max-number-of-contiguous-pages, > it causes trouble some arbitrary time later because th

Re: dsa_allocate() faliure

2019-02-07 Thread Thomas Munro
On Fri, Feb 8, 2019 at 4:49 AM Thomas Munro wrote: > I don't have the answer yet but I have some progress: I finally > reproduced the "could not find %d free pages" error by running lots of > concurrent parallel queries. Will investigate. Sometimes FreeManagerPutInternal() returns a number-of-co

Re: dsa_allocate() faliure

2019-02-07 Thread Thomas Munro
On Thu, Feb 7, 2019 at 9:10 PM Jakub Glapa wrote: > > Do you have query logging enabled ? If not, could you consider it on at > > least > one of those servers ? I'm interested to know what ELSE is running at the > time > that query failed. > > Ok, I have configured that and will enable in the

Re: dsa_allocate() faliure

2019-02-07 Thread Jakub Glapa
> Do you have query logging enabled ? If not, could you consider it on at least one of those servers ? I'm interested to know what ELSE is running at the time that query failed. Ok, I have configured that and will enable in the time window when the errors usually occur. I'll report as soon as I

Re: dsa_allocate() faliure

2019-02-06 Thread Justin Pryzby
Moving to -hackers, hopefully it doesn't confuse the list scripts too much. On Mon, Feb 04, 2019 at 08:52:17AM +0100, Jakub Glapa wrote: > I see the error showing up every night on 2 different servers. But it's a > bit of a heisenbug because If I go there now it won't be reproducible. Do you have

Re: dsa_allocate() faliure

2018-12-02 Thread Thomas Munro
On Sat, Dec 1, 2018 at 9:46 AM Justin Pryzby wrote: > elog(FATAL, > "dsa_allocate could not find %zu free > pages", npages); > + abort() If anyone can reproduce this problem with a debugger, it'd be interesting to see

Re: dsa_allocate() faliure

2018-11-30 Thread Justin Pryzby
On Fri, Nov 30, 2018 at 08:20:49PM +0100, Jakub Glapa wrote: > In the last days I've been monitoring no segfault occurred but the > das_allocation did. > I'm starting to doubt if the segfault I've found in dmesg was actually > related. The dmesg looks like a real crash, not just OOM. You can hope

Re: dsa_allocate() faliure

2018-11-30 Thread Jakub Glapa
Hi, just a small update. I've configured the OS for taking crash dumps on Ubuntu 16.04 with the following (maybe somebody will find it helpful): I've added LimitCORE=infinity to /lib/systemd/system/postgresql@.service under [Service] section I've reloaded the service config with sudo systemctl daem

Re: dsa_allocate() faliure

2018-11-27 Thread Thomas Munro
On Tue, Nov 27, 2018 at 4:00 PM Thomas Munro wrote: > Hmm. I will see if I can come up with a many-partition torture test > reproducer for this. No luck. I suppose one theory that could link both failure modes would a buffer overrun, where in the non-shared case it trashes a pointer that is lat

Re: dsa_allocate() faliure

2018-11-26 Thread Thomas Munro
On Tue, Nov 27, 2018 at 7:45 AM Alvaro Herrera wrote: > On 2018-Nov-26, Jakub Glapa wrote: > > Justin thanks for the information! > > I'm running Ubuntu 16.04. > > I'll try to prepare for the next crash. > > Couldn't find anything this time. > > As I recall, the appport stuff in Ubuntu is terrible