Re: Why is infinite_recurse test suddenly failing?

2019-08-15 Thread Thomas Munro
On Thu, Aug 15, 2019 at 5:49 PM Tom Lane  wrote:
> So that leads to the thought that "the infinite_recurse test is fine
> if it runs by itself, but it tends to fall over if there are
> concurrently-running backends".  I have absolutely no idea how that
> would happen on anything that passes for a platform built in this
> century.  Still, it's a place to start, which we hadn't before.

Hmm.  mereswin's recent failure on REL_11_STABLE was running the
serial schedule.

I read about 3 ways to get SEGV from stack-related faults: you can
exceed RLIMIT_STACK (the total mapping size) and then you'll get SEGV
(per man pages), you can access a page that is inside the mapping but
is beyond the stack pointer (with some tolerance, exact details vary
by arch), and you can fail to allocate a page due to low memory.

The first kind of failure doesn't seem right -- we carefully set
max_stack_size based on RLIMIT_STACK minus some slop, so that theory
would require child processes to have different stack limits than the
postmaster as you said (perhaps OpenStack, Docker, related tooling or
concurrent activity on the host system is capable of changing it?), or
a bug in our slop logic.  The second kind of failure would imply that
we have a bug -- we're accessing something below the stack pointer --
but that doesn't seem right either -- I think various address
sanitising tools would have told us about that, and it's hard to
believe there is a bug in the powerpc and arm implementation of the
stack pointer check (see Linux arch/{powerpc,arm}/mm/fault.c).  That
leaves the third explanation, except then I'd expect to see other
kinds of problems like OOM etc before you get to that stage, and why
just here?  Confused.

> Also notable is that we now have a couple of hits on ARM, not
> only ppc64.  Don't know what to make of that.

Yeah, that is indeed interesting.

-- 
Thomas Munro
https://enterprisedb.com




Re: Why is infinite_recurse test suddenly failing?

2019-08-14 Thread Tom Lane
Thomas Munro  writes:
> Here's another crash like that.

> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=cavefish=2019-07-13%2003%3A49%3A38
> 2019-07-13 04:01:23.437 UTC [9365:70] LOG:  server process (PID 12951)
> was terminated by signal 11: Segmentation fault
> 2019-07-13 04:01:23.437 UTC [9365:71] DETAIL:  Failed process was
> running: select infinite_recurse();

It occurred to me to scrape the buildfarm database for these failures,
and what I got was

   sysname|branch |  snapshot   |  stage  | 
data
  |   architecture
--+---+-+-+---+--
 demoiselle   | HEAD  | 2019-04-27 14:55:52 | pg_upgradeCheck | 
2019-04-27 15:00:42.736 UTC [1457:66] DETAIL:  Failed process was running: 
select infinite_recurse(); | ppc64le (POWER9) 
 buri | HEAD  | 2019-04-27 23:54:46 | Check   | 
2019-04-28 00:01:49.794 UTC [3041:66] DETAIL:  Failed process was running: 
select infinite_recurse(); | ppc64le (POWER9) 
 takin| HEAD  | 2019-05-01 08:16:48 | pg_upgradeCheck | 
2019-05-01 08:23:27.159 UTC [32303:59] DETAIL:  Failed process was running: 
select infinite_recurse();| ppc64le  
 bonito   | HEAD  | 2019-05-01 23:05:36 | Check   | 
2019-05-01 23:11:00.145 UTC [13933:66] DETAIL:  Failed process was running: 
select infinite_recurse();| ppc64le (POWER9) 
 shoveler | HEAD  | 2019-05-10 14:04:34 | Check   | 
2019-05-10 14:11:26.833 UTC [13456:73] DETAIL:  Failed process was running: 
select infinite_recurse();| ppc64le (POWER8) 
 demoiselle   | HEAD  | 2019-05-19 14:22:23 | pg_upgradeCheck | 
2019-05-19 14:26:17.002 UTC [23275:80] DETAIL:  Failed process was running: 
select infinite_recurse();| ppc64le (POWER9) 
 vulpes   | HEAD  | 2019-06-15 09:16:45 | pg_upgradeCheck | 
2019-06-15 09:22:22.268 UTC [4885:77] DETAIL:  Failed process was running: 
select infinite_recurse(); | ppc64le  
 ayu  | HEAD  | 2019-06-19 22:13:23 | pg_upgradeCheck | 
2019-06-19 22:18:16.805 UTC [2708:71] DETAIL:  Failed process was running: 
select infinite_recurse(); | ppc64le (POWER8) 
 quokka   | HEAD  | 2019-07-10 14:20:13 | pg_upgradeCheck | 
2019-07-10 15:24:06.102 BST [5d25f4fb.2644:5] DETAIL:  Failed process was 
running: select infinite_recurse(); | ppc64
 cavefish | HEAD  | 2019-07-13 03:49:38 | pg_upgradeCheck | 
2019-07-13 04:01:23.437 UTC [9365:71] DETAIL:  Failed process was running: 
select infinite_recurse(); | ppc64le (POWER9) 
 pintail  | REL_12_STABLE | 2019-07-13 19:36:51 | Check   | 
2019-07-13 19:39:29.013 UTC [31086:5] DETAIL:  Failed process was running: 
select infinite_recurse(); | ppc64le (POWER9) 
 bonito   | HEAD  | 2019-07-19 23:13:01 | Check   | 
2019-07-19 23:16:33.330 UTC [24191:70] DETAIL:  Failed process was running: 
select infinite_recurse();| ppc64le (POWER9) 
 takin| HEAD  | 2019-07-24 08:24:56 | Check   | 
2019-07-24 08:28:01.735 UTC [16366:75] DETAIL:  Failed process was running: 
select infinite_recurse();| ppc64le  
 quokka   | HEAD  | 2019-07-31 02:00:07 | pg_upgradeCheck | 
2019-07-31 03:04:04.043 BST [5d40f709.776a:5] DETAIL:  Failed process was 
running: select infinite_recurse(); | ppc64
 elasmobranch | HEAD  | 2019-08-01 03:13:38 | Check   | 
2019-08-01 03:19:05.394 UTC [22888:62] DETAIL:  Failed process was running: 
select infinite_recurse();| ppc64le (POWER9) 
 buri | HEAD  | 2019-08-02 00:10:23 | Check   | 
2019-08-02 00:17:11.075 UTC [28222:73] DETAIL:  Failed process was running: 
select infinite_recurse();| ppc64le (POWER9) 
 urocryon | HEAD  | 2019-08-02 05:43:46 | Check   | 
2019-08-02 05:51:51.944 UTC [2724:64] DETAIL:  Failed process was running: 
select infinite_recurse(); | ppc64le  
 batfish  | HEAD  | 2019-08-04 19:02:36 | pg_upgradeCheck | 
2019-08-04 19:08:11.728 UTC [23899:79] DETAIL:  Failed process was running: 
select infinite_recurse();| ppc64le  
 buri | REL_12_STABLE | 2019-08-07 00:03:29 | pg_upgradeCheck | 
2019-08-07 00:11:24.500 UTC [1405:5] DETAIL:  Failed process was running: 
select infinite_recurse();  | ppc64le (POWER9) 
 quokka   | REL_12_STABLE | 2019-08-08 02:43:45 | pg_upgradeCheck | 
2019-08-08 03:47:38.115 BST [5d4b8d3f.cdd7:5] DETAIL:  Failed process was 
running: select infinite_recurse(); | ppc64
 quokka   | HEAD  | 2019-08-08 

Re: Why is infinite_recurse test suddenly failing?

2019-07-12 Thread Thomas Munro
On Tue, May 21, 2019 at 9:22 AM Mark Wong  wrote:
> Andrew let me know I need to get on v10.  I've upgraded demoiselle, and
> am trying to work through the rest now...

Here's another crash like that.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=cavefish=2019-07-13%2003%3A49%3A38

2019-07-13 04:01:23.437 UTC [9365:70] LOG:  server process (PID 12951)
was terminated by signal 11: Segmentation fault
2019-07-13 04:01:23.437 UTC [9365:71] DETAIL:  Failed process was
running: select infinite_recurse();

-- 
Thomas Munro
https://enterprisedb.com




Re: Why is infinite_recurse test suddenly failing?

2019-05-20 Thread Mark Wong
On Mon, May 20, 2019 at 12:15:49PM -0700, Mark Wong wrote:
> On Sun, May 19, 2019 at 02:38:26PM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2019-05-14 08:31:37 -0700, Mark Wong wrote:
> > > Ok, I have this added to everyone now.
> > > 
> > > I think I also have caught up on this thread, but let me know if I
> > > missed anything.
> > 
> > I notice 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=demoiselle=2019-05-19%2014%3A22%3A23
> > failed recently, but unfortunately does not appear to have gdb
> > installed? Or the buildfarm version is too old? Or ulimits are set
> > strictly on a system wide basis?
> 
> I'm on v9 of the build-farm here (have it on my todo list to get
> everything up to 10.)


Andrew let me know I need to get on v10.  I've upgraded demoiselle, and
am trying to work through the rest now...

Regards,
Mark

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/




Re: Why is infinite_recurse test suddenly failing?

2019-05-20 Thread Mark Wong
On Sun, May 19, 2019 at 02:38:26PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-05-14 08:31:37 -0700, Mark Wong wrote:
> > Ok, I have this added to everyone now.
> > 
> > I think I also have caught up on this thread, but let me know if I
> > missed anything.
> 
> I notice 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=demoiselle=2019-05-19%2014%3A22%3A23
> failed recently, but unfortunately does not appear to have gdb
> installed? Or the buildfarm version is too old? Or ulimits are set
> strictly on a system wide basis?

It looks like I did have gdb on there:

opensuse@postgresql-opensuse-p9:~> gdb --version
GNU gdb (GDB; openSUSE Leap 15.0) 8.1


I'm on v9 of the build-farm here (have it on my todo list to get
everything up to 10.)


I hope nothing is overriding my core size ulimit:

opensuse@postgresql-opensuse-p9:~> ulimit -c
unlimited


This animal is using clang.  I wonder if gdb is disagreeing with the
clang binaries?

Regards,
Mark

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/




Re: Why is infinite_recurse test suddenly failing?

2019-05-19 Thread Andres Freund
Hi,

On 2019-05-14 08:31:37 -0700, Mark Wong wrote:
> Ok, I have this added to everyone now.
> 
> I think I also have caught up on this thread, but let me know if I
> missed anything.

I notice 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=demoiselle=2019-05-19%2014%3A22%3A23
failed recently, but unfortunately does not appear to have gdb
installed? Or the buildfarm version is too old? Or ulimits are set
strictly on a system wide basis?

Greetings,

Andres Freund




Re: Why is infinite_recurse test suddenly failing?

2019-05-14 Thread Mark Wong
On Fri, May 10, 2019 at 05:26:43PM -0400, Andrew Dunstan wrote:
> 
> On 5/10/19 3:35 PM, Tom Lane wrote:
> > Andres Freund  writes:
> >> On 2019-05-10 11:38:57 -0400, Tom Lane wrote:
> >>> I am wondering if, somehow, the stack depth limit seen by the postmaster
> >>> sometimes doesn't apply to its children.  That would be pretty wacko
> >>> kernel behavior, especially if it's only intermittently true.
> >>> But we're running out of other explanations.
> >> I wonder if this is a SIGSEGV that actually signals an OOM
> >> situation. Linux, if it can't actually extend the stack on-demand due to
> >> OOM, sends a SIGSEGV.  The signal has that information, but
> >> unfortunately the buildfarm code doesn't print it.  p $_siginfo would
> >> show us some of that...
> >> Mark, how tight is the memory on that machine? Does dmesg have any other
> >> information (often segfaults are logged by the kernel with the code
> >> IIRC).
> > It does sort of smell like a resource exhaustion problem, especially
> > if all these buildfarm animals are VMs running on the same underlying
> > platform.  But why would that manifest as "you can't have a measly two
> > megabytes of stack" and not as any other sort of OOM symptom?
> >
> > Mark, if you don't mind modding your local copies of the buildfarm
> > script, I think what Andres is asking for is a pretty trivial addition
> > in PGBuild/Utils.pm's sub get_stack_trace:
> >
> > my $cmdfile = "./gdbcmd";
> > my $handle;
> > open($handle, '>', $cmdfile) || die "opening $cmdfile: $!";
> > print $handle "bt\n";
> > +   print $handle "p $_siginfo\n";
> > close($handle);
> >
> > 
> 
> 
> I think we'll need to write that as:
> 
> 
>     print $handle 'p $_siginfo',"\n";

Ok, I have this added to everyone now.

I think I also have caught up on this thread, but let me know if I
missed anything.

Regards,
Mark

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/




Re: Why is infinite_recurse test suddenly failing?

2019-05-14 Thread Mark Wong
On Tue, May 14, 2019 at 10:52:07AM -0400, Tom Lane wrote:
> Mark Wong  writes:
> > Slowly catching up on my collection of ppc64le animals...
> 
> Oh, btw ... you didn't answer my question from before: are these animals
> all running on a common platform (and if so, what is that), or are they
> really different hardware?  If they're all VMs on one machine then it
> might be that there's some common-mode effect from the underlying system.

Sorry, I was almost there. :)

These systems are provisioned with OpenStack.  Additionally, a couple
more (cardinalfish, devario) are using docker under that.

> (Another thing I notice, now, is that these are all Linux variants;
> I'd been thinking you had some BSDen in there too, but now I see
> that none of those are ppc64.  Hm.)

Right, the BSDen I have are on different hardware.

Regards,
Mark

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/




Re: Why is infinite_recurse test suddenly failing?

2019-05-14 Thread Mark Wong
On Fri, May 10, 2019 at 11:27:07AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-05-10 11:38:57 -0400, Tom Lane wrote:
> > Core was generated by `postgres: debian regression [local] SELECT   
> >   '.
> > Program terminated with signal SIGSEGV, Segmentation fault.
> > #0  sysmalloc (nb=8208, av=0x3fff916e0d28 ) at malloc.c:2748
> > 2748malloc.c: No such file or directory.
> > #0  sysmalloc (nb=8208, av=0x3fff916e0d28 ) at malloc.c:2748
> > #1  0x3fff915bedc8 in _int_malloc (av=0x3fff916e0d28 , 
> > bytes=8192) at malloc.c:3865
> > #2  0x3fff915c1064 in __GI___libc_malloc (bytes=8192) at malloc.c:2928
> > #3  0x106acfd8 in AllocSetContextCreateInternal 
> > (parent=0x1000babdad0, name=0x1085508c "inline_function", 
> > minContextSize=, initBlockSize=, 
> > maxBlockSize=8388608) at aset.c:477
> > #4  0x103d5e00 in inline_function (funcid=20170, 
> > result_type=, result_collid=, 
> > input_collid=, funcvariadic=, 
> > func_tuple=, context=0x3fffe3da15d0, args=) 
> > at clauses.c:4459
> > #5  simplify_function (funcid=, result_type=, 
> > result_typmod=, result_collid=, 
> > input_collid=, args_p=, 
> > funcvariadic=, process_args=, 
> > allow_non_const=, context=) at clauses.c:4040
> > #6  0x103d2e74 in eval_const_expressions_mutator 
> > (node=0x1000babe968, context=0x3fffe3da15d0) at clauses.c:2474
> > #7  0x103511bc in expression_tree_mutator (node=, 
> > mutator=0x103d2b10 , 
> > context=0x3fffe3da15d0) at nodeFuncs.c:2893
> 
> 
> > So that lets out any theory that somehow we're getting into a weird
> > control path that misses calling check_stack_depth;
> > expression_tree_mutator does so for one, and it was called just nine
> > stack frames down from the crash.
> 
> Right. There's plenty places checking it...
> 
> 
> > I am wondering if, somehow, the stack depth limit seen by the postmaster
> > sometimes doesn't apply to its children.  That would be pretty wacko
> > kernel behavior, especially if it's only intermittently true.
> > But we're running out of other explanations.
> 
> I wonder if this is a SIGSEGV that actually signals an OOM
> situation. Linux, if it can't actually extend the stack on-demand due to
> OOM, sends a SIGSEGV.  The signal has that information, but
> unfortunately the buildfarm code doesn't print it.  p $_siginfo would
> show us some of that...
> 
> Mark, how tight is the memory on that machine?

There's about 2GB allocated:

debian@postgresql-debian:~$ cat /proc/meminfo
MemTotal:2080704 kB
MemFree: 1344768 kB
MemAvailable:1824192 kB


At the moment it looks like plenty. :)  Maybe I should set something up
to monitor these things.

> Does dmesg have any other
> information (often segfaults are logged by the kernel with the code
> IIRC).

It's been up for about 49 days:

debian@postgresql-debian:~$ uptime
 14:54:30 up 49 days, 14:59,  3 users,  load average: 0.00, 0.34, 1.04


I see one line from dmesg that is related to postgres:

[3939350.616849] postgres[17057]: bad frame in setup_rt_frame: 3fffe3d9fe00 
nip 3fff915bdba0 lr 3fff915bde9c


But only that one time in 49 days up.  Otherwise I see a half dozen
hung_task_timeout_secs messages around jdb2 and dhclient.

Regards,
Mark

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/




Re: Why is infinite_recurse test suddenly failing?

2019-05-14 Thread Tom Lane
Mark Wong  writes:
> Slowly catching up on my collection of ppc64le animals...

Oh, btw ... you didn't answer my question from before: are these animals
all running on a common platform (and if so, what is that), or are they
really different hardware?  If they're all VMs on one machine then it
might be that there's some common-mode effect from the underlying system.

(Another thing I notice, now, is that these are all Linux variants;
I'd been thinking you had some BSDen in there too, but now I see
that none of those are ppc64.  Hm.)

regards, tom lane




Re: Why is infinite_recurse test suddenly failing?

2019-05-14 Thread Tom Lane
Mark Wong  writes:
> The following I've enabled force_parallel_mode for HEAD, 11, 10, and
> 9.6:

Thanks Mark!

In theory, the stack trace we now have from shoveler proves that parallel
mode has nothing to do with this, because the crash happens during
planning (specifically, inlining SQL functions).  I wonder though if
it's possible that previous force_parallel_mode queries have had some
undesirable effect on the process's stack allocation.  Just grasping
at straws, because it's sure not clear what's going wrong.

regards, tom lane




Re: Why is infinite_recurse test suddenly failing?

2019-05-14 Thread Mark Wong
On Fri, May 03, 2019 at 11:45:33AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-05-03 10:08:59 -0700, Mark Wong wrote:
> > Ok, I think I have gdb installed now...
> 
> Thanks! Any chance you could turn on force_parallel_mode for the other
> branches it applies to too? Makes it easier to figure out whether
> breakage is related to that, or independent.

Slowly catching up on my collection of ppc64le animals...

I still need to upgrade the build farm client (v8) on:
* dhole
* vulpes
* wobbegong
* cuon
* batfish
* devario
* cardinalfish

The following I've enabled force_parallel_mode for HEAD, 11, 10, and
9.6:

* buri
* urocryon
* ayu
* shoveler
* chimaera
* bonito
* takin
* bufflehead
* elasmobranch
* demoiselle
* cavefish

Regards,
Mark




Re: Why is infinite_recurse test suddenly failing?

2019-05-12 Thread Andrew Dunstan


On 5/12/19 4:57 AM, Michael Paquier wrote:
> On Fri, May 10, 2019 at 05:26:43PM -0400, Andrew Dunstan wrote:
>> I think we'll need to write that as:
>>
>>     print $handle 'p $_siginfo',"\n";
>>
>> As you have it written perl will try to interpolate a variable called
>> $_siginfo.
> Anything in double quotes with a dollar sign would be interpreted as a
> variable, and single quotes make that safe.


Yes, that's why I did it that way.


cheers


andrew





Re: Why is infinite_recurse test suddenly failing?

2019-05-12 Thread Michael Paquier
On Fri, May 10, 2019 at 05:26:43PM -0400, Andrew Dunstan wrote:
> I think we'll need to write that as:
> 
>     print $handle 'p $_siginfo',"\n";
> 
> As you have it written perl will try to interpolate a variable called
> $_siginfo.

Anything in double quotes with a dollar sign would be interpreted as a
variable, and single quotes make that safe.
--
Michael


signature.asc
Description: PGP signature


Re: Why is infinite_recurse test suddenly failing?

2019-05-10 Thread Andrew Dunstan


On 5/10/19 3:35 PM, Tom Lane wrote:
> Andres Freund  writes:
>> On 2019-05-10 11:38:57 -0400, Tom Lane wrote:
>>> I am wondering if, somehow, the stack depth limit seen by the postmaster
>>> sometimes doesn't apply to its children.  That would be pretty wacko
>>> kernel behavior, especially if it's only intermittently true.
>>> But we're running out of other explanations.
>> I wonder if this is a SIGSEGV that actually signals an OOM
>> situation. Linux, if it can't actually extend the stack on-demand due to
>> OOM, sends a SIGSEGV.  The signal has that information, but
>> unfortunately the buildfarm code doesn't print it.  p $_siginfo would
>> show us some of that...
>> Mark, how tight is the memory on that machine? Does dmesg have any other
>> information (often segfaults are logged by the kernel with the code
>> IIRC).
> It does sort of smell like a resource exhaustion problem, especially
> if all these buildfarm animals are VMs running on the same underlying
> platform.  But why would that manifest as "you can't have a measly two
> megabytes of stack" and not as any other sort of OOM symptom?
>
> Mark, if you don't mind modding your local copies of the buildfarm
> script, I think what Andres is asking for is a pretty trivial addition
> in PGBuild/Utils.pm's sub get_stack_trace:
>
>   my $cmdfile = "./gdbcmd";
>   my $handle;
>   open($handle, '>', $cmdfile) || die "opening $cmdfile: $!";
>   print $handle "bt\n";
> + print $handle "p $_siginfo\n";
>   close($handle);
>
>   


I think we'll need to write that as:


    print $handle 'p $_siginfo',"\n";


As you have it written perl will try to interpolate a variable called
$_siginfo.


cheers


andrew








Re: Why is infinite_recurse test suddenly failing?

2019-05-10 Thread Tom Lane
Andres Freund  writes:
> On 2019-05-10 11:38:57 -0400, Tom Lane wrote:
>> I am wondering if, somehow, the stack depth limit seen by the postmaster
>> sometimes doesn't apply to its children.  That would be pretty wacko
>> kernel behavior, especially if it's only intermittently true.
>> But we're running out of other explanations.

> I wonder if this is a SIGSEGV that actually signals an OOM
> situation. Linux, if it can't actually extend the stack on-demand due to
> OOM, sends a SIGSEGV.  The signal has that information, but
> unfortunately the buildfarm code doesn't print it.  p $_siginfo would
> show us some of that...

> Mark, how tight is the memory on that machine? Does dmesg have any other
> information (often segfaults are logged by the kernel with the code
> IIRC).

It does sort of smell like a resource exhaustion problem, especially
if all these buildfarm animals are VMs running on the same underlying
platform.  But why would that manifest as "you can't have a measly two
megabytes of stack" and not as any other sort of OOM symptom?

Mark, if you don't mind modding your local copies of the buildfarm
script, I think what Andres is asking for is a pretty trivial addition
in PGBuild/Utils.pm's sub get_stack_trace:

my $cmdfile = "./gdbcmd";
my $handle;
open($handle, '>', $cmdfile) || die "opening $cmdfile: $!";
print $handle "bt\n";
+   print $handle "p $_siginfo\n";
close($handle);

regards, tom lane




Re: Why is infinite_recurse test suddenly failing?

2019-05-10 Thread Andres Freund
Hi,

On 2019-05-10 11:38:57 -0400, Tom Lane wrote:
> Core was generated by `postgres: debian regression [local] SELECT 
> '.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  sysmalloc (nb=8208, av=0x3fff916e0d28 ) at malloc.c:2748
> 2748  malloc.c: No such file or directory.
> #0  sysmalloc (nb=8208, av=0x3fff916e0d28 ) at malloc.c:2748
> #1  0x3fff915bedc8 in _int_malloc (av=0x3fff916e0d28 , 
> bytes=8192) at malloc.c:3865
> #2  0x3fff915c1064 in __GI___libc_malloc (bytes=8192) at malloc.c:2928
> #3  0x106acfd8 in AllocSetContextCreateInternal 
> (parent=0x1000babdad0, name=0x1085508c "inline_function", 
> minContextSize=, initBlockSize=, 
> maxBlockSize=8388608) at aset.c:477
> #4  0x103d5e00 in inline_function (funcid=20170, 
> result_type=, result_collid=, 
> input_collid=, funcvariadic=, 
> func_tuple=, context=0x3fffe3da15d0, args=) at 
> clauses.c:4459
> #5  simplify_function (funcid=, result_type=, 
> result_typmod=, result_collid=, 
> input_collid=, args_p=, funcvariadic= out>, process_args=, allow_non_const=, 
> context=) at clauses.c:4040
> #6  0x103d2e74 in eval_const_expressions_mutator (node=0x1000babe968, 
> context=0x3fffe3da15d0) at clauses.c:2474
> #7  0x103511bc in expression_tree_mutator (node=, 
> mutator=0x103d2b10 , context=0x3fffe3da15d0) 
> at nodeFuncs.c:2893


> So that lets out any theory that somehow we're getting into a weird
> control path that misses calling check_stack_depth;
> expression_tree_mutator does so for one, and it was called just nine
> stack frames down from the crash.

Right. There's plenty places checking it...


> I am wondering if, somehow, the stack depth limit seen by the postmaster
> sometimes doesn't apply to its children.  That would be pretty wacko
> kernel behavior, especially if it's only intermittently true.
> But we're running out of other explanations.

I wonder if this is a SIGSEGV that actually signals an OOM
situation. Linux, if it can't actually extend the stack on-demand due to
OOM, sends a SIGSEGV.  The signal has that information, but
unfortunately the buildfarm code doesn't print it.  p $_siginfo would
show us some of that...

Mark, how tight is the memory on that machine? Does dmesg have any other
information (often segfaults are logged by the kernel with the code
IIRC).

Greetings,

Andres Freund




Re: Why is infinite_recurse test suddenly failing?

2019-05-10 Thread Tom Lane
We just got another one of these, on yet another member of Mark's
ppc64 armada:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=shoveler=2019-05-10%2014%3A04%3A34

Now we have a stack trace (thanks Mark!), but it is pretty unsurprising:

Core was generated by `postgres: debian regression [local] SELECT   
  '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  sysmalloc (nb=8208, av=0x3fff916e0d28 ) at malloc.c:2748
2748malloc.c: No such file or directory.
#0  sysmalloc (nb=8208, av=0x3fff916e0d28 ) at malloc.c:2748
#1  0x3fff915bedc8 in _int_malloc (av=0x3fff916e0d28 , 
bytes=8192) at malloc.c:3865
#2  0x3fff915c1064 in __GI___libc_malloc (bytes=8192) at malloc.c:2928
#3  0x106acfd8 in AllocSetContextCreateInternal (parent=0x1000babdad0, 
name=0x1085508c "inline_function", minContextSize=, 
initBlockSize=, maxBlockSize=8388608) at aset.c:477
#4  0x103d5e00 in inline_function (funcid=20170, result_type=, result_collid=, input_collid=, 
funcvariadic=, func_tuple=, 
context=0x3fffe3da15d0, args=) at clauses.c:4459
#5  simplify_function (funcid=, result_type=, 
result_typmod=, result_collid=, 
input_collid=, args_p=, funcvariadic=, process_args=, allow_non_const=, 
context=) at clauses.c:4040
#6  0x103d2e74 in eval_const_expressions_mutator (node=0x1000babe968, 
context=0x3fffe3da15d0) at clauses.c:2474
#7  0x103511bc in expression_tree_mutator (node=, 
mutator=0x103d2b10 , context=0x3fffe3da15d0) at 
nodeFuncs.c:2893
#8  0x103d2cbc in eval_const_expressions_mutator (node=0x1000babe9c0, 
context=0x3fffe3da15d0) at clauses.c:3606
#9  0x103510c8 in expression_tree_mutator (node=, 
mutator=, context=) at nodeFuncs.c:2942
#10 0x103d2cbc in eval_const_expressions_mutator (node=0x1000babea40, 
context=0x3fffe3da15d0) at clauses.c:3606
#11 0x103d2ae8 in eval_const_expressions (root=, 
node=) at clauses.c:2266
#12 0x103b6264 in preprocess_expression (root=0x1000babee18, 
expr=, kind=1) at planner.c:1087
#13 0x103b496c in subquery_planner (glob=, 
parse=, parent_root=, hasRecursion=, tuple_fraction=) at planner.c:769
#14 0x103b3c58 in standard_planner (parse=, 
cursorOptions=, boundParams=) at planner.c:406
#15 0x103b3a68 in planner (parse=, 
cursorOptions=, boundParams=) at planner.c:275
#16 0x104cc2cc in pg_plan_query (querytree=0x1000babe7f8, 
cursorOptions=256, boundParams=0x0) at postgres.c:878
#17 0x102ef850 in init_execution_state (lazyEvalOK=, 
queryTree_list=, fcache=) at functions.c:507
#18 init_sql_fcache (finfo=, collation=, 
lazyEvalOK=) at functions.c:770
#19 fmgr_sql (fcinfo=) at functions.c:1053
#20 0x102cef24 in ExecInterpExpr (state=, 
econtext=, isnull=) at execExprInterp.c:625
#21 0x102cddb8 in ExecInterpExprStillValid (state=0x1000bab41e8, 
econtext=0x1000bab3ed8, isNull=) at execExprInterp.c:1769
#22 0x10314f10 in ExecEvalExprSwitchContext (state=0x1000bab41e8, 
econtext=0x1000bab3ed8, isNull=) at 
../../../src/include/executor/executor.h:307
#23 ExecProject (projInfo=0x1000bab41e0) at 
../../../src/include/executor/executor.h:341
#24 ExecResult (pstate=) at nodeResult.c:136
#25 0x102e319c in ExecProcNodeFirst (node=0x1000bab3dc0) at 
execProcnode.c:445
#26 0x102d9c94 in ExecProcNode (node=) at 
../../../src/include/executor/executor.h:239
#27 ExecutePlan (estate=, planstate=, 
use_parallel_mode=false, operation=, numberTuples=, direction=, dest=, execute_once=, sendTuples=) at execMain.c:1648
#28 standard_ExecutorRun (queryDesc=, direction=, 
count=, execute_once=) at execMain.c:365
#29 0x102d9ac8 in ExecutorRun (queryDesc=, 
direction=, count=, execute_once=) 
at execMain.c:309
#30 0x102efe84 in postquel_getnext (es=, 
fcache=) at functions.c:867
#31 fmgr_sql (fcinfo=) at functions.c:1153
#32 0x102cef24 in ExecInterpExpr (state=, 
econtext=, isnull=) at execExprInterp.c:625
#33 0x102cddb8 in ExecInterpExprStillValid (state=0x1000baa8158, 
econtext=0x1000baa7e48, isNull=) at execExprInterp.c:1769
#34 0x10314f10 in ExecEvalExprSwitchContext (state=0x1000baa8158, 
econtext=0x1000baa7e48, isNull=) at 
../../../src/include/executor/executor.h:307
#35 ExecProject (projInfo=0x1000baa8150) at 
../../../src/include/executor/executor.h:341
#36 ExecResult (pstate=) at nodeResult.c:136
#37 0x102e319c in ExecProcNodeFirst (node=0x1000baa7d30) at 
execProcnode.c:445

... lots and lots of repetitions ...

#11809 0x102e319c in ExecProcNodeFirst (node=0x10008c01e90) at 
execProcnode.c:445
#11810 0x102d9c94 in ExecProcNode (node=) at 
../../../src/include/executor/executor.h:239
#11811 ExecutePlan (estate=, planstate=, 
use_parallel_mode=false, operation=, numberTuples=, direction=, dest=, execute_once=, sendTuples=) at execMain.c:1648
#11812 standard_ExecutorRun (queryDesc=, direction=, count=, execute_once=) at execMain.c:365
#11813 

Re: Why is infinite_recurse test suddenly failing?

2019-05-03 Thread Tom Lane
Mark Wong  writes:
> On Thu, May 02, 2019 at 11:45:34AM -0400, Tom Lane wrote:
>> While we're looking at this --- Mark, if you could install gdb
>> on your buildfarm hosts, that would be really handy.  I think that's
>> the only extra thing the buildfarm script needs to extract stack
>> traces from core dumps.  We'd likely already know where the problem
>> is if we had a stack trace ...

> Ok, I think I have gdb installed now...

Thanks!

regards, tom lane




Re: Why is infinite_recurse test suddenly failing?

2019-05-03 Thread Mark Wong
On Thu, May 02, 2019 at 11:45:34AM -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Hm, I just noticed:
> >'HEAD' => [
> >'force_parallel_mode = 
> > regress'
> >  ]
> 
> Oooh, I didn't see that.
> 
> > on all those animals. So it's not necessarily the case that HEAD and
> > backbranch runs are behaving all that identical.  Note that isn't a
> > recent config change, so it's not an explanation as to why they started
> > to fail only recently.
> 
> No, but it does point at another area of the code in which a relevant
> change could've occurred.
> 
> While we're looking at this --- Mark, if you could install gdb
> on your buildfarm hosts, that would be really handy.  I think that's
> the only extra thing the buildfarm script needs to extract stack
> traces from core dumps.  We'd likely already know where the problem
> is if we had a stack trace ...

Ok, I think I have gdb installed now...

Regards,
Mark

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/




Re: Why is infinite_recurse test suddenly failing?

2019-05-02 Thread Tom Lane
Andres Freund  writes:
> Hm, I just noticed:
>'HEAD' => [
>'force_parallel_mode = 
> regress'
>  ]

Oooh, I didn't see that.

> on all those animals. So it's not necessarily the case that HEAD and
> backbranch runs are behaving all that identical.  Note that isn't a
> recent config change, so it's not an explanation as to why they started
> to fail only recently.

No, but it does point at another area of the code in which a relevant
change could've occurred.

While we're looking at this --- Mark, if you could install gdb
on your buildfarm hosts, that would be really handy.  I think that's
the only extra thing the buildfarm script needs to extract stack
traces from core dumps.  We'd likely already know where the problem
is if we had a stack trace ...

regards, tom lane




Re: Why is infinite_recurse test suddenly failing?

2019-05-02 Thread Andres Freund
Hi,

On 2019-05-02 11:02:03 -0400, Tom Lane wrote:
> In the past week, four different buildfarm members have shown
> non-reproducing segfaults in the "select infinite_recurse()"
> test case, rather than the expected detection of stack overrun
> before we get to the point of a segfault.

I was just staring at bonito's failure in confusion.


> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bonito=2019-05-01%2023%3A05%3A36
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=takin=2019-05-01%2008%3A16%3A48
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=buri=2019-04-27%2023%3A54%3A46
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=demoiselle=2019-04-27%2014%3A55%3A52
> 
> They're all on HEAD, and they all look like
> 
> 2019-05-01 23:11:00.145 UTC [13933:65] LOG:  server process (PID 17161) was 
> terminated by signal 11: Segmentation fault
> 2019-05-01 23:11:00.145 UTC [13933:66] DETAIL:  Failed process was running: 
> select infinite_recurse();
> 
> I scraped the buildfarm database and verified that there are no similar
> failures for at least three months back; nor, offhand, can I remember ever
> having seen this test fail in many years.  So it seems we broke something
> recently.  No idea what though.

I can't recall any recent changes to relevant area of code.


> (Another possibility, seeing that these are all members of Mark's PPC64
> flotilla, is that there's some common misconfiguration --- but it's hard
> to credit that such a problem would only affect HEAD not the back
> branches.)

Hm, I just noticed:
   'HEAD' => [
   'force_parallel_mode = 
regress'
 ]

on all those animals. So it's not necessarily the case that HEAD and
backbranch runs are behaving all that identical.  Note that isn't a
recent config change, so it's not an explanation as to why they started
to fail only recently.

Greetings,

Andres Freund