On 12/01/2015 09:58 AM, Peter J. Holzer wrote:
On 2015-12-01 07:16:04 -0800, Adrian Klaver wrote:
On 12/01/2015 06:51 AM, Peter J. Holzer wrote:
A rather weird observation from the log files of our server (9.5 beta1):

2015-12-01 09:23:37 CET [26265]: [42-1] user=fragner,db=wds,pid=26265 WARNING:  
Use of uninitialized value $success in concatenation (.) or string at 
/usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pm line 36.
2015-12-01 09:23:37 CET [26265]: [43-1] user=fragner,db=wds,pid=26265 CONTEXT:  PL/Perl 
function "mb_timeseriesdata_zmq"
[lots of other stuff from different connections]
2015-12-01 09:24:45 CET [26265]: [44-1] user=fragner,db=wds,pid=26265 ERROR:  
impossible result '' (payload=) at 
/usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pm line 36.
                 WDS::Macrobond::Utils::decode_result("") called at line 30
                 main::__ANON__("gen_wqehur") called at -e line 0
                 eval {...} called at -e line 0

Two messages from the same line of the same plperlu stored procedure, 68
seconds apart. So what is this line 36?

     confess "impossible result '$success' (payload=$payload)";

What? The first message clearly comes from interpolating $success
(which is undef at that point) into the argument. The second from
confess itself. What could cause a plperlu procedure to freeze for 68
seconds between the call to confess and its output?

Is it possible that only the writing of the log entry is delayed?

Another weird thing: $success is undef because a ØMQ rpc call[1] timed

And the call is?

The sequence is:

my $req_sck = zmq_socket($context, ZMQ_REQ);
zmq_connect($req_sck, $url);
my $qry_msg = join(...);
zmq_send($req_sck, $qry_msg);
my $res_msg = zmq_msg_init();
my $rv = zmq_msg_recv($res_msg, $req_sck);
# check rv here.
my $data = zmq_msg_data($res_msg); # $data is "" here
my $result = WDS::Macrobond::Utils::decode_result($data); # the error messages 
are from this function

(Yeah, ØMQ is quite low-level. There is a higher level Perl Module, but
I'm not using it).

I omitted that because I don't think it's terribly relevant here.
Details of the usage of ØMQ are better discussed on the ØMQ mailing
list.

But there is something else which may be relevant: ØMQ uses threads
internally, and I don't actually know whether zmq_msg_recv returning

Except I see this here:

http://api.zeromq.org/4-0:zmq-socket

Thread safety

ØMQ sockets are not thread safe. Applications MUST NOT use a socket from multiple threads except after migrating a socket from one thread to another with a "full fence" memory barrier.

means that the read(2) call (or whatever) on the socket terminates.
It may actually continue in another thread. But I still don't see how
that could block the main thread (or wake it up again in a place which
has nothing to do with ØMQ (confess is a standard Perl function to print
a stack trace and die)). Or - just thinking aloud here - I fear I'm
abusing you guys as support teddy bears[1] - maybe it's the other way
round: confess dies, so maybe it frees some lock during cleanup which
allows the message which should have been sent by zmq_send to finally go
out on the wire. But that still doesn't explain the 68 seconds spent in
confess ...

Postgres worker processes are single-threaded, are they? Is there
something else which could interact badly with a moderately complex
multithreaded I/O library used from a stored procedure?

I suspect such an interaction because I cannot reproduce the problem
outside of a stored procedure. A standalone Perl script doing the same
requests doesn't get a timeout.

I guess Alvaro is right: I should strace the postgres worker process
while it executes the stored procedure. The problem of course is that
it happens often enough be annoying, but rarely enough that it's not
easily reproducible.

From here:

http://api.zeromq.org/4-0:zmq-connect

It seems something like(I am not a Perl programmer, so approach carefully):

my $rc = zmq_connect($req_sck, $url);

Then you will have an error code to examine.

Have you looked at the Notes at the bottom of this page:

http://www.postgresql.org/docs/9.4/interactive/plperl-trusted.html


out (after 60 seconds, which is also a mystery, because ØMQ doesn't seem
to have a default timeout of 60 seconds, and I don't set one). But at

Network timeout?

That was my first guess, but I don't see where it would come from. Or
why it only is there if I call the code from a stored procedure, not
from a standalone script.

09:24:45 (i.e. the time of the error message) the answer for that RPC
call arrived. So it kind of looks like confess waited for the message to
arrive (which makes no sense at all) or maybe that confess waited for
something which also blocked the sending of the request (because
according to the server logs, the RPC request only arrived there at
09:24:45 and was answered within 1 second), but that doesn't make any

So if the request timed out how did you get a reply, a second request?

Nope. I don't really "get" the reply. I just see in the logs of the
other server that it sent a reply at that time.

The time line is like this

time    postgres process        mb_dal process
T       zmq_send()
         zmq_msg_recv()

T+60    zmq_msg_recv returns
         an empty message;
         decode() is called
         which notices that the
         message is empty and
         calls confess();

T+128   confess() prints a      receives the message sent at T;
         stacktrace;             sends a reply;

         hp

[1] For those who don't know the story: Someone at a help desk noticed
that people often started to explain their problem and somewhere in the
middle said "never mind, I think I know what the problem is" and went
away. So he put a big teddy bear on the desk and asked people to explain
their problems to the bear. So a support teddy bear is someone who you
talk to mostly to clear your own mind, not because you expect a
solution.



--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to