Re: [asterisk-dev] Bridges, T.38, and other good times

2015-12-10 Thread Richard Mudgett
On Sun, Dec 6, 2015 at 7:57 PM, Matthew Jordan  wrote:

> Hello all -
>
> One of the efforts that a number of developers in the community here at
> Digium have been at work at are cleaning up test failures exposed by
> Jenkins [1]. One of these, in particular, has been rather difficult to
> resolve - namely, fax/pjsip/directmedia_reinvite_t38 [2]. This e-mail goes
> over what has been accomplished, and asks some questions on how we might
> try and fix Asterisk under this scenario.
>
> The directmedia_reinvite_t38 test attempts to do the following:
>  (1) UAC1 calls UAC2 through Asterisk, with audio as the media. The dial
> is performed using the 'g' flag, such that UAC2 will continue on if UAC1
> hangs up.
>  (2) UAC1 and UAC2 are configured for direct media. Asterisk sends a
> re-INVITE to UAC1 and UAC2 to initiate direct media.
>  (3) After responding with a 200 OK to the direct media requests, UAC1
> sends a re-INVITE offering T.38.
>  (4) Asterisk sends an INVITE with T.38 to UAC2
>  (5) UAC2 sends back a 200 OK for T.38; Asterisk sends that to UAC1.
> Asterisk switches out of a direct media bridge to a core bridge.
>  (6) UAC1 hangs up. Asterisk sends a re-INVITE to UAC2 for audio back to
> Asterisk. UAC2 responds with a 200 OK for the audio.
>  (7) Asterisk ejects UAC2 back to the dialplan.
>
> It's important to note that this test never should have passed - an update
> to the test suite "fixed" the test erroneously passing, which led to us
> investigating why the scenario was failing. This test was copied over from
> an identical chan_sip test, which passes.
>
> The PJSIP stack has two issues which make life difficult for it in this
> scenario:
> (1) The T.38 logic is implemented in res_pjsip_t38. While that is _mostly_
> a very good thing - as it keeps all the fax state logic outside of the
> channel driver - we are also a layer removed from interactions that occur
> in the channel driver. That makes it challenging to influence direct media
> checks and other Asterisk/channel interactions.
> (2) Being very asynchronous, requests may be serviced that influence T.38
> state while other interactions are occurring in the core. Informing the
> core of what has occurred can have more race conditions than what occurs in
> chan_sip, which is single threaded.
>
> The first bug discovered when the test was investigated was an issue in
> step (2). We never actually initiated a direct media re-INVITE. This was
> due to res_pjsip_t38 using a frame hook, and not implementing the
> .consume_cb callback. That callback allows a framehook to inform the core
> (and also the bridging framework) of the types of frames that a framehook
> wants to consume. If a framehook needs audio, a direct media bridge will be
> explicitly denied, and - by default - the bridging framework assumes that
> framehooks will want all frames. Another bug that was discovered occurred
> in step (6). When UAC1 sends a BYE request, nothing informed UAC2 that the
> fax had ended - instead, it was merely ejected from the bridge. This meant
> that it kept its T.38 session going, and Asterisk never sent a re-INVITE to
> UAC2. Both of these bugs were fixed by 726ee873a6.
>
> Except, unfortunately, the second bug wasn't really fixed.
>
> 726ee873a6 did the "right" thing by intercepting the BYE request sent by
> UAC1, and queueing up a control frame of type AST_CONTROL_T38_PARAMETERS
> with a new state of AST_T38_TERMINATED. This is supposed to be passed on to
> UAC2, informing it that the T.38 fax has ended, and that it should have its
> media re-negotiated back to the last known state (audio) but also back to
> Asterisk (since we aren't going to be in a bridge any longer).
> Unfortunately, this code was insufficient.
>
> A race condition exists in this case. On the one hand, we've just queued
> up a frame on UAC1's channel to be passed into the bridge, which should get
> tossed onto UAC2's channel. On the other hand, we've just told the bridging
> framework to kill UAC1's channel with extreme prejudice, thereby also
> terminating the bridge and ejecting UAC2 off into the dialplan. In the
> first case, this is an asynchronous, message passing mechanism; in the
> second case, the bridging framework inspects the channel to see if it
> should be hung up on *every frame* and *immediately* starts the
> hangup/shutdown procedure if it knows the channel should die. This is not
> asynchronous in any way. As a result, UAC1 may be hung up and the bridge
> dissolved before UAC2 ever gets its control frame from UAC1.
>
> There were a couple of solutions to this problem that were tried:
> (1) First, I tried to make sure that enqueued control frames were flushed
> out of a channel and passed over the bridge when a hangup was detected. In
> practice, this was incredibly cumbersome - some control frames should get
> tossed, others need to be preserved. What was worse was the sheer number of
> places the bridge dissolution can be triggered. While 

Re: [asterisk-dev] Bridges, T.38, and other good times

2015-12-07 Thread Mark Michelson

On 12/06/2015 07:57 PM, Matthew Jordan wrote:

Hello all -

One of the efforts that a number of developers in the community here 
at Digium have been at work at are cleaning up test failures exposed 
by Jenkins [1]. One of these, in particular, has been rather difficult 
to resolve - namely, fax/pjsip/directmedia_reinvite_t38 [2]. This 
e-mail goes over what has been accomplished, and asks some questions 
on how we might try and fix Asterisk under this scenario.


The directmedia_reinvite_t38 test attempts to do the following:
 (1) UAC1 calls UAC2 through Asterisk, with audio as the media. The 
dial is performed using the 'g' flag, such that UAC2 will continue on 
if UAC1 hangs up.
 (2) UAC1 and UAC2 are configured for direct media. Asterisk sends a 
re-INVITE to UAC1 and UAC2 to initiate direct media.
 (3) After responding with a 200 OK to the direct media requests, UAC1 
sends a re-INVITE offering T.38.

 (4) Asterisk sends an INVITE with T.38 to UAC2
 (5) UAC2 sends back a 200 OK for T.38; Asterisk sends that to UAC1. 
Asterisk switches out of a direct media bridge to a core bridge.
 (6) UAC1 hangs up. Asterisk sends a re-INVITE to UAC2 for audio back 
to Asterisk. UAC2 responds with a 200 OK for the audio.

 (7) Asterisk ejects UAC2 back to the dialplan.

It's important to note that this test never should have passed - an 
update to the test suite "fixed" the test erroneously passing, which 
led to us investigating why the scenario was failing. This test was 
copied over from an identical chan_sip test, which passes.


The PJSIP stack has two issues which make life difficult for it in 
this scenario:
(1) The T.38 logic is implemented in res_pjsip_t38. While that is 
_mostly_ a very good thing - as it keeps all the fax state logic 
outside of the channel driver - we are also a layer removed from 
interactions that occur in the channel driver. That makes it 
challenging to influence direct media checks and other 
Asterisk/channel interactions.
(2) Being very asynchronous, requests may be serviced that influence 
T.38 state while other interactions are occurring in the core. 
Informing the core of what has occurred can have more race conditions 
than what occurs in chan_sip, which is single threaded.


The first bug discovered when the test was investigated was an issue 
in step (2). We never actually initiated a direct media re-INVITE. 
This was due to res_pjsip_t38 using a frame hook, and not implementing 
the .consume_cb callback. That callback allows a framehook to inform 
the core (and also the bridging framework) of the types of frames that 
a framehook wants to consume. If a framehook needs audio, a direct 
media bridge will be explicitly denied, and - by default - the 
bridging framework assumes that framehooks will want all frames. 
Another bug that was discovered occurred in step (6). When UAC1 sends 
a BYE request, nothing informed UAC2 that the fax had ended - instead, 
it was merely ejected from the bridge. This meant that it kept its 
T.38 session going, and Asterisk never sent a re-INVITE to UAC2. Both 
of these bugs were fixed by 726ee873a6.


Except, unfortunately, the second bug wasn't really fixed.

726ee873a6 did the "right" thing by intercepting the BYE request sent 
by UAC1, and queueing up a control frame of type 
AST_CONTROL_T38_PARAMETERS with a new state of AST_T38_TERMINATED. 
This is supposed to be passed on to UAC2, informing it that the T.38 
fax has ended, and that it should have its media re-negotiated back to 
the last known state (audio) but also back to Asterisk (since we 
aren't going to be in a bridge any longer). Unfortunately, this code 
was insufficient.


A race condition exists in this case. On the one hand, we've just 
queued up a frame on UAC1's channel to be passed into the bridge, 
which should get tossed onto UAC2's channel. On the other hand, we've 
just told the bridging framework to kill UAC1's channel with extreme 
prejudice, thereby also terminating the bridge and ejecting UAC2 off 
into the dialplan. In the first case, this is an asynchronous, message 
passing mechanism; in the second case, the bridging framework inspects 
the channel to see if it should be hung up on *every frame* and 
*immediately* starts the hangup/shutdown procedure if it knows the 
channel should die. This is not asynchronous in any way. As a result, 
UAC1 may be hung up and the bridge dissolved before UAC2 ever gets its 
control frame from UAC1.


There were a couple of solutions to this problem that were tried:
(1) First, I tried to make sure that enqueued control frames were 
flushed out of a channel and passed over the bridge when a hangup was 
detected. In practice, this was incredibly cumbersome - some control 
frames should get tossed, others need to be preserved. What was worse 
was the sheer number of places the bridge dissolution can be 
triggered. While it wasn't hard to make sure we flushed frames off an 
ejected channel into a bridge, it was nigh impossible to ensure 

[asterisk-dev] Bridges, T.38, and other good times

2015-12-06 Thread Matthew Jordan
Hello all -

One of the efforts that a number of developers in the community here at
Digium have been at work at are cleaning up test failures exposed by
Jenkins [1]. One of these, in particular, has been rather difficult to
resolve - namely, fax/pjsip/directmedia_reinvite_t38 [2]. This e-mail goes
over what has been accomplished, and asks some questions on how we might
try and fix Asterisk under this scenario.

The directmedia_reinvite_t38 test attempts to do the following:
 (1) UAC1 calls UAC2 through Asterisk, with audio as the media. The dial is
performed using the 'g' flag, such that UAC2 will continue on if UAC1 hangs
up.
 (2) UAC1 and UAC2 are configured for direct media. Asterisk sends a
re-INVITE to UAC1 and UAC2 to initiate direct media.
 (3) After responding with a 200 OK to the direct media requests, UAC1
sends a re-INVITE offering T.38.
 (4) Asterisk sends an INVITE with T.38 to UAC2
 (5) UAC2 sends back a 200 OK for T.38; Asterisk sends that to UAC1.
Asterisk switches out of a direct media bridge to a core bridge.
 (6) UAC1 hangs up. Asterisk sends a re-INVITE to UAC2 for audio back to
Asterisk. UAC2 responds with a 200 OK for the audio.
 (7) Asterisk ejects UAC2 back to the dialplan.

It's important to note that this test never should have passed - an update
to the test suite "fixed" the test erroneously passing, which led to us
investigating why the scenario was failing. This test was copied over from
an identical chan_sip test, which passes.

The PJSIP stack has two issues which make life difficult for it in this
scenario:
(1) The T.38 logic is implemented in res_pjsip_t38. While that is _mostly_
a very good thing - as it keeps all the fax state logic outside of the
channel driver - we are also a layer removed from interactions that occur
in the channel driver. That makes it challenging to influence direct media
checks and other Asterisk/channel interactions.
(2) Being very asynchronous, requests may be serviced that influence T.38
state while other interactions are occurring in the core. Informing the
core of what has occurred can have more race conditions than what occurs in
chan_sip, which is single threaded.

The first bug discovered when the test was investigated was an issue in
step (2). We never actually initiated a direct media re-INVITE. This was
due to res_pjsip_t38 using a frame hook, and not implementing the
.consume_cb callback. That callback allows a framehook to inform the core
(and also the bridging framework) of the types of frames that a framehook
wants to consume. If a framehook needs audio, a direct media bridge will be
explicitly denied, and - by default - the bridging framework assumes that
framehooks will want all frames. Another bug that was discovered occurred
in step (6). When UAC1 sends a BYE request, nothing informed UAC2 that the
fax had ended - instead, it was merely ejected from the bridge. This meant
that it kept its T.38 session going, and Asterisk never sent a re-INVITE to
UAC2. Both of these bugs were fixed by 726ee873a6.

Except, unfortunately, the second bug wasn't really fixed.

726ee873a6 did the "right" thing by intercepting the BYE request sent by
UAC1, and queueing up a control frame of type AST_CONTROL_T38_PARAMETERS
with a new state of AST_T38_TERMINATED. This is supposed to be passed on to
UAC2, informing it that the T.38 fax has ended, and that it should have its
media re-negotiated back to the last known state (audio) but also back to
Asterisk (since we aren't going to be in a bridge any longer).
Unfortunately, this code was insufficient.

A race condition exists in this case. On the one hand, we've just queued up
a frame on UAC1's channel to be passed into the bridge, which should get
tossed onto UAC2's channel. On the other hand, we've just told the bridging
framework to kill UAC1's channel with extreme prejudice, thereby also
terminating the bridge and ejecting UAC2 off into the dialplan. In the
first case, this is an asynchronous, message passing mechanism; in the
second case, the bridging framework inspects the channel to see if it
should be hung up on *every frame* and *immediately* starts the
hangup/shutdown procedure if it knows the channel should die. This is not
asynchronous in any way. As a result, UAC1 may be hung up and the bridge
dissolved before UAC2 ever gets its control frame from UAC1.

There were a couple of solutions to this problem that were tried:
(1) First, I tried to make sure that enqueued control frames were flushed
out of a channel and passed over the bridge when a hangup was detected. In
practice, this was incredibly cumbersome - some control frames should get
tossed, others need to be preserved. What was worse was the sheer number of
places the bridge dissolution can be triggered. While it wasn't hard to
make sure we flushed frames off an ejected channel into a bridge, it was
nigh impossible to ensure that this occurred every single time before the
other channels were ejected. Again, the bridging framework is