Re: [asterisk-dev] [Code Review] 3069: Fix deadlock experienced during multi-party PJSIP transfer through masquerade rework

2013-12-19 Thread Mark Michelson

---
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3069/
---

(Updated Dec. 19, 2013, 6:20 p.m.)


Status
--

This change has been marked as submitted.


Review request for Asterisk Developers, Joshua Colp, Jonathan Rose, Matt 
Jordan, and rmudgett.


Bugs: ASTERISK-22936
https://issues.asterisk.org/jira/browse/ASTERISK-22936


Repository: Asterisk


Description
---

This is more-or-less a take 2 of https://reviewboard.asterisk.org/r/3042

Initially, it was discovered that performing an attended transfer of a 
multiparty bridge with a PJSIP channel would cause a deadlock. A PBX thread 
started a masquerade and reached the point where it was calling the fixup() 
callback on the original channel. For chan_pjsip, this involves pushing a 
synchronous task to the session's serializer. The problem was that a task ahead 
of the fixup task was also attempting to perform a channel masquerade. However, 
since masquerades are designed in a way to only allow for one to occur at a 
time, the task ahead of the fixup could not continue until the masquerade 
already in progress had completed. And of course, the masquerade in progress 
could not complete until the task ahead of the fixup task had completed. 
Deadlock.

The initial fix was to change the fixup task to be asynchronous. While this 
prevented the deadlock from occurring, it had the frightful side effect of 
potentially allowing for tasks in the session's serializer to operate on a 
zombie channel.

Taking a step back from this particular deadlock, it became clear that the 
problem was not really this one particular issue but that masquerades 
themselves needed to be addressed. A PJSIP attended transfer operation calls 
ast_channel_move(), which attempts to both set up and execute a masquerade. The 
problem was that after it had set up the masquerade, the PBX thread had swooped 
in and tried to actually perform the masquerade. Looking at changes that had 
been made to Asterisk 12, it became clear that there never is any time now that 
anyone ever wants to set up a masquerade and allow for the channel thread to 
actually perform the masquerade. Everyone always is calling ast_channel_move(), 
performs the masquerade itself before returning.

In this patch, I have removed all blocks of code from channel.c that will 
attempt to perform a masquerade if ast_channel_masq() returns true. Now, there 
is no distinction between setting up a masquerade and performing the 
masquerade. It is one operation. The only remaining checks for 
ast_channel_masq() and ast_channel_masqr() are in ast_hangup() since we do not 
want to interrupt a masquerade by hanging up the channel. Instead, now 
ast_hangup() will wait for a masquerade to complete before moving forward with 
its operation.

The ast_channel_move() function has been modified to basically in-line the 
logic that used to be in ast_channel_masquerade(). ast_channel_masquerade() has 
been killed off for real. ast_channel_move() now has a lock associated with it 
that is used to prevent any simultaneous moves from occurring at once. This 
means there is no need to make sure that ast_channel_masq() or 
ast_channel_masqr() are already set on a channel when ast_channel_move() is 
called. It also means the channel container lock is not pulling double duty by 
both keeping the container locked and preventing multiple masquerades from 
occurring simultaneously.

The ast_do_masquerade() function has been renamed to do_channel_masquerade() 
and is now internal to channel.c. The function now takes explicit arguments of 
which channels are involved in the masquerade instead of a single channel. 
While it probably is possible to do some further refactoring of this method, I 
feel that I would be treading dangerously. Instead, all I did was change some 
comments that no longer are true after this changeset.

The other more minor change introduced in this patch is to res_pjsip.c to make 
ast_sip_push_task_synchronous() run the task in-place if we are already a SIP 
servant thread. This is related to this patch because even when we isolate the 
channel masquerade to only running in the SIP servant thread, we would still 
deadlock when the fixup() callback is reached since we would essentially be 
waiting forever for ourselves to finish before actually running the fixup. This 
makes it so the fixup is run without having to push a task into a serializer at 
all.


Diffs
-

  /branches/12/res/res_pjsip.c 403778 
  /branches/12/main/channel.c 403778 
  /branches/12/include/asterisk/channel.h 403778 
  /branches/12/include/asterisk/autochan.h 403778 

Diff: https://reviewboard.asterisk.org/r/3069/diff/


Testing
---

Jonathan ran the same test he had originally run in order to make the deadlock 
occur, and with this patch it no longer occurs. I also ran a bevy of 

Re: [asterisk-dev] [Code Review] 3069: Fix deadlock experienced during multi-party PJSIP transfer through masquerade rework

2013-12-18 Thread Joshua Colp


 On Dec. 16, 2013, 8:26 p.m., rmudgett wrote:
  /branches/12/main/channel.c, lines 10399-10400
  https://reviewboard.asterisk.org/r/3069/diff/1/?file=49555#file49555line10399
 
  One thing that ast_channel_masquerade() did that is not being done 
  anymore is queuing null frames to wake up threads before the masquerade 
  takes place.
  
  Maybe do_channel_masquerade() needs to poke both channels to wake up 
  their respective caretaker threads instead.  Those threads could be in the 
  ast_read/ast_write functions.
 
 Mark Michelson wrote:
 The old ast_channel_masquerade() would queue null frames, but I 
 interpreted that as being done so that when the thread servicing the channel 
 woke up, the subsequent ast_read() call would end up performing the 
 masquerade. Since we set up and perform the masquerade all at once now, 
 queuing a null frame at the time of marking the channels as being involved in 
 a masquerade seems like the wrong move.
 
 I *think* that signaling blocking threads is actually taken care of 
 already.
 
 The clonechan has a null frame queued onto it when it gets marked as a 
 zombie.
 The original channel has a SIGURG sent to its blocking thread shortly 
 after.
 
 Am I correct that this should result in any blocking threads being awoken 
 correctly?
 
 rmudgett wrote:
 I was hoping you would know with more certainty than I since the channel 
 fds are swapped around. :)

Yes, it should.


- Joshua


---
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3069/#review10423
---


On Dec. 12, 2013, 9:59 p.m., Mark Michelson wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviewboard.asterisk.org/r/3069/
 ---
 
 (Updated Dec. 12, 2013, 9:59 p.m.)
 
 
 Review request for Asterisk Developers, Joshua Colp, Jonathan Rose, Matt 
 Jordan, and rmudgett.
 
 
 Bugs: ASTERISK-22936
 https://issues.asterisk.org/jira/browse/ASTERISK-22936
 
 
 Repository: Asterisk
 
 
 Description
 ---
 
 This is more-or-less a take 2 of https://reviewboard.asterisk.org/r/3042
 
 Initially, it was discovered that performing an attended transfer of a 
 multiparty bridge with a PJSIP channel would cause a deadlock. A PBX thread 
 started a masquerade and reached the point where it was calling the fixup() 
 callback on the original channel. For chan_pjsip, this involves pushing a 
 synchronous task to the session's serializer. The problem was that a task 
 ahead of the fixup task was also attempting to perform a channel masquerade. 
 However, since masquerades are designed in a way to only allow for one to 
 occur at a time, the task ahead of the fixup could not continue until the 
 masquerade already in progress had completed. And of course, the masquerade 
 in progress could not complete until the task ahead of the fixup task had 
 completed. Deadlock.
 
 The initial fix was to change the fixup task to be asynchronous. While this 
 prevented the deadlock from occurring, it had the frightful side effect of 
 potentially allowing for tasks in the session's serializer to operate on a 
 zombie channel.
 
 Taking a step back from this particular deadlock, it became clear that the 
 problem was not really this one particular issue but that masquerades 
 themselves needed to be addressed. A PJSIP attended transfer operation calls 
 ast_channel_move(), which attempts to both set up and execute a masquerade. 
 The problem was that after it had set up the masquerade, the PBX thread had 
 swooped in and tried to actually perform the masquerade. Looking at changes 
 that had been made to Asterisk 12, it became clear that there never is any 
 time now that anyone ever wants to set up a masquerade and allow for the 
 channel thread to actually perform the masquerade. Everyone always is calling 
 ast_channel_move(), performs the masquerade itself before returning.
 
 In this patch, I have removed all blocks of code from channel.c that will 
 attempt to perform a masquerade if ast_channel_masq() returns true. Now, 
 there is no distinction between setting up a masquerade and performing the 
 masquerade. It is one operation. The only remaining checks for 
 ast_channel_masq() and ast_channel_masqr() are in ast_hangup() since we do 
 not want to interrupt a masquerade by hanging up the channel. Instead, now 
 ast_hangup() will wait for a masquerade to complete before moving forward 
 with its operation.
 
 The ast_channel_move() function has been modified to basically in-line the 
 logic that used to be in ast_channel_masquerade(). ast_channel_masquerade() 
 has been killed off for real. ast_channel_move() now has a lock associated 
 with it that is used to prevent any simultaneous 

Re: [asterisk-dev] [Code Review] 3069: Fix deadlock experienced during multi-party PJSIP transfer through masquerade rework

2013-12-18 Thread Mark Michelson

---
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3069/
---

(Updated Dec. 18, 2013, 8:20 p.m.)


Review request for Asterisk Developers, Joshua Colp, Jonathan Rose, Matt 
Jordan, and rmudgett.


Changes
---

Addressed review feedback.


Bugs: ASTERISK-22936
https://issues.asterisk.org/jira/browse/ASTERISK-22936


Repository: Asterisk


Description
---

This is more-or-less a take 2 of https://reviewboard.asterisk.org/r/3042

Initially, it was discovered that performing an attended transfer of a 
multiparty bridge with a PJSIP channel would cause a deadlock. A PBX thread 
started a masquerade and reached the point where it was calling the fixup() 
callback on the original channel. For chan_pjsip, this involves pushing a 
synchronous task to the session's serializer. The problem was that a task ahead 
of the fixup task was also attempting to perform a channel masquerade. However, 
since masquerades are designed in a way to only allow for one to occur at a 
time, the task ahead of the fixup could not continue until the masquerade 
already in progress had completed. And of course, the masquerade in progress 
could not complete until the task ahead of the fixup task had completed. 
Deadlock.

The initial fix was to change the fixup task to be asynchronous. While this 
prevented the deadlock from occurring, it had the frightful side effect of 
potentially allowing for tasks in the session's serializer to operate on a 
zombie channel.

Taking a step back from this particular deadlock, it became clear that the 
problem was not really this one particular issue but that masquerades 
themselves needed to be addressed. A PJSIP attended transfer operation calls 
ast_channel_move(), which attempts to both set up and execute a masquerade. The 
problem was that after it had set up the masquerade, the PBX thread had swooped 
in and tried to actually perform the masquerade. Looking at changes that had 
been made to Asterisk 12, it became clear that there never is any time now that 
anyone ever wants to set up a masquerade and allow for the channel thread to 
actually perform the masquerade. Everyone always is calling ast_channel_move(), 
performs the masquerade itself before returning.

In this patch, I have removed all blocks of code from channel.c that will 
attempt to perform a masquerade if ast_channel_masq() returns true. Now, there 
is no distinction between setting up a masquerade and performing the 
masquerade. It is one operation. The only remaining checks for 
ast_channel_masq() and ast_channel_masqr() are in ast_hangup() since we do not 
want to interrupt a masquerade by hanging up the channel. Instead, now 
ast_hangup() will wait for a masquerade to complete before moving forward with 
its operation.

The ast_channel_move() function has been modified to basically in-line the 
logic that used to be in ast_channel_masquerade(). ast_channel_masquerade() has 
been killed off for real. ast_channel_move() now has a lock associated with it 
that is used to prevent any simultaneous moves from occurring at once. This 
means there is no need to make sure that ast_channel_masq() or 
ast_channel_masqr() are already set on a channel when ast_channel_move() is 
called. It also means the channel container lock is not pulling double duty by 
both keeping the container locked and preventing multiple masquerades from 
occurring simultaneously.

The ast_do_masquerade() function has been renamed to do_channel_masquerade() 
and is now internal to channel.c. The function now takes explicit arguments of 
which channels are involved in the masquerade instead of a single channel. 
While it probably is possible to do some further refactoring of this method, I 
feel that I would be treading dangerously. Instead, all I did was change some 
comments that no longer are true after this changeset.

The other more minor change introduced in this patch is to res_pjsip.c to make 
ast_sip_push_task_synchronous() run the task in-place if we are already a SIP 
servant thread. This is related to this patch because even when we isolate the 
channel masquerade to only running in the SIP servant thread, we would still 
deadlock when the fixup() callback is reached since we would essentially be 
waiting forever for ourselves to finish before actually running the fixup. This 
makes it so the fixup is run without having to push a task into a serializer at 
all.


Diffs (updated)
-

  /branches/12/res/res_pjsip.c 403778 
  /branches/12/main/channel.c 403778 
  /branches/12/include/asterisk/channel.h 403778 
  /branches/12/include/asterisk/autochan.h 403778 

Diff: https://reviewboard.asterisk.org/r/3069/diff/


Testing
---

Jonathan ran the same test he had originally run in order to make the deadlock 
occur, and with this patch it no longer occurs. I also ran a bevy of 

Re: [asterisk-dev] [Code Review] 3069: Fix deadlock experienced during multi-party PJSIP transfer through masquerade rework

2013-12-18 Thread rmudgett

---
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3069/#review10453
---



/branches/12/main/channel.c
https://reviewboard.asterisk.org/r/3069/#comment19917

This is mostly still true.  Change to:

The masq and masqr pointers need to be left alone until the masquerade has 
restabilized the channels to hold off ast_hangup() and until the 
AST_FLAG_ZOMBIE can be set on the clonechan.




/branches/12/main/channel.c
https://reviewboard.asterisk.org/r/3069/#comment19921

I think the masq and masqr pointers need to be cleared later at the same 
time where you are clearing the masq pointer now.  They also need to be cleared 
with both channels locked.  This way agent.c and features.c could get the 
masq/masqr channel name safely if they had the channel locked.



/branches/12/main/channel.c
https://reviewboard.asterisk.org/r/3069/#comment19920

This comment can be removed.  Delaying the masq/masqr clearing holds off 
the hangup and another masquerade cannot be started until this one finished.



/branches/12/main/channel.c
https://reviewboard.asterisk.org/r/3069/#comment19923

Lock both channels and clear masq/masqr here.


- rmudgett


On Dec. 18, 2013, 8:20 p.m., Mark Michelson wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviewboard.asterisk.org/r/3069/
 ---
 
 (Updated Dec. 18, 2013, 8:20 p.m.)
 
 
 Review request for Asterisk Developers, Joshua Colp, Jonathan Rose, Matt 
 Jordan, and rmudgett.
 
 
 Bugs: ASTERISK-22936
 https://issues.asterisk.org/jira/browse/ASTERISK-22936
 
 
 Repository: Asterisk
 
 
 Description
 ---
 
 This is more-or-less a take 2 of https://reviewboard.asterisk.org/r/3042
 
 Initially, it was discovered that performing an attended transfer of a 
 multiparty bridge with a PJSIP channel would cause a deadlock. A PBX thread 
 started a masquerade and reached the point where it was calling the fixup() 
 callback on the original channel. For chan_pjsip, this involves pushing a 
 synchronous task to the session's serializer. The problem was that a task 
 ahead of the fixup task was also attempting to perform a channel masquerade. 
 However, since masquerades are designed in a way to only allow for one to 
 occur at a time, the task ahead of the fixup could not continue until the 
 masquerade already in progress had completed. And of course, the masquerade 
 in progress could not complete until the task ahead of the fixup task had 
 completed. Deadlock.
 
 The initial fix was to change the fixup task to be asynchronous. While this 
 prevented the deadlock from occurring, it had the frightful side effect of 
 potentially allowing for tasks in the session's serializer to operate on a 
 zombie channel.
 
 Taking a step back from this particular deadlock, it became clear that the 
 problem was not really this one particular issue but that masquerades 
 themselves needed to be addressed. A PJSIP attended transfer operation calls 
 ast_channel_move(), which attempts to both set up and execute a masquerade. 
 The problem was that after it had set up the masquerade, the PBX thread had 
 swooped in and tried to actually perform the masquerade. Looking at changes 
 that had been made to Asterisk 12, it became clear that there never is any 
 time now that anyone ever wants to set up a masquerade and allow for the 
 channel thread to actually perform the masquerade. Everyone always is calling 
 ast_channel_move(), performs the masquerade itself before returning.
 
 In this patch, I have removed all blocks of code from channel.c that will 
 attempt to perform a masquerade if ast_channel_masq() returns true. Now, 
 there is no distinction between setting up a masquerade and performing the 
 masquerade. It is one operation. The only remaining checks for 
 ast_channel_masq() and ast_channel_masqr() are in ast_hangup() since we do 
 not want to interrupt a masquerade by hanging up the channel. Instead, now 
 ast_hangup() will wait for a masquerade to complete before moving forward 
 with its operation.
 
 The ast_channel_move() function has been modified to basically in-line the 
 logic that used to be in ast_channel_masquerade(). ast_channel_masquerade() 
 has been killed off for real. ast_channel_move() now has a lock associated 
 with it that is used to prevent any simultaneous moves from occurring at 
 once. This means there is no need to make sure that ast_channel_masq() or 
 ast_channel_masqr() are already set on a channel when ast_channel_move() is 
 called. It also means the channel container lock is not pulling double duty 
 by both keeping the container locked and preventing multiple masquerades from 
 occurring simultaneously.
 
 The ast_do_masquerade() 

Re: [asterisk-dev] [Code Review] 3069: Fix deadlock experienced during multi-party PJSIP transfer through masquerade rework

2013-12-18 Thread Mark Michelson

---
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3069/
---

(Updated Dec. 18, 2013, 11:10 p.m.)


Review request for Asterisk Developers, Joshua Colp, Jonathan Rose, Matt 
Jordan, and rmudgett.


Changes
---

Addressed latest feedback.


Bugs: ASTERISK-22936
https://issues.asterisk.org/jira/browse/ASTERISK-22936


Repository: Asterisk


Description
---

This is more-or-less a take 2 of https://reviewboard.asterisk.org/r/3042

Initially, it was discovered that performing an attended transfer of a 
multiparty bridge with a PJSIP channel would cause a deadlock. A PBX thread 
started a masquerade and reached the point where it was calling the fixup() 
callback on the original channel. For chan_pjsip, this involves pushing a 
synchronous task to the session's serializer. The problem was that a task ahead 
of the fixup task was also attempting to perform a channel masquerade. However, 
since masquerades are designed in a way to only allow for one to occur at a 
time, the task ahead of the fixup could not continue until the masquerade 
already in progress had completed. And of course, the masquerade in progress 
could not complete until the task ahead of the fixup task had completed. 
Deadlock.

The initial fix was to change the fixup task to be asynchronous. While this 
prevented the deadlock from occurring, it had the frightful side effect of 
potentially allowing for tasks in the session's serializer to operate on a 
zombie channel.

Taking a step back from this particular deadlock, it became clear that the 
problem was not really this one particular issue but that masquerades 
themselves needed to be addressed. A PJSIP attended transfer operation calls 
ast_channel_move(), which attempts to both set up and execute a masquerade. The 
problem was that after it had set up the masquerade, the PBX thread had swooped 
in and tried to actually perform the masquerade. Looking at changes that had 
been made to Asterisk 12, it became clear that there never is any time now that 
anyone ever wants to set up a masquerade and allow for the channel thread to 
actually perform the masquerade. Everyone always is calling ast_channel_move(), 
performs the masquerade itself before returning.

In this patch, I have removed all blocks of code from channel.c that will 
attempt to perform a masquerade if ast_channel_masq() returns true. Now, there 
is no distinction between setting up a masquerade and performing the 
masquerade. It is one operation. The only remaining checks for 
ast_channel_masq() and ast_channel_masqr() are in ast_hangup() since we do not 
want to interrupt a masquerade by hanging up the channel. Instead, now 
ast_hangup() will wait for a masquerade to complete before moving forward with 
its operation.

The ast_channel_move() function has been modified to basically in-line the 
logic that used to be in ast_channel_masquerade(). ast_channel_masquerade() has 
been killed off for real. ast_channel_move() now has a lock associated with it 
that is used to prevent any simultaneous moves from occurring at once. This 
means there is no need to make sure that ast_channel_masq() or 
ast_channel_masqr() are already set on a channel when ast_channel_move() is 
called. It also means the channel container lock is not pulling double duty by 
both keeping the container locked and preventing multiple masquerades from 
occurring simultaneously.

The ast_do_masquerade() function has been renamed to do_channel_masquerade() 
and is now internal to channel.c. The function now takes explicit arguments of 
which channels are involved in the masquerade instead of a single channel. 
While it probably is possible to do some further refactoring of this method, I 
feel that I would be treading dangerously. Instead, all I did was change some 
comments that no longer are true after this changeset.

The other more minor change introduced in this patch is to res_pjsip.c to make 
ast_sip_push_task_synchronous() run the task in-place if we are already a SIP 
servant thread. This is related to this patch because even when we isolate the 
channel masquerade to only running in the SIP servant thread, we would still 
deadlock when the fixup() callback is reached since we would essentially be 
waiting forever for ourselves to finish before actually running the fixup. This 
makes it so the fixup is run without having to push a task into a serializer at 
all.


Diffs (updated)
-

  /branches/12/res/res_pjsip.c 403778 
  /branches/12/main/channel.c 403778 
  /branches/12/include/asterisk/channel.h 403778 
  /branches/12/include/asterisk/autochan.h 403778 

Diff: https://reviewboard.asterisk.org/r/3069/diff/


Testing
---

Jonathan ran the same test he had originally run in order to make the deadlock 
occur, and with this patch it no longer occurs. I also ran a bevy of 

Re: [asterisk-dev] [Code Review] 3069: Fix deadlock experienced during multi-party PJSIP transfer through masquerade rework

2013-12-18 Thread rmudgett

---
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3069/#review10460
---

Ship it!


Ship It!

- rmudgett


On Dec. 18, 2013, 11:10 p.m., Mark Michelson wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviewboard.asterisk.org/r/3069/
 ---
 
 (Updated Dec. 18, 2013, 11:10 p.m.)
 
 
 Review request for Asterisk Developers, Joshua Colp, Jonathan Rose, Matt 
 Jordan, and rmudgett.
 
 
 Bugs: ASTERISK-22936
 https://issues.asterisk.org/jira/browse/ASTERISK-22936
 
 
 Repository: Asterisk
 
 
 Description
 ---
 
 This is more-or-less a take 2 of https://reviewboard.asterisk.org/r/3042
 
 Initially, it was discovered that performing an attended transfer of a 
 multiparty bridge with a PJSIP channel would cause a deadlock. A PBX thread 
 started a masquerade and reached the point where it was calling the fixup() 
 callback on the original channel. For chan_pjsip, this involves pushing a 
 synchronous task to the session's serializer. The problem was that a task 
 ahead of the fixup task was also attempting to perform a channel masquerade. 
 However, since masquerades are designed in a way to only allow for one to 
 occur at a time, the task ahead of the fixup could not continue until the 
 masquerade already in progress had completed. And of course, the masquerade 
 in progress could not complete until the task ahead of the fixup task had 
 completed. Deadlock.
 
 The initial fix was to change the fixup task to be asynchronous. While this 
 prevented the deadlock from occurring, it had the frightful side effect of 
 potentially allowing for tasks in the session's serializer to operate on a 
 zombie channel.
 
 Taking a step back from this particular deadlock, it became clear that the 
 problem was not really this one particular issue but that masquerades 
 themselves needed to be addressed. A PJSIP attended transfer operation calls 
 ast_channel_move(), which attempts to both set up and execute a masquerade. 
 The problem was that after it had set up the masquerade, the PBX thread had 
 swooped in and tried to actually perform the masquerade. Looking at changes 
 that had been made to Asterisk 12, it became clear that there never is any 
 time now that anyone ever wants to set up a masquerade and allow for the 
 channel thread to actually perform the masquerade. Everyone always is calling 
 ast_channel_move(), performs the masquerade itself before returning.
 
 In this patch, I have removed all blocks of code from channel.c that will 
 attempt to perform a masquerade if ast_channel_masq() returns true. Now, 
 there is no distinction between setting up a masquerade and performing the 
 masquerade. It is one operation. The only remaining checks for 
 ast_channel_masq() and ast_channel_masqr() are in ast_hangup() since we do 
 not want to interrupt a masquerade by hanging up the channel. Instead, now 
 ast_hangup() will wait for a masquerade to complete before moving forward 
 with its operation.
 
 The ast_channel_move() function has been modified to basically in-line the 
 logic that used to be in ast_channel_masquerade(). ast_channel_masquerade() 
 has been killed off for real. ast_channel_move() now has a lock associated 
 with it that is used to prevent any simultaneous moves from occurring at 
 once. This means there is no need to make sure that ast_channel_masq() or 
 ast_channel_masqr() are already set on a channel when ast_channel_move() is 
 called. It also means the channel container lock is not pulling double duty 
 by both keeping the container locked and preventing multiple masquerades from 
 occurring simultaneously.
 
 The ast_do_masquerade() function has been renamed to do_channel_masquerade() 
 and is now internal to channel.c. The function now takes explicit arguments 
 of which channels are involved in the masquerade instead of a single channel. 
 While it probably is possible to do some further refactoring of this method, 
 I feel that I would be treading dangerously. Instead, all I did was change 
 some comments that no longer are true after this changeset.
 
 The other more minor change introduced in this patch is to res_pjsip.c to 
 make ast_sip_push_task_synchronous() run the task in-place if we are already 
 a SIP servant thread. This is related to this patch because even when we 
 isolate the channel masquerade to only running in the SIP servant thread, we 
 would still deadlock when the fixup() callback is reached since we would 
 essentially be waiting forever for ourselves to finish before actually 
 running the fixup. This makes it so the fixup is run without having to push a 
 task into a serializer at all.
 
 
 Diffs
 -
 
   /branches/12/res/res_pjsip.c 403778 
   

Re: [asterisk-dev] [Code Review] 3069: Fix deadlock experienced during multi-party PJSIP transfer through masquerade rework

2013-12-17 Thread Mark Michelson


 On Dec. 16, 2013, 8:26 p.m., rmudgett wrote:
  /branches/12/main/channel.c, line 6479
  https://reviewboard.asterisk.org/r/3069/diff/1/?file=49555#file49555line6479
 
  How about
  channel_do_masquerade()
  or simply
  do_masquerade()
 

I went with channel_do_masquerade() because I really just don't care.


 On Dec. 16, 2013, 8:26 p.m., rmudgett wrote:
  /branches/12/main/channel.c, lines 6567-6572
  https://reviewboard.asterisk.org/r/3069/diff/1/?file=49555#file49555line6567
 
  You could move the lock to after the comment since the only thing it is 
  protecting now is getting the visible_indication value.


 On Dec. 16, 2013, 8:26 p.m., rmudgett wrote:
  /branches/12/main/channel.c, lines 2687-2704
  https://reviewboard.asterisk.org/r/3069/diff/1/?file=49555#file49555line2687
 
  It is probably simpler now to just make ast_hangup() wait for any 
  masquerade in progress to complete.
  
  while (ast_channel_masq() || ast_channel_masqr()) {
unlock(chan);
usleep(1);
lock(chan);
  }

I'm actually beginning to rethink my original change a bit.

In channel_do_masquerade() (yes, I renamed it to that), the masq field on the 
original is cleared quite earlier than I expected. The masq is cleared from the 
original, then the original is unlocked, and then further operations are 
performed on the original afterward, to include indicating a visible indication 
and indicating SRCCHANGE frames. When the original becomes unlocked at this 
point in the masquerade, the ast_hangup() operation will be able to grab the 
channel lock, will see that ast_channel_masq() evaluates NULL, and will 
continue with the hangup, thus hanging up the channel at the same time that 
frames are being indicated on it. I think the masq field on the channel needs 
to be cleared at the very end of the masquerade in order to prevent this from 
happening. Do you agree?

As far as the masqr field is concerned, if I make the suggested change here, 
then I should also remove the clone_was_zombie logic from 
channel_do_masquerade() since that situation would no longer be possible. If 
the channel is marked as a masqr and then ast_hangup() is called, then the 
hangup will wait for the masquerade to complete before marking the channel as a 
zombie. If ast_hangup() is called first, then the channel cannot be marked as a 
masqr since it will already be marked as a zombie. Thus, when 
channel_do_masquerade() is called, it will be impossible for the clonechan to 
be marked as a zombie. Does this logic seem sound?


 On Dec. 16, 2013, 8:26 p.m., rmudgett wrote:
  /branches/12/main/channel.c, lines 10399-10400
  https://reviewboard.asterisk.org/r/3069/diff/1/?file=49555#file49555line10399
 
  One thing that ast_channel_masquerade() did that is not being done 
  anymore is queuing null frames to wake up threads before the masquerade 
  takes place.
  
  Maybe do_channel_masquerade() needs to poke both channels to wake up 
  their respective caretaker threads instead.  Those threads could be in the 
  ast_read/ast_write functions.

The old ast_channel_masquerade() would queue null frames, but I interpreted 
that as being done so that when the thread servicing the channel woke up, the 
subsequent ast_read() call would end up performing the masquerade. Since we set 
up and perform the masquerade all at once now, queuing a null frame at the time 
of marking the channels as being involved in a masquerade seems like the wrong 
move.

I *think* that signaling blocking threads is actually taken care of already.

The clonechan has a null frame queued onto it when it gets marked as a zombie.
The original channel has a SIGURG sent to its blocking thread shortly after.

Am I correct that this should result in any blocking threads being awoken 
correctly?


- Mark


---
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3069/#review10423
---


On Dec. 12, 2013, 9:59 p.m., Mark Michelson wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviewboard.asterisk.org/r/3069/
 ---
 
 (Updated Dec. 12, 2013, 9:59 p.m.)
 
 
 Review request for Asterisk Developers, Joshua Colp, Jonathan Rose, Matt 
 Jordan, and rmudgett.
 
 
 Bugs: ASTERISK-22936
 https://issues.asterisk.org/jira/browse/ASTERISK-22936
 
 
 Repository: Asterisk
 
 
 Description
 ---
 
 This is more-or-less a take 2 of https://reviewboard.asterisk.org/r/3042
 
 Initially, it was discovered that performing an attended transfer of a 
 multiparty bridge with a PJSIP channel would cause a deadlock. A PBX thread 
 started a masquerade and reached the point where it was calling the fixup() 
 callback on the original channel. For chan_pjsip, 

Re: [asterisk-dev] [Code Review] 3069: Fix deadlock experienced during multi-party PJSIP transfer through masquerade rework

2013-12-17 Thread rmudgett


 On Dec. 16, 2013, 8:26 p.m., rmudgett wrote:
  /branches/12/main/channel.c, lines 2687-2704
  https://reviewboard.asterisk.org/r/3069/diff/1/?file=49555#file49555line2687
 
  It is probably simpler now to just make ast_hangup() wait for any 
  masquerade in progress to complete.
  
  while (ast_channel_masq() || ast_channel_masqr()) {
unlock(chan);
usleep(1);
lock(chan);
  }
 
 Mark Michelson wrote:
 I'm actually beginning to rethink my original change a bit.
 
 In channel_do_masquerade() (yes, I renamed it to that), the masq field on 
 the original is cleared quite earlier than I expected. The masq is cleared 
 from the original, then the original is unlocked, and then further operations 
 are performed on the original afterward, to include indicating a visible 
 indication and indicating SRCCHANGE frames. When the original becomes 
 unlocked at this point in the masquerade, the ast_hangup() operation will be 
 able to grab the channel lock, will see that ast_channel_masq() evaluates 
 NULL, and will continue with the hangup, thus hanging up the channel at the 
 same time that frames are being indicated on it. I think the masq field on 
 the channel needs to be cleared at the very end of the masquerade in order to 
 prevent this from happening. Do you agree?
 
 As far as the masqr field is concerned, if I make the suggested change 
 here, then I should also remove the clone_was_zombie logic from 
 channel_do_masquerade() since that situation would no longer be possible. If 
 the channel is marked as a masqr and then ast_hangup() is called, then the 
 hangup will wait for the masquerade to complete before marking the channel as 
 a zombie. If ast_hangup() is called first, then the channel cannot be marked 
 as a masqr since it will already be marked as a zombie. Thus, when 
 channel_do_masquerade() is called, it will be impossible for the clonechan to 
 be marked as a zombie. Does this logic seem sound?

Yes.  Clearing the masq and masqr pointers at the end should work fine.  There 
cannot be another masquerade started anyway because of the masquerade lock 
around ast_channel_move().

Yes.  Remove the clone_was_zombie is true code.  It is dead code with the 
proposed ast_hangup() change.  If ast_hangup() is called first then 
ast_channel_move() will fail before channel_do_masquerade() is called because 
ast_hangup() has marked it as a zombie.


 On Dec. 16, 2013, 8:26 p.m., rmudgett wrote:
  /branches/12/main/channel.c, lines 10399-10400
  https://reviewboard.asterisk.org/r/3069/diff/1/?file=49555#file49555line10399
 
  One thing that ast_channel_masquerade() did that is not being done 
  anymore is queuing null frames to wake up threads before the masquerade 
  takes place.
  
  Maybe do_channel_masquerade() needs to poke both channels to wake up 
  their respective caretaker threads instead.  Those threads could be in the 
  ast_read/ast_write functions.
 
 Mark Michelson wrote:
 The old ast_channel_masquerade() would queue null frames, but I 
 interpreted that as being done so that when the thread servicing the channel 
 woke up, the subsequent ast_read() call would end up performing the 
 masquerade. Since we set up and perform the masquerade all at once now, 
 queuing a null frame at the time of marking the channels as being involved in 
 a masquerade seems like the wrong move.
 
 I *think* that signaling blocking threads is actually taken care of 
 already.
 
 The clonechan has a null frame queued onto it when it gets marked as a 
 zombie.
 The original channel has a SIGURG sent to its blocking thread shortly 
 after.
 
 Am I correct that this should result in any blocking threads being awoken 
 correctly?

I was hoping you would know with more certainty than I since the channel fds 
are swapped around. :)


- rmudgett


---
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3069/#review10423
---


On Dec. 12, 2013, 9:59 p.m., Mark Michelson wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviewboard.asterisk.org/r/3069/
 ---
 
 (Updated Dec. 12, 2013, 9:59 p.m.)
 
 
 Review request for Asterisk Developers, Joshua Colp, Jonathan Rose, Matt 
 Jordan, and rmudgett.
 
 
 Bugs: ASTERISK-22936
 https://issues.asterisk.org/jira/browse/ASTERISK-22936
 
 
 Repository: Asterisk
 
 
 Description
 ---
 
 This is more-or-less a take 2 of https://reviewboard.asterisk.org/r/3042
 
 Initially, it was discovered that performing an attended transfer of a 
 multiparty bridge with a PJSIP channel would cause a deadlock. A PBX thread 
 started a masquerade and reached the point where it was calling the 

Re: [asterisk-dev] [Code Review] 3069: Fix deadlock experienced during multi-party PJSIP transfer through masquerade rework

2013-12-16 Thread rmudgett

---
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3069/#review10423
---



/branches/12/main/channel.c
https://reviewboard.asterisk.org/r/3069/#comment19873

It is probably simpler now to just make ast_hangup() wait for any 
masquerade in progress to complete.

while (ast_channel_masq() || ast_channel_masqr()) {
  unlock(chan);
  usleep(1);
  lock(chan);
}



/branches/12/main/channel.c
https://reviewboard.asterisk.org/r/3069/#comment19868

You can use CHANNEL_DEADLOCK_AVOIDANCE() instead of inlining it.



/branches/12/main/channel.c
https://reviewboard.asterisk.org/r/3069/#comment19870

This comment needs updating.  Also there is another doxygen reference to 
ast_do_masquerade() in autochan.h for the ast_autochan_new_channel() prototype.



/branches/12/main/channel.c
https://reviewboard.asterisk.org/r/3069/#comment19874

Comment no longer necessary



/branches/12/main/channel.c
https://reviewboard.asterisk.org/r/3069/#comment19875

Update comment.



/branches/12/main/channel.c
https://reviewboard.asterisk.org/r/3069/#comment19871

How about
channel_do_masquerade()
or simply
do_masquerade()




/branches/12/main/channel.c
https://reviewboard.asterisk.org/r/3069/#comment19876

You could move the lock to after the comment since the only thing it is 
protecting now is getting the visible_indication value.



/branches/12/main/channel.c
https://reviewboard.asterisk.org/r/3069/#comment19877

One thing that ast_channel_masquerade() did that is not being done anymore 
is queuing null frames to wake up threads before the masquerade takes place.

Maybe do_channel_masquerade() needs to poke both channels to wake up their 
respective caretaker threads instead.  Those threads could be in the 
ast_read/ast_write functions.


- rmudgett


On Dec. 12, 2013, 9:59 p.m., Mark Michelson wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviewboard.asterisk.org/r/3069/
 ---
 
 (Updated Dec. 12, 2013, 9:59 p.m.)
 
 
 Review request for Asterisk Developers, Joshua Colp, Jonathan Rose, Matt 
 Jordan, and rmudgett.
 
 
 Bugs: ASTERISK-22936
 https://issues.asterisk.org/jira/browse/ASTERISK-22936
 
 
 Repository: Asterisk
 
 
 Description
 ---
 
 This is more-or-less a take 2 of https://reviewboard.asterisk.org/r/3042
 
 Initially, it was discovered that performing an attended transfer of a 
 multiparty bridge with a PJSIP channel would cause a deadlock. A PBX thread 
 started a masquerade and reached the point where it was calling the fixup() 
 callback on the original channel. For chan_pjsip, this involves pushing a 
 synchronous task to the session's serializer. The problem was that a task 
 ahead of the fixup task was also attempting to perform a channel masquerade. 
 However, since masquerades are designed in a way to only allow for one to 
 occur at a time, the task ahead of the fixup could not continue until the 
 masquerade already in progress had completed. And of course, the masquerade 
 in progress could not complete until the task ahead of the fixup task had 
 completed. Deadlock.
 
 The initial fix was to change the fixup task to be asynchronous. While this 
 prevented the deadlock from occurring, it had the frightful side effect of 
 potentially allowing for tasks in the session's serializer to operate on a 
 zombie channel.
 
 Taking a step back from this particular deadlock, it became clear that the 
 problem was not really this one particular issue but that masquerades 
 themselves needed to be addressed. A PJSIP attended transfer operation calls 
 ast_channel_move(), which attempts to both set up and execute a masquerade. 
 The problem was that after it had set up the masquerade, the PBX thread had 
 swooped in and tried to actually perform the masquerade. Looking at changes 
 that had been made to Asterisk 12, it became clear that there never is any 
 time now that anyone ever wants to set up a masquerade and allow for the 
 channel thread to actually perform the masquerade. Everyone always is calling 
 ast_channel_move(), performs the masquerade itself before returning.
 
 In this patch, I have removed all blocks of code from channel.c that will 
 attempt to perform a masquerade if ast_channel_masq() returns true. Now, 
 there is no distinction between setting up a masquerade and performing the 
 masquerade. It is one operation. The only remaining checks for 
 ast_channel_masq() and ast_channel_masqr() are in ast_hangup() since we do 
 not want to interrupt a masquerade by hanging up the channel. Instead, now 
 ast_hangup() will wait for a masquerade to complete before moving forward 
 with its operation.
 
 The 

Re: [asterisk-dev] [Code Review] 3069: Fix deadlock experienced during multi-party PJSIP transfer through masquerade rework

2013-12-13 Thread Matt Jordan


 On Dec. 12, 2013, 4:03 p.m., Mark Michelson wrote:
  Josh and I talked on IRC, and it would really be great if the channels 
  involved in the masquerade could be locked for the duration of the 
  operation instead of having to unlock them in order to send indications to 
  them. That way, it could be guaranteed that any time something acquires a 
  channel lock, it is not interrupting the middle of a masquerade and can 
  freely do what it needs to. I think such refactoring could actually be 
  done, but I also feel like since that sort of refactoring does not directly 
  affect the problem being fixed here, it should be saved for a separate 
  review.

Agreed. If (when?) this goes in, I'd make a new issue in JIRA for that part of 
it.


- Matt


---
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3069/#review10408
---


On Dec. 12, 2013, 3:59 p.m., Mark Michelson wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviewboard.asterisk.org/r/3069/
 ---
 
 (Updated Dec. 12, 2013, 3:59 p.m.)
 
 
 Review request for Asterisk Developers, Joshua Colp, Jonathan Rose, Matt 
 Jordan, and rmudgett.
 
 
 Bugs: ASTERISK-22936
 https://issues.asterisk.org/jira/browse/ASTERISK-22936
 
 
 Repository: Asterisk
 
 
 Description
 ---
 
 This is more-or-less a take 2 of https://reviewboard.asterisk.org/r/3042
 
 Initially, it was discovered that performing an attended transfer of a 
 multiparty bridge with a PJSIP channel would cause a deadlock. A PBX thread 
 started a masquerade and reached the point where it was calling the fixup() 
 callback on the original channel. For chan_pjsip, this involves pushing a 
 synchronous task to the session's serializer. The problem was that a task 
 ahead of the fixup task was also attempting to perform a channel masquerade. 
 However, since masquerades are designed in a way to only allow for one to 
 occur at a time, the task ahead of the fixup could not continue until the 
 masquerade already in progress had completed. And of course, the masquerade 
 in progress could not complete until the task ahead of the fixup task had 
 completed. Deadlock.
 
 The initial fix was to change the fixup task to be asynchronous. While this 
 prevented the deadlock from occurring, it had the frightful side effect of 
 potentially allowing for tasks in the session's serializer to operate on a 
 zombie channel.
 
 Taking a step back from this particular deadlock, it became clear that the 
 problem was not really this one particular issue but that masquerades 
 themselves needed to be addressed. A PJSIP attended transfer operation calls 
 ast_channel_move(), which attempts to both set up and execute a masquerade. 
 The problem was that after it had set up the masquerade, the PBX thread had 
 swooped in and tried to actually perform the masquerade. Looking at changes 
 that had been made to Asterisk 12, it became clear that there never is any 
 time now that anyone ever wants to set up a masquerade and allow for the 
 channel thread to actually perform the masquerade. Everyone always is calling 
 ast_channel_move(), performs the masquerade itself before returning.
 
 In this patch, I have removed all blocks of code from channel.c that will 
 attempt to perform a masquerade if ast_channel_masq() returns true. Now, 
 there is no distinction between setting up a masquerade and performing the 
 masquerade. It is one operation. The only remaining checks for 
 ast_channel_masq() and ast_channel_masqr() are in ast_hangup() since we do 
 not want to interrupt a masquerade by hanging up the channel. Instead, now 
 ast_hangup() will wait for a masquerade to complete before moving forward 
 with its operation.
 
 The ast_channel_move() function has been modified to basically in-line the 
 logic that used to be in ast_channel_masquerade(). ast_channel_masquerade() 
 has been killed off for real. ast_channel_move() now has a lock associated 
 with it that is used to prevent any simultaneous moves from occurring at 
 once. This means there is no need to make sure that ast_channel_masq() or 
 ast_channel_masqr() are already set on a channel when ast_channel_move() is 
 called. It also means the channel container lock is not pulling double duty 
 by both keeping the container locked and preventing multiple masquerades from 
 occurring simultaneously.
 
 The ast_do_masquerade() function has been renamed to do_channel_masquerade() 
 and is now internal to channel.c. The function now takes explicit arguments 
 of which channels are involved in the masquerade instead of a single channel. 
 While it probably is possible to do some further refactoring of this method, 
 I feel that I would be treading dangerously. 

[asterisk-dev] [Code Review] 3069: Fix deadlock experienced during multi-party PJSIP transfer through masquerade rework

2013-12-12 Thread Mark Michelson

---
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3069/
---

Review request for Asterisk Developers, Joshua Colp, Jonathan Rose, Matt 
Jordan, and rmudgett.


Bugs: ASTERISK-22936
https://issues.asterisk.org/jira/browse/ASTERISK-22936


Repository: Asterisk


Description
---

This is more-or-less a take 2 of https://reviewboard.asterisk.org/r/3042

Initially, it was discovered that performing an attended transfer of a 
multiparty bridge with a PJSIP channel would cause a deadlock. A PBX thread 
started a masquerade and reached the point where it was calling the fixup() 
callback on the original channel. For chan_pjsip, this involves pushing a 
synchronous task to the session's serializer. The problem was that a task ahead 
of the fixup task was also attempting to perform a channel masquerade. However, 
since masquerades are designed in a way to only allow for one to occur at a 
time, the task ahead of the fixup could not continue until the masquerade 
already in progress had completed. And of course, the masquerade in progress 
could not complete until the task ahead of the fixup task had completed. 
Deadlock.

The initial fix was to change the fixup task to be asynchronous. While this 
prevented the deadlock from occurring, it had the frightful side effect of 
potentially allowing for tasks in the session's serializer to operate on a 
zombie channel.

Taking a step back from this particular deadlock, it became clear that the 
problem was not really this one particular issue but that masquerades 
themselves needed to be addressed. A PJSIP attended transfer operation calls 
ast_channel_move(), which attempts to both set up and execute a masquerade. The 
problem was that after it had set up the masquerade, the PBX thread had swooped 
in and tried to actually perform the masquerade. Looking at changes that had 
been made to Asterisk 12, it became clear that there never is any time now that 
anyone ever wants to set up a masquerade and allow for the channel thread to 
actually perform the masquerade. Everyone always is calling ast_channel_move(), 
performs the masquerade itself before returning.

In this patch, I have removed all blocks of code from channel.c that will 
attempt to perform a masquerade if ast_channel_masq() returns true. Now, there 
is no distinction between setting up a masquerade and performing the 
masquerade. It is one operation. The only remaining checks for 
ast_channel_masq() and ast_channel_masqr() are in ast_hangup() since we do not 
want to interrupt a masquerade by hanging up the channel. Instead, now 
ast_hangup() will wait for a masquerade to complete before moving forward with 
its operation.

The ast_channel_move() function has been modified to basically in-line the 
logic that used to be in ast_channel_masquerade(). ast_channel_masquerade() has 
been killed off for real. ast_channel_move() now has a lock associated with it 
that is used to prevent any simultaneous moves from occurring at once. This 
means there is no need to make sure that ast_channel_masq() or 
ast_channel_masqr() are already set on a channel when ast_channel_move() is 
called. It also means the channel container lock is not pulling double duty by 
both keeping the container locked and preventing multiple masquerades from 
occurring simultaneously.

The ast_do_masquerade() function has been renamed to do_channel_masquerade() 
and is now internal to channel.c. The function now takes explicit arguments of 
which channels are involved in the masquerade instead of a single channel. 
While it probably is possible to do some further refactoring of this method, I 
feel that I would be treading dangerously. Instead, all I did was change some 
comments that no longer are true after this changeset.

The other more minor change introduced in this patch is to res_pjsip.c to make 
ast_sip_push_task_synchronous() run the task in-place if we are already a SIP 
servant thread. This is related to this patch because even when we isolate the 
channel masquerade to only running in the SIP servant thread, we would still 
deadlock when the fixup() callback is reached since we would essentially be 
waiting forever for ourselves to finish before actually running the fixup. This 
makes it so the fixup is run without having to push a task into a serializer at 
all.


Diffs
-

  /branches/12/res/res_pjsip.c 403349 
  /branches/12/main/channel.c 403343 
  /branches/12/include/asterisk/channel.h 403343 

Diff: https://reviewboard.asterisk.org/r/3069/diff/


Testing
---

Jonathan ran the same test he had originally run in order to make the deadlock 
occur, and with this patch it no longer occurs. I also ran a bevy of testsuite 
tests to see if there were any that did not pass with this patch but did pass 
without the patch. I found none on my machine, though I'll 

Re: [asterisk-dev] [Code Review] 3069: Fix deadlock experienced during multi-party PJSIP transfer through masquerade rework

2013-12-12 Thread Mark Michelson

---
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3069/#review10408
---


Josh and I talked on IRC, and it would really be great if the channels involved 
in the masquerade could be locked for the duration of the operation instead of 
having to unlock them in order to send indications to them. That way, it could 
be guaranteed that any time something acquires a channel lock, it is not 
interrupting the middle of a masquerade and can freely do what it needs to. I 
think such refactoring could actually be done, but I also feel like since that 
sort of refactoring does not directly affect the problem being fixed here, it 
should be saved for a separate review.

- Mark Michelson


On Dec. 12, 2013, 9:59 p.m., Mark Michelson wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviewboard.asterisk.org/r/3069/
 ---
 
 (Updated Dec. 12, 2013, 9:59 p.m.)
 
 
 Review request for Asterisk Developers, Joshua Colp, Jonathan Rose, Matt 
 Jordan, and rmudgett.
 
 
 Bugs: ASTERISK-22936
 https://issues.asterisk.org/jira/browse/ASTERISK-22936
 
 
 Repository: Asterisk
 
 
 Description
 ---
 
 This is more-or-less a take 2 of https://reviewboard.asterisk.org/r/3042
 
 Initially, it was discovered that performing an attended transfer of a 
 multiparty bridge with a PJSIP channel would cause a deadlock. A PBX thread 
 started a masquerade and reached the point where it was calling the fixup() 
 callback on the original channel. For chan_pjsip, this involves pushing a 
 synchronous task to the session's serializer. The problem was that a task 
 ahead of the fixup task was also attempting to perform a channel masquerade. 
 However, since masquerades are designed in a way to only allow for one to 
 occur at a time, the task ahead of the fixup could not continue until the 
 masquerade already in progress had completed. And of course, the masquerade 
 in progress could not complete until the task ahead of the fixup task had 
 completed. Deadlock.
 
 The initial fix was to change the fixup task to be asynchronous. While this 
 prevented the deadlock from occurring, it had the frightful side effect of 
 potentially allowing for tasks in the session's serializer to operate on a 
 zombie channel.
 
 Taking a step back from this particular deadlock, it became clear that the 
 problem was not really this one particular issue but that masquerades 
 themselves needed to be addressed. A PJSIP attended transfer operation calls 
 ast_channel_move(), which attempts to both set up and execute a masquerade. 
 The problem was that after it had set up the masquerade, the PBX thread had 
 swooped in and tried to actually perform the masquerade. Looking at changes 
 that had been made to Asterisk 12, it became clear that there never is any 
 time now that anyone ever wants to set up a masquerade and allow for the 
 channel thread to actually perform the masquerade. Everyone always is calling 
 ast_channel_move(), performs the masquerade itself before returning.
 
 In this patch, I have removed all blocks of code from channel.c that will 
 attempt to perform a masquerade if ast_channel_masq() returns true. Now, 
 there is no distinction between setting up a masquerade and performing the 
 masquerade. It is one operation. The only remaining checks for 
 ast_channel_masq() and ast_channel_masqr() are in ast_hangup() since we do 
 not want to interrupt a masquerade by hanging up the channel. Instead, now 
 ast_hangup() will wait for a masquerade to complete before moving forward 
 with its operation.
 
 The ast_channel_move() function has been modified to basically in-line the 
 logic that used to be in ast_channel_masquerade(). ast_channel_masquerade() 
 has been killed off for real. ast_channel_move() now has a lock associated 
 with it that is used to prevent any simultaneous moves from occurring at 
 once. This means there is no need to make sure that ast_channel_masq() or 
 ast_channel_masqr() are already set on a channel when ast_channel_move() is 
 called. It also means the channel container lock is not pulling double duty 
 by both keeping the container locked and preventing multiple masquerades from 
 occurring simultaneously.
 
 The ast_do_masquerade() function has been renamed to do_channel_masquerade() 
 and is now internal to channel.c. The function now takes explicit arguments 
 of which channels are involved in the masquerade instead of a single channel. 
 While it probably is possible to do some further refactoring of this method, 
 I feel that I would be treading dangerously. Instead, all I did was change 
 some comments that no longer are true after this changeset.
 
 The other more minor change introduced in this patch is to