[ 
https://issues.apache.org/jira/browse/FLUME-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13402546#comment-13402546
 ] 

Hari Shreedharan edited comment on FLUME-1327 at 6/27/12 8:54 PM:
------------------------------------------------------------------

There was a request for a more detailed description of the issue on review 
board. Here is a more detailed description of the problem:

* The Log class methods put() and commit() both grab the read lock using 
tryLock() and if a roll is required calls the roll function which is 
synchronized on the Log class instance.
* Now why there is a deadlock occurs is because, if a the background worker 
initiates a checkpoint after the put/commit method has grabbed the lock, it 
enters the synchronized writeCheckpoint method, but ends up waitiing for the 
write lock, because the put/commit method has already grabbed the read lock and 
must wait for them to unlock the read lock.
* The put/commit method now calls roll() but roll cannot execute because it is 
synchronized and the instance monitor is held by the background worker which is 
waiting for the lock to be released by the thread which called put/commit().

* In summary:
{noformat}
                        readlock           monitor
                put --------------> roll ----------> writeCheckpoint
                ^                                        |
                |----------------------------------------|      
                                write lock
{noformat}

The reason I put 10 mins wait for lock in checkpoint was just to put a time out 
rather than wait indefinitely. This makes sure all readers can complete and the 
writer can run. 10 mins was rather arbitrary, basically a way of saying 
eventually, I will just go away and come back and try again + this was a simple 
change in the code, than a more complex one which may have introduced more 
issues. Anyway, as per the review I made this timeout configurable.
                
      was (Author: hshreedharan):
    There was a request for a more detailed description of the issue on review 
board. Here is a more detailed description of the problem:

* The Log class methods put() and commit() both grab the read lock using 
tryLock() and if a roll is required calls the roll function which is 
synchronized on the Log class instance.
* Now why there is a deadlock occurs is because, if a the background worker 
initiates a checkpoint after the put/commit method has grabbed the lock, it 
enters the synchronized writeCheckpoint method, but ends up waitiing for the 
write lock, because the put/commit method has already grabbed the read lock and 
must wait for them to unlock the read lock.
* The put/commit method now calls roll() but roll cannot execute because it is 
synchronized and the instance monitor is held by the background worker which is 
waiting for the lock to be released by the thread which called put/commit().

*In summary:
{noformat}
                        readlock           monitor
                put --------------> roll ----------> writeCheckpoint
                ^                                        |
                |----------------------------------------|      
                                write lock
{noformat}

The reason I put 10 mins wait for lock in checkpoint was just to put a time out 
rather than wait indefinitely. This makes sure all readers can complete and the 
writer can run. 10 mins was rather arbitrary, basically a way of saying 
eventually, I will just go away and come back and try again + this was a simple 
change in the code, than a more complex one which may have introduced more 
issues. Anyway, as per the review I made this timeout configurable.
                  
> File Channel can deadlock in when checkpoint happens in between a 
> put/take/commit
> ---------------------------------------------------------------------------------
>
>                 Key: FLUME-1327
>                 URL: https://issues.apache.org/jira/browse/FLUME-1327
>             Project: Flume
>          Issue Type: Bug
>          Components: Channel
>    Affects Versions: v1.2.0
>            Reporter: Hari Shreedharan
>             Fix For: v1.2.0
>
>         Attachments: FLUME-1327-1.patch
>
>
> In the following case, the FileChannel deadlocks:
> * put() grabs the read lock.
> * In another thread, the writeCheckpoint method is called, which is 
> synchronized, but this method gets blocked on writelock.lock() call.
> * put() method calls roll, which is also synchronized, ends up getting 
> blocked on the monitor for the Log object.
> * The lock is never acquired since the thread holding the lock is blocked on 
> the monitor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to