[
https://issues.apache.org/jira/browse/FLUME-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13402546#comment-13402546
]
Hari Shreedharan edited comment on FLUME-1327 at 6/27/12 8:54 PM:
------------------------------------------------------------------
There was a request for a more detailed description of the issue on review
board. Here is a more detailed description of the problem:
* The Log class methods put() and commit() both grab the read lock using
tryLock() and if a roll is required calls the roll function which is
synchronized on the Log class instance.
* Now why there is a deadlock occurs is because, if a the background worker
initiates a checkpoint after the put/commit method has grabbed the lock, it
enters the synchronized writeCheckpoint method, but ends up waitiing for the
write lock, because the put/commit method has already grabbed the read lock and
must wait for them to unlock the read lock.
* The put/commit method now calls roll() but roll cannot execute because it is
synchronized and the instance monitor is held by the background worker which is
waiting for the lock to be released by the thread which called put/commit().
* In summary:
{noformat}
readlock monitor
put --------------> roll ----------> writeCheckpoint
^ |
|----------------------------------------|
write lock
{noformat}
The reason I put 10 mins wait for lock in checkpoint was just to put a time out
rather than wait indefinitely. This makes sure all readers can complete and the
writer can run. 10 mins was rather arbitrary, basically a way of saying
eventually, I will just go away and come back and try again + this was a simple
change in the code, than a more complex one which may have introduced more
issues. Anyway, as per the review I made this timeout configurable.
was (Author: hshreedharan):
There was a request for a more detailed description of the issue on review
board. Here is a more detailed description of the problem:
* The Log class methods put() and commit() both grab the read lock using
tryLock() and if a roll is required calls the roll function which is
synchronized on the Log class instance.
* Now why there is a deadlock occurs is because, if a the background worker
initiates a checkpoint after the put/commit method has grabbed the lock, it
enters the synchronized writeCheckpoint method, but ends up waitiing for the
write lock, because the put/commit method has already grabbed the read lock and
must wait for them to unlock the read lock.
* The put/commit method now calls roll() but roll cannot execute because it is
synchronized and the instance monitor is held by the background worker which is
waiting for the lock to be released by the thread which called put/commit().
*In summary:
{noformat}
readlock monitor
put --------------> roll ----------> writeCheckpoint
^ |
|----------------------------------------|
write lock
{noformat}
The reason I put 10 mins wait for lock in checkpoint was just to put a time out
rather than wait indefinitely. This makes sure all readers can complete and the
writer can run. 10 mins was rather arbitrary, basically a way of saying
eventually, I will just go away and come back and try again + this was a simple
change in the code, than a more complex one which may have introduced more
issues. Anyway, as per the review I made this timeout configurable.
> File Channel can deadlock in when checkpoint happens in between a
> put/take/commit
> ---------------------------------------------------------------------------------
>
> Key: FLUME-1327
> URL: https://issues.apache.org/jira/browse/FLUME-1327
> Project: Flume
> Issue Type: Bug
> Components: Channel
> Affects Versions: v1.2.0
> Reporter: Hari Shreedharan
> Fix For: v1.2.0
>
> Attachments: FLUME-1327-1.patch
>
>
> In the following case, the FileChannel deadlocks:
> * put() grabs the read lock.
> * In another thread, the writeCheckpoint method is called, which is
> synchronized, but this method gets blocked on writelock.lock() call.
> * put() method calls roll, which is also synchronized, ends up getting
> blocked on the monitor for the Log object.
> * The lock is never acquired since the thread holding the lock is blocked on
> the monitor.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira