[
https://issues.apache.org/jira/browse/BOOKKEEPER-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sijie Guo reassigned BOOKKEEPER-1044:
-------------------------------------
Assignee: Sijie Guo (was: Charan Reddy Guttapalem)
I attached a simple fix at https://github.com/apache/bookkeeper/pull/286
> Entrylogger is not readding rolled logs back to the logChannelsToFlush list
> when exception happens while trying to flush rolled logs
> ------------------------------------------------------------------------------------------------------------------------------------
>
> Key: BOOKKEEPER-1044
> URL: https://issues.apache.org/jira/browse/BOOKKEEPER-1044
> Project: Bookkeeper
> Issue Type: Bug
> Reporter: Charan Reddy Guttapalem
> Assignee: Sijie Guo
> Priority: Blocker
> Fix For: 4.5.0
>
>
> SyncThread.checkpoint(Checkpoint checkpoint) (which is called periodically by
> SyncThread's executor for every flushInterval) ultimately calls
> EntryLogger.flushRotatedLogs.
> In EntryLogger.flushRotatedLogs, first we set 'logChannelsToFlush' to null
> and then we try to flush and close individual file. Now, if IOException
> happens while trying to flush/close the logchannel, then exception is thrown
> as it is and it get propagates back upto SyncThread.checkpoint. Here we catch
> that IOException, log it and return without calling the checkpointComplete.
> But by now we lost reference of 'logChannelsToFlush' (rolled logs which are
> yet to be closed), because it is set to null before we try to flush/close
> individually rolledlogs. The next execution of 'checkpoint' (after
> flushinterval) wouldn't be knowing about the rolledlogs it failed to
> flush/close the previous time and it would flush the newly rolledlogs. So the
> failure of flush/close of the previous rolledlogs goes unnoticed completely.
> in EntryLogger.java
> void flushRotatedLogs() throws IOException {
> List<BufferedLogChannel> channels = null;
> long flushedLogId = INVALID_LID;
> synchronized (this) {
> channels = logChannelsToFlush;
> logChannelsToFlush = null; <--------- here we set
> 'logChannelsToFlush' to null before it tries to flush/close rolledlogs
> }
> if (null == channels) {
> return;
> }
> for (BufferedLogChannel channel : channels) {
> channel.flush(true);
> <------------IOEXception can happen here or in the following closeFileChannel
> call
> // since this channel is only used for writing, after flushing
> the channel,
> // we had to close the underlying file channel. Otherwise, we
> might end up
> // leaking fds which cause the disk spaces could not be reclaimed.
> closeFileChannel(channel);
> if (channel.getLogId() > flushedLogId) {
> flushedLogId = channel.getLogId();
> }
> LOG.info("Synced entry logger {} to disk.", channel.getLogId());
> }
> // move the leastUnflushedLogId ptr
> leastUnflushedLogId = flushedLogId + 1;
> }
> in SyncThread.java
> public void checkpoint(Checkpoint checkpoint) {
> try {
> checkpoint = ledgerStorage.checkpoint(checkpoint);
> } catch (NoWritableLedgerDirException e) {
> LOG.error("No writeable ledger directories", e);
> dirsListener.allDisksFull();
> return;
> } catch (IOException e) {
> LOG.error("Exception flushing ledgers", e); <-----that IOExc gets
> propagated to this method and here it is caught and not dealt appropriately
>
> return;
> }
> try {
> checkpointSource.checkpointComplete(checkpoint, true);
> } catch (IOException e) {
> LOG.error("Exception marking checkpoint as complete", e);
> dirsListener.allDisksFull();
> }
> }
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)