[ 
https://issues.apache.org/jira/browse/KAFKA-615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498364#comment-13498364
 ] 

Jay Kreps commented on KAFKA-615:
---------------------------------

Here is a quick thought on how this might work to kick things off. A few things 
to consider:
1. Currently recovery works on a per segment basis as the index is recreated 
from scratch. It will be tricky to do partial segment recovery. I recommend we 
avoid this. If we want to speed up recovery we can always just make segment 
sizes smaller.
2. We need to guarantee that fsync can never happen on the active segment under 
any circumstances for this to really work--the syncing must be fully async 
since the fsync's may be very slow. To make this work we need to somehow record 
what has and has not been fsync'd.
3. One complicating factor is that an inactive segment may become active once 
again because of a truncate operation. In this case more messages will be 
appended and we will need to fsync it again after it is rolled again. If we 
crash in the intervening time we can't think the old fsync counted after the 
truncation and further appends.

My thought was to keep a text file containing:
  topic partition offset
This file would be populated by a background thread that would flush segments 
and update the file. At recovery time we would use this file to determine the 
latest flushed segment for each topic/partition and only recover segments newer 
than this. This would generally mean recovering only the last segment.
Some notes on maintaining this file:
- When a segment is flushed we would immediately append a new entry to this 
file without fsyncing. Doing this as appends means that we need only write out 
a small bit of data incrementally. Losing an entry is usually okay, it would 
just mean we would do an unnecessary recovery so background flush should 
usually be okay (exception noted before).
- We would keep an in memory map with the latest offset for each 
topic/partition.
- To avoid the log growing forever we would periodically write out the full 
contents of the map to a new file and swap this for the old log.
- We should probably keep one of these files for each data directory.
- The case of truncate (i.e. moving the offset backwards) needs to be thought 
through. In this case I think we may need to immediately fsync the file to 
avoid a case where we lose an entry in the file and therefor think we have 
flushed more than we actually have.

We should probably refactor the existing background flush thread so it would 
now handle both the time based flush and the flushing of old segments just to 
avoid having two threads and since they are very similar in their role. Note 
that it is fine to have both running, since if the data is already flushed due 
to a time or interval rule, then the flushing of the old segment can just 
proceed and will essentially be a no-op (since the file will have no dirty 
pages).

I recommend we leave the existing time and interval flush policies in place but 
now default both to infinity. These will now be rarely used options mostly 
useful for people who are paranoid, running only a single broker, or using a 
crappy filesystem where fsync locks the everything.

In terms of code structure it is quite tricky and I don't quite know how to do 
it. A lot of the challenge is that the flush thread and flush journal file will 
be global for all logs, but the roll() call is at the log level. The first 
question is how do we know what segments to flush? One way would be to have the 
background thread just periodically (say every 30 seconds) scan all logs and 
see if they have any new segments that need flushing. The downside of this is 
that it is O(n) in terms of the number of logs, which perhaps is not ideal. 
Another way would be to have a roll() somehow enqueue a flush operation for the 
background thread to carry out. The later case may be more efficient but 
tangles the log with the layers above it. It would be good to work out the 
details of how this would work up front.
                
> Avoid fsync on log segment roll
> -------------------------------
>
>                 Key: KAFKA-615
>                 URL: https://issues.apache.org/jira/browse/KAFKA-615
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jay Kreps
>            Assignee: Neha Narkhede
>
> It still isn't feasible to run without an application level fsync policy. 
> This is a problem as fsync locks the file and tuning such a policy so that 
> the flushes aren't so frequent that seeks reduce throughput, yet not so 
> infrequent that the fsync is writing so much data that there is a noticable 
> jump in latency is very challenging.
> The remaining problem is the way that log recovery works. Our current policy 
> is that if a clean shutdown occurs we do no recovery. If an unclean shutdown 
> occurs we recovery the last segment of all logs. To make this correct we need 
> to ensure that each segment is fsync'd before we create a new segment. Hence 
> the fsync during roll.
> Obviously if the fsync during roll is the only time fsync occurs then it will 
> potentially write out the entire segment which for a 1GB segment at 50mb/sec 
> might take many seconds. The goal of this JIRA is to eliminate this and make 
> it possible to run with no application-level fsyncs at all, depending 
> entirely on replication and background writeback for durability.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to