[ https://issues.apache.org/jira/browse/KAFKA-615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498364#comment-13498364 ]
Jay Kreps commented on KAFKA-615: --------------------------------- Here is a quick thought on how this might work to kick things off. A few things to consider: 1. Currently recovery works on a per segment basis as the index is recreated from scratch. It will be tricky to do partial segment recovery. I recommend we avoid this. If we want to speed up recovery we can always just make segment sizes smaller. 2. We need to guarantee that fsync can never happen on the active segment under any circumstances for this to really work--the syncing must be fully async since the fsync's may be very slow. To make this work we need to somehow record what has and has not been fsync'd. 3. One complicating factor is that an inactive segment may become active once again because of a truncate operation. In this case more messages will be appended and we will need to fsync it again after it is rolled again. If we crash in the intervening time we can't think the old fsync counted after the truncation and further appends. My thought was to keep a text file containing: topic partition offset This file would be populated by a background thread that would flush segments and update the file. At recovery time we would use this file to determine the latest flushed segment for each topic/partition and only recover segments newer than this. This would generally mean recovering only the last segment. Some notes on maintaining this file: - When a segment is flushed we would immediately append a new entry to this file without fsyncing. Doing this as appends means that we need only write out a small bit of data incrementally. Losing an entry is usually okay, it would just mean we would do an unnecessary recovery so background flush should usually be okay (exception noted before). - We would keep an in memory map with the latest offset for each topic/partition. - To avoid the log growing forever we would periodically write out the full contents of the map to a new file and swap this for the old log. - We should probably keep one of these files for each data directory. - The case of truncate (i.e. moving the offset backwards) needs to be thought through. In this case I think we may need to immediately fsync the file to avoid a case where we lose an entry in the file and therefor think we have flushed more than we actually have. We should probably refactor the existing background flush thread so it would now handle both the time based flush and the flushing of old segments just to avoid having two threads and since they are very similar in their role. Note that it is fine to have both running, since if the data is already flushed due to a time or interval rule, then the flushing of the old segment can just proceed and will essentially be a no-op (since the file will have no dirty pages). I recommend we leave the existing time and interval flush policies in place but now default both to infinity. These will now be rarely used options mostly useful for people who are paranoid, running only a single broker, or using a crappy filesystem where fsync locks the everything. In terms of code structure it is quite tricky and I don't quite know how to do it. A lot of the challenge is that the flush thread and flush journal file will be global for all logs, but the roll() call is at the log level. The first question is how do we know what segments to flush? One way would be to have the background thread just periodically (say every 30 seconds) scan all logs and see if they have any new segments that need flushing. The downside of this is that it is O(n) in terms of the number of logs, which perhaps is not ideal. Another way would be to have a roll() somehow enqueue a flush operation for the background thread to carry out. The later case may be more efficient but tangles the log with the layers above it. It would be good to work out the details of how this would work up front. > Avoid fsync on log segment roll > ------------------------------- > > Key: KAFKA-615 > URL: https://issues.apache.org/jira/browse/KAFKA-615 > Project: Kafka > Issue Type: Bug > Reporter: Jay Kreps > Assignee: Neha Narkhede > > It still isn't feasible to run without an application level fsync policy. > This is a problem as fsync locks the file and tuning such a policy so that > the flushes aren't so frequent that seeks reduce throughput, yet not so > infrequent that the fsync is writing so much data that there is a noticable > jump in latency is very challenging. > The remaining problem is the way that log recovery works. Our current policy > is that if a clean shutdown occurs we do no recovery. If an unclean shutdown > occurs we recovery the last segment of all logs. To make this correct we need > to ensure that each segment is fsync'd before we create a new segment. Hence > the fsync during roll. > Obviously if the fsync during roll is the only time fsync occurs then it will > potentially write out the entire segment which for a 1GB segment at 50mb/sec > might take many seconds. The goal of this JIRA is to eliminate this and make > it possible to run with no application-level fsyncs at all, depending > entirely on replication and background writeback for durability. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira