Ok, I didn't realize the write to disk was immediate (is that new in 0.8, with requested acks enabled?).
I do think the OS will indeed reserve space in advance for data not yet flushed to disk. This seems to be true, at least, for xfs, which I have more experience lately. Jason On Thu, Aug 15, 2013 at 11:30 AM, Jay Kreps <jay.kr...@gmail.com> wrote: > I am saying we always immediately write to the fs. So the question is is it > possible with delayed allocation in ext4 to do a successful write that > later cannot be flushed to disk due to running out of space? I don't know > the answer to this, though I would hope it is not possible. > > Basically if our write to the fs succeeds and replicas acknowledge then we > send back the ack. > > -Jay > > > On Thu, Aug 15, 2013 at 11:12 AM, Jason Rosenberg <j...@squareup.com> > wrote: > > > Hmmm....I guess I was thinking that a broker could receive a message and > > keep it in memory, before having disk space reserved for it's eventual > > storage. Are you saying that memory is not allocated for a message > without > > there already being disk space allocated for it? In which case, there > > should be no problem! > > > > Jason > > > > > > On Thu, Aug 15, 2013 at 10:44 AM, Jay Kreps <jay.kr...@gmail.com> wrote: > > > > > I don't think the filesystem will overcommit its disk space, but I'm > > > actually not sure. I think this would only come into play on a fs like > > ext4 > > > which does lazy block allocation in addition to lazy writing. But I > think > > > even ext4 is probably not allowed to hand out more disk space then it > > has. > > > > > > > > > On Thu, Aug 15, 2013 at 10:18 AM, Jason Rosenberg <j...@squareup.com> > > > wrote: > > > > > > > A related question: Will producers sending messages with > > acknowledgment, > > > > get a failed ack if a broker is out of disk space, or will messages > get > > > > buffered in memory successfully (resulting in a good ack, before > > failing > > > to > > > > be written). > > > > > > > > It seems like it might be a good feature to have the broker > auto-detect > > > if > > > > it's log dir is nearing full, so that there is some runway to > > gracefully > > > > shutdown, while still writing any in memory buffered messages. It > > could > > > be > > > > an optional threshold, like 98% full, or X Mb free, etc. > > > > > > > > Jason > > > > > > > > > > > > On Wed, Aug 14, 2013 at 7:58 PM, Jay Kreps <jay.kr...@gmail.com> > > wrote: > > > > > > > > > The crash is actually just a call to shutdown. We think this is the > > > right > > > > > thing to do, though I agree it is unintuitive. Here is why. When > you > > > get > > > > an > > > > > out of space error it is likely that the operating system did a > > partial > > > > > write, leaving you with a corrupt log. Furthermore it is possible > > that > > > > > space will free up at which point more writes on the log could > > succeed > > > so > > > > > you wouldn't even know there was a problem but all your consumers > > would > > > > hit > > > > > this data and choke. > > > > > > > > > > By "crashing" the node we ensure that recovery is run on the log to > > > bring > > > > > it into a consistent state. > > > > > > > > > > Theoretically we could leave the node up accepting reads but > > rejecting > > > > > writes while attempting to recover the log. But there are a bunch > of > > > > > problems with this. But this is very complex. Likely if you are out > > of > > > > > space you are just going to keep getting writes, and running out of > > > space > > > > > again and then running recovery and so on. This kind of crazy loop > is > > > > much > > > > > worse then just needing to bring the node back up. > > > > > > > > > > Alternately we could leave the node up but go into some kind of > > > > > write-rejecting mode forever. But this would still require that you > > > > restart > > > > > the node, and we would have to implement that write-rejecting node. > > > > > > > > > > Cheers, > > > > > > > > > > -Jay > > > > > > > > > > > > > > > On Wed, Aug 14, 2013 at 9:52 AM, Bryan Baugher <bjb...@gmail.com> > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > This is more of a thought question than a problem that I need > > support > > > > > for. > > > > > > I have trying out Kafka 0.8.0-beta1 with replication. For our > user > > > case > > > > > we > > > > > > want to try and guarantee that our consumers will see all > messages > > > even > > > > > if > > > > > > they have fallen greatly behind the broker/producer. For this > > reason > > > I > > > > > > wanted to know how the broker would react when the filesystem it > > > writes > > > > > its > > > > > > messages to is full. What I found was that the broker crashes and > > > > cannot > > > > > be > > > > > > started until the filesystem has space again. > > > > > > > > > > > > Is there or would it make sense to provide configuration allowing > > the > > > > > > broker to reject writes in this case rather than crashing, > > electing a > > > > new > > > > > > leader and attempting the write again? I can clearly understand > the > > > use > > > > > > case that we don't want to 'lose' messages from the producer and > I > > > > could > > > > > > also see how lack of filesystem space could be considered a > machine > > > > > > failure, but with replication I would think if you are running > out > > of > > > > > space > > > > > > on 1 broker you are likely running out of space on others. > > > > > > > > > > > > Bryan > > > > > > > > > > > > > > > > > > > > >