[ 
https://issues.apache.org/jira/browse/KAFKA-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489859#comment-13489859
 ] 

Jay Kreps commented on KAFKA-600:
---------------------------------

This seems sensible but the problem is this. If you get ENOSPC, you likely 
accepted a partial write, which means the log itself is corrupt. To fix this we 
need to run recovery on the log (which checks each message in the last segment 
and truncates off any invalid partitial writes). This happens automatically 
when the server is restarted. Theoretically this could be done automatically 
but since the recovery process can be slow (minutes) and we can't accept writes 
during that time we felt the best course of action is to have the node shoot 
itself in the head and let other healthy nodes take over.
                
> kafka should respond gracefully rather than crash when unable to write due to 
> ENOSPC
> ------------------------------------------------------------------------------------
>
>                 Key: KAFKA-600
>                 URL: https://issues.apache.org/jira/browse/KAFKA-600
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>            Reporter: Tim Stilwell
>
> problem:
> user starts kafka with log.dir value set to a small partition and begins 
> writing data to the mq.  when the disk partition is full, kafka crashes.  
> given that this product is used for both reading and writing operations, 
> crashing seems rather drastic even if the error message is helpful.   
> something more robust would be appreciated.  perhaps, logging an error and 
> rejecting additional write requests while accepting additional read requests? 
>  perhaps, sending an email alert to Operations?  at least shutdown gracefully 
> so the user is aware that received messages were saved with a helpful message 
> providing some details of the last message received.  when tens or hundreds 
> of thousands of messages can be processed in a second, it isn't helpful to 
> merely log a timestamp and crash.
> steps to reproduce:
> 1) download and install kafka
> 2) modify server.properties
>     # vi /opt/kafka-0.7.2-incubating-src/config/server.properties
>     set log.dir="/var/log/kafka"
> 3) modify log4j
>     # vi /opt/kafka-0.7.2-incubating-src/config/log4j.properties
>     set fileAppender.File=/var/log/kafka/kafka-request.log
> 4) start kafka service
>     $ sudo bash
>     # ulimit -c unlimited
>     # /opt/kafka-0.7.2-incubating-src/bin/kafka-server-start.sh 
> /opt/kafka-0.7.2-incubating-src/config/server.properties &
> 6) begin writing data to hostname:9092
> 7) review /var/log/kafka-request.log
> results:
> $ grep log.dir /opt/kafka-0.7.2-incubating-src/config/server.properties
> log.dir=/var/log/kafka
> $ df -h /var/log/kafka
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/sda1       4.0G  4.0G     0 100% /
> $ tail /var/log/kafka/kafka-request.log
> 17627442 [ZkClient-EventThread-14-10.0.20.242:2181] INFO  
> kafka.server.KafkaZooKeeper  - Begin registering broker topic 
> /brokers/topics/raw/0 with 1 partitions
> 17627444 [ZkClient-EventThread-14-10.0.20.242:2181] INFO  
> kafka.server.KafkaZooKeeper  - End registering broker topic 
> /brokers/topics/raw/0
> 17627445 [ZkClient-EventThread-14-10.0.20.242:2181] INFO  
> kafka.server.KafkaZooKeeper  - done re-registering broker
> 18337676 [kafka-processor-3] ERROR kafka.network.Processor  - Closing socket 
> for /10.0.20.138 because of error
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:218)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:191)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:359)
>         at kafka.utils.Utils$.read(Utils.scala:538)
>         at 
> kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
>         at kafka.network.Processor.read(SocketServer.scala:311)
>         at kafka.network.Processor.run(SocketServer.scala:214)
>         at java.lang.Thread.run(Thread.java:722)
> 18391974 [kafka-processor-4] INFO  kafka.network.Processor  - Closing socket 
> connection to /10.0.20.138.
> 18422004 [kafka-processor-5] INFO  kafka.network.Processor  - Closing socket 
> connection to /10.0.20.138.
> 18434563 [kafka-processor-6] INFO  kafka.network.Processor  - Closing socket 
> connection to /10.0.20.138.
> 18485005 [kafka-processor-7] INFO  kafka.network.Processor  - Closing socket 
> connection to /10.0.20.138.
> 18497083 [kafka-processor-0] INFO  kafka.network.Processor  - Closing socket 
> connection to /10.0.20.138.
> 18525720 [kafka-processor-1] INFO  kafka.network.Processor  - Closing socket 
> connection to /10.0.20.138.
> 18543843 [kafka-processor-2] INFO  kafka.network.Processor  - Closing socket 
> connection to /10.0.20.138.
> 18563230 [kafka-processor-4] INFO  kafka.network.Processor  - Closing socket 
> connection to /10.0.20.138.
> 18575613 [kafka-processor-5] INFO  kafka.network.Processor  - Closing socket 
> connection to /10.0.20.138.
> 18677568 [kafka-processor-6] ERROR kafka.network.Processor  - Closing socket 
> for /10.0.20.138 because of error
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:218)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:191)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:359)
>         at kafka.utils.Utils$.read(Utils.scala:538)
>         at 
> kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
>         at kafka.network.Processor.read(SocketServer.scala:311)
>         at kafka.network.Processor.run(SocketServer.scala:214)
>         at java.lang.Thread.run(Thread.java:722)
> 18828016 [kafka-processor-7] INFO  kafka.network.Processor  - Closing socket 
> connection to /10.0.20.248.
> 18844274 [kafka-processor-0] INFO  kafka.network.Processor  - Closing socket 
> connection to /10.0.20.248.
> 18849691 [kafka-processor-1] INFO  kafka.network.Processor  - Closing socket 
> connection to /10.0.20.248.
> 18896883 [kafka-processor-2] INFO  kafka.network.Processor  - Closing socket 
> connection to /10.0.20.248.
> 22383195 [kafka-processor-2] FATAL kafka.log.Log  - Halting due to 
> unrecoverable I/O error while handling producer request
> java.io.IOException: No space left on device
>         at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>         at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:59)
>         at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:89)
>         at sun.nio.ch.IOUtil.write(IOUtil.java:60)
>         at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:195)
>         at 
> kafka.message.ByteBufferMessageSet.writeTo(ByteBufferMessageSet.scala:76)
>         at kafka.message.FileMessageSet.append(FileMessageSet.scala:159)
>         at kafka.log.LogSegment.append(Log.scala:105)
>         at kafka.log.Log.liftedTree1$1(Log.scala:246)
>         at kafka.log.Log.append(Log.scala:242)
>         at 
> kafka.server.KafkaRequestHandlers.kafka$server$KafkaRequestHandlers$$handleProducerRequest(KafkaRequestHandlers.scala:69)
>         at 
> kafka.server.KafkaRequestHandlers.handleProducerRequest(KafkaRequestHandlers.scala:53)
>         at 
> kafka.server.KafkaRequestHandlers$$anonfun$handlerFor$1.apply(KafkaRequestHandlers.scala:38)
>         at 
> kafka.server.KafkaRequestHandlers$$anonfun$handlerFor$1.apply(KafkaRequestHandlers.scala:38)
>         at kafka.network.Processor.handle(SocketServer.scala:296)
>         at kafka.network.Processor.read(SocketServer.scala:319)
>         at kafka.network.Processor.run(SocketServer.scala:214)
>         at java.lang.Thread.run(Thread.java:722)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to