Ok. What I just saw was that when the controller machine reaches around 4100+ files, it crashes. Then I think the controller bounced between 2 other machines, taking them down too, and the circled back to the original machine.
Paul Lung On 6/24/14, 10:51 PM, "Lung, Paul" <pl...@ebay.com> wrote: >The controller machine has 3500 or so, while the other machines have >around 1600. > >Paul Lung > >On 6/24/14, 10:31 PM, "Prakash Gowri Shankor" <prakash.shan...@gmail.com> >wrote: > >>How many files does each broker itself have open ? You can find this from >>'ls -l /proc/<processid>/fd' >> >> >> >> >>On Tue, Jun 24, 2014 at 10:18 PM, Lung, Paul <pl...@ebay.com> wrote: >> >>> Hi All, >>> >>> >>> I just upgraded my cluster from 0.8.1 to 0.8.1.1. I¹m seeing the >>>following >>> error messages on the same 3 brokers once in a while: >>> >>> >>> [2014-06-24 21:43:44,711] ERROR Error in acceptor >>>(kafka.network.Acceptor) >>> >>> java.io.IOException: Too many open files >>> >>> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) >>> >>> at >>> >>>sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:1 >>>6 >>>3) >>> >>> at kafka.network.Acceptor.accept(SocketServer.scala:200) >>> >>> at kafka.network.Acceptor.run(SocketServer.scala:154) >>> >>> at java.lang.Thread.run(Thread.java:679) >>> >>> [2014-06-24 21:43:44,711] ERROR Error in acceptor >>>(kafka.network.Acceptor) >>> >>> java.io.IOException: Too many open files >>> >>> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) >>> >>> at >>> >>>sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:1 >>>6 >>>3) >>> >>> at kafka.network.Acceptor.accept(SocketServer.scala:200) >>> >>> at kafka.network.Acceptor.run(SocketServer.scala:154) >>> >>> at java.lang.Thread.run(Thread.java:679) >>> >>> When this happens, these 3 brokers essentially go out of sync when you >>>do >>> a ³kafka-topics.sh ‹describe². >>> >>> I tracked the number of open files by doing ³watch n 1 Œsudo lsof | wc >>> l¹², which basically counts all open files on the system. The numbers >>>for >>> the systems are basically in the 6000 range, with one system going to >>>9000. >>> I presume the 9000 machine is the controller. Looking at the ulimit of >>>the >>> user, both the hard limit and the soft limit for open files is 100,000. >>> Using sysctl, the max file is fs.file-max = 9774928. So we seem to be >>>way >>> under the limit. >>> >>> What am I missing here? Is there some JVM limit around 10K open files >>>or >>> something? >>> >>> Paul Lung >>> >