[
https://issues.apache.org/jira/browse/KAFKA-3689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15383936#comment-15383936
]
Ismael Juma edited comment on KAFKA-3689 at 7/20/16 2:36 PM:
-------------------------------------------------------------
So, looking at that NPE gives a clue. It turns out that `Selector` can `close`
and remove the channel for a connection even if it was able to complete a
receive in the same poll. It's unclear how this can happen, but the NPE shows
that it can.
That doesn't seem to be the same issue as the one originally reported in this
ticket.
was (Author: ijuma):
So, looking at that NPE gives a clue. It turns out that `Selector` can
`disconnect` a connection even if it was able to complete a receive. This seems
pretty unlikely unless the `connections.max.idle.ms` is very low or processing
the keys from the Selector takes longer than usual (maybe if the server is very
overloaded).
In this case, the connection would appear in both `completedReceives` and
`disconnected` and if `completedReceives` fails at the right point, we would
end up double-counting the decrease. The NPE is an alternative failure mode.
> Exception when attempting to decrease connection count for address with no
> connections
> --------------------------------------------------------------------------------------
>
> Key: KAFKA-3689
> URL: https://issues.apache.org/jira/browse/KAFKA-3689
> Project: Kafka
> Issue Type: Bug
> Components: network
> Affects Versions: 0.9.0.1
> Environment: ubuntu 14.04,
> java version "1.7.0_95"
> OpenJDK Runtime Environment (IcedTea 2.6.4) (7u95-2.6.4-0ubuntu0.14.04.2)
> OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)
> 3 broker cluster (all 3 servers identical - Intel Xeon E5-2670 @2.6GHz,
> 8cores, 16 threads 64 GB RAM & 1 TB Disk)
> Kafka Cluster is managed by 3 server ZK cluster (these servers are different
> from Kafka broker servers). All 6 servers are connected via 10G switch.
> Producers run from external servers.
> Reporter: Buvaneswari Ramanan
> Assignee: Jun Rao
> Fix For: 0.10.1.0, 0.10.0.1
>
> Attachments: KAFKA-3689.log.redacted, kafka-3689-instrumentation.patch
>
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> As per Ismael Juma's suggestion in email thread to [email protected]
> with the same subject, I am creating this bug report.
> The following error occurs in one of the brokers in our 3 broker cluster,
> which serves about 8000 topics. These topics are single partitioned with a
> replication factor = 3. Each topic gets data at a low rate – 200 bytes per
> sec. Leaders are balanced across the topics.
> Producers run from external servers (4 Ubuntu servers with same config as the
> brokers), each producing to 2000 topics utilizing kafka-python library.
> This error message occurs repeatedly in one of the servers. Between the hours
> of 10:30am and 1:30pm on 5/9/16, there were about 10 Million such
> occurrences. This was right after a cluster restart.
> This is not the first time we got this error in this broker. In those
> instances, error occurred hours / days after cluster restart.
> =====================================================
> [2016-05-09 10:38:43,932] ERROR Processor got uncaught exception.
> (kafka.network.Processor)
> java.lang.IllegalArgumentException: Attempted to decrease connection count
> for address with no connections, address: /X.Y.Z.144 (actual network address
> masked)
> at
> kafka.network.ConnectionQuotas$$anonfun$9.apply(SocketServer.scala:565)
> at
> kafka.network.ConnectionQuotas$$anonfun$9.apply(SocketServer.scala:565)
> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
> at scala.collection.AbstractMap.getOrElse(Map.scala:59)
> at kafka.network.ConnectionQuotas.dec(SocketServer.scala:564)
> at
> kafka.network.Processor$$anonfun$run$13.apply(SocketServer.scala:450)
> at
> kafka.network.Processor$$anonfun$run$13.apply(SocketServer.scala:445)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at kafka.network.Processor.run(SocketServer.scala:445)
> at java.lang.Thread.run(Thread.java:745)
> [2016-05-09 10:38:43,932] ERROR Processor got uncaught exception.
> (kafka.network.Processor)
> java.lang.IllegalArgumentException: Attempted to decrease connection count
> for address with no connections, address: /X.Y.Z.144
> at
> kafka.network.ConnectionQuotas$$anonfun$9.apply(SocketServer.scala:565)
> at
> kafka.network.ConnectionQuotas$$anonfun$9.apply(SocketServer.scala:565)
> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
> at scala.collection.AbstractMap.getOrElse(Map.scala:59)
> at kafka.network.ConnectionQuotas.dec(SocketServer.scala:564)
> at
> kafka.network.Processor$$anonfun$run$13.apply(SocketServer.scala:450)
> at
> kafka.network.Processor$$anonfun$run$13.apply(SocketServer.scala:445)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at kafka.network.Processor.run(SocketServer.scala:445)
> at java.lang.Thread.run(Thread.java:745)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)