[
https://issues.apache.org/jira/browse/KAFKA-5038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ewen Cheslack-Postava updated KAFKA-5038:
-----------------------------------------
Fix Version/s: (was: 0.10.2.0)
Removing invalid fix version since the tagged version has already been released.
> running multiple kafka streams instances causes one or more instance to get
> into file contention
> ------------------------------------------------------------------------------------------------
>
> Key: KAFKA-5038
> URL: https://issues.apache.org/jira/browse/KAFKA-5038
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Affects Versions: 0.10.2.0
> Environment: 3 Kafka broker machines and 3 kafka streams machines.
> Each machine is Linux 64 bit, CentOS 6.5 with 64GB memory, 8 vCPUs running in
> AWS
> 31GB java heap space allocated to each KafkaStreams instance and 4GB
> allocated to each Kafka broker.
> Reporter: Bharad Tirumala
>
> Having multiple kafka streams application instances causes one or more
> instances to get get into file lock contention and the instance(s) become
> unresponsive with uncaught exception.
> The exception is below:
> 22:14:37.621 [StreamThread-7] WARN o.a.k.s.p.internals.StreamThread -
> Unexpected state transition from RUNNING to NOT_RUNNING
> 22:14:37.621 [StreamThread-13] WARN o.a.k.s.p.internals.StreamThread -
> Unexpected state transition from RUNNING to NOT_RUNNING
> 22:14:37.623 [StreamThread-18] WARN o.a.k.s.p.internals.StreamThread -
> Unexpected state transition from RUNNING to NOT_RUNNING
> 22:14:37.625 [StreamThread-7] ERROR n.a.a.k.t.KStreamTopologyBase - Uncaught
> Exception:org.apache.kafka.streams.errors.ProcessorStateException: task
> directory [/data/kafka-streams/rtp-kstreams-metrics/0_119] doesn't exist and
> couldn't be created
> at
> org.apache.kafka.streams.processor.internals.StateDirectory.directoryForTask(StateDirectory.java:75)
> at
> org.apache.kafka.streams.processor.internals.StateDirectory.lock(StateDirectory.java:102)
> at
> org.apache.kafka.streams.processor.internals.StateDirectory.cleanRemovedTasks(StateDirectory.java:205)
> at
> org.apache.kafka.streams.processor.internals.StreamThread.maybeClean(StreamThread.java:753)
> at
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:664)
> at
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:368)
> This happens within couple of minutes after the instances are up and there is
> NO data being sent to the broker yet and the streams app is started with
> auto.offset.reset set to "latest".
> Please note that there are no permissions or capacity issues. This may have
> nothing to do with number of instances, but I could easily reproduce it when
> I've 3 stream instances running. This is similar to the (and may be the same)
> bug as [KAFKA-3758]
> Here are some relevant configuration info:
> 3 kafka brokers have one topic with 128 partitions and 1 replication
> 3 kafka streams applications (running on 3 machines) have a single processor
> topology and this processor is not doing anything (the process() method just
> returns and the punctuate method just commits)
> There is no data flowing yet, so the process() and puctuate() methods are not
> even called yet.
> The 3 kafka stream instances have 43, 43 and 42 threads each respectively
> (totally making up to 128 threads, so one task per thread distributed across
> three streams instances on 3 machines).
> Here are the configurations that I'd played around with:
> session.timeout.ms=300000
> heartbeat.interval.ms=60000
> max.poll.records=100
> num.standby.replicas=1
> commit.interval.ms=10000
> poll.ms=100
> When punctuate is scheduled to be called every 1000ms or 3000ms, the problem
> happens every time. If punctuate is scheduled for 5000ms, I didn't see the
> problem in my test scenario (described above), but it happened in my real
> application. But this may have nothing to do with the issue, since punctuate
> is not even called as there are no messages streaming through yet.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)