[ 
https://issues.apache.org/jira/browse/KAFKA-5038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on KAFKA-5038 started by Eno Thereska.
-------------------------------------------
> running multiple kafka streams instances causes one or more instance to get 
> into file contention
> ------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-5038
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5038
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 0.10.2.0
>         Environment: 3 Kafka broker machines and 3 kafka streams machines.
> Each machine is Linux 64 bit, CentOS 6.5 with 64GB memory, 8 vCPUs running in 
> AWS
> 31GB java heap space allocated to each KafkaStreams instance and 4GB 
> allocated to each Kafka broker.
>            Reporter: Bharad Tirumala
>            Assignee: Eno Thereska
>
> Having multiple kafka streams application instances causes one or more 
> instances to get get into file lock contention and the instance(s) become 
> unresponsive with uncaught exception.
> The exception is below:
> 22:14:37.621 [StreamThread-7] WARN  o.a.k.s.p.internals.StreamThread - 
> Unexpected state transition from RUNNING to NOT_RUNNING
> 22:14:37.621 [StreamThread-13] WARN  o.a.k.s.p.internals.StreamThread - 
> Unexpected state transition from RUNNING to NOT_RUNNING
> 22:14:37.623 [StreamThread-18] WARN  o.a.k.s.p.internals.StreamThread - 
> Unexpected state transition from RUNNING to NOT_RUNNING
> 22:14:37.625 [StreamThread-7] ERROR n.a.a.k.t.KStreamTopologyBase - Uncaught 
> Exception:org.apache.kafka.streams.errors.ProcessorStateException: task 
> directory [/data/kafka-streams/rtp-kstreams-metrics/0_119] doesn't exist and 
> couldn't be created
>       at 
> org.apache.kafka.streams.processor.internals.StateDirectory.directoryForTask(StateDirectory.java:75)
>       at 
> org.apache.kafka.streams.processor.internals.StateDirectory.lock(StateDirectory.java:102)
>       at 
> org.apache.kafka.streams.processor.internals.StateDirectory.cleanRemovedTasks(StateDirectory.java:205)
>       at 
> org.apache.kafka.streams.processor.internals.StreamThread.maybeClean(StreamThread.java:753)
>       at 
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:664)
>       at 
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:368)
> This happens within couple of minutes after the instances are up and there is 
> NO data being sent to the broker yet and the streams app is started with 
> auto.offset.reset set to "latest".
> Please note that there are no permissions or capacity issues. This may have 
> nothing to do with number of instances, but I could easily reproduce it when 
> I've 3 stream instances running. This is similar to the (and may be the same) 
> bug as [KAFKA-3758]
> Here are some relevant configuration info:
> 3 kafka brokers have one topic with 128 partitions and 1 replication
> 3 kafka streams applications (running on 3 machines) have a single processor 
> topology and this processor is not doing anything (the process() method just 
> returns and the punctuate method just commits)
> There is no data flowing yet, so the process() and puctuate() methods are not 
> even called yet.
> The 3 kafka stream instances have 43, 43 and 42 threads each respectively 
> (totally making up to 128 threads, so one task per thread distributed across 
> three streams instances on 3 machines).
> Here are the configurations that I'd played around with:
> session.timeout.ms=300000
> heartbeat.interval.ms=60000
> max.poll.records=100
> num.standby.replicas=1
> commit.interval.ms=10000
> poll.ms=100
> When punctuate is scheduled to be called every 1000ms or 3000ms, the problem 
> happens every time. If punctuate is scheduled for 5000ms, I didn't see the 
> problem in my test scenario (described above), but it happened in my real 
> application. But this may have nothing to do with the issue, since punctuate 
> is not even called as there are no messages streaming through yet.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to