[ 
https://issues.apache.org/jira/browse/KAFKA-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037037#comment-16037037
 ] 

Markus B commented on KAFKA-5377:
---------------------------------

It's not a diskspace issue.
- The E/F drives that are dedicated to the Kafka data files have 200GB free 
space each.
- The D drive that stores the log files has over 200 GB free space as well.
- The memory dump was getting written to the C drive, which had less free 
space, and since we increased the heap size to 24GB for the broker process, it 
was not able to write more than a couple memory dumps, because each memory dump 
was 24 GB. We've cleaned up some logic to write the memory dumps to D drive and 
also delete old memory dumps.
We still see the issue consistently.

> Kafka server process crashing due to access violation (caused by log cleaner)
> -----------------------------------------------------------------------------
>
>                 Key: KAFKA-5377
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5377
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.10.2.0, 0.10.2.1
>         Environment: Windows 2008 R2, Intel Xeon CPU, 64 GB RAM
> 4 Disk Drives (C for software, D for log files, E/F for Kafka/Zookeeper data)
> 2 broker cluster
> JAVA 8 (131)
>            Reporter: Markus B
>              Labels: windows
>         Attachments: hs_err_pid15944.log, hs_err_pid6304.log, 
> hs_err_pid7356.log, hs_err_pid9056.log, hs_err_pid9276.log, 
> java_error7192.log, server.1.properties
>
>
> We are running Kafka in a 2 x broker cluster configuration on Windows, and 
> overall it has been working well for us. We have been seeing occasional 
> issues where the broker crashes first on one node, and then almost 
> immediately on the second. When we go and try to re-start, the broker 
> continues to crash during startup until we fix the issue that caused the 
> crash.
> I finally figured out that the root cause of the startup crashes were a bad 
> set of files in __consumer_offsets-2 (in this latest case, which offset is 
> the cause varies). Once I deleted the bad files, the broker started up 
> correctly again.
> From what I can tell, looking at both code, crash dump files, and log files, 
> it is all happening because of the log cleaner, and I can pinpoint it down in 
> most (if not all) cases to TimeIndex. The java dump file indicates some kind 
> of an access violation, but I am not sure when/how that is happening. It 
> seems like the initial crashes happen during the compacting/swapping action, 
> and then the startups fail when they try to access the bad files 
> (TimeIndex.parse()).
> I am attaching dump files from two separate instances of when it initially 
> crashed, and then when we try to restart. Also including the broker config 
> settings that we are using.
> I'm not sure what additional information to provide, but I can add more if 
> needed.
> Any help, suggestions or input would be very appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to