Mark Mindenhall created SAMZA-1044: -------------------------------------- Summary: Checkpointing requires log.cleaner.enable=true Key: SAMZA-1044 URL: https://issues.apache.org/jira/browse/SAMZA-1044 Project: Samza Issue Type: Bug Components: docs Environment: linux Reporter: Mark Mindenhall Priority: Minor
We're running Samza 0.9.1 with kafka 0.8.2.1, which has a default setting of {{log.cleaner.enable=false}}. We didn't think we needed to enable this, as we never created any topics with {{cleanup.policy=compact}}. However, this morning we had a disk alert, and when I took a look on the broker that triggered the alert, one of the Samza checkpoint topics was consuming 29GB within the {{/logs}} folder. Long story short, I eventually figured out that all of the checkpoint topics were created with {{cleanup.policy=compact}}, and were growing unbounded. I set {{log.cleaner.enable=true}} on each broker, and restarted them. Within minutes, the 29GB was reduced to a 200-300KB. I thought I must have missed this when I created our jobs with checkpointing enabled, so I went and scoured the docs. There's no mention of the {{log.cleaner.enable}} setting within the documentation (unless I missed it _again_). I should add that we've been running most of these jobs for about a year, and I noticed that each time we would deploy, it would take longer and longer to transition from {{ACCEPTED}} to {{RUNNING}} in the YARN cluster. Eventually, it was taking 10-15 minutes per job, and we didn't understand why. After bouncing our staging cluster with {{log.cleaner.enable=true}} (and letting the log cleaner finish its work), I redeployed one of our jobs, and it once again took 15-20 seconds from {{ACCEPTED}} to {{RUNNING}}. Please mention in the documentation that {{log.cleaner.enable}} must be set to {{true}} for checkpointing to work correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)