Hello all,
We've recently been experiencing some Kafka/Samza issues we're not quite sure
how to tackle. We've exhausted all our internal expertise and were hoping that
someone on the mailing lists might have seen this before and knows what might
cause it:
KafkaSystemConsumer [WARN] While refreshing brokers for
[Store_LogParser_RedactedMetadata_RedactedEnvironment,35]:
org.apache.kafka.common.errors.OffsetOutOfRangeException: The requested offset
is not within the range of offsets maintained by the server.. Retrying.
^ (Above repeats indefinitely until we intervene)
A bit about our use case:
* Versions:
* Kafka 1.0.1 (CDH Distribution 3.1.0-1.3.1.0.p0.35)
* Samza 0.14.1
* Hadoop: 2.6.0-cdh5.12.1
* We've seen some manifestation of this error in 4 different environments
with minor differences in configuration, but all running the same versions of
the software
* Distributed Samza on Yarn (~10 node yarn environment, 3-7 node kafka
environment)
* Non-distributed virtual test environment (Samza on yarn, but with no
network in between)
* We have not found a reliable way to reproduce this error
* Issue typically presents on process startup. It usually doesn't make a
difference if the application was down for 5 minutes or 5 days before that
startup
* The LogParser application experiencing this issue is reading and parsing
a set of log files, and supplementing them with metadata stored in the Store
topic in question, and cached locally in RocksDB
* The LogParser application has 40-60 running tasks and partitions
depending on configuration
* There is no discernable pattern for where the error presents itself:
* It is not consistent WRT which yarn node hosts tasks with the issue
* It is not consistent WRT which kafka node hosts the partitions
relevant to the issue
* The pattern does not persist with issue nodes upon consecutive
appearances of the error
* This leads us to believe the bug is probably endemic to the whole
cluster and not the result of a random hardware issue
* Offsets for the LogParser application are maintained in a samza topic
called something like:
* __samza_checkpoint_ver_1_for_LogParser-RedactedEnvironment_1
* Upon startup, checkpoints are refreshed from that topic, and we'll see
something in the log similar to:
* kafka.KafkaCheckpointManager [INFO] Read 6000 from topic:
__samza_checkpoint_ver_1_for_LogParser-RedactedEnvironment_1. Current offset:
5999
* On more than one occasion, we have attempted to repair the job by
killing individual yarn containers and letting samza retry them.
* This will occasionally work. More frequently, it will get the
partition stuck in a loop trying to read from the __samza_checkpoint topic
forever; we're suspicious that the retry loop above is storing offsets one or
many times, causing the topic to fill up considerably.
* We are aware of only two workarounds:
* 1- Fully clearing out the data disks on the Kafka servers and
rebuilding the topics always seems to work, at least for a time.
* 2- We can use a setting like:
streams.Store_LogParser_RedactedMetadata_RedactedEnvironment.samza.reset.offset=true,
which will necessarily ignore the checkpoint topic, and not bother to validate
any offset on the Store.
* This works, but requires us to do a lengthy metadata refresh
immediately after startup, which is less than ideal.
* We have also seen this on rare occasion on other, smaller Samza tiers
* In those cases, the common thread appears to be that the tier was left
down for a period of time longer than the Kafka retention timeout, and got
stuck in the loop upon restart. Attempts at reproducing it this way have been
unsuccessful
* Worth adding that in this case, adding the samza.reset.offset
parameter to the configuration did not seem to have the intended effect
On another possibly-related note, one of our clusters periodically throws an
error like this, but usually recovers without intervention:
KafkaSystemAdmin [WARN] Exception while trying to get offset for
SystemStreamPartition [kafka,
Store_LogParser_RedactedMetadata_RedactedEnvironment, 32]:
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition.. Retrying.
* We've seen this error message crop up when we've had issues with the
network in our datacenter, but we're not aware of any such issue at the times
when we're experiencing the bigger issue. We're not sure if that might be
related or not.
Has anyone seen these errors before? Is there a known workaround or fix for it?
Thanks for your help!
Attached is a copy of the Samza configuration for the job in question, in case
it contains more valuable information I may have missed.
-Will Schneider