Samza/Yarn cluster having issue with OffsetOutOfRangeException

Will Schneider Mon, 20 Aug 2018 06:16:38 -0700

Hello all,

We've recently been experiencing some Kafka/Samza issues we're not quite sure 
how to tackle. We've exhausted all our internal expertise and were hoping that 
someone on the mailing lists might have seen this before and knows what might 
cause it:


KafkaSystemConsumer [WARN] While refreshing brokers for 
[Store_LogParser_RedactedMetadata_RedactedEnvironment,35]: 
org.apache.kafka.common.errors.OffsetOutOfRangeException: The requested offset 
is not within the range of offsets maintained by the server.. Retrying.

^ (Above repeats indefinitely until we intervene)

A bit about our use case:

  *   Versions:
     *   Kafka 1.0.1 (CDH Distribution 3.1.0-1.3.1.0.p0.35)
     *   Samza 0.14.1
     *   Hadoop: 2.6.0-cdh5.12.1
  *   We've seen some manifestation of this error in 4 different environments 
with minor differences in configuration, but all running the same versions of 
the software
     *   Distributed Samza on Yarn (~10 node yarn environment, 3-7 node kafka 
environment)
     *   Non-distributed virtual test environment (Samza on yarn, but with no 
network in between)
  *   We have not found a reliable way to reproduce this error
  *   Issue typically presents on process startup. It usually doesn't make a 
difference if the application was down for 5 minutes or 5 days before that 
startup
  *   The LogParser application experiencing this issue is reading and parsing 
a set of log files, and supplementing them with metadata stored in the Store 
topic in question, and cached locally in RocksDB
  *   The LogParser application has 40-60 running tasks and partitions 
depending on configuration
  *   There is no discernable pattern for where the error presents itself:
     *   It is not consistent WRT which yarn node hosts tasks with the issue
     *   It is not consistent WRT which kafka node hosts the partitions 
relevant to the issue
     *   The pattern does not persist with issue nodes upon consecutive 
appearances of the error
     *   This leads us to believe the bug is probably endemic to the whole 
cluster and not the result of a random hardware issue
  *   Offsets for the LogParser application are maintained in a samza topic 
called something like:
     *   __samza_checkpoint_ver_1_for_LogParser-RedactedEnvironment_1
  *   Upon startup, checkpoints are refreshed from that topic, and we'll see 
something in the log similar to:
     *   kafka.KafkaCheckpointManager [INFO] Read 6000 from topic: 
__samza_checkpoint_ver_1_for_LogParser-RedactedEnvironment_1. Current offset: 
5999
     *   On more than one occasion, we have attempted to repair the job by 
killing individual yarn containers and letting samza retry them.
        *   This will occasionally work. More frequently, it will get the 
partition stuck in a loop trying to read from the __samza_checkpoint topic 
forever; we're suspicious that the retry loop above is storing offsets one or 
many times, causing the topic to fill up considerably.
  *   We are aware of only two workarounds:
     *   1- Fully clearing out the data disks on the Kafka servers and 
rebuilding the topics always seems to work, at least for a time.
     *   2- We can use a setting like: 
streams.Store_LogParser_RedactedMetadata_RedactedEnvironment.samza.reset.offset=true,
 which will necessarily ignore the checkpoint topic, and not bother to validate 
any offset on the Store.
        *   This works, but requires us to do a lengthy metadata refresh 
immediately after startup, which is less than ideal.
  *   We have also seen this on rare occasion on other, smaller Samza tiers
     *   In those cases, the common thread appears to be that the tier was left 
down for a period of time longer than the Kafka retention timeout, and got 
stuck in the loop upon restart. Attempts at reproducing it this way have been 
unsuccessful
     *   Worth adding that in this case, adding the samza.reset.offset 
parameter to the configuration did not seem to have the intended effect

On another possibly-related note, one of our clusters periodically throws an 
error like this, but usually recovers without intervention:

KafkaSystemAdmin [WARN] Exception while trying to get offset for 
SystemStreamPartition [kafka, 
Store_LogParser_RedactedMetadata_RedactedEnvironment, 32]: 
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition.. Retrying.


  *   We've seen this error message crop up when we've had issues with the 
network in our datacenter, but we're not aware of any such issue at the times 
when we're experiencing the bigger issue. We're not sure if that might be 
related or not.

Has anyone seen these errors before? Is there a known workaround or fix for it?

Thanks for your help!

Attached is a copy of the Samza configuration for the job in question, in case 
it contains more valuable information I may have missed.

-Will Schneider

Samza/Yarn cluster having issue with OffsetOutOfRangeException

Reply via email to