Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/22042#discussion_r209479417 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala --- @@ -251,32 +274,53 @@ private[kafka010] case class InternalKafkaConsumer( untilOffset: Long, pollTimeoutMs: Long, failOnDataLoss: Boolean): ConsumerRecord[Array[Byte], Array[Byte]] = { - if (offset != nextOffsetInFetchedData || !fetchedData.hasNext()) { - // This is the first fetch, or the last pre-fetched data has been drained. + if (offset != nextOffsetInFetchedData) { + // This is the first fetch, or the fetched data has been reset. // Seek to the offset because we may call seekToBeginning or seekToEnd before this. seek(offset) poll(pollTimeoutMs) + } else if (!fetchedData.hasNext) { + // The last pre-fetched data has been drained. + if (offset < offsetAfterPoll) { + // Offsets in [offset, offsetAfterPoll) are missing. We should skip them. + resetFetchedData() + throw new MissingOffsetException(offset, offsetAfterPoll) --- End diff -- So MissingOffsetRange is only used to signal that some offset may be missing due to control messages and nothing else. And the higher function (i.e. `get`) just handles it by resetting the fetched offsets. Why not let this `fetchData` method handle the situation instead of creating a new exception just for this?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org