Background: I have written a simple spark structured steaming app to move
data from Kafka to S3. Found that in order to support exactly-once guarantee
spark creates _spark_metadata folder, which ends up growing too large, when the
streaming app runs for a long time the metadata folder grows so big that we
start getting OOM errors. I want to get rid of metadata and checkpoint
folders of Spark Structured Streaming and manage offsets myself.
How we managed offsets in Spark Streaming:I have used val offsetRanges =
rdd.asInstanceOf[HasOffsetRanges].offsetRanges to get offsets in Spark
Structured Streaming. But want to know how to get the offsets and other
metadata to manage checkpointing ourself using Spark Structured Streaming. Do
you have any sample program that implements checkpointing?
How we managed offsets in Spark Structured Streaming??
Looking at this JIRA https://issues-test.apache.org/jira/browse/SPARK-18258.
looks like offsets are not provided. How should we go about?