Background: I have written a simple spark structured steaming app to move 
data from Kafka to S3. Found that in order to support exactly-once guarantee 
spark creates _spark_metadata folder, which ends up growing too large, when the 
streaming app runs for a long time the metadata folder grows so big that we 
start getting OOM errors.   I want to get rid of metadata and checkpoint 
folders of Spark Structured Streaming and manage offsets myself.
How we managed offsets in Spark Streaming:I have used val offsetRanges = 
rdd.asInstanceOf[HasOffsetRanges].offsetRanges  to get offsets in Spark 
Structured Streaming.  But want to know how to get the offsets and other 
metadata to manage checkpointing ourself using Spark Structured Streaming.  Do 
you have any sample program that implements checkpointing?
How we managed offsets in Spark Structured Streaming??
Looking at this JIRA https://issues-test.apache.org/jira/browse/SPARK-18258. 
looks like offsets are not provided.  How should we go about?

  

Reply via email to