I have written a simple spark structured steaming app to move data from Kafka 
to S3. Found that in order to support exactly-once guarantee spark creates 
_spark_metadata folder, which ends up growing too large as the streaming app is 
SUPPOSE TO run FOREVER. But when the streaming app runs for a long time the 
metadata folder grows so big that we start getting OOM errors. Only way to 
resolve OOM is delete Checkpoint and Metadata folder and loose VALUABLE 
customer data.

Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and SPARK-24295)
Since Spark Streaming was NOT broken like this. Is Spark Streaming a BETTER 
choice?

Reply via email to