I have written a simple spark structured steaming app to move data from Kafka to S3. Found that in order to support exactly-once guarantee spark creates _spark_metadata folder, which ends up growing too large as the streaming app is SUPPOSE TO run FOREVER. But when the streaming app runs for a long time the metadata folder grows so big that we start getting OOM errors. Only way to resolve OOM is delete Checkpoint and Metadata folder and loose VALUABLE customer data.
Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and SPARK-24295) Since Spark Streaming was NOT broken like this. Is Spark Streaming a BETTER choice?