Issue: I am trying to process 5000+ files of gzipped json file periodically
from S3 using Structured Streaming code.Â
Here are the key steps:
-
Read json schema and broadccast to executors
-
Read Stream
Dataset inputDS = sparkSession.readStream() .format("text")
.option("inferSchema", "true") .option("header", "true") .option("multiLine",
true).schema(jsonSchema) .option("mode", "PERMISSIVE") .json(inputPath + "/*");
-
Process each file in a map Dataset ds = inputDS.map(x -> { ... },
Encoders.STRING());
-
Write output to S3
StreamingQuery query = ds .coalesce(1) .writeStream() .outputMode("append")
.format("csv") ... .start();
maxFilesPerTrigger is set to 500 so I was hoping the streaming will pick only
that many file to process. Why are we getting OOM? If in a we have more than
3500 files then system crashes with OOM.