Issue: I am trying to process 5000+ files of gzipped json file periodically from S3 using Structured Streaming code. Here are the key steps: - Read json schema and broadccast to executors
- Read Stream Dataset inputDS = sparkSession.readStream() .format("text") .option("inferSchema", "true") .option("header", "true") .option("multiLine", true).schema(jsonSchema) .option("mode", "PERMISSIVE") .json(inputPath + "/*"); - Process each file in a map Dataset ds = inputDS.map(x -> { ... }, Encoders.STRING()); - Write output to S3 StreamingQuery query = ds .coalesce(1) .writeStream() .outputMode("append") .format("csv") ... .start(); maxFilesPerTrigger is set to 500 so I was hoping the streaming will pick only that many file to process. Why are we getting OOM? If in a we have more than 3500 files then system crashes with OOM.