Issue: I am trying to process 5000+ files of gzipped json file periodically 
from S3 using Structured Streaming code. 
Here are the key steps:   
   -    
Read json schema and broadccast to executors

   -    
Read Stream
   
Dataset inputDS = sparkSession.readStream() .format("text") 
.option("inferSchema", "true") .option("header", "true") .option("multiLine", 
true).schema(jsonSchema) .option("mode", "PERMISSIVE") .json(inputPath + "/*");

   -    
Process each file in a map Dataset ds = inputDS.map(x -> { ... }, 
Encoders.STRING());

   -    
Write output to S3
   
StreamingQuery query = ds .coalesce(1) .writeStream() .outputMode("append") 
.format("csv") ... .start();


maxFilesPerTrigger is set to 500 so I was hoping the streaming will pick only 
that many file to process. Why are we getting OOM? If in a we have more than 
3500 files then system crashes with OOM.

Reply via email to