Hi, I'm looking into the Parquet format support for the File source in Structured Streaming. The docs mention the use of the option 'mergeSchema' to merge the schemas of the part files found.[1]
What would be the practical use of that in a streaming context? In its batch counterpart, `mergeSchemas` would infer the schema superset of the part-files found. When using the File source + parquet format in streaming mode, we must provide a schema to the readStream.schema(...) builder and that schema is fixed for the duration of the stream. My current understanding is that: - Files containing a subset of the fields declared in the schema will render null values for the non-existing fields. - For files containing a superset of the fields, the additional data fields will be lost. - Files not matching the schema set on the streaming source, will render all fields null for each record in the file. Is the 'mergeSchema' option playing another role? From the user perspective, they may think that this option would help their job cope with schema evolution at runtime, but that does not seem to be the case. What is the use of this option? -kr, Gerard. [1] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala#L376