[Structured Streaming] File source, Parquet format: use of the mergeSchema option.

Gerard Maas Wed, 11 Apr 2018 03:13:18 -0700

Hi,

I'm looking into the Parquet format support for the File source in
Structured Streaming.
The docs mention the use of the option 'mergeSchema' to merge the schemas
of the part files found.[1]


What would be the practical use of that in a streaming context?

In its batch counterpart, `mergeSchemas` would infer the schema superset of
the part-files found.


When using the File source + parquet format in streaming mode, we must
provide a schema to the readStream.schema(...) builder and that schema is
fixed for the duration of the stream.

My current understanding is that:

- Files containing a subset of the fields declared in the schema will
render null values for the non-existing fields.
- For files containing a superset of the fields, the additional data fields
will be lost.
- Files not matching the schema set on the streaming source, will render
all fields null for each record in the file.

Is the 'mergeSchema' option playing another role? From the user
perspective, they may think that this option would help their job cope with
schema evolution at runtime, but that does not seem to be the case.

What is the use of this option?

-kr, Gerard.


[1]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala#L376

[Structured Streaming] File source, Parquet format: use of the mergeSchema option.

Reply via email to