We want to create a Spark-based streaming data pipeline that consumes from a source (e.g. Kinesis), apply some basic transformations, and write the data to a file-based sink (e.g. s3). We have thousands of different event types coming in and the transformations would take place on a set of common fields. Once the events are transformed, they need to be split by writing them to different output locations according to the event type. This pipeline is described in the figure below:
https://ibb.co/8jq3MR6 Goals: - To infer schema safely in order to apply transformations based on the merged schema. The assumption is that the event types are compatible with each other (i.e. without overlapping schema structure) but the schema of any of them can change at unpredictable times. The pipeline should handle it dynamically. - To split the output after the transformations while keeping the original individual schema? What we considered: - Schema inference seems to work fine on sample data. But is it safe for production usecases and for a large number of different event types? - Simply using `partitionBy("name")` while writing out is not enough because it would use the merged schema. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org