Multi-schema data pipeline

Joarley Wanzeler de Moraes Fri, 10 Sep 2021 08:29:27 -0700

We want to create a Spark-based streaming data pipeline that consumes from a 
source (e.g. Kinesis), apply some basic transformations, and write the data to 
a file-based sink (e.g. s3). We have thousands of different event types coming 
in and the transformations would take place on a set of common fields. Once the 
events are transformed, they need to be split by writing them to different 
output locations according to the event type. This pipeline is described in the 
figure below:


https://ibb.co/8jq3MR6

Goals:

- To infer schema safely in order to apply transformations based on the merged 
schema. The assumption is that the event types are compatible with each other 
(i.e. without overlapping schema structure) but the schema of any of them can 
change at unpredictable times. The pipeline should handle it dynamically.
- To split the output after the transformations while keeping the original 
individual schema?


What we considered:
- Schema inference seems to work fine on sample data. But is it safe for 
production usecases and for a large number of different event types?
- Simply using `partitionBy("name")` while writing out is not enough because it 
would use the merged schema.




---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Multi-schema data pipeline

Reply via email to