What is a good way to support non-homogenous input data? In structured streaming
Let me explain the use case that we are trying to solve. We are reading data from 3 topics in Kafka. All the topics have data in Avro format, with each of them having their own schema. Now, all the 3 Avro schemas have 2 columns: EventHeader and EventPayload. EventHeader has the same structure in all 3 topics. However, each topic has a different schema for EventPayload. The EventHeader has a field called eventType that dictates what will be structure of the EventPayload Now, what we want to do is do some streaming analytics across all 3 topics. So, at a high level, we want to a) read the data from all 3 topics from Kafka, b) transform into a single streaming Data frame that contains data from all 3 topics. This data frame should have a superset of fields from all 3 schemas c) Do some aggregation/windowing on this data frame d) Write to a database The issue is how exactly do we do b) . Let’s say Event Payload in Topic A has fields x, y, z, in Topic B has p, q and r, and in Topic C has x, y and p, we need a Dataframe that contains eventID, x, y, z, p, q, r. So, there are different ways we are playing around. Note that one of the goals we are trying to achieve is separate the parsing logic to Option1 1) Read Avro data from each kafka topic as a byte array into it’s own dataframe. So, we have 3 data frames, and all 3 data frames have 1 column with binary data 2) Convert each data frame and parse the avro payload. So, now, we have 3 data frames, and all 3 have 2 fields EventHeader and EventPayload. The StructType of EventPayload in each of the 3 dataframes is dfferrent 3) Convert the 3 data frames into the standard format. Now, the output has 3 data frames all of them with the same structure: EventID, x, y, z, p, q, r 4) Union these 3 data frames I am not sure if Union across different streaming data sources will work. Will it? Option 2 1) Read Avro data from all 3 kafka topic as a byte array into a single dataframe. So, we have 1 data frames, with 1 binary column 2) Convert the data frame and parse the avro payload. Convert EventPayload into some format that can store heterogenous data. Diffierrent rows have different schema for EventPayload. So, we could convert EventPayload into JSON and store it in a String column, or convert it into AVRO and store it in a binary column 3) Write a converter that re parses the EventPayload column and creates a data frame with the standard structure Is there a better way of doing this? ________________________________________________________ The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
