What is a good way to support non-homogenous input data? In structured streaming

Let me explain the use case that we are trying to solve. We are reading data 
from 3 topics in Kafka. All the topics have data in Avro format, with each of 
them having their own schema. Now, all the 3 Avro schemas have 2 columns: 
EventHeader and EventPayload. EventHeader has the same structure in all 3 
topics. However, each topic has a different schema for EventPayload. The 
EventHeader has a field called eventType that dictates what will be structure 
of the EventPayload

Now, what we want to do is do some streaming analytics across all 3 topics. So, 
at a high level, we want to

a)       read the data from all 3 topics from Kafka,

b)       transform into a single streaming Data frame that contains data from 
all 3 topics. This data frame should have a superset of fields from all 3 
schemas

c)       Do some aggregation/windowing on this data frame

d)       Write to a database

The issue is how exactly do we do b) . Let’s say Event Payload in Topic A has 
fields x, y, z, in Topic B has  p, q and r, and in Topic C has x, y and p, we 
need a Dataframe that contains eventID, x, y, z, p, q, r.  So, there are 
different ways we are playing around. Note that one of the goals we are trying 
to achieve is separate the parsing logic to

Option1

1)       Read Avro data from each kafka topic as a byte array into it’s own 
dataframe. So, we have 3 data frames, and all 3 data frames have 1 column with 
binary data

2)       Convert each data frame and parse the avro payload. So, now, we have 3 
data frames, and all 3 have 2 fields EventHeader and EventPayload. The 
StructType of EventPayload in each of the 3 dataframes is dfferrent

3)       Convert the 3 data frames into the standard format. Now, the output 
has 3 data frames all of them with the same structure: EventID, x, y, z, p, q, r

4)       Union these 3 data frames
I am not sure if Union across different streaming data sources will work. Will 
it?

Option 2

1)       Read Avro data from all 3  kafka topic as a byte array into a single  
dataframe. So, we have 1 data frames, with 1 binary column

2)       Convert the data frame and parse the avro payload. Convert 
EventPayload into some format that can store heterogenous data. Diffierrent 
rows have different schema for EventPayload. So, we could convert EventPayload 
into JSON and store it in a String column, or convert it into AVRO and store it 
in a binary column

3)       Write a converter that re parses the EventPayload column and creates a 
data frame with the standard structure

Is there a better way of doing this?

________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Reply via email to