Hello Spark Experts,

I have a design question w.r.t Spark Streaming. I have a streaming job that
consumes protocol buffer encoded real time logs from a Kafka cluster on
premise. My spark application runs on EMR (aws) and persists data onto s3.
Before I persist, I need to strip header and convert protobuffer to parquet
(I use sparksql-scalapb to convert from Protobuff to Spark.sql.Row). I need
to persist Raw logs as is. I can continue the enrichment on the same
dataframe after persisting the raw data, however, in order to modularize I
am planning to have a separate job which picks up the raw data and performs
enrichment on it. Also,  I am trying to avoid all in 1 job as the
enrichments could get project specific while raw data persistence stays
customer/project agnostic.The enriched data is allowed to have some latency
(few minutes)

My challenge is, after persisting the raw data, how do I chain the next
streaming job. The only way I can think of is -  job 1 (raw data)
partitions on current date (YYYYMMDD) and within current date, the job 2
(enrichment job) filters for records within 60s of current time and
performs enrichment on it in 60s batches.
Is this a good option? It seems to be error prone. When either of the jobs
get delayed due to bursts or any error/exception this could lead to huge
data losses and non-deterministic behavior . What are other alternatives to
this?

Appreciate any guidance in this regard.

regards
Sunita Koppar

Reply via email to