Thanks for your response Michael Will try it out. Regards Sunita
On Wed, Aug 23, 2017 at 2:30 PM Michael Armbrust <mich...@databricks.com> wrote: > If you use structured streaming and the file sink, you can have a > subsequent stream read using the file source. This will maintain exactly > once processing even if there are hiccups or failures. > > On Mon, Aug 21, 2017 at 2:02 PM, Sunita Arvind <sunitarv...@gmail.com> > wrote: > >> Hello Spark Experts, >> >> I have a design question w.r.t Spark Streaming. I have a streaming job >> that consumes protocol buffer encoded real time logs from a Kafka cluster >> on premise. My spark application runs on EMR (aws) and persists data onto >> s3. Before I persist, I need to strip header and convert protobuffer to >> parquet (I use sparksql-scalapb to convert from Protobuff to >> Spark.sql.Row). I need to persist Raw logs as is. I can continue the >> enrichment on the same dataframe after persisting the raw data, however, in >> order to modularize I am planning to have a separate job which picks up the >> raw data and performs enrichment on it. Also, I am trying to avoid all in >> 1 job as the enrichments could get project specific while raw data >> persistence stays customer/project agnostic.The enriched data is allowed to >> have some latency (few minutes) >> >> My challenge is, after persisting the raw data, how do I chain the next >> streaming job. The only way I can think of is - job 1 (raw data) >> partitions on current date (YYYYMMDD) and within current date, the job 2 >> (enrichment job) filters for records within 60s of current time and >> performs enrichment on it in 60s batches. >> Is this a good option? It seems to be error prone. When either of the >> jobs get delayed due to bursts or any error/exception this could lead to >> huge data losses and non-deterministic behavior . What are other >> alternatives to this? >> >> Appreciate any guidance in this regard. >> >> regards >> Sunita Koppar >> > >