Thanks for your response Michael
Will try it out.

Regards
Sunita

On Wed, Aug 23, 2017 at 2:30 PM Michael Armbrust <mich...@databricks.com>
wrote:

> If you use structured streaming and the file sink, you can have a
> subsequent stream read using the file source.  This will maintain exactly
> once processing even if there are hiccups or failures.
>
> On Mon, Aug 21, 2017 at 2:02 PM, Sunita Arvind <sunitarv...@gmail.com>
> wrote:
>
>> Hello Spark Experts,
>>
>> I have a design question w.r.t Spark Streaming. I have a streaming job
>> that consumes protocol buffer encoded real time logs from a Kafka cluster
>> on premise. My spark application runs on EMR (aws) and persists data onto
>> s3. Before I persist, I need to strip header and convert protobuffer to
>> parquet (I use sparksql-scalapb to convert from Protobuff to
>> Spark.sql.Row). I need to persist Raw logs as is. I can continue the
>> enrichment on the same dataframe after persisting the raw data, however, in
>> order to modularize I am planning to have a separate job which picks up the
>> raw data and performs enrichment on it. Also,  I am trying to avoid all in
>> 1 job as the enrichments could get project specific while raw data
>> persistence stays customer/project agnostic.The enriched data is allowed to
>> have some latency (few minutes)
>>
>> My challenge is, after persisting the raw data, how do I chain the next
>> streaming job. The only way I can think of is -  job 1 (raw data)
>> partitions on current date (YYYYMMDD) and within current date, the job 2
>> (enrichment job) filters for records within 60s of current time and
>> performs enrichment on it in 60s batches.
>> Is this a good option? It seems to be error prone. When either of the
>> jobs get delayed due to bursts or any error/exception this could lead to
>> huge data losses and non-deterministic behavior . What are other
>> alternatives to this?
>>
>> Appreciate any guidance in this regard.
>>
>> regards
>> Sunita Koppar
>>
>
>

Reply via email to