Thanks for your response Praneeth. We did consider Kafka however cost was the only hold back factor as we might need a larger cluster and existing cluster is on premise and my app is on cloud. So the same cluster cannot be used. But I agree it does sound like a good alternative.
Regards Sunita On Thu, Sep 7, 2017 at 11:24 PM Praneeth Gayam <praneeth.ga...@gmail.com> wrote: > With file stream you will have to deal with the following > > 1. The file(s) must not be changed once created. So if the files are > being continuously appended, the new data will not be read. Refer > > <https://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources> > 2. The files must be created in the dataDirectory by atomically > *moving* or *renaming* them into the data directory. > > Since the latency requirements for the second job in the chain is only a > few mins, you may have to end up creating a new file every few mins > > You may want to consider Kafka as your intermediary store for building a > chain/DAG of streaming jobs > > On Fri, Sep 8, 2017 at 9:45 AM, Sunita Arvind <sunitarv...@gmail.com> > wrote: > >> Thanks for your response Michael >> Will try it out. >> >> Regards >> Sunita >> >> On Wed, Aug 23, 2017 at 2:30 PM Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> If you use structured streaming and the file sink, you can have a >>> subsequent stream read using the file source. This will maintain exactly >>> once processing even if there are hiccups or failures. >>> >>> On Mon, Aug 21, 2017 at 2:02 PM, Sunita Arvind <sunitarv...@gmail.com> >>> wrote: >>> >>>> Hello Spark Experts, >>>> >>>> I have a design question w.r.t Spark Streaming. I have a streaming job >>>> that consumes protocol buffer encoded real time logs from a Kafka cluster >>>> on premise. My spark application runs on EMR (aws) and persists data onto >>>> s3. Before I persist, I need to strip header and convert protobuffer to >>>> parquet (I use sparksql-scalapb to convert from Protobuff to >>>> Spark.sql.Row). I need to persist Raw logs as is. I can continue the >>>> enrichment on the same dataframe after persisting the raw data, however, in >>>> order to modularize I am planning to have a separate job which picks up the >>>> raw data and performs enrichment on it. Also, I am trying to avoid all in >>>> 1 job as the enrichments could get project specific while raw data >>>> persistence stays customer/project agnostic.The enriched data is allowed to >>>> have some latency (few minutes) >>>> >>>> My challenge is, after persisting the raw data, how do I chain the next >>>> streaming job. The only way I can think of is - job 1 (raw data) >>>> partitions on current date (YYYYMMDD) and within current date, the job 2 >>>> (enrichment job) filters for records within 60s of current time and >>>> performs enrichment on it in 60s batches. >>>> Is this a good option? It seems to be error prone. When either of the >>>> jobs get delayed due to bursts or any error/exception this could lead to >>>> huge data losses and non-deterministic behavior . What are other >>>> alternatives to this? >>>> >>>> Appreciate any guidance in this regard. >>>> >>>> regards >>>> Sunita Koppar >>>> >>> >>> >