@Enno As per the latest version and documentation Spark Streaming does offer exactly once semantics using improved kafka integration , Not i have not tested yet.
Any feedback will be helpful if anyone is tried the same. http://koeninger.github.io/kafka-exactly-once/#7 https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji <eshi...@gmail.com> wrote: > AFAIK KCL is *supposed* to provide fault tolerance and load balancing > (plus additionally, elastic scaling unlike Storm), Kinesis providing the > coordination. My understanding is that it's like a naked Storm worker > process that can consequently only do map. > > I haven't really used it tho, so can't really comment how it compares to > Spark/Storm. Maybe somebody else will be able to comment. > > > > On Wed, Jun 17, 2015 at 3:13 PM, ayan guha <guha.a...@gmail.com> wrote: > >> Thanks for this. It's kcl based kinesis application. But because its just >> a Java application we are thinking to use spark on EMR or storm for fault >> tolerance and load balancing. Is it a correct approach? >> On 17 Jun 2015 23:07, "Enno Shioji" <eshi...@gmail.com> wrote: >> >>> Hi Ayan, >>> >>> Admittedly I haven't done much with Kinesis, but if I'm not mistaken you >>> should be able to use their "processor" interface for that. In this >>> example, it's incrementing a counter: >>> https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java >>> >>> Instead of incrementing a counter, you could do your transformation and >>> send it to HBase. >>> >>> >>> >>> >>> >>> >>> On Wed, Jun 17, 2015 at 1:40 PM, ayan guha <guha.a...@gmail.com> wrote: >>> >>>> Great discussion!! >>>> >>>> One qs about some comment: Also, you can do some processing with >>>> Kinesis. If all you need to do is straight forward transformation and you >>>> are reading from Kinesis to begin with, it might be an easier option to >>>> just do the transformation in Kinesis >>>> >>>> - Do you mean KCL application? Or some kind of processing withinKineis? >>>> >>>> Can you kindly share a link? I would definitely pursue this route as >>>> our transformations are really simple. >>>> >>>> Best >>>> >>>> On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni <asoni.le...@gmail.com> >>>> wrote: >>>> >>>>> My Use case is below >>>>> >>>>> We are going to receive lot of event as stream ( basically Kafka >>>>> Stream ) and then we need to process and compute >>>>> >>>>> Consider you have a phone contract with ATT and every call / sms / >>>>> data useage you do is an event and then it needs to calculate your bill >>>>> on >>>>> real time basis so when you login to your account you can see all those >>>>> variable as how much you used and how much is left and what is your bill >>>>> till date ,Also there are different rules which need to be considered when >>>>> you calculate the total bill one simple rule will be 0-500 min it is free >>>>> but above it is $1 a min. >>>>> >>>>> How do i maintain a shared state ( total amount , total min , total >>>>> data etc ) so that i know how much i accumulated at any given point as >>>>> events for same phone can go to any node / executor. >>>>> >>>>> Can some one please tell me how can i achieve this is spark as in >>>>> storm i can have a bolt which can do this ? >>>>> >>>>> Thanks, >>>>> >>>>> >>>>> >>>>> On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji <eshi...@gmail.com> >>>>> wrote: >>>>> >>>>>> I guess both. In terms of syntax, I was comparing it with Trident. >>>>>> >>>>>> If you are joining, Spark Streaming actually does offer windowed join >>>>>> out of the box. We couldn't use this though as our event stream can grow >>>>>> "out-of-sync", so we had to implement something on top of Storm. If your >>>>>> event streams don't become out of sync, you may find the built-in join in >>>>>> Spark Streaming useful. Storm also has a join keyword but its semantics >>>>>> are >>>>>> different. >>>>>> >>>>>> >>>>>> > Also, what do you mean by "No Back Pressure" ? >>>>>> >>>>>> So when a topology is overloaded, Storm is designed so that it will >>>>>> stop reading from the source. Spark on the other hand, will keep reading >>>>>> from the source and spilling it internally. This maybe fine, in fairness, >>>>>> but it does mean you have to worry about the persistent store usage in >>>>>> the >>>>>> processing cluster, whereas with Storm you don't have to worry because >>>>>> the >>>>>> messages just remain in the data store. >>>>>> >>>>>> Spark came up with the idea of rate limiting, but I don't feel this >>>>>> is as nice as back pressure because it's very difficult to tune it such >>>>>> that you don't cap the cluster's processing power but yet so that it will >>>>>> prevent the persistent storage to get used up. >>>>>> >>>>>> >>>>>> On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast < >>>>>> sparkenthusi...@yahoo.in> wrote: >>>>>> >>>>>>> When you say Storm, did you mean Storm with Trident or Storm? >>>>>>> >>>>>>> My use case does not have simple transformation. There are complex >>>>>>> events that need to be generated by joining the incoming event stream. >>>>>>> >>>>>>> Also, what do you mean by "No Back PRessure" ? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wednesday, 17 June 2015 11:57 AM, Enno Shioji < >>>>>>> eshi...@gmail.com> wrote: >>>>>>> >>>>>>> >>>>>>> We've evaluated Spark Streaming vs. Storm and ended up sticking with >>>>>>> Storm. >>>>>>> >>>>>>> Some of the important draw backs are: >>>>>>> Spark has no back pressure (receiver rate limit can alleviate this >>>>>>> to a certain point, but it's far from ideal) >>>>>>> There is also no exactly-once semantics. (updateStateByKey can >>>>>>> achieve this semantics, but is not practical if you have any significant >>>>>>> amount of state because it does so by dumping the entire state on every >>>>>>> checkpointing) >>>>>>> >>>>>>> There are also some minor drawbacks that I'm sure will be fixed >>>>>>> quickly, like no task timeout, not being able to read from Kafka using >>>>>>> multiple nodes, data loss hazard with Kafka. >>>>>>> >>>>>>> It's also not possible to attain very low latency in Spark, if >>>>>>> that's what you need. >>>>>>> >>>>>>> The pos for Spark is the concise and IMO more intuitive syntax, >>>>>>> especially if you compare it with Storm's Java API. >>>>>>> >>>>>>> I admit I might be a bit biased towards Storm tho as I'm more >>>>>>> familiar with it. >>>>>>> >>>>>>> Also, you can do some processing with Kinesis. If all you need to do >>>>>>> is straight forward transformation and you are reading from Kinesis to >>>>>>> begin with, it might be an easier option to just do the transformation >>>>>>> in >>>>>>> Kinesis. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan < >>>>>>> sabarish.sasidha...@manthan.com> wrote: >>>>>>> >>>>>>> Whatever you write in bolts would be the logic you want to apply on >>>>>>> your events. In Spark, that logic would be coded in map() or similar >>>>>>> such >>>>>>> transformations and/or actions. Spark doesn't enforce a structure for >>>>>>> capturing your processing logic like Storm does. >>>>>>> Regards >>>>>>> Sab >>>>>>> Probably overloading the question a bit. >>>>>>> >>>>>>> In Storm, Bolts have the functionality of getting triggered on >>>>>>> events. Is that kind of functionality possible with Spark streaming? >>>>>>> During >>>>>>> each phase of the data processing, the transformed data is stored to the >>>>>>> database and this transformed data should then be sent to a new pipeline >>>>>>> for further processing >>>>>>> >>>>>>> How can this be achieved using Spark? >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast < >>>>>>> sparkenthusi...@yahoo.in> wrote: >>>>>>> >>>>>>> I have a use-case where a stream of Incoming events have to be >>>>>>> aggregated and joined to create Complex events. The aggregation will >>>>>>> have >>>>>>> to happen at an interval of 1 minute (or less). >>>>>>> >>>>>>> The pipeline is : >>>>>>> send events >>>>>>> enrich event >>>>>>> Upstream services -------------------> KAFKA ---------> event Stream >>>>>>> Processor ------------> Complex Event Processor ------------> Elastic >>>>>>> Search. >>>>>>> >>>>>>> From what I understand, Storm will make a very good ESP and Spark >>>>>>> Streaming will make a good CEP. >>>>>>> >>>>>>> But, we are also evaluating Storm with Trident. >>>>>>> >>>>>>> How does Spark Streaming compare with Storm with Trident? >>>>>>> >>>>>>> Sridhar Chellappa >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wednesday, 17 June 2015 10:02 AM, ayan guha < >>>>>>> guha.a...@gmail.com> wrote: >>>>>>> >>>>>>> >>>>>>> I have a similar scenario where we need to bring data from kinesis >>>>>>> to hbase. Data volecity is 20k per 10 mins. Little manipulation of data >>>>>>> will be required but that's regardless of the tool so we will be writing >>>>>>> that piece in Java pojo. >>>>>>> All env is on aws. Hbase is on a long running EMR and kinesis on a >>>>>>> separate cluster. >>>>>>> TIA. >>>>>>> Best >>>>>>> Ayan >>>>>>> On 17 Jun 2015 12:13, "Will Briggs" <wrbri...@gmail.com> wrote: >>>>>>> >>>>>>> The programming models for the two frameworks are conceptually >>>>>>> rather different; I haven't worked with Storm for quite some time, but >>>>>>> based on my old experience with it, I would equate Spark Streaming more >>>>>>> with Storm's Trident API, rather than with the raw Bolt API. Even then, >>>>>>> there are significant differences, but it's a bit closer. >>>>>>> >>>>>>> If you can share your use case, we might be able to provide better >>>>>>> guidance. >>>>>>> >>>>>>> Regards, >>>>>>> Will >>>>>>> >>>>>>> On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote: >>>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> I am evaluating spark VS storm ( spark streaming ) and i am not >>>>>>> able to see what is equivalent of Bolt in storm inside spark. >>>>>>> >>>>>>> Any help will be appreciated on this ? >>>>>>> >>>>>>> Thanks , >>>>>>> Ashish >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Ayan Guha >>>> >>> >>> >