Re: RE: Spark or Storm
I agree with Cody. Its pretty hard for any framework to provide in built support for that since the semantics completely depends on what data store you want to use it with. Providing interfaces does help a little, but even with those interface, the user still has to do most of the heavy lifting; the user has to understand what is actually going on AND implement all the needed code to ensure unique ID, and the data are atomically updated, according to the capability and APIs provided by the data store. On Fri, Jun 19, 2015 at 7:45 AM, Cody Koeninger c...@koeninger.org wrote: http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics semantics of output operations section Is this really not clear? As for the general tone of why doesn't the framework do it for you... in my opinion, this is essential complexity for delivery semantics in a distributed system, not incidental complexity. You need to actually understand and be responsible for what's going on, unless you're talking about very narrow use cases (i.e. outputting to a known datastore with known semantics and schema) On Fri, Jun 19, 2015 at 7:26 AM, Ashish Soni asoni.le...@gmail.com wrote: My understanding for exactly once semantics is it is handled into the framework itself but it is not very clear from the documentation , I believe documentation needs to be updated with a simple example so that it is clear to the end user , This is very critical to decide when some one is evaluating the framework and does not have enough time to validate all the use cases but to relay on the documentation. Ashish On Fri, Jun 19, 2015 at 7:10 AM, bit1...@163.com bit1...@163.com wrote: I think your observation is correct, you have to take care of these replayed data at your end,eg,each message has a unique id or something else. I am using I think in the above sentense, because I am not sure and I also have a related question: I am wonderring how direct stream + kakfa is implemented when the Driver is down and restarted, will it always first replay the checkpointed failed batch or will it honor Kafka's offset reset policy(auto.offset.reset). If it honors the reset policy and it is set as smallest, then it is the at least once semantics; if it set largest, then it will be at most once semantics? -- bit1...@163.com *From:* Haopu Wang hw...@qilinsoft.com *Date:* 2015-06-19 18:47 *To:* Enno Shioji eshi...@gmail.com; Tathagata Das t...@databricks.com *CC:* prajod.vettiyat...@wipro.com; Cody Koeninger c...@koeninger.org; bit1...@163.com; Jordan Pilat jrpi...@gmail.com; Will Briggs wrbri...@gmail.com; Ashish Soni asoni.le...@gmail.com; ayan guha guha.a...@gmail.com; user@spark.apache.org; Sateesh Kavuri sateesh.kav...@gmail.com; Spark Enthusiast sparkenthusi...@yahoo.in; Sabarish Sasidharan sabarish.sasidha...@manthan.com *Subject:* RE: RE: Spark or Storm My question is not directly related: about the exactly-once semantic, the document (copied below) said spark streaming gives exactly-once semantic, but actually from my test result, with check-point enabled, the application always re-process the files in last batch after gracefully restart. == *Semantics of Received Data* Different input sources provide different guarantees, ranging from *at-least once* to *exactly once*. Read for more details. *With Files* If all of the input data is already present in a fault-tolerant files system like HDFS, Spark Streaming can always recover from any failure and process all the data. This gives *exactly-once* semantics, that all the data will be processed exactly once no matter what fails. -- *From:* Enno Shioji [mailto:eshi...@gmail.com] *Sent:* Friday, June 19, 2015 5:29 PM *To:* Tathagata Das *Cc:* prajod.vettiyat...@wipro.com; Cody Koeninger; bit1...@163.com; Jordan Pilat; Will Briggs; Ashish Soni; ayan guha; user@spark.apache.org; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan *Subject:* Re: RE: Spark or Storm Fair enough, on second thought, just saying that it should be idempotent is indeed more confusing. I guess the crux of the confusion comes from the fact that people tend to assume the work you described (store batch id and skip etc.) is handled by the framework, perhaps partly because Storm Trident does handle it (you just need to let Storm know if the output operation has succeeded or not, and it handles the batch id storing skipping business). Whenever I explain people that one needs to do this additional work you described to get end-to-end exactly-once semantics, it usually takes a while to convince them. In my limited experience, they tend to interpret transactional in that sentence to mean that you just have to write to a transactional storage like ACID RDB. Pointing them to Semantics of output operations is usually sufficient though. Maybe others like
Re: RE: Spark or Storm
Tbh I find the doc around this a bit confusing. If it says end-to-end exactly-once semantics (if your updates to downstream systems are idempotent or transactional), I think most people will interpret it that as long as you use a storage which has atomicity (like MySQL/Postgres etc.), a successful output operation for a given batch (let's say + 5) is going to be issued exactly-once against the storage. However, as I understand it that's not what this statement means. What it is saying is, it will always issue +5 and never, say +6, because it makes sure a message is processed exactly-once internally. However, it *may* issue +5 more than once for a given batch, and it is up to the developer to deal with this by either making the output operation idempotent (e.g. set 5), or transactional (e.g. keep track of batch IDs and skip already applied batches etc.). I wonder if it makes more sense to drop or transactional from the statement, because if you think about it, ultimately what you are asked to do is to make the writes idempotent even with the transactional approach, transactional is a bit loaded and would be prone to lead to misunderstandings (even though in fairness, if you read the fault tolerance chapter it explicitly explains it). On Fri, Jun 19, 2015 at 2:56 AM, prajod.vettiyat...@wipro.com wrote: More details on the Direct API of Spark 1.3 is at the databricks blog: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html Note the use of checkpoints to persist the Kafka offsets in Spark Streaming itself, and not in zookeeper. Also this statement:”.. This allows one to build a Spark Streaming + Kafka pipelines with end-to-end exactly-once semantics (if your updates to downstream systems are idempotent or transactional).” *From:* Cody Koeninger [mailto:c...@koeninger.org] *Sent:* 18 June 2015 19:38 *To:* bit1...@163.com *Cc:* Prajod S Vettiyattil (WT01 - BAS); jrpi...@gmail.com; eshi...@gmail.com; wrbri...@gmail.com; asoni.le...@gmail.com; ayan guha; user; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in; sabarish.sasidha...@manthan.com *Subject:* Re: RE: Spark or Storm That general description is accurate, but not really a specific issue of the direct steam. It applies to anything consuming from kafka (or, as Matei already said, any streaming system really). You can't have exactly once semantics, unless you know something more about how you're storing results. For some unique id, topicpartition and offset is usually the obvious choice, which is why it's important that the direct stream gives you access to the offsets. See https://github.com/koeninger/kafka-exactly-once for more info On Thu, Jun 18, 2015 at 6:47 AM, bit1...@163.com bit1...@163.com wrote: I am wondering how direct stream api ensures end-to-end exactly once semantics I think there are two things involved: 1. From the spark streaming end, the driver will replay the Offset range when it's down and restarted,which means that the new tasks will process some already processed data. 2. From the user end, since tasks may process already processed data, user end should detect that some data has already been processed,eg, use some unique ID. Not sure if I have understood correctly. -- bit1...@163.com *From:* prajod.vettiyat...@wipro.com *Date:* 2015-06-18 16:56 *To:* jrpi...@gmail.com; eshi...@gmail.com *CC:* wrbri...@gmail.com; asoni.le...@gmail.com; guha.a...@gmail.com; user@spark.apache.org; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in; sabarish.sasidha...@manthan.com *Subject:* RE: Spark or Storm not being able to read from Kafka using multiple nodes Kafka is plenty capable of doing this.. I faced the same issue before Spark 1.3 was released. The issue was not with Kafka, but with Spark Streaming’s Kafka connector. Before Spark 1.3.0 release one Spark worker would get all the streamed messages. We had to re-partition to distribute the processing. From Spark 1.3.0 release the Spark Direct API for Kafka supported parallel reads from Kafka streamed to Spark workers. See the “Approach 2: Direct Approach” in this page: http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html. Note that is also mentions zero data loss and exactly once semantics for kafka integration. Prajod *From:* Jordan Pilat [mailto:jrpi...@gmail.com] *Sent:* 18 June 2015 03:57 *To:* Enno Shioji *Cc:* Will Briggs; asoni.le...@gmail.com; ayan guha; user; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan *Subject:* Re: Spark or Storm not being able to read from Kafka using multiple nodes Kafka is plenty capable of doing this, by clustering together multiple consumer instances into a consumer group. If your topic is sufficiently partitioned, the consumer group can consume the topic in a parallelized fashion. If it isn't, you still have
Re: RE: Spark or Storm
My understanding for exactly once semantics is it is handled into the framework itself but it is not very clear from the documentation , I believe documentation needs to be updated with a simple example so that it is clear to the end user , This is very critical to decide when some one is evaluating the framework and does not have enough time to validate all the use cases but to relay on the documentation. Ashish On Fri, Jun 19, 2015 at 7:10 AM, bit1...@163.com bit1...@163.com wrote: I think your observation is correct, you have to take care of these replayed data at your end,eg,each message has a unique id or something else. I am using I think in the above sentense, because I am not sure and I also have a related question: I am wonderring how direct stream + kakfa is implemented when the Driver is down and restarted, will it always first replay the checkpointed failed batch or will it honor Kafka's offset reset policy(auto.offset.reset). If it honors the reset policy and it is set as smallest, then it is the at least once semantics; if it set largest, then it will be at most once semantics? -- bit1...@163.com *From:* Haopu Wang hw...@qilinsoft.com *Date:* 2015-06-19 18:47 *To:* Enno Shioji eshi...@gmail.com; Tathagata Das t...@databricks.com *CC:* prajod.vettiyat...@wipro.com; Cody Koeninger c...@koeninger.org; bit1...@163.com; Jordan Pilat jrpi...@gmail.com; Will Briggs wrbri...@gmail.com; Ashish Soni asoni.le...@gmail.com; ayan guha guha.a...@gmail.com; user@spark.apache.org; Sateesh Kavuri sateesh.kav...@gmail.com; Spark Enthusiast sparkenthusi...@yahoo.in; Sabarish Sasidharan sabarish.sasidha...@manthan.com *Subject:* RE: RE: Spark or Storm My question is not directly related: about the exactly-once semantic, the document (copied below) said spark streaming gives exactly-once semantic, but actually from my test result, with check-point enabled, the application always re-process the files in last batch after gracefully restart. == *Semantics of Received Data* Different input sources provide different guarantees, ranging from *at-least once* to *exactly once*. Read for more details. *With Files* If all of the input data is already present in a fault-tolerant files system like HDFS, Spark Streaming can always recover from any failure and process all the data. This gives *exactly-once* semantics, that all the data will be processed exactly once no matter what fails. -- *From:* Enno Shioji [mailto:eshi...@gmail.com] *Sent:* Friday, June 19, 2015 5:29 PM *To:* Tathagata Das *Cc:* prajod.vettiyat...@wipro.com; Cody Koeninger; bit1...@163.com; Jordan Pilat; Will Briggs; Ashish Soni; ayan guha; user@spark.apache.org; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan *Subject:* Re: RE: Spark or Storm Fair enough, on second thought, just saying that it should be idempotent is indeed more confusing. I guess the crux of the confusion comes from the fact that people tend to assume the work you described (store batch id and skip etc.) is handled by the framework, perhaps partly because Storm Trident does handle it (you just need to let Storm know if the output operation has succeeded or not, and it handles the batch id storing skipping business). Whenever I explain people that one needs to do this additional work you described to get end-to-end exactly-once semantics, it usually takes a while to convince them. In my limited experience, they tend to interpret transactional in that sentence to mean that you just have to write to a transactional storage like ACID RDB. Pointing them to Semantics of output operations is usually sufficient though. Maybe others like @Ashish can weigh on this; did you interpret it in this way? What if we change the statement into: end-to-end exactly-once semantics (if your updates to downstream systems are idempotent or transactional). To learn how to make your updates idempotent or transactional, see the Semantics of output operations section in this chapter https://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics That way, it's clear that it's not sufficient to merely write to a transactional storage like ACID store. On Fri, Jun 19, 2015 at 9:08 AM, Tathagata Das t...@databricks.com wrote: If the current documentation is confusing, we can definitely improve the documentation. However, I dont not understand why is the term transactional confusing. If your output operation has to add 5, then the user has to implement the following mechanism 1. If the unique id of the batch of data is already present in the store, then skip the update 2. Otherwise atomically do both, the update operation as well as store the unique id of the batch. This is pretty much the definition of a transaction. The user has to be aware of the transactional semantics of the data
RE: RE: Spark or Storm
My question is not directly related: about the exactly-once semantic, the document (copied below) said spark streaming gives exactly-once semantic, but actually from my test result, with check-point enabled, the application always re-process the files in last batch after gracefully restart. == Semantics of Received Data Different input sources provide different guarantees, ranging from at-least once to exactly once. Read for more details. With Files If all of the input data is already present in a fault-tolerant files system like HDFS, Spark Streaming can always recover from any failure and process all the data. This gives exactly-once semantics, that all the data will be processed exactly once no matter what fails. From: Enno Shioji [mailto:eshi...@gmail.com] Sent: Friday, June 19, 2015 5:29 PM To: Tathagata Das Cc: prajod.vettiyat...@wipro.com; Cody Koeninger; bit1...@163.com; Jordan Pilat; Will Briggs; Ashish Soni; ayan guha; user@spark.apache.org; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan Subject: Re: RE: Spark or Storm Fair enough, on second thought, just saying that it should be idempotent is indeed more confusing. I guess the crux of the confusion comes from the fact that people tend to assume the work you described (store batch id and skip etc.) is handled by the framework, perhaps partly because Storm Trident does handle it (you just need to let Storm know if the output operation has succeeded or not, and it handles the batch id storing skipping business). Whenever I explain people that one needs to do this additional work you described to get end-to-end exactly-once semantics, it usually takes a while to convince them. In my limited experience, they tend to interpret transactional in that sentence to mean that you just have to write to a transactional storage like ACID RDB. Pointing them to Semantics of output operations is usually sufficient though. Maybe others like @Ashish can weigh on this; did you interpret it in this way? What if we change the statement into: end-to-end exactly-once semantics (if your updates to downstream systems are idempotent or transactional). To learn how to make your updates idempotent or transactional, see the Semantics of output operations section in this chapter https://spark.apache.org/docs/latest/streaming-programming-guide.html#f ault-tolerance-semantics That way, it's clear that it's not sufficient to merely write to a transactional storage like ACID store. On Fri, Jun 19, 2015 at 9:08 AM, Tathagata Das t...@databricks.com wrote: If the current documentation is confusing, we can definitely improve the documentation. However, I dont not understand why is the term transactional confusing. If your output operation has to add 5, then the user has to implement the following mechanism 1. If the unique id of the batch of data is already present in the store, then skip the update 2. Otherwise atomically do both, the update operation as well as store the unique id of the batch. This is pretty much the definition of a transaction. The user has to be aware of the transactional semantics of the data store while implementing this functionality. You CAN argue that this effective makes the whole updating sort-a idempotent, as even if you try doing it multiple times, it will update only once. But that is not what is generally considered as idempotent. Writing a fixed count, not an increment, is usually what is called idempotent. And so just mentioning that the output operation must be idempotent is, in my opinion, more confusing. To take a page out of the Storm / Trident guide, even they call this exact conditional updating of Trident State as transactional operation. See transactional spout in the Trident State guide - https://storm.apache.org/documentation/Trident-state In the end, I am totally open the suggestions and PRs on how to make the programming guide easier to understand. :) TD On Thu, Jun 18, 2015 at 11:47 PM, Enno Shioji eshi...@gmail.com wrote: Tbh I find the doc around this a bit confusing. If it says end-to-end exactly-once semantics (if your updates to downstream systems are idempotent or transactional), I think most people will interpret it that as long as you use a storage which has atomicity (like MySQL/Postgres etc.), a successful output operation for a given batch (let's say + 5) is going to be issued exactly-once against the storage. However, as I understand it that's not what this statement means. What it is saying is, it will always issue +5 and never, say +6, because it makes sure a message is processed exactly-once internally. However, it *may* issue +5 more than once for a given batch, and it is up to the developer to deal with this by either making the output operation idempotent (e.g. set 5), or transactional (e.g. keep track of batch IDs and skip already applied batches etc.). I wonder if it makes
Re: RE: Spark or Storm
I think your observation is correct, you have to take care of these replayed data at your end,eg,each message has a unique id or something else. I am using I think in the above sentense, because I am not sure and I also have a related question: I am wonderring how direct stream + kakfa is implemented when the Driver is down and restarted, will it always first replay the checkpointed failed batch or will it honor Kafka's offset reset policy(auto.offset.reset). If it honors the reset policy and it is set as smallest, then it is the at least once semantics; if it set largest, then it will be at most once semantics? bit1...@163.com From: Haopu Wang Date: 2015-06-19 18:47 To: Enno Shioji; Tathagata Das CC: prajod.vettiyat...@wipro.com; Cody Koeninger; bit1...@163.com; Jordan Pilat; Will Briggs; Ashish Soni; ayan guha; user@spark.apache.org; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan Subject: RE: RE: Spark or Storm My question is not directly related: about the exactly-once semantic, the document (copied below) said spark streaming gives exactly-once semantic, but actually from my test result, with check-point enabled, the application always re-process the files in last batch after gracefully restart. == Semantics of Received Data Different input sources provide different guarantees, ranging from at-least once to exactly once. Read for more details. With Files If all of the input data is already present in a fault-tolerant files system like HDFS, Spark Streaming can always recover from any failure and process all the data. This gives exactly-once semantics, that all the data will be processed exactly once no matter what fails. From: Enno Shioji [mailto:eshi...@gmail.com] Sent: Friday, June 19, 2015 5:29 PM To: Tathagata Das Cc: prajod.vettiyat...@wipro.com; Cody Koeninger; bit1...@163.com; Jordan Pilat; Will Briggs; Ashish Soni; ayan guha; user@spark.apache.org; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan Subject: Re: RE: Spark or Storm Fair enough, on second thought, just saying that it should be idempotent is indeed more confusing. I guess the crux of the confusion comes from the fact that people tend to assume the work you described (store batch id and skip etc.) is handled by the framework, perhaps partly because Storm Trident does handle it (you just need to let Storm know if the output operation has succeeded or not, and it handles the batch id storing skipping business). Whenever I explain people that one needs to do this additional work you described to get end-to-end exactly-once semantics, it usually takes a while to convince them. In my limited experience, they tend to interpret transactional in that sentence to mean that you just have to write to a transactional storage like ACID RDB. Pointing them to Semantics of output operations is usually sufficient though. Maybe others like @Ashish can weigh on this; did you interpret it in this way? What if we change the statement into: end-to-end exactly-once semantics (if your updates to downstream systems are idempotent or transactional). To learn how to make your updates idempotent or transactional, see the Semantics of output operations section in this chapter That way, it's clear that it's not sufficient to merely write to a transactional storage like ACID store. On Fri, Jun 19, 2015 at 9:08 AM, Tathagata Das t...@databricks.com wrote: If the current documentation is confusing, we can definitely improve the documentation. However, I dont not understand why is the term transactional confusing. If your output operation has to add 5, then the user has to implement the following mechanism 1. If the unique id of the batch of data is already present in the store, then skip the update 2. Otherwise atomically do both, the update operation as well as store the unique id of the batch. This is pretty much the definition of a transaction. The user has to be aware of the transactional semantics of the data store while implementing this functionality. You CAN argue that this effective makes the whole updating sort-a idempotent, as even if you try doing it multiple times, it will update only once. But that is not what is generally considered as idempotent. Writing a fixed count, not an increment, is usually what is called idempotent. And so just mentioning that the output operation must be idempotent is, in my opinion, more confusing. To take a page out of the Storm / Trident guide, even they call this exact conditional updating of Trident State as transactional operation. See transactional spout in the Trident State guide - https://storm.apache.org/documentation/Trident-state In the end, I am totally open the suggestions and PRs on how to make the programming guide easier to understand. :) TD On Thu, Jun 18, 2015 at 11:47 PM, Enno Shioji eshi...@gmail.com wrote: Tbh I find the doc around this a bit confusing. If it says end-to-end exactly-once
Re: RE: Spark or Storm
If the current documentation is confusing, we can definitely improve the documentation. However, I dont not understand why is the term transactional confusing. If your output operation has to add 5, then the user has to implement the following mechanism 1. If the unique id of the batch of data is already present in the store, then skip the update 2. Otherwise atomically do both, the update operation as well as store the unique id of the batch. This is pretty much the definition of a transaction. The user has to be aware of the transactional semantics of the data store while implementing this functionality. You CAN argue that this effective makes the whole updating sort-a idempotent, as even if you try doing it multiple times, it will update only once. But that is not what is generally considered as idempotent. Writing a fixed count, not an increment, is usually what is called idempotent. And so just mentioning that the output operation must be idempotent is, in my opinion, more confusing. To take a page out of the Storm / Trident guide, even they call this exact conditional updating of Trident State as transactional operation. See transactional spout in the Trident State guide - https://storm.apache.org/documentation/Trident-state In the end, I am totally open the suggestions and PRs on how to make the programming guide easier to understand. :) TD On Thu, Jun 18, 2015 at 11:47 PM, Enno Shioji eshi...@gmail.com wrote: Tbh I find the doc around this a bit confusing. If it says end-to-end exactly-once semantics (if your updates to downstream systems are idempotent or transactional), I think most people will interpret it that as long as you use a storage which has atomicity (like MySQL/Postgres etc.), a successful output operation for a given batch (let's say + 5) is going to be issued exactly-once against the storage. However, as I understand it that's not what this statement means. What it is saying is, it will always issue +5 and never, say +6, because it makes sure a message is processed exactly-once internally. However, it *may* issue +5 more than once for a given batch, and it is up to the developer to deal with this by either making the output operation idempotent (e.g. set 5), or transactional (e.g. keep track of batch IDs and skip already applied batches etc.). I wonder if it makes more sense to drop or transactional from the statement, because if you think about it, ultimately what you are asked to do is to make the writes idempotent even with the transactional approach, transactional is a bit loaded and would be prone to lead to misunderstandings (even though in fairness, if you read the fault tolerance chapter it explicitly explains it). On Fri, Jun 19, 2015 at 2:56 AM, prajod.vettiyat...@wipro.com wrote: More details on the Direct API of Spark 1.3 is at the databricks blog: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html Note the use of checkpoints to persist the Kafka offsets in Spark Streaming itself, and not in zookeeper. Also this statement:”.. This allows one to build a Spark Streaming + Kafka pipelines with end-to-end exactly-once semantics (if your updates to downstream systems are idempotent or transactional).” *From:* Cody Koeninger [mailto:c...@koeninger.org] *Sent:* 18 June 2015 19:38 *To:* bit1...@163.com *Cc:* Prajod S Vettiyattil (WT01 - BAS); jrpi...@gmail.com; eshi...@gmail.com; wrbri...@gmail.com; asoni.le...@gmail.com; ayan guha; user; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in; sabarish.sasidha...@manthan.com *Subject:* Re: RE: Spark or Storm That general description is accurate, but not really a specific issue of the direct steam. It applies to anything consuming from kafka (or, as Matei already said, any streaming system really). You can't have exactly once semantics, unless you know something more about how you're storing results. For some unique id, topicpartition and offset is usually the obvious choice, which is why it's important that the direct stream gives you access to the offsets. See https://github.com/koeninger/kafka-exactly-once for more info On Thu, Jun 18, 2015 at 6:47 AM, bit1...@163.com bit1...@163.com wrote: I am wondering how direct stream api ensures end-to-end exactly once semantics I think there are two things involved: 1. From the spark streaming end, the driver will replay the Offset range when it's down and restarted,which means that the new tasks will process some already processed data. 2. From the user end, since tasks may process already processed data, user end should detect that some data has already been processed,eg, use some unique ID. Not sure if I have understood correctly. -- bit1...@163.com *From:* prajod.vettiyat...@wipro.com *Date:* 2015-06-18 16:56 *To:* jrpi...@gmail.com; eshi...@gmail.com *CC:* wrbri...@gmail.com; asoni.le
Re: RE: Spark or Storm
...@gmail.com; asoni.le...@gmail.com; ayan guha; user; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in; sabarish.sasidha...@manthan.com *Subject:* Re: RE: Spark or Storm That general description is accurate, but not really a specific issue of the direct steam. It applies to anything consuming from kafka (or, as Matei already said, any streaming system really). You can't have exactly once semantics, unless you know something more about how you're storing results. For some unique id, topicpartition and offset is usually the obvious choice, which is why it's important that the direct stream gives you access to the offsets. See https://github.com/koeninger/kafka-exactly-once for more info On Thu, Jun 18, 2015 at 6:47 AM, bit1...@163.com bit1...@163.com wrote: I am wondering how direct stream api ensures end-to-end exactly once semantics I think there are two things involved: 1. From the spark streaming end, the driver will replay the Offset range when it's down and restarted,which means that the new tasks will process some already processed data. 2. From the user end, since tasks may process already processed data, user end should detect that some data has already been processed,eg, use some unique ID. Not sure if I have understood correctly. -- bit1...@163.com *From:* prajod.vettiyat...@wipro.com *Date:* 2015-06-18 16:56 *To:* jrpi...@gmail.com; eshi...@gmail.com *CC:* wrbri...@gmail.com; asoni.le...@gmail.com; guha.a...@gmail.com; user@spark.apache.org; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in; sabarish.sasidha...@manthan.com *Subject:* RE: Spark or Storm not being able to read from Kafka using multiple nodes Kafka is plenty capable of doing this.. I faced the same issue before Spark 1.3 was released. The issue was not with Kafka, but with Spark Streaming’s Kafka connector. Before Spark 1.3.0 release one Spark worker would get all the streamed messages. We had to re-partition to distribute the processing. From Spark 1.3.0 release the Spark Direct API for Kafka supported parallel reads from Kafka streamed to Spark workers. See the “Approach 2: Direct Approach” in this page: http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html. Note that is also mentions zero data loss and exactly once semantics for kafka integration. Prajod *From:* Jordan Pilat [mailto:jrpi...@gmail.com] *Sent:* 18 June 2015 03:57 *To:* Enno Shioji *Cc:* Will Briggs; asoni.le...@gmail.com; ayan guha; user; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan *Subject:* Re: Spark or Storm not being able to read from Kafka using multiple nodes Kafka is plenty capable of doing this, by clustering together multiple consumer instances into a consumer group. If your topic is sufficiently partitioned, the consumer group can consume the topic in a parallelized fashion. If it isn't, you still have the fault tolerance associated with clustering the consumers. OK JRP On Jun 17, 2015 1:27 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing) There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able to read from Kafka using multiple nodes, data loss hazard with Kafka. It's also not possible to attain very low latency in Spark, if that's what you need. The pos for Spark is the concise and IMO more intuitive syntax, especially if you compare it with Storm's Java API. I admit I might be a bit biased towards Storm tho as I'm more familiar with it. Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis. On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: Whatever you write in bolts would be the logic you want to apply on your events. In Spark, that logic would be coded in map() or similar such transformations and/or actions. Spark doesn't enforce a structure for capturing your processing logic like Storm does. Regards Sab Probably overloading the question a bit. In Storm, Bolts have the functionality of getting triggered on events. Is that kind of functionality possible with Spark streaming? During each phase of the data processing, the transformed data is stored to the database and this transformed data should then be sent
Re: RE: Spark or Storm
http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics semantics of output operations section Is this really not clear? As for the general tone of why doesn't the framework do it for you... in my opinion, this is essential complexity for delivery semantics in a distributed system, not incidental complexity. You need to actually understand and be responsible for what's going on, unless you're talking about very narrow use cases (i.e. outputting to a known datastore with known semantics and schema) On Fri, Jun 19, 2015 at 7:26 AM, Ashish Soni asoni.le...@gmail.com wrote: My understanding for exactly once semantics is it is handled into the framework itself but it is not very clear from the documentation , I believe documentation needs to be updated with a simple example so that it is clear to the end user , This is very critical to decide when some one is evaluating the framework and does not have enough time to validate all the use cases but to relay on the documentation. Ashish On Fri, Jun 19, 2015 at 7:10 AM, bit1...@163.com bit1...@163.com wrote: I think your observation is correct, you have to take care of these replayed data at your end,eg,each message has a unique id or something else. I am using I think in the above sentense, because I am not sure and I also have a related question: I am wonderring how direct stream + kakfa is implemented when the Driver is down and restarted, will it always first replay the checkpointed failed batch or will it honor Kafka's offset reset policy(auto.offset.reset). If it honors the reset policy and it is set as smallest, then it is the at least once semantics; if it set largest, then it will be at most once semantics? -- bit1...@163.com *From:* Haopu Wang hw...@qilinsoft.com *Date:* 2015-06-19 18:47 *To:* Enno Shioji eshi...@gmail.com; Tathagata Das t...@databricks.com *CC:* prajod.vettiyat...@wipro.com; Cody Koeninger c...@koeninger.org; bit1...@163.com; Jordan Pilat jrpi...@gmail.com; Will Briggs wrbri...@gmail.com; Ashish Soni asoni.le...@gmail.com; ayan guha guha.a...@gmail.com; user@spark.apache.org; Sateesh Kavuri sateesh.kav...@gmail.com; Spark Enthusiast sparkenthusi...@yahoo.in; Sabarish Sasidharan sabarish.sasidha...@manthan.com *Subject:* RE: RE: Spark or Storm My question is not directly related: about the exactly-once semantic, the document (copied below) said spark streaming gives exactly-once semantic, but actually from my test result, with check-point enabled, the application always re-process the files in last batch after gracefully restart. == *Semantics of Received Data* Different input sources provide different guarantees, ranging from *at-least once* to *exactly once*. Read for more details. *With Files* If all of the input data is already present in a fault-tolerant files system like HDFS, Spark Streaming can always recover from any failure and process all the data. This gives *exactly-once* semantics, that all the data will be processed exactly once no matter what fails. -- *From:* Enno Shioji [mailto:eshi...@gmail.com] *Sent:* Friday, June 19, 2015 5:29 PM *To:* Tathagata Das *Cc:* prajod.vettiyat...@wipro.com; Cody Koeninger; bit1...@163.com; Jordan Pilat; Will Briggs; Ashish Soni; ayan guha; user@spark.apache.org; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan *Subject:* Re: RE: Spark or Storm Fair enough, on second thought, just saying that it should be idempotent is indeed more confusing. I guess the crux of the confusion comes from the fact that people tend to assume the work you described (store batch id and skip etc.) is handled by the framework, perhaps partly because Storm Trident does handle it (you just need to let Storm know if the output operation has succeeded or not, and it handles the batch id storing skipping business). Whenever I explain people that one needs to do this additional work you described to get end-to-end exactly-once semantics, it usually takes a while to convince them. In my limited experience, they tend to interpret transactional in that sentence to mean that you just have to write to a transactional storage like ACID RDB. Pointing them to Semantics of output operations is usually sufficient though. Maybe others like @Ashish can weigh on this; did you interpret it in this way? What if we change the statement into: end-to-end exactly-once semantics (if your updates to downstream systems are idempotent or transactional). To learn how to make your updates idempotent or transactional, see the Semantics of output operations section in this chapter https://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics That way, it's clear that it's not sufficient to merely write to a transactional storage like ACID store. On Fri, Jun 19, 2015 at 9
Re: RE: Spark or Storm
auto.offset.reset only applies when there are no starting offsets (either from a checkpoint, or from you providing them explicitly) On Fri, Jun 19, 2015 at 6:10 AM, bit1...@163.com bit1...@163.com wrote: I think your observation is correct, you have to take care of these replayed data at your end,eg,each message has a unique id or something else. I am using I think in the above sentense, because I am not sure and I also have a related question: I am wonderring how direct stream + kakfa is implemented when the Driver is down and restarted, will it always first replay the checkpointed failed batch or will it honor Kafka's offset reset policy(auto.offset.reset). If it honors the reset policy and it is set as smallest, then it is the at least once semantics; if it set largest, then it will be at most once semantics? -- bit1...@163.com *From:* Haopu Wang hw...@qilinsoft.com *Date:* 2015-06-19 18:47 *To:* Enno Shioji eshi...@gmail.com; Tathagata Das t...@databricks.com *CC:* prajod.vettiyat...@wipro.com; Cody Koeninger c...@koeninger.org; bit1...@163.com; Jordan Pilat jrpi...@gmail.com; Will Briggs wrbri...@gmail.com; Ashish Soni asoni.le...@gmail.com; ayan guha guha.a...@gmail.com; user@spark.apache.org; Sateesh Kavuri sateesh.kav...@gmail.com; Spark Enthusiast sparkenthusi...@yahoo.in; Sabarish Sasidharan sabarish.sasidha...@manthan.com *Subject:* RE: RE: Spark or Storm My question is not directly related: about the exactly-once semantic, the document (copied below) said spark streaming gives exactly-once semantic, but actually from my test result, with check-point enabled, the application always re-process the files in last batch after gracefully restart. == *Semantics of Received Data* Different input sources provide different guarantees, ranging from *at-least once* to *exactly once*. Read for more details. *With Files* If all of the input data is already present in a fault-tolerant files system like HDFS, Spark Streaming can always recover from any failure and process all the data. This gives *exactly-once* semantics, that all the data will be processed exactly once no matter what fails. -- *From:* Enno Shioji [mailto:eshi...@gmail.com] *Sent:* Friday, June 19, 2015 5:29 PM *To:* Tathagata Das *Cc:* prajod.vettiyat...@wipro.com; Cody Koeninger; bit1...@163.com; Jordan Pilat; Will Briggs; Ashish Soni; ayan guha; user@spark.apache.org; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan *Subject:* Re: RE: Spark or Storm Fair enough, on second thought, just saying that it should be idempotent is indeed more confusing. I guess the crux of the confusion comes from the fact that people tend to assume the work you described (store batch id and skip etc.) is handled by the framework, perhaps partly because Storm Trident does handle it (you just need to let Storm know if the output operation has succeeded or not, and it handles the batch id storing skipping business). Whenever I explain people that one needs to do this additional work you described to get end-to-end exactly-once semantics, it usually takes a while to convince them. In my limited experience, they tend to interpret transactional in that sentence to mean that you just have to write to a transactional storage like ACID RDB. Pointing them to Semantics of output operations is usually sufficient though. Maybe others like @Ashish can weigh on this; did you interpret it in this way? What if we change the statement into: end-to-end exactly-once semantics (if your updates to downstream systems are idempotent or transactional). To learn how to make your updates idempotent or transactional, see the Semantics of output operations section in this chapter https://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics That way, it's clear that it's not sufficient to merely write to a transactional storage like ACID store. On Fri, Jun 19, 2015 at 9:08 AM, Tathagata Das t...@databricks.com wrote: If the current documentation is confusing, we can definitely improve the documentation. However, I dont not understand why is the term transactional confusing. If your output operation has to add 5, then the user has to implement the following mechanism 1. If the unique id of the batch of data is already present in the store, then skip the update 2. Otherwise atomically do both, the update operation as well as store the unique id of the batch. This is pretty much the definition of a transaction. The user has to be aware of the transactional semantics of the data store while implementing this functionality. You CAN argue that this effective makes the whole updating sort-a idempotent, as even if you try doing it multiple times, it will update only once. But that is not what is generally considered as idempotent. Writing a fixed count
Re: RE: Spark or Storm
I am wondering how direct stream api ensures end-to-end exactly once semantics I think there are two things involved: 1. From the spark streaming end, the driver will replay the Offset range when it's down and restarted,which means that the new tasks will process some already processed data. 2. From the user end, since tasks may process already processed data, user end should detect that some data has already been processed,eg, use some unique ID. Not sure if I have understood correctly. bit1...@163.com From: prajod.vettiyat...@wipro.com Date: 2015-06-18 16:56 To: jrpi...@gmail.com; eshi...@gmail.com CC: wrbri...@gmail.com; asoni.le...@gmail.com; guha.a...@gmail.com; user@spark.apache.org; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in; sabarish.sasidha...@manthan.com Subject: RE: Spark or Storm not being able to read from Kafka using multiple nodes Kafka is plenty capable of doing this.. I faced the same issue before Spark 1.3 was released. The issue was not with Kafka, but with Spark Streaming’s Kafka connector. Before Spark 1.3.0 release one Spark worker would get all the streamed messages. We had to re-partition to distribute the processing. From Spark 1.3.0 release the Spark Direct API for Kafka supported parallel reads from Kafka streamed to Spark workers. See the “Approach 2: Direct Approach” in this page: http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html. Note that is also mentions zero data loss and exactly once semantics for kafka integration. Prajod From: Jordan Pilat [mailto:jrpi...@gmail.com] Sent: 18 June 2015 03:57 To: Enno Shioji Cc: Will Briggs; asoni.le...@gmail.com; ayan guha; user; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan Subject: Re: Spark or Storm not being able to read from Kafka using multiple nodes Kafka is plenty capable of doing this, by clustering together multiple consumer instances into a consumer group. If your topic is sufficiently partitioned, the consumer group can consume the topic in a parallelized fashion. If it isn't, you still have the fault tolerance associated with clustering the consumers. OK JRP On Jun 17, 2015 1:27 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing) There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able to read from Kafka using multiple nodes, data loss hazard with Kafka. It's also not possible to attain very low latency in Spark, if that's what you need. The pos for Spark is the concise and IMO more intuitive syntax, especially if you compare it with Storm's Java API. I admit I might be a bit biased towards Storm tho as I'm more familiar with it. Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis. On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: Whatever you write in bolts would be the logic you want to apply on your events. In Spark, that logic would be coded in map() or similar such transformations and/or actions. Spark doesn't enforce a structure for capturing your processing logic like Storm does. Regards Sab Probably overloading the question a bit. In Storm, Bolts have the functionality of getting triggered on events. Is that kind of functionality possible with Spark streaming? During each phase of the data processing, the transformed data is stored to the database and this transformed data should then be sent to a new pipeline for further processing How can this be achieved using Spark? On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: I have a use-case where a stream of Incoming events have to be aggregated and joined to create Complex events. The aggregation will have to happen at an interval of 1 minute (or less). The pipeline is : send events enrich event Upstream services --- KAFKA - event Stream Processor Complex Event Processor Elastic Search. From what I understand, Storm will make a very good ESP and Spark Streaming will make a good CEP. But, we are also evaluating Storm with Trident. How does Spark Streaming compare with Storm with Trident? Sridhar Chellappa On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote: I
Re: RE: Spark or Storm
That general description is accurate, but not really a specific issue of the direct steam. It applies to anything consuming from kafka (or, as Matei already said, any streaming system really). You can't have exactly once semantics, unless you know something more about how you're storing results. For some unique id, topicpartition and offset is usually the obvious choice, which is why it's important that the direct stream gives you access to the offsets. See https://github.com/koeninger/kafka-exactly-once for more info On Thu, Jun 18, 2015 at 6:47 AM, bit1...@163.com bit1...@163.com wrote: I am wondering how direct stream api ensures end-to-end exactly once semantics I think there are two things involved: 1. From the spark streaming end, the driver will replay the Offset range when it's down and restarted,which means that the new tasks will process some already processed data. 2. From the user end, since tasks may process already processed data, user end should detect that some data has already been processed,eg, use some unique ID. Not sure if I have understood correctly. -- bit1...@163.com *From:* prajod.vettiyat...@wipro.com *Date:* 2015-06-18 16:56 *To:* jrpi...@gmail.com; eshi...@gmail.com *CC:* wrbri...@gmail.com; asoni.le...@gmail.com; guha.a...@gmail.com; user@spark.apache.org; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in; sabarish.sasidha...@manthan.com *Subject:* RE: Spark or Storm not being able to read from Kafka using multiple nodes Kafka is plenty capable of doing this.. I faced the same issue before Spark 1.3 was released. The issue was not with Kafka, but with Spark Streaming’s Kafka connector. Before Spark 1.3.0 release one Spark worker would get all the streamed messages. We had to re-partition to distribute the processing. From Spark 1.3.0 release the Spark Direct API for Kafka supported parallel reads from Kafka streamed to Spark workers. See the “Approach 2: Direct Approach” in this page: http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html. Note that is also mentions zero data loss and exactly once semantics for kafka integration. Prajod *From:* Jordan Pilat [mailto:jrpi...@gmail.com] *Sent:* 18 June 2015 03:57 *To:* Enno Shioji *Cc:* Will Briggs; asoni.le...@gmail.com; ayan guha; user; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan *Subject:* Re: Spark or Storm not being able to read from Kafka using multiple nodes Kafka is plenty capable of doing this, by clustering together multiple consumer instances into a consumer group. If your topic is sufficiently partitioned, the consumer group can consume the topic in a parallelized fashion. If it isn't, you still have the fault tolerance associated with clustering the consumers. OK JRP On Jun 17, 2015 1:27 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing) There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able to read from Kafka using multiple nodes, data loss hazard with Kafka. It's also not possible to attain very low latency in Spark, if that's what you need. The pos for Spark is the concise and IMO more intuitive syntax, especially if you compare it with Storm's Java API. I admit I might be a bit biased towards Storm tho as I'm more familiar with it. Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis. On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: Whatever you write in bolts would be the logic you want to apply on your events. In Spark, that logic would be coded in map() or similar such transformations and/or actions. Spark doesn't enforce a structure for capturing your processing logic like Storm does. Regards Sab Probably overloading the question a bit. In Storm, Bolts have the functionality of getting triggered on events. Is that kind of functionality possible with Spark streaming? During each phase of the data processing, the transformed data is stored to the database and this transformed data should then be sent to a new pipeline for further processing How can this be achieved using Spark? On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: I have a use-case where a stream
RE: RE: Spark or Storm
More details on the Direct API of Spark 1.3 is at the databricks blog: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html Note the use of checkpoints to persist the Kafka offsets in Spark Streaming itself, and not in zookeeper. Also this statement:”.. This allows one to build a Spark Streaming + Kafka pipelines with end-to-end exactly-once semantics (if your updates to downstream systems are idempotent or transactional).” From: Cody Koeninger [mailto:c...@koeninger.org] Sent: 18 June 2015 19:38 To: bit1...@163.com Cc: Prajod S Vettiyattil (WT01 - BAS); jrpi...@gmail.com; eshi...@gmail.com; wrbri...@gmail.com; asoni.le...@gmail.com; ayan guha; user; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in; sabarish.sasidha...@manthan.com Subject: Re: RE: Spark or Storm That general description is accurate, but not really a specific issue of the direct steam. It applies to anything consuming from kafka (or, as Matei already said, any streaming system really). You can't have exactly once semantics, unless you know something more about how you're storing results. For some unique id, topicpartition and offset is usually the obvious choice, which is why it's important that the direct stream gives you access to the offsets. See https://github.com/koeninger/kafka-exactly-once for more info On Thu, Jun 18, 2015 at 6:47 AM, bit1...@163.commailto:bit1...@163.com bit1...@163.commailto:bit1...@163.com wrote: I am wondering how direct stream api ensures end-to-end exactly once semantics I think there are two things involved: 1. From the spark streaming end, the driver will replay the Offset range when it's down and restarted,which means that the new tasks will process some already processed data. 2. From the user end, since tasks may process already processed data, user end should detect that some data has already been processed,eg, use some unique ID. Not sure if I have understood correctly. bit1...@163.commailto:bit1...@163.com From: prajod.vettiyat...@wipro.commailto:prajod.vettiyat...@wipro.com Date: 2015-06-18 16:56 To: jrpi...@gmail.commailto:jrpi...@gmail.com; eshi...@gmail.commailto:eshi...@gmail.com CC: wrbri...@gmail.commailto:wrbri...@gmail.com; asoni.le...@gmail.commailto:asoni.le...@gmail.com; guha.a...@gmail.commailto:guha.a...@gmail.com; user@spark.apache.orgmailto:user@spark.apache.org; sateesh.kav...@gmail.commailto:sateesh.kav...@gmail.com; sparkenthusi...@yahoo.inmailto:sparkenthusi...@yahoo.in; sabarish.sasidha...@manthan.commailto:sabarish.sasidha...@manthan.com Subject: RE: Spark or Storm not being able to read from Kafka using multiple nodes Kafka is plenty capable of doing this.. I faced the same issue before Spark 1.3 was released. The issue was not with Kafka, but with Spark Streaming’s Kafka connector. Before Spark 1.3.0 release one Spark worker would get all the streamed messages. We had to re-partition to distribute the processing. From Spark 1.3.0 release the Spark Direct API for Kafka supported parallel reads from Kafka streamed to Spark workers. See the “Approach 2: Direct Approach” in this page: http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html. Note that is also mentions zero data loss and exactly once semantics for kafka integration. Prajod From: Jordan Pilat [mailto:jrpi...@gmail.commailto:jrpi...@gmail.com] Sent: 18 June 2015 03:57 To: Enno Shioji Cc: Will Briggs; asoni.le...@gmail.commailto:asoni.le...@gmail.com; ayan guha; user; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan Subject: Re: Spark or Storm not being able to read from Kafka using multiple nodes Kafka is plenty capable of doing this, by clustering together multiple consumer instances into a consumer group. If your topic is sufficiently partitioned, the consumer group can consume the topic in a parallelized fashion. If it isn't, you still have the fault tolerance associated with clustering the consumers. OK JRP On Jun 17, 2015 1:27 AM, Enno Shioji eshi...@gmail.commailto:eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing) There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able to read from Kafka using multiple nodes, data loss hazard with Kafka. It's also not possible to attain very low latency in Spark, if that's what you need. The pos for Spark is the concise and IMO more intuitive syntax, especially if you compare it with Storm's Java API. I admit I might be a bit biased towards
RE: Spark or Storm
not being able to read from Kafka using multiple nodes Kafka is plenty capable of doing this.. I faced the same issue before Spark 1.3 was released. The issue was not with Kafka, but with Spark Streaming’s Kafka connector. Before Spark 1.3.0 release one Spark worker would get all the streamed messages. We had to re-partition to distribute the processing. From Spark 1.3.0 release the Spark Direct API for Kafka supported parallel reads from Kafka streamed to Spark workers. See the “Approach 2: Direct Approach” in this page: http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html. Note that is also mentions zero data loss and exactly once semantics for kafka integration. Prajod From: Jordan Pilat [mailto:jrpi...@gmail.com] Sent: 18 June 2015 03:57 To: Enno Shioji Cc: Will Briggs; asoni.le...@gmail.com; ayan guha; user; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan Subject: Re: Spark or Storm not being able to read from Kafka using multiple nodes Kafka is plenty capable of doing this, by clustering together multiple consumer instances into a consumer group. If your topic is sufficiently partitioned, the consumer group can consume the topic in a parallelized fashion. If it isn't, you still have the fault tolerance associated with clustering the consumers. OK JRP On Jun 17, 2015 1:27 AM, Enno Shioji eshi...@gmail.commailto:eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing) There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able to read from Kafka using multiple nodes, data loss hazard with Kafka. It's also not possible to attain very low latency in Spark, if that's what you need. The pos for Spark is the concise and IMO more intuitive syntax, especially if you compare it with Storm's Java API. I admit I might be a bit biased towards Storm tho as I'm more familiar with it. Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis. On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan sabarish.sasidha...@manthan.commailto:sabarish.sasidha...@manthan.com wrote: Whatever you write in bolts would be the logic you want to apply on your events. In Spark, that logic would be coded in map() or similar such transformations and/or actions. Spark doesn't enforce a structure for capturing your processing logic like Storm does. Regards Sab Probably overloading the question a bit. In Storm, Bolts have the functionality of getting triggered on events. Is that kind of functionality possible with Spark streaming? During each phase of the data processing, the transformed data is stored to the database and this transformed data should then be sent to a new pipeline for further processing How can this be achieved using Spark? On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast sparkenthusi...@yahoo.inmailto:sparkenthusi...@yahoo.in wrote: I have a use-case where a stream of Incoming events have to be aggregated and joined to create Complex events. The aggregation will have to happen at an interval of 1 minute (or less). The pipeline is : send events enrich event Upstream services --- KAFKA - event Stream Processor Complex Event Processor Elastic Search. From what I understand, Storm will make a very good ESP and Spark Streaming will make a good CEP. But, we are also evaluating Storm with Trident. How does Spark Streaming compare with Storm with Trident? Sridhar Chellappa On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.commailto:guha.a...@gmail.com wrote: I have a similar scenario where we need to bring data from kinesis to hbase. Data volecity is 20k per 10 mins. Little manipulation of data will be required but that's regardless of the tool so we will be writing that piece in Java pojo. All env is on aws. Hbase is on a long running EMR and kinesis on a separate cluster. TIA. Best Ayan On 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.commailto:wrbri...@gmail.com wrote: The programming models for the two frameworks are conceptually rather different; I haven't worked with Storm for quite some time, but based on my old experience with it, I would equate Spark Streaming more with Storm's Trident API, rather than with the raw Bolt API. Even then, there are significant
Re: Spark or Storm
Again, by Storm, you mean Storm Trident, correct? On Wednesday, 17 June 2015 10:09 PM, Michael Segel msegel_had...@hotmail.com wrote: Actually the reverse. Spark Streaming is really a micro batch system where the smallest window is 1/2 a second (500ms). So for CEP, its not really a good idea. So in terms of options…. spark streaming, storm, samza, akka and others… Storm is probably the easiest to pick up, spark streaming / akka may give you more flexibility and akka would work for CEP. Just my $0.02 On Jun 16, 2015, at 9:40 PM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: I have a use-case where a stream of Incoming events have to be aggregated and joined to create Complex events. The aggregation will have to happen at an interval of 1 minute (or less). The pipeline is : send events enrich eventUpstream services --- KAFKA - event Stream Processor Complex Event Processor Elastic Search. From what I understand, Storm will make a very good ESP and Spark Streaming will make a good CEP. But, we are also evaluating Storm with Trident. How does Spark Streaming compare with Storm with Trident? Sridhar Chellappa On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote: I have a similar scenario where we need to bring data from kinesis to hbase. Data volecity is 20k per 10 mins. Little manipulation of data will be required but that's regardless of the tool so we will be writing that piece in Java pojo. All env is on aws. Hbase is on a long running EMR and kinesis on a separate cluster.TIA. Best AyanOn 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote: The programming models for the two frameworks are conceptually rather different; I haven't worked with Storm for quite some time, but based on my old experience with it, I would equate Spark Streaming more with Storm's Trident API, rather than with the raw Bolt API. Even then, there are significant differences, but it's a bit closer. If you can share your use case, we might be able to provide better guidance. Regards, Will On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote: Hi All, I am evaluating spark VS storm ( spark streaming ) and i am not able to see what is equivalent of Bolt in storm inside spark. Any help will be appreciated on this ? Thanks , Ashish - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark or Storm
This documentation is only for writes to an external system, but all the counting you do within your streaming app (e.g. if you use reduceByKeyAndWindow to keep track of a running count) is exactly-once. When you write to a storage system, no matter which streaming framework you use, you'll have to make sure the writes are idempotent, because the storage system can't know whether you meant to write the same data again or not. But the place where Spark Streaming helps over Storm, etc is for tracking state within your computation. Without that facility, you'd not only have to make sure that writes are idempotent, but you'd have to make sure that updates to your own internal state (e.g. reduceByKeyAndWindow) are exactly-once too. Matei On Jun 17, 2015, at 8:26 AM, Enno Shioji eshi...@gmail.com wrote: The thing is, even with that improvement, you still have to make updates idempotent or transactional yourself. If you read http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics that refers to the latest version, it says: Semantics of output operations Output operations (like foreachRDD) have at-least once semantics, that is, the transformed data may get written to an external entity more than once in the event of a worker failure. While this is acceptable for saving to file systems using the saveAs***Files operations (as the file will simply get overwritten with the same data), additional effort may be necessary to achieve exactly-once semantics. There are two approaches. Idempotent updates: Multiple attempts always write the same data. For example, saveAs***Files always writes the same data to the generated files. Transactional updates: All updates are made transactionally so that updates are made exactly once atomically. One way to do this would be the following. Use the batch time (available in foreachRDD) and the partition index of the transformed RDD to create an identifier. This identifier uniquely identifies a blob data in the streaming application. Update external system with this blob transactionally (that is, exactly once, atomically) using the identifier. That is, if the identifier is not already committed, commit the partition data and the identifier atomically. Else if this was already committed, skip the update. So either you make the update idempotent, or you have to make it transactional yourself, and the suggested mechanism is very similar to what Storm does. On Wed, Jun 17, 2015 at 3:51 PM, Ashish Soni asoni.le...@gmail.com mailto:asoni.le...@gmail.com wrote: @Enno As per the latest version and documentation Spark Streaming does offer exactly once semantics using improved kafka integration , Not i have not tested yet. Any feedback will be helpful if anyone is tried the same. http://koeninger.github.io/kafka-exactly-once/#7 http://koeninger.github.io/kafka-exactly-once/#7 https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji eshi...@gmail.com mailto:eshi...@gmail.com wrote: AFAIK KCL is *supposed* to provide fault tolerance and load balancing (plus additionally, elastic scaling unlike Storm), Kinesis providing the coordination. My understanding is that it's like a naked Storm worker process that can consequently only do map. I haven't really used it tho, so can't really comment how it compares to Spark/Storm. Maybe somebody else will be able to comment. On Wed, Jun 17, 2015 at 3:13 PM, ayan guha guha.a...@gmail.com mailto:guha.a...@gmail.com wrote: Thanks for this. It's kcl based kinesis application. But because its just a Java application we are thinking to use spark on EMR or storm for fault tolerance and load balancing. Is it a correct approach? On 17 Jun 2015 23:07, Enno Shioji eshi...@gmail.com mailto:eshi...@gmail.com wrote: Hi Ayan, Admittedly I haven't done much with Kinesis, but if I'm not mistaken you should be able to use their processor interface for that. In this example, it's incrementing a counter: https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java Instead of incrementing a counter, you could do your transformation and send it to HBase. On Wed, Jun 17, 2015 at 1:40 PM, ayan guha guha.a...@gmail.com mailto:guha.a...@gmail.com wrote: Great discussion!! One qs about some comment: Also, you can do
Re: Spark or Storm
Hi Matei, Ah, can't get more accurate than from the horse's mouth... If you don't mind helping me understand it correctly.. From what I understand, Storm Trident does the following (when used with Kafka): 1) Sit on Kafka Spout and create batches 2) Assign global sequential ID to the batches 3) Make sure that all result of processed batches are written once to TridentState, *in order* (for example, by skipping batches that were already applied once, ultimately by using Zookeeper) TridentState is an interface that you have to implement, and the underlying storage has to be transactional for this to work. The necessary skipping etc. is handled by Storm. In case of Spark Streaming, I understand that 1) There is no global ordering; e.g. an output operation for batch consisting of offset [4,5,6] can be invoked before the operation for offset [1,2,3] 2) If you wanted to achieve something similar to what TridentState does, you'll have to do it yourself (for example using Zookeeper) Is this a correct understanding? On Wed, Jun 17, 2015 at 7:14 PM, Matei Zaharia matei.zaha...@gmail.com wrote: This documentation is only for writes to an external system, but all the counting you do within your streaming app (e.g. if you use reduceByKeyAndWindow to keep track of a running count) is exactly-once. When you write to a storage system, no matter which streaming framework you use, you'll have to make sure the writes are idempotent, because the storage system can't know whether you meant to write the same data again or not. But the place where Spark Streaming helps over Storm, etc is for tracking state within your computation. Without that facility, you'd not only have to make sure that writes are idempotent, but you'd have to make sure that updates to your own internal state (e.g. reduceByKeyAndWindow) are exactly-once too. Matei On Jun 17, 2015, at 8:26 AM, Enno Shioji eshi...@gmail.com wrote: The thing is, even with that improvement, you still have to make updates idempotent or transactional yourself. If you read http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics that refers to the latest version, it says: Semantics of output operations Output operations (like foreachRDD) have *at-least once* semantics, that is, the transformed data may get written to an external entity more than once in the event of a worker failure. While this is acceptable for saving to file systems using the saveAs***Files operations (as the file will simply get overwritten with the same data), additional effort may be necessary to achieve exactly-once semantics. There are two approaches. - *Idempotent updates*: Multiple attempts always write the same data. For example, saveAs***Files always writes the same data to the generated files. - *Transactional updates*: All updates are made transactionally so that updates are made exactly once atomically. One way to do this would be the following. - Use the batch time (available in foreachRDD) and the partition index of the transformed RDD to create an identifier. This identifier uniquely identifies a blob data in the streaming application. - Update external system with this blob transactionally (that is, exactly once, atomically) using the identifier. That is, if the identifier is not already committed, commit the partition data and the identifier atomically. Else if this was already committed, skip the update. So either you make the update idempotent, or you have to make it transactional yourself, and the suggested mechanism is very similar to what Storm does. On Wed, Jun 17, 2015 at 3:51 PM, Ashish Soni asoni.le...@gmail.com wrote: @Enno As per the latest version and documentation Spark Streaming does offer exactly once semantics using improved kafka integration , Not i have not tested yet. Any feedback will be helpful if anyone is tried the same. http://koeninger.github.io/kafka-exactly-once/#7 https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji eshi...@gmail.com wrote: AFAIK KCL is *supposed* to provide fault tolerance and load balancing (plus additionally, elastic scaling unlike Storm), Kinesis providing the coordination. My understanding is that it's like a naked Storm worker process that can consequently only do map. I haven't really used it tho, so can't really comment how it compares to Spark/Storm. Maybe somebody else will be able to comment. On Wed, Jun 17, 2015 at 3:13 PM, ayan guha guha.a...@gmail.com wrote: Thanks for this. It's kcl based kinesis application. But because its just a Java application we are thinking to use spark on EMR or storm for fault tolerance and load balancing. Is it a correct approach? On 17 Jun 2015 23:07, Enno Shioji eshi...@gmail.com wrote: Hi Ayan, Admittedly I
Re: Spark or Storm
does offer exactly once semantics using improved kafka integration , Not i have not tested yet. Any feedback will be helpful if anyone is tried the same. http://koeninger.github.io/kafka-exactly-once/#7 http://koeninger.github.io/kafka-exactly-once/#7 https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji eshi...@gmail.com mailto:eshi...@gmail.com wrote: AFAIK KCL is *supposed* to provide fault tolerance and load balancing (plus additionally, elastic scaling unlike Storm), Kinesis providing the coordination. My understanding is that it's like a naked Storm worker process that can consequently only do map. I haven't really used it tho, so can't really comment how it compares to Spark/Storm. Maybe somebody else will be able to comment. On Wed, Jun 17, 2015 at 3:13 PM, ayan guha guha.a...@gmail.com mailto:guha.a...@gmail.com wrote: Thanks for this. It's kcl based kinesis application. But because its just a Java application we are thinking to use spark on EMR or storm for fault tolerance and load balancing. Is it a correct approach? On 17 Jun 2015 23:07, Enno Shioji eshi...@gmail.com mailto:eshi...@gmail.com wrote: Hi Ayan, Admittedly I haven't done much with Kinesis, but if I'm not mistaken you should be able to use their processor interface for that. In this example, it's incrementing a counter: https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java Instead of incrementing a counter, you could do your transformation and send it to HBase. On Wed, Jun 17, 2015 at 1:40 PM, ayan guha guha.a...@gmail.com mailto:guha.a...@gmail.com wrote: Great discussion!! One qs about some comment: Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis - Do you mean KCL application? Or some kind of processing withinKineis? Can you kindly share a link? I would definitely pursue this route as our transformations are really simple. Best On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni asoni.le...@gmail.com mailto:asoni.le...@gmail.com wrote: My Use case is below We are going to receive lot of event as stream ( basically Kafka Stream ) and then we need to process and compute Consider you have a phone contract with ATT and every call / sms / data useage you do is an event and then it needs to calculate your bill on real time basis so when you login to your account you can see all those variable as how much you used and how much is left and what is your bill till date ,Also there are different rules which need to be considered when you calculate the total bill one simple rule will be 0-500 min it is free but above it is $1 a min. How do i maintain a shared state ( total amount , total min , total data etc ) so that i know how much i accumulated at any given point as events for same phone can go to any node / executor. Can some one please tell me how can i achieve this is spark as in storm i can have a bolt which can do this ? Thanks, On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com mailto:eshi...@gmail.com wrote: I guess both. In terms of syntax, I was comparing it with Trident. If you are joining, Spark Streaming actually does offer windowed join out of the box. We couldn't use this though as our event stream can grow out-of-sync, so we had to implement something on top of Storm. If your event streams don't become out of sync, you may find the built-in join in Spark Streaming useful. Storm also has a join keyword but its semantics are different. Also, what do you mean by No Back Pressure ? So when a topology is overloaded, Storm is designed so that it will stop reading from the source. Spark on the other hand, will keep reading from the source and spilling it internally. This maybe fine, in fairness, but it does mean you have to worry about the persistent store usage in the processing cluster, whereas with Storm you don't have to worry because the messages just remain in the data store. Spark came up with the idea of rate limiting, but I don't feel this is as nice as back pressure because it's very difficult to tune it such that you don't cap the cluster's processing power but yet so that it will prevent the persistent storage to get used up. On Wed, Jun 17, 2015 at 9:33 AM, Spark
Re: Spark or Storm
not being able to read from Kafka using multiple nodes Kafka is plenty capable of doing this, by clustering together multiple consumer instances into a consumer group. If your topic is sufficiently partitioned, the consumer group can consume the topic in a parallelized fashion. If it isn't, you still have the fault tolerance associated with clustering the consumers. OK JRP On Jun 17, 2015 1:27 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing) There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able to read from Kafka using multiple nodes, data loss hazard with Kafka. It's also not possible to attain very low latency in Spark, if that's what you need. The pos for Spark is the concise and IMO more intuitive syntax, especially if you compare it with Storm's Java API. I admit I might be a bit biased towards Storm tho as I'm more familiar with it. Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis. On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: Whatever you write in bolts would be the logic you want to apply on your events. In Spark, that logic would be coded in map() or similar such transformations and/or actions. Spark doesn't enforce a structure for capturing your processing logic like Storm does. Regards Sab Probably overloading the question a bit. In Storm, Bolts have the functionality of getting triggered on events. Is that kind of functionality possible with Spark streaming? During each phase of the data processing, the transformed data is stored to the database and this transformed data should then be sent to a new pipeline for further processing How can this be achieved using Spark? On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: I have a use-case where a stream of Incoming events have to be aggregated and joined to create Complex events. The aggregation will have to happen at an interval of 1 minute (or less). The pipeline is : send events enrich event Upstream services --- KAFKA - event Stream Processor Complex Event Processor Elastic Search. From what I understand, Storm will make a very good ESP and Spark Streaming will make a good CEP. But, we are also evaluating Storm with Trident. How does Spark Streaming compare with Storm with Trident? Sridhar Chellappa On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote: I have a similar scenario where we need to bring data from kinesis to hbase. Data volecity is 20k per 10 mins. Little manipulation of data will be required but that's regardless of the tool so we will be writing that piece in Java pojo. All env is on aws. Hbase is on a long running EMR and kinesis on a separate cluster. TIA. Best Ayan On 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote: The programming models for the two frameworks are conceptually rather different; I haven't worked with Storm for quite some time, but based on my old experience with it, I would equate Spark Streaming more with Storm's Trident API, rather than with the raw Bolt API. Even then, there are significant differences, but it's a bit closer. If you can share your use case, we might be able to provide better guidance. Regards, Will On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote: Hi All, I am evaluating spark VS storm ( spark streaming ) and i am not able to see what is equivalent of Bolt in storm inside spark. Any help will be appreciated on this ? Thanks , Ashish - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Spark or Storm
The only thing which doesn't make much sense in Spark Streaming (and I am not saying it is done better in Storm) is the iterative and redundant shipping of the essentially the same tasks (closures/lambdas/functions) to the cluster nodes AND re-launching them there again and again This is a legacy from Spark Batch where such approach DOES make sense So to recap - in Spark Streaming, the driver keeps serializing and transmitting the same Tasks (comprising a Job) for every new DStream RDD, which then get re-launched ie new JVM Threads launched within each Executor (JVM) and then the tasks report their final execution status to the driver (only the last has real functional purpose) An optimization (provided Spark Streaming was implemented from scratch) could be to launch the Stages/Tasks of a Streaming Job as constantly running Threads (Demons/Agents) within the Executors and leave the DStream RDD stream through them as only the final execution status for each DSTream RDD and some periodical heartbeats (of the Agents) are reported to the driver Essentially this gives you are Pipeline Architecture (of stringed Agents) which is a well known Parallel Programming Patterns especially suitable for streaming data From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: Wednesday, June 17, 2015 7:14 PM To: Enno Shioji Cc: Ashish Soni; ayan guha; Sabarish Sasidharan; Spark Enthusiast; Will Briggs; user; Sateesh Kavuri Subject: Re: Spark or Storm This documentation is only for writes to an external system, but all the counting you do within your streaming app (e.g. if you use reduceByKeyAndWindow to keep track of a running count) is exactly-once. When you write to a storage system, no matter which streaming framework you use, you'll have to make sure the writes are idempotent, because the storage system can't know whether you meant to write the same data again or not. But the place where Spark Streaming helps over Storm, etc is for tracking state within your computation. Without that facility, you'd not only have to make sure that writes are idempotent, but you'd have to make sure that updates to your own internal state (e.g. reduceByKeyAndWindow) are exactly-once too. Matei On Jun 17, 2015, at 8:26 AM, Enno Shioji eshi...@gmail.com wrote: The thing is, even with that improvement, you still have to make updates idempotent or transactional yourself. If you read http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-t olerance-semantics that refers to the latest version, it says: Semantics of output operations Output operations (like foreachRDD) have at-least once semantics, that is, the transformed data may get written to an external entity more than once in the event of a worker failure. While this is acceptable for saving to file systems using the saveAs***Files operations (as the file will simply get overwritten with the same data), additional effort may be necessary to achieve exactly-once semantics. There are two approaches. . Idempotent updates: Multiple attempts always write the same data. For example, saveAs***Files always writes the same data to the generated files. . Transactional updates: All updates are made transactionally so that updates are made exactly once atomically. One way to do this would be the following. o Use the batch time (available in foreachRDD) and the partition index of the transformed RDD to create an identifier. This identifier uniquely identifies a blob data in the streaming application. o Update external system with this blob transactionally (that is, exactly once, atomically) using the identifier. That is, if the identifier is not already committed, commit the partition data and the identifier atomically. Else if this was already committed, skip the update. So either you make the update idempotent, or you have to make it transactional yourself, and the suggested mechanism is very similar to what Storm does. On Wed, Jun 17, 2015 at 3:51 PM, Ashish Soni asoni.le...@gmail.com wrote: @Enno As per the latest version and documentation Spark Streaming does offer exactly once semantics using improved kafka integration , Not i have not tested yet. Any feedback will be helpful if anyone is tried the same. http://koeninger.github.io/kafka-exactly-once/#7 https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of- spark-streaming.html On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji eshi...@gmail.com wrote: AFAIK KCL is *supposed* to provide fault tolerance and load balancing (plus additionally, elastic scaling unlike Storm), Kinesis providing the coordination. My understanding is that it's like a naked Storm worker process that can consequently only do map. I haven't really used it tho, so can't really comment how it compares to Spark/Storm. Maybe somebody else will be able to comment. On Wed, Jun 17, 2015 at 3:13 PM, ayan guha
Re: Spark or Storm
To add more information beyond what Matei said and answer the original question, here are other things to consider when comparing between Spark Streaming and Storm. * Unified programming model and semantics - Most occasions you have to process the same data again in batch jobs. If you have two separate systems for batch and streaming, its much much harder to share the code. You will have to deal with different processing models, with their own semantics. Compare Storm's join vs doing an usual batch join, where as Spark and Spark Streaming share the same join semantics as they are based on same RDD model underneath. * Integration with Spark ecosystem - Many people really want to go beyond basic streaming ETL and into advanced streaming analytics. - Combine stream processing with static datasets - Apply dynamically updated machine learning models (i.e. offline learning and online prediction, or even continuous learning and prediction), - Apply DataFrame and SQL operation with streaming These things are pretty easy to do with the spark ecosystem * Operational management - You have to consider the operational cost of managing two separate systems for batch and stream processing (with their own deployment models), vs managing one single engine with one deployment model. * Performance - According to Intel's independent study, Spark Streaming in Kafka direct mode can have 2.5-3x throughput than Trident with 0.5GB batch size. And at that batch size, the latency of Trident is 30 seconds, compared to 1.5 seconds for Spark Streaming. This is from a talk that Intel gave in the Spark Summit (https://spark-summit.org/2015/) two days ago. Slides will be available soon, but here is a sneak peek - throughput - http://i.imgur.com/u6pf4rB.png and latency - http://imgur.com/c46MJ4i I will post the link to the slides when it comes out, hopefully next week. On Wed, Jun 17, 2015 at 11:55 AM, Matei Zaharia matei.zaha...@gmail.com wrote: The major difference is that in Spark Streaming, there's no *need* for a TridentState for state inside your computation. All the stateful operations (reduceByWindow, updateStateByKey, etc) automatically handle exactly-once processing, keeping updates in order, etc. Also, you don't need to run a separate transactional system (e.g. MySQL) to store the state. After your computation runs, if you want to write the final results (e.g. the counts you've been tracking) to a storage system, you use one of the output operations (saveAsFiles, foreach, etc). Those actually will run in order, but some might run multiple times if nodes fail, thus attempting to write the same state again. You can read about how it works in this research paper: http://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf. Matei On Jun 17, 2015, at 11:49 AM, Enno Shioji eshi...@gmail.com wrote: Hi Matei, Ah, can't get more accurate than from the horse's mouth... If you don't mind helping me understand it correctly.. From what I understand, Storm Trident does the following (when used with Kafka): 1) Sit on Kafka Spout and create batches 2) Assign global sequential ID to the batches 3) Make sure that all result of processed batches are written once to TridentState, *in order* (for example, by skipping batches that were already applied once, ultimately by using Zookeeper) TridentState is an interface that you have to implement, and the underlying storage has to be transactional for this to work. The necessary skipping etc. is handled by Storm. In case of Spark Streaming, I understand that 1) There is no global ordering; e.g. an output operation for batch consisting of offset [4,5,6] can be invoked before the operation for offset [1,2,3] 2) If you wanted to achieve something similar to what TridentState does, you'll have to do it yourself (for example using Zookeeper) Is this a correct understanding? On Wed, Jun 17, 2015 at 7:14 PM, Matei Zaharia matei.zaha...@gmail.com wrote: This documentation is only for writes to an external system, but all the counting you do within your streaming app (e.g. if you use reduceByKeyAndWindow to keep track of a running count) is exactly-once. When you write to a storage system, no matter which streaming framework you use, you'll have to make sure the writes are idempotent, because the storage system can't know whether you meant to write the same data again or not. But the place where Spark Streaming helps over Storm, etc is for tracking state within your computation. Without that facility, you'd not only have to make sure that writes are idempotent, but you'd have to make sure that updates to your own internal state (e.g. reduceByKeyAndWindow) are exactly-once too. Matei On Jun 17, 2015, at 8:26 AM, Enno Shioji eshi...@gmail.com wrote: The thing is, even with that improvement, you still have to make updates idempotent or transactional yourself. If you read http://spark.apache.org/docs/latest/streaming
Re: Spark or Storm
My Use case is below We are going to receive lot of event as stream ( basically Kafka Stream ) and then we need to process and compute Consider you have a phone contract with ATT and every call / sms / data useage you do is an event and then it needs to calculate your bill on real time basis so when you login to your account you can see all those variable as how much you used and how much is left and what is your bill till date ,Also there are different rules which need to be considered when you calculate the total bill one simple rule will be 0-500 min it is free but above it is $1 a min. How do i maintain a shared state ( total amount , total min , total data etc ) so that i know how much i accumulated at any given point as events for same phone can go to any node / executor. Can some one please tell me how can i achieve this is spark as in storm i can have a bolt which can do this ? Thanks, On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote: I guess both. In terms of syntax, I was comparing it with Trident. If you are joining, Spark Streaming actually does offer windowed join out of the box. We couldn't use this though as our event stream can grow out-of-sync, so we had to implement something on top of Storm. If your event streams don't become out of sync, you may find the built-in join in Spark Streaming useful. Storm also has a join keyword but its semantics are different. Also, what do you mean by No Back Pressure ? So when a topology is overloaded, Storm is designed so that it will stop reading from the source. Spark on the other hand, will keep reading from the source and spilling it internally. This maybe fine, in fairness, but it does mean you have to worry about the persistent store usage in the processing cluster, whereas with Storm you don't have to worry because the messages just remain in the data store. Spark came up with the idea of rate limiting, but I don't feel this is as nice as back pressure because it's very difficult to tune it such that you don't cap the cluster's processing power but yet so that it will prevent the persistent storage to get used up. On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: When you say Storm, did you mean Storm with Trident or Storm? My use case does not have simple transformation. There are complex events that need to be generated by joining the incoming event stream. Also, what do you mean by No Back PRessure ? On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing) There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able to read from Kafka using multiple nodes, data loss hazard with Kafka. It's also not possible to attain very low latency in Spark, if that's what you need. The pos for Spark is the concise and IMO more intuitive syntax, especially if you compare it with Storm's Java API. I admit I might be a bit biased towards Storm tho as I'm more familiar with it. Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis. On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: Whatever you write in bolts would be the logic you want to apply on your events. In Spark, that logic would be coded in map() or similar such transformations and/or actions. Spark doesn't enforce a structure for capturing your processing logic like Storm does. Regards Sab Probably overloading the question a bit. In Storm, Bolts have the functionality of getting triggered on events. Is that kind of functionality possible with Spark streaming? During each phase of the data processing, the transformed data is stored to the database and this transformed data should then be sent to a new pipeline for further processing How can this be achieved using Spark? On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: I have a use-case where a stream of Incoming events have to be aggregated and joined to create Complex events. The aggregation will have to happen at an interval of 1 minute (or less). The pipeline is : send events enrich event Upstream services --- KAFKA - event Stream Processor Complex
Re: Spark or Storm
Whatever you write in bolts would be the logic you want to apply on your events. In Spark, that logic would be coded in map() or similar such transformations and/or actions. Spark doesn't enforce a structure for capturing your processing logic like Storm does. Regards Sab Probably overloading the question a bit. In Storm, Bolts have the functionality of getting triggered on events. Is that kind of functionality possible with Spark streaming? During each phase of the data processing, the transformed data is stored to the database and this transformed data should then be sent to a new pipeline for further processing How can this be achieved using Spark? On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: I have a use-case where a stream of Incoming events have to be aggregated and joined to create Complex events. The aggregation will have to happen at an interval of 1 minute (or less). The pipeline is : send events enrich event Upstream services --- KAFKA - event Stream Processor Complex Event Processor Elastic Search. From what I understand, Storm will make a very good ESP and Spark Streaming will make a good CEP. But, we are also evaluating Storm with Trident. How does Spark Streaming compare with Storm with Trident? Sridhar Chellappa On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote: I have a similar scenario where we need to bring data from kinesis to hbase. Data volecity is 20k per 10 mins. Little manipulation of data will be required but that's regardless of the tool so we will be writing that piece in Java pojo. All env is on aws. Hbase is on a long running EMR and kinesis on a separate cluster. TIA. Best Ayan On 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote: The programming models for the two frameworks are conceptually rather different; I haven't worked with Storm for quite some time, but based on my old experience with it, I would equate Spark Streaming more with Storm's Trident API, rather than with the raw Bolt API. Even then, there are significant differences, but it's a bit closer. If you can share your use case, we might be able to provide better guidance. Regards, Will On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote: Hi All, I am evaluating spark VS storm ( spark streaming ) and i am not able to see what is equivalent of Bolt in storm inside spark. Any help will be appreciated on this ? Thanks , Ashish - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark or Storm
We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing) There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able to read from Kafka using multiple nodes, data loss hazard with Kafka. It's also not possible to attain very low latency in Spark, if that's what you need. The pos for Spark is the concise and IMO more intuitive syntax, especially if you compare it with Storm's Java API. I admit I might be a bit biased towards Storm tho as I'm more familiar with it. Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis. On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: Whatever you write in bolts would be the logic you want to apply on your events. In Spark, that logic would be coded in map() or similar such transformations and/or actions. Spark doesn't enforce a structure for capturing your processing logic like Storm does. Regards Sab Probably overloading the question a bit. In Storm, Bolts have the functionality of getting triggered on events. Is that kind of functionality possible with Spark streaming? During each phase of the data processing, the transformed data is stored to the database and this transformed data should then be sent to a new pipeline for further processing How can this be achieved using Spark? On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: I have a use-case where a stream of Incoming events have to be aggregated and joined to create Complex events. The aggregation will have to happen at an interval of 1 minute (or less). The pipeline is : send events enrich event Upstream services --- KAFKA - event Stream Processor Complex Event Processor Elastic Search. From what I understand, Storm will make a very good ESP and Spark Streaming will make a good CEP. But, we are also evaluating Storm with Trident. How does Spark Streaming compare with Storm with Trident? Sridhar Chellappa On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote: I have a similar scenario where we need to bring data from kinesis to hbase. Data volecity is 20k per 10 mins. Little manipulation of data will be required but that's regardless of the tool so we will be writing that piece in Java pojo. All env is on aws. Hbase is on a long running EMR and kinesis on a separate cluster. TIA. Best Ayan On 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote: The programming models for the two frameworks are conceptually rather different; I haven't worked with Storm for quite some time, but based on my old experience with it, I would equate Spark Streaming more with Storm's Trident API, rather than with the raw Bolt API. Even then, there are significant differences, but it's a bit closer. If you can share your use case, we might be able to provide better guidance. Regards, Will On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote: Hi All, I am evaluating spark VS storm ( spark streaming ) and i am not able to see what is equivalent of Bolt in storm inside spark. Any help will be appreciated on this ? Thanks , Ashish - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark or Storm
When you say Storm, did you mean Storm with Trident or Storm? My use case does not have simple transformation. There are complex events that need to be generated by joining the incoming event stream. Also, what do you mean by No Back PRessure ? On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal)There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing) There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able to read from Kafka using multiple nodes, data loss hazard with Kafka. It's also not possible to attain very low latency in Spark, if that's what you need. The pos for Spark is the concise and IMO more intuitive syntax, especially if you compare it with Storm's Java API. I admit I might be a bit biased towards Storm tho as I'm more familiar with it. Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis. On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: Whatever you write in bolts would be the logic you want to apply on your events. In Spark, that logic would be coded in map() or similar such transformations and/or actions. Spark doesn't enforce a structure for capturing your processing logic like Storm does.Regards SabProbably overloading the question a bit. In Storm, Bolts have the functionality of getting triggered on events. Is that kind of functionality possible with Spark streaming? During each phase of the data processing, the transformed data is stored to the database and this transformed data should then be sent to a new pipeline for further processing How can this be achieved using Spark? On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: I have a use-case where a stream of Incoming events have to be aggregated and joined to create Complex events. The aggregation will have to happen at an interval of 1 minute (or less). The pipeline is : send events enrich eventUpstream services --- KAFKA - event Stream Processor Complex Event Processor Elastic Search. From what I understand, Storm will make a very good ESP and Spark Streaming will make a good CEP. But, we are also evaluating Storm with Trident. How does Spark Streaming compare with Storm with Trident? Sridhar Chellappa On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote: I have a similar scenario where we need to bring data from kinesis to hbase. Data volecity is 20k per 10 mins. Little manipulation of data will be required but that's regardless of the tool so we will be writing that piece in Java pojo. All env is on aws. Hbase is on a long running EMR and kinesis on a separate cluster.TIA. Best AyanOn 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote: The programming models for the two frameworks are conceptually rather different; I haven't worked with Storm for quite some time, but based on my old experience with it, I would equate Spark Streaming more with Storm's Trident API, rather than with the raw Bolt API. Even then, there are significant differences, but it's a bit closer. If you can share your use case, we might be able to provide better guidance. Regards, Will On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote: Hi All, I am evaluating spark VS storm ( spark streaming ) and i am not able to see what is equivalent of Bolt in storm inside spark. Any help will be appreciated on this ? Thanks , Ashish - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark or Storm
I guess both. In terms of syntax, I was comparing it with Trident. If you are joining, Spark Streaming actually does offer windowed join out of the box. We couldn't use this though as our event stream can grow out-of-sync, so we had to implement something on top of Storm. If your event streams don't become out of sync, you may find the built-in join in Spark Streaming useful. Storm also has a join keyword but its semantics are different. Also, what do you mean by No Back Pressure ? So when a topology is overloaded, Storm is designed so that it will stop reading from the source. Spark on the other hand, will keep reading from the source and spilling it internally. This maybe fine, in fairness, but it does mean you have to worry about the persistent store usage in the processing cluster, whereas with Storm you don't have to worry because the messages just remain in the data store. Spark came up with the idea of rate limiting, but I don't feel this is as nice as back pressure because it's very difficult to tune it such that you don't cap the cluster's processing power but yet so that it will prevent the persistent storage to get used up. On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: When you say Storm, did you mean Storm with Trident or Storm? My use case does not have simple transformation. There are complex events that need to be generated by joining the incoming event stream. Also, what do you mean by No Back PRessure ? On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing) There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able to read from Kafka using multiple nodes, data loss hazard with Kafka. It's also not possible to attain very low latency in Spark, if that's what you need. The pos for Spark is the concise and IMO more intuitive syntax, especially if you compare it with Storm's Java API. I admit I might be a bit biased towards Storm tho as I'm more familiar with it. Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis. On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: Whatever you write in bolts would be the logic you want to apply on your events. In Spark, that logic would be coded in map() or similar such transformations and/or actions. Spark doesn't enforce a structure for capturing your processing logic like Storm does. Regards Sab Probably overloading the question a bit. In Storm, Bolts have the functionality of getting triggered on events. Is that kind of functionality possible with Spark streaming? During each phase of the data processing, the transformed data is stored to the database and this transformed data should then be sent to a new pipeline for further processing How can this be achieved using Spark? On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: I have a use-case where a stream of Incoming events have to be aggregated and joined to create Complex events. The aggregation will have to happen at an interval of 1 minute (or less). The pipeline is : send events enrich event Upstream services --- KAFKA - event Stream Processor Complex Event Processor Elastic Search. From what I understand, Storm will make a very good ESP and Spark Streaming will make a good CEP. But, we are also evaluating Storm with Trident. How does Spark Streaming compare with Storm with Trident? Sridhar Chellappa On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote: I have a similar scenario where we need to bring data from kinesis to hbase. Data volecity is 20k per 10 mins. Little manipulation of data will be required but that's regardless of the tool so we will be writing that piece in Java pojo. All env is on aws. Hbase is on a long running EMR and kinesis on a separate cluster. TIA. Best Ayan On 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote: The programming models for the two frameworks are conceptually rather different; I haven't worked with Storm for quite some time, but based on my old experience with it, I would equate Spark Streaming more with Storm's Trident API
Re: Spark or Storm
Great discussion!! One qs about some comment: Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis - Do you mean KCL application? Or some kind of processing withinKineis? Can you kindly share a link? I would definitely pursue this route as our transformations are really simple. Best On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni asoni.le...@gmail.com wrote: My Use case is below We are going to receive lot of event as stream ( basically Kafka Stream ) and then we need to process and compute Consider you have a phone contract with ATT and every call / sms / data useage you do is an event and then it needs to calculate your bill on real time basis so when you login to your account you can see all those variable as how much you used and how much is left and what is your bill till date ,Also there are different rules which need to be considered when you calculate the total bill one simple rule will be 0-500 min it is free but above it is $1 a min. How do i maintain a shared state ( total amount , total min , total data etc ) so that i know how much i accumulated at any given point as events for same phone can go to any node / executor. Can some one please tell me how can i achieve this is spark as in storm i can have a bolt which can do this ? Thanks, On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote: I guess both. In terms of syntax, I was comparing it with Trident. If you are joining, Spark Streaming actually does offer windowed join out of the box. We couldn't use this though as our event stream can grow out-of-sync, so we had to implement something on top of Storm. If your event streams don't become out of sync, you may find the built-in join in Spark Streaming useful. Storm also has a join keyword but its semantics are different. Also, what do you mean by No Back Pressure ? So when a topology is overloaded, Storm is designed so that it will stop reading from the source. Spark on the other hand, will keep reading from the source and spilling it internally. This maybe fine, in fairness, but it does mean you have to worry about the persistent store usage in the processing cluster, whereas with Storm you don't have to worry because the messages just remain in the data store. Spark came up with the idea of rate limiting, but I don't feel this is as nice as back pressure because it's very difficult to tune it such that you don't cap the cluster's processing power but yet so that it will prevent the persistent storage to get used up. On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: When you say Storm, did you mean Storm with Trident or Storm? My use case does not have simple transformation. There are complex events that need to be generated by joining the incoming event stream. Also, what do you mean by No Back PRessure ? On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing) There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able to read from Kafka using multiple nodes, data loss hazard with Kafka. It's also not possible to attain very low latency in Spark, if that's what you need. The pos for Spark is the concise and IMO more intuitive syntax, especially if you compare it with Storm's Java API. I admit I might be a bit biased towards Storm tho as I'm more familiar with it. Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis. On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: Whatever you write in bolts would be the logic you want to apply on your events. In Spark, that logic would be coded in map() or similar such transformations and/or actions. Spark doesn't enforce a structure for capturing your processing logic like Storm does. Regards Sab Probably overloading the question a bit. In Storm, Bolts have the functionality of getting triggered on events. Is that kind of functionality possible with Spark streaming? During each phase of the data processing, the transformed data is stored to the database and this transformed data should then be sent
Re: Spark or Storm
In that case I assume you need exactly once semantics. There's no out-of-the-box way to do that in Spark. There is updateStateByKey, but it's not practical with your use case as the state is too large (it'll try to dump the entire intermediate state on every checkpoint, which would be prohibitively expensive). So either you have to implement something yourself, or you can use Storm Trident (or transactional low-level API). On Wed, Jun 17, 2015 at 1:26 PM, Ashish Soni asoni.le...@gmail.com wrote: My Use case is below We are going to receive lot of event as stream ( basically Kafka Stream ) and then we need to process and compute Consider you have a phone contract with ATT and every call / sms / data useage you do is an event and then it needs to calculate your bill on real time basis so when you login to your account you can see all those variable as how much you used and how much is left and what is your bill till date ,Also there are different rules which need to be considered when you calculate the total bill one simple rule will be 0-500 min it is free but above it is $1 a min. How do i maintain a shared state ( total amount , total min , total data etc ) so that i know how much i accumulated at any given point as events for same phone can go to any node / executor. Can some one please tell me how can i achieve this is spark as in storm i can have a bolt which can do this ? Thanks, On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote: I guess both. In terms of syntax, I was comparing it with Trident. If you are joining, Spark Streaming actually does offer windowed join out of the box. We couldn't use this though as our event stream can grow out-of-sync, so we had to implement something on top of Storm. If your event streams don't become out of sync, you may find the built-in join in Spark Streaming useful. Storm also has a join keyword but its semantics are different. Also, what do you mean by No Back Pressure ? So when a topology is overloaded, Storm is designed so that it will stop reading from the source. Spark on the other hand, will keep reading from the source and spilling it internally. This maybe fine, in fairness, but it does mean you have to worry about the persistent store usage in the processing cluster, whereas with Storm you don't have to worry because the messages just remain in the data store. Spark came up with the idea of rate limiting, but I don't feel this is as nice as back pressure because it's very difficult to tune it such that you don't cap the cluster's processing power but yet so that it will prevent the persistent storage to get used up. On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: When you say Storm, did you mean Storm with Trident or Storm? My use case does not have simple transformation. There are complex events that need to be generated by joining the incoming event stream. Also, what do you mean by No Back PRessure ? On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing) There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able to read from Kafka using multiple nodes, data loss hazard with Kafka. It's also not possible to attain very low latency in Spark, if that's what you need. The pos for Spark is the concise and IMO more intuitive syntax, especially if you compare it with Storm's Java API. I admit I might be a bit biased towards Storm tho as I'm more familiar with it. Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis. On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: Whatever you write in bolts would be the logic you want to apply on your events. In Spark, that logic would be coded in map() or similar such transformations and/or actions. Spark doesn't enforce a structure for capturing your processing logic like Storm does. Regards Sab Probably overloading the question a bit. In Storm, Bolts have the functionality of getting triggered on events. Is that kind of functionality possible with Spark streaming? During each phase of the data processing, the transformed data is stored to the database and this transformed data should then be sent to a new pipeline for further
Re: Spark or Storm
Stream can also be processed in micro-batch / batches which is the main reason behind Spark Steaming so what is the difference ? Ashish On Wed, Jun 17, 2015 at 9:04 AM, Enno Shioji eshi...@gmail.com wrote: PS just to elaborate on my first sentence, the reason Spark (not streaming) can offer exactly once semantics is because its update operation is idempotent. This is easy to do in a batch context because the input is finite, but it's harder in streaming context. On Wed, Jun 17, 2015 at 2:00 PM, Enno Shioji eshi...@gmail.com wrote: So Spark (not streaming) does offer exactly once. Spark Streaming however, can only do exactly once semantics *if the update operation is idempotent*. updateStateByKey's update operation is idempotent, because it completely replaces the previous state. So as long as you use Spark streaming, you must somehow make the update operation idempotent. Replacing the entire state is the easiest way to do it, but it's obviously expensive. The alternative is to do something similar to what Storm does. At that point, you'll have to ask though if just using Storm is easier than that. On Wed, Jun 17, 2015 at 1:50 PM, Ashish Soni asoni.le...@gmail.com wrote: As per my Best Understanding Spark Streaming offer Exactly once processing , is this achieve only through updateStateByKey or there is another way to do the same. Ashish On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji eshi...@gmail.com wrote: In that case I assume you need exactly once semantics. There's no out-of-the-box way to do that in Spark. There is updateStateByKey, but it's not practical with your use case as the state is too large (it'll try to dump the entire intermediate state on every checkpoint, which would be prohibitively expensive). So either you have to implement something yourself, or you can use Storm Trident (or transactional low-level API). On Wed, Jun 17, 2015 at 1:26 PM, Ashish Soni asoni.le...@gmail.com wrote: My Use case is below We are going to receive lot of event as stream ( basically Kafka Stream ) and then we need to process and compute Consider you have a phone contract with ATT and every call / sms / data useage you do is an event and then it needs to calculate your bill on real time basis so when you login to your account you can see all those variable as how much you used and how much is left and what is your bill till date ,Also there are different rules which need to be considered when you calculate the total bill one simple rule will be 0-500 min it is free but above it is $1 a min. How do i maintain a shared state ( total amount , total min , total data etc ) so that i know how much i accumulated at any given point as events for same phone can go to any node / executor. Can some one please tell me how can i achieve this is spark as in storm i can have a bolt which can do this ? Thanks, On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote: I guess both. In terms of syntax, I was comparing it with Trident. If you are joining, Spark Streaming actually does offer windowed join out of the box. We couldn't use this though as our event stream can grow out-of-sync, so we had to implement something on top of Storm. If your event streams don't become out of sync, you may find the built-in join in Spark Streaming useful. Storm also has a join keyword but its semantics are different. Also, what do you mean by No Back Pressure ? So when a topology is overloaded, Storm is designed so that it will stop reading from the source. Spark on the other hand, will keep reading from the source and spilling it internally. This maybe fine, in fairness, but it does mean you have to worry about the persistent store usage in the processing cluster, whereas with Storm you don't have to worry because the messages just remain in the data store. Spark came up with the idea of rate limiting, but I don't feel this is as nice as back pressure because it's very difficult to tune it such that you don't cap the cluster's processing power but yet so that it will prevent the persistent storage to get used up. On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: When you say Storm, did you mean Storm with Trident or Storm? My use case does not have simple transformation. There are complex events that need to be generated by joining the incoming event stream. Also, what do you mean by No Back PRessure ? On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does
Re: Spark or Storm
As per my Best Understanding Spark Streaming offer Exactly once processing , is this achieve only through updateStateByKey or there is another way to do the same. Ashish On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji eshi...@gmail.com wrote: In that case I assume you need exactly once semantics. There's no out-of-the-box way to do that in Spark. There is updateStateByKey, but it's not practical with your use case as the state is too large (it'll try to dump the entire intermediate state on every checkpoint, which would be prohibitively expensive). So either you have to implement something yourself, or you can use Storm Trident (or transactional low-level API). On Wed, Jun 17, 2015 at 1:26 PM, Ashish Soni asoni.le...@gmail.com wrote: My Use case is below We are going to receive lot of event as stream ( basically Kafka Stream ) and then we need to process and compute Consider you have a phone contract with ATT and every call / sms / data useage you do is an event and then it needs to calculate your bill on real time basis so when you login to your account you can see all those variable as how much you used and how much is left and what is your bill till date ,Also there are different rules which need to be considered when you calculate the total bill one simple rule will be 0-500 min it is free but above it is $1 a min. How do i maintain a shared state ( total amount , total min , total data etc ) so that i know how much i accumulated at any given point as events for same phone can go to any node / executor. Can some one please tell me how can i achieve this is spark as in storm i can have a bolt which can do this ? Thanks, On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote: I guess both. In terms of syntax, I was comparing it with Trident. If you are joining, Spark Streaming actually does offer windowed join out of the box. We couldn't use this though as our event stream can grow out-of-sync, so we had to implement something on top of Storm. If your event streams don't become out of sync, you may find the built-in join in Spark Streaming useful. Storm also has a join keyword but its semantics are different. Also, what do you mean by No Back Pressure ? So when a topology is overloaded, Storm is designed so that it will stop reading from the source. Spark on the other hand, will keep reading from the source and spilling it internally. This maybe fine, in fairness, but it does mean you have to worry about the persistent store usage in the processing cluster, whereas with Storm you don't have to worry because the messages just remain in the data store. Spark came up with the idea of rate limiting, but I don't feel this is as nice as back pressure because it's very difficult to tune it such that you don't cap the cluster's processing power but yet so that it will prevent the persistent storage to get used up. On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: When you say Storm, did you mean Storm with Trident or Storm? My use case does not have simple transformation. There are complex events that need to be generated by joining the incoming event stream. Also, what do you mean by No Back PRessure ? On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing) There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able to read from Kafka using multiple nodes, data loss hazard with Kafka. It's also not possible to attain very low latency in Spark, if that's what you need. The pos for Spark is the concise and IMO more intuitive syntax, especially if you compare it with Storm's Java API. I admit I might be a bit biased towards Storm tho as I'm more familiar with it. Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis. On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: Whatever you write in bolts would be the logic you want to apply on your events. In Spark, that logic would be coded in map() or similar such transformations and/or actions. Spark doesn't enforce a structure for capturing your processing logic like Storm does. Regards Sab Probably overloading the question a bit. In Storm, Bolts have the functionality of getting
Re: Spark or Storm
Hi Ayan, Admittedly I haven't done much with Kinesis, but if I'm not mistaken you should be able to use their processor interface for that. In this example, it's incrementing a counter: https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java Instead of incrementing a counter, you could do your transformation and send it to HBase. On Wed, Jun 17, 2015 at 1:40 PM, ayan guha guha.a...@gmail.com wrote: Great discussion!! One qs about some comment: Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis - Do you mean KCL application? Or some kind of processing withinKineis? Can you kindly share a link? I would definitely pursue this route as our transformations are really simple. Best On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni asoni.le...@gmail.com wrote: My Use case is below We are going to receive lot of event as stream ( basically Kafka Stream ) and then we need to process and compute Consider you have a phone contract with ATT and every call / sms / data useage you do is an event and then it needs to calculate your bill on real time basis so when you login to your account you can see all those variable as how much you used and how much is left and what is your bill till date ,Also there are different rules which need to be considered when you calculate the total bill one simple rule will be 0-500 min it is free but above it is $1 a min. How do i maintain a shared state ( total amount , total min , total data etc ) so that i know how much i accumulated at any given point as events for same phone can go to any node / executor. Can some one please tell me how can i achieve this is spark as in storm i can have a bolt which can do this ? Thanks, On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote: I guess both. In terms of syntax, I was comparing it with Trident. If you are joining, Spark Streaming actually does offer windowed join out of the box. We couldn't use this though as our event stream can grow out-of-sync, so we had to implement something on top of Storm. If your event streams don't become out of sync, you may find the built-in join in Spark Streaming useful. Storm also has a join keyword but its semantics are different. Also, what do you mean by No Back Pressure ? So when a topology is overloaded, Storm is designed so that it will stop reading from the source. Spark on the other hand, will keep reading from the source and spilling it internally. This maybe fine, in fairness, but it does mean you have to worry about the persistent store usage in the processing cluster, whereas with Storm you don't have to worry because the messages just remain in the data store. Spark came up with the idea of rate limiting, but I don't feel this is as nice as back pressure because it's very difficult to tune it such that you don't cap the cluster's processing power but yet so that it will prevent the persistent storage to get used up. On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: When you say Storm, did you mean Storm with Trident or Storm? My use case does not have simple transformation. There are complex events that need to be generated by joining the incoming event stream. Also, what do you mean by No Back PRessure ? On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing) There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able to read from Kafka using multiple nodes, data loss hazard with Kafka. It's also not possible to attain very low latency in Spark, if that's what you need. The pos for Spark is the concise and IMO more intuitive syntax, especially if you compare it with Storm's Java API. I admit I might be a bit biased towards Storm tho as I'm more familiar with it. Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis. On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: Whatever you write in bolts would be the logic you
Re: Spark or Storm
PS just to elaborate on my first sentence, the reason Spark (not streaming) can offer exactly once semantics is because its update operation is idempotent. This is easy to do in a batch context because the input is finite, but it's harder in streaming context. On Wed, Jun 17, 2015 at 2:00 PM, Enno Shioji eshi...@gmail.com wrote: So Spark (not streaming) does offer exactly once. Spark Streaming however, can only do exactly once semantics *if the update operation is idempotent*. updateStateByKey's update operation is idempotent, because it completely replaces the previous state. So as long as you use Spark streaming, you must somehow make the update operation idempotent. Replacing the entire state is the easiest way to do it, but it's obviously expensive. The alternative is to do something similar to what Storm does. At that point, you'll have to ask though if just using Storm is easier than that. On Wed, Jun 17, 2015 at 1:50 PM, Ashish Soni asoni.le...@gmail.com wrote: As per my Best Understanding Spark Streaming offer Exactly once processing , is this achieve only through updateStateByKey or there is another way to do the same. Ashish On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji eshi...@gmail.com wrote: In that case I assume you need exactly once semantics. There's no out-of-the-box way to do that in Spark. There is updateStateByKey, but it's not practical with your use case as the state is too large (it'll try to dump the entire intermediate state on every checkpoint, which would be prohibitively expensive). So either you have to implement something yourself, or you can use Storm Trident (or transactional low-level API). On Wed, Jun 17, 2015 at 1:26 PM, Ashish Soni asoni.le...@gmail.com wrote: My Use case is below We are going to receive lot of event as stream ( basically Kafka Stream ) and then we need to process and compute Consider you have a phone contract with ATT and every call / sms / data useage you do is an event and then it needs to calculate your bill on real time basis so when you login to your account you can see all those variable as how much you used and how much is left and what is your bill till date ,Also there are different rules which need to be considered when you calculate the total bill one simple rule will be 0-500 min it is free but above it is $1 a min. How do i maintain a shared state ( total amount , total min , total data etc ) so that i know how much i accumulated at any given point as events for same phone can go to any node / executor. Can some one please tell me how can i achieve this is spark as in storm i can have a bolt which can do this ? Thanks, On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote: I guess both. In terms of syntax, I was comparing it with Trident. If you are joining, Spark Streaming actually does offer windowed join out of the box. We couldn't use this though as our event stream can grow out-of-sync, so we had to implement something on top of Storm. If your event streams don't become out of sync, you may find the built-in join in Spark Streaming useful. Storm also has a join keyword but its semantics are different. Also, what do you mean by No Back Pressure ? So when a topology is overloaded, Storm is designed so that it will stop reading from the source. Spark on the other hand, will keep reading from the source and spilling it internally. This maybe fine, in fairness, but it does mean you have to worry about the persistent store usage in the processing cluster, whereas with Storm you don't have to worry because the messages just remain in the data store. Spark came up with the idea of rate limiting, but I don't feel this is as nice as back pressure because it's very difficult to tune it such that you don't cap the cluster's processing power but yet so that it will prevent the persistent storage to get used up. On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: When you say Storm, did you mean Storm with Trident or Storm? My use case does not have simple transformation. There are complex events that need to be generated by joining the incoming event stream. Also, what do you mean by No Back PRessure ? On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing) There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able to read from Kafka using multiple nodes, data loss
Re: Spark or Storm
Processing stuff in batch is not the same thing as being transactional. If you look at Storm, it will e.g. skip tuples that were already applied to a state to avoid counting stuff twice etc. Spark doesn't come with such facility, so you could end up counting twice etc. On Wed, Jun 17, 2015 at 2:09 PM, Ashish Soni asoni.le...@gmail.com wrote: Stream can also be processed in micro-batch / batches which is the main reason behind Spark Steaming so what is the difference ? Ashish On Wed, Jun 17, 2015 at 9:04 AM, Enno Shioji eshi...@gmail.com wrote: PS just to elaborate on my first sentence, the reason Spark (not streaming) can offer exactly once semantics is because its update operation is idempotent. This is easy to do in a batch context because the input is finite, but it's harder in streaming context. On Wed, Jun 17, 2015 at 2:00 PM, Enno Shioji eshi...@gmail.com wrote: So Spark (not streaming) does offer exactly once. Spark Streaming however, can only do exactly once semantics *if the update operation is idempotent*. updateStateByKey's update operation is idempotent, because it completely replaces the previous state. So as long as you use Spark streaming, you must somehow make the update operation idempotent. Replacing the entire state is the easiest way to do it, but it's obviously expensive. The alternative is to do something similar to what Storm does. At that point, you'll have to ask though if just using Storm is easier than that. On Wed, Jun 17, 2015 at 1:50 PM, Ashish Soni asoni.le...@gmail.com wrote: As per my Best Understanding Spark Streaming offer Exactly once processing , is this achieve only through updateStateByKey or there is another way to do the same. Ashish On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji eshi...@gmail.com wrote: In that case I assume you need exactly once semantics. There's no out-of-the-box way to do that in Spark. There is updateStateByKey, but it's not practical with your use case as the state is too large (it'll try to dump the entire intermediate state on every checkpoint, which would be prohibitively expensive). So either you have to implement something yourself, or you can use Storm Trident (or transactional low-level API). On Wed, Jun 17, 2015 at 1:26 PM, Ashish Soni asoni.le...@gmail.com wrote: My Use case is below We are going to receive lot of event as stream ( basically Kafka Stream ) and then we need to process and compute Consider you have a phone contract with ATT and every call / sms / data useage you do is an event and then it needs to calculate your bill on real time basis so when you login to your account you can see all those variable as how much you used and how much is left and what is your bill till date ,Also there are different rules which need to be considered when you calculate the total bill one simple rule will be 0-500 min it is free but above it is $1 a min. How do i maintain a shared state ( total amount , total min , total data etc ) so that i know how much i accumulated at any given point as events for same phone can go to any node / executor. Can some one please tell me how can i achieve this is spark as in storm i can have a bolt which can do this ? Thanks, On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote: I guess both. In terms of syntax, I was comparing it with Trident. If you are joining, Spark Streaming actually does offer windowed join out of the box. We couldn't use this though as our event stream can grow out-of-sync, so we had to implement something on top of Storm. If your event streams don't become out of sync, you may find the built-in join in Spark Streaming useful. Storm also has a join keyword but its semantics are different. Also, what do you mean by No Back Pressure ? So when a topology is overloaded, Storm is designed so that it will stop reading from the source. Spark on the other hand, will keep reading from the source and spilling it internally. This maybe fine, in fairness, but it does mean you have to worry about the persistent store usage in the processing cluster, whereas with Storm you don't have to worry because the messages just remain in the data store. Spark came up with the idea of rate limiting, but I don't feel this is as nice as back pressure because it's very difficult to tune it such that you don't cap the cluster's processing power but yet so that it will prevent the persistent storage to get used up. On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: When you say Storm, did you mean Storm with Trident or Storm? My use case does not have simple transformation. There are complex events that need to be generated by joining the incoming event stream. Also, what do you mean by No Back PRessure ? On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs
Re: Spark or Storm
So Spark (not streaming) does offer exactly once. Spark Streaming however, can only do exactly once semantics *if the update operation is idempotent*. updateStateByKey's update operation is idempotent, because it completely replaces the previous state. So as long as you use Spark streaming, you must somehow make the update operation idempotent. Replacing the entire state is the easiest way to do it, but it's obviously expensive. The alternative is to do something similar to what Storm does. At that point, you'll have to ask though if just using Storm is easier than that. On Wed, Jun 17, 2015 at 1:50 PM, Ashish Soni asoni.le...@gmail.com wrote: As per my Best Understanding Spark Streaming offer Exactly once processing , is this achieve only through updateStateByKey or there is another way to do the same. Ashish On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji eshi...@gmail.com wrote: In that case I assume you need exactly once semantics. There's no out-of-the-box way to do that in Spark. There is updateStateByKey, but it's not practical with your use case as the state is too large (it'll try to dump the entire intermediate state on every checkpoint, which would be prohibitively expensive). So either you have to implement something yourself, or you can use Storm Trident (or transactional low-level API). On Wed, Jun 17, 2015 at 1:26 PM, Ashish Soni asoni.le...@gmail.com wrote: My Use case is below We are going to receive lot of event as stream ( basically Kafka Stream ) and then we need to process and compute Consider you have a phone contract with ATT and every call / sms / data useage you do is an event and then it needs to calculate your bill on real time basis so when you login to your account you can see all those variable as how much you used and how much is left and what is your bill till date ,Also there are different rules which need to be considered when you calculate the total bill one simple rule will be 0-500 min it is free but above it is $1 a min. How do i maintain a shared state ( total amount , total min , total data etc ) so that i know how much i accumulated at any given point as events for same phone can go to any node / executor. Can some one please tell me how can i achieve this is spark as in storm i can have a bolt which can do this ? Thanks, On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote: I guess both. In terms of syntax, I was comparing it with Trident. If you are joining, Spark Streaming actually does offer windowed join out of the box. We couldn't use this though as our event stream can grow out-of-sync, so we had to implement something on top of Storm. If your event streams don't become out of sync, you may find the built-in join in Spark Streaming useful. Storm also has a join keyword but its semantics are different. Also, what do you mean by No Back Pressure ? So when a topology is overloaded, Storm is designed so that it will stop reading from the source. Spark on the other hand, will keep reading from the source and spilling it internally. This maybe fine, in fairness, but it does mean you have to worry about the persistent store usage in the processing cluster, whereas with Storm you don't have to worry because the messages just remain in the data store. Spark came up with the idea of rate limiting, but I don't feel this is as nice as back pressure because it's very difficult to tune it such that you don't cap the cluster's processing power but yet so that it will prevent the persistent storage to get used up. On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: When you say Storm, did you mean Storm with Trident or Storm? My use case does not have simple transformation. There are complex events that need to be generated by joining the incoming event stream. Also, what do you mean by No Back PRessure ? On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing) There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able to read from Kafka using multiple nodes, data loss hazard with Kafka. It's also not possible to attain very low latency in Spark, if that's what you need. The pos for Spark is the concise and IMO more intuitive syntax, especially if you compare it with Storm's Java API. I admit I might be a bit biased towards Storm tho as I'm more familiar with it. Also, you can do some processing
Re: Spark or Storm
Thanks for this. It's kcl based kinesis application. But because its just a Java application we are thinking to use spark on EMR or storm for fault tolerance and load balancing. Is it a correct approach? On 17 Jun 2015 23:07, Enno Shioji eshi...@gmail.com wrote: Hi Ayan, Admittedly I haven't done much with Kinesis, but if I'm not mistaken you should be able to use their processor interface for that. In this example, it's incrementing a counter: https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java Instead of incrementing a counter, you could do your transformation and send it to HBase. On Wed, Jun 17, 2015 at 1:40 PM, ayan guha guha.a...@gmail.com wrote: Great discussion!! One qs about some comment: Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis - Do you mean KCL application? Or some kind of processing withinKineis? Can you kindly share a link? I would definitely pursue this route as our transformations are really simple. Best On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni asoni.le...@gmail.com wrote: My Use case is below We are going to receive lot of event as stream ( basically Kafka Stream ) and then we need to process and compute Consider you have a phone contract with ATT and every call / sms / data useage you do is an event and then it needs to calculate your bill on real time basis so when you login to your account you can see all those variable as how much you used and how much is left and what is your bill till date ,Also there are different rules which need to be considered when you calculate the total bill one simple rule will be 0-500 min it is free but above it is $1 a min. How do i maintain a shared state ( total amount , total min , total data etc ) so that i know how much i accumulated at any given point as events for same phone can go to any node / executor. Can some one please tell me how can i achieve this is spark as in storm i can have a bolt which can do this ? Thanks, On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote: I guess both. In terms of syntax, I was comparing it with Trident. If you are joining, Spark Streaming actually does offer windowed join out of the box. We couldn't use this though as our event stream can grow out-of-sync, so we had to implement something on top of Storm. If your event streams don't become out of sync, you may find the built-in join in Spark Streaming useful. Storm also has a join keyword but its semantics are different. Also, what do you mean by No Back Pressure ? So when a topology is overloaded, Storm is designed so that it will stop reading from the source. Spark on the other hand, will keep reading from the source and spilling it internally. This maybe fine, in fairness, but it does mean you have to worry about the persistent store usage in the processing cluster, whereas with Storm you don't have to worry because the messages just remain in the data store. Spark came up with the idea of rate limiting, but I don't feel this is as nice as back pressure because it's very difficult to tune it such that you don't cap the cluster's processing power but yet so that it will prevent the persistent storage to get used up. On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: When you say Storm, did you mean Storm with Trident or Storm? My use case does not have simple transformation. There are complex events that need to be generated by joining the incoming event stream. Also, what do you mean by No Back PRessure ? On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing) There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able to read from Kafka using multiple nodes, data loss hazard with Kafka. It's also not possible to attain very low latency in Spark, if that's what you need. The pos for Spark is the concise and IMO more intuitive syntax, especially if you compare it with Storm's Java API. I admit I might be a bit biased towards Storm tho as I'm more familiar with it. Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation
Re: Spark or Storm
AFAIK KCL is *supposed* to provide fault tolerance and load balancing (plus additionally, elastic scaling unlike Storm), Kinesis providing the coordination. My understanding is that it's like a naked Storm worker process that can consequently only do map. I haven't really used it tho, so can't really comment how it compares to Spark/Storm. Maybe somebody else will be able to comment. On Wed, Jun 17, 2015 at 3:13 PM, ayan guha guha.a...@gmail.com wrote: Thanks for this. It's kcl based kinesis application. But because its just a Java application we are thinking to use spark on EMR or storm for fault tolerance and load balancing. Is it a correct approach? On 17 Jun 2015 23:07, Enno Shioji eshi...@gmail.com wrote: Hi Ayan, Admittedly I haven't done much with Kinesis, but if I'm not mistaken you should be able to use their processor interface for that. In this example, it's incrementing a counter: https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java Instead of incrementing a counter, you could do your transformation and send it to HBase. On Wed, Jun 17, 2015 at 1:40 PM, ayan guha guha.a...@gmail.com wrote: Great discussion!! One qs about some comment: Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis - Do you mean KCL application? Or some kind of processing withinKineis? Can you kindly share a link? I would definitely pursue this route as our transformations are really simple. Best On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni asoni.le...@gmail.com wrote: My Use case is below We are going to receive lot of event as stream ( basically Kafka Stream ) and then we need to process and compute Consider you have a phone contract with ATT and every call / sms / data useage you do is an event and then it needs to calculate your bill on real time basis so when you login to your account you can see all those variable as how much you used and how much is left and what is your bill till date ,Also there are different rules which need to be considered when you calculate the total bill one simple rule will be 0-500 min it is free but above it is $1 a min. How do i maintain a shared state ( total amount , total min , total data etc ) so that i know how much i accumulated at any given point as events for same phone can go to any node / executor. Can some one please tell me how can i achieve this is spark as in storm i can have a bolt which can do this ? Thanks, On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote: I guess both. In terms of syntax, I was comparing it with Trident. If you are joining, Spark Streaming actually does offer windowed join out of the box. We couldn't use this though as our event stream can grow out-of-sync, so we had to implement something on top of Storm. If your event streams don't become out of sync, you may find the built-in join in Spark Streaming useful. Storm also has a join keyword but its semantics are different. Also, what do you mean by No Back Pressure ? So when a topology is overloaded, Storm is designed so that it will stop reading from the source. Spark on the other hand, will keep reading from the source and spilling it internally. This maybe fine, in fairness, but it does mean you have to worry about the persistent store usage in the processing cluster, whereas with Storm you don't have to worry because the messages just remain in the data store. Spark came up with the idea of rate limiting, but I don't feel this is as nice as back pressure because it's very difficult to tune it such that you don't cap the cluster's processing power but yet so that it will prevent the persistent storage to get used up. On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: When you say Storm, did you mean Storm with Trident or Storm? My use case does not have simple transformation. There are complex events that need to be generated by joining the incoming event stream. Also, what do you mean by No Back PRessure ? On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing) There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able
Re: Spark or Storm
@Enno As per the latest version and documentation Spark Streaming does offer exactly once semantics using improved kafka integration , Not i have not tested yet. Any feedback will be helpful if anyone is tried the same. http://koeninger.github.io/kafka-exactly-once/#7 https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji eshi...@gmail.com wrote: AFAIK KCL is *supposed* to provide fault tolerance and load balancing (plus additionally, elastic scaling unlike Storm), Kinesis providing the coordination. My understanding is that it's like a naked Storm worker process that can consequently only do map. I haven't really used it tho, so can't really comment how it compares to Spark/Storm. Maybe somebody else will be able to comment. On Wed, Jun 17, 2015 at 3:13 PM, ayan guha guha.a...@gmail.com wrote: Thanks for this. It's kcl based kinesis application. But because its just a Java application we are thinking to use spark on EMR or storm for fault tolerance and load balancing. Is it a correct approach? On 17 Jun 2015 23:07, Enno Shioji eshi...@gmail.com wrote: Hi Ayan, Admittedly I haven't done much with Kinesis, but if I'm not mistaken you should be able to use their processor interface for that. In this example, it's incrementing a counter: https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java Instead of incrementing a counter, you could do your transformation and send it to HBase. On Wed, Jun 17, 2015 at 1:40 PM, ayan guha guha.a...@gmail.com wrote: Great discussion!! One qs about some comment: Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis - Do you mean KCL application? Or some kind of processing withinKineis? Can you kindly share a link? I would definitely pursue this route as our transformations are really simple. Best On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni asoni.le...@gmail.com wrote: My Use case is below We are going to receive lot of event as stream ( basically Kafka Stream ) and then we need to process and compute Consider you have a phone contract with ATT and every call / sms / data useage you do is an event and then it needs to calculate your bill on real time basis so when you login to your account you can see all those variable as how much you used and how much is left and what is your bill till date ,Also there are different rules which need to be considered when you calculate the total bill one simple rule will be 0-500 min it is free but above it is $1 a min. How do i maintain a shared state ( total amount , total min , total data etc ) so that i know how much i accumulated at any given point as events for same phone can go to any node / executor. Can some one please tell me how can i achieve this is spark as in storm i can have a bolt which can do this ? Thanks, On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote: I guess both. In terms of syntax, I was comparing it with Trident. If you are joining, Spark Streaming actually does offer windowed join out of the box. We couldn't use this though as our event stream can grow out-of-sync, so we had to implement something on top of Storm. If your event streams don't become out of sync, you may find the built-in join in Spark Streaming useful. Storm also has a join keyword but its semantics are different. Also, what do you mean by No Back Pressure ? So when a topology is overloaded, Storm is designed so that it will stop reading from the source. Spark on the other hand, will keep reading from the source and spilling it internally. This maybe fine, in fairness, but it does mean you have to worry about the persistent store usage in the processing cluster, whereas with Storm you don't have to worry because the messages just remain in the data store. Spark came up with the idea of rate limiting, but I don't feel this is as nice as back pressure because it's very difficult to tune it such that you don't cap the cluster's processing power but yet so that it will prevent the persistent storage to get used up. On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: When you say Storm, did you mean Storm with Trident or Storm? My use case does not have simple transformation. There are complex events that need to be generated by joining the incoming event stream. Also, what do you mean by No Back PRessure ? On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs
Re: Spark or Storm
The thing is, even with that improvement, you still have to make updates idempotent or transactional yourself. If you read http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics that refers to the latest version, it says: Semantics of output operations Output operations (like foreachRDD) have *at-least once* semantics, that is, the transformed data may get written to an external entity more than once in the event of a worker failure. While this is acceptable for saving to file systems using the saveAs***Files operations (as the file will simply get overwritten with the same data), additional effort may be necessary to achieve exactly-once semantics. There are two approaches. - *Idempotent updates*: Multiple attempts always write the same data. For example, saveAs***Files always writes the same data to the generated files. - *Transactional updates*: All updates are made transactionally so that updates are made exactly once atomically. One way to do this would be the following. - Use the batch time (available in foreachRDD) and the partition index of the transformed RDD to create an identifier. This identifier uniquely identifies a blob data in the streaming application. - Update external system with this blob transactionally (that is, exactly once, atomically) using the identifier. That is, if the identifier is not already committed, commit the partition data and the identifier atomically. Else if this was already committed, skip the update. So either you make the update idempotent, or you have to make it transactional yourself, and the suggested mechanism is very similar to what Storm does. On Wed, Jun 17, 2015 at 3:51 PM, Ashish Soni asoni.le...@gmail.com wrote: @Enno As per the latest version and documentation Spark Streaming does offer exactly once semantics using improved kafka integration , Not i have not tested yet. Any feedback will be helpful if anyone is tried the same. http://koeninger.github.io/kafka-exactly-once/#7 https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji eshi...@gmail.com wrote: AFAIK KCL is *supposed* to provide fault tolerance and load balancing (plus additionally, elastic scaling unlike Storm), Kinesis providing the coordination. My understanding is that it's like a naked Storm worker process that can consequently only do map. I haven't really used it tho, so can't really comment how it compares to Spark/Storm. Maybe somebody else will be able to comment. On Wed, Jun 17, 2015 at 3:13 PM, ayan guha guha.a...@gmail.com wrote: Thanks for this. It's kcl based kinesis application. But because its just a Java application we are thinking to use spark on EMR or storm for fault tolerance and load balancing. Is it a correct approach? On 17 Jun 2015 23:07, Enno Shioji eshi...@gmail.com wrote: Hi Ayan, Admittedly I haven't done much with Kinesis, but if I'm not mistaken you should be able to use their processor interface for that. In this example, it's incrementing a counter: https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java Instead of incrementing a counter, you could do your transformation and send it to HBase. On Wed, Jun 17, 2015 at 1:40 PM, ayan guha guha.a...@gmail.com wrote: Great discussion!! One qs about some comment: Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis - Do you mean KCL application? Or some kind of processing withinKineis? Can you kindly share a link? I would definitely pursue this route as our transformations are really simple. Best On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni asoni.le...@gmail.com wrote: My Use case is below We are going to receive lot of event as stream ( basically Kafka Stream ) and then we need to process and compute Consider you have a phone contract with ATT and every call / sms / data useage you do is an event and then it needs to calculate your bill on real time basis so when you login to your account you can see all those variable as how much you used and how much is left and what is your bill till date ,Also there are different rules which need to be considered when you calculate the total bill one simple rule will be 0-500 min it is free but above it is $1 a min. How do i maintain a shared state ( total amount , total min , total data etc ) so that i know how much i accumulated at any given point as events for same phone can go to any node / executor. Can some one please tell me how can i achieve this is spark as in storm i can have a bolt which can do
Re: Spark or Storm
Actually the reverse. Spark Streaming is really a micro batch system where the smallest window is 1/2 a second (500ms). So for CEP, its not really a good idea. So in terms of options…. spark streaming, storm, samza, akka and others… Storm is probably the easiest to pick up, spark streaming / akka may give you more flexibility and akka would work for CEP. Just my $0.02 On Jun 16, 2015, at 9:40 PM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: I have a use-case where a stream of Incoming events have to be aggregated and joined to create Complex events. The aggregation will have to happen at an interval of 1 minute (or less). The pipeline is : send events enrich event Upstream services --- KAFKA - event Stream Processor Complex Event Processor Elastic Search. From what I understand, Storm will make a very good ESP and Spark Streaming will make a good CEP. But, we are also evaluating Storm with Trident. How does Spark Streaming compare with Storm with Trident? Sridhar Chellappa On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote: I have a similar scenario where we need to bring data from kinesis to hbase. Data volecity is 20k per 10 mins. Little manipulation of data will be required but that's regardless of the tool so we will be writing that piece in Java pojo. All env is on aws. Hbase is on a long running EMR and kinesis on a separate cluster. TIA. Best Ayan On 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com mailto:wrbri...@gmail.com wrote: The programming models for the two frameworks are conceptually rather different; I haven't worked with Storm for quite some time, but based on my old experience with it, I would equate Spark Streaming more with Storm's Trident API, rather than with the raw Bolt API. Even then, there are significant differences, but it's a bit closer. If you can share your use case, we might be able to provide better guidance. Regards, Will On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com mailto:asoni.le...@gmail.com wrote: Hi All, I am evaluating spark VS storm ( spark streaming ) and i am not able to see what is equivalent of Bolt in storm inside spark. Any help will be appreciated on this ? Thanks , Ashish - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Spark or Storm
Hi All, I am evaluating spark VS storm ( spark streaming ) and i am not able to see what is equivalent of Bolt in storm inside spark. Any help will be appreciated on this ? Thanks , Ashish - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark or Storm
I have a similar scenario where we need to bring data from kinesis to hbase. Data volecity is 20k per 10 mins. Little manipulation of data will be required but that's regardless of the tool so we will be writing that piece in Java pojo. All env is on aws. Hbase is on a long running EMR and kinesis on a separate cluster. TIA. Best Ayan On 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote: The programming models for the two frameworks are conceptually rather different; I haven't worked with Storm for quite some time, but based on my old experience with it, I would equate Spark Streaming more with Storm's Trident API, rather than with the raw Bolt API. Even then, there are significant differences, but it's a bit closer. If you can share your use case, we might be able to provide better guidance. Regards, Will On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote: Hi All, I am evaluating spark VS storm ( spark streaming ) and i am not able to see what is equivalent of Bolt in storm inside spark. Any help will be appreciated on this ? Thanks , Ashish - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark or Storm
I have a use-case where a stream of Incoming events have to be aggregated and joined to create Complex events. The aggregation will have to happen at an interval of 1 minute (or less). The pipeline is : send events enrich eventUpstream services --- KAFKA - event Stream Processor Complex Event Processor Elastic Search. From what I understand, Storm will make a very good ESP and Spark Streaming will make a good CEP. But, we are also evaluating Storm with Trident. How does Spark Streaming compare with Storm with Trident? Sridhar Chellappa On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote: I have a similar scenario where we need to bring data from kinesis to hbase. Data volecity is 20k per 10 mins. Little manipulation of data will be required but that's regardless of the tool so we will be writing that piece in Java pojo. All env is on aws. Hbase is on a long running EMR and kinesis on a separate cluster.TIA. Best AyanOn 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote: The programming models for the two frameworks are conceptually rather different; I haven't worked with Storm for quite some time, but based on my old experience with it, I would equate Spark Streaming more with Storm's Trident API, rather than with the raw Bolt API. Even then, there are significant differences, but it's a bit closer. If you can share your use case, we might be able to provide better guidance. Regards, Will On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote: Hi All, I am evaluating spark VS storm ( spark streaming ) and i am not able to see what is equivalent of Bolt in storm inside spark. Any help will be appreciated on this ? Thanks , Ashish - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark or Storm
Probably overloading the question a bit. In Storm, Bolts have the functionality of getting triggered on events. Is that kind of functionality possible with Spark streaming? During each phase of the data processing, the transformed data is stored to the database and this transformed data should then be sent to a new pipeline for further processing How can this be achieved using Spark? On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: I have a use-case where a stream of Incoming events have to be aggregated and joined to create Complex events. The aggregation will have to happen at an interval of 1 minute (or less). The pipeline is : send events enrich event Upstream services --- KAFKA - event Stream Processor Complex Event Processor Elastic Search. From what I understand, Storm will make a very good ESP and Spark Streaming will make a good CEP. But, we are also evaluating Storm with Trident. How does Spark Streaming compare with Storm with Trident? Sridhar Chellappa On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote: I have a similar scenario where we need to bring data from kinesis to hbase. Data volecity is 20k per 10 mins. Little manipulation of data will be required but that's regardless of the tool so we will be writing that piece in Java pojo. All env is on aws. Hbase is on a long running EMR and kinesis on a separate cluster. TIA. Best Ayan On 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote: The programming models for the two frameworks are conceptually rather different; I haven't worked with Storm for quite some time, but based on my old experience with it, I would equate Spark Streaming more with Storm's Trident API, rather than with the raw Bolt API. Even then, there are significant differences, but it's a bit closer. If you can share your use case, we might be able to provide better guidance. Regards, Will On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote: Hi All, I am evaluating spark VS storm ( spark streaming ) and i am not able to see what is equivalent of Bolt in storm inside spark. Any help will be appreciated on this ? Thanks , Ashish - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark or Storm
The programming models for the two frameworks are conceptually rather different; I haven't worked with Storm for quite some time, but based on my old experience with it, I would equate Spark Streaming more with Storm's Trident API, rather than with the raw Bolt API. Even then, there are significant differences, but it's a bit closer. If you can share your use case, we might be able to provide better guidance. Regards, Will On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote: Hi All, I am evaluating spark VS storm ( spark streaming ) and i am not able to see what is equivalent of Bolt in storm inside spark. Any help will be appreciated on this ? Thanks , Ashish - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org