subject:"Spark or Storm"

Re: RE: Spark or Storm

2015-06-19 Thread Tathagata Das

I agree with Cody. Its pretty hard for any framework to provide in built
support for that since the semantics completely depends on what data store
you want to use it with. Providing interfaces does help a little, but even
with those interface, the user still has to do most of the heavy lifting;
the user has to understand what is actually going on AND implement all the
needed code to ensure unique ID, and the data are atomically updated,
according to the capability and APIs provided by the data store.

On Fri, Jun 19, 2015 at 7:45 AM, Cody Koeninger c...@koeninger.org wrote:


 http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics

 semantics of output operations section

 Is this really not clear?

 As for the general tone of why doesn't the framework do it for you... in
 my opinion, this is essential complexity for delivery semantics in a
 distributed system, not incidental complexity.  You need to actually
 understand and be responsible for what's going on, unless you're talking
 about very narrow use cases (i.e. outputting to a known datastore with
 known semantics and schema)

 On Fri, Jun 19, 2015 at 7:26 AM, Ashish Soni asoni.le...@gmail.com
 wrote:

 My understanding for exactly once semantics is it is handled into the
 framework itself but it is not very clear from the documentation , I
 believe documentation needs to be updated with a simple example so that it
 is clear to the end user , This is very critical to decide when some one is
 evaluating the framework and does not have enough time to validate all the
 use cases but to relay on the documentation.

 Ashish

 On Fri, Jun 19, 2015 at 7:10 AM, bit1...@163.com bit1...@163.com wrote:


 I think your observation is correct, you have to take care of these
 replayed data at your end,eg,each message has a unique id or something else.

 I am using I think in the above sentense, because I am not sure and I
 also have a related question:
 I am wonderring how direct stream + kakfa is implemented when the Driver
 is down and restarted, will it always first replay the checkpointed failed
 batch or will it honor Kafka's offset reset policy(auto.offset.reset). If
 it honors the reset policy and it is set as smallest, then it is the at
 least once semantics;  if it set largest, then it will be at most once
 semantics?


 --
 bit1...@163.com


 *From:* Haopu Wang hw...@qilinsoft.com
 *Date:* 2015-06-19 18:47
 *To:* Enno Shioji eshi...@gmail.com; Tathagata Das
 t...@databricks.com
 *CC:* prajod.vettiyat...@wipro.com; Cody Koeninger c...@koeninger.org;
 bit1...@163.com; Jordan Pilat jrpi...@gmail.com; Will Briggs
 wrbri...@gmail.com; Ashish Soni asoni.le...@gmail.com; ayan guha
 guha.a...@gmail.com; user@spark.apache.org; Sateesh Kavuri
 sateesh.kav...@gmail.com; Spark Enthusiast sparkenthusi...@yahoo.in;
 Sabarish Sasidharan sabarish.sasidha...@manthan.com
 *Subject:* RE: RE: Spark or Storm

 My question is not directly related: about the exactly-once semantic,
 the document (copied below) said spark streaming gives exactly-once
 semantic, but actually from my test result, with check-point enabled, the
 application always re-process the files in last batch after gracefully
 restart.



 ==
 *Semantics of Received Data*

 Different input sources provide different guarantees, ranging from *at-least
 once* to *exactly once*. Read for more details.
 *With Files*

 If all of the input data is already present in a fault-tolerant files
 system like HDFS, Spark Streaming can always recover from any failure and
 process all the data. This gives *exactly-once* semantics, that all the
 data will be processed exactly once no matter what fails.




  --

 *From:* Enno Shioji [mailto:eshi...@gmail.com]
 *Sent:* Friday, June 19, 2015 5:29 PM
 *To:* Tathagata Das
 *Cc:* prajod.vettiyat...@wipro.com; Cody Koeninger; bit1...@163.com;
 Jordan Pilat; Will Briggs; Ashish Soni; ayan guha; user@spark.apache.org;
 Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan
 *Subject:* Re: RE: Spark or Storm



 Fair enough, on second thought, just saying that it should be idempotent
 is indeed more confusing.



 I guess the crux of the confusion comes from the fact that people tend
 to assume the work you described (store batch id and skip etc.) is handled
 by the framework, perhaps partly because Storm Trident does handle it (you
 just need to let Storm know if the output operation has succeeded or not,
 and it handles the batch id storing  skipping business). Whenever I
 explain people that one needs to do this additional work you described to
 get end-to-end exactly-once semantics, it usually takes a while to convince
 them. In my limited experience, they tend to interpret transactional in
 that sentence to mean that you just have to write to a transactional
 storage like ACID RDB. Pointing them to Semantics of output operations is
 usually sufficient though.



 Maybe others like

Re: RE: Spark or Storm

2015-06-19 Thread Enno Shioji

Tbh I find the doc around this a bit confusing. If it says end-to-end
exactly-once semantics (if your updates to downstream systems are
idempotent or transactional), I think most people will interpret it that
as long as you use a storage which has atomicity (like MySQL/Postgres
etc.), a successful output operation for a given batch (let's say + 5) is
going to be issued exactly-once against the storage.

However, as I understand it that's not what this statement means. What it
is saying is, it will always issue +5 and never, say +6, because it
makes sure a message is processed exactly-once internally. However, it
*may* issue +5 more than once for a given batch, and it is up to the
developer to deal with this by either making the output operation
idempotent (e.g. set 5), or transactional (e.g. keep track of batch IDs
and skip already applied batches etc.).

I wonder if it makes more sense to drop or transactional from the
statement, because if you think about it, ultimately what you are asked to
do is to make the writes idempotent even with the transactional approach,
 transactional is a bit loaded and would be prone to lead to
misunderstandings (even though in fairness, if you read the fault tolerance
chapter it explicitly explains it).



On Fri, Jun 19, 2015 at 2:56 AM, prajod.vettiyat...@wipro.com wrote:

  More details on the Direct API of Spark 1.3 is at the databricks blog:
 https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html



 Note the use of checkpoints to persist the Kafka offsets in Spark
 Streaming itself, and not in zookeeper.



 Also this statement:”.. This allows one to build a Spark Streaming +
 Kafka pipelines with end-to-end exactly-once semantics (if your updates to
 downstream systems are idempotent or transactional).”





 *From:* Cody Koeninger [mailto:c...@koeninger.org]
 *Sent:* 18 June 2015 19:38
 *To:* bit1...@163.com
 *Cc:* Prajod S Vettiyattil (WT01 - BAS); jrpi...@gmail.com;
 eshi...@gmail.com; wrbri...@gmail.com; asoni.le...@gmail.com; ayan guha;
 user; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in;
 sabarish.sasidha...@manthan.com
 *Subject:* Re: RE: Spark or Storm



 That general description is accurate, but not really a specific issue of
 the direct steam.  It applies to anything consuming from kafka (or, as
 Matei already said, any streaming system really).  You can't have exactly
 once semantics, unless you know something more about how you're storing
 results.



 For some unique id, topicpartition and offset is usually the obvious
 choice, which is why it's important that the direct stream gives you access
 to the offsets.



 See https://github.com/koeninger/kafka-exactly-once for more info







 On Thu, Jun 18, 2015 at 6:47 AM, bit1...@163.com bit1...@163.com wrote:

  I am wondering how direct stream api ensures end-to-end exactly once
 semantics



 I think there are two things involved:

 1. From the spark streaming end, the driver will replay the Offset range
 when it's down and restarted,which means that the new tasks will process
 some already processed data.

 2. From the user end, since tasks may process already processed data, user
 end should detect that some data has already been processed,eg,

 use some unique ID.



 Not sure if I have understood correctly.




  --

 bit1...@163.com



 *From:* prajod.vettiyat...@wipro.com

 *Date:* 2015-06-18 16:56

 *To:* jrpi...@gmail.com; eshi...@gmail.com

 *CC:* wrbri...@gmail.com; asoni.le...@gmail.com; guha.a...@gmail.com;
 user@spark.apache.org; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in;
 sabarish.sasidha...@manthan.com

 *Subject:* RE: Spark or Storm

 not being able to read from Kafka using multiple nodes



  Kafka is plenty capable of doing this..



 I faced the same issue before Spark 1.3 was released.



 The issue was not with Kafka, but with Spark Streaming’s Kafka connector.
 Before Spark 1.3.0 release one Spark worker would get all the streamed
 messages. We had to re-partition to distribute the processing.



 From Spark 1.3.0 release the Spark Direct API for Kafka supported parallel
 reads from Kafka streamed to Spark workers. See the “Approach 2: Direct
 Approach” in this page:
 http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html. Note
 that is also mentions zero data loss and exactly once semantics for kafka
 integration.





 Prajod



 *From:* Jordan Pilat [mailto:jrpi...@gmail.com]
 *Sent:* 18 June 2015 03:57
 *To:* Enno Shioji
 *Cc:* Will Briggs; asoni.le...@gmail.com; ayan guha; user; Sateesh
 Kavuri; Spark Enthusiast; Sabarish Sasidharan
 *Subject:* Re: Spark or Storm



 not being able to read from Kafka using multiple nodes

 Kafka is plenty capable of doing this,  by clustering together multiple
 consumer instances into a consumer group.
 If your topic is sufficiently partitioned, the consumer group can consume
 the topic in a parallelized fashion.
 If it isn't, you still have

Re: RE: Spark or Storm

2015-06-19 Thread Ashish Soni

My understanding for exactly once semantics is it is handled into the
framework itself but it is not very clear from the documentation , I
believe documentation needs to be updated with a simple example so that it
is clear to the end user , This is very critical to decide when some one is
evaluating the framework and does not have enough time to validate all the
use cases but to relay on the documentation.

Ashish

On Fri, Jun 19, 2015 at 7:10 AM, bit1...@163.com bit1...@163.com wrote:


 I think your observation is correct, you have to take care of these
 replayed data at your end,eg,each message has a unique id or something else.

 I am using I think in the above sentense, because I am not sure and I
 also have a related question:
 I am wonderring how direct stream + kakfa is implemented when the Driver
 is down and restarted, will it always first replay the checkpointed failed
 batch or will it honor Kafka's offset reset policy(auto.offset.reset). If
 it honors the reset policy and it is set as smallest, then it is the at
 least once semantics;  if it set largest, then it will be at most once
 semantics?


 --
 bit1...@163.com


 *From:* Haopu Wang hw...@qilinsoft.com
 *Date:* 2015-06-19 18:47
 *To:* Enno Shioji eshi...@gmail.com; Tathagata Das t...@databricks.com
 *CC:* prajod.vettiyat...@wipro.com; Cody Koeninger c...@koeninger.org;
 bit1...@163.com; Jordan Pilat jrpi...@gmail.com; Will Briggs
 wrbri...@gmail.com; Ashish Soni asoni.le...@gmail.com; ayan guha
 guha.a...@gmail.com; user@spark.apache.org; Sateesh Kavuri
 sateesh.kav...@gmail.com; Spark Enthusiast sparkenthusi...@yahoo.in; 
 Sabarish
 Sasidharan sabarish.sasidha...@manthan.com
 *Subject:* RE: RE: Spark or Storm

 My question is not directly related: about the exactly-once semantic,
 the document (copied below) said spark streaming gives exactly-once
 semantic, but actually from my test result, with check-point enabled, the
 application always re-process the files in last batch after gracefully
 restart.



 ==
 *Semantics of Received Data*

 Different input sources provide different guarantees, ranging from *at-least
 once* to *exactly once*. Read for more details.
 *With Files*

 If all of the input data is already present in a fault-tolerant files
 system like HDFS, Spark Streaming can always recover from any failure and
 process all the data. This gives *exactly-once* semantics, that all the
 data will be processed exactly once no matter what fails.




  --

 *From:* Enno Shioji [mailto:eshi...@gmail.com]
 *Sent:* Friday, June 19, 2015 5:29 PM
 *To:* Tathagata Das
 *Cc:* prajod.vettiyat...@wipro.com; Cody Koeninger; bit1...@163.com;
 Jordan Pilat; Will Briggs; Ashish Soni; ayan guha; user@spark.apache.org;
 Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan
 *Subject:* Re: RE: Spark or Storm



 Fair enough, on second thought, just saying that it should be idempotent
 is indeed more confusing.



 I guess the crux of the confusion comes from the fact that people tend to
 assume the work you described (store batch id and skip etc.) is handled by
 the framework, perhaps partly because Storm Trident does handle it (you
 just need to let Storm know if the output operation has succeeded or not,
 and it handles the batch id storing  skipping business). Whenever I
 explain people that one needs to do this additional work you described to
 get end-to-end exactly-once semantics, it usually takes a while to convince
 them. In my limited experience, they tend to interpret transactional in
 that sentence to mean that you just have to write to a transactional
 storage like ACID RDB. Pointing them to Semantics of output operations is
 usually sufficient though.



 Maybe others like @Ashish can weigh on this; did you interpret it in this
 way?



 What if we change the statement into:

 end-to-end exactly-once semantics (if your updates to downstream systems
 are idempotent or transactional). To learn how to make your updates
 idempotent or transactional, see the Semantics of output operations
 section in this chapter
 https://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics
 



 That way, it's clear that it's not sufficient to merely write to a
 transactional storage like ACID store.















 On Fri, Jun 19, 2015 at 9:08 AM, Tathagata Das t...@databricks.com
 wrote:

 If the current documentation is confusing, we can definitely improve the
 documentation. However, I dont not understand why is the term
 transactional confusing. If your output operation has to add 5, then the
 user has to implement the following mechanism



 1. If the unique id of the batch of data is already present in the store,
 then skip the update

 2. Otherwise atomically do both, the update operation as well as store the
 unique id of the batch. This is pretty much the definition of a
 transaction. The user has to be aware of the transactional semantics of the
 data

RE: RE: Spark or Storm

2015-06-19 Thread Haopu Wang

My question is not directly related: about the exactly-once semantic,
the document (copied below) said spark streaming gives exactly-once
semantic, but actually from my test result, with check-point enabled,
the application always re-process the files in last batch after
gracefully restart.

 

==


Semantics of Received Data


Different input sources provide different guarantees, ranging from
at-least once to exactly once. Read for more details.


With Files


If all of the input data is already present in a fault-tolerant files
system like HDFS, Spark Streaming can always recover from any failure
and process all the data. This gives exactly-once semantics, that all
the data will be processed exactly once no matter what fails.

 

 



From: Enno Shioji [mailto:eshi...@gmail.com] 
Sent: Friday, June 19, 2015 5:29 PM
To: Tathagata Das
Cc: prajod.vettiyat...@wipro.com; Cody Koeninger; bit1...@163.com;
Jordan Pilat; Will Briggs; Ashish Soni; ayan guha;
user@spark.apache.org; Sateesh Kavuri; Spark Enthusiast; Sabarish
Sasidharan
Subject: Re: RE: Spark or Storm

 

Fair enough, on second thought, just saying that it should be idempotent
is indeed more confusing.

 

I guess the crux of the confusion comes from the fact that people tend
to assume the work you described (store batch id and skip etc.) is
handled by the framework, perhaps partly because Storm Trident does
handle it (you just need to let Storm know if the output operation has
succeeded or not, and it handles the batch id storing  skipping
business). Whenever I explain people that one needs to do this
additional work you described to get end-to-end exactly-once semantics,
it usually takes a while to convince them. In my limited experience,
they tend to interpret transactional in that sentence to mean that you
just have to write to a transactional storage like ACID RDB. Pointing
them to Semantics of output operations is usually sufficient though.

 

Maybe others like @Ashish can weigh on this; did you interpret it in
this way?

 

What if we change the statement into:

end-to-end exactly-once semantics (if your updates to downstream
systems are idempotent or transactional). To learn how to make your
updates idempotent or transactional, see the Semantics of output
operations section in this chapter
https://spark.apache.org/docs/latest/streaming-programming-guide.html#f
ault-tolerance-semantics 

 

That way, it's clear that it's not sufficient to merely write to a
transactional storage like ACID store.

 

 

 

 

 

 

 

On Fri, Jun 19, 2015 at 9:08 AM, Tathagata Das t...@databricks.com
wrote:

If the current documentation is confusing, we can definitely improve the
documentation. However, I dont not understand why is the term
transactional confusing. If your output operation has to add 5, then
the user has to implement the following mechanism

 

1. If the unique id of the batch of data is already present in the
store, then skip the update

2. Otherwise atomically do both, the update operation as well as store
the unique id of the batch. This is pretty much the definition of a
transaction. The user has to be aware of the transactional semantics of
the data store while implementing this functionality. 

 

You CAN argue that this effective makes the whole updating sort-a
idempotent, as even if you try doing it multiple times, it will update
only once. But that is not what is generally considered as idempotent.
Writing a fixed count, not an increment, is usually what is called
idempotent. And so just mentioning that the output operation must be
idempotent is, in my opinion, more confusing.

 

To take a page out of the Storm / Trident guide, even they call this
exact conditional updating of Trident State as transactional
operation. See transactional spout in the Trident State guide -
https://storm.apache.org/documentation/Trident-state

 

In the end, I am totally open the suggestions and PRs on how to make the
programming guide easier to understand. :)

 

TD

 

On Thu, Jun 18, 2015 at 11:47 PM, Enno Shioji eshi...@gmail.com wrote:

Tbh I find the doc around this a bit confusing. If it says end-to-end
exactly-once semantics (if your updates to downstream systems are
idempotent or transactional), I think most people will interpret it
that as long as you use a storage which has atomicity (like
MySQL/Postgres etc.), a successful output operation for a given batch
(let's say + 5) is going to be issued exactly-once against the
storage.

 

However, as I understand it that's not what this statement means. What
it is saying is, it will always issue +5 and never, say +6, because
it makes sure a message is processed exactly-once internally. However,
it *may* issue +5 more than once for a given batch, and it is up to
the developer to deal with this by either making the output operation
idempotent (e.g. set 5), or transactional (e.g. keep track of batch
IDs and skip already applied batches etc.).

 

I wonder if it makes

Re: RE: Spark or Storm

2015-06-19 Thread bit1...@163.com


I think your observation is correct, you have to take care of these replayed 
data at your end,eg,each message has a unique id or something else.

I am using I think in the above sentense, because I am not sure and I also 
have a related question:
I am wonderring how direct stream + kakfa is implemented when the Driver is 
down and restarted, will it always first replay the checkpointed failed batch 
or will it honor Kafka's offset reset policy(auto.offset.reset). If it honors 
the reset policy and it is set as smallest, then it is the at least once 
semantics;  if it set largest, then it will be at most once semantics?




bit1...@163.com
 
From: Haopu Wang
Date: 2015-06-19 18:47
To: Enno Shioji; Tathagata Das
CC: prajod.vettiyat...@wipro.com; Cody Koeninger; bit1...@163.com; Jordan 
Pilat; Will Briggs; Ashish Soni; ayan guha; user@spark.apache.org; Sateesh 
Kavuri; Spark Enthusiast; Sabarish Sasidharan
Subject: RE: RE: Spark or Storm
My question is not directly related: about the exactly-once semantic, the 
document (copied below) said spark streaming gives exactly-once semantic, but 
actually from my test result, with check-point enabled, the application always 
re-process the files in last batch after gracefully restart.
 
==
Semantics of Received Data
Different input sources provide different guarantees, ranging from at-least 
once to exactly once. Read for more details.
With Files
If all of the input data is already present in a fault-tolerant files system 
like HDFS, Spark Streaming can always recover from any failure and process all 
the data. This gives exactly-once semantics, that all the data will be 
processed exactly once no matter what fails.
 
 


From: Enno Shioji [mailto:eshi...@gmail.com] 
Sent: Friday, June 19, 2015 5:29 PM
To: Tathagata Das
Cc: prajod.vettiyat...@wipro.com; Cody Koeninger; bit1...@163.com; Jordan 
Pilat; Will Briggs; Ashish Soni; ayan guha; user@spark.apache.org; Sateesh 
Kavuri; Spark Enthusiast; Sabarish Sasidharan
Subject: Re: RE: Spark or Storm
 
Fair enough, on second thought, just saying that it should be idempotent is 
indeed more confusing.
 
I guess the crux of the confusion comes from the fact that people tend to 
assume the work you described (store batch id and skip etc.) is handled by the 
framework, perhaps partly because Storm Trident does handle it (you just need 
to let Storm know if the output operation has succeeded or not, and it handles 
the batch id storing  skipping business). Whenever I explain people that one 
needs to do this additional work you described to get end-to-end exactly-once 
semantics, it usually takes a while to convince them. In my limited experience, 
they tend to interpret transactional in that sentence to mean that you just 
have to write to a transactional storage like ACID RDB. Pointing them to 
Semantics of output operations is usually sufficient though.
 
Maybe others like @Ashish can weigh on this; did you interpret it in this way?
 
What if we change the statement into:
end-to-end exactly-once semantics (if your updates to downstream systems are 
idempotent or transactional). To learn how to make your updates idempotent or 
transactional, see the Semantics of output operations section in this chapter
 
That way, it's clear that it's not sufficient to merely write to a 
transactional storage like ACID store.
 
 
 
 
 
 
 
On Fri, Jun 19, 2015 at 9:08 AM, Tathagata Das t...@databricks.com wrote:
If the current documentation is confusing, we can definitely improve the 
documentation. However, I dont not understand why is the term transactional 
confusing. If your output operation has to add 5, then the user has to 
implement the following mechanism
 
1. If the unique id of the batch of data is already present in the store, then 
skip the update
2. Otherwise atomically do both, the update operation as well as store the 
unique id of the batch. This is pretty much the definition of a transaction. 
The user has to be aware of the transactional semantics of the data store while 
implementing this functionality. 
 
You CAN argue that this effective makes the whole updating sort-a idempotent, 
as even if you try doing it multiple times, it will update only once. But that 
is not what is generally considered as idempotent. Writing a fixed count, not 
an increment, is usually what is called idempotent. And so just mentioning that 
the output operation must be idempotent is, in my opinion, more confusing.
 
To take a page out of the Storm / Trident guide, even they call this exact 
conditional updating of Trident State as transactional operation. See 
transactional spout in the Trident State guide - 
https://storm.apache.org/documentation/Trident-state
 
In the end, I am totally open the suggestions and PRs on how to make the 
programming guide easier to understand. :)
 
TD
 
On Thu, Jun 18, 2015 at 11:47 PM, Enno Shioji eshi...@gmail.com wrote:
Tbh I find the doc around this a bit confusing. If it says end-to-end 
exactly-once

Re: RE: Spark or Storm

2015-06-19 Thread Tathagata Das

If the current documentation is confusing, we can definitely improve the
documentation. However, I dont not understand why is the term
transactional confusing. If your output operation has to add 5, then the
user has to implement the following mechanism

1. If the unique id of the batch of data is already present in the store,
then skip the update
2. Otherwise atomically do both, the update operation as well as store the
unique id of the batch. This is pretty much the definition of a
transaction. The user has to be aware of the transactional semantics of the
data store while implementing this functionality.

You CAN argue that this effective makes the whole updating sort-a
idempotent, as even if you try doing it multiple times, it will update only
once. But that is not what is generally considered as idempotent. Writing a
fixed count, not an increment, is usually what is called idempotent. And so
just mentioning that the output operation must be idempotent is, in my
opinion, more confusing.

To take a page out of the Storm / Trident guide, even they call this exact
conditional updating of Trident State as transactional operation. See
transactional spout in the Trident State guide -
https://storm.apache.org/documentation/Trident-state

In the end, I am totally open the suggestions and PRs on how to make the
programming guide easier to understand. :)

TD

On Thu, Jun 18, 2015 at 11:47 PM, Enno Shioji eshi...@gmail.com wrote:

 Tbh I find the doc around this a bit confusing. If it says end-to-end
 exactly-once semantics (if your updates to downstream systems are
 idempotent or transactional), I think most people will interpret it that
 as long as you use a storage which has atomicity (like MySQL/Postgres
 etc.), a successful output operation for a given batch (let's say + 5) is
 going to be issued exactly-once against the storage.

 However, as I understand it that's not what this statement means. What it
 is saying is, it will always issue +5 and never, say +6, because it
 makes sure a message is processed exactly-once internally. However, it
 *may* issue +5 more than once for a given batch, and it is up to the
 developer to deal with this by either making the output operation
 idempotent (e.g. set 5), or transactional (e.g. keep track of batch IDs
 and skip already applied batches etc.).

 I wonder if it makes more sense to drop or transactional from the
 statement, because if you think about it, ultimately what you are asked to
 do is to make the writes idempotent even with the transactional approach,
  transactional is a bit loaded and would be prone to lead to
 misunderstandings (even though in fairness, if you read the fault tolerance
 chapter it explicitly explains it).



 On Fri, Jun 19, 2015 at 2:56 AM, prajod.vettiyat...@wipro.com wrote:

  More details on the Direct API of Spark 1.3 is at the databricks blog:
 https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html



 Note the use of checkpoints to persist the Kafka offsets in Spark
 Streaming itself, and not in zookeeper.



 Also this statement:”.. This allows one to build a Spark Streaming +
 Kafka pipelines with end-to-end exactly-once semantics (if your updates to
 downstream systems are idempotent or transactional).”





 *From:* Cody Koeninger [mailto:c...@koeninger.org]
 *Sent:* 18 June 2015 19:38
 *To:* bit1...@163.com
 *Cc:* Prajod S Vettiyattil (WT01 - BAS); jrpi...@gmail.com;
 eshi...@gmail.com; wrbri...@gmail.com; asoni.le...@gmail.com; ayan guha;
 user; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in;
 sabarish.sasidha...@manthan.com
 *Subject:* Re: RE: Spark or Storm



 That general description is accurate, but not really a specific issue of
 the direct steam.  It applies to anything consuming from kafka (or, as
 Matei already said, any streaming system really).  You can't have exactly
 once semantics, unless you know something more about how you're storing
 results.



 For some unique id, topicpartition and offset is usually the obvious
 choice, which is why it's important that the direct stream gives you access
 to the offsets.



 See https://github.com/koeninger/kafka-exactly-once for more info







 On Thu, Jun 18, 2015 at 6:47 AM, bit1...@163.com bit1...@163.com wrote:

  I am wondering how direct stream api ensures end-to-end exactly once
 semantics



 I think there are two things involved:

 1. From the spark streaming end, the driver will replay the Offset range
 when it's down and restarted,which means that the new tasks will process
 some already processed data.

 2. From the user end, since tasks may process already processed data,
 user end should detect that some data has already been processed,eg,

 use some unique ID.



 Not sure if I have understood correctly.




  --

 bit1...@163.com



 *From:* prajod.vettiyat...@wipro.com

 *Date:* 2015-06-18 16:56

 *To:* jrpi...@gmail.com; eshi...@gmail.com

 *CC:* wrbri...@gmail.com; asoni.le

Re: RE: Spark or Storm

2015-06-19 Thread Enno Shioji

...@gmail.com; asoni.le...@gmail.com; ayan
 guha; user; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in;
 sabarish.sasidha...@manthan.com
 *Subject:* Re: RE: Spark or Storm



 That general description is accurate, but not really a specific issue of
 the direct steam.  It applies to anything consuming from kafka (or, as
 Matei already said, any streaming system really).  You can't have exactly
 once semantics, unless you know something more about how you're storing
 results.



 For some unique id, topicpartition and offset is usually the obvious
 choice, which is why it's important that the direct stream gives you access
 to the offsets.



 See https://github.com/koeninger/kafka-exactly-once for more info







 On Thu, Jun 18, 2015 at 6:47 AM, bit1...@163.com bit1...@163.com
 wrote:

  I am wondering how direct stream api ensures end-to-end exactly once
 semantics



 I think there are two things involved:

 1. From the spark streaming end, the driver will replay the Offset range
 when it's down and restarted,which means that the new tasks will process
 some already processed data.

 2. From the user end, since tasks may process already processed data,
 user end should detect that some data has already been processed,eg,

 use some unique ID.



 Not sure if I have understood correctly.




  --

 bit1...@163.com



 *From:* prajod.vettiyat...@wipro.com

 *Date:* 2015-06-18 16:56

 *To:* jrpi...@gmail.com; eshi...@gmail.com

 *CC:* wrbri...@gmail.com; asoni.le...@gmail.com; guha.a...@gmail.com;
 user@spark.apache.org; sateesh.kav...@gmail.com;
 sparkenthusi...@yahoo.in; sabarish.sasidha...@manthan.com

 *Subject:* RE: Spark or Storm

 not being able to read from Kafka using multiple nodes



  Kafka is plenty capable of doing this..



 I faced the same issue before Spark 1.3 was released.



 The issue was not with Kafka, but with Spark Streaming’s Kafka
 connector. Before Spark 1.3.0 release one Spark worker would get all the
 streamed messages. We had to re-partition to distribute the processing.



 From Spark 1.3.0 release the Spark Direct API for Kafka supported
 parallel reads from Kafka streamed to Spark workers. See the “Approach 2:
 Direct Approach” in this page:
 http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html.
 Note that is also mentions zero data loss and exactly once semantics for
 kafka integration.





 Prajod



 *From:* Jordan Pilat [mailto:jrpi...@gmail.com]
 *Sent:* 18 June 2015 03:57
 *To:* Enno Shioji
 *Cc:* Will Briggs; asoni.le...@gmail.com; ayan guha; user; Sateesh
 Kavuri; Spark Enthusiast; Sabarish Sasidharan
 *Subject:* Re: Spark or Storm



 not being able to read from Kafka using multiple nodes

 Kafka is plenty capable of doing this,  by clustering together multiple
 consumer instances into a consumer group.
 If your topic is sufficiently partitioned, the consumer group can
 consume the topic in a parallelized fashion.
 If it isn't, you still have the fault tolerance associated with
 clustering the consumers.

 OK
 JRP

 On Jun 17, 2015 1:27 AM, Enno Shioji eshi...@gmail.com wrote:

  We've evaluated Spark Streaming vs. Storm and ended up sticking with
 Storm.



 Some of the important draw backs are:

 Spark has no back pressure (receiver rate limit can alleviate this to a
 certain point, but it's far from ideal)

 There is also no exactly-once semantics. (updateStateByKey can achieve
 this semantics, but is not practical if you have any significant amount of
 state because it does so by dumping the entire state on every checkpointing)



 There are also some minor drawbacks that I'm sure will be fixed quickly,
 like no task timeout, not being able to read from Kafka using multiple
 nodes, data loss hazard with Kafka.



 It's also not possible to attain very low latency in Spark, if that's
 what you need.



 The pos for Spark is the concise and IMO more intuitive syntax,
 especially if you compare it with Storm's Java API.



 I admit I might be a bit biased towards Storm tho as I'm more familiar
 with it.



 Also, you can do some processing with Kinesis. If all you need to do is
 straight forward transformation and you are reading from Kinesis to begin
 with, it might be an easier option to just do the transformation in Kinesis.











 On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 Whatever you write in bolts would be the logic you want to apply on your
 events. In Spark, that logic would be coded in map() or similar such
 transformations and/or actions. Spark doesn't enforce a structure for
 capturing your processing logic like Storm does.

 Regards
 Sab

 Probably overloading the question a bit.

 In Storm, Bolts have the functionality of getting triggered on events.
 Is that kind of functionality possible with Spark streaming? During each
 phase of the data processing, the transformed data is stored to the
 database and this transformed data should then be sent

Re: RE: Spark or Storm

2015-06-19 Thread Cody Koeninger

http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics

semantics of output operations section

Is this really not clear?

As for the general tone of why doesn't the framework do it for you... in
my opinion, this is essential complexity for delivery semantics in a
distributed system, not incidental complexity.  You need to actually
understand and be responsible for what's going on, unless you're talking
about very narrow use cases (i.e. outputting to a known datastore with
known semantics and schema)

On Fri, Jun 19, 2015 at 7:26 AM, Ashish Soni asoni.le...@gmail.com wrote:

 My understanding for exactly once semantics is it is handled into the
 framework itself but it is not very clear from the documentation , I
 believe documentation needs to be updated with a simple example so that it
 is clear to the end user , This is very critical to decide when some one is
 evaluating the framework and does not have enough time to validate all the
 use cases but to relay on the documentation.

 Ashish

 On Fri, Jun 19, 2015 at 7:10 AM, bit1...@163.com bit1...@163.com wrote:


 I think your observation is correct, you have to take care of these
 replayed data at your end,eg,each message has a unique id or something else.

 I am using I think in the above sentense, because I am not sure and I
 also have a related question:
 I am wonderring how direct stream + kakfa is implemented when the Driver
 is down and restarted, will it always first replay the checkpointed failed
 batch or will it honor Kafka's offset reset policy(auto.offset.reset). If
 it honors the reset policy and it is set as smallest, then it is the at
 least once semantics;  if it set largest, then it will be at most once
 semantics?


 --
 bit1...@163.com


 *From:* Haopu Wang hw...@qilinsoft.com
 *Date:* 2015-06-19 18:47
 *To:* Enno Shioji eshi...@gmail.com; Tathagata Das
 t...@databricks.com
 *CC:* prajod.vettiyat...@wipro.com; Cody Koeninger c...@koeninger.org;
 bit1...@163.com; Jordan Pilat jrpi...@gmail.com; Will Briggs
 wrbri...@gmail.com; Ashish Soni asoni.le...@gmail.com; ayan guha
 guha.a...@gmail.com; user@spark.apache.org; Sateesh Kavuri
 sateesh.kav...@gmail.com; Spark Enthusiast sparkenthusi...@yahoo.in; 
 Sabarish
 Sasidharan sabarish.sasidha...@manthan.com
 *Subject:* RE: RE: Spark or Storm

 My question is not directly related: about the exactly-once semantic,
 the document (copied below) said spark streaming gives exactly-once
 semantic, but actually from my test result, with check-point enabled, the
 application always re-process the files in last batch after gracefully
 restart.



 ==
 *Semantics of Received Data*

 Different input sources provide different guarantees, ranging from *at-least
 once* to *exactly once*. Read for more details.
 *With Files*

 If all of the input data is already present in a fault-tolerant files
 system like HDFS, Spark Streaming can always recover from any failure and
 process all the data. This gives *exactly-once* semantics, that all the
 data will be processed exactly once no matter what fails.




  --

 *From:* Enno Shioji [mailto:eshi...@gmail.com]
 *Sent:* Friday, June 19, 2015 5:29 PM
 *To:* Tathagata Das
 *Cc:* prajod.vettiyat...@wipro.com; Cody Koeninger; bit1...@163.com;
 Jordan Pilat; Will Briggs; Ashish Soni; ayan guha; user@spark.apache.org;
 Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan
 *Subject:* Re: RE: Spark or Storm



 Fair enough, on second thought, just saying that it should be idempotent
 is indeed more confusing.



 I guess the crux of the confusion comes from the fact that people tend to
 assume the work you described (store batch id and skip etc.) is handled by
 the framework, perhaps partly because Storm Trident does handle it (you
 just need to let Storm know if the output operation has succeeded or not,
 and it handles the batch id storing  skipping business). Whenever I
 explain people that one needs to do this additional work you described to
 get end-to-end exactly-once semantics, it usually takes a while to convince
 them. In my limited experience, they tend to interpret transactional in
 that sentence to mean that you just have to write to a transactional
 storage like ACID RDB. Pointing them to Semantics of output operations is
 usually sufficient though.



 Maybe others like @Ashish can weigh on this; did you interpret it in this
 way?



 What if we change the statement into:

 end-to-end exactly-once semantics (if your updates to downstream
 systems are idempotent or transactional). To learn how to make your updates
 idempotent or transactional, see the Semantics of output operations
 section in this chapter
 https://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics
 



 That way, it's clear that it's not sufficient to merely write to a
 transactional storage like ACID store.















 On Fri, Jun 19, 2015 at 9

Re: RE: Spark or Storm

2015-06-19 Thread Cody Koeninger

auto.offset.reset only applies when there are no starting offsets (either
from a checkpoint, or from you providing them explicitly)

On Fri, Jun 19, 2015 at 6:10 AM, bit1...@163.com bit1...@163.com wrote:


 I think your observation is correct, you have to take care of these
 replayed data at your end,eg,each message has a unique id or something else.

 I am using I think in the above sentense, because I am not sure and I
 also have a related question:
 I am wonderring how direct stream + kakfa is implemented when the Driver
 is down and restarted, will it always first replay the checkpointed failed
 batch or will it honor Kafka's offset reset policy(auto.offset.reset). If
 it honors the reset policy and it is set as smallest, then it is the at
 least once semantics;  if it set largest, then it will be at most once
 semantics?


 --
 bit1...@163.com


 *From:* Haopu Wang hw...@qilinsoft.com
 *Date:* 2015-06-19 18:47
 *To:* Enno Shioji eshi...@gmail.com; Tathagata Das t...@databricks.com
 *CC:* prajod.vettiyat...@wipro.com; Cody Koeninger c...@koeninger.org;
 bit1...@163.com; Jordan Pilat jrpi...@gmail.com; Will Briggs
 wrbri...@gmail.com; Ashish Soni asoni.le...@gmail.com; ayan guha
 guha.a...@gmail.com; user@spark.apache.org; Sateesh Kavuri
 sateesh.kav...@gmail.com; Spark Enthusiast sparkenthusi...@yahoo.in; 
 Sabarish
 Sasidharan sabarish.sasidha...@manthan.com
 *Subject:* RE: RE: Spark or Storm

 My question is not directly related: about the exactly-once semantic,
 the document (copied below) said spark streaming gives exactly-once
 semantic, but actually from my test result, with check-point enabled, the
 application always re-process the files in last batch after gracefully
 restart.



 ==
 *Semantics of Received Data*

 Different input sources provide different guarantees, ranging from *at-least
 once* to *exactly once*. Read for more details.
 *With Files*

 If all of the input data is already present in a fault-tolerant files
 system like HDFS, Spark Streaming can always recover from any failure and
 process all the data. This gives *exactly-once* semantics, that all the
 data will be processed exactly once no matter what fails.




  --

 *From:* Enno Shioji [mailto:eshi...@gmail.com]
 *Sent:* Friday, June 19, 2015 5:29 PM
 *To:* Tathagata Das
 *Cc:* prajod.vettiyat...@wipro.com; Cody Koeninger; bit1...@163.com;
 Jordan Pilat; Will Briggs; Ashish Soni; ayan guha; user@spark.apache.org;
 Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan
 *Subject:* Re: RE: Spark or Storm



 Fair enough, on second thought, just saying that it should be idempotent
 is indeed more confusing.



 I guess the crux of the confusion comes from the fact that people tend to
 assume the work you described (store batch id and skip etc.) is handled by
 the framework, perhaps partly because Storm Trident does handle it (you
 just need to let Storm know if the output operation has succeeded or not,
 and it handles the batch id storing  skipping business). Whenever I
 explain people that one needs to do this additional work you described to
 get end-to-end exactly-once semantics, it usually takes a while to convince
 them. In my limited experience, they tend to interpret transactional in
 that sentence to mean that you just have to write to a transactional
 storage like ACID RDB. Pointing them to Semantics of output operations is
 usually sufficient though.



 Maybe others like @Ashish can weigh on this; did you interpret it in this
 way?



 What if we change the statement into:

 end-to-end exactly-once semantics (if your updates to downstream systems
 are idempotent or transactional). To learn how to make your updates
 idempotent or transactional, see the Semantics of output operations
 section in this chapter
 https://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics
 



 That way, it's clear that it's not sufficient to merely write to a
 transactional storage like ACID store.















 On Fri, Jun 19, 2015 at 9:08 AM, Tathagata Das t...@databricks.com
 wrote:

 If the current documentation is confusing, we can definitely improve the
 documentation. However, I dont not understand why is the term
 transactional confusing. If your output operation has to add 5, then the
 user has to implement the following mechanism



 1. If the unique id of the batch of data is already present in the store,
 then skip the update

 2. Otherwise atomically do both, the update operation as well as store the
 unique id of the batch. This is pretty much the definition of a
 transaction. The user has to be aware of the transactional semantics of the
 data store while implementing this functionality.



 You CAN argue that this effective makes the whole updating sort-a
 idempotent, as even if you try doing it multiple times, it will update only
 once. But that is not what is generally considered as idempotent. Writing a
 fixed count

Re: RE: Spark or Storm

2015-06-18 Thread bit1...@163.com

I am wondering how direct stream api ensures end-to-end exactly once semantics

I think there are two things involved:
1. From the spark streaming end, the driver will replay the Offset range when 
it's down and restarted,which means that the new tasks will process some 
already processed data.
2. From the user end, since tasks may process already processed data, user end 
should detect that some data has already been processed,eg,
use some unique ID.

Not sure if I have understood correctly.




bit1...@163.com
 
From: prajod.vettiyat...@wipro.com
Date: 2015-06-18 16:56
To: jrpi...@gmail.com; eshi...@gmail.com
CC: wrbri...@gmail.com; asoni.le...@gmail.com; guha.a...@gmail.com; 
user@spark.apache.org; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in; 
sabarish.sasidha...@manthan.com
Subject: RE: Spark or Storm
not being able to read from Kafka using multiple nodes
 
 Kafka is plenty capable of doing this..
 
I faced the same issue before Spark 1.3 was released.
 
The issue was not with Kafka, but with Spark Streaming’s Kafka connector. 
Before Spark 1.3.0 release one Spark worker would get all the streamed 
messages. We had to re-partition to distribute the processing.
 
From Spark 1.3.0 release the Spark Direct API for Kafka supported parallel 
reads from Kafka streamed to Spark workers. See the “Approach 2: Direct 
Approach” in this page: 
http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html. Note that 
is also mentions zero data loss and exactly once semantics for kafka 
integration.
 
 
Prajod
 
From: Jordan Pilat [mailto:jrpi...@gmail.com] 
Sent: 18 June 2015 03:57
To: Enno Shioji
Cc: Will Briggs; asoni.le...@gmail.com; ayan guha; user; Sateesh Kavuri; Spark 
Enthusiast; Sabarish Sasidharan
Subject: Re: Spark or Storm
 
not being able to read from Kafka using multiple nodes
Kafka is plenty capable of doing this,  by clustering together multiple 
consumer instances into a consumer group.
If your topic is sufficiently partitioned, the consumer group can consume the 
topic in a parallelized fashion.
If it isn't, you still have the fault tolerance associated with clustering the 
consumers.
OK
JRP
On Jun 17, 2015 1:27 AM, Enno Shioji eshi...@gmail.com wrote:
We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm.
 
Some of the important draw backs are:
Spark has no back pressure (receiver rate limit can alleviate this to a certain 
point, but it's far from ideal)
There is also no exactly-once semantics. (updateStateByKey can achieve this 
semantics, but is not practical if you have any significant amount of state 
because it does so by dumping the entire state on every checkpointing)
 
There are also some minor drawbacks that I'm sure will be fixed quickly, like 
no task timeout, not being able to read from Kafka using multiple nodes, data 
loss hazard with Kafka.
 
It's also not possible to attain very low latency in Spark, if that's what you 
need.
 
The pos for Spark is the concise and IMO more intuitive syntax, especially if 
you compare it with Storm's Java API.
 
I admit I might be a bit biased towards Storm tho as I'm more familiar with it.
 
Also, you can do some processing with Kinesis. If all you need to do is 
straight forward transformation and you are reading from Kinesis to begin with, 
it might be an easier option to just do the transformation in Kinesis.
 
 
 
 
 
On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan 
sabarish.sasidha...@manthan.com wrote:
Whatever you write in bolts would be the logic you want to apply on your 
events. In Spark, that logic would be coded in map() or similar such  
transformations and/or actions. Spark doesn't enforce a structure for capturing 
your processing logic like Storm does.
Regards
Sab
Probably overloading the question a bit.
In Storm, Bolts have the functionality of getting triggered on events. Is that 
kind of functionality possible with Spark streaming? During each phase of the 
data processing, the transformed data is stored to the database and this 
transformed data should then be sent to a new pipeline for further processing
How can this be achieved using Spark?

 
On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast sparkenthusi...@yahoo.in 
wrote:
I have a use-case where a stream of Incoming events have to be aggregated and 
joined to create Complex events. The aggregation will have to happen at an 
interval of 1 minute (or less).
 
The pipeline is :
  send events   
   enrich event
Upstream services --- KAFKA - event Stream Processor 
 Complex Event Processor  Elastic Search.
 
From what I understand, Storm will make a very good ESP and Spark Streaming 
will make a good CEP.
 
But, we are also evaluating Storm with Trident.
 
How does Spark Streaming compare with Storm with Trident?
 
Sridhar Chellappa
 
 
 
 
 
 
On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote:
 
I

Re: RE: Spark or Storm

2015-06-18 Thread Cody Koeninger

That general description is accurate, but not really a specific issue of
the direct steam.  It applies to anything consuming from kafka (or, as
Matei already said, any streaming system really).  You can't have exactly
once semantics, unless you know something more about how you're storing
results.

For some unique id, topicpartition and offset is usually the obvious
choice, which is why it's important that the direct stream gives you access
to the offsets.

See https://github.com/koeninger/kafka-exactly-once for more info



On Thu, Jun 18, 2015 at 6:47 AM, bit1...@163.com bit1...@163.com wrote:

 I am wondering how direct stream api ensures end-to-end exactly once
 semantics

 I think there are two things involved:
 1. From the spark streaming end, the driver will replay the Offset range
 when it's down and restarted,which means that the new tasks will process
 some already processed data.
 2. From the user end, since tasks may process already processed data, user
 end should detect that some data has already been processed,eg,
 use some unique ID.

 Not sure if I have understood correctly.


 --
 bit1...@163.com


 *From:* prajod.vettiyat...@wipro.com
 *Date:* 2015-06-18 16:56
 *To:* jrpi...@gmail.com; eshi...@gmail.com
 *CC:* wrbri...@gmail.com; asoni.le...@gmail.com; guha.a...@gmail.com;
 user@spark.apache.org; sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in;
 sabarish.sasidha...@manthan.com
 *Subject:* RE: Spark or Storm

 not being able to read from Kafka using multiple nodes



  Kafka is plenty capable of doing this..



 I faced the same issue before Spark 1.3 was released.



 The issue was not with Kafka, but with Spark Streaming’s Kafka connector.
 Before Spark 1.3.0 release one Spark worker would get all the streamed
 messages. We had to re-partition to distribute the processing.



 From Spark 1.3.0 release the Spark Direct API for Kafka supported parallel
 reads from Kafka streamed to Spark workers. See the “Approach 2: Direct
 Approach” in this page:
 http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html. Note
 that is also mentions zero data loss and exactly once semantics for kafka
 integration.





 Prajod



 *From:* Jordan Pilat [mailto:jrpi...@gmail.com]
 *Sent:* 18 June 2015 03:57
 *To:* Enno Shioji
 *Cc:* Will Briggs; asoni.le...@gmail.com; ayan guha; user; Sateesh
 Kavuri; Spark Enthusiast; Sabarish Sasidharan
 *Subject:* Re: Spark or Storm



 not being able to read from Kafka using multiple nodes

 Kafka is plenty capable of doing this,  by clustering together multiple
 consumer instances into a consumer group.
 If your topic is sufficiently partitioned, the consumer group can consume
 the topic in a parallelized fashion.
 If it isn't, you still have the fault tolerance associated with clustering
 the consumers.

 OK
 JRP

 On Jun 17, 2015 1:27 AM, Enno Shioji eshi...@gmail.com wrote:

  We've evaluated Spark Streaming vs. Storm and ended up sticking with
 Storm.



 Some of the important draw backs are:

 Spark has no back pressure (receiver rate limit can alleviate this to a
 certain point, but it's far from ideal)

 There is also no exactly-once semantics. (updateStateByKey can achieve
 this semantics, but is not practical if you have any significant amount of
 state because it does so by dumping the entire state on every checkpointing)



 There are also some minor drawbacks that I'm sure will be fixed quickly,
 like no task timeout, not being able to read from Kafka using multiple
 nodes, data loss hazard with Kafka.



 It's also not possible to attain very low latency in Spark, if that's what
 you need.



 The pos for Spark is the concise and IMO more intuitive syntax, especially
 if you compare it with Storm's Java API.



 I admit I might be a bit biased towards Storm tho as I'm more familiar
 with it.



 Also, you can do some processing with Kinesis. If all you need to do is
 straight forward transformation and you are reading from Kinesis to begin
 with, it might be an easier option to just do the transformation in Kinesis.











 On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 Whatever you write in bolts would be the logic you want to apply on your
 events. In Spark, that logic would be coded in map() or similar such
 transformations and/or actions. Spark doesn't enforce a structure for
 capturing your processing logic like Storm does.

 Regards
 Sab

 Probably overloading the question a bit.

 In Storm, Bolts have the functionality of getting triggered on events. Is
 that kind of functionality possible with Spark streaming? During each phase
 of the data processing, the transformed data is stored to the database and
 this transformed data should then be sent to a new pipeline for further
 processing

 How can this be achieved using Spark?



 On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast 
 sparkenthusi...@yahoo.in wrote:

   I have a use-case where a stream

RE: RE: Spark or Storm

2015-06-18 Thread prajod.vettiyattil

More details on the Direct API of Spark 1.3 is at the databricks blog: 
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

Note the use of checkpoints to persist the Kafka offsets in Spark Streaming 
itself, and not in zookeeper.

Also this statement:”.. This allows one to build a Spark Streaming + Kafka 
pipelines with end-to-end exactly-once semantics (if your updates to downstream 
systems are idempotent or transactional).”


From: Cody Koeninger [mailto:c...@koeninger.org]
Sent: 18 June 2015 19:38
To: bit1...@163.com
Cc: Prajod S Vettiyattil (WT01 - BAS); jrpi...@gmail.com; eshi...@gmail.com; 
wrbri...@gmail.com; asoni.le...@gmail.com; ayan guha; user; 
sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in; 
sabarish.sasidha...@manthan.com
Subject: Re: RE: Spark or Storm

That general description is accurate, but not really a specific issue of the 
direct steam.  It applies to anything consuming from kafka (or, as Matei 
already said, any streaming system really).  You can't have exactly once 
semantics, unless you know something more about how you're storing results.

For some unique id, topicpartition and offset is usually the obvious choice, 
which is why it's important that the direct stream gives you access to the 
offsets.

See https://github.com/koeninger/kafka-exactly-once for more info



On Thu, Jun 18, 2015 at 6:47 AM, bit1...@163.commailto:bit1...@163.com 
bit1...@163.commailto:bit1...@163.com wrote:
I am wondering how direct stream api ensures end-to-end exactly once semantics

I think there are two things involved:
1. From the spark streaming end, the driver will replay the Offset range when 
it's down and restarted,which means that the new tasks will process some 
already processed data.
2. From the user end, since tasks may process already processed data, user end 
should detect that some data has already been processed,eg,
use some unique ID.

Not sure if I have understood correctly.



bit1...@163.commailto:bit1...@163.com

From: prajod.vettiyat...@wipro.commailto:prajod.vettiyat...@wipro.com
Date: 2015-06-18 16:56
To: jrpi...@gmail.commailto:jrpi...@gmail.com; 
eshi...@gmail.commailto:eshi...@gmail.com
CC: wrbri...@gmail.commailto:wrbri...@gmail.com; 
asoni.le...@gmail.commailto:asoni.le...@gmail.com; 
guha.a...@gmail.commailto:guha.a...@gmail.com; 
user@spark.apache.orgmailto:user@spark.apache.org; 
sateesh.kav...@gmail.commailto:sateesh.kav...@gmail.com; 
sparkenthusi...@yahoo.inmailto:sparkenthusi...@yahoo.in; 
sabarish.sasidha...@manthan.commailto:sabarish.sasidha...@manthan.com
Subject: RE: Spark or Storm
not being able to read from Kafka using multiple nodes

 Kafka is plenty capable of doing this..

I faced the same issue before Spark 1.3 was released.

The issue was not with Kafka, but with Spark Streaming’s Kafka connector. 
Before Spark 1.3.0 release one Spark worker would get all the streamed 
messages. We had to re-partition to distribute the processing.

From Spark 1.3.0 release the Spark Direct API for Kafka supported parallel 
reads from Kafka streamed to Spark workers. See the “Approach 2: Direct 
Approach” in this page: 
http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html. Note that 
is also mentions zero data loss and exactly once semantics for kafka 
integration.


Prajod

From: Jordan Pilat [mailto:jrpi...@gmail.commailto:jrpi...@gmail.com]
Sent: 18 June 2015 03:57
To: Enno Shioji
Cc: Will Briggs; asoni.le...@gmail.commailto:asoni.le...@gmail.com; ayan 
guha; user; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan
Subject: Re: Spark or Storm


not being able to read from Kafka using multiple nodes

Kafka is plenty capable of doing this,  by clustering together multiple 
consumer instances into a consumer group.
If your topic is sufficiently partitioned, the consumer group can consume the 
topic in a parallelized fashion.
If it isn't, you still have the fault tolerance associated with clustering the 
consumers.

OK
JRP
On Jun 17, 2015 1:27 AM, Enno Shioji 
eshi...@gmail.commailto:eshi...@gmail.com wrote:
We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm.

Some of the important draw backs are:
Spark has no back pressure (receiver rate limit can alleviate this to a certain 
point, but it's far from ideal)
There is also no exactly-once semantics. (updateStateByKey can achieve this 
semantics, but is not practical if you have any significant amount of state 
because it does so by dumping the entire state on every checkpointing)

There are also some minor drawbacks that I'm sure will be fixed quickly, like 
no task timeout, not being able to read from Kafka using multiple nodes, data 
loss hazard with Kafka.

It's also not possible to attain very low latency in Spark, if that's what you 
need.

The pos for Spark is the concise and IMO more intuitive syntax, especially if 
you compare it with Storm's Java API.

I admit I might be a bit biased towards

RE: Spark or Storm

2015-06-18 Thread prajod.vettiyattil

not being able to read from Kafka using multiple nodes

 Kafka is plenty capable of doing this..

I faced the same issue before Spark 1.3 was released.

The issue was not with Kafka, but with Spark Streaming’s Kafka connector. 
Before Spark 1.3.0 release one Spark worker would get all the streamed 
messages. We had to re-partition to distribute the processing.

From Spark 1.3.0 release the Spark Direct API for Kafka supported parallel 
reads from Kafka streamed to Spark workers. See the “Approach 2: Direct 
Approach” in this page: 
http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html. Note that 
is also mentions zero data loss and exactly once semantics for kafka 
integration.


Prajod

From: Jordan Pilat [mailto:jrpi...@gmail.com]
Sent: 18 June 2015 03:57
To: Enno Shioji
Cc: Will Briggs; asoni.le...@gmail.com; ayan guha; user; Sateesh Kavuri; Spark 
Enthusiast; Sabarish Sasidharan
Subject: Re: Spark or Storm


not being able to read from Kafka using multiple nodes

Kafka is plenty capable of doing this,  by clustering together multiple 
consumer instances into a consumer group.
If your topic is sufficiently partitioned, the consumer group can consume the 
topic in a parallelized fashion.
If it isn't, you still have the fault tolerance associated with clustering the 
consumers.

OK
JRP
On Jun 17, 2015 1:27 AM, Enno Shioji 
eshi...@gmail.commailto:eshi...@gmail.com wrote:
We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm.

Some of the important draw backs are:
Spark has no back pressure (receiver rate limit can alleviate this to a certain 
point, but it's far from ideal)
There is also no exactly-once semantics. (updateStateByKey can achieve this 
semantics, but is not practical if you have any significant amount of state 
because it does so by dumping the entire state on every checkpointing)

There are also some minor drawbacks that I'm sure will be fixed quickly, like 
no task timeout, not being able to read from Kafka using multiple nodes, data 
loss hazard with Kafka.

It's also not possible to attain very low latency in Spark, if that's what you 
need.

The pos for Spark is the concise and IMO more intuitive syntax, especially if 
you compare it with Storm's Java API.

I admit I might be a bit biased towards Storm tho as I'm more familiar with it.

Also, you can do some processing with Kinesis. If all you need to do is 
straight forward transformation and you are reading from Kinesis to begin with, 
it might be an easier option to just do the transformation in Kinesis.





On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan 
sabarish.sasidha...@manthan.commailto:sabarish.sasidha...@manthan.com wrote:

Whatever you write in bolts would be the logic you want to apply on your 
events. In Spark, that logic would be coded in map() or similar such  
transformations and/or actions. Spark doesn't enforce a structure for capturing 
your processing logic like Storm does.

Regards
Sab
Probably overloading the question a bit.
In Storm, Bolts have the functionality of getting triggered on events. Is that 
kind of functionality possible with Spark streaming? During each phase of the 
data processing, the transformed data is stored to the database and this 
transformed data should then be sent to a new pipeline for further processing
How can this be achieved using Spark?


On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast 
sparkenthusi...@yahoo.inmailto:sparkenthusi...@yahoo.in wrote:
I have a use-case where a stream of Incoming events have to be aggregated and 
joined to create Complex events. The aggregation will have to happen at an 
interval of 1 minute (or less).

The pipeline is :
  send events   
   enrich event
Upstream services --- KAFKA - event Stream Processor 
 Complex Event Processor  Elastic Search.

From what I understand, Storm will make a very good ESP and Spark Streaming 
will make a good CEP.

But, we are also evaluating Storm with Trident.

How does Spark Streaming compare with Storm with Trident?

Sridhar Chellappa






On Wednesday, 17 June 2015 10:02 AM, ayan guha 
guha.a...@gmail.commailto:guha.a...@gmail.com wrote:

I have a similar scenario where we need to bring data from kinesis to hbase. 
Data volecity is 20k per 10 mins. Little manipulation of data will be required 
but that's regardless of the tool so we will be writing that piece in Java pojo.
All env is on aws. Hbase is on a long running EMR and kinesis on a separate 
cluster.
TIA.
Best
Ayan
On 17 Jun 2015 12:13, Will Briggs 
wrbri...@gmail.commailto:wrbri...@gmail.com wrote:
The programming models for the two frameworks are conceptually rather 
different; I haven't worked with Storm for quite some time, but based on my old 
experience with it, I would equate Spark Streaming more with Storm's Trident 
API, rather than with the raw Bolt API. Even then, there are significant

Re: Spark or Storm

2015-06-17 Thread Spark Enthusiast

Again, by Storm, you mean Storm Trident, correct? 


 On Wednesday, 17 June 2015 10:09 PM, Michael Segel 
msegel_had...@hotmail.com wrote:
   

 Actually the reverse.
Spark Streaming is really a micro batch system where the smallest window is 1/2 
a second (500ms). So for CEP, its not really a good idea. 
So in terms of options…. spark streaming, storm, samza, akka and others… 
Storm is probably the easiest to pick up,  spark streaming / akka may give you 
more flexibility and akka would work for CEP. 
Just my $0.02

On Jun 16, 2015, at 9:40 PM, Spark Enthusiast sparkenthusi...@yahoo.in wrote:
I have a use-case where a stream of Incoming events have to be aggregated and 
joined to create Complex events. The aggregation will have to happen at an 
interval of 1 minute (or less).
The pipeline is :                                  send events                  
                        enrich eventUpstream services --- 
KAFKA - event Stream Processor  Complex Event Processor 
 Elastic Search.
From what I understand, Storm will make a very good ESP and Spark Streaming 
will make a good CEP.
But, we are also evaluating Storm with Trident.
How does Spark Streaming compare with Storm with Trident?
Sridhar Chellappa


  


 On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote:
   

 I have a similar scenario where we need to bring data from kinesis to hbase. 
Data volecity is 20k per 10 mins. Little manipulation of data will be required 
but that's regardless of the tool so we will be writing that piece in Java 
pojo. All env is on aws. Hbase is on a long running EMR and kinesis on a 
separate cluster.TIA.
Best
AyanOn 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote:

The programming models for the two frameworks are conceptually rather 
different; I haven't worked with Storm for quite some time, but based on my old 
experience with it, I would equate Spark Streaming more with Storm's Trident 
API, rather than with the raw Bolt API. Even then, there are significant 
differences, but it's a bit closer.

If you can share your use case, we might be able to provide better guidance.

Regards,
Will

On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote:

Hi All,

I am evaluating spark VS storm ( spark streaming  ) and i am not able to see 
what is equivalent of Bolt in storm inside spark.

Any help will be appreciated on this ?

Thanks ,
Ashish
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

This documentation is only for writes to an external system, but all the
counting you do within your streaming app (e.g. if you use reduceByKeyAndWindow
to keep track of a running count) is exactly-once. When you write to a storage
system, no matter which streaming framework you use, you'll have to make sure
the writes are idempotent, because the storage system can't know whether you
meant to write the same data again or not. But the place where Spark Streaming
helps over Storm, etc is for tracking state within your computation. Without
that facility, you'd not only have to make sure that writes are idempotent, but
you'd have to make sure that updates to your own internal state (e.g.
reduceByKeyAndWindow) are exactly-once too.

Matei

On Jun 17, 2015, at 8:26 AM, Enno Shioji eshi...@gmail.com wrote:

The thing is, even with that improvement, you still have to make updates
idempotent or transactional yourself. If you read
http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics

http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics

that refers to the latest version, it says:

Semantics of output operations
Output operations (like foreachRDD) have at-least once semantics, that is,
the transformed data may get written to an external entity more than once in
the event of a worker failure. While this is acceptable for saving to file
systems using the saveAs***Files operations (as the file will simply get
overwritten with the same data), additional effort may be necessary to
achieve exactly-once semantics. There are two approaches.

Idempotent updates: Multiple attempts always write the same data. For
example, saveAs***Files always writes the same data to the generated files.

Transactional updates: All updates are made transactionally so that updates
are made exactly once atomically. One way to do this would be the following.

Use the batch time (available in foreachRDD) and the partition index of the
transformed RDD to create an identifier. This identifier uniquely identifies
a blob data in the streaming application.
Update external system with this blob transactionally (that is, exactly once,
atomically) using the identifier. That is, if the identifier is not already
committed, commit the partition data and the identifier atomically. Else if
this was already committed, skip the update.

So either you make the update idempotent, or you have to make it
transactional yourself, and the suggested mechanism is very similar to what
Storm does.

On Wed, Jun 17, 2015 at 3:51 PM, Ashish Soni asoni.le...@gmail.com
mailto:asoni.le...@gmail.com wrote:
@Enno
As per the latest version and documentation Spark Streaming does offer
exactly once semantics using improved kafka integration , Not i have not
tested yet.

Any feedback will be helpful if anyone is tried the same.

http://koeninger.github.io/kafka-exactly-once/#7
http://koeninger.github.io/kafka-exactly-once/#7

https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji eshi...@gmail.com
mailto:eshi...@gmail.com wrote:
AFAIK KCL is *supposed* to provide fault tolerance and load balancing (plus
additionally, elastic scaling unlike Storm), Kinesis providing the
coordination. My understanding is that it's like a naked Storm worker process
that can consequently only do map.

I haven't really used it tho, so can't really comment how it compares to
Spark/Storm. Maybe somebody else will be able to comment.

On Wed, Jun 17, 2015 at 3:13 PM, ayan guha guha.a...@gmail.com
mailto:guha.a...@gmail.com wrote:
Thanks for this. It's kcl based kinesis application. But because its just a
Java application we are thinking to use spark on EMR or storm for fault
tolerance and load balancing. Is it a correct approach?

On 17 Jun 2015 23:07, Enno Shioji eshi...@gmail.com
mailto:eshi...@gmail.com wrote:
Hi Ayan,

Admittedly I haven't done much with Kinesis, but if I'm not mistaken you
should be able to use their processor interface for that. In this example,
it's incrementing a counter:
https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java

https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java

Instead of incrementing a counter, you could do your transformation and send
it to HBase.

On Wed, Jun 17, 2015 at 1:40 PM, ayan guha guha.a...@gmail.com
mailto:guha.a...@gmail.com wrote:
Great discussion!!

One qs about some comment: Also, you can do

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

Hi Matei,

Ah, can't get more accurate than from the horse's mouth... If you don't
mind helping me understand it correctly..

From what I understand, Storm Trident does the following (when used with
Kafka):
1) Sit on Kafka Spout and create batches
2) Assign global sequential ID to the batches
3) Make sure that all result of processed batches are written once to
TridentState, *in order* (for example, by skipping batches that were
already applied once, ultimately by using Zookeeper)

TridentState is an interface that you have to implement, and the underlying
storage has to be transactional for this to work. The necessary skipping
etc. is handled by Storm.

In case of Spark Streaming, I understand that
1) There is no global ordering; e.g. an output operation for batch
consisting of offset [4,5,6] can be invoked before the operation for offset
[1,2,3]
2) If you wanted to achieve something similar to what TridentState does,
you'll have to do it yourself (for example using Zookeeper)

Is this a correct understanding?

On Wed, Jun 17, 2015 at 7:14 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

This documentation is only for writes to an external system, but all the
counting you do within your streaming app (e.g. if you use
reduceByKeyAndWindow to keep track of a running count) is exactly-once.
When you write to a storage system, no matter which streaming framework you
use, you'll have to make sure the writes are idempotent, because the
storage system can't know whether you meant to write the same data again or
not. But the place where Spark Streaming helps over Storm, etc is for
tracking state within your computation. Without that facility, you'd not
only have to make sure that writes are idempotent, but you'd have to make
sure that updates to your own internal state (e.g. reduceByKeyAndWindow)
are exactly-once too.

Matei

On Jun 17, 2015, at 8:26 AM, Enno Shioji eshi...@gmail.com wrote:

that refers to the latest version, it says:

Semantics of output operations

Output operations (like foreachRDD) have *at-least once* semantics, that
is, the transformed data may get written to an external entity more than
once in the event of a worker failure. While this is acceptable for saving
to file systems using the saveAs***Files operations (as the file will
simply get overwritten with the same data), additional effort may be
necessary to achieve exactly-once semantics. There are two approaches.

*Idempotent updates*: Multiple attempts always write the same data.
For example, saveAs***Files always writes the same data to the
generated files.
-

*Transactional updates*: All updates are made transactionally so that
updates are made exactly once atomically. One way to do this would be the
following.
- Use the batch time (available in foreachRDD) and the partition index
of the transformed RDD to create an identifier. This identifier uniquely
identifies a blob data in the streaming application.
- Update external system with this blob transactionally (that is,
exactly once, atomically) using the identifier. That is, if the
identifier
is not already committed, commit the partition data and the identifier
atomically. Else if this was already committed, skip the update.

So either you make the update idempotent, or you have to make it
transactional yourself, and the suggested mechanism is very similar to what
Storm does.

On Wed, Jun 17, 2015 at 3:51 PM, Ashish Soni asoni.le...@gmail.com
wrote:

@Enno
As per the latest version and documentation Spark Streaming does offer
exactly once semantics using improved kafka integration , Not i have not
tested yet.

Any feedback will be helpful if anyone is tried the same.

http://koeninger.github.io/kafka-exactly-once/#7

https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji eshi...@gmail.com wrote:

AFAIK KCL is *supposed* to provide fault tolerance and load balancing
(plus additionally, elastic scaling unlike Storm), Kinesis providing the
coordination. My understanding is that it's like a naked Storm worker
process that can consequently only do map.

I haven't really used it tho, so can't really comment how it compares to
Spark/Storm. Maybe somebody else will be able to comment.

On Wed, Jun 17, 2015 at 3:13 PM, ayan guha guha.a...@gmail.com wrote:

Thanks for this. It's kcl based kinesis application. But because its
just a Java application we are thinking to use spark on EMR or storm for
fault tolerance and load balancing. Is it a correct approach?
On 17 Jun 2015 23:07, Enno Shioji eshi...@gmail.com wrote:

Hi Ayan,

Admittedly I

Re: Spark or Storm

2015-06-17 Thread Matei Zaharia

does offer
exactly once semantics using improved kafka integration , Not i have not
tested yet.

Any feedback will be helpful if anyone is tried the same.

http://koeninger.github.io/kafka-exactly-once/#7
http://koeninger.github.io/kafka-exactly-once/#7

https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

I haven't really used it tho, so can't really comment how it compares to
Spark/Storm. Maybe somebody else will be able to comment.

On 17 Jun 2015 23:07, Enno Shioji eshi...@gmail.com
mailto:eshi...@gmail.com wrote:
Hi Ayan,

https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java

Instead of incrementing a counter, you could do your transformation and send
it to HBase.

On Wed, Jun 17, 2015 at 1:40 PM, ayan guha guha.a...@gmail.com
mailto:guha.a...@gmail.com wrote:
Great discussion!!

One qs about some comment: Also, you can do some processing with Kinesis. If
all you need to do is straight forward transformation and you are reading
from Kinesis to begin with, it might be an easier option to just do the
transformation in Kinesis

- Do you mean KCL application? Or some kind of processing withinKineis?

Can you kindly share a link? I would definitely pursue this route as our
transformations are really simple.

Best

On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni asoni.le...@gmail.com
mailto:asoni.le...@gmail.com wrote:
My Use case is below

We are going to receive lot of event as stream ( basically Kafka Stream )
and then we need to process and compute

Consider you have a phone contract with ATT and every call / sms / data
useage you do is an event and then it needs to calculate your bill on real
time basis so when you login to your account you can see all those variable
as how much you used and how much is left and what is your bill till date
,Also there are different rules which need to be considered when you
calculate the total bill one simple rule will be 0-500 min it is free but
above it is $1 a min.

How do i maintain a shared state ( total amount , total min , total data
etc ) so that i know how much i accumulated at any given point as events for
same phone can go to any node / executor.

Can some one please tell me how can i achieve this is spark as in storm i
can have a bolt which can do this ?

Thanks,

On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com
mailto:eshi...@gmail.com wrote:
I guess both. In terms of syntax, I was comparing it with Trident.

If you are joining, Spark Streaming actually does offer windowed join out of
the box. We couldn't use this though as our event stream can grow
out-of-sync, so we had to implement something on top of Storm. If your
event streams don't become out of sync, you may find the built-in join in
Spark Streaming useful. Storm also has a join keyword but its semantics are
different.

Also, what do you mean by No Back Pressure ?

So when a topology is overloaded, Storm is designed so that it will stop
reading from the source. Spark on the other hand, will keep reading from the
source and spilling it internally. This maybe fine, in fairness, but it does
mean you have to worry about the persistent store usage in the processing
cluster, whereas with Storm you don't have to worry because the messages
just remain in the data store.

Spark came up with the idea of rate limiting, but I don't feel this is as
nice as back pressure because it's very difficult to tune it such that you
don't cap the cluster's processing power but yet so that it will prevent the
persistent storage to get used up.

On Wed, Jun 17, 2015 at 9:33 AM, Spark

Re: Spark or Storm

2015-06-17 Thread Jordan Pilat

not being able to read from Kafka using multiple nodes

Kafka is plenty capable of doing this,  by clustering together multiple
consumer instances into a consumer group.
If your topic is sufficiently partitioned, the consumer group can consume
the topic in a parallelized fashion.
If it isn't, you still have the fault tolerance associated with clustering
the consumers.

OK
JRP
On Jun 17, 2015 1:27 AM, Enno Shioji eshi...@gmail.com wrote:

 We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm.

 Some of the important draw backs are:
 Spark has no back pressure (receiver rate limit can alleviate this to a
 certain point, but it's far from ideal)
 There is also no exactly-once semantics. (updateStateByKey can achieve
 this semantics, but is not practical if you have any significant amount of
 state because it does so by dumping the entire state on every checkpointing)

 There are also some minor drawbacks that I'm sure will be fixed quickly,
 like no task timeout, not being able to read from Kafka using multiple
 nodes, data loss hazard with Kafka.

 It's also not possible to attain very low latency in Spark, if that's what
 you need.

 The pos for Spark is the concise and IMO more intuitive syntax, especially
 if you compare it with Storm's Java API.

 I admit I might be a bit biased towards Storm tho as I'm more familiar
 with it.

 Also, you can do some processing with Kinesis. If all you need to do is
 straight forward transformation and you are reading from Kinesis to begin
 with, it might be an easier option to just do the transformation in Kinesis.





 On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 Whatever you write in bolts would be the logic you want to apply on your
 events. In Spark, that logic would be coded in map() or similar such
 transformations and/or actions. Spark doesn't enforce a structure for
 capturing your processing logic like Storm does.

 Regards
 Sab
 Probably overloading the question a bit.

 In Storm, Bolts have the functionality of getting triggered on events. Is
 that kind of functionality possible with Spark streaming? During each phase
 of the data processing, the transformed data is stored to the database and
 this transformed data should then be sent to a new pipeline for further
 processing

 How can this be achieved using Spark?



 On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast 
 sparkenthusi...@yahoo.in wrote:

 I have a use-case where a stream of Incoming events have to be
 aggregated and joined to create Complex events. The aggregation will have
 to happen at an interval of 1 minute (or less).

 The pipeline is :
   send events
enrich event
 Upstream services --- KAFKA - event Stream
 Processor  Complex Event Processor  Elastic
 Search.

 From what I understand, Storm will make a very good ESP and Spark
 Streaming will make a good CEP.

 But, we are also evaluating Storm with Trident.

 How does Spark Streaming compare with Storm with Trident?

 Sridhar Chellappa







   On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com
 wrote:


 I have a similar scenario where we need to bring data from kinesis to
 hbase. Data volecity is 20k per 10 mins. Little manipulation of data will
 be required but that's regardless of the tool so we will be writing that
 piece in Java pojo.
 All env is on aws. Hbase is on a long running EMR and kinesis on a
 separate cluster.
 TIA.
 Best
 Ayan
 On 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote:

 The programming models for the two frameworks are conceptually rather
 different; I haven't worked with Storm for quite some time, but based on my
 old experience with it, I would equate Spark Streaming more with Storm's
 Trident API, rather than with the raw Bolt API. Even then, there are
 significant differences, but it's a bit closer.

 If you can share your use case, we might be able to provide better
 guidance.

 Regards,
 Will

 On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote:

 Hi All,

 I am evaluating spark VS storm ( spark streaming  ) and i am not able to
 see what is equivalent of Bolt in storm inside spark.

 Any help will be appreciated on this ?

 Thanks ,
 Ashish
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

RE: Spark or Storm

2015-06-17 Thread Evo Eftimov

The only thing which doesn't make much sense in Spark Streaming (and I am
not saying it is done better in Storm) is the iterative and redundant
shipping of the essentially the same tasks (closures/lambdas/functions) to
the cluster nodes AND re-launching them there again and again 

 

This is a legacy from Spark Batch where such approach DOES make sense 

 

So to recap - in Spark Streaming, the driver keeps serializing and
transmitting the same Tasks (comprising a Job) for every new DStream RDD,
which then get re-launched ie new JVM Threads launched within each Executor
(JVM) and then the tasks report their final execution status to the driver
(only the last has real functional purpose)

 

An optimization (provided Spark Streaming was implemented from scratch)
could be to launch the Stages/Tasks of a Streaming Job as constantly running
Threads (Demons/Agents) within the Executors and leave the DStream RDD
stream through them as only the final execution status for each DSTream
RDD and some periodical heartbeats (of the Agents) are reported to the
driver   

 

Essentially this gives you are Pipeline Architecture (of stringed Agents)
which is a well known Parallel Programming Patterns especially suitable for
streaming data 

 

From: Matei Zaharia [mailto:matei.zaha...@gmail.com] 
Sent: Wednesday, June 17, 2015 7:14 PM
To: Enno Shioji
Cc: Ashish Soni; ayan guha; Sabarish Sasidharan; Spark Enthusiast; Will
Briggs; user; Sateesh Kavuri
Subject: Re: Spark or Storm

 

This documentation is only for writes to an external system, but all the
counting you do within your streaming app (e.g. if you use
reduceByKeyAndWindow to keep track of a running count) is exactly-once. When
you write to a storage system, no matter which streaming framework you use,
you'll have to make sure the writes are idempotent, because the storage
system can't know whether you meant to write the same data again or not. But
the place where Spark Streaming helps over Storm, etc is for tracking state
within your computation. Without that facility, you'd not only have to make
sure that writes are idempotent, but you'd have to make sure that updates to
your own internal state (e.g. reduceByKeyAndWindow) are exactly-once too.

 

Matei

 

On Jun 17, 2015, at 8:26 AM, Enno Shioji eshi...@gmail.com wrote:

 

The thing is, even with that improvement, you still have to make updates
idempotent or transactional yourself. If you read
http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-t
olerance-semantics

 

that refers to the latest version, it says:

 


Semantics of output operations


Output operations (like foreachRDD) have at-least once semantics, that is,
the transformed data may get written to an external entity more than once in
the event of a worker failure. While this is acceptable for saving to file
systems using the saveAs***Files operations (as the file will simply get
overwritten with the same data), additional effort may be necessary to
achieve exactly-once semantics. There are two approaches.

. Idempotent updates: Multiple attempts always write the same data.
For example, saveAs***Files always writes the same data to the generated
files.

. Transactional updates: All updates are made transactionally so
that updates are made exactly once atomically. One way to do this would be
the following.

o   Use the batch time (available in foreachRDD) and the partition index of
the transformed RDD to create an identifier. This identifier uniquely
identifies a blob data in the streaming application.

o   Update external system with this blob transactionally (that is, exactly
once, atomically) using the identifier. That is, if the identifier is not
already committed, commit the partition data and the identifier atomically.
Else if this was already committed, skip the update.

 

So either you make the update idempotent, or you have to make it
transactional yourself, and the suggested mechanism is very similar to what
Storm does.

 

 

 

 

On Wed, Jun 17, 2015 at 3:51 PM, Ashish Soni asoni.le...@gmail.com wrote:

@Enno 

As per the latest version and documentation Spark Streaming does offer
exactly once semantics using improved kafka integration , Not i have not
tested yet.

 

Any feedback will be helpful if anyone is tried the same.

 

http://koeninger.github.io/kafka-exactly-once/#7

 

https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-
spark-streaming.html

 

 

 

On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji eshi...@gmail.com wrote:

AFAIK KCL is *supposed* to provide fault tolerance and load balancing (plus
additionally, elastic scaling unlike Storm), Kinesis providing the
coordination. My understanding is that it's like a naked Storm worker
process that can consequently only do map.

 

I haven't really used it tho, so can't really comment how it compares to
Spark/Storm. Maybe somebody else will be able to comment.

 

 

 

On Wed, Jun 17, 2015 at 3:13 PM, ayan guha

Re: Spark or Storm

2015-06-17 Thread Tathagata Das

To add more information beyond what Matei said and answer the original
question, here are other things to consider when comparing between Spark
Streaming and Storm.

* Unified programming model and semantics - Most occasions you have to
process the same data again in batch jobs. If you have two separate systems
for batch and streaming, its much much harder to share the code. You will
have to deal with different processing models, with their own semantics.
Compare Storm's join vs doing an usual batch join, where as Spark and
Spark Streaming share the same join semantics as they are based on same RDD
model underneath.

* Integration with Spark ecosystem - Many people really want to go beyond
basic streaming ETL and into advanced streaming analytics.
  - Combine stream processing with static datasets
  - Apply dynamically updated machine learning models (i.e. offline
learning and online prediction, or even continuous learning and
prediction),
  - Apply DataFrame and SQL operation with streaming
 These things are pretty easy to do with the spark ecosystem

* Operational management - You have to consider the operational cost of
managing two separate systems for batch and stream processing (with their
own deployment models), vs managing one single engine with one deployment
model.

* Performance - According to Intel's independent study, Spark Streaming in
Kafka direct mode can have 2.5-3x throughput than Trident with 0.5GB batch
size. And at that batch size, the latency of Trident is 30 seconds,
compared to 1.5 seconds for Spark Streaming. This is from a talk that Intel
gave in the Spark Summit (https://spark-summit.org/2015/) two days ago.
Slides will be available soon, but here is a sneak peek - throughput -
http://i.imgur.com/u6pf4rB.png   and latency - http://imgur.com/c46MJ4i
I will post the link to the slides when it comes out, hopefully next week.



On Wed, Jun 17, 2015 at 11:55 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 The major difference is that in Spark Streaming, there's no *need* for a
 TridentState for state inside your computation. All the stateful operations
 (reduceByWindow, updateStateByKey, etc) automatically handle exactly-once
 processing, keeping updates in order, etc. Also, you don't need to run a
 separate transactional system (e.g. MySQL) to store the state.

 After your computation runs, if you want to write the final results (e.g.
 the counts you've been tracking) to a storage system, you use one of the
 output operations (saveAsFiles, foreach, etc). Those actually will run in
 order, but some might run multiple times if nodes fail, thus attempting to
 write the same state again.

 You can read about how it works in this research paper:
 http://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf.

 Matei

 On Jun 17, 2015, at 11:49 AM, Enno Shioji eshi...@gmail.com wrote:

 Hi Matei,


 Ah, can't get more accurate than from the horse's mouth... If you don't
 mind helping me understand it correctly..

 From what I understand, Storm Trident does the following (when used with
 Kafka):
 1) Sit on Kafka Spout and create batches
 2) Assign global sequential ID to the batches
 3) Make sure that all result of processed batches are written once to
 TridentState, *in order* (for example, by skipping batches that were
 already applied once, ultimately by using Zookeeper)

 TridentState is an interface that you have to implement, and the
 underlying storage has to be transactional for this to work. The necessary
 skipping etc. is handled by Storm.

 In case of Spark Streaming, I understand that
 1) There is no global ordering; e.g. an output operation for batch
 consisting of offset [4,5,6] can be invoked before the operation for offset
 [1,2,3]
 2) If you wanted to achieve something similar to what TridentState does,
 you'll have to do it yourself (for example using Zookeeper)

 Is this a correct understanding?




 On Wed, Jun 17, 2015 at 7:14 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 This documentation is only for writes to an external system, but all the
 counting you do within your streaming app (e.g. if you use
 reduceByKeyAndWindow to keep track of a running count) is exactly-once.
 When you write to a storage system, no matter which streaming framework you
 use, you'll have to make sure the writes are idempotent, because the
 storage system can't know whether you meant to write the same data again or
 not. But the place where Spark Streaming helps over Storm, etc is for
 tracking state within your computation. Without that facility, you'd not
 only have to make sure that writes are idempotent, but you'd have to make
 sure that updates to your own internal state (e.g. reduceByKeyAndWindow)
 are exactly-once too.

 Matei


 On Jun 17, 2015, at 8:26 AM, Enno Shioji eshi...@gmail.com wrote:

 The thing is, even with that improvement, you still have to make updates
 idempotent or transactional yourself. If you read
 http://spark.apache.org/docs/latest/streaming

Re: Spark or Storm

2015-06-17 Thread Ashish Soni

My Use case is below

We are going to receive lot of event as stream ( basically Kafka Stream )
and then we need to process and compute

Consider you have a phone contract with ATT and every call / sms / data
useage you do is an event and then it needs  to calculate your bill on real
time basis so when you login to your account you can see all those variable
as how much you used and how much is left and what is your bill till date
,Also there are different rules which need to be considered when you
calculate the total bill one simple rule will be 0-500 min it is free but
above it is $1 a min.

How do i maintain a shared state  ( total amount , total min , total data
etc ) so that i know how much i accumulated at any given point as events
for same phone can go to any node / executor.

Can some one please tell me how can i achieve this is spark as in storm i
can have a bolt which can do this ?

Thanks,



On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote:

 I guess both. In terms of syntax, I was comparing it with Trident.

 If you are joining, Spark Streaming actually does offer windowed join out
 of the box. We couldn't use this though as our event stream can grow
 out-of-sync, so we had to implement something on top of Storm. If your
 event streams don't become out of sync, you may find the built-in join in
 Spark Streaming useful. Storm also has a join keyword but its semantics are
 different.


  Also, what do you mean by No Back Pressure ?

 So when a topology is overloaded, Storm is designed so that it will stop
 reading from the source. Spark on the other hand, will keep reading from
 the source and spilling it internally. This maybe fine, in fairness, but it
 does mean you have to worry about the persistent store usage in the
 processing cluster, whereas with Storm you don't have to worry because the
 messages just remain in the data store.

 Spark came up with the idea of rate limiting, but I don't feel this is as
 nice as back pressure because it's very difficult to tune it such that you
 don't cap the cluster's processing power but yet so that it will prevent
 the persistent storage to get used up.


 On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast 
 sparkenthusi...@yahoo.in wrote:

 When you say Storm, did you mean Storm with Trident or Storm?

 My use case does not have simple transformation. There are complex events
 that need to be generated by joining the incoming event stream.

 Also, what do you mean by No Back PRessure ?





   On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com
 wrote:


 We've evaluated Spark Streaming vs. Storm and ended up sticking with
 Storm.

 Some of the important draw backs are:
 Spark has no back pressure (receiver rate limit can alleviate this to a
 certain point, but it's far from ideal)
 There is also no exactly-once semantics. (updateStateByKey can achieve
 this semantics, but is not practical if you have any significant amount of
 state because it does so by dumping the entire state on every checkpointing)

 There are also some minor drawbacks that I'm sure will be fixed quickly,
 like no task timeout, not being able to read from Kafka using multiple
 nodes, data loss hazard with Kafka.

 It's also not possible to attain very low latency in Spark, if that's
 what you need.

 The pos for Spark is the concise and IMO more intuitive syntax,
 especially if you compare it with Storm's Java API.

 I admit I might be a bit biased towards Storm tho as I'm more familiar
 with it.

 Also, you can do some processing with Kinesis. If all you need to do is
 straight forward transformation and you are reading from Kinesis to begin
 with, it might be an easier option to just do the transformation in Kinesis.





 On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 Whatever you write in bolts would be the logic you want to apply on your
 events. In Spark, that logic would be coded in map() or similar such
 transformations and/or actions. Spark doesn't enforce a structure for
 capturing your processing logic like Storm does.
 Regards
 Sab
 Probably overloading the question a bit.

 In Storm, Bolts have the functionality of getting triggered on events. Is
 that kind of functionality possible with Spark streaming? During each phase
 of the data processing, the transformed data is stored to the database and
 this transformed data should then be sent to a new pipeline for further
 processing

 How can this be achieved using Spark?



 On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast 
 sparkenthusi...@yahoo.in wrote:

 I have a use-case where a stream of Incoming events have to be aggregated
 and joined to create Complex events. The aggregation will have to happen at
 an interval of 1 minute (or less).

 The pipeline is :
   send events
  enrich event
 Upstream services --- KAFKA - event Stream
 Processor  Complex

Re: Spark or Storm

2015-06-17 Thread Sabarish Sasidharan

Whatever you write in bolts would be the logic you want to apply on your
events. In Spark, that logic would be coded in map() or similar such
transformations and/or actions. Spark doesn't enforce a structure for
capturing your processing logic like Storm does.

Regards
Sab
Probably overloading the question a bit.

In Storm, Bolts have the functionality of getting triggered on events. Is
that kind of functionality possible with Spark streaming? During each phase
of the data processing, the transformed data is stored to the database and
this transformed data should then be sent to a new pipeline for further
processing

How can this be achieved using Spark?



On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast sparkenthusi...@yahoo.in
 wrote:

 I have a use-case where a stream of Incoming events have to be aggregated
 and joined to create Complex events. The aggregation will have to happen at
 an interval of 1 minute (or less).

 The pipeline is :
   send events
  enrich event
 Upstream services --- KAFKA - event Stream
 Processor  Complex Event Processor  Elastic
 Search.

 From what I understand, Storm will make a very good ESP and Spark
 Streaming will make a good CEP.

 But, we are also evaluating Storm with Trident.

 How does Spark Streaming compare with Storm with Trident?

 Sridhar Chellappa







   On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com
 wrote:


 I have a similar scenario where we need to bring data from kinesis to
 hbase. Data volecity is 20k per 10 mins. Little manipulation of data will
 be required but that's regardless of the tool so we will be writing that
 piece in Java pojo.
 All env is on aws. Hbase is on a long running EMR and kinesis on a
 separate cluster.
 TIA.
 Best
 Ayan
 On 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote:

 The programming models for the two frameworks are conceptually rather
 different; I haven't worked with Storm for quite some time, but based on my
 old experience with it, I would equate Spark Streaming more with Storm's
 Trident API, rather than with the raw Bolt API. Even then, there are
 significant differences, but it's a bit closer.

 If you can share your use case, we might be able to provide better
 guidance.

 Regards,
 Will

 On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote:

 Hi All,

 I am evaluating spark VS storm ( spark streaming  ) and i am not able to
 see what is equivalent of Bolt in storm inside spark.

 Any help will be appreciated on this ?

 Thanks ,
 Ashish
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm.

Some of the important draw backs are:
Spark has no back pressure (receiver rate limit can alleviate this to a
certain point, but it's far from ideal)
There is also no exactly-once semantics. (updateStateByKey can achieve this
semantics, but is not practical if you have any significant amount of state
because it does so by dumping the entire state on every checkpointing)

There are also some minor drawbacks that I'm sure will be fixed quickly,
like no task timeout, not being able to read from Kafka using multiple
nodes, data loss hazard with Kafka.

It's also not possible to attain very low latency in Spark, if that's what
you need.

The pos for Spark is the concise and IMO more intuitive syntax, especially
if you compare it with Storm's Java API.

I admit I might be a bit biased towards Storm tho as I'm more familiar with
it.

Also, you can do some processing with Kinesis. If all you need to do is
straight forward transformation and you are reading from Kinesis to begin
with, it might be an easier option to just do the transformation in Kinesis.





On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan 
sabarish.sasidha...@manthan.com wrote:

 Whatever you write in bolts would be the logic you want to apply on your
 events. In Spark, that logic would be coded in map() or similar such
 transformations and/or actions. Spark doesn't enforce a structure for
 capturing your processing logic like Storm does.

 Regards
 Sab
 Probably overloading the question a bit.

 In Storm, Bolts have the functionality of getting triggered on events. Is
 that kind of functionality possible with Spark streaming? During each phase
 of the data processing, the transformed data is stored to the database and
 this transformed data should then be sent to a new pipeline for further
 processing

 How can this be achieved using Spark?



 On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast 
 sparkenthusi...@yahoo.in wrote:

 I have a use-case where a stream of Incoming events have to be aggregated
 and joined to create Complex events. The aggregation will have to happen at
 an interval of 1 minute (or less).

 The pipeline is :
   send events
  enrich event
 Upstream services --- KAFKA - event Stream
 Processor  Complex Event Processor  Elastic
 Search.

 From what I understand, Storm will make a very good ESP and Spark
 Streaming will make a good CEP.

 But, we are also evaluating Storm with Trident.

 How does Spark Streaming compare with Storm with Trident?

 Sridhar Chellappa







   On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com
 wrote:


 I have a similar scenario where we need to bring data from kinesis to
 hbase. Data volecity is 20k per 10 mins. Little manipulation of data will
 be required but that's regardless of the tool so we will be writing that
 piece in Java pojo.
 All env is on aws. Hbase is on a long running EMR and kinesis on a
 separate cluster.
 TIA.
 Best
 Ayan
 On 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote:

 The programming models for the two frameworks are conceptually rather
 different; I haven't worked with Storm for quite some time, but based on my
 old experience with it, I would equate Spark Streaming more with Storm's
 Trident API, rather than with the raw Bolt API. Even then, there are
 significant differences, but it's a bit closer.

 If you can share your use case, we might be able to provide better
 guidance.

 Regards,
 Will

 On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote:

 Hi All,

 I am evaluating spark VS storm ( spark streaming  ) and i am not able to
 see what is equivalent of Bolt in storm inside spark.

 Any help will be appreciated on this ?

 Thanks ,
 Ashish
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark or Storm

2015-06-17 Thread Spark Enthusiast

When you say Storm, did you mean Storm with Trident or Storm?
My use case does not have simple transformation. There are complex events that 
need to be generated by joining the incoming event stream.
Also, what do you mean by No Back PRessure ?

 


 On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com wrote:
   

 We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm.
Some of the important draw backs are:
Spark has no back pressure (receiver rate limit can alleviate this to a certain 
point, but it's far from ideal)There is also no exactly-once semantics. 
(updateStateByKey can achieve this semantics, but is not practical if you have 
any significant amount of state because it does so by dumping the entire state 
on every checkpointing)

There are also some minor drawbacks that I'm sure will be fixed quickly, like 
no task timeout, not being able to read from Kafka using multiple nodes, data 
loss hazard with Kafka.
It's also not possible to attain very low latency in Spark, if that's what you 
need.
The pos for Spark is the concise and IMO more intuitive syntax, especially if 
you compare it with Storm's Java API.
I admit I might be a bit biased towards Storm tho as I'm more familiar with it.
Also, you can do some processing with Kinesis. If all you need to do is 
straight forward transformation and you are reading from Kinesis to begin with, 
it might be an easier option to just do the transformation in Kinesis.




On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan 
sabarish.sasidha...@manthan.com wrote:

Whatever you write in bolts would be the logic you want to apply on your 
events. In Spark, that logic would be coded in map() or similar such  
transformations and/or actions. Spark doesn't enforce a structure for capturing 
your processing logic like Storm does.Regards
SabProbably overloading the question a bit.

In Storm, Bolts have the functionality of getting triggered on events. Is that 
kind of functionality possible with Spark streaming? During each phase of the 
data processing, the transformed data is stored to the database and this 
transformed data should then be sent to a new pipeline for further processing

How can this be achieved using Spark?



On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast sparkenthusi...@yahoo.in 
wrote:

I have a use-case where a stream of Incoming events have to be aggregated and 
joined to create Complex events. The aggregation will have to happen at an 
interval of 1 minute (or less).
The pipeline is :                                  send events                  
                        enrich eventUpstream services --- 
KAFKA - event Stream Processor  Complex Event Processor 
 Elastic Search.
From what I understand, Storm will make a very good ESP and Spark Streaming 
will make a good CEP.
But, we are also evaluating Storm with Trident.
How does Spark Streaming compare with Storm with Trident?
Sridhar Chellappa


  


 On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote:
   

 I have a similar scenario where we need to bring data from kinesis to hbase. 
Data volecity is 20k per 10 mins. Little manipulation of data will be required 
but that's regardless of the tool so we will be writing that piece in Java 
pojo. All env is on aws. Hbase is on a long running EMR and kinesis on a 
separate cluster.TIA.
Best
AyanOn 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote:

The programming models for the two frameworks are conceptually rather 
different; I haven't worked with Storm for quite some time, but based on my old 
experience with it, I would equate Spark Streaming more with Storm's Trident 
API, rather than with the raw Bolt API. Even then, there are significant 
differences, but it's a bit closer.

If you can share your use case, we might be able to provide better guidance.

Regards,
Will

On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote:

Hi All,

I am evaluating spark VS storm ( spark streaming  ) and i am not able to see 
what is equivalent of Bolt in storm inside spark.

Any help will be appreciated on this ?

Thanks ,
Ashish
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

I guess both. In terms of syntax, I was comparing it with Trident.

If you are joining, Spark Streaming actually does offer windowed join out
of the box. We couldn't use this though as our event stream can grow
out-of-sync, so we had to implement something on top of Storm. If your
event streams don't become out of sync, you may find the built-in join in
Spark Streaming useful. Storm also has a join keyword but its semantics are
different.


 Also, what do you mean by No Back Pressure ?

So when a topology is overloaded, Storm is designed so that it will stop
reading from the source. Spark on the other hand, will keep reading from
the source and spilling it internally. This maybe fine, in fairness, but it
does mean you have to worry about the persistent store usage in the
processing cluster, whereas with Storm you don't have to worry because the
messages just remain in the data store.

Spark came up with the idea of rate limiting, but I don't feel this is as
nice as back pressure because it's very difficult to tune it such that you
don't cap the cluster's processing power but yet so that it will prevent
the persistent storage to get used up.


On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast sparkenthusi...@yahoo.in
wrote:

 When you say Storm, did you mean Storm with Trident or Storm?

 My use case does not have simple transformation. There are complex events
 that need to be generated by joining the incoming event stream.

 Also, what do you mean by No Back PRessure ?





   On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com
 wrote:


 We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm.

 Some of the important draw backs are:
 Spark has no back pressure (receiver rate limit can alleviate this to a
 certain point, but it's far from ideal)
 There is also no exactly-once semantics. (updateStateByKey can achieve
 this semantics, but is not practical if you have any significant amount of
 state because it does so by dumping the entire state on every checkpointing)

 There are also some minor drawbacks that I'm sure will be fixed quickly,
 like no task timeout, not being able to read from Kafka using multiple
 nodes, data loss hazard with Kafka.

 It's also not possible to attain very low latency in Spark, if that's what
 you need.

 The pos for Spark is the concise and IMO more intuitive syntax, especially
 if you compare it with Storm's Java API.

 I admit I might be a bit biased towards Storm tho as I'm more familiar
 with it.

 Also, you can do some processing with Kinesis. If all you need to do is
 straight forward transformation and you are reading from Kinesis to begin
 with, it might be an easier option to just do the transformation in Kinesis.





 On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 Whatever you write in bolts would be the logic you want to apply on your
 events. In Spark, that logic would be coded in map() or similar such
 transformations and/or actions. Spark doesn't enforce a structure for
 capturing your processing logic like Storm does.
 Regards
 Sab
 Probably overloading the question a bit.

 In Storm, Bolts have the functionality of getting triggered on events. Is
 that kind of functionality possible with Spark streaming? During each phase
 of the data processing, the transformed data is stored to the database and
 this transformed data should then be sent to a new pipeline for further
 processing

 How can this be achieved using Spark?



 On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast 
 sparkenthusi...@yahoo.in wrote:

 I have a use-case where a stream of Incoming events have to be aggregated
 and joined to create Complex events. The aggregation will have to happen at
 an interval of 1 minute (or less).

 The pipeline is :
   send events
  enrich event
 Upstream services --- KAFKA - event Stream
 Processor  Complex Event Processor  Elastic
 Search.

 From what I understand, Storm will make a very good ESP and Spark
 Streaming will make a good CEP.

 But, we are also evaluating Storm with Trident.

 How does Spark Streaming compare with Storm with Trident?

 Sridhar Chellappa







   On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com
 wrote:


 I have a similar scenario where we need to bring data from kinesis to
 hbase. Data volecity is 20k per 10 mins. Little manipulation of data will
 be required but that's regardless of the tool so we will be writing that
 piece in Java pojo.
 All env is on aws. Hbase is on a long running EMR and kinesis on a
 separate cluster.
 TIA.
 Best
 Ayan
 On 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote:

 The programming models for the two frameworks are conceptually rather
 different; I haven't worked with Storm for quite some time, but based on my
 old experience with it, I would equate Spark Streaming more with Storm's
 Trident API

Re: Spark or Storm

2015-06-17 Thread ayan guha

Great discussion!!

One qs about some comment: Also, you can do some processing with Kinesis.
If all you need to do is straight forward transformation and you are
reading from Kinesis to begin with, it might be an easier option to just do
the transformation in Kinesis

- Do you mean KCL application? Or some kind of processing withinKineis?

Can you kindly share a link? I would definitely pursue this route as our
transformations are really simple.

Best

On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni asoni.le...@gmail.com wrote:

 My Use case is below

 We are going to receive lot of event as stream ( basically Kafka Stream )
 and then we need to process and compute

 Consider you have a phone contract with ATT and every call / sms / data
 useage you do is an event and then it needs  to calculate your bill on real
 time basis so when you login to your account you can see all those variable
 as how much you used and how much is left and what is your bill till date
 ,Also there are different rules which need to be considered when you
 calculate the total bill one simple rule will be 0-500 min it is free but
 above it is $1 a min.

 How do i maintain a shared state  ( total amount , total min , total data
 etc ) so that i know how much i accumulated at any given point as events
 for same phone can go to any node / executor.

 Can some one please tell me how can i achieve this is spark as in storm i
 can have a bolt which can do this ?

 Thanks,



 On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote:

 I guess both. In terms of syntax, I was comparing it with Trident.

 If you are joining, Spark Streaming actually does offer windowed join out
 of the box. We couldn't use this though as our event stream can grow
 out-of-sync, so we had to implement something on top of Storm. If your
 event streams don't become out of sync, you may find the built-in join in
 Spark Streaming useful. Storm also has a join keyword but its semantics are
 different.


  Also, what do you mean by No Back Pressure ?

 So when a topology is overloaded, Storm is designed so that it will stop
 reading from the source. Spark on the other hand, will keep reading from
 the source and spilling it internally. This maybe fine, in fairness, but it
 does mean you have to worry about the persistent store usage in the
 processing cluster, whereas with Storm you don't have to worry because the
 messages just remain in the data store.

 Spark came up with the idea of rate limiting, but I don't feel this is as
 nice as back pressure because it's very difficult to tune it such that you
 don't cap the cluster's processing power but yet so that it will prevent
 the persistent storage to get used up.


 On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast 
 sparkenthusi...@yahoo.in wrote:

 When you say Storm, did you mean Storm with Trident or Storm?

 My use case does not have simple transformation. There are complex
 events that need to be generated by joining the incoming event stream.

 Also, what do you mean by No Back PRessure ?





   On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com
 wrote:


 We've evaluated Spark Streaming vs. Storm and ended up sticking with
 Storm.

 Some of the important draw backs are:
 Spark has no back pressure (receiver rate limit can alleviate this to a
 certain point, but it's far from ideal)
 There is also no exactly-once semantics. (updateStateByKey can achieve
 this semantics, but is not practical if you have any significant amount of
 state because it does so by dumping the entire state on every checkpointing)

 There are also some minor drawbacks that I'm sure will be fixed quickly,
 like no task timeout, not being able to read from Kafka using multiple
 nodes, data loss hazard with Kafka.

 It's also not possible to attain very low latency in Spark, if that's
 what you need.

 The pos for Spark is the concise and IMO more intuitive syntax,
 especially if you compare it with Storm's Java API.

 I admit I might be a bit biased towards Storm tho as I'm more familiar
 with it.

 Also, you can do some processing with Kinesis. If all you need to do is
 straight forward transformation and you are reading from Kinesis to begin
 with, it might be an easier option to just do the transformation in Kinesis.





 On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 Whatever you write in bolts would be the logic you want to apply on your
 events. In Spark, that logic would be coded in map() or similar such
 transformations and/or actions. Spark doesn't enforce a structure for
 capturing your processing logic like Storm does.
 Regards
 Sab
 Probably overloading the question a bit.

 In Storm, Bolts have the functionality of getting triggered on events.
 Is that kind of functionality possible with Spark streaming? During each
 phase of the data processing, the transformed data is stored to the
 database and this transformed data should then be sent

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

In that case I assume you need exactly once semantics. There's no
out-of-the-box way to do that in Spark. There is updateStateByKey, but it's
not practical with your use case as the state is too large (it'll try to
dump the entire intermediate state on every checkpoint, which would be
prohibitively expensive).

So either you have to implement something yourself, or you can use Storm
Trident (or transactional low-level API).

On Wed, Jun 17, 2015 at 1:26 PM, Ashish Soni asoni.le...@gmail.com wrote:

 My Use case is below

 We are going to receive lot of event as stream ( basically Kafka Stream )
 and then we need to process and compute

 Consider you have a phone contract with ATT and every call / sms / data
 useage you do is an event and then it needs  to calculate your bill on real
 time basis so when you login to your account you can see all those variable
 as how much you used and how much is left and what is your bill till date
 ,Also there are different rules which need to be considered when you
 calculate the total bill one simple rule will be 0-500 min it is free but
 above it is $1 a min.

 How do i maintain a shared state  ( total amount , total min , total data
 etc ) so that i know how much i accumulated at any given point as events
 for same phone can go to any node / executor.

 Can some one please tell me how can i achieve this is spark as in storm i
 can have a bolt which can do this ?

 Thanks,



 On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote:

 I guess both. In terms of syntax, I was comparing it with Trident.

 If you are joining, Spark Streaming actually does offer windowed join out
 of the box. We couldn't use this though as our event stream can grow
 out-of-sync, so we had to implement something on top of Storm. If your
 event streams don't become out of sync, you may find the built-in join in
 Spark Streaming useful. Storm also has a join keyword but its semantics are
 different.


  Also, what do you mean by No Back Pressure ?

 So when a topology is overloaded, Storm is designed so that it will stop
 reading from the source. Spark on the other hand, will keep reading from
 the source and spilling it internally. This maybe fine, in fairness, but it
 does mean you have to worry about the persistent store usage in the
 processing cluster, whereas with Storm you don't have to worry because the
 messages just remain in the data store.

 Spark came up with the idea of rate limiting, but I don't feel this is as
 nice as back pressure because it's very difficult to tune it such that you
 don't cap the cluster's processing power but yet so that it will prevent
 the persistent storage to get used up.


 On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast 
 sparkenthusi...@yahoo.in wrote:

 When you say Storm, did you mean Storm with Trident or Storm?

 My use case does not have simple transformation. There are complex
 events that need to be generated by joining the incoming event stream.

 Also, what do you mean by No Back PRessure ?





   On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com
 wrote:


 We've evaluated Spark Streaming vs. Storm and ended up sticking with
 Storm.

 Some of the important draw backs are:
 Spark has no back pressure (receiver rate limit can alleviate this to a
 certain point, but it's far from ideal)
 There is also no exactly-once semantics. (updateStateByKey can achieve
 this semantics, but is not practical if you have any significant amount of
 state because it does so by dumping the entire state on every checkpointing)

 There are also some minor drawbacks that I'm sure will be fixed quickly,
 like no task timeout, not being able to read from Kafka using multiple
 nodes, data loss hazard with Kafka.

 It's also not possible to attain very low latency in Spark, if that's
 what you need.

 The pos for Spark is the concise and IMO more intuitive syntax,
 especially if you compare it with Storm's Java API.

 I admit I might be a bit biased towards Storm tho as I'm more familiar
 with it.

 Also, you can do some processing with Kinesis. If all you need to do is
 straight forward transformation and you are reading from Kinesis to begin
 with, it might be an easier option to just do the transformation in Kinesis.





 On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 Whatever you write in bolts would be the logic you want to apply on your
 events. In Spark, that logic would be coded in map() or similar such
 transformations and/or actions. Spark doesn't enforce a structure for
 capturing your processing logic like Storm does.
 Regards
 Sab
 Probably overloading the question a bit.

 In Storm, Bolts have the functionality of getting triggered on events.
 Is that kind of functionality possible with Spark streaming? During each
 phase of the data processing, the transformed data is stored to the
 database and this transformed data should then be sent to a new pipeline
 for further

Re: Spark or Storm

2015-06-17 Thread Ashish Soni

Stream can also be processed in micro-batch / batches which is the main
reason behind Spark Steaming so what is the difference ?

Ashish

On Wed, Jun 17, 2015 at 9:04 AM, Enno Shioji eshi...@gmail.com wrote:

 PS just to elaborate on my first sentence, the reason Spark (not
 streaming) can offer exactly once semantics is because its update operation
 is idempotent. This is easy to do in a batch context because the input is
 finite, but it's harder in streaming context.

 On Wed, Jun 17, 2015 at 2:00 PM, Enno Shioji eshi...@gmail.com wrote:

 So Spark (not streaming) does offer exactly once. Spark Streaming
 however, can only do exactly once semantics *if the update operation is
 idempotent*. updateStateByKey's update operation is idempotent, because
 it completely replaces the previous state.

 So as long as you use Spark streaming, you must somehow make the update
 operation idempotent. Replacing the entire state is the easiest way to do
 it, but it's obviously expensive.

 The alternative is to do something similar to what Storm does. At that
 point, you'll have to ask though if just using Storm is easier than that.





 On Wed, Jun 17, 2015 at 1:50 PM, Ashish Soni asoni.le...@gmail.com
 wrote:

 As per my Best Understanding Spark Streaming offer Exactly once
 processing , is this achieve only through updateStateByKey or there is
 another way to do the same.

 Ashish

 On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji eshi...@gmail.com wrote:

 In that case I assume you need exactly once semantics. There's no
 out-of-the-box way to do that in Spark. There is updateStateByKey, but it's
 not practical with your use case as the state is too large (it'll try to
 dump the entire intermediate state on every checkpoint, which would be
 prohibitively expensive).

 So either you have to implement something yourself, or you can use
 Storm Trident (or transactional low-level API).

 On Wed, Jun 17, 2015 at 1:26 PM, Ashish Soni asoni.le...@gmail.com
 wrote:

 My Use case is below

 We are going to receive lot of event as stream ( basically Kafka
 Stream ) and then we need to process and compute

 Consider you have a phone contract with ATT and every call / sms /
 data useage you do is an event and then it needs  to calculate your bill 
 on
 real time basis so when you login to your account you can see all those
 variable as how much you used and how much is left and what is your bill
 till date ,Also there are different rules which need to be considered when
 you calculate the total bill one simple rule will be 0-500 min it is free
 but above it is $1 a min.

 How do i maintain a shared state  ( total amount , total min , total
 data etc ) so that i know how much i accumulated at any given point as
 events for same phone can go to any node / executor.

 Can some one please tell me how can i achieve this is spark as in
 storm i can have a bolt which can do this ?

 Thanks,



 On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com
 wrote:

 I guess both. In terms of syntax, I was comparing it with Trident.

 If you are joining, Spark Streaming actually does offer windowed join
 out of the box. We couldn't use this though as our event stream can grow
 out-of-sync, so we had to implement something on top of Storm. If your
 event streams don't become out of sync, you may find the built-in join in
 Spark Streaming useful. Storm also has a join keyword but its semantics 
 are
 different.


  Also, what do you mean by No Back Pressure ?

 So when a topology is overloaded, Storm is designed so that it will
 stop reading from the source. Spark on the other hand, will keep reading
 from the source and spilling it internally. This maybe fine, in fairness,
 but it does mean you have to worry about the persistent store usage in 
 the
 processing cluster, whereas with Storm you don't have to worry because 
 the
 messages just remain in the data store.

 Spark came up with the idea of rate limiting, but I don't feel this
 is as nice as back pressure because it's very difficult to tune it such
 that you don't cap the cluster's processing power but yet so that it will
 prevent the persistent storage to get used up.


 On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast 
 sparkenthusi...@yahoo.in wrote:

 When you say Storm, did you mean Storm with Trident or Storm?

 My use case does not have simple transformation. There are complex
 events that need to be generated by joining the incoming event stream.

 Also, what do you mean by No Back PRessure ?





   On Wednesday, 17 June 2015 11:57 AM, Enno Shioji 
 eshi...@gmail.com wrote:


 We've evaluated Spark Streaming vs. Storm and ended up sticking with
 Storm.

 Some of the important draw backs are:
 Spark has no back pressure (receiver rate limit can alleviate this
 to a certain point, but it's far from ideal)
 There is also no exactly-once semantics. (updateStateByKey can
 achieve this semantics, but is not practical if you have any significant
 amount of state because it does

Re: Spark or Storm

2015-06-17 Thread Ashish Soni

As per my Best Understanding Spark Streaming offer Exactly once processing
, is this achieve only through updateStateByKey or there is another way to
do the same.

Ashish

On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji eshi...@gmail.com wrote:

 In that case I assume you need exactly once semantics. There's no
 out-of-the-box way to do that in Spark. There is updateStateByKey, but it's
 not practical with your use case as the state is too large (it'll try to
 dump the entire intermediate state on every checkpoint, which would be
 prohibitively expensive).

 So either you have to implement something yourself, or you can use Storm
 Trident (or transactional low-level API).

 On Wed, Jun 17, 2015 at 1:26 PM, Ashish Soni asoni.le...@gmail.com
 wrote:

 My Use case is below

 We are going to receive lot of event as stream ( basically Kafka Stream )
 and then we need to process and compute

 Consider you have a phone contract with ATT and every call / sms / data
 useage you do is an event and then it needs  to calculate your bill on real
 time basis so when you login to your account you can see all those variable
 as how much you used and how much is left and what is your bill till date
 ,Also there are different rules which need to be considered when you
 calculate the total bill one simple rule will be 0-500 min it is free but
 above it is $1 a min.

 How do i maintain a shared state  ( total amount , total min , total data
 etc ) so that i know how much i accumulated at any given point as events
 for same phone can go to any node / executor.

 Can some one please tell me how can i achieve this is spark as in storm i
 can have a bolt which can do this ?

 Thanks,



 On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote:

 I guess both. In terms of syntax, I was comparing it with Trident.

 If you are joining, Spark Streaming actually does offer windowed join
 out of the box. We couldn't use this though as our event stream can grow
 out-of-sync, so we had to implement something on top of Storm. If your
 event streams don't become out of sync, you may find the built-in join in
 Spark Streaming useful. Storm also has a join keyword but its semantics are
 different.


  Also, what do you mean by No Back Pressure ?

 So when a topology is overloaded, Storm is designed so that it will stop
 reading from the source. Spark on the other hand, will keep reading from
 the source and spilling it internally. This maybe fine, in fairness, but it
 does mean you have to worry about the persistent store usage in the
 processing cluster, whereas with Storm you don't have to worry because the
 messages just remain in the data store.

 Spark came up with the idea of rate limiting, but I don't feel this is
 as nice as back pressure because it's very difficult to tune it such that
 you don't cap the cluster's processing power but yet so that it will
 prevent the persistent storage to get used up.


 On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast 
 sparkenthusi...@yahoo.in wrote:

 When you say Storm, did you mean Storm with Trident or Storm?

 My use case does not have simple transformation. There are complex
 events that need to be generated by joining the incoming event stream.

 Also, what do you mean by No Back PRessure ?





   On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com
 wrote:


 We've evaluated Spark Streaming vs. Storm and ended up sticking with
 Storm.

 Some of the important draw backs are:
 Spark has no back pressure (receiver rate limit can alleviate this to a
 certain point, but it's far from ideal)
 There is also no exactly-once semantics. (updateStateByKey can achieve
 this semantics, but is not practical if you have any significant amount of
 state because it does so by dumping the entire state on every 
 checkpointing)

 There are also some minor drawbacks that I'm sure will be fixed
 quickly, like no task timeout, not being able to read from Kafka using
 multiple nodes, data loss hazard with Kafka.

 It's also not possible to attain very low latency in Spark, if that's
 what you need.

 The pos for Spark is the concise and IMO more intuitive syntax,
 especially if you compare it with Storm's Java API.

 I admit I might be a bit biased towards Storm tho as I'm more familiar
 with it.

 Also, you can do some processing with Kinesis. If all you need to do is
 straight forward transformation and you are reading from Kinesis to begin
 with, it might be an easier option to just do the transformation in 
 Kinesis.





 On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 Whatever you write in bolts would be the logic you want to apply on
 your events. In Spark, that logic would be coded in map() or similar such
 transformations and/or actions. Spark doesn't enforce a structure for
 capturing your processing logic like Storm does.
 Regards
 Sab
 Probably overloading the question a bit.

 In Storm, Bolts have the functionality of getting

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

Hi Ayan,

Admittedly I haven't done much with Kinesis, but if I'm not mistaken you
should be able to use their processor interface for that. In this
example, it's incrementing a counter:
https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java

Instead of incrementing a counter, you could do your transformation and
send it to HBase.

On Wed, Jun 17, 2015 at 1:40 PM, ayan guha guha.a...@gmail.com wrote:

Great discussion!!

One qs about some comment: Also, you can do some processing with Kinesis.
If all you need to do is straight forward transformation and you are
reading from Kinesis to begin with, it might be an easier option to just do
the transformation in Kinesis

- Do you mean KCL application? Or some kind of processing withinKineis?

Can you kindly share a link? I would definitely pursue this route as our
transformations are really simple.

Best

On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni asoni.le...@gmail.com
wrote:

My Use case is below

We are going to receive lot of event as stream ( basically Kafka Stream )
and then we need to process and compute

How do i maintain a shared state ( total amount , total min , total data
etc ) so that i know how much i accumulated at any given point as events
for same phone can go to any node / executor.

Can some one please tell me how can i achieve this is spark as in storm i
can have a bolt which can do this ?

Thanks,

On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote:

I guess both. In terms of syntax, I was comparing it with Trident.

If you are joining, Spark Streaming actually does offer windowed join
out of the box. We couldn't use this though as our event stream can grow
out-of-sync, so we had to implement something on top of Storm. If your
event streams don't become out of sync, you may find the built-in join in
Spark Streaming useful. Storm also has a join keyword but its semantics are
different.

Also, what do you mean by No Back Pressure ?

So when a topology is overloaded, Storm is designed so that it will stop
reading from the source. Spark on the other hand, will keep reading from
the source and spilling it internally. This maybe fine, in fairness, but it
does mean you have to worry about the persistent store usage in the
processing cluster, whereas with Storm you don't have to worry because the
messages just remain in the data store.

Spark came up with the idea of rate limiting, but I don't feel this is
as nice as back pressure because it's very difficult to tune it such that
you don't cap the cluster's processing power but yet so that it will
prevent the persistent storage to get used up.

On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast
sparkenthusi...@yahoo.in wrote:

When you say Storm, did you mean Storm with Trident or Storm?

My use case does not have simple transformation. There are complex
events that need to be generated by joining the incoming event stream.

Also, what do you mean by No Back PRessure ?

On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com
wrote:

We've evaluated Spark Streaming vs. Storm and ended up sticking with
Storm.

Some of the important draw backs are:
Spark has no back pressure (receiver rate limit can alleviate this to a
certain point, but it's far from ideal)
There is also no exactly-once semantics. (updateStateByKey can achieve
this semantics, but is not practical if you have any significant amount of
state because it does so by dumping the entire state on every
checkpointing)

There are also some minor drawbacks that I'm sure will be fixed
quickly, like no task timeout, not being able to read from Kafka using
multiple nodes, data loss hazard with Kafka.

It's also not possible to attain very low latency in Spark, if that's
what you need.

The pos for Spark is the concise and IMO more intuitive syntax,
especially if you compare it with Storm's Java API.

I admit I might be a bit biased towards Storm tho as I'm more familiar
with it.

Also, you can do some processing with Kinesis. If all you need to do is
straight forward transformation and you are reading from Kinesis to begin
with, it might be an easier option to just do the transformation in
Kinesis.

On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan
sabarish.sasidha...@manthan.com wrote:

Whatever you write in bolts would be the logic you

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

PS just to elaborate on my first sentence, the reason Spark (not streaming)
can offer exactly once semantics is because its update operation is
idempotent. This is easy to do in a batch context because the input is
finite, but it's harder in streaming context.

On Wed, Jun 17, 2015 at 2:00 PM, Enno Shioji eshi...@gmail.com wrote:

 So Spark (not streaming) does offer exactly once. Spark Streaming however,
 can only do exactly once semantics *if the update operation is idempotent*.
 updateStateByKey's update operation is idempotent, because it completely
 replaces the previous state.

 So as long as you use Spark streaming, you must somehow make the update
 operation idempotent. Replacing the entire state is the easiest way to do
 it, but it's obviously expensive.

 The alternative is to do something similar to what Storm does. At that
 point, you'll have to ask though if just using Storm is easier than that.





 On Wed, Jun 17, 2015 at 1:50 PM, Ashish Soni asoni.le...@gmail.com
 wrote:

 As per my Best Understanding Spark Streaming offer Exactly once
 processing , is this achieve only through updateStateByKey or there is
 another way to do the same.

 Ashish

 On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji eshi...@gmail.com wrote:

 In that case I assume you need exactly once semantics. There's no
 out-of-the-box way to do that in Spark. There is updateStateByKey, but it's
 not practical with your use case as the state is too large (it'll try to
 dump the entire intermediate state on every checkpoint, which would be
 prohibitively expensive).

 So either you have to implement something yourself, or you can use Storm
 Trident (or transactional low-level API).

 On Wed, Jun 17, 2015 at 1:26 PM, Ashish Soni asoni.le...@gmail.com
 wrote:

 My Use case is below

 We are going to receive lot of event as stream ( basically Kafka Stream
 ) and then we need to process and compute

 Consider you have a phone contract with ATT and every call / sms / data
 useage you do is an event and then it needs  to calculate your bill on real
 time basis so when you login to your account you can see all those variable
 as how much you used and how much is left and what is your bill till date
 ,Also there are different rules which need to be considered when you
 calculate the total bill one simple rule will be 0-500 min it is free but
 above it is $1 a min.

 How do i maintain a shared state  ( total amount , total min , total
 data etc ) so that i know how much i accumulated at any given point as
 events for same phone can go to any node / executor.

 Can some one please tell me how can i achieve this is spark as in storm
 i can have a bolt which can do this ?

 Thanks,



 On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote:

 I guess both. In terms of syntax, I was comparing it with Trident.

 If you are joining, Spark Streaming actually does offer windowed join
 out of the box. We couldn't use this though as our event stream can grow
 out-of-sync, so we had to implement something on top of Storm. If your
 event streams don't become out of sync, you may find the built-in join in
 Spark Streaming useful. Storm also has a join keyword but its semantics 
 are
 different.


  Also, what do you mean by No Back Pressure ?

 So when a topology is overloaded, Storm is designed so that it will
 stop reading from the source. Spark on the other hand, will keep reading
 from the source and spilling it internally. This maybe fine, in fairness,
 but it does mean you have to worry about the persistent store usage in the
 processing cluster, whereas with Storm you don't have to worry because the
 messages just remain in the data store.

 Spark came up with the idea of rate limiting, but I don't feel this is
 as nice as back pressure because it's very difficult to tune it such that
 you don't cap the cluster's processing power but yet so that it will
 prevent the persistent storage to get used up.


 On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast 
 sparkenthusi...@yahoo.in wrote:

 When you say Storm, did you mean Storm with Trident or Storm?

 My use case does not have simple transformation. There are complex
 events that need to be generated by joining the incoming event stream.

 Also, what do you mean by No Back PRessure ?





   On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com
 wrote:


 We've evaluated Spark Streaming vs. Storm and ended up sticking with
 Storm.

 Some of the important draw backs are:
 Spark has no back pressure (receiver rate limit can alleviate this to
 a certain point, but it's far from ideal)
 There is also no exactly-once semantics. (updateStateByKey can
 achieve this semantics, but is not practical if you have any significant
 amount of state because it does so by dumping the entire state on every
 checkpointing)

 There are also some minor drawbacks that I'm sure will be fixed
 quickly, like no task timeout, not being able to read from Kafka using
 multiple nodes, data loss

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

Processing stuff in batch is not the same thing as being transactional. If
you look at Storm, it will e.g. skip tuples that were already applied to a
state to avoid counting stuff twice etc. Spark doesn't come with such
facility, so you could end up counting twice etc.



On Wed, Jun 17, 2015 at 2:09 PM, Ashish Soni asoni.le...@gmail.com wrote:

 Stream can also be processed in micro-batch / batches which is the main
 reason behind Spark Steaming so what is the difference ?

 Ashish

 On Wed, Jun 17, 2015 at 9:04 AM, Enno Shioji eshi...@gmail.com wrote:

 PS just to elaborate on my first sentence, the reason Spark (not
 streaming) can offer exactly once semantics is because its update operation
 is idempotent. This is easy to do in a batch context because the input is
 finite, but it's harder in streaming context.

 On Wed, Jun 17, 2015 at 2:00 PM, Enno Shioji eshi...@gmail.com wrote:

 So Spark (not streaming) does offer exactly once. Spark Streaming
 however, can only do exactly once semantics *if the update operation is
 idempotent*. updateStateByKey's update operation is idempotent, because
 it completely replaces the previous state.

 So as long as you use Spark streaming, you must somehow make the update
 operation idempotent. Replacing the entire state is the easiest way to do
 it, but it's obviously expensive.

 The alternative is to do something similar to what Storm does. At that
 point, you'll have to ask though if just using Storm is easier than that.





 On Wed, Jun 17, 2015 at 1:50 PM, Ashish Soni asoni.le...@gmail.com
 wrote:

 As per my Best Understanding Spark Streaming offer Exactly once
 processing , is this achieve only through updateStateByKey or there is
 another way to do the same.

 Ashish

 On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji eshi...@gmail.com wrote:

 In that case I assume you need exactly once semantics. There's no
 out-of-the-box way to do that in Spark. There is updateStateByKey, but 
 it's
 not practical with your use case as the state is too large (it'll try to
 dump the entire intermediate state on every checkpoint, which would be
 prohibitively expensive).

 So either you have to implement something yourself, or you can use
 Storm Trident (or transactional low-level API).

 On Wed, Jun 17, 2015 at 1:26 PM, Ashish Soni asoni.le...@gmail.com
 wrote:

 My Use case is below

 We are going to receive lot of event as stream ( basically Kafka
 Stream ) and then we need to process and compute

 Consider you have a phone contract with ATT and every call / sms /
 data useage you do is an event and then it needs  to calculate your bill 
 on
 real time basis so when you login to your account you can see all those
 variable as how much you used and how much is left and what is your bill
 till date ,Also there are different rules which need to be considered 
 when
 you calculate the total bill one simple rule will be 0-500 min it is free
 but above it is $1 a min.

 How do i maintain a shared state  ( total amount , total min , total
 data etc ) so that i know how much i accumulated at any given point as
 events for same phone can go to any node / executor.

 Can some one please tell me how can i achieve this is spark as in
 storm i can have a bolt which can do this ?

 Thanks,



 On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com
 wrote:

 I guess both. In terms of syntax, I was comparing it with Trident.

 If you are joining, Spark Streaming actually does offer windowed
 join out of the box. We couldn't use this though as our event stream can
 grow out-of-sync, so we had to implement something on top of Storm. If
 your event streams don't become out of sync, you may find the built-in 
 join
 in Spark Streaming useful. Storm also has a join keyword but its 
 semantics
 are different.


  Also, what do you mean by No Back Pressure ?

 So when a topology is overloaded, Storm is designed so that it will
 stop reading from the source. Spark on the other hand, will keep reading
 from the source and spilling it internally. This maybe fine, in 
 fairness,
 but it does mean you have to worry about the persistent store usage in 
 the
 processing cluster, whereas with Storm you don't have to worry because 
 the
 messages just remain in the data store.

 Spark came up with the idea of rate limiting, but I don't feel this
 is as nice as back pressure because it's very difficult to tune it such
 that you don't cap the cluster's processing power but yet so that it 
 will
 prevent the persistent storage to get used up.


 On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast 
 sparkenthusi...@yahoo.in wrote:

 When you say Storm, did you mean Storm with Trident or Storm?

 My use case does not have simple transformation. There are complex
 events that need to be generated by joining the incoming event stream.

 Also, what do you mean by No Back PRessure ?





   On Wednesday, 17 June 2015 11:57 AM, Enno Shioji 
 eshi...@gmail.com wrote:


 We've evaluated Spark Streaming vs

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

So Spark (not streaming) does offer exactly once. Spark Streaming however,
can only do exactly once semantics *if the update operation is idempotent*.
updateStateByKey's update operation is idempotent, because it completely
replaces the previous state.

So as long as you use Spark streaming, you must somehow make the update
operation idempotent. Replacing the entire state is the easiest way to do
it, but it's obviously expensive.

The alternative is to do something similar to what Storm does. At that
point, you'll have to ask though if just using Storm is easier than that.





On Wed, Jun 17, 2015 at 1:50 PM, Ashish Soni asoni.le...@gmail.com wrote:

 As per my Best Understanding Spark Streaming offer Exactly once processing
 , is this achieve only through updateStateByKey or there is another way to
 do the same.

 Ashish

 On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji eshi...@gmail.com wrote:

 In that case I assume you need exactly once semantics. There's no
 out-of-the-box way to do that in Spark. There is updateStateByKey, but it's
 not practical with your use case as the state is too large (it'll try to
 dump the entire intermediate state on every checkpoint, which would be
 prohibitively expensive).

 So either you have to implement something yourself, or you can use Storm
 Trident (or transactional low-level API).

 On Wed, Jun 17, 2015 at 1:26 PM, Ashish Soni asoni.le...@gmail.com
 wrote:

 My Use case is below

 We are going to receive lot of event as stream ( basically Kafka Stream
 ) and then we need to process and compute

 Consider you have a phone contract with ATT and every call / sms / data
 useage you do is an event and then it needs  to calculate your bill on real
 time basis so when you login to your account you can see all those variable
 as how much you used and how much is left and what is your bill till date
 ,Also there are different rules which need to be considered when you
 calculate the total bill one simple rule will be 0-500 min it is free but
 above it is $1 a min.

 How do i maintain a shared state  ( total amount , total min , total
 data etc ) so that i know how much i accumulated at any given point as
 events for same phone can go to any node / executor.

 Can some one please tell me how can i achieve this is spark as in storm
 i can have a bolt which can do this ?

 Thanks,



 On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote:

 I guess both. In terms of syntax, I was comparing it with Trident.

 If you are joining, Spark Streaming actually does offer windowed join
 out of the box. We couldn't use this though as our event stream can grow
 out-of-sync, so we had to implement something on top of Storm. If your
 event streams don't become out of sync, you may find the built-in join in
 Spark Streaming useful. Storm also has a join keyword but its semantics are
 different.


  Also, what do you mean by No Back Pressure ?

 So when a topology is overloaded, Storm is designed so that it will
 stop reading from the source. Spark on the other hand, will keep reading
 from the source and spilling it internally. This maybe fine, in fairness,
 but it does mean you have to worry about the persistent store usage in the
 processing cluster, whereas with Storm you don't have to worry because the
 messages just remain in the data store.

 Spark came up with the idea of rate limiting, but I don't feel this is
 as nice as back pressure because it's very difficult to tune it such that
 you don't cap the cluster's processing power but yet so that it will
 prevent the persistent storage to get used up.


 On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast 
 sparkenthusi...@yahoo.in wrote:

 When you say Storm, did you mean Storm with Trident or Storm?

 My use case does not have simple transformation. There are complex
 events that need to be generated by joining the incoming event stream.

 Also, what do you mean by No Back PRessure ?





   On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com
 wrote:


 We've evaluated Spark Streaming vs. Storm and ended up sticking with
 Storm.

 Some of the important draw backs are:
 Spark has no back pressure (receiver rate limit can alleviate this to
 a certain point, but it's far from ideal)
 There is also no exactly-once semantics. (updateStateByKey can
 achieve this semantics, but is not practical if you have any significant
 amount of state because it does so by dumping the entire state on every
 checkpointing)

 There are also some minor drawbacks that I'm sure will be fixed
 quickly, like no task timeout, not being able to read from Kafka using
 multiple nodes, data loss hazard with Kafka.

 It's also not possible to attain very low latency in Spark, if that's
 what you need.

 The pos for Spark is the concise and IMO more intuitive syntax,
 especially if you compare it with Storm's Java API.

 I admit I might be a bit biased towards Storm tho as I'm more familiar
 with it.

 Also, you can do some processing

Re: Spark or Storm

2015-06-17 Thread ayan guha

Thanks for this. It's kcl based kinesis application. But because its just a
Java application we are thinking to use spark on EMR or storm for fault
tolerance and load balancing. Is it a correct approach?
On 17 Jun 2015 23:07, Enno Shioji eshi...@gmail.com wrote:

Hi Ayan,

Admittedly I haven't done much with Kinesis, but if I'm not mistaken you
should be able to use their processor interface for that. In this
example, it's incrementing a counter:
https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java

Instead of incrementing a counter, you could do your transformation and
send it to HBase.

On Wed, Jun 17, 2015 at 1:40 PM, ayan guha guha.a...@gmail.com wrote:

Great discussion!!

One qs about some comment: Also, you can do some processing with Kinesis.
If all you need to do is straight forward transformation and you are
reading from Kinesis to begin with, it might be an easier option to just do
the transformation in Kinesis

- Do you mean KCL application? Or some kind of processing withinKineis?

Can you kindly share a link? I would definitely pursue this route as our
transformations are really simple.

Best

On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni asoni.le...@gmail.com
wrote:

My Use case is below

We are going to receive lot of event as stream ( basically Kafka Stream
) and then we need to process and compute

How do i maintain a shared state ( total amount , total min , total
data etc ) so that i know how much i accumulated at any given point as
events for same phone can go to any node / executor.

Can some one please tell me how can i achieve this is spark as in storm
i can have a bolt which can do this ?

Thanks,

On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote:

I guess both. In terms of syntax, I was comparing it with Trident.

If you are joining, Spark Streaming actually does offer windowed join
out of the box. We couldn't use this though as our event stream can grow
out-of-sync, so we had to implement something on top of Storm. If your
event streams don't become out of sync, you may find the built-in join in
Spark Streaming useful. Storm also has a join keyword but its semantics are
different.

Also, what do you mean by No Back Pressure ?

So when a topology is overloaded, Storm is designed so that it will
stop reading from the source. Spark on the other hand, will keep reading
from the source and spilling it internally. This maybe fine, in fairness,
but it does mean you have to worry about the persistent store usage in the
processing cluster, whereas with Storm you don't have to worry because the
messages just remain in the data store.

Spark came up with the idea of rate limiting, but I don't feel this is
as nice as back pressure because it's very difficult to tune it such that
you don't cap the cluster's processing power but yet so that it will
prevent the persistent storage to get used up.

On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast
sparkenthusi...@yahoo.in wrote:

When you say Storm, did you mean Storm with Trident or Storm?

My use case does not have simple transformation. There are complex
events that need to be generated by joining the incoming event stream.

Also, what do you mean by No Back PRessure ?

On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com
wrote:

We've evaluated Spark Streaming vs. Storm and ended up sticking with
Storm.

Some of the important draw backs are:
Spark has no back pressure (receiver rate limit can alleviate this to
a certain point, but it's far from ideal)
There is also no exactly-once semantics. (updateStateByKey can
achieve this semantics, but is not practical if you have any significant
amount of state because it does so by dumping the entire state on every
checkpointing)

There are also some minor drawbacks that I'm sure will be fixed
quickly, like no task timeout, not being able to read from Kafka using
multiple nodes, data loss hazard with Kafka.

It's also not possible to attain very low latency in Spark, if that's
what you need.

The pos for Spark is the concise and IMO more intuitive syntax,
especially if you compare it with Storm's Java API.

I admit I might be a bit biased towards Storm tho as I'm more familiar
with it.

Also, you can do some processing with Kinesis. If all you need to do
is straight forward transformation

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

AFAIK KCL is *supposed* to provide fault tolerance and load balancing (plus
additionally, elastic scaling unlike Storm), Kinesis providing the
coordination. My understanding is that it's like a naked Storm worker
process that can consequently only do map.

I haven't really used it tho, so can't really comment how it compares to
Spark/Storm. Maybe somebody else will be able to comment.

On Wed, Jun 17, 2015 at 3:13 PM, ayan guha guha.a...@gmail.com wrote:

Thanks for this. It's kcl based kinesis application. But because its just
a Java application we are thinking to use spark on EMR or storm for fault
tolerance and load balancing. Is it a correct approach?
On 17 Jun 2015 23:07, Enno Shioji eshi...@gmail.com wrote:

Hi Ayan,

Admittedly I haven't done much with Kinesis, but if I'm not mistaken you
should be able to use their processor interface for that. In this
example, it's incrementing a counter:
https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java

Instead of incrementing a counter, you could do your transformation and
send it to HBase.

On Wed, Jun 17, 2015 at 1:40 PM, ayan guha guha.a...@gmail.com wrote:

Great discussion!!

One qs about some comment: Also, you can do some processing with
Kinesis. If all you need to do is straight forward transformation and you
are reading from Kinesis to begin with, it might be an easier option to
just do the transformation in Kinesis

- Do you mean KCL application? Or some kind of processing withinKineis?

Can you kindly share a link? I would definitely pursue this route as our
transformations are really simple.

Best

On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni asoni.le...@gmail.com
wrote:

My Use case is below

We are going to receive lot of event as stream ( basically Kafka Stream
) and then we need to process and compute

How do i maintain a shared state ( total amount , total min , total
data etc ) so that i know how much i accumulated at any given point as
events for same phone can go to any node / executor.

Can some one please tell me how can i achieve this is spark as in storm
i can have a bolt which can do this ?

Thanks,

On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote:

I guess both. In terms of syntax, I was comparing it with Trident.

If you are joining, Spark Streaming actually does offer windowed join
out of the box. We couldn't use this though as our event stream can grow
out-of-sync, so we had to implement something on top of Storm. If your
event streams don't become out of sync, you may find the built-in join in
Spark Streaming useful. Storm also has a join keyword but its semantics
are
different.

Also, what do you mean by No Back Pressure ?

So when a topology is overloaded, Storm is designed so that it will
stop reading from the source. Spark on the other hand, will keep reading
from the source and spilling it internally. This maybe fine, in fairness,
but it does mean you have to worry about the persistent store usage in the
processing cluster, whereas with Storm you don't have to worry because the
messages just remain in the data store.

Spark came up with the idea of rate limiting, but I don't feel this is
as nice as back pressure because it's very difficult to tune it such that
you don't cap the cluster's processing power but yet so that it will
prevent the persistent storage to get used up.

On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast
sparkenthusi...@yahoo.in wrote:

When you say Storm, did you mean Storm with Trident or Storm?

My use case does not have simple transformation. There are complex
events that need to be generated by joining the incoming event stream.

Also, what do you mean by No Back PRessure ?

On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com
wrote:

We've evaluated Spark Streaming vs. Storm and ended up sticking with
Storm.

Some of the important draw backs are:
Spark has no back pressure (receiver rate limit can alleviate this to
a certain point, but it's far from ideal)
There is also no exactly-once semantics. (updateStateByKey can
achieve this semantics, but is not practical if you have any significant
amount of state because it does so by dumping the entire state on every
checkpointing)

There are also some minor drawbacks that I'm sure will be fixed
quickly, like no task timeout, not being able

Re: Spark or Storm

2015-06-17 Thread Ashish Soni

@Enno
As per the latest version and documentation Spark Streaming does offer
exactly once semantics using improved kafka integration , Not i have not
tested yet.

Any feedback will be helpful if anyone is tried the same.

http://koeninger.github.io/kafka-exactly-once/#7

https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji eshi...@gmail.com wrote:

I haven't really used it tho, so can't really comment how it compares to
Spark/Storm. Maybe somebody else will be able to comment.

On Wed, Jun 17, 2015 at 3:13 PM, ayan guha guha.a...@gmail.com wrote:

Thanks for this. It's kcl based kinesis application. But because its just
a Java application we are thinking to use spark on EMR or storm for fault
tolerance and load balancing. Is it a correct approach?
On 17 Jun 2015 23:07, Enno Shioji eshi...@gmail.com wrote:

Hi Ayan,

Admittedly I haven't done much with Kinesis, but if I'm not mistaken you
should be able to use their processor interface for that. In this
example, it's incrementing a counter:
https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java

Instead of incrementing a counter, you could do your transformation and
send it to HBase.

On Wed, Jun 17, 2015 at 1:40 PM, ayan guha guha.a...@gmail.com wrote:

Great discussion!!

One qs about some comment: Also, you can do some processing with
Kinesis. If all you need to do is straight forward transformation and you
are reading from Kinesis to begin with, it might be an easier option to
just do the transformation in Kinesis

- Do you mean KCL application? Or some kind of processing withinKineis?

Can you kindly share a link? I would definitely pursue this route as
our transformations are really simple.

Best

On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni asoni.le...@gmail.com
wrote:

My Use case is below

We are going to receive lot of event as stream ( basically Kafka
Stream ) and then we need to process and compute

Consider you have a phone contract with ATT and every call / sms /
data useage you do is an event and then it needs to calculate your bill
on
real time basis so when you login to your account you can see all those
variable as how much you used and how much is left and what is your bill
till date ,Also there are different rules which need to be considered when
you calculate the total bill one simple rule will be 0-500 min it is free
but above it is $1 a min.

How do i maintain a shared state ( total amount , total min , total
data etc ) so that i know how much i accumulated at any given point as
events for same phone can go to any node / executor.

Can some one please tell me how can i achieve this is spark as in
storm i can have a bolt which can do this ?

Thanks,

On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com
wrote:

I guess both. In terms of syntax, I was comparing it with Trident.

If you are joining, Spark Streaming actually does offer windowed join
out of the box. We couldn't use this though as our event stream can grow
out-of-sync, so we had to implement something on top of Storm. If your
event streams don't become out of sync, you may find the built-in join in
Spark Streaming useful. Storm also has a join keyword but its semantics
are
different.

Also, what do you mean by No Back Pressure ?

So when a topology is overloaded, Storm is designed so that it will
stop reading from the source. Spark on the other hand, will keep reading
from the source and spilling it internally. This maybe fine, in fairness,
but it does mean you have to worry about the persistent store usage in
the
processing cluster, whereas with Storm you don't have to worry because
the
messages just remain in the data store.

Spark came up with the idea of rate limiting, but I don't feel this
is as nice as back pressure because it's very difficult to tune it such
that you don't cap the cluster's processing power but yet so that it will
prevent the persistent storage to get used up.

On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast
sparkenthusi...@yahoo.in wrote:

When you say Storm, did you mean Storm with Trident or Storm?

My use case does not have simple transformation. There are complex
events that need to be generated by joining the incoming event stream.

Also, what do you mean by No Back PRessure ?

On Wednesday, 17 June 2015 11:57 AM, Enno Shioji
eshi...@gmail.com wrote:

We've evaluated Spark Streaming vs. Storm and ended up sticking with
Storm.

Some of the important draw backs

Re: Spark or Storm

2015-06-17 Thread Enno Shioji

that refers to the latest version, it says:

Semantics of output operations

*Idempotent updates*: Multiple attempts always write the same data. For
example, saveAs***Files always writes the same data to the generated
files.
-

So either you make the update idempotent, or you have to make it
transactional yourself, and the suggested mechanism is very similar to what
Storm does.

On Wed, Jun 17, 2015 at 3:51 PM, Ashish Soni asoni.le...@gmail.com wrote:

@Enno
As per the latest version and documentation Spark Streaming does offer
exactly once semantics using improved kafka integration , Not i have not
tested yet.

Any feedback will be helpful if anyone is tried the same.

http://koeninger.github.io/kafka-exactly-once/#7

https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji eshi...@gmail.com wrote:

I haven't really used it tho, so can't really comment how it compares to
Spark/Storm. Maybe somebody else will be able to comment.

On Wed, Jun 17, 2015 at 3:13 PM, ayan guha guha.a...@gmail.com wrote:

Hi Ayan,

Admittedly I haven't done much with Kinesis, but if I'm not mistaken
you should be able to use their processor interface for that. In this
example, it's incrementing a counter:
https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java

Instead of incrementing a counter, you could do your transformation and
send it to HBase.

On Wed, Jun 17, 2015 at 1:40 PM, ayan guha guha.a...@gmail.com wrote:

Great discussion!!

One qs about some comment: Also, you can do some processing with
Kinesis. If all you need to do is straight forward transformation and you
are reading from Kinesis to begin with, it might be an easier option to
just do the transformation in Kinesis

- Do you mean KCL application? Or some kind of processing
withinKineis?

Can you kindly share a link? I would definitely pursue this route as
our transformations are really simple.

Best

On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni asoni.le...@gmail.com
wrote:

My Use case is below

We are going to receive lot of event as stream ( basically Kafka
Stream ) and then we need to process and compute

Consider you have a phone contract with ATT and every call / sms /
data useage you do is an event and then it needs to calculate your bill
on
real time basis so when you login to your account you can see all those
variable as how much you used and how much is left and what is your bill
till date ,Also there are different rules which need to be considered
when
you calculate the total bill one simple rule will be 0-500 min it is free
but above it is $1 a min.

How do i maintain a shared state ( total amount , total min , total
data etc ) so that i know how much i accumulated at any given point as
events for same phone can go to any node / executor.

Can some one please tell me how can i achieve this is spark as in
storm i can have a bolt which can do

Re: Spark or Storm

2015-06-17 Thread Michael Segel

Actually the reverse.

Spark Streaming is really a micro batch system where the smallest window is 1/2 
a second (500ms). 
So for CEP, its not really a good idea. 

So in terms of options…. spark streaming, storm, samza, akka and others… 

Storm is probably the easiest to pick up,  spark streaming / akka may give you 
more flexibility and akka would work for CEP. 

Just my $0.02

 On Jun 16, 2015, at 9:40 PM, Spark Enthusiast sparkenthusi...@yahoo.in 
 wrote:
 
 I have a use-case where a stream of Incoming events have to be aggregated and 
 joined to create Complex events. The aggregation will have to happen at an 
 interval of 1 minute (or less).
 
 The pipeline is :
   send events 
  enrich event
 Upstream services --- KAFKA - event Stream 
 Processor  Complex Event Processor  Elastic Search.
 
 From what I understand, Storm will make a very good ESP and Spark Streaming 
 will make a good CEP.
 
 But, we are also evaluating Storm with Trident.
 
 How does Spark Streaming compare with Storm with Trident?
 
 Sridhar Chellappa
 
 
 
  
 
 
 
 On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote:
 
 
 I have a similar scenario where we need to bring data from kinesis to hbase. 
 Data volecity is 20k per 10 mins. Little manipulation of data will be 
 required but that's regardless of the tool so we will be writing that piece 
 in Java pojo.
 All env is on aws. Hbase is on a long running EMR and kinesis on a separate 
 cluster.
 TIA.
 Best
 Ayan
 On 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com 
 mailto:wrbri...@gmail.com wrote:
 The programming models for the two frameworks are conceptually rather 
 different; I haven't worked with Storm for quite some time, but based on my 
 old experience with it, I would equate Spark Streaming more with Storm's 
 Trident API, rather than with the raw Bolt API. Even then, there are 
 significant differences, but it's a bit closer.
 
 If you can share your use case, we might be able to provide better guidance.
 
 Regards,
 Will
 
 On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com 
 mailto:asoni.le...@gmail.com wrote:
 
 Hi All,
 
 I am evaluating spark VS storm ( spark streaming  ) and i am not able to see 
 what is equivalent of Bolt in storm inside spark.
 
 Any help will be appreciated on this ?
 
 Thanks ,
 Ashish
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org

Spark or Storm

2015-06-16 Thread asoni . learn

Hi All,

I am evaluating spark VS storm ( spark streaming  ) and i am not able to see 
what is equivalent of Bolt in storm inside spark.

Any help will be appreciated on this ? 

Thanks ,
Ashish
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark or Storm

2015-06-16 Thread ayan guha

I have a similar scenario where we need to bring data from kinesis to
hbase. Data volecity is 20k per 10 mins. Little manipulation of data will
be required but that's regardless of the tool so we will be writing that
piece in Java pojo.

All env is on aws. Hbase is on a long running EMR and kinesis on a separate
cluster.

TIA.
Best
Ayan
On 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote:

 The programming models for the two frameworks are conceptually rather
 different; I haven't worked with Storm for quite some time, but based on my
 old experience with it, I would equate Spark Streaming more with Storm's
 Trident API, rather than with the raw Bolt API. Even then, there are
 significant differences, but it's a bit closer.

 If you can share your use case, we might be able to provide better
 guidance.

 Regards,
 Will

 On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote:

 Hi All,

 I am evaluating spark VS storm ( spark streaming  ) and i am not able to
 see what is equivalent of Bolt in storm inside spark.

 Any help will be appreciated on this ?

 Thanks ,
 Ashish
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark or Storm

2015-06-16 Thread Spark Enthusiast

I have a use-case where a stream of Incoming events have to be aggregated and 
joined to create Complex events. The aggregation will have to happen at an 
interval of 1 minute (or less).
The pipeline is :                                  send events                  
                        enrich eventUpstream services --- 
KAFKA - event Stream Processor  Complex Event Processor 
 Elastic Search.
From what I understand, Storm will make a very good ESP and Spark Streaming 
will make a good CEP.
But, we are also evaluating Storm with Trident.
How does Spark Streaming compare with Storm with Trident?
Sridhar Chellappa


  


 On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com wrote:
   

 I have a similar scenario where we need to bring data from kinesis to hbase. 
Data volecity is 20k per 10 mins. Little manipulation of data will be required 
but that's regardless of the tool so we will be writing that piece in Java 
pojo. All env is on aws. Hbase is on a long running EMR and kinesis on a 
separate cluster.TIA.
Best
AyanOn 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote:

The programming models for the two frameworks are conceptually rather 
different; I haven't worked with Storm for quite some time, but based on my old 
experience with it, I would equate Spark Streaming more with Storm's Trident 
API, rather than with the raw Bolt API. Even then, there are significant 
differences, but it's a bit closer.

If you can share your use case, we might be able to provide better guidance.

Regards,
Will

On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote:

Hi All,

I am evaluating spark VS storm ( spark streaming  ) and i am not able to see 
what is equivalent of Bolt in storm inside spark.

Any help will be appreciated on this ?

Thanks ,
Ashish
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark or Storm

2015-06-16 Thread Sateesh Kavuri

Probably overloading the question a bit.

In Storm, Bolts have the functionality of getting triggered on events. Is
that kind of functionality possible with Spark streaming? During each phase
of the data processing, the transformed data is stored to the database and
this transformed data should then be sent to a new pipeline for further
processing

How can this be achieved using Spark?



On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast sparkenthusi...@yahoo.in
 wrote:

 I have a use-case where a stream of Incoming events have to be aggregated
 and joined to create Complex events. The aggregation will have to happen at
 an interval of 1 minute (or less).

 The pipeline is :
   send events
  enrich event
 Upstream services --- KAFKA - event Stream
 Processor  Complex Event Processor  Elastic
 Search.

 From what I understand, Storm will make a very good ESP and Spark
 Streaming will make a good CEP.

 But, we are also evaluating Storm with Trident.

 How does Spark Streaming compare with Storm with Trident?

 Sridhar Chellappa







   On Wednesday, 17 June 2015 10:02 AM, ayan guha guha.a...@gmail.com
 wrote:


 I have a similar scenario where we need to bring data from kinesis to
 hbase. Data volecity is 20k per 10 mins. Little manipulation of data will
 be required but that's regardless of the tool so we will be writing that
 piece in Java pojo.
 All env is on aws. Hbase is on a long running EMR and kinesis on a
 separate cluster.
 TIA.
 Best
 Ayan
 On 17 Jun 2015 12:13, Will Briggs wrbri...@gmail.com wrote:

 The programming models for the two frameworks are conceptually rather
 different; I haven't worked with Storm for quite some time, but based on my
 old experience with it, I would equate Spark Streaming more with Storm's
 Trident API, rather than with the raw Bolt API. Even then, there are
 significant differences, but it's a bit closer.

 If you can share your use case, we might be able to provide better
 guidance.

 Regards,
 Will

 On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote:

 Hi All,

 I am evaluating spark VS storm ( spark streaming  ) and i am not able to
 see what is equivalent of Bolt in storm inside spark.

 Any help will be appreciated on this ?

 Thanks ,
 Ashish
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark or Storm

2015-06-16 Thread Will Briggs

The programming models for the two frameworks are conceptually rather 
different; I haven't worked with Storm for quite some time, but based on my old 
experience with it, I would equate Spark Streaming more with Storm's Trident 
API, rather than with the raw Bolt API. Even then, there are significant 
differences, but it's a bit closer.

If you can share your use case, we might be able to provide better guidance.

Regards,
Will

On June 16, 2015, at 9:46 PM, asoni.le...@gmail.com wrote:

Hi All,

I am evaluating spark VS storm ( spark streaming  ) and i am not able to see 
what is equivalent of Bolt in storm inside spark.

Any help will be appreciated on this ? 

Thanks ,
Ashish
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

43 matches

Mail list logo