More details on the Direct API of Spark 1.3 is at the databricks blog: 
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

Note the use of checkpoints to persist the Kafka offsets in Spark Streaming 
itself, and not in zookeeper.

Also this statement:”.. This allows one to build a Spark Streaming + Kafka 
pipelines with end-to-end exactly-once semantics (if your updates to downstream 
systems are idempotent or transactional).”


From: Cody Koeninger [mailto:c...@koeninger.org]
Sent: 18 June 2015 19:38
To: bit1...@163.com
Cc: Prajod S Vettiyattil (WT01 - BAS); jrpi...@gmail.com; eshi...@gmail.com; 
wrbri...@gmail.com; asoni.le...@gmail.com; ayan guha; user; 
sateesh.kav...@gmail.com; sparkenthusi...@yahoo.in; 
sabarish.sasidha...@manthan.com
Subject: Re: RE: Spark or Storm

That general description is accurate, but not really a specific issue of the 
direct steam.  It applies to anything consuming from kafka (or, as Matei 
already said, any streaming system really).  You can't have exactly once 
semantics, unless you know something more about how you're storing results.

For "some unique id", topicpartition and offset is usually the obvious choice, 
which is why it's important that the direct stream gives you access to the 
offsets.

See https://github.com/koeninger/kafka-exactly-once for more info



On Thu, Jun 18, 2015 at 6:47 AM, bit1...@163.com<mailto:bit1...@163.com> 
<bit1...@163.com<mailto:bit1...@163.com>> wrote:
I am wondering how direct stream api ensures end-to-end exactly once semantics

I think there are two things involved:
1. From the spark streaming end, the driver will replay the Offset range when 
it's down and restarted,which means that the new tasks will process some 
already processed data.
2. From the user end, since tasks may process already processed data, user end 
should detect that some data has already been processed,eg,
use some unique ID.

Not sure if I have understood correctly.


________________________________
bit1...@163.com<mailto:bit1...@163.com>

From: prajod.vettiyat...@wipro.com<mailto:prajod.vettiyat...@wipro.com>
Date: 2015-06-18 16:56
To: jrpi...@gmail.com<mailto:jrpi...@gmail.com>; 
eshi...@gmail.com<mailto:eshi...@gmail.com>
CC: wrbri...@gmail.com<mailto:wrbri...@gmail.com>; 
asoni.le...@gmail.com<mailto:asoni.le...@gmail.com>; 
guha.a...@gmail.com<mailto:guha.a...@gmail.com>; 
user@spark.apache.org<mailto:user@spark.apache.org>; 
sateesh.kav...@gmail.com<mailto:sateesh.kav...@gmail.com>; 
sparkenthusi...@yahoo.in<mailto:sparkenthusi...@yahoo.in>; 
sabarish.sasidha...@manthan.com<mailto:sabarish.sasidha...@manthan.com>
Subject: RE: Spark or Storm
>>not being able to read from Kafka using multiple nodes

> Kafka is plenty capable of doing this..

I faced the same issue before Spark 1.3 was released.

The issue was not with Kafka, but with Spark Streaming’s Kafka connector. 
Before Spark 1.3.0 release one Spark worker would get all the streamed 
messages. We had to re-partition to distribute the processing.

From Spark 1.3.0 release the Spark Direct API for Kafka supported parallel 
reads from Kafka streamed to Spark workers. See the “Approach 2: Direct 
Approach” in this page: 
http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html. Note that 
is also mentions zero data loss and exactly once semantics for kafka 
integration.


Prajod

From: Jordan Pilat [mailto:jrpi...@gmail.com<mailto:jrpi...@gmail.com>]
Sent: 18 June 2015 03:57
To: Enno Shioji
Cc: Will Briggs; asoni.le...@gmail.com<mailto:asoni.le...@gmail.com>; ayan 
guha; user; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan
Subject: Re: Spark or Storm


>not being able to read from Kafka using multiple nodes

Kafka is plenty capable of doing this,  by clustering together multiple 
consumer instances into a consumer group.
If your topic is sufficiently partitioned, the consumer group can consume the 
topic in a parallelized fashion.
If it isn't, you still have the fault tolerance associated with clustering the 
consumers.

OK
JRP
On Jun 17, 2015 1:27 AM, "Enno Shioji" 
<eshi...@gmail.com<mailto:eshi...@gmail.com>> wrote:
We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm.

Some of the important draw backs are:
Spark has no back pressure (receiver rate limit can alleviate this to a certain 
point, but it's far from ideal)
There is also no exactly-once semantics. (updateStateByKey can achieve this 
semantics, but is not practical if you have any significant amount of state 
because it does so by dumping the entire state on every checkpointing)

There are also some minor drawbacks that I'm sure will be fixed quickly, like 
no task timeout, not being able to read from Kafka using multiple nodes, data 
loss hazard with Kafka.

It's also not possible to attain very low latency in Spark, if that's what you 
need.

The pos for Spark is the concise and IMO more intuitive syntax, especially if 
you compare it with Storm's Java API.

I admit I might be a bit biased towards Storm tho as I'm more familiar with it.

Also, you can do some processing with Kinesis. If all you need to do is 
straight forward transformation and you are reading from Kinesis to begin with, 
it might be an easier option to just do the transformation in Kinesis.





On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan 
<sabarish.sasidha...@manthan.com<mailto:sabarish.sasidha...@manthan.com>> wrote:

Whatever you write in bolts would be the logic you want to apply on your 
events. In Spark, that logic would be coded in map() or similar such  
transformations and/or actions. Spark doesn't enforce a structure for capturing 
your processing logic like Storm does.

Regards
Sab
Probably overloading the question a bit.
In Storm, Bolts have the functionality of getting triggered on events. Is that 
kind of functionality possible with Spark streaming? During each phase of the 
data processing, the transformed data is stored to the database and this 
transformed data should then be sent to a new pipeline for further processing
How can this be achieved using Spark?

On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast 
<sparkenthusi...@yahoo.in<mailto:sparkenthusi...@yahoo.in>> wrote:
I have a use-case where a stream of Incoming events have to be aggregated and 
joined to create Complex events. The aggregation will have to happen at an 
interval of 1 minute (or less).

The pipeline is :
                                  send events                                   
       enrich event
Upstream services -------------------> KAFKA ---------> event Stream Processor 
------------> Complex Event Processor ------------> Elastic Search.

From what I understand, Storm will make a very good ESP and Spark Streaming 
will make a good CEP.

But, we are also evaluating Storm with Trident.

How does Spark Streaming compare with Storm with Trident?

Sridhar Chellappa






On Wednesday, 17 June 2015 10:02 AM, ayan guha 
<guha.a...@gmail.com<mailto:guha.a...@gmail.com>> wrote:

I have a similar scenario where we need to bring data from kinesis to hbase. 
Data volecity is 20k per 10 mins. Little manipulation of data will be required 
but that's regardless of the tool so we will be writing that piece in Java pojo.
All env is on aws. Hbase is on a long running EMR and kinesis on a separate 
cluster.
TIA.
Best
Ayan
On 17 Jun 2015 12:13, "Will Briggs" 
<wrbri...@gmail.com<mailto:wrbri...@gmail.com>> wrote:
The programming models for the two frameworks are conceptually rather 
different; I haven't worked with Storm for quite some time, but based on my old 
experience with it, I would equate Spark Streaming more with Storm's Trident 
API, rather than with the raw Bolt API. Even then, there are significant 
differences, but it's a bit closer.

If you can share your use case, we might be able to provide better guidance.

Regards,
Will

On June 16, 2015, at 9:46 PM, 
asoni.le...@gmail.com<mailto:asoni.le...@gmail.com> wrote:

Hi All,

I am evaluating spark VS storm ( spark streaming  ) and i am not able to see 
what is equivalent of Bolt in storm inside spark.

Any help will be appreciated on this ?

Thanks ,
Ashish
---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>



The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com<http://www.wipro.com>

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com

Reply via email to