Repository: samza
Updated Branches:
  refs/heads/0.9.1 868ecaca6 -> aa4dbe6dc


SAMZA-716: fixed broken link in Spark Streaming comparison page


Project: http://git-wip-us.apache.org/repos/asf/samza/repo
Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/aa4dbe6d
Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/aa4dbe6d
Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/aa4dbe6d

Branch: refs/heads/0.9.1
Commit: aa4dbe6dc05c13270ac5389ac19c7a9adbde767c
Parents: 868ecac
Author: Aleksandar Bircakovic <a.bircako...@levi9.com>
Authored: Thu Jun 18 17:04:03 2015 -0700
Committer: Yan Fang <yanfang...@gmail.com>
Committed: Thu Jun 18 17:04:03 2015 -0700

----------------------------------------------------------------------
 docs/learn/documentation/versioned/comparisons/spark-streaming.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/samza/blob/aa4dbe6d/docs/learn/documentation/versioned/comparisons/spark-streaming.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/comparisons/spark-streaming.md 
b/docs/learn/documentation/versioned/comparisons/spark-streaming.md
index e1ccc3e..d11e8b1 100644
--- a/docs/learn/documentation/versioned/comparisons/spark-streaming.md
+++ b/docs/learn/documentation/versioned/comparisons/spark-streaming.md
@@ -42,7 +42,7 @@ Samza guarantees processing the messages as the order they 
appear in the partiti
 
 ### Fault-tolerance semantics
 
-Spark Streaming has different fault-tolerance semantics for different data 
sources. Here, for a better comparison, only discuss the semantic when using 
Spark Streaming with Kafka. In Spark 1.2, Spark Streaming provides 
at-least-once semantic in the receiver side (See the 
[post](https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html])).
 In Spark 1.3, it uses the no-receiver approach ([more 
detail](https://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers)),
 which provides some benefits. However, it still does not guarantee 
exactly-once semantics for output actions. Because the side-effecting output 
operations maybe repeated when the job fails and recovers from the checkpoint. 
If the updates in your output operations are not idempotent or transactional 
(such as send messages to a Kafka topic), you will get duplicated messages. Do 
not be confused by the "exactly-once semantic"
  in Spark Streaming guide. This only means a given item is only processed once 
and always gets the same result (Also check the "Delivery Semantics" section 
[posted](http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/)
 by Cloudera).
+Spark Streaming has different fault-tolerance semantics for different data 
sources. Here, for a better comparison, only discuss the semantic when using 
Spark Streaming with Kafka. In Spark 1.2, Spark Streaming provides 
at-least-once semantic in the receiver side (See the 
[post](https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html)).
 In Spark 1.3, it uses the no-receiver approach ([more 
detail](https://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers)),
 which provides some benefits. However, it still does not guarantee 
exactly-once semantics for output actions. Because the side-effecting output 
operations maybe repeated when the job fails and recovers from the checkpoint. 
If the updates in your output operations are not idempotent or transactional 
(such as send messages to a Kafka topic), you will get duplicated messages. Do 
not be confused by the "exactly-once semantic" 
 in Spark Streaming guide. This only means a given item is only processed once 
and always gets the same result (Also check the "Delivery Semantics" section 
[posted](http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/)
 by Cloudera).
 
 Samza provides an at-least-once message delivery guarantee. When the job 
failure happens, it restarts the containers and reads the latest offset from 
the [checkpointing](../container/checkpointing.html). When a Samza job recovers 
from a failure, it's possible that it will process some data more than once. 
This happens because the job restarts at the last checkpoint, and any messages 
that had been processed between that checkpoint and the failure are processed 
again. The amount of reprocessed data can be minimized by setting a small 
checkpoint interval period.
 

Reply via email to