Repository: samza Updated Branches: refs/heads/0.9.1 868ecaca6 -> aa4dbe6dc
SAMZA-716: fixed broken link in Spark Streaming comparison page Project: http://git-wip-us.apache.org/repos/asf/samza/repo Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/aa4dbe6d Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/aa4dbe6d Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/aa4dbe6d Branch: refs/heads/0.9.1 Commit: aa4dbe6dc05c13270ac5389ac19c7a9adbde767c Parents: 868ecac Author: Aleksandar Bircakovic <a.bircako...@levi9.com> Authored: Thu Jun 18 17:04:03 2015 -0700 Committer: Yan Fang <yanfang...@gmail.com> Committed: Thu Jun 18 17:04:03 2015 -0700 ---------------------------------------------------------------------- docs/learn/documentation/versioned/comparisons/spark-streaming.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/samza/blob/aa4dbe6d/docs/learn/documentation/versioned/comparisons/spark-streaming.md ---------------------------------------------------------------------- diff --git a/docs/learn/documentation/versioned/comparisons/spark-streaming.md b/docs/learn/documentation/versioned/comparisons/spark-streaming.md index e1ccc3e..d11e8b1 100644 --- a/docs/learn/documentation/versioned/comparisons/spark-streaming.md +++ b/docs/learn/documentation/versioned/comparisons/spark-streaming.md @@ -42,7 +42,7 @@ Samza guarantees processing the messages as the order they appear in the partiti ### Fault-tolerance semantics -Spark Streaming has different fault-tolerance semantics for different data sources. Here, for a better comparison, only discuss the semantic when using Spark Streaming with Kafka. In Spark 1.2, Spark Streaming provides at-least-once semantic in the receiver side (See the [post](https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html])). In Spark 1.3, it uses the no-receiver approach ([more detail](https://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers)), which provides some benefits. However, it still does not guarantee exactly-once semantics for output actions. Because the side-effecting output operations maybe repeated when the job fails and recovers from the checkpoint. If the updates in your output operations are not idempotent or transactional (such as send messages to a Kafka topic), you will get duplicated messages. Do not be confused by the "exactly-once semantic" in Spark Streaming guide. This only means a given item is only processed once and always gets the same result (Also check the "Delivery Semantics" section [posted](http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/) by Cloudera). +Spark Streaming has different fault-tolerance semantics for different data sources. Here, for a better comparison, only discuss the semantic when using Spark Streaming with Kafka. In Spark 1.2, Spark Streaming provides at-least-once semantic in the receiver side (See the [post](https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html)). In Spark 1.3, it uses the no-receiver approach ([more detail](https://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers)), which provides some benefits. However, it still does not guarantee exactly-once semantics for output actions. Because the side-effecting output operations maybe repeated when the job fails and recovers from the checkpoint. If the updates in your output operations are not idempotent or transactional (such as send messages to a Kafka topic), you will get duplicated messages. Do not be confused by the "exactly-once semantic" in Spark Streaming guide. This only means a given item is only processed once and always gets the same result (Also check the "Delivery Semantics" section [posted](http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/) by Cloudera). Samza provides an at-least-once message delivery guarantee. When the job failure happens, it restarts the containers and reads the latest offset from the [checkpointing](../container/checkpointing.html). When a Samza job recovers from a failure, it's possible that it will process some data more than once. This happens because the job restarts at the last checkpoint, and any messages that had been processed between that checkpoint and the failure are processed again. The amount of reprocessed data can be minimized by setting a small checkpoint interval period.