[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

tdas Wed, 17 Jun 2015 19:15:52 -0700

Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6863#discussion_r32694391
  
    --- Diff: docs/streaming-kafka-integration.md ---
    @@ -74,15 +74,15 @@ Next, we discuss how to use this approach in your 
streaming application.
        [Maven 
repository](http://search.maven.org/#search|ga|1|a%3A%22spark-streaming-kafka-assembly_2.10%22%20AND%20v%3A%22{{site.SPARK_VERSION_SHORT}}%22)
 and add it to `spark-submit` with `--jars`.
     
     ## Approach 2: Direct Approach (No Receivers)
    -This is a new receiver-less "direct" approach has been introduced in Spark 
1.3 to ensure stronger end-to-end guarantees. Instead of using receivers to 
receive data, this approach periodically queries Kafka for the latest offsets 
in each topic+partition, and accordingly defines the offset ranges to process 
in each batch. When the jobs to process the data are launched, Kafka's simple 
consumer API is used to read the defined ranges of offsets from Kafka (similar 
to read files from a file system). Note that this is an experimental feature in 
Spark 1.3 and is only available in the Scala and Java API.
    +This new receiver-less "direct" approach has been introduced in Spark 1.3 
to ensure stronger end-to-end guarantees. Instead of using receivers to receive 
data, this approach periodically queries Kafka for the latest offsets in each 
topic+partition, and accordingly defines the offset ranges to process in each 
batch. When the jobs to process the data are launched, Kafka's simple consumer 
API is used to read the defined ranges of offsets from Kafka (similar to read 
files from a file system). Note that this is an experimental feature introduced 
in Spark 1.3 for the Scala and Java API.  Spark 1.4 added a Python API, but it 
is not yet at full feature parity.
     
    -This approach has the following advantages over the received-based 
approach (i.e. Approach 1).
    +This approach has the following advantages over the receiver-based 
approach (i.e. Approach 1).
     
    -- *Simplified Parallelism:* No need to create multiple input Kafka streams 
and union-ing them. With `directStream`, Spark Streaming will create as many 
RDD partitions as there is Kafka partitions to consume, which will all read 
data from Kafka in parallel. So there is one-to-one mapping between Kafka and 
RDD partitions, which is easier to understand and tune.
    +- *Simplified Parallelism:* No need to create multiple input Kafka streams 
and union them. With `directStream`, Spark Streaming will create as many RDD 
partitions as there are Kafka partitions to consume, which will all read data 
from Kafka in parallel. So there is a one-to-one mapping between Kafka and RDD 
partitions, which is easier to understand and tune.
     
    -- *Efficiency:* Achieving zero-data loss in the first approach required 
the data to be stored in a Write Ahead Log, which further replicated the data. 
This is actually inefficient as the data effectively gets replicated twice - 
once by Kafka, and a second time by the Write Ahead Log. This second approach 
eliminate the problem as there is no receiver, and hence no need for Write 
Ahead Logs.
    +- *Efficiency:* Achieving zero-data loss in the first approach required 
the data to be stored in a Write Ahead Log, which further replicated the data. 
This is actually inefficient as the data effectively gets replicated twice - 
once by Kafka, and a second time by the Write Ahead Log. This second approach 
eliminates the problem as there is no receiver, and hence no need for Write 
Ahead Logs.  As long as you have sufficient Kafka retention, messages can be 
recovered from Kafka.
    --- End diff --
    
    Are there two spaces before "As long as"



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Streaming][Kafka][SPARK-8390] fix docs relate...

Reply via email to