spark git commit: [SPARK-17853][STREAMING][KAFKA][DOC] make it clear that reusing group.id is bad

rxin Wed, 12 Oct 2016 00:42:00 -0700

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 f3d82b53c -> f12b74c02



[SPARK-17853][STREAMING][KAFKA][DOC] make it clear that reusing group.id is bad

## What changes were proposed in this pull request?

Documentation fix to make it clear that reusing group id for different streams 
is super duper bad, just like it is with the underlying Kafka consumer.

## How was this patch tested?

I built jekyll doc and made sure it looked ok.

Author: cody koeninger <c...@koeninger.org>

Closes #15442 from koeninger/SPARK-17853.

(cherry picked from commit c264ef9b1918256a5018c7a42a1a2b42308ea3f7)
Signed-off-by: Reynold Xin <r...@databricks.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f12b74c0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f12b74c0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f12b74c0

Branch: refs/heads/branch-2.0
Commit: f12b74c02eec9e201fec8a16dac1f8e549c1b4f0
Parents: f3d82b5
Author: cody koeninger <c...@koeninger.org>
Authored: Wed Oct 12 00:40:47 2016 -0700
Committer: Reynold Xin <r...@databricks.com>
Committed: Wed Oct 12 00:40:52 2016 -0700

----------------------------------------------------------------------
 docs/streaming-kafka-0-10-integration.md | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/f12b74c0/docs/streaming-kafka-0-10-integration.md
----------------------------------------------------------------------
diff --git a/docs/streaming-kafka-0-10-integration.md 
b/docs/streaming-kafka-0-10-integration.md
index 44c39e3..456b845 100644
--- a/docs/streaming-kafka-0-10-integration.md
+++ b/docs/streaming-kafka-0-10-integration.md
@@ -27,7 +27,7 @@ For Scala/Java applications using SBT/Maven project 
definitions, link your strea
          "bootstrap.servers" -> "localhost:9092,anotherhost:9092",
          "key.deserializer" -> classOf[StringDeserializer],
          "value.deserializer" -> classOf[StringDeserializer],
-         "group.id" -> "example",
+         "group.id" -> "use_a_separate_group_id_for_each_stream",
          "auto.offset.reset" -> "latest",
          "enable.auto.commit" -> (false: java.lang.Boolean)
        )
@@ -48,7 +48,7 @@ Each item in the stream is a 
[ConsumerRecord](http://kafka.apache.org/0100/javad
 </div>
 
 For possible kafkaParams, see [Kafka consumer config 
docs](http://kafka.apache.org/documentation.html#newconsumerconfigs).
-Note that enable.auto.commit is disabled, for discussion see [Storing 
Offsets](streaming-kafka-0-10-integration.html#storing-offsets) below.
+Note that the example sets enable.auto.commit to false, for discussion see 
[Storing Offsets](streaming-kafka-0-10-integration.html#storing-offsets) below.
 
 ### LocationStrategies
 The new Kafka consumer API will pre-fetch messages into buffers.  Therefore it 
is important for performance reasons that the Spark integration keep cached 
consumers on executors (rather than recreating them for each batch), and prefer 
to schedule partitions on the host locations that have the appropriate 
consumers.
@@ -57,6 +57,9 @@ In most cases, you should use 
`LocationStrategies.PreferConsistent` as shown abo
 
 The cache for consumers has a default maximum size of 64.  If you expect to be 
handling more than (64 * number of executors) Kafka partitions, you can change 
this setting via `spark.streaming.kafka.consumer.cache.maxCapacity`
 
+The cache is keyed by topicpartition and group.id, so use a **separate** 
`group.id` for each call to `createDirectStream`.
+
+
 ### ConsumerStrategies
 The new Kafka consumer API has a number of different ways to specify topics, 
some of which require considerable post-object-instantiation setup.  
`ConsumerStrategies` provides an abstraction that allows Spark to obtain 
properly configured consumers even after restart from checkpoint.
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17853][STREAMING][KAFKA][DOC] make it clear that reusing group.id is bad

Reply via email to