git commit: [SPARK-2419][Streaming][Docs] More updates to the streaming programming guide

tdas Sat, 06 Sep 2014 14:47:22 -0700

Repository: spark
Updated Branches:
  refs/heads/master 0c681dd6b -> baff7e936



[SPARK-2419][Streaming][Docs] More updates to the streaming programming guide

- Improvements to the kinesis integration guide from @cfregly
- More information about unified input dstreams in main guide

Author: Tathagata Das <tathagata.das1...@gmail.com>
Author: Chris Fregly <ch...@fregly.com>

Closes #2307 from tdas/streaming-doc-fix1 and squashes the following commits:

ec40b5d [Tathagata Das] Updated figure with kinesis
fdb9c5e [Tathagata Das] Fixed style issues with kinesis guide
036d219 [Chris Fregly] updated kinesis docs and added an arch diagram
24f622a [Tathagata Das] More modifications.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/baff7e93
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/baff7e93
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/baff7e93

Branch: refs/heads/master
Commit: baff7e936101635d9bd4245e45335878bafb75e0
Parents: 0c681dd
Author: Tathagata Das <tathagata.das1...@gmail.com>
Authored: Sat Sep 6 14:46:43 2014 -0700
Committer: Tathagata Das <tathagata.das1...@gmail.com>
Committed: Sat Sep 6 14:46:43 2014 -0700

----------------------------------------------------------------------
 docs/img/streaming-arch.png           | Bin 78856 -> 78954 bytes
 docs/img/streaming-figures.pptx       | Bin 887545 -> 887551 bytes
 docs/img/streaming-kinesis-arch.png   | Bin 0 -> 115277 bytes
 docs/streaming-kinesis-integration.md |  94 ++++++++++++++++++++---------
 docs/streaming-programming-guide.md   |  64 +++++++++++++++-----
 5 files changed, 117 insertions(+), 41 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/baff7e93/docs/img/streaming-arch.png
----------------------------------------------------------------------
diff --git a/docs/img/streaming-arch.png b/docs/img/streaming-arch.png
index bc57b46..ac35f1d 100644
Binary files a/docs/img/streaming-arch.png and b/docs/img/streaming-arch.png 
differ

http://git-wip-us.apache.org/repos/asf/spark/blob/baff7e93/docs/img/streaming-figures.pptx
----------------------------------------------------------------------
diff --git a/docs/img/streaming-figures.pptx b/docs/img/streaming-figures.pptx
index 1b18c2e..d1cc25e 100644
Binary files a/docs/img/streaming-figures.pptx and 
b/docs/img/streaming-figures.pptx differ

http://git-wip-us.apache.org/repos/asf/spark/blob/baff7e93/docs/img/streaming-kinesis-arch.png
----------------------------------------------------------------------
diff --git a/docs/img/streaming-kinesis-arch.png 
b/docs/img/streaming-kinesis-arch.png
new file mode 100644
index 0000000..bea5fa8
Binary files /dev/null and b/docs/img/streaming-kinesis-arch.png differ

http://git-wip-us.apache.org/repos/asf/spark/blob/baff7e93/docs/streaming-kinesis-integration.md
----------------------------------------------------------------------
diff --git a/docs/streaming-kinesis-integration.md 
b/docs/streaming-kinesis-integration.md
index 079d4c5..c6090d9 100644
--- a/docs/streaming-kinesis-integration.md
+++ b/docs/streaming-kinesis-integration.md
@@ -3,8 +3,8 @@ layout: global
 title: Spark Streaming + Kinesis Integration
 ---
 [Amazon Kinesis](http://aws.amazon.com/kinesis/) is a fully managed service 
for real-time processing of streaming data at massive scale.
-The Kinesis input DStream and receiver uses the Kinesis Client Library (KCL) 
provided by Amazon under the Amazon Software License (ASL).
-The KCL builds on top of the Apache 2.0 licensed AWS Java SDK and provides 
load-balancing, fault-tolerance, checkpointing through the concept of Workers, 
Checkpoints, and Shard Leases.
+The Kinesis receiver creates an input DStream using the Kinesis Client Library 
(KCL) provided by Amazon under the Amazon Software License (ASL).
+The KCL builds on top of the Apache 2.0 licensed AWS Java SDK and provides 
load-balancing, fault-tolerance, checkpointing through the concepts of Workers, 
Checkpoints, and Shard Leases.
 Here we explain how to configure Spark Streaming to receive data from Kinesis.
 
 #### Configuring Kinesis
@@ -15,7 +15,7 @@ A Kinesis stream can be set up at one of the valid Kinesis 
endpoints with 1 or m
 
 #### Configuring Spark Streaming Application
 
-1. **Linking:** In your SBT/Maven projrect definition, link your streaming 
application against the following artifact (see [Linking 
section](streaming-programming-guide.html#linking) in the main programming 
guide for further information).
+1. **Linking:** In your SBT/Maven project definition, link your streaming 
application against the following artifact (see [Linking 
section](streaming-programming-guide.html#linking) in the main programming 
guide for further information).
 
                groupId = org.apache.spark
                artifactId = 
spark-streaming-kinesis-asl_{{site.SCALA_BINARY_VERSION}}
@@ -23,10 +23,11 @@ A Kinesis stream can be set up at one of the valid Kinesis 
endpoints with 1 or m
 
        **Note that by linking to this library, you will include 
[ASL](https://aws.amazon.com/asl/)-licensed code in your application.**
 
-2. **Programming:** In the streaming application code, import `KinesisUtils` 
and create input DStream as follows.
+2. **Programming:** In the streaming application code, import `KinesisUtils` 
and create the input DStream as follows:
 
        <div class="codetabs">
        <div data-lang="scala" markdown="1">
+               import org.apache.spark.streaming.Duration
                import org.apache.spark.streaming.kinesis._
                import 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
 
@@ -34,11 +35,13 @@ A Kinesis stream can be set up at one of the valid Kinesis 
endpoints with 1 or m
                streamingContext, [Kinesis stream name], [endpoint URL], 
[checkpoint interval], [initial position])
 
        See the [API 
docs](api/scala/index.html#org.apache.spark.streaming.kinesis.KinesisUtils$)
-       and the 
[example]({{site.SPARK_GITHUB_URL}}/tree/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala).
 Refer to the next subsection for instructions to run the example.
+       and the 
[example]({{site.SPARK_GITHUB_URL}}/tree/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala).
 Refer to the Running the Example section for instructions on how to run the 
example.
 
        </div>
        <div data-lang="java" markdown="1">
-               import org.apache.spark.streaming.flume.*;
+               import org.apache.spark.streaming.Duration;
+               import org.apache.spark.streaming.kinesis.*;
+               import 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream;
 
                JavaReceiverInputDStream<byte[]> kinesisStream = 
KinesisUtils.createStream(
                streamingContext, [Kinesis stream name], [endpoint URL], 
[checkpoint interval], [initial position]);
@@ -49,36 +52,73 @@ A Kinesis stream can be set up at one of the valid Kinesis 
endpoints with 1 or m
        </div>
        </div>
 
-       `[endpoint URL]`: Valid Kinesis endpoints URL can be found 
[here](http://docs.aws.amazon.com/general/latest/gr/rande.html#ak_region).
+    - `streamingContext`: StreamingContext containg an application name used 
by Kinesis to tie this Kinesis application to the Kinesis stream
 
-       `[checkpoint interval]`: The interval at which the Kinesis client 
library is going to save its position in the stream. For starters, set it to 
the same as the batch interval of the streaming application.
+       - `[Kinesis stream name]`: The Kinesis stream that this streaming 
application receives from
+               - The application name used in the streaming context becomes 
the Kinesis application name
+               - The application name must be unique for a given account and 
region.
+               - The Kinesis backend automatically associates the application 
name to the Kinesis stream using a DynamoDB table (always in the us-east-1 
region) created during Kinesis Client Library initialization. 
+               - Changing the application name or stream name can lead to 
Kinesis errors in some cases.  If you see errors, you may need to manually 
delete the DynamoDB table.
 
-       `[initial position]`: Can be either 
`InitialPositionInStream.TRIM_HORIZON` or `InitialPositionInStream.LATEST` (see 
later section and Amazon Kinesis API documentation for more details).
 
-       *Points to remember:*
+       - `[endpoint URL]`: Valid Kinesis endpoints URL can be found 
[here](http://docs.aws.amazon.com/general/latest/gr/rande.html#ak_region).
 
-       - The name used in the context of the streaming application must be 
unique for a given account and region. Changing the app name or stream name 
could lead to Kinesis errors as only a single logical application can process a 
single stream.
-       - A single Kinesis input DStream can receive many Kinesis shards by 
spinning up multiple KinesisRecordProcessor threads. Note that there is no 
correlation between number of shards in Kinesis and the number of partitions in 
the generated RDDs that is used for processing the data.
-       - You never need more KinesisReceivers than the number of shards in 
your stream as each will spin up at least one KinesisRecordProcessor thread.
-       - Horizontal scaling is achieved by autoscaling additional Kinesis 
input DStreams (separate processes) up to the number of current shards for a 
given stream, of course.
+       - `[checkpoint interval]`: The interval (e.g., Duration(2000) = 2 
seconds) at which the Kinesis Client Library saves its position in the stream.  
For starters, set it to the same as the batch interval of the streaming 
application.
 
-3. **Deploying:** Package 
`spark-streaming-flume_{{site.SCALA_BINARY_VERSION}}` and its dependencies 
(except `spark-core_{{site.SCALA_BINARY_VERSION}}` and 
`spark-streaming_{{site.SCALA_BINARY_VERSION}}` which are provided by 
`spark-submit`) into the application JAR. Then use `spark-submit` to launch 
your application (see [Deploying 
section](streaming-programming-guide.html#deploying-applications) in the main 
programming guide).
+       - `[initial position]`: Can be either 
`InitialPositionInStream.TRIM_HORIZON` or `InitialPositionInStream.LATEST` (see 
Kinesis Checkpointing section and Amazon Kinesis API documentation for more 
details).
 
-    - A DynamoDB table and CloudWatch namespace are created during KCL 
initialization using this Kinesis application name.  This DynamoDB table lives 
in the us-east-1 region regardless of the Kinesis endpoint URL. It is used to 
store KCL's checkpoint information.
 
-    - If you are seeing errors after changing the app name or stream name, it 
may be necessary to manually delete the DynamoDB table and start from scratch.
+3. **Deploying:** Package 
`spark-streaming-kinesis-asl_{{site.SCALA_BINARY_VERSION}}` and its 
dependencies (except `spark-core_{{site.SCALA_BINARY_VERSION}}` and 
`spark-streaming_{{site.SCALA_BINARY_VERSION}}` which are provided by 
`spark-submit`) into the application JAR. Then use `spark-submit` to launch 
your application (see [Deploying 
section](streaming-programming-guide.html#deploying-applications) in the main 
programming guide).
+
+       *Points to remember at runtime:*
+
+       - Kinesis data processing is ordered per partition and occurs at-least 
once per message.
+
+       - Multiple applications can read from the same Kinesis stream.  Kinesis 
will maintain the application-specific shard and checkpoint info in DynamodDB.
+
+       - A single Kinesis stream shard is processed by one input DStream at a 
time.
+
+       <p style="text-align: center;">
+               <img src="img/streaming-kinesis-arch.png"
+                       title="Spark Streaming Kinesis Architecture"
+                       alt="Spark Streaming Kinesis Architecture"
+              width="60%" 
+        />
+               <!-- Images are downsized intentionally to improve quality on 
retina displays -->
+       </p>
+
+       - A single Kinesis input DStream can read from multiple shards of a 
Kinesis stream by creating multiple KinesisRecordProcessor threads.
+
+       - Multiple input DStreams running in separate processes/instances can 
read from a Kinesis stream.
+
+       - You never need more Kinesis input DStreams than the number of Kinesis 
stream shards as each input DStream will create at least one 
KinesisRecordProcessor thread that handles a single shard.
+
+       - Horizontal scaling is achieved by adding/removing  Kinesis input 
DStreams (within a single process or across multiple processes/instances) - up 
to the total number of Kinesis stream shards per the previous point.
+
+       - The Kinesis input DStream will balance the load between all DStreams 
- even across processes/instances.
+
+       - The Kinesis input DStream will balance the load during re-shard 
events (merging and splitting) due to changes in load.
+
+       - As a best practice, it's recommended that you avoid re-shard jitter 
by over-provisioning when possible.
+
+       - Each Kinesis input DStream maintains its own checkpoint info.  See 
the Kinesis Checkpointing section for more details.
+
+       - There is no correlation between the number of Kinesis stream shards 
and the number of RDD partitions/shards created across the Spark cluster during 
input DStream processing.  These are 2 independent partitioning schemes.
 
 #### Running the Example
 To run the example,
+
 - Download Spark source and follow the 
[instructions](building-with-maven.html) to build Spark with profile 
*-Pkinesis-asl*.
 
-    mvn -Pkinesis-asl -DskipTests clean package
+        mvn -Pkinesis-asl -DskipTests clean package
+
 
-- Set up Kinesis stream (see earlier section). Note the name of the Kinesis 
stream, and the endpoint URL corresponding to the region the stream is based on.
+- Set up Kinesis stream (see earlier section) within AWS. Note the name of the 
Kinesis stream and the endpoint URL corresponding to the region where the 
stream was created.
 
 - Set up the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_KEY with 
your AWS credentials.
 
 - In the Spark root directory, run the example as
+
        <div class="codetabs">
        <div data-lang="scala" markdown="1">
 
@@ -92,19 +132,19 @@ To run the example,
        </div>
        </div>
 
-    This will wait for data to be received from Kinesis.
+    This will wait for data to be received from the Kinesis stream.
 
-- To generate random string data, in another terminal, run the associated 
Kinesis data producer.
+- To generate random string data to put onto the Kinesis stream, in another 
terminal, run the associated Kinesis data producer.
 
                bin/run-example streaming.KinesisWordCountProducerASL [Kinesis 
stream name] [endpoint URL] 1000 10
 
-       This will push random words to the Kinesis stream, which should then be 
received and processed by the running example.
+       This will push 1000 lines per second of 10 random numbers per line to 
the Kinesis stream.  This data should then be received and processed by the 
running example.
 
 #### Kinesis Checkpointing
-The Kinesis receiver checkpoints the position of the stream that has been read 
periodically, so that the system can recover from failures and continue 
processing where it had left off. Checkpointing too frequently will cause 
excess load on the AWS checkpoint storage layer and may lead to AWS throttling. 
 The provided example handles this throttling with a random-backoff-retry 
strategy.
-
-- If no Kinesis checkpoint info exists, the KinesisReceiver will start either 
from the oldest record available (InitialPositionInStream.TRIM_HORIZON) or from 
the latest tip (InitialPostitionInStream.LATEST).  This is configurable.
+- Each Kinesis input DStream periodically stores the current position of the 
stream in the backing DynamoDB table.  This allows the system to recover from 
failures and continue processing where the DStream left off.
 
-- InitialPositionInStream.LATEST could lead to missed records if data is added 
to the stream while no KinesisReceivers are running (and no checkpoint info is 
being stored). In production, you'll want to switch to 
InitialPositionInStream.TRIM_HORIZON which will read up to 24 hours (Kinesis 
limit) of previous stream data.
+- Checkpointing too frequently will cause excess load on the AWS checkpoint 
storage layer and may lead to AWS throttling.  The provided example handles 
this throttling with a random-backoff-retry strategy.
 
-- InitialPositionInStream.TRIM_HORIZON may lead to duplicate processing of 
records where the impact is dependent on checkpoint frequency.
+- If no Kinesis checkpoint info exists when the input DStream starts, it will 
start either from the oldest record available 
(InitialPositionInStream.TRIM_HORIZON) or from the latest tip 
(InitialPostitionInStream.LATEST).  This is configurable.
+- InitialPositionInStream.LATEST could lead to missed records if data is added 
to the stream while no input DStreams are running (and no checkpoint info is 
being stored). 
+- InitialPositionInStream.TRIM_HORIZON may lead to duplicate processing of 
records where the impact is dependent on checkpoint frequency and processing 
idempotency.

http://git-wip-us.apache.org/repos/asf/spark/blob/baff7e93/docs/streaming-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/streaming-programming-guide.md 
b/docs/streaming-programming-guide.md
index 3d4bce4..41f1705 100644
--- a/docs/streaming-programming-guide.md
+++ b/docs/streaming-programming-guide.md
@@ -233,7 +233,7 @@ $ ./bin/run-example streaming.NetworkWordCount localhost 
9999
 </div>
 <div data-lang="java" markdown="1">
 {% highlight bash %}
-$ ./bin/run-example JavaNetworkWordCount localhost 9999
+$ ./bin/run-example streaming.JavaNetworkWordCount localhost 9999
 {% endhighlight %}
 </div>
 </div>
@@ -262,7 +262,7 @@ hello world
 {% highlight bash %}
 # TERMINAL 2: RUNNING NetworkWordCount or JavaNetworkWordCount
 
-$ ./bin/run-example org.apache.spark.examples.streaming.NetworkWordCount 
localhost 9999
+$ ./bin/run-example streaming.NetworkWordCount localhost 9999
 ...
 -------------------------------------------
 Time: 1357008430000 ms
@@ -285,12 +285,22 @@ need to know to write your streaming applications.
 
 ## Linking
 
-To write your own Spark Streaming program, you will have to add the following 
dependency to your
- SBT or Maven project:
+Similar to Spark, Spark Streaming is available through Maven Central. To write 
your own Spark Streaming program, you will have to add the following dependency 
to your SBT or Maven project.
+
+<div class="codetabs">
+<div data-lang="Maven" markdown="1">
 
-    groupId = org.apache.spark
-    artifactId = spark-streaming_{{site.SCALA_BINARY_VERSION}}
-    version = {{site.SPARK_VERSION}}
+       <dependency>
+        <groupId>org.apache.spark</groupId>
+        <artifactId>spark-streaming_{{site.SCALA_BINARY_VERSION}}</artifactId>
+        <version>{{site.SPARK_VERSION}}</version>
+    </dependency>
+</div>
+<div data-lang="SBT" markdown="1">
+
+       libraryDependencies += "org.apache.spark" % 
"spark-streaming_{{site.SCALA_BINARY_VERSION}}" % "{{site.SPARK_VERSION}}"
+</div>
+</div>
 
 For ingesting data from sources like Kafka, Flume, and Kinesis that are not 
present in the Spark
 Streaming core
@@ -302,7 +312,7 @@ some of the common ones are as follows.
 <tr><th>Source</th><th>Artifact</th></tr>
 <tr><td> Kafka </td><td> spark-streaming-kafka_{{site.SCALA_BINARY_VERSION}} 
</td></tr>
 <tr><td> Flume </td><td> spark-streaming-flume_{{site.SCALA_BINARY_VERSION}} 
</td></tr>
-<tr><td> 
Kinesis<br/></td><td>spark-streaming-kinesis-asl_{{site.SCALA_BINARY_VERSION}} 
</td></tr>
+<tr><td> 
Kinesis<br/></td><td>spark-streaming-kinesis-asl_{{site.SCALA_BINARY_VERSION}} 
[Apache Software License] </td></tr>
 <tr><td> Twitter </td><td> 
spark-streaming-twitter_{{site.SCALA_BINARY_VERSION}} </td></tr>
 <tr><td> ZeroMQ </td><td> spark-streaming-zeromq_{{site.SCALA_BINARY_VERSION}} 
</td></tr>
 <tr><td> MQTT </td><td> spark-streaming-mqtt_{{site.SCALA_BINARY_VERSION}} 
</td></tr>
@@ -373,7 +383,7 @@ or a special __"local[\*]"__ string to run in local mode. 
In practice, when runn
 you will not want to hardcode `master` in the program,
 but rather [launch the application with 
`spark-submit`](submitting-applications.html) and
 receive it there. However, for local testing and unit tests, you can pass 
"local[*]" to run Spark Streaming
-in-process. Note that this internally creates a 
[JavaSparkContext](api/java/index.html?org/apache/spark/api/java/JavaSparkContext.html)
 (starting point of all Spark functionality) which can be accessed as 
`ssc.sparkContext`. 
+in-process. Note that this internally creates a 
[JavaSparkContext](api/java/index.html?org/apache/spark/api/java/JavaSparkContext.html)
 (starting point of all Spark functionality) which can be accessed as 
`ssc.sparkContext`.
 
 The batch interval must be set based on the latency requirements of your 
application
 and available cluster resources. See the [Performance 
Tuning](#setting-the-right-batch-size)
@@ -447,11 +457,12 @@ Spark Streaming has two categories of streaming sources.
 - *Basic sources*: Sources directly available in the StreamingContext API. 
Example: file systems, socket connections, and Akka actors.
 - *Advanced sources*: Sources like Kafka, Flume, Kinesis, Twitter, etc. are 
available through extra utility classes. These require linking against extra 
dependencies as discussed in the [linking](#linking) section.
 
-Every input DStream (except file stream) is associated with a single 
[Receiver](api/scala/index.html#org.apache.spark.streaming.receiver.Receiver) 
object which receives the data from a source and stores it in Spark's memory 
for processing. A receiver is run within a Spark worker/executor as a 
long-running task, hence it occupies one of the cores allocated to the Spark 
Streaming application. Hence, it is important to remember that Spark Streaming 
application needs to be allocated enough cores to process the received data, as 
well as, to run the receiver(s). Therefore, few important points to remember 
are:
+Every input DStream (except file stream) is associated with a single 
[Receiver](api/scala/index.html#org.apache.spark.streaming.receiver.Receiver) 
object which receives the data from a source and stores it in Spark's memory 
for processing. So every input DStream receives a single stream of data. Note 
that in a streaming application, you can create multiple input DStreams to 
receive multiple streams of data in parallel. This is discussed later in the 
[Performance Tuning](#level-of-parallelism-in-data-receiving) section.
+
+A receiver is run within a Spark worker/executor as a long-running task, hence 
it occupies one of the cores allocated to the Spark Streaming application. 
Hence, it is important to remember that Spark Streaming application needs to be 
allocated enough cores to process the received data, as well as, to run the 
receiver(s). Therefore, few important points to remember are:
 
 ##### Points to remember:
 {:.no_toc}
-
 - If the number of cores allocated to the application is less than or equal to 
the number of input DStreams / receivers, then the system will receive data, 
but not be able to process them.
 - When running locally, if you master URL is set to "local", then there is 
only one core to run tasks.  That is insufficient for programs with even one 
input DStream (file streams are okay) as the receiver will occupy that core and 
there will be no core left to process the data.
 
@@ -1089,9 +1100,34 @@ parallelizing the data receiving. Note that each input 
DStream
 creates a single receiver (running on a worker machine) that receives a single 
stream of data.
 Receiving multiple data streams can therefore be achieved by creating multiple 
input DStreams
 and configuring them to receive different partitions of the data stream from 
the source(s).
-For example, a single Kafka input stream receiving two topics of data can be 
split into two
+For example, a single Kafka input DStream receiving two topics of data can be 
split into two
 Kafka input streams, each receiving only one topic. This would run two 
receivers on two workers,
-thus allowing data to be received in parallel, and increasing overall 
throughput.
+thus allowing data to be received in parallel, and increasing overall 
throughput. These multiple
+DStream can be unioned together to create a single DStream. Then the 
transformations that was
+being applied on the single input DStream can applied on the unified stream. 
This is done as follows.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+val numStreams = 5
+val kafkaStreams = (1 to numStreams).map { i => KafkaUtils.createStream(...) }
+val unifiedStream = streamingContext.union(kafkaStreams)
+unifiedStream.print()
+{% endhighlight %}
+</div>
+<div data-lang="java" markdown="1">
+{% highlight java %}
+int numStreams = 5;
+List<JavaPairDStream<String, String>> kafkaStreams = new 
ArrayList<JavaPairDStream<String, String>>(numStreams);
+for (int i = 0; i < numStreams; i++) {
+  kafkaStreams.add(KafkaUtils.createStream(...));
+}
+JavaPairDStream<String, String> unifiedStream = 
streamingContext.union(kafkaStreams.get(0), kafkaStreams.subList(1, 
kafkaStreams.size()));
+unifiedStream.print();
+{% endhighlight %}
+</div>
+</div>
+
 
 Another parameter that should be considered is the receiver's blocking 
interval. For most receivers,
 the received data is coalesced together into large blocks of data before 
storing inside Spark's memory.
@@ -1107,7 +1143,7 @@ before further processing.
 
 ### Level of Parallelism in Data Processing
 {:.no_toc}
-Cluster resources maybe under-utilized if the number of parallel tasks used in 
any stage of the
+Cluster resources can be under-utilized if the number of parallel tasks used 
in any stage of the
 computation is not high enough. For example, for distributed reduce operations 
like `reduceByKey`
 and `reduceByKeyAndWindow`, the default number of parallel tasks is decided by 
the [config property]
 (configuration.html#spark-properties) `spark.default.parallelism`. You can 
pass the level of


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

git commit: [SPARK-2419][Streaming][Docs] More updates to the streaming programming guide

Reply via email to