Consuming Data in Parallel using Spark Streaming

Vibhakar, Beejal Wed, 21 Feb 2018 19:13:52 -0800

I am trying to process data from 3 different Kafka topics using 3 InputDStream 
with a single StreamingContext. I am currently testing this under Sandbox where 
I see data processed from one Kafka topic followed by other.


Question#1: I want to understand that when I run this program in Hadoop 
cluster, will it process the data in parallel from 3 Kafka topics OR will I see 
the same behavior as I see in my Sandbox?

Question#2: I aim to process the data from all three Kafka topics in parallel.  
Can I achieve this without breaking this program into 3 separate smaller 
programs?

Here's how the code template looks like..

       val ssc = new StreamingContext(sc, 30)

val topic1 = Array("TOPIC1")

       val dataStreamTopic1 = KafkaUtils.createDirectStream[Array[Byte], 
GenericRecord](
      ssc,
      PreferConsistent,
      Subscribe[Array[Byte], GenericRecord](topic1, kafkaParms))

             // Processing logic for dataStreamTopic1


val topic2 = Array("TOPIC2")

       val dataStreamTopic2 = KafkaUtils.createDirectStream[Array[Byte], 
GenericRecord](
      ssc,
      PreferConsistent,
      Subscribe[Array[Byte], GenericRecord](topic2, kafkaParms))

             // Processing logic for dataStreamTopic2


val topic3 = Array("TOPIC3")

       val dataStreamTopic3 = KafkaUtils.createDirectStream[Array[Byte], 
GenericRecord](
      ssc,
      PreferConsistent,
      Subscribe[Array[Byte], GenericRecord](topic3, kafkaParms))

             // Processing logic for dataStreamTopic3

    // Start the Streaming
    ssc.start()
    ssc.awaitTermination()

Here's how I submit my spark job on my sandbox...

./bin/spark-submit --class <CLASS NAME> --master local[*] <PATH TO JAR>

Thanks,
Beejal


The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.

Consuming Data in Parallel using Spark Streaming

Reply via email to