Chris, I did the Dstream.repartition mentioned in the document on parallelism in receiving, as well as set "spark.default.parallelism" and it still uses only 2 nodes in my cluster. I notice there is another email thread on the same topic:
http://apache-spark-user-list.1001560.n3.nabble.com/DStream-repartitioning-performance-tuning-processing-td13069.html My code is in Java and here is what I have: JavaPairReceiverInputDStream<String, String> messages = KafkaUtils.createStream(ssc, zkQuorum, "cse-job-play-consumer", kafkaTopicMap); JavaPairDStream<String, String> newMessages = messages.repartition(partitionSize);// partitionSize=30 JavaDStream<String> lines = newMessages.map(new Function<Tuple2<String, String>, String>() { ... public String call(Tuple2<String, String> tuple2) { return tuple2._2(); } }); JavaDStream<String> words = lines.flatMap(new MetricsComputeFunction() ); JavaPairDStream<String, Integer> wordCounts = words.mapToPair( new PairFunction<String, String, Integer>() { ... } ); wordCounts.foreachRDD(new Function<JavaPairRDD<String, Integer>, Void>() {...}); Thanks, Bharat -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p13131.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org