The scala version of the Kafka is something that we have been working on for a while, and is likely to be more optimized than the python one. The python one definitely requires pass the data back and forth between JVM and Python VM and decoding the raw bytes to the Python strings (probably less efficient that Java's Byte to UTF8 decoder), so that may cause some extra overheads compared to scala.
Also consider trying the direct API. Read more in the Kafka integration guide - http://spark.apache.org/docs/latest/streaming-kafka-integration.html That overall has a much higher throughput that the earlier receiver based approach. BTW, disclaimer. Do not consider this difference as generalization of the performance difference between Scala and Python for all of Spark, For example, DataFrames provide performance parity between Scala and Python APIs. On Mon, Aug 24, 2015 at 5:22 AM, utk.pat <utkarsh.pat...@gmail.com> wrote: > I am new to SPARK streaming. I was running the "kafka_wordcount" example > with a local KAFKA and SPARK instance. It was very easy to set this up and > get going :) I tried running both SCALA and Python versions of the word > count example. Python versions seems to be extremely slow. Sometimes it has > delays of more than couple of minutes. On the other hand SCALA versions > seems to be way better. I am running on a windows machine. I am trying to > understand what is the cause slowness in python streaming? Is there > anything that I am missing? For real time streaming analysis should I > prefer SCALA? > ------------------------------ > View this message in context: Performance - Python streaming v/s Scala > streaming > <http://apache-spark-user-list.1001560.n3.nabble.com/Performance-Python-streaming-v-s-Scala-streaming-tp24415.html> > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >