The scala version of the Kafka  is something that we have been working on
for a while, and is likely to be more optimized than the python one. The
python one definitely requires pass the data back and forth between JVM and
Python VM and decoding the raw bytes to the Python strings (probably less
efficient that Java's Byte to UTF8 decoder), so that may cause some extra
overheads compared to scala.

Also consider trying the direct API. Read more in the Kafka integration
guide - http://spark.apache.org/docs/latest/streaming-kafka-integration.html
That overall has a much higher throughput that the earlier receiver based
approach.

BTW, disclaimer. Do not consider this difference as generalization of the
performance difference between Scala and Python for all of Spark, For
example, DataFrames provide performance parity between Scala and Python
APIs.


On Mon, Aug 24, 2015 at 5:22 AM, utk.pat <utkarsh.pat...@gmail.com> wrote:

> I am new to SPARK streaming. I was running the "kafka_wordcount" example
> with a local KAFKA and SPARK instance. It was very easy to set this up and
> get going :) I tried running both SCALA and Python versions of the word
> count example. Python versions seems to be extremely slow. Sometimes it has
> delays of more than couple of minutes. On the other hand SCALA versions
> seems to be way better. I am running on a windows machine. I am trying to
> understand what is the cause slowness in python streaming? Is there
> anything that I am missing? For real time streaming analysis should I
> prefer SCALA?
> ------------------------------
> View this message in context: Performance - Python streaming v/s Scala
> streaming
> <http://apache-spark-user-list.1001560.n3.nabble.com/Performance-Python-streaming-v-s-Scala-streaming-tp24415.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>

Reply via email to