Didn't have the time to investigate much further, but the one thing that
popped out is that partitioning was no longer working on 1.6.1. This would
definitely explain the 2x performance loss.

Checking 1.5.1 Spark logs for the same application showed that our
partitioner was working correctly, and after the DStream / RDD creation a
user session was only processed on a single machine. Running on top of 1.6.1
though, the session was processed on up to 4 machines, in a 5 node cluster
including the driver, with a lot of redundant operations. We use a custom
but very simple partitioner which extends HashPartitioner. It partitions on
a case class which has a single string parameter.

Speculative operations are turned off by default, and we never enabled it,
so it's not that.

Right now we're postponing any Spark upgrade, and we'll probably try to
upgrade directly to Spark 2.0, hoping the partitioning issue is no longer
present there.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Severe-Spark-Streaming-performance-degradation-after-upgrading-to-1-6-1-tp27056p27334.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to