Hi, I have a streaming program with the block as below [ref: https://github.com/agsachin/streamingBenchmark/blob/master/spark-benchmarks/src/main/scala/TwitterStreaming.scala ]
1 val lines = messages.map(_._2) 2 val hashTags = lines.flatMap(status => status.split(" " ).filter(_.startsWith("#"))) 3 val topCounts60 = hashTags.map((_, 1)).reduceByKey( _ + _ ) 3a .map { case (topic, count) => (count, topic) } 3b .transform(_.sortByKey(false)) 4a topCounts60.foreachRDD( rdd => { 4b val topList = rdd.take( 10 ) }) This batch is triggering 2 jobs...one at line 3b (sortByKey) and the other at 4b (rdd.take) I agree that there is a Job triggered on line 4b as take() is an action on RDD while as on line 3b sortByKey is just a transformation function which as per docs is lazy evaluation...but I see that this line uses a RangePartitioner and Rangepartitioner on initialization invokes a method called sketch() that invokes collect() triggering a Job. My question: Is it expected that sortByKey will invoke a Job...if yes, why is sortByKey listed as a transformation and not action. Are there any other functions like this that invoke a Job, though they are transformations and not actions? I am on Spark 1.6 Thanking You --------------------------------------------------------------------------------- Praveen Devarao Spark Technology Centre IBM India Software Labs --------------------------------------------------------------------------------- "Courage doesn't always roar. Sometimes courage is the quiet voice at the end of the day saying I will try again"