Hi,

        I have a streaming program with the block as below [ref: 
https://github.com/agsachin/streamingBenchmark/blob/master/spark-benchmarks/src/main/scala/TwitterStreaming.scala
]

1 val lines = messages.map(_._2)
2 val hashTags = lines.flatMap(status => status.split(" "
).filter(_.startsWith("#")))

3 val topCounts60 = hashTags.map((_, 1)).reduceByKey( _ + _ )
3a  .map { case (topic, count) => (count, topic) }
3b  .transform(_.sortByKey(false))

4a topCounts60.foreachRDD( rdd => {
4b  val topList = rdd.take( 10 )
})

        This batch is triggering 2 jobs...one at line 3b (sortByKey)  and 
the other at 4b (rdd.take) I agree that there is a Job triggered on line 
4b as take() is an action on RDD while as on line 3b sortByKey is just a 
transformation function which as per docs is lazy evaluation...but I see 
that this line uses a RangePartitioner and Rangepartitioner on 
initialization invokes a method called sketch() that invokes collect() 
triggering a Job.

        My question: Is it expected that sortByKey will invoke a Job...if 
yes, why is sortByKey listed as a transformation and not action. Are there 
any other functions like this that invoke a Job, though they are 
transformations and not actions?

        I am on Spark 1.6

Thanking You
---------------------------------------------------------------------------------
Praveen Devarao
Spark Technology Centre
IBM India Software Labs
---------------------------------------------------------------------------------
"Courage doesn't always roar. Sometimes courage is the quiet voice at the 
end of the day saying I will try again"

Reply via email to