Hi all, I implemented a transformation on hdfs files with spark. First tested in spark-shell (with yarn), I implemented essentially the same logic with a spark program (scala), built a jar file and used spark-submit to execute it on my yarn cluster. The weird thing is that spark-submit approach is almost 3x as slow (500s vs 1500s). I am curious why...
I am essentially writing a benchmarking program to test the performance of spark in various settings, so my spark program has a Benchmark abstract class, a trait for some common things, and an actual class to perform one specific benchmark. My spark main creates an instance of my benchmark class and execute something like benchmark1.run(), which in turn kicks off spark context, perform data manipulation, etc. I wonder if such constructs introduced some overhead - comparing to direct manipulation commands in spark-shell. Thanks. -Simon