-dev +user 1). Is that the reason why it's always slow in the first run? Or are there > any other reasons? Apparently it loads data to memory every time so it > shouldn't be something to do with disk read should it? >
You are probably seeing the effect of the JVMs JIT. The first run is executing in interpreted mode. Once the JVM sees its a hot piece of code it will compile it to native code. This applies both to Spark / Spark SQL itself and (as of Spark 1.5) the code that we dynamically generate for doing expression evaluation. Multiple runs with the same expressions will used cached code that might have been JITed. > 2). Does Spark use the Hadoop's Map Reduce engine under the hood? If so > can we configure it to use MR2 instead of MR1. > No, we do not use the map reduce engine for execution. You can however compile Spark to work with either version of hadoop for so you can access HDFS, etc.