Hello, I am running Spark 1.4.0 on Mesos 0.22.1, and usually I run my jobs in coarse-grained mode.
I have written some single-threaded standalone Scala applications for a problem that I am working on, and I am unable to get a Spark solution that comes close to the performance of this application. My hope was to sacrifice some performance to get an easily scalable solution, but I'm finding that the single-threaded implementations consistently outperform Spark even with a couple dozen cores, and I'm having trouble getting Spark to scale linearly. All files are binary files with fixed-width records, ranging from about 40 bytes to 200 bytes per record depending on the type. The files are already partitioned by 3 keys, with one file for each combination. Basically the layout is /customer/day/partition_number. The ultimate goal is to read time series events, join in some smaller tables when processing those events, and write the result to parquet. For this discussion, I'm focusing on just a simple problem: reading and aggregating the events. I started with a simple experiment to walk over all the events and sum the value of an integer field. I implemented two standalone solutions and a Spark solution: 1) For each file, use a BufferedInputStream to iterate over each fixed-width row, copy the row to a Array[Byte], and then parse the one field out of that array. This can process events at about 30 million/second. 2) Memory-map each file to a java.nio.MappedByteBuffer. Calculate the sum by directly selecting the integer field while iterating over the rows. This solution can process about 100-300 million events/second. 3) Use SparkContext.binaryRecords, map over the RDD[Array[Byte]] to parse or select the field, and then called sum on that. Although performance is understandably much better when I use a memory mapped bytebuffer, I would expect my Spark solution to get the same per-core throughput as solution #1 above, where the record type is Array[Byte] and I'm using the same approach to pull out the integer field from that byte array. However, the Spark solution achieves only 1-2 million events/second on 1 core, 4 million events/second on 2 nodes with 4 cores each, and 8 million events/second on 6 nodes with 4 cores each. So, not only was the performance a fraction of my standalone application, but it can't even scale linearly to 6 nodes. - Philip