Hi, So yesterday and this morning I manually tested TinkerPop 3.2.0-SNAPSHOT for our VOTE release on Friday on 4 Blades using Friendster (2.5 billion edges). I noticed that Spark 1.6.1 is fickle and Netty-based network errors occur "easily." I dropped back down to 1.5.2 and no errors. I think one of the problems is GC in Spark 1.6.1 and using MEMORY_XXX storage levels. I did DISK_ONLY and the issues went away on the simple query of g.V().count() (which only repartitions -- no message passing). In 1.5.2 you get GC stalls with MEMORY_XXX storage levels, but no [ERROR]s (and no stack traces w/ failed tasks). Next, I did a more complex query -- g.V().out().out().count() -- and Spark 1.6.1 had failed tasks even with DISK_ONLY. Bummer. As a last check, I changed the proportion of SPARK_WORKER_INSTANCES to SPARK_WORKER_CORES from 4/6 to 6/4 and everything started to work again with Spark 1.6.1.
In short, the memory management and workers/core-ratio in Spark 1.6.1 is "different" than Spark 1.5.2. I was able to get the same speeds on 1.6.1 as with 1.5.2, I just had to do things a little differently. In fact, 1.6.1 seems a bit faster -- a 55 minute job on 1.5.2 taking 50 minutes on 1.6.1. I think it is safe to release TinkerPop 3.2.0 with Spark 1.6.1, but we will just have to be ready to tell people to reduce the number of workers and to use DISK_ONLY if they are GC stalling a lot. Finally, with this testing, I ensured that our bump to Hadoop 2.7.2 didn't cause any problems and moreover, there were a few nick nack bugs around FileSystemStorage that I was able to confirm no longer existed. Thanks, Marko. http://markorodriguez.com