Cheers Marko, good work. On Tue, Apr 5, 2016 at 4:32 PM, Marko Rodriguez <okramma...@gmail.com> wrote:
> Hi, > > So yesterday and this morning I manually tested TinkerPop 3.2.0-SNAPSHOT > for our VOTE release on Friday on 4 Blades using Friendster (2.5 billion > edges). I noticed that Spark 1.6.1 is fickle and Netty-based network errors > occur "easily." I dropped back down to 1.5.2 and no errors. I think one of > the problems is GC in Spark 1.6.1 and using MEMORY_XXX storage levels. I > did DISK_ONLY and the issues went away on the simple query of g.V().count() > (which only repartitions -- no message passing). In 1.5.2 you get GC stalls > with MEMORY_XXX storage levels, but no [ERROR]s (and no stack traces w/ > failed tasks). Next, I did a more complex query -- > g.V().out().out().count() -- and Spark 1.6.1 had failed tasks even with > DISK_ONLY. Bummer. As a last check, I changed the proportion of > SPARK_WORKER_INSTANCES to SPARK_WORKER_CORES from 4/6 to 6/4 and everything > started to work again with Spark 1.6.1. > > In short, the memory management and workers/core-ratio in Spark 1.6.1 is > "different" than Spark 1.5.2. I was able to get the same speeds on 1.6.1 as > with 1.5.2, I just had to do things a little differently. In fact, 1.6.1 > seems a bit faster -- a 55 minute job on 1.5.2 taking 50 minutes on 1.6.1. > > I think it is safe to release TinkerPop 3.2.0 with Spark 1.6.1, but we > will just have to be ready to tell people to reduce the number of workers > and to use DISK_ONLY if they are GC stalling a lot. Finally, with this > testing, I ensured that our bump to Hadoop 2.7.2 didn't cause any problems > and moreover, there were a few nick nack bugs around FileSystemStorage that > I was able to confirm no longer existed. > > Thanks, > Marko. > > http://markorodriguez.com > >