Blade testing 3.2.0-SNAPSHOT (master/)

Marko Rodriguez Tue, 05 Apr 2016 07:33:10 -0700

Hi,

So yesterday and this morning I manually tested TinkerPop 3.2.0-SNAPSHOT for 
our VOTE release on Friday on 4 Blades using Friendster (2.5 billion edges). I 
noticed that Spark 1.6.1 is fickle and Netty-based network errors occur 
"easily." I dropped back down to 1.5.2 and no errors. I think one of the 
problems is GC in Spark 1.6.1 and using MEMORY_XXX storage levels. I did 
DISK_ONLY and the issues went away on the simple query of g.V().count() (which 
only repartitions -- no message passing). In 1.5.2 you get GC stalls with 
MEMORY_XXX storage levels, but no [ERROR]s (and no stack traces w/ failed 
tasks). Next, I did a more complex query -- g.V().out().out().count() -- and 
Spark 1.6.1 had failed tasks even with DISK_ONLY. Bummer. As a last check, I 
changed the proportion of SPARK_WORKER_INSTANCES to SPARK_WORKER_CORES from 4/6 
to 6/4 and everything started to work again with Spark 1.6.1.


In short, the memory management and workers/core-ratio in Spark 1.6.1 is 
"different" than Spark 1.5.2. I was able to get the same speeds on 1.6.1 as 
with 1.5.2, I just had to do things a little differently. In fact, 1.6.1 seems 
a bit faster -- a 55 minute job on 1.5.2 taking 50 minutes on 1.6.1.

I think it is safe to release TinkerPop 3.2.0 with Spark 1.6.1, but we will 
just have to be ready to tell people to reduce the number of workers and to use 
DISK_ONLY if they are GC stalling a lot. Finally, with this testing, I ensured 
that our bump to Hadoop 2.7.2 didn't cause any problems and moreover, there 
were a few nick nack bugs around FileSystemStorage that I was able to confirm 
no longer existed.

Thanks,
Marko.

http://markorodriguez.com

Blade testing 3.2.0-SNAPSHOT (master/)

Reply via email to