Re: Graphx : optimal partitions for a graph and error in logs

Ankur Dave Fri, 11 Jul 2014 14:39:33 -0700

On Fri, Jul 11, 2014 at 2:23 PM, ShreyanshB <shreyanshpbh...@gmail.com>
 wrote:
>
> -- Is it a correct way to load file to get best performance?



Yes, edgeListFile should be efficient at loading the edges.

-- What should be the partition size? =computing node or =cores?


In general it should be a multiple of the number of cores to exploit all
available parallelism, but because of shuffle overhead, it might help to
use fewer partitions -- in some cases even fewer than the number of cores.
You can measure the performance with different numbers of partitions to see
what is best.

-- I see following error so many times in my logs [...]
> NotSerializableException


This is a known bug, and there are two possible resolutions:

1. Switch from Java serialization to Kryo serialization, which is faster
and will also resolve the problem, by setting the following Spark
properties in conf/spark-defaults.conf:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.apache.spark.graphx.GraphKryoRegistrator

2. Mark the affected classes as Serializable. I'll submit a patch with this
fix as well, but for now I'd suggest trying Kryo if possible.

Ankur <http://www.ankurdave.com/>

Re: Graphx : optimal partitions for a graph and error in logs

Reply via email to