On Fri, Jul 11, 2014 at 2:23 PM, ShreyanshB <shreyanshpbh...@gmail.com> wrote: > > -- Is it a correct way to load file to get best performance?
Yes, edgeListFile should be efficient at loading the edges. -- What should be the partition size? =computing node or =cores? In general it should be a multiple of the number of cores to exploit all available parallelism, but because of shuffle overhead, it might help to use fewer partitions -- in some cases even fewer than the number of cores. You can measure the performance with different numbers of partitions to see what is best. -- I see following error so many times in my logs [...] > NotSerializableException This is a known bug, and there are two possible resolutions: 1. Switch from Java serialization to Kryo serialization, which is faster and will also resolve the problem, by setting the following Spark properties in conf/spark-defaults.conf: spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryo.registrator org.apache.spark.graphx.GraphKryoRegistrator 2. Mark the affected classes as Serializable. I'll submit a patch with this fix as well, but for now I'd suggest trying Kryo if possible. Ankur <http://www.ankurdave.com/>