Hi, I am having a problem serializing a custom partitioner that I have written that extends Externalizable. The partitioner wraps a java TreeSet which stores table splits. There are thousands of splits.
I noticed earlier that my spark job was taking over 30 seconds just to transmit a task to each worker, which is what motivated me to optimize the serialization of the partitioner I wrote. I read in your configuration page that Kryo can only be used to serialize data in RDDs, but spark cannot use Kryo to transmit the task closure to each executor. That is why I chose to use Externalizable for my custom partitioner. I have unit tests that confirm my externalizable class can be serialized using java serialization. Unfortunately when it comes to the cluster though, it stops working. I added some logging to this and discovered that it had serialized about 2000 splits out of 4000 before it encountered an EOFException. This is happening consistently on every node of my cluster. I have no idea what could cause this or even how to get more information. Would somebody please tell me what could possibly cause this or how to troubleshoot it? Mike Knapp