We have run into a problem where some Spark job is aborted after a worker is
killed in a 2-worker standalone cluster. The problem is intermittent, but
we can consistently reproduce it. The problem only appears to happen when
we kill a worker. It doesn't happen when we kill an executor directly.
I think you're exactly right. I once had 100 iterations in a single Pregel
call, and got into the lineage problem right there. I had to modify the
Pregel function and checkpoint both the graph and the newVerts RDD there to
cut off the lineage. If you draw out the dependency graph among the g,
I've been encountering something similar too. I suspected that was related
to the lineage growth of the graph/RDDs. So I checkpoint the graph every 60
Pregel rounds, after doing which my program doesn't slow down any more
(except that every checkpoint takes some extra time).
--
View this
in VertexRDD.scala:
private[graphx] def partitionsRDD: RDD[ShippableVertexPartition[VD]]
We would really appreciate it if anyone could shed some light on solving
this problem, or anyone who has come across a similar problem could share a
solution or workaround.
Thank you,
Cheuk Lam
--
View
I wasn't the original person who posted the question, but this helped me! :)
Thank you.
I had a similar issue today when I tried to connect using the IP address
(spark://master_ip:7077). I got it resolved by replacing it with the URL
displayed in the Spark web console - in my case it is
When using the activeSetOpt in GraphImpl.mapReduceTriplets(), can we expect a
performance that is only proportional to the size of the active set and
independent of the size of the original data set? Or there is still a fixed
overhead that depends on the size of the original data set?
Thank you!
This is a question on using the Pregel function in GraphX. Does a message
get serialized and then de-serialized in the scenario where both the source
and the destination vertices are in the same compute node/machine?
Thank you!
--
View this message in context: