[ https://issues.apache.org/jira/browse/SPARK-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin resolved SPARK-1329. -------------------------------- Resolution: Fixed Fix Version/s: 1.0.0 Assignee: Ankur Dave > ArrayIndexOutOfBoundsException if graphx.Graph has more edge partitions than > node partitions > -------------------------------------------------------------------------------------------- > > Key: SPARK-1329 > URL: https://issues.apache.org/jira/browse/SPARK-1329 > Project: Spark > Issue Type: Bug > Components: GraphX > Affects Versions: 0.9.0 > Reporter: Daniel Darabos > Assignee: Ankur Dave > Fix For: 1.0.0 > > > To reproduce, let's look at a graph with two nodes in one partition, and two > edges between them split across two partitions: > scala> val vs = sc.makeRDD(Seq(1L->null, 2L->null), 1) > scala> val es = sc.makeRDD(Seq(graphx.Edge(1, 2, null), graphx.Edge(2, 1, > null)), 2) > scala> val g = graphx.Graph(vs, es) > Everything seems fine, until GraphX needs to join the two RDDs: > scala> g.triplets.collect > [...] > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.graphx.impl.RoutingTable$$anonfun$2$$anonfun$apply$3.apply(RoutingTable.scala:76) > at > org.apache.spark.graphx.impl.RoutingTable$$anonfun$2$$anonfun$apply$3.apply(RoutingTable.scala:75) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.graphx.impl.RoutingTable$$anonfun$2.apply(RoutingTable.scala:75) > at > org.apache.spark.graphx.impl.RoutingTable$$anonfun$2.apply(RoutingTable.scala:73) > at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450) > at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:71) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:85) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) > at org.apache.spark.scheduler.Task.run(Task.scala:53) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > The bug is fairly obvious in RoutingTable.createPid2Vid() -- it creates an > array of length vertices.partitions.size, and then looks up partition IDs > from the edges.partitionsRDD in it. > A graph usually has more edges than nodes. So it is natural to have more edge > partitions than node partitions. -- This message was sent by Atlassian JIRA (v6.2#6252)