[jira] [Commented] (SPARK-5854) Implement Personalized PageRank with GraphX
[ https://issues.apache.org/jira/browse/SPARK-5854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324374#comment-14324374 ] Baoxu Shi commented on SPARK-5854: -- Oh sure, there are lots of paper talking about personalized page rank, for example: The fundamental one is: Haveliwala TH (2002) Topic-sensitive PageRank. WWW 517–526. doi: 10.1145/511446.511513 This has been cited by 1412 papers (according to google) And there are some implementations on MapReduce Bahmani B, Chakrabarti K, Xin D (2011) Fast personalized PageRank on MapReduce. SIGMOD 973–984. doi: 10.1145/1989323.1989425 Also, the most recent industry usage would be Twitter's WTF system Gupta P, Goel A, Lin J, et al. (2013) WTF: the who to follow service at Twitter. WWW 505–514. The difference between PageRank and Personalized PageRank are how we initialize score for each node, and how we do teleporting. In PageRank, every node has an initial score of 1, whereas for Personalized PageRank, only source node has a score of 1 and others have a score of 0 at the beginning. In PageRank, a random surfer has a probability of \alpha to jump to another random surfer. Whereas in Personalized PageRank, all jumping must end at source node. BTW, I thought this may just be a simple method of a Graph object, since PageRank is also a method of Graph object. > Implement Personalized PageRank with GraphX > --- > > Key: SPARK-5854 > URL: https://issues.apache.org/jira/browse/SPARK-5854 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Baoxu Shi >Priority: Minor > Labels: Feature > Original Estimate: 48h > Remaining Estimate: 48h > > I'm wondering if you would like to add personalized page rank to GraphX? I > have implemented it by modifying existing PageRank algorithm in GraphX, and I > would like to share it with others. > It is pretty simple and straightforward since the only change I need to make > is only teleport to source node. > I did some google searching and seems there are a few guys want to have it in > GraphX :-) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5854) Implement Personalized PageRank with GraphX
[ https://issues.apache.org/jira/browse/SPARK-5854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baoxu Shi updated SPARK-5854: - Remaining Estimate: 48h (was: 24h) Original Estimate: 48h (was: 24h) > Implement Personalized PageRank with GraphX > --- > > Key: SPARK-5854 > URL: https://issues.apache.org/jira/browse/SPARK-5854 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Baoxu Shi >Priority: Minor > Labels: Feature > Original Estimate: 48h > Remaining Estimate: 48h > > I'm wondering if you would like to add personalized page rank to GraphX? I > have implemented it by modifying existing PageRank algorithm in GraphX, and I > would like to share it with others. > It is pretty simple and straightforward since the only change I need to make > is only teleport to source node. > I did some google searching and seems there are a few guys want to have it in > GraphX :-) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5854) Implement Personalized PageRank with GraphX
Baoxu Shi created SPARK-5854: Summary: Implement Personalized PageRank with GraphX Key: SPARK-5854 URL: https://issues.apache.org/jira/browse/SPARK-5854 Project: Spark Issue Type: New Feature Components: GraphX Reporter: Baoxu Shi Priority: Minor I'm wondering if you would like to add personalized page rank to GraphX? I have implemented it by modifying existing PageRank algorithm in GraphX, and I would like to share it with others. It is pretty simple and straightforward since the only change I need to make is only teleport to source node. I did some google searching and seems there are a few guys want to have it in GraphX :-) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2347) Graph object can not be set to StorageLevel.MEMORY_ONLY_SER
Baoxu Shi created SPARK-2347: Summary: Graph object can not be set to StorageLevel.MEMORY_ONLY_SER Key: SPARK-2347 URL: https://issues.apache.org/jira/browse/SPARK-2347 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.0 Environment: Spark standalone with 5 workers and 1 driver Reporter: Baoxu Shi I'm creating Graph object by using Graph(vertices, edges, null, StorageLevel.MEMORY_ONLY, StorageLevel.MEMORY_ONLY) But that will throw out not serializable exception on both workers and driver. 14/07/02 16:30:26 ERROR BlockManagerWorker: Exception handling buffer message java.io.NotSerializableException: org.apache.spark.graphx.impl.VertexPartition at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:106) at org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:30) at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:988) at org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:997) at org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:392) at org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:358) at org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90) at org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28) at org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:662) at org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:504) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Even if the driver sometime does not throw this exception, it will throw java.io.FileNotFoundException: /tmp/spark-local-20140702151845-9620/2a/shuffle_2_25_3 (No such file or directory) I know that VertexPartition not supposed to be serializable, so is there any workaround on this? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2245) VertexRDD can not be materialized for checkpointing
[ https://issues.apache.org/jira/browse/SPARK-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048073#comment-14048073 ] Baoxu Shi commented on SPARK-2245: -- I edited my original comment to add the updates, but I do not know if you can get them via email. So I resubmit it again. Hope that won't bother you. [~ankurd] Hi Ankur Dave, I changed my pull request. But there is another exception, ShippableVertexPartition is not serializable. So I serialized it, but there is another exception org.apache.spark.graphx.impl.RoutingTablePartition is not serializable. Then I serialized it again, but on iteration 2 there will be an exception: org.apache.spark.graphx.impl.ShippableVertexPartition cannot be cast to scala.Tuple2 The code I'm using are: val conf = new SparkConf().setAppName("HDTM") .setMaster("local[4]") val sc = new SparkContext(conf) sc.setCheckpointDir("./checkpoint") val v = sc.parallelize(Seq[(VertexId, Long)]((0L, 0L), (1L, 1L), (2L, 2L))) val e = sc.parallelize(Seq[Edge[Long]](Edge(0L, 1L, 0L), Edge(1L, 2L, 1L), Edge(2L, 0L, 2L))) var g = Graph(v, e) val vertexIds = Seq(0L, 1L, 2L) var prevG: Graph[VertexId, Long] = null for (i <- 1 to 2000) { vertexIds.toStream.foreach(id => { prevG = g g = Graph(g.vertices, g.edges) g.vertices.cache() g.edges.cache() prevG.unpersistVertices(blocking = false) prevG.edges.unpersist(blocking = false) } ) g.vertices.checkpoint() g.edges.checkpoint() g.edges.count() g.vertices.count() println(s"$ {g.vertices.isCheckpointed} $ {g.edges.isCheckpointed} ") println(" iter " + i + " finished") } println(g.vertices.collect().mkString(" ")) println(g.edges.collect().mkString(" ")) Am I on the right track? Or Should there be another way to change it? > VertexRDD can not be materialized for checkpointing > --- > > Key: SPARK-2245 > URL: https://issues.apache.org/jira/browse/SPARK-2245 > Project: Spark > Issue Type: Bug > Components: GraphX >Reporter: Baoxu Shi > > Seems one can not materialize VertexRDD by simply calling count method, which > is overridden by VertexRDD. But if you call RDD's count, it could materialize > it. > Is this a feature that designed to get the count without materialize > VertexRDD? If so, do you guys think it is necessary to add a materialize > method to VertexRDD? > By the way, does count() is the cheapest way to materialize a RDD? Or it just > cost the same resources like other actions? > The pull request is here: > https://github.com/apache/spark/pull/1177 > Best, -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will be thrown in onStageCompleted
[ https://issues.apache.org/jira/browse/SPARK-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046889#comment-14046889 ] Baoxu Shi commented on SPARK-2228: -- [~rxin] https://github.com/apache/spark/pull/1257 > onStageSubmitted does not properly called so NoSuchElement will be thrown in > onStageCompleted > - > > Key: SPARK-2228 > URL: https://issues.apache.org/jira/browse/SPARK-2228 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Baoxu Shi > > We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during > iterative computing, but after several hundreds of iterations, there will be > `NoSuchElementsError`. We check the code and locate the problem at > `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is > called, such `stageId` can not be found in `stageIdToPool`, but it does exist > in other HashMaps. So we think `onStageSubmitted` is not properly called. > `Spark` did add a stage but failed to send the message to listeners. When > sending `finish` message to listeners, the error occurs. > This problem will cause a huge number of `active stages` showing in > `SparkUI`, which is really annoying. But it may not affect the final result, > according to the result of my testing code. > I'm willing to help solve this problem, any idea about which part should I > change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something > to do with it but it looks fine to me. > FYI, here is the test code that could reproduce the problem. I do not know > who to put code here with highlight, so I put the code on gist to make the > issue looks clean. > https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will be thrown in onStageCompleted
[ https://issues.apache.org/jira/browse/SPARK-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046872#comment-14046872 ] Baoxu Shi commented on SPARK-2228: -- No problem, I’ll do that. > onStageSubmitted does not properly called so NoSuchElement will be thrown in > onStageCompleted > - > > Key: SPARK-2228 > URL: https://issues.apache.org/jira/browse/SPARK-2228 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Baoxu Shi > > We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during > iterative computing, but after several hundreds of iterations, there will be > `NoSuchElementsError`. We check the code and locate the problem at > `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is > called, such `stageId` can not be found in `stageIdToPool`, but it does exist > in other HashMaps. So we think `onStageSubmitted` is not properly called. > `Spark` did add a stage but failed to send the message to listeners. When > sending `finish` message to listeners, the error occurs. > This problem will cause a huge number of `active stages` showing in > `SparkUI`, which is really annoying. But it may not affect the final result, > according to the result of my testing code. > I'm willing to help solve this problem, any idea about which part should I > change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something > to do with it but it looks fine to me. > FYI, here is the test code that could reproduce the problem. I do not know > who to put code here with highlight, so I put the code on gist to make the > issue looks clean. > https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will be thrown in onStageCompleted
[ https://issues.apache.org/jira/browse/SPARK-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046102#comment-14046102 ] Baoxu Shi commented on SPARK-2228: -- I think a workaround would be adding EVENT_QUEUE_CAPACITY to sparkContext settings so people can adjust that according to their workload. > onStageSubmitted does not properly called so NoSuchElement will be thrown in > onStageCompleted > - > > Key: SPARK-2228 > URL: https://issues.apache.org/jira/browse/SPARK-2228 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Baoxu Shi > > We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during > iterative computing, but after several hundreds of iterations, there will be > `NoSuchElementsError`. We check the code and locate the problem at > `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is > called, such `stageId` can not be found in `stageIdToPool`, but it does exist > in other HashMaps. So we think `onStageSubmitted` is not properly called. > `Spark` did add a stage but failed to send the message to listeners. When > sending `finish` message to listeners, the error occurs. > This problem will cause a huge number of `active stages` showing in > `SparkUI`, which is really annoying. But it may not affect the final result, > according to the result of my testing code. > I'm willing to help solve this problem, any idea about which part should I > change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something > to do with it but it looks fine to me. > FYI, here is the test code that could reproduce the problem. I do not know > who to put code here with highlight, so I put the code on gist to make the > issue looks clean. > https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2245) VertexRDD can not be materialized for checkpointing
[ https://issues.apache.org/jira/browse/SPARK-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041448#comment-14041448 ] Baoxu Shi edited comment on SPARK-2245 at 6/25/14 1:43 AM: --- Hi [~ankurd], I changed my pull request. But there is another exception, ShippableVertexPartition is not serializable. So I serialized it, but there is another exception org.apache.spark.graphx.impl.RoutingTablePartition is not serializable. Then I serialized it again, but on iteration 2 there will be an exception: org.apache.spark.graphx.impl.ShippableVertexPartition cannot be cast to scala.Tuple2 The code I'm using are: val conf = new SparkConf().setAppName("HDTM") .setMaster("local[4]") val sc = new SparkContext(conf) sc.setCheckpointDir("./checkpoint") val v = sc.parallelize(Seq[(VertexId, Long)]((0L, 0L), (1L, 1L), (2L, 2L))) val e = sc.parallelize(Seq[Edge[Long]](Edge(0L, 1L, 0L), Edge(1L, 2L, 1L), Edge(2L, 0L, 2L))) var g = Graph(v, e) val vertexIds = Seq(0L, 1L, 2L) var prevG: Graph[VertexId, Long] = null for (i <- 1 to 2000) { vertexIds.toStream.foreach(id => { prevG = g g = Graph(g.vertices, g.edges) g.vertices.cache() g.edges.cache() prevG.unpersistVertices(blocking = false) prevG.edges.unpersist(blocking = false) }) g.vertices.checkpoint() g.edges.checkpoint() g.edges.count() g.vertices.count() println(s"${g.vertices.isCheckpointed} ${g.edges.isCheckpointed}") println(" iter " + i + " finished") } println(g.vertices.collect().mkString(" ")) println(g.edges.collect().mkString(" ")) Am I on the right track? Or Should there be another way to change it? was (Author: bxshi): Just submit the changes, thanks! > VertexRDD can not be materialized for checkpointing > --- > > Key: SPARK-2245 > URL: https://issues.apache.org/jira/browse/SPARK-2245 > Project: Spark > Issue Type: Bug > Components: GraphX >Reporter: Baoxu Shi > > Seems one can not materialize VertexRDD by simply calling count method, which > is overridden by VertexRDD. But if you call RDD's count, it could materialize > it. > Is this a feature that designed to get the count without materialize > VertexRDD? If so, do you guys think it is necessary to add a materialize > method to VertexRDD? > By the way, does count() is the cheapest way to materialize a RDD? Or it just > cost the same resources like other actions? > The pull request is here: > https://github.com/apache/spark/pull/1177 > Best, -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2245) VertexRDD can not be materialized for checkpointing
[ https://issues.apache.org/jira/browse/SPARK-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041448#comment-14041448 ] Baoxu Shi commented on SPARK-2245: -- Just submit the changes, thanks! > VertexRDD can not be materialized for checkpointing > --- > > Key: SPARK-2245 > URL: https://issues.apache.org/jira/browse/SPARK-2245 > Project: Spark > Issue Type: Bug > Components: GraphX >Reporter: Baoxu Shi > > Seems one can not materialize VertexRDD by simply calling count method, which > is overridden by VertexRDD. But if you call RDD's count, it could materialize > it. > Is this a feature that designed to get the count without materialize > VertexRDD? If so, do you guys think it is necessary to add a materialize > method to VertexRDD? > By the way, does count() is the cheapest way to materialize a RDD? Or it just > cost the same resources like other actions? > The pull request is here: > https://github.com/apache/spark/pull/1177 > Best, -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2245) VertexRDD can not be materialized for checkpointing
[ https://issues.apache.org/jira/browse/SPARK-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041363#comment-14041363 ] Baoxu Shi commented on SPARK-2245: -- [~ankurd] Thanks for your detailed explanation. So if we want to fix this problem in the right way, we need to override all the methods related to checkpoint? I list all the methods I've found: checkpoint checkpointData computeOrReadCheckpoint doCheckpoint getCheckpointFile markCheckpointed Am I missing some one? If I changed those, should I also provide corresponding test with my pull request? > VertexRDD can not be materialized for checkpointing > --- > > Key: SPARK-2245 > URL: https://issues.apache.org/jira/browse/SPARK-2245 > Project: Spark > Issue Type: Bug > Components: GraphX >Reporter: Baoxu Shi > > Seems one can not materialize VertexRDD by simply calling count method, which > is overridden by VertexRDD. But if you call RDD's count, it could materialize > it. > Is this a feature that designed to get the count without materialize > VertexRDD? If so, do you guys think it is necessary to add a materialize > method to VertexRDD? > By the way, does count() is the cheapest way to materialize a RDD? Or it just > cost the same resources like other actions? > The pull request is here: > https://github.com/apache/spark/pull/1177 > Best, -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2245) VertexRDD can not be materialized for checkpointing
Baoxu Shi created SPARK-2245: Summary: VertexRDD can not be materialized for checkpointing Key: SPARK-2245 URL: https://issues.apache.org/jira/browse/SPARK-2245 Project: Spark Issue Type: Bug Components: GraphX Reporter: Baoxu Shi Seems one can not materialize VertexRDD by simply calling count method, which is overridden by VertexRDD. But if you call RDD's count, it could materialize it. Is this a feature that designed to get the count without materialize VertexRDD? If so, do you guys think it is necessary to add a materialize method to VertexRDD? By the way, does count() is the cheapest way to materialize a RDD? Or it just cost the same resources like other actions? The pull request is here: https://github.com/apache/spark/pull/1177 Best, -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will be thrown in onStageCompleted
[ https://issues.apache.org/jira/browse/SPARK-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baoxu Shi updated SPARK-2228: - Summary: onStageSubmitted does not properly called so NoSuchElement will be thrown in onStageCompleted (was: onStageSubmitted does not properly called so NoSuchElement will throw in onStageCompleted) > onStageSubmitted does not properly called so NoSuchElement will be thrown in > onStageCompleted > - > > Key: SPARK-2228 > URL: https://issues.apache.org/jira/browse/SPARK-2228 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.1.0 >Reporter: Baoxu Shi > > We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during > iterative computing, but after several hundreds of iterations, there will be > `NoSuchElementsError`. We check the code and locate the problem at > `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is > called, such `stageId` can not be found in `stageIdToPool`, but it does exist > in other HashMaps. So we think `onStageSubmitted` is not properly called. > `Spark` did add a stage but failed to send the message to listeners. When > sending `finish` message to listeners, the error occurs. > This problem will cause a huge number of `active stages` showing in > `SparkUI`, which is really annoying. But it may not affect the final result, > according to the result of my testing code. > I'm willing to help solve this problem, any idea about which part should I > change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something > to do with it but it looks fine to me. > FYI, here is the test code that could reproduce the problem. I do not know > who to put code here with highlight, so I put the code on gist to make the > issue looks clean. > https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will throw in onStageCompleted
[ https://issues.apache.org/jira/browse/SPARK-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baoxu Shi updated SPARK-2228: - Affects Version/s: 1.0.0 > onStageSubmitted does not properly called so NoSuchElement will throw in > onStageCompleted > - > > Key: SPARK-2228 > URL: https://issues.apache.org/jira/browse/SPARK-2228 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.1.0 >Reporter: Baoxu Shi > > We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during > iterative computing, but after several hundreds of iterations, there will be > `NoSuchElementsError`. We check the code and locate the problem at > `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is > called, such `stageId` can not be found in `stageIdToPool`, but it does exist > in other HashMaps. So we think `onStageSubmitted` is not properly called. > `Spark` did add a stage but failed to send the message to listeners. When > sending `finish` message to listeners, the error occurs. > This problem will cause a huge number of `active stages` showing in > `SparkUI`, which is really annoying. But it may not affect the final result, > according to the result of my testing code. > I'm willing to help solve this problem, any idea about which part should I > change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something > to do with it but it looks fine to me. > FYI, here is the test code that could reproduce the problem. I do not know > who to put code here with highlight, so I put the code on gist to make the > issue looks clean. > https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will throw in onStageCompleted
[ https://issues.apache.org/jira/browse/SPARK-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baoxu Shi updated SPARK-2228: - Description: We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during iterative computing, but after several hundreds of iterations, there will be `NoSuchElementsError`. We check the code and locate the problem at `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is called, such `stageId` can not be found in `stageIdToPool`, but it does exist in other HashMaps. So we think `onStageSubmitted` is not properly called. `Spark` did add a stage but failed to send the message to listeners. When sending `finish` message to listeners, the error occurs. This problem will cause a huge number of `active stages` showing in `SparkUI`, which is really annoying. But it may not affect the final result, according to the result of my testing code. I'm willing to help solve this problem, any idea about which part should I change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something to do with it but it looks fine to me. FYI, here is the test code that could reproduce the problem. I do not know who to put code here with highlight, so I put the code on gist to make the issue looks clean. https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd was: We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during iterative computing, but after several hundreds of iterations, there will be `NoSuchElementsError`. We check the code and locate the problem at `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is called, such `stageId` can not be found in `stageIdToPool`, but it does exist in other HashMaps. So we think `onStageSubmitted` is not properly called. `Spark` did add a stage but failed to send the message to listeners. When sending `finish` message to listeners, the error occurs. This problem will cause a huge number of `active stages` showing in `SparkUI`, which is really annoying. But it may not affect the final result, according to the result of my testing code. I'm willing to help solve this problem, any idea about which part should I change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something to do with it but it looks fine to me. FYI, here is the test code that could reproduce the problem. I do not see code filed in the system so I put the code on gist. https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd > onStageSubmitted does not properly called so NoSuchElement will throw in > onStageCompleted > - > > Key: SPARK-2228 > URL: https://issues.apache.org/jira/browse/SPARK-2228 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Baoxu Shi > > We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during > iterative computing, but after several hundreds of iterations, there will be > `NoSuchElementsError`. We check the code and locate the problem at > `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is > called, such `stageId` can not be found in `stageIdToPool`, but it does exist > in other HashMaps. So we think `onStageSubmitted` is not properly called. > `Spark` did add a stage but failed to send the message to listeners. When > sending `finish` message to listeners, the error occurs. > This problem will cause a huge number of `active stages` showing in > `SparkUI`, which is really annoying. But it may not affect the final result, > according to the result of my testing code. > I'm willing to help solve this problem, any idea about which part should I > change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something > to do with it but it looks fine to me. > FYI, here is the test code that could reproduce the problem. I do not know > who to put code here with highlight, so I put the code on gist to make the > issue looks clean. > https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will throw in onStageCompleted
Baoxu Shi created SPARK-2228: Summary: onStageSubmitted does not properly called so NoSuchElement will throw in onStageCompleted Key: SPARK-2228 URL: https://issues.apache.org/jira/browse/SPARK-2228 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Baoxu Shi We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during iterative computing, but after several hundreds of iterations, there will be `NoSuchElementsError`. We check the code and locate the problem at `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is called, such `stageId` can not be found in `stageIdToPool`, but it does exist in other HashMaps. So we think `onStageSubmitted` is not properly called. `Spark` did add a stage but failed to send the message to listeners. When sending `finish` message to listeners, the error occurs. This problem will cause a huge number of `active stages` showing in `SparkUI`, which is really annoying. But it may not affect the final result, according to the result of my testing code. I'm willing to help solve this problem, any idea about which part should I change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something to do with it but it looks fine to me. FYI, here is the test code that could reproduce the problem. I do not see code filed in the system so I put the code on gist. https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd -- This message was sent by Atlassian JIRA (v6.2#6252)