[jira] [Commented] (SPARK-5854) Implement Personalized PageRank with GraphX

2015-02-17 Thread Baoxu Shi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324374#comment-14324374
 ] 

Baoxu Shi commented on SPARK-5854:
--

Oh sure, there are lots of paper talking about personalized page rank, for 
example:

The fundamental one is:

Haveliwala TH (2002) Topic-sensitive PageRank. WWW 517–526. doi: 
10.1145/511446.511513
This has been cited by 1412 papers (according to google)

And there are some implementations on MapReduce

Bahmani B, Chakrabarti K, Xin D (2011) Fast personalized PageRank on MapReduce. 
SIGMOD 973–984. doi: 10.1145/1989323.1989425

Also, the most recent industry usage would be Twitter's WTF system

Gupta P, Goel A, Lin J, et al. (2013) WTF: the who to follow service at 
Twitter. WWW 505–514.

The difference between PageRank and Personalized PageRank are how we initialize 
score for each node, and how we do teleporting. 

In PageRank, every node has an initial score of 1, whereas for Personalized 
PageRank, only source node has a score of 1 and others have a score of 0 at the 
beginning.

In PageRank, a random surfer has a probability of \alpha to jump to another 
random surfer. Whereas in Personalized PageRank, all jumping must end at source 
node.


BTW, I thought this may just be a simple method of a Graph object, since 
PageRank is also a method of Graph object.

> Implement Personalized PageRank with GraphX
> ---
>
> Key: SPARK-5854
> URL: https://issues.apache.org/jira/browse/SPARK-5854
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Baoxu Shi
>Priority: Minor
>  Labels: Feature
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> I'm wondering if you would like to add personalized page rank to GraphX? I 
> have implemented it by modifying existing PageRank algorithm in GraphX, and I 
> would like to share it with others.
> It is pretty simple and straightforward since the only change I need to make 
> is only teleport to source node. 
> I did some google searching and seems there are a few guys want to have it in 
> GraphX :-)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5854) Implement Personalized PageRank with GraphX

2015-02-16 Thread Baoxu Shi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baoxu Shi updated SPARK-5854:
-
Remaining Estimate: 48h  (was: 24h)
 Original Estimate: 48h  (was: 24h)

> Implement Personalized PageRank with GraphX
> ---
>
> Key: SPARK-5854
> URL: https://issues.apache.org/jira/browse/SPARK-5854
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Baoxu Shi
>Priority: Minor
>  Labels: Feature
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> I'm wondering if you would like to add personalized page rank to GraphX? I 
> have implemented it by modifying existing PageRank algorithm in GraphX, and I 
> would like to share it with others.
> It is pretty simple and straightforward since the only change I need to make 
> is only teleport to source node. 
> I did some google searching and seems there are a few guys want to have it in 
> GraphX :-)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5854) Implement Personalized PageRank with GraphX

2015-02-16 Thread Baoxu Shi (JIRA)
Baoxu Shi created SPARK-5854:


 Summary: Implement Personalized PageRank with GraphX
 Key: SPARK-5854
 URL: https://issues.apache.org/jira/browse/SPARK-5854
 Project: Spark
  Issue Type: New Feature
  Components: GraphX
Reporter: Baoxu Shi
Priority: Minor


I'm wondering if you would like to add personalized page rank to GraphX? I have 
implemented it by modifying existing PageRank algorithm in GraphX, and I would 
like to share it with others.

It is pretty simple and straightforward since the only change I need to make is 
only teleport to source node. 

I did some google searching and seems there are a few guys want to have it in 
GraphX :-)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2347) Graph object can not be set to StorageLevel.MEMORY_ONLY_SER

2014-07-02 Thread Baoxu Shi (JIRA)
Baoxu Shi created SPARK-2347:


 Summary: Graph object can not be set to 
StorageLevel.MEMORY_ONLY_SER
 Key: SPARK-2347
 URL: https://issues.apache.org/jira/browse/SPARK-2347
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.0
 Environment: Spark standalone with 5 workers and 1 driver
Reporter: Baoxu Shi


I'm creating Graph object by using 

Graph(vertices, edges, null, StorageLevel.MEMORY_ONLY, StorageLevel.MEMORY_ONLY)

But that will throw out not serializable exception on both workers and driver. 

14/07/02 16:30:26 ERROR BlockManagerWorker: Exception handling buffer message
java.io.NotSerializableException: org.apache.spark.graphx.impl.VertexPartition
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at 
org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:106)
at 
org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:30)
at 
org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:988)
at 
org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:997)
at org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
at 
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:392)
at 
org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:358)
at 
org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
at 
org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
at 
org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
at 
org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:662)
at 
org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:504)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Even if the driver sometime does not throw this exception, it will throw 

java.io.FileNotFoundException: 
/tmp/spark-local-20140702151845-9620/2a/shuffle_2_25_3 (No such file or 
directory)

I know that VertexPartition not supposed to be serializable, so is there any 
workaround on this?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2245) VertexRDD can not be materialized for checkpointing

2014-06-30 Thread Baoxu Shi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048073#comment-14048073
 ] 

Baoxu Shi commented on SPARK-2245:
--

I edited my original comment to add the updates, but I do not know if you can 
get them via email. So I resubmit it again. Hope that won't bother you. 
[~ankurd]

Hi Ankur Dave, I changed my pull request. But there is another exception, 
ShippableVertexPartition is not serializable. So I serialized it, but there is 
another exception org.apache.spark.graphx.impl.RoutingTablePartition is not 
serializable. Then I serialized it again, but on iteration 2 there will be an 
exception: org.apache.spark.graphx.impl.ShippableVertexPartition cannot be cast 
to scala.Tuple2
The code I'm using are:
val conf = new SparkConf().setAppName("HDTM")
.setMaster("local[4]")
val sc = new SparkContext(conf)
sc.setCheckpointDir("./checkpoint")
val v = sc.parallelize(Seq[(VertexId, Long)]((0L, 0L), (1L, 1L), (2L, 2L)))
val e = sc.parallelize(Seq[Edge[Long]](Edge(0L, 1L, 0L), Edge(1L, 2L, 1L), 
Edge(2L, 0L, 2L)))
var g = Graph(v, e)
val vertexIds = Seq(0L, 1L, 2L)
var prevG: Graph[VertexId, Long] = null
for (i <- 1 to 2000) {
vertexIds.toStream.foreach(id =>
{ prevG = g g = Graph(g.vertices, g.edges) g.vertices.cache() g.edges.cache() 
prevG.unpersistVertices(blocking = false) prevG.edges.unpersist(blocking = 
false) }
)
g.vertices.checkpoint()
g.edges.checkpoint()
g.edges.count()
g.vertices.count()
println(s"$
{g.vertices.isCheckpointed}
$
{g.edges.isCheckpointed}
")
println(" iter " + i + " finished")
}
println(g.vertices.collect().mkString(" "))
println(g.edges.collect().mkString(" "))
Am I on the right track? Or Should there be another way to change it?

> VertexRDD can not be materialized for checkpointing
> ---
>
> Key: SPARK-2245
> URL: https://issues.apache.org/jira/browse/SPARK-2245
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Baoxu Shi
>
> Seems one can not materialize VertexRDD by simply calling count method, which 
> is overridden by VertexRDD. But if you call RDD's count, it could materialize 
> it.
> Is this a feature that designed to get the count without materialize 
> VertexRDD? If so, do you guys think it is necessary to add a materialize 
> method to VertexRDD?
> By the way, does count() is the cheapest way to materialize a RDD? Or it just 
> cost the same resources like other actions?
> The pull request is here:
> https://github.com/apache/spark/pull/1177
> Best,



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will be thrown in onStageCompleted

2014-06-28 Thread Baoxu Shi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046889#comment-14046889
 ] 

Baoxu Shi commented on SPARK-2228:
--

[~rxin] https://github.com/apache/spark/pull/1257

> onStageSubmitted does not properly called so NoSuchElement will be thrown in 
> onStageCompleted
> -
>
> Key: SPARK-2228
> URL: https://issues.apache.org/jira/browse/SPARK-2228
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Baoxu Shi
>
> We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during 
> iterative computing, but after several hundreds of iterations, there will be 
> `NoSuchElementsError`. We check the code and locate the problem at 
> `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is 
> called, such `stageId` can not be found in `stageIdToPool`, but it does exist 
> in other HashMaps. So we think `onStageSubmitted` is not properly called. 
> `Spark` did add a stage but failed to send the message to listeners. When 
> sending `finish` message to listeners, the error occurs. 
> This problem will cause a huge number of `active stages` showing in 
> `SparkUI`, which is really annoying. But it may not affect the final result, 
> according to the result of my testing code.
> I'm willing to help solve this problem, any idea about which part should I 
> change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something 
> to do with it but it looks fine to me.
> FYI, here is the test code that could reproduce the problem. I do not know 
> who to put code here with highlight, so I put the code on gist to make the 
> issue looks clean.
> https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will be thrown in onStageCompleted

2014-06-28 Thread Baoxu Shi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046872#comment-14046872
 ] 

Baoxu Shi commented on SPARK-2228:
--

No problem, I’ll do that.





> onStageSubmitted does not properly called so NoSuchElement will be thrown in 
> onStageCompleted
> -
>
> Key: SPARK-2228
> URL: https://issues.apache.org/jira/browse/SPARK-2228
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Baoxu Shi
>
> We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during 
> iterative computing, but after several hundreds of iterations, there will be 
> `NoSuchElementsError`. We check the code and locate the problem at 
> `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is 
> called, such `stageId` can not be found in `stageIdToPool`, but it does exist 
> in other HashMaps. So we think `onStageSubmitted` is not properly called. 
> `Spark` did add a stage but failed to send the message to listeners. When 
> sending `finish` message to listeners, the error occurs. 
> This problem will cause a huge number of `active stages` showing in 
> `SparkUI`, which is really annoying. But it may not affect the final result, 
> according to the result of my testing code.
> I'm willing to help solve this problem, any idea about which part should I 
> change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something 
> to do with it but it looks fine to me.
> FYI, here is the test code that could reproduce the problem. I do not know 
> who to put code here with highlight, so I put the code on gist to make the 
> issue looks clean.
> https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will be thrown in onStageCompleted

2014-06-27 Thread Baoxu Shi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046102#comment-14046102
 ] 

Baoxu Shi commented on SPARK-2228:
--

I think a workaround would be adding EVENT_QUEUE_CAPACITY to sparkContext 
settings so people can adjust that according to their workload.



> onStageSubmitted does not properly called so NoSuchElement will be thrown in 
> onStageCompleted
> -
>
> Key: SPARK-2228
> URL: https://issues.apache.org/jira/browse/SPARK-2228
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Baoxu Shi
>
> We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during 
> iterative computing, but after several hundreds of iterations, there will be 
> `NoSuchElementsError`. We check the code and locate the problem at 
> `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is 
> called, such `stageId` can not be found in `stageIdToPool`, but it does exist 
> in other HashMaps. So we think `onStageSubmitted` is not properly called. 
> `Spark` did add a stage but failed to send the message to listeners. When 
> sending `finish` message to listeners, the error occurs. 
> This problem will cause a huge number of `active stages` showing in 
> `SparkUI`, which is really annoying. But it may not affect the final result, 
> according to the result of my testing code.
> I'm willing to help solve this problem, any idea about which part should I 
> change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something 
> to do with it but it looks fine to me.
> FYI, here is the test code that could reproduce the problem. I do not know 
> who to put code here with highlight, so I put the code on gist to make the 
> issue looks clean.
> https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2245) VertexRDD can not be materialized for checkpointing

2014-06-24 Thread Baoxu Shi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041448#comment-14041448
 ] 

Baoxu Shi edited comment on SPARK-2245 at 6/25/14 1:43 AM:
---

Hi [~ankurd], I changed my pull request. But there is another exception, 
ShippableVertexPartition is not serializable. So I serialized it, but there is 
another exception org.apache.spark.graphx.impl.RoutingTablePartition is not 
serializable.  Then I serialized it again, but on iteration 2 there will be an 
exception: org.apache.spark.graphx.impl.ShippableVertexPartition cannot be cast 
to scala.Tuple2

The code I'm using are:

val conf = new SparkConf().setAppName("HDTM")
  .setMaster("local[4]")
val sc = new SparkContext(conf)
sc.setCheckpointDir("./checkpoint")
val v = sc.parallelize(Seq[(VertexId, Long)]((0L, 0L), (1L, 1L), (2L, 2L)))
val e = sc.parallelize(Seq[Edge[Long]](Edge(0L, 1L, 0L), Edge(1L, 2L, 1L), 
Edge(2L, 0L, 2L)))
var g = Graph(v, e)

val vertexIds = Seq(0L, 1L, 2L)
var prevG: Graph[VertexId, Long] = null
for (i <- 1 to 2000) {
  vertexIds.toStream.foreach(id => {
prevG = g
g = Graph(g.vertices, g.edges)

g.vertices.cache()
g.edges.cache()
prevG.unpersistVertices(blocking = false)
prevG.edges.unpersist(blocking = false)
  })

  g.vertices.checkpoint()
  g.edges.checkpoint()

  g.edges.count()
  g.vertices.count()
  println(s"${g.vertices.isCheckpointed} ${g.edges.isCheckpointed}")

  println(" iter " + i + " finished")
}

println(g.vertices.collect().mkString(" "))
println(g.edges.collect().mkString(" "))

Am I on the right track? Or Should there be another way to change it?


was (Author: bxshi):
Just submit the changes, thanks!

> VertexRDD can not be materialized for checkpointing
> ---
>
> Key: SPARK-2245
> URL: https://issues.apache.org/jira/browse/SPARK-2245
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Baoxu Shi
>
> Seems one can not materialize VertexRDD by simply calling count method, which 
> is overridden by VertexRDD. But if you call RDD's count, it could materialize 
> it.
> Is this a feature that designed to get the count without materialize 
> VertexRDD? If so, do you guys think it is necessary to add a materialize 
> method to VertexRDD?
> By the way, does count() is the cheapest way to materialize a RDD? Or it just 
> cost the same resources like other actions?
> The pull request is here:
> https://github.com/apache/spark/pull/1177
> Best,



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2245) VertexRDD can not be materialized for checkpointing

2014-06-23 Thread Baoxu Shi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041448#comment-14041448
 ] 

Baoxu Shi commented on SPARK-2245:
--

Just submit the changes, thanks!

> VertexRDD can not be materialized for checkpointing
> ---
>
> Key: SPARK-2245
> URL: https://issues.apache.org/jira/browse/SPARK-2245
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Baoxu Shi
>
> Seems one can not materialize VertexRDD by simply calling count method, which 
> is overridden by VertexRDD. But if you call RDD's count, it could materialize 
> it.
> Is this a feature that designed to get the count without materialize 
> VertexRDD? If so, do you guys think it is necessary to add a materialize 
> method to VertexRDD?
> By the way, does count() is the cheapest way to materialize a RDD? Or it just 
> cost the same resources like other actions?
> The pull request is here:
> https://github.com/apache/spark/pull/1177
> Best,



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2245) VertexRDD can not be materialized for checkpointing

2014-06-23 Thread Baoxu Shi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041363#comment-14041363
 ] 

Baoxu Shi commented on SPARK-2245:
--

[~ankurd] Thanks for your detailed explanation. So if we want to fix this 
problem in the right way, we need to override all the methods related to 
checkpoint? I list all the methods I've found:

checkpoint
checkpointData
computeOrReadCheckpoint
doCheckpoint
getCheckpointFile
markCheckpointed

Am I missing some one? If I changed those, should I also provide corresponding 
test with my pull request?

> VertexRDD can not be materialized for checkpointing
> ---
>
> Key: SPARK-2245
> URL: https://issues.apache.org/jira/browse/SPARK-2245
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Baoxu Shi
>
> Seems one can not materialize VertexRDD by simply calling count method, which 
> is overridden by VertexRDD. But if you call RDD's count, it could materialize 
> it.
> Is this a feature that designed to get the count without materialize 
> VertexRDD? If so, do you guys think it is necessary to add a materialize 
> method to VertexRDD?
> By the way, does count() is the cheapest way to materialize a RDD? Or it just 
> cost the same resources like other actions?
> The pull request is here:
> https://github.com/apache/spark/pull/1177
> Best,



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2245) VertexRDD can not be materialized for checkpointing

2014-06-23 Thread Baoxu Shi (JIRA)
Baoxu Shi created SPARK-2245:


 Summary: VertexRDD can not be materialized for checkpointing
 Key: SPARK-2245
 URL: https://issues.apache.org/jira/browse/SPARK-2245
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Baoxu Shi


Seems one can not materialize VertexRDD by simply calling count method, which 
is overridden by VertexRDD. But if you call RDD's count, it could materialize 
it.

Is this a feature that designed to get the count without materialize VertexRDD? 
If so, do you guys think it is necessary to add a materialize method to 
VertexRDD?

By the way, does count() is the cheapest way to materialize a RDD? Or it just 
cost the same resources like other actions?

The pull request is here:
https://github.com/apache/spark/pull/1177

Best,



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will be thrown in onStageCompleted

2014-06-21 Thread Baoxu Shi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baoxu Shi updated SPARK-2228:
-

Summary: onStageSubmitted does not properly called so NoSuchElement will be 
thrown in onStageCompleted  (was: onStageSubmitted does not properly called so 
NoSuchElement will throw in onStageCompleted)

> onStageSubmitted does not properly called so NoSuchElement will be thrown in 
> onStageCompleted
> -
>
> Key: SPARK-2228
> URL: https://issues.apache.org/jira/browse/SPARK-2228
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Baoxu Shi
>
> We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during 
> iterative computing, but after several hundreds of iterations, there will be 
> `NoSuchElementsError`. We check the code and locate the problem at 
> `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is 
> called, such `stageId` can not be found in `stageIdToPool`, but it does exist 
> in other HashMaps. So we think `onStageSubmitted` is not properly called. 
> `Spark` did add a stage but failed to send the message to listeners. When 
> sending `finish` message to listeners, the error occurs. 
> This problem will cause a huge number of `active stages` showing in 
> `SparkUI`, which is really annoying. But it may not affect the final result, 
> according to the result of my testing code.
> I'm willing to help solve this problem, any idea about which part should I 
> change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something 
> to do with it but it looks fine to me.
> FYI, here is the test code that could reproduce the problem. I do not know 
> who to put code here with highlight, so I put the code on gist to make the 
> issue looks clean.
> https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will throw in onStageCompleted

2014-06-21 Thread Baoxu Shi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baoxu Shi updated SPARK-2228:
-

Affects Version/s: 1.0.0

> onStageSubmitted does not properly called so NoSuchElement will throw in 
> onStageCompleted
> -
>
> Key: SPARK-2228
> URL: https://issues.apache.org/jira/browse/SPARK-2228
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Baoxu Shi
>
> We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during 
> iterative computing, but after several hundreds of iterations, there will be 
> `NoSuchElementsError`. We check the code and locate the problem at 
> `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is 
> called, such `stageId` can not be found in `stageIdToPool`, but it does exist 
> in other HashMaps. So we think `onStageSubmitted` is not properly called. 
> `Spark` did add a stage but failed to send the message to listeners. When 
> sending `finish` message to listeners, the error occurs. 
> This problem will cause a huge number of `active stages` showing in 
> `SparkUI`, which is really annoying. But it may not affect the final result, 
> according to the result of my testing code.
> I'm willing to help solve this problem, any idea about which part should I 
> change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something 
> to do with it but it looks fine to me.
> FYI, here is the test code that could reproduce the problem. I do not know 
> who to put code here with highlight, so I put the code on gist to make the 
> issue looks clean.
> https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will throw in onStageCompleted

2014-06-21 Thread Baoxu Shi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baoxu Shi updated SPARK-2228:
-

Description: 
We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during 
iterative computing, but after several hundreds of iterations, there will be 
`NoSuchElementsError`. We check the code and locate the problem at 
`org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is 
called, such `stageId` can not be found in `stageIdToPool`, but it does exist 
in other HashMaps. So we think `onStageSubmitted` is not properly called. 
`Spark` did add a stage but failed to send the message to listeners. When 
sending `finish` message to listeners, the error occurs. 

This problem will cause a huge number of `active stages` showing in `SparkUI`, 
which is really annoying. But it may not affect the final result, according to 
the result of my testing code.

I'm willing to help solve this problem, any idea about which part should I 
change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something 
to do with it but it looks fine to me.

FYI, here is the test code that could reproduce the problem. I do not know who 
to put code here with highlight, so I put the code on gist to make the issue 
looks clean.

https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd

  was:
We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during 
iterative computing, but after several hundreds of iterations, there will be 
`NoSuchElementsError`. We check the code and locate the problem at 
`org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is 
called, such `stageId` can not be found in `stageIdToPool`, but it does exist 
in other HashMaps. So we think `onStageSubmitted` is not properly called. 
`Spark` did add a stage but failed to send the message to listeners. When 
sending `finish` message to listeners, the error occurs. 

This problem will cause a huge number of `active stages` showing in `SparkUI`, 
which is really annoying. But it may not affect the final result, according to 
the result of my testing code.

I'm willing to help solve this problem, any idea about which part should I 
change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something 
to do with it but it looks fine to me.

FYI, here is the test code that could reproduce the problem. I do not see code 
filed in the system so I put the code on gist.

https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd


> onStageSubmitted does not properly called so NoSuchElement will throw in 
> onStageCompleted
> -
>
> Key: SPARK-2228
> URL: https://issues.apache.org/jira/browse/SPARK-2228
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Baoxu Shi
>
> We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during 
> iterative computing, but after several hundreds of iterations, there will be 
> `NoSuchElementsError`. We check the code and locate the problem at 
> `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is 
> called, such `stageId` can not be found in `stageIdToPool`, but it does exist 
> in other HashMaps. So we think `onStageSubmitted` is not properly called. 
> `Spark` did add a stage but failed to send the message to listeners. When 
> sending `finish` message to listeners, the error occurs. 
> This problem will cause a huge number of `active stages` showing in 
> `SparkUI`, which is really annoying. But it may not affect the final result, 
> according to the result of my testing code.
> I'm willing to help solve this problem, any idea about which part should I 
> change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something 
> to do with it but it looks fine to me.
> FYI, here is the test code that could reproduce the problem. I do not know 
> who to put code here with highlight, so I put the code on gist to make the 
> issue looks clean.
> https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2228) onStageSubmitted does not properly called so NoSuchElement will throw in onStageCompleted

2014-06-21 Thread Baoxu Shi (JIRA)
Baoxu Shi created SPARK-2228:


 Summary: onStageSubmitted does not properly called so 
NoSuchElement will throw in onStageCompleted
 Key: SPARK-2228
 URL: https://issues.apache.org/jira/browse/SPARK-2228
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Baoxu Shi


We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during 
iterative computing, but after several hundreds of iterations, there will be 
`NoSuchElementsError`. We check the code and locate the problem at 
`org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is 
called, such `stageId` can not be found in `stageIdToPool`, but it does exist 
in other HashMaps. So we think `onStageSubmitted` is not properly called. 
`Spark` did add a stage but failed to send the message to listeners. When 
sending `finish` message to listeners, the error occurs. 

This problem will cause a huge number of `active stages` showing in `SparkUI`, 
which is really annoying. But it may not affect the final result, according to 
the result of my testing code.

I'm willing to help solve this problem, any idea about which part should I 
change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something 
to do with it but it looks fine to me.

FYI, here is the test code that could reproduce the problem. I do not see code 
filed in the system so I put the code on gist.

https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd



--
This message was sent by Atlassian JIRA
(v6.2#6252)