[jira] [Commented] (SPARK-15328) Word2Vec import for original binary format
[ https://issues.apache.org/jira/browse/SPARK-15328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723536#comment-15723536 ] Robin East commented on SPARK-15328: Any news on the PR for this? There seem to be a few issues with large-scale models as I mention in the comments > Word2Vec import for original binary format > -- > > Key: SPARK-15328 > URL: https://issues.apache.org/jira/browse/SPARK-15328 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Yuming Wang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10945) GraphX computes Pagerank with NaN (with some datasets)
[ https://issues.apache.org/jira/browse/SPARK-10945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15156641#comment-15156641 ] Robin East commented on SPARK-10945: [~ankurd] Did you get a chance to look at this? > GraphX computes Pagerank with NaN (with some datasets) > -- > > Key: SPARK-10945 > URL: https://issues.apache.org/jira/browse/SPARK-10945 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.3.0 > Environment: Linux >Reporter: Khaled Ammar > Labels: test > > Hi, > I run GraphX in a medium size standalone Spark 1.3.0 installation. The > pagerank typically works fine, except with one dataset (Twitter: > http://law.di.unimi.it/webdata/twitter-2010). This is a public dataset that > is commonly used in research papers. > I found that many vertices have an NaN values. This is true, even if the > algorithm run for 1 iteration only. > Thanks, > -Khaled -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10945) GraphX computes Pagerank with NaN (with some datasets)
[ https://issues.apache.org/jira/browse/SPARK-10945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15156639#comment-15156639 ] Robin East commented on SPARK-10945: It's not obvious how to reproduce this from the datasets available at the download site. You mentioned that 'dataset format was converted to edge-list, no edge weights at all.'. Can you share the code that converts from the WebGraph format to edge-list? Alternatively can you make the input file available? > GraphX computes Pagerank with NaN (with some datasets) > -- > > Key: SPARK-10945 > URL: https://issues.apache.org/jira/browse/SPARK-10945 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.3.0 > Environment: Linux >Reporter: Khaled Ammar > Labels: test > > Hi, > I run GraphX in a medium size standalone Spark 1.3.0 installation. The > pagerank typically works fine, except with one dataset (Twitter: > http://law.di.unimi.it/webdata/twitter-2010). This is a public dataset that > is commonly used in research papers. > I found that many vertices have an NaN values. This is true, even if the > algorithm run for 1 iteration only. > Thanks, > -Khaled -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6808) Checkpointing after zipPartitions results in NODE_LOCAL execution
[ https://issues.apache.org/jira/browse/SPARK-6808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15156599#comment-15156599 ] Robin East commented on SPARK-6808: --- It doesn't look like this should be tagged with GraphX - as the Reporter mentions he can reproduce using just Spark core code. > Checkpointing after zipPartitions results in NODE_LOCAL execution > - > > Key: SPARK-6808 > URL: https://issues.apache.org/jira/browse/SPARK-6808 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.2.1, 1.3.0 > Environment: EC2 Ubuntu r3.8xlarge machines >Reporter: Xinghao Pan > > I'm encountering a weird issue where a simple iterative zipPartition is > PROCESS_LOCAL before checkpointing, but turns NODE_LOCAL for all iterations > after checkpointing. More often than not, tasks are fetching remote blocks > from the network, leading to a 10x increase in runtime. > Here's an example snippet of code: > var R : RDD[(Long,Int)] > = sc.parallelize((0 until numPartitions), numPartitions) > .mapPartitions(_ => new Array[(Long,Int)](1000).map(i => > (0L,0)).toSeq.iterator).cache() > sc.setCheckpointDir(checkpointDir) > var iteration = 0 > while (iteration < 50){ > R = R.zipPartitions(R)((x,y) => x).cache() > if ((iteration+1) % checkpointIter == 0) R.checkpoint() > R.foreachPartition(_ => {}) > iteration += 1 > } > I've also tried to unpersist the old RDDs, and increased spark.locality.wait > but nether helps. > Strangely, by adding a simple identity map > R = R.map(x => x).cache() > after the zipPartitions appears to partially mitigate the issue. > The problem was originally triggered when I attempted to checkpoint after > doing joinVertices in GraphX, but the above example shows that the issue is > in Spark core too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3650) Triangle Count handles reverse edges incorrectly
[ https://issues.apache.org/jira/browse/SPARK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152327#comment-15152327 ] Robin East commented on SPARK-3650: --- I did ask if the PR could be revived but never followed up on it. If I get a moment I'll try and submit the PR myself however have been a little busy on other GraphX things. By the way there is a workaround to the issue which is to make sure your edges are in the canonical direction before calling triangleCount. > Triangle Count handles reverse edges incorrectly > > > Key: SPARK-3650 > URL: https://issues.apache.org/jira/browse/SPARK-3650 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.1.0, 1.2.0 >Reporter: Joseph E. Gonzalez >Priority: Critical > > The triangle count implementation assumes that edges are aligned in a > canonical direction. As stated in the documentation: > bq. Note that the input graph should have its edges in canonical direction > (i.e. the `sourceId` less than `destId`) > However the TriangleCount algorithm does not verify that this condition holds > and indeed even the unit tests exploits this functionality: > {code:scala} > val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++ > Array(0L -> -1L, -1L -> -2L, -2L -> 0L) > val rawEdges = sc.parallelize(triangles, 2) > val graph = Graph.fromEdgeTuples(rawEdges, true).cache() > val triangleCount = graph.triangleCount() > val verts = triangleCount.vertices > verts.collect().foreach { case (vid, count) => > if (vid == 0) { > assert(count === 4) // <-- Should be 2 > } else { > assert(count === 2) // <-- Should be 1 > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6357) Add unapply in EdgeContext
[ https://issues.apache.org/jira/browse/SPARK-6357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804664#comment-14804664 ] Robin East commented on SPARK-6357: --- [~maropu] Looks like the PR was merged. Does that mean this JIRA can be closed? > Add unapply in EdgeContext > -- > > Key: SPARK-6357 > URL: https://issues.apache.org/jira/browse/SPARK-6357 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: Takeshi Yamamuro > > This extractor is mainly used for Graph#aggregateMessages*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9429) TriangleCount: job aborted due to stage failure
[ https://issues.apache.org/jira/browse/SPARK-9429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804678#comment-14804678 ] Robin East commented on SPARK-9429: --- The scala docs for triangleCount state 'Note that the input graph should have its edges in canonical direction (i.e. the sourceId less than destId). Also the graph must have been partitioned using org.apache.spark.graphx.Graph#partitionBy.'. The code checks for this condition and throws an assertion when the conditions are not met e.g. you have an Edge(2L,1L,...) which is not in canonical direction. > TriangleCount: job aborted due to stage failure > --- > > Key: SPARK-9429 > URL: https://issues.apache.org/jira/browse/SPARK-9429 > Project: Spark > Issue Type: Bug > Components: GraphX >Reporter: YangBaoxing > > Hi, all ! > When I run the TriangleCount algorithm on my own data, an exception like "Job > aborted to stage failure: Task 0 in stage 4.0 failed 1 times, most recent > failure: Lost task 0.0 in stage 4.0 (TID 8, localhost): > java.lang.AssertionError: assertion failed" occurred. Then I checked the > source code and found that the problem is in line "assert((dblCount & 1) == > 0)". And I also found that it run successfully on Array(0L -> 1L, 1L -> 2L, > 2L -> 0L) and Array(0L -> 1L, 1L -> 2L, 2L -> 0L, 0L -> 2L, 2L -> 1L, 1L -> > 0L) while failed on Array(0L -> 1L, 1L -> 2L, 2L -> 0L, 2L -> 1L). It seems > to be more suitable for all unidirectional or bidirectional graph. Is > TriangleCount suitable for incomplete bidirectional graph? The complete > exception as follows: > Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 4.0 (TID 8, localhost): > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:90) > at > org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:87) > at > org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:140) > at > org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:159) > at > org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:156) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.graphx.VertexRDD.compute(VertexRDD.scala:71) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits
Robin East created SPARK-10598: -- Summary: RoutingTablePartition toMessage method refers to bytes instead of bits Key: SPARK-10598 URL: https://issues.apache.org/jira/browse/SPARK-10598 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.5.0, 1.4.1, 1.4.0 Reporter: Robin East Priority: Minor Fix For: 1.5.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits
[ https://issues.apache.org/jira/browse/SPARK-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744359#comment-14744359 ] Robin East commented on SPARK-10598: Apologies - have checked it out. You're referring to Fix and Target Version fields right? > RoutingTablePartition toMessage method refers to bytes instead of bits > -- > > Key: SPARK-10598 > URL: https://issues.apache.org/jira/browse/SPARK-10598 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.4.1, 1.5.0 >Reporter: Robin East >Assignee: Robin East >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647432#comment-14647432 ] Robin East commented on SPARK-5692: --- Hi the description includes the sentence 'We may want to discuss whether we want to be compatible with the original Word2Vec model storage format.'. Was this ever discussed - I can't see anything in comment stream for this JIRA. Is there any interest in adding functionality to import Word2Vec models from the original binary format (e.g. the 300 million word Google News model). Model import/export for Word2Vec Key: SPARK-5692 URL: https://issues.apache.org/jira/browse/SPARK-5692 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Manoj Kumar Fix For: 1.4.0 Supoort save and load for Word2VecModel. We may want to discuss whether we want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3650) Triangle Count handles reverse edges incorrectly
[ https://issues.apache.org/jira/browse/SPARK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603016#comment-14603016 ] Robin East commented on SPARK-3650: --- What is the status of this issue? A user on the mailing list just ran into to this issue. It looks like PR-2495 should fix the issue. Is there a version that is being targeted for the fix? Triangle Count handles reverse edges incorrectly Key: SPARK-3650 URL: https://issues.apache.org/jira/browse/SPARK-3650 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.1.0, 1.2.0 Reporter: Joseph E. Gonzalez Priority: Critical The triangle count implementation assumes that edges are aligned in a canonical direction. As stated in the documentation: bq. Note that the input graph should have its edges in canonical direction (i.e. the `sourceId` less than `destId`) However the TriangleCount algorithm does not verify that this condition holds and indeed even the unit tests exploits this functionality: {code:scala} val triangles = Array(0L - 1L, 1L - 2L, 2L - 0L) ++ Array(0L - -1L, -1L - -2L, -2L - 0L) val rawEdges = sc.parallelize(triangles, 2) val graph = Graph.fromEdgeTuples(rawEdges, true).cache() val triangleCount = graph.triangleCount() val verts = triangleCount.vertices verts.collect().foreach { case (vid, count) = if (vid == 0) { assert(count === 4) // -- Should be 2 } else { assert(count === 2) // -- Should be 1 } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org