[jira] [Commented] (SPARK-15328) Word2Vec import for original binary format

2016-12-05 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723536#comment-15723536
 ] 

Robin East commented on SPARK-15328:


Any news on the PR for this? There seem to be a few issues with large-scale 
models as I mention in the comments

> Word2Vec import for original binary format
> --
>
> Key: SPARK-15328
> URL: https://issues.apache.org/jira/browse/SPARK-15328
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yuming Wang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10945) GraphX computes Pagerank with NaN (with some datasets)

2016-02-22 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15156641#comment-15156641
 ] 

Robin East commented on SPARK-10945:


[~ankurd] Did you get a chance to look at this?

> GraphX computes Pagerank with NaN (with some datasets)
> --
>
> Key: SPARK-10945
> URL: https://issues.apache.org/jira/browse/SPARK-10945
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.3.0
> Environment: Linux
>Reporter: Khaled Ammar
>  Labels: test
>
> Hi,
> I run GraphX in a medium size standalone Spark 1.3.0 installation. The 
> pagerank typically works fine, except with one dataset (Twitter: 
> http://law.di.unimi.it/webdata/twitter-2010). This is a public dataset that 
> is commonly used in research papers.
> I found that many vertices have an NaN values. This is true, even if the 
> algorithm run for 1 iteration only.  
> Thanks,
> -Khaled



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10945) GraphX computes Pagerank with NaN (with some datasets)

2016-02-22 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15156639#comment-15156639
 ] 

Robin East commented on SPARK-10945:


It's not obvious how to reproduce this from the datasets available at the 
download site. You mentioned that 'dataset format was converted to edge-list, 
no edge weights at all.'. Can you share the code that converts from the 
WebGraph format to edge-list? Alternatively can you make the input file 
available?

> GraphX computes Pagerank with NaN (with some datasets)
> --
>
> Key: SPARK-10945
> URL: https://issues.apache.org/jira/browse/SPARK-10945
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.3.0
> Environment: Linux
>Reporter: Khaled Ammar
>  Labels: test
>
> Hi,
> I run GraphX in a medium size standalone Spark 1.3.0 installation. The 
> pagerank typically works fine, except with one dataset (Twitter: 
> http://law.di.unimi.it/webdata/twitter-2010). This is a public dataset that 
> is commonly used in research papers.
> I found that many vertices have an NaN values. This is true, even if the 
> algorithm run for 1 iteration only.  
> Thanks,
> -Khaled



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6808) Checkpointing after zipPartitions results in NODE_LOCAL execution

2016-02-22 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15156599#comment-15156599
 ] 

Robin East commented on SPARK-6808:
---

It doesn't look like this should be tagged with GraphX - as the Reporter 
mentions he can reproduce using just Spark core code.

> Checkpointing after zipPartitions results in NODE_LOCAL execution
> -
>
> Key: SPARK-6808
> URL: https://issues.apache.org/jira/browse/SPARK-6808
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.2.1, 1.3.0
> Environment: EC2 Ubuntu r3.8xlarge machines
>Reporter: Xinghao Pan
>
> I'm encountering a weird issue where a simple iterative zipPartition is 
> PROCESS_LOCAL before checkpointing, but turns NODE_LOCAL for all iterations 
> after checkpointing. More often than not, tasks are fetching remote blocks 
> from the network, leading to a 10x increase in runtime.
> Here's an example snippet of code:
> var R : RDD[(Long,Int)]
> = sc.parallelize((0 until numPartitions), numPartitions)
>   .mapPartitions(_ => new Array[(Long,Int)](1000).map(i => 
> (0L,0)).toSeq.iterator).cache()
> sc.setCheckpointDir(checkpointDir)
> var iteration = 0
> while (iteration < 50){
>   R = R.zipPartitions(R)((x,y) => x).cache()
>   if ((iteration+1) % checkpointIter == 0) R.checkpoint()
>   R.foreachPartition(_ => {})
>   iteration += 1
> }
> I've also tried to unpersist the old RDDs, and increased spark.locality.wait 
> but nether helps.
> Strangely, by adding a simple identity map
> R = R.map(x => x).cache()
> after the zipPartitions appears to partially mitigate the issue.
> The problem was originally triggered when I attempted to checkpoint after 
> doing joinVertices in GraphX, but the above example shows that the issue is 
> in Spark core too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3650) Triangle Count handles reverse edges incorrectly

2016-02-18 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152327#comment-15152327
 ] 

Robin East commented on SPARK-3650:
---

I did ask if the PR could be revived but never followed up on it. If I get a 
moment I'll try and submit the PR myself however have been a little busy on 
other GraphX things.

By the way there is a workaround to the issue which is to make sure your edges 
are in the canonical direction before calling triangleCount.

> Triangle Count handles reverse edges incorrectly
> 
>
> Key: SPARK-3650
> URL: https://issues.apache.org/jira/browse/SPARK-3650
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Joseph E. Gonzalez
>Priority: Critical
>
> The triangle count implementation assumes that edges are aligned in a 
> canonical direction.  As stated in the documentation:
> bq. Note that the input graph should have its edges in canonical direction 
> (i.e. the `sourceId` less than `destId`)
> However the TriangleCount algorithm does not verify that this condition holds 
> and indeed even the unit tests exploits this functionality:
> {code:scala}
> val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++
> Array(0L -> -1L, -1L -> -2L, -2L -> 0L)
>   val rawEdges = sc.parallelize(triangles, 2)
>   val graph = Graph.fromEdgeTuples(rawEdges, true).cache()
>   val triangleCount = graph.triangleCount()
>   val verts = triangleCount.vertices
>   verts.collect().foreach { case (vid, count) =>
> if (vid == 0) {
>   assert(count === 4)  // <-- Should be 2
> } else {
>   assert(count === 2) // <-- Should be 1
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6357) Add unapply in EdgeContext

2015-09-17 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804664#comment-14804664
 ] 

Robin East commented on SPARK-6357:
---

[~maropu] Looks like the PR was merged. Does that mean this JIRA can be closed?

> Add unapply in EdgeContext
> --
>
> Key: SPARK-6357
> URL: https://issues.apache.org/jira/browse/SPARK-6357
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Takeshi Yamamuro
>
> This extractor is mainly used for Graph#aggregateMessages*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9429) TriangleCount: job aborted due to stage failure

2015-09-17 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804678#comment-14804678
 ] 

Robin East commented on SPARK-9429:
---

The scala docs for triangleCount state 'Note that the input graph should have 
its edges in canonical direction (i.e. the sourceId less than destId). Also the 
graph must have been partitioned using 
org.apache.spark.graphx.Graph#partitionBy.'. The code checks for this condition 
and throws an assertion when the conditions are not met e.g. you have an 
Edge(2L,1L,...) which is not in canonical direction.

> TriangleCount: job aborted due to stage failure
> ---
>
> Key: SPARK-9429
> URL: https://issues.apache.org/jira/browse/SPARK-9429
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: YangBaoxing
>
> Hi, all !
> When I run the TriangleCount algorithm on my own data, an exception like "Job 
> aborted to stage failure: Task 0 in stage 4.0 failed 1 times, most recent 
> failure: Lost task 0.0 in stage 4.0 (TID 8, localhost): 
> java.lang.AssertionError: assertion failed" occurred. Then I checked the 
> source code and found that the problem is in line "assert((dblCount & 1) == 
> 0)". And I also found that it run successfully on Array(0L -> 1L, 1L -> 2L, 
> 2L -> 0L) and Array(0L -> 1L, 1L -> 2L, 2L -> 0L, 0L -> 2L, 2L -> 1L, 1L -> 
> 0L) while failed on Array(0L -> 1L, 1L -> 2L, 2L -> 0L, 2L -> 1L). It seems 
> to be more suitable for all unidirectional or bidirectional graph. Is 
> TriangleCount suitable for incomplete bidirectional graph? The complete 
> exception as follows:
> Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 4.0 (TID 8, localhost): 
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:90)
>   at 
> org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:87)
>   at 
> org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:140)
>   at 
> org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:159)
>   at 
> org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:156)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at org.apache.spark.graphx.VertexRDD.compute(VertexRDD.scala:71)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits

2015-09-14 Thread Robin East (JIRA)
Robin East created SPARK-10598:
--

 Summary: RoutingTablePartition toMessage method refers to bytes 
instead of bits
 Key: SPARK-10598
 URL: https://issues.apache.org/jira/browse/SPARK-10598
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.5.0, 1.4.1, 1.4.0
Reporter: Robin East
Priority: Minor
 Fix For: 1.5.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits

2015-09-14 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744359#comment-14744359
 ] 

Robin East commented on SPARK-10598:


Apologies - have checked it out. You're referring to Fix and Target Version 
fields right?

> RoutingTablePartition toMessage method refers to bytes instead of bits
> --
>
> Key: SPARK-10598
> URL: https://issues.apache.org/jira/browse/SPARK-10598
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Robin East
>Assignee: Robin East
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec

2015-07-30 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647432#comment-14647432
 ] 

Robin East commented on SPARK-5692:
---

Hi the description includes the sentence 'We may want to discuss whether we 
want to be compatible with the original Word2Vec model storage format.'. Was 
this ever discussed - I can't see anything in comment stream for this JIRA. Is 
there any interest in adding functionality to import Word2Vec models from the 
original binary format (e.g. the 300 million word Google News model).

 Model import/export for Word2Vec
 

 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
 Fix For: 1.4.0


 Supoort save and load for Word2VecModel. We may want to discuss whether we 
 want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3650) Triangle Count handles reverse edges incorrectly

2015-06-26 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603016#comment-14603016
 ] 

Robin East commented on SPARK-3650:
---

What is the status of this issue? A user on the mailing list just ran into to 
this issue. It looks like PR-2495 should fix the issue. Is there a version that 
is being targeted for the fix?

 Triangle Count handles reverse edges incorrectly
 

 Key: SPARK-3650
 URL: https://issues.apache.org/jira/browse/SPARK-3650
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.1.0, 1.2.0
Reporter: Joseph E. Gonzalez
Priority: Critical

 The triangle count implementation assumes that edges are aligned in a 
 canonical direction.  As stated in the documentation:
 bq. Note that the input graph should have its edges in canonical direction 
 (i.e. the `sourceId` less than `destId`)
 However the TriangleCount algorithm does not verify that this condition holds 
 and indeed even the unit tests exploits this functionality:
 {code:scala}
 val triangles = Array(0L - 1L, 1L - 2L, 2L - 0L) ++
 Array(0L - -1L, -1L - -2L, -2L - 0L)
   val rawEdges = sc.parallelize(triangles, 2)
   val graph = Graph.fromEdgeTuples(rawEdges, true).cache()
   val triangleCount = graph.triangleCount()
   val verts = triangleCount.vertices
   verts.collect().foreach { case (vid, count) =
 if (vid == 0) {
   assert(count === 4)  // -- Should be 2
 } else {
   assert(count === 2) // -- Should be 1
 }
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org