[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...

2016-02-21 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/11290#issuecomment-186951853
  
This looks good to me. @insidedctm thanks for reviving the PR and @srowen 
thanks for taking a look at this!  My only minor concern is that it will change 
the results for people that are using triangle count so we should note this in 
the change log on the next release. 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Adding zipPartitions to PySpark

2016-01-04 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/10550#issuecomment-168628357
  
@davies and @JoshRosen I have finished a working prototype that passes the 
tests.  I would be interested in your thoughts.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Adding zipPartitions to PySpark

2016-01-04 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/10550#issuecomment-168850430
  
@davies thanks for taking look!  I will open a JIRA issue later today.

With respect to the disk based design, I had considered it but it has a few 
limitations.  First it breaks the lazy evaluation model which (while not 
critical) was something I wanted to avoid.  Second, and perhaps more 
importantly, I wanted to avoid writing both relations completely to disk since 
it is possible that one may only need to be partially processed or require 
subsequent disk processing.  I should add that the Python performance is 
actually pretty reasonable :-) for basic data-processing tasks so I wanted a 
design that would retain the current level of performance.   




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Adding zipPartitions to PySpark

2016-01-04 Thread jegonzal
Github user jegonzal closed the pull request at:

https://github.com/apache/spark/pull/10550


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Adding zipPartitions to PySpark

2016-01-03 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/10550#issuecomment-168568011
  
retest this please



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Adding zipPartitions to PySpark

2016-01-03 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/10550#issuecomment-168566477
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Adding zipPartitions to PySpark

2016-01-03 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/10550#issuecomment-168497312
  
retest this please



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Adding zipPartitions to PySpark

2016-01-01 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/10550#issuecomment-168368506
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Adding zipPartitions to PySpark

2016-01-01 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/10550#issuecomment-168362426
  
@davies and @JoshRosen let me know what you think of this design.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Adding zipPartitions to PySpark

2016-01-01 Thread jegonzal
GitHub user jegonzal opened a pull request:

https://github.com/apache/spark/pull/10550

Adding zipPartitions to PySpark

The following working WIP adds support for `zipPartitions` to PySpark.  
This is accomplished by modifying the PySpark `worker` (in both daemon and 
non-deamon mode) to open a second socket back to the Spark process.  The second 
socket is used to send tuple from the second iterator in `zipPartitions` 
enabling the user defined function to pull tuples from both iterators at 
different rates without requiring a back-and-forth protocol over the primary 
socket.  The single socket protocol design was considered but creates issues 
with the built-in serializers and would require much larger changes.  The 
second socket is always created at the launch of the worker process and is 
simply ignored if it is not needed.  



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jegonzal/spark multi_iterator_pyspark

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10550.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10550


commit 70650ab94ae5dceca2dd6a970035d45dffdce2b1
Author: Joseph Gonzalez <joseph.e.gonza...@gmail.com>
Date:   2016-01-02T01:40:10Z

compiling prototype

commit 61512acb2dba276b2bbd1bca5d22ff2474f6def5
Author: Joseph Gonzalez <joseph.e.gonza...@gmail.com>
Date:   2016-01-02T03:51:40Z

addressing a bug where sockets could get created multiple times




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11432][GraphX] Personalized PageRank sh...

2015-11-02 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/9386#issuecomment-153207973
  
This is actually a pretty serious error since it could lead to mass being 
accumulated on unreachable sub-graphs.  The performance implications of the 
above branch should be negligible. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4086][GraphX]: Fold-style aggregation f...

2015-09-02 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/5142#issuecomment-137170613
  
@srowen GraphX is still active we have just been pretty busy with some 
other changes.  Let me see what needs to be done with this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9001] Fixing errors in javadocs that le...

2015-07-13 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/7354#issuecomment-120998075
  
I will make the suggested changes now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9001] Fixing errors in javadocs that le...

2015-07-13 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/7354#issuecomment-121000162
  
I have merged upstream changes and added back the requested paragraph 
blocks (correctly).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9001] Fixing errors in javadocs that le...

2015-07-12 Thread jegonzal
Github user jegonzal commented on a diff in the pull request:

https://github.com/apache/spark/pull/7354#discussion_r34419761
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java ---
@@ -25,9 +25,9 @@
 
 import static org.apache.spark.launcher.CommandBuilderUtils.*;
 
-/**
+/** 
  * Launcher for Spark applications.
- * p/
--- End diff --

This was rejected by JDK8. I thought that the first line was treated 
differently so I dropped the spurious p.  Without this fix it is not possible 
to build the docs or publish locally so it was a serious issue for me. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9001] Fixing errors in javadocs that le...

2015-07-11 Thread jegonzal
GitHub user jegonzal opened a pull request:

https://github.com/apache/spark/pull/7354

[SPARK-9001] Fixing errors in javadocs that lead to failed build/sbt doc

These are minor corrections in the documentation of several classes that 
are preventing:

```bash
build/sbt publish-local
```

I believe this might be an issue associated with running JDK8 as @ankurdave 
does not appear to have this issue in JDK7.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jegonzal/spark FixingJavadocErrors

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7354.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7354


commit 958bea2ca969dccbac2323eb4e783cc1b095139f
Author: Joseph Gonzalez joseph.e.gonza...@gmail.com
Date:   2015-07-11T06:22:01Z

Fixing errors in javadocs that prevents build/sbt publish-local from 
completing.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Improved GraphX PageRank Test Coverage

2015-05-18 Thread jegonzal
Github user jegonzal closed the pull request at:

https://github.com/apache/spark/pull/1228


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Improved GraphX PageRank Test Coverage

2015-05-18 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/1228#issuecomment-103186580
  
I think we have covered most of this code in later tests (PR #1217) and the 
remaining tests need to be substantially updated which I can do in a later PR.  
I am going to go ahead and close this one.  Sorry about the delay.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark-5854 personalized page rank

2015-05-01 Thread jegonzal
Github user jegonzal commented on a diff in the pull request:

https://github.com/apache/spark/pull/4774#discussion_r29521247
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala 
---
@@ -103,8 +132,14 @@ object PageRank extends Logging {
   // that didn't receive a message. Requires a shuffle for 
broadcasting updated ranks to the
   // edge partitions.
   prevRankGraph = rankGraph
+  val rPrb = if (personalized) {
+(src: VertexId ,id: VertexId) = resetProb * delta(src,id)
+  } else {
+(src: VertexId, id: VertexId) = resetProb
+  }
+
   rankGraph = rankGraph.joinVertices(rankUpdates) {
-(id, oldRank, msgSum) = resetProb + (1.0 - resetProb) * msgSum
+(id, oldRank, msgSum) = rPrb(src,id) + (1.0 - resetProb) * msgSum
--- End diff --

This all looks correct but I have a minor concern that the extra function 
call and branching might increase overhead if the hotspot optimizations don't 
inline.  Do we have a sense as to the performance cost of this change?   An 
alternative, less elegant solution would be to have two code paths for lines 
141 and 142 depending on whether personalization is enabled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Spark-5854 personalized page rank

2015-05-01 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/4774#issuecomment-98196331
  
Overall this looks great!  I apologize for the delayed response.   I am 
going to go ahead and merge this now and then we can tune the performance in a 
later pull request. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

2015-04-30 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/5403#issuecomment-97975070
  
This PR could have important performance implications for algorithms in 
GraphX and MLlib (e.g., ALS) which introduce relatively lightweight shuffle 
stages at each iteration. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GraphX] initialmessage for pagerank should be...

2015-02-22 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/1128#issuecomment-75451894
  
Great!  I agree with this proposal as well.  I apologize for letting it sit
so long.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3650] Fix TriangleCount handling of rev...

2015-01-21 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/2495#issuecomment-70925718
  
Great!  What else needs to be done?  There was some discussion about how
this might change the semantics of the triangle count function?  Is this
still true?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2365] Add IndexedRDD, an efficient upda...

2015-01-11 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/1297#issuecomment-69504481
  
We should really address this stack overflow issue. Is there a JIRA we can 
promote?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2365] Add IndexedRDD, an efficient upda...

2015-01-11 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/1297#issuecomment-69531109
  
Hmm, we really need to elevate this to a full issue.  I have run into the
stack overflow in MLlib (ALS) as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Removing confusing TripletFields

2014-11-25 Thread jegonzal
GitHub user jegonzal opened a pull request:

https://github.com/apache/spark/pull/3472

Removing confusing TripletFields

After additional discussion with @rxin, I think having all the possible 
`TripletField` options is confusing.  This pull request reduces the triplet 
fields to:

```java
  /**
   * None of the triplet fields are exposed.
   */
  public static final TripletFields None = new TripletFields(false, false, 
false);

  /**
   * Expose only the edge field and not the source or destination field.
   */
  public static final TripletFields EdgeOnly = new TripletFields(false, 
false, true);

  /**
   * Expose the source and edge fields but not the destination field. (Same 
as Src)
   */
  public static final TripletFields Src = new TripletFields(true, false, 
true);

  /**
   * Expose the destination and edge fields but not the source field. (Same 
as Dst)
   */
  public static final TripletFields Dst = new TripletFields(false, true, 
true);

  /**
   * Expose all the fields (source, edge, and destination).
   */
  public static final TripletFields All = new TripletFields(true, true, 
true);
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jegonzal/spark SimplifyTripletFields

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3472.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3472


commit 91796b52c17f5c88f1be6c7fe13d49c8e0cf64b1
Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com
Date:   2014-11-26T06:26:43Z

removing confusing triplet fields




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Removing confusing TripletFields

2014-11-25 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/3472#issuecomment-64520673
  
@ankurdave, what do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Removing confusing TripletFields

2014-11-25 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/3472#issuecomment-64520711
  
This is consistent with the current discussion in the graphx programming 
guide and so it is unlikely users have started using the more obscure 
combinations that were removed. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Updating GraphX programming guide and document...

2014-11-19 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/3359#issuecomment-63743607
  
Sounds good.  I can fix it now if you want.

Joey


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Updating GraphX programming guide and document...

2014-11-18 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/3359#issuecomment-63597028
  
@rxin and @ankurdave, take a look when you get a chance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Improved GraphX PageRank Test Coverage

2014-11-18 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/1228#issuecomment-63597173
  
@ankurdave and @rxin can we merge this now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Introducing an Improved Pregel API

2014-11-18 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/1217#issuecomment-63597212
  
@ankurdave should I try and update this with your latest changes or do you 
want to create a new one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Updating GraphX programming guide and document...

2014-11-18 Thread jegonzal
Github user jegonzal commented on a diff in the pull request:

https://github.com/apache/spark/pull/3359#discussion_r20559596
  
--- Diff: project/SparkBuild.scala ---
@@ -328,7 +328,7 @@ object Unidoc {
 unidocProjectFilter in(ScalaUnidoc, unidoc) :=
   inAnyProject -- inProjects(OldDeps.project, repl, examples, tools, 
catalyst, streamingFlumeSink, yarn, yarnAlpha),
 unidocProjectFilter in(JavaUnidoc, unidoc) :=
-  inAnyProject -- inProjects(OldDeps.project, repl, bagel, graphx, 
examples, tools, catalyst, streamingFlumeSink, yarn, yarnAlpha),
+  inAnyProject -- inProjects(OldDeps.project, repl, bagel, examples, 
tools, catalyst, streamingFlumeSink, yarn, yarnAlpha),
--- End diff --

In order for the TripletFields.java API to be rendered in the docs (i.e., 
the list of static fields) it must be compiled using javadoc. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Drop VD type parameter from EdgeRDD

2014-11-16 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/3303#issuecomment-63267156
  
Looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Introducing an Improved Pregel API

2014-11-13 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/1217#issuecomment-62917449
  
@ankurdave is this already covered in your latest PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3650] Fix TriangleCount handling of rev...

2014-11-13 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/2495#issuecomment-62917547
  
@ankurdave take a look at this when you get a chance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3936] Add aggregateMessages, which supe...

2014-11-12 Thread jegonzal
Github user jegonzal commented on a diff in the pull request:

https://github.com/apache/spark/pull/3100#discussion_r20243545
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/TripletFields.java 
---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graphx;
+
+import java.io.Serializable;
+
+/**
+ * Represents a subset of the fields of an [[EdgeTriplet]] or 
[[EdgeContext]]. This allows the
+ * system to populate only those fields for efficiency.
+ */
+public class TripletFields implements Serializable {
+  public final boolean useSrc;
+  public final boolean useDst;
+  public final boolean useEdge;
+
+  public TripletFields() {
+this(true, true, true);
+  }
+
+  public TripletFields(boolean useSrc, boolean useDst, boolean useEdge) {
+this.useSrc = useSrc;
+this.useDst = useDst;
+this.useEdge = useEdge;
+  }
+
+  public static final TripletFields None = new TripletFields(false, false, 
false);
--- End diff --

Hmm, I agree though I used many of them in the `GraphOps` code and decided 
maybe it would make sense to go ahead and be exhaustive.  I think we could cut 
a few.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3936] Add aggregateMessages, which supe...

2014-11-12 Thread jegonzal
Github user jegonzal commented on a diff in the pull request:

https://github.com/apache/spark/pull/3100#discussion_r20257677
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/TripletFields.java 
---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graphx;
+
+import java.io.Serializable;
+
+/**
+ * Represents a subset of the fields of an [[EdgeTriplet]] or 
[[EdgeContext]]. This allows the
+ * system to populate only those fields for efficiency.
+ */
+public class TripletFields implements Serializable {
+  public final boolean useSrc;
+  public final boolean useDst;
+  public final boolean useEdge;
+
+  public TripletFields() {
+this(true, true, true);
+  }
+
+  public TripletFields(boolean useSrc, boolean useDst, boolean useEdge) {
+this.useSrc = useSrc;
+this.useDst = useDst;
+this.useEdge = useEdge;
+  }
+
+  public static final TripletFields None = new TripletFields(false, false, 
false);
--- End diff --

How about we just keep:

```
EdgeOnly, SrcAndEdge, DstAndEdge, All
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3936] Remove Bytecode Inspection for Jo...

2014-11-12 Thread jegonzal
Github user jegonzal closed the pull request at:

https://github.com/apache/spark/pull/2815


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61922586
  
The model serving work would really benefit from being able to evaluate 
models without requiring a Spark context especially since we are shooting for 
10s millisecond latencies.  Though more generally, we should think about how 
one might want to use the artifact of the pipeline.  I suspect their are uses 
that may exist outside of Spark and the extent to which the models themselves 
are portable functions could enable greater adoption. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-61934417
  
@jkbradley Right now we are planning to serve linear combinations of models 
derived from MLlib (currently latent factor models, naive bayes, and decision 
trees).  Though I agree that in some cases (e.g., latent factor models) serving 
is less trivial (thats the research).  Still, it would be good to think of 
models as output to be consumed by systems beyond Spark should people want to 
do things beyond computing test error, admittedly thats all I ever do :-). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3936] Remove Bytecode Inspection for Jo...

2014-10-31 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/2815#issuecomment-61349221
  
I added the `TripletFields` enum and updated all the dependent files.  I 
can't deprecate the old API since they have the same function signature up to 
default arguments and I don't want to require the `TripletFields` be specified. 
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3936] Remove Bytecode Inspection for Jo...

2014-10-31 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/2815#issuecomment-61349425
  
At this point I could also imagine actually having a separate function 
closure for each version.

```scala
mapTriplets(f: Edge = ED2)

mapTriplets(f: SrcEdge = ED2)

mapTriplets(f: DstEdge = ED2)

mapTriplets(f: Triplet = ED2)
```

Though to do this would require users to annotate their functions:

```scala
g.mapTriplets( (t: SrcEdge) = t.src )
```

What do you all think?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4130][MLlib] Fixing libSVM parser bug w...

2014-10-29 Thread jegonzal
GitHub user jegonzal opened a pull request:

https://github.com/apache/spark/pull/2996

[SPARK-4130][MLlib] Fixing libSVM parser bug with extra whitespace 

This simple patch filters out extra whitespace entries.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jegonzal/spark loadLibSVM

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2996.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2996


commit e028e8443ff38e3617a1bdd0a2a3f5ec9b42d980
Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com
Date:   2014-10-29T07:13:56Z

fixing whitespace bug in loadLibSVMFile when parsing libSVM files




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4130][MLlib] Fixing libSVM parser bug w...

2014-10-29 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/2996#issuecomment-61026000
  
Not sure why it failed the test.  Is this an issue with the testing 
framework?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4130][MLlib] Fixing libSVM parser bug w...

2014-10-29 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/2996#issuecomment-61026298
  
The following implementation seems a bit more efficient but is needlessly 
complicated.

```scala
  // Count the number of empty values   

  
var i = 1
var emptyValues = 0
while (i  items.size) {
  if (items(i).isEmpty) emptyValues += 1
  i += 1
}
// Determine the number of non-zero entries 


val nnzs = items.size - 1 - emptyValues
// Compute the indices  


val indices = new Array[Int](nnzs)
val values = new Array[Double](nnzs)
i = 1
var j = 0
while (i  items.size) {
  if (!items(i).isEmpty) {
val indexAndValue = items(i).split(':')
indices(j) = indexAndValue(0).toInt - 1 // Convert 1-based 
indices to 0-based. 
 
values(j) = indexAndValue(1).toDouble
j += 1
  }
  i += 1
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4142][GraphX] Default numEdgePartitions

2014-10-29 Thread jegonzal
GitHub user jegonzal opened a pull request:

https://github.com/apache/spark/pull/3006

[SPARK-4142][GraphX] Default numEdgePartitions

Changing the default number of edge partitions to match spark parallelism. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jegonzal/spark default_partitions

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3006.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3006


commit a9a5c4f28ba7d5c29a974045960f45a55640df19
Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com
Date:   2014-10-30T00:06:55Z

Changing the default number of edge partitions to match spark parallelism




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3936] Remove Bytecode Inspection for Jo...

2014-10-29 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/2815#issuecomment-61026843
  
What is the status on this patch?  I would like to merge it soon so that 
the python GraphX API can support these additional flags.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3650] Fix TriangleCount handling of rev...

2014-10-29 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/2495#issuecomment-61026881
  
What is the status on this patch? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Introducing an Improved Pregel API

2014-10-29 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/1217#issuecomment-61027554
  
This is still work in progress and we need to discuss these API changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Improved GraphX PageRank Test Coverage

2014-10-29 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/1228#issuecomment-61029490
  
ok to test



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Improved GraphX PageRank Test Coverage

2014-10-29 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/1228#issuecomment-61029482
  
This should now be addressed in the latest master and does not depend on PR 
#1217


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Remove Bytecode Inspection for Join Eliminatio...

2014-10-15 Thread jegonzal
GitHub user jegonzal opened a pull request:

https://github.com/apache/spark/pull/2815

Remove Bytecode Inspection for Join Elimination

Removing bytecode inspection from triplet operations and introducing 
explicit join elimination flags.  The explicit flags make the join elimination 
more robust. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jegonzal/spark SPARK-3936

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2815.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2815


commit 2e471584d70aa8029a7eab9643cfdcb3e758a9d7
Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com
Date:   2014-10-15T19:41:24Z

Removing bytecode inspection from triplet operations and introducing 
explicit join elimination flags.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Remove Bytecode Inspection for Join Eliminatio...

2014-10-15 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/2815#issuecomment-59263992
  
@ankurdave and @rxin I have not updated the applications to use the new 
explicit flags.  I will do that in this PR pending approval for the API changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3936] Remove Bytecode Inspection for Jo...

2014-10-15 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/2815#issuecomment-59275910
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3578] Fix upper bound in GraphGenerator...

2014-09-22 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/2439#issuecomment-56438311
  
This looks good to me.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3650] Fix TriangleCount handling of rev...

2014-09-22 Thread jegonzal
GitHub user jegonzal opened a pull request:

https://github.com/apache/spark/pull/2495

[SPARK-3650] Fix TriangleCount handling of reverse edges

This PR causes the TriangleCount algorithm to remove self-edges, direct 
edges from low-id to high-id (canonical direction), and then remove duplicate 
edges, before running the triangle count algorithm.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jegonzal/spark FixTriangleCount

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2495.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2495


commit daf1ab5e66259d0af449a91ca7c230323ac49daa
Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com
Date:   2014-09-22T21:57:28Z

Improving Triangle Count

commit d054d33181486e3b90222e5e30b2f20648434673
Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com
Date:   2014-09-22T22:16:46Z

fixing bug in unit tests where bi-directed edges lead to duplicate 
triangles.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3263][GraphX] Fix changes made to Graph...

2014-08-28 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/2168#issuecomment-53760885
  
The code changes look good to me (and were badly need).  Thanks for fixing 
it!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Improved GraphX PageRank Test Coverage

2014-08-28 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/1228#issuecomment-53776044
  
Yes. This is an extension of the unit tests to catch a class of bugs 
addressed in PR #1217 (which has not been merged).  I believe @ankurdave was 
working on a merge of these two pull requests. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Introducing an Improved Pregel API

2014-06-26 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/1217#issuecomment-47193665
  
I spent some time verifying the math behind the PageRank (in particular 
starting values) to ensure that the delta formulation behaves identically to 
the static formulation which matches other reference implementations of 
PageRank.  One of the key changes is I have added an extra normalization step 
at the end of the calculation to address a discrepancy in how we handle 
dangling vertices. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Introducing an Improved Pregel API

2014-06-26 Thread jegonzal
Github user jegonzal commented on a diff in the pull request:

https://github.com/apache/spark/pull/1217#discussion_r14227560
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala ---
@@ -158,4 +169,125 @@ object Pregel extends Logging {
 g
   } // end of apply
 
+  /**
+   * Execute a Pregel-like iterative vertex-parallel abstraction.  The
+   * user-defined vertex-program `vprog` is executed in parallel on
+   * each vertex receiving any inbound messages and computing a new
+   * value for the vertex.  The `sendMsg` function is then invoked on
+   * all out-edges and is used to compute an optional message to the
+   * destination vertex. The `mergeMsg` function is a commutative
+   * associative function used to combine messages destined to the
+   * same vertex.
+   *
+   * On the first iteration all vertices receive the `initialMsg` and
+   * on subsequent iterations if a vertex does not receive a message
+   * then the vertex-program is not invoked.
+   *
+   * This function iterates until there are no remaining messages, or
+   * for `maxIterations` iterations.
+   *
+   * @tparam VD the vertex data type
+   * @tparam ED the edge data type
+   * @tparam A the Pregel message type
+   *
+   * @param graph the input graph.
+   *
+   * @param initialMsg the message each vertex will receive at the on
+   * the first iteration
+   *
+   * @param maxIterations the maximum number of iterations to run for
+   *
+   * @param activeDirection the direction of edges incident to a vertex 
that received a message in
+   * the previous round on which to run `sendMsg`. For example, if this is 
`EdgeDirection.Out`, only
+   * out-edges of vertices that received a message in the previous round 
will run. The default is
+   * `EdgeDirection.Either`, which will run `sendMsg` on edges where 
either side received a message
+   * in the previous round. If this is `EdgeDirection.Both`, `sendMsg` 
will only run on edges where
+   * *both* vertices received a message.
+   *
+   * @param vprog the user-defined vertex program which runs on each
+   * vertex and receives the inbound message and computes a new vertex
+   * value.  On the first iteration the vertex program is invoked on
+   * all vertices and is passed the default message.  On subsequent
+   * iterations the vertex program is only invoked on those vertices
+   * that receive messages.
+   *
+   * @param sendMsg a user supplied function that is applied to out
+   * edges of vertices that received messages in the current
+   * iteration
+   *
+   * @param mergeMsg a user supplied function that takes two incoming
+   * messages of type A and merges them into a single message of type
+   * A.  ''This function must be commutative and associative and
+   * ideally the size of A should not increase.''
+   *
+   * @return the resulting graph at the end of the computation
+   *
+   */
+  def run[VD: ClassTag, ED: ClassTag, A: ClassTag]
+  (graph: Graph[VD, ED],
+   maxIterations: Int = Int.MaxValue,
+   activeDirection: EdgeDirection = EdgeDirection.Either)
+  (vertexProgram: (VertexId, VD, Option[A], VertexContext) = VD,
+   sendMsg: (EdgeTriplet[VD, ED], EdgeContext) = Iterator[(VertexId, A)],
+   mergeMsg: (A, A) = A)
+  : Graph[VD, ED] =
+  {
+// Initialize the graph with all vertices active
+var g: Graph[(VD, Boolean), ED] = graph.mapVertices { (vid, vdata) = 
(vdata, true) }.cache()
+// Determine the set of vertices that did not vote to halt
+var activeVertices = g.vertices
+var numActive = activeVertices.count()
+var i = 0
+while (numActive  0  i  maxIterations) {
+  // The send message wrapper removes the active fields from the 
triplet and places them in the edge context.
+  def sendMessageWrapper(triplet: EdgeTriplet[(VD, Boolean),ED]): 
Iterator[(VertexId, A)] = {
+val simpleTriplet = new EdgeTriplet[VD, ED]()
+simpleTriplet.set(triplet)
+simpleTriplet.srcAttr = triplet.srcAttr._1
+simpleTriplet.dstAttr = triplet.dstAttr._1
+val ctx = new EdgeContext(i, triplet.srcAttr._2, 
triplet.dstAttr._2)
+sendMsg(simpleTriplet, ctx)
+  }
+
+  // Compute the messages for all the active vertices
+  val messages = g.mapReduceTriplets(sendMessageWrapper, mergeMsg, 
Some((activeVertices, activeDirection)))
+
+  // get a reference to the current graph so that we can unpersist it 
once the new graph is created.
+  val prevG = g
+
+  // Receive the messages to the subset of active vertices
+  g = g.outerJoinVertices(messages){ (vid, dataAndActive, msgOpt) =
+val (vdata

[GitHub] spark pull request: Introducing an Improved Pregel API

2014-06-26 Thread jegonzal
Github user jegonzal commented on a diff in the pull request:

https://github.com/apache/spark/pull/1217#discussion_r14227573
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala ---
@@ -158,4 +169,125 @@ object Pregel extends Logging {
 g
   } // end of apply
 
+  /**
+   * Execute a Pregel-like iterative vertex-parallel abstraction.  The
+   * user-defined vertex-program `vprog` is executed in parallel on
+   * each vertex receiving any inbound messages and computing a new
+   * value for the vertex.  The `sendMsg` function is then invoked on
+   * all out-edges and is used to compute an optional message to the
+   * destination vertex. The `mergeMsg` function is a commutative
+   * associative function used to combine messages destined to the
+   * same vertex.
+   *
+   * On the first iteration all vertices receive the `initialMsg` and
+   * on subsequent iterations if a vertex does not receive a message
+   * then the vertex-program is not invoked.
+   *
+   * This function iterates until there are no remaining messages, or
+   * for `maxIterations` iterations.
+   *
+   * @tparam VD the vertex data type
+   * @tparam ED the edge data type
+   * @tparam A the Pregel message type
+   *
+   * @param graph the input graph.
+   *
+   * @param initialMsg the message each vertex will receive at the on
+   * the first iteration
+   *
+   * @param maxIterations the maximum number of iterations to run for
+   *
+   * @param activeDirection the direction of edges incident to a vertex 
that received a message in
+   * the previous round on which to run `sendMsg`. For example, if this is 
`EdgeDirection.Out`, only
+   * out-edges of vertices that received a message in the previous round 
will run. The default is
+   * `EdgeDirection.Either`, which will run `sendMsg` on edges where 
either side received a message
+   * in the previous round. If this is `EdgeDirection.Both`, `sendMsg` 
will only run on edges where
+   * *both* vertices received a message.
+   *
+   * @param vprog the user-defined vertex program which runs on each
+   * vertex and receives the inbound message and computes a new vertex
+   * value.  On the first iteration the vertex program is invoked on
+   * all vertices and is passed the default message.  On subsequent
+   * iterations the vertex program is only invoked on those vertices
+   * that receive messages.
+   *
+   * @param sendMsg a user supplied function that is applied to out
+   * edges of vertices that received messages in the current
+   * iteration
+   *
+   * @param mergeMsg a user supplied function that takes two incoming
+   * messages of type A and merges them into a single message of type
+   * A.  ''This function must be commutative and associative and
+   * ideally the size of A should not increase.''
+   *
+   * @return the resulting graph at the end of the computation
+   *
+   */
+  def run[VD: ClassTag, ED: ClassTag, A: ClassTag]
+  (graph: Graph[VD, ED],
+   maxIterations: Int = Int.MaxValue,
+   activeDirection: EdgeDirection = EdgeDirection.Either)
+  (vertexProgram: (VertexId, VD, Option[A], VertexContext) = VD,
+   sendMsg: (EdgeTriplet[VD, ED], EdgeContext) = Iterator[(VertexId, A)],
+   mergeMsg: (A, A) = A)
+  : Graph[VD, ED] =
+  {
+// Initialize the graph with all vertices active
+var g: Graph[(VD, Boolean), ED] = graph.mapVertices { (vid, vdata) = 
(vdata, true) }.cache()
+// Determine the set of vertices that did not vote to halt
+var activeVertices = g.vertices
+var numActive = activeVertices.count()
+var i = 0
+while (numActive  0  i  maxIterations) {
+  // The send message wrapper removes the active fields from the 
triplet and places them in the edge context.
+  def sendMessageWrapper(triplet: EdgeTriplet[(VD, Boolean),ED]): 
Iterator[(VertexId, A)] = {
+val simpleTriplet = new EdgeTriplet[VD, ED]()
+simpleTriplet.set(triplet)
+simpleTriplet.srcAttr = triplet.srcAttr._1
+simpleTriplet.dstAttr = triplet.dstAttr._1
+val ctx = new EdgeContext(i, triplet.srcAttr._2, 
triplet.dstAttr._2)
+sendMsg(simpleTriplet, ctx)
+  }
+
+  // Compute the messages for all the active vertices
+  val messages = g.mapReduceTriplets(sendMessageWrapper, mergeMsg, 
Some((activeVertices, activeDirection)))
+
+  // get a reference to the current graph so that we can unpersist it 
once the new graph is created.
+  val prevG = g
+
+  // Receive the messages to the subset of active vertices
+  g = g.outerJoinVertices(messages){ (vid, dataAndActive, msgOpt) =
+val (vdata

[GitHub] spark pull request: Improved GraphX PageRank Test Coverage

2014-06-26 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/1228#issuecomment-47200276
  
@ankurdave thanks for pointing out this bug!  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Introducing an Improved Pregel API

2014-06-26 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/1217#issuecomment-47204112
  
@ankurdave and @rxin there is an issue with the current API.  The 
`sendMessage` function pull the active field out of the vertex value here:

https://github.com/apache/spark/pull/1217/files#diff-e399679417ffa6eeedf26a7630baca16R243
```scala
def sendMessageWrapper(triplet: EdgeTriplet[(VD, Boolean),ED]): 
Iterator[(VertexId, A)] = {
  val simpleTriplet = new EdgeTriplet[VD, ED]()
   simpleTriplet.set(triplet)
   simpleTriplet.srcAttr = triplet.srcAttr._1
   simpleTriplet.dstAttr = triplet.dstAttr._1
   val ctx = new EdgeContext(i, triplet.srcAttr._2, triplet.dstAttr._2)
   sendMsg(simpleTriplet, ctx)
 }
// Compute the messages for all the active vertices
val messages = g.mapReduceTriplets(sendMessageWrapper, mergeMsg, 
Some((activeVertices, activeDirection)))
```
thereby allowing the user a simple `sendMsg` interface:
```scala
sendMsg: (EdgeTriplet[VD, ED], EdgeContext) = Iterator[(VertexId, A)]
```
However because we access the source and destination vertex attributes the 
byte code inspection will force a full 3-way join even if the user doesn't 
actually read the fields. 

The simplest solution would be to change the send message interface to 
operate on the extended vertex attribute passing (containing the active field).
```scala
sendMsg: (EdgeTriplet[(VD, Boolean), ED], EdgeContext) = 
Iterator[(VertexId, A)]
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Introducing an Improved Pregel API

2014-06-25 Thread jegonzal
GitHub user jegonzal opened a pull request:

https://github.com/apache/spark/pull/1217

Introducing an Improved Pregel API

The initial Pregel API coupled voting to halt with message reception.  In 
this revised the vertex program receives a `PregelContext` which enables the 
user to signal whether or not to halt as well as access the current iteration 
number.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jegonzal/spark PregelContext

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1217.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1217


commit ca0fcc8797769f706e14cadd773ace7c2ff53e0a
Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com
Date:   2014-06-25T22:24:54Z

Starting a revised version of Pregel




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Introducing an Improved Pregel API

2014-06-25 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/1217#issuecomment-47167314
  
@ankurdave unfortunately to full accept this change we will need to break 
compatibility with the current Pregel API.  I cannot seem to overload the apply 
method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Synthetic GraphX Benchmark

2014-05-16 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/720#issuecomment-42757213
  
Good point!  I moved the benchmark into the examples folder.  Is there a 
standard format for command line args in the example applications?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Synthetic GraphX Benchmark

2014-05-16 Thread jegonzal
GitHub user jegonzal opened a pull request:

https://github.com/apache/spark/pull/720

Synthetic GraphX Benchmark

This PR accomplishes two things:

1. It introduces a Synthetic Benchmark application that generates an 
arbitrarily large log-normal graph and executes either PageRank or connected 
components on the graph.  This can be used to profile GraphX system on 
arbitrary clusters without access to large graph datasets

2. This PR improves the implementation of the log-normal graph generator. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jegonzal/spark graphx_synth_benchmark

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/720.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #720


commit c81ee23bb66918052efefa81a9e8077951e5ebf5
Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com
Date:   2014-05-05T22:52:58Z

Creating a synthetic benchmark script.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Enable repartitioning of graph over different ...

2014-05-15 Thread jegonzal
GitHub user jegonzal opened a pull request:

https://github.com/apache/spark/pull/719

Enable repartitioning of graph over different number of partitions

It is currently very difficult to repartition a graph over a different 
number of partitions.  This PR adds an additional `partitionBy` function that 
takes the number of partitions.  

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jegonzal/spark graph_partitioning_options

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/719.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #719


commit 54412fc658018c8285190fdd26b43f324dd1f580
Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com
Date:   2014-05-09T23:26:59Z

adding an additional number of partitions option to partitionBy




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Unify GraphImpl RDDs + other graph load optimi...

2014-05-14 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/497#issuecomment-42618546
  
I went through this PR with Ankur and it looks good to me.  There are a few 
minor changes but those can be moved to a second PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1786: Reopening PR 724

2014-05-12 Thread jegonzal
GitHub user jegonzal opened a pull request:

https://github.com/apache/spark/pull/742

SPARK-1786: Reopening PR 724 

Addressing issue in MimaBuild.scala.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jegonzal/spark edge_partition_serialization

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/742.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #742


commit 67dac22884b098b72c277dbe6e344da796a5321c
Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com
Date:   2014-05-10T01:54:56Z

Making EdgePartition serializable.

commit bb7f548542d58ee6ac2dbdf868fea165fdf4f415
Author: Ankur Dave ankurd...@gmail.com
Date:   2014-05-10T03:09:48Z

Add failing test for EdgePartition Kryo serialization

commit b0a525a7f48a6b13cf8687e5e6d8ba3d3bf852f5
Author: Ankur Dave ankurd...@gmail.com
Date:   2014-05-10T03:12:38Z

Disable reference tracking to fix serialization test

commit d8b70fbca17534eb8f60e8feb4a9fdd5996fdcd8
Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com
Date:   2014-05-12T18:20:49Z

addressing missing exclusion in MimaBuild.scala




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1786: Reopening PR 724

2014-05-12 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/742#issuecomment-42868913
  
@ankurdave and @pwendell I am reopening the PR 724 to address the issue 
with MimaBuild.  I believe I made the required changes but how can I verify?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1786: Edge Partition Serialization

2014-05-11 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/724#issuecomment-42787343
  
I would like to get it into 1.0 if possible.  Otherwise, we could run into 
issues if the user persists graphs to disk or straggler mitigation is used. 
@ankurdave do you see any issues with trying to get this into 1.0?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix error in 2d Graph Partitioner

2014-05-11 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/709#issuecomment-42703154
  
@rxin and @ankurdave take a look at this minor change when you get a 
chance.  I would like to get it into the next release if possible.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1786: Edge Partition Serialization

2014-05-11 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/724#issuecomment-42793347
  
My only concern is that I would prefer things work slowly than fail.  With 
reference tracking disabled it is not possible to serialize user defined types 
from the spark-shell.  

A second concern is that it will be difficult for the user to enable 
reference tracking if we disable it in the  GraphX Kryo registrar.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1577: Enabling reference tracking by def...

2014-04-23 Thread jegonzal
GitHub user jegonzal opened a pull request:

https://github.com/apache/spark/pull/499

SPARK-1577: Enabling reference tracking by default in GraphX 
KryoRegistrator.

We had originally disabled reference tracking by default however this now 
seems to create serious issues in the spark-shell where even the following 
benign block of code will fail:

```scala
class A(a: String) extends Serializable
val x = sc.parallelize(Array.fill(10)(new A(hello)))
x.collect
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jegonzal/spark graphx-kryo-issue

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/499.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #499


commit 8779a836ff6d134b445a051a13c5d85130a9f848
Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com
Date:   2014-04-23T05:59:24Z

Enabling reference tracking by default in GraphX KryoRegistrator.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add Shortest-path computations to graphx.lib w...

2014-04-23 Thread jegonzal
Github user jegonzal commented on a diff in the pull request:

https://github.com/apache/spark/pull/10#discussion_r11915905
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graphx.lib
+
+import org.apache.spark.graphx._
+
+object ShortestPaths {
+  type SPMap = Map[VertexId, Int] // map of landmarks - minimum distance 
to landmark
--- End diff --

The scala map data-structures can be pretty costly and inefficient.  
Instead you could use an array containing the distances and then maintain a 
global map (shared by broadcast variable) with the mapping from vertex id to 
index in the array.  This should also reduce the memory overhead substantially 
since each vertex will not need to maintain its own locally Map data structure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add Shortest-path computations to graphx.lib w...

2014-04-23 Thread jegonzal
Github user jegonzal commented on a diff in the pull request:

https://github.com/apache/spark/pull/10#discussion_r11916394
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graphx.lib
+
+import org.apache.spark.graphx._
+
+object ShortestPaths {
+  type SPMap = Map[VertexId, Int] // map of landmarks - minimum distance 
to landmark
--- End diff --

Essentially remap the landmarks to a consecutive landmark id set and then 
on the initial creation of spGraph you would require using the single broadcast 
map but from then on no map data structures would be required. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add Shortest-path computations to graphx.lib w...

2014-04-23 Thread jegonzal
Github user jegonzal commented on a diff in the pull request:

https://github.com/apache/spark/pull/10#discussion_r11916531
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graphx.lib
+
+import org.apache.spark.graphx._
+
+object ShortestPaths {
+  type SPMap = Map[VertexId, Int] // map of landmarks - minimum distance 
to landmark
+  def SPMap(x: (VertexId, Int)*) = Map(x: _*)
+  def increment(spmap: SPMap): SPMap = spmap.map { case (v, d) = v - (d 
+ 1) }
+  def plus(spmap1: SPMap, spmap2: SPMap): SPMap =
+(spmap1.keySet ++ spmap2.keySet).map{
+  k = k - scala.math.min(spmap1.getOrElse(k, Int.MaxValue), 
spmap2.getOrElse(k, Int.MaxValue))
+}.toMap
+
+  /**
+   * Compute the shortest paths to each landmark for each vertex and
+   * return an RDD with the map of landmarks to their shortest-path
+   * lengths.
+   *
+   * @tparam VD the shortest paths map for the vertex
+   * @tparam ED the incremented shortest-paths map of the originating
+   * vertex (discarded in the computation)
+   *
+   * @param graph the graph for which to compute the shortest paths
+   * @param landmarks the list of landmark vertex ids
+   *
+   * @return a graph with vertex attributes containing a map of the
+   * shortest paths to each landmark
+   */
+  def run[VD, ED](graph: Graph[VD, ED], landmarks: Seq[VertexId])
+(implicit m1: Manifest[VD], m2: Manifest[ED]): Graph[SPMap, SPMap] = {
+
+val spGraph = graph
+  .mapVertices{ (vid, attr) =
+if (landmarks.contains(vid)) SPMap(vid - 0)
+else SPMap()
+  }
+  .mapTriplets{ edge = edge.srcAttr }
+
+val initialMessage = SPMap()
+
+def vertexProgram(id: VertexId, attr: SPMap, msg: SPMap): SPMap = {
+  plus(attr, msg)
+}
+
+def sendMessage(edge: EdgeTriplet[SPMap, SPMap]): Iterator[(VertexId, 
SPMap)] = {
+  val newAttr = increment(edge.srcAttr)
--- End diff --

It might be worth considering adding support for edge weights instead of 
assuming all edges are length 1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add Shortest-path computations to graphx.lib w...

2014-04-23 Thread jegonzal
Github user jegonzal commented on a diff in the pull request:

https://github.com/apache/spark/pull/10#discussion_r11916619
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graphx.lib
+
+import org.apache.spark.graphx._
+
+object ShortestPaths {
+  type SPMap = Map[VertexId, Int] // map of landmarks - minimum distance 
to landmark
+  def SPMap(x: (VertexId, Int)*) = Map(x: _*)
+  def increment(spmap: SPMap): SPMap = spmap.map { case (v, d) = v - (d 
+ 1) }
+  def plus(spmap1: SPMap, spmap2: SPMap): SPMap =
+(spmap1.keySet ++ spmap2.keySet).map{
+  k = k - scala.math.min(spmap1.getOrElse(k, Int.MaxValue), 
spmap2.getOrElse(k, Int.MaxValue))
+}.toMap
+
+  /**
+   * Compute the shortest paths to each landmark for each vertex and
+   * return an RDD with the map of landmarks to their shortest-path
+   * lengths.
+   *
+   * @tparam VD the shortest paths map for the vertex
+   * @tparam ED the incremented shortest-paths map of the originating
+   * vertex (discarded in the computation)
+   *
+   * @param graph the graph for which to compute the shortest paths
+   * @param landmarks the list of landmark vertex ids
+   *
+   * @return a graph with vertex attributes containing a map of the
+   * shortest paths to each landmark
+   */
+  def run[VD, ED](graph: Graph[VD, ED], landmarks: Seq[VertexId])
+(implicit m1: Manifest[VD], m2: Manifest[ED]): Graph[SPMap, SPMap] = {
+
+val spGraph = graph
+  .mapVertices{ (vid, attr) =
+if (landmarks.contains(vid)) SPMap(vid - 0)
+else SPMap()
--- End diff --

If we switch to an array implementation of the map then perhaps set the 
distance to MaxInt (or MaxDouble if we switch to weighted edge).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add Shortest-path computations to graphx.lib w...

2014-04-23 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/10#issuecomment-41199189
  
This code looks good to me.  All my comments are with respect to potential 
performance issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---