[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/11290#issuecomment-186951853 This looks good to me. @insidedctm thanks for reviving the PR and @srowen thanks for taking a look at this! My only minor concern is that it will change the results for people that are using triangle count so we should note this in the change log on the next release. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Adding zipPartitions to PySpark
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/10550#issuecomment-168628357 @davies and @JoshRosen I have finished a working prototype that passes the tests. I would be interested in your thoughts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Adding zipPartitions to PySpark
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/10550#issuecomment-168850430 @davies thanks for taking look! I will open a JIRA issue later today. With respect to the disk based design, I had considered it but it has a few limitations. First it breaks the lazy evaluation model which (while not critical) was something I wanted to avoid. Second, and perhaps more importantly, I wanted to avoid writing both relations completely to disk since it is possible that one may only need to be partially processed or require subsequent disk processing. I should add that the Python performance is actually pretty reasonable :-) for basic data-processing tasks so I wanted a design that would retain the current level of performance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Adding zipPartitions to PySpark
Github user jegonzal closed the pull request at: https://github.com/apache/spark/pull/10550 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Adding zipPartitions to PySpark
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/10550#issuecomment-168568011 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Adding zipPartitions to PySpark
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/10550#issuecomment-168566477 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Adding zipPartitions to PySpark
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/10550#issuecomment-168497312 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Adding zipPartitions to PySpark
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/10550#issuecomment-168368506 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Adding zipPartitions to PySpark
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/10550#issuecomment-168362426 @davies and @JoshRosen let me know what you think of this design. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Adding zipPartitions to PySpark
GitHub user jegonzal opened a pull request: https://github.com/apache/spark/pull/10550 Adding zipPartitions to PySpark The following working WIP adds support for `zipPartitions` to PySpark. This is accomplished by modifying the PySpark `worker` (in both daemon and non-deamon mode) to open a second socket back to the Spark process. The second socket is used to send tuple from the second iterator in `zipPartitions` enabling the user defined function to pull tuples from both iterators at different rates without requiring a back-and-forth protocol over the primary socket. The single socket protocol design was considered but creates issues with the built-in serializers and would require much larger changes. The second socket is always created at the launch of the worker process and is simply ignored if it is not needed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jegonzal/spark multi_iterator_pyspark Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10550.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10550 commit 70650ab94ae5dceca2dd6a970035d45dffdce2b1 Author: Joseph Gonzalez <joseph.e.gonza...@gmail.com> Date: 2016-01-02T01:40:10Z compiling prototype commit 61512acb2dba276b2bbd1bca5d22ff2474f6def5 Author: Joseph Gonzalez <joseph.e.gonza...@gmail.com> Date: 2016-01-02T03:51:40Z addressing a bug where sockets could get created multiple times --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11432][GraphX] Personalized PageRank sh...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/9386#issuecomment-153207973 This is actually a pretty serious error since it could lead to mass being accumulated on unreachable sub-graphs. The performance implications of the above branch should be negligible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4086][GraphX]: Fold-style aggregation f...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/5142#issuecomment-137170613 @srowen GraphX is still active we have just been pretty busy with some other changes. Let me see what needs to be done with this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9001] Fixing errors in javadocs that le...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/7354#issuecomment-120998075 I will make the suggested changes now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9001] Fixing errors in javadocs that le...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/7354#issuecomment-121000162 I have merged upstream changes and added back the requested paragraph blocks (correctly). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9001] Fixing errors in javadocs that le...
Github user jegonzal commented on a diff in the pull request: https://github.com/apache/spark/pull/7354#discussion_r34419761 --- Diff: launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java --- @@ -25,9 +25,9 @@ import static org.apache.spark.launcher.CommandBuilderUtils.*; -/** +/** * Launcher for Spark applications. - * p/ --- End diff -- This was rejected by JDK8. I thought that the first line was treated differently so I dropped the spurious p. Without this fix it is not possible to build the docs or publish locally so it was a serious issue for me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9001] Fixing errors in javadocs that le...
GitHub user jegonzal opened a pull request: https://github.com/apache/spark/pull/7354 [SPARK-9001] Fixing errors in javadocs that lead to failed build/sbt doc These are minor corrections in the documentation of several classes that are preventing: ```bash build/sbt publish-local ``` I believe this might be an issue associated with running JDK8 as @ankurdave does not appear to have this issue in JDK7. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jegonzal/spark FixingJavadocErrors Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7354.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7354 commit 958bea2ca969dccbac2323eb4e783cc1b095139f Author: Joseph Gonzalez joseph.e.gonza...@gmail.com Date: 2015-07-11T06:22:01Z Fixing errors in javadocs that prevents build/sbt publish-local from completing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Improved GraphX PageRank Test Coverage
Github user jegonzal closed the pull request at: https://github.com/apache/spark/pull/1228 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Improved GraphX PageRank Test Coverage
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/1228#issuecomment-103186580 I think we have covered most of this code in later tests (PR #1217) and the remaining tests need to be substantially updated which I can do in a later PR. I am going to go ahead and close this one. Sorry about the delay. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark-5854 personalized page rank
Github user jegonzal commented on a diff in the pull request: https://github.com/apache/spark/pull/4774#discussion_r29521247 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala --- @@ -103,8 +132,14 @@ object PageRank extends Logging { // that didn't receive a message. Requires a shuffle for broadcasting updated ranks to the // edge partitions. prevRankGraph = rankGraph + val rPrb = if (personalized) { +(src: VertexId ,id: VertexId) = resetProb * delta(src,id) + } else { +(src: VertexId, id: VertexId) = resetProb + } + rankGraph = rankGraph.joinVertices(rankUpdates) { -(id, oldRank, msgSum) = resetProb + (1.0 - resetProb) * msgSum +(id, oldRank, msgSum) = rPrb(src,id) + (1.0 - resetProb) * msgSum --- End diff -- This all looks correct but I have a minor concern that the extra function call and branching might increase overhead if the hotspot optimizations don't inline. Do we have a sense as to the performance cost of this change? An alternative, less elegant solution would be to have two code paths for lines 141 and 142 depending on whether personalization is enabled. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark-5854 personalized page rank
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/4774#issuecomment-98196331 Overall this looks great! I apologize for the delayed response. I am going to go ahead and merge this now and then we can tune the performance in a later pull request. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/5403#issuecomment-97975070 This PR could have important performance implications for algorithms in GraphX and MLlib (e.g., ALS) which introduce relatively lightweight shuffle stages at each iteration. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GraphX] initialmessage for pagerank should be...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/1128#issuecomment-75451894 Great! I agree with this proposal as well. I apologize for letting it sit so long. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3650] Fix TriangleCount handling of rev...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/2495#issuecomment-70925718 Great! What else needs to be done? There was some discussion about how this might change the semantics of the triangle count function? Is this still true? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2365] Add IndexedRDD, an efficient upda...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/1297#issuecomment-69504481 We should really address this stack overflow issue. Is there a JIRA we can promote? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2365] Add IndexedRDD, an efficient upda...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/1297#issuecomment-69531109 Hmm, we really need to elevate this to a full issue. I have run into the stack overflow in MLlib (ALS) as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Removing confusing TripletFields
GitHub user jegonzal opened a pull request: https://github.com/apache/spark/pull/3472 Removing confusing TripletFields After additional discussion with @rxin, I think having all the possible `TripletField` options is confusing. This pull request reduces the triplet fields to: ```java /** * None of the triplet fields are exposed. */ public static final TripletFields None = new TripletFields(false, false, false); /** * Expose only the edge field and not the source or destination field. */ public static final TripletFields EdgeOnly = new TripletFields(false, false, true); /** * Expose the source and edge fields but not the destination field. (Same as Src) */ public static final TripletFields Src = new TripletFields(true, false, true); /** * Expose the destination and edge fields but not the source field. (Same as Dst) */ public static final TripletFields Dst = new TripletFields(false, true, true); /** * Expose all the fields (source, edge, and destination). */ public static final TripletFields All = new TripletFields(true, true, true); ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/jegonzal/spark SimplifyTripletFields Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3472.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3472 commit 91796b52c17f5c88f1be6c7fe13d49c8e0cf64b1 Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com Date: 2014-11-26T06:26:43Z removing confusing triplet fields --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Removing confusing TripletFields
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/3472#issuecomment-64520673 @ankurdave, what do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Removing confusing TripletFields
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/3472#issuecomment-64520711 This is consistent with the current discussion in the graphx programming guide and so it is unlikely users have started using the more obscure combinations that were removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Updating GraphX programming guide and document...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/3359#issuecomment-63743607 Sounds good. I can fix it now if you want. Joey --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Updating GraphX programming guide and document...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/3359#issuecomment-63597028 @rxin and @ankurdave, take a look when you get a chance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Improved GraphX PageRank Test Coverage
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/1228#issuecomment-63597173 @ankurdave and @rxin can we merge this now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Introducing an Improved Pregel API
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/1217#issuecomment-63597212 @ankurdave should I try and update this with your latest changes or do you want to create a new one? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Updating GraphX programming guide and document...
Github user jegonzal commented on a diff in the pull request: https://github.com/apache/spark/pull/3359#discussion_r20559596 --- Diff: project/SparkBuild.scala --- @@ -328,7 +328,7 @@ object Unidoc { unidocProjectFilter in(ScalaUnidoc, unidoc) := inAnyProject -- inProjects(OldDeps.project, repl, examples, tools, catalyst, streamingFlumeSink, yarn, yarnAlpha), unidocProjectFilter in(JavaUnidoc, unidoc) := - inAnyProject -- inProjects(OldDeps.project, repl, bagel, graphx, examples, tools, catalyst, streamingFlumeSink, yarn, yarnAlpha), + inAnyProject -- inProjects(OldDeps.project, repl, bagel, examples, tools, catalyst, streamingFlumeSink, yarn, yarnAlpha), --- End diff -- In order for the TripletFields.java API to be rendered in the docs (i.e., the list of static fields) it must be compiled using javadoc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Drop VD type parameter from EdgeRDD
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/3303#issuecomment-63267156 Looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Introducing an Improved Pregel API
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/1217#issuecomment-62917449 @ankurdave is this already covered in your latest PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3650] Fix TriangleCount handling of rev...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/2495#issuecomment-62917547 @ankurdave take a look at this when you get a chance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3936] Add aggregateMessages, which supe...
Github user jegonzal commented on a diff in the pull request: https://github.com/apache/spark/pull/3100#discussion_r20243545 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/TripletFields.java --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graphx; + +import java.io.Serializable; + +/** + * Represents a subset of the fields of an [[EdgeTriplet]] or [[EdgeContext]]. This allows the + * system to populate only those fields for efficiency. + */ +public class TripletFields implements Serializable { + public final boolean useSrc; + public final boolean useDst; + public final boolean useEdge; + + public TripletFields() { +this(true, true, true); + } + + public TripletFields(boolean useSrc, boolean useDst, boolean useEdge) { +this.useSrc = useSrc; +this.useDst = useDst; +this.useEdge = useEdge; + } + + public static final TripletFields None = new TripletFields(false, false, false); --- End diff -- Hmm, I agree though I used many of them in the `GraphOps` code and decided maybe it would make sense to go ahead and be exhaustive. I think we could cut a few. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3936] Add aggregateMessages, which supe...
Github user jegonzal commented on a diff in the pull request: https://github.com/apache/spark/pull/3100#discussion_r20257677 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/TripletFields.java --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graphx; + +import java.io.Serializable; + +/** + * Represents a subset of the fields of an [[EdgeTriplet]] or [[EdgeContext]]. This allows the + * system to populate only those fields for efficiency. + */ +public class TripletFields implements Serializable { + public final boolean useSrc; + public final boolean useDst; + public final boolean useEdge; + + public TripletFields() { +this(true, true, true); + } + + public TripletFields(boolean useSrc, boolean useDst, boolean useEdge) { +this.useSrc = useSrc; +this.useDst = useDst; +this.useEdge = useEdge; + } + + public static final TripletFields None = new TripletFields(false, false, false); --- End diff -- How about we just keep: ``` EdgeOnly, SrcAndEdge, DstAndEdge, All ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3936] Remove Bytecode Inspection for Jo...
Github user jegonzal closed the pull request at: https://github.com/apache/spark/pull/2815 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61922586 The model serving work would really benefit from being able to evaluate models without requiring a Spark context especially since we are shooting for 10s millisecond latencies. Though more generally, we should think about how one might want to use the artifact of the pipeline. I suspect their are uses that may exist outside of Spark and the extent to which the models themselves are portable functions could enable greater adoption. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61934417 @jkbradley Right now we are planning to serve linear combinations of models derived from MLlib (currently latent factor models, naive bayes, and decision trees). Though I agree that in some cases (e.g., latent factor models) serving is less trivial (thats the research). Still, it would be good to think of models as output to be consumed by systems beyond Spark should people want to do things beyond computing test error, admittedly thats all I ever do :-). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3936] Remove Bytecode Inspection for Jo...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/2815#issuecomment-61349221 I added the `TripletFields` enum and updated all the dependent files. I can't deprecate the old API since they have the same function signature up to default arguments and I don't want to require the `TripletFields` be specified. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3936] Remove Bytecode Inspection for Jo...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/2815#issuecomment-61349425 At this point I could also imagine actually having a separate function closure for each version. ```scala mapTriplets(f: Edge = ED2) mapTriplets(f: SrcEdge = ED2) mapTriplets(f: DstEdge = ED2) mapTriplets(f: Triplet = ED2) ``` Though to do this would require users to annotate their functions: ```scala g.mapTriplets( (t: SrcEdge) = t.src ) ``` What do you all think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4130][MLlib] Fixing libSVM parser bug w...
GitHub user jegonzal opened a pull request: https://github.com/apache/spark/pull/2996 [SPARK-4130][MLlib] Fixing libSVM parser bug with extra whitespace This simple patch filters out extra whitespace entries. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jegonzal/spark loadLibSVM Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2996.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2996 commit e028e8443ff38e3617a1bdd0a2a3f5ec9b42d980 Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com Date: 2014-10-29T07:13:56Z fixing whitespace bug in loadLibSVMFile when parsing libSVM files --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4130][MLlib] Fixing libSVM parser bug w...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/2996#issuecomment-61026000 Not sure why it failed the test. Is this an issue with the testing framework? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4130][MLlib] Fixing libSVM parser bug w...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/2996#issuecomment-61026298 The following implementation seems a bit more efficient but is needlessly complicated. ```scala // Count the number of empty values var i = 1 var emptyValues = 0 while (i items.size) { if (items(i).isEmpty) emptyValues += 1 i += 1 } // Determine the number of non-zero entries val nnzs = items.size - 1 - emptyValues // Compute the indices val indices = new Array[Int](nnzs) val values = new Array[Double](nnzs) i = 1 var j = 0 while (i items.size) { if (!items(i).isEmpty) { val indexAndValue = items(i).split(':') indices(j) = indexAndValue(0).toInt - 1 // Convert 1-based indices to 0-based. values(j) = indexAndValue(1).toDouble j += 1 } i += 1 } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4142][GraphX] Default numEdgePartitions
GitHub user jegonzal opened a pull request: https://github.com/apache/spark/pull/3006 [SPARK-4142][GraphX] Default numEdgePartitions Changing the default number of edge partitions to match spark parallelism. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jegonzal/spark default_partitions Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3006.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3006 commit a9a5c4f28ba7d5c29a974045960f45a55640df19 Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com Date: 2014-10-30T00:06:55Z Changing the default number of edge partitions to match spark parallelism --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3936] Remove Bytecode Inspection for Jo...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/2815#issuecomment-61026843 What is the status on this patch? I would like to merge it soon so that the python GraphX API can support these additional flags. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3650] Fix TriangleCount handling of rev...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/2495#issuecomment-61026881 What is the status on this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Introducing an Improved Pregel API
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/1217#issuecomment-61027554 This is still work in progress and we need to discuss these API changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Improved GraphX PageRank Test Coverage
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/1228#issuecomment-61029490 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Improved GraphX PageRank Test Coverage
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/1228#issuecomment-61029482 This should now be addressed in the latest master and does not depend on PR #1217 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Remove Bytecode Inspection for Join Eliminatio...
GitHub user jegonzal opened a pull request: https://github.com/apache/spark/pull/2815 Remove Bytecode Inspection for Join Elimination Removing bytecode inspection from triplet operations and introducing explicit join elimination flags. The explicit flags make the join elimination more robust. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jegonzal/spark SPARK-3936 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2815.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2815 commit 2e471584d70aa8029a7eab9643cfdcb3e758a9d7 Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com Date: 2014-10-15T19:41:24Z Removing bytecode inspection from triplet operations and introducing explicit join elimination flags. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Remove Bytecode Inspection for Join Eliminatio...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/2815#issuecomment-59263992 @ankurdave and @rxin I have not updated the applications to use the new explicit flags. I will do that in this PR pending approval for the API changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3936] Remove Bytecode Inspection for Jo...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/2815#issuecomment-59275910 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3578] Fix upper bound in GraphGenerator...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/2439#issuecomment-56438311 This looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3650] Fix TriangleCount handling of rev...
GitHub user jegonzal opened a pull request: https://github.com/apache/spark/pull/2495 [SPARK-3650] Fix TriangleCount handling of reverse edges This PR causes the TriangleCount algorithm to remove self-edges, direct edges from low-id to high-id (canonical direction), and then remove duplicate edges, before running the triangle count algorithm. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jegonzal/spark FixTriangleCount Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2495.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2495 commit daf1ab5e66259d0af449a91ca7c230323ac49daa Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com Date: 2014-09-22T21:57:28Z Improving Triangle Count commit d054d33181486e3b90222e5e30b2f20648434673 Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com Date: 2014-09-22T22:16:46Z fixing bug in unit tests where bi-directed edges lead to duplicate triangles. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3263][GraphX] Fix changes made to Graph...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/2168#issuecomment-53760885 The code changes look good to me (and were badly need). Thanks for fixing it! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Improved GraphX PageRank Test Coverage
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/1228#issuecomment-53776044 Yes. This is an extension of the unit tests to catch a class of bugs addressed in PR #1217 (which has not been merged). I believe @ankurdave was working on a merge of these two pull requests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Introducing an Improved Pregel API
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/1217#issuecomment-47193665 I spent some time verifying the math behind the PageRank (in particular starting values) to ensure that the delta formulation behaves identically to the static formulation which matches other reference implementations of PageRank. One of the key changes is I have added an extra normalization step at the end of the calculation to address a discrepancy in how we handle dangling vertices. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Introducing an Improved Pregel API
Github user jegonzal commented on a diff in the pull request: https://github.com/apache/spark/pull/1217#discussion_r14227560 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -158,4 +169,125 @@ object Pregel extends Logging { g } // end of apply + /** + * Execute a Pregel-like iterative vertex-parallel abstraction. The + * user-defined vertex-program `vprog` is executed in parallel on + * each vertex receiving any inbound messages and computing a new + * value for the vertex. The `sendMsg` function is then invoked on + * all out-edges and is used to compute an optional message to the + * destination vertex. The `mergeMsg` function is a commutative + * associative function used to combine messages destined to the + * same vertex. + * + * On the first iteration all vertices receive the `initialMsg` and + * on subsequent iterations if a vertex does not receive a message + * then the vertex-program is not invoked. + * + * This function iterates until there are no remaining messages, or + * for `maxIterations` iterations. + * + * @tparam VD the vertex data type + * @tparam ED the edge data type + * @tparam A the Pregel message type + * + * @param graph the input graph. + * + * @param initialMsg the message each vertex will receive at the on + * the first iteration + * + * @param maxIterations the maximum number of iterations to run for + * + * @param activeDirection the direction of edges incident to a vertex that received a message in + * the previous round on which to run `sendMsg`. For example, if this is `EdgeDirection.Out`, only + * out-edges of vertices that received a message in the previous round will run. The default is + * `EdgeDirection.Either`, which will run `sendMsg` on edges where either side received a message + * in the previous round. If this is `EdgeDirection.Both`, `sendMsg` will only run on edges where + * *both* vertices received a message. + * + * @param vprog the user-defined vertex program which runs on each + * vertex and receives the inbound message and computes a new vertex + * value. On the first iteration the vertex program is invoked on + * all vertices and is passed the default message. On subsequent + * iterations the vertex program is only invoked on those vertices + * that receive messages. + * + * @param sendMsg a user supplied function that is applied to out + * edges of vertices that received messages in the current + * iteration + * + * @param mergeMsg a user supplied function that takes two incoming + * messages of type A and merges them into a single message of type + * A. ''This function must be commutative and associative and + * ideally the size of A should not increase.'' + * + * @return the resulting graph at the end of the computation + * + */ + def run[VD: ClassTag, ED: ClassTag, A: ClassTag] + (graph: Graph[VD, ED], + maxIterations: Int = Int.MaxValue, + activeDirection: EdgeDirection = EdgeDirection.Either) + (vertexProgram: (VertexId, VD, Option[A], VertexContext) = VD, + sendMsg: (EdgeTriplet[VD, ED], EdgeContext) = Iterator[(VertexId, A)], + mergeMsg: (A, A) = A) + : Graph[VD, ED] = + { +// Initialize the graph with all vertices active +var g: Graph[(VD, Boolean), ED] = graph.mapVertices { (vid, vdata) = (vdata, true) }.cache() +// Determine the set of vertices that did not vote to halt +var activeVertices = g.vertices +var numActive = activeVertices.count() +var i = 0 +while (numActive 0 i maxIterations) { + // The send message wrapper removes the active fields from the triplet and places them in the edge context. + def sendMessageWrapper(triplet: EdgeTriplet[(VD, Boolean),ED]): Iterator[(VertexId, A)] = { +val simpleTriplet = new EdgeTriplet[VD, ED]() +simpleTriplet.set(triplet) +simpleTriplet.srcAttr = triplet.srcAttr._1 +simpleTriplet.dstAttr = triplet.dstAttr._1 +val ctx = new EdgeContext(i, triplet.srcAttr._2, triplet.dstAttr._2) +sendMsg(simpleTriplet, ctx) + } + + // Compute the messages for all the active vertices + val messages = g.mapReduceTriplets(sendMessageWrapper, mergeMsg, Some((activeVertices, activeDirection))) + + // get a reference to the current graph so that we can unpersist it once the new graph is created. + val prevG = g + + // Receive the messages to the subset of active vertices + g = g.outerJoinVertices(messages){ (vid, dataAndActive, msgOpt) = +val (vdata
[GitHub] spark pull request: Introducing an Improved Pregel API
Github user jegonzal commented on a diff in the pull request: https://github.com/apache/spark/pull/1217#discussion_r14227573 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -158,4 +169,125 @@ object Pregel extends Logging { g } // end of apply + /** + * Execute a Pregel-like iterative vertex-parallel abstraction. The + * user-defined vertex-program `vprog` is executed in parallel on + * each vertex receiving any inbound messages and computing a new + * value for the vertex. The `sendMsg` function is then invoked on + * all out-edges and is used to compute an optional message to the + * destination vertex. The `mergeMsg` function is a commutative + * associative function used to combine messages destined to the + * same vertex. + * + * On the first iteration all vertices receive the `initialMsg` and + * on subsequent iterations if a vertex does not receive a message + * then the vertex-program is not invoked. + * + * This function iterates until there are no remaining messages, or + * for `maxIterations` iterations. + * + * @tparam VD the vertex data type + * @tparam ED the edge data type + * @tparam A the Pregel message type + * + * @param graph the input graph. + * + * @param initialMsg the message each vertex will receive at the on + * the first iteration + * + * @param maxIterations the maximum number of iterations to run for + * + * @param activeDirection the direction of edges incident to a vertex that received a message in + * the previous round on which to run `sendMsg`. For example, if this is `EdgeDirection.Out`, only + * out-edges of vertices that received a message in the previous round will run. The default is + * `EdgeDirection.Either`, which will run `sendMsg` on edges where either side received a message + * in the previous round. If this is `EdgeDirection.Both`, `sendMsg` will only run on edges where + * *both* vertices received a message. + * + * @param vprog the user-defined vertex program which runs on each + * vertex and receives the inbound message and computes a new vertex + * value. On the first iteration the vertex program is invoked on + * all vertices and is passed the default message. On subsequent + * iterations the vertex program is only invoked on those vertices + * that receive messages. + * + * @param sendMsg a user supplied function that is applied to out + * edges of vertices that received messages in the current + * iteration + * + * @param mergeMsg a user supplied function that takes two incoming + * messages of type A and merges them into a single message of type + * A. ''This function must be commutative and associative and + * ideally the size of A should not increase.'' + * + * @return the resulting graph at the end of the computation + * + */ + def run[VD: ClassTag, ED: ClassTag, A: ClassTag] + (graph: Graph[VD, ED], + maxIterations: Int = Int.MaxValue, + activeDirection: EdgeDirection = EdgeDirection.Either) + (vertexProgram: (VertexId, VD, Option[A], VertexContext) = VD, + sendMsg: (EdgeTriplet[VD, ED], EdgeContext) = Iterator[(VertexId, A)], + mergeMsg: (A, A) = A) + : Graph[VD, ED] = + { +// Initialize the graph with all vertices active +var g: Graph[(VD, Boolean), ED] = graph.mapVertices { (vid, vdata) = (vdata, true) }.cache() +// Determine the set of vertices that did not vote to halt +var activeVertices = g.vertices +var numActive = activeVertices.count() +var i = 0 +while (numActive 0 i maxIterations) { + // The send message wrapper removes the active fields from the triplet and places them in the edge context. + def sendMessageWrapper(triplet: EdgeTriplet[(VD, Boolean),ED]): Iterator[(VertexId, A)] = { +val simpleTriplet = new EdgeTriplet[VD, ED]() +simpleTriplet.set(triplet) +simpleTriplet.srcAttr = triplet.srcAttr._1 +simpleTriplet.dstAttr = triplet.dstAttr._1 +val ctx = new EdgeContext(i, triplet.srcAttr._2, triplet.dstAttr._2) +sendMsg(simpleTriplet, ctx) + } + + // Compute the messages for all the active vertices + val messages = g.mapReduceTriplets(sendMessageWrapper, mergeMsg, Some((activeVertices, activeDirection))) + + // get a reference to the current graph so that we can unpersist it once the new graph is created. + val prevG = g + + // Receive the messages to the subset of active vertices + g = g.outerJoinVertices(messages){ (vid, dataAndActive, msgOpt) = +val (vdata
[GitHub] spark pull request: Improved GraphX PageRank Test Coverage
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/1228#issuecomment-47200276 @ankurdave thanks for pointing out this bug! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Introducing an Improved Pregel API
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/1217#issuecomment-47204112 @ankurdave and @rxin there is an issue with the current API. The `sendMessage` function pull the active field out of the vertex value here: https://github.com/apache/spark/pull/1217/files#diff-e399679417ffa6eeedf26a7630baca16R243 ```scala def sendMessageWrapper(triplet: EdgeTriplet[(VD, Boolean),ED]): Iterator[(VertexId, A)] = { val simpleTriplet = new EdgeTriplet[VD, ED]() simpleTriplet.set(triplet) simpleTriplet.srcAttr = triplet.srcAttr._1 simpleTriplet.dstAttr = triplet.dstAttr._1 val ctx = new EdgeContext(i, triplet.srcAttr._2, triplet.dstAttr._2) sendMsg(simpleTriplet, ctx) } // Compute the messages for all the active vertices val messages = g.mapReduceTriplets(sendMessageWrapper, mergeMsg, Some((activeVertices, activeDirection))) ``` thereby allowing the user a simple `sendMsg` interface: ```scala sendMsg: (EdgeTriplet[VD, ED], EdgeContext) = Iterator[(VertexId, A)] ``` However because we access the source and destination vertex attributes the byte code inspection will force a full 3-way join even if the user doesn't actually read the fields. The simplest solution would be to change the send message interface to operate on the extended vertex attribute passing (containing the active field). ```scala sendMsg: (EdgeTriplet[(VD, Boolean), ED], EdgeContext) = Iterator[(VertexId, A)] ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Introducing an Improved Pregel API
GitHub user jegonzal opened a pull request: https://github.com/apache/spark/pull/1217 Introducing an Improved Pregel API The initial Pregel API coupled voting to halt with message reception. In this revised the vertex program receives a `PregelContext` which enables the user to signal whether or not to halt as well as access the current iteration number. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jegonzal/spark PregelContext Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1217.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1217 commit ca0fcc8797769f706e14cadd773ace7c2ff53e0a Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com Date: 2014-06-25T22:24:54Z Starting a revised version of Pregel --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Introducing an Improved Pregel API
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/1217#issuecomment-47167314 @ankurdave unfortunately to full accept this change we will need to break compatibility with the current Pregel API. I cannot seem to overload the apply method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Synthetic GraphX Benchmark
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/720#issuecomment-42757213 Good point! I moved the benchmark into the examples folder. Is there a standard format for command line args in the example applications? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Synthetic GraphX Benchmark
GitHub user jegonzal opened a pull request: https://github.com/apache/spark/pull/720 Synthetic GraphX Benchmark This PR accomplishes two things: 1. It introduces a Synthetic Benchmark application that generates an arbitrarily large log-normal graph and executes either PageRank or connected components on the graph. This can be used to profile GraphX system on arbitrary clusters without access to large graph datasets 2. This PR improves the implementation of the log-normal graph generator. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jegonzal/spark graphx_synth_benchmark Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/720.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #720 commit c81ee23bb66918052efefa81a9e8077951e5ebf5 Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com Date: 2014-05-05T22:52:58Z Creating a synthetic benchmark script. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Enable repartitioning of graph over different ...
GitHub user jegonzal opened a pull request: https://github.com/apache/spark/pull/719 Enable repartitioning of graph over different number of partitions It is currently very difficult to repartition a graph over a different number of partitions. This PR adds an additional `partitionBy` function that takes the number of partitions. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jegonzal/spark graph_partitioning_options Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/719.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #719 commit 54412fc658018c8285190fdd26b43f324dd1f580 Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com Date: 2014-05-09T23:26:59Z adding an additional number of partitions option to partitionBy --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Unify GraphImpl RDDs + other graph load optimi...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/497#issuecomment-42618546 I went through this PR with Ankur and it looks good to me. There are a few minor changes but those can be moved to a second PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1786: Reopening PR 724
GitHub user jegonzal opened a pull request: https://github.com/apache/spark/pull/742 SPARK-1786: Reopening PR 724 Addressing issue in MimaBuild.scala. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jegonzal/spark edge_partition_serialization Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/742.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #742 commit 67dac22884b098b72c277dbe6e344da796a5321c Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com Date: 2014-05-10T01:54:56Z Making EdgePartition serializable. commit bb7f548542d58ee6ac2dbdf868fea165fdf4f415 Author: Ankur Dave ankurd...@gmail.com Date: 2014-05-10T03:09:48Z Add failing test for EdgePartition Kryo serialization commit b0a525a7f48a6b13cf8687e5e6d8ba3d3bf852f5 Author: Ankur Dave ankurd...@gmail.com Date: 2014-05-10T03:12:38Z Disable reference tracking to fix serialization test commit d8b70fbca17534eb8f60e8feb4a9fdd5996fdcd8 Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com Date: 2014-05-12T18:20:49Z addressing missing exclusion in MimaBuild.scala --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1786: Reopening PR 724
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/742#issuecomment-42868913 @ankurdave and @pwendell I am reopening the PR 724 to address the issue with MimaBuild. I believe I made the required changes but how can I verify? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1786: Edge Partition Serialization
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/724#issuecomment-42787343 I would like to get it into 1.0 if possible. Otherwise, we could run into issues if the user persists graphs to disk or straggler mitigation is used. @ankurdave do you see any issues with trying to get this into 1.0? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix error in 2d Graph Partitioner
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/709#issuecomment-42703154 @rxin and @ankurdave take a look at this minor change when you get a chance. I would like to get it into the next release if possible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1786: Edge Partition Serialization
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/724#issuecomment-42793347 My only concern is that I would prefer things work slowly than fail. With reference tracking disabled it is not possible to serialize user defined types from the spark-shell. A second concern is that it will be difficult for the user to enable reference tracking if we disable it in the GraphX Kryo registrar. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1577: Enabling reference tracking by def...
GitHub user jegonzal opened a pull request: https://github.com/apache/spark/pull/499 SPARK-1577: Enabling reference tracking by default in GraphX KryoRegistrator. We had originally disabled reference tracking by default however this now seems to create serious issues in the spark-shell where even the following benign block of code will fail: ```scala class A(a: String) extends Serializable val x = sc.parallelize(Array.fill(10)(new A(hello))) x.collect ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/jegonzal/spark graphx-kryo-issue Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/499.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #499 commit 8779a836ff6d134b445a051a13c5d85130a9f848 Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com Date: 2014-04-23T05:59:24Z Enabling reference tracking by default in GraphX KryoRegistrator. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add Shortest-path computations to graphx.lib w...
Github user jegonzal commented on a diff in the pull request: https://github.com/apache/spark/pull/10#discussion_r11915905 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala --- @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graphx.lib + +import org.apache.spark.graphx._ + +object ShortestPaths { + type SPMap = Map[VertexId, Int] // map of landmarks - minimum distance to landmark --- End diff -- The scala map data-structures can be pretty costly and inefficient. Instead you could use an array containing the distances and then maintain a global map (shared by broadcast variable) with the mapping from vertex id to index in the array. This should also reduce the memory overhead substantially since each vertex will not need to maintain its own locally Map data structure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add Shortest-path computations to graphx.lib w...
Github user jegonzal commented on a diff in the pull request: https://github.com/apache/spark/pull/10#discussion_r11916394 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala --- @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graphx.lib + +import org.apache.spark.graphx._ + +object ShortestPaths { + type SPMap = Map[VertexId, Int] // map of landmarks - minimum distance to landmark --- End diff -- Essentially remap the landmarks to a consecutive landmark id set and then on the initial creation of spGraph you would require using the single broadcast map but from then on no map data structures would be required. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add Shortest-path computations to graphx.lib w...
Github user jegonzal commented on a diff in the pull request: https://github.com/apache/spark/pull/10#discussion_r11916531 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala --- @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graphx.lib + +import org.apache.spark.graphx._ + +object ShortestPaths { + type SPMap = Map[VertexId, Int] // map of landmarks - minimum distance to landmark + def SPMap(x: (VertexId, Int)*) = Map(x: _*) + def increment(spmap: SPMap): SPMap = spmap.map { case (v, d) = v - (d + 1) } + def plus(spmap1: SPMap, spmap2: SPMap): SPMap = +(spmap1.keySet ++ spmap2.keySet).map{ + k = k - scala.math.min(spmap1.getOrElse(k, Int.MaxValue), spmap2.getOrElse(k, Int.MaxValue)) +}.toMap + + /** + * Compute the shortest paths to each landmark for each vertex and + * return an RDD with the map of landmarks to their shortest-path + * lengths. + * + * @tparam VD the shortest paths map for the vertex + * @tparam ED the incremented shortest-paths map of the originating + * vertex (discarded in the computation) + * + * @param graph the graph for which to compute the shortest paths + * @param landmarks the list of landmark vertex ids + * + * @return a graph with vertex attributes containing a map of the + * shortest paths to each landmark + */ + def run[VD, ED](graph: Graph[VD, ED], landmarks: Seq[VertexId]) +(implicit m1: Manifest[VD], m2: Manifest[ED]): Graph[SPMap, SPMap] = { + +val spGraph = graph + .mapVertices{ (vid, attr) = +if (landmarks.contains(vid)) SPMap(vid - 0) +else SPMap() + } + .mapTriplets{ edge = edge.srcAttr } + +val initialMessage = SPMap() + +def vertexProgram(id: VertexId, attr: SPMap, msg: SPMap): SPMap = { + plus(attr, msg) +} + +def sendMessage(edge: EdgeTriplet[SPMap, SPMap]): Iterator[(VertexId, SPMap)] = { + val newAttr = increment(edge.srcAttr) --- End diff -- It might be worth considering adding support for edge weights instead of assuming all edges are length 1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add Shortest-path computations to graphx.lib w...
Github user jegonzal commented on a diff in the pull request: https://github.com/apache/spark/pull/10#discussion_r11916619 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala --- @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graphx.lib + +import org.apache.spark.graphx._ + +object ShortestPaths { + type SPMap = Map[VertexId, Int] // map of landmarks - minimum distance to landmark + def SPMap(x: (VertexId, Int)*) = Map(x: _*) + def increment(spmap: SPMap): SPMap = spmap.map { case (v, d) = v - (d + 1) } + def plus(spmap1: SPMap, spmap2: SPMap): SPMap = +(spmap1.keySet ++ spmap2.keySet).map{ + k = k - scala.math.min(spmap1.getOrElse(k, Int.MaxValue), spmap2.getOrElse(k, Int.MaxValue)) +}.toMap + + /** + * Compute the shortest paths to each landmark for each vertex and + * return an RDD with the map of landmarks to their shortest-path + * lengths. + * + * @tparam VD the shortest paths map for the vertex + * @tparam ED the incremented shortest-paths map of the originating + * vertex (discarded in the computation) + * + * @param graph the graph for which to compute the shortest paths + * @param landmarks the list of landmark vertex ids + * + * @return a graph with vertex attributes containing a map of the + * shortest paths to each landmark + */ + def run[VD, ED](graph: Graph[VD, ED], landmarks: Seq[VertexId]) +(implicit m1: Manifest[VD], m2: Manifest[ED]): Graph[SPMap, SPMap] = { + +val spGraph = graph + .mapVertices{ (vid, attr) = +if (landmarks.contains(vid)) SPMap(vid - 0) +else SPMap() --- End diff -- If we switch to an array implementation of the map then perhaps set the distance to MaxInt (or MaxDouble if we switch to weighted edge). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add Shortest-path computations to graphx.lib w...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/10#issuecomment-41199189 This code looks good to me. All my comments are with respect to potential performance issues. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---