Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-02-03 Thread Jay Hutfles
I think this is a separate issue with how the EdgeRDDImpl partitions edges. If you can merge this change in and rebuild, it should work: https://github.com/apache/spark/pull/4136/files If you can't, I just called the Graph.partitonBy() method right after construction my graph but before

Re: GraphX pregel: getting the current iteration number

2015-02-03 Thread Daniil Osipov
I don't think its possible to access. What I've done before is send the current or next iteration index with the message, where the message is a case class. HTH Dan On Tue, Feb 3, 2015 at 10:20 AM, Matthew Cornell corn...@cs.umass.edu wrote: Hi Folks, I'm new to GraphX and Scala and my

Re: [Graphx Spark] Error of Lost executor and TimeoutException

2015-02-02 Thread Sonal Goyal
That may be the cause of your issue. Take a look at the tuning guide[1] and maybe also profile your application. See if you can reuse your objects. 1. http://spark.apache.org/docs/latest/tuning.html Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co

Re: [Graphx Spark] Error of Lost executor and TimeoutException

2015-02-02 Thread Yifan LI
Thanks, Sonal. But it seems to be an error happened when “cleaning broadcast”? BTW, what is the timeout of “[30 seconds]”? can I increase it? Best, Yifan LI On 02 Feb 2015, at 11:12, Sonal Goyal sonalgoy...@gmail.com wrote: That may be the cause of your issue. Take a look at the

Re: [Graphx Spark] Error of Lost executor and TimeoutException

2015-02-02 Thread Yifan LI
I think this broadcast cleaning(memory block remove?) timeout exception was caused by: 15/02/02 11:48:49 ERROR TaskSchedulerImpl: Lost executor 13 on small18-tap1.common.lip6.fr: remote Akka client disassociated 15/02/02 11:48:49 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-02-02 Thread NicolasC
On 01/29/2015 08:31 PM, Ankur Dave wrote: Thanks for the reminder. I just created a PR: https://github.com/apache/spark/pull/4273 Ankur Hello, Thanks for the patch. I applied it on Pregel.scala (in Spark 1.2.0 sources) and rebuilt Spark. During execution, at the 25th iteration of Pregel,

Re: [Graphx Spark] Error of Lost executor and TimeoutException

2015-01-30 Thread Sonal Goyal
Is your code hitting frequent garbage collection? Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Jan 30, 2015 at 7:52 PM, Yifan LI iamyifa...@gmail.com wrote: Hi, I am running my graphx application on Spark 1.2.0(11

Re: [Graphx Spark] Error of Lost executor and TimeoutException

2015-01-30 Thread Yifan LI
Yes, I think so, esp. for a pregel application… have any suggestion? Best, Yifan LI On 30 Jan 2015, at 22:25, Sonal Goyal sonalgoy...@gmail.com wrote: Is your code hitting frequent garbage collection? Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co/

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-01-29 Thread Jay Hutfles
Just curious, is this set to be merged at some point? On Thu Jan 22 2015 at 4:34:46 PM Ankur Dave ankurd...@gmail.com wrote: At 2015-01-22 02:06:37 -0800, NicolasC nicolas.ch...@inria.fr wrote: I try to execute a simple program that runs the ShortestPaths algorithm

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-01-29 Thread Ankur Dave
Thanks for the reminder. I just created a PR: https://github.com/apache/spark/pull/4273 Ankur On Thu, Jan 29, 2015 at 7:25 AM, Jay Hutfles jayhutf...@gmail.com wrote: Just curious, is this set to be merged at some point? - To

Re: [GraphX] Integration with TinkerPop3/Gremlin

2015-01-26 Thread Nicolas Colson
TinkerPop has become an Apache Incubator project and seems to have Spark in mind in their proposal https://wiki.apache.org/incubator/TinkerPopProposal. That's good news! I hope there will be nice collaborations between the communities. On Wed, Jan 7, 2015 at 11:31 AM, Nicolas Colson

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-01-22 Thread Ankur Dave
At 2015-01-22 02:06:37 -0800, NicolasC nicolas.ch...@inria.fr wrote: I try to execute a simple program that runs the ShortestPaths algorithm (org.apache.spark.graphx.lib.ShortestPaths) on a small grid graph. I use Spark 1.2.0 downloaded from spark.apache.org. This program runs more than 2

RE: GraphX vs GraphLab

2015-01-13 Thread Buttler, David
would be if the AMP Lab or Databricks maintained a set of benchmarks on the web that showed how much each successive version of Spark improved. Dave From: Madabhattula Rajesh Kumar [mailto:mrajaf...@gmail.com] Sent: Monday, January 12, 2015 9:24 PM To: Buttler, David Subject: Re: GraphX vs

Re: [Graphx] which way is better to access faraway neighbors?

2014-12-05 Thread Ankur Dave
At 2014-12-05 02:26:52 -0800, Yifan LI iamyifa...@gmail.com wrote: I have a graph in where each vertex keep several messages to some faraway neighbours(I mean, not to only immediate neighbours, at most k-hops far, e.g. k = 5). now, I propose to distribute these messages to their

Re: GraphX Pregel halting condition

2014-12-04 Thread Ankur Dave
There's no built-in support for doing this, so the best option is to copy and modify Pregel to check the accumulator at the end of each iteration. This is robust and shouldn't be too hard, since the Pregel code is short and only uses public GraphX APIs. Ankur At 2014-12-03 09:37:01 -0800, Jay

Re: GraphX / PageRank with edge weights

2014-11-18 Thread Ankur Dave
At 2014-11-13 21:28:52 +, Ommen, Jurgen omme0...@stthomas.edu wrote: I'm using GraphX and playing around with its PageRank algorithm. However, I can't see from the documentation how to use edge weight when running PageRank. Is this possible to consider edge weights and how would I do it?

Re: GraphX: Get edges for a vertex

2014-11-13 Thread Takeshi Yamamuro
Hi, I think that there are two solutions; 1. Invalid edges send Iterator.empty messages in sendMsg of the Pregel API. These messages make no effect on corresponding vertices. 2. Use GraphOps.(collectNeighbors/collectNeighborIds), not the Pregel API so as to handle active edge lists by

Re: GraphX and Spark

2014-11-04 Thread Kamal Banga
GraphX is build on *top* of Spark, so Spark can achieve whatever GraphX can. On Wed, Nov 5, 2014 at 9:41 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Can Spark achieve whatever GraphX can? Keeping aside the performance comparison between Spark and GraphX, if I want to implement any

Re: GraphX StackOverflowError

2014-10-28 Thread Ankur Dave
At 2014-10-28 16:27:20 +0300, Zuhair Khayyat zuhair.khay...@gmail.com wrote: I am using connected components function of GraphX (on Spark 1.0.2) on some graph. However for some reason the fails with StackOverflowError. The graph is not too big; it contains 1 vertices and 50 edges.

Re: graphx - mutable?

2014-10-14 Thread ll
hi again. just want to check in again to see if anyone could advise on how to implement a mutable, growing graph with graphx? we're building a graph is growing over time. it adds more vertices and edges every iteration of our algorithm. it doesn't look like there is an obvious way to add a

Re: graphx - mutable?

2014-10-14 Thread Ankur Dave
On Tue, Oct 14, 2014 at 12:36 PM, ll duy.huynh@gmail.com wrote: hi again. just want to check in again to see if anyone could advise on how to implement a mutable, growing graph with graphx? we're building a graph is growing over time. it adds more vertices and edges every iteration of

Re: graphx - mutable?

2014-10-14 Thread Duy Huynh
thanks ankur. indexedrdd sounds super helpful! a related question, what is the best way to update the values of existing vertices and edges? On Tue, Oct 14, 2014 at 4:30 PM, Ankur Dave ankurd...@gmail.com wrote: On Tue, Oct 14, 2014 at 12:36 PM, ll duy.huynh@gmail.com wrote: hi again.

Re: graphx - mutable?

2014-10-14 Thread Ankur Dave
On Tue, Oct 14, 2014 at 1:57 PM, Duy Huynh duy.huynh@gmail.com wrote: a related question, what is the best way to update the values of existing vertices and edges? Many of the Graph methods deal with updating the existing values in bulk, including mapVertices, mapEdges, mapTriplets,

Re: graphx - mutable?

2014-10-14 Thread Duy Huynh
great, thanks! On Tue, Oct 14, 2014 at 5:08 PM, Ankur Dave ankurd...@gmail.com wrote: On Tue, Oct 14, 2014 at 1:57 PM, Duy Huynh duy.huynh@gmail.com wrote: a related question, what is the best way to update the values of existing vertices and edges? Many of the Graph methods deal

Re: GraphX: Types for the Nodes and Edges

2014-10-07 Thread Oshi
Hi again, Thank you for your suggestion :) I've tried to implement this method but I'm stuck trying to union the payload before creating the graph. Below is a really simplified snippet of what have worked so far. //Reading the articles given in json format val articles =

Re: GraphX: Types for the Nodes and Edges

2014-10-01 Thread andy petrella
I'll try my best ;-). 1/ you could create a abstract type for the types (1 on top of Vs, 1 other on top of Es types) than use the subclasses as payload in your VertexRDD or in your Edge. Regarding storage and files, it doesn't really matter (unless you want to use the OOTB loading method, thus

Re: GraphX: Types for the Nodes and Edges

2014-10-01 Thread Oshi
Excellent! Thanks Andy. I will give it a go. On Thu, Oct 2, 2014 at 12:42 AM, andy petrella [via Apache Spark User List] ml-node+s1001560n15487...@n3.nabble.com wrote: I'll try my best ;-). 1/ you could create a abstract type for the types (1 on top of Vs, 1 other on top of Es types) than

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError GC overhead limit exceeded

2014-09-09 Thread Ankur Dave
At 2014-09-05 12:13:18 +0200, Yifan LI iamyifa...@gmail.com wrote: But how to assign the storage level to a new vertices RDD that mapped from an existing vertices RDD, e.g. *val newVertexRDD = graph.collectNeighborIds(EdgeDirection.Out).map{case(id:VertexId, a:Array[VertexId]) = (id,

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError GC overhead limit exceeded

2014-09-05 Thread Yifan LI
Thank you, Ankur! :) But how to assign the storage level to a new vertices RDD that mapped from an existing vertices RDD, e.g. *val newVertexRDD = graph.collectNeighborIds(EdgeDirection.Out).map{case(id:VertexId, a:Array[VertexId]) = (id, initialHashMap(a))}* the new one will be combined with

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError GC overhead limit exceeded

2014-09-03 Thread Yifan LI
Hi Ankur, Thanks so much for your advice. But it failed when I tried to set the storage level in constructing a graph. val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions = numPartitions).partitionBy(PartitionStrategy.EdgePartition2D).persist(StorageLevel.MEMORY_AND_DISK)

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError GC overhead limit exceeded

2014-09-03 Thread Ankur Dave
At 2014-09-03 17:58:09 +0200, Yifan LI iamyifa...@gmail.com wrote: val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions = numPartitions).partitionBy(PartitionStrategy.EdgePartition2D).persist(StorageLevel.MEMORY_AND_DISK) Error: java.lang.UnsupportedOperationException: Cannot

Re: Graphx: undirected graph support

2014-08-28 Thread FokkoDriesprong
A bit in analogy with a linked-list a double linked-list. It might introduce overhead in terms of memory usage, but you could use two directed edges to substitute the uni-directed edge. -- View this message in context:

Re: GraphX usecases

2014-08-25 Thread Ankur Dave
At 2014-08-25 11:23:37 -0700, Sunita Arvind sunitarv...@gmail.com wrote: Does this We introduce GraphX, which combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework. We leverage new ideas in

Re: GraphX question about graph traversal

2014-08-20 Thread glxc
I don't know if Pregel would be necessary since it's not iterative You could filter the graph by looking at edge triplets, and testing if source =B, dest =A, and edge value 5 -- View this message in context:

Re: GraphX question about graph traversal

2014-08-20 Thread Cesar Arevalo
Hey, thanks for your response. And I had seen the triplets, but I'm not quite sure how the triplets would get me that V1 is connected to V4. Maybe I need to spend more time understanding it, I guess. -Cesar On Wed, Aug 20, 2014 at 10:56 AM, glxc r.ryan.mcc...@gmail.com wrote: I don't know

Re: GraphX question about graph traversal

2014-08-20 Thread Ankur Dave
At 2014-08-20 10:34:50 -0700, Cesar Arevalo ce...@zephyrhealthinc.com wrote: I would like to get the type B vertices that are connected through type A vertices where the edges have a score greater than 5. So, from the example above I would like to get V1 and V4. It sounds like you're trying to

Re: GraphX question about graph traversal

2014-08-20 Thread Cesar Arevalo
Hi Ankur, thank you for your response. I already looked at the sample code you sent. And I think the modification you are referring to is on the tryMatch function of the PartialMatch class. I noticed you have a case in there that checks for a pattern match, and I think that's the code I need to

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError GC overhead limit exceeded

2014-08-18 Thread Ankur Dave
On Mon, Aug 18, 2014 at 6:29 AM, Yifan LI iamyifa...@gmail.com wrote: I am testing our application(similar to personalised page rank using Pregel, and note that each vertex property will need pretty much more space to store after new iteration) [...] But when we ran it on larger graph(e.g.

Re: GraphX Pagerank application

2014-08-15 Thread Ankur Dave
On Wed, Aug 6, 2014 at 11:37 AM, AlexanderRiggers alexander.rigg...@gmail.com wrote: To perform the page rank I have to create a graph object, adding the edges by setting sourceID=id and distID=brand. In GraphLab there is function: g = SGraph().add_edges(data, src_field='id',

Re:[GraphX] Can't zip RDDs with unequal numbers of partitions

2014-08-07 Thread Bin
OK, I think I've figured it out. It seems to be a bug which has been reported at: https://issues.apache.org/jira/browse/SPARK-2823 and https://github.com/apache/spark/pull/1763. As it says: If the users set “spark.default.parallelism” and the value is different with the EdgeRDD partition

Re: [GraphX] How spark parameters relate to Pregel implementation

2014-08-04 Thread Ankur Dave
At 2014-08-04 20:52:26 +0800, Bin wubin_phi...@126.com wrote: I wonder how spark parameters, e.g., number of paralellism, affect Pregel performance? Specifically, sendmessage, mergemessage, and vertexprogram? I have tried label propagation on a 300,000 edges graph, and I found that no

Re: GraphX runs without Spark?

2014-08-03 Thread Deep Pradhan
We need to pass the URL only when we are using the interactive shell right? Now, I am not using the interactive shell, I am just doing ./bin/run-example.. when I am in the Spark directory. If not, Spark may be ignoring your single-node cluster and defaulting to local mode. What does this

Re: GraphX

2014-08-02 Thread Yifan LI
Try this: ./bin/run-example graphx.LiveJournalPageRank edge_list_file… On Aug 2, 2014, at 5:55 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am running Spark in a single node cluster. I am able to run the codes in Spark like SparkPageRank.scala, SparkKMeans.scala by the following

Re: GraphX

2014-08-02 Thread Ankur Dave
At 2014-08-02 21:29:33 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: How should I run graphx codes? At the moment it's a little more complicated to run the GraphX algorithms than the Spark examples due to SPARK-1986 [1]. There is a driver program in org.apache.spark.graphx.lib.Analytics

Re: [GraphX] how to compute only a subset of vertices in the whole graph?

2014-08-02 Thread Ankur Dave
At 2014-08-02 19:04:22 +0200, Yifan LI iamyifa...@gmail.com wrote: But I am thinking of if I can compute only some selected vertexes(hubs), not to do update on every vertex… is it possible to do this using Pregel API? The Pregel API already only runs vprog on vertices that received messages

Re: [GraphX] The best way to construct a graph

2014-08-01 Thread Ankur Dave
At 2014-08-01 11:23:49 +0800, Bin wubin_phi...@126.com wrote: I am wondering what is the best way to construct a graph? Say I have some attributes for each user, and specific weight for each user pair. The way I am currently doing is first read user information and edge triple into two

Re: Graphx : Perfomance comparison over cluster

2014-07-30 Thread Ankur Dave
ShreyanshB shreyanshpbh...@gmail.com writes: The version with in-memory shuffle is here: https://github.com/amplab/graphx2/commits/vldb. It'd be great if you can tell me how to configure and invoke this spark version. Sorry for the delay on this. Assuming you're planning to launch an EC2

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread Ankur Dave
Yifan LI iamyifa...@gmail.com writes: Maybe you could get the vertex, for instance, which id is 80, by using: graph.vertices.filter{case(id, _) = id==80}.collect but I am not sure this is the exactly efficient way.(it will scan the whole table? if it can not get benefit from index of

Re: GraphX Pragel implementation

2014-07-24 Thread Ankur Dave
On Thu, Jul 24, 2014 at 9:52 AM, Arun Kumar toga...@gmail.com wrote: While using pregel API for Iterations how to figure out which super step the iteration currently in. The Pregel API doesn't currently expose this, but it's very straightforward to modify Pregel.scala

Re: Graphx : Perfomance comparison over cluster

2014-07-23 Thread ShreyanshB
Thanks Ankur. The version with in-memory shuffle is here: https://github.com/amplab/graphx2/commits/vldb. Unfortunately Spark has changed a lot since then, and the way to configure and invoke Spark is different. I can send you the correct configuration/invocation for this if you're interested in

Re: Graphx : Perfomance comparison over cluster

2014-07-20 Thread Ankur Dave
On Fri, Jul 18, 2014 at 9:07 PM, ShreyanshB shreyanshpbh...@gmail.com wrote: Does the suggested version with in-memory shuffle affects performance too much? We've observed a 2-3x speedup from it, at least on larger graphs like twitter-2010 http://law.di.unimi.it/webdata/twitter-2010/ and

Re: GraphX Pragel implementation

2014-07-18 Thread Arun Kumar
Thanks On Fri, Jul 18, 2014 at 12:22 AM, Ankur Dave ankurd...@gmail.com wrote: If your sendMsg function needs to know the incoming messages as well as the vertex value, you could define VD to be a tuple of the vertex value and the last received message. The vprog function would then store

Re: Graphx : Perfomance comparison over cluster

2014-07-18 Thread Ankur Dave
Thanks for your interest. I should point out that the numbers in the arXiv paper are from GraphX running on top of a custom version of Spark with an experimental in-memory shuffle prototype. As a result, if you benchmark GraphX at the current master, it's expected that it will be 2-3x slower than

Re: Graphx : Perfomance comparison over cluster

2014-07-18 Thread ShreyanshB
Thanks a lot Ankur. The version with in-memory shuffle is here: https://github.com/amplab/graphx2/commits/vldb. Unfortunately Spark has changed a lot since then, and the way to configure and invoke Spark is different. I can send you the correct configuration/invocation for this if you're

Re: GraphX Pragel implementation

2014-07-17 Thread Ankur Dave
If your sendMsg function needs to know the incoming messages as well as the vertex value, you could define VD to be a tuple of the vertex value and the last received message. The vprog function would then store the incoming messages into the tuple, allowing sendMsg to access them. For example, if

Re: Graphx traversal and merge interesting edges

2014-07-14 Thread HHB
Hi Ankur, FYI - in a naive attempt to enhance your solution, managed to create MergePatternPath. I think it works in expected way (atleast for the traversing problem in last email). I modified your code a bit. Also instead of EdgePattern I used List of Functions that match the whole edge

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread Ankur Dave
On Fri, Jul 11, 2014 at 2:23 PM, ShreyanshB shreyanshpbh...@gmail.com wrote: -- Is it a correct way to load file to get best performance? Yes, edgeListFile should be efficient at loading the edges. -- What should be the partition size? =computing node or =cores? In general it should be a

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread ShreyanshB
Thanks a lot Ankur, I'll follow that. A last quick Does that error affect performance? ~Shreyansh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-optimal-partitions-for-a-graph-and-error-in-logs-tp9455p9462.html Sent from the Apache Spark User List

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread Ankur Dave
I don't think it should affect performance very much, because GraphX doesn't serialize ShippableVertexPartition in the fast path of mapReduceTriplets execution (instead it calls ShippableVertexPartition.shipVertexAttributes and serializes the result). I think it should only get serialized for

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread ShreyanshB
Great! Thanks a lot. Hate to say this but I promise this is last quickie I looked at the configurations but I didn't find any parameter to tune for network bandwidth i.e. Is there anyway to tell graphx (spark) that I'm using 1G network or 10G network or infinite band? Does it figure out on its

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread Ankur Dave
Spark just uses opens up inter-slave TCP connections for message passing during shuffles (I think the relevant code is in ConnectionManager). Since TCP automatically determines http://en.wikipedia.org/wiki/TCP_congestion-avoidance_algorithm the optimal sending rate, Spark doesn't need any

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread ShreyanshB
Perfect! Thanks Ankur. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-optimal-partitions-for-a-graph-and-error-in-logs-tp9455p9488.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: GraphX: how to specify partition strategy?

2014-07-10 Thread Ankur Dave
On Thu, Jul 10, 2014 at 8:20 AM, Yifan LI iamyifa...@gmail.com wrote: - how to build the latest version of Spark from the master branch, which contains a fix? Instead of downloading a prebuilt Spark release from http://spark.apache.org/downloads.html, follow the instructions under Development

Re: Graphx traversal and merge interesting edges

2014-07-08 Thread HHB
Hi Ankur, I was trying out the PatterMatcher it works for smaller path, but I see that for the longer ones it continues to run forever... Here's what I am trying: https://gist.github.com/hihellobolke/dd2dc0fcebba485975d1 (The example of 3 share traders transacting in appl shares) The first

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-07 Thread Koert Kuipers
you could only do the deep check if the hashcodes are the same and design hashcodes that do not take all elements into account. the alternative seems to be putting cache statements all over graphx, as is currently the case, which is trouble for any long lived application where caching is

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-06 Thread Koert Kuipers
probably a dumb question, but why is reference equality used for the indexes? On Sun, Jul 6, 2014 at 12:43 AM, Ankur Dave ankurd...@gmail.com wrote: When joining two VertexRDDs with identical indexes, GraphX can use a fast code path (a zip join without any hash lookups). However, the check

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-06 Thread Ankur Dave
Well, the alternative is to do a deep equality check on the index arrays, which would be somewhat expensive since these are pretty large arrays (one element per vertex in the graph). But, in case the reference equality check fails, it actually might be a good idea to do the deep check before

Re: Graphx traversal and merge interesting edges

2014-07-05 Thread Ankur Dave
Interesting problem! My understanding is that you want to (1) find paths matching a particular pattern, and (2) add edges between the start and end vertices of the matched paths. For (1), I implemented a pattern matcher for GraphX

Re: Graphx traversal and merge interesting edges

2014-07-05 Thread HHB
Thanks Ankur, Cannot thank you enough for this!!! I am reading your example still digesting grokking it though :-) I was breaking my head over this for past few hours. In my last futile attempts over past few hours. I was looking at Pregel... E.g if that could be used to see at what step

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-05 Thread Koert Kuipers
thanks for replying. why is joining two vertexrdds without caching slow? what is recomputed unnecessarily? i am not sure what is different here from joining 2 regular RDDs (where nobody seems to recommend to cache before joining i think...) On Thu, Jul 3, 2014 at 10:52 PM, Ankur Dave

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-05 Thread Ankur Dave
When joining two VertexRDDs with identical indexes, GraphX can use a fast code path (a zip join without any hash lookups). However, the check for identical indexes is performed using reference equality. Without caching, two copies of the index are created. Although the two indexes are

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-03 Thread Ankur Dave
A common reason for the Joining ... is slow message is that you're joining VertexRDDs without having cached them first. This will cause Spark to recompute unnecessarily, and as a side effect, the same index will get created twice and GraphX won't be able to do an efficient zip join. For example,

Re: [GraphX] Cast error when comparing a vertex attribute after its type has changed

2014-06-27 Thread Pierre-Alexandre Fonta
Thanks for having corrected this bug! The fix version is marked as 1.1.0 ( SPARK-1552 https://issues.apache.org/jira/browse/SPARK-1552 ). I have tested my code snippet with Spark 1.0.0 (Scala 2.10.4) and it works. I don't know if it's important to mention it. Pierre-Alexandre -- View this

Re: Graphx SubGraph

2014-06-24 Thread Ankur Dave
Yes, the subgraph operator takes a vertex predicate and keeps only the edges where both vertices satisfy the predicate, so it will work as long as you can express the sublist in terms of a vertex predicate. If that's not possible, you can still obtain the same effect, but you'll have to use

RE: GraphX partition problem

2014-05-28 Thread Zhicharevich, Alex
below? Can you advise on how to solve this issue? Thanks, Alex From: Ankur Dave [mailto:ankurd...@gmail.com] Sent: Thursday, May 22, 2014 6:59 PM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: GraphX partition problem The fix will be included in Spark 1.0, but if you just want

Re: GraphX partition problem

2014-05-28 Thread Ankur Dave
I've been trying to reproduce this but I haven't succeeded so far. For example, on the web-Google https://snap.stanford.edu/data/web-Google.htmlgraph, I get the expected results both on v0.9.1-handle-empty-partitions and on master: // Load web-Google and run connected componentsimport

RE: GraphX partition problem

2014-05-26 Thread Zhicharevich, Alex
I’m not sure about 1.2TB, but I can give it a shot. Is there some way to persist intermediate results to disk? Does all the graph has to be in memory? Alex From: Ankur Dave [mailto:ankurd...@gmail.com] Sent: Monday, May 26, 2014 12:23 AM To: user@spark.apache.org Subject: Re: GraphX partition

RE: GraphX partition problem

2014-05-26 Thread Zhicharevich, Alex
Can we do better with Bagel somehow? Control how we store the graph? From: Ankur Dave [mailto:ankurd...@gmail.com] Sent: Monday, May 26, 2014 12:13 PM To: user@spark.apache.org Subject: Re: GraphX partition problem GraphX only performs sequential scans over the edges, so we could in theory

RE: GraphX partition problem

2014-05-25 Thread Zhicharevich, Alex
@spark.apache.org Subject: Re: GraphX partition problem The fix will be included in Spark 1.0, but if you just want to apply the fix to 0.9.1, here's a hotfixed version of 0.9.1 that only includes PR #367: https://github.com/ankurdave/spark/tree/v0.9.1-handle-empty-partitions. You can clone and build

Re: GraphX partition problem

2014-05-25 Thread Ankur Dave
Once the graph is built, edges are stored in parallel primitive arrays, so each edge should only take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). Unfortunately, the current implementation in EdgePartitionBuilder uses an array of Edge objects as an intermediate representation for

Re: GraphX vertices and connected edges

2014-05-02 Thread Ankur Dave
Do you mean you want to obtain a list of adjacent edges for every vertex? A mapReduceTriplets followed by a join is the right way to do this. The join will be cheap because the original and derived vertices will share indices. There's a built-in function to do this for neighboring vertex

Re: GraphX. How to remove vertex or edge?

2014-05-01 Thread Daniel Darabos
Graph.subgraph() allows you to apply a filter to edges and/or vertices. On Thu, May 1, 2014 at 8:52 AM, Николай Кинаш peroksi...@gmail.com wrote: Hello. How to remove vertex or edges from graph in GraphX?

Re: GraphX: .edges.distinct().count() is 10?

2014-04-23 Thread Daniel Darabos
This is caused by https://issues.apache.org/jira/browse/SPARK-1188. I think the fix will be in the next release. But until then, do: g.edges.map(_.copy()).distinct.count On Wed, Apr 23, 2014 at 2:26 AM, Ryan Compton compton.r...@gmail.comwrote: Try this:

Re: GraphX: Help understanding the limitations of Pregel

2014-04-23 Thread Tom Vacek
Here are some out-of-the-box ideas: If the elements lie in a fairly small range and/or you're willing to work with limited precision, you could use counting sort. Moreover, you could iteratively find the median using bisection, which would be associative and commutative. It's easy to think of

Re: GraphX: Help understanding the limitations of Pregel

2014-04-23 Thread Ryan Compton
Whoops, I should have mentioned that it's a multivariate median (cf http://www.pnas.org/content/97/4/1423.full.pdf ). It's easy to compute when all the values are accessible at once. I'm not sure it's possible with a combiner. So, I guess the question should be: Can I use GraphX's Pregel without a

Re: GraphX: Help understanding the limitations of Pregel

2014-04-23 Thread Ankur Dave
If you need access to all message values in vprog, there's nothing wrong with building up an array in mergeMsg (option #1). This is what org.apache.spark.graphx.lib.TriangleCount does, though with sets instead of arrays. There will be a performance penalty because of the communication, but it

Re: GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ankur Dave
I wasn't able to reproduce this with a small test file, but I did change the file parsing to use x(1).toLong instead of x(2).toLong. Did you mean to take the third column rather than the second? If so, would you mind posting a larger sample of the file, or even the whole file if possible? Here's

Re: GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ryan Compton
Try this: https://www.dropbox.com/s/xf34l0ta496bdsn/.txt This code: println(g.numEdges) println(g.numVertices) println(g.edges.distinct().count()) gave me 1 9294 2 On Tue, Apr 22, 2014 at 5:14 PM, Ankur Dave ankurd...@gmail.com wrote: I wasn't able to reproduce this

Re: [GraphX] Cast error when comparing a vertex attribute after its type has changed

2014-04-21 Thread Ankur Dave
On Fri, Apr 11, 2014 at 4:42 AM, Pierre-Alexandre Fonta pierre.alexandre.fonta+sp...@gmail.com wrote: Testing in mapTriplets if a vertex attribute, which is defined as Integer in first VertexRDD but has been changed after to Double by mapVertices, is greater than a number throws

<    1   2