Re: how to create a Graph in GraphX?

2014-11-11 Thread ankurdave
You should be able to construct the edges in a single map() call without
using collect():

val edges: RDD[Edge[String]] = sc.textFile(...).map { line =>
  val row = line.split(",")
  Edge(row(0), row(1), row(2)
}
val graph: Graph[Int, String] = Graph.fromEdges(edges, defaultValue = 1)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-create-a-Graph-in-GraphX-tp18635p18646.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: counting degrees graphx

2014-05-25 Thread ankurdave
Sorry, I missed vertex 6 in that example. It should be [{1}, {1}, {1}, {1},
{1, 6}, {6}, {7}, {7}].



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/counting-degrees-graphx-tp6370p6378.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Benchmarking Graphx

2014-05-19 Thread ankurdave
On May 17, 2014 at 2:59pm, Hari wrote:
> a) Is there a way to get the total time taken for the execution from
start to finish?
Assuming you're running the benchmark as a standalone program, such as by
invoking the  Analytics driver

 
, you could wrap the driver invocation using time:
/usr/bin/time -p ./bin/spark-submit ...
If you're using spark-shell, you could use System.currentTimeMillis.
> b) log4j properties need to be modified to turn off logging, but its
not clear how to. 
Create  conf/log4j.properties
  
by copying conf/log4j.properties.template and changing the first line to
log4j.rootCategory=WARN, console
> c) how can this be extended to a cluster?
It should work just to invoke the driver on the cluster using spark-submit.
If you aren't using the Analytics driver, make sure to set the same  Spark
properties
  
as it does (spark.serializer, spark.kryo.registrator, and
spark.locality.wait).
> d) also how to quantify memory overhead if i added more functionality
to the execution?
You can see how much memory each cached RDD is taking up by looking at the 
web UI  
.
> e) any scripts? reports generated?
We don't have well-supported benchmark scripts for GraphX yet. Dan Crankshaw
has some personal-use  scripts   
for setting up GraphX and competing graph systems on a cluster and running
some benchmarks. You could look at those for some ideas.
There are benchmarks from earlier this year in the GraphX  arXiv paper
  . These are on the  soc-LiveJournal
   and  twitter-2010
   datasets.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Benchmarking-Graphx-tp5965p6061.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Variables outside of mapPartitions scope

2014-05-13 Thread ankurdave
In general, you can find out exactly what's not serializable by adding
-Dsun.io.serialization.extendedDebugInfo=true to SPARK_JAVA_OPTS.
Since a this reference to the enclosing class is often what's causing the
problem, a general workaround is to move the mapPartitions call to a static
method where there is no this reference. This transforms this:
class A {  def f() = rdd.mapPartitions(iter => ...)}
into this:
class A {  def f() = A.helper(rdd)}object A {  def helper(rdd: RDD[...]) =
rdd.mapPartitions(iter => ...)}




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Variables-outside-of-mapPartitions-scope-tp5517p5527.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Caching in graphX

2014-05-13 Thread ankurdave
Unfortunately it's very difficult to get uncaching right with GraphX due to
the complicated internal dependency structure that it creates. It's
necessary to know exactly what operations you're doing on the graph in order
to unpersist correctly (i.e., in a way that avoids recomputation).

I have a pull request (https://github.com/apache/spark/pull/497) that may
make this a bit easier, but your best option is to use the Pregel API for
iterative algorithms if possible.

If that's not possible, leaving things cached has actually not been very
costly in my experience, at least as long as VD and ED are primitive types
to reduce the load on the garbage collector.

Ankur



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Caching-in-graphX-tp5482p5514.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Is there any problem on the spark mailing list?

2014-05-11 Thread ankurdave
I haven't been getting mail either. This was the last message I received:
http://apache-spark-user-list.1001560.n3.nabble.com/master-attempted-to-re-register-the-worker-and-then-took-all-workers-as-unregistered-tp553p5491.html



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509p5515.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: sample data for pagerank?

2014-03-18 Thread ankurdave
The examples in graphx/data are meant to show the input data format, but if
you want to play around with larger and more interesting datasets, we've
been using the following ones, among others:

- SNAP's web-Google dataset (5M edges):
https://snap.stanford.edu/data/web-Google.html
- SNAP's soc-LiveJournal1 dataset (69M edges):
https://snap.stanford.edu/data/soc-LiveJournal1.html

These come in edge list format and, after decompression, can directly be
loaded using GraphLoader.

Ankur



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sample-data-for-pagerank-tp2655p2839.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Are there any plans to develop Graphx Streaming?

2014-03-18 Thread ankurdave
Yes, Joey Gonzalez and I are working on a streaming version of GraphX. It's
not usable yet, but we will announce when an alpha is ready, likely in a few
months.

Ankur



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Are-there-any-plans-to-develop-Graphx-Streaming-tp2689p2838.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: There is an error in Graphx

2014-03-18 Thread ankurdave
> The workaround is to force a copy using graph.triplets.map(_.copy()).

Sorry, this actually won't copy the entire triplet, only the attributes
defined in Edge. The right workaround is to copy the EdgeTriplet explicitly:

graph.triplets.map { et =>
  val et2 = new EdgeTriplet[VD, ED]   // Replace VD and ED with the correct
types
  et2.srcId = et.srcId
  et2.dstId = et.dstId
  et2.attr = et.attr
  et2.srcAttr = et.srcAttr
  et2.dstAttr = et.dstAttr
  et2
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/There-is-an-error-in-Graphx-tp1575p2837.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: There is an error in Graphx

2014-03-18 Thread ankurdave
This problem occurs because graph.triplets generates an iterator that reuses
the same EdgeTriplet object for every triplet in the partition. The
workaround is to force a copy using graph.triplets.map(_.copy()).

The solution in the AMPCamp tutorial is mistaken -- I'm not sure if that
ever worked.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/There-is-an-error-in-Graphx-tp1575p2836.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.