unsubscribe

2019-06-24 Thread Dave Moyers

RE: Spark on Kubernetes - log4j.properties not read

2019-06-11 Thread Dave Jaffe
That did the trick, Abhishek! Thanks for the explanation, that answered a lot of questions I had. Dave -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr

Spark on Kubernetes - log4j.properties not read

2019-06-10 Thread Dave Jaffe
, Dave Jaffe

Re: OutOfDirectMemoryError for Spark 2.2

2018-03-12 Thread Dave Cameron
t;>> >>>> 917: 8 448 io.netty.buffer.UnpooledHeapByteBuf >>>> >>>> 1018: 20 320 io.netty.buffer.PoolThreadCache$1 >>>> >>>> 1305: 4 128 io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry >>>> >>>> 1404

[Structured Streaming] Commit protocol to move temp files to dest path only when complete, with code

2018-02-09 Thread Dave Cameron
(at the moment not up to Spark standards, admittedly): https://github.com/davcamer/spark/commit/361f1c69851f0f94cfd974ce720c694407f9340b Did I miss a better approach? Does anyone else think this would be useful? Thanks for reading, Dave

Re: Does Random Forest in spark ML supports multi label classification in scala

2017-11-07 Thread Dave Moyers
Yes, see https://dzone.com/articles/predictive-analytics-with-spark-ml Although the example uses two labels, the same approach supports multiple labels. Sent from my iPad > On Nov 7, 2017, at 6:30 AM, HARSH TAKKAR wrote: > > Hi > > Does Random Forest in spark Ml

Re: Running stress tests on spark cluster to avoid wild-goose chase later

2016-11-15 Thread Dave Jaffe
Mich- Sparkperf from Databricks (https://github.com/databricks/spark-perf) is a good stress test, covering a wide range of Spark functionality but especially ML. I’ve tested it with Spark 1.6.0 on CDH 5.7. It may need some work for Spark 2.0. Dave Jaffe BigData Performance VMware dja

Re: Anomalous Spark RDD persistence behavior

2016-11-08 Thread Dave Jaffe
No, I am not using serializing either with memory or disk. Dave Jaffe VMware dja...@vmware.com From: Shreya Agarwal <shrey...@microsoft.com> Date: Monday, November 7, 2016 at 3:29 PM To: Dave Jaffe <dja...@vmware.com>, "user@spark.apache.org" <user@spark.apache.org>

Anomalous Spark RDD persistence behavior

2016-11-07 Thread Dave Jaffe
to disk take more memory than caching to memory? Is this behavior expected as dataset size exceeds available memory? Thanks in advance, Dave Jaffe Big Data Performance VMware dja...@vmware.com

Spark connecting to Hive in another EMR cluster

2016-06-24 Thread Dave Maughan
ql just to force it to connect/configure correctly? Thanks, Dave

Re: Spark SQL - Encoders - case class

2016-06-06 Thread Dave Maughan
types will be added in future.* Thanks - Dave

Spark SQL - Encoders - case class

2016-06-06 Thread Dave Maughan
he equivalent in Scala if Table1 is a case class. Could someone please point me in the right direction? Thanks - Dave

Re: spark 1.6.0 connect to hive metastore

2016-03-09 Thread Dave Maughan
-site.xml the exception does not occur which implies that it WAS on the classpath... Dave On Tue, 9 Feb 2016 at 22:26 Koert Kuipers <ko...@tresata.com> wrote: > i do not have phoenix, but i wonder if its something related. will check > my classpaths > > On Tue, Feb 9, 2016 at 5:0

Re: Spark Job Hanging on Join

2016-02-23 Thread Dave Moyers
ng fast afterwards :) > > On Feb 22, 2016 21:24, "Dave Moyers" <davemoy...@icloud.com> wrote: >> Good article! Thanks for sharing! >> >> >> > On Feb 22, 2016, at 11:10 AM, Davies Liu <dav...@databricks.com> wrote: >> > >&

Re: Spark Job Hanging on Join

2016-02-22 Thread Dave Moyers
Good article! Thanks for sharing! > On Feb 22, 2016, at 11:10 AM, Davies Liu wrote: > > This link may help: > https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html > > Spark 1.6 had improved the CatesianProduct, you should

Re: spark-xml can't recognize schema

2016-02-21 Thread Dave Moyers
Make sure the xml input file is well formed (check your end tags). Sent from my iPhone > On Feb 21, 2016, at 8:14 AM, Prathamesh Dharangutte > wrote: > > This is the code I am using for parsing xml file: > > > > import org.apache.spark.{SparkConf,SparkContext} >

Re: Spark Job Hanging on Join

2016-02-20 Thread Dave Moyers
Try this setting in your Spark defaults: spark.sql.autoBroadcastJoinThreshold=-1 I had a similar problem with joins hanging and that resolved it for me. You might be able to pass that value from the driver as a --conf option, but I have not tried that, and not sure if that will work. Sent

Re: How to use scala.math.Ordering in java

2016-01-21 Thread Dave
)); I can't determine how to create the correct object for parameter 4 which is "scala.math.Ordering evidence$1" from the documentation. From the scala.math.Ordering code I see there are many implicit objects and one handles Strings. How can I access them from Java.

Re: RDD immutablility

2016-01-19 Thread Dave
Thanks Sean. On 19/01/16 13:36, Sean Owen wrote: It's a good question. You can easily imagine an RDD of classes that are mutable. Yes, if you modify these objects, the result is pretty undefined, so don't do that. On Tue, Jan 19, 2016 at 12:27 PM, Dave <dave.davo...@gmail.com> wrote: Hi

Re: RDD immutablility

2016-01-19 Thread Dave
Hi Marco, Yes, that answers my question. I just wanted to be sure as the API gave me write access to the immutable data which means its up to the developer to know not to modify the input parameters for these API's. Thanks for the response. Dave. On 19/01/16 12:25, Marco wrote: Hello, RDD

Re: Kafka Streaming and partitioning

2016-01-13 Thread Dave
. Thanks, Dave. On 13/01/16 16:21, Cody Koeninger wrote: If two rdds have an identical partitioner, joining should not involve a shuffle. You should be able to override the partitioner without calling partitionBy. Two ways I can think of to do this: - subclass or modify the direct stream

Re: Kafka Streaming and partitioning

2016-01-13 Thread Dave
rdd) wrapped.join(reference) } In which case it will run through the partitioner of the wrapped RDD when it arrives in the cluster for the first time i.e. no shuffle. Thanks, Dave. On 13/01/16 17:00, Cody Koeninger wrote: In the case here of a kafkaRDD, the data doesn't reside on the cluster, i

Best way to use Spark UDFs via Hive (Spark Thrift Server)

2015-10-22 Thread Dave Moyers
the udf's in their queries against the saved tables. Is there a way to register udf's such that they can be used within both a Spark job and in a Hive connection? Thanks! Dave Sent from my iPad - To unsubscribe, e-mail: user

Kafka Streaming and Filtering > 3000 partitons

2015-10-21 Thread Dave Ariens
Hey folks, I have a very large number of Kafka topics (many thousands of partitions) that I want to consume, filter based on topic-specific filters, then produce back to filtered topics in Kafka. Using the receiver-less based approach with Spark 1.4.1 (described

RE: Kafka Streaming and Filtering > 3000 partitons

2015-10-21 Thread Dave Ariens
.@koeninger.org] Sent: Wednesday, October 21, 2015 3:01 PM To: Dave Ariens Cc: user@spark.apache.org Subject: Re: Kafka Streaming and Filtering > 3000 partitons The rdd partitions are 1:1 with kafka topicpartitions, so you can use offsets ranges to figure out which topic a given rdd pa

Re: creating a distributed index

2015-07-15 Thread Ankur Dave
The latest version of IndexedRDD supports any key type with a defined serializer https://github.com/amplab/spark-indexedrdd/blob/master/src/main/scala/edu/berkeley/cs/amplab/spark/indexedrdd/KeySerializer.scala, including Strings. It's not released yet, but you can use it from the master branch if

RE: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-29 Thread Dave Ariens
/YarnSparkHadoopUtil.scala: val dstFs = dst.getFileSystem(conf) From: Steve Loughran [mailto:ste...@hortonworks.com] Sent: Sunday, June 28, 2015 10:34 AM To: Tim Chen Cc: Marcelo Vanzin; Dave Ariens; Olivier Girardot; user@spark.apache.org Subject: Re: Accessing Kerberos Secured HDFS

RE: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-29 Thread Dave Ariens
, thanks everyone. From: Steve Loughran [mailto:ste...@hortonworks.com] Sent: Monday, June 29, 2015 10:32 AM To: Dave Ariens Cc: Tim Chen; Marcelo Vanzin; Olivier Girardot; user@spark.apache.org Subject: Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos On 29 Jun 2015, at 14:18, Dave

Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
support... Thanks, Dave

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
Chen Cc: Olivier Girardot; Dave Ariens; user@spark.apache.org Subject: Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos On Fri, Jun 26, 2015 at 1:13 PM, Tim Chen t...@mesosphere.iomailto:t...@mesosphere.io wrote: So correct me if I'm wrong, sounds like all you need is a principal

RE: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
help would be appreciated! From: Timothy Chen [mailto:t...@mesosphere.io] Sent: Friday, June 26, 2015 12:50 PM To: Dave Ariens Cc: user@spark.apache.org Subject: Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos Hi Dave, I don't understand Keeberos much but if you know the exact

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
PM To: Dave Ariens Cc: Tim Chen; Olivier Girardot; user@spark.apache.org Subject: Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos On Fri, Jun 26, 2015 at 3:09 PM, Dave Ariens dari...@blackberry.commailto:dari...@blackberry.com wrote: Would there be any way to have the task

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
in the slaves call the UGI login with a principal/keytab provided to the driver? From: Marcelo Vanzin Sent: Friday, June 26, 2015 5:28 PM To: Tim Chen Cc: Olivier Girardot; Dave Ariens; user@spark.apache.org Subject: Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos On Fri, Jun 26

Re: Effecient way to fetch all records on a particular node/partition in GraphX

2015-05-17 Thread Ankur Dave
If you know the partition IDs, you can launch a job that runs tasks on only those partitions by calling sc.runJob https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1686. For example, we do this in IndexedRDD

Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Ankur Dave
I'm the primary author of IndexedRDD. To answer your questions: 1. Operations on an IndexedRDD partition can only be performed from a task operating on that partition, since doing otherwise would require decentralized coordination between workers, which is difficult in Spark. If you want to

Re: [GraphX] aggregateMessages with active set

2015-04-09 Thread Ankur Dave
Actually, GraphX doesn't need to scan all the edges, because it maintains a clustered index on the source vertex id (that is, it sorts the edges by source vertex id and stores the offsets in a hash table). If the activeDirection is appropriately set, it can then jump only to the clusters with

Re: [GraphX] aggregateMessages with active set

2015-04-07 Thread Ankur Dave
We thought it would be better to simplify the interface, since the active set is a performance optimization but the result is identical to calling subgraph before aggregateMessages. The active set option is still there in the package-private method aggregateMessagesWithActiveSet. You can actually

Re: Graphx gets slower as the iteration number increases

2015-03-24 Thread Ankur Dave
This might be because partitions are getting dropped from memory and needing to be recomputed. How much memory is in the cluster, and how large are the partitions? This information should be in the Executors and Storage pages in the web UI. Ankur http://www.ankurdave.com/ On Tue, Mar 24, 2015 at

Re: Learning GraphX Questions

2015-02-13 Thread Ankur Dave
At 2015-02-13 12:19:46 -0800, Matthew Bucci mrbucci...@gmail.com wrote: 1) How do you actually run programs in GraphX? At the moment I've been doing everything live through the shell, but I'd obviously like to be able to work on it by writing and running scripts. You can create your own

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-01-29 Thread Ankur Dave
Thanks for the reminder. I just created a PR: https://github.com/apache/spark/pull/4273 Ankur On Thu, Jan 29, 2015 at 7:25 AM, Jay Hutfles jayhutf...@gmail.com wrote: Just curious, is this set to be merged at some point? - To

Re: graph.inDegrees including zero values

2015-01-25 Thread Ankur Dave
You can do this using leftJoin, as collectNeighbors [1] does: graph.vertices.leftJoin(graph.inDegrees) { (vid, attr, inDegOpt) = inDegOpt.getOrElse(0) } [1] https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala#L145 Ankur On Sun, Jan 25,

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-01-22 Thread Ankur Dave
At 2015-01-22 02:06:37 -0800, NicolasC nicolas.ch...@inria.fr wrote: I try to execute a simple program that runs the ShortestPaths algorithm (org.apache.spark.graphx.lib.ShortestPaths) on a small grid graph. I use Spark 1.2.0 downloaded from spark.apache.org. This program runs more than 2

Re: Error for first run from iPython Notebook

2015-01-21 Thread Dave
Is this the wrong list to be asking this question? I'm not even sure where to start troubleshooting. On Tue, Jan 20, 2015 at 9:48 AM, Dave dla...@gmail.com wrote: Not sure if anyone who can help has seen this. Any suggestions would be appreciated, thanks! On Mon Jan 19 2015 at 1:50:43 PM

Re: Error for first run from iPython Notebook

2015-01-20 Thread Dave
Not sure if anyone who can help has seen this. Any suggestions would be appreciated, thanks! On Mon Jan 19 2015 at 1:50:43 PM Dave dla...@gmail.com wrote: Hi, I've setup my first spark cluster (1 master, 2 workers) and an iPython notebook server that I'm trying to setup to access the cluster

Error for first run from iPython Notebook

2015-01-19 Thread Dave
helpful. Thanks for any help! Dave

Re: Using graphx to calculate average distance of a big graph

2015-01-06 Thread Ankur Dave
[-dev] What size of graph are you hoping to run this on? For small graphs where materializing the all-pairs shortest path is an option, you could simply find the APSP using https://github.com/apache/spark/pull/3619 and then take the average distance (apsp.map(_._2.toDouble).mean). Ankur

Re: representing RDF literals as vertex properties

2014-12-08 Thread Ankur Dave
At 2014-12-08 12:12:16 -0800, spr s...@yarcdata.com wrote: OK, have waded into implementing this and have gotten pretty far, but am now hitting something I don't understand, an NoSuchMethodError. [...] The (short) traceback looks like Exception in thread main java.lang.NoSuchMethodError:

Re: [Graphx] which way is better to access faraway neighbors?

2014-12-05 Thread Ankur Dave
At 2014-12-05 02:26:52 -0800, Yifan LI iamyifa...@gmail.com wrote: I have a graph in where each vertex keep several messages to some faraway neighbours(I mean, not to only immediate neighbours, at most k-hops far, e.g. k = 5). now, I propose to distribute these messages to their

Re: Problem creating EC2 cluster using spark-ec2

2014-12-04 Thread Dave Challis
at 5:11 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Interesting. Do you have any problems when launching in us-east-1? What is the full output of spark-ec2 when launching a cluster? (Post it to a gist if it’s too big for email.) On Mon, Dec 1, 2014 at 10:34 AM, Dave Challis

Re: Determination of number of RDDs

2014-12-04 Thread Ankur Dave
At 2014-12-04 02:08:45 -0800, Deep Pradhan pradhandeep1...@gmail.com wrote: I have a graph and I want to create RDDs equal in number to the nodes in the graph. How can I do that? If I have 10 nodes then I want to create 10 rdds. Is that possible in GraphX? This is possible: you can collect

Re: GraphX Pregel halting condition

2014-12-04 Thread Ankur Dave
There's no built-in support for doing this, so the best option is to copy and modify Pregel to check the accumulator at the end of each iteration. This is robust and shouldn't be too hard, since the Pregel code is short and only uses public GraphX APIs. Ankur At 2014-12-03 09:37:01 -0800, Jay

Re: representing RDF literals as vertex properties

2014-12-04 Thread Ankur Dave
At 2014-12-04 16:26:50 -0800, spr s...@yarcdata.com wrote: I'm also looking at how to represent literals as vertex properties. It seems one way to do this is via positional convention in an Array/Tuple/List that is the VD; i.e., to represent height, weight, and eyeColor, the VD could be a

Re: Filter using the Vertex Ids

2014-12-03 Thread Ankur Dave
At 2014-12-03 02:13:49 -0800, Deep Pradhan pradhandeep1...@gmail.com wrote: We cannot do sc.parallelize(List(VertexRDD)), can we? There's no need to do this, because every VertexRDD is also a pair RDD: class VertexRDD[VD] extends RDD[(VertexId, VD)] You can simply use graph.vertices in

Re: Filter using the Vertex Ids

2014-12-03 Thread Ankur Dave
At 2014-12-02 22:01:20 -0800, Deep Pradhan pradhandeep1...@gmail.com wrote: I have a graph which returns the following on doing graph.vertices (1, 1.0) (2, 1.0) (3, 2.0) (4, 2.0) (5, 0.0) I want to group all the vertices with the same attribute together, like into one RDD or something. I

Re: Filter using the Vertex Ids

2014-12-03 Thread Ankur Dave
To get that function in scope you have to import org.apache.spark.SparkContext._ Ankur On Wednesday, December 3, 2014, Deep Pradhan pradhandeep1...@gmail.com wrote: But groupByKey() gives me the error saying that it is not a member of org.apache.spark,rdd,RDD[(Double,

Problem creating EC2 cluster using spark-ec2

2014-12-01 Thread Dave Challis
directory should be found), it only contains a single 'conf' directory: $ ls /root/spark conf Any idea why spark-ec2 might have failed to copy these files across? Thanks, Dave - To unsubscribe, e-mail: user-unsubscr

Re: how to force graphx to execute transfomtation

2014-11-26 Thread Ankur Dave
At 2014-11-26 05:25:10 -0800, Hlib Mykhailenko hlib.mykhaile...@inria.fr wrote: I work with Graphx. When I call graph.partitionBy(..) nothing happens, because, as I understood, that all transformation are lazy and partitionBy is built using transformations. Is there way how to force spark

Re: inconsistent edge counts in GraphX

2014-11-18 Thread Ankur Dave
At 2014-11-11 01:51:43 +, Buttler, David buttl...@llnl.gov wrote: I am building a graph from a large CSV file. Each record contains a couple of nodes and about 10 edges. When I try to load a large portion of the graph, using multiple partitions, I get inconsistent results in the number

Re: GraphX / PageRank with edge weights

2014-11-18 Thread Ankur Dave
At 2014-11-13 21:28:52 +, Ommen, Jurgen omme0...@stthomas.edu wrote: I'm using GraphX and playing around with its PageRank algorithm. However, I can't see from the documentation how to use edge weight when running PageRank. Is this possible to consider edge weights and how would I do it?

Re: Pagerank implementation

2014-11-18 Thread Ankur Dave
At 2014-11-15 18:01:22 -0700, tom85 tom.manha...@gmail.com wrote: This line: val newPR = oldPR + (1.0 - resetProb) * msgSum makes no sense to me. Should it not be: val newPR = resetProb/graph.vertices.count() + (1.0 - resetProb) * msgSum ? This is an unusual version of PageRank where the

Re: Landmarks in GraphX section of Spark API

2014-11-18 Thread Ankur Dave
At 2014-11-17 14:47:50 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: I was going through the graphx section in the Spark API in https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.ShortestPaths$ Here, I find the word landmark. Can anyone explain to me

Re: Running PageRank in GraphX

2014-11-18 Thread Ankur Dave
At 2014-11-18 12:02:52 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: I just ran the PageRank code in GraphX with some sample data. What I am seeing is that the total rank changes drastically if I change the number of iterations from 10 to 100. Why is that so? As far as I understand, the

Re: Landmarks in GraphX section of Spark API

2014-11-18 Thread Ankur Dave
At 2014-11-18 14:59:20 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: So landmark can contain just one vertex right? Right. Which algorithm has been used to compute the shortest path? It's distributed Bellman-Ford. Ankur

Re: New Codes in GraphX

2014-11-18 Thread Ankur Dave
At 2014-11-18 14:51:54 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: I am using Spark-1.0.0. There are two GraphX directories that I can see here 1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx which contains LiveJournalPageRank,scala 2.

Re: Landmarks in GraphX section of Spark API

2014-11-18 Thread Ankur Dave
At 2014-11-18 15:29:08 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: Does Bellman-Ford give the best solution? It gives the same solution as any other algorithm, since there's only one correct solution for shortest paths and it's guaranteed to find it eventually. There are probably

Re: New Codes in GraphX

2014-11-18 Thread Ankur Dave
At 2014-11-18 15:35:13 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: Now, how do I run the LiveJournalPageRank.scala that is there in 1? I think it should work to use MASTER=local[*] $SPARK_HOME/bin/run-example graphx.LiveJournalPageRank /edge-list-file.txt --numEPart=8 --numIter=10

Re: Landmarks in GraphX section of Spark API

2014-11-18 Thread Ankur Dave
At 2014-11-18 15:44:31 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: I meant to ask whether it gives the solution faster than other algorithms. No, it's just that it's much simpler and easier to implement than the others. Section 5.2 of the Pregel paper [1] justifies using it for a graph

Re: New Codes in GraphX

2014-11-18 Thread Ankur Dave
At 2014-11-18 15:51:52 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes the above command works, but there is this problem. Most of the times, the total rank is Nan (Not a Number). Why is it so? I've also seen this, but I'm not sure why it happens. If you could find out which vertices

Re: Fwd: Executor Lost Failure

2014-11-10 Thread Ankur Dave
At 2014-11-10 22:53:49 +0530, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: Tasks are now getting submitted, but many tasks don't happen. Like, after opening the spark-shell, I load a text file from disk and try printing its contentsas: sc.textFile(/path/to/file).foreach(println)

Re: Spark v Redshift

2014-11-04 Thread Akshar Dave
commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Akshar Dave Principal – Big Data SoftNet Solutions

Re: To find distances to reachable source vertices using GraphX

2014-11-03 Thread Ankur Dave
The NullPointerException seems to be because edge.dstAttr is null, which might be due to SPARK-3936 https://issues.apache.org/jira/browse/SPARK-3936. Until that's fixed, I edited the Gist with a workaround. Does that fix the problem? Ankur http://www.ankurdave.com/ On Mon, Nov 3, 2014 at 12:23

Re: How to correctly extimate the number of partition of a graph in GraphX

2014-11-02 Thread Ankur Dave
How large is your graph, and how much memory does your cluster have? We don't have a good way to determine the *optimal* number of partitions aside from trial and error, but to get the job to at least run to completion, it might help to use the MEMORY_AND_DISK storage level and a large number of

Re: How to set persistence level of graph in GraphX in spark 1.0.0

2014-10-28 Thread Ankur Dave
At 2014-10-25 08:56:34 +0530, Arpit Kumar arp8...@gmail.com wrote: GraphLoader1.scala:49: error: class EdgePartitionBuilder in package impl cannot be accessed in package org.apache.spark.graphx.impl [INFO] val builder = new EdgePartitionBuilder[Int, Int] Here's a workaround: 1. Copy and

Re: GraphX StackOverflowError

2014-10-28 Thread Ankur Dave
At 2014-10-28 16:27:20 +0300, Zuhair Khayyat zuhair.khay...@gmail.com wrote: I am using connected components function of GraphX (on Spark 1.0.2) on some graph. However for some reason the fails with StackOverflowError. The graph is not too big; it contains 1 vertices and 50 edges.

Re: Workaround for SPARK-1931 not compiling

2014-10-24 Thread Ankur Dave
At 2014-10-23 09:48:55 +0530, Arpit Kumar arp8...@gmail.com wrote: error: value partitionBy is not a member of org.apache.spark.rdd.RDD[(org.apache.spark.graphx.PartitionID, org.apache.spark.graphx.Edge[ED])] Since partitionBy is a member of PairRDDFunctions, it sounds like the implicit

Re: graphx - mutable?

2014-10-14 Thread Ankur Dave
On Tue, Oct 14, 2014 at 12:36 PM, ll duy.huynh@gmail.com wrote: hi again. just want to check in again to see if anyone could advise on how to implement a mutable, growing graph with graphx? we're building a graph is growing over time. it adds more vertices and edges every iteration of

Re: graphx - mutable?

2014-10-14 Thread Ankur Dave
On Tue, Oct 14, 2014 at 1:57 PM, Duy Huynh duy.huynh@gmail.com wrote: a related question, what is the best way to update the values of existing vertices and edges? Many of the Graph methods deal with updating the existing values in bulk, including mapVertices, mapEdges, mapTriplets,

Re: How to construct graph in graphx

2014-10-13 Thread Ankur Dave
At 2014-10-13 18:22:44 -0400, Soumitra Siddharth Johri soumitra.siddha...@gmail.com wrote: I have a flat tab separated file like below: [...] where n1,n2,n3,n4 are the nodes of the graph and R1,P2,P3 are the properties which should form the edges between the nodes. How can I construct a

Re: How to construct graph in graphx

2014-10-13 Thread Ankur Dave
At 2014-10-13 21:08:15 -0400, Soumitra Johri soumitra.siddha...@gmail.com wrote: There is no 'long' field in my file. So when I form the edge I get a type mismatch error. Is it mandatory for GraphX that every vertex should have a distinct id. ? in my case n1,n2,n3,n4 are all strings. (+user

Re: Pregel messages serialized in local machine?

2014-09-25 Thread Ankur Dave
At 2014-09-25 06:52:46 -0700, Cheuk Lam chl...@hotmail.com wrote: This is a question on using the Pregel function in GraphX. Does a message get serialized and then de-serialized in the scenario where both the source and the destination vertices are in the same compute node/machine? Yes,

Re: paging through an RDD that's too large to collect() all at once

2014-09-19 Thread Dave Anderson
Excellent - thats exactly what I needed. I saw iterator() but missed the toLocalIterator() method -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/paging-through-an-RDD-that-s-too-large-to-collect-all-at-once-tp14638p14686.html Sent from the Apache Spark

paging through an RDD that's too large to collect() all at once

2014-09-18 Thread dave-anderson
I have an RDD on the cluster that I'd like to iterate over and perform some operations on each element (push data from each element to another downstream system outside of Spark). I'd like to do this at the driver so I can throttle the rate that I push to the downstream system (as opposed to

Re: how to group within the messages at a vertex?

2014-09-17 Thread Ankur Dave
At 2014-09-17 11:39:19 -0700, spr s...@yarcdata.com wrote: I'm trying to implement label propagation in GraphX. The core step of that algorithm is - for each vertex, find the most frequent label among its neighbors and set its label to that. [...] It seems on the broken line above, I

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Ankur Dave
At 2014-09-16 10:55:37 +0200, Yifan LI iamyifa...@gmail.com wrote: - from [1], and my understanding, the existing inactive feature in graphx pregel api is “if there is no in-edges, from active vertex, to this vertex, then we will say this one is inactive”, right? Well, that's true when

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Ankur Dave
At 2014-09-16 12:23:10 +0200, Yifan LI iamyifa...@gmail.com wrote: but I am wondering if there is a message(none?) sent to the target vertex(the rank change is less than tolerance) in below dynamic page rank implementation, def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = {

Re: vertex active/inactive feature in Pregel API ?

2014-09-15 Thread Ankur Dave
At 2014-09-15 16:25:04 +0200, Yifan LI iamyifa...@gmail.com wrote: I am wondering if the vertex active/inactive(corresponding the change of its value between two supersteps) feature is introduced in Pregel API of GraphX? Vertex activeness in Pregel is controlled by messages: if a vertex did

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError GC overhead limit exceeded

2014-09-09 Thread Ankur Dave
At 2014-09-05 12:13:18 +0200, Yifan LI iamyifa...@gmail.com wrote: But how to assign the storage level to a new vertices RDD that mapped from an existing vertices RDD, e.g. *val newVertexRDD = graph.collectNeighborIds(EdgeDirection.Out).map{case(id:VertexId, a:Array[VertexId]) = (id,

Re: spark 1.1.0 requested array size exceed vm limits

2014-09-05 Thread Ankur Dave
At 2014-09-05 21:40:51 +0800, marylucy qaz163wsx_...@hotmail.com wrote: But running graphx edgeFileList ,some tasks failed error:requested array size exceed vm limits Try passing a higher value for minEdgePartitions when calling GraphLoader.edgeListFile. Ankur

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError GC overhead limit exceeded

2014-09-03 Thread Ankur Dave
At 2014-09-03 17:58:09 +0200, Yifan LI iamyifa...@gmail.com wrote: val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions = numPartitions).partitionBy(PartitionStrategy.EdgePartition2D).persist(StorageLevel.MEMORY_AND_DISK) Error: java.lang.UnsupportedOperationException: Cannot

Re: Spark - GraphX pregel like with global variables (accumulator / broadcast)

2014-08-26 Thread Ankur Dave
At 2014-08-26 01:20:09 -0700, BertrandR bertrand.rondepierre...@gmail.com wrote: I actually tried without unpersisting, but given the performance I tryed to add these in order to free the memory. After your anwser I tried to remove them again, but without any change in the execution time...

Re: Spark - GraphX pregel like with global variables (accumulator / broadcast)

2014-08-25 Thread Ankur Dave
At 2014-08-25 06:41:36 -0700, BertrandR bertrand.rondepierre...@gmail.com wrote: Unfortunately, this works well for extremely small graphs, but it becomes exponentially slow with the size of the graph and the number of iterations (doesn't finish 20 iterations with graphs having 48000 edges).

Re: GraphX usecases

2014-08-25 Thread Ankur Dave
At 2014-08-25 11:23:37 -0700, Sunita Arvind sunitarv...@gmail.com wrote: Does this We introduce GraphX, which combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework. We leverage new ideas in

Re: The running time of spark

2014-08-23 Thread Ankur Dave
At 2014-08-23 08:33:48 -0700, Denis RP qq378789...@gmail.com wrote: Bottleneck seems to be I/O, the CPU usage ranges 10%~15% most time per VM. The caching is maintained by pregel, should be reliable. Storage level is MEMORY_AND_DISK_SER. I'd suggest trying the DISK_ONLY storage level and

Re: Personalized Page rank in graphx

2014-08-20 Thread Ankur Dave
At 2014-08-20 10:57:57 -0700, Mohit Singh mohit1...@gmail.com wrote: I was wondering if Personalized Page Rank algorithm is implemented in graphx. If the talks and presentation were to be believed (https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx@strata2014_final.pdf) it

Re: GraphX question about graph traversal

2014-08-20 Thread Ankur Dave
At 2014-08-20 10:34:50 -0700, Cesar Arevalo ce...@zephyrhealthinc.com wrote: I would like to get the type B vertices that are connected through type A vertices where the edges have a score greater than 5. So, from the example above I would like to get V1 and V4. It sounds like you're trying to

Re: noob: how to extract different members of a VertexRDD

2014-08-19 Thread Ankur Dave
(+user) On Tue, Aug 19, 2014 at 12:05 PM, spr s...@yarcdata.com wrote: I want to assign each vertex to a community with the name of the vertex. As I understand it, you want to set the vertex attributes of a graph to the corresponding vertex ids. You can do this using Graph#mapVertices [1] as

Re: noob: how to extract different members of a VertexRDD

2014-08-19 Thread Ankur Dave
At 2014-08-19 12:47:16 -0700, spr s...@yarcdata.com wrote: One follow-up question. If I just wanted to get those values into a vanilla variable (not a VertexRDD or Graph or ...) so I could easily look at them in the REPL, what would I do? Are the aggregate data structures inside the

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError GC overhead limit exceeded

2014-08-18 Thread Ankur Dave
On Mon, Aug 18, 2014 at 6:29 AM, Yifan LI iamyifa...@gmail.com wrote: I am testing our application(similar to personalised page rank using Pregel, and note that each vertex property will need pretty much more space to store after new iteration) [...] But when we ran it on larger graph(e.g.

Re: GraphX Pagerank application

2014-08-15 Thread Ankur Dave
On Wed, Aug 6, 2014 at 11:37 AM, AlexanderRiggers alexander.rigg...@gmail.com wrote: To perform the page rank I have to create a graph object, adding the edges by setting sourceID=id and distID=brand. In GraphLab there is function: g = SGraph().add_edges(data, src_field='id',

Re: SparkR : lapplyPartition transforms the data in vertical format

2014-08-07 Thread Pranay Dave
Hello Shivram Thanks for your reply. Here is a simple data set input. This data is in file called /sparkdev/datafiles/covariance.txt 1,1 2,2 3,3 4,4 5,5 6,6 7,7 8,8 9,9 10,10 Output I would like to see is a total of columns. It can be done with reduce, but I wanted to test lapply. Output I

  1   2   >