Can we make EdgeRDD and VertexRDD storage level to MEMORY_AND_DISK?
Hi, I'm running out of memory when I run a GraphX program for dataset moe than 10 GB, It was handle pretty well in case of noraml spark operation when did StorageLevel.MEMORY_AND_DISK. In case of GraphX I found its only allowed storing in memory, and it is because in Graph constructor, this property set by default. When I changed storage level as per my requirement, it doesn't allow and throw Error Message sayinh Cannot Modify StorageLevel when Its already set Please help me on these queries : 1 How to override current staorge level to MEMORY and DISK ? 2 If its not possible through constructor, what If I modify Graph.scala class and rebuild it to make it work? By applying this, is there any other things I need know? Thanks - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-make-EdgeRDD-and-VertexRDD-storage-level-to-MEMORY-AND-DISK-tp19307.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to get list of edges between two Vertex ?
Hi, I have a graph where no. of edges b/w two vertices are more than once possible. Now I need to find out who are top vertices between which no. of calls happen more? output should look like (V1, V2 , No. of edges) So I need to know, how to find out total no. of edges b/w only that two vertices. - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-list-of-edges-between-two-Vertex-tp19309.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can we make EdgeRDD and VertexRDD storage level to MEMORY_AND_DISK?
Just figured it out using Graph constructor you can pass the storage level for both Edge and Vertex : Graph.fromEdges(edges, defaultValue = (,),StorageLevel.MEMORY_AND_DISK,StorageLevel.MEMORY_AND_DISK ) Thanks to this post : https://issues.apache.org/jira/browse/SPARK-1991 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-make-EdgeRDD-and-VertexRDD-storage-level-to-MEMORY-AND-DISK-tp19307p19335.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to join two RDDs with mutually exclusive keys
I've similar type of issue, want to join two different type of RDD in one RDD file1.txt content (ID, counts) val x : RDD[Long, Int] = sc.textFile(file1.txt).map( line = line.split(,)).map(row = (row(0).toLong, row(1).toInt) [(4407 ,40), (2064, 38), (7815 ,10), (5736,17), (8031,3)] Second RDD from : file2.txt contains (ID, name) val y: RDD[(Long, String)]{where ID is common in both the RDDs} [(4407 ,Jhon), (2064, Maria), (7815 ,Casto), (5736,Ram), (8031,XYZ)] and I'm expecting result should be like this : [(ID, Name, Count)] [(4407 ,Jhon, 40), (2064, Maria, 38), (7815 ,Casto, 10), (5736,Ram, 17), (8031,XYZ, 3)] Any help will really appreciate. Thanks On 21 November 2014 09:18, dsiegmann [via Apache Spark User List] ml-node+s1001560n19419...@n3.nabble.com wrote: You want to use RDD.union (or SparkContext.union for many RDDs). These don't join on a key. Union doesn't really do anything itself, so it is low overhead. Note that the combined RDD will have all the partitions of the original RDDs, so you may want to coalesce after the union. val x = sc.parallelize(Seq( (1, 3), (2, 4) )) val y = sc.parallelize(Seq( (3, 5), (4, 7) )) val z = x.union(y) z.collect res0: Array[(Int, Int)] = Array((1,3), (2,4), (3,5), (4,7)) On Thu, Nov 20, 2014 at 3:06 PM, Blind Faith [hidden email] http://user/SendEmail.jtp?type=nodenode=19419i=0 wrote: Say I have two RDDs with the following values x = [(1, 3), (2, 4)] and y = [(3, 5), (4, 7)] and I want to have z = [(1, 3), (2, 4), (3, 5), (4, 7)] How can I achieve this. I know you can use outerJoin followed by map to achieve this, but is there a more direct way for this. -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: [hidden email] http://user/SendEmail.jtp?type=nodenode=19419i=1 W: www.velos.io -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-join-two-RDDs-with-mutually-exclusive-keys-tp19417p19419.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-join-two-RDDs-with-mutually-exclusive-keys-tp19417p19423.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to join two RDDs with mutually exclusive keys
Thanks Daniel , Applied Join from PairedRDD val countByUsername = file1.join(file2) .map { case (id, (username, count)) = (id, username, count) } - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-join-two-RDDs-with-mutually-exclusive-keys-tp19417p19431.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Is Spark? or GraphX runs fast? a performance comparison on Page Rank
Hi All, I started exploring Spark from past 2 months. I'm looking for some concrete features from both Spark and GraphX so that I'll take some decisions what to use, based upon who get highest performance. According to documentation GraphX runs 10x faster than normal Spark. So I run Page Rank algorithm in both the applications: For Spark I used: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPageRank.scala For GraphX I used : https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/graphx/LiveJournalPageRank.scala Input data : http://snap.stanford.edu/data/soc-LiveJournal1.html (1 Gb in size) No of Iterations : 2 *Time Taken : * Local Mode (Machine : 8 Core; 16 GB memory; 2.80 Ghz Intel i7; Executor Memory: 4Gb, No. of Partition: 50; No. of Iterations: 2); == *Spark Page Rank took - 21.29 mins GraphX Page Rank took - 42.01 mins * Cluster Mode (ubantu 12.4; spark 1.1/hadoop 2.4 cluster ; 3 workers , 1 driver , 8 cores, 30 gb memory) (Executor memory 4gb; No. of edge partitions : 50, random vertex cut ; no. of iteration : 2) = *Spark Page Rank took - 10.54 mins GraphX Page Rank took - 7.54 mins * Could you please help me to determine, when to use Spark and GraphX ? If GraphX took same amount of time than Spark then its better to use Spark because spark has variey of operators to deal with any type of RDD. Any suggestions or feedback or pointers will highly appreciate Thanks, - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is there a way to turn on spark eventLog on the worker node?
You can set the same parameter when launching an application, if you use sppar-submit tried --conf to give those variables or from SparkConfig also you can set the logs for both driver and workers. - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-way-to-turn-on-spark-eventLog-on-the-worker-node-tp19714p19716.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Configuring custom input format
Hi, I'm trying to make custom input format for CSV file, if you can share little bit more what you read as input and what things you have implemented. I'll try to replicate the same things. If I find something interesting at my end I'll let you know. Thanks, Harihar - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-custom-input-format-tp18220p19800.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank
Hi Guys, is there any one experience the same thing as above? - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p19909.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank
Thanks Ankur, Its really help full. I've few queries on optimization techniques. for the current I used RandomVertexCut partition. But what partition should be used if have: 1. No. of edges in edgeList file are to large like 50,000,000; where multiple edges to same pair of vertices are many 2. No of unique Vertex are to large suppose 10,000,000 in above edgeList file 3. No of unique Vertex are small suppose less than 100,000 in above edgeList file On 27 November 2014 at 20:23, ankurdave [via Apache Spark User List] ml-node+s1001560n1995...@n3.nabble.com wrote: At 2014-11-24 19:02:08 -0800, Harihar Nahak [hidden email] http://user/SendEmail.jtp?type=nodenode=19956i=0 wrote: According to documentation GraphX runs 10x faster than normal Spark. So I run Page Rank algorithm in both the applications: [...] Local Mode (Machine : 8 Core; 16 GB memory; 2.80 Ghz Intel i7; Executor Memory: 4Gb, No. of Partition: 50; No. of Iterations: 2); == *Spark Page Rank took - 21.29 mins GraphX Page Rank took - 42.01 mins * Cluster Mode (ubantu 12.4; spark 1.1/hadoop 2.4 cluster ; 3 workers , 1 driver , 8 cores, 30 gb memory) (Executor memory 4gb; No. of edge partitions : 50, random vertex cut ; no. of iteration : 2) = *Spark Page Rank took - 10.54 mins GraphX Page Rank took - 7.54 mins * Could you please help me to determine, when to use Spark and GraphX ? If GraphX took same amount of time than Spark then its better to use Spark because spark has variey of operators to deal with any type of RDD. If you have a problem that's naturally expressible as a graph computation, it makes sense to use GraphX in my opinion. In addition to the optimizations that GraphX incorporates which you would otherwise have to implement manually, GraphX's programming model is likely a better fit. But even if you start off by using pure Spark, you'll still have the flexibility to use GraphX for other parts of the problem since it's part of the same system. To address the benchmark results you got: 1. GraphX takes more time than Spark to load the graph, because it has to index it, but subsequent iterations should be faster. We benchmarked with 20 iterations to show this effect, but you only used 2 iterations, which doesn't give much time to amortize the loading cost. 2. The benchmarks in the GraphX OSDI paper are against a naive implementation of PageRank in Spark, while the version you benchmarked against has some of the same optimizations as GraphX does. I believe we found that the optimized Spark PageRank was only 3x slower than GraphX. 3. When running those benchmarks, we used an experimental version of Spark with in-memory shuffle, which disproportionately benefits GraphX since its shuffle files are smaller due to specialized compression. 4. We haven't optimized GraphX for local mode, so it's not surprising that it's slower there. Ankur - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=19956i=1 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=19956i=2 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p19956.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Is Spark? or GraphX runs fast? a performance comparison on Page Rank, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=19710code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MTk3MTB8LTE4MTkxOTE5Mjk= . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p19986.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Lifecycle of RDD in spark-streaming
When there is new data comes in a stream spark use streams classes to convert it into RDD and as you mention its follow with transformation and finally action. Till the time user doesn't destroy or application is alive All RDD remain in Memory as far as I experienced. On 26 November 2014 at 20:05, Mukesh Jha [via Apache Spark User List] ml-node+s1001560n19835...@n3.nabble.com wrote: Any pointers guys? On Tue, Nov 25, 2014 at 5:32 PM, Mukesh Jha [hidden email] http://user/SendEmail.jtp?type=nodenode=19835i=0 wrote: Hey Experts, I wanted to understand in detail about the lifecycle of rdd(s) in a streaming app. From my current understanding - rdd gets created out of the realtime input stream. - Transform(s) functions are applied in a lazy fashion on the RDD to transform into another rdd(s). - Actions are taken on the final transformed rdds to get the data out of the system. Also rdd(s) are stored in the clusters RAM (disc if configured so) and are cleaned in LRU fashion. So I have the following questions on the same. - How spark (streaming) guarantees that all the actions are taken on each input rdd/batch. - How does spark determines that the life-cycle of a rdd is complete. Is there any chance that a RDD will be cleaned out of ram before all actions are taken on them? Thanks in advance for all your help. Also, I'm relatively new to scala spark so pardon me in case these are naive questions/assumptions. -- Thanks Regards, *[hidden email] http://user/SendEmail.jtp?type=nodenode=19835i=1* -- Thanks Regards, *[hidden email] http://user/SendEmail.jtp?type=nodenode=19835i=2* -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Lifecycle-of-RDD-in-spark-streaming-tp19749p19835.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lifecycle-of-RDD-in-spark-streaming-tp19749p19987.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Multiple SparkContexts in same Driver JVM
try setting in SparkConf.set( 'spark.driver.allowMultipleContexts' , true) On 30 November 2014 at 17:37, lokeshkumar [via Apache Spark User List] ml-node+s1001560n20037...@n3.nabble.com wrote: Hi Forum, Is it not possible to run multiple SparkContexts concurrently without stopping the other one in the spark 1.3.0. I have been trying this out and getting the below error. Caused by: org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at: According to this, its not possible to create unless we specify the option spark.driver.allowMultipleContexts = true. So is there a way to create multiple concurrently running SparkContext in same JVM or should we trigger Driver processes in different JVMs to do the same? Also please let me know where the option 'spark.driver.allowMultipleContexts' to be set? I have set it in spark-env.sh SPARK_MASTER_OPTS but no luck. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-SparkContexts-in-same-Driver-JVM-tp20037.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-SparkContexts-in-same-Driver-JVM-tp20037p20055.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: RDDs join problem: incorrect result
what do you mean by incorrect? could you please share some examples from both the RDD and resultant RDD also If you get any exception paste that too. it helps to debug where is the issue On 27 November 2014 at 17:07, liuboya [via Apache Spark User List] ml-node+s1001560n19928...@n3.nabble.com wrote: Hi, I ran into a problem when doing two RDDs join operation. For example, RDDa: RDD[(String,String)] and RDDb:RDD[(String,Int)]. Then, the result RDDc:[String,(String,Int)] = RDDa.join(RDDb). But I find the results in RDDc are incorrect compared with RDDb. What's wrong in join? -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-join-problem-incorrect-result-tp19928.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-join-problem-incorrect-result-tp19928p20056.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: GraphX:java.lang.NoSuchMethodError:org.apache.spark.graphx.Graph$.apply
Hi, If you haven't figure out so far; could you please share some details: how you running GraphX ? also before executing above commands from shell import required GraphX packages On 27 November 2014 at 20:49, liuboya [via Apache Spark User List] ml-node+s1001560n19959...@n3.nabble.com wrote: I'm waiting online. Who can help me, please? -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-java-lang-NoSuchMethodError-org-apache-spark-graphx-Graph-apply-tp19958p19959.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-java-lang-NoSuchMethodError-org-apache-spark-graphx-Graph-apply-tp19958p20057.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Edge List File in GraphX
Graphloade.edgeListFile(fileName) , where file name must be in 1\t2 form. about result NaN there might some issue with the data. I ran it for various combination of data set and it works perfectly fine. On 25 November 2014 at 19:23, pradhandeep [via Apache Spark User List] ml-node+s1001560n1972...@n3.nabble.com wrote: Hi, Is it necessary for every vertex to have an attribute when we load a graph to GraphX? In other words, if I have an edge list file containing pairs of vertices i.e., 1 2 means that there is an edge between node 1 and node 2. Now, when I run PageRank on this data it return a NaN. Can I use this type of data for any algorithm on GraphX? Thank You -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Edge-List-File-in-GraphX-tp19724.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Edge-List-File-in-GraphX-tp19724p20060.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: NumberFormatException
Hi Yu, Try this : val data = csv.map( line = line.split(,).map(elem = elem.trim)) //lines in rows data.map( rec = (rec(0).toInt, rec(1).toInt)) to convert into integer. On 16 December 2014 at 10:49, yu [via Apache Spark User List] ml-node+s1001560n20694...@n3.nabble.com wrote: Hello, everyone I know 'NumberFormatException' is due to the reason that String can not be parsed properly, but I really can not find any mistakes for my code. I hope someone may kindly help me. My hdfs file is as follows: 8,22 3,11 40,10 49,47 48,29 24,28 50,30 33,56 4,20 30,38 ... So each line contains an integer + , + an integer + \n My code is as follows: object StreamMonitor { def main(args: Array[String]): Unit = { val myFunc = (str: String) = { val strArray = str.trim().split(,) (strArray(0).toInt, strArray(1).toInt) } val conf = new SparkConf().setAppName(StreamMonitor); val ssc = new StreamingContext(conf, Seconds(30)); val datastream = ssc.textFileStream(/user/yu/streaminput); val newstream = datastream.map(myFunc) newstream.saveAsTextFiles(output/, ); ssc.start() ssc.awaitTermination() } } The exception info is: 14/12/15 15:35:03 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, h3): java.lang.NumberFormatException: For input string: 8 java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) java.lang.Integer.parseInt(Integer.java:492) java.lang.Integer.parseInt(Integer.java:527) scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) scala.collection.immutable.StringOps.toInt(StringOps.scala:31) StreamMonitor$$anonfun$1.apply(StreamMonitor.scala:9) StreamMonitor$$anonfun$1.apply(StreamMonitor.scala:7) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) So based on the above info, 8 is the first number in the file and I think it should be parsed to integer without any problems. I know it may be a very stupid question and the answer may be very easy. But I really can not find the reason. I am thankful to anyone who helps! -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/NumberFormatException-tp20694.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NumberFormatException-tp20694p20696.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: RDDs being cleaned too fast
RDD.persist() can be useful here. On 11 December 2014 at 14:34, ankits [via Apache Spark User List] ml-node+s1001560n20613...@n3.nabble.com wrote: I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too fast. How can i inspect the size of RDD in memory and get more information about why it was cleaned up. There should be more than enough memory available on the cluster to store them, and by default, the spark.cleaner.ttl is infinite, so I want more information about why this is happening and how to prevent it. Spark just logs this when removing RDDs: [2014-12-11 01:19:34,006] INFO spark.storage.BlockManager [] [] - Removing RDD 33 [2014-12-11 01:19:34,010] INFO pache.spark.ContextCleaner [] [akka://JobServer/user/context-supervisor/job-context1] - Cleaned RDD 33 [2014-12-11 01:19:34,012] INFO spark.storage.BlockManager [] [] - Removing RDD 33 [2014-12-11 01:19:34,016] INFO pache.spark.ContextCleaner [] [akka://JobServer/user/context-supervisor/job-context1] - Cleaned RDD 33 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-being-cleaned-too-fast-tp20613.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-being-cleaned-too-fast-tp20613p20738.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: hello
You mean to Spark User List, Its pretty easy. check the first email it has all instructions On 18 December 2014 at 21:56, csjtx1021 [via Apache Spark User List] ml-node+s1001560n20759...@n3.nabble.com wrote: i want to join you -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/hello-tp20759.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/hello-tp20759p20770.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark GraphX question.
Hi Ted, I've no idea what is Transitive Reduction but the expected result you can achieve by graph.subgraph(graph.edges.filter()) syntax and which filter edges by its weight and give you new graph as per your condition. On 19 December 2014 at 11:11, Tae-Hyuk Ahn [via Apache Spark User List] ml-node+s1001560n20768...@n3.nabble.com wrote: Hi All, I am wondering what is the best way to remove transitive edges with maximum spanning tree. For example, Edges: 1 - 2 (30) 2 - 3 (30) 1 - 3 (25) where parenthesis is a weight for each edge. Then, I'd like to get the reduced edges graph after Transitive Reduction with considering the weight as a maximum spanning tree. Edges: 1 - 2 (30) 2 - 3 (30) Do you have a good idea for this? Thanks, Ted -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-GraphX-question-tp20768.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-GraphX-question-tp20768p20771.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank
Yes, I had try that too. I took the pre-built spark 1.1 release. If you there are changes in up coming changes for GraphX library, just let me know or in spark 1.2 I can do try on that. --Harihar - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p20874.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Results never return to driver | Spark Custom Reader
Hi All, I wrote a custom reader to read a DB, and it is able to return key and value as expected but after it finished it never returned to driver here is output of worker log : 15/01/23 15:51:38 INFO worker.ExecutorRunner: Launch command: java -cp ::/usr/local/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/usr/local/hadoop/etc/hadoop -XX:MaxPermSize=128m -Dspark.driver.port=53484 -Xms1024M -Xmx1024M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@VM90:53484/user/CoarseGrainedScheduler 6 VM99 4 app-20150123155114- akka.tcp://sparkWorker@VM99:44826/user/Worker 15/01/23 15:51:47 INFO worker.Worker: Executor app-20150123155114-/6 finished with state EXITED message Command exited with code 1 exitStatus 1 15/01/23 15:51:47 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@VM99:57695] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/01/23 15:51:47 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40143.96.25.29%3A35065-4#-915179653] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 15/01/23 15:51:49 INFO worker.Worker: Asked to kill unknown executor app-20150123155114-/6 If someone noticed any clue to fixed that will really appreciate. - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Results-never-return-to-driver-Spark-Custom-Reader-tp21328.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Data Locality
Hi Guys, I have the similar question and doubt. How spark create an executor on the same node where is data block stored? Does it first take information from HDFS name mode, get the block information and then place executor on the same node is spark-worker demon is installed? - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Data-Locality-tp21000p21410.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Eclipse on spark
Download pre build binary for window and attached all required jars in your project eclipsclass-path and go head with your eclipse. make sure you have same java version On 25 January 2015 at 07:33, riginos [via Apache Spark User List] ml-node+s1001560n21350...@n3.nabble.com wrote: How to compile a Spark project in Scala IDE for Eclipse? I got many scala scripts and i no longer want to load them from scala-shell what can i do? -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Eclipse-on-spark-tp21350.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Eclipse-on-spark-tp21350p21359.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Results never return to driver | Spark Custom Reader
directory location but in that case I'd assume you know where to find the log) On Thu, Jan 22, 2015 at 10:54 PM, Harihar Nahak hna...@wynyardgroup.com wrote: Hi All, I wrote a custom reader to read a DB, and it is able to return key and value as expected but after it finished it never returned to driver here is output of worker log : 15/01/23 15:51:38 INFO worker.ExecutorRunner: Launch command: java -cp ::/usr/local/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/usr/local/hadoop/etc/hadoop -XX:MaxPermSize=128m -Dspark.driver.port=53484 -Xms1024M -Xmx1024M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@VM90:53484/user/CoarseGrainedScheduler 6 VM99 4 app-20150123155114- akka.tcp://sparkWorker@VM99:44826/user/Worker 15/01/23 15:51:47 INFO worker.Worker: Executor app-20150123155114-/6 finished with state EXITED message Command exited with code 1 exitStatus 1 15/01/23 15:51:47 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@VM99:57695] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/01/23 15:51:47 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40143.96.25.29%3A35065-4#-915179653] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 15/01/23 15:51:49 INFO worker.Worker: Asked to kill unknown executor app-20150123155114-/6 If someone noticed any clue to fixed that will really appreciate. - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Results-never-return-to-driver-Spark-Custom-Reader-tp21328.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019
Re: connector for CouchDB
No, I changed it to MongoDB. but you can write you custom code to connect couchDB directly but in market there is no such connector available. with few classes extends you can achieve to read couch DB. I can help you in that let me know if you really interested. On 30 January 2015 at 06:46, prateek arora [via Apache Spark User List] ml-node+s1001560n21422...@n3.nabble.com wrote: I am also looking for connector for CouchDB in Spark. did you find anything ? -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/connector-for-CouchDB-tp18630p21422.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from connector for CouchDB, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=18630code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MTg2MzB8LTE4MTkxOTE5Mjk= . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/connector-for-CouchDB-tp18630p21426.html Sent from the Apache Spark User List mailing list archive at Nabble.com.