Clustering of Words

2015-11-08 Thread Deep Pradhan
Hi, I am trying to cluster words of some articles. I used TFIDF and Word2Vec in Spark to get the vector for each word and I used KMeans to cluster the words. Now, is there any way to get back the words from the vectors? I want to know what words are there in each cluster. I am aware that TFIDF

Reg. Difference in Performance

2015-02-28 Thread Deep Pradhan
Hi, I am running Spark applications in GCE. I set up cluster with different number of nodes varying from 1 to 7. The machines are single core machines. I set the spark.default.parallelism to the number of nodes in the cluster for each cluster. I ran the four applications available in Spark

Re: Reg. Difference in Performance

2015-02-28 Thread Deep Pradhan
to observe more meaningful trends and speedups. Joseph On Sat, Feb 28, 2015 at 7:26 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am running Spark applications in GCE. I set up cluster with different number of nodes varying from 1 to 7. The machines are single core machines. I set

spark.default.parallelism

2015-02-27 Thread Deep Pradhan
Hi, I have a four single core machines as slaves in my cluster. I set the spark.default.parallelism to 4 and ran SparkTC given in examples. It took around 26 sec. Now, I increased the spark.default.parallelism to 8, but my performance deteriorates. The same application takes 32 sec now. I have

Reg. KNN on MLlib

2015-02-26 Thread Deep Pradhan
Has KNN classification algorithm been implemented on MLlib? Thank You Regards, Deep

Spark on EC2

2015-02-24 Thread Deep Pradhan
Hi, I have just signed up for Amazon AWS because I learnt that it provides service for free for the first 12 months. I want to run Spark on EC2 cluster. Will they charge me for this? Thank You

Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
. On Tue, Feb 24, 2015 at 2:55 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have just signed up for Amazon AWS because I learnt that it provides service for free for the first 12 months. I want to run Spark on EC2 cluster. Will they charge me for this? Thank You

Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
the ~$0.07/hour to play with an m3.medium, which ought to be pretty OK for basic experimentation. On Tue, Feb 24, 2015 at 3:14 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Thank You Sean. I was just trying to experiment with the performance of Spark Applications with various worker

Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
testing purposes. :) Thanks Best Regards On Tue, Feb 24, 2015 at 8:25 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have just signed up for Amazon AWS because I learnt that it provides service for free for the first 12 months. I want to run Spark on EC2 cluster. Will they charge me

Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
that you launched, but not on the utilisation of machine. Hope it would help. Cheers Gen On Tue, Feb 24, 2015 at 3:55 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have just signed up for Amazon AWS because I learnt that it provides service for free for the first 12 months. I

Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
on a cluster, you can find more details here: https://spark.apache.org/docs/latest/spark-standalone.html Cheers Gen On Tue, Feb 24, 2015 at 4:07 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Kindly bear with my questions as I am new to this. If you run spark on local mode on a ec2

Repartition and Worker Instances

2015-02-23 Thread Deep Pradhan
Hi, If I repartition my data by a factor equal to the number of worker instances, will the performance be better or worse? As far as I understand, the performance should be better, but in my case it is becoming worse. I have a single node standalone cluster, is it because of this? Am I guaranteed

Re: Repartition and Worker Instances

2015-02-23 Thread Deep Pradhan
. If you're running on a single node, shuffle operations become almost free (because there's no network movement), so don't read into any performance metrics you've collected to extrapolate what may happen at scale. On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, If I

Re: Repartition and Worker Instances

2015-02-23 Thread Deep Pradhan
. Generally over subscribe this. So if you have 10 free CPU cores, set num_cores to 20. On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com wrote: How is task slot different from # of Workers? so don't read into any performance metrics you've collected to extrapolate what may

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
; for instance, some ML algorithms become hard to scale out past a certain point because the increase in communication overhead outweighs the increase in parallelism. On Sat, Feb 21, 2015 at 8:19 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: So, if I keep the number of instances constant and increase

Re: Perf Prediction

2015-02-21 Thread Deep Pradhan
Has anyone done any work on that? On Sun, Feb 22, 2015 at 9:57 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes, exactly. On Sun, Feb 22, 2015 at 9:10 AM, Ognen Duzlevski ognen.duzlev...@gmail.com wrote: On Sat, Feb 21, 2015 at 8:54 AM, Deep Pradhan pradhandeep1...@gmail.com wrote

Re: Perf Prediction

2015-02-21 Thread Deep Pradhan
Yes, exactly. On Sun, Feb 22, 2015 at 9:10 AM, Ognen Duzlevski ognen.duzlev...@gmail.com wrote: On Sat, Feb 21, 2015 at 8:54 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: No, I am talking about some work parallel to prediction works that are done on GPUs. Like say, given the data

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
in the same way? Thank You On Sun, Feb 22, 2015 at 10:02 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: So increasing Executors without increasing physical resources If I have a 16 GB RAM system and then I allocate 1 GB for each executor, and give number of executors as 8, then I am increasing

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
? It is like, without having the 10 nodes cluster, I can know the behavior of the application in 10 nodes cluster by having a single node with 10 workers. The time taken may vary but I am talking about the behavior. Can we say that? On Sat, Feb 21, 2015 at 8:21 PM, Deep Pradhan pradhandeep1...@gmail.com

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
in performance, right? Thank You On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes, I have decreased the executor memory. But,if I have to do this, then I have to tweak around with the code corresponding to each configuration right? On Sat, Feb 21, 2015 at 8

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
, 2015 at 2:37 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have been running some jobs in my local single node stand alone cluster. I am varying the worker instances for the same job, and the time taken for the job to complete increases with increase in the number of workers

Re: Perf Prediction

2015-02-21 Thread Deep Pradhan
at 8:22 PM, Ted Yu yuzhih...@gmail.com wrote: Can you be a bit more specific ? Are you asking about performance across Spark releases ? Cheers On Sat, Feb 21, 2015 at 6:38 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Has some performance prediction work been done on Spark? Thank

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes, I am talking about standalone single node cluster. No, I am not increasing parallelism. I just wanted to know if it is natural. Does message passing across the workers account for the happenning? I am running SparkKMeans, just

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
. This isn't a good way to estimate performance on a distributed cluster either. On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: No, I just have a single node standalone cluster. I am not tweaking around with the code to increase parallelism. I am just running

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
So, if I keep the number of instances constant and increase the degree of parallelism in steps, can I expect the performance to increase? Thank You On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: So, with the increase in the number of worker instances, if I also

Perf Prediction

2015-02-21 Thread Deep Pradhan
Hi, Has some performance prediction work been done on Spark? Thank You

Worker and Nodes

2015-02-21 Thread Deep Pradhan
Hi, I have been running some jobs in my local single node stand alone cluster. I am varying the worker instances for the same job, and the time taken for the job to complete increases with increase in the number of workers. I repeated some experiments varying the number of nodes in a cluster too

Re: Profiling in YourKit

2015-02-07 Thread Deep Pradhan
probably takes as many threads as cores in both cases, 4. On Sat, Feb 7, 2015 at 10:14 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am using YourKit tool to profile Spark jobs that is run in my Single Node Spark Cluster. When I see the YourKit UI Performance Charts, the thread

PR Request

2015-02-06 Thread Deep Pradhan
Hi, When we submit a PR in Github, there are various tests that are performed like RAT test, Scala Style Test, and beyond this many other tests which run for more time. Could anyone please direct me to the details of the tests that are performed there? Thank You

Reg GraphX APSP

2015-02-06 Thread Deep Pradhan
Hi, Is the implementation of All Pairs Shortest Path on GraphX for directed graphs or undirected graph? When I use the algorithm with dataset, it assumes that the graph is undirected. Has anyone come across that earlier? Thank you

Re: Reg Job Server

2015-02-05 Thread Deep Pradhan
I read somewhere about Gatling. Can that be used to profile Spark jobs? On Fri, Feb 6, 2015 at 10:27 AM, Kostas Sakellis kos...@cloudera.com wrote: Which Spark Job server are you talking about? On Thu, Feb 5, 2015 at 8:28 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Can Spark Job

Re: Reg Job Server

2015-02-05 Thread Deep Pradhan
spark job is slow? Gatling seems to be a load generating framework so I'm not sure how you'd use it (i've never used it before). Spark runs on the JVM so you can use any JVM profiling tools like YourKit. Kostas On Thu, Feb 5, 2015 at 9:03 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: I

Re: Reg Job Server

2015-02-05 Thread Deep Pradhan
PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes, I want to know, the reason about the job being slow. I will look at YourKit. Can you redirect me to that, some tutorial in how to use? Thank You On Fri, Feb 6, 2015 at 10:44 AM, Kostas Sakellis kos...@cloudera.com wrote: When you say

Reg Job Server

2015-02-05 Thread Deep Pradhan
Hi, Can Spark Job Server be used for profiling Spark jobs?

Re: Union in Spark

2015-02-01 Thread Deep Pradhan
...@sigmoidanalytics.com wrote: Hi Deep, What is your configuration and what is the size of the 2 data sets? Thanks Arush On Mon, Feb 2, 2015 at 11:56 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: I did not check the console because once the job starts I cannot run anything else

Union in Spark

2015-02-01 Thread Deep Pradhan
Hi, Is there any better operation than Union. I am using union and the cluster is getting stuck with a large data set. Thank you

Re: Union in Spark

2015-02-01 Thread Deep Pradhan
The cluster hangs. On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam chiling...@gmail.com wrote: Hi Deep, what do you mean by stuck? Jerry On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Is there any better operation than Union. I am using union

Re: Union in Spark

2015-02-01 Thread Deep Pradhan
, 2015 at 11:53 AM, Jerry Lam chiling...@gmail.com wrote: Hi Deep, How do you know the cluster is not responsive because of Union? Did you check the spark web console? Best Regards, Jerry On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: The cluster hangs

Spark on Gordon

2015-01-31 Thread Deep Pradhan
Hi All, Gordon SC has Spark installed in it. Has anyone tried to run Spark jobs on Gordon? Thank You

While Loop

2015-01-23 Thread Deep Pradhan
Hi, Is there a better programming construct than while loop in Spark? Thank You

Re: Bind Exception

2015-01-19 Thread Deep Pradhan
ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/kill,null} 15/01/17 14:33:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/,null} 15/01/17 14:33:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/static,null} .. On Tue, Jan 20, 2015 at 9:52 AM, Deep Pradhan

Bind Exception

2015-01-19 Thread Deep Pradhan
Hi, I am running a Spark job. I get the output correctly but when I see the logs file I see the following: AbstractLifeCycle: FAILED.: java.net.BindException: Address already in use... What could be the reason for this? Thank You

Re: Bind Exception

2015-01-19 Thread Deep Pradhan
I had the Spark Shell running through out. Is it because of that? On Tue, Jan 20, 2015 at 9:47 AM, Ted Yu yuzhih...@gmail.com wrote: Was there another instance of Spark running on the same machine ? Can you pastebin the full stack trace ? Cheers On Mon, Jan 19, 2015 at 8:11 PM, Deep

Re: Bind Exception

2015-01-19 Thread Deep Pradhan
19, 2015 at 8:33 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi Ted, When I am running the same job with small data, I am able to run. But when I run it with relatively bigger set of data, it is giving me OutOfMemoryError: GC overhead limit exceeded. The first time I run the job

Re: No Output

2015-01-18 Thread Deep Pradhan
The error in the log file says: *java.lang.OutOfMemoryError: GC overhead limit exceeded* with certain task ID and the error repeats for further task IDs. What could be the problem? On Sun, Jan 18, 2015 at 2:45 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Updating the Spark version means

Re: No Output

2015-01-18 Thread Deep Pradhan
, Jan 17, 2015 at 2:40 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am using Spark-1.0.0 in a single node cluster. When I run a job with small data set it runs perfectly but when I use a data set of 350 KB, no output is being produced and when I try to run it the second time

No Output

2015-01-17 Thread Deep Pradhan
Hi, I am using Spark-1.0.0 in a single node cluster. When I run a job with small data set it runs perfectly but when I use a data set of 350 KB, no output is being produced and when I try to run it the second time it is giving me an exception telling that SparkContext was shut down. Can anyone

Joins in Spark

2014-12-22 Thread Deep Pradhan
Hi, I have two RDDs, vertices and edges. Vertices is an RDD and edges is a pair RDD. I want to take three way join of these two. Joins work only when both the RDDs are pair RDDS right? So, how am I supposed to take a three way join of these RDDs? Thank You

Joins in Spark

2014-12-22 Thread Deep Pradhan
Hi, I have two RDDs, vertices and edges. Vertices is an RDD and edges is a pair RDD. I want to take three way join of these two. Joins work only when both the RDDs are pair RDDS right? So, how am I supposed to take a three way join of these RDDs? Thank You

Fwd: Joins in Spark

2014-12-22 Thread Deep Pradhan
This gives me two pair RDDs, one is the edgesRDD and another is verticesRDD with each vertex padded with value null. But I have to take a three way join of these two RDD and I have only one common attribute in these two RDDs. How can I go about doing the three join?

Profiling GraphX codes.

2014-12-05 Thread Deep Pradhan
Is there any tool to profile GraphX codes in a cluster? Is there a way to know the messages exchanged among the nodes in a cluster? WebUI does not give all the information. Thank You

Determination of number of RDDs

2014-12-04 Thread Deep Pradhan
Hi, I have a graph and I want to create RDDs equal in number to the nodes in the graph. How can I do that? If I have 10 nodes then I want to create 10 rdds. Is that possible in GraphX? Like in C language we have array of pointers. Do we have array of RDDs in Spark. Can we create such an array and

Re: Filter using the Vertex Ids

2014-12-03 Thread Deep Pradhan
And one more thing, the given tupes (1, 1.0) (2, 1.0) (3, 2.0) (4, 2.0) (5, 0.0) are a part of RDD and they are not just tuples. graph.vertices return me the above tuples which is a part of VertexRDD. On Wed, Dec 3, 2014 at 3:43 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: This is just

Re: Filter using the Vertex Ids

2014-12-03 Thread Deep Pradhan
in that case? We cannot do *sc.parallelize(List(VertexRDD)), *can we? On Wed, Dec 3, 2014 at 3:32 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-12-02 22:01:20 -0800, Deep Pradhan pradhandeep1...@gmail.com wrote: I have a graph which returns the following on doing graph.vertices (1, 1.0) (2

SVD Plus Plus in GraphX

2014-11-27 Thread Deep Pradhan
Hi, I was just going through the two codes in GraphX namely SVDPlusPlus and TriangleCount. In the first I see an RDD as an input to run ie, run(edges: RDD[Edge[Double]],...) and in the other I see run(VD:..., ED:...) Can anyone explain me the difference between these two? Infact SVDPlusPlus is the

Undirected Graphs in GraphX-Pregel

2014-11-26 Thread Deep Pradhan
Hi, I was going through this paper on Pregel titled, Pregel: A System for Large-Scale Graph Processing. In the second section named Model Of Computation, it says that the input to a Pregel computation is a directed graph. Is it the same in the Pregel abstraction of GraphX too? Do we always

Edge List File in GraphX

2014-11-24 Thread Deep Pradhan
Hi, Is it necessary for every vertex to have an attribute when we load a graph to GraphX? In other words, if I have an edge list file containing pairs of vertices i.e., 1 2 means that there is an edge between node 1 and node 2. Now, when I run PageRank on this data it return a NaN. Can I use

Re: New Codes in GraphX

2014-11-24 Thread Deep Pradhan
Could it be because my edge list file is in the form (1 2), where there is an edge between node 1 and node 2? On Tue, Nov 18, 2014 at 4:13 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-11-18 15:51:52 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes the above command works

New Codes in GraphX

2014-11-18 Thread Deep Pradhan
Hi, I am using Spark-1.0.0. There are two GraphX directories that I can see here 1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx which contains LiveJournalPageRank,scala 2. spark-1.0.0/graphx/src/main/scala/org/apache/sprak/graphx/lib which contains

Re: Landmarks in GraphX section of Spark API

2014-11-18 Thread Deep Pradhan
So landmark can contain just one vertex right? Which algorithm has been used to compute the shortest path? Thank You On Tue, Nov 18, 2014 at 2:53 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-11-17 14:47:50 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: I was going through

Re: Running PageRank in GraphX

2014-11-18 Thread Deep Pradhan
There are no vertices of zero outdegree. The total rank for the graph with numIter = 10 is 4.99 and for the graph with numIter = 100 is 5.99 I do not know why so much variation. On Tue, Nov 18, 2014 at 3:22 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-11-18 12:02:52 +0530, Deep Pradhan

Re: Landmarks in GraphX section of Spark API

2014-11-18 Thread Deep Pradhan
Does Bellman-Ford give the best solution? On Tue, Nov 18, 2014 at 3:27 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-11-18 14:59:20 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: So landmark can contain just one vertex right? Right. Which algorithm has been used to compute

Re: New Codes in GraphX

2014-11-18 Thread Deep Pradhan
=EdgePartition2D* Now, how do I run the LiveJournalPageRank.scala that is there in 1? On Tue, Nov 18, 2014 at 2:51 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am using Spark-1.0.0. There are two GraphX directories that I can see here 1. spark-1.0.0/examples/src/main/scala/org/apache/sprak

Re: New Codes in GraphX

2014-11-18 Thread Deep Pradhan
, Nov 18, 2014 at 3:35 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-11-18 14:51:54 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: I am using Spark-1.0.0. There are two GraphX directories that I can see here 1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx

Re: New Codes in GraphX

2014-11-18 Thread Deep Pradhan
Yes the above command works, but there is this problem. Most of the times, the total rank is Nan (Not a Number). Why is it so? Thank You On Tue, Nov 18, 2014 at 3:48 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: What command should I use to run the LiveJournalPageRank.scala? If you want

Landmarks in GraphX section of Spark API

2014-11-17 Thread Deep Pradhan
Hi, I was going through the graphx section in the Spark API in https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.ShortestPaths$ Here, I find the word landmark. Can anyone explain to me what is landmark means. Is it a simple English word or does it mean

Running PageRank in GraphX

2014-11-17 Thread Deep Pradhan
Hi, I just ran the PageRank code in GraphX with some sample data. What I am seeing is that the total rank changes drastically if I change the number of iterations from 10 to 100. Why is that so? Thank You

Functions in Spark

2014-11-16 Thread Deep Pradhan
Hi, Is there any way to know which of my functions perform better in Spark? In other words, say I have achieved same thing using two different implementations. How do I judge as to which implementation is better than the other. Is processing time the only metric that we can use to claim the

Re: toLocalIterator in Spark 1.0.0

2014-11-14 Thread Deep Pradhan
as a method of an existing RDD if you have one. - Patrick On Thu, Nov 13, 2014 at 10:21 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am using Spark 1.0.0 and Scala 2.10.3. I want to use toLocalIterator in a code but the spark shell tells not found: value toLocalIterator

toLocalIterator in Spark 1.0.0

2014-11-13 Thread Deep Pradhan
Hi, I am using Spark 1.0.0 and Scala 2.10.3. I want to use toLocalIterator in a code but the spark shell tells *not found: value toLocalIterator* I also did import org.apache.spark.rdd but even after this the shell tells *object toLocalIterator is not a member of package org.apache.spark.rdd*

Pass RDD to functions

2014-11-12 Thread Deep Pradhan
Hi, Can we pass RDD to functions? Like, can we do the following? *def func (temp: RDD[String]):RDD[String] = {* *//body of the function* *}* Thank You

Queues

2014-11-09 Thread Deep Pradhan
Has anyone implemented Queues using RDDs? Thank You

GraphX and Spark

2014-11-04 Thread Deep Pradhan
Hi, Can Spark achieve whatever GraphX can? Keeping aside the performance comparison between Spark and GraphX, if I want to implement any graph algorithm and I do not want to use GraphX, can I get the work done with Spark? Than You

RDD of Iterable[String]

2014-09-24 Thread Deep Pradhan
Can we iterate over RDD of Iterable[String]? How do we do that? Because the entire Iterable[String] seems to be a single element in the RDD. Thank You

Converting one RDD to another

2014-09-23 Thread Deep Pradhan
Hi, Is it always possible to get one RDD from another. For example, if I do a *top(K)(Ordering)*, I get an Int right? (In my example the type is Int). I do not get an RDD. Can anyone explain this to me? Thank You

Change RDDs using map()

2014-09-17 Thread Deep Pradhan
Hi, I want to make the following changes in the RDD (create new RDD from the existing to reflect some transformation): In an RDD of key-value pair, I want to get the keys for which the values are 1. How to do this using map()? Thank You

Re: Spark and Scala

2014-09-13 Thread Deep Pradhan
/presentation/zaharia On Sat, Sep 13, 2014 at 12:06 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Take for example this: I have declared one queue *val queue = Queue.empty[Int]*, which is a pure scala line in the program. I actually want the queue to be an RDD but there are no direct

Re: Spark and Scala

2014-09-13 Thread Deep Pradhan
(1)(Ordering.by(f = f._2))* The nodeSizeTuple is an RDD,but rootNode is an array. Here I have used all RDD operations, but I am getting an array. What about this case? On Sat, Sep 13, 2014 at 11:45 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Is it always true that whenever we apply

RDDs and Immutability

2014-09-13 Thread Deep Pradhan
Hi, We all know that RDDs are immutable. There are not enough operations that can achieve anything and everything on RDDs. Take for example this: I want an Array of Bytes filled with zeros which during the program should change. Some elements of that Array should change to 1. If I make an RDD with

Spark and Scala

2014-09-12 Thread Deep Pradhan
There is one thing that I am confused about. Spark has codes that have been implemented in Scala. Now, can we run any Scala code on the Spark framework? What will be the difference in the execution of the scala code in normal systems and on Spark? The reason for my question is the following: I had

Re: Spark and Scala

2014-09-12 Thread Deep Pradhan
by Spark. An Int is just a Scala Int. You can't call unpersist on Int in Scala, and that doesn't change in Spark. On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: There is one thing that I am confused about. Spark has codes that have been implemented in Scala

Re: Spark and Scala

2014-09-12 Thread Deep Pradhan
porting. Spark is an application as far as scala is concerned - there is no compilation (except of course, the scala, JIT compilation etc). On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: I know that unpersist is a method on RDD. But my confusion is that, when we

Unpersist

2014-09-11 Thread Deep Pradhan
I want to create a temporary variables in a spark code. Can I do this? for (i - num) { val temp = .. { do something } temp.unpersist() } Thank You

Re: Unpersist

2014-09-11 Thread Deep Pradhan
at 3:26 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: I want to create a temporary variables in a spark code. Can I do this? for (i - num) { val temp = .. { do something } temp.unpersist() } Thank You

How to change the values in Array of Bytes

2014-09-06 Thread Deep Pradhan
Hi, I have an array of bytes and I have filled the array with 0 in all the postitions. *var Array = Array.fill[Byte](10)(0)* Now, if certain conditions are satisfied, I want to change some elements of the array to 1 instead of 0. If I run, *if (Array.apply(index)==0) Array.apply(index) = 1*

Recursion

2014-09-05 Thread Deep Pradhan
Hi, Does Spark support recursive calls?

Array and RDDs

2014-09-05 Thread Deep Pradhan
Hi, I have an input file which consists of stc_node dest_node I have created and RDD consisting of key-value pair where key is the node id and the values are the children of that node. Now I want to associate a byte with each node. For that I have created a byte array. Every time I print out the

Iterate over ArrayBuffer

2014-09-04 Thread Deep Pradhan
Hi, I have the following ArrayBuffer *ArrayBuffer(5,3,1,4)* Now, I want to iterate over the ArrayBuffer. What is the way to do it? Thank You

Number of elements in ArrayBuffer

2014-09-02 Thread Deep Pradhan
Hi, I have the following ArrayBuffer: *ArrayBuffer(5,3,1,4)* Now, I want to get the number of elements in this ArrayBuffer and also the first element of the ArrayBuffer. I used .length and .size but they are returning 1 instead of 4. I also used .head and .last for getting the first and the last

Re: Number of elements in ArrayBuffer

2014-09-02 Thread Deep Pradhan
) a: scala.collection.mutable.ArrayBuffer[Int] = ArrayBuffer(5, 3, 1, 4) scala a.head res2: Int = 5 scala a.tail res3: scala.collection.mutable.ArrayBuffer[Int] = ArrayBuffer(3, 1, 4) scala a.length res4: Int = 4 Regards, Rajesh On Wed, Sep 3, 2014 at 10:13 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have

Key-Value Operations

2014-08-28 Thread Deep Pradhan
Hi, I have a RDD of key-value pairs. Now I want to find the key for which the values has the largest number of elements. How should I do that? Basically I want to select the key for which the number of items in values is the largest. Thank You

Re: Printing the RDDs in SparkPageRank

2014-08-26 Thread Deep Pradhan
println(parts(0)) does not solve the problem. It does not work On Mon, Aug 25, 2014 at 1:30 PM, Sean Owen so...@cloudera.com wrote: On Mon, Aug 25, 2014 at 7:18 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: When I add parts(0).collect().foreach(println) parts(1).collect

Key-Value in PairRDD

2014-08-26 Thread Deep Pradhan
I have the following code *val nodes = lines.map(s ={val fields = s.split(\\s+) (fields(0),fields(1))}).distinct().groupByKey().cache()* and when I print out the nodes RDD I get the following *(4,ArrayBuffer(1))(2,ArrayBuffer(1))(3,ArrayBuffer(1))(1,ArrayBuffer(3, 2,

Re: Printing the RDDs in SparkPageRank

2014-08-25 Thread Deep Pradhan
parameter pf.parts.collect().foreach(println) * On Sun, Aug 24, 2014 at 8:27 PM, Jörn Franke jornfra...@gmail.com wrote: Hi, What kind of error do you receive? Best regards, Jörn Le 24 août 2014 08:29, Deep Pradhan pradhandeep1...@gmail.com a écrit : Hi, I was going through

Printing the RDDs in SparkPageRank

2014-08-24 Thread Deep Pradhan
Hi, I was going through the SparkPageRank code and want to see the intermediate steps, like the RDDs formed in the intermediate steps. Here is a part of the code along with the lines that I added in order to print the RDDs. I want to print the *parts* in the code (denoted by the comment in Bold

Program without doing assembly

2014-08-16 Thread Deep Pradhan
Hi, I am just playing around with the codes in Spark. I am printing out some statements of the codes given in Spark so as to see how it looks. Every time I change/add something to the code I have to run the command *SPARK_HADOOP_VERSION=2.3.0 sbt/sbt assembly* which is tiresome at times. Is

Error in sbt/sbt package

2014-08-15 Thread Deep Pradhan
I am getting the following error while doing SPARK_HADOOP_VERSION=2.3.0 sbt/sbt/package java.io.IOException: Cannot run program /home/deep/spark-1.0.0/usr/lib/jvm/java-7-oracle/bin/javac: error=2, No such file or directory ...lots of errors [error] (core/compile:compile)

Minimum Split of Hadoop RDD

2014-08-08 Thread Deep Pradhan
Hi, I am using a single node Spark cluster on HDFS. When I was going through the SparkPageRank.scala code, I came across the following line: *val lines = ctx.textFile(args(0), 1)* where, args(0) is the path of the input file from the HDFS, and the second argument is the minimum split of Hadoop

Re: GraphX runs without Spark?

2014-08-03 Thread Deep Pradhan
this mean? I can work even without Spark coming up? Does the same thing happen even if I have a multi-node cluster? Thank You On Sun, Aug 3, 2014 at 2:24 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-08-03 13:14:52 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: I have a single node

GraphX

2014-08-02 Thread Deep Pradhan
Hi, I am running Spark in a single node cluster. I am able to run the codes in Spark like SparkPageRank.scala, SparkKMeans.scala by the following command, bin/run-examples org.apache.spark.examples.SparkPageRank and the required things Now, I want to run the Pagerank.scala that is there in GraphX.

  1   2   >