Clustering of Words
Hi, I am trying to cluster words of some articles. I used TFIDF and Word2Vec in Spark to get the vector for each word and I used KMeans to cluster the words. Now, is there any way to get back the words from the vectors? I want to know what words are there in each cluster. I am aware that TFIDF does not have an inverse. Does anyone know how to get back the words from the clusters? Thank You Regards, Deep
Reg. Difference in Performance
Hi, I am running Spark applications in GCE. I set up cluster with different number of nodes varying from 1 to 7. The machines are single core machines. I set the spark.default.parallelism to the number of nodes in the cluster for each cluster. I ran the four applications available in Spark Examples, SparkTC, SparkALS, SparkLR, SparkPi for each of the configurations. What I notice is the following: In case of SparkTC and SparkALS, the time to complete the job increases with the increase in number of nodes in cluster, where as in SparkLR and SparkPi, the time to complete the job remains the same across all the configurations. Could anyone explain me this? Thank You Regards, Deep
Re: Reg. Difference in Performance
You mean the size of the data that we take? Thank You Regards, Deep On Sun, Mar 1, 2015 at 6:04 AM, Joseph Bradley jos...@databricks.com wrote: Hi Deep, Compute times may not be very meaningful for small examples like those. If you increase the sizes of the examples, then you may start to observe more meaningful trends and speedups. Joseph On Sat, Feb 28, 2015 at 7:26 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am running Spark applications in GCE. I set up cluster with different number of nodes varying from 1 to 7. The machines are single core machines. I set the spark.default.parallelism to the number of nodes in the cluster for each cluster. I ran the four applications available in Spark Examples, SparkTC, SparkALS, SparkLR, SparkPi for each of the configurations. What I notice is the following: In case of SparkTC and SparkALS, the time to complete the job increases with the increase in number of nodes in cluster, where as in SparkLR and SparkPi, the time to complete the job remains the same across all the configurations. Could anyone explain me this? Thank You Regards, Deep
spark.default.parallelism
Hi, I have a four single core machines as slaves in my cluster. I set the spark.default.parallelism to 4 and ran SparkTC given in examples. It took around 26 sec. Now, I increased the spark.default.parallelism to 8, but my performance deteriorates. The same application takes 32 sec now. I have read that usually the best performance is obtained when the parallelism is set to 2X of the number of cores available. I do not quite understand this. Could anyone please tell me? Thank You Regards, Deep
Reg. KNN on MLlib
Has KNN classification algorithm been implemented on MLlib? Thank You Regards, Deep
Spark on EC2
Hi, I have just signed up for Amazon AWS because I learnt that it provides service for free for the first 12 months. I want to run Spark on EC2 cluster. Will they charge me for this? Thank You
Re: Spark on EC2
Thank You Sean. I was just trying to experiment with the performance of Spark Applications with various worker instances (I hope you remember that we discussed about the worker instances). I thought it would be a good one to try in EC2. So, it doesn't work out, does it? Thank You On Tue, Feb 24, 2015 at 8:40 PM, Sean Owen so...@cloudera.com wrote: The free tier includes 750 hours of t2.micro instance time per month. http://aws.amazon.com/free/ That's basically a month of hours, so it's all free if you run one instance only at a time. If you run 4, you'll be able to run your cluster of 4 for about a week free. A t2.micro has 1GB of memory, which is small but something you could possible get work done with. However it provides only burst CPU. You can only use about 10% of 1 vCPU continuously due to capping. Imagine this as about 1/10th of 1 core on your laptop. It would be incredibly slow. This is not to mention the network and I/O bottleneck you're likely to run into as you don't get much provisioning with these free instances. So, no you really can't use this for anything that is at all CPU intensive. It's for, say, running a low-traffic web service. On Tue, Feb 24, 2015 at 2:55 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have just signed up for Amazon AWS because I learnt that it provides service for free for the first 12 months. I want to run Spark on EC2 cluster. Will they charge me for this? Thank You
Re: Spark on EC2
No, I think I am ok with the time it takes. Just that, with the increase in the partitions along with the increase in the number of workers, I want to see the improvement in the performance of an application. I just want to see this happen. Any comments? Thank You On Tue, Feb 24, 2015 at 8:52 PM, Sean Owen so...@cloudera.com wrote: You can definitely, easily, try a 1-node standalone cluster for free. Just don't be surprised when the CPU capping kicks in within about 5 minutes of any non-trivial computation and suddenly the instance is very s-l-o-w. I would consider just paying the ~$0.07/hour to play with an m3.medium, which ought to be pretty OK for basic experimentation. On Tue, Feb 24, 2015 at 3:14 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Thank You Sean. I was just trying to experiment with the performance of Spark Applications with various worker instances (I hope you remember that we discussed about the worker instances). I thought it would be a good one to try in EC2. So, it doesn't work out, does it? Thank You On Tue, Feb 24, 2015 at 8:40 PM, Sean Owen so...@cloudera.com wrote: The free tier includes 750 hours of t2.micro instance time per month. http://aws.amazon.com/free/ That's basically a month of hours, so it's all free if you run one instance only at a time. If you run 4, you'll be able to run your cluster of 4 for about a week free. A t2.micro has 1GB of memory, which is small but something you could possible get work done with. However it provides only burst CPU. You can only use about 10% of 1 vCPU continuously due to capping. Imagine this as about 1/10th of 1 core on your laptop. It would be incredibly slow. This is not to mention the network and I/O bottleneck you're likely to run into as you don't get much provisioning with these free instances. So, no you really can't use this for anything that is at all CPU intensive. It's for, say, running a low-traffic web service. On Tue, Feb 24, 2015 at 2:55 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have just signed up for Amazon AWS because I learnt that it provides service for free for the first 12 months. I want to run Spark on EC2 cluster. Will they charge me for this? Thank You
Re: Spark on EC2
Thank You Akhil. Will look into it. Its free, isn't it? I am still a student :) On Tue, Feb 24, 2015 at 9:06 PM, Akhil Das ak...@sigmoidanalytics.com wrote: If you signup for Google Compute Cloud, you will get free $300 credits for 3 months and you can start a pretty good cluster for your testing purposes. :) Thanks Best Regards On Tue, Feb 24, 2015 at 8:25 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have just signed up for Amazon AWS because I learnt that it provides service for free for the first 12 months. I want to run Spark on EC2 cluster. Will they charge me for this? Thank You
Re: Spark on EC2
Kindly bear with my questions as I am new to this. If you run spark on local mode on a ec2 machine What does this mean? Is it that I launch Spark cluster from my local machine,i.e., by running the shell script that is there in /spark/ec2? On Tue, Feb 24, 2015 at 8:32 PM, gen tang gen.tan...@gmail.com wrote: Hi, As a real spark cluster needs a least one master and one slaves, you need to launch two machine. Therefore the second machine is not free. However, If you run spark on local mode on a ec2 machine. It is free. The charge of AWS depends on how much and the types of machine that you launched, but not on the utilisation of machine. Hope it would help. Cheers Gen On Tue, Feb 24, 2015 at 3:55 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have just signed up for Amazon AWS because I learnt that it provides service for free for the first 12 months. I want to run Spark on EC2 cluster. Will they charge me for this? Thank You
Re: Spark on EC2
Thank You All. I think I will look into paying ~$0.7/hr as Sean suggested. On Tue, Feb 24, 2015 at 9:01 PM, gen tang gen.tan...@gmail.com wrote: Hi, I am sorry that I made a mistake on AWS tarif. You can read the email of sean owen which explains better the strategies to run spark on AWS. For your question: it means that you just download spark and unzip it. Then run spark shell by ./bin/spark-shell or ./bin/pyspark. It is useful to get familiar with spark. You can do this on your laptop as well as on ec2. In fact, running ./ec2/spark-ec2 means launching spark standalone mode on a cluster, you can find more details here: https://spark.apache.org/docs/latest/spark-standalone.html Cheers Gen On Tue, Feb 24, 2015 at 4:07 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Kindly bear with my questions as I am new to this. If you run spark on local mode on a ec2 machine What does this mean? Is it that I launch Spark cluster from my local machine,i.e., by running the shell script that is there in /spark/ec2? On Tue, Feb 24, 2015 at 8:32 PM, gen tang gen.tan...@gmail.com wrote: Hi, As a real spark cluster needs a least one master and one slaves, you need to launch two machine. Therefore the second machine is not free. However, If you run spark on local mode on a ec2 machine. It is free. The charge of AWS depends on how much and the types of machine that you launched, but not on the utilisation of machine. Hope it would help. Cheers Gen On Tue, Feb 24, 2015 at 3:55 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have just signed up for Amazon AWS because I learnt that it provides service for free for the first 12 months. I want to run Spark on EC2 cluster. Will they charge me for this? Thank You
Repartition and Worker Instances
Hi, If I repartition my data by a factor equal to the number of worker instances, will the performance be better or worse? As far as I understand, the performance should be better, but in my case it is becoming worse. I have a single node standalone cluster, is it because of this? Am I guaranteed to have a better performance if I do the same thing in a multi-node cluster? Thank You
Re: Repartition and Worker Instances
How is task slot different from # of Workers? so don't read into any performance metrics you've collected to extrapolate what may happen at scale. I did not get you in this. Thank You On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui same...@databricks.com wrote: In general you should first figure out how many task slots are in the cluster and then repartition the RDD to maybe 2x that #. So if you have a 100 slots, then maybe RDDs with partition count of 100-300 would be normal. But also size of each partition can matter. You want a task to operate on a partition for at least 200ms, but no longer than around 20 seconds. Even if you have 100 slots, it could be okay to have a RDD with 10,000 partitions if you've read in a large file. So don't repartition your RDD to match the # of Worker JVMs, but rather align it to the total # of task slots in the Executors. If you're running on a single node, shuffle operations become almost free (because there's no network movement), so don't read into any performance metrics you've collected to extrapolate what may happen at scale. On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, If I repartition my data by a factor equal to the number of worker instances, will the performance be better or worse? As far as I understand, the performance should be better, but in my case it is becoming worse. I have a single node standalone cluster, is it because of this? Am I guaranteed to have a better performance if I do the same thing in a multi-node cluster? Thank You
Re: Repartition and Worker Instances
You mean SPARK_WORKER_CORES in /conf/spark-env.sh? On Mon, Feb 23, 2015 at 11:06 PM, Sameer Farooqui same...@databricks.com wrote: In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there are slots for task threads. The slot count is configured by the num_cores setting. Generally over subscribe this. So if you have 10 free CPU cores, set num_cores to 20. On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com wrote: How is task slot different from # of Workers? so don't read into any performance metrics you've collected to extrapolate what may happen at scale. I did not get you in this. Thank You On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui same...@databricks.com wrote: In general you should first figure out how many task slots are in the cluster and then repartition the RDD to maybe 2x that #. So if you have a 100 slots, then maybe RDDs with partition count of 100-300 would be normal. But also size of each partition can matter. You want a task to operate on a partition for at least 200ms, but no longer than around 20 seconds. Even if you have 100 slots, it could be okay to have a RDD with 10,000 partitions if you've read in a large file. So don't repartition your RDD to match the # of Worker JVMs, but rather align it to the total # of task slots in the Executors. If you're running on a single node, shuffle operations become almost free (because there's no network movement), so don't read into any performance metrics you've collected to extrapolate what may happen at scale. On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, If I repartition my data by a factor equal to the number of worker instances, will the performance be better or worse? As far as I understand, the performance should be better, but in my case it is becoming worse. I have a single node standalone cluster, is it because of this? Am I guaranteed to have a better performance if I do the same thing in a multi-node cluster? Thank You
Re: Worker and Nodes
So increasing Executors without increasing physical resources If I have a 16 GB RAM system and then I allocate 1 GB for each executor, and give number of executors as 8, then I am increasing the resource right? In this case, how do you explain? Thank You On Sun, Feb 22, 2015 at 6:12 AM, Aaron Davidson ilike...@gmail.com wrote: Note that the parallelism (i.e., number of partitions) is just an upper bound on how much of the work can be done in parallel. If you have 200 partitions, then you can divide the work among between 1 and 200 cores and all resources will remain utilized. If you have more than 200 cores, though, then some will not be used, so you would want to increase parallelism further. (There are other rules-of-thumb -- for instance, it's generally good to have at least 2x more partitions than cores for straggler mitigation, but these are essentially just optimizations.) Further note that when you increase the number of Executors for the same set of resources (i.e., starting 10 Executors on a single machine instead of 1), you make Spark's job harder. Spark has to communicate in an all-to-all manner across Executors for shuffle operations, and it uses TCP sockets to do so whether or not the Executors happen to be on the same machine. So increasing Executors without increasing physical resources means Spark has to do more communication to do the same work. We expect that increasing the number of Executors by a factor of 10, given an increase in the number of physical resources by the same factor, would also improve performance by 10x. This is not always the case for the precise reason above (increased communication overhead), but typically we can get close. The actual observed improvement is very algorithm-dependent, though; for instance, some ML algorithms become hard to scale out past a certain point because the increase in communication overhead outweighs the increase in parallelism. On Sat, Feb 21, 2015 at 8:19 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: So, if I keep the number of instances constant and increase the degree of parallelism in steps, can I expect the performance to increase? Thank You On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: So, with the increase in the number of worker instances, if I also increase the degree of parallelism, will it make any difference? I can use this model even the other way round right? I can always predict the performance of an app with the increase in number of worker instances, the deterioration in performance, right? Thank You On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes, I have decreased the executor memory. But,if I have to do this, then I have to tweak around with the code corresponding to each configuration right? On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen so...@cloudera.com wrote: Workers has a specific meaning in Spark. You are running many on one machine? that's possible but not usual. Each worker's executors have access to a fraction of your machine's resources then. If you're not increasing parallelism, maybe you're not actually using additional workers, so are using less resource for your problem. Or because the resulting executors are smaller, maybe you're hitting GC thrashing in these executors with smaller heaps. Or if you're not actually configuring the executors to use less memory, maybe you're over-committing your RAM and swapping? Bottom line, you wouldn't use multiple workers on one small standalone node. This isn't a good way to estimate performance on a distributed cluster either. On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: No, I just have a single node standalone cluster. I am not tweaking around with the code to increase parallelism. I am just running SparkKMeans that is there in Spark-1.0.0 I just wanted to know, if this behavior is natural. And if so, what causes this? Thank you On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com wrote: What's your storage like? are you adding worker machines that are remote from where the data lives? I wonder if it just means you are spending more and more time sending the data over the network as you try to ship more of it to more remote workers. To answer your question, no in general more workers means more parallelism and therefore faster execution. But that depends on a lot of things. For example, if your process isn't parallelize to use all available execution slots, adding more slots doesn't do anything. On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes, I am talking about standalone single node cluster. No, I am not increasing parallelism. I just wanted to know if it is natural. Does message passing across the workers account for the happenning? I am running SparkKMeans, just
Re: Perf Prediction
Has anyone done any work on that? On Sun, Feb 22, 2015 at 9:57 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes, exactly. On Sun, Feb 22, 2015 at 9:10 AM, Ognen Duzlevski ognen.duzlev...@gmail.com wrote: On Sat, Feb 21, 2015 at 8:54 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: No, I am talking about some work parallel to prediction works that are done on GPUs. Like say, given the data for smaller number of nodes in a Spark cluster, the prediction needs to be done about the time that the application would take when we have larger number of nodes. Are you talking about predicting how performance would increase with adding more nodes/CPUs/whatever?
Re: Perf Prediction
Yes, exactly. On Sun, Feb 22, 2015 at 9:10 AM, Ognen Duzlevski ognen.duzlev...@gmail.com wrote: On Sat, Feb 21, 2015 at 8:54 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: No, I am talking about some work parallel to prediction works that are done on GPUs. Like say, given the data for smaller number of nodes in a Spark cluster, the prediction needs to be done about the time that the application would take when we have larger number of nodes. Are you talking about predicting how performance would increase with adding more nodes/CPUs/whatever?
Re: Worker and Nodes
Also, If I take SparkPageRank for example (org.apache.spark.examples), there are various RDDs that are created and transformed in the code that is written. If I want to increase the number of partitions and test out, what is the optimum number of partitions that gives me the best performance, I have to change the number of partitions in each run, right? Now, there are various RDDs there, so, which RDD do I partition? In other words, if I partition the first RDD that is created from the data in HDFS, am I ensured that other RDDs that are transformed from this RDD will also be partitioned in the same way? Thank You On Sun, Feb 22, 2015 at 10:02 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: So increasing Executors without increasing physical resources If I have a 16 GB RAM system and then I allocate 1 GB for each executor, and give number of executors as 8, then I am increasing the resource right? In this case, how do you explain? Thank You On Sun, Feb 22, 2015 at 6:12 AM, Aaron Davidson ilike...@gmail.com wrote: Note that the parallelism (i.e., number of partitions) is just an upper bound on how much of the work can be done in parallel. If you have 200 partitions, then you can divide the work among between 1 and 200 cores and all resources will remain utilized. If you have more than 200 cores, though, then some will not be used, so you would want to increase parallelism further. (There are other rules-of-thumb -- for instance, it's generally good to have at least 2x more partitions than cores for straggler mitigation, but these are essentially just optimizations.) Further note that when you increase the number of Executors for the same set of resources (i.e., starting 10 Executors on a single machine instead of 1), you make Spark's job harder. Spark has to communicate in an all-to-all manner across Executors for shuffle operations, and it uses TCP sockets to do so whether or not the Executors happen to be on the same machine. So increasing Executors without increasing physical resources means Spark has to do more communication to do the same work. We expect that increasing the number of Executors by a factor of 10, given an increase in the number of physical resources by the same factor, would also improve performance by 10x. This is not always the case for the precise reason above (increased communication overhead), but typically we can get close. The actual observed improvement is very algorithm-dependent, though; for instance, some ML algorithms become hard to scale out past a certain point because the increase in communication overhead outweighs the increase in parallelism. On Sat, Feb 21, 2015 at 8:19 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: So, if I keep the number of instances constant and increase the degree of parallelism in steps, can I expect the performance to increase? Thank You On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: So, with the increase in the number of worker instances, if I also increase the degree of parallelism, will it make any difference? I can use this model even the other way round right? I can always predict the performance of an app with the increase in number of worker instances, the deterioration in performance, right? Thank You On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes, I have decreased the executor memory. But,if I have to do this, then I have to tweak around with the code corresponding to each configuration right? On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen so...@cloudera.com wrote: Workers has a specific meaning in Spark. You are running many on one machine? that's possible but not usual. Each worker's executors have access to a fraction of your machine's resources then. If you're not increasing parallelism, maybe you're not actually using additional workers, so are using less resource for your problem. Or because the resulting executors are smaller, maybe you're hitting GC thrashing in these executors with smaller heaps. Or if you're not actually configuring the executors to use less memory, maybe you're over-committing your RAM and swapping? Bottom line, you wouldn't use multiple workers on one small standalone node. This isn't a good way to estimate performance on a distributed cluster either. On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: No, I just have a single node standalone cluster. I am not tweaking around with the code to increase parallelism. I am just running SparkKMeans that is there in Spark-1.0.0 I just wanted to know, if this behavior is natural. And if so, what causes this? Thank you On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com wrote: What's your storage like? are you adding worker machines that are remote from where the data lives? I wonder if it just means you are spending more and more time
Re: Worker and Nodes
In this case, I just wanted to know if a single node cluster with various workers act like a simulator of a multi-node cluster with various nodes. Like, if we have a single node cluster with 10 workers, say, then can we tell that the same behavior will take place with cluster of 10 nodes? It is like, without having the 10 nodes cluster, I can know the behavior of the application in 10 nodes cluster by having a single node with 10 workers. The time taken may vary but I am talking about the behavior. Can we say that? On Sat, Feb 21, 2015 at 8:21 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes, I am talking about standalone single node cluster. No, I am not increasing parallelism. I just wanted to know if it is natural. Does message passing across the workers account for the happenning? I am running SparkKMeans, just to validate one prediction model. I am using several data sets. I have a standalone mode. I am varying the workers from 1 to 16 On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com wrote: I can imagine a few reasons. Adding workers might cause fewer tasks to execute locally (?) So you may be execute more remotely. Are you increasing parallelism? for trivial jobs, chopping them up further may cause you to pay more overhead of managing so many small tasks, for no speed up in execution time. Can you provide any more specifics though? you haven't said what you're running, what mode, how many workers, how long it takes, etc. On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have been running some jobs in my local single node stand alone cluster. I am varying the worker instances for the same job, and the time taken for the job to complete increases with increase in the number of workers. I repeated some experiments varying the number of nodes in a cluster too and the same behavior is seen. Can the idea of worker instances be extrapolated to the nodes in a cluster? Thank You
Re: Worker and Nodes
So, with the increase in the number of worker instances, if I also increase the degree of parallelism, will it make any difference? I can use this model even the other way round right? I can always predict the performance of an app with the increase in number of worker instances, the deterioration in performance, right? Thank You On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes, I have decreased the executor memory. But,if I have to do this, then I have to tweak around with the code corresponding to each configuration right? On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen so...@cloudera.com wrote: Workers has a specific meaning in Spark. You are running many on one machine? that's possible but not usual. Each worker's executors have access to a fraction of your machine's resources then. If you're not increasing parallelism, maybe you're not actually using additional workers, so are using less resource for your problem. Or because the resulting executors are smaller, maybe you're hitting GC thrashing in these executors with smaller heaps. Or if you're not actually configuring the executors to use less memory, maybe you're over-committing your RAM and swapping? Bottom line, you wouldn't use multiple workers on one small standalone node. This isn't a good way to estimate performance on a distributed cluster either. On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: No, I just have a single node standalone cluster. I am not tweaking around with the code to increase parallelism. I am just running SparkKMeans that is there in Spark-1.0.0 I just wanted to know, if this behavior is natural. And if so, what causes this? Thank you On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com wrote: What's your storage like? are you adding worker machines that are remote from where the data lives? I wonder if it just means you are spending more and more time sending the data over the network as you try to ship more of it to more remote workers. To answer your question, no in general more workers means more parallelism and therefore faster execution. But that depends on a lot of things. For example, if your process isn't parallelize to use all available execution slots, adding more slots doesn't do anything. On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes, I am talking about standalone single node cluster. No, I am not increasing parallelism. I just wanted to know if it is natural. Does message passing across the workers account for the happenning? I am running SparkKMeans, just to validate one prediction model. I am using several data sets. I have a standalone mode. I am varying the workers from 1 to 16 On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com wrote: I can imagine a few reasons. Adding workers might cause fewer tasks to execute locally (?) So you may be execute more remotely. Are you increasing parallelism? for trivial jobs, chopping them up further may cause you to pay more overhead of managing so many small tasks, for no speed up in execution time. Can you provide any more specifics though? you haven't said what you're running, what mode, how many workers, how long it takes, etc. On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have been running some jobs in my local single node stand alone cluster. I am varying the worker instances for the same job, and the time taken for the job to complete increases with increase in the number of workers. I repeated some experiments varying the number of nodes in a cluster too and the same behavior is seen. Can the idea of worker instances be extrapolated to the nodes in a cluster? Thank You
Re: Worker and Nodes
Yes, I am talking about standalone single node cluster. No, I am not increasing parallelism. I just wanted to know if it is natural. Does message passing across the workers account for the happenning? I am running SparkKMeans, just to validate one prediction model. I am using several data sets. I have a standalone mode. I am varying the workers from 1 to 16 On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com wrote: I can imagine a few reasons. Adding workers might cause fewer tasks to execute locally (?) So you may be execute more remotely. Are you increasing parallelism? for trivial jobs, chopping them up further may cause you to pay more overhead of managing so many small tasks, for no speed up in execution time. Can you provide any more specifics though? you haven't said what you're running, what mode, how many workers, how long it takes, etc. On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have been running some jobs in my local single node stand alone cluster. I am varying the worker instances for the same job, and the time taken for the job to complete increases with increase in the number of workers. I repeated some experiments varying the number of nodes in a cluster too and the same behavior is seen. Can the idea of worker instances be extrapolated to the nodes in a cluster? Thank You
Re: Perf Prediction
No, I am talking about some work parallel to prediction works that are done on GPUs. Like say, given the data for smaller number of nodes in a Spark cluster, the prediction needs to be done about the time that the application would take when we have larger number of nodes. On Sat, Feb 21, 2015 at 8:22 PM, Ted Yu yuzhih...@gmail.com wrote: Can you be a bit more specific ? Are you asking about performance across Spark releases ? Cheers On Sat, Feb 21, 2015 at 6:38 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Has some performance prediction work been done on Spark? Thank You
Re: Worker and Nodes
No, I just have a single node standalone cluster. I am not tweaking around with the code to increase parallelism. I am just running SparkKMeans that is there in Spark-1.0.0 I just wanted to know, if this behavior is natural. And if so, what causes this? Thank you On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com wrote: What's your storage like? are you adding worker machines that are remote from where the data lives? I wonder if it just means you are spending more and more time sending the data over the network as you try to ship more of it to more remote workers. To answer your question, no in general more workers means more parallelism and therefore faster execution. But that depends on a lot of things. For example, if your process isn't parallelize to use all available execution slots, adding more slots doesn't do anything. On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes, I am talking about standalone single node cluster. No, I am not increasing parallelism. I just wanted to know if it is natural. Does message passing across the workers account for the happenning? I am running SparkKMeans, just to validate one prediction model. I am using several data sets. I have a standalone mode. I am varying the workers from 1 to 16 On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com wrote: I can imagine a few reasons. Adding workers might cause fewer tasks to execute locally (?) So you may be execute more remotely. Are you increasing parallelism? for trivial jobs, chopping them up further may cause you to pay more overhead of managing so many small tasks, for no speed up in execution time. Can you provide any more specifics though? you haven't said what you're running, what mode, how many workers, how long it takes, etc. On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have been running some jobs in my local single node stand alone cluster. I am varying the worker instances for the same job, and the time taken for the job to complete increases with increase in the number of workers. I repeated some experiments varying the number of nodes in a cluster too and the same behavior is seen. Can the idea of worker instances be extrapolated to the nodes in a cluster? Thank You
Re: Worker and Nodes
Yes, I have decreased the executor memory. But,if I have to do this, then I have to tweak around with the code corresponding to each configuration right? On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen so...@cloudera.com wrote: Workers has a specific meaning in Spark. You are running many on one machine? that's possible but not usual. Each worker's executors have access to a fraction of your machine's resources then. If you're not increasing parallelism, maybe you're not actually using additional workers, so are using less resource for your problem. Or because the resulting executors are smaller, maybe you're hitting GC thrashing in these executors with smaller heaps. Or if you're not actually configuring the executors to use less memory, maybe you're over-committing your RAM and swapping? Bottom line, you wouldn't use multiple workers on one small standalone node. This isn't a good way to estimate performance on a distributed cluster either. On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: No, I just have a single node standalone cluster. I am not tweaking around with the code to increase parallelism. I am just running SparkKMeans that is there in Spark-1.0.0 I just wanted to know, if this behavior is natural. And if so, what causes this? Thank you On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com wrote: What's your storage like? are you adding worker machines that are remote from where the data lives? I wonder if it just means you are spending more and more time sending the data over the network as you try to ship more of it to more remote workers. To answer your question, no in general more workers means more parallelism and therefore faster execution. But that depends on a lot of things. For example, if your process isn't parallelize to use all available execution slots, adding more slots doesn't do anything. On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes, I am talking about standalone single node cluster. No, I am not increasing parallelism. I just wanted to know if it is natural. Does message passing across the workers account for the happenning? I am running SparkKMeans, just to validate one prediction model. I am using several data sets. I have a standalone mode. I am varying the workers from 1 to 16 On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com wrote: I can imagine a few reasons. Adding workers might cause fewer tasks to execute locally (?) So you may be execute more remotely. Are you increasing parallelism? for trivial jobs, chopping them up further may cause you to pay more overhead of managing so many small tasks, for no speed up in execution time. Can you provide any more specifics though? you haven't said what you're running, what mode, how many workers, how long it takes, etc. On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have been running some jobs in my local single node stand alone cluster. I am varying the worker instances for the same job, and the time taken for the job to complete increases with increase in the number of workers. I repeated some experiments varying the number of nodes in a cluster too and the same behavior is seen. Can the idea of worker instances be extrapolated to the nodes in a cluster? Thank You
Re: Worker and Nodes
So, if I keep the number of instances constant and increase the degree of parallelism in steps, can I expect the performance to increase? Thank You On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: So, with the increase in the number of worker instances, if I also increase the degree of parallelism, will it make any difference? I can use this model even the other way round right? I can always predict the performance of an app with the increase in number of worker instances, the deterioration in performance, right? Thank You On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes, I have decreased the executor memory. But,if I have to do this, then I have to tweak around with the code corresponding to each configuration right? On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen so...@cloudera.com wrote: Workers has a specific meaning in Spark. You are running many on one machine? that's possible but not usual. Each worker's executors have access to a fraction of your machine's resources then. If you're not increasing parallelism, maybe you're not actually using additional workers, so are using less resource for your problem. Or because the resulting executors are smaller, maybe you're hitting GC thrashing in these executors with smaller heaps. Or if you're not actually configuring the executors to use less memory, maybe you're over-committing your RAM and swapping? Bottom line, you wouldn't use multiple workers on one small standalone node. This isn't a good way to estimate performance on a distributed cluster either. On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: No, I just have a single node standalone cluster. I am not tweaking around with the code to increase parallelism. I am just running SparkKMeans that is there in Spark-1.0.0 I just wanted to know, if this behavior is natural. And if so, what causes this? Thank you On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com wrote: What's your storage like? are you adding worker machines that are remote from where the data lives? I wonder if it just means you are spending more and more time sending the data over the network as you try to ship more of it to more remote workers. To answer your question, no in general more workers means more parallelism and therefore faster execution. But that depends on a lot of things. For example, if your process isn't parallelize to use all available execution slots, adding more slots doesn't do anything. On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes, I am talking about standalone single node cluster. No, I am not increasing parallelism. I just wanted to know if it is natural. Does message passing across the workers account for the happenning? I am running SparkKMeans, just to validate one prediction model. I am using several data sets. I have a standalone mode. I am varying the workers from 1 to 16 On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com wrote: I can imagine a few reasons. Adding workers might cause fewer tasks to execute locally (?) So you may be execute more remotely. Are you increasing parallelism? for trivial jobs, chopping them up further may cause you to pay more overhead of managing so many small tasks, for no speed up in execution time. Can you provide any more specifics though? you haven't said what you're running, what mode, how many workers, how long it takes, etc. On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have been running some jobs in my local single node stand alone cluster. I am varying the worker instances for the same job, and the time taken for the job to complete increases with increase in the number of workers. I repeated some experiments varying the number of nodes in a cluster too and the same behavior is seen. Can the idea of worker instances be extrapolated to the nodes in a cluster? Thank You
Perf Prediction
Hi, Has some performance prediction work been done on Spark? Thank You
Worker and Nodes
Hi, I have been running some jobs in my local single node stand alone cluster. I am varying the worker instances for the same job, and the time taken for the job to complete increases with increase in the number of workers. I repeated some experiments varying the number of nodes in a cluster too and the same behavior is seen. Can the idea of worker instances be extrapolated to the nodes in a cluster? Thank You
Re: Profiling in YourKit
So, Can I increase the number of threads by manually coding in the Spark code? On Sat, Feb 7, 2015 at 6:52 PM, Sean Owen so...@cloudera.com wrote: If you look at the threads, the other 30 are almost surely not Spark worker threads. They're the JVM finalizer, GC threads, Jetty listeners, etc. Nothing wrong with this. Your OS has hundreds of threads running now, most of which are idle, and up to 4 of which can be executing. In a one-machine cluster, I don't think you would expect any difference in number of running threads. More data does not mean more threads, no. Your executor probably takes as many threads as cores in both cases, 4. On Sat, Feb 7, 2015 at 10:14 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am using YourKit tool to profile Spark jobs that is run in my Single Node Spark Cluster. When I see the YourKit UI Performance Charts, the thread count always remains at All threads: 34 Daemon threads: 32 Here are my questions: 1. My system can run only 4 threads simultaneously, and obviously my system does not have 34 threads. What could 34 threads mean? 2. I tried running the same job with four different datasets, two small and two relatively big. But in the UI the thread count increases by two, irrespective of data size. Does this mean that the number of threads allocated to each job depending on data size is not taken care by the framework? Thank You
PR Request
Hi, When we submit a PR in Github, there are various tests that are performed like RAT test, Scala Style Test, and beyond this many other tests which run for more time. Could anyone please direct me to the details of the tests that are performed there? Thank You
Reg GraphX APSP
Hi, Is the implementation of All Pairs Shortest Path on GraphX for directed graphs or undirected graph? When I use the algorithm with dataset, it assumes that the graph is undirected. Has anyone come across that earlier? Thank you
Re: Reg Job Server
I read somewhere about Gatling. Can that be used to profile Spark jobs? On Fri, Feb 6, 2015 at 10:27 AM, Kostas Sakellis kos...@cloudera.com wrote: Which Spark Job server are you talking about? On Thu, Feb 5, 2015 at 8:28 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Can Spark Job Server be used for profiling Spark jobs?
Re: Reg Job Server
Yes, I want to know, the reason about the job being slow. I will look at YourKit. Can you redirect me to that, some tutorial in how to use? Thank You On Fri, Feb 6, 2015 at 10:44 AM, Kostas Sakellis kos...@cloudera.com wrote: When you say profiling, what are you trying to figure out? Why your spark job is slow? Gatling seems to be a load generating framework so I'm not sure how you'd use it (i've never used it before). Spark runs on the JVM so you can use any JVM profiling tools like YourKit. Kostas On Thu, Feb 5, 2015 at 9:03 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: I read somewhere about Gatling. Can that be used to profile Spark jobs? On Fri, Feb 6, 2015 at 10:27 AM, Kostas Sakellis kos...@cloudera.com wrote: Which Spark Job server are you talking about? On Thu, Feb 5, 2015 at 8:28 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Can Spark Job Server be used for profiling Spark jobs?
Re: Reg Job Server
I have a single node Spark standalone cluster. Will this also work for my cluster? Thank You On Fri, Feb 6, 2015 at 11:02 AM, Mark Hamstra m...@clearstorydata.com wrote: https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit On Thu, Feb 5, 2015 at 9:18 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes, I want to know, the reason about the job being slow. I will look at YourKit. Can you redirect me to that, some tutorial in how to use? Thank You On Fri, Feb 6, 2015 at 10:44 AM, Kostas Sakellis kos...@cloudera.com wrote: When you say profiling, what are you trying to figure out? Why your spark job is slow? Gatling seems to be a load generating framework so I'm not sure how you'd use it (i've never used it before). Spark runs on the JVM so you can use any JVM profiling tools like YourKit. Kostas On Thu, Feb 5, 2015 at 9:03 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: I read somewhere about Gatling. Can that be used to profile Spark jobs? On Fri, Feb 6, 2015 at 10:27 AM, Kostas Sakellis kos...@cloudera.com wrote: Which Spark Job server are you talking about? On Thu, Feb 5, 2015 at 8:28 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Can Spark Job Server be used for profiling Spark jobs?
Reg Job Server
Hi, Can Spark Job Server be used for profiling Spark jobs?
Re: Union in Spark
The configuration is 16GB ram and 1TB HD. have a single node Spark cluster. Even after setting driver memory to 5g and executor memory to 3g, I get this error. The size of the data set is 350 KB and the set that it works well is hardly few KBs. On Mon, Feb 2, 2015 at 1:18 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Hi Deep, What is your configuration and what is the size of the 2 data sets? Thanks Arush On Mon, Feb 2, 2015 at 11:56 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: I did not check the console because once the job starts I cannot run anything else and have to force shutdown the system. I commented parts of codes and I tested. I doubt it is because of union. So, I want to change it to something else and see if the problem persists. Thank you On Mon, Feb 2, 2015 at 11:53 AM, Jerry Lam chiling...@gmail.com wrote: Hi Deep, How do you know the cluster is not responsive because of Union? Did you check the spark web console? Best Regards, Jerry On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: The cluster hangs. On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam chiling...@gmail.com wrote: Hi Deep, what do you mean by stuck? Jerry On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Is there any better operation than Union. I am using union and the cluster is getting stuck with a large data set. Thank you -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Union in Spark
Hi, Is there any better operation than Union. I am using union and the cluster is getting stuck with a large data set. Thank you
Re: Union in Spark
The cluster hangs. On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam chiling...@gmail.com wrote: Hi Deep, what do you mean by stuck? Jerry On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Is there any better operation than Union. I am using union and the cluster is getting stuck with a large data set. Thank you
Re: Union in Spark
I did not check the console because once the job starts I cannot run anything else and have to force shutdown the system. I commented parts of codes and I tested. I doubt it is because of union. So, I want to change it to something else and see if the problem persists. Thank you On Mon, Feb 2, 2015 at 11:53 AM, Jerry Lam chiling...@gmail.com wrote: Hi Deep, How do you know the cluster is not responsive because of Union? Did you check the spark web console? Best Regards, Jerry On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: The cluster hangs. On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam chiling...@gmail.com wrote: Hi Deep, what do you mean by stuck? Jerry On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Is there any better operation than Union. I am using union and the cluster is getting stuck with a large data set. Thank you
Spark on Gordon
Hi All, Gordon SC has Spark installed in it. Has anyone tried to run Spark jobs on Gordon? Thank You
While Loop
Hi, Is there a better programming construct than while loop in Spark? Thank You
Re: Bind Exception
I closed the Spark Shell and tried but no change. Here is the error: . 15/01/17 14:33:39 INFO AbstractConnector: Started SocketConnector@0.0.0.0:59791 15/01/17 14:33:39 INFO Server: jetty-8.y.z-SNAPSHOT 15/01/17 14:33:39 WARN AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:444) at sun.nio.ch.Net.bind(Net.java:436) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187) at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316) at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.server.Server.doStart(Server.java:293) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191) at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205) at org.apache.spark.ui.WebUI.bind(WebUI.scala:99) at org.apache.spark.SparkContext.init(SparkContext.scala:223) at org.apache.spark.examples.SparkAPSP$.main(SparkAPSP.scala:21) at org.apache.spark.examples.SparkAPSP.main(SparkAPSP.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 15/01/17 14:33:39 WARN AbstractLifeCycle: FAILED org.eclipse.jetty.server.Server@f1b69ca: java.net.BindException: Address already in use java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:444) at sun.nio.ch.Net.bind(Net.java:436) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187) at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316) at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.server.Server.doStart(Server.java:293) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191) at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205) at org.apache.spark.ui.WebUI.bind(WebUI.scala:99) at org.apache.spark.SparkContext.init(SparkContext.scala:223) at org.apache.spark.examples.SparkAPSP$.main(SparkAPSP.scala:21) at org.apache.spark.examples.SparkAPSP.main(SparkAPSP.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 15/01/17 14:33:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/metrics/json,null} 15/01/17 14:33:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/kill,null} 15/01/17 14:33:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/,null} 15/01/17 14:33:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/static,null} .. On Tue, Jan 20, 2015 at 9:52 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: I had the Spark Shell running through out. Is it because of that? On Tue, Jan 20, 2015 at 9:47 AM, Ted Yu yuzhih...@gmail.com wrote: Was there another instance
Bind Exception
Hi, I am running a Spark job. I get the output correctly but when I see the logs file I see the following: AbstractLifeCycle: FAILED.: java.net.BindException: Address already in use... What could be the reason for this? Thank You
Re: Bind Exception
I had the Spark Shell running through out. Is it because of that? On Tue, Jan 20, 2015 at 9:47 AM, Ted Yu yuzhih...@gmail.com wrote: Was there another instance of Spark running on the same machine ? Can you pastebin the full stack trace ? Cheers On Mon, Jan 19, 2015 at 8:11 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am running a Spark job. I get the output correctly but when I see the logs file I see the following: AbstractLifeCycle: FAILED.: java.net.BindException: Address already in use... What could be the reason for this? Thank You
Re: Bind Exception
Yes, I have increased the driver memory in spark-default.conf to 2g. Still the error persists. On Tue, Jan 20, 2015 at 10:18 AM, Ted Yu yuzhih...@gmail.com wrote: Have you seen these threads ? http://search-hadoop.com/m/JW1q5tMFlb http://search-hadoop.com/m/JW1q5dabji1 Cheers On Mon, Jan 19, 2015 at 8:33 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi Ted, When I am running the same job with small data, I am able to run. But when I run it with relatively bigger set of data, it is giving me OutOfMemoryError: GC overhead limit exceeded. The first time I run the job, no output. When I run for second time, I am getting this error. I am aware that, the memory is getting full, but is there any way to avoid this? I have a single node Spark cluster. Thank You On Tue, Jan 20, 2015 at 9:52 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: I had the Spark Shell running through out. Is it because of that? On Tue, Jan 20, 2015 at 9:47 AM, Ted Yu yuzhih...@gmail.com wrote: Was there another instance of Spark running on the same machine ? Can you pastebin the full stack trace ? Cheers On Mon, Jan 19, 2015 at 8:11 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am running a Spark job. I get the output correctly but when I see the logs file I see the following: AbstractLifeCycle: FAILED.: java.net.BindException: Address already in use... What could be the reason for this? Thank You
Re: No Output
The error in the log file says: *java.lang.OutOfMemoryError: GC overhead limit exceeded* with certain task ID and the error repeats for further task IDs. What could be the problem? On Sun, Jan 18, 2015 at 2:45 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Updating the Spark version means setting up the entire cluster once more? Or can we update it in some other way? On Sat, Jan 17, 2015 at 3:22 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you paste the code? Also you can try updating your spark version. Thanks Best Regards On Sat, Jan 17, 2015 at 2:40 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am using Spark-1.0.0 in a single node cluster. When I run a job with small data set it runs perfectly but when I use a data set of 350 KB, no output is being produced and when I try to run it the second time it is giving me an exception telling that SparkContext was shut down. Can anyone help me on this? Thank You
Re: No Output
Updating the Spark version means setting up the entire cluster once more? Or can we update it in some other way? On Sat, Jan 17, 2015 at 3:22 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you paste the code? Also you can try updating your spark version. Thanks Best Regards On Sat, Jan 17, 2015 at 2:40 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am using Spark-1.0.0 in a single node cluster. When I run a job with small data set it runs perfectly but when I use a data set of 350 KB, no output is being produced and when I try to run it the second time it is giving me an exception telling that SparkContext was shut down. Can anyone help me on this? Thank You
No Output
Hi, I am using Spark-1.0.0 in a single node cluster. When I run a job with small data set it runs perfectly but when I use a data set of 350 KB, no output is being produced and when I try to run it the second time it is giving me an exception telling that SparkContext was shut down. Can anyone help me on this? Thank You
Joins in Spark
Hi, I have two RDDs, vertices and edges. Vertices is an RDD and edges is a pair RDD. I want to take three way join of these two. Joins work only when both the RDDs are pair RDDS right? So, how am I supposed to take a three way join of these RDDs? Thank You
Joins in Spark
Hi, I have two RDDs, vertices and edges. Vertices is an RDD and edges is a pair RDD. I want to take three way join of these two. Joins work only when both the RDDs are pair RDDS right? So, how am I supposed to take a three way join of these RDDs? Thank You
Fwd: Joins in Spark
This gives me two pair RDDs, one is the edgesRDD and another is verticesRDD with each vertex padded with value null. But I have to take a three way join of these two RDD and I have only one common attribute in these two RDDs. How can I go about doing the three join?
Profiling GraphX codes.
Is there any tool to profile GraphX codes in a cluster? Is there a way to know the messages exchanged among the nodes in a cluster? WebUI does not give all the information. Thank You
Determination of number of RDDs
Hi, I have a graph and I want to create RDDs equal in number to the nodes in the graph. How can I do that? If I have 10 nodes then I want to create 10 rdds. Is that possible in GraphX? Like in C language we have array of pointers. Do we have array of RDDs in Spark. Can we create such an array and then parallelize it? Thank You
Re: Filter using the Vertex Ids
And one more thing, the given tupes (1, 1.0) (2, 1.0) (3, 2.0) (4, 2.0) (5, 0.0) are a part of RDD and they are not just tuples. graph.vertices return me the above tuples which is a part of VertexRDD. On Wed, Dec 3, 2014 at 3:43 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: This is just an example but if my graph is big, there will be so many tuples to handle. I cannot manually do val a: RDD[(Int, Double)] = sc.parallelize(List( (1, 1.0), (2, 1.0), (3, 2.0), (4, 2.0), (5, 0.0))) for all the vertices in the graph. What should I do in that case? We cannot do *sc.parallelize(List(VertexRDD)), *can we? On Wed, Dec 3, 2014 at 3:32 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-12-02 22:01:20 -0800, Deep Pradhan pradhandeep1...@gmail.com wrote: I have a graph which returns the following on doing graph.vertices (1, 1.0) (2, 1.0) (3, 2.0) (4, 2.0) (5, 0.0) I want to group all the vertices with the same attribute together, like into one RDD or something. I want all the vertices with same attribute to be together. You can do this by flipping the tuples so the values become the keys, then using one of the by-key functions in PairRDDFunctions: val a: RDD[(Int, Double)] = sc.parallelize(List( (1, 1.0), (2, 1.0), (3, 2.0), (4, 2.0), (5, 0.0))) val b: RDD[(Double, Int)] = a.map(kv = (kv._2, kv._1)) val c: RDD[(Double, Iterable[Int])] = b.groupByKey(numPartitions = 5) c.collect.foreach(println) // (0.0,CompactBuffer(5)) // (1.0,CompactBuffer(1, 2)) // (2.0,CompactBuffer(3, 4)) Ankur
Re: Filter using the Vertex Ids
This is just an example but if my graph is big, there will be so many tuples to handle. I cannot manually do val a: RDD[(Int, Double)] = sc.parallelize(List( (1, 1.0), (2, 1.0), (3, 2.0), (4, 2.0), (5, 0.0))) for all the vertices in the graph. What should I do in that case? We cannot do *sc.parallelize(List(VertexRDD)), *can we? On Wed, Dec 3, 2014 at 3:32 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-12-02 22:01:20 -0800, Deep Pradhan pradhandeep1...@gmail.com wrote: I have a graph which returns the following on doing graph.vertices (1, 1.0) (2, 1.0) (3, 2.0) (4, 2.0) (5, 0.0) I want to group all the vertices with the same attribute together, like into one RDD or something. I want all the vertices with same attribute to be together. You can do this by flipping the tuples so the values become the keys, then using one of the by-key functions in PairRDDFunctions: val a: RDD[(Int, Double)] = sc.parallelize(List( (1, 1.0), (2, 1.0), (3, 2.0), (4, 2.0), (5, 0.0))) val b: RDD[(Double, Int)] = a.map(kv = (kv._2, kv._1)) val c: RDD[(Double, Iterable[Int])] = b.groupByKey(numPartitions = 5) c.collect.foreach(println) // (0.0,CompactBuffer(5)) // (1.0,CompactBuffer(1, 2)) // (2.0,CompactBuffer(3, 4)) Ankur
SVD Plus Plus in GraphX
Hi, I was just going through the two codes in GraphX namely SVDPlusPlus and TriangleCount. In the first I see an RDD as an input to run ie, run(edges: RDD[Edge[Double]],...) and in the other I see run(VD:..., ED:...) Can anyone explain me the difference between these two? Infact SVDPlusPlus is the only GraphX code in Spark-1.0.0 that I have seen RDD as an input. Could anyone please explain to me? Thank you
Undirected Graphs in GraphX-Pregel
Hi, I was going through this paper on Pregel titled, Pregel: A System for Large-Scale Graph Processing. In the second section named Model Of Computation, it says that the input to a Pregel computation is a directed graph. Is it the same in the Pregel abstraction of GraphX too? Do we always have to input directed graphs to Pregel abstraction or can we also give undirected graphs? Thank You
Edge List File in GraphX
Hi, Is it necessary for every vertex to have an attribute when we load a graph to GraphX? In other words, if I have an edge list file containing pairs of vertices i.e., 1 2 means that there is an edge between node 1 and node 2. Now, when I run PageRank on this data it return a NaN. Can I use this type of data for any algorithm on GraphX? Thank You
Re: New Codes in GraphX
Could it be because my edge list file is in the form (1 2), where there is an edge between node 1 and node 2? On Tue, Nov 18, 2014 at 4:13 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-11-18 15:51:52 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes the above command works, but there is this problem. Most of the times, the total rank is Nan (Not a Number). Why is it so? I've also seen this, but I'm not sure why it happens. If you could find out which vertices are getting the NaN rank, it might be helpful in tracking down the problem. Ankur
New Codes in GraphX
Hi, I am using Spark-1.0.0. There are two GraphX directories that I can see here 1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx which contains LiveJournalPageRank,scala 2. spark-1.0.0/graphx/src/main/scala/org/apache/sprak/graphx/lib which contains Analytics.scala, ConnectedComponenets.scala etc etc Now, if I want to add my own code to GraphX i.e., if I want to write a small application on GraphX, in which directory should I add my code, in 1 or 2 ? And what is the difference? Can anyone tell me something on this? Thank You
Re: Landmarks in GraphX section of Spark API
So landmark can contain just one vertex right? Which algorithm has been used to compute the shortest path? Thank You On Tue, Nov 18, 2014 at 2:53 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-11-17 14:47:50 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: I was going through the graphx section in the Spark API in https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.ShortestPaths$ Here, I find the word landmark. Can anyone explain to me what is landmark means. Is it a simple English word or does it mean something else in graphx. The landmarks in the context of the shortest-paths algorithm are just the vertices of interest. For each vertex in the graph, the algorithm will return the distance to each of the landmark vertices. Ankur
Re: Running PageRank in GraphX
There are no vertices of zero outdegree. The total rank for the graph with numIter = 10 is 4.99 and for the graph with numIter = 100 is 5.99 I do not know why so much variation. On Tue, Nov 18, 2014 at 3:22 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-11-18 12:02:52 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: I just ran the PageRank code in GraphX with some sample data. What I am seeing is that the total rank changes drastically if I change the number of iterations from 10 to 100. Why is that so? As far as I understand, the total rank should asymptotically approach the number of vertices in the graph, assuming there are no vertices of zero outdegree. Does that seem to be the case for your graph? Ankur
Re: Landmarks in GraphX section of Spark API
Does Bellman-Ford give the best solution? On Tue, Nov 18, 2014 at 3:27 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-11-18 14:59:20 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: So landmark can contain just one vertex right? Right. Which algorithm has been used to compute the shortest path? It's distributed Bellman-Ford. Ankur
Re: New Codes in GraphX
The codes that are present in 2 can be run with the command *$SPARK_HOME/bin/spark-submit --master local[*] --class org.apache.spark.graphx.lib.Analytics $SPARK_HOME/assembly/target/scala-2.10/spark-assembly-*.jar pagerank /edge-list-file.txt --numEPart=8 --numIter=10 --partStrategy=EdgePartition2D* Now, how do I run the LiveJournalPageRank.scala that is there in 1? On Tue, Nov 18, 2014 at 2:51 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am using Spark-1.0.0. There are two GraphX directories that I can see here 1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx which contains LiveJournalPageRank,scala 2. spark-1.0.0/graphx/src/main/scala/org/apache/sprak/graphx/lib which contains Analytics.scala, ConnectedComponenets.scala etc etc Now, if I want to add my own code to GraphX i.e., if I want to write a small application on GraphX, in which directory should I add my code, in 1 or 2 ? And what is the difference? Can anyone tell me something on this? Thank You
Re: New Codes in GraphX
What command should I use to run the LiveJournalPageRank.scala? If you want to write a separate application, the ideal way is to do it in a separate project that links in Spark as a dependency [1]. But even for this, I have to do the build every time I change the code, right? Thank You On Tue, Nov 18, 2014 at 3:35 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-11-18 14:51:54 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: I am using Spark-1.0.0. There are two GraphX directories that I can see here 1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx which contains LiveJournalPageRank,scala 2. spark-1.0.0/graphx/src/main/scala/org/apache/sprak/graphx/lib which contains Analytics.scala, ConnectedComponenets.scala etc etc Now, if I want to add my own code to GraphX i.e., if I want to write a small application on GraphX, in which directory should I add my code, in 1 or 2 ? And what is the difference? If you want to add an algorithm which you can call from the Spark shell and submit as a pull request, you should add it to org.apache.spark.graphx.lib (#2). To run it from the command line, you'll also have to modify Analytics.scala. If you want to write a separate application, the ideal way is to do it in a separate project that links in Spark as a dependency [1]. It will also work to put it in either #1 or #2, but this will be worse in the long term because each build cycle will require you to rebuild and restart all of Spark rather than just building your application and calling spark-submit on the new JAR. Ankur [1] http://spark.apache.org/docs/1.0.2/quick-start.html#standalone-applications
Re: New Codes in GraphX
Yes the above command works, but there is this problem. Most of the times, the total rank is Nan (Not a Number). Why is it so? Thank You On Tue, Nov 18, 2014 at 3:48 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: What command should I use to run the LiveJournalPageRank.scala? If you want to write a separate application, the ideal way is to do it in a separate project that links in Spark as a dependency [1]. But even for this, I have to do the build every time I change the code, right? Thank You On Tue, Nov 18, 2014 at 3:35 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-11-18 14:51:54 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: I am using Spark-1.0.0. There are two GraphX directories that I can see here 1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx which contains LiveJournalPageRank,scala 2. spark-1.0.0/graphx/src/main/scala/org/apache/sprak/graphx/lib which contains Analytics.scala, ConnectedComponenets.scala etc etc Now, if I want to add my own code to GraphX i.e., if I want to write a small application on GraphX, in which directory should I add my code, in 1 or 2 ? And what is the difference? If you want to add an algorithm which you can call from the Spark shell and submit as a pull request, you should add it to org.apache.spark.graphx.lib (#2). To run it from the command line, you'll also have to modify Analytics.scala. If you want to write a separate application, the ideal way is to do it in a separate project that links in Spark as a dependency [1]. It will also work to put it in either #1 or #2, but this will be worse in the long term because each build cycle will require you to rebuild and restart all of Spark rather than just building your application and calling spark-submit on the new JAR. Ankur [1] http://spark.apache.org/docs/1.0.2/quick-start.html#standalone-applications
Landmarks in GraphX section of Spark API
Hi, I was going through the graphx section in the Spark API in https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.ShortestPaths$ Here, I find the word landmark. Can anyone explain to me what is landmark means. Is it a simple English word or does it mean something else in graphx. Thank You
Running PageRank in GraphX
Hi, I just ran the PageRank code in GraphX with some sample data. What I am seeing is that the total rank changes drastically if I change the number of iterations from 10 to 100. Why is that so? Thank You
Functions in Spark
Hi, Is there any way to know which of my functions perform better in Spark? In other words, say I have achieved same thing using two different implementations. How do I judge as to which implementation is better than the other. Is processing time the only metric that we can use to claim the goodness of one implementation to the other? Can anyone please share some thoughts on this? Thank You
Re: toLocalIterator in Spark 1.0.0
val iter = toLocalIterator (rdd) This is what I am doing and it says error: not found On Fri, Nov 14, 2014 at 12:34 PM, Patrick Wendell pwend...@gmail.com wrote: It looks like you are trying to directly import the toLocalIterator function. You can't import functions, it should just appear as a method of an existing RDD if you have one. - Patrick On Thu, Nov 13, 2014 at 10:21 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am using Spark 1.0.0 and Scala 2.10.3. I want to use toLocalIterator in a code but the spark shell tells not found: value toLocalIterator I also did import org.apache.spark.rdd but even after this the shell tells object toLocalIterator is not a member of package org.apache.spark.rdd Can anyone help me in this? Thank You
toLocalIterator in Spark 1.0.0
Hi, I am using Spark 1.0.0 and Scala 2.10.3. I want to use toLocalIterator in a code but the spark shell tells *not found: value toLocalIterator* I also did import org.apache.spark.rdd but even after this the shell tells *object toLocalIterator is not a member of package org.apache.spark.rdd* Can anyone help me in this? Thank You
Pass RDD to functions
Hi, Can we pass RDD to functions? Like, can we do the following? *def func (temp: RDD[String]):RDD[String] = {* *//body of the function* *}* Thank You
Queues
Has anyone implemented Queues using RDDs? Thank You
GraphX and Spark
Hi, Can Spark achieve whatever GraphX can? Keeping aside the performance comparison between Spark and GraphX, if I want to implement any graph algorithm and I do not want to use GraphX, can I get the work done with Spark? Than You
RDD of Iterable[String]
Can we iterate over RDD of Iterable[String]? How do we do that? Because the entire Iterable[String] seems to be a single element in the RDD. Thank You
Converting one RDD to another
Hi, Is it always possible to get one RDD from another. For example, if I do a *top(K)(Ordering)*, I get an Int right? (In my example the type is Int). I do not get an RDD. Can anyone explain this to me? Thank You
Change RDDs using map()
Hi, I want to make the following changes in the RDD (create new RDD from the existing to reflect some transformation): In an RDD of key-value pair, I want to get the keys for which the values are 1. How to do this using map()? Thank You
Re: Spark and Scala
Is it always true that whenever we apply operations on an RDD, we get another RDD? Or does it depend on the return type of the operation? On Sat, Sep 13, 2014 at 9:45 AM, Soumya Simanta soumya.sima...@gmail.com wrote: An RDD is a fault-tolerant distributed structure. It is the primary abstraction in Spark. I would strongly suggest that you have a look at the following to get a basic idea. http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/RDD.html http://spark.apache.org/docs/latest/quick-start.html#basics https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia On Sat, Sep 13, 2014 at 12:06 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Take for example this: I have declared one queue *val queue = Queue.empty[Int]*, which is a pure scala line in the program. I actually want the queue to be an RDD but there are no direct methods to create RDD which is a queue right? What say do you have on this? Does there exist something like: *Create and RDD which is a queue *? On Sat, Sep 13, 2014 at 8:43 AM, Hari Shreedharan hshreedha...@cloudera.com wrote: No, Scala primitives remain primitives. Unless you create an RDD using one of the many methods - you would not be able to access any of the RDD methods. There is no automatic porting. Spark is an application as far as scala is concerned - there is no compilation (except of course, the scala, JIT compilation etc). On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: I know that unpersist is a method on RDD. But my confusion is that, when we port our Scala programs to Spark, doesn't everything change to RDDs? On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: unpersist is a method on RDDs. RDDs are abstractions introduced by Spark. An Int is just a Scala Int. You can't call unpersist on Int in Scala, and that doesn't change in Spark. On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: There is one thing that I am confused about. Spark has codes that have been implemented in Scala. Now, can we run any Scala code on the Spark framework? What will be the difference in the execution of the scala code in normal systems and on Spark? The reason for my question is the following: I had a variable *val temp = some operations* This temp was being created inside the loop, so as to manually throw it out of the cache, every time the loop ends I was calling *temp.unpersist()*, this was returning an error saying that *value unpersist is not a method of Int*, which means that temp is an Int. Can some one explain to me why I was not able to call *unpersist* on *temp*? Thank You
Re: Spark and Scala
Take for example this: *val lines = sc.textFile(args(0))* *val nodes = lines.map(s ={ * *val fields = s.split(\\s+)* *(fields(0),fields(1))* *}).distinct().groupByKey().cache() * *val nodeSizeTuple = nodes.map(node = (node._1.toInt, node._2.size))* *val rootNode = nodeSizeTuple.top(1)(Ordering.by(f = f._2))* The nodeSizeTuple is an RDD,but rootNode is an array. Here I have used all RDD operations, but I am getting an array. What about this case? On Sat, Sep 13, 2014 at 11:45 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Is it always true that whenever we apply operations on an RDD, we get another RDD? Or does it depend on the return type of the operation? On Sat, Sep 13, 2014 at 9:45 AM, Soumya Simanta soumya.sima...@gmail.com wrote: An RDD is a fault-tolerant distributed structure. It is the primary abstraction in Spark. I would strongly suggest that you have a look at the following to get a basic idea. http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/RDD.html http://spark.apache.org/docs/latest/quick-start.html#basics https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia On Sat, Sep 13, 2014 at 12:06 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Take for example this: I have declared one queue *val queue = Queue.empty[Int]*, which is a pure scala line in the program. I actually want the queue to be an RDD but there are no direct methods to create RDD which is a queue right? What say do you have on this? Does there exist something like: *Create and RDD which is a queue *? On Sat, Sep 13, 2014 at 8:43 AM, Hari Shreedharan hshreedha...@cloudera.com wrote: No, Scala primitives remain primitives. Unless you create an RDD using one of the many methods - you would not be able to access any of the RDD methods. There is no automatic porting. Spark is an application as far as scala is concerned - there is no compilation (except of course, the scala, JIT compilation etc). On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: I know that unpersist is a method on RDD. But my confusion is that, when we port our Scala programs to Spark, doesn't everything change to RDDs? On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: unpersist is a method on RDDs. RDDs are abstractions introduced by Spark. An Int is just a Scala Int. You can't call unpersist on Int in Scala, and that doesn't change in Spark. On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: There is one thing that I am confused about. Spark has codes that have been implemented in Scala. Now, can we run any Scala code on the Spark framework? What will be the difference in the execution of the scala code in normal systems and on Spark? The reason for my question is the following: I had a variable *val temp = some operations* This temp was being created inside the loop, so as to manually throw it out of the cache, every time the loop ends I was calling *temp.unpersist()*, this was returning an error saying that *value unpersist is not a method of Int*, which means that temp is an Int. Can some one explain to me why I was not able to call *unpersist* on *temp*? Thank You
RDDs and Immutability
Hi, We all know that RDDs are immutable. There are not enough operations that can achieve anything and everything on RDDs. Take for example this: I want an Array of Bytes filled with zeros which during the program should change. Some elements of that Array should change to 1. If I make an RDD with all elements as zero, I won't be able to change the elements. On the other hand, if I declare as Array then so much memory will be consumed. Please clarify this to me. Thank You
Spark and Scala
There is one thing that I am confused about. Spark has codes that have been implemented in Scala. Now, can we run any Scala code on the Spark framework? What will be the difference in the execution of the scala code in normal systems and on Spark? The reason for my question is the following: I had a variable *val temp = some operations* This temp was being created inside the loop, so as to manually throw it out of the cache, every time the loop ends I was calling *temp.unpersist()*, this was returning an error saying that *value unpersist is not a method of Int*, which means that temp is an Int. Can some one explain to me why I was not able to call *unpersist* on *temp*? Thank You
Re: Spark and Scala
I know that unpersist is a method on RDD. But my confusion is that, when we port our Scala programs to Spark, doesn't everything change to RDDs? On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: unpersist is a method on RDDs. RDDs are abstractions introduced by Spark. An Int is just a Scala Int. You can't call unpersist on Int in Scala, and that doesn't change in Spark. On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: There is one thing that I am confused about. Spark has codes that have been implemented in Scala. Now, can we run any Scala code on the Spark framework? What will be the difference in the execution of the scala code in normal systems and on Spark? The reason for my question is the following: I had a variable *val temp = some operations* This temp was being created inside the loop, so as to manually throw it out of the cache, every time the loop ends I was calling *temp.unpersist()*, this was returning an error saying that *value unpersist is not a method of Int*, which means that temp is an Int. Can some one explain to me why I was not able to call *unpersist* on *temp*? Thank You
Re: Spark and Scala
Take for example this: I have declared one queue *val queue = Queue.empty[Int]*, which is a pure scala line in the program. I actually want the queue to be an RDD but there are no direct methods to create RDD which is a queue right? What say do you have on this? Does there exist something like: *Create and RDD which is a queue *? On Sat, Sep 13, 2014 at 8:43 AM, Hari Shreedharan hshreedha...@cloudera.com wrote: No, Scala primitives remain primitives. Unless you create an RDD using one of the many methods - you would not be able to access any of the RDD methods. There is no automatic porting. Spark is an application as far as scala is concerned - there is no compilation (except of course, the scala, JIT compilation etc). On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: I know that unpersist is a method on RDD. But my confusion is that, when we port our Scala programs to Spark, doesn't everything change to RDDs? On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: unpersist is a method on RDDs. RDDs are abstractions introduced by Spark. An Int is just a Scala Int. You can't call unpersist on Int in Scala, and that doesn't change in Spark. On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: There is one thing that I am confused about. Spark has codes that have been implemented in Scala. Now, can we run any Scala code on the Spark framework? What will be the difference in the execution of the scala code in normal systems and on Spark? The reason for my question is the following: I had a variable *val temp = some operations* This temp was being created inside the loop, so as to manually throw it out of the cache, every time the loop ends I was calling *temp.unpersist()*, this was returning an error saying that *value unpersist is not a method of Int*, which means that temp is an Int. Can some one explain to me why I was not able to call *unpersist* on *temp*? Thank You
Unpersist
I want to create a temporary variables in a spark code. Can I do this? for (i - num) { val temp = .. { do something } temp.unpersist() } Thank You
Re: Unpersist
After every loop I want the temp variable to cease to exist On Thu, Sep 11, 2014 at 4:33 PM, Akhil Das ak...@sigmoidanalytics.com wrote: like this? var temp = ... for (i - num) { temp = .. { do something } temp.unpersist() } Thanks Best Regards On Thu, Sep 11, 2014 at 3:26 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: I want to create a temporary variables in a spark code. Can I do this? for (i - num) { val temp = .. { do something } temp.unpersist() } Thank You
How to change the values in Array of Bytes
Hi, I have an array of bytes and I have filled the array with 0 in all the postitions. *var Array = Array.fill[Byte](10)(0)* Now, if certain conditions are satisfied, I want to change some elements of the array to 1 instead of 0. If I run, *if (Array.apply(index)==0) Array.apply(index) = 1* it returns me an error. But if I assign *Array.apply(index) *to a variable and do the same thing then it works. I do not want to assign this to variables because if I do this, I would be creating a lot of variables. Can anyone tell me a method to do this? Thank You
Recursion
Hi, Does Spark support recursive calls?
Array and RDDs
Hi, I have an input file which consists of stc_node dest_node I have created and RDD consisting of key-value pair where key is the node id and the values are the children of that node. Now I want to associate a byte with each node. For that I have created a byte array. Every time I print out the key-value pair in the RDD the key-value pairs do not come in the same order. Because of this I am finding it difficult to assign the byte values with each node. Can anyone help me out in this matter? I basically have the following code: val bitarray = Array.fill[Byte](number)(0) And I want to assiciate each byte in the array to a node. How should I do that? Thank You
Iterate over ArrayBuffer
Hi, I have the following ArrayBuffer *ArrayBuffer(5,3,1,4)* Now, I want to iterate over the ArrayBuffer. What is the way to do it? Thank You
Number of elements in ArrayBuffer
Hi, I have the following ArrayBuffer: *ArrayBuffer(5,3,1,4)* Now, I want to get the number of elements in this ArrayBuffer and also the first element of the ArrayBuffer. I used .length and .size but they are returning 1 instead of 4. I also used .head and .last for getting the first and the last element but they also return the entire ArrayBuffer (ArrayBuffer(5,3,1,4)) What I understand from this is that, the entire ArrayBuffer is stored as one element. How should I go about doing the required things? Thank You
Re: Number of elements in ArrayBuffer
I have a problem here. When I run the commands that Rajesh has suggested in Scala REPL, they work fine. But, I want to work in a Spark code, where I need to find the number of elements in an ArrayBuffer. In Spark code, these things are not working. How should I do that? On Wed, Sep 3, 2014 at 10:25 AM, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hi Deep, Please find below results of ArrayBuffer in scala REPL scala import scala.collection.mutable.ArrayBuffer import scala.collection.mutable.ArrayBuffer scala val a = ArrayBuffer(5,3,1,4) a: scala.collection.mutable.ArrayBuffer[Int] = ArrayBuffer(5, 3, 1, 4) scala a.head res2: Int = 5 scala a.tail res3: scala.collection.mutable.ArrayBuffer[Int] = ArrayBuffer(3, 1, 4) scala a.length res4: Int = 4 Regards, Rajesh On Wed, Sep 3, 2014 at 10:13 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have the following ArrayBuffer: *ArrayBuffer(5,3,1,4)* Now, I want to get the number of elements in this ArrayBuffer and also the first element of the ArrayBuffer. I used .length and .size but they are returning 1 instead of 4. I also used .head and .last for getting the first and the last element but they also return the entire ArrayBuffer (ArrayBuffer(5,3,1,4)) What I understand from this is that, the entire ArrayBuffer is stored as one element. How should I go about doing the required things? Thank You
Key-Value Operations
Hi, I have a RDD of key-value pairs. Now I want to find the key for which the values has the largest number of elements. How should I do that? Basically I want to select the key for which the number of items in values is the largest. Thank You
Re: Printing the RDDs in SparkPageRank
println(parts(0)) does not solve the problem. It does not work On Mon, Aug 25, 2014 at 1:30 PM, Sean Owen so...@cloudera.com wrote: On Mon, Aug 25, 2014 at 7:18 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: When I add parts(0).collect().foreach(println) parts(1).collect().foreach(println), for printing parts, I get the following error not enough arguments for method collect: (pf: PartialFunction[Char,B])(implicit bf:scala.collection.generic.CanBuildFrom[String,B,That])That.Unspecified value parameter pf.parts(0).collect().foreach(println) val links = lines.map{ s = val parts = s.split(\\s+) (parts(0), parts(1)) /*I want to print this parts*/ }.distinct().groupByKey().cache() Within this code, you are working in a simple Scala function. parts is an Array[String]. parts(0) is a String. You can just println(parts(0)). You are not calling RDD.collect() there, but collect() on a String a sequence of Char. However note that this will print the String on the worker that executes this, not the driver. Maybe you want to print the result right after this map function? Then break this into two statements and print the result of the first. You already are doing that in your code. A good formula is actually take(10) rather than collect() in case the RDD is huge.
Key-Value in PairRDD
I have the following code *val nodes = lines.map(s ={val fields = s.split(\\s+) (fields(0),fields(1))}).distinct().groupByKey().cache()* and when I print out the nodes RDD I get the following *(4,ArrayBuffer(1))(2,ArrayBuffer(1))(3,ArrayBuffer(1))(1,ArrayBuffer(3, 2, 4))* Now, I want to print only the key part of the RDD and also the maximum value among the keys. How should I do that? Thank You
Re: Printing the RDDs in SparkPageRank
When I add parts(0).collect().foreach(println) parts(1).collect().foreach(println), for printing parts, I get the following error *not enough arguments for method collect: (pf: PartialFunction[Char,B])(implicit bf:scala.collection.generic.CanBuildFrom[String,B,That])That.Unspecified value parameter pf.parts(0).collect().foreach(println)* And, when I add parts.collect().foreach(println), I get the following error *not enough arguments for method collect: (pf: PartialFunction[String,B])(implicit bf: scala.collection.generic.CanBuildFrom[Array[String],B,That])That.Unspecified value parameter pf.parts.collect().foreach(println) * On Sun, Aug 24, 2014 at 8:27 PM, Jörn Franke jornfra...@gmail.com wrote: Hi, What kind of error do you receive? Best regards, Jörn Le 24 août 2014 08:29, Deep Pradhan pradhandeep1...@gmail.com a écrit : Hi, I was going through the SparkPageRank code and want to see the intermediate steps, like the RDDs formed in the intermediate steps. Here is a part of the code along with the lines that I added in order to print the RDDs. I want to print the *parts* in the code (denoted by the comment in Bold letters). But, when I try to do the same thing there, it gives an error. Can someone suggest what I should be doing? Thank You CODE: object SparkPageRank { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName(PageRank) var iters = args(1).toInt val ctx = new SparkContext(sparkConf) val lines = ctx.textFile(args(0), 1) println(The lines RDD is) lines.collect().foreach(println) val links = lines.map{ s = val parts = s.split(\\s+) (parts(0), parts(1)) */*I want to print this parts*/* }.distinct().groupByKey().cache() println(The links RDD is) links.collect().foreach(println) var ranks = links.mapValues(v = 1.0) println(The ranks RDD is) ranks.collect().foreach(println) for (i - 1 to iters) { val contribs = links.join(ranks).values.flatMap{ case (urls, rank) = val size = urls.size urls.map(url = (url, rank / size)) } println(The contribs RDD is) contribs.collect().foreach(println) ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _) } println(The second ranks RDD is) ranks.collect().foreach(println) val output = ranks.collect() output.foreach(tup = println(tup._1 + has rank: + tup._2 + .)) ctx.stop() } }
Printing the RDDs in SparkPageRank
Hi, I was going through the SparkPageRank code and want to see the intermediate steps, like the RDDs formed in the intermediate steps. Here is a part of the code along with the lines that I added in order to print the RDDs. I want to print the *parts* in the code (denoted by the comment in Bold letters). But, when I try to do the same thing there, it gives an error. Can someone suggest what I should be doing? Thank You CODE: object SparkPageRank { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName(PageRank) var iters = args(1).toInt val ctx = new SparkContext(sparkConf) val lines = ctx.textFile(args(0), 1) println(The lines RDD is) lines.collect().foreach(println) val links = lines.map{ s = val parts = s.split(\\s+) (parts(0), parts(1)) */*I want to print this parts*/* }.distinct().groupByKey().cache() println(The links RDD is) links.collect().foreach(println) var ranks = links.mapValues(v = 1.0) println(The ranks RDD is) ranks.collect().foreach(println) for (i - 1 to iters) { val contribs = links.join(ranks).values.flatMap{ case (urls, rank) = val size = urls.size urls.map(url = (url, rank / size)) } println(The contribs RDD is) contribs.collect().foreach(println) ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _) } println(The second ranks RDD is) ranks.collect().foreach(println) val output = ranks.collect() output.foreach(tup = println(tup._1 + has rank: + tup._2 + .)) ctx.stop() } }
Program without doing assembly
Hi, I am just playing around with the codes in Spark. I am printing out some statements of the codes given in Spark so as to see how it looks. Every time I change/add something to the code I have to run the command *SPARK_HADOOP_VERSION=2.3.0 sbt/sbt assembly* which is tiresome at times. Is there any way to check out the codes or even add a new code (a new file to the examples directory) to the Spark already available without having to do sbt/sbt assembly? Please tell me for a single node as well as a multi node cluster. Thank You
Error in sbt/sbt package
I am getting the following error while doing SPARK_HADOOP_VERSION=2.3.0 sbt/sbt/package java.io.IOException: Cannot run program /home/deep/spark-1.0.0/usr/lib/jvm/java-7-oracle/bin/javac: error=2, No such file or directory ...lots of errors [error] (core/compile:compile) java.io.IOException: Cannot run program /home/deep/spark-1.0.0/usr/lib/jvm/java-7-oracle/bin/javac: error=2, No such file or directory [error] Total time: 198 s, completed 16 Aug, 2014 10:25:50 AM My ~/.bashrc file has the following (apart from other paths too) export JAVA_HOME=usr/lib/jvm/java-7-oracle export PATH=$PATH:$JAVA_HOME/bin My spark-env.sh file has the following (apart from other paths) export JAVA_HOME=/usr/lib/jvm/jdk-7-oracle Can anyone tell me what I should modify? Thank You
Minimum Split of Hadoop RDD
Hi, I am using a single node Spark cluster on HDFS. When I was going through the SparkPageRank.scala code, I came across the following line: *val lines = ctx.textFile(args(0), 1)* where, args(0) is the path of the input file from the HDFS, and the second argument is the minimum split of Hadoop RDD (textFile in Spark documentation). Could anyone please tell me, how this minimum split plays a role? Can we change it? If so, how does it effect the performance? Thank You
Re: GraphX runs without Spark?
We need to pass the URL only when we are using the interactive shell right? Now, I am not using the interactive shell, I am just doing ./bin/run-example.. when I am in the Spark directory. If not, Spark may be ignoring your single-node cluster and defaulting to local mode. What does this mean? I can work even without Spark coming up? Does the same thing happen even if I have a multi-node cluster? Thank You On Sun, Aug 3, 2014 at 2:24 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-08-03 13:14:52 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: I have a single node cluster on which I have Spark running. I ran some graphx codes on some data set. Now when I stop all the workers in the cluster (sbin/stop-all.sh), the codes still run and gives the answers. Why is it so? I mean does graphx run even without Spark coming up? Are you passing the correct master URL (spark://master-address:7077) to run-example or spark-submit? If not, Spark may be ignoring your single-node cluster and defaulting to local mode. Ankur
GraphX
Hi, I am running Spark in a single node cluster. I am able to run the codes in Spark like SparkPageRank.scala, SparkKMeans.scala by the following command, bin/run-examples org.apache.spark.examples.SparkPageRank and the required things Now, I want to run the Pagerank.scala that is there in GraphX. Do we have a similar command like earlier one for this too? I also went through the run-example shell file and saw that the path is set to org.apache.spark.examples How should I run graphx codes? Thank You