Clustering of Words

2015-11-08 Thread Deep Pradhan
Hi,
I am trying to cluster words of some articles. I used TFIDF and Word2Vec in
Spark to get the vector for each word and I used KMeans to cluster the
words. Now, is there any way to get back the words from the vectors? I want
to know what words are there in each cluster.
I am aware that TFIDF does not have an inverse. Does anyone know how to get
back the words from the clusters?

Thank You
Regards,
Deep


Reg. Difference in Performance

2015-02-28 Thread Deep Pradhan
Hi,
I am running Spark applications in GCE. I set up cluster with different
number of nodes varying from 1 to 7. The machines are single core machines.
I set the spark.default.parallelism to the number of nodes in the cluster
for each cluster. I ran the four applications available in Spark Examples,
SparkTC, SparkALS, SparkLR, SparkPi for each of the configurations.
What I notice is the following:
In case of SparkTC and SparkALS, the time to complete the job increases
with the increase in number of nodes in cluster, where as in SparkLR and
SparkPi, the time to complete the job remains the same across all the
configurations.
Could anyone explain me this?

Thank You
Regards,
Deep


Re: Reg. Difference in Performance

2015-02-28 Thread Deep Pradhan
You mean the size of the data that we take?

Thank You
Regards,
Deep

On Sun, Mar 1, 2015 at 6:04 AM, Joseph Bradley jos...@databricks.com
wrote:

 Hi Deep,

 Compute times may not be very meaningful for small examples like those.
 If you increase the sizes of the examples, then you may start to observe
 more meaningful trends and speedups.

 Joseph

 On Sat, Feb 28, 2015 at 7:26 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 I am running Spark applications in GCE. I set up cluster with different
 number of nodes varying from 1 to 7. The machines are single core machines.
 I set the spark.default.parallelism to the number of nodes in the cluster
 for each cluster. I ran the four applications available in Spark Examples,
 SparkTC, SparkALS, SparkLR, SparkPi for each of the configurations.
 What I notice is the following:
 In case of SparkTC and SparkALS, the time to complete the job increases
 with the increase in number of nodes in cluster, where as in SparkLR and
 SparkPi, the time to complete the job remains the same across all the
 configurations.
 Could anyone explain me this?

 Thank You
 Regards,
 Deep





spark.default.parallelism

2015-02-27 Thread Deep Pradhan
Hi,
I have a four single core machines as slaves in my cluster. I set the
spark.default.parallelism to 4 and ran SparkTC given in examples. It took
around 26 sec.
Now, I increased the spark.default.parallelism to 8, but my performance
deteriorates. The same application takes 32 sec now.
I have read that usually the best performance is obtained when the
parallelism is set to 2X of the number of cores available. I do not quite
understand this. Could anyone please tell me?


Thank You
Regards,
Deep


Reg. KNN on MLlib

2015-02-26 Thread Deep Pradhan
Has KNN classification algorithm been implemented on MLlib?

Thank You
Regards,
Deep


Spark on EC2

2015-02-24 Thread Deep Pradhan
Hi,
I have just signed up for Amazon AWS because I learnt that it provides
service for free for the first 12 months.
I want to run Spark on EC2 cluster. Will they charge me for this?

Thank You


Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
Thank You Sean.
I was just trying to experiment with the performance of Spark Applications
with various worker instances (I hope you remember that we discussed about
the worker instances).
I thought it would be a good one to try in EC2. So, it doesn't work out,
does it?

Thank You

On Tue, Feb 24, 2015 at 8:40 PM, Sean Owen so...@cloudera.com wrote:

 The free tier includes 750 hours of t2.micro instance time per month.
 http://aws.amazon.com/free/

 That's basically a month of hours, so it's all free if you run one
 instance only at a time. If you run 4, you'll be able to run your
 cluster of 4 for about a week free.

 A t2.micro has 1GB of memory, which is small but something you could
 possible get work done with.

 However it provides only burst CPU. You can only use about 10% of 1
 vCPU continuously due to capping. Imagine this as about 1/10th of 1
 core on your laptop. It would be incredibly slow.

 This is not to mention the network and I/O bottleneck you're likely to
 run into as you don't get much provisioning with these free instances.

 So, no you really can't use this for anything that is at all CPU
 intensive. It's for, say, running a low-traffic web service.

 On Tue, Feb 24, 2015 at 2:55 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  Hi,
  I have just signed up for Amazon AWS because I learnt that it provides
  service for free for the first 12 months.
  I want to run Spark on EC2 cluster. Will they charge me for this?
 
  Thank You



Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
No, I think I am ok with the time it takes.
Just that, with the increase in the partitions along with the increase in
the number of workers, I want to see the improvement in the performance of
an application.
I just want to see this happen.
Any comments?

Thank You

On Tue, Feb 24, 2015 at 8:52 PM, Sean Owen so...@cloudera.com wrote:

 You can definitely, easily, try a 1-node standalone cluster for free.
 Just don't be surprised when the CPU capping kicks in within about 5
 minutes of any non-trivial computation and suddenly the instance is
 very s-l-o-w.

 I would consider just paying the ~$0.07/hour to play with an
 m3.medium, which ought to be pretty OK for basic experimentation.

 On Tue, Feb 24, 2015 at 3:14 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  Thank You Sean.
  I was just trying to experiment with the performance of Spark
 Applications
  with various worker instances (I hope you remember that we discussed
 about
  the worker instances).
  I thought it would be a good one to try in EC2. So, it doesn't work out,
  does it?
 
  Thank You
 
  On Tue, Feb 24, 2015 at 8:40 PM, Sean Owen so...@cloudera.com wrote:
 
  The free tier includes 750 hours of t2.micro instance time per month.
  http://aws.amazon.com/free/
 
  That's basically a month of hours, so it's all free if you run one
  instance only at a time. If you run 4, you'll be able to run your
  cluster of 4 for about a week free.
 
  A t2.micro has 1GB of memory, which is small but something you could
  possible get work done with.
 
  However it provides only burst CPU. You can only use about 10% of 1
  vCPU continuously due to capping. Imagine this as about 1/10th of 1
  core on your laptop. It would be incredibly slow.
 
  This is not to mention the network and I/O bottleneck you're likely to
  run into as you don't get much provisioning with these free instances.
 
  So, no you really can't use this for anything that is at all CPU
  intensive. It's for, say, running a low-traffic web service.
 
  On Tue, Feb 24, 2015 at 2:55 PM, Deep Pradhan 
 pradhandeep1...@gmail.com
  wrote:
   Hi,
   I have just signed up for Amazon AWS because I learnt that it provides
   service for free for the first 12 months.
   I want to run Spark on EC2 cluster. Will they charge me for this?
  
   Thank You
 
 



Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
Thank You Akhil. Will look into it.
Its free, isn't it? I am still a student :)

On Tue, Feb 24, 2015 at 9:06 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 If you signup for Google Compute Cloud, you will get free $300 credits for
 3 months and you can start a pretty good cluster for your testing purposes.
 :)

 Thanks
 Best Regards

 On Tue, Feb 24, 2015 at 8:25 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 I have just signed up for Amazon AWS because I learnt that it provides
 service for free for the first 12 months.
 I want to run Spark on EC2 cluster. Will they charge me for this?

 Thank You





Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
Kindly bear with my questions as I am new to this.
 If you run spark on local mode on a ec2 machine
What does this mean? Is it that I launch Spark cluster from my local
machine,i.e., by running the shell script that is there in /spark/ec2?

On Tue, Feb 24, 2015 at 8:32 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 As a real spark cluster needs a least one master and one slaves, you need
 to launch two machine. Therefore the second machine is not free.
 However, If you run spark on local mode on a ec2 machine. It is free.

 The charge of AWS depends on how much and the types of machine that you
 launched, but not on the utilisation of machine.

 Hope it would help.

 Cheers
 Gen


 On Tue, Feb 24, 2015 at 3:55 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 I have just signed up for Amazon AWS because I learnt that it provides
 service for free for the first 12 months.
 I want to run Spark on EC2 cluster. Will they charge me for this?

 Thank You





Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
Thank You All.
I think I will look into paying ~$0.7/hr as Sean suggested.

On Tue, Feb 24, 2015 at 9:01 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 I am sorry that I made a mistake on AWS tarif. You can read the email of
 sean owen which explains better the strategies to run spark on AWS.

 For your question: it means that you just download spark and unzip it.
 Then run spark shell by ./bin/spark-shell or ./bin/pyspark. It is useful to
 get familiar with spark. You can do this on your laptop as well as on ec2.
 In fact, running ./ec2/spark-ec2 means launching spark standalone mode on a
 cluster, you can find more details here:
 https://spark.apache.org/docs/latest/spark-standalone.html

 Cheers
 Gen


 On Tue, Feb 24, 2015 at 4:07 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Kindly bear with my questions as I am new to this.
  If you run spark on local mode on a ec2 machine
 What does this mean? Is it that I launch Spark cluster from my local
 machine,i.e., by running the shell script that is there in /spark/ec2?

 On Tue, Feb 24, 2015 at 8:32 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 As a real spark cluster needs a least one master and one slaves, you
 need to launch two machine. Therefore the second machine is not free.
 However, If you run spark on local mode on a ec2 machine. It is free.

 The charge of AWS depends on how much and the types of machine that you
 launched, but not on the utilisation of machine.

 Hope it would help.

 Cheers
 Gen


 On Tue, Feb 24, 2015 at 3:55 PM, Deep Pradhan pradhandeep1...@gmail.com
  wrote:

 Hi,
 I have just signed up for Amazon AWS because I learnt that it provides
 service for free for the first 12 months.
 I want to run Spark on EC2 cluster. Will they charge me for this?

 Thank You







Repartition and Worker Instances

2015-02-23 Thread Deep Pradhan
Hi,
If I repartition my data by a factor equal to the number of worker
instances, will the performance be better or worse?
As far as I understand, the performance should be better, but in my case it
is becoming worse.
I have a single node standalone cluster, is it because of this?
Am I guaranteed to have a better performance if I do the same thing in a
multi-node cluster?

Thank You


Re: Repartition and Worker Instances

2015-02-23 Thread Deep Pradhan
How is task slot different from # of Workers?


 so don't read into any performance metrics you've collected to
extrapolate what may happen at scale.
I did not get you in this.

Thank You

On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui same...@databricks.com
wrote:

 In general you should first figure out how many task slots are in the
 cluster and then repartition the RDD to maybe 2x that #. So if you have a
 100 slots, then maybe RDDs with partition count of 100-300 would be normal.

 But also size of each partition can matter. You want a task to operate on
 a partition for at least 200ms, but no longer than around 20 seconds.

 Even if you have 100 slots, it could be okay to have a RDD with 10,000
 partitions if you've read in a large file.

 So don't repartition your RDD to match the # of Worker JVMs, but rather
 align it to the total # of task slots in the Executors.

 If you're running on a single node, shuffle operations become almost free
 (because there's no network movement), so don't read into any performance
 metrics you've collected to extrapolate what may happen at scale.


 On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 If I repartition my data by a factor equal to the number of worker
 instances, will the performance be better or worse?
 As far as I understand, the performance should be better, but in my case
 it is becoming worse.
 I have a single node standalone cluster, is it because of this?
 Am I guaranteed to have a better performance if I do the same thing in a
 multi-node cluster?

 Thank You




Re: Repartition and Worker Instances

2015-02-23 Thread Deep Pradhan
You mean SPARK_WORKER_CORES in /conf/spark-env.sh?

On Mon, Feb 23, 2015 at 11:06 PM, Sameer Farooqui same...@databricks.com
wrote:

 In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there
 are slots for task threads. The slot count is configured by the num_cores
 setting. Generally over subscribe this. So if you have 10 free CPU cores,
 set num_cores to 20.


 On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 How is task slot different from # of Workers?


  so don't read into any performance metrics you've collected to
 extrapolate what may happen at scale.
 I did not get you in this.

 Thank You

 On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui same...@databricks.com
  wrote:

 In general you should first figure out how many task slots are in the
 cluster and then repartition the RDD to maybe 2x that #. So if you have a
 100 slots, then maybe RDDs with partition count of 100-300 would be normal.

 But also size of each partition can matter. You want a task to operate
 on a partition for at least 200ms, but no longer than around 20 seconds.

 Even if you have 100 slots, it could be okay to have a RDD with 10,000
 partitions if you've read in a large file.

 So don't repartition your RDD to match the # of Worker JVMs, but rather
 align it to the total # of task slots in the Executors.

 If you're running on a single node, shuffle operations become almost
 free (because there's no network movement), so don't read into any
 performance metrics you've collected to extrapolate what may happen at
 scale.


 On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 If I repartition my data by a factor equal to the number of worker
 instances, will the performance be better or worse?
 As far as I understand, the performance should be better, but in my
 case it is becoming worse.
 I have a single node standalone cluster, is it because of this?
 Am I guaranteed to have a better performance if I do the same thing in
 a multi-node cluster?

 Thank You





Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
 So increasing Executors without increasing physical resources
If I have a 16 GB RAM system and then I allocate 1 GB for each executor,
and give number of executors as 8, then I am increasing the resource right?
In this case, how do you explain?

Thank You

On Sun, Feb 22, 2015 at 6:12 AM, Aaron Davidson ilike...@gmail.com wrote:

 Note that the parallelism (i.e., number of partitions) is just an upper
 bound on how much of the work can be done in parallel. If you have 200
 partitions, then you can divide the work among between 1 and 200 cores and
 all resources will remain utilized. If you have more than 200 cores,
 though, then some will not be used, so you would want to increase
 parallelism further. (There are other rules-of-thumb -- for instance, it's
 generally good to have at least 2x more partitions than cores for straggler
 mitigation, but these are essentially just optimizations.)

 Further note that when you increase the number of Executors for the same
 set of resources (i.e., starting 10 Executors on a single machine instead
 of 1), you make Spark's job harder. Spark has to communicate in an
 all-to-all manner across Executors for shuffle operations, and it uses TCP
 sockets to do so whether or not the Executors happen to be on the same
 machine. So increasing Executors without increasing physical resources
 means Spark has to do more communication to do the same work.

 We expect that increasing the number of Executors by a factor of 10, given
 an increase in the number of physical resources by the same factor, would
 also improve performance by 10x. This is not always the case for the
 precise reason above (increased communication overhead), but typically we
 can get close. The actual observed improvement is very algorithm-dependent,
 though; for instance, some ML algorithms become hard to scale out past a
 certain point because the increase in communication overhead outweighs the
 increase in parallelism.

 On Sat, Feb 21, 2015 at 8:19 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 So, if I keep the number of instances constant and increase the degree of
 parallelism in steps, can I expect the performance to increase?

 Thank You

 On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 So, with the increase in the number of worker instances, if I also
 increase the degree of parallelism, will it make any difference?
 I can use this model even the other way round right? I can always
 predict the performance of an app with the increase in number of worker
 instances, the deterioration in performance, right?

 Thank You

 On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan pradhandeep1...@gmail.com
  wrote:

 Yes, I have decreased the executor memory.
 But,if I have to do this, then I have to tweak around with the code
 corresponding to each configuration right?

 On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen so...@cloudera.com wrote:

 Workers has a specific meaning in Spark. You are running many on one
 machine? that's possible but not usual.

 Each worker's executors have access to a fraction of your machine's
 resources then. If you're not increasing parallelism, maybe you're not
 actually using additional workers, so are using less resource for your
 problem.

 Or because the resulting executors are smaller, maybe you're hitting
 GC thrashing in these executors with smaller heaps.

 Or if you're not actually configuring the executors to use less
 memory, maybe you're over-committing your RAM and swapping?

 Bottom line, you wouldn't use multiple workers on one small standalone
 node. This isn't a good way to estimate performance on a distributed
 cluster either.

 On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:
  No, I just have a single node standalone cluster.
 
  I am not tweaking around with the code to increase parallelism. I am
 just
  running SparkKMeans that is there in Spark-1.0.0
  I just wanted to know, if this behavior is natural. And if so, what
 causes
  this?
 
  Thank you
 
  On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com
 wrote:
 
  What's your storage like? are you adding worker machines that are
  remote from where the data lives? I wonder if it just means you are
  spending more and more time sending the data over the network as you
  try to ship more of it to more remote workers.
 
  To answer your question, no in general more workers means more
  parallelism and therefore faster execution. But that depends on a
 lot
  of things. For example, if your process isn't parallelize to use all
  available execution slots, adding more slots doesn't do anything.
 
  On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan 
 pradhandeep1...@gmail.com
  wrote:
   Yes, I am talking about standalone single node cluster.
  
   No, I am not increasing parallelism. I just wanted to know if it
 is
   natural.
   Does message passing across the workers account for the
 happenning?
  
   I am running SparkKMeans, just

Re: Perf Prediction

2015-02-21 Thread Deep Pradhan
Has anyone done any work on that?

On Sun, Feb 22, 2015 at 9:57 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Yes, exactly.

 On Sun, Feb 22, 2015 at 9:10 AM, Ognen Duzlevski 
 ognen.duzlev...@gmail.com wrote:

 On Sat, Feb 21, 2015 at 8:54 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 No, I am talking about some work parallel to prediction works that are
 done on GPUs. Like say, given the data for smaller number of nodes in a
 Spark cluster, the prediction needs to be done about the time that the
 application would take when we have larger number of nodes.


 Are you talking about predicting how performance would increase with
 adding more nodes/CPUs/whatever?





Re: Perf Prediction

2015-02-21 Thread Deep Pradhan
Yes, exactly.

On Sun, Feb 22, 2015 at 9:10 AM, Ognen Duzlevski ognen.duzlev...@gmail.com
wrote:

 On Sat, Feb 21, 2015 at 8:54 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 No, I am talking about some work parallel to prediction works that are
 done on GPUs. Like say, given the data for smaller number of nodes in a
 Spark cluster, the prediction needs to be done about the time that the
 application would take when we have larger number of nodes.


 Are you talking about predicting how performance would increase with
 adding more nodes/CPUs/whatever?



Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
Also, If I take SparkPageRank for example (org.apache.spark.examples),
there are various RDDs that are created and transformed in the code that is
written. If I want to increase the number of partitions and test out, what
is the optimum number of partitions that gives me the best performance, I
have to change the number of partitions in each run, right? Now, there are
various RDDs there, so, which RDD do I partition? In other words, if I
partition the first RDD that is created from the data in HDFS, am I ensured
that other RDDs that are transformed from this RDD will also be partitioned
in the same way?

Thank You

On Sun, Feb 22, 2015 at 10:02 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

  So increasing Executors without increasing physical resources
 If I have a 16 GB RAM system and then I allocate 1 GB for each executor,
 and give number of executors as 8, then I am increasing the resource right?
 In this case, how do you explain?

 Thank You

 On Sun, Feb 22, 2015 at 6:12 AM, Aaron Davidson ilike...@gmail.com
 wrote:

 Note that the parallelism (i.e., number of partitions) is just an upper
 bound on how much of the work can be done in parallel. If you have 200
 partitions, then you can divide the work among between 1 and 200 cores and
 all resources will remain utilized. If you have more than 200 cores,
 though, then some will not be used, so you would want to increase
 parallelism further. (There are other rules-of-thumb -- for instance, it's
 generally good to have at least 2x more partitions than cores for straggler
 mitigation, but these are essentially just optimizations.)

 Further note that when you increase the number of Executors for the same
 set of resources (i.e., starting 10 Executors on a single machine instead
 of 1), you make Spark's job harder. Spark has to communicate in an
 all-to-all manner across Executors for shuffle operations, and it uses TCP
 sockets to do so whether or not the Executors happen to be on the same
 machine. So increasing Executors without increasing physical resources
 means Spark has to do more communication to do the same work.

 We expect that increasing the number of Executors by a factor of 10,
 given an increase in the number of physical resources by the same factor,
 would also improve performance by 10x. This is not always the case for the
 precise reason above (increased communication overhead), but typically we
 can get close. The actual observed improvement is very algorithm-dependent,
 though; for instance, some ML algorithms become hard to scale out past a
 certain point because the increase in communication overhead outweighs the
 increase in parallelism.

 On Sat, Feb 21, 2015 at 8:19 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 So, if I keep the number of instances constant and increase the degree
 of parallelism in steps, can I expect the performance to increase?

 Thank You

 On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan pradhandeep1...@gmail.com
  wrote:

 So, with the increase in the number of worker instances, if I also
 increase the degree of parallelism, will it make any difference?
 I can use this model even the other way round right? I can always
 predict the performance of an app with the increase in number of worker
 instances, the deterioration in performance, right?

 Thank You

 On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 Yes, I have decreased the executor memory.
 But,if I have to do this, then I have to tweak around with the code
 corresponding to each configuration right?

 On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen so...@cloudera.com wrote:

 Workers has a specific meaning in Spark. You are running many on one
 machine? that's possible but not usual.

 Each worker's executors have access to a fraction of your machine's
 resources then. If you're not increasing parallelism, maybe you're not
 actually using additional workers, so are using less resource for your
 problem.

 Or because the resulting executors are smaller, maybe you're hitting
 GC thrashing in these executors with smaller heaps.

 Or if you're not actually configuring the executors to use less
 memory, maybe you're over-committing your RAM and swapping?

 Bottom line, you wouldn't use multiple workers on one small standalone
 node. This isn't a good way to estimate performance on a distributed
 cluster either.

 On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:
  No, I just have a single node standalone cluster.
 
  I am not tweaking around with the code to increase parallelism. I
 am just
  running SparkKMeans that is there in Spark-1.0.0
  I just wanted to know, if this behavior is natural. And if so, what
 causes
  this?
 
  Thank you
 
  On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com
 wrote:
 
  What's your storage like? are you adding worker machines that are
  remote from where the data lives? I wonder if it just means you are
  spending more and more time

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
In this case, I just wanted to know if a single node cluster with various
workers act like a simulator of a multi-node cluster with various nodes.
Like, if we have a single node cluster with 10 workers, say, then can we
tell that the same behavior will take place with cluster of 10 nodes?
It is like, without having the 10 nodes cluster, I can know the behavior of
the application in 10 nodes cluster by having a single node with 10
workers. The time taken may vary but I am talking about the behavior. Can
we say that?

On Sat, Feb 21, 2015 at 8:21 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Yes, I am talking about standalone single node cluster.

 No, I am not increasing parallelism. I just wanted to know if it is
 natural. Does message passing across the workers account for the happenning?

 I am running SparkKMeans, just to validate one prediction model. I am
 using several data sets. I have a standalone mode. I am varying the workers
 from 1 to 16

 On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com wrote:

 I can imagine a few reasons. Adding workers might cause fewer tasks to
 execute locally (?) So you may be execute more remotely.

 Are you increasing parallelism? for trivial jobs, chopping them up
 further may cause you to pay more overhead of managing so many small
 tasks, for no speed up in execution time.

 Can you provide any more specifics though? you haven't said what
 you're running, what mode, how many workers, how long it takes, etc.

 On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  Hi,
  I have been running some jobs in my local single node stand alone
 cluster. I
  am varying the worker instances for the same job, and the time taken
 for the
  job to complete increases with increase in the number of workers. I
 repeated
  some experiments varying the number of nodes in a cluster too and the
 same
  behavior is seen.
  Can the idea of worker instances be extrapolated to the nodes in a
 cluster?
 
  Thank You





Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
So, with the increase in the number of worker instances, if I also increase
the degree of parallelism, will it make any difference?
I can use this model even the other way round right? I can always predict
the performance of an app with the increase in number of worker instances,
the deterioration in performance, right?

Thank You

On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Yes, I have decreased the executor memory.
 But,if I have to do this, then I have to tweak around with the code
 corresponding to each configuration right?

 On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen so...@cloudera.com wrote:

 Workers has a specific meaning in Spark. You are running many on one
 machine? that's possible but not usual.

 Each worker's executors have access to a fraction of your machine's
 resources then. If you're not increasing parallelism, maybe you're not
 actually using additional workers, so are using less resource for your
 problem.

 Or because the resulting executors are smaller, maybe you're hitting
 GC thrashing in these executors with smaller heaps.

 Or if you're not actually configuring the executors to use less
 memory, maybe you're over-committing your RAM and swapping?

 Bottom line, you wouldn't use multiple workers on one small standalone
 node. This isn't a good way to estimate performance on a distributed
 cluster either.

 On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  No, I just have a single node standalone cluster.
 
  I am not tweaking around with the code to increase parallelism. I am
 just
  running SparkKMeans that is there in Spark-1.0.0
  I just wanted to know, if this behavior is natural. And if so, what
 causes
  this?
 
  Thank you
 
  On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com wrote:
 
  What's your storage like? are you adding worker machines that are
  remote from where the data lives? I wonder if it just means you are
  spending more and more time sending the data over the network as you
  try to ship more of it to more remote workers.
 
  To answer your question, no in general more workers means more
  parallelism and therefore faster execution. But that depends on a lot
  of things. For example, if your process isn't parallelize to use all
  available execution slots, adding more slots doesn't do anything.
 
  On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan 
 pradhandeep1...@gmail.com
  wrote:
   Yes, I am talking about standalone single node cluster.
  
   No, I am not increasing parallelism. I just wanted to know if it is
   natural.
   Does message passing across the workers account for the happenning?
  
   I am running SparkKMeans, just to validate one prediction model. I am
   using
   several data sets. I have a standalone mode. I am varying the workers
   from 1
   to 16
  
   On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com
 wrote:
  
   I can imagine a few reasons. Adding workers might cause fewer tasks
 to
   execute locally (?) So you may be execute more remotely.
  
   Are you increasing parallelism? for trivial jobs, chopping them up
   further may cause you to pay more overhead of managing so many small
   tasks, for no speed up in execution time.
  
   Can you provide any more specifics though? you haven't said what
   you're running, what mode, how many workers, how long it takes, etc.
  
   On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan
   pradhandeep1...@gmail.com
   wrote:
Hi,
I have been running some jobs in my local single node stand alone
cluster. I
am varying the worker instances for the same job, and the time
 taken
for
the
job to complete increases with increase in the number of workers.
 I
repeated
some experiments varying the number of nodes in a cluster too and
 the
same
behavior is seen.
Can the idea of worker instances be extrapolated to the nodes in a
cluster?
   
Thank You
  
  
 
 





Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
Yes, I am talking about standalone single node cluster.

No, I am not increasing parallelism. I just wanted to know if it is
natural. Does message passing across the workers account for the happenning?

I am running SparkKMeans, just to validate one prediction model. I am using
several data sets. I have a standalone mode. I am varying the workers from
1 to 16

On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com wrote:

 I can imagine a few reasons. Adding workers might cause fewer tasks to
 execute locally (?) So you may be execute more remotely.

 Are you increasing parallelism? for trivial jobs, chopping them up
 further may cause you to pay more overhead of managing so many small
 tasks, for no speed up in execution time.

 Can you provide any more specifics though? you haven't said what
 you're running, what mode, how many workers, how long it takes, etc.

 On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  Hi,
  I have been running some jobs in my local single node stand alone
 cluster. I
  am varying the worker instances for the same job, and the time taken for
 the
  job to complete increases with increase in the number of workers. I
 repeated
  some experiments varying the number of nodes in a cluster too and the
 same
  behavior is seen.
  Can the idea of worker instances be extrapolated to the nodes in a
 cluster?
 
  Thank You



Re: Perf Prediction

2015-02-21 Thread Deep Pradhan
No, I am talking about some work parallel to prediction works that are done
on GPUs. Like say, given the data for smaller number of nodes in a Spark
cluster, the prediction needs to be done about the time that the
application would take when we have larger number of nodes.

On Sat, Feb 21, 2015 at 8:22 PM, Ted Yu yuzhih...@gmail.com wrote:

 Can you be a bit more specific ?

 Are you asking about performance across Spark releases ?

 Cheers

 On Sat, Feb 21, 2015 at 6:38 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 Has some performance prediction work been done on Spark?

 Thank You





Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
No, I just have a single node standalone cluster.

I am not tweaking around with the code to increase parallelism. I am just
running SparkKMeans that is there in Spark-1.0.0
I just wanted to know, if this behavior is natural. And if so, what causes
this?

Thank you

On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com wrote:

 What's your storage like? are you adding worker machines that are
 remote from where the data lives? I wonder if it just means you are
 spending more and more time sending the data over the network as you
 try to ship more of it to more remote workers.

 To answer your question, no in general more workers means more
 parallelism and therefore faster execution. But that depends on a lot
 of things. For example, if your process isn't parallelize to use all
 available execution slots, adding more slots doesn't do anything.

 On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  Yes, I am talking about standalone single node cluster.
 
  No, I am not increasing parallelism. I just wanted to know if it is
 natural.
  Does message passing across the workers account for the happenning?
 
  I am running SparkKMeans, just to validate one prediction model. I am
 using
  several data sets. I have a standalone mode. I am varying the workers
 from 1
  to 16
 
  On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com wrote:
 
  I can imagine a few reasons. Adding workers might cause fewer tasks to
  execute locally (?) So you may be execute more remotely.
 
  Are you increasing parallelism? for trivial jobs, chopping them up
  further may cause you to pay more overhead of managing so many small
  tasks, for no speed up in execution time.
 
  Can you provide any more specifics though? you haven't said what
  you're running, what mode, how many workers, how long it takes, etc.
 
  On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan 
 pradhandeep1...@gmail.com
  wrote:
   Hi,
   I have been running some jobs in my local single node stand alone
   cluster. I
   am varying the worker instances for the same job, and the time taken
 for
   the
   job to complete increases with increase in the number of workers. I
   repeated
   some experiments varying the number of nodes in a cluster too and the
   same
   behavior is seen.
   Can the idea of worker instances be extrapolated to the nodes in a
   cluster?
  
   Thank You
 
 



Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
Yes, I have decreased the executor memory.
But,if I have to do this, then I have to tweak around with the code
corresponding to each configuration right?

On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen so...@cloudera.com wrote:

 Workers has a specific meaning in Spark. You are running many on one
 machine? that's possible but not usual.

 Each worker's executors have access to a fraction of your machine's
 resources then. If you're not increasing parallelism, maybe you're not
 actually using additional workers, so are using less resource for your
 problem.

 Or because the resulting executors are smaller, maybe you're hitting
 GC thrashing in these executors with smaller heaps.

 Or if you're not actually configuring the executors to use less
 memory, maybe you're over-committing your RAM and swapping?

 Bottom line, you wouldn't use multiple workers on one small standalone
 node. This isn't a good way to estimate performance on a distributed
 cluster either.

 On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  No, I just have a single node standalone cluster.
 
  I am not tweaking around with the code to increase parallelism. I am just
  running SparkKMeans that is there in Spark-1.0.0
  I just wanted to know, if this behavior is natural. And if so, what
 causes
  this?
 
  Thank you
 
  On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com wrote:
 
  What's your storage like? are you adding worker machines that are
  remote from where the data lives? I wonder if it just means you are
  spending more and more time sending the data over the network as you
  try to ship more of it to more remote workers.
 
  To answer your question, no in general more workers means more
  parallelism and therefore faster execution. But that depends on a lot
  of things. For example, if your process isn't parallelize to use all
  available execution slots, adding more slots doesn't do anything.
 
  On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan 
 pradhandeep1...@gmail.com
  wrote:
   Yes, I am talking about standalone single node cluster.
  
   No, I am not increasing parallelism. I just wanted to know if it is
   natural.
   Does message passing across the workers account for the happenning?
  
   I am running SparkKMeans, just to validate one prediction model. I am
   using
   several data sets. I have a standalone mode. I am varying the workers
   from 1
   to 16
  
   On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com
 wrote:
  
   I can imagine a few reasons. Adding workers might cause fewer tasks
 to
   execute locally (?) So you may be execute more remotely.
  
   Are you increasing parallelism? for trivial jobs, chopping them up
   further may cause you to pay more overhead of managing so many small
   tasks, for no speed up in execution time.
  
   Can you provide any more specifics though? you haven't said what
   you're running, what mode, how many workers, how long it takes, etc.
  
   On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan
   pradhandeep1...@gmail.com
   wrote:
Hi,
I have been running some jobs in my local single node stand alone
cluster. I
am varying the worker instances for the same job, and the time
 taken
for
the
job to complete increases with increase in the number of workers. I
repeated
some experiments varying the number of nodes in a cluster too and
 the
same
behavior is seen.
Can the idea of worker instances be extrapolated to the nodes in a
cluster?
   
Thank You
  
  
 
 



Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
So, if I keep the number of instances constant and increase the degree of
parallelism in steps, can I expect the performance to increase?

Thank You

On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 So, with the increase in the number of worker instances, if I also
 increase the degree of parallelism, will it make any difference?
 I can use this model even the other way round right? I can always predict
 the performance of an app with the increase in number of worker instances,
 the deterioration in performance, right?

 Thank You

 On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Yes, I have decreased the executor memory.
 But,if I have to do this, then I have to tweak around with the code
 corresponding to each configuration right?

 On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen so...@cloudera.com wrote:

 Workers has a specific meaning in Spark. You are running many on one
 machine? that's possible but not usual.

 Each worker's executors have access to a fraction of your machine's
 resources then. If you're not increasing parallelism, maybe you're not
 actually using additional workers, so are using less resource for your
 problem.

 Or because the resulting executors are smaller, maybe you're hitting
 GC thrashing in these executors with smaller heaps.

 Or if you're not actually configuring the executors to use less
 memory, maybe you're over-committing your RAM and swapping?

 Bottom line, you wouldn't use multiple workers on one small standalone
 node. This isn't a good way to estimate performance on a distributed
 cluster either.

 On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  No, I just have a single node standalone cluster.
 
  I am not tweaking around with the code to increase parallelism. I am
 just
  running SparkKMeans that is there in Spark-1.0.0
  I just wanted to know, if this behavior is natural. And if so, what
 causes
  this?
 
  Thank you
 
  On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com wrote:
 
  What's your storage like? are you adding worker machines that are
  remote from where the data lives? I wonder if it just means you are
  spending more and more time sending the data over the network as you
  try to ship more of it to more remote workers.
 
  To answer your question, no in general more workers means more
  parallelism and therefore faster execution. But that depends on a lot
  of things. For example, if your process isn't parallelize to use all
  available execution slots, adding more slots doesn't do anything.
 
  On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan 
 pradhandeep1...@gmail.com
  wrote:
   Yes, I am talking about standalone single node cluster.
  
   No, I am not increasing parallelism. I just wanted to know if it is
   natural.
   Does message passing across the workers account for the happenning?
  
   I am running SparkKMeans, just to validate one prediction model. I
 am
   using
   several data sets. I have a standalone mode. I am varying the
 workers
   from 1
   to 16
  
   On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com
 wrote:
  
   I can imagine a few reasons. Adding workers might cause fewer
 tasks to
   execute locally (?) So you may be execute more remotely.
  
   Are you increasing parallelism? for trivial jobs, chopping them up
   further may cause you to pay more overhead of managing so many
 small
   tasks, for no speed up in execution time.
  
   Can you provide any more specifics though? you haven't said what
   you're running, what mode, how many workers, how long it takes,
 etc.
  
   On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan
   pradhandeep1...@gmail.com
   wrote:
Hi,
I have been running some jobs in my local single node stand alone
cluster. I
am varying the worker instances for the same job, and the time
 taken
for
the
job to complete increases with increase in the number of
 workers. I
repeated
some experiments varying the number of nodes in a cluster too
 and the
same
behavior is seen.
Can the idea of worker instances be extrapolated to the nodes in
 a
cluster?
   
Thank You
  
  
 
 






Perf Prediction

2015-02-21 Thread Deep Pradhan
Hi,
Has some performance prediction work been done on Spark?

Thank You


Worker and Nodes

2015-02-21 Thread Deep Pradhan
Hi,
I have been running some jobs in my local single node stand alone cluster.
I am varying the worker instances for the same job, and the time taken for
the job to complete increases with increase in the number of workers. I
repeated some experiments varying the number of nodes in a cluster too and
the same behavior is seen.
Can the idea of worker instances be extrapolated to the nodes in a cluster?

Thank You


Re: Profiling in YourKit

2015-02-07 Thread Deep Pradhan
So, Can I increase the number of threads by manually coding in the Spark
code?

On Sat, Feb 7, 2015 at 6:52 PM, Sean Owen so...@cloudera.com wrote:

 If you look at the threads, the other 30 are almost surely not Spark
 worker threads. They're the JVM finalizer, GC threads, Jetty
 listeners, etc. Nothing wrong with this. Your OS has hundreds of
 threads running now, most of which are idle, and up to 4 of which can
 be executing.  In a one-machine cluster, I don't think you would
 expect any difference in number of running threads. More data does not
 mean more threads, no. Your executor probably takes as many threads as
 cores in both cases, 4.


 On Sat, Feb 7, 2015 at 10:14 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  Hi,
  I am using YourKit tool to profile Spark jobs that is run in my Single
 Node
  Spark Cluster.
  When I see the YourKit UI Performance Charts, the thread count always
  remains at
  All threads: 34
  Daemon threads: 32
 
  Here are my questions:
 
  1. My system can run only 4 threads simultaneously, and obviously my
 system
  does not have 34 threads. What could 34 threads mean?
 
  2. I tried running the same job with four different datasets, two small
 and
  two relatively big. But in the UI the thread count increases by two,
  irrespective of data size. Does this mean that the number of threads
  allocated to each job depending on data size is not taken care by the
  framework?
 
  Thank You



PR Request

2015-02-06 Thread Deep Pradhan
Hi,
When we submit a PR in Github, there are various tests that are performed
like RAT test, Scala Style Test, and beyond this many other tests which run
for more time.
Could anyone please direct me to the details of the tests that are
performed there?

Thank You


Reg GraphX APSP

2015-02-06 Thread Deep Pradhan
Hi,
Is the implementation of All Pairs Shortest Path on GraphX for directed
graphs or undirected graph? When I use the algorithm with dataset, it
assumes that the graph is undirected.
Has anyone come across that earlier?

Thank you


Re: Reg Job Server

2015-02-05 Thread Deep Pradhan
I read somewhere about Gatling. Can that be used to profile Spark jobs?

On Fri, Feb 6, 2015 at 10:27 AM, Kostas Sakellis kos...@cloudera.com
wrote:

 Which Spark Job server are you talking about?

 On Thu, Feb 5, 2015 at 8:28 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 Can Spark Job Server be used for profiling Spark jobs?





Re: Reg Job Server

2015-02-05 Thread Deep Pradhan
Yes, I want to know, the reason about the job being slow.
I will look at YourKit.
Can you redirect me to that, some tutorial in how to use?

Thank You

On Fri, Feb 6, 2015 at 10:44 AM, Kostas Sakellis kos...@cloudera.com
wrote:

 When you say profiling, what are you trying to figure out? Why your spark
 job is slow? Gatling seems to be a load generating framework so I'm not
 sure how you'd use it (i've never used it before). Spark runs on the JVM so
 you can use any JVM profiling tools like YourKit.

 Kostas

 On Thu, Feb 5, 2015 at 9:03 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 I read somewhere about Gatling. Can that be used to profile Spark jobs?

 On Fri, Feb 6, 2015 at 10:27 AM, Kostas Sakellis kos...@cloudera.com
 wrote:

 Which Spark Job server are you talking about?

 On Thu, Feb 5, 2015 at 8:28 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 Can Spark Job Server be used for profiling Spark jobs?







Re: Reg Job Server

2015-02-05 Thread Deep Pradhan
I have a single node Spark standalone cluster. Will this also work for my
cluster?

Thank You

On Fri, Feb 6, 2015 at 11:02 AM, Mark Hamstra m...@clearstorydata.com
wrote:


 https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit

 On Thu, Feb 5, 2015 at 9:18 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Yes, I want to know, the reason about the job being slow.
 I will look at YourKit.
 Can you redirect me to that, some tutorial in how to use?

 Thank You

 On Fri, Feb 6, 2015 at 10:44 AM, Kostas Sakellis kos...@cloudera.com
 wrote:

 When you say profiling, what are you trying to figure out? Why your
 spark job is slow? Gatling seems to be a load generating framework so I'm
 not sure how you'd use it (i've never used it before). Spark runs on the
 JVM so you can use any JVM profiling tools like YourKit.

 Kostas

 On Thu, Feb 5, 2015 at 9:03 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 I read somewhere about Gatling. Can that be used to profile Spark jobs?

 On Fri, Feb 6, 2015 at 10:27 AM, Kostas Sakellis kos...@cloudera.com
 wrote:

 Which Spark Job server are you talking about?

 On Thu, Feb 5, 2015 at 8:28 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 Hi,
 Can Spark Job Server be used for profiling Spark jobs?









Reg Job Server

2015-02-05 Thread Deep Pradhan
Hi,
Can Spark Job Server be used for profiling Spark jobs?


Re: Union in Spark

2015-02-01 Thread Deep Pradhan
The configuration is 16GB ram and 1TB HD. have a single node Spark cluster.
Even after setting driver memory to 5g and executor memory to 3g, I get
this error. The size of the data set is 350 KB and the set that it works
well is hardly few KBs.

On Mon, Feb 2, 2015 at 1:18 PM, Arush Kharbanda ar...@sigmoidanalytics.com
wrote:

 Hi Deep,

 What is your configuration and what is the size of the 2 data sets?

 Thanks
 Arush

 On Mon, Feb 2, 2015 at 11:56 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 I did not check the console because once the job starts I cannot run
 anything else and have to force shutdown the system. I commented parts of
 codes and I tested. I doubt it is because of union. So, I want to change it
 to something else and see if the problem persists.

 Thank you

 On Mon, Feb 2, 2015 at 11:53 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi Deep,

 How do you know the cluster is not responsive because of Union?
 Did you check the spark web console?

 Best Regards,

 Jerry


 On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 The cluster hangs.

 On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam chiling...@gmail.com
 wrote:

 Hi Deep,

 what do you mean by stuck?

 Jerry


 On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 Hi,
 Is there any better operation than Union. I am using union and the
 cluster is getting stuck with a large data set.

 Thank you








 --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com



Union in Spark

2015-02-01 Thread Deep Pradhan
Hi,
Is there any better operation than Union. I am using union and the cluster
is getting stuck with a large data set.

Thank you


Re: Union in Spark

2015-02-01 Thread Deep Pradhan
The cluster hangs.

On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi Deep,

 what do you mean by stuck?

 Jerry


 On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 Is there any better operation than Union. I am using union and the
 cluster is getting stuck with a large data set.

 Thank you





Re: Union in Spark

2015-02-01 Thread Deep Pradhan
I did not check the console because once the job starts I cannot run
anything else and have to force shutdown the system. I commented parts of
codes and I tested. I doubt it is because of union. So, I want to change it
to something else and see if the problem persists.

Thank you

On Mon, Feb 2, 2015 at 11:53 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi Deep,

 How do you know the cluster is not responsive because of Union?
 Did you check the spark web console?

 Best Regards,

 Jerry


 On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 The cluster hangs.

 On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi Deep,

 what do you mean by stuck?

 Jerry


 On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan pradhandeep1...@gmail.com
  wrote:

 Hi,
 Is there any better operation than Union. I am using union and the
 cluster is getting stuck with a large data set.

 Thank you







Spark on Gordon

2015-01-31 Thread Deep Pradhan
Hi All,
Gordon SC has Spark installed in it. Has anyone tried to run Spark jobs on
Gordon?

Thank You


While Loop

2015-01-23 Thread Deep Pradhan
Hi,
Is there a better programming construct than while loop in Spark?

Thank You


Re: Bind Exception

2015-01-19 Thread Deep Pradhan
I closed the Spark Shell and tried but no change.

Here is the error:

.
15/01/17 14:33:39 INFO AbstractConnector: Started
SocketConnector@0.0.0.0:59791
15/01/17 14:33:39 INFO Server: jetty-8.y.z-SNAPSHOT
15/01/17 14:33:39 WARN AbstractLifeCycle: FAILED
SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address
already in use
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:444)
at sun.nio.ch.Net.bind(Net.java:436)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at
org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
at
org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
at
org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at org.eclipse.jetty.server.Server.doStart(Server.java:293)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at
org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192)
at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191)
at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:99)
at org.apache.spark.SparkContext.init(SparkContext.scala:223)
at org.apache.spark.examples.SparkAPSP$.main(SparkAPSP.scala:21)
at org.apache.spark.examples.SparkAPSP.main(SparkAPSP.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/01/17 14:33:39 WARN AbstractLifeCycle: FAILED
org.eclipse.jetty.server.Server@f1b69ca: java.net.BindException: Address
already in use
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:444)
at sun.nio.ch.Net.bind(Net.java:436)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at
org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
at
org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
at
org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at org.eclipse.jetty.server.Server.doStart(Server.java:293)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at
org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192)
at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191)
at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:99)
at org.apache.spark.SparkContext.init(SparkContext.scala:223)
at org.apache.spark.examples.SparkAPSP$.main(SparkAPSP.scala:21)
at org.apache.spark.examples.SparkAPSP.main(SparkAPSP.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/01/17 14:33:39 INFO ContextHandler: stopped
o.e.j.s.ServletContextHandler{/metrics/json,null}
15/01/17 14:33:39 INFO ContextHandler: stopped
o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
15/01/17 14:33:39 INFO ContextHandler: stopped
o.e.j.s.ServletContextHandler{/,null}
15/01/17 14:33:39 INFO ContextHandler: stopped
o.e.j.s.ServletContextHandler{/static,null}
..

On Tue, Jan 20, 2015 at 9:52 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 I had the Spark Shell running through out. Is it because of that?

 On Tue, Jan 20, 2015 at 9:47 AM, Ted Yu yuzhih...@gmail.com wrote:

 Was there another instance

Bind Exception

2015-01-19 Thread Deep Pradhan
Hi,
I am running a Spark job. I get the output correctly but when I see the
logs file I see the following:
AbstractLifeCycle: FAILED.: java.net.BindException: Address already in
use...

What could be the reason for this?

Thank You


Re: Bind Exception

2015-01-19 Thread Deep Pradhan
I had the Spark Shell running through out. Is it because of that?

On Tue, Jan 20, 2015 at 9:47 AM, Ted Yu yuzhih...@gmail.com wrote:

 Was there another instance of Spark running on the same machine ?

 Can you pastebin the full stack trace ?

 Cheers

 On Mon, Jan 19, 2015 at 8:11 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 I am running a Spark job. I get the output correctly but when I see the
 logs file I see the following:
 AbstractLifeCycle: FAILED.: java.net.BindException: Address already
 in use...

 What could be the reason for this?

 Thank You





Re: Bind Exception

2015-01-19 Thread Deep Pradhan
Yes, I have increased the driver memory in spark-default.conf to 2g. Still
the error persists.

On Tue, Jan 20, 2015 at 10:18 AM, Ted Yu yuzhih...@gmail.com wrote:

 Have you seen these threads ?

 http://search-hadoop.com/m/JW1q5tMFlb
 http://search-hadoop.com/m/JW1q5dabji1

 Cheers

 On Mon, Jan 19, 2015 at 8:33 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi Ted,
 When I am running the same job with small data, I am able to run. But
 when I run it with relatively bigger set of data, it is giving me
 OutOfMemoryError: GC overhead limit exceeded.
 The first time I run the job, no output. When I run for second time, I am
 getting this error. I am aware that, the memory is getting full, but is
 there any way to avoid this?
 I have a single node Spark cluster.

 Thank You

 On Tue, Jan 20, 2015 at 9:52 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 I had the Spark Shell running through out. Is it because of that?

 On Tue, Jan 20, 2015 at 9:47 AM, Ted Yu yuzhih...@gmail.com wrote:

 Was there another instance of Spark running on the same machine ?

 Can you pastebin the full stack trace ?

 Cheers

 On Mon, Jan 19, 2015 at 8:11 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 Hi,
 I am running a Spark job. I get the output correctly but when I see
 the logs file I see the following:
 AbstractLifeCycle: FAILED.: java.net.BindException: Address
 already in use...

 What could be the reason for this?

 Thank You








Re: No Output

2015-01-18 Thread Deep Pradhan
The error in the log file says:

*java.lang.OutOfMemoryError: GC overhead limit exceeded*

with certain task ID and the error repeats for further task IDs.

What could be the problem?

On Sun, Jan 18, 2015 at 2:45 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Updating the Spark version means setting up the entire cluster once more?
 Or can we update it in some other way?

 On Sat, Jan 17, 2015 at 3:22 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you paste the code? Also you can try updating your spark version.

 Thanks
 Best Regards

 On Sat, Jan 17, 2015 at 2:40 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 I am using Spark-1.0.0 in a single node cluster. When I run a job with
 small data set it runs perfectly but when I use a data set of 350 KB, no
 output is being produced and when I try to run it the second time it is
 giving me an exception telling that SparkContext was shut down.
 Can anyone help me on this?

 Thank You






Re: No Output

2015-01-18 Thread Deep Pradhan
Updating the Spark version means setting up the entire cluster once more?
Or can we update it in some other way?

On Sat, Jan 17, 2015 at 3:22 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Can you paste the code? Also you can try updating your spark version.

 Thanks
 Best Regards

 On Sat, Jan 17, 2015 at 2:40 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 I am using Spark-1.0.0 in a single node cluster. When I run a job with
 small data set it runs perfectly but when I use a data set of 350 KB, no
 output is being produced and when I try to run it the second time it is
 giving me an exception telling that SparkContext was shut down.
 Can anyone help me on this?

 Thank You





No Output

2015-01-17 Thread Deep Pradhan
Hi,
I am using Spark-1.0.0 in a single node cluster. When I run a job with
small data set it runs perfectly but when I use a data set of 350 KB, no
output is being produced and when I try to run it the second time it is
giving me an exception telling that SparkContext was shut down.
Can anyone help me on this?

Thank You


Joins in Spark

2014-12-22 Thread Deep Pradhan
Hi,
I have two RDDs, vertices and edges. Vertices is an RDD and edges is a pair
RDD. I want to take three way join of these two. Joins work only when both
the RDDs are pair RDDS right? So, how am I supposed to take a three way
join of these RDDs?

Thank You


Joins in Spark

2014-12-22 Thread Deep Pradhan
Hi,
I have two RDDs, vertices and edges. Vertices is an RDD and edges is a pair
RDD. I want to take three way join of these two. Joins work only when both
the RDDs are pair RDDS right? So, how am I supposed to take a three way
join of these RDDs?

Thank You


Fwd: Joins in Spark

2014-12-22 Thread Deep Pradhan
This gives me two pair RDDs, one is the edgesRDD and another is verticesRDD
with each vertex padded with value null. But I have to take a three way
join of these two RDD and I have only one common attribute in these two
RDDs. How can I go about doing the three join?


Profiling GraphX codes.

2014-12-05 Thread Deep Pradhan
Is there any tool to profile GraphX codes in a cluster? Is there a way to
know the messages exchanged among the nodes in a cluster?
WebUI does not give all the information.


Thank You


Determination of number of RDDs

2014-12-04 Thread Deep Pradhan
Hi,
I have a graph and I want to create RDDs equal in number to the nodes in
the graph. How can I do that?
If I have 10 nodes then I want to create 10 rdds. Is that possible in
GraphX?
Like in C language we have array of pointers. Do we have array of RDDs in
Spark.
Can we create such an array and then parallelize it?

Thank You


Re: Filter using the Vertex Ids

2014-12-03 Thread Deep Pradhan
And one more thing, the given tupes
(1, 1.0)
(2, 1.0)
(3, 2.0)
(4, 2.0)
(5, 0.0)

are a part of RDD and they are not just tuples.
graph.vertices return me the above tuples which is a part of VertexRDD.


On Wed, Dec 3, 2014 at 3:43 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 This is just an example but if my graph is big, there will be so many
 tuples to handle. I cannot manually do
 val a: RDD[(Int, Double)] = sc.parallelize(List(
   (1, 1.0),
   (2, 1.0),
   (3, 2.0),
   (4, 2.0),
   (5, 0.0)))
 for all the vertices in the graph.
 What should I do in that case?
 We cannot do *sc.parallelize(List(VertexRDD)), *can we?

 On Wed, Dec 3, 2014 at 3:32 PM, Ankur Dave ankurd...@gmail.com wrote:

 At 2014-12-02 22:01:20 -0800, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  I have a graph which returns the following on doing graph.vertices
  (1, 1.0)
  (2, 1.0)
  (3, 2.0)
  (4, 2.0)
  (5, 0.0)
 
  I want to group all the vertices with the same attribute together, like
 into
  one RDD or something. I want all the vertices with same attribute to be
  together.

 You can do this by flipping the tuples so the values become the keys,
 then using one of the by-key functions in PairRDDFunctions:

 val a: RDD[(Int, Double)] = sc.parallelize(List(
   (1, 1.0),
   (2, 1.0),
   (3, 2.0),
   (4, 2.0),
   (5, 0.0)))

 val b: RDD[(Double, Int)] = a.map(kv = (kv._2, kv._1))

 val c: RDD[(Double, Iterable[Int])] = b.groupByKey(numPartitions = 5)

 c.collect.foreach(println)
 // (0.0,CompactBuffer(5))
 // (1.0,CompactBuffer(1, 2))
 // (2.0,CompactBuffer(3, 4))

 Ankur





Re: Filter using the Vertex Ids

2014-12-03 Thread Deep Pradhan
This is just an example but if my graph is big, there will be so many
tuples to handle. I cannot manually do
val a: RDD[(Int, Double)] = sc.parallelize(List(
  (1, 1.0),
  (2, 1.0),
  (3, 2.0),
  (4, 2.0),
  (5, 0.0)))
for all the vertices in the graph.
What should I do in that case?
We cannot do *sc.parallelize(List(VertexRDD)), *can we?

On Wed, Dec 3, 2014 at 3:32 PM, Ankur Dave ankurd...@gmail.com wrote:

 At 2014-12-02 22:01:20 -0800, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  I have a graph which returns the following on doing graph.vertices
  (1, 1.0)
  (2, 1.0)
  (3, 2.0)
  (4, 2.0)
  (5, 0.0)
 
  I want to group all the vertices with the same attribute together, like
 into
  one RDD or something. I want all the vertices with same attribute to be
  together.

 You can do this by flipping the tuples so the values become the keys, then
 using one of the by-key functions in PairRDDFunctions:

 val a: RDD[(Int, Double)] = sc.parallelize(List(
   (1, 1.0),
   (2, 1.0),
   (3, 2.0),
   (4, 2.0),
   (5, 0.0)))

 val b: RDD[(Double, Int)] = a.map(kv = (kv._2, kv._1))

 val c: RDD[(Double, Iterable[Int])] = b.groupByKey(numPartitions = 5)

 c.collect.foreach(println)
 // (0.0,CompactBuffer(5))
 // (1.0,CompactBuffer(1, 2))
 // (2.0,CompactBuffer(3, 4))

 Ankur



SVD Plus Plus in GraphX

2014-11-27 Thread Deep Pradhan
Hi,
I was just going through the two codes in GraphX namely SVDPlusPlus and
TriangleCount. In the first I see an RDD as an input to run ie, run(edges:
RDD[Edge[Double]],...) and in the other I see run(VD:..., ED:...)
Can anyone explain me the difference between these two? Infact SVDPlusPlus
is the only GraphX code in Spark-1.0.0 that I have seen RDD as an input.
Could anyone please explain to me?

Thank you


Undirected Graphs in GraphX-Pregel

2014-11-26 Thread Deep Pradhan
Hi,
I was going through this paper on Pregel titled, Pregel: A System for
Large-Scale Graph Processing. In the second section named Model Of
Computation, it says that the input to a Pregel computation is a directed
graph.
Is it the same in the Pregel abstraction of GraphX too? Do we always have
to input directed graphs to Pregel abstraction or can we also give
undirected graphs?

Thank You


Edge List File in GraphX

2014-11-24 Thread Deep Pradhan
Hi,
Is it necessary for every vertex to have an attribute when we load a graph
to GraphX?
In other words, if I have an edge list file containing pairs of vertices
i.e., 1   2 means that there is an edge between node 1 and node 2. Now,
when I run PageRank on this data it return a NaN.
Can I use this type of data for any algorithm on GraphX?

Thank You


Re: New Codes in GraphX

2014-11-24 Thread Deep Pradhan
Could it be because my edge list file is in the form (1 2), where there
is an edge between node 1 and node 2?

On Tue, Nov 18, 2014 at 4:13 PM, Ankur Dave ankurd...@gmail.com wrote:

 At 2014-11-18 15:51:52 +0530, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  Yes the above command works, but there is this problem. Most of the
 times,
  the total rank is Nan (Not a Number). Why is it so?

 I've also seen this, but I'm not sure why it happens. If you could find
 out which vertices are getting the NaN rank, it might be helpful in
 tracking down the problem.

 Ankur



New Codes in GraphX

2014-11-18 Thread Deep Pradhan
Hi,
I am using Spark-1.0.0. There are two GraphX directories that I can see here

1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx
which contains LiveJournalPageRank,scala

2. spark-1.0.0/graphx/src/main/scala/org/apache/sprak/graphx/lib which
contains   Analytics.scala, ConnectedComponenets.scala etc etc

Now, if I want to add my own code to GraphX i.e., if I want to write a
small application on GraphX, in which directory should I add my code, in 1
or 2 ? And what is the difference?

Can anyone tell me something on this?

Thank You


Re: Landmarks in GraphX section of Spark API

2014-11-18 Thread Deep Pradhan
So landmark can contain just one vertex right?
Which algorithm has been used to compute the shortest path?

Thank You

On Tue, Nov 18, 2014 at 2:53 PM, Ankur Dave ankurd...@gmail.com wrote:

 At 2014-11-17 14:47:50 +0530, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  I was going through the graphx section in the Spark API in
 
 https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.ShortestPaths$
 
  Here, I find the word landmark. Can anyone explain to me what is
 landmark
  means. Is it a simple English word or does it mean something else in
 graphx.

 The landmarks in the context of the shortest-paths algorithm are just
 the vertices of interest. For each vertex in the graph, the algorithm will
 return the distance to each of the landmark vertices.

 Ankur



Re: Running PageRank in GraphX

2014-11-18 Thread Deep Pradhan
There are no vertices of zero outdegree.
The total rank for the graph with numIter = 10 is 4.99 and for the graph
with numIter = 100 is 5.99
I do not know why so much variation.

On Tue, Nov 18, 2014 at 3:22 PM, Ankur Dave ankurd...@gmail.com wrote:

 At 2014-11-18 12:02:52 +0530, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  I just ran the PageRank code in GraphX with some sample data. What I am
  seeing is that the total rank changes drastically if I change the number
 of
  iterations from 10 to 100. Why is that so?

 As far as I understand, the total rank should asymptotically approach the
 number of vertices in the graph, assuming there are no vertices of zero
 outdegree. Does that seem to be the case for your graph?

 Ankur



Re: Landmarks in GraphX section of Spark API

2014-11-18 Thread Deep Pradhan
Does Bellman-Ford give the best solution?

On Tue, Nov 18, 2014 at 3:27 PM, Ankur Dave ankurd...@gmail.com wrote:

 At 2014-11-18 14:59:20 +0530, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  So landmark can contain just one vertex right?

 Right.

  Which algorithm has been used to compute the shortest path?

 It's distributed Bellman-Ford.

 Ankur



Re: New Codes in GraphX

2014-11-18 Thread Deep Pradhan
The codes that are present in 2 can be run with the command

*$SPARK_HOME/bin/spark-submit --master local[*] --class
org.apache.spark.graphx.lib.Analytics
$SPARK_HOME/assembly/target/scala-2.10/spark-assembly-*.jar pagerank
/edge-list-file.txt --numEPart=8 --numIter=10
--partStrategy=EdgePartition2D*

Now, how do I run the LiveJournalPageRank.scala that is there in 1?



On Tue, Nov 18, 2014 at 2:51 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Hi,
 I am using Spark-1.0.0. There are two GraphX directories that I can see
 here

 1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx
 which contains LiveJournalPageRank,scala

 2. spark-1.0.0/graphx/src/main/scala/org/apache/sprak/graphx/lib which
 contains   Analytics.scala, ConnectedComponenets.scala etc etc

 Now, if I want to add my own code to GraphX i.e., if I want to write a
 small application on GraphX, in which directory should I add my code, in 1
 or 2 ? And what is the difference?

 Can anyone tell me something on this?

 Thank You



Re: New Codes in GraphX

2014-11-18 Thread Deep Pradhan
What command should I use to run the LiveJournalPageRank.scala?

 If you want to write a separate application, the ideal way is to do it in
a separate project that links in Spark as a dependency [1].
But even for this, I have to do the build every time I change the code,
right?

Thank You

On Tue, Nov 18, 2014 at 3:35 PM, Ankur Dave ankurd...@gmail.com wrote:

 At 2014-11-18 14:51:54 +0530, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  I am using Spark-1.0.0. There are two GraphX directories that I can see
 here
 
  1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx
  which contains LiveJournalPageRank,scala
 
  2. spark-1.0.0/graphx/src/main/scala/org/apache/sprak/graphx/lib which
  contains   Analytics.scala, ConnectedComponenets.scala etc etc
 
  Now, if I want to add my own code to GraphX i.e., if I want to write a
  small application on GraphX, in which directory should I add my code, in
 1
  or 2 ? And what is the difference?

 If you want to add an algorithm which you can call from the Spark shell
 and submit as a pull request, you should add it to
 org.apache.spark.graphx.lib (#2). To run it from the command line, you'll
 also have to modify Analytics.scala.

 If you want to write a separate application, the ideal way is to do it in
 a separate project that links in Spark as a dependency [1]. It will also
 work to put it in either #1 or #2, but this will be worse in the long term
 because each build cycle will require you to rebuild and restart all of
 Spark rather than just building your application and calling spark-submit
 on the new JAR.

 Ankur

 [1]
 http://spark.apache.org/docs/1.0.2/quick-start.html#standalone-applications



Re: New Codes in GraphX

2014-11-18 Thread Deep Pradhan
Yes the above command works, but there is this problem. Most of the times,
the total rank is Nan (Not a Number). Why is it so?

Thank You

On Tue, Nov 18, 2014 at 3:48 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 What command should I use to run the LiveJournalPageRank.scala?

  If you want to write a separate application, the ideal way is to do it
 in a separate project that links in Spark as a dependency [1].
 But even for this, I have to do the build every time I change the code,
 right?

 Thank You

 On Tue, Nov 18, 2014 at 3:35 PM, Ankur Dave ankurd...@gmail.com wrote:

 At 2014-11-18 14:51:54 +0530, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  I am using Spark-1.0.0. There are two GraphX directories that I can see
 here
 
  1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx
  which contains LiveJournalPageRank,scala
 
  2. spark-1.0.0/graphx/src/main/scala/org/apache/sprak/graphx/lib which
  contains   Analytics.scala, ConnectedComponenets.scala etc etc
 
  Now, if I want to add my own code to GraphX i.e., if I want to write a
  small application on GraphX, in which directory should I add my code,
 in 1
  or 2 ? And what is the difference?

 If you want to add an algorithm which you can call from the Spark shell
 and submit as a pull request, you should add it to
 org.apache.spark.graphx.lib (#2). To run it from the command line, you'll
 also have to modify Analytics.scala.

 If you want to write a separate application, the ideal way is to do it in
 a separate project that links in Spark as a dependency [1]. It will also
 work to put it in either #1 or #2, but this will be worse in the long term
 because each build cycle will require you to rebuild and restart all of
 Spark rather than just building your application and calling spark-submit
 on the new JAR.

 Ankur

 [1]
 http://spark.apache.org/docs/1.0.2/quick-start.html#standalone-applications





Landmarks in GraphX section of Spark API

2014-11-17 Thread Deep Pradhan
Hi,
I was going through the graphx section in the Spark API in
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.ShortestPaths$

Here, I find the word landmark. Can anyone explain to me what is landmark
means. Is it a simple English word or does it mean something else in graphx.

Thank You


Running PageRank in GraphX

2014-11-17 Thread Deep Pradhan
Hi,
I just ran the PageRank code in GraphX with some sample data. What I am
seeing is that the total rank changes drastically if I change the number of
iterations from 10 to 100. Why is that so?

Thank You


Functions in Spark

2014-11-16 Thread Deep Pradhan
Hi,
Is there any way to know which of my functions perform better in Spark? In
other words, say I have achieved same thing using two different
implementations. How do I judge as to which implementation is better than
the other. Is processing time the only metric that we can use to claim the
goodness of one implementation to the other?
Can anyone please share some thoughts on this?

Thank You


Re: toLocalIterator in Spark 1.0.0

2014-11-14 Thread Deep Pradhan
val iter = toLocalIterator (rdd)

This is what I am doing and it says error: not found

On Fri, Nov 14, 2014 at 12:34 PM, Patrick Wendell pwend...@gmail.com
wrote:

 It looks like you are trying to directly import the toLocalIterator
 function. You can't import functions, it should just appear as a
 method of an existing RDD if you have one.

 - Patrick

 On Thu, Nov 13, 2014 at 10:21 PM, Deep Pradhan
 pradhandeep1...@gmail.com wrote:
  Hi,
 
  I am using Spark 1.0.0 and Scala 2.10.3.
 
  I want to use toLocalIterator in a code but the spark shell tells
 
  not found: value toLocalIterator
 
  I also did import org.apache.spark.rdd but even after this the shell
 tells
 
  object toLocalIterator is not a member of package org.apache.spark.rdd
 
  Can anyone help me in this?
 
  Thank You



toLocalIterator in Spark 1.0.0

2014-11-13 Thread Deep Pradhan
Hi,

I am using Spark 1.0.0 and Scala 2.10.3.

I want to use toLocalIterator in a code but the spark shell tells

*not found: value toLocalIterator*

I also did import org.apache.spark.rdd but even after this the shell tells

*object toLocalIterator is not a member of package org.apache.spark.rdd*

Can anyone help me in this?

Thank You


Pass RDD to functions

2014-11-12 Thread Deep Pradhan
Hi,
Can we pass RDD to functions?
Like, can we do the following?

*def func (temp: RDD[String]):RDD[String] = {*
*//body of the function*
*}*


Thank You


Queues

2014-11-09 Thread Deep Pradhan
Has anyone implemented Queues using RDDs?


Thank You


GraphX and Spark

2014-11-04 Thread Deep Pradhan
Hi,
Can Spark achieve whatever GraphX can?
Keeping aside the performance comparison between Spark and GraphX, if I
want to implement any graph algorithm and I do not want to use GraphX, can
I get the work done with Spark?

Than You


RDD of Iterable[String]

2014-09-24 Thread Deep Pradhan
Can we iterate over RDD of Iterable[String]? How do we do that?
Because the entire Iterable[String] seems to be a single element in the RDD.

Thank You


Converting one RDD to another

2014-09-23 Thread Deep Pradhan
Hi,
Is it always possible to get one RDD from another.
For example, if I do a *top(K)(Ordering)*, I get an Int right? (In my
example the type is Int). I do not get an RDD.
Can anyone explain this to me?
Thank You


Change RDDs using map()

2014-09-17 Thread Deep Pradhan
Hi,
I want to make the following changes in the RDD (create new RDD from the
existing to reflect some transformation):
In an RDD of key-value pair, I want to get the keys for which the values
are 1.
How to do this using map()?
Thank You


Re: Spark and Scala

2014-09-13 Thread Deep Pradhan
Is it always true that whenever we apply operations on an RDD, we get
another RDD?
Or does it depend on the return type of the operation?

On Sat, Sep 13, 2014 at 9:45 AM, Soumya Simanta soumya.sima...@gmail.com
wrote:


 An RDD is a fault-tolerant distributed structure. It is the primary
 abstraction in Spark.

 I would strongly suggest that you have a look at the following to get a
 basic idea.

 http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/RDD.html
 http://spark.apache.org/docs/latest/quick-start.html#basics

 https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

 On Sat, Sep 13, 2014 at 12:06 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Take for example this:
 I have declared one queue *val queue = Queue.empty[Int]*, which is a
 pure scala line in the program. I actually want the queue to be an RDD but
 there are no direct methods to create RDD which is a queue right? What say
 do you have on this?
 Does there exist something like: *Create and RDD which is a queue *?

 On Sat, Sep 13, 2014 at 8:43 AM, Hari Shreedharan 
 hshreedha...@cloudera.com wrote:

 No, Scala primitives remain primitives. Unless you create an RDD using
 one of the many methods - you would not be able to access any of the RDD
 methods. There is no automatic porting. Spark is an application as far as
 scala is concerned - there is no compilation (except of course, the scala,
 JIT compilation etc).

 On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan pradhandeep1...@gmail.com
  wrote:

 I know that unpersist is a method on RDD.
 But my confusion is that, when we port our Scala programs to Spark,
 doesn't everything change to RDDs?

 On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 unpersist is a method on RDDs. RDDs are abstractions introduced by
 Spark.

 An Int is just a Scala Int. You can't call unpersist on Int in Scala,
 and that doesn't change in Spark.

 On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we run
 any Scala code on the Spark framework? What will be the difference in the
 execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually throw
 it out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an Int.
 Can some one explain to me why I was not able to call *unpersist* on
 *temp*?

 Thank You









Re: Spark and Scala

2014-09-13 Thread Deep Pradhan
Take for example this:


*val lines = sc.textFile(args(0))*
*val nodes = lines.map(s ={  *
*val fields = s.split(\\s+)*
*(fields(0),fields(1))*
*}).distinct().groupByKey().cache() *

*val nodeSizeTuple = nodes.map(node = (node._1.toInt, node._2.size))*
*val rootNode = nodeSizeTuple.top(1)(Ordering.by(f = f._2))*

The nodeSizeTuple is an RDD,but rootNode is an array. Here I have used all
RDD operations, but I am getting an array.
What about this case?

On Sat, Sep 13, 2014 at 11:45 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Is it always true that whenever we apply operations on an RDD, we get
 another RDD?
 Or does it depend on the return type of the operation?

 On Sat, Sep 13, 2014 at 9:45 AM, Soumya Simanta soumya.sima...@gmail.com
 wrote:


 An RDD is a fault-tolerant distributed structure. It is the primary
 abstraction in Spark.

 I would strongly suggest that you have a look at the following to get a
 basic idea.

 http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/RDD.html
 http://spark.apache.org/docs/latest/quick-start.html#basics

 https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

 On Sat, Sep 13, 2014 at 12:06 AM, Deep Pradhan pradhandeep1...@gmail.com
  wrote:

 Take for example this:
 I have declared one queue *val queue = Queue.empty[Int]*, which is a
 pure scala line in the program. I actually want the queue to be an RDD but
 there are no direct methods to create RDD which is a queue right? What say
 do you have on this?
 Does there exist something like: *Create and RDD which is a queue *?

 On Sat, Sep 13, 2014 at 8:43 AM, Hari Shreedharan 
 hshreedha...@cloudera.com wrote:

 No, Scala primitives remain primitives. Unless you create an RDD using
 one of the many methods - you would not be able to access any of the RDD
 methods. There is no automatic porting. Spark is an application as far as
 scala is concerned - there is no compilation (except of course, the scala,
 JIT compilation etc).

 On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 I know that unpersist is a method on RDD.
 But my confusion is that, when we port our Scala programs to Spark,
 doesn't everything change to RDDs?

 On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 unpersist is a method on RDDs. RDDs are abstractions introduced by
 Spark.

 An Int is just a Scala Int. You can't call unpersist on Int in Scala,
 and that doesn't change in Spark.

 On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we run
 any Scala code on the Spark framework? What will be the difference in 
 the
 execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually throw
 it out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an Int.
 Can some one explain to me why I was not able to call *unpersist*
 on *temp*?

 Thank You










RDDs and Immutability

2014-09-13 Thread Deep Pradhan
Hi,
We all know that RDDs are immutable.
There are not enough operations that can achieve anything and everything on
RDDs.
Take for example this:
I want an Array of Bytes filled with zeros which during the program should
change. Some elements of that Array should change to 1.
If I make an RDD with all elements as zero, I won't be able to change the
elements. On the other hand, if I declare as Array then so much memory will
be consumed.
Please clarify this to me.

Thank You


Spark and Scala

2014-09-12 Thread Deep Pradhan
There is one thing that I am confused about.
Spark has codes that have been implemented in Scala. Now, can we run any
Scala code on the Spark framework? What will be the difference in the
execution of the scala code in normal systems and on Spark?
The reason for my question is the following:
I had a variable
*val temp = some operations*
This temp was being created inside the loop, so as to manually throw it out
of the cache, every time the loop ends I was calling *temp.unpersist()*,
this was returning an error saying that *value unpersist is not a method of
Int*, which means that temp is an Int.
Can some one explain to me why I was not able to call *unpersist* on *temp*?

Thank You


Re: Spark and Scala

2014-09-12 Thread Deep Pradhan
I know that unpersist is a method on RDD.
But my confusion is that, when we port our Scala programs to Spark, doesn't
everything change to RDDs?

On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 unpersist is a method on RDDs. RDDs are abstractions introduced by Spark.

 An Int is just a Scala Int. You can't call unpersist on Int in Scala, and
 that doesn't change in Spark.

 On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we run any
 Scala code on the Spark framework? What will be the difference in the
 execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually throw it
 out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an Int.
 Can some one explain to me why I was not able to call *unpersist* on
 *temp*?

 Thank You





Re: Spark and Scala

2014-09-12 Thread Deep Pradhan
Take for example this:
I have declared one queue *val queue = Queue.empty[Int]*, which is a pure
scala line in the program. I actually want the queue to be an RDD but there
are no direct methods to create RDD which is a queue right? What say do you
have on this?
Does there exist something like: *Create and RDD which is a queue *?

On Sat, Sep 13, 2014 at 8:43 AM, Hari Shreedharan hshreedha...@cloudera.com
 wrote:

 No, Scala primitives remain primitives. Unless you create an RDD using one
 of the many methods - you would not be able to access any of the RDD
 methods. There is no automatic porting. Spark is an application as far as
 scala is concerned - there is no compilation (except of course, the scala,
 JIT compilation etc).

 On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 I know that unpersist is a method on RDD.
 But my confusion is that, when we port our Scala programs to Spark,
 doesn't everything change to RDDs?

 On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 unpersist is a method on RDDs. RDDs are abstractions introduced by Spark.

 An Int is just a Scala Int. You can't call unpersist on Int in Scala,
 and that doesn't change in Spark.

 On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we run
 any Scala code on the Spark framework? What will be the difference in the
 execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually throw it
 out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an Int.
 Can some one explain to me why I was not able to call *unpersist* on
 *temp*?

 Thank You







Unpersist

2014-09-11 Thread Deep Pradhan
I want to create a temporary variables in a spark code.
Can I do this?

for (i - num)
{
val temp = ..
   {
   do something
   }
temp.unpersist()
}

Thank You


Re: Unpersist

2014-09-11 Thread Deep Pradhan
After every loop I want the temp variable to cease to exist

On Thu, Sep 11, 2014 at 4:33 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 like this?

 var temp = ...
 for (i - num)
 {
  temp = ..
{
do something
}
 temp.unpersist()
 }

 Thanks
 Best Regards

 On Thu, Sep 11, 2014 at 3:26 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 I want to create a temporary variables in a spark code.
 Can I do this?

 for (i - num)
 {
 val temp = ..
{
do something
}
 temp.unpersist()
 }

 Thank You





How to change the values in Array of Bytes

2014-09-06 Thread Deep Pradhan
Hi,
I have an array of bytes and I have filled the array with 0 in all the
postitions.


*var Array = Array.fill[Byte](10)(0)*
Now, if certain conditions are satisfied, I want to change some elements of
the array to 1 instead of 0. If I run,



*if (Array.apply(index)==0) Array.apply(index) = 1*
it returns me an error.

But if I assign *Array.apply(index) *to a variable and do the same thing
then it works. I do not want to assign this to variables because if I do
this, I would be creating a lot of variables.

Can anyone tell me a method to do this?

Thank You


Recursion

2014-09-05 Thread Deep Pradhan
Hi,
Does Spark support recursive calls?


Array and RDDs

2014-09-05 Thread Deep Pradhan
Hi,
I have an input file which consists of stc_node dest_node
I have created and RDD consisting of key-value pair where key is the node
id and the values are the children of that node.
Now I want to associate a byte with each node. For that I have created a
byte array.
Every time I print out the key-value pair in the RDD the key-value pairs do
not come in the same order. Because of this I am finding it difficult to
assign the byte values with each node.
Can anyone help me out in this matter?

I basically have the following code:
val bitarray = Array.fill[Byte](number)(0)

And I want to assiciate each byte in the array to a node.
How should I do that?

Thank You


Iterate over ArrayBuffer

2014-09-04 Thread Deep Pradhan
Hi,
I have the following ArrayBuffer
*ArrayBuffer(5,3,1,4)*
Now, I want to iterate over the ArrayBuffer.
What is the way to do it?

Thank You


Number of elements in ArrayBuffer

2014-09-02 Thread Deep Pradhan
Hi,
I have the following ArrayBuffer:

*ArrayBuffer(5,3,1,4)*

Now, I want to get the number of elements in this ArrayBuffer and also the
first element of the ArrayBuffer. I used .length and .size but they are
returning 1 instead of 4.
I also used .head and .last for getting the first and the last element but
they also return the entire ArrayBuffer (ArrayBuffer(5,3,1,4))
What I understand from this is that, the entire ArrayBuffer is stored as
one element.

How should I go about doing the required things?

Thank You


Re: Number of elements in ArrayBuffer

2014-09-02 Thread Deep Pradhan
I have a problem here.
When I run the commands that Rajesh has suggested in Scala REPL, they work
fine. But, I want to work in a Spark code, where I need to find the number
of elements in an ArrayBuffer. In Spark code, these things are not working.
How should I do that?


On Wed, Sep 3, 2014 at 10:25 AM, Madabhattula Rajesh Kumar 
mrajaf...@gmail.com wrote:

 Hi Deep,

 Please find below results of ArrayBuffer in scala REPL

 scala import scala.collection.mutable.ArrayBuffer
 import scala.collection.mutable.ArrayBuffer

 scala val a = ArrayBuffer(5,3,1,4)
 a: scala.collection.mutable.ArrayBuffer[Int] = ArrayBuffer(5, 3, 1, 4)

 scala a.head
 res2: Int = 5

 scala a.tail
 res3: scala.collection.mutable.ArrayBuffer[Int] = ArrayBuffer(3, 1, 4)

 scala a.length
 res4: Int = 4

 Regards,
 Rajesh


 On Wed, Sep 3, 2014 at 10:13 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 I have the following ArrayBuffer:

 *ArrayBuffer(5,3,1,4)*

 Now, I want to get the number of elements in this ArrayBuffer and also
 the first element of the ArrayBuffer. I used .length and .size but they are
 returning 1 instead of 4.
 I also used .head and .last for getting the first and the last element
 but they also return the entire ArrayBuffer (ArrayBuffer(5,3,1,4))
 What I understand from this is that, the entire ArrayBuffer is stored as
 one element.

 How should I go about doing the required things?

 Thank You





Key-Value Operations

2014-08-28 Thread Deep Pradhan
Hi,
I have a RDD of key-value pairs. Now I want to find the key for which the
values has the largest number of elements. How should I do that?
Basically I want to select the key for which the number of items in values
is the largest.
Thank You


Re: Printing the RDDs in SparkPageRank

2014-08-26 Thread Deep Pradhan
println(parts(0)) does not solve the problem. It does not work


On Mon, Aug 25, 2014 at 1:30 PM, Sean Owen so...@cloudera.com wrote:

 On Mon, Aug 25, 2014 at 7:18 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  When I add
 
  parts(0).collect().foreach(println)
 
  parts(1).collect().foreach(println), for printing parts, I get the
 following
  error
 
  not enough arguments for method collect: (pf:
  PartialFunction[Char,B])(implicit
  bf:scala.collection.generic.CanBuildFrom[String,B,That])That.Unspecified
  value parameter pf.parts(0).collect().foreach(println)

  val links = lines.map{ s =
val parts = s.split(\\s+)
(parts(0), parts(1))  /*I want to print this parts*/
  }.distinct().groupByKey().cache()


 Within this code, you are working in a simple Scala function. parts is
 an Array[String]. parts(0) is a String. You can just
 println(parts(0)). You are not calling RDD.collect() there, but
 collect() on a String a sequence of Char.

 However note that this will print the String on the worker that
 executes this, not the driver.

 Maybe you want to print the result right after this map function? Then
 break this into two statements and print the result of the first. You
 already are doing that in your code. A good formula is actually
 take(10) rather than collect() in case the RDD is huge.



Key-Value in PairRDD

2014-08-26 Thread Deep Pradhan
I have the following code






*val nodes = lines.map(s ={val fields = s.split(\\s+)
(fields(0),fields(1))}).distinct().groupByKey().cache()*
and when I print out the nodes RDD I get the following






*(4,ArrayBuffer(1))(2,ArrayBuffer(1))(3,ArrayBuffer(1))(1,ArrayBuffer(3, 2,
4))*
Now, I want to print only the key part of the RDD and also the maximum
value among the keys. How should I do that?
Thank You


Re: Printing the RDDs in SparkPageRank

2014-08-25 Thread Deep Pradhan
When I add

parts(0).collect().foreach(println)

parts(1).collect().foreach(println), for printing parts, I get the
following error

*not enough arguments for method collect: (pf:
PartialFunction[Char,B])(implicit
bf:scala.collection.generic.CanBuildFrom[String,B,That])That.Unspecified
value parameter pf.parts(0).collect().foreach(println)*


 And, when I add
parts.collect().foreach(println), I get the following error

*not enough arguments for method collect: (pf:
PartialFunction[String,B])(implicit bf:
scala.collection.generic.CanBuildFrom[Array[String],B,That])That.Unspecified
value parameter pf.parts.collect().foreach(println) *


On Sun, Aug 24, 2014 at 8:27 PM, Jörn Franke jornfra...@gmail.com wrote:

 Hi,

 What kind of error do you receive?

 Best regards,

 Jörn
 Le 24 août 2014 08:29, Deep Pradhan pradhandeep1...@gmail.com a écrit
 :

 Hi,
 I was going through the SparkPageRank code and want to see the
 intermediate steps, like the RDDs formed in the intermediate steps.
 Here is a part of the code along with the lines that I added in order to
 print the RDDs.
 I want to print the *parts* in the code (denoted by the comment in
 Bold letters). But, when I try to do the same thing there, it gives an
 error.
 Can someone suggest what I should be doing?
 Thank You

 CODE:

 object SparkPageRank {
   def main(args: Array[String]) {
 val sparkConf = new SparkConf().setAppName(PageRank)
 var iters = args(1).toInt
 val ctx = new SparkContext(sparkConf)
 val lines = ctx.textFile(args(0), 1)
 println(The lines RDD is)
  lines.collect().foreach(println)
 val links = lines.map{ s =
   val parts = s.split(\\s+)
   (parts(0), parts(1))  */*I want to print this parts*/*
 }.distinct().groupByKey().cache()
 println(The links RDD is)
 links.collect().foreach(println)
 var ranks = links.mapValues(v = 1.0)
 println(The ranks RDD is)
 ranks.collect().foreach(println)
 for (i - 1 to iters) {
   val contribs = links.join(ranks).values.flatMap{ case (urls, rank)
 =
 val size = urls.size
 urls.map(url = (url, rank / size))
   }
 println(The contribs RDD is)
   contribs.collect().foreach(println)
   ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
 }
 println(The second ranks RDD is)
ranks.collect().foreach(println)

 val output = ranks.collect()
 output.foreach(tup = println(tup._1 +  has rank:  + tup._2 + .))

 ctx.stop()
   }
 }






Printing the RDDs in SparkPageRank

2014-08-24 Thread Deep Pradhan
Hi,
I was going through the SparkPageRank code and want to see the intermediate
steps, like the RDDs formed in the intermediate steps.
Here is a part of the code along with the lines that I added in order to
print the RDDs.
I want to print the *parts* in the code (denoted by the comment in Bold
letters). But, when I try to do the same thing there, it gives an error.
Can someone suggest what I should be doing?
Thank You

CODE:

object SparkPageRank {
  def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName(PageRank)
var iters = args(1).toInt
val ctx = new SparkContext(sparkConf)
val lines = ctx.textFile(args(0), 1)
println(The lines RDD is)
lines.collect().foreach(println)
val links = lines.map{ s =
  val parts = s.split(\\s+)
  (parts(0), parts(1))  */*I want to print this parts*/*
}.distinct().groupByKey().cache()
println(The links RDD is)
links.collect().foreach(println)
var ranks = links.mapValues(v = 1.0)
println(The ranks RDD is)
ranks.collect().foreach(println)
for (i - 1 to iters) {
  val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =
val size = urls.size
urls.map(url = (url, rank / size))
  }
println(The contribs RDD is)
  contribs.collect().foreach(println)
  ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
}
println(The second ranks RDD is)
   ranks.collect().foreach(println)

val output = ranks.collect()
output.foreach(tup = println(tup._1 +  has rank:  + tup._2 + .))

ctx.stop()
  }
}


Program without doing assembly

2014-08-16 Thread Deep Pradhan
Hi,
I am just playing around with the codes in Spark.
I am printing out some statements of the codes given in Spark so as to see
how it looks.
Every time I change/add something to the code I have to run the command

*SPARK_HADOOP_VERSION=2.3.0 sbt/sbt assembly*

which is tiresome at times.
Is there any way to check out the codes or even add a new code (a new file
to the examples directory) to the Spark already available without having to
do sbt/sbt assembly? Please tell me for a single node as well as a multi
node cluster.

Thank You


Error in sbt/sbt package

2014-08-15 Thread Deep Pradhan
I am getting the following error while doing SPARK_HADOOP_VERSION=2.3.0
sbt/sbt/package

java.io.IOException: Cannot run program
/home/deep/spark-1.0.0/usr/lib/jvm/java-7-oracle/bin/javac: error=2, No
such file or directory

 ...lots of errors

 [error] (core/compile:compile) java.io.IOException: Cannot run program
/home/deep/spark-1.0.0/usr/lib/jvm/java-7-oracle/bin/javac: error=2, No
such file or directory

[error] Total time: 198 s, completed 16 Aug, 2014 10:25:50 AM

My ~/.bashrc file has the following (apart from other paths too)

export JAVA_HOME=usr/lib/jvm/java-7-oracle

export PATH=$PATH:$JAVA_HOME/bin


 My spark-env.sh file has the following (apart from other paths)

export JAVA_HOME=/usr/lib/jvm/jdk-7-oracle

Can anyone tell me what I should modify?

Thank You


Minimum Split of Hadoop RDD

2014-08-08 Thread Deep Pradhan
Hi,
I am using a single node Spark cluster on HDFS. When I was going through
the SparkPageRank.scala code, I came across the following line:

*val lines = ctx.textFile(args(0), 1)*


where, args(0) is the path of the input file from the HDFS, and the second
argument is the minimum split of Hadoop RDD (textFile in Spark
documentation).

Could anyone please tell me, how this minimum split plays a role? Can we
change it? If so, how does it effect the performance?

Thank You


Re: GraphX runs without Spark?

2014-08-03 Thread Deep Pradhan
We need to pass the URL only when we are using the interactive shell right?
Now, I am not using the interactive shell, I am just doing
./bin/run-example.. when I am in the Spark directory.
If not, Spark may be ignoring your single-node cluster and defaulting to
local mode.
What does this mean? I can work even without Spark coming up? Does the same
thing happen even if I have a multi-node cluster?
Thank You


On Sun, Aug 3, 2014 at 2:24 PM, Ankur Dave ankurd...@gmail.com wrote:

 At 2014-08-03 13:14:52 +0530, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  I have a single node cluster on which I have Spark running. I ran some
  graphx codes on some data set. Now when I stop all the workers in the
  cluster (sbin/stop-all.sh), the codes still run and gives the answers.
 Why
  is it so? I mean does graphx run even without Spark coming up?

 Are you passing the correct master URL (spark://master-address:7077) to
 run-example or spark-submit? If not, Spark may be ignoring your single-node
 cluster and defaulting to local mode.

 Ankur



GraphX

2014-08-02 Thread Deep Pradhan
Hi,
I am running Spark in a single node cluster. I am able to run the codes in
Spark like SparkPageRank.scala, SparkKMeans.scala by the following command,
bin/run-examples org.apache.spark.examples.SparkPageRank and the required
things
Now, I want to run the Pagerank.scala that is there in GraphX. Do we have a
similar command like earlier one for this too?
I also went through the run-example shell file and saw that the path is set
to org.apache.spark.examples
How should I run graphx codes?
Thank You


  1   2   >