Error starting EC2 cluster

2014-05-04 Thread Aliaksei Litouka
I am using Spark 0.9.1. When I'm trying to start a EC2 cluster with the
spark-ec2 script, an error occurs and the following message is issued:
AttributeError: 'module' object has no attribute 'check_output'. By this
time, EC2 instances are up and running but Spark doesn't seem to be
installed on them. Any ideas how to fix it?

$ ./spark-ec2 -k my_key -i /home/alitouka/my_key.pem -s 1
--region=us-east-1 --instance-type=m3.medium launch test_cluster
Setting up security groups...
Searching for existing cluster test_cluster...
Don't recognize m3.medium, assuming type is pvm
Spark AMI: ami-5bb18832
Launching instances...
Launched 1 slaves in us-east-1c, regid = r-
Launched master in us-east-1c, regid = r-
Waiting for instances to start up...
Waiting 120 more seconds...
Generating cluster's SSH key on master...
ssh: connect to host ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com port 22:
Connection refused
Error executing remote command, retrying after 30 seconds: Command '['ssh',
'-o', 'StrictHostKeyChecking=no', '-i', '/home/alitouka/my_key.pem', '-t',
'-t', u'r...@ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com', "\n  [ -f
~/.ssh/id_rsa ] ||\n(ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa
&&\n cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys)\n"]'
returned non-zero exit status 255
ssh: connect to host ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com port 22:
Connection refused
Error executing remote command, retrying after 30 seconds: Command '['ssh',
'-o', 'StrictHostKeyChecking=no', '-i', '/home/alitouka/my_key.pem', '-t',
'-t', u'r...@ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com', "\n  [ -f
~/.ssh/id_rsa ] ||\n(ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa
&&\n cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys)\n"]'
returned non-zero exit status 255
ssh: connect to host ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com port 22:
Connection refused
Error executing remote command, retrying after 30 seconds: Command '['ssh',
'-o', 'StrictHostKeyChecking=no', '-i', '/home/alitouka/my_key.pem', '-t',
'-t', u'r...@ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com', "\n  [ -f
~/.ssh/id_rsa ] ||\n(ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa
&&\n cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys)\n"]'
returned non-zero exit status 255
Warning: Permanently added
'ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com,54.227.205.82'
(RSA) to the list of known hosts.
Connection to ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com closed.
Traceback (most recent call last):
  File "./spark_ec2.py", line 806, in 
main()
  File "./spark_ec2.py", line 799, in main
real_main()
  File "./spark_ec2.py", line 684, in real_main
setup_cluster(conn, master_nodes, slave_nodes, opts, True)
  File "./spark_ec2.py", line 419, in setup_cluster
dot_ssh_tar = ssh_read(master, opts, ['tar', 'c', '.ssh'])
  File "./spark_ec2.py", line 624, in ssh_read
return subprocess.check_output(
AttributeError: 'module' object has no attribute 'check_output'


Re: Error starting EC2 cluster

2014-05-16 Thread Aliaksei Litouka
Well... the reason was an out-of-date version of Python (2.6.6) on the
machine where I ran the script. If anyone else experiences this issue -
just update your Python.


On Sun, May 4, 2014 at 7:51 PM, Aliaksei Litouka  wrote:

> I am using Spark 0.9.1. When I'm trying to start a EC2 cluster with the
> spark-ec2 script, an error occurs and the following message is issued:
> AttributeError: 'module' object has no attribute 'check_output'. By this
> time, EC2 instances are up and running but Spark doesn't seem to be
> installed on them. Any ideas how to fix it?
>
> $ ./spark-ec2 -k my_key -i /home/alitouka/my_key.pem -s 1
> --region=us-east-1 --instance-type=m3.medium launch test_cluster
> Setting up security groups...
> Searching for existing cluster test_cluster...
> Don't recognize m3.medium, assuming type is pvm
> Spark AMI: ami-5bb18832
> Launching instances...
> Launched 1 slaves in us-east-1c, regid = r-
> Launched master in us-east-1c, regid = r-
> Waiting for instances to start up...
> Waiting 120 more seconds...
> Generating cluster's SSH key on master...
> ssh: connect to host ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com port 22:
> Connection refused
> Error executing remote command, retrying after 30 seconds: Command
> '['ssh', '-o', 'StrictHostKeyChecking=no', '-i',
> '/home/alitouka/my_key.pem', '-t', '-t',
> u'r...@ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com', "\n  [ -f
> ~/.ssh/id_rsa ] ||\n(ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa
> &&\n cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys)\n"]'
> returned non-zero exit status 255
> ssh: connect to host ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com port 22:
> Connection refused
> Error executing remote command, retrying after 30 seconds: Command
> '['ssh', '-o', 'StrictHostKeyChecking=no', '-i',
> '/home/alitouka/my_key.pem', '-t', '-t',
> u'r...@ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com', "\n  [ -f
> ~/.ssh/id_rsa ] ||\n(ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa
> &&\n cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys)\n"]'
> returned non-zero exit status 255
> ssh: connect to host ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com port 22:
> Connection refused
> Error executing remote command, retrying after 30 seconds: Command
> '['ssh', '-o', 'StrictHostKeyChecking=no', '-i',
> '/home/alitouka/my_key.pem', '-t', '-t',
> u'r...@ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com', "\n  [ -f
> ~/.ssh/id_rsa ] ||\n(ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa
> &&\n cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys)\n"]'
> returned non-zero exit status 255
> Warning: Permanently added 
> 'ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com,54.227.205.82'
> (RSA) to the list of known hosts.
> Connection to ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com closed.
> Traceback (most recent call last):
>   File "./spark_ec2.py", line 806, in 
> main()
>   File "./spark_ec2.py", line 799, in main
> real_main()
>   File "./spark_ec2.py", line 684, in real_main
> setup_cluster(conn, master_nodes, slave_nodes, opts, True)
>   File "./spark_ec2.py", line 419, in setup_cluster
> dot_ssh_tar = ssh_read(master, opts, ['tar', 'c', '.ssh'])
>   File "./spark_ec2.py", line 624, in ssh_read
> return subprocess.check_output(
> AttributeError: 'module' object has no attribute 'check_output'
>


How to specify executor memory in EC2 ?

2014-06-10 Thread Aliaksei Litouka
I am testing my application in EC2 cluster of m3.medium machines. By
default, only 512 MB of memory on each machine is used. I want to increase
this amount and I'm trying to do it by passing --executor-memory 2G option
to the spark-submit script, but it doesn't seem to work - each machine uses
only 512 MB instead of 2 gigabytes. What am I doing wrong? How do I
increase the amount of memory?


An attempt to implement dbscan algorithm on top of Spark

2014-06-12 Thread Aliaksei Litouka
Hi.
I'm not sure if messages like this are appropriate in this list; I just
want to share with you an application I am working on. This is my personal
project which I started to learn more about Spark and Scala, and, if it
succeeds, to contribute it to the Spark community.

Maybe someone will find it useful. Or maybe someone will want to join
development.

The application is available at https://github.com/alitouka/spark_dbscan

Any questions, comments, suggestions, as well as criticism are welcome :)

Best regards,
Aliaksei Litouka


Re: How to specify executor memory in EC2 ?

2014-06-12 Thread Aliaksei Litouka
spark-env.sh doesn't seem to contain any settings related to memory size :(
I will continue searching for a solution and will post it if I find it :)
Thank you, anyway


On Wed, Jun 11, 2014 at 12:19 AM, Matei Zaharia 
wrote:

> It might be that conf/spark-env.sh on EC2 is configured to set it to 512,
> and is overriding the application’s settings. Take a look in there and
> delete that line if possible.
>
> Matei
>
> On Jun 10, 2014, at 2:38 PM, Aliaksei Litouka 
> wrote:
>
> > I am testing my application in EC2 cluster of m3.medium machines. By
> default, only 512 MB of memory on each machine is used. I want to increase
> this amount and I'm trying to do it by passing --executor-memory 2G option
> to the spark-submit script, but it doesn't seem to work - each machine uses
> only 512 MB instead of 2 gigabytes. What am I doing wrong? How do I
> increase the amount of memory?
>
>


Re: How to specify executor memory in EC2 ?

2014-06-12 Thread Aliaksei Litouka
Yes, I am launching a cluster with the spark_ec2 script. I checked
/root/spark/conf/spark-env.sh on the master node and on slaves and it looks
like this:

#!/usr/bin/env bash
> export SPARK_LOCAL_DIRS="/mnt/spark"
> # Standalone cluster options
> export SPARK_MASTER_OPTS=""
> export SPARK_WORKER_INSTANCES=1
> export SPARK_WORKER_CORES=1
> export HADOOP_HOME="/root/ephemeral-hdfs"
> export SPARK_MASTER_IP=ec2-54-89-95-238.compute-1.amazonaws.com
> export MASTER=`cat /root/spark-ec2/cluster-url`
> export
> SPARK_SUBMIT_LIBRARY_PATH="$SPARK_SUBMIT_LIBRARY_PATH:/root/ephemeral-hdfs/lib/native/"
> export
> SPARK_SUBMIT_CLASSPATH="$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH:/root/ephemeral-hdfs/conf"
> # Bind Spark's web UIs to this machine's public EC2 hostname:
> export SPARK_PUBLIC_DNS=`wget -q -O -
> http://169.254.169.254/latest/meta-data/public-hostname`
> # Set a high ulimit for large shuffles
> ulimit -n 100


None of these variables seem to be related to memory size. Let me know if I
am missing something.


On Thu, Jun 12, 2014 at 7:17 PM, Matei Zaharia 
wrote:

> Are you launching this using our EC2 scripts? Or have you set up a cluster
> by hand?
>
> Matei
>
> On Jun 12, 2014, at 2:32 PM, Aliaksei Litouka 
> wrote:
>
> spark-env.sh doesn't seem to contain any settings related to memory size
> :( I will continue searching for a solution and will post it if I find it :)
> Thank you, anyway
>
>
> On Wed, Jun 11, 2014 at 12:19 AM, Matei Zaharia 
> wrote:
>
>> It might be that conf/spark-env.sh on EC2 is configured to set it to 512,
>> and is overriding the application’s settings. Take a look in there and
>> delete that line if possible.
>>
>> Matei
>>
>> On Jun 10, 2014, at 2:38 PM, Aliaksei Litouka 
>> wrote:
>>
>> > I am testing my application in EC2 cluster of m3.medium machines. By
>> default, only 512 MB of memory on each machine is used. I want to increase
>> this amount and I'm trying to do it by passing --executor-memory 2G option
>> to the spark-submit script, but it doesn't seem to work - each machine uses
>> only 512 MB instead of 2 gigabytes. What am I doing wrong? How do I
>> increase the amount of memory?
>>
>>
>
>


Re: An attempt to implement dbscan algorithm on top of Spark

2014-06-12 Thread Aliaksei Litouka
Vipul,
Thanks for your feedback. As far as I understand, mean RDD[(Double,
Double)] (note the parenthesis), and each of these Double values is
supposed to contain one coordinate of a point. It limits us to
2-dimensional space, which is not suitable for many tasks. I want the
algorithm to be able to work in multidimensional space. Actually, there is
a class org.alitouka.spark.dbscan.spatial.Point in my code, which
represents a point with an arbitrary number of coordinates.

IOHelper.readDataset is just a convenience method which reads a CSV file
and returns an RDD of Points (more precisely, it returns a value of type
RawDataset, which is just an alias for RDD[Point]). If your data is stored
in a format other than CSV, you will have to write your own code to convert
your data to RawDataset.

I can add support for other data formats in future versions.

As for other distance measures - it is a high priority issue in my list ;)



On Thu, Jun 12, 2014 at 6:02 PM, Vipul Pandey  wrote:

> Great! I was going to implement one of my own - but I may not need to do
> that any more :)
> I haven't had a chance to look deep into your code but I would recommend
> accepting an RDD[Double,Double] as well, instead of just a file.
>
> val data = IOHelper.readDataset(sc, "/path/to/my/data.csv")
>
> And other distance measures ofcourse.
>
> Thanks,
> Vipul
>
>
>
>
> On Jun 12, 2014, at 2:31 PM, Aliaksei Litouka 
> wrote:
>
> Hi.
> I'm not sure if messages like this are appropriate in this list; I just
> want to share with you an application I am working on. This is my personal
> project which I started to learn more about Spark and Scala, and, if it
> succeeds, to contribute it to the Spark community.
>
> Maybe someone will find it useful. Or maybe someone will want to join
> development.
>
> The application is available at https://github.com/alitouka/spark_dbscan
>
> Any questions, comments, suggestions, as well as criticism are welcome :)
>
> Best regards,
> Aliaksei Litouka
>
>
>


Re: How to specify executor memory in EC2 ?

2014-06-13 Thread Aliaksei Litouka
Aaron,
spark.executor.memory is set to 2454m in my spark-defaults.conf, which is a
reasonable value for EC2 instances which I use (they are m3.medium
machines). However, it doesn't help and each executor uses only 512 MB of
memory. To figure out why, I examined spark-submit and spark-class scripts
and found that java options and java memory size are computed in the
spark-class script (see OUR_JAVA_OPTS and OUR_JAVA_MEM variables in that
script). Then these values are used to compose the following string:

JAVA_OPTS="$JAVA_OPTS -Xms$OUR_JAVA_MEM -Xmx$OUR_JAVA_MEM"

Note that OUR_JAVA_MEM is appended to the end of the string. For some
reason which I haven't found yet, OUR_JAVA_MEM is set to its default value
- 512 MB. I was able to fix it only by setting the SPARM_MEM variable in
the spark-env.sh file:

export SPARK_MEM=2g

However, this variable is deprecated, so my solution doesn't seem to be
good :)


On Thu, Jun 12, 2014 at 10:16 PM, Aaron Davidson  wrote:

> The scripts for Spark 1.0 actually specify this property in
> /root/spark/conf/spark-defaults.conf
>
> I didn't know that this would override the --executor-memory flag, though,
> that's pretty odd.
>
>
> On Thu, Jun 12, 2014 at 6:02 PM, Aliaksei Litouka <
> aliaksei.lito...@gmail.com> wrote:
>
>> Yes, I am launching a cluster with the spark_ec2 script. I checked
>> /root/spark/conf/spark-env.sh on the master node and on slaves and it looks
>> like this:
>>
>> #!/usr/bin/env bash
>>> export SPARK_LOCAL_DIRS="/mnt/spark"
>>> # Standalone cluster options
>>> export SPARK_MASTER_OPTS=""
>>> export SPARK_WORKER_INSTANCES=1
>>> export SPARK_WORKER_CORES=1
>>> export HADOOP_HOME="/root/ephemeral-hdfs"
>>> export SPARK_MASTER_IP=ec2-54-89-95-238.compute-1.amazonaws.com
>>> export MASTER=`cat /root/spark-ec2/cluster-url`
>>> export
>>> SPARK_SUBMIT_LIBRARY_PATH="$SPARK_SUBMIT_LIBRARY_PATH:/root/ephemeral-hdfs/lib/native/"
>>> export
>>> SPARK_SUBMIT_CLASSPATH="$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH:/root/ephemeral-hdfs/conf"
>>> # Bind Spark's web UIs to this machine's public EC2 hostname:
>>> export SPARK_PUBLIC_DNS=`wget -q -O -
>>> http://169.254.169.254/latest/meta-data/public-hostname`
>>> <http://169.254.169.254/latest/meta-data/public-hostname>
>>> # Set a high ulimit for large shuffles
>>> ulimit -n 100
>>
>>
>> None of these variables seem to be related to memory size. Let me know if
>> I am missing something.
>>
>>
>> On Thu, Jun 12, 2014 at 7:17 PM, Matei Zaharia 
>> wrote:
>>
>>> Are you launching this using our EC2 scripts? Or have you set up a
>>> cluster by hand?
>>>
>>> Matei
>>>
>>> On Jun 12, 2014, at 2:32 PM, Aliaksei Litouka <
>>> aliaksei.lito...@gmail.com> wrote:
>>>
>>> spark-env.sh doesn't seem to contain any settings related to memory size
>>> :( I will continue searching for a solution and will post it if I find it :)
>>> Thank you, anyway
>>>
>>>
>>> On Wed, Jun 11, 2014 at 12:19 AM, Matei Zaharia >> > wrote:
>>>
>>>> It might be that conf/spark-env.sh on EC2 is configured to set it to
>>>> 512, and is overriding the application’s settings. Take a look in there and
>>>> delete that line if possible.
>>>>
>>>> Matei
>>>>
>>>> On Jun 10, 2014, at 2:38 PM, Aliaksei Litouka <
>>>> aliaksei.lito...@gmail.com> wrote:
>>>>
>>>> > I am testing my application in EC2 cluster of m3.medium machines. By
>>>> default, only 512 MB of memory on each machine is used. I want to increase
>>>> this amount and I'm trying to do it by passing --executor-memory 2G option
>>>> to the spark-submit script, but it doesn't seem to work - each machine uses
>>>> only 512 MB instead of 2 gigabytes. What am I doing wrong? How do I
>>>> increase the amount of memory?
>>>>
>>>>
>>>
>>>
>>
>