RE: Spark on EMR with S3 example (Python)

2015-07-14 Thread Pagliari, Roberto
Hi Sujit,
I just wanted to access public datasets on Amazon. Do I still need to provide 
the keys?

Thank you,


From: Sujit Pal [mailto:sujitatgt...@gmail.com]
Sent: Tuesday, July 14, 2015 3:14 PM
To: Pagliari, Roberto
Cc: user@spark.apache.org
Subject: Re: Spark on EMR with S3 example (Python)

Hi Roberto,

I have written PySpark code that reads from private S3 buckets, it should be 
similar for public S3 buckets as well. You need to set the AWS access and 
secret keys into the SparkContext, then you can access the S3 folders and files 
with their s3n:// paths. Something like this:

sc = SparkContext()
sc._jsc.hadoopConfiguration().set(fs.s3n.awsAccessKeyId, aws_access_key)
sc._jsc.hadoopConfiguration().set(fs.s3n.awsSecretAccessKey, aws_secret_key)

mydata = sc.textFile(s3n://mybucket/my_input_folder) \
.map(lambda x: do_something(x)) \
.saveAsTextFile(s3://mybucket/my_output_folder)
...

You can read and write sequence files as well - these are the only 2 formats I 
have tried, but I'm sure the other ones like JSON would work also. Another 
approach is to embed the AWS access key and secret key into the s3n:// path.

I wasn't able to use the s3 protocol, but s3n is equivalent (I believe its an 
older version but not sure) but it works for access.

Hope this helps,
Sujit


On Tue, Jul 14, 2015 at 10:50 AM, Pagliari, Roberto 
rpagli...@appcomsci.commailto:rpagli...@appcomsci.com wrote:
Is there an example about how to load data from a public S3 bucket in Python? I 
haven’t found any.

Thank you,




Spark on EMR with S3 example (Python)

2015-07-14 Thread Pagliari, Roberto
Is there an example about how to load data from a public S3 bucket in Python? I 
haven't found any.

Thank you,



unable to bring up cluster with ec2 script

2015-07-07 Thread Pagliari, Roberto


I'm following the tutorial about Apache Spark on EC2. The output is the 
following:


$ ./spark-ec2 -i ../spark.pem -k spark --copy launch spark-training
Setting up security groups...
Searching for existing cluster spark-training...
Latest Spark AMI: ami-19474270
Launching instances...
Launched 5 slaves in us-east-1d, regid = r-59a0d4b6
Launched master in us-east-1d, regid = r-9ba2d674
Waiting for instances to start up...
Waiting 120 more seconds...
Copying SSH key ../spark.pem to master...
ssh: connect to host ec2-54-152-15-165.compute-1.amazonaws.com port 22: 
Connection refused
Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i 
../spark.pem r...@ec2-54-152-15-165.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' 
returned non-zero exit status 255, sleeping 30
ssh: connect to host ec2-54-152-15-165.compute-1.amazonaws.com port 22: 
Connection refused
Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i 
../spark.pem r...@ec2-54-152-15-165.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' 
returned non-zero exit status 255, sleeping 30
ssh: Could not resolve hostname ec2-54-152-15-165.compute-1.amazonaws.com: 
Name or service not known
Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i 
../spark.pem r...@ec2-54-152-15-165.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' 
returned non-zero exit status 255, sleeping 30
ssh: connect to host ec2-54-152-15-165.compute-1.amazonaws.com port 22: 
Connection refused
   Traceback (most recent call last):
  File ./spark_ec2.py, line 925, in module
main()
  File ./spark_ec2.py, line 766, in main
setup_cluster(conn, master_nodes, slave_nodes, zoo_nodes, opts, True)
  File ./spark_ec2.py, line 406, in setup_cluster
ssh(master, opts, 'mkdir -p ~/.ssh')
  File ./spark_ec2.py, line 712, in ssh
raise e
subprocess.CalledProcessError: Command 'ssh -t -o StrictHostKeyChecking=no 
-i ../spark.pem r...@ec2-54-152-15-165.compute-1.amazonaws.com 'mkdir -p 
~/.ssh'' returned non-zero exit status 255


However, I can see the six instances created on my EC2 console, and I could 
even get the name of the master. I'm not sure how to fix the ssh issue (my 
region is US EST).



bug: numClasses is not a valid argument of LogisticRegressionWithSGD

2015-04-27 Thread Pagliari, Roberto
With the Python APIs, the available arguments I got (using inspect module) are 
the following:

['cls', 'data', 'iterations', 'step', 'miniBatchFraction', 'initialWeights', 
'regParam', 'regType', 'intercept']

numClasses is not available. Can someone comment on this?

Thanks,







deos randomSplit return a copy or a reference to the original rdd? [Python]

2015-04-27 Thread Pagliari, Roberto
Suppose I have something like the code below


for idx in xrange(0, 10):
train_test_split = training.randomSplit(weights=[0.75, 0.25])
train_cv = train_test_split[0]
test_cv = train_test_split[1]
# scale train_cv and test_cv


by scaling train_cv and test_cv, will the original data be affected?

Thanks,



RE: indexing an RDD [Python]

2015-04-24 Thread Pagliari, Roberto
Hi,
I may need to read many values. The list [0,4,5,6,8] is the locations of the 
rows I’d like to extract from the RDD (of labledPoints). Could you possibly 
provide a quick example?

Also, I’m not quite sure how this work, but the resulting RDD should be a 
clone, as I may need to modify the values and preserve the original ones.

Thank you,


From: Sven Krasser [mailto:kras...@gmail.com]
Sent: Friday, April 24, 2015 5:56 PM
To: Pagliari, Roberto
Cc: user@spark.apache.org
Subject: Re: indexing an RDD [Python]

The solution depends largely on your use case. I assume the index is in the 
key. In that case, you can make a second RDD out of the list of indices and 
then use cogroup() on both.
If the list of indices is small, just using filter() will work well.
If you need to read back a few select values to the driver, take a look at 
lookup().

On Fri, Apr 24, 2015 at 1:51 PM, Pagliari, Roberto 
rpagli...@appcomsci.commailto:rpagli...@appcomsci.com wrote:
I have an RDD of LabledPoints.
Is it possible to select a subset of it based on a list of indeces?
For example with idx=[0,4,5,6,8], I'd like to be able to create a new RDD with 
elements 0,4,5,6 and 8.


-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org



--
www.skrasser.comhttp://www.skrasser.com/?utm_source=sig


indexing an RDD [Python]

2015-04-24 Thread Pagliari, Roberto
I have an RDD of LabledPoints. 
Is it possible to select a subset of it based on a list of indeces?
For example with idx=[0,4,5,6,8], I'd like to be able to create a new RDD with 
elements 0,4,5,6 and 8.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



gridsearch - python

2015-04-23 Thread Pagliari, Roberto
Can anybody point me to an example, if available, about gridsearch with python?

Thank you,



RE: gridsearch - python

2015-04-23 Thread Pagliari, Roberto
I know grid search with cross validation is not supported. However, I was 
wondering if there is something availalable for the time being.

Thanks,


From: Punyashloka Biswal [mailto:punya.bis...@gmail.com]
Sent: Thursday, April 23, 2015 9:06 PM
To: Pagliari, Roberto; user@spark.apache.org
Subject: Re: gridsearch - python

https://issues.apache.org/jira/browse/SPARK-7022.
Punya

On Thu, Apr 23, 2015 at 5:47 PM Pagliari, Roberto 
rpagli...@appcomsci.commailto:rpagli...@appcomsci.com wrote:
Can anybody point me to an example, if available, about gridsearch with python?

Thank you,



setting cost in linear SVM [Python]

2015-04-22 Thread Pagliari, Roberto
Is there a way to set the cost value C when using linear SVM?


failed to create a table with python (single node)

2014-11-11 Thread Pagliari, Roberto
I'm executing this example from the documentation (in single node mode)

# sc is an existing SparkContext.
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))

# Queries can be expressed in HiveQL.
results = sqlContext.sql(FROM src SELECT key, value).collect()




1.   would it be possible to get the results from collect in a more 
human-readable format? For example, I would like to  have a result similar to 
what I would get using hive CLI.

2.   The first query does not seem to create the table. I tried show 
tables; from hive after doing it, and the table src did not show up.


error when importing HiveContext

2014-11-07 Thread Pagliari, Roberto
I'm getting this error when importing hive context

 from pyspark.sql import HiveContext
Traceback (most recent call last):
  File stdin, line 1, in module
  File /path/spark-1.1.0/python/pyspark/__init__.py, line 63, in module
from pyspark.context import SparkContext
  File /path/spark-1.1.0/python/pyspark/context.py, line 30, in module
from pyspark.java_gateway import launch_gateway
  File /path/spark-1.1.0/python/pyspark/java_gateway.py, line 26, in module
from py4j.java_gateway import java_import, JavaGateway, GatewayClient
ImportError: No module named py4j.java_gateway

I cannot find py4j on my system. Where is it?


spark context not defined

2014-11-07 Thread Pagliari, Roberto
I'm running the latest version of spark with Hadoop 1.x and scala 2.9.3 and 
hive 0.9.0.

When using python 2.7
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

I'm getting 'sc not defined'

On the other hand, I can see 'sc' from pyspark CLI.

Is there a way to fix it?


SparkContext._lock Error

2014-11-05 Thread Pagliari, Roberto
I'm using this system

Hadoop 1.0.4
Scala 2.9.3
Hive 0.9.0


With spark 1.1.0. When importing pyspark, I'm getting this error:

 from pyspark.sql import *
Traceback (most recent call last):
  File stdin, line 1, in ?
  File /path/spark-1.1.0/python/pyspark/__init__.py, line 63, in ?
from pyspark.context import SparkContext
  File /path/spark-1.1.0/python/pyspark/context.py, line 209
with SparkContext._lock:
^
SyntaxError: invalid syntax

How do I fix it?

Thank you,


RE: problem with start-slaves.sh

2014-10-30 Thread Pagliari, Roberto
I also didn’t realize I was trying to bring up the 2ndNameNode as a slave.. 
that might be an issue as well..

Thanks,


From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
Sent: Thursday, October 30, 2014 11:27 AM
To: Pagliari, Roberto
Cc: user@spark.apache.org
Subject: Re: problem with start-slaves.sh

Roberto, I don't think shark is an issue -- I have shark server running on a 
node that also acts as a worker. What you can do is turn off shark server, just 
run start-all to start your spark cluster. then you can try bin/spark-shell 
--master yourmasterip and see if you can successfully run some hello world 
stuff. This will verify you have a working Spark cluster. Shark is just an 
application on top of it, so I can't imagine that's what's causing 
interference. But stopping it is the simplest way to check.

On Wed, Oct 29, 2014 at 10:54 PM, Pagliari, Roberto 
rpagli...@appcomsci.commailto:rpagli...@appcomsci.com wrote:
hi Yana,
in my case I did not start any spark worker. However, shark was definitely 
running. Do you think that might be a problem?

I will take a look

Thank you,


From: Yana Kadiyska [yana.kadiy...@gmail.commailto:yana.kadiy...@gmail.com]
Sent: Wednesday, October 29, 2014 9:45 AM
To: Pagliari, Roberto
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: problem with start-slaves.sh
I see this when I start a worker and then try to start it again forgetting it's 
already running (I don't use start-slaves, I start the slaves individually with 
start-slave.sh). All this is telling you is that there is already a running 
process on that machine. You can see it if you do a ps -aef|grep worker

you can look on the spark UI and see if your master shows this machine as 
connected to it already. If it doesn't, you might want to kill the worker 
process and restart it.

On Tue, Oct 28, 2014 at 4:32 PM, Pagliari, Roberto 
rpagli...@appcomsci.commailto:rpagli...@appcomsci.com wrote:
I ran sbin/start-master.sh followed by sbin/start-slaves.sh (I build with PHive 
option to be able to interface with hive)

I’m getting this

ip_address: org.apache.spark.deploy.worker.Worker running as process . Stop 
it first.

Am I doing something wrong? In my specific case, shark+hive is running on the 
nodes. Does that interfere with spark?

Thank you,




RE: problem with start-slaves.sh

2014-10-29 Thread Pagliari, Roberto
hi Yana,
in my case I did not start any spark worker. However, shark was definitely 
running. Do you think that might be a problem?

I will take a look

Thank you,


From: Yana Kadiyska [yana.kadiy...@gmail.com]
Sent: Wednesday, October 29, 2014 9:45 AM
To: Pagliari, Roberto
Cc: user@spark.apache.org
Subject: Re: problem with start-slaves.sh

I see this when I start a worker and then try to start it again forgetting it's 
already running (I don't use start-slaves, I start the slaves individually with 
start-slave.sh). All this is telling you is that there is already a running 
process on that machine. You can see it if you do a ps -aef|grep worker

you can look on the spark UI and see if your master shows this machine as 
connected to it already. If it doesn't, you might want to kill the worker 
process and restart it.

On Tue, Oct 28, 2014 at 4:32 PM, Pagliari, Roberto 
rpagli...@appcomsci.commailto:rpagli...@appcomsci.com wrote:
I ran sbin/start-master.sh followed by sbin/start-slaves.sh (I build with PHive 
option to be able to interface with hive)

I’m getting this

ip_address: org.apache.spark.deploy.worker.Worker running as process . Stop 
it first.

Am I doing something wrong? In my specific case, shark+hive is running on the 
nodes. Does that interfere with spark?

Thank you,



install sbt

2014-10-28 Thread Pagliari, Roberto
Is there a repo or some kind of instruction about how to install sbt for centos?

Thanks,



problem with start-slaves.sh

2014-10-28 Thread Pagliari, Roberto
I ran sbin/start-master.sh followed by sbin/start-slaves.sh (I build with PHive 
option to be able to interface with hive)

I'm getting this

ip_address: org.apache.spark.deploy.worker.Worker running as process . Stop 
it first.

Am I doing something wrong? In my specific case, shark+hive is running on the 
nodes. Does that interfere with spark?

Thank you,


using existing hive with spark sql

2014-10-27 Thread Pagliari, Roberto
If I already have hive running on Hadoop, do I need to build Hive using

sbt/sbt -Phive assembly/assembly

command?

If the answer is no, how do I tell spark where hive home is?

Thanks,



Spark SQL configuration

2014-10-26 Thread Pagliari, Roberto
I'm a newbie with Spark. After installing it on all the machines I want to use, 
do I need to tell it about Hadoop configuration, or will it be able to find it 
himself?

Thank you,