RE: Spark on EMR with S3 example (Python)
Hi Sujit, I just wanted to access public datasets on Amazon. Do I still need to provide the keys? Thank you, From: Sujit Pal [mailto:sujitatgt...@gmail.com] Sent: Tuesday, July 14, 2015 3:14 PM To: Pagliari, Roberto Cc: user@spark.apache.org Subject: Re: Spark on EMR with S3 example (Python) Hi Roberto, I have written PySpark code that reads from private S3 buckets, it should be similar for public S3 buckets as well. You need to set the AWS access and secret keys into the SparkContext, then you can access the S3 folders and files with their s3n:// paths. Something like this: sc = SparkContext() sc._jsc.hadoopConfiguration().set(fs.s3n.awsAccessKeyId, aws_access_key) sc._jsc.hadoopConfiguration().set(fs.s3n.awsSecretAccessKey, aws_secret_key) mydata = sc.textFile(s3n://mybucket/my_input_folder) \ .map(lambda x: do_something(x)) \ .saveAsTextFile(s3://mybucket/my_output_folder) ... You can read and write sequence files as well - these are the only 2 formats I have tried, but I'm sure the other ones like JSON would work also. Another approach is to embed the AWS access key and secret key into the s3n:// path. I wasn't able to use the s3 protocol, but s3n is equivalent (I believe its an older version but not sure) but it works for access. Hope this helps, Sujit On Tue, Jul 14, 2015 at 10:50 AM, Pagliari, Roberto rpagli...@appcomsci.commailto:rpagli...@appcomsci.com wrote: Is there an example about how to load data from a public S3 bucket in Python? I haven’t found any. Thank you,
Spark on EMR with S3 example (Python)
Is there an example about how to load data from a public S3 bucket in Python? I haven't found any. Thank you,
unable to bring up cluster with ec2 script
I'm following the tutorial about Apache Spark on EC2. The output is the following: $ ./spark-ec2 -i ../spark.pem -k spark --copy launch spark-training Setting up security groups... Searching for existing cluster spark-training... Latest Spark AMI: ami-19474270 Launching instances... Launched 5 slaves in us-east-1d, regid = r-59a0d4b6 Launched master in us-east-1d, regid = r-9ba2d674 Waiting for instances to start up... Waiting 120 more seconds... Copying SSH key ../spark.pem to master... ssh: connect to host ec2-54-152-15-165.compute-1.amazonaws.com port 22: Connection refused Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i ../spark.pem r...@ec2-54-152-15-165.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 30 ssh: connect to host ec2-54-152-15-165.compute-1.amazonaws.com port 22: Connection refused Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i ../spark.pem r...@ec2-54-152-15-165.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 30 ssh: Could not resolve hostname ec2-54-152-15-165.compute-1.amazonaws.com: Name or service not known Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i ../spark.pem r...@ec2-54-152-15-165.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 30 ssh: connect to host ec2-54-152-15-165.compute-1.amazonaws.com port 22: Connection refused Traceback (most recent call last): File ./spark_ec2.py, line 925, in module main() File ./spark_ec2.py, line 766, in main setup_cluster(conn, master_nodes, slave_nodes, zoo_nodes, opts, True) File ./spark_ec2.py, line 406, in setup_cluster ssh(master, opts, 'mkdir -p ~/.ssh') File ./spark_ec2.py, line 712, in ssh raise e subprocess.CalledProcessError: Command 'ssh -t -o StrictHostKeyChecking=no -i ../spark.pem r...@ec2-54-152-15-165.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' returned non-zero exit status 255 However, I can see the six instances created on my EC2 console, and I could even get the name of the master. I'm not sure how to fix the ssh issue (my region is US EST).
bug: numClasses is not a valid argument of LogisticRegressionWithSGD
With the Python APIs, the available arguments I got (using inspect module) are the following: ['cls', 'data', 'iterations', 'step', 'miniBatchFraction', 'initialWeights', 'regParam', 'regType', 'intercept'] numClasses is not available. Can someone comment on this? Thanks,
deos randomSplit return a copy or a reference to the original rdd? [Python]
Suppose I have something like the code below for idx in xrange(0, 10): train_test_split = training.randomSplit(weights=[0.75, 0.25]) train_cv = train_test_split[0] test_cv = train_test_split[1] # scale train_cv and test_cv by scaling train_cv and test_cv, will the original data be affected? Thanks,
RE: indexing an RDD [Python]
Hi, I may need to read many values. The list [0,4,5,6,8] is the locations of the rows I’d like to extract from the RDD (of labledPoints). Could you possibly provide a quick example? Also, I’m not quite sure how this work, but the resulting RDD should be a clone, as I may need to modify the values and preserve the original ones. Thank you, From: Sven Krasser [mailto:kras...@gmail.com] Sent: Friday, April 24, 2015 5:56 PM To: Pagliari, Roberto Cc: user@spark.apache.org Subject: Re: indexing an RDD [Python] The solution depends largely on your use case. I assume the index is in the key. In that case, you can make a second RDD out of the list of indices and then use cogroup() on both. If the list of indices is small, just using filter() will work well. If you need to read back a few select values to the driver, take a look at lookup(). On Fri, Apr 24, 2015 at 1:51 PM, Pagliari, Roberto rpagli...@appcomsci.commailto:rpagli...@appcomsci.com wrote: I have an RDD of LabledPoints. Is it possible to select a subset of it based on a list of indeces? For example with idx=[0,4,5,6,8], I'd like to be able to create a new RDD with elements 0,4,5,6 and 8. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org -- www.skrasser.comhttp://www.skrasser.com/?utm_source=sig
indexing an RDD [Python]
I have an RDD of LabledPoints. Is it possible to select a subset of it based on a list of indeces? For example with idx=[0,4,5,6,8], I'd like to be able to create a new RDD with elements 0,4,5,6 and 8. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
gridsearch - python
Can anybody point me to an example, if available, about gridsearch with python? Thank you,
RE: gridsearch - python
I know grid search with cross validation is not supported. However, I was wondering if there is something availalable for the time being. Thanks, From: Punyashloka Biswal [mailto:punya.bis...@gmail.com] Sent: Thursday, April 23, 2015 9:06 PM To: Pagliari, Roberto; user@spark.apache.org Subject: Re: gridsearch - python https://issues.apache.org/jira/browse/SPARK-7022. Punya On Thu, Apr 23, 2015 at 5:47 PM Pagliari, Roberto rpagli...@appcomsci.commailto:rpagli...@appcomsci.com wrote: Can anybody point me to an example, if available, about gridsearch with python? Thank you,
setting cost in linear SVM [Python]
Is there a way to set the cost value C when using linear SVM?
failed to create a table with python (single node)
I'm executing this example from the documentation (in single node mode) # sc is an existing SparkContext. from pyspark.sql import HiveContext sqlContext = HiveContext(sc) sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) # Queries can be expressed in HiveQL. results = sqlContext.sql(FROM src SELECT key, value).collect() 1. would it be possible to get the results from collect in a more human-readable format? For example, I would like to have a result similar to what I would get using hive CLI. 2. The first query does not seem to create the table. I tried show tables; from hive after doing it, and the table src did not show up.
error when importing HiveContext
I'm getting this error when importing hive context from pyspark.sql import HiveContext Traceback (most recent call last): File stdin, line 1, in module File /path/spark-1.1.0/python/pyspark/__init__.py, line 63, in module from pyspark.context import SparkContext File /path/spark-1.1.0/python/pyspark/context.py, line 30, in module from pyspark.java_gateway import launch_gateway File /path/spark-1.1.0/python/pyspark/java_gateway.py, line 26, in module from py4j.java_gateway import java_import, JavaGateway, GatewayClient ImportError: No module named py4j.java_gateway I cannot find py4j on my system. Where is it?
spark context not defined
I'm running the latest version of spark with Hadoop 1.x and scala 2.9.3 and hive 0.9.0. When using python 2.7 from pyspark.sql import HiveContext sqlContext = HiveContext(sc) I'm getting 'sc not defined' On the other hand, I can see 'sc' from pyspark CLI. Is there a way to fix it?
SparkContext._lock Error
I'm using this system Hadoop 1.0.4 Scala 2.9.3 Hive 0.9.0 With spark 1.1.0. When importing pyspark, I'm getting this error: from pyspark.sql import * Traceback (most recent call last): File stdin, line 1, in ? File /path/spark-1.1.0/python/pyspark/__init__.py, line 63, in ? from pyspark.context import SparkContext File /path/spark-1.1.0/python/pyspark/context.py, line 209 with SparkContext._lock: ^ SyntaxError: invalid syntax How do I fix it? Thank you,
RE: problem with start-slaves.sh
I also didn’t realize I was trying to bring up the 2ndNameNode as a slave.. that might be an issue as well.. Thanks, From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Sent: Thursday, October 30, 2014 11:27 AM To: Pagliari, Roberto Cc: user@spark.apache.org Subject: Re: problem with start-slaves.sh Roberto, I don't think shark is an issue -- I have shark server running on a node that also acts as a worker. What you can do is turn off shark server, just run start-all to start your spark cluster. then you can try bin/spark-shell --master yourmasterip and see if you can successfully run some hello world stuff. This will verify you have a working Spark cluster. Shark is just an application on top of it, so I can't imagine that's what's causing interference. But stopping it is the simplest way to check. On Wed, Oct 29, 2014 at 10:54 PM, Pagliari, Roberto rpagli...@appcomsci.commailto:rpagli...@appcomsci.com wrote: hi Yana, in my case I did not start any spark worker. However, shark was definitely running. Do you think that might be a problem? I will take a look Thank you, From: Yana Kadiyska [yana.kadiy...@gmail.commailto:yana.kadiy...@gmail.com] Sent: Wednesday, October 29, 2014 9:45 AM To: Pagliari, Roberto Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: problem with start-slaves.sh I see this when I start a worker and then try to start it again forgetting it's already running (I don't use start-slaves, I start the slaves individually with start-slave.sh). All this is telling you is that there is already a running process on that machine. You can see it if you do a ps -aef|grep worker you can look on the spark UI and see if your master shows this machine as connected to it already. If it doesn't, you might want to kill the worker process and restart it. On Tue, Oct 28, 2014 at 4:32 PM, Pagliari, Roberto rpagli...@appcomsci.commailto:rpagli...@appcomsci.com wrote: I ran sbin/start-master.sh followed by sbin/start-slaves.sh (I build with PHive option to be able to interface with hive) I’m getting this ip_address: org.apache.spark.deploy.worker.Worker running as process . Stop it first. Am I doing something wrong? In my specific case, shark+hive is running on the nodes. Does that interfere with spark? Thank you,
RE: problem with start-slaves.sh
hi Yana, in my case I did not start any spark worker. However, shark was definitely running. Do you think that might be a problem? I will take a look Thank you, From: Yana Kadiyska [yana.kadiy...@gmail.com] Sent: Wednesday, October 29, 2014 9:45 AM To: Pagliari, Roberto Cc: user@spark.apache.org Subject: Re: problem with start-slaves.sh I see this when I start a worker and then try to start it again forgetting it's already running (I don't use start-slaves, I start the slaves individually with start-slave.sh). All this is telling you is that there is already a running process on that machine. You can see it if you do a ps -aef|grep worker you can look on the spark UI and see if your master shows this machine as connected to it already. If it doesn't, you might want to kill the worker process and restart it. On Tue, Oct 28, 2014 at 4:32 PM, Pagliari, Roberto rpagli...@appcomsci.commailto:rpagli...@appcomsci.com wrote: I ran sbin/start-master.sh followed by sbin/start-slaves.sh (I build with PHive option to be able to interface with hive) I’m getting this ip_address: org.apache.spark.deploy.worker.Worker running as process . Stop it first. Am I doing something wrong? In my specific case, shark+hive is running on the nodes. Does that interfere with spark? Thank you,
install sbt
Is there a repo or some kind of instruction about how to install sbt for centos? Thanks,
problem with start-slaves.sh
I ran sbin/start-master.sh followed by sbin/start-slaves.sh (I build with PHive option to be able to interface with hive) I'm getting this ip_address: org.apache.spark.deploy.worker.Worker running as process . Stop it first. Am I doing something wrong? In my specific case, shark+hive is running on the nodes. Does that interfere with spark? Thank you,
using existing hive with spark sql
If I already have hive running on Hadoop, do I need to build Hive using sbt/sbt -Phive assembly/assembly command? If the answer is no, how do I tell spark where hive home is? Thanks,
Spark SQL configuration
I'm a newbie with Spark. After installing it on all the machines I want to use, do I need to tell it about Hadoop configuration, or will it be able to find it himself? Thank you,