RE: Spark on EMR with S3 example (Python)

2015-07-14 Thread Pagliari, Roberto
Hi Sujit, I just wanted to access public datasets on Amazon. Do I still need to provide the keys? Thank you, From: Sujit Pal [mailto:sujitatgt...@gmail.com] Sent: Tuesday, July 14, 2015 3:14 PM To: Pagliari, Roberto Cc: user@spark.apache.org Subject: Re: Spark on EMR with S3 example (Python

Spark on EMR with S3 example (Python)

2015-07-14 Thread Pagliari, Roberto
Is there an example about how to load data from a public S3 bucket in Python? I haven't found any. Thank you,

unable to bring up cluster with ec2 script

2015-07-07 Thread Pagliari, Roberto
I'm following the tutorial about Apache Spark on EC2. The output is the following: $ ./spark-ec2 -i ../spark.pem -k spark --copy launch spark-training Setting up security groups... Searching for existing cluster spark-training... Latest Spark AMI: ami-19474270 Launching

bug: numClasses is not a valid argument of LogisticRegressionWithSGD

2015-04-27 Thread Pagliari, Roberto
With the Python APIs, the available arguments I got (using inspect module) are the following: ['cls', 'data', 'iterations', 'step', 'miniBatchFraction', 'initialWeights', 'regParam', 'regType', 'intercept'] numClasses is not available. Can someone comment on this? Thanks,

deos randomSplit return a copy or a reference to the original rdd? [Python]

2015-04-27 Thread Pagliari, Roberto
Suppose I have something like the code below for idx in xrange(0, 10): train_test_split = training.randomSplit(weights=[0.75, 0.25]) train_cv = train_test_split[0] test_cv = train_test_split[1] # scale train_cv and test_cv by scaling

RE: indexing an RDD [Python]

2015-04-24 Thread Pagliari, Roberto
the values and preserve the original ones. Thank you, From: Sven Krasser [mailto:kras...@gmail.com] Sent: Friday, April 24, 2015 5:56 PM To: Pagliari, Roberto Cc: user@spark.apache.org Subject: Re: indexing an RDD [Python] The solution depends largely on your use case. I assume the index is in the key

indexing an RDD [Python]

2015-04-24 Thread Pagliari, Roberto
I have an RDD of LabledPoints. Is it possible to select a subset of it based on a list of indeces? For example with idx=[0,4,5,6,8], I'd like to be able to create a new RDD with elements 0,4,5,6 and 8. - To unsubscribe,

gridsearch - python

2015-04-23 Thread Pagliari, Roberto
Can anybody point me to an example, if available, about gridsearch with python? Thank you,

RE: gridsearch - python

2015-04-23 Thread Pagliari, Roberto
I know grid search with cross validation is not supported. However, I was wondering if there is something availalable for the time being. Thanks, From: Punyashloka Biswal [mailto:punya.bis...@gmail.com] Sent: Thursday, April 23, 2015 9:06 PM To: Pagliari, Roberto; user@spark.apache.org Subject

setting cost in linear SVM [Python]

2015-04-22 Thread Pagliari, Roberto
Is there a way to set the cost value C when using linear SVM?

failed to create a table with python (single node)

2014-11-11 Thread Pagliari, Roberto
I'm executing this example from the documentation (in single node mode) # sc is an existing SparkContext. from pyspark.sql import HiveContext sqlContext = HiveContext(sc) sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) # Queries can be expressed in HiveQL. results =

error when importing HiveContext

2014-11-07 Thread Pagliari, Roberto
I'm getting this error when importing hive context from pyspark.sql import HiveContext Traceback (most recent call last): File stdin, line 1, in module File /path/spark-1.1.0/python/pyspark/__init__.py, line 63, in module from pyspark.context import SparkContext File

spark context not defined

2014-11-07 Thread Pagliari, Roberto
I'm running the latest version of spark with Hadoop 1.x and scala 2.9.3 and hive 0.9.0. When using python 2.7 from pyspark.sql import HiveContext sqlContext = HiveContext(sc) I'm getting 'sc not defined' On the other hand, I can see 'sc' from pyspark CLI. Is there a way to fix it?

SparkContext._lock Error

2014-11-05 Thread Pagliari, Roberto
I'm using this system Hadoop 1.0.4 Scala 2.9.3 Hive 0.9.0 With spark 1.1.0. When importing pyspark, I'm getting this error: from pyspark.sql import * Traceback (most recent call last): File stdin, line 1, in ? File /path/spark-1.1.0/python/pyspark/__init__.py, line 63, in ? from

RE: problem with start-slaves.sh

2014-10-30 Thread Pagliari, Roberto
I also didn’t realize I was trying to bring up the 2ndNameNode as a slave.. that might be an issue as well.. Thanks, From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Sent: Thursday, October 30, 2014 11:27 AM To: Pagliari, Roberto Cc: user@spark.apache.org Subject: Re: problem with start

RE: problem with start-slaves.sh

2014-10-29 Thread Pagliari, Roberto
, Pagliari, Roberto rpagli...@appcomsci.commailto:rpagli...@appcomsci.com wrote: I ran sbin/start-master.sh followed by sbin/start-slaves.sh (I build with PHive option to be able to interface with hive) I’m getting this ip_address: org.apache.spark.deploy.worker.Worker running as process . Stop

install sbt

2014-10-28 Thread Pagliari, Roberto
Is there a repo or some kind of instruction about how to install sbt for centos? Thanks,

problem with start-slaves.sh

2014-10-28 Thread Pagliari, Roberto
I ran sbin/start-master.sh followed by sbin/start-slaves.sh (I build with PHive option to be able to interface with hive) I'm getting this ip_address: org.apache.spark.deploy.worker.Worker running as process . Stop it first. Am I doing something wrong? In my specific case, shark+hive is

using existing hive with spark sql

2014-10-27 Thread Pagliari, Roberto
If I already have hive running on Hadoop, do I need to build Hive using sbt/sbt -Phive assembly/assembly command? If the answer is no, how do I tell spark where hive home is? Thanks,

Spark SQL configuration

2014-10-26 Thread Pagliari, Roberto
I'm a newbie with Spark. After installing it on all the machines I want to use, do I need to tell it about Hadoop configuration, or will it be able to find it himself? Thank you,