Dropping parquet file partitions

2016-03-01 Thread sparkuser2345
Is there a way to drop parquet file partitions through Spark? I'm partitioning a parquet file by a date field and I would like to drop old partitions in a file system agnostic manner. I guess I could read the whole parquet file into a DataFrame, filter out the dates to be dropped, and overwrite

Preventing an RDD from shuffling

2015-12-16 Thread sparkuser2345
Is there a way to prevent an RDD from shuffling in a join operation without repartitioning it? I'm reading an RDD from sharded MongoDB, joining that with an RDD of incoming data (+ some additional calculations), and writing the resulting RDD back to MongoDB. It would make sense to shuffle only

Re: Parallelizing a task makes it freeze

2014-08-12 Thread sparkuser2345
). What are the limiting factors to the size of the elements of an RDD? sparkuser2345 wrote I have an array 'dataAll' of key-value pairs where each value is an array of arrays. I would like to parallelize a task over the elements of 'dataAll' to the workers. In the dummy example below, the number

Parallelizing a task makes it freeze

2014-08-11 Thread sparkuser2345
I have an array 'dataAll' of key-value pairs where each value is an array of arrays. I would like to parallelize a task over the elements of 'dataAll' to the workers. In the dummy example below, the number of elements in 'dataAll' is 3 but in real application it would be tens to hundreds.

Unable to access worker web UI or application UI (EC2)

2014-08-08 Thread sparkuser2345
I'm running spark 1.0.0 on EMR. I'm able to access the master web UI but not the worker web UIs or the application detail UI (Server not found). I added the following inbound rule to the ElasticMapreduce-slave security group but it didn't help: Type = All TCP Port range = 0 - 65535 Source = My

Re: How to read a multipart s3 file?

2014-08-07 Thread sparkuser2345
sparkuser2345 wrote I'm using Spark 1.0.0. The same works when - Using Spark 0.9.1. - Saving to and reading from local file system (Spark 1.0.0) - Saving to and reading from HDFS (Spark 1.0.0) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read

Re: How to read a multipart s3 file?

2014-08-07 Thread sparkuser2345
Ashish Rangole wrote Specify a folder instead of a file name for input and output code, as in: Output: s3n://your-bucket-name/your-data-folder Input: (when consuming the above output) s3n://your-bucket-name/your-data-folder/* Unfortunately no luck: Exception in thread main

Problem reading from S3 in standalone application

2014-08-06 Thread sparkuser2345
Hi, I'm running Spark in an EMR cluster and I'm able to read from S3 using REPL without problems: val input_file = s3://bucket-name/test_data.txt val rawdata = sc.textFile(input_file) val test = rawdata.collect but when I try to run a simple standalone application reading the same data, I

Re: Problem reading from S3 in standalone application

2014-08-06 Thread sparkuser2345
I'm getting the same Input path does not exist error also after setting the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and using the format s3://bucket-name/test_data.txt for the input file. -- View this message in context:

Re: Problem reading from S3 in standalone application

2014-08-06 Thread sparkuser2345
Evan R. Sparks wrote Try s3n:// Thanks, that works! In REPL, I can succesfully load the data using both s3:// and s3n://, why the difference? -- View this message in context:

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-07 Thread sparkuser2345
own using underlying sgd / bfgs primitives. — Sent from Mailbox On Sat, Jul 5, 2014 at 10:45 AM, Christopher Nguyen lt; ctn@ gt; wrote: Hi sparkuser2345, I'm inferring the problem statement is something like how do I make this complete faster (given my compute resources)? Several

How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread sparkuser2345
Hi, I am trying to fit a logistic regression model with cross validation in Spark 0.9.0 using SVMWithSGD. I have created an array data_kfolded where each element is a pair of RDDs containing the training and test data: (training_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint],