Dropping parquet file partitions

2016-03-01 Thread sparkuser2345
Is there a way to drop parquet file partitions through Spark? I'm partitioning a parquet file by a date field and I would like to drop old partitions in a file system agnostic manner. I guess I could read the whole parquet file into a DataFrame, filter out the dates to be dropped, and overwrite the

Preventing an RDD from shuffling

2015-12-16 Thread sparkuser2345
Is there a way to prevent an RDD from shuffling in a join operation without repartitioning it? I'm reading an RDD from sharded MongoDB, joining that with an RDD of incoming data (+ some additional calculations), and writing the resulting RDD back to MongoDB. It would make sense to shuffle only th

Experiences about NoSQL databases with Spark

2015-11-24 Thread sparkuser2345
I'm interested in knowing which NoSQL databases you use with Spark and what are your experiences. On a general level, I would like to use Spark streaming to process incoming data, fetch relevant aggregated data from the database, and update the aggregates in the DB based on the incoming records.

Re: Parallelizing a task makes it freeze

2014-08-12 Thread sparkuser2345
). What are the limiting factors to the size of the elements of an RDD? sparkuser2345 wrote > I have an array 'dataAll' of key-value pairs where each value is an array > of arrays. I would like to parallelize a task over the elements of > 'dataAll' to the workers. In

Parallelizing a task makes it freeze

2014-08-11 Thread sparkuser2345
I have an array 'dataAll' of key-value pairs where each value is an array of arrays. I would like to parallelize a task over the elements of 'dataAll' to the workers. In the dummy example below, the number of elements in 'dataAll' is 3 but in real application it would be tens to hundreds. Without

Unable to access worker web UI or application UI (EC2)

2014-08-08 Thread sparkuser2345
I'm running spark 1.0.0 on EMR. I'm able to access the master web UI but not the worker web UIs or the application detail UI ("Server not found"). I added the following inbound rule to the ElasticMapreduce-slave security group but it didn't help: Type = All TCP Port range = 0 - 65535 Source = My

Re: How to read a multipart s3 file?

2014-08-07 Thread sparkuser2345
Ashish Rangole wrote > Specify a folder instead of a file name for input and output code, as in: > > Output: > s3n://your-bucket-name/your-data-folder > > Input: (when consuming the above output) > > s3n://your-bucket-name/your-data-folder/* Unfortunately no luck: Exception in thread "main" o

Re: How to read a multipart s3 file?

2014-08-07 Thread sparkuser2345
sparkuser2345 wrote > I'm using Spark 1.0.0. The same works when - Using Spark 0.9.1. - Saving to and reading from local file system (Spark 1.0.0) - Saving to and reading from HDFS (Spark 1.0.0) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How

Re: How to read a multipart s3 file?

2014-08-07 Thread sparkuser2345
Matei Zaharia wrote > If you use s3n:// for both, you should be able to pass the exact same file > to load as you did to save. I'm trying to write a file to s3n in a Spark app and to read it in another one using the same file name, but without luck. Writing data to s3n as val data = Array(1.0, 1

Re: Problem reading from S3 in standalone application

2014-08-06 Thread sparkuser2345
Evan R. Sparks wrote > Try s3n:// Thanks, that works! In REPL, I can succesfully load the data using both s3:// and s3n://, why the difference? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-reading-from-S3-in-standalone-application-tp11524p11537.

Re: Problem reading from S3 in standalone application

2014-08-06 Thread sparkuser2345
I'm getting the same "Input path does not exist" error also after setting the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and using the format "s3:///test_data.txt" for the input file. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/P

Problem reading from S3 in standalone application

2014-08-06 Thread sparkuser2345
Hi, I'm running Spark in an EMR cluster and I'm able to read from S3 using REPL without problems: val input_file = "s3:///test_data.txt" val rawdata = sc.textFile(input_file) val test = rawdata.collect but when I try to run a simple standalone application reading the same data, I get an erro

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-07 Thread sparkuser2345
with the classes in Mllib now so > you'll have to roll your own using underlying sgd / bfgs primitives. > — > Sent from Mailbox > > On Sat, Jul 5, 2014 at 10:45 AM, Christopher Nguyen < > ctn@ > > > wrote: > >> Hi sparkuser2345, >> I'm infer

How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread sparkuser2345
Hi, I am trying to fit a logistic regression model with cross validation in Spark 0.9.0 using SVMWithSGD. I have created an array data_kfolded where each element is a pair of RDDs containing the training and test data: (training_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint], test_d