Is there a way to drop parquet file partitions through Spark? I'm
partitioning a parquet file by a date field and I would like to drop old
partitions in a file system agnostic manner. I guess I could read the whole
parquet file into a DataFrame, filter out the dates to be dropped, and
overwrite
Is there a way to prevent an RDD from shuffling in a join operation without
repartitioning it?
I'm reading an RDD from sharded MongoDB, joining that with an RDD of
incoming data (+ some additional calculations), and writing the resulting
RDD back to MongoDB. It would make sense to shuffle only
).
What are the limiting factors to the size of the elements of an RDD?
sparkuser2345 wrote
I have an array 'dataAll' of key-value pairs where each value is an array
of arrays. I would like to parallelize a task over the elements of
'dataAll' to the workers. In the dummy example below, the number
I have an array 'dataAll' of key-value pairs where each value is an array of
arrays. I would like to parallelize a task over the elements of 'dataAll' to
the workers. In the dummy example below, the number of elements in 'dataAll'
is 3 but in real application it would be tens to hundreds.
I'm running spark 1.0.0 on EMR. I'm able to access the master web UI but not
the worker web UIs or the application detail UI (Server not found).
I added the following inbound rule to the ElasticMapreduce-slave security
group but it didn't help:
Type = All TCP
Port range = 0 - 65535
Source = My
sparkuser2345 wrote
I'm using Spark 1.0.0.
The same works when
- Using Spark 0.9.1.
- Saving to and reading from local file system (Spark 1.0.0)
- Saving to and reading from HDFS (Spark 1.0.0)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read
Ashish Rangole wrote
Specify a folder instead of a file name for input and output code, as in:
Output:
s3n://your-bucket-name/your-data-folder
Input: (when consuming the above output)
s3n://your-bucket-name/your-data-folder/*
Unfortunately no luck:
Exception in thread main
Hi,
I'm running Spark in an EMR cluster and I'm able to read from S3 using REPL
without problems:
val input_file = s3://bucket-name/test_data.txt
val rawdata = sc.textFile(input_file)
val test = rawdata.collect
but when I try to run a simple standalone application reading the same data,
I
I'm getting the same Input path does not exist error also after setting the
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and using
the format s3://bucket-name/test_data.txt for the input file.
--
View this message in context:
Evan R. Sparks wrote
Try s3n://
Thanks, that works! In REPL, I can succesfully load the data using both
s3:// and s3n://, why the difference?
--
View this message in context:
own using underlying sgd / bfgs primitives.
—
Sent from Mailbox
On Sat, Jul 5, 2014 at 10:45 AM, Christopher Nguyen lt;
ctn@
gt;
wrote:
Hi sparkuser2345,
I'm inferring the problem statement is something like how do I make this
complete faster (given my compute resources)?
Several
Hi,
I am trying to fit a logistic regression model with cross validation in
Spark 0.9.0 using SVMWithSGD. I have created an array data_kfolded where
each element is a pair of RDDs containing the training and test data:
(training_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint],
12 matches
Mail list logo