sparkuser2345 wrote
I'm using Spark 1.0.0.
The same works when
- Using Spark 0.9.1.
- Saving to and reading from local file system (Spark 1.0.0)
- Saving to and reading from HDFS (Spark 1.0.0)
--
View this message in context:
darkjh wrote
But in my experience, when reading directly from
s3n, spark create only 1 input partition per file, regardless of the file
size. This may lead to some performance problem if you have big files.
This is actually not true, Spark uses the underlying hadoop input formats to
read the
Ashish Rangole wrote
Specify a folder instead of a file name for input and output code, as in:
Output:
s3n://your-bucket-name/your-data-folder
Input: (when consuming the above output)
s3n://your-bucket-name/your-data-folder/*
Unfortunately no luck:
Exception in thread main
That won't be it, since you can see from the directory listing that
there are no data files under test -- only _ files and dirs. The
output looks like it was written, or partially written at least, but
didn't finish, in that the part-* files were never moved to the target
dir. I don't know why,
On Wed, May 7, 2014 at 4:44 PM, Aaron Davidson ilike...@gmail.com wrote:
Spark can only run as many tasks as there are partitions, so if you don't
have enough partitions, your cluster will be underutilized.
This is a very important point.
kamatsuoka, how many partitions does your RDD have
Thanks Nicholas! I looked at those docs several times without noticing that
critical part you highlighted.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p5494.html
Sent from the Apache Spark User List mailing list
On Wed, May 7, 2014 at 4:00 AM, Han JU ju.han.fe...@gmail.com wrote:
But in my experience, when reading directly from s3n, spark create only 1
input partition per file, regardless of the file size. This may lead to
some performance problem if you have big files.
You can (and perhaps should)
One way to ensure Spark writes more partitions is by using
RDD#repartition() to make each partition smaller. One Spark partition
always corresponds to one file in the underlying store, and it's usually a
good idea to have each partition size range somewhere between 64 MB to 256
MB. Too few
On Tue, May 6, 2014 at 10:07 PM, kamatsuoka ken...@gmail.com wrote:
I was using s3n:// but I got frustrated by how
slow it is at writing files.
I'm curious: How slow is slow? How long does it take you, for example, to
save a 1GB file to S3 using s3n vs s3?
Amazon also strongly discourages the use of s3:// because the block file
system it maps to is deprecated.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html
Note
The configuration of Hadoop running on Amazon EMR differs from the default
configuration
Just some complements to other answers:
If you output to, say, `s3://bucket/myfile`, then you can use this bucket
as the input of other jobs (sc.textFile('s3://bucket/myfile')). By default
all `part-xxx` files will be used. There's also `sc.wholeTextFiles` that
you can play with.
If you file is
Try using s3n instead of s3
Em 06/05/2014 21:19, kamatsuoka ken...@gmail.com escreveu:
I have a Spark app that writes out a file, s3://mybucket/mydir/myfile.txt.
Behind the scenes, the S3 driver creates a bunch of files like
s3://mybucket//mydir/myfile.txt/part-, as well as the block
12 matches
Mail list logo