Re: How to read a multipart s3 file?

2014-08-07 Thread sparkuser2345
sparkuser2345 wrote I'm using Spark 1.0.0. The same works when - Using Spark 0.9.1. - Saving to and reading from local file system (Spark 1.0.0) - Saving to and reading from HDFS (Spark 1.0.0) -- View this message in context:

Re: How to read a multipart s3 file?

2014-08-07 Thread paul
darkjh wrote But in my experience, when reading directly from s3n, spark create only 1 input partition per file, regardless of the file size. This may lead to some performance problem if you have big files. This is actually not true, Spark uses the underlying hadoop input formats to read the

Re: How to read a multipart s3 file?

2014-08-07 Thread sparkuser2345
Ashish Rangole wrote Specify a folder instead of a file name for input and output code, as in: Output: s3n://your-bucket-name/your-data-folder Input: (when consuming the above output) s3n://your-bucket-name/your-data-folder/* Unfortunately no luck: Exception in thread main

Re: How to read a multipart s3 file?

2014-08-07 Thread Sean Owen
That won't be it, since you can see from the directory listing that there are no data files under test -- only _ files and dirs. The output looks like it was written, or partially written at least, but didn't finish, in that the part-* files were never moved to the target dir. I don't know why,

Re: How to read a multipart s3 file?

2014-05-16 Thread Nicholas Chammas
On Wed, May 7, 2014 at 4:44 PM, Aaron Davidson ilike...@gmail.com wrote: Spark can only run as many tasks as there are partitions, so if you don't have enough partitions, your cluster will be underutilized. This is a very important point. kamatsuoka, how many partitions does your RDD have

Re: How to read a multipart s3 file?

2014-05-13 Thread kamatsuoka
Thanks Nicholas! I looked at those docs several times without noticing that critical part you highlighted. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p5494.html Sent from the Apache Spark User List mailing list

Re: How to read a multipart s3 file?

2014-05-12 Thread Nicholas Chammas
On Wed, May 7, 2014 at 4:00 AM, Han JU ju.han.fe...@gmail.com wrote: But in my experience, when reading directly from s3n, spark create only 1 input partition per file, regardless of the file size. This may lead to some performance problem if you have big files. You can (and perhaps should)

Re: How to read a multipart s3 file?

2014-05-12 Thread Aaron Davidson
One way to ensure Spark writes more partitions is by using RDD#repartition() to make each partition smaller. One Spark partition always corresponds to one file in the underlying store, and it's usually a good idea to have each partition size range somewhere between 64 MB to 256 MB. Too few

Re: How to read a multipart s3 file?

2014-05-11 Thread Nicholas Chammas
On Tue, May 6, 2014 at 10:07 PM, kamatsuoka ken...@gmail.com wrote: I was using s3n:// but I got frustrated by how slow it is at writing files. I'm curious: How slow is slow? How long does it take you, for example, to save a 1GB file to S3 using s3n vs s3?

Re: How to read a multipart s3 file?

2014-05-07 Thread Nicholas Chammas
Amazon also strongly discourages the use of s3:// because the block file system it maps to is deprecated. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html Note The configuration of Hadoop running on Amazon EMR differs from the default configuration

Re: How to read a multipart s3 file?

2014-05-07 Thread Han JU
Just some complements to other answers: If you output to, say, `s3://bucket/myfile`, then you can use this bucket as the input of other jobs (sc.textFile('s3://bucket/myfile')). By default all `part-xxx` files will be used. There's also `sc.wholeTextFiles` that you can play with. If you file is

Re: How to read a multipart s3 file?

2014-05-06 Thread Andre Kuhnen
Try using s3n instead of s3 Em 06/05/2014 21:19, kamatsuoka ken...@gmail.com escreveu: I have a Spark app that writes out a file, s3://mybucket/mydir/myfile.txt. Behind the scenes, the S3 driver creates a bunch of files like s3://mybucket//mydir/myfile.txt/part-, as well as the block