Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2015-05-21 Thread Peng Cheng
I stumble upon this thread and I conjecture that this may affect restoring a checkpointed RDD as well: http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-checkpointed-RDD-in-Apache-Spark-has-long-gt-10-hour-between-stage-latency-td22925.html#a22928 In my case I have 1600+ fragmented

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-30 Thread David Blewett
You might be interested in the new s3a filesystem in Hadoop 2.6.0 [1]. 1. https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-10400 On Nov 26, 2014 12:24 PM, Aaron Davidson ilike...@gmail.com wrote: Spark has a known problem where it will do a pass of metadata on a large number

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-30 Thread Aaron Davidson
Note that it does not appear that s3a solves the original problems in this thread, which are on the Spark side or due to the fact that metadata listing in S3 is slow simply due to going over the network. On Sun, Nov 30, 2014 at 10:07 AM, David Blewett da...@dawninglight.net wrote: You might be

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-29 Thread Tomer Benyamini
Thanks - this is very helpful! On Thu, Nov 27, 2014 at 5:20 AM, Michael Armbrust mich...@databricks.com wrote: In the past I have worked around this problem by avoiding sc.textFile(). Instead I read the data directly inside of a Spark job. Basically, you start with an RDD where each entry is

S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread Tomer Benyamini
Hello, I'm building a spark app required to read large amounts of log files from s3. I'm doing so in the code by constructing the file list, and passing it to the context as following: val myRDD = sc.textFile(s3n://mybucket/file1, s3n://mybucket/file2, ... , s3n://mybucket/fileN) When running

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread lalit1303
you can try creating hadoop Configuration and set s3 configuration i.e. access keys etc. Now, for reading files from s3 use newAPIHadoopFile and pass the config object here along with key, value classes. - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context:

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread Tomer Benyamini
Thanks Lalit; Setting the access + secret keys in the configuration works even when calling sc.textFile. Is there a way to select which hadoop s3 native filesystem implementation would be used at runtime using the hadoop configuration? Thanks, Tomer On Wed, Nov 26, 2014 at 11:08 AM, lalit1303

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread Aaron Davidson
Spark has a known problem where it will do a pass of metadata on a large number of small files serially, in order to find the partition information prior to starting the job. This will probably not be repaired by switching the FS impl. However, you can change the FS being used like so (prior to

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread Michael Armbrust
In the past I have worked around this problem by avoiding sc.textFile(). Instead I read the data directly inside of a Spark job. Basically, you start with an RDD where each entry is a file in S3 and then flatMap that with something that reads the files and returns the lines. Here's an example: