I stumble upon this thread and I conjecture that this may affect restoring a
checkpointed RDD as well:
http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-checkpointed-RDD-in-Apache-Spark-has-long-gt-10-hour-between-stage-latency-td22925.html#a22928
In my case I have 1600+ fragmented
You might be interested in the new s3a filesystem in Hadoop 2.6.0 [1].
1. https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-10400
On Nov 26, 2014 12:24 PM, Aaron Davidson ilike...@gmail.com wrote:
Spark has a known problem where it will do a pass of metadata on a large
number
Note that it does not appear that s3a solves the original problems in this
thread, which are on the Spark side or due to the fact that metadata
listing in S3 is slow simply due to going over the network.
On Sun, Nov 30, 2014 at 10:07 AM, David Blewett da...@dawninglight.net
wrote:
You might be
Thanks - this is very helpful!
On Thu, Nov 27, 2014 at 5:20 AM, Michael Armbrust mich...@databricks.com
wrote:
In the past I have worked around this problem by avoiding sc.textFile().
Instead I read the data directly inside of a Spark job. Basically, you
start with an RDD where each entry is
Hello,
I'm building a spark app required to read large amounts of log files from
s3. I'm doing so in the code by constructing the file list, and passing it
to the context as following:
val myRDD = sc.textFile(s3n://mybucket/file1, s3n://mybucket/file2, ... ,
s3n://mybucket/fileN)
When running
you can try creating hadoop Configuration and set s3 configuration i.e.
access keys etc.
Now, for reading files from s3 use newAPIHadoopFile and pass the config
object here along with key, value classes.
-
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context:
Thanks Lalit; Setting the access + secret keys in the configuration works
even when calling sc.textFile. Is there a way to select which hadoop s3
native filesystem implementation would be used at runtime using the hadoop
configuration?
Thanks,
Tomer
On Wed, Nov 26, 2014 at 11:08 AM, lalit1303
Spark has a known problem where it will do a pass of metadata on a large
number of small files serially, in order to find the partition information
prior to starting the job. This will probably not be repaired by switching
the FS impl.
However, you can change the FS being used like so (prior to
In the past I have worked around this problem by avoiding sc.textFile().
Instead I read the data directly inside of a Spark job. Basically, you
start with an RDD where each entry is a file in S3 and then flatMap that
with something that reads the files and returns the lines.
Here's an example: