I’ve set “mapreduce.input.fileinputformat.input.dir.recursive” to “true” in
the SparkConf I use to instantiate SparkContext, and I confirm this at
runtime in my scala job to print out this property, but
sparkContext.textFile(“/foo/*/bar/*.gz”) still fails (so do /foo/**/bar/*.gz
and /foo/*/*/bar/*.gz).
Any thoughts or workarounds? I’m considering using bash globbing to match
files recursively and feed hundreds of thousands of arguments to
spark-submit. Reasons for/against?
From: Ted Yu
Date: Wednesday, December 9, 2015 at 3:50 PM
To: James Ding
Cc: "user@spark.apache.org"
Subject: Re: Recursive nested wildcard directory walking in Spark
Have you seen this thread ?
http://search-hadoop.com/m/q3RTt2uhMX1UhnCc1&subj=Re+Does+sc+newAPIHadoopFil
e+support+multiple+directories+or+nested+directories+
<https://urldefense.proofpoint.com/v2/url?u=http-3A__search-2Dhadoop.com_m_q
3RTt2uhMX1UhnCc1-26subj-3DRe-2BDoes-2Bsc-2BnewAPIHadoopFile-2Bsupport-2Bmult
iple-2Bdirectories-2Bor-2Bnested-2Bdirectories-2B&d=CwMFaQ&c=izlc9mHr637UR4l
pLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=nX8GRkcN51t--NvYyCeLIgTrhCN2jV0M6wL5LyNggFg&m
=GGSXdv6Ymo7CCgd1WS1BuqPmIU9HOhQq2WE0fSnun88&s=2v9s1Rq7cK3MLQpdOGfnlAnzPPh9z
GR-9nsVgwOqMyw&e=>
FYI
On Wed, Dec 9, 2015 at 11:18 AM, James Ding wrote:
> Hi!
>
> My name is James, and I’m working on a question there doesn’t seem to be a lot
> of answers about online. I was hoping spark/hadoop gurus could shed some light
> on this.
>
> I have a data feed on NFS that looks like /foobar/.gz
> Currently I have a spark scala job that calls
> sparkContext.textFile("/foo/*/*/*/bar/*.gz")
> Upstream owners for the data feed have told me they may add additional nested
> directories or remove them from files relevant to me. In other words, files
> relevant to my spark job might sit on paths that look like:
> * /foo/a/b/c/d/bar/*.gz
> * /foo/a/b/bar/*.gz
> They will do this with only some files and without warning. Anyone have ideas
> on how I can configure spark to create an RDD from any textfiles that fit the
> pattern /foo/**/bar/*.gz where ** represents a variable number of wildcard
> directories?
> I'm working with on order of 10^5 and 10^6 files which has discouraged me from
> using something besides Hadoop fs API to walk the filesystem and feed that
> input to my spark job, but I'm open to suggestions here also.
> Thanks!
> James Ding
smime.p7s
Description: S/MIME cryptographic signature