Hi! My name is James, and I’m working on a question there doesn’t seem to be a lot of answers about online. I was hoping spark/hadoop gurus could shed some light on this.
I have a data feed on NFS that looks like /foo////bar/.gz Currently I have a spark scala job that calls sparkContext.textFile("/foo/*/*/*/bar/*.gz") Upstream owners for the data feed have told me they may add additional nested directories or remove them from files relevant to me. In other words, files relevant to my spark job might sit on paths that look like: * /foo/a/b/c/d/bar/*.gz * /foo/a/b/bar/*.gz They will do this with only some files and without warning. Anyone have ideas on how I can configure spark to create an RDD from any textfiles that fit the pattern /foo/**/bar/*.gz where ** represents a variable number of wildcard directories? I'm working with on order of 10^5 and 10^6 files which has discouraged me from using something besides Hadoop fs API to walk the filesystem and feed that input to my spark job, but I'm open to suggestions here also. Thanks! James Ding
smime.p7s
Description: S/MIME cryptographic signature