Recursive nested wildcard directory walking in Spark

James Ding Wed, 09 Dec 2015 11:19:10 -0800

Hi!

My name is James, and I’m working on a question there doesn’t seem to be a
lot of answers about online. I was hoping spark/hadoop gurus could shed some
light on this.


I have a data feed on NFS that looks like /foo////bar/.gz
Currently I have a spark scala job that calls
sparkContext.textFile("/foo/*/*/*/bar/*.gz")
Upstream owners for the data feed have told me they may add additional
nested directories or remove them from files relevant to me. In other words,
files relevant to my spark job might sit on paths that look like:
* /foo/a/b/c/d/bar/*.gz
* /foo/a/b/bar/*.gz
They will do this with only some files and without warning. Anyone have
ideas on how I can configure spark to create an RDD from any textfiles that
fit the pattern /foo/**/bar/*.gz where ** represents a variable number of
wildcard directories?
I'm working with on order of 10^5 and 10^6 files which has discouraged me
from using something besides Hadoop fs API to walk the filesystem and feed
that input to my spark job, but I'm open to suggestions here also.
Thanks!
James Ding

smime.p7s
Description: S/MIME cryptographic signature

Recursive nested wildcard directory walking in Spark

Reply via email to