UNSUBSCRIBE

2016-08-09 Thread James Ding





smime.p7s
Description: S/MIME cryptographic signature


Recursive nested wildcard directory walking in Spark

2015-12-09 Thread James Ding
Hi!

My name is James, and I’m working on a question there doesn’t seem to be a
lot of answers about online. I was hoping spark/hadoop gurus could shed some
light on this.

I have a data feed on NFS that looks like /foobar/.gz
Currently I have a spark scala job that calls
sparkContext.textFile("/foo/*/*/*/bar/*.gz")
Upstream owners for the data feed have told me they may add additional
nested directories or remove them from files relevant to me. In other words,
files relevant to my spark job might sit on paths that look like:
* /foo/a/b/c/d/bar/*.gz
* /foo/a/b/bar/*.gz
They will do this with only some files and without warning. Anyone have
ideas on how I can configure spark to create an RDD from any textfiles that
fit the pattern /foo/**/bar/*.gz where ** represents a variable number of
wildcard directories?
I'm working with on order of 10^5 and 10^6 files which has discouraged me
from using something besides Hadoop fs API to walk the filesystem and feed
that input to my spark job, but I'm open to suggestions here also.
Thanks!
James Ding




smime.p7s
Description: S/MIME cryptographic signature


Re: Recursive nested wildcard directory walking in Spark

2015-12-09 Thread James Ding
I’ve set “mapreduce.input.fileinputformat.input.dir.recursive” to “true” in
the SparkConf I use to instantiate SparkContext, and I confirm this at
runtime in my scala job to print out this property, but
sparkContext.textFile(“/foo/*/bar/*.gz”) still fails (so do /foo/**/bar/*.gz
and /foo/*/*/bar/*.gz).

Any thoughts or workarounds? I’m considering using bash globbing to match
files recursively and feed hundreds of thousands of arguments to
spark-submit. Reasons for/against?

From:  Ted Yu <yuzhih...@gmail.com>
Date:  Wednesday, December 9, 2015 at 3:50 PM
To:  James Ding <jd...@palantir.com>
Cc:  "user@spark.apache.org" <user@spark.apache.org>
Subject:  Re: Recursive nested wildcard directory walking in Spark

Have you seen this thread ?

http://search-hadoop.com/m/q3RTt2uhMX1UhnCc1=Re+Does+sc+newAPIHadoopFil
e+support+multiple+directories+or+nested+directories+
<https://urldefense.proofpoint.com/v2/url?u=http-3A__search-2Dhadoop.com_m_q
3RTt2uhMX1UhnCc1-26subj-3DRe-2BDoes-2Bsc-2BnewAPIHadoopFile-2Bsupport-2Bmult
iple-2Bdirectories-2Bor-2Bnested-2Bdirectories-2B=CwMFaQ=izlc9mHr637UR4l
pLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=nX8GRkcN51t--NvYyCeLIgTrhCN2jV0M6wL5LyNggFg
=GGSXdv6Ymo7CCgd1WS1BuqPmIU9HOhQq2WE0fSnun88=2v9s1Rq7cK3MLQpdOGfnlAnzPPh9z
GR-9nsVgwOqMyw=> 

FYI

On Wed, Dec 9, 2015 at 11:18 AM, James Ding <jd...@palantir.com> wrote:
> Hi!
> 
> My name is James, and I’m working on a question there doesn’t seem to be a lot
> of answers about online. I was hoping spark/hadoop gurus could shed some light
> on this.
> 
> I have a data feed on NFS that looks like /foobar/.gz
> Currently I have a spark scala job that calls
> sparkContext.textFile("/foo/*/*/*/bar/*.gz")
> Upstream owners for the data feed have told me they may add additional nested
> directories or remove them from files relevant to me. In other words, files
> relevant to my spark job might sit on paths that look like:
> * /foo/a/b/c/d/bar/*.gz
> * /foo/a/b/bar/*.gz
> They will do this with only some files and without warning. Anyone have ideas
> on how I can configure spark to create an RDD from any textfiles that fit the
> pattern /foo/**/bar/*.gz where ** represents a variable number of wildcard
> directories?
> I'm working with on order of 10^5 and 10^6 files which has discouraged me from
> using something besides Hadoop fs API to walk the filesystem and feed that
> input to my spark job, but I'm open to suggestions here also.
> Thanks!
> James Ding





smime.p7s
Description: S/MIME cryptographic signature