Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread S. Zhou
I did some experiments and it seems not. But I like to get confirmation (or perhaps I missed something). If it does support, could u let me know how to specify multiple folders? Thanks. Senqiang 

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread S. Zhou
Thanks Ted. Actually a follow up question. I need to read multiple HDFS files into RDD. What I am doing now is: for each file I read them into a RDD. Then later on I union all these RDDs into one RDD. I am not sure if it is the best way to do it. ThanksSenqiang On Tuesday, March 3, 2015

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Ted Yu
Looking at FileInputFormat#listStatus(): // Whether we need to recursive look into the directory structure boolean recursive = job.getBoolean(INPUT_DIR_RECURSIVE, false); where: public static final String INPUT_DIR_RECURSIVE = mapreduce.input.fileinputformat.input.dir.recursive;

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Ted Yu
Thanks for the confirmation, Stephen. On Tue, Mar 3, 2015 at 3:53 PM, Stephen Boesch java...@gmail.com wrote: Thanks, I was looking at an old version of FileInputFormat.. BEFORE setting the recursive config ( mapreduce.input.fileinputformat.input.dir.recursive) scala

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Stephen Boesch
Thanks, I was looking at an old version of FileInputFormat.. BEFORE setting the recursive config ( mapreduce.input.fileinputformat.input.dir.recursive) scala sc.textFile(dev/*).count java.io.IOException: *Not a file*: file:/shared/sparkup/dev/audit-release/blank_maven_build The default is

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Stephen Boesch
The sc.textFile() invokes the Hadoop FileInputFormat via the (subclass) TextInputFormat. Inside the logic does exist to do the recursive directory reading - i.e. first detecting if an entry were a directory and if so then descending: for (FileStatus

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Ted Yu
Looking at scaladoc: /** Get an RDD for a Hadoop file with an arbitrary new API InputFormat. */ def newAPIHadoopFile[K, V, F : NewInputFormat[K, V]] Your conclusion is confirmed. On Tue, Mar 3, 2015 at 1:59 PM, S. Zhou myx...@yahoo.com.invalid wrote: I did some experiments and it seems

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Sean Owen
This API reads a directory of files, not one file. A file here really means a directory full of part-* files. You do not need to read those separately. Any syntax that works with Hadoop's FileInputFormat should work. I thought you could specify a comma-separated list of paths? maybe I am

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread S. Zhou
Thanks guys. So does this recursive tag work for newAPIHadoopFile? On Tuesday, March 3, 2015 3:55 PM, Ted Yu yuzhih...@gmail.com wrote: Thanks for the confirmation, Stephen. On Tue, Mar 3, 2015 at 3:53 PM, Stephen Boesch java...@gmail.com wrote: Thanks, I was looking at an old