Here is how you can list all HDFS directories for a given path. val hadoopConf = new org.apache.hadoop.conf.Configuration() val hdfsConn = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://<Your NN Hostname>:8020"), hadoopConf) val c = hdfsConn.listStatus(new org.apache.hadoop.fs.Path("/user/csingh/")) c.foreach(x => println(x.getPath))
Output: hdfs://<NN hostname>/user/csingh/.Trash hdfs://<NN hostname>/user/csingh/.sparkStaging hdfs://<NN hostname>/user/csingh/.staging hdfs://<NN hostname>/user/csingh/test1 hdfs://<NN hostname>/user/csingh/test2 hdfs://<NN hostname>/user/csingh/tmp > On Feb 20, 2016, at 2:37 PM, Divya Gehlot <divya.htco...@gmail.com> wrote: > > Hi, > @Umesh :You understanding is partially correct as per my requirement. > My idea which I try to implement is > Steps which I am trying to follow > (Not sure how feasible it is I am new new bee to spark and scala) > 1.List all the files under parent directory > hdfs :///Testdirectory/ > As list > For example : val listsubdirs =(subdir1,subdir2...subdir.n) > Iterate through this list > for(subdir <-listsubdirs){ > val df ="df"+subdir > df= read it using spark csv package using custom schema > > } > Will get dataframes equal to subdirs > > Now I got stuck in first step itself . > How do I list directories and put it in list ? > > Hope you understood my issue now. > Thanks, > Divya > On Feb 19, 2016 6:54 PM, "UMESH CHAUDHARY" <umesh9...@gmail.com > <mailto:umesh9...@gmail.com>> wrote: > If I understood correctly, you can have many sub-dirs under > hdfs:///TestDirectory and and you need to attach a schema to all part files > in a sub-dir. > > 1) I am assuming that you know the sub-dirs names : > > For that, you need to list all sub-dirs inside hdfs:///TestDirectory > using Scala, iterate over sub-dirs > foreach sub-dir in the list > read the partfiles , identify and attach schema respective to that > sub-directory. > > 2) If you don't know the sub-directory names: > You need to store schema somewhere inside that sub-directory and read it > in iteration. > > On Fri, Feb 19, 2016 at 3:44 PM, Divya Gehlot <divya.htco...@gmail.com > <mailto:divya.htco...@gmail.com>> wrote: > Hi, > I have a use case ,where I have one parent directory > > File stucture looks like > hdfs:///TestDirectory/spark1/part files( created by some spark job ) > hdfs:///TestDirectory/spark2/ part files (created by some spark job ) > > spark1 and spark 2 has different schema > > like spark 1 part files schema > carname model year > > Spark2 part files schema > carowner city carcost > > > As these spark 1 and spark2 directory gets created dynamically > can have spark3 directory with different schema > > M requirement is to read the parent directory and list sub drectory > and create dataframe for each subdirectory > > I am not able to get how can I list subdirectory under parent directory and > dynamically create dataframes. > > Thanks, > Divya > > > > >