For question #2, see the following method of FileSystem : public abstract boolean delete(Path f, boolean recursive) throws IOException;
FYI On Thu, Feb 4, 2016 at 10:58 AM, Akhilesh Pathodia < pathodia.akhil...@gmail.com> wrote: > Hi, > > I am using Spark to read large set of files from HDFS, applying some > formatting on each line and then saving each line as a record in hive. > Spark is reading directory paths from kafka. Each directory can have large > number of files. I am reading one path from kafka and then processing all > files of the directory in parallel. I have delete the directory after all > files are processed I have following questions: > > 1. What is the optimized way to read large set of files in Spark? I am not > using sc.textFile(), instead I am reading the file content using FileSystem > and creating Dstream of lines. > 2. How to delete the directory/files from HDFS after the task is completed? > > Thanks, > Akhilesh >