Hi, I am using Spark to read large set of files from HDFS, applying some formatting on each line and then saving each line as a record in hive. Spark is reading directory paths from kafka. Each directory can have large number of files. I am reading one path from kafka and then processing all files of the directory in parallel. I have delete the directory after all files are processed I have following questions:
1. What is the optimized way to read large set of files in Spark? I am not using sc.textFile(), instead I am reading the file content using FileSystem and creating Dstream of lines. 2. How to delete the directory/files from HDFS after the task is completed? Thanks, Akhilesh