Hi,

I am using Spark to read large set of files from HDFS, applying some
formatting on each line and then saving each line as a record in hive.
Spark is reading directory paths from kafka. Each directory can have large
number of files. I am reading one path from kafka and then processing all
files of the directory in parallel. I have delete the directory after all
files are processed I have following questions:

1. What is the optimized way to read large set of files in Spark? I am not
using sc.textFile(), instead I am reading the file content using FileSystem
and creating Dstream of lines.
2. How to delete the directory/files from HDFS after the task is completed?

Thanks,
Akhilesh

Reply via email to