Reading large set of files in Spark

Akhilesh Pathodia Thu, 04 Feb 2016 10:59:24 -0800

Hi,

I am using Spark to read large set of files from HDFS, applying some
formatting on each line and then saving each line as a record in hive.
Spark is reading directory paths from kafka. Each directory can have large
number of files. I am reading one path from kafka and then processing all
files of the directory in parallel. I have delete the directory after all
files are processed I have following questions:


1. What is the optimized way to read large set of files in Spark? I am not
using sc.textFile(), instead I am reading the file content using FileSystem
and creating Dstream of lines.
2. How to delete the directory/files from HDFS after the task is completed?

Thanks,
Akhilesh

Reading large set of files in Spark

Reply via email to