Re: Reading large set of files in Spark

Ted Yu Thu, 04 Feb 2016 11:17:23 -0800

For question #2, see the following method of FileSystem :

  public abstract boolean delete(Path f, boolean recursive) throws
IOException;


FYI

On Thu, Feb 4, 2016 at 10:58 AM, Akhilesh Pathodia <
pathodia.akhil...@gmail.com> wrote:

> Hi,
>
> I am using Spark to read large set of files from HDFS, applying some
> formatting on each line and then saving each line as a record in hive.
> Spark is reading directory paths from kafka. Each directory can have large
> number of files. I am reading one path from kafka and then processing all
> files of the directory in parallel. I have delete the directory after all
> files are processed I have following questions:
>
> 1. What is the optimized way to read large set of files in Spark? I am not
> using sc.textFile(), instead I am reading the file content using FileSystem
> and creating Dstream of lines.
> 2. How to delete the directory/files from HDFS after the task is completed?
>
> Thanks,
> Akhilesh
>

Re: Reading large set of files in Spark

Reply via email to