So, I don't have an explicit solution to your problem, but...

On Wed, Jun 10, 2015 at 7:13 AM, Kostas Kougios <
kostas.koug...@googlemail.com> wrote:

> I am profiling the driver. It currently has 564MB of strings which might be
> the 1mil file names. But also it has 2.34 GB of long[] ! That's so far, it
> is still running. What are those long[] used for?
>

When Spark lists files it also needs all the extra metadata about where the
files are in the HDFS cluster. That is a lot more than just the file's name
- see the "LocatedFileStatus" class in the Hadoop docs for an idea.

What you could try is to somehow break that input down into smaller
batches, if that's feasible for your app. e.g. organize the files by
directory and use separate directories in different calls to
"binaryFiles()", things like that.

-- 
Marcelo

Reply via email to