Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

Marcelo Vanzin Wed, 10 Jun 2015 09:16:48 -0700

So, I don't have an explicit solution to your problem, but...

On Wed, Jun 10, 2015 at 7:13 AM, Kostas Kougios <
kostas.koug...@googlemail.com> wrote:


> I am profiling the driver. It currently has 564MB of strings which might be
> the 1mil file names. But also it has 2.34 GB of long[] ! That's so far, it
> is still running. What are those long[] used for?
>

When Spark lists files it also needs all the extra metadata about where the
files are in the HDFS cluster. That is a lot more than just the file's name
- see the "LocatedFileStatus" class in the Hadoop docs for an idea.

What you could try is to somehow break that input down into smaller
batches, if that's feasible for your app. e.g. organize the files by
directory and use separate directories in different calls to
"binaryFiles()", things like that.

-- 
Marcelo

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

Reply via email to