Hi Marchelo,

The collected data are collected in say class C. c.id is the id of each of those data. But that id might appear more than once in those 1mil xml files, so I am doing a reduceByKey(). Even if I had multiple binaryFile RDD's, wouldn't I have to ++ those in order to correctly reduceByKey()?

Also the executor is now configured to use 64GB and it run overnight, failing at 2am when it was using something between 30-50GB of RAM. I don't think my data are even close to that figure but I wasn't able to profile memory (will do today) and see what was consuming so much of it.

Cheers


On 10/06/15 17:14, Marcelo Vanzin wrote:
So, I don't have an explicit solution to your problem, but...

On Wed, Jun 10, 2015 at 7:13 AM, Kostas Kougios <[email protected] <mailto:[email protected]>> wrote:

    I am profiling the driver. It currently has 564MB of strings which
    might be
    the 1mil file names. But also it has 2.34 GB of long[] ! That's so
    far, it
    is still running. What are those long[] used for?


When Spark lists files it also needs all the extra metadata about where the files are in the HDFS cluster. That is a lot more than just the file's name - see the "LocatedFileStatus" class in the Hadoop docs for an idea.

What you could try is to somehow break that input down into smaller batches, if that's feasible for your app. e.g. organize the files by directory and use separate directories in different calls to "binaryFiles()", things like that.

--
Marcelo

Reply via email to