Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

Konstantinos Kougios Thu, 11 Jun 2015 03:25:48 -0700

Hi Marchelo,

The collected data are collected in say class C. c.id is the id of eachof those data. But that id might appear more than once in those 1mil xmlfiles, so I am doing a reduceByKey(). Even if I had multiple binaryFileRDD's, wouldn't I have to ++ those in order to correctly reduceByKey()?

Also the executor is now configured to use 64GB and it run overnight,failing at 2am when it was using something between 30-50GB of RAM. Idon't think my data are even close to that figure but I wasn't able toprofile memory (will do today) and see what was consuming so much of it.


Cheers


On 10/06/15 17:14, Marcelo Vanzin wrote:

So, I don't have an explicit solution to your problem, but...
On Wed, Jun 10, 2015 at 7:13 AM, Kostas Kougios<[email protected] <mailto:[email protected]>>wrote:
    I am profiling the driver. It currently has 564MB of strings which
    might be
    the 1mil file names. But also it has 2.34 GB of long[] ! That's so
    far, it
    is still running. What are those long[] used for?
When Spark lists files it also needs all the extra metadata aboutwhere the files are in the HDFS cluster. That is a lot more than justthe file's name - see the "LocatedFileStatus" class in the Hadoop docsfor an idea.
What you could try is to somehow break that input down into smallerbatches, if that's feasible for your app. e.g. organize the files bydirectory and use separate directories in different calls to"binaryFiles()", things like that.
--
Marcelo

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

Reply via email to