Hi Marchelo,
The collected data are collected in say class C. c.id is the id of each
of those data. But that id might appear more than once in those 1mil xml
files, so I am doing a reduceByKey(). Even if I had multiple binaryFile
RDD's, wouldn't I have to ++ those in order to correctly reduceByKey()?
Also the executor is now configured to use 64GB and it run overnight,
failing at 2am when it was using something between 30-50GB of RAM. I
don't think my data are even close to that figure but I wasn't able to
profile memory (will do today) and see what was consuming so much of it.
Cheers
On 10/06/15 17:14, Marcelo Vanzin wrote:
So, I don't have an explicit solution to your problem, but...
On Wed, Jun 10, 2015 at 7:13 AM, Kostas Kougios
<[email protected] <mailto:[email protected]>>
wrote:
I am profiling the driver. It currently has 564MB of strings which
might be
the 1mil file names. But also it has 2.34 GB of long[] ! That's so
far, it
is still running. What are those long[] used for?
When Spark lists files it also needs all the extra metadata about
where the files are in the HDFS cluster. That is a lot more than just
the file's name - see the "LocatedFileStatus" class in the Hadoop docs
for an idea.
What you could try is to somehow break that input down into smaller
batches, if that's feasible for your app. e.g. organize the files by
directory and use separate directories in different calls to
"binaryFiles()", things like that.
--
Marcelo