Ah, the new caches. =) These make things a lot faster (bulk data
sending), but do take up some additional memory. if you look at
GiraphConstants, you can find ways to change the cache sizes (this will
reduce that memory usage).
For example, MAX_EDGE_REQUEST_SIZE will affect the size of the edge
cache. MAX_MSG_REQUEST_SIZE will affect the size of the message cache.
The caches are per worker, so 100 workers would require 50 MB per
worker by default. Feel free to trim it if you like.
The byte arrays for the edges are the most efficient storage possible
(although not as performance as the native edge stores).
Hope that helps,
Avery
On 8/29/13 4:53 PM, Jeff Peters wrote:
Avery, it would seem that optimizations to Giraph have, unfortunately,
turned the majority of the heap into "dark matter". The two snapshots
are at unknown points in a superstep but I waited for several
supersteps so that the activity had more or less stabilized. About the
only thing comparable between the two snapshots are the vertexes,
192561 X "RecsVertex" in the new version and 191995 X "Coloring" in
the old system. But with the new Giraph 672710176 out of 824886184
bytes are stored as primitive byte arrays. That's probably indicative
of some very fine performance optimization work, but it makes it
extremely difficult to know what's really out there, and why. I did
notice that a number of caches have appeared that did not exist
before, namely SendEdgeCache, SendPartitionCache, SendMessageCache
and SendMutationsCache.
Could any of those account for a larger per-worker footprint in a
modern Giraph? Should I simply assume that I need to force AWS to
configure its EMR Hadoop so that each instance has fewer map tasks but
with a somewhat larger VM max, say 3GB instead of 2GB?
On Wed, Aug 28, 2013 at 4:57 PM, Avery Ching <ach...@apache.org
<mailto:ach...@apache.org>> wrote:
Try dumping a histogram of memory usage from a running JVM and see
where the memory is going. I can't think of anything in
particular that changed...
On 8/28/13 4:39 PM, Jeff Peters wrote:
I am tasked with updating our ancient (circa 7/10/2012) Giraph
to giraph-release-1.0.0-RC3. Most jobs run fine but our
largest job now runs out of memory using the same AWS
elastic-mapreduce configuration we have always used. I have
never tried to configure either Giraph or the AWS Hadoop. We
build for Hadoop 1.0.2 because that's closest to the 1.0.3 AWS
provides us. The 8 X m2.4xlarge cluster we use seems to
provide 8*14=112 map tasks fitted out with 2GB heap each. Our
code is completely unchanged except as required to adapt to
the new Giraph APIs. Our vertex, edge, and message data are
completely unchanged. On smaller jobs, that work, the
aggregate heap usage high-water mark seems about the same as
before, but the "committed heap" seems to run higher. I can't
even make it work on a cluster of 12. In that case I get one
map task that seems to end up with nearly twice as many
messages as most of the others so it runs out of memory
anyway. It only takes one to fail the job. Am I missing
something here? Should I be configuring my new Giraph in some
way I didn't used to need to with the old one?