Re: Out of memory with giraph-release-1.0.0-RC3, used to work on old Giraph

2013-09-04 Thread Lukas Nalezenec


Thanks,
I was not sure if it really works as I described.

 Facebook can't be using it like this if, as described, they have 
billions of vertices and a trillion edges.


Yes, its strange. I guess configuration does not help so much on large 
cluster. What might help are properties of input data.


 So do you, or Avery, have any idea how you might initialize this is a 
more reasonable way, and how???


Fast workaround is to set number of partitions to from W^2 to W or 2*W 
.  It will help if you dont have very large number of workers.

I would not change MAX_*_REQUEST_SIZE much since it may hurt performance.
You can do some preprocessing before loading data to Giraph.



How to change Giraph:
The caches could be flushed if total sum of vertexes/edges in all caches 
exceeds some number. Ideally, it should prevent not only OutOfMemory 
errors but also raising high water mark. Not sure if it (preventing 
raising HWM) is easy to do.
I am going to use almost-prebuild partitions. For my use case it would 
be ideal to detect if some cache is abandoned and i would not be used 
anymore. It would cut memory usage in caches from ~O(n^3) to ~O(n).  It 
could be done by counting number of cache flushes or cache insertions 
and if some cache was not touched for long time it would be flushed.


There could be separated configuration MAX_*_REQUEST_SIZE for per 
partition caches during loading data.


I guess there should be simple but efficient way how to trace memory 
high-water mark. It could look like:


Loading data: Memory high-water mark: start: 100 Gb end: 300 Gb
Iteration 1 Computation: Memory high-water mark: start: 300 Gb end: 300 Gb
Iteration 1 XYZ 
Iteration 2 Computation: Memory high-water mark: start: 300 Gb end: 300 Gb
.
.
.

Lukas





On 09/04/13 01:12, Jeff Peters wrote:
Thank you Lukas!!! That's EXACTLY the kind of model I was building in 
my head over the weekend about why this might be happening, and why 
increasing the number of AWS instances (and workers) does not solve 
the problem without increasing each worker's VM. Surely Facebook can't 
be using it like this if, as described, they have billions of vertices 
and a trillion edges. So do you, or Avery, have any idea how you might 
initialize this is a more reasonable way, and how???



On Mon, Sep 2, 2013 at 6:08 AM, Lukas Nalezenec 
lukas.naleze...@firma.seznam.cz 
mailto:lukas.naleze...@firma.seznam.cz wrote:


Hi

I wasted few days on similar problem.

I guess the problem was that during loading - if you have got W
workers and W^2 partitions there are W^2 partition caches in each
worker.
Each cache can hold 10 000 vertexes by default.
I had 26 000 000 vertexes, 60 workers - 3600 partitions. It means
that there can be up to 36 000 000 vertexes in caches in each
worker if input files are random.
Workers were assigned 450 000 vertexes but failed when they had
900 000 vertexes in memory.

Btw: Why default number of partitions is W^2 ?

(I can be wrong)
Lukas



On 08/31/13 01:54, Avery Ching wrote:

Ah, the new caches. =)  These make things a lot faster (bulk data
sending), but do take up some additional memory.  if you look at
GiraphConstants, you can find ways to change the cache sizes
(this will reduce that memory usage).
For example, MAX_EDGE_REQUEST_SIZE will affect the size of the
edge cache.  MAX_MSG_REQUEST_SIZE will affect the size of the
message cache.  The caches are per worker, so 100 workers would
require 50 MB  per worker by default.  Feel free to trim it if
you like.

The byte arrays for the edges are the most efficient storage
possible (although not as performance as the native edge stores).

Hope that helps,

Avery

On 8/29/13 4:53 PM, Jeff Peters wrote:

Avery, it would seem that optimizations to Giraph have,
unfortunately, turned the majority of the heap into dark
matter. The two snapshots are at unknown points in a superstep
but I waited for several supersteps so that the activity had
more or less stabilized. About the only thing comparable between
the two snapshots are the vertexes, 192561 X RecsVertex in the
new version and 191995 X Coloring in the old system. But with
the new Giraph 672710176 out of 824886184 bytes are stored as
primitive byte arrays. That's probably indicative of some very
fine performance optimization work, but it makes it extremely
difficult to know what's really out there, and why. I did notice
that a number of caches have appeared that did not exist before,
namely SendEdgeCache, SendPartitionCache, SendMessageCache
and SendMutationsCache.

Could any of those account for a larger per-worker footprint in
a modern Giraph? Should I simply assume that I need to force AWS
to configure its EMR Hadoop so that each instance has fewer map
tasks but with a somewhat larger VM max, say 3GB instead of 2GB?


On Wed, Aug 

Re: Out of memory with giraph-release-1.0.0-RC3, used to work on old Giraph

2013-09-04 Thread Jeff Peters
Ok thanks Avery. But I still have two questions, the first a really dumb
newbie question. Why are there ever any number of partitions other than
exactly one per worker thread (one per worker in our case)? And a deeper
question. Even if I shrink the cache I would suppose that if Facebook has
billions of vertices they must have thousands or workers. It would seem the
cache scheme simply blows up on huge graphs no matter what you do. What am
I missing here?


On Wed, Sep 4, 2013 at 11:18 AM, Avery Ching ach...@apache.org wrote:

  The amount of memory for the send message cache is per worker =
 number of compute threads * number of workers * size of the cache.

 The number of partitions doesn't affect the memory usage very much.  My
 advice would be to dial down the cache size a bit with MAX_MSG_REQUEST_SIZE.

 Avery


 On 9/4/13 3:33 AM, Lukas Nalezenec wrote:


 Thanks,
 I was not sure if it really works as I described.

  Facebook can't be using it like this if, as described, they have
 billions of vertices and a trillion edges.

 Yes, its strange. I guess configuration does not help so much on large
 cluster. What might help are properties of input data.

  So do you, or Avery, have any idea how you might initialize this is a
 more reasonable way, and how???

 Fast workaround is to set number of partitions to from W^2 to W or 2*W .
 It will help if you dont have very large number of workers.
 I would not change MAX_*_REQUEST_SIZE much since it may hurt performance.
 You can do some preprocessing before loading data to Giraph.



 How to change Giraph:
 The caches could be flushed if total sum of vertexes/edges in all caches
 exceeds some number. Ideally, it should prevent not only OutOfMemory errors
 but also raising high water mark. Not sure if it (preventing raising HWM)
 is easy to do.
 I am going to use almost-prebuild partitions. For my use case it would be
 ideal to detect if some cache is abandoned and i would not be used anymore.
 It would cut memory usage in caches from ~O(n^3) to ~O(n).  It could be
 done by counting number of cache flushes or cache insertions and if some
 cache was not touched for long time it would be flushed.

 There could be separated configuration MAX_*_REQUEST_SIZE for per
 partition caches during loading data.

 I guess there should be simple but efficient way how to trace memory
 high-water mark. It could look like:

 Loading data: Memory high-water mark: start: 100 Gb end: 300 Gb
 Iteration 1 Computation: Memory high-water mark: start: 300 Gb end: 300 Gb
 Iteration 1 XYZ 
 Iteration 2 Computation: Memory high-water mark: start: 300 Gb end: 300 Gb
 .
 .
 .

 Lukas





 On 09/04/13 01:12, Jeff Peters wrote:

 Thank you Lukas!!! That's EXACTLY the kind of model I was building in my
 head over the weekend about why this might be happening, and why increasing
 the number of AWS instances (and workers) does not solve the problem
 without increasing each worker's VM. Surely Facebook can't be using it like
 this if, as described, they have billions of vertices and a trillion edges.
 So do you, or Avery, have any idea how you might initialize this is a more
 reasonable way, and how???


 On Mon, Sep 2, 2013 at 6:08 AM, Lukas Nalezenec 
 lukas.naleze...@firma.seznam.cz wrote:

  Hi

 I wasted few days on similar problem.

 I guess the problem was that during loading - if you have got W workers
 and W^2 partitions there are W^2 partition caches in each worker.
 Each cache can hold 10 000 vertexes by default.
 I had 26 000 000 vertexes, 60 workers - 3600 partitions. It means that
 there can be up to 36 000 000 vertexes in caches in each worker if input
 files are random.
 Workers were assigned 450 000 vertexes but failed when they had 900 000
 vertexes in memory.

 Btw: Why default number of partitions is W^2 ?

 (I can be wrong)
 Lukas



 On 08/31/13 01:54, Avery Ching wrote:

 Ah, the new caches. =)  These make things a lot faster (bulk data
 sending), but do take up some additional memory.  if you look at
 GiraphConstants, you can find ways to change the cache sizes (this will
 reduce that memory usage).
 For example, MAX_EDGE_REQUEST_SIZE will affect the size of the edge
 cache.  MAX_MSG_REQUEST_SIZE will affect the size of the message cache.
 The caches are per worker, so 100 workers would require 50 MB  per worker
 by default.  Feel free to trim it if you like.

 The byte arrays for the edges are the most efficient storage possible
 (although not as performance as the native edge stores).

 Hope that helps,

 Avery

 On 8/29/13 4:53 PM, Jeff Peters wrote:

 Avery, it would seem that optimizations to Giraph have, unfortunately,
 turned the majority of the heap into dark matter. The two snapshots are
 at unknown points in a superstep but I waited for several supersteps so
 that the activity had more or less stabilized. About the only thing
 comparable between the two snapshots are the vertexes, 192561 X
 RecsVertex in the new version and 191995 X Coloring in the old 

Re: Out of memory with giraph-release-1.0.0-RC3, used to work on old Giraph

2013-09-04 Thread Avery Ching
We have caches per every compute threads.  Then we have w worker caches 
per compute thread.  So the total amount of memory consumed by message 
caches per worker =
Compute threads * workers * size of cache.  The best thing is to tune 
down the size of the cache from MAX_MSG_REQUEST_SIZE to a size that 
works for your configuration.


Hope that helps,

Avery

On 9/4/13 3:33 AM, Lukas Nalezenec wrote:


Thanks,
I was not sure if it really works as I described.

 Facebook can't be using it like this if, as described, they have 
billions of vertices and a trillion edges.


Yes, its strange. I guess configuration does not help so much on large 
cluster. What might help are properties of input data.


 So do you, or Avery, have any idea how you might initialize this is 
a more reasonable way, and how???


Fast workaround is to set number of partitions to from W^2 to W or 2*W 
.  It will help if you dont have very large number of workers.

I would not change MAX_*_REQUEST_SIZE much since it may hurt performance.
You can do some preprocessing before loading data to Giraph.



How to change Giraph:
The caches could be flushed if total sum of vertexes/edges in all 
caches exceeds some number. Ideally, it should prevent not only 
OutOfMemory errors but also raising high water mark. Not sure if it 
(preventing raising HWM) is easy to do.
I am going to use almost-prebuild partitions. For my use case it would 
be ideal to detect if some cache is abandoned and i would not be used 
anymore. It would cut memory usage in caches from ~O(n^3) to ~O(n).  
It could be done by counting number of cache flushes or cache 
insertions and if some cache was not touched for long time it would be 
flushed.


There could be separated configuration MAX_*_REQUEST_SIZE for per 
partition caches during loading data.


I guess there should be simple but efficient way how to trace memory 
high-water mark. It could look like:


Loading data: Memory high-water mark: start: 100 Gb end: 300 Gb
Iteration 1 Computation: Memory high-water mark: start: 300 Gb end: 300 Gb
Iteration 1 XYZ 
Iteration 2 Computation: Memory high-water mark: start: 300 Gb end: 300 Gb
.
.
.

Lukas





On 09/04/13 01:12, Jeff Peters wrote:
Thank you Lukas!!! That's EXACTLY the kind of model I was building in 
my head over the weekend about why this might be happening, and why 
increasing the number of AWS instances (and workers) does not solve 
the problem without increasing each worker's VM. Surely Facebook 
can't be using it like this if, as described, they have billions of 
vertices and a trillion edges. So do you, or Avery, have any idea how 
you might initialize this is a more reasonable way, and how???



On Mon, Sep 2, 2013 at 6:08 AM, Lukas Nalezenec 
lukas.naleze...@firma.seznam.cz 
mailto:lukas.naleze...@firma.seznam.cz wrote:


Hi

I wasted few days on similar problem.

I guess the problem was that during loading - if you have got W
workers and W^2 partitions there are W^2 partition caches in each
worker.
Each cache can hold 10 000 vertexes by default.
I had 26 000 000 vertexes, 60 workers - 3600 partitions. It
means that there can be up to 36 000 000 vertexes in caches in
each worker if input files are random.
Workers were assigned 450 000 vertexes but failed when they had
900 000 vertexes in memory.

Btw: Why default number of partitions is W^2 ?

(I can be wrong)
Lukas



On 08/31/13 01:54, Avery Ching wrote:

Ah, the new caches. =)  These make things a lot faster (bulk
data sending), but do take up some additional memory.  if you
look at GiraphConstants, you can find ways to change the cache
sizes (this will reduce that memory usage).
For example, MAX_EDGE_REQUEST_SIZE will affect the size of the
edge cache. MAX_MSG_REQUEST_SIZE will affect the size of the
message cache.  The caches are per worker, so 100 workers would
require 50 MB  per worker by default.  Feel free to trim it if
you like.

The byte arrays for the edges are the most efficient storage
possible (although not as performance as the native edge stores).

Hope that helps,

Avery

On 8/29/13 4:53 PM, Jeff Peters wrote:

Avery, it would seem that optimizations to Giraph have,
unfortunately, turned the majority of the heap into dark
matter. The two snapshots are at unknown points in a superstep
but I waited for several supersteps so that the activity had
more or less stabilized. About the only thing comparable
between the two snapshots are the vertexes, 192561 X
RecsVertex in the new version and 191995 X Coloring in the
old system. But with the new Giraph 672710176 out of 824886184
bytes are stored as primitive byte arrays. That's probably
indicative of some very fine performance optimization work, but
it makes it extremely difficult to know what's really out
there, and why. I did notice that a number of caches have
appeared that 

Re: Out of memory with giraph-release-1.0.0-RC3, used to work on old Giraph

2013-08-30 Thread Avery Ching
Ah, the new caches. =)  These make things a lot faster (bulk data 
sending), but do take up some additional memory.  if you look at 
GiraphConstants, you can find ways to change the cache sizes (this will 
reduce that memory usage).
For example, MAX_EDGE_REQUEST_SIZE will affect the size of the edge 
cache.  MAX_MSG_REQUEST_SIZE will affect the size of the message cache.  
The caches are per worker, so 100 workers would require 50 MB  per 
worker by default.  Feel free to trim it if you like.


The byte arrays for the edges are the most efficient storage possible 
(although not as performance as the native edge stores).


Hope that helps,

Avery

On 8/29/13 4:53 PM, Jeff Peters wrote:
Avery, it would seem that optimizations to Giraph have, unfortunately, 
turned the majority of the heap into dark matter. The two snapshots 
are at unknown points in a superstep but I waited for several 
supersteps so that the activity had more or less stabilized. About the 
only thing comparable between the two snapshots are the vertexes, 
192561 X RecsVertex in the new version and 191995 X Coloring in 
the old system. But with the new Giraph 672710176 out of 824886184 
bytes are stored as primitive byte arrays. That's probably indicative 
of some very fine performance optimization work, but it makes it 
extremely difficult to know what's really out there, and why. I did 
notice that a number of caches have appeared that did not exist 
before, namely SendEdgeCache, SendPartitionCache, SendMessageCache 
and SendMutationsCache.


Could any of those account for a larger per-worker footprint in a 
modern Giraph? Should I simply assume that I need to force AWS to 
configure its EMR Hadoop so that each instance has fewer map tasks but 
with a somewhat larger VM max, say 3GB instead of 2GB?



On Wed, Aug 28, 2013 at 4:57 PM, Avery Ching ach...@apache.org 
mailto:ach...@apache.org wrote:


Try dumping a histogram of memory usage from a running JVM and see
where the memory is going.  I can't think of anything in
particular that changed...


On 8/28/13 4:39 PM, Jeff Peters wrote:


I am tasked with updating our ancient (circa 7/10/2012) Giraph
to giraph-release-1.0.0-RC3. Most jobs run fine but our
largest job now runs out of memory using the same AWS
elastic-mapreduce configuration we have always used. I have
never tried to configure either Giraph or the AWS Hadoop. We
build for Hadoop 1.0.2 because that's closest to the 1.0.3 AWS
provides us. The 8 X m2.4xlarge cluster we use seems to
provide 8*14=112 map tasks fitted out with 2GB heap each. Our
code is completely unchanged except as required to adapt to
the new Giraph APIs. Our vertex, edge, and message data are
completely unchanged. On smaller jobs, that work, the
aggregate heap usage high-water mark seems about the same as
before, but the committed heap seems to run higher. I can't
even make it work on a cluster of 12. In that case I get one
map task that seems to end up with nearly twice as many
messages as most of the others so it runs out of memory
anyway. It only takes one to fail the job. Am I missing
something here? Should I be configuring my new Giraph in some
way I didn't used to need to with the old one?







Re: Out of memory with giraph-release-1.0.0-RC3, used to work on old Giraph

2013-08-28 Thread Avery Ching
Try dumping a histogram of memory usage from a running JVM and see where 
the memory is going.  I can't think of anything in particular that 
changed...


On 8/28/13 4:39 PM, Jeff Peters wrote:


I am tasked with updating our ancient (circa 7/10/2012) Giraph to 
giraph-release-1.0.0-RC3. Most jobs run fine but our largest job now 
runs out of memory using the same AWS elastic-mapreduce configuration 
we have always used. I have never tried to configure either Giraph or 
the AWS Hadoop. We build for Hadoop 1.0.2 because that's closest to 
the 1.0.3 AWS provides us. The 8 X m2.4xlarge cluster we use seems to 
provide 8*14=112 map tasks fitted out with 2GB heap each. Our code is 
completely unchanged except as required to adapt to the new Giraph 
APIs. Our vertex, edge, and message data are completely unchanged. On 
smaller jobs, that work, the aggregate heap usage high-water mark 
seems about the same as before, but the committed heap seems to run 
higher. I can't even make it work on a cluster of 12. In that case I 
get one map task that seems to end up with nearly twice as many 
messages as most of the others so it runs out of memory anyway. It 
only takes one to fail the job. Am I missing something here? Should I 
be configuring my new Giraph in some way I didn't used to need to with 
the old one?






Re: Out of memory with giraph-release-1.0.0-RC3, used to work on old Giraph

2013-08-28 Thread Jeff Peters
Ok thanks Avery, I'll try it. I'm not sure I know how I would do that on a
running AWS EMR instance, but I can do it on a local stand-alone Hadoop
running a smaller version of the job and see if anything jumps out.


On Wed, Aug 28, 2013 at 4:57 PM, Avery Ching ach...@apache.org wrote:

 Try dumping a histogram of memory usage from a running JVM and see where
 the memory is going.  I can't think of anything in particular that
 changed...


 On 8/28/13 4:39 PM, Jeff Peters wrote:


 I am tasked with updating our ancient (circa 7/10/2012) Giraph to
 giraph-release-1.0.0-RC3. Most jobs run fine but our largest job now runs
 out of memory using the same AWS elastic-mapreduce configuration we have
 always used. I have never tried to configure either Giraph or the AWS
 Hadoop. We build for Hadoop 1.0.2 because that's closest to the 1.0.3 AWS
 provides us. The 8 X m2.4xlarge cluster we use seems to provide 8*14=112
 map tasks fitted out with 2GB heap each. Our code is completely unchanged
 except as required to adapt to the new Giraph APIs. Our vertex, edge, and
 message data are completely unchanged. On smaller jobs, that work, the
 aggregate heap usage high-water mark seems about the same as before, but
 the committed heap seems to run higher. I can't even make it work on a
 cluster of 12. In that case I get one map task that seems to end up with
 nearly twice as many messages as most of the others so it runs out of
 memory anyway. It only takes one to fail the job. Am I missing something
 here? Should I be configuring my new Giraph in some way I didn't used to
 need to with the old one?