Yep. Actually more than a supernode sending messages it looks more like a supernode receiving a lot of messages, and hence filling the inbox queue (in bytearrayformat). I assumed a max (2^31)/4 (for float messages) limit because I assumed we had such an object per vertex, but it looks like the ByteArrayOutputFormat is used for more than a single vertex. Maja, do you think it could make sense to use one array per vertex (I haven't looked deep at the code, but I assume that we have one per partition now)?
On Thu, Feb 21, 2013 at 12:49 AM, Eli Reisman <[email protected]>wrote: > Sorry didn't see the bottom of this thread before! I like Maja's idea, some > mega-supernode (thats a technical term) just sent that vertex so many > messages that the byte buffer size overflowed on the right shift. I saw > this problem in some other forms last summer. The number of other ways that > number could go negative seem limited. > > > On Wed, Feb 20, 2013 at 3:41 PM, Eli Reisman <[email protected] > >wrote: > > > I do know when you run out of memory it often looks like the logs are > > fine, or seem to end in a wierd spot, but really the last X number of log > > messages just doesn't get logged. Sometimes its worth looking at the > worker > > logs individually and one in every group of them will actually get a > > sensible exception message logged. Don't know if thats the case though. I > > would tend to agree with your intuition regarding the queues and > > concurrency. > > > > On Tue, Feb 19, 2013 at 3:04 PM, Claudio Martella < > > [email protected]> wrote: > > > >> Looking at the second line of the logs, it does not look like the case, > >> but > >> it could be. do you have any suggestion for the flags in particular? I'm > >> running with -XX:+HeapDumpOnOutOfMemoryError now, and I don't get any > >> dump. > >> > >> It could be the case, but the behavior is quite consistent currently, > I'd > >> get a Child error/OOM pretty soon in these cases. > >> > >> > >> On Tue, Feb 19, 2013 at 11:45 PM, Avery Ching <[email protected]> > wrote: > >> > >> > Can you add GC flags? Perhaps you are running into full GCs... > >> > > >> > > >> > On 2/19/13 1:45 PM, Claudio Martella wrote: > >> > > >> >> I have not investigated further yet, but this does not look connected > >> to > >> >> ZK, but mostly to some concurrency on the queues. The number of > >> requests > >> >> stays to 29... > >> >> > >> >> > >> >> On Tue, Feb 19, 2013 at 9:58 PM, Sebastian Schelter <[email protected]> > >> >> wrote: > >> >> > >> >> For me, Giraph on Hadoop 1.0.4 runs into some kind of deadlock after > >> the > >> >>> input was loaded when using it with 200 workers on a billion-scale > >> >>> graph.... > >> >>> > >> >>> I tried several versions of zookeeper, but nothing helped. > >> >>> > >> >>> Best, > >> >>> Sebastian > >> >>> > >> >>> > >> >>> On 19.02.2013 21:55, Claudio Martella wrote: > >> >>> > >> >>>> I'm running PR with trunk on yahoo webgraph across 528 workers and > I > >> get > >> >>>> the workers hanging at the flushing: > >> >>>> > >> >>>> 2013-02-19 21:40:07,187 INFO > >> org.apache.giraph.graph.**ComputeCallable: > >> >>>> > >> >>> call: > >> >>> > >> >>>> Computation took 28.386961 secs for 95 partitions on superstep 0. > >> >>>> > >> >>> Flushing > >> >>> > >> >>>> started > >> >>>> 2013-02-19 21:40:07,604 INFO org.apache.giraph.worker.** > >> >>>> BspServiceWorker: > >> >>>> finishSuperstep: Waiting on all requests, superstep 0 Memory > >> >>>> (free/total/max) = 5993.42M / 8661.44M / 14346.69M > >> >>>> 2013-02-19 21:40:07,605 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 558 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 6.4758, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0003, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 27.083 > >> >>>> 2013-02-19 21:40:26,717 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 39 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 3.7965, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0.0001, MBytesSent = 175.3834, MBytesReceived = 0.0068, ave sent > req > >> >>>> > >> >>> MBytes > >> >>> > >> >>>> = 0.2998, ave received req MBytes = 0, secs waited = 46.195 > >> >>>> 2013-02-19 21:40:32,248 INFO > >> >>>> org.apache.giraph.comm.netty.**handler.RequestDecoder: decode: > Server > >> >>>> > >> >>> window > >> >>> > >> >>>> metrics MBytes/sec sent = 0.0002, MBytes/sec received = 3.9824, > >> >>>> > >> >>> MBytesSent > >> >>> > >> >>>> = 0.0066, MBytesReceived = 155.4787, ave sent req MBytes = 0, ave > >> >>>> > >> >>> received > >> >>> > >> >>>> req MBytes = 0.0155, secs waited = 39.04 > >> >>>> 2013-02-19 21:40:42,041 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 36 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 2.8509, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0.0001, MBytesSent = 175.3834, MBytesReceived = 0.0068, ave sent > req > >> >>>> > >> >>> MBytes > >> >>> > >> >>>> = 0.2998, ave received req MBytes = 0, secs waited = 61.519 > >> >>>> 2013-02-19 21:41:07,456 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 2.0174, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0.0001, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent > req > >> >>>> > >> >>> MBytes > >> >>> > >> >>>> = 0.2998, ave received req MBytes = 0, secs waited = 86.933 > >> >>>> 2013-02-19 21:41:22,456 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 1.7205, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0.0001, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent > req > >> >>>> > >> >>> MBytes > >> >>> > >> >>>> = 0.2998, ave received req MBytes = 0, secs waited = 101.934 > >> >>>> 2013-02-19 21:41:37,457 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 1.4998, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0.0001, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent > req > >> >>>> > >> >>> MBytes > >> >>> > >> >>>> = 0.2998, ave received req MBytes = 0, secs waited = 116.935 > >> >>>> 2013-02-19 21:41:52,458 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 1.3293, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0.0001, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent > req > >> >>>> > >> >>> MBytes > >> >>> > >> >>>> = 0.2998, ave received req MBytes = 0, secs waited = 131.936 > >> >>>> 2013-02-19 21:42:22,459 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 1.083, MBytes/sec > >> received > >> >>>> = > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 161.937 > >> >>>> 2013-02-19 21:42:37,460 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 0.9912, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 176.938 > >> >>>> 2013-02-19 21:43:07,461 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 0.8475, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 206.939 > >> >>>> 2013-02-19 21:43:37,462 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 0.7402, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 236.94 > >> >>>> 2013-02-19 21:43:52,463 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 0.6961, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 251.941 > >> >>>> 2013-02-19 21:44:07,464 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 0.657, MBytes/sec > >> received > >> >>>> = > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 266.942 > >> >>>> 2013-02-19 21:44:37,465 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 0.5906, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 296.943 > >> >>>> 2013-02-19 21:45:07,466 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 0.5364, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 326.944 > >> >>>> 2013-02-19 21:45:22,467 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 0.5129, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 341.945 > >> >>>> 2013-02-19 21:45:37,468 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 0.4913, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 356.946 > >> >>>> 2013-02-19 21:46:07,469 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 0.4532, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 386.947 > >> >>>> 2013-02-19 21:46:22,470 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 0.4363, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 401.948 > >> >>>> 2013-02-19 21:46:52,471 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 0.406, MBytes/sec > >> received > >> >>>> = > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 431.949 > >> >>>> 2013-02-19 21:47:07,472 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 0.3924, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 446.95 > >> >>>> 2013-02-19 21:47:37,473 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 0.3677, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 476.951 > >> >>>> 2013-02-19 21:48:07,474 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 0.346, MBytes/sec > >> received > >> >>>> = > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 506.952 > >> >>>> 2013-02-19 21:48:22,475 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 0.336, MBytes/sec > >> received > >> >>>> = > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 521.953 > >> >>>> 2013-02-19 21:48:52,476 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 0.3177, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 551.954 > >> >>>> 2013-02-19 21:49:07,477 INFO org.apache.giraph.comm.netty.** > >> >>>> NettyClient: > >> >>>> waitSomeRequests: Waiting interval of 15000 msecs, 29 open > requests, > >> >>>> waiting for it to be <= 0, MBytes/sec sent = 0.3093, MBytes/sec > >> received > >> >>>> > >> >>> = > >> >>> > >> >>>> 0, MBytesSent = 175.3834, MBytesReceived = 0.0069, ave sent req > >> MBytes = > >> >>>> 0.2998, ave received req MBytes = 0, secs waited = 566.955 > >> >>>> > >> >>>> It looks similar to the problem we had earlier with concurrency. > >> >>>> > >> >>>> Are you guys experiencing something similar at large-scale? > >> >>>> > >> >>>> > >> >>> > >> >> > >> > > >> > >> > >> -- > >> Claudio Martella > >> [email protected] > >> > > > > > -- Claudio Martella [email protected]
