out of curiosity, did you see below messages in RS log? LOG.warn("Snapshot called again without clearing previous. " + "Doing nothing. Another ongoing flush or did we fail last attempt?");
thanks. On Tue, Sep 9, 2014 at 2:15 AM, Brian Jeltema < brian.jelt...@digitalenvoy.net> wrote: > I’ve resolved these problems by restarting the region server that owned > the region in question. > I don’t know what the underlying issue was, but at this point it’s not > worth pursuing. > > Thanks for responding. > > Brian > > On Sep 8, 2014, at 11:06 AM, Brian Jeltema <brian.jelt...@digitalenvoy.net> > wrote: > > > I realized today that the region server logs for the region being > updated (startKey=\x00DDD@) contains the following: > > > > 2014-09-08 06:25:50,223 INFO [regionserver60020.periodicFlusher] > regionserver.HRegionServer: regionserver60020.periodicFlusher requesting > flush for region Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65. > after a delay of 11302 > > 2014-09-08 06:26:00,222 INFO [regionserver60020.periodicFlusher] > regionserver.HRegionServer: regionserver60020.periodicFlusher requesting > flush for region Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65. > after a delay of 21682 > > 2014-09-08 06:26:10,223 INFO [regionserver60020.periodicFlusher] > regionserver.HRegionServer: regionserver60020.periodicFlusher requesting > flush for region Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65. > after a delay of 5724 > > 2014-09-08 06:26:20,223 INFO [regionserver60020.periodicFlusher] > regionserver.HRegionServer: regionserver60020.periodicFlusher requesting > flush for region Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65. > after a delay of 11962 > > 2014-09-08 06:26:30,223 INFO [regionserver60020.periodicFlusher] > regionserver.HRegionServer: regionserver60020.periodicFlusher requesting > flush for region Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65. > after a delay of 7693 > > 2014-09-08 06:26:40,224 INFO [regionserver60020.periodicFlusher] > regionserver.HRegionServer: regionserver60020.periodicFlusher requesting > flush for region Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65. > after a delay of 5578 > > 2014-09-08 06:26:50,223 INFO [regionserver60020.periodicFlusher] > regionserver.HRegionServer: regionserver60020.periodicFlusher requesting > flush for region Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65. > after a delay of 12420 > > > > a log entry being generated every 10 seconds starting about 4 days ago. > I presume these problems are related. > > > > On Sep 8, 2014, at 7:10 AM, Brian Jeltema < > brian.jelt...@digitalenvoy.net> wrote: > > > >> > >>> When number of attempts is greater than the value of > >>> hbase.client.start.log.errors.counter (default 9), AsyncProcess would > >>> produce logs cited below. > >>> The interval following 'retrying after ' is the backoff time. > >>> > >>> Which release of HBase are you using ? > >>> > >> > >> HBase Version 0.98.0.2.1.1.0-385-hadoop2 > >> > >> The MR job is reading from an HBase snapshot, if that’s relevant. > >> > >>> Cheers > >>> > >>> > >>> On Sun, Sep 7, 2014 at 8:50 AM, Brian Jeltema < > >>> brian.jelt...@digitalenvoy.net> wrote: > >>> > >>>> I have a map/reduce job that is consistently failing with timeouts. > The > >>>> failing mapper log files contain a series > >>>> of records similar to those below. When I look at the hbase and hdfs > logs > >>>> (on foo.net in this case) I don’t see > >>>> anything obvious at these timestamps. The mapper task times out > at/near > >>>> attempt=25/35. Can anyone shed light > >>>> on what these log entries mean? > >>>> > >>>> Thanks - Brian > >>>> > >>>> > >>>> 2014-09-07 09:36:51,421 INFO [htable-pool1-t1] > >>>> org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary, > >>>> attempt=10/35 failed 1062 ops, last exception: null on foo.net > ,60020,1406043467187, > >>>> tracking started null, retrying after 10029 ms, replay 1062 ops > >>>> 2014-09-07 09:37:01,642 INFO [htable-pool1-t1] > >>>> org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary, > >>>> attempt=11/35 failed 1062 ops, last exception: null on foo.net > ,60020,1406043467187, > >>>> tracking started null, retrying after 10023 ms, replay 1062 ops > >>>> 2014-09-07 09:37:12,064 INFO [htable-pool1-t1] > >>>> org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary, > >>>> attempt=12/35 failed 1062 ops, last exception: null on foo.net > ,60020,1406043467187, > >>>> tracking started null, retrying after 20182 ms, replay 1062 ops > >>>> 2014-09-07 09:37:32,708 INFO [htable-pool1-t1] > >>>> org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary, > >>>> attempt=13/35 failed 1062 ops, last exception: null on foo.net > ,60020,1406043467187, > >>>> tracking started null, retrying after 20140 ms, replay 1062 ops > >>>> 2014-09-07 09:37:52,940 INFO [htable-pool1-t1] > >>>> org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary, > >>>> attempt=14/35 failed 1062 ops, last exception: null on foo.net > ,60020,1406043467187, > >>>> tracking started null, retrying after 20041 ms, replay 1062 ops > >>>> 2014-09-07 09:38:13,324 INFO [htable-pool1-t1] > >>>> org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary, > >>>> attempt=15/35 failed 1062 ops, last exception: null on foo.net > ,60020,1406043467187, > >>>> tracking started null, retrying after 20041 ms, replay 1062 ops > >>>> > >>>> > >> > > > >