Re: Performance degradation between CDH5.3.1(HBase0.98.6) and CDH5.4.5(HBase1.0.0)

2015-11-30 Thread Stack
On Mon, Nov 30, 2015 at 9:16 PM, Bryan Beaudreault  wrote:

> I'll try to get another one.  We are currently not seeing the issue due to
> lack of contention (it is off hours for our customers).
>
> Note that the stack trace I gave you was taken with a tool we have which
> aggregates common stacks. The one at the bottom occurred 122 times (out of
> 128 handlers -- this is pre-tuning when we added 1000 handlers and the read
> vs write).  So to me it looks like 122 of 128 handlers were waiting on:
>
> if (!existingContext.latch.await(this.rowLockWaitDuration,
> TimeUnit.MILLISECONDS)) {
>   throw new IOException("Timed out waiting for lock for row: " + rowKey);
> }
>
>
>
That would explain the nice numbers. Looks like useful tool. Can I have
unadulterated trace? All increments are trying to go against same row?  If
only the one thread doing an actual increment and it is up in the memstore
doing a family compare, is it stuck not letting anyone else in? Only using
5% CPU though... so maybe not

St.Ack




> On Tue, Dec 1, 2015 at 12:08 AM Stack  wrote:
>
> > Looking at that stack trace, nothing showing as blocked or slowed by
> > another operation. You have others I could look at Bryan?
> > St.Ack
> >
> > On Mon, Nov 30, 2015 at 8:40 PM, Bryan Beaudreault <
> > bbeaudrea...@hubspot.com
> > > wrote:
> >
> > > Yea sorry if I was misleading.  The nonce loglines we saw only happened
> > on
> > > full cluster restart, it may have been the HLog's replaying, not sure.
> > >
> > > We are still seeing slow Increments. Where Gets and Mutates will be on
> > the
> > > order of 50-150ms according to metrics, Increment will be in the
> > > 1000-5000ms range. It seems we may be blocking on FSHLog#syncer.
> > >
> > >
> >
> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L359
> > >
> > >
> > >
> > > On Mon, Nov 30, 2015 at 11:26 PM Stack  wrote:
> > >
> > > > Still slow increments though?
> > > >
> > > > On Mon, Nov 30, 2015 at 5:05 PM, Bryan Beaudreault <
> > > > bbeaudrea...@hubspot.com
> > > > > wrote:
> > > >
> > > > > Those log lines have settled down, they may have been related to a
> > > > > cluster-wide forced restart at the time.
> > > > >
> > > > > On Mon, Nov 30, 2015 at 7:59 PM Bryan Beaudreault <
> > > > > bbeaudrea...@hubspot.com>
> > > > > wrote:
> > > > >
> > > > > > We've been doing more debugging of this and have set up the read
> vs
> > > > write
> > > > > > handlers to try to at least segment this away so reads can work.
> We
> > > > have
> > > > > > pretty beefy servers, and are running wiht the following
> settings:
> > > > > >
> > > > > > hbase.regionserver.handler.count=1000
> > > > > > hbase.ipc.server.read.threadpool.size=50
> > > > > > hbase.ipc.server.callqueue.handler.factor=0.025
> > > > > > hbase.ipc.server.callqueue.read.ratio=0.6
> > > > > > hbase.ipc.server.callqueue.scan.ratio=0.5
> > > > > >
> > > > > > We are seeing all 400 write handlers taken up by row locks for
> the
> > > most
> > > > > > part. The read handlers are mostly idle. We're thinking of
> changing
> > > the
> > > > > > ratio here, but are not sure it will help if they are all blocked
> > on
> > > a
> > > > > row
> > > > > > lock.  We just enabled DEBUG logging on all our servers and
> notice
> > > the
> > > > > > following:
> > > > > >
> > > > > > 2015-12-01 00:56:09,015 DEBUG
> > > > > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > > > > detected
> > > > > > by nonce: [-687451119961178644:7664336281906118656], [state 0,
> > > hasWait
> > > > > > false, activity 00:54:36.240]
> > > > > > 2015-12-01 00:56:09,015 DEBUG
> > > > > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > > > > detected
> > > > > > by nonce: [-687451119961178644:-7119840249342174227], [state 0,
> > > hasWait
> > > > > > false, activity 00:54:36.256]
> > > > > > 2015-12-01 00:56:09,268 DEBUG
> > > > > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > > > > detected
> > > > > > by nonce: [-5946137511131403479:2112661701888365489], [state 0,
> > > hasWait
> > > > > > false, activity 00:55:01.259]
> > > > > > 2015-12-01 00:56:09,279 DEBUG
> > > > > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > > > > detected
> > > > > > by nonce: [4165332617675853029:6256955295384472057], [state 0,
> > > hasWait
> > > > > > false, activity 00:53:58.151]
> > > > > > 2015-12-01 00:56:09,279 DEBUG
> > > > > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > > > > detected
> > > > > > by nonce: [4165332617675853029:4961178013070912522], [state 0,
> > > hasWait
> > > > > > false, activity 00:53:58.162]
> > > > > >
> > > > > >
> > > > > > On Mon, Nov 30, 2015 at 6:11 PM Bryan Beaudreault <
> > > > > > bbeaudrea...@hubspot.com> wrote:
> > > > > >
> > > > > >> Sorry the second link should be
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L579
> > > > > >>
> > > > > >

Re: Performance degradation between CDH5.3.1(HBase0.98.6) and CDH5.4.5(HBase1.0.0)

2015-11-30 Thread Bryan Beaudreault
I didn't think to use the non-aggregated jstack outout as it has become
second nature for us to use https://github.com/HubSpot/astack/.

It rolls up repeating stacktraces. You can see above each stacktrace the
number of times it occurred and an estimated cpu time spent.  Sorry will
try to get it without astack next time. Hopefully this clears things up
though, there are 122 increments blocking in the aggregated jstack

On Tue, Dec 1, 2015 at 12:17 AM Stack  wrote:

> Looking again, the
>
> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L359
> thread
> dump and the https://gist.github.com/bbeaudreault/2994a748da83d9f75085
> thread dump are the same? Only have two increments going on in this thread
> dump:
>
> at org.apache.hadoop.hbase.KeyValue.matchingQualifier(KeyValue.java:1656)
>
> ... and other is doing:
>
> at
>
> org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:3593)
>
> Not many increments going on.
>
>
> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L286
> is two increments too in same places. Is it stuck?
>
> St.Ack
>
>
>
>
>
> On Mon, Nov 30, 2015 at 3:10 PM, Bryan Beaudreault <
> bbeaudrea...@hubspot.com
> > wrote:
>
> > https://gist.github.com/bbeaudreault/2994a748da83d9f75085
> >
> > An active handler:
> >
> >
> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L286
> > One that is locked:
> >
> >
> https://git.hubteam.com/gist/jwilliams/80f37999bfdf55119588#file-gistfile1-txt-L579
> >
> > The difference between pre-rollback and post is that previously we were
> > seeing things blocked in mvcc.  Now we are seeing them blocked on the
> > upsert.
> >
> > It always follows the same pattern, of 1 active handler in the upsert and
> > the rest blocked waiting for it.
> >
> > On Mon, Nov 30, 2015 at 6:05 PM Stack  wrote:
> >
> > > On Mon, Nov 30, 2015 at 2:31 PM, Bryan Beaudreault <
> > > bbeaudrea...@hubspot.com
> > > > wrote:
> > >
> > > > The rollback seems to have mostly solved the issue for one of our
> > > clusters,
> > > > but another one is still seeing long increment times:
> > > >
> > > > "slowIncrementCount": 52080,
> > > > "Increment_num_ops": 325236,"Increment_min": 1,"Increment_max":
> 6162,"
> > > > Increment_mean": 465.68678129112396,"Increment_median": 216,"
> > > > Increment_75th_percentile": 450.25,"Increment_95th_percentile":
> > > > 1052.64999,"Increment_99th_percentile": 1635.23998
> > > >
> > > >
> > > > Any ideas if there are other changes that may be causing a
> performance
> > > > regression for increments between CDH4.7.1 and CDH5.3.8?
> > > >
> > > >
> > > >
> > > No.
> > >
> > > Post a thread dump Bryan and it might prompt something.
> > >
> > > St.Ack
> > >
> > >
> > >
> > >
> > > >
> > > > On Mon, Nov 30, 2015 at 4:13 PM Stack  wrote:
> > > >
> > > > > On Mon, Nov 30, 2015 at 12:54 PM, Bryan Beaudreault <
> > > > > bbeaudrea...@hubspot.com> wrote:
> > > > >
> > > > > > Should this be added as a known issue in the CDH or hbase
> > > > documentation?
> > > > > It
> > > > > > was a severe performance hit for us, all of our regionservers
> were
> > > > > sitting
> > > > > > at a few thousand queued requests.
> > > > > >
> > > > > >
> > > > > Let me take care of that.
> > > > > St.Ack
> > > > >
> > > > >
> > > > >
> > > > > > On Mon, Nov 30, 2015 at 3:53 PM Bryan Beaudreault <
> > > > > > bbeaudrea...@hubspot.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Yea, they are all over the place and called from client and
> > > > coprocessor
> > > > > > > code. We ended up having no other option but to rollback, and
> > aside
> > > > > from
> > > > > > a
> > > > > > > few NoSuchMethodErrors due to API changes (Put#add vs
> > > Put#addColumn),
> > > > > it
> > > > > > > seems to be working and fixing our problem.
> > > > > > >
> > > > > > > On Mon, Nov 30, 2015 at 3:47 PM Stack 
> wrote:
> > > > > > >
> > > > > > >> Rollback is untested. No fix in 5.5. I was going to work on
> this
> > > > now.
> > > > > > >> Where
> > > > > > >> are your counters Bryan? In their own column family or
> scattered
> > > > about
> > > > > > in
> > > > > > >> a
> > > > > > >> row with other Cell types?
> > > > > > >> St.Ack
> > > > > > >>
> > > > > > >> On Mon, Nov 30, 2015 at 10:28 AM, Bryan Beaudreault <
> > > > > > >> bbeaudrea...@hubspot.com> wrote:
> > > > > > >>
> > > > > > >> > Is there any update to this? We just upgraded all of our
> > > > production
> > > > > > >> > clusters from CDH4 to CDH5.4.7 and, not seeing this JIRA
> > listed
> > > in
> > > > > the
> > > > > > >> > known issues, did not not about this.  Now we are seeing
> > > > perfomance
> > > > > > >> issues
> > > > > > >> > across all clusters, as we make heavy use of increments.
> > > > > > >> >
> > > > > > >> > Can we roll forward to CDH5.5 to fix? Or is our only hope to
> > > roll
> > > > > back
> > > > > > >> to
> > > > > > >> > CDH 5.3.1 (if that is possible)?
> > > > > > >> >
> > > > > > >> >
> 

Re: Performance degradation between CDH5.3.1(HBase0.98.6) and CDH5.4.5(HBase1.0.0)

2015-11-30 Thread Stack
Looking again, the
https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L359
thread
dump and the https://gist.github.com/bbeaudreault/2994a748da83d9f75085
thread dump are the same? Only have two increments going on in this thread
dump:

at org.apache.hadoop.hbase.KeyValue.matchingQualifier(KeyValue.java:1656)

... and other is doing:

at
org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:3593)

Not many increments going on.

https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L286
is two increments too in same places. Is it stuck?

St.Ack





On Mon, Nov 30, 2015 at 3:10 PM, Bryan Beaudreault  wrote:

> https://gist.github.com/bbeaudreault/2994a748da83d9f75085
>
> An active handler:
>
> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L286
> One that is locked:
>
> https://git.hubteam.com/gist/jwilliams/80f37999bfdf55119588#file-gistfile1-txt-L579
>
> The difference between pre-rollback and post is that previously we were
> seeing things blocked in mvcc.  Now we are seeing them blocked on the
> upsert.
>
> It always follows the same pattern, of 1 active handler in the upsert and
> the rest blocked waiting for it.
>
> On Mon, Nov 30, 2015 at 6:05 PM Stack  wrote:
>
> > On Mon, Nov 30, 2015 at 2:31 PM, Bryan Beaudreault <
> > bbeaudrea...@hubspot.com
> > > wrote:
> >
> > > The rollback seems to have mostly solved the issue for one of our
> > clusters,
> > > but another one is still seeing long increment times:
> > >
> > > "slowIncrementCount": 52080,
> > > "Increment_num_ops": 325236,"Increment_min": 1,"Increment_max": 6162,"
> > > Increment_mean": 465.68678129112396,"Increment_median": 216,"
> > > Increment_75th_percentile": 450.25,"Increment_95th_percentile":
> > > 1052.64999,"Increment_99th_percentile": 1635.23998
> > >
> > >
> > > Any ideas if there are other changes that may be causing a performance
> > > regression for increments between CDH4.7.1 and CDH5.3.8?
> > >
> > >
> > >
> > No.
> >
> > Post a thread dump Bryan and it might prompt something.
> >
> > St.Ack
> >
> >
> >
> >
> > >
> > > On Mon, Nov 30, 2015 at 4:13 PM Stack  wrote:
> > >
> > > > On Mon, Nov 30, 2015 at 12:54 PM, Bryan Beaudreault <
> > > > bbeaudrea...@hubspot.com> wrote:
> > > >
> > > > > Should this be added as a known issue in the CDH or hbase
> > > documentation?
> > > > It
> > > > > was a severe performance hit for us, all of our regionservers were
> > > > sitting
> > > > > at a few thousand queued requests.
> > > > >
> > > > >
> > > > Let me take care of that.
> > > > St.Ack
> > > >
> > > >
> > > >
> > > > > On Mon, Nov 30, 2015 at 3:53 PM Bryan Beaudreault <
> > > > > bbeaudrea...@hubspot.com>
> > > > > wrote:
> > > > >
> > > > > > Yea, they are all over the place and called from client and
> > > coprocessor
> > > > > > code. We ended up having no other option but to rollback, and
> aside
> > > > from
> > > > > a
> > > > > > few NoSuchMethodErrors due to API changes (Put#add vs
> > Put#addColumn),
> > > > it
> > > > > > seems to be working and fixing our problem.
> > > > > >
> > > > > > On Mon, Nov 30, 2015 at 3:47 PM Stack  wrote:
> > > > > >
> > > > > >> Rollback is untested. No fix in 5.5. I was going to work on this
> > > now.
> > > > > >> Where
> > > > > >> are your counters Bryan? In their own column family or scattered
> > > about
> > > > > in
> > > > > >> a
> > > > > >> row with other Cell types?
> > > > > >> St.Ack
> > > > > >>
> > > > > >> On Mon, Nov 30, 2015 at 10:28 AM, Bryan Beaudreault <
> > > > > >> bbeaudrea...@hubspot.com> wrote:
> > > > > >>
> > > > > >> > Is there any update to this? We just upgraded all of our
> > > production
> > > > > >> > clusters from CDH4 to CDH5.4.7 and, not seeing this JIRA
> listed
> > in
> > > > the
> > > > > >> > known issues, did not not about this.  Now we are seeing
> > > perfomance
> > > > > >> issues
> > > > > >> > across all clusters, as we make heavy use of increments.
> > > > > >> >
> > > > > >> > Can we roll forward to CDH5.5 to fix? Or is our only hope to
> > roll
> > > > back
> > > > > >> to
> > > > > >> > CDH 5.3.1 (if that is possible)?
> > > > > >> >
> > > > > >> >
> > > > > >> > On Thu, Sep 24, 2015 at 5:06 AM 鈴木俊裕 
> > wrote:
> > > > > >> >
> > > > > >> > > Thank you St.Ack!
> > > > > >> > >
> > > > > >> > > I would like to follow the ticket.
> > > > > >> > >
> > > > > >> > > Toshihiro Suzuki
> > > > > >> > >
> > > > > >> > > 2015-09-22 14:14 GMT+09:00 Stack :
> > > > > >> > >
> > > > > >> > > > Back to this problem. Simple tests confirm that as is, the
> > > > > >> > > > single-queue-backed MVCC instance can slow Region ops if
> > some
> > > > > other
> > > > > >> row
> > > > > >> > > is
> > > > > >> > > > slow to complete. In particular Increment, checkAndPut,
> and
> > > > batch
> > > > > >> > > mutations
> > > > > >> > > > are effected. I opened HBASE-14460 to start in on a fix
> up.
> > > Lets
> > > > > >> see if
> > > > > >> > > we
> > > 

Re: Performance degradation between CDH5.3.1(HBase0.98.6) and CDH5.4.5(HBase1.0.0)

2015-11-30 Thread Bryan Beaudreault
I'll try to get another one.  We are currently not seeing the issue due to
lack of contention (it is off hours for our customers).

Note that the stack trace I gave you was taken with a tool we have which
aggregates common stacks. The one at the bottom occurred 122 times (out of
128 handlers -- this is pre-tuning when we added 1000 handlers and the read
vs write).  So to me it looks like 122 of 128 handlers were waiting on:

if (!existingContext.latch.await(this.rowLockWaitDuration,
TimeUnit.MILLISECONDS)) {
  throw new IOException("Timed out waiting for lock for row: " + rowKey);
}


On Tue, Dec 1, 2015 at 12:08 AM Stack  wrote:

> Looking at that stack trace, nothing showing as blocked or slowed by
> another operation. You have others I could look at Bryan?
> St.Ack
>
> On Mon, Nov 30, 2015 at 8:40 PM, Bryan Beaudreault <
> bbeaudrea...@hubspot.com
> > wrote:
>
> > Yea sorry if I was misleading.  The nonce loglines we saw only happened
> on
> > full cluster restart, it may have been the HLog's replaying, not sure.
> >
> > We are still seeing slow Increments. Where Gets and Mutates will be on
> the
> > order of 50-150ms according to metrics, Increment will be in the
> > 1000-5000ms range. It seems we may be blocking on FSHLog#syncer.
> >
> >
> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L359
> >
> >
> >
> > On Mon, Nov 30, 2015 at 11:26 PM Stack  wrote:
> >
> > > Still slow increments though?
> > >
> > > On Mon, Nov 30, 2015 at 5:05 PM, Bryan Beaudreault <
> > > bbeaudrea...@hubspot.com
> > > > wrote:
> > >
> > > > Those log lines have settled down, they may have been related to a
> > > > cluster-wide forced restart at the time.
> > > >
> > > > On Mon, Nov 30, 2015 at 7:59 PM Bryan Beaudreault <
> > > > bbeaudrea...@hubspot.com>
> > > > wrote:
> > > >
> > > > > We've been doing more debugging of this and have set up the read vs
> > > write
> > > > > handlers to try to at least segment this away so reads can work. We
> > > have
> > > > > pretty beefy servers, and are running wiht the following settings:
> > > > >
> > > > > hbase.regionserver.handler.count=1000
> > > > > hbase.ipc.server.read.threadpool.size=50
> > > > > hbase.ipc.server.callqueue.handler.factor=0.025
> > > > > hbase.ipc.server.callqueue.read.ratio=0.6
> > > > > hbase.ipc.server.callqueue.scan.ratio=0.5
> > > > >
> > > > > We are seeing all 400 write handlers taken up by row locks for the
> > most
> > > > > part. The read handlers are mostly idle. We're thinking of changing
> > the
> > > > > ratio here, but are not sure it will help if they are all blocked
> on
> > a
> > > > row
> > > > > lock.  We just enabled DEBUG logging on all our servers and notice
> > the
> > > > > following:
> > > > >
> > > > > 2015-12-01 00:56:09,015 DEBUG
> > > > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > > > detected
> > > > > by nonce: [-687451119961178644:7664336281906118656], [state 0,
> > hasWait
> > > > > false, activity 00:54:36.240]
> > > > > 2015-12-01 00:56:09,015 DEBUG
> > > > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > > > detected
> > > > > by nonce: [-687451119961178644:-7119840249342174227], [state 0,
> > hasWait
> > > > > false, activity 00:54:36.256]
> > > > > 2015-12-01 00:56:09,268 DEBUG
> > > > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > > > detected
> > > > > by nonce: [-5946137511131403479:2112661701888365489], [state 0,
> > hasWait
> > > > > false, activity 00:55:01.259]
> > > > > 2015-12-01 00:56:09,279 DEBUG
> > > > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > > > detected
> > > > > by nonce: [4165332617675853029:6256955295384472057], [state 0,
> > hasWait
> > > > > false, activity 00:53:58.151]
> > > > > 2015-12-01 00:56:09,279 DEBUG
> > > > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > > > detected
> > > > > by nonce: [4165332617675853029:4961178013070912522], [state 0,
> > hasWait
> > > > > false, activity 00:53:58.162]
> > > > >
> > > > >
> > > > > On Mon, Nov 30, 2015 at 6:11 PM Bryan Beaudreault <
> > > > > bbeaudrea...@hubspot.com> wrote:
> > > > >
> > > > >> Sorry the second link should be
> > > > >>
> > > >
> > >
> >
> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L579
> > > > >>
> > > > >> On Mon, Nov 30, 2015 at 6:10 PM Bryan Beaudreault <
> > > > >> bbeaudrea...@hubspot.com> wrote:
> > > > >>
> > > > >>> https://gist.github.com/bbeaudreault/2994a748da83d9f75085
> > > > >>>
> > > > >>> An active handler:
> > > > >>>
> > > >
> > >
> >
> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L286
> > > > >>> One that is locked:
> > > > >>>
> > > >
> > >
> >
> https://git.hubteam.com/gist/jwilliams/80f37999bfdf55119588#file-gistfile1-txt-L579
> > > > >>>
> > > > >>> The difference between pre-rollback and post is that previously
> we
> > > were
> > > > >>> seeing things blocked in mvcc.  Now we are seeing them b

Re: Performance degradation between CDH5.3.1(HBase0.98.6) and CDH5.4.5(HBase1.0.0)

2015-11-30 Thread Stack
Looking at that stack trace, nothing showing as blocked or slowed by
another operation. You have others I could look at Bryan?
St.Ack

On Mon, Nov 30, 2015 at 8:40 PM, Bryan Beaudreault  wrote:

> Yea sorry if I was misleading.  The nonce loglines we saw only happened on
> full cluster restart, it may have been the HLog's replaying, not sure.
>
> We are still seeing slow Increments. Where Gets and Mutates will be on the
> order of 50-150ms according to metrics, Increment will be in the
> 1000-5000ms range. It seems we may be blocking on FSHLog#syncer.
>
> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L359
>
>
>
> On Mon, Nov 30, 2015 at 11:26 PM Stack  wrote:
>
> > Still slow increments though?
> >
> > On Mon, Nov 30, 2015 at 5:05 PM, Bryan Beaudreault <
> > bbeaudrea...@hubspot.com
> > > wrote:
> >
> > > Those log lines have settled down, they may have been related to a
> > > cluster-wide forced restart at the time.
> > >
> > > On Mon, Nov 30, 2015 at 7:59 PM Bryan Beaudreault <
> > > bbeaudrea...@hubspot.com>
> > > wrote:
> > >
> > > > We've been doing more debugging of this and have set up the read vs
> > write
> > > > handlers to try to at least segment this away so reads can work. We
> > have
> > > > pretty beefy servers, and are running wiht the following settings:
> > > >
> > > > hbase.regionserver.handler.count=1000
> > > > hbase.ipc.server.read.threadpool.size=50
> > > > hbase.ipc.server.callqueue.handler.factor=0.025
> > > > hbase.ipc.server.callqueue.read.ratio=0.6
> > > > hbase.ipc.server.callqueue.scan.ratio=0.5
> > > >
> > > > We are seeing all 400 write handlers taken up by row locks for the
> most
> > > > part. The read handlers are mostly idle. We're thinking of changing
> the
> > > > ratio here, but are not sure it will help if they are all blocked on
> a
> > > row
> > > > lock.  We just enabled DEBUG logging on all our servers and notice
> the
> > > > following:
> > > >
> > > > 2015-12-01 00:56:09,015 DEBUG
> > > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > > detected
> > > > by nonce: [-687451119961178644:7664336281906118656], [state 0,
> hasWait
> > > > false, activity 00:54:36.240]
> > > > 2015-12-01 00:56:09,015 DEBUG
> > > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > > detected
> > > > by nonce: [-687451119961178644:-7119840249342174227], [state 0,
> hasWait
> > > > false, activity 00:54:36.256]
> > > > 2015-12-01 00:56:09,268 DEBUG
> > > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > > detected
> > > > by nonce: [-5946137511131403479:2112661701888365489], [state 0,
> hasWait
> > > > false, activity 00:55:01.259]
> > > > 2015-12-01 00:56:09,279 DEBUG
> > > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > > detected
> > > > by nonce: [4165332617675853029:6256955295384472057], [state 0,
> hasWait
> > > > false, activity 00:53:58.151]
> > > > 2015-12-01 00:56:09,279 DEBUG
> > > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > > detected
> > > > by nonce: [4165332617675853029:4961178013070912522], [state 0,
> hasWait
> > > > false, activity 00:53:58.162]
> > > >
> > > >
> > > > On Mon, Nov 30, 2015 at 6:11 PM Bryan Beaudreault <
> > > > bbeaudrea...@hubspot.com> wrote:
> > > >
> > > >> Sorry the second link should be
> > > >>
> > >
> >
> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L579
> > > >>
> > > >> On Mon, Nov 30, 2015 at 6:10 PM Bryan Beaudreault <
> > > >> bbeaudrea...@hubspot.com> wrote:
> > > >>
> > > >>> https://gist.github.com/bbeaudreault/2994a748da83d9f75085
> > > >>>
> > > >>> An active handler:
> > > >>>
> > >
> >
> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L286
> > > >>> One that is locked:
> > > >>>
> > >
> >
> https://git.hubteam.com/gist/jwilliams/80f37999bfdf55119588#file-gistfile1-txt-L579
> > > >>>
> > > >>> The difference between pre-rollback and post is that previously we
> > were
> > > >>> seeing things blocked in mvcc.  Now we are seeing them blocked on
> the
> > > >>> upsert.
> > > >>>
> > > >>> It always follows the same pattern, of 1 active handler in the
> upsert
> > > >>> and the rest blocked waiting for it.
> > > >>>
> > > >>> On Mon, Nov 30, 2015 at 6:05 PM Stack  wrote:
> > > >>>
> > >  On Mon, Nov 30, 2015 at 2:31 PM, Bryan Beaudreault <
> > >  bbeaudrea...@hubspot.com
> > >  > wrote:
> > > 
> > >  > The rollback seems to have mostly solved the issue for one of
> our
> > >  clusters,
> > >  > but another one is still seeing long increment times:
> > >  >
> > >  > "slowIncrementCount": 52080,
> > >  > "Increment_num_ops": 325236,"Increment_min": 1,"Increment_max":
> > > 6162,"
> > >  > Increment_mean": 465.68678129112396,"Increment_median": 216,"
> > >  > Increment_75th_percentile": 450.25,"Increment_95th_percentile":
> > >  > 1052.64999,"Increment_99th_percentile":
> 1635.2

Re: Performance degradation between CDH5.3.1(HBase0.98.6) and CDH5.4.5(HBase1.0.0)

2015-11-30 Thread Bryan Beaudreault
Yea sorry if I was misleading.  The nonce loglines we saw only happened on
full cluster restart, it may have been the HLog's replaying, not sure.

We are still seeing slow Increments. Where Gets and Mutates will be on the
order of 50-150ms according to metrics, Increment will be in the
1000-5000ms range. It seems we may be blocking on FSHLog#syncer.
https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L359



On Mon, Nov 30, 2015 at 11:26 PM Stack  wrote:

> Still slow increments though?
>
> On Mon, Nov 30, 2015 at 5:05 PM, Bryan Beaudreault <
> bbeaudrea...@hubspot.com
> > wrote:
>
> > Those log lines have settled down, they may have been related to a
> > cluster-wide forced restart at the time.
> >
> > On Mon, Nov 30, 2015 at 7:59 PM Bryan Beaudreault <
> > bbeaudrea...@hubspot.com>
> > wrote:
> >
> > > We've been doing more debugging of this and have set up the read vs
> write
> > > handlers to try to at least segment this away so reads can work. We
> have
> > > pretty beefy servers, and are running wiht the following settings:
> > >
> > > hbase.regionserver.handler.count=1000
> > > hbase.ipc.server.read.threadpool.size=50
> > > hbase.ipc.server.callqueue.handler.factor=0.025
> > > hbase.ipc.server.callqueue.read.ratio=0.6
> > > hbase.ipc.server.callqueue.scan.ratio=0.5
> > >
> > > We are seeing all 400 write handlers taken up by row locks for the most
> > > part. The read handlers are mostly idle. We're thinking of changing the
> > > ratio here, but are not sure it will help if they are all blocked on a
> > row
> > > lock.  We just enabled DEBUG logging on all our servers and notice the
> > > following:
> > >
> > > 2015-12-01 00:56:09,015 DEBUG
> > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > detected
> > > by nonce: [-687451119961178644:7664336281906118656], [state 0, hasWait
> > > false, activity 00:54:36.240]
> > > 2015-12-01 00:56:09,015 DEBUG
> > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > detected
> > > by nonce: [-687451119961178644:-7119840249342174227], [state 0, hasWait
> > > false, activity 00:54:36.256]
> > > 2015-12-01 00:56:09,268 DEBUG
> > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > detected
> > > by nonce: [-5946137511131403479:2112661701888365489], [state 0, hasWait
> > > false, activity 00:55:01.259]
> > > 2015-12-01 00:56:09,279 DEBUG
> > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > detected
> > > by nonce: [4165332617675853029:6256955295384472057], [state 0, hasWait
> > > false, activity 00:53:58.151]
> > > 2015-12-01 00:56:09,279 DEBUG
> > > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> > detected
> > > by nonce: [4165332617675853029:4961178013070912522], [state 0, hasWait
> > > false, activity 00:53:58.162]
> > >
> > >
> > > On Mon, Nov 30, 2015 at 6:11 PM Bryan Beaudreault <
> > > bbeaudrea...@hubspot.com> wrote:
> > >
> > >> Sorry the second link should be
> > >>
> >
> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L579
> > >>
> > >> On Mon, Nov 30, 2015 at 6:10 PM Bryan Beaudreault <
> > >> bbeaudrea...@hubspot.com> wrote:
> > >>
> > >>> https://gist.github.com/bbeaudreault/2994a748da83d9f75085
> > >>>
> > >>> An active handler:
> > >>>
> >
> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L286
> > >>> One that is locked:
> > >>>
> >
> https://git.hubteam.com/gist/jwilliams/80f37999bfdf55119588#file-gistfile1-txt-L579
> > >>>
> > >>> The difference between pre-rollback and post is that previously we
> were
> > >>> seeing things blocked in mvcc.  Now we are seeing them blocked on the
> > >>> upsert.
> > >>>
> > >>> It always follows the same pattern, of 1 active handler in the upsert
> > >>> and the rest blocked waiting for it.
> > >>>
> > >>> On Mon, Nov 30, 2015 at 6:05 PM Stack  wrote:
> > >>>
> >  On Mon, Nov 30, 2015 at 2:31 PM, Bryan Beaudreault <
> >  bbeaudrea...@hubspot.com
> >  > wrote:
> > 
> >  > The rollback seems to have mostly solved the issue for one of our
> >  clusters,
> >  > but another one is still seeing long increment times:
> >  >
> >  > "slowIncrementCount": 52080,
> >  > "Increment_num_ops": 325236,"Increment_min": 1,"Increment_max":
> > 6162,"
> >  > Increment_mean": 465.68678129112396,"Increment_median": 216,"
> >  > Increment_75th_percentile": 450.25,"Increment_95th_percentile":
> >  > 1052.64999,"Increment_99th_percentile": 1635.23998
> >  >
> >  >
> >  > Any ideas if there are other changes that may be causing a
> > performance
> >  > regression for increments between CDH4.7.1 and CDH5.3.8?
> >  >
> >  >
> >  >
> >  No.
> > 
> >  Post a thread dump Bryan and it might prompt something.
> > 
> >  St.Ack
> > 
> > 
> > 
> > 
> >  >
> >  > On Mon, Nov 30, 2015 at 4:13 PM Stack  wrote:
> >  >
> >  

Re: Performance degradation between CDH5.3.1(HBase0.98.6) and CDH5.4.5(HBase1.0.0)

2015-11-30 Thread Stack
Still slow increments though?

On Mon, Nov 30, 2015 at 5:05 PM, Bryan Beaudreault  wrote:

> Those log lines have settled down, they may have been related to a
> cluster-wide forced restart at the time.
>
> On Mon, Nov 30, 2015 at 7:59 PM Bryan Beaudreault <
> bbeaudrea...@hubspot.com>
> wrote:
>
> > We've been doing more debugging of this and have set up the read vs write
> > handlers to try to at least segment this away so reads can work. We have
> > pretty beefy servers, and are running wiht the following settings:
> >
> > hbase.regionserver.handler.count=1000
> > hbase.ipc.server.read.threadpool.size=50
> > hbase.ipc.server.callqueue.handler.factor=0.025
> > hbase.ipc.server.callqueue.read.ratio=0.6
> > hbase.ipc.server.callqueue.scan.ratio=0.5
> >
> > We are seeing all 400 write handlers taken up by row locks for the most
> > part. The read handlers are mostly idle. We're thinking of changing the
> > ratio here, but are not sure it will help if they are all blocked on a
> row
> > lock.  We just enabled DEBUG logging on all our servers and notice the
> > following:
> >
> > 2015-12-01 00:56:09,015 DEBUG
> > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> detected
> > by nonce: [-687451119961178644:7664336281906118656], [state 0, hasWait
> > false, activity 00:54:36.240]
> > 2015-12-01 00:56:09,015 DEBUG
> > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> detected
> > by nonce: [-687451119961178644:-7119840249342174227], [state 0, hasWait
> > false, activity 00:54:36.256]
> > 2015-12-01 00:56:09,268 DEBUG
> > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> detected
> > by nonce: [-5946137511131403479:2112661701888365489], [state 0, hasWait
> > false, activity 00:55:01.259]
> > 2015-12-01 00:56:09,279 DEBUG
> > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> detected
> > by nonce: [4165332617675853029:6256955295384472057], [state 0, hasWait
> > false, activity 00:53:58.151]
> > 2015-12-01 00:56:09,279 DEBUG
> > org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict
> detected
> > by nonce: [4165332617675853029:4961178013070912522], [state 0, hasWait
> > false, activity 00:53:58.162]
> >
> >
> > On Mon, Nov 30, 2015 at 6:11 PM Bryan Beaudreault <
> > bbeaudrea...@hubspot.com> wrote:
> >
> >> Sorry the second link should be
> >>
> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L579
> >>
> >> On Mon, Nov 30, 2015 at 6:10 PM Bryan Beaudreault <
> >> bbeaudrea...@hubspot.com> wrote:
> >>
> >>> https://gist.github.com/bbeaudreault/2994a748da83d9f75085
> >>>
> >>> An active handler:
> >>>
> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L286
> >>> One that is locked:
> >>>
> https://git.hubteam.com/gist/jwilliams/80f37999bfdf55119588#file-gistfile1-txt-L579
> >>>
> >>> The difference between pre-rollback and post is that previously we were
> >>> seeing things blocked in mvcc.  Now we are seeing them blocked on the
> >>> upsert.
> >>>
> >>> It always follows the same pattern, of 1 active handler in the upsert
> >>> and the rest blocked waiting for it.
> >>>
> >>> On Mon, Nov 30, 2015 at 6:05 PM Stack  wrote:
> >>>
>  On Mon, Nov 30, 2015 at 2:31 PM, Bryan Beaudreault <
>  bbeaudrea...@hubspot.com
>  > wrote:
> 
>  > The rollback seems to have mostly solved the issue for one of our
>  clusters,
>  > but another one is still seeing long increment times:
>  >
>  > "slowIncrementCount": 52080,
>  > "Increment_num_ops": 325236,"Increment_min": 1,"Increment_max":
> 6162,"
>  > Increment_mean": 465.68678129112396,"Increment_median": 216,"
>  > Increment_75th_percentile": 450.25,"Increment_95th_percentile":
>  > 1052.64999,"Increment_99th_percentile": 1635.23998
>  >
>  >
>  > Any ideas if there are other changes that may be causing a
> performance
>  > regression for increments between CDH4.7.1 and CDH5.3.8?
>  >
>  >
>  >
>  No.
> 
>  Post a thread dump Bryan and it might prompt something.
> 
>  St.Ack
> 
> 
> 
> 
>  >
>  > On Mon, Nov 30, 2015 at 4:13 PM Stack  wrote:
>  >
>  > > On Mon, Nov 30, 2015 at 12:54 PM, Bryan Beaudreault <
>  > > bbeaudrea...@hubspot.com> wrote:
>  > >
>  > > > Should this be added as a known issue in the CDH or hbase
>  > documentation?
>  > > It
>  > > > was a severe performance hit for us, all of our regionservers
> were
>  > > sitting
>  > > > at a few thousand queued requests.
>  > > >
>  > > >
>  > > Let me take care of that.
>  > > St.Ack
>  > >
>  > >
>  > >
>  > > > On Mon, Nov 30, 2015 at 3:53 PM Bryan Beaudreault <
>  > > > bbeaudrea...@hubspot.com>
>  > > > wrote:
>  > > >
>  > > > > Yea, they are all over the place and called from client and
>  > coprocessor
>  > > > > code. We ended up having no ot

Re: Performance degradation between CDH5.3.1(HBase0.98.6) and CDH5.4.5(HBase1.0.0)

2015-11-30 Thread Bryan Beaudreault
Those log lines have settled down, they may have been related to a
cluster-wide forced restart at the time.

On Mon, Nov 30, 2015 at 7:59 PM Bryan Beaudreault 
wrote:

> We've been doing more debugging of this and have set up the read vs write
> handlers to try to at least segment this away so reads can work. We have
> pretty beefy servers, and are running wiht the following settings:
>
> hbase.regionserver.handler.count=1000
> hbase.ipc.server.read.threadpool.size=50
> hbase.ipc.server.callqueue.handler.factor=0.025
> hbase.ipc.server.callqueue.read.ratio=0.6
> hbase.ipc.server.callqueue.scan.ratio=0.5
>
> We are seeing all 400 write handlers taken up by row locks for the most
> part. The read handlers are mostly idle. We're thinking of changing the
> ratio here, but are not sure it will help if they are all blocked on a row
> lock.  We just enabled DEBUG logging on all our servers and notice the
> following:
>
> 2015-12-01 00:56:09,015 DEBUG
> org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict detected
> by nonce: [-687451119961178644:7664336281906118656], [state 0, hasWait
> false, activity 00:54:36.240]
> 2015-12-01 00:56:09,015 DEBUG
> org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict detected
> by nonce: [-687451119961178644:-7119840249342174227], [state 0, hasWait
> false, activity 00:54:36.256]
> 2015-12-01 00:56:09,268 DEBUG
> org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict detected
> by nonce: [-5946137511131403479:2112661701888365489], [state 0, hasWait
> false, activity 00:55:01.259]
> 2015-12-01 00:56:09,279 DEBUG
> org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict detected
> by nonce: [4165332617675853029:6256955295384472057], [state 0, hasWait
> false, activity 00:53:58.151]
> 2015-12-01 00:56:09,279 DEBUG
> org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict detected
> by nonce: [4165332617675853029:4961178013070912522], [state 0, hasWait
> false, activity 00:53:58.162]
>
>
> On Mon, Nov 30, 2015 at 6:11 PM Bryan Beaudreault <
> bbeaudrea...@hubspot.com> wrote:
>
>> Sorry the second link should be
>> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L579
>>
>> On Mon, Nov 30, 2015 at 6:10 PM Bryan Beaudreault <
>> bbeaudrea...@hubspot.com> wrote:
>>
>>> https://gist.github.com/bbeaudreault/2994a748da83d9f75085
>>>
>>> An active handler:
>>> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L286
>>> One that is locked:
>>> https://git.hubteam.com/gist/jwilliams/80f37999bfdf55119588#file-gistfile1-txt-L579
>>>
>>> The difference between pre-rollback and post is that previously we were
>>> seeing things blocked in mvcc.  Now we are seeing them blocked on the
>>> upsert.
>>>
>>> It always follows the same pattern, of 1 active handler in the upsert
>>> and the rest blocked waiting for it.
>>>
>>> On Mon, Nov 30, 2015 at 6:05 PM Stack  wrote:
>>>
 On Mon, Nov 30, 2015 at 2:31 PM, Bryan Beaudreault <
 bbeaudrea...@hubspot.com
 > wrote:

 > The rollback seems to have mostly solved the issue for one of our
 clusters,
 > but another one is still seeing long increment times:
 >
 > "slowIncrementCount": 52080,
 > "Increment_num_ops": 325236,"Increment_min": 1,"Increment_max": 6162,"
 > Increment_mean": 465.68678129112396,"Increment_median": 216,"
 > Increment_75th_percentile": 450.25,"Increment_95th_percentile":
 > 1052.64999,"Increment_99th_percentile": 1635.23998
 >
 >
 > Any ideas if there are other changes that may be causing a performance
 > regression for increments between CDH4.7.1 and CDH5.3.8?
 >
 >
 >
 No.

 Post a thread dump Bryan and it might prompt something.

 St.Ack




 >
 > On Mon, Nov 30, 2015 at 4:13 PM Stack  wrote:
 >
 > > On Mon, Nov 30, 2015 at 12:54 PM, Bryan Beaudreault <
 > > bbeaudrea...@hubspot.com> wrote:
 > >
 > > > Should this be added as a known issue in the CDH or hbase
 > documentation?
 > > It
 > > > was a severe performance hit for us, all of our regionservers were
 > > sitting
 > > > at a few thousand queued requests.
 > > >
 > > >
 > > Let me take care of that.
 > > St.Ack
 > >
 > >
 > >
 > > > On Mon, Nov 30, 2015 at 3:53 PM Bryan Beaudreault <
 > > > bbeaudrea...@hubspot.com>
 > > > wrote:
 > > >
 > > > > Yea, they are all over the place and called from client and
 > coprocessor
 > > > > code. We ended up having no other option but to rollback, and
 aside
 > > from
 > > > a
 > > > > few NoSuchMethodErrors due to API changes (Put#add vs
 Put#addColumn),
 > > it
 > > > > seems to be working and fixing our problem.
 > > > >
 > > > > On Mon, Nov 30, 2015 at 3:47 PM Stack  wrote:
 > > > >
 > > > >> Rollback is untested. No fix in 5.5. I was going to work on

Re: Performance degradation between CDH5.3.1(HBase0.98.6) and CDH5.4.5(HBase1.0.0)

2015-11-30 Thread Bryan Beaudreault
We've been doing more debugging of this and have set up the read vs write
handlers to try to at least segment this away so reads can work. We have
pretty beefy servers, and are running wiht the following settings:

hbase.regionserver.handler.count=1000
hbase.ipc.server.read.threadpool.size=50
hbase.ipc.server.callqueue.handler.factor=0.025
hbase.ipc.server.callqueue.read.ratio=0.6
hbase.ipc.server.callqueue.scan.ratio=0.5

We are seeing all 400 write handlers taken up by row locks for the most
part. The read handlers are mostly idle. We're thinking of changing the
ratio here, but are not sure it will help if they are all blocked on a row
lock.  We just enabled DEBUG logging on all our servers and notice the
following:

2015-12-01 00:56:09,015 DEBUG
org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict detected
by nonce: [-687451119961178644:7664336281906118656], [state 0, hasWait
false, activity 00:54:36.240]
2015-12-01 00:56:09,015 DEBUG
org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict detected
by nonce: [-687451119961178644:-7119840249342174227], [state 0, hasWait
false, activity 00:54:36.256]
2015-12-01 00:56:09,268 DEBUG
org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict detected
by nonce: [-5946137511131403479:2112661701888365489], [state 0, hasWait
false, activity 00:55:01.259]
2015-12-01 00:56:09,279 DEBUG
org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict detected
by nonce: [4165332617675853029:6256955295384472057], [state 0, hasWait
false, activity 00:53:58.151]
2015-12-01 00:56:09,279 DEBUG
org.apache.hadoop.hbase.regionserver.ServerNonceManager: Conflict detected
by nonce: [4165332617675853029:4961178013070912522], [state 0, hasWait
false, activity 00:53:58.162]


On Mon, Nov 30, 2015 at 6:11 PM Bryan Beaudreault 
wrote:

> Sorry the second link should be
> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L579
>
> On Mon, Nov 30, 2015 at 6:10 PM Bryan Beaudreault <
> bbeaudrea...@hubspot.com> wrote:
>
>> https://gist.github.com/bbeaudreault/2994a748da83d9f75085
>>
>> An active handler:
>> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L286
>> One that is locked:
>> https://git.hubteam.com/gist/jwilliams/80f37999bfdf55119588#file-gistfile1-txt-L579
>>
>> The difference between pre-rollback and post is that previously we were
>> seeing things blocked in mvcc.  Now we are seeing them blocked on the
>> upsert.
>>
>> It always follows the same pattern, of 1 active handler in the upsert and
>> the rest blocked waiting for it.
>>
>> On Mon, Nov 30, 2015 at 6:05 PM Stack  wrote:
>>
>>> On Mon, Nov 30, 2015 at 2:31 PM, Bryan Beaudreault <
>>> bbeaudrea...@hubspot.com
>>> > wrote:
>>>
>>> > The rollback seems to have mostly solved the issue for one of our
>>> clusters,
>>> > but another one is still seeing long increment times:
>>> >
>>> > "slowIncrementCount": 52080,
>>> > "Increment_num_ops": 325236,"Increment_min": 1,"Increment_max": 6162,"
>>> > Increment_mean": 465.68678129112396,"Increment_median": 216,"
>>> > Increment_75th_percentile": 450.25,"Increment_95th_percentile":
>>> > 1052.64999,"Increment_99th_percentile": 1635.23998
>>> >
>>> >
>>> > Any ideas if there are other changes that may be causing a performance
>>> > regression for increments between CDH4.7.1 and CDH5.3.8?
>>> >
>>> >
>>> >
>>> No.
>>>
>>> Post a thread dump Bryan and it might prompt something.
>>>
>>> St.Ack
>>>
>>>
>>>
>>>
>>> >
>>> > On Mon, Nov 30, 2015 at 4:13 PM Stack  wrote:
>>> >
>>> > > On Mon, Nov 30, 2015 at 12:54 PM, Bryan Beaudreault <
>>> > > bbeaudrea...@hubspot.com> wrote:
>>> > >
>>> > > > Should this be added as a known issue in the CDH or hbase
>>> > documentation?
>>> > > It
>>> > > > was a severe performance hit for us, all of our regionservers were
>>> > > sitting
>>> > > > at a few thousand queued requests.
>>> > > >
>>> > > >
>>> > > Let me take care of that.
>>> > > St.Ack
>>> > >
>>> > >
>>> > >
>>> > > > On Mon, Nov 30, 2015 at 3:53 PM Bryan Beaudreault <
>>> > > > bbeaudrea...@hubspot.com>
>>> > > > wrote:
>>> > > >
>>> > > > > Yea, they are all over the place and called from client and
>>> > coprocessor
>>> > > > > code. We ended up having no other option but to rollback, and
>>> aside
>>> > > from
>>> > > > a
>>> > > > > few NoSuchMethodErrors due to API changes (Put#add vs
>>> Put#addColumn),
>>> > > it
>>> > > > > seems to be working and fixing our problem.
>>> > > > >
>>> > > > > On Mon, Nov 30, 2015 at 3:47 PM Stack  wrote:
>>> > > > >
>>> > > > >> Rollback is untested. No fix in 5.5. I was going to work on this
>>> > now.
>>> > > > >> Where
>>> > > > >> are your counters Bryan? In their own column family or scattered
>>> > about
>>> > > > in
>>> > > > >> a
>>> > > > >> row with other Cell types?
>>> > > > >> St.Ack
>>> > > > >>
>>> > > > >> On Mon, Nov 30, 2015 at 10:28 AM, Bryan Beaudreault <
>>> > > > >> bbeaudrea...@hubspot.com> wrote:
>>> > > > >>
>>> > > > >

Re: Performance degradation between CDH5.3.1(HBase0.98.6) and CDH5.4.5(HBase1.0.0)

2015-11-30 Thread Bryan Beaudreault
Sorry the second link should be
https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L579

On Mon, Nov 30, 2015 at 6:10 PM Bryan Beaudreault 
wrote:

> https://gist.github.com/bbeaudreault/2994a748da83d9f75085
>
> An active handler:
> https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L286
> One that is locked:
> https://git.hubteam.com/gist/jwilliams/80f37999bfdf55119588#file-gistfile1-txt-L579
>
> The difference between pre-rollback and post is that previously we were
> seeing things blocked in mvcc.  Now we are seeing them blocked on the
> upsert.
>
> It always follows the same pattern, of 1 active handler in the upsert and
> the rest blocked waiting for it.
>
> On Mon, Nov 30, 2015 at 6:05 PM Stack  wrote:
>
>> On Mon, Nov 30, 2015 at 2:31 PM, Bryan Beaudreault <
>> bbeaudrea...@hubspot.com
>> > wrote:
>>
>> > The rollback seems to have mostly solved the issue for one of our
>> clusters,
>> > but another one is still seeing long increment times:
>> >
>> > "slowIncrementCount": 52080,
>> > "Increment_num_ops": 325236,"Increment_min": 1,"Increment_max": 6162,"
>> > Increment_mean": 465.68678129112396,"Increment_median": 216,"
>> > Increment_75th_percentile": 450.25,"Increment_95th_percentile":
>> > 1052.64999,"Increment_99th_percentile": 1635.23998
>> >
>> >
>> > Any ideas if there are other changes that may be causing a performance
>> > regression for increments between CDH4.7.1 and CDH5.3.8?
>> >
>> >
>> >
>> No.
>>
>> Post a thread dump Bryan and it might prompt something.
>>
>> St.Ack
>>
>>
>>
>>
>> >
>> > On Mon, Nov 30, 2015 at 4:13 PM Stack  wrote:
>> >
>> > > On Mon, Nov 30, 2015 at 12:54 PM, Bryan Beaudreault <
>> > > bbeaudrea...@hubspot.com> wrote:
>> > >
>> > > > Should this be added as a known issue in the CDH or hbase
>> > documentation?
>> > > It
>> > > > was a severe performance hit for us, all of our regionservers were
>> > > sitting
>> > > > at a few thousand queued requests.
>> > > >
>> > > >
>> > > Let me take care of that.
>> > > St.Ack
>> > >
>> > >
>> > >
>> > > > On Mon, Nov 30, 2015 at 3:53 PM Bryan Beaudreault <
>> > > > bbeaudrea...@hubspot.com>
>> > > > wrote:
>> > > >
>> > > > > Yea, they are all over the place and called from client and
>> > coprocessor
>> > > > > code. We ended up having no other option but to rollback, and
>> aside
>> > > from
>> > > > a
>> > > > > few NoSuchMethodErrors due to API changes (Put#add vs
>> Put#addColumn),
>> > > it
>> > > > > seems to be working and fixing our problem.
>> > > > >
>> > > > > On Mon, Nov 30, 2015 at 3:47 PM Stack  wrote:
>> > > > >
>> > > > >> Rollback is untested. No fix in 5.5. I was going to work on this
>> > now.
>> > > > >> Where
>> > > > >> are your counters Bryan? In their own column family or scattered
>> > about
>> > > > in
>> > > > >> a
>> > > > >> row with other Cell types?
>> > > > >> St.Ack
>> > > > >>
>> > > > >> On Mon, Nov 30, 2015 at 10:28 AM, Bryan Beaudreault <
>> > > > >> bbeaudrea...@hubspot.com> wrote:
>> > > > >>
>> > > > >> > Is there any update to this? We just upgraded all of our
>> > production
>> > > > >> > clusters from CDH4 to CDH5.4.7 and, not seeing this JIRA
>> listed in
>> > > the
>> > > > >> > known issues, did not not about this.  Now we are seeing
>> > perfomance
>> > > > >> issues
>> > > > >> > across all clusters, as we make heavy use of increments.
>> > > > >> >
>> > > > >> > Can we roll forward to CDH5.5 to fix? Or is our only hope to
>> roll
>> > > back
>> > > > >> to
>> > > > >> > CDH 5.3.1 (if that is possible)?
>> > > > >> >
>> > > > >> >
>> > > > >> > On Thu, Sep 24, 2015 at 5:06 AM 鈴木俊裕 
>> wrote:
>> > > > >> >
>> > > > >> > > Thank you St.Ack!
>> > > > >> > >
>> > > > >> > > I would like to follow the ticket.
>> > > > >> > >
>> > > > >> > > Toshihiro Suzuki
>> > > > >> > >
>> > > > >> > > 2015-09-22 14:14 GMT+09:00 Stack :
>> > > > >> > >
>> > > > >> > > > Back to this problem. Simple tests confirm that as is, the
>> > > > >> > > > single-queue-backed MVCC instance can slow Region ops if
>> some
>> > > > other
>> > > > >> row
>> > > > >> > > is
>> > > > >> > > > slow to complete. In particular Increment, checkAndPut, and
>> > > batch
>> > > > >> > > mutations
>> > > > >> > > > are effected. I opened HBASE-14460 to start in on a fix up.
>> > Lets
>> > > > >> see if
>> > > > >> > > we
>> > > > >> > > > can somehow scope mvcc to row or at least shard mvcc so not
>> > all
>> > > > >> Region
>> > > > >> > > ops
>> > > > >> > > > are paused.
>> > > > >> > > >
>> > > > >> > > > St.Ack
>> > > > >> > > >
>> > > > >> > > >
>> > > > >> > > > On Mon, Sep 14, 2015 at 4:15 AM, 鈴木俊裕 
>> > > wrote:
>> > > > >> > > >
>> > > > >> > > > > > Thank you for the below reasoning (with accompanying
>> > helpful
>> > > > >> > > diagram).
>> > > > >> > > > > > Makes sense. Let me hack up a test case to help with
>> the
>> > > > >> > > illustration.
>> > > > >> > > > It
>> > > > >> > > > > > is as though the mvcc should be scoped to a row onl

Re: Performance degradation between CDH5.3.1(HBase0.98.6) and CDH5.4.5(HBase1.0.0)

2015-11-30 Thread Bryan Beaudreault
https://gist.github.com/bbeaudreault/2994a748da83d9f75085

An active handler:
https://gist.github.com/bbeaudreault/2994a748da83d9f75085#file-gistfile1-txt-L286
One that is locked:
https://git.hubteam.com/gist/jwilliams/80f37999bfdf55119588#file-gistfile1-txt-L579

The difference between pre-rollback and post is that previously we were
seeing things blocked in mvcc.  Now we are seeing them blocked on the
upsert.

It always follows the same pattern, of 1 active handler in the upsert and
the rest blocked waiting for it.

On Mon, Nov 30, 2015 at 6:05 PM Stack  wrote:

> On Mon, Nov 30, 2015 at 2:31 PM, Bryan Beaudreault <
> bbeaudrea...@hubspot.com
> > wrote:
>
> > The rollback seems to have mostly solved the issue for one of our
> clusters,
> > but another one is still seeing long increment times:
> >
> > "slowIncrementCount": 52080,
> > "Increment_num_ops": 325236,"Increment_min": 1,"Increment_max": 6162,"
> > Increment_mean": 465.68678129112396,"Increment_median": 216,"
> > Increment_75th_percentile": 450.25,"Increment_95th_percentile":
> > 1052.64999,"Increment_99th_percentile": 1635.23998
> >
> >
> > Any ideas if there are other changes that may be causing a performance
> > regression for increments between CDH4.7.1 and CDH5.3.8?
> >
> >
> >
> No.
>
> Post a thread dump Bryan and it might prompt something.
>
> St.Ack
>
>
>
>
> >
> > On Mon, Nov 30, 2015 at 4:13 PM Stack  wrote:
> >
> > > On Mon, Nov 30, 2015 at 12:54 PM, Bryan Beaudreault <
> > > bbeaudrea...@hubspot.com> wrote:
> > >
> > > > Should this be added as a known issue in the CDH or hbase
> > documentation?
> > > It
> > > > was a severe performance hit for us, all of our regionservers were
> > > sitting
> > > > at a few thousand queued requests.
> > > >
> > > >
> > > Let me take care of that.
> > > St.Ack
> > >
> > >
> > >
> > > > On Mon, Nov 30, 2015 at 3:53 PM Bryan Beaudreault <
> > > > bbeaudrea...@hubspot.com>
> > > > wrote:
> > > >
> > > > > Yea, they are all over the place and called from client and
> > coprocessor
> > > > > code. We ended up having no other option but to rollback, and aside
> > > from
> > > > a
> > > > > few NoSuchMethodErrors due to API changes (Put#add vs
> Put#addColumn),
> > > it
> > > > > seems to be working and fixing our problem.
> > > > >
> > > > > On Mon, Nov 30, 2015 at 3:47 PM Stack  wrote:
> > > > >
> > > > >> Rollback is untested. No fix in 5.5. I was going to work on this
> > now.
> > > > >> Where
> > > > >> are your counters Bryan? In their own column family or scattered
> > about
> > > > in
> > > > >> a
> > > > >> row with other Cell types?
> > > > >> St.Ack
> > > > >>
> > > > >> On Mon, Nov 30, 2015 at 10:28 AM, Bryan Beaudreault <
> > > > >> bbeaudrea...@hubspot.com> wrote:
> > > > >>
> > > > >> > Is there any update to this? We just upgraded all of our
> > production
> > > > >> > clusters from CDH4 to CDH5.4.7 and, not seeing this JIRA listed
> in
> > > the
> > > > >> > known issues, did not not about this.  Now we are seeing
> > perfomance
> > > > >> issues
> > > > >> > across all clusters, as we make heavy use of increments.
> > > > >> >
> > > > >> > Can we roll forward to CDH5.5 to fix? Or is our only hope to
> roll
> > > back
> > > > >> to
> > > > >> > CDH 5.3.1 (if that is possible)?
> > > > >> >
> > > > >> >
> > > > >> > On Thu, Sep 24, 2015 at 5:06 AM 鈴木俊裕 
> wrote:
> > > > >> >
> > > > >> > > Thank you St.Ack!
> > > > >> > >
> > > > >> > > I would like to follow the ticket.
> > > > >> > >
> > > > >> > > Toshihiro Suzuki
> > > > >> > >
> > > > >> > > 2015-09-22 14:14 GMT+09:00 Stack :
> > > > >> > >
> > > > >> > > > Back to this problem. Simple tests confirm that as is, the
> > > > >> > > > single-queue-backed MVCC instance can slow Region ops if
> some
> > > > other
> > > > >> row
> > > > >> > > is
> > > > >> > > > slow to complete. In particular Increment, checkAndPut, and
> > > batch
> > > > >> > > mutations
> > > > >> > > > are effected. I opened HBASE-14460 to start in on a fix up.
> > Lets
> > > > >> see if
> > > > >> > > we
> > > > >> > > > can somehow scope mvcc to row or at least shard mvcc so not
> > all
> > > > >> Region
> > > > >> > > ops
> > > > >> > > > are paused.
> > > > >> > > >
> > > > >> > > > St.Ack
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > On Mon, Sep 14, 2015 at 4:15 AM, 鈴木俊裕 
> > > wrote:
> > > > >> > > >
> > > > >> > > > > > Thank you for the below reasoning (with accompanying
> > helpful
> > > > >> > > diagram).
> > > > >> > > > > > Makes sense. Let me hack up a test case to help with the
> > > > >> > > illustration.
> > > > >> > > > It
> > > > >> > > > > > is as though the mvcc should be scoped to a row only...
> > > Writes
> > > > >> > > against
> > > > >> > > > > > other rows should not hold up my read of my row. Tag an
> > mvcc
> > > > >> with a
> > > > >> > > > 'row'
> > > > >> > > > > > scope so we can see which on-going writes pertain to
> > current
> > > > >> > > operation?
> > > > >> > > > > Thank you St.Ack! I think this approach would

Re: Performance degradation between CDH5.3.1(HBase0.98.6) and CDH5.4.5(HBase1.0.0)

2015-11-30 Thread Stack
On Mon, Nov 30, 2015 at 2:31 PM, Bryan Beaudreault  wrote:

> The rollback seems to have mostly solved the issue for one of our clusters,
> but another one is still seeing long increment times:
>
> "slowIncrementCount": 52080,
> "Increment_num_ops": 325236,"Increment_min": 1,"Increment_max": 6162,"
> Increment_mean": 465.68678129112396,"Increment_median": 216,"
> Increment_75th_percentile": 450.25,"Increment_95th_percentile":
> 1052.64999,"Increment_99th_percentile": 1635.23998
>
>
> Any ideas if there are other changes that may be causing a performance
> regression for increments between CDH4.7.1 and CDH5.3.8?
>
>
>
No.

Post a thread dump Bryan and it might prompt something.

St.Ack




>
> On Mon, Nov 30, 2015 at 4:13 PM Stack  wrote:
>
> > On Mon, Nov 30, 2015 at 12:54 PM, Bryan Beaudreault <
> > bbeaudrea...@hubspot.com> wrote:
> >
> > > Should this be added as a known issue in the CDH or hbase
> documentation?
> > It
> > > was a severe performance hit for us, all of our regionservers were
> > sitting
> > > at a few thousand queued requests.
> > >
> > >
> > Let me take care of that.
> > St.Ack
> >
> >
> >
> > > On Mon, Nov 30, 2015 at 3:53 PM Bryan Beaudreault <
> > > bbeaudrea...@hubspot.com>
> > > wrote:
> > >
> > > > Yea, they are all over the place and called from client and
> coprocessor
> > > > code. We ended up having no other option but to rollback, and aside
> > from
> > > a
> > > > few NoSuchMethodErrors due to API changes (Put#add vs Put#addColumn),
> > it
> > > > seems to be working and fixing our problem.
> > > >
> > > > On Mon, Nov 30, 2015 at 3:47 PM Stack  wrote:
> > > >
> > > >> Rollback is untested. No fix in 5.5. I was going to work on this
> now.
> > > >> Where
> > > >> are your counters Bryan? In their own column family or scattered
> about
> > > in
> > > >> a
> > > >> row with other Cell types?
> > > >> St.Ack
> > > >>
> > > >> On Mon, Nov 30, 2015 at 10:28 AM, Bryan Beaudreault <
> > > >> bbeaudrea...@hubspot.com> wrote:
> > > >>
> > > >> > Is there any update to this? We just upgraded all of our
> production
> > > >> > clusters from CDH4 to CDH5.4.7 and, not seeing this JIRA listed in
> > the
> > > >> > known issues, did not not about this.  Now we are seeing
> perfomance
> > > >> issues
> > > >> > across all clusters, as we make heavy use of increments.
> > > >> >
> > > >> > Can we roll forward to CDH5.5 to fix? Or is our only hope to roll
> > back
> > > >> to
> > > >> > CDH 5.3.1 (if that is possible)?
> > > >> >
> > > >> >
> > > >> > On Thu, Sep 24, 2015 at 5:06 AM 鈴木俊裕  wrote:
> > > >> >
> > > >> > > Thank you St.Ack!
> > > >> > >
> > > >> > > I would like to follow the ticket.
> > > >> > >
> > > >> > > Toshihiro Suzuki
> > > >> > >
> > > >> > > 2015-09-22 14:14 GMT+09:00 Stack :
> > > >> > >
> > > >> > > > Back to this problem. Simple tests confirm that as is, the
> > > >> > > > single-queue-backed MVCC instance can slow Region ops if some
> > > other
> > > >> row
> > > >> > > is
> > > >> > > > slow to complete. In particular Increment, checkAndPut, and
> > batch
> > > >> > > mutations
> > > >> > > > are effected. I opened HBASE-14460 to start in on a fix up.
> Lets
> > > >> see if
> > > >> > > we
> > > >> > > > can somehow scope mvcc to row or at least shard mvcc so not
> all
> > > >> Region
> > > >> > > ops
> > > >> > > > are paused.
> > > >> > > >
> > > >> > > > St.Ack
> > > >> > > >
> > > >> > > >
> > > >> > > > On Mon, Sep 14, 2015 at 4:15 AM, 鈴木俊裕 
> > wrote:
> > > >> > > >
> > > >> > > > > > Thank you for the below reasoning (with accompanying
> helpful
> > > >> > > diagram).
> > > >> > > > > > Makes sense. Let me hack up a test case to help with the
> > > >> > > illustration.
> > > >> > > > It
> > > >> > > > > > is as though the mvcc should be scoped to a row only...
> > Writes
> > > >> > > against
> > > >> > > > > > other rows should not hold up my read of my row. Tag an
> mvcc
> > > >> with a
> > > >> > > > 'row'
> > > >> > > > > > scope so we can see which on-going writes pertain to
> current
> > > >> > > operation?
> > > >> > > > > Thank you St.Ack! I think this approach would work.
> > > >> > > > >
> > > >> > > > > > You need to read back the increment and have it be
> 'correct'
> > > at
> > > >> > > > increment
> > > >> > > > > > time?
> > > >> > > > > Yes, we need it.
> > > >> > > > >
> > > >> > > > > I would like to help if there is anything I can do.
> > > >> > > > >
> > > >> > > > > Thanks,
> > > >> > > > > Toshihiro Suzuki
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > 2015-09-13 14:11 GMT+09:00 Stack :
> > > >> > > > >
> > > >> > > > > > Thank you for the below reasoning (with accompanying
> helpful
> > > >> > > diagram).
> > > >> > > > > > Makes sense. Let me hack up a test case to help with the
> > > >> > > illustration.
> > > >> > > > It
> > > >> > > > > > is as though the mvcc should be scoped to a row only...
> > Writes
> > > >> > > against
> > > >> > > > > > other rows should not hold up my read of my row. Tag an
> mvcc
> > > >

Re: Performance degradation between CDH5.3.1(HBase0.98.6) and CDH5.4.5(HBase1.0.0)

2015-11-30 Thread Bryan Beaudreault
The rollback seems to have mostly solved the issue for one of our clusters,
but another one is still seeing long increment times:

"slowIncrementCount": 52080,
"Increment_num_ops": 325236,"Increment_min": 1,"Increment_max": 6162,"
Increment_mean": 465.68678129112396,"Increment_median": 216,"
Increment_75th_percentile": 450.25,"Increment_95th_percentile":
1052.64999,"Increment_99th_percentile": 1635.23998


Any ideas if there are other changes that may be causing a performance
regression for increments between CDH4.7.1 and CDH5.3.8?



On Mon, Nov 30, 2015 at 4:13 PM Stack  wrote:

> On Mon, Nov 30, 2015 at 12:54 PM, Bryan Beaudreault <
> bbeaudrea...@hubspot.com> wrote:
>
> > Should this be added as a known issue in the CDH or hbase documentation?
> It
> > was a severe performance hit for us, all of our regionservers were
> sitting
> > at a few thousand queued requests.
> >
> >
> Let me take care of that.
> St.Ack
>
>
>
> > On Mon, Nov 30, 2015 at 3:53 PM Bryan Beaudreault <
> > bbeaudrea...@hubspot.com>
> > wrote:
> >
> > > Yea, they are all over the place and called from client and coprocessor
> > > code. We ended up having no other option but to rollback, and aside
> from
> > a
> > > few NoSuchMethodErrors due to API changes (Put#add vs Put#addColumn),
> it
> > > seems to be working and fixing our problem.
> > >
> > > On Mon, Nov 30, 2015 at 3:47 PM Stack  wrote:
> > >
> > >> Rollback is untested. No fix in 5.5. I was going to work on this now.
> > >> Where
> > >> are your counters Bryan? In their own column family or scattered about
> > in
> > >> a
> > >> row with other Cell types?
> > >> St.Ack
> > >>
> > >> On Mon, Nov 30, 2015 at 10:28 AM, Bryan Beaudreault <
> > >> bbeaudrea...@hubspot.com> wrote:
> > >>
> > >> > Is there any update to this? We just upgraded all of our production
> > >> > clusters from CDH4 to CDH5.4.7 and, not seeing this JIRA listed in
> the
> > >> > known issues, did not not about this.  Now we are seeing perfomance
> > >> issues
> > >> > across all clusters, as we make heavy use of increments.
> > >> >
> > >> > Can we roll forward to CDH5.5 to fix? Or is our only hope to roll
> back
> > >> to
> > >> > CDH 5.3.1 (if that is possible)?
> > >> >
> > >> >
> > >> > On Thu, Sep 24, 2015 at 5:06 AM 鈴木俊裕  wrote:
> > >> >
> > >> > > Thank you St.Ack!
> > >> > >
> > >> > > I would like to follow the ticket.
> > >> > >
> > >> > > Toshihiro Suzuki
> > >> > >
> > >> > > 2015-09-22 14:14 GMT+09:00 Stack :
> > >> > >
> > >> > > > Back to this problem. Simple tests confirm that as is, the
> > >> > > > single-queue-backed MVCC instance can slow Region ops if some
> > other
> > >> row
> > >> > > is
> > >> > > > slow to complete. In particular Increment, checkAndPut, and
> batch
> > >> > > mutations
> > >> > > > are effected. I opened HBASE-14460 to start in on a fix up. Lets
> > >> see if
> > >> > > we
> > >> > > > can somehow scope mvcc to row or at least shard mvcc so not all
> > >> Region
> > >> > > ops
> > >> > > > are paused.
> > >> > > >
> > >> > > > St.Ack
> > >> > > >
> > >> > > >
> > >> > > > On Mon, Sep 14, 2015 at 4:15 AM, 鈴木俊裕 
> wrote:
> > >> > > >
> > >> > > > > > Thank you for the below reasoning (with accompanying helpful
> > >> > > diagram).
> > >> > > > > > Makes sense. Let me hack up a test case to help with the
> > >> > > illustration.
> > >> > > > It
> > >> > > > > > is as though the mvcc should be scoped to a row only...
> Writes
> > >> > > against
> > >> > > > > > other rows should not hold up my read of my row. Tag an mvcc
> > >> with a
> > >> > > > 'row'
> > >> > > > > > scope so we can see which on-going writes pertain to current
> > >> > > operation?
> > >> > > > > Thank you St.Ack! I think this approach would work.
> > >> > > > >
> > >> > > > > > You need to read back the increment and have it be 'correct'
> > at
> > >> > > > increment
> > >> > > > > > time?
> > >> > > > > Yes, we need it.
> > >> > > > >
> > >> > > > > I would like to help if there is anything I can do.
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > > Toshihiro Suzuki
> > >> > > > >
> > >> > > > >
> > >> > > > > 2015-09-13 14:11 GMT+09:00 Stack :
> > >> > > > >
> > >> > > > > > Thank you for the below reasoning (with accompanying helpful
> > >> > > diagram).
> > >> > > > > > Makes sense. Let me hack up a test case to help with the
> > >> > > illustration.
> > >> > > > It
> > >> > > > > > is as though the mvcc should be scoped to a row only...
> Writes
> > >> > > against
> > >> > > > > > other rows should not hold up my read of my row. Tag an mvcc
> > >> with a
> > >> > > > 'row'
> > >> > > > > > scope so we can see which on-going writes pertain to current
> > >> > > operation?
> > >> > > > > >
> > >> > > > > > You need to read back the increment and have it be 'correct'
> > at
> > >> > > > increment
> > >> > > > > > time?
> > >> > > > > >
> > >> > > > > > (This is a good one)
> > >> > > > > >
> > >> > > > > > Thank you Toshihiro Suzuki
> > >> > > > > > St.Ack
> > >> > > > > >
> > >

Re: Performance degradation between CDH5.3.1(HBase0.98.6) and CDH5.4.5(HBase1.0.0)

2015-11-30 Thread Stack
On Mon, Nov 30, 2015 at 12:54 PM, Bryan Beaudreault <
bbeaudrea...@hubspot.com> wrote:

> Should this be added as a known issue in the CDH or hbase documentation? It
> was a severe performance hit for us, all of our regionservers were sitting
> at a few thousand queued requests.
>
>
Let me take care of that.
St.Ack



> On Mon, Nov 30, 2015 at 3:53 PM Bryan Beaudreault <
> bbeaudrea...@hubspot.com>
> wrote:
>
> > Yea, they are all over the place and called from client and coprocessor
> > code. We ended up having no other option but to rollback, and aside from
> a
> > few NoSuchMethodErrors due to API changes (Put#add vs Put#addColumn), it
> > seems to be working and fixing our problem.
> >
> > On Mon, Nov 30, 2015 at 3:47 PM Stack  wrote:
> >
> >> Rollback is untested. No fix in 5.5. I was going to work on this now.
> >> Where
> >> are your counters Bryan? In their own column family or scattered about
> in
> >> a
> >> row with other Cell types?
> >> St.Ack
> >>
> >> On Mon, Nov 30, 2015 at 10:28 AM, Bryan Beaudreault <
> >> bbeaudrea...@hubspot.com> wrote:
> >>
> >> > Is there any update to this? We just upgraded all of our production
> >> > clusters from CDH4 to CDH5.4.7 and, not seeing this JIRA listed in the
> >> > known issues, did not not about this.  Now we are seeing perfomance
> >> issues
> >> > across all clusters, as we make heavy use of increments.
> >> >
> >> > Can we roll forward to CDH5.5 to fix? Or is our only hope to roll back
> >> to
> >> > CDH 5.3.1 (if that is possible)?
> >> >
> >> >
> >> > On Thu, Sep 24, 2015 at 5:06 AM 鈴木俊裕  wrote:
> >> >
> >> > > Thank you St.Ack!
> >> > >
> >> > > I would like to follow the ticket.
> >> > >
> >> > > Toshihiro Suzuki
> >> > >
> >> > > 2015-09-22 14:14 GMT+09:00 Stack :
> >> > >
> >> > > > Back to this problem. Simple tests confirm that as is, the
> >> > > > single-queue-backed MVCC instance can slow Region ops if some
> other
> >> row
> >> > > is
> >> > > > slow to complete. In particular Increment, checkAndPut, and batch
> >> > > mutations
> >> > > > are effected. I opened HBASE-14460 to start in on a fix up. Lets
> >> see if
> >> > > we
> >> > > > can somehow scope mvcc to row or at least shard mvcc so not all
> >> Region
> >> > > ops
> >> > > > are paused.
> >> > > >
> >> > > > St.Ack
> >> > > >
> >> > > >
> >> > > > On Mon, Sep 14, 2015 at 4:15 AM, 鈴木俊裕  wrote:
> >> > > >
> >> > > > > > Thank you for the below reasoning (with accompanying helpful
> >> > > diagram).
> >> > > > > > Makes sense. Let me hack up a test case to help with the
> >> > > illustration.
> >> > > > It
> >> > > > > > is as though the mvcc should be scoped to a row only... Writes
> >> > > against
> >> > > > > > other rows should not hold up my read of my row. Tag an mvcc
> >> with a
> >> > > > 'row'
> >> > > > > > scope so we can see which on-going writes pertain to current
> >> > > operation?
> >> > > > > Thank you St.Ack! I think this approach would work.
> >> > > > >
> >> > > > > > You need to read back the increment and have it be 'correct'
> at
> >> > > > increment
> >> > > > > > time?
> >> > > > > Yes, we need it.
> >> > > > >
> >> > > > > I would like to help if there is anything I can do.
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Toshihiro Suzuki
> >> > > > >
> >> > > > >
> >> > > > > 2015-09-13 14:11 GMT+09:00 Stack :
> >> > > > >
> >> > > > > > Thank you for the below reasoning (with accompanying helpful
> >> > > diagram).
> >> > > > > > Makes sense. Let me hack up a test case to help with the
> >> > > illustration.
> >> > > > It
> >> > > > > > is as though the mvcc should be scoped to a row only... Writes
> >> > > against
> >> > > > > > other rows should not hold up my read of my row. Tag an mvcc
> >> with a
> >> > > > 'row'
> >> > > > > > scope so we can see which on-going writes pertain to current
> >> > > operation?
> >> > > > > >
> >> > > > > > You need to read back the increment and have it be 'correct'
> at
> >> > > > increment
> >> > > > > > time?
> >> > > > > >
> >> > > > > > (This is a good one)
> >> > > > > >
> >> > > > > > Thank you Toshihiro Suzuki
> >> > > > > > St.Ack
> >> > > > > >
> >> > > > > > On Sat, Sep 12, 2015 at 8:09 AM, 鈴木俊裕 
> >> wrote:
> >> > > > > >
> >> > > > > > > St.Ack,
> >> > > > > > >
> >> > > > > > > Thank you for your response.
> >> > > > > > >
> >> > > > > > > Why I make out that "A region lock (not a row lock) seems to
> >> > occur
> >> > > in
> >> > > > > > > waitForPreviousTransactionsComplete()" is as follows:
> >> > > > > > >
> >> > > > > > > A increment operation has 3 procedures for MVCC.
> >> > > > > > >
> >> > > > > > > 1. mvcc.waitForPreviousTransactionsComplete();
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6712
> >> > > > > > >
> >> > > > > > > 2. w = mvcc.beginMemstoreInsertWithSeqNum(mvccNum);
> >> > > > > > >
> 

Re: Performance degradation between CDH5.3.1(HBase0.98.6) and CDH5.4.5(HBase1.0.0)

2015-11-30 Thread Bryan Beaudreault
Should this be added as a known issue in the CDH or hbase documentation? It
was a severe performance hit for us, all of our regionservers were sitting
at a few thousand queued requests.

On Mon, Nov 30, 2015 at 3:53 PM Bryan Beaudreault 
wrote:

> Yea, they are all over the place and called from client and coprocessor
> code. We ended up having no other option but to rollback, and aside from a
> few NoSuchMethodErrors due to API changes (Put#add vs Put#addColumn), it
> seems to be working and fixing our problem.
>
> On Mon, Nov 30, 2015 at 3:47 PM Stack  wrote:
>
>> Rollback is untested. No fix in 5.5. I was going to work on this now.
>> Where
>> are your counters Bryan? In their own column family or scattered about in
>> a
>> row with other Cell types?
>> St.Ack
>>
>> On Mon, Nov 30, 2015 at 10:28 AM, Bryan Beaudreault <
>> bbeaudrea...@hubspot.com> wrote:
>>
>> > Is there any update to this? We just upgraded all of our production
>> > clusters from CDH4 to CDH5.4.7 and, not seeing this JIRA listed in the
>> > known issues, did not not about this.  Now we are seeing perfomance
>> issues
>> > across all clusters, as we make heavy use of increments.
>> >
>> > Can we roll forward to CDH5.5 to fix? Or is our only hope to roll back
>> to
>> > CDH 5.3.1 (if that is possible)?
>> >
>> >
>> > On Thu, Sep 24, 2015 at 5:06 AM 鈴木俊裕  wrote:
>> >
>> > > Thank you St.Ack!
>> > >
>> > > I would like to follow the ticket.
>> > >
>> > > Toshihiro Suzuki
>> > >
>> > > 2015-09-22 14:14 GMT+09:00 Stack :
>> > >
>> > > > Back to this problem. Simple tests confirm that as is, the
>> > > > single-queue-backed MVCC instance can slow Region ops if some other
>> row
>> > > is
>> > > > slow to complete. In particular Increment, checkAndPut, and batch
>> > > mutations
>> > > > are effected. I opened HBASE-14460 to start in on a fix up. Lets
>> see if
>> > > we
>> > > > can somehow scope mvcc to row or at least shard mvcc so not all
>> Region
>> > > ops
>> > > > are paused.
>> > > >
>> > > > St.Ack
>> > > >
>> > > >
>> > > > On Mon, Sep 14, 2015 at 4:15 AM, 鈴木俊裕  wrote:
>> > > >
>> > > > > > Thank you for the below reasoning (with accompanying helpful
>> > > diagram).
>> > > > > > Makes sense. Let me hack up a test case to help with the
>> > > illustration.
>> > > > It
>> > > > > > is as though the mvcc should be scoped to a row only... Writes
>> > > against
>> > > > > > other rows should not hold up my read of my row. Tag an mvcc
>> with a
>> > > > 'row'
>> > > > > > scope so we can see which on-going writes pertain to current
>> > > operation?
>> > > > > Thank you St.Ack! I think this approach would work.
>> > > > >
>> > > > > > You need to read back the increment and have it be 'correct' at
>> > > > increment
>> > > > > > time?
>> > > > > Yes, we need it.
>> > > > >
>> > > > > I would like to help if there is anything I can do.
>> > > > >
>> > > > > Thanks,
>> > > > > Toshihiro Suzuki
>> > > > >
>> > > > >
>> > > > > 2015-09-13 14:11 GMT+09:00 Stack :
>> > > > >
>> > > > > > Thank you for the below reasoning (with accompanying helpful
>> > > diagram).
>> > > > > > Makes sense. Let me hack up a test case to help with the
>> > > illustration.
>> > > > It
>> > > > > > is as though the mvcc should be scoped to a row only... Writes
>> > > against
>> > > > > > other rows should not hold up my read of my row. Tag an mvcc
>> with a
>> > > > 'row'
>> > > > > > scope so we can see which on-going writes pertain to current
>> > > operation?
>> > > > > >
>> > > > > > You need to read back the increment and have it be 'correct' at
>> > > > increment
>> > > > > > time?
>> > > > > >
>> > > > > > (This is a good one)
>> > > > > >
>> > > > > > Thank you Toshihiro Suzuki
>> > > > > > St.Ack
>> > > > > >
>> > > > > > On Sat, Sep 12, 2015 at 8:09 AM, 鈴木俊裕 
>> wrote:
>> > > > > >
>> > > > > > > St.Ack,
>> > > > > > >
>> > > > > > > Thank you for your response.
>> > > > > > >
>> > > > > > > Why I make out that "A region lock (not a row lock) seems to
>> > occur
>> > > in
>> > > > > > > waitForPreviousTransactionsComplete()" is as follows:
>> > > > > > >
>> > > > > > > A increment operation has 3 procedures for MVCC.
>> > > > > > >
>> > > > > > > 1. mvcc.waitForPreviousTransactionsComplete();
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6712
>> > > > > > >
>> > > > > > > 2. w = mvcc.beginMemstoreInsertWithSeqNum(mvccNum);
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6721
>> > > > > > >
>> > > > > > > 3. mvcc.completeMemstoreInsertWithSeqNum(w, walKey);
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/cloudera/hbase/blob/cdh5.4.5-releas

Re: Performance degradation between CDH5.3.1(HBase0.98.6) and CDH5.4.5(HBase1.0.0)

2015-11-30 Thread Bryan Beaudreault
Yea, they are all over the place and called from client and coprocessor
code. We ended up having no other option but to rollback, and aside from a
few NoSuchMethodErrors due to API changes (Put#add vs Put#addColumn), it
seems to be working and fixing our problem.

On Mon, Nov 30, 2015 at 3:47 PM Stack  wrote:

> Rollback is untested. No fix in 5.5. I was going to work on this now. Where
> are your counters Bryan? In their own column family or scattered about in a
> row with other Cell types?
> St.Ack
>
> On Mon, Nov 30, 2015 at 10:28 AM, Bryan Beaudreault <
> bbeaudrea...@hubspot.com> wrote:
>
> > Is there any update to this? We just upgraded all of our production
> > clusters from CDH4 to CDH5.4.7 and, not seeing this JIRA listed in the
> > known issues, did not not about this.  Now we are seeing perfomance
> issues
> > across all clusters, as we make heavy use of increments.
> >
> > Can we roll forward to CDH5.5 to fix? Or is our only hope to roll back to
> > CDH 5.3.1 (if that is possible)?
> >
> >
> > On Thu, Sep 24, 2015 at 5:06 AM 鈴木俊裕  wrote:
> >
> > > Thank you St.Ack!
> > >
> > > I would like to follow the ticket.
> > >
> > > Toshihiro Suzuki
> > >
> > > 2015-09-22 14:14 GMT+09:00 Stack :
> > >
> > > > Back to this problem. Simple tests confirm that as is, the
> > > > single-queue-backed MVCC instance can slow Region ops if some other
> row
> > > is
> > > > slow to complete. In particular Increment, checkAndPut, and batch
> > > mutations
> > > > are effected. I opened HBASE-14460 to start in on a fix up. Lets see
> if
> > > we
> > > > can somehow scope mvcc to row or at least shard mvcc so not all
> Region
> > > ops
> > > > are paused.
> > > >
> > > > St.Ack
> > > >
> > > >
> > > > On Mon, Sep 14, 2015 at 4:15 AM, 鈴木俊裕  wrote:
> > > >
> > > > > > Thank you for the below reasoning (with accompanying helpful
> > > diagram).
> > > > > > Makes sense. Let me hack up a test case to help with the
> > > illustration.
> > > > It
> > > > > > is as though the mvcc should be scoped to a row only... Writes
> > > against
> > > > > > other rows should not hold up my read of my row. Tag an mvcc
> with a
> > > > 'row'
> > > > > > scope so we can see which on-going writes pertain to current
> > > operation?
> > > > > Thank you St.Ack! I think this approach would work.
> > > > >
> > > > > > You need to read back the increment and have it be 'correct' at
> > > > increment
> > > > > > time?
> > > > > Yes, we need it.
> > > > >
> > > > > I would like to help if there is anything I can do.
> > > > >
> > > > > Thanks,
> > > > > Toshihiro Suzuki
> > > > >
> > > > >
> > > > > 2015-09-13 14:11 GMT+09:00 Stack :
> > > > >
> > > > > > Thank you for the below reasoning (with accompanying helpful
> > > diagram).
> > > > > > Makes sense. Let me hack up a test case to help with the
> > > illustration.
> > > > It
> > > > > > is as though the mvcc should be scoped to a row only... Writes
> > > against
> > > > > > other rows should not hold up my read of my row. Tag an mvcc
> with a
> > > > 'row'
> > > > > > scope so we can see which on-going writes pertain to current
> > > operation?
> > > > > >
> > > > > > You need to read back the increment and have it be 'correct' at
> > > > increment
> > > > > > time?
> > > > > >
> > > > > > (This is a good one)
> > > > > >
> > > > > > Thank you Toshihiro Suzuki
> > > > > > St.Ack
> > > > > >
> > > > > > On Sat, Sep 12, 2015 at 8:09 AM, 鈴木俊裕 
> wrote:
> > > > > >
> > > > > > > St.Ack,
> > > > > > >
> > > > > > > Thank you for your response.
> > > > > > >
> > > > > > > Why I make out that "A region lock (not a row lock) seems to
> > occur
> > > in
> > > > > > > waitForPreviousTransactionsComplete()" is as follows:
> > > > > > >
> > > > > > > A increment operation has 3 procedures for MVCC.
> > > > > > >
> > > > > > > 1. mvcc.waitForPreviousTransactionsComplete();
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6712
> > > > > > >
> > > > > > > 2. w = mvcc.beginMemstoreInsertWithSeqNum(mvccNum);
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6721
> > > > > > >
> > > > > > > 3. mvcc.completeMemstoreInsertWithSeqNum(w, walKey);
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6893
> > > > > > >
> > > > > > >
> > > > > > > I think that MultiVersionConsistencyControl's writeQueue can
> > cause
> > > a
> > > > > > region
> > > > > > > lock.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/ja

Re: Performance degradation between CDH5.3.1(HBase0.98.6) and CDH5.4.5(HBase1.0.0)

2015-11-30 Thread Stack
Rollback is untested. No fix in 5.5. I was going to work on this now. Where
are your counters Bryan? In their own column family or scattered about in a
row with other Cell types?
St.Ack

On Mon, Nov 30, 2015 at 10:28 AM, Bryan Beaudreault <
bbeaudrea...@hubspot.com> wrote:

> Is there any update to this? We just upgraded all of our production
> clusters from CDH4 to CDH5.4.7 and, not seeing this JIRA listed in the
> known issues, did not not about this.  Now we are seeing perfomance issues
> across all clusters, as we make heavy use of increments.
>
> Can we roll forward to CDH5.5 to fix? Or is our only hope to roll back to
> CDH 5.3.1 (if that is possible)?
>
>
> On Thu, Sep 24, 2015 at 5:06 AM 鈴木俊裕  wrote:
>
> > Thank you St.Ack!
> >
> > I would like to follow the ticket.
> >
> > Toshihiro Suzuki
> >
> > 2015-09-22 14:14 GMT+09:00 Stack :
> >
> > > Back to this problem. Simple tests confirm that as is, the
> > > single-queue-backed MVCC instance can slow Region ops if some other row
> > is
> > > slow to complete. In particular Increment, checkAndPut, and batch
> > mutations
> > > are effected. I opened HBASE-14460 to start in on a fix up. Lets see if
> > we
> > > can somehow scope mvcc to row or at least shard mvcc so not all Region
> > ops
> > > are paused.
> > >
> > > St.Ack
> > >
> > >
> > > On Mon, Sep 14, 2015 at 4:15 AM, 鈴木俊裕  wrote:
> > >
> > > > > Thank you for the below reasoning (with accompanying helpful
> > diagram).
> > > > > Makes sense. Let me hack up a test case to help with the
> > illustration.
> > > It
> > > > > is as though the mvcc should be scoped to a row only... Writes
> > against
> > > > > other rows should not hold up my read of my row. Tag an mvcc with a
> > > 'row'
> > > > > scope so we can see which on-going writes pertain to current
> > operation?
> > > > Thank you St.Ack! I think this approach would work.
> > > >
> > > > > You need to read back the increment and have it be 'correct' at
> > > increment
> > > > > time?
> > > > Yes, we need it.
> > > >
> > > > I would like to help if there is anything I can do.
> > > >
> > > > Thanks,
> > > > Toshihiro Suzuki
> > > >
> > > >
> > > > 2015-09-13 14:11 GMT+09:00 Stack :
> > > >
> > > > > Thank you for the below reasoning (with accompanying helpful
> > diagram).
> > > > > Makes sense. Let me hack up a test case to help with the
> > illustration.
> > > It
> > > > > is as though the mvcc should be scoped to a row only... Writes
> > against
> > > > > other rows should not hold up my read of my row. Tag an mvcc with a
> > > 'row'
> > > > > scope so we can see which on-going writes pertain to current
> > operation?
> > > > >
> > > > > You need to read back the increment and have it be 'correct' at
> > > increment
> > > > > time?
> > > > >
> > > > > (This is a good one)
> > > > >
> > > > > Thank you Toshihiro Suzuki
> > > > > St.Ack
> > > > >
> > > > > On Sat, Sep 12, 2015 at 8:09 AM, 鈴木俊裕  wrote:
> > > > >
> > > > > > St.Ack,
> > > > > >
> > > > > > Thank you for your response.
> > > > > >
> > > > > > Why I make out that "A region lock (not a row lock) seems to
> occur
> > in
> > > > > > waitForPreviousTransactionsComplete()" is as follows:
> > > > > >
> > > > > > A increment operation has 3 procedures for MVCC.
> > > > > >
> > > > > > 1. mvcc.waitForPreviousTransactionsComplete();
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6712
> > > > > >
> > > > > > 2. w = mvcc.beginMemstoreInsertWithSeqNum(mvccNum);
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6721
> > > > > >
> > > > > > 3. mvcc.completeMemstoreInsertWithSeqNum(w, walKey);
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6893
> > > > > >
> > > > > >
> > > > > > I think that MultiVersionConsistencyControl's writeQueue can
> cause
> > a
> > > > > region
> > > > > > lock.
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MultiVersionConsistencyControl.java#L42-L43
> > > > > >
> > > > > >
> > > > > > Step 2 adds to a WriteEntry to writeQueue.
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MultiVersionConsistencyControl.java#L102-L108
> > > > > >
> > > > > > Step 3 removes the WriteEntry from writeQueue.
> > > > > >
> > > > > > completeMemstoreInsertWithSeqNum(w, walKey) ->
> > > > > > waitForPreviousTransactionsComplete(e) -> advanceMemstore(w)
> > > > > >
> > 

Re: Performance degradation between CDH5.3.1(HBase0.98.6) and CDH5.4.5(HBase1.0.0)

2015-11-30 Thread Bryan Beaudreault
Is there any update to this? We just upgraded all of our production
clusters from CDH4 to CDH5.4.7 and, not seeing this JIRA listed in the
known issues, did not not about this.  Now we are seeing perfomance issues
across all clusters, as we make heavy use of increments.

Can we roll forward to CDH5.5 to fix? Or is our only hope to roll back to
CDH 5.3.1 (if that is possible)?


On Thu, Sep 24, 2015 at 5:06 AM 鈴木俊裕  wrote:

> Thank you St.Ack!
>
> I would like to follow the ticket.
>
> Toshihiro Suzuki
>
> 2015-09-22 14:14 GMT+09:00 Stack :
>
> > Back to this problem. Simple tests confirm that as is, the
> > single-queue-backed MVCC instance can slow Region ops if some other row
> is
> > slow to complete. In particular Increment, checkAndPut, and batch
> mutations
> > are effected. I opened HBASE-14460 to start in on a fix up. Lets see if
> we
> > can somehow scope mvcc to row or at least shard mvcc so not all Region
> ops
> > are paused.
> >
> > St.Ack
> >
> >
> > On Mon, Sep 14, 2015 at 4:15 AM, 鈴木俊裕  wrote:
> >
> > > > Thank you for the below reasoning (with accompanying helpful
> diagram).
> > > > Makes sense. Let me hack up a test case to help with the
> illustration.
> > It
> > > > is as though the mvcc should be scoped to a row only... Writes
> against
> > > > other rows should not hold up my read of my row. Tag an mvcc with a
> > 'row'
> > > > scope so we can see which on-going writes pertain to current
> operation?
> > > Thank you St.Ack! I think this approach would work.
> > >
> > > > You need to read back the increment and have it be 'correct' at
> > increment
> > > > time?
> > > Yes, we need it.
> > >
> > > I would like to help if there is anything I can do.
> > >
> > > Thanks,
> > > Toshihiro Suzuki
> > >
> > >
> > > 2015-09-13 14:11 GMT+09:00 Stack :
> > >
> > > > Thank you for the below reasoning (with accompanying helpful
> diagram).
> > > > Makes sense. Let me hack up a test case to help with the
> illustration.
> > It
> > > > is as though the mvcc should be scoped to a row only... Writes
> against
> > > > other rows should not hold up my read of my row. Tag an mvcc with a
> > 'row'
> > > > scope so we can see which on-going writes pertain to current
> operation?
> > > >
> > > > You need to read back the increment and have it be 'correct' at
> > increment
> > > > time?
> > > >
> > > > (This is a good one)
> > > >
> > > > Thank you Toshihiro Suzuki
> > > > St.Ack
> > > >
> > > > On Sat, Sep 12, 2015 at 8:09 AM, 鈴木俊裕  wrote:
> > > >
> > > > > St.Ack,
> > > > >
> > > > > Thank you for your response.
> > > > >
> > > > > Why I make out that "A region lock (not a row lock) seems to occur
> in
> > > > > waitForPreviousTransactionsComplete()" is as follows:
> > > > >
> > > > > A increment operation has 3 procedures for MVCC.
> > > > >
> > > > > 1. mvcc.waitForPreviousTransactionsComplete();
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6712
> > > > >
> > > > > 2. w = mvcc.beginMemstoreInsertWithSeqNum(mvccNum);
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6721
> > > > >
> > > > > 3. mvcc.completeMemstoreInsertWithSeqNum(w, walKey);
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6893
> > > > >
> > > > >
> > > > > I think that MultiVersionConsistencyControl's writeQueue can cause
> a
> > > > region
> > > > > lock.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MultiVersionConsistencyControl.java#L42-L43
> > > > >
> > > > >
> > > > > Step 2 adds to a WriteEntry to writeQueue.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MultiVersionConsistencyControl.java#L102-L108
> > > > >
> > > > > Step 3 removes the WriteEntry from writeQueue.
> > > > >
> > > > > completeMemstoreInsertWithSeqNum(w, walKey) ->
> > > > > waitForPreviousTransactionsComplete(e) -> advanceMemstore(w)
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MultiVersionConsistencyControl.java#L127
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MultiVersionConsistencyControl.java#L235
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MultiVersionConsistencyCo