[jira] [Comment Edited] (HBASE-17069) RegionServer writes invalid META entries for split daughters in some circumstances

2017-02-22 Thread Abhishek Singh Chouhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879300#comment-15879300
 ] 

Abhishek Singh Chouhan edited comment on HBASE-17069 at 2/22/17 9:42 PM:
-

[~stack] I'm still a bit new to the code. Hope i'm not misunderstanding the 
bits :).
bq. mutations have to be zero to go the 'false' route

>From what i understood we enter the below block only when the walEdit is not 
>empty and we have some cells in the wal edit from the processed mutations. 
>Hope i'm not terribly wrong here.
In Hregion.processRowsWithLocks
if (!walEdit.isEmpty()) {
// we use HLogKey here instead of WALKey directly to support legacy 
coprocessors.
walKey = new HLogKey(this.getRegionInfo().getEncodedNameAsBytes(),
  this.htableDescriptor.getTableName(), WALKey.NO_SEQUENCE_ID, now,
  processor.getClusterIds(), nonceGroup, nonce, mvcc);
txid = this.wal.append(this.htableDescriptor, this.getRegionInfo(),
walKey, walEdit, false); //this was false before the patch
  }

bq. I wonder why the hregioninfo previously written doesn't 'shine through'... 
is the update of location writing null into the hregioninfo cell or is the read 
not allowing old versions of the hregioninfo cell in hbase:meta?

During the merge we send a multi that deletes the rows which had the 
hregioninfo and never add it back.


was (Author: abhishek.chouhan):
[~stack] I'm still a bit new to the code. Hope i'm not misunderstanding the 
bits.
bq. mutations have to be zero to go the 'false' route

>From what i understood we enter the below block only when the walEdit is not 
>empty and we have some cells in the wal edit from the processed mutations. 
>Hope i'm not terribly wrong here.
In Hregion.processRowsWithLocks
if (!walEdit.isEmpty()) {
// we use HLogKey here instead of WALKey directly to support legacy 
coprocessors.
walKey = new HLogKey(this.getRegionInfo().getEncodedNameAsBytes(),
  this.htableDescriptor.getTableName(), WALKey.NO_SEQUENCE_ID, now,
  processor.getClusterIds(), nonceGroup, nonce, mvcc);
txid = this.wal.append(this.htableDescriptor, this.getRegionInfo(),
walKey, walEdit, false); //this was false before the patch
  }

bq. I wonder why the hregioninfo previously written doesn't 'shine through'... 
is the update of location writing null into the hregioninfo cell or is the read 
not allowing old versions of the hregioninfo cell in hbase:meta?

During the merge we send a multi that deletes the rows which had the 
hregioninfo and never add it back.

> RegionServer writes invalid META entries for split daughters in some 
> circumstances
> --
>
> Key: HBASE-17069
> URL: https://issues.apache.org/jira/browse/HBASE-17069
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 2.0.0, 1.3.0, 1.4.0, 1.2.4
>Reporter: Andrew Purtell
>Assignee: Abhishek Singh Chouhan
>Priority: Blocker
> Fix For: 2.0.0, 1.4.0, 1.3.1, 1.2.5
>
> Attachments: daughter_1_d55ef81c2f8299abbddfce0445067830.log, 
> daughter_2_08629d59564726da2497f70451aafcdb.log, 
> HBASE-17069.branch-1.3.001.patch, HBASE-17069.branch-1.3.002.patch, 
> HBASE-17069.master.001.patch, logs.tar.gz, 
> parent-393d2bfd8b1c52ce08540306659624f2.log
>
>
> I have been seeing frequent ITBLL failures testing various versions of 1.2.x. 
> Over the lifetime of 1.2.x the following issues have been fixed:
> - HBASE-15315 (Remove always set super user call as high priority)
> - HBASE-16093 (Fix splits failed before creating daughter regions leave meta 
> inconsistent)
> And this one is pending:
> - HBASE-17044 (Fix merge failed before creating merged region leaves meta 
> inconsistent)
> I can apply all of the above to branch-1.2 and still see this failure: 
> *The life of stillborn region d55ef81c2f8299abbddfce0445067830*
> *Master sees SPLITTING_NEW*
> {noformat}
> 2016-11-08 04:23:21,186 INFO  [AM.ZK.Worker-pool2-t82] master.RegionStates: 
> Transition null to {d55ef81c2f8299abbddfce0445067830 state=SPLITTING_NEW, 
> ts=1478579001186, server=node-3.cluster,16020,1478578389506}
> {noformat}
> *The RegionServer creates it*
> {noformat}
> 2016-11-08 04:23:26,035 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for GomnU: blockCache=LruBlockCache{blockCount=34, 
> currentSize=14996112, freeSize=12823716208, maxSize=12838712320, 
> heapSize=14996112, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> 

[jira] [Comment Edited] (HBASE-17069) RegionServer writes invalid META entries for split daughters in some circumstances

2017-02-21 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877228#comment-15877228
 ] 

Andrew Purtell edited comment on HBASE-17069 at 2/22/17 1:39 AM:
-

bq. So you think it good in branch-1.2, branch-1.3 but not branch-1 Andrew 
Purtell?

I think this isn't completely solved and that surfaces in branch-1. Not sure at 
this point if that is the only affected branch. My guess is there is common 
code between them all that needs adjusting, to do with merging. 

bq. Interesting we were focused on locking but it seems like the issue 
something else altogether but it was in the locking patch Dang.

Yeah I spent a while looking at locking... but Abhishek got us pointed in the 
right direction, thankfully. 


was (Author: apurtell):
bq. So you think it good in branch-1.2, branch-1.3 but not branch-1 Andrew 
Purtell?

I think this isn't completely solved and that surfaces in branch-1. Not sure at 
this point if that is the only affected branch. My guess is there is common 
code between them all that needs adjusting, to do with merging. 

> RegionServer writes invalid META entries for split daughters in some 
> circumstances
> --
>
> Key: HBASE-17069
> URL: https://issues.apache.org/jira/browse/HBASE-17069
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 2.0.0, 1.3.0, 1.4.0, 1.2.4
>Reporter: Andrew Purtell
>Assignee: Abhishek Singh Chouhan
>Priority: Blocker
> Fix For: 2.0.0, 1.4.0, 1.3.1, 1.2.5
>
> Attachments: daughter_1_d55ef81c2f8299abbddfce0445067830.log, 
> daughter_2_08629d59564726da2497f70451aafcdb.log, 
> HBASE-17069.branch-1.3.001.patch, HBASE-17069.branch-1.3.002.patch, 
> HBASE-17069.master.001.patch, logs.tar.gz, 
> parent-393d2bfd8b1c52ce08540306659624f2.log
>
>
> I have been seeing frequent ITBLL failures testing various versions of 1.2.x. 
> Over the lifetime of 1.2.x the following issues have been fixed:
> - HBASE-15315 (Remove always set super user call as high priority)
> - HBASE-16093 (Fix splits failed before creating daughter regions leave meta 
> inconsistent)
> And this one is pending:
> - HBASE-17044 (Fix merge failed before creating merged region leaves meta 
> inconsistent)
> I can apply all of the above to branch-1.2 and still see this failure: 
> *The life of stillborn region d55ef81c2f8299abbddfce0445067830*
> *Master sees SPLITTING_NEW*
> {noformat}
> 2016-11-08 04:23:21,186 INFO  [AM.ZK.Worker-pool2-t82] master.RegionStates: 
> Transition null to {d55ef81c2f8299abbddfce0445067830 state=SPLITTING_NEW, 
> ts=1478579001186, server=node-3.cluster,16020,1478578389506}
> {noformat}
> *The RegionServer creates it*
> {noformat}
> 2016-11-08 04:23:26,035 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for GomnU: blockCache=LruBlockCache{blockCount=34, 
> currentSize=14996112, freeSize=12823716208, maxSize=12838712320, 
> heapSize=14996112, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,038 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for big: blockCache=LruBlockCache{blockCount=34, 
> currentSize=14996112, freeSize=12823716208, maxSize=12838712320, 
> heapSize=14996112, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,442 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for meta: blockCache=LruBlockCache{blockCount=63, 
> currentSize=17187656, freeSize=12821524664, maxSize=12838712320, 
> heapSize=17187656, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,713 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for nwmrW: blockCache=LruBlockCache{blockCount=96, 
> currentSize=19178440, freeSize=12819533880, maxSize=12838712320, 
> heapSize=19178440, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> 

[jira] [Comment Edited] (HBASE-17069) RegionServer writes invalid META entries for split daughters in some circumstances

2017-02-15 Thread Abhishek Singh Chouhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867762#comment-15867762
 ] 

Abhishek Singh Chouhan edited comment on HBASE-17069 at 2/15/17 12:49 PM:
--

[~Apache9] Yep got confused a bit there :) rethrowIOException might be better.

Got to the bottom of things. Here's what was happening.
- Region A getting split into B & C. The first request with daughter info is a 
multi (with hregion info) while the second one is a put.

In Hregion.processRowsWithLocks we have 
{noformat}
// 6. Append no sync
  if (!walEdit.isEmpty()) {
// we use HLogKey here instead of WALKey directly to support legacy 
coprocessors.
walKey = new HLogKey(this.getRegionInfo().getEncodedNameAsBytes(),
  this.htableDescriptor.getTableName(), WALKey.NO_SEQUENCE_ID, now,
  processor.getClusterIds(), nonceGroup, nonce, mvcc);
txid = this.wal.append(this.htableDescriptor, this.getRegionInfo(),
walKey, walEdit, false);
  }
{noformat}

Since we pass false for inMemstore in append, we mess up the seq id accounting. 
In SequenceIdAccounting.update() we pass false for the multirequest (lets say 
sequence id here was 1) so lowestunflusedsequenceid is not updated.
Now for the second put that goes through doMiniBatchMutation we pass true 
correctly during append(Seq id 2). lowestUnflushedSequenceIds is set to 2 for 
the metafamily. The rs sends the report using HRegion.setCompleteSequenceId 
where it sets the lastflushedsequence id for this store as 1 (however we still 
haven't actually flushed).
- At this point the RS dies
- During the split we receive lastflushedseqid for this store as 1 and filter 
out the cells belonging to the multi which had the hregioninfo. The 
regionserver thats opening the region now will replay the edits correctly but 
we've lost data belonging to the multi and hence the client fails with 
"HRegionInfo was null"

However this case is not particular to split or meta but the case where a 
region is just opened and we do a number of multi followed by a put, in case 
the RS dies before we flush we lose data belonging to the multi. Fix is simply 
a line change :)
[~apurtell] [~lhofhansl]



was (Author: abhishek.chouhan):
[~Apache9] Yep got confused a bit there :) retrowIOException might be better.

Got to the bottom of things. Here's what was happening.
- Region A getting split into B & C. The first request with daughter info is a 
multi (with hregion info) while the second one is a put.

In Hregion.processRowsWithLocks we have 
{noformat}
// 6. Append no sync
  if (!walEdit.isEmpty()) {
// we use HLogKey here instead of WALKey directly to support legacy 
coprocessors.
walKey = new HLogKey(this.getRegionInfo().getEncodedNameAsBytes(),
  this.htableDescriptor.getTableName(), WALKey.NO_SEQUENCE_ID, now,
  processor.getClusterIds(), nonceGroup, nonce, mvcc);
txid = this.wal.append(this.htableDescriptor, this.getRegionInfo(),
walKey, walEdit, false);
  }
{noformat}

Since we pass false for inMemstore in append, we mess up the seq id accounting. 
In SequenceIdAccounting.update() we pass false for the multirequest (lets say 
sequence id here was 1) so lowestunflusedsequenceid is not updated.
Now for the second put that goes through doMiniBatchMutation we pass true 
correctly during append(Seq id 2). lowestUnflushedSequenceIds is set to 2 for 
the metafamily. The rs sends the report using HRegion.setCompleteSequenceId 
where it sets the lastflushedsequence id for this store as 1 (however we still 
haven't actually flushed).
- At this point the RS dies
- During the split we receive lastflushedseqid for this store as 1 and filter 
out the cells belonging to the multi which had the hregioninfo. The 
regionserver thats opening the region now will replay the edits correctly but 
we've lost data belonging to the multi and hence the client fails with 
"HRegionInfo was null"

However this case is not particular to split or meta but the case where a 
region is just opened and we do a number of multi followed by a put, in case 
the RS dies before we flush we lose data belonging to the multi. Fix is simply 
a line change :)
[~apurtell] [~lhofhansl]


> RegionServer writes invalid META entries for split daughters in some 
> circumstances
> --
>
> Key: HBASE-17069
> URL: https://issues.apache.org/jira/browse/HBASE-17069
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.2.4
>Reporter: Andrew Purtell
>Assignee: Abhishek Singh Chouhan
>Priority: Critical
> Attachments: daughter_1_d55ef81c2f8299abbddfce0445067830.log, 
> 

[jira] [Comment Edited] (HBASE-17069) RegionServer writes invalid META entries for split daughters in some circumstances

2017-02-09 Thread Abhishek Singh Chouhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859722#comment-15859722
 ] 

Abhishek Singh Chouhan edited comment on HBASE-17069 at 2/9/17 4:16 PM:


Crap, misunderstood a bit of code here. We're not actually swallowing the 
exception. My bad.
However i do think that the first meta edit is getting a bit screwed up. When i 
ran itbll again with by adding hregioninfo during postOpen i din't get 
"HRegionInfo was null", will dig deeper with a bit more testing. [~Apache9] 
Thanks.


was (Author: abhishek.chouhan):
Crap, misunderstood a bit of code here. We're not actually swallowing the 
exception. My bad.
However i do think that the first edit is getting a bit screwed up. When i ran 
itbll again with by adding hregioninfo during postOpen i din't get "HRegionInfo 
was null", will dig deeper with a bit more testing. [~Apache9] Thanks.

> RegionServer writes invalid META entries for split daughters in some 
> circumstances
> --
>
> Key: HBASE-17069
> URL: https://issues.apache.org/jira/browse/HBASE-17069
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.2.4
>Reporter: Andrew Purtell
>Priority: Critical
> Attachments: daughter_1_d55ef81c2f8299abbddfce0445067830.log, 
> daughter_2_08629d59564726da2497f70451aafcdb.log, logs.tar.gz, 
> parent-393d2bfd8b1c52ce08540306659624f2.log
>
>
> I have been seeing frequent ITBLL failures testing various versions of 1.2.x. 
> Over the lifetime of 1.2.x the following issues have been fixed:
> - HBASE-15315 (Remove always set super user call as high priority)
> - HBASE-16093 (Fix splits failed before creating daughter regions leave meta 
> inconsistent)
> And this one is pending:
> - HBASE-17044 (Fix merge failed before creating merged region leaves meta 
> inconsistent)
> I can apply all of the above to branch-1.2 and still see this failure: 
> *The life of stillborn region d55ef81c2f8299abbddfce0445067830*
> *Master sees SPLITTING_NEW*
> {noformat}
> 2016-11-08 04:23:21,186 INFO  [AM.ZK.Worker-pool2-t82] master.RegionStates: 
> Transition null to {d55ef81c2f8299abbddfce0445067830 state=SPLITTING_NEW, 
> ts=1478579001186, server=node-3.cluster,16020,1478578389506}
> {noformat}
> *The RegionServer creates it*
> {noformat}
> 2016-11-08 04:23:26,035 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for GomnU: blockCache=LruBlockCache{blockCount=34, 
> currentSize=14996112, freeSize=12823716208, maxSize=12838712320, 
> heapSize=14996112, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,038 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for big: blockCache=LruBlockCache{blockCount=34, 
> currentSize=14996112, freeSize=12823716208, maxSize=12838712320, 
> heapSize=14996112, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,442 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for meta: blockCache=LruBlockCache{blockCount=63, 
> currentSize=17187656, freeSize=12821524664, maxSize=12838712320, 
> heapSize=17187656, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,713 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for nwmrW: blockCache=LruBlockCache{blockCount=96, 
> currentSize=19178440, freeSize=12819533880, maxSize=12838712320, 
> heapSize=19178440, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,715 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for piwbr: blockCache=LruBlockCache{blockCount=96, 
> currentSize=19178440, 

[jira] [Comment Edited] (HBASE-17069) RegionServer writes invalid META entries for split daughters in some circumstances

2017-02-09 Thread Abhishek Singh Chouhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859651#comment-15859651
 ] 

Abhishek Singh Chouhan edited comment on HBASE-17069 at 2/9/17 3:27 PM:



More specifically:

{noformat}

MultiRowMutationProtos.MultiRowMutationService.BlockingInterface service =
  MultiRowMutationProtos.MultiRowMutationService.newBlockingStub(channel);
try {
  service.mutateRows(null, mmrBuilder.build());
} catch (ServiceException ex) {
  ProtobufUtil.toIOException(ex);
}
  }
{noformat}

We catch the exception here and move on. [~Apache9]


was (Author: abhishek.chouhan):
{noformat}

More specifically:

MultiRowMutationProtos.MultiRowMutationService.BlockingInterface service =
  MultiRowMutationProtos.MultiRowMutationService.newBlockingStub(channel);
try {
  service.mutateRows(null, mmrBuilder.build());
} catch (ServiceException ex) {
  ProtobufUtil.toIOException(ex);
}
  }

We catch the exception here and move on. [~Apache9]

> RegionServer writes invalid META entries for split daughters in some 
> circumstances
> --
>
> Key: HBASE-17069
> URL: https://issues.apache.org/jira/browse/HBASE-17069
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.2.4
>Reporter: Andrew Purtell
>Priority: Critical
> Attachments: daughter_1_d55ef81c2f8299abbddfce0445067830.log, 
> daughter_2_08629d59564726da2497f70451aafcdb.log, logs.tar.gz, 
> parent-393d2bfd8b1c52ce08540306659624f2.log
>
>
> I have been seeing frequent ITBLL failures testing various versions of 1.2.x. 
> Over the lifetime of 1.2.x the following issues have been fixed:
> - HBASE-15315 (Remove always set super user call as high priority)
> - HBASE-16093 (Fix splits failed before creating daughter regions leave meta 
> inconsistent)
> And this one is pending:
> - HBASE-17044 (Fix merge failed before creating merged region leaves meta 
> inconsistent)
> I can apply all of the above to branch-1.2 and still see this failure: 
> *The life of stillborn region d55ef81c2f8299abbddfce0445067830*
> *Master sees SPLITTING_NEW*
> {noformat}
> 2016-11-08 04:23:21,186 INFO  [AM.ZK.Worker-pool2-t82] master.RegionStates: 
> Transition null to {d55ef81c2f8299abbddfce0445067830 state=SPLITTING_NEW, 
> ts=1478579001186, server=node-3.cluster,16020,1478578389506}
> {noformat}
> *The RegionServer creates it*
> {noformat}
> 2016-11-08 04:23:26,035 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for GomnU: blockCache=LruBlockCache{blockCount=34, 
> currentSize=14996112, freeSize=12823716208, maxSize=12838712320, 
> heapSize=14996112, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,038 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for big: blockCache=LruBlockCache{blockCount=34, 
> currentSize=14996112, freeSize=12823716208, maxSize=12838712320, 
> heapSize=14996112, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,442 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for meta: blockCache=LruBlockCache{blockCount=63, 
> currentSize=17187656, freeSize=12821524664, maxSize=12838712320, 
> heapSize=17187656, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,713 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for nwmrW: blockCache=LruBlockCache{blockCount=96, 
> currentSize=19178440, freeSize=12819533880, maxSize=12838712320, 
> heapSize=19178440, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,715 INFO  
> 

[jira] [Comment Edited] (HBASE-17069) RegionServer writes invalid META entries for split daughters in some circumstances

2016-12-21 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767531#comment-15767531
 ] 

Andrew Purtell edited comment on HBASE-17069 at 12/21/16 5:59 PM:
--

FWIW I tested the head of branch-1.3 in my rig and it failed the same way, "no 
serialized HRegionInfo" in some rows in meta, with resulting job failure as 
part of the keyspace went missing. 
[~mantonov] [~ghelmling]


was (Author: apurtell):
FWIW I tested the head of branch-1.3 in my rig and it failed the same way, "no 
serialized HRegionInfo" in some rows in meta, with resulting job failure as 
part of the keyspace went missing. 

> RegionServer writes invalid META entries for split daughters in some 
> circumstances
> --
>
> Key: HBASE-17069
> URL: https://issues.apache.org/jira/browse/HBASE-17069
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.2.4
>Reporter: Andrew Purtell
>Priority: Critical
> Attachments: daughter_1_d55ef81c2f8299abbddfce0445067830.log, 
> daughter_2_08629d59564726da2497f70451aafcdb.log, logs.tar.gz, 
> parent-393d2bfd8b1c52ce08540306659624f2.log
>
>
> I have been seeing frequent ITBLL failures testing various versions of 1.2.x. 
> Over the lifetime of 1.2.x the following issues have been fixed:
> - HBASE-15315 (Remove always set super user call as high priority)
> - HBASE-16093 (Fix splits failed before creating daughter regions leave meta 
> inconsistent)
> And this one is pending:
> - HBASE-17044 (Fix merge failed before creating merged region leaves meta 
> inconsistent)
> I can apply all of the above to branch-1.2 and still see this failure: 
> *The life of stillborn region d55ef81c2f8299abbddfce0445067830*
> *Master sees SPLITTING_NEW*
> {noformat}
> 2016-11-08 04:23:21,186 INFO  [AM.ZK.Worker-pool2-t82] master.RegionStates: 
> Transition null to {d55ef81c2f8299abbddfce0445067830 state=SPLITTING_NEW, 
> ts=1478579001186, server=node-3.cluster,16020,1478578389506}
> {noformat}
> *The RegionServer creates it*
> {noformat}
> 2016-11-08 04:23:26,035 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for GomnU: blockCache=LruBlockCache{blockCount=34, 
> currentSize=14996112, freeSize=12823716208, maxSize=12838712320, 
> heapSize=14996112, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,038 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for big: blockCache=LruBlockCache{blockCount=34, 
> currentSize=14996112, freeSize=12823716208, maxSize=12838712320, 
> heapSize=14996112, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,442 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for meta: blockCache=LruBlockCache{blockCount=63, 
> currentSize=17187656, freeSize=12821524664, maxSize=12838712320, 
> heapSize=17187656, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,713 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for nwmrW: blockCache=LruBlockCache{blockCount=96, 
> currentSize=19178440, freeSize=12819533880, maxSize=12838712320, 
> heapSize=19178440, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,715 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for piwbr: blockCache=LruBlockCache{blockCount=96, 
> currentSize=19178440, freeSize=12819533880, maxSize=12838712320, 
> heapSize=19178440, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> 

[jira] [Comment Edited] (HBASE-17069) RegionServer writes invalid META entries for split daughters in some circumstances

2016-11-11 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658076#comment-15658076
 ] 

Andrew Purtell edited comment on HBASE-17069 at 11/11/16 8:25 PM:
--

I have a bisect in progress. It's taken me a while to get a good signal. These 
boundaries have been determined by failure and success of multiple 1B ITBLL 
jobs, respectively:

Bad: a12d0a861db850ded1a66d6be8e3a4a9d2c76a4f
Good (when patched for HBASE-15315 and HBASE-16093): 
1a305bb4848ebcda2bd7c0df8f2f9c03ddf5b471

There are a few steps within this range. Working on it


was (Author: apurtell):
I have a bisect in progress. It's taken me a while to get a good signal. These 
boundaries have been determined by failure and success of multiple 1B ITBLL 
jobs, respectively:

Bad: a12d0a861db850ded1a66d6be8e3a4a9d2c76a4f
Good (when patched for HBASE-15315 and HBASE-16093): 
1a305bb4848ebcda2bd7c0df8f2f9c03ddf5b471

> RegionServer writes invalid META entries for split daughters in some 
> circumstances
> --
>
> Key: HBASE-17069
> URL: https://issues.apache.org/jira/browse/HBASE-17069
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.2.4
>Reporter: Andrew Purtell
>Priority: Critical
> Attachments: daughter_1_d55ef81c2f8299abbddfce0445067830.log, 
> daughter_2_08629d59564726da2497f70451aafcdb.log, logs.tar.gz, 
> parent-393d2bfd8b1c52ce08540306659624f2.log
>
>
> I have been seeing frequent ITBLL failures testing various versions of 1.2.x. 
> Over the lifetime of 1.2.x the following issues have been fixed:
> - HBASE-15315 (Remove always set super user call as high priority)
> - HBASE-16093 (Fix splits failed before creating daughter regions leave meta 
> inconsistent)
> And this one is pending:
> - HBASE-17044 (Fix merge failed before creating merged region leaves meta 
> inconsistent)
> I can apply all of the above to branch-1.2 and still see this failure: 
> *The life of stillborn region d55ef81c2f8299abbddfce0445067830*
> *Master sees SPLITTING_NEW*
> {noformat}
> 2016-11-08 04:23:21,186 INFO  [AM.ZK.Worker-pool2-t82] master.RegionStates: 
> Transition null to {d55ef81c2f8299abbddfce0445067830 state=SPLITTING_NEW, 
> ts=1478579001186, server=node-3.cluster,16020,1478578389506}
> {noformat}
> *The RegionServer creates it*
> {noformat}
> 2016-11-08 04:23:26,035 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for GomnU: blockCache=LruBlockCache{blockCount=34, 
> currentSize=14996112, freeSize=12823716208, maxSize=12838712320, 
> heapSize=14996112, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,038 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for big: blockCache=LruBlockCache{blockCount=34, 
> currentSize=14996112, freeSize=12823716208, maxSize=12838712320, 
> heapSize=14996112, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,442 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for meta: blockCache=LruBlockCache{blockCount=63, 
> currentSize=17187656, freeSize=12821524664, maxSize=12838712320, 
> heapSize=17187656, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,713 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for nwmrW: blockCache=LruBlockCache{blockCount=96, 
> currentSize=19178440, freeSize=12819533880, maxSize=12838712320, 
> heapSize=19178440, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,715 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for piwbr: blockCache=LruBlockCache{blockCount=96, 
> 

[jira] [Comment Edited] (HBASE-17069) RegionServer writes invalid META entries for split daughters in some circumstances

2016-11-11 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658076#comment-15658076
 ] 

Andrew Purtell edited comment on HBASE-17069 at 11/11/16 8:25 PM:
--

I have a bisect in progress. It's taken me a while to get a good signal. These 
boundaries have been determined by failure and success of multiple 1B ITBLL 
jobs, respectively:

Bad: a12d0a861db850ded1a66d6be8e3a4a9d2c76a4f
Good (when patched for HBASE-15315 and HBASE-16093): 
1a305bb4848ebcda2bd7c0df8f2f9c03ddf5b471


was (Author: apurtell):
I have a bisect in progress. It's taken me a while to get a good signal. I am 
confident about these boundaries:

Bad: a12d0a861db850ded1a66d6be8e3a4a9d2c76a4f
Good (when patched for HBASE-15315 and HBASE-16093): 
1a305bb4848ebcda2bd7c0df8f2f9c03ddf5b471

> RegionServer writes invalid META entries for split daughters in some 
> circumstances
> --
>
> Key: HBASE-17069
> URL: https://issues.apache.org/jira/browse/HBASE-17069
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.2.4
>Reporter: Andrew Purtell
>Priority: Critical
> Attachments: daughter_1_d55ef81c2f8299abbddfce0445067830.log, 
> daughter_2_08629d59564726da2497f70451aafcdb.log, logs.tar.gz, 
> parent-393d2bfd8b1c52ce08540306659624f2.log
>
>
> I have been seeing frequent ITBLL failures testing various versions of 1.2.x. 
> Over the lifetime of 1.2.x the following issues have been fixed:
> - HBASE-15315 (Remove always set super user call as high priority)
> - HBASE-16093 (Fix splits failed before creating daughter regions leave meta 
> inconsistent)
> And this one is pending:
> - HBASE-17044 (Fix merge failed before creating merged region leaves meta 
> inconsistent)
> I can apply all of the above to branch-1.2 and still see this failure: 
> *The life of stillborn region d55ef81c2f8299abbddfce0445067830*
> *Master sees SPLITTING_NEW*
> {noformat}
> 2016-11-08 04:23:21,186 INFO  [AM.ZK.Worker-pool2-t82] master.RegionStates: 
> Transition null to {d55ef81c2f8299abbddfce0445067830 state=SPLITTING_NEW, 
> ts=1478579001186, server=node-3.cluster,16020,1478578389506}
> {noformat}
> *The RegionServer creates it*
> {noformat}
> 2016-11-08 04:23:26,035 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for GomnU: blockCache=LruBlockCache{blockCount=34, 
> currentSize=14996112, freeSize=12823716208, maxSize=12838712320, 
> heapSize=14996112, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,038 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for big: blockCache=LruBlockCache{blockCount=34, 
> currentSize=14996112, freeSize=12823716208, maxSize=12838712320, 
> heapSize=14996112, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,442 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for meta: blockCache=LruBlockCache{blockCount=63, 
> currentSize=17187656, freeSize=12821524664, maxSize=12838712320, 
> heapSize=17187656, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,713 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for nwmrW: blockCache=LruBlockCache{blockCount=96, 
> currentSize=19178440, freeSize=12819533880, maxSize=12838712320, 
> heapSize=19178440, minSize=12196776960, minFactor=0.95, multiSize=6098388480, 
> multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, 
> cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
> cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
> prefetchOnOpen=false
> 2016-11-08 04:23:26,715 INFO  
> [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created 
> cacheConfig for piwbr: blockCache=LruBlockCache{blockCount=96, 
> currentSize=19178440, freeSize=12819533880, maxSize=12838712320, 
> heapSize=19178440, minSize=12196776960, minFactor=0.95,