[jira] [Commented] (HBASE-22072) High read/write intensive regions may cause long crash recovery

2019-04-17 Thread Pavel (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819903#comment-16819903
 ] 

Pavel commented on HBASE-22072:
---

[~ram_krish] thanks for the work made, patch looks good, it was included to 
build and deployed to production. With debug log level turned I have both cases:
{noformat}
11:44:32.727 [MemStoreFlusher.1] DEBUG 
org.apache.hadoop.hbase.regionserver.StoreScanner - StoreScanner already 
closing. There is no need to updateReaders
11:46:03.882 [MemStoreFlusher.0] DEBUG 
org.apache.hadoop.hbase.regionserver.StoreScanner - StoreScanner already 
closing. There is no need to updateReaders
11:46:03.882 [MemStoreFlusher.0] DEBUG 
org.apache.hadoop.hbase.regionserver.StoreScanner - StoreScanner already has 
the close lock. There is no need to updateReaders
11:46:32.990 [MemStoreFlusher.0] DEBUG 
org.apache.hadoop.hbase.regionserver.StoreScanner - StoreScanner already 
closing. There is no need to updateReaders
11:48:03.872 [MemStoreFlusher.0] DEBUG 
org.apache.hadoop.hbase.regionserver.StoreScanner - StoreScanner already 
closing. There is no need to updateReaders
11:48:32.878 [MemStoreFlusher.0] DEBUG 
org.apache.hadoop.hbase.regionserver.StoreScanner - StoreScanner already 
closing. There is no need to updateReaders
11:48:32.880 [MemStoreFlusher.0] DEBUG 
org.apache.hadoop.hbase.regionserver.StoreScanner - StoreScanner already 
closing. There is no need to updateReaders
11:50:33.286 [MemStoreFlusher.0] DEBUG 
org.apache.hadoop.hbase.regionserver.StoreScanner - StoreScanner already 
closing. There is no need to updateReaders
11:52:32.487 [MemStoreFlusher.1] DEBUG 
org.apache.hadoop.hbase.regionserver.StoreScanner - StoreScanner already 
closing. There is no need to updateReaders
11:52:32.487 [MemStoreFlusher.1] DEBUG 
org.apache.hadoop.hbase.regionserver.StoreScanner - StoreScanner already 
closing. There is no need to updateReaders
11:52:32.492 [MemStoreFlusher.1] DEBUG 
org.apache.hadoop.hbase.regionserver.StoreScanner - StoreScanner already 
closing. There is no need to updateReaders
11:54:18.280 [MemStoreFlusher.1] DEBUG 
org.apache.hadoop.hbase.regionserver.StoreScanner - StoreScanner already has 
the close lock. There is no need to updateReaders
11:55:32.467 [MemStoreFlusher.1] DEBUG 
org.apache.hadoop.hbase.regionserver.StoreScanner - StoreScanner already 
closing. There is no need to updateReaders
11:55:32.471 [MemStoreFlusher.1] DEBUG 
org.apache.hadoop.hbase.regionserver.StoreScanner - StoreScanner already 
closing. There is no need to updateReaders{noformat}
More common, than flusher tries to updateReaders for StoreScanner, which is 
already closed.
 And less common than closing is in progress.

Finally regionservers got rid of compacted obsolete storefiles. Victory!

Could you please clarify if *StoreScanner private boolean closing = false;* has 
to volatile or not for the first case.
 Is it possible if other thread, performing updateReaders, see *closing* flag 
still false after StoreScanner#close acomplished?

> High read/write intensive regions may cause long crash recovery
> ---
>
> Key: HBASE-22072
> URL: https://issues.apache.org/jira/browse/HBASE-22072
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, Recovery
>Affects Versions: 2.1.2
>Reporter: Pavel
>Assignee: ramkrishna.s.vasudevan
>Priority: Major
> Attachments: HBASE-22072.HBASE-21879-v1.patch
>
>
> Compaction of high read loaded region may leave compacted files undeleted 
> because of existing scan references:
> INFO org.apache.hadoop.hbase.regionserver.HStore - Can't archive compacted 
> file hdfs://hdfs-ha/hbase... because of either isCompactedAway=true or file 
> has reference, isReferencedInReads=true, refCount=1, skipping for now
> If region is either high write loaded this happens quite often and region may 
> have few storefiles and tons of undeleted compacted hdfs files.
> Region keeps all that files (in my case thousands) untill graceful region 
> closing procedure, which ignores existing references and drop obsolete files. 
> It works fine unless consuming some extra hdfs space, but only in case of 
> normal region closing. If region server crashes than new region server, 
> responsible for that overfiling region, reads hdfs folder and try to deal 
> with all undeleted files, producing tons of storefiles, compaction tasks and 
> consuming abnormal amount of memory, wich may lead to OutOfMemory Exception 
> and further region servers crash. This stops writing to region because number 
> of storefiles reach *hbase.hstore.blockingStoreFiles* limit, forces high GC 
> duty and may take hours to compact all files into working set of files.
> Workaround is a periodically check hdfs folders files count and force region 
> assign for ones with 

[jira] [Commented] (HBASE-22072) High read/write intensive regions may cause long crash recovery

2019-04-03 Thread Pavel (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808608#comment-16808608
 ] 

Pavel commented on HBASE-22072:
---

??May be we should have a lock for closing and the updateReader() should try to 
get that lock before trying to update the scanners? If already closing is done 
then don't do it? ??

Yes, lock will solve in most cases, unless StoreScanner#updateReaders do 
flushedstoreFileScanners.addAll(scanners); notice flushedstoreFileScanners is 
an ArrayList, neither volatile no a threadsafe one. Rarely thread, that closes 
StoreScanner right after flusher thread executed StoreScanner.updateReaders may 
not see changes in flushedstoreFileScanners list and keeps unclosed scanner.

That could be tested, but for me it is a big challenge to write correct 
multithreading UT.

> High read/write intensive regions may cause long crash recovery
> ---
>
> Key: HBASE-22072
> URL: https://issues.apache.org/jira/browse/HBASE-22072
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, Recovery
>Affects Versions: 2.1.2
>Reporter: Pavel
>Priority: Major
>
> Compaction of high read loaded region may leave compacted files undeleted 
> because of existing scan references:
> INFO org.apache.hadoop.hbase.regionserver.HStore - Can't archive compacted 
> file hdfs://hdfs-ha/hbase... because of either isCompactedAway=true or file 
> has reference, isReferencedInReads=true, refCount=1, skipping for now
> If region is either high write loaded this happens quite often and region may 
> have few storefiles and tons of undeleted compacted hdfs files.
> Region keeps all that files (in my case thousands) untill graceful region 
> closing procedure, which ignores existing references and drop obsolete files. 
> It works fine unless consuming some extra hdfs space, but only in case of 
> normal region closing. If region server crashes than new region server, 
> responsible for that overfiling region, reads hdfs folder and try to deal 
> with all undeleted files, producing tons of storefiles, compaction tasks and 
> consuming abnormal amount of memory, wich may lead to OutOfMemory Exception 
> and further region servers crash. This stops writing to region because number 
> of storefiles reach *hbase.hstore.blockingStoreFiles* limit, forces high GC 
> duty and may take hours to compact all files into working set of files.
> Workaround is a periodically check hdfs folders files count and force region 
> assign for ones with too many files.
> It could be nice if regionserver had a setting similar to 
> hbase.hstore.blockingStoreFiles and invoke attempt to drop undeleted 
> compacted files if number of files reaches this setting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22072) High read/write intensive regions may cause long crash recovery

2019-04-02 Thread Pavel (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807708#comment-16807708
 ] 

Pavel commented on HBASE-22072:
---

[~anoop.hbase]
{noformat}
StoreScanner#close() Also taking flushLock.lock() might solve this?{noformat}
I would say no, because StoreScanner can be closed before updateReaders called:
{code:java}
for (ChangedReadersObserver o : this.changedReaderObservers) {
List memStoreScanners;
this.lock.readLock().lock();
try {
  memStoreScanners = this.memstore.getScanners(o.getReadPoint());
} finally {
  this.lock.readLock().unlock();
}
//o might be already closed at this point
o.updateReaders(sfs, memStoreScanners);
{code}

> High read/write intensive regions may cause long crash recovery
> ---
>
> Key: HBASE-22072
> URL: https://issues.apache.org/jira/browse/HBASE-22072
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, Recovery
>Affects Versions: 2.1.2
>Reporter: Pavel
>Priority: Major
>
> Compaction of high read loaded region may leave compacted files undeleted 
> because of existing scan references:
> INFO org.apache.hadoop.hbase.regionserver.HStore - Can't archive compacted 
> file hdfs://hdfs-ha/hbase... because of either isCompactedAway=true or file 
> has reference, isReferencedInReads=true, refCount=1, skipping for now
> If region is either high write loaded this happens quite often and region may 
> have few storefiles and tons of undeleted compacted hdfs files.
> Region keeps all that files (in my case thousands) untill graceful region 
> closing procedure, which ignores existing references and drop obsolete files. 
> It works fine unless consuming some extra hdfs space, but only in case of 
> normal region closing. If region server crashes than new region server, 
> responsible for that overfiling region, reads hdfs folder and try to deal 
> with all undeleted files, producing tons of storefiles, compaction tasks and 
> consuming abnormal amount of memory, wich may lead to OutOfMemory Exception 
> and further region servers crash. This stops writing to region because number 
> of storefiles reach *hbase.hstore.blockingStoreFiles* limit, forces high GC 
> duty and may take hours to compact all files into working set of files.
> Workaround is a periodically check hdfs folders files count and force region 
> assign for ones with too many files.
> It could be nice if regionserver had a setting similar to 
> hbase.hstore.blockingStoreFiles and invoke attempt to drop undeleted 
> compacted files if number of files reaches this setting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22072) High read/write intensive regions may cause long crash recovery

2019-04-02 Thread Pavel (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807693#comment-16807693
 ] 

Pavel commented on HBASE-22072:
---

[~ram_krish] I am that good in writing UT for hbase, I've never done this 
before. I would suggest 2 different test-cases, first easy and second 
difficult. 

1. check updateReaders on closed StoreScanner
 * Create HStore, StoreScanner and HStoreFile instances.
 * Close StoreScanner, than call updateReaders for StoreScanner
 * finally check if HStoreFile.refcount equal to 0

2. check race condition on updateReaders in multithreading workload.

Probably first test-case should newer happen on working regionserver due to 
synchronizations.

 

 

> High read/write intensive regions may cause long crash recovery
> ---
>
> Key: HBASE-22072
> URL: https://issues.apache.org/jira/browse/HBASE-22072
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, Recovery
>Affects Versions: 2.1.2
>Reporter: Pavel
>Priority: Major
>
> Compaction of high read loaded region may leave compacted files undeleted 
> because of existing scan references:
> INFO org.apache.hadoop.hbase.regionserver.HStore - Can't archive compacted 
> file hdfs://hdfs-ha/hbase... because of either isCompactedAway=true or file 
> has reference, isReferencedInReads=true, refCount=1, skipping for now
> If region is either high write loaded this happens quite often and region may 
> have few storefiles and tons of undeleted compacted hdfs files.
> Region keeps all that files (in my case thousands) untill graceful region 
> closing procedure, which ignores existing references and drop obsolete files. 
> It works fine unless consuming some extra hdfs space, but only in case of 
> normal region closing. If region server crashes than new region server, 
> responsible for that overfiling region, reads hdfs folder and try to deal 
> with all undeleted files, producing tons of storefiles, compaction tasks and 
> consuming abnormal amount of memory, wich may lead to OutOfMemory Exception 
> and further region servers crash. This stops writing to region because number 
> of storefiles reach *hbase.hstore.blockingStoreFiles* limit, forces high GC 
> duty and may take hours to compact all files into working set of files.
> Workaround is a periodically check hdfs folders files count and force region 
> assign for ones with too many files.
> It could be nice if regionserver had a setting similar to 
> hbase.hstore.blockingStoreFiles and invoke attempt to drop undeleted 
> compacted files if number of files reaches this setting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-22072) High read/write intensive regions may cause long crash recovery

2019-04-02 Thread Pavel (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807693#comment-16807693
 ] 

Pavel edited comment on HBASE-22072 at 4/2/19 12:11 PM:


[~ram_krish] I am not that good in writing UT for hbase, I've never done this 
before. I would suggest 2 different test-cases, first easy and second difficult.

1. check updateReaders on closed StoreScanner
 * Create HStore, StoreScanner and HStoreFile instances.
 * Close StoreScanner, than call updateReaders for StoreScanner
 * finally check if HStoreFile.refcount equal to 0

2. check race condition on updateReaders in multithreading workload.

Probably first test-case should newer happen on working regionserver due to 
synchronizations.

 

 


was (Author: pkirillov):
[~ram_krish] I am that good in writing UT for hbase, I've never done this 
before. I would suggest 2 different test-cases, first easy and second 
difficult. 

1. check updateReaders on closed StoreScanner
 * Create HStore, StoreScanner and HStoreFile instances.
 * Close StoreScanner, than call updateReaders for StoreScanner
 * finally check if HStoreFile.refcount equal to 0

2. check race condition on updateReaders in multithreading workload.

Probably first test-case should newer happen on working regionserver due to 
synchronizations.

 

 

> High read/write intensive regions may cause long crash recovery
> ---
>
> Key: HBASE-22072
> URL: https://issues.apache.org/jira/browse/HBASE-22072
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, Recovery
>Affects Versions: 2.1.2
>Reporter: Pavel
>Priority: Major
>
> Compaction of high read loaded region may leave compacted files undeleted 
> because of existing scan references:
> INFO org.apache.hadoop.hbase.regionserver.HStore - Can't archive compacted 
> file hdfs://hdfs-ha/hbase... because of either isCompactedAway=true or file 
> has reference, isReferencedInReads=true, refCount=1, skipping for now
> If region is either high write loaded this happens quite often and region may 
> have few storefiles and tons of undeleted compacted hdfs files.
> Region keeps all that files (in my case thousands) untill graceful region 
> closing procedure, which ignores existing references and drop obsolete files. 
> It works fine unless consuming some extra hdfs space, but only in case of 
> normal region closing. If region server crashes than new region server, 
> responsible for that overfiling region, reads hdfs folder and try to deal 
> with all undeleted files, producing tons of storefiles, compaction tasks and 
> consuming abnormal amount of memory, wich may lead to OutOfMemory Exception 
> and further region servers crash. This stops writing to region because number 
> of storefiles reach *hbase.hstore.blockingStoreFiles* limit, forces high GC 
> duty and may take hours to compact all files into working set of files.
> Workaround is a periodically check hdfs folders files count and force region 
> assign for ones with too many files.
> It could be nice if regionserver had a setting similar to 
> hbase.hstore.blockingStoreFiles and invoke attempt to drop undeleted 
> compacted files if number of files reaches this setting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-22072) High read/write intensive regions may cause long crash recovery

2019-04-01 Thread Pavel (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806468#comment-16806468
 ] 

Pavel edited comment on HBASE-22072 at 4/1/19 7:36 AM:
---

After a bit deeper code reading it looks like race condition issue.

MemStore flushing and RPCs, handling client scanners, use different threads.

While flushing memStore calls HStore
{code:java}
private void notifyChangedReadersObservers(List sfs) throws 
IOException {
  for (ChangedReadersObserver o : this.changedReaderObservers) {
List memStoreScanners;
this.lock.readLock().lock();
try {
  memStoreScanners = this.memstore.getScanners(o.getReadPoint());
} finally {
  this.lock.readLock().unlock();
}
o.updateReaders(sfs, memStoreScanners);
  }
}
{code}
Store scanner may running close() because where is a gap between loop iterator 
gets next *ChangedReadersObserver* and calling updateReaders and further 
updateReaders procedure does not consider StoreScanner is closing or not.


was (Author: pkirillov):
After a bit deeper code reading it looks like race condition issue.

MemStore flushing and RPCs, handling client scanners, uses different threads.

While flushing memStore calls HStore
{code:java}
private void notifyChangedReadersObservers(List sfs) throws 
IOException {
  for (ChangedReadersObserver o : this.changedReaderObservers) {
List memStoreScanners;
this.lock.readLock().lock();
try {
  memStoreScanners = this.memstore.getScanners(o.getReadPoint());
} finally {
  this.lock.readLock().unlock();
}
o.updateReaders(sfs, memStoreScanners);
  }
}
{code}
Store scanner may running close() because where is a gap between loop iterator 
gets next *ChangedReadersObserver* and calling updateReaders and further 
updateReaders procedure does not consider StoreScanner is closing or not.

> High read/write intensive regions may cause long crash recovery
> ---
>
> Key: HBASE-22072
> URL: https://issues.apache.org/jira/browse/HBASE-22072
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, Recovery
>Affects Versions: 2.1.2
>Reporter: Pavel
>Priority: Major
>
> Compaction of high read loaded region may leave compacted files undeleted 
> because of existing scan references:
> INFO org.apache.hadoop.hbase.regionserver.HStore - Can't archive compacted 
> file hdfs://hdfs-ha/hbase... because of either isCompactedAway=true or file 
> has reference, isReferencedInReads=true, refCount=1, skipping for now
> If region is either high write loaded this happens quite often and region may 
> have few storefiles and tons of undeleted compacted hdfs files.
> Region keeps all that files (in my case thousands) untill graceful region 
> closing procedure, which ignores existing references and drop obsolete files. 
> It works fine unless consuming some extra hdfs space, but only in case of 
> normal region closing. If region server crashes than new region server, 
> responsible for that overfiling region, reads hdfs folder and try to deal 
> with all undeleted files, producing tons of storefiles, compaction tasks and 
> consuming abnormal amount of memory, wich may lead to OutOfMemory Exception 
> and further region servers crash. This stops writing to region because number 
> of storefiles reach *hbase.hstore.blockingStoreFiles* limit, forces high GC 
> duty and may take hours to compact all files into working set of files.
> Workaround is a periodically check hdfs folders files count and force region 
> assign for ones with too many files.
> It could be nice if regionserver had a setting similar to 
> hbase.hstore.blockingStoreFiles and invoke attempt to drop undeleted 
> compacted files if number of files reaches this setting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22072) High read/write intensive regions may cause long crash recovery

2019-04-01 Thread Pavel (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806468#comment-16806468
 ] 

Pavel commented on HBASE-22072:
---

After a bit deeper code reading it looks like race condition issue.

MemStore flushing and RPCs, handling client scanners, uses different threads.

While flushing memStore calls HStore
{code:java}
private void notifyChangedReadersObservers(List sfs) throws 
IOException {
  for (ChangedReadersObserver o : this.changedReaderObservers) {
List memStoreScanners;
this.lock.readLock().lock();
try {
  memStoreScanners = this.memstore.getScanners(o.getReadPoint());
} finally {
  this.lock.readLock().unlock();
}
o.updateReaders(sfs, memStoreScanners);
  }
}
{code}
Store scanner may running close() because where is a gap between loop iterator 
gets next *ChangedReadersObserver* and calling updateReaders and further 
updateReaders procedure does not consider StoreScanner is closing or not.

> High read/write intensive regions may cause long crash recovery
> ---
>
> Key: HBASE-22072
> URL: https://issues.apache.org/jira/browse/HBASE-22072
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, Recovery
>Affects Versions: 2.1.2
>Reporter: Pavel
>Priority: Major
>
> Compaction of high read loaded region may leave compacted files undeleted 
> because of existing scan references:
> INFO org.apache.hadoop.hbase.regionserver.HStore - Can't archive compacted 
> file hdfs://hdfs-ha/hbase... because of either isCompactedAway=true or file 
> has reference, isReferencedInReads=true, refCount=1, skipping for now
> If region is either high write loaded this happens quite often and region may 
> have few storefiles and tons of undeleted compacted hdfs files.
> Region keeps all that files (in my case thousands) untill graceful region 
> closing procedure, which ignores existing references and drop obsolete files. 
> It works fine unless consuming some extra hdfs space, but only in case of 
> normal region closing. If region server crashes than new region server, 
> responsible for that overfiling region, reads hdfs folder and try to deal 
> with all undeleted files, producing tons of storefiles, compaction tasks and 
> consuming abnormal amount of memory, wich may lead to OutOfMemory Exception 
> and further region servers crash. This stops writing to region because number 
> of storefiles reach *hbase.hstore.blockingStoreFiles* limit, forces high GC 
> duty and may take hours to compact all files into working set of files.
> Workaround is a periodically check hdfs folders files count and force region 
> assign for ones with too many files.
> It could be nice if regionserver had a setting similar to 
> hbase.hstore.blockingStoreFiles and invoke attempt to drop undeleted 
> compacted files if number of files reaches this setting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22072) High read/write intensive regions may cause long crash recovery

2019-03-30 Thread Pavel (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805785#comment-16805785
 ] 

Pavel commented on HBASE-22072:
---

Sorry for a long time to answer, I've made some investigations catching 
unclosed scanners.

Each StoreFile has an AtomicInteger refCount, serving by StoreFileReader 
incrementing by each StoreFileScanner and decreeing when reading completes. 
This refCount==0 alows dropping StoreFile after compaction. I added identifier 
to StoreFileScanner class as follows (all code examples from rel/2.1.4)
{code:java}
  public StoreFileScanner(StoreFileReader reader, HFileScanner hfs, boolean 
useMVCC,
  boolean hasMVCC, long readPt, long scannerOrder, boolean 
canOptimizeForNonNullColumn) {
this.readPt = readPt;
this.reader = reader;
this.hfs = hfs;
this.enforceMVCC = useMVCC;
this.hasMVCCInfo = hasMVCC;
this.scannerOrder = scannerOrder;
this.canOptimizeForNonNullColumn = canOptimizeForNonNullColumn;
this.identifier = new Timestamp(System.currentTimeMillis()) + "$" + 
System.identityHashCode(this);
this.reader.incrementRefCount(identifier);
  }
{code}
and pass it to incrementRefCount of StoreFileReader who now has
{code:java}
private Set scanners = ConcurrentHashMap.newKeySet();{code}
to get more information about active scanners
{code:java}
void incrementRefCount(String storeFileScannerName) {
int rc = refCount.incrementAndGet();
scanners.add(storeFileScannerName);
if (LOG.isDebugEnabled()){
  String message = "Increment refCount of "
  + reader.getPath()
  + "(" + rc + ") by " + storeFileScannerName + ": "
  + Thread.currentThread().getId() + "#"
  + Thread.currentThread().getName();
  
LOG.debug(org.apache.commons.lang.exception.ExceptionUtils.getFullStackTrace(new
 NullPointerException(message)));
}
  }
{code}
StoreFileScanner removing from Set after scanner closing
{code:java}
void readCompleted(String storeFileScannerName) {
int rc = refCount.decrementAndGet();
scanners.remove(storeFileScannerName);
if (LOG.isDebugEnabled()){
  String message = "Decrement refCount of "
  + reader.getPath()
  + "(" + rc + ") by " + storeFileScannerName + ": "
  + Thread.currentThread().getId() + "#"
  + Thread.currentThread().getName();
  LOG.debug(message);
.
{code}
I also added this set of scanners to log message of Chore service and finally 
got unclosed StoreFileScanner, created as follows
{noformat}
org.apache.hadoop.hbase.regionserver.StoreFileReader.incrementRefCount(StoreFileReader.java:172)
org.apache.hadoop.hbase.regionserver.StoreFileScanner.(StoreFileScanner.java:97)
org.apache.hadoop.hbase.regionserver.StoreFileReader.getStoreFileScanner(StoreFileReader.java:155)
org.apache.hadoop.hbase.regionserver.HStoreFile.getPreadScanner(HStoreFile.java:504)
org.apache.hadoop.hbase.regionserver.StoreFileScanner.getScannersForStoreFiles(StoreFileScanner.java:147)
org.apache.hadoop.hbase.regionserver.HStore.getScanners(HStore.java:1309)
org.apache.hadoop.hbase.regionserver.HStore.getScanners(HStore.java:1276)
org.apache.hadoop.hbase.regionserver.StoreScanner.updateReaders(StoreScanner.java:891)
org.apache.hadoop.hbase.regionserver.HStore.notifyChangedReadersObservers(HStore.java:1195)
org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1171)
org.apache.hadoop.hbase.regionserver.HStore.access$600(HStore.java:131)
org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.commit(HStore.java:2302)
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2741)
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2467)
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2439)
org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2329)
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:612)
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:581)
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:68)
org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:361)
java.lang.Thread.run(Thread.java:748)
{noformat}
It looks like then we have active StoreScanners on MemStore, which is flushing 
to StoreFile, we close all MemStoreScanners and open new StoreFileScanners to 
make it possible for client scanner to continue reading. And for some reason 
this StoreFileScanners remain unclosed. I could not find certain place in code, 
working wrong and ask dev community for assist.

I can give more details if needed by changing logging though it require new 
build and production regionserver restart as far as I can not reproduce this 
behavior on test installation.

> High read/write intensive 

[jira] [Commented] (HBASE-22072) High read/write intensive regions may cause long crash recovery

2019-03-22 Thread Pavel (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799329#comment-16799329
 ] 

Pavel commented on HBASE-22072:
---

Great job [~anoop.hbase] it seems you are preety experienced in hbase source 
code. What shall we do next?

Starting point is not an issue itself, but an eventual result of one, you've 
found.

> High read/write intensive regions may cause long crash recovery
> ---
>
> Key: HBASE-22072
> URL: https://issues.apache.org/jira/browse/HBASE-22072
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, Recovery
>Affects Versions: 2.1.2
>Reporter: Pavel
>Priority: Major
>
> Compaction of high read loaded region may leave compacted files undeleted 
> because of existing scan references:
> INFO org.apache.hadoop.hbase.regionserver.HStore - Can't archive compacted 
> file hdfs://hdfs-ha/hbase... because of either isCompactedAway=true or file 
> has reference, isReferencedInReads=true, refCount=1, skipping for now
> If region is either high write loaded this happens quite often and region may 
> have few storefiles and tons of undeleted compacted hdfs files.
> Region keeps all that files (in my case thousands) untill graceful region 
> closing procedure, which ignores existing references and drop obsolete files. 
> It works fine unless consuming some extra hdfs space, but only in case of 
> normal region closing. If region server crashes than new region server, 
> responsible for that overfiling region, reads hdfs folder and try to deal 
> with all undeleted files, producing tons of storefiles, compaction tasks and 
> consuming abnormal amount of memory, wich may lead to OutOfMemory Exception 
> and further region servers crash. This stops writing to region because number 
> of storefiles reach *hbase.hstore.blockingStoreFiles* limit, forces high GC 
> duty and may take hours to compact all files into working set of files.
> Workaround is a periodically check hdfs folders files count and force region 
> assign for ones with too many files.
> It could be nice if regionserver had a setting similar to 
> hbase.hstore.blockingStoreFiles and invoke attempt to drop undeleted 
> compacted files if number of files reaches this setting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-22072) High read/write intensive regions may cause long crash recovery

2019-03-22 Thread Pavel (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799051#comment-16799051
 ] 

Pavel edited comment on HBASE-22072 at 3/22/19 2:17 PM:


Thanks [~anoop.hbase] for your attention.

Answering to you last question first, RS first crashed due to hardware error, 
for example if host machine goes to reboot. Next RS crashes can happen during 
compacting with OutOfMemoryException.

First files are not deleted immediately after compaction and after Chore 
service repeatedly try to drop them, without success because of existing read 
references.

It surprise me a little, because some files in a list compacted 3 days ago, 
marked as "compactedAway", so new scanners should not read them, and existing 
scanners should release reference as far as hbase.client.scanner.timeout.period 
achived. I am going to inspect source of hbase client, maybe there is a bug 
inside and hbase.client.scanner.timeout.period setting ignored if 
scaner.close() was not called.

Correct me please if I am mistaken in difference between 2.1 and 2.2 branch 
behavior.

*2.1:* Chore service try to drop files untill region is closing, cause closing 
ignores read references.

*2.2:* Closing region does not affect undeleted files and after region is 
assigned again new RS, serving this region, read compaction marker, does not 
compact files again, but keep trying to drop them with chore service.

It seems that HBASE-20724 solves this issue, but in case of having unclosed 
references on files Chore will accumulate all that files forever, trying to 
drop them.I wonder if region split keeps read references. If yes, split will 
double work for Chore service.

I will try to figure out bottom of existing so much read references on 
compacted files and probably create another issue.

 

 


was (Author: pkirillov):
Thanks [~anoop.hbase] for your attention.

Answering to you last question first, RS first crashed due to hardware error, 
for example if host machine goes to reboot. Next RS crashes can happen during 
compacting with OutOfMemoryException.

First files are not deleted immediately after compaction and after Chore 
service repeatedly try to drop them, without success because of existing read 
references.

It surprise me a little, because some files in a list compacted 3 days ago, 
marked as "compactedAway", so new scanners should not read them, and existing 
scanners should release reference as far as 
{{hbase.client.scanner.timeout.period }}achived. I am going to inspect source 
of hbase client, maybe there is a bug inside and 
{{hbase.client.scanner.timeout.period}} setting ignored if scaner.close() was 
not called.{{}}

Correct me please if I am mistaken in difference between 2.1 and 2.2 branch 
behavior.

*2.1:* Chore service try to drop files untill region is closing, cause closing 
ignores read references.

*2.2:* Closing region does not affect undeleted files and after region is 
assigned again new RS, serving this region, read compaction marker, does not 
compact files again, but keep trying to drop them with chore service.

It seems that HBASE-20724 solves this issue, but in case of having unclosed 
references on files Chore will accumulate all that files forever, trying to 
drop them.I wonder if region split keeps read references. If yes, split will 
double work for Chore service.

I will try to figure out bottom of existing so much read references on 
compacted files and probably create another issue.

 

 

> High read/write intensive regions may cause long crash recovery
> ---
>
> Key: HBASE-22072
> URL: https://issues.apache.org/jira/browse/HBASE-22072
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, Recovery
>Affects Versions: 2.1.2
>Reporter: Pavel
>Priority: Major
>
> Compaction of high read loaded region may leave compacted files undeleted 
> because of existing scan references:
> INFO org.apache.hadoop.hbase.regionserver.HStore - Can't archive compacted 
> file hdfs://hdfs-ha/hbase... because of either isCompactedAway=true or file 
> has reference, isReferencedInReads=true, refCount=1, skipping for now
> If region is either high write loaded this happens quite often and region may 
> have few storefiles and tons of undeleted compacted hdfs files.
> Region keeps all that files (in my case thousands) untill graceful region 
> closing procedure, which ignores existing references and drop obsolete files. 
> It works fine unless consuming some extra hdfs space, but only in case of 
> normal region closing. If region server crashes than new region server, 
> responsible for that overfiling region, reads hdfs folder and try to deal 
> with all undeleted files, producing tons of storefiles, compaction tasks and 
> consuming abnormal amount of 

[jira] [Commented] (HBASE-22072) High read/write intensive regions may cause long crash recovery

2019-03-22 Thread Pavel (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799051#comment-16799051
 ] 

Pavel commented on HBASE-22072:
---

Thanks [~anoop.hbase] for your attention.

Answering to you last question first, RS first crashed due to hardware error, 
for example if host machine goes to reboot. Next RS crashes can happen during 
compacting with OutOfMemoryException.

First files are not deleted immediately after compaction and after Chore 
service repeatedly try to drop them, without success because of existing read 
references.

It surprise me a little, because some files in a list compacted 3 days ago, 
marked as "compactedAway", so new scanners should not read them, and existing 
scanners should release reference as far as 
{{hbase.client.scanner.timeout.period }}achived. I am going to inspect source 
of hbase client, maybe there is a bug inside and 
{{hbase.client.scanner.timeout.period}} setting ignored if scaner.close() was 
not called.{{}}

Correct me please if I am mistaken in difference between 2.1 and 2.2 branch 
behavior.

*2.1:* Chore service try to drop files untill region is closing, cause closing 
ignores read references.

*2.2:* Closing region does not affect undeleted files and after region is 
assigned again new RS, serving this region, read compaction marker, does not 
compact files again, but keep trying to drop them with chore service.

It seems that HBASE-20724 solves this issue, but in case of having unclosed 
references on files Chore will accumulate all that files forever, trying to 
drop them.I wonder if region split keeps read references. If yes, split will 
double work for Chore service.

I will try to figure out bottom of existing so much read references on 
compacted files and probably create another issue.

 

 

> High read/write intensive regions may cause long crash recovery
> ---
>
> Key: HBASE-22072
> URL: https://issues.apache.org/jira/browse/HBASE-22072
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, Recovery
>Affects Versions: 2.1.2
>Reporter: Pavel
>Priority: Major
>
> Compaction of high read loaded region may leave compacted files undeleted 
> because of existing scan references:
> INFO org.apache.hadoop.hbase.regionserver.HStore - Can't archive compacted 
> file hdfs://hdfs-ha/hbase... because of either isCompactedAway=true or file 
> has reference, isReferencedInReads=true, refCount=1, skipping for now
> If region is either high write loaded this happens quite often and region may 
> have few storefiles and tons of undeleted compacted hdfs files.
> Region keeps all that files (in my case thousands) untill graceful region 
> closing procedure, which ignores existing references and drop obsolete files. 
> It works fine unless consuming some extra hdfs space, but only in case of 
> normal region closing. If region server crashes than new region server, 
> responsible for that overfiling region, reads hdfs folder and try to deal 
> with all undeleted files, producing tons of storefiles, compaction tasks and 
> consuming abnormal amount of memory, wich may lead to OutOfMemory Exception 
> and further region servers crash. This stops writing to region because number 
> of storefiles reach *hbase.hstore.blockingStoreFiles* limit, forces high GC 
> duty and may take hours to compact all files into working set of files.
> Workaround is a periodically check hdfs folders files count and force region 
> assign for ones with too many files.
> It could be nice if regionserver had a setting similar to 
> hbase.hstore.blockingStoreFiles and invoke attempt to drop undeleted 
> compacted files if number of files reaches this setting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-22072) High read/write intensive regions may cause long crash recovery

2019-03-20 Thread Pavel (JIRA)
Pavel created HBASE-22072:
-

 Summary: High read/write intensive regions may cause long crash 
recovery
 Key: HBASE-22072
 URL: https://issues.apache.org/jira/browse/HBASE-22072
 Project: HBase
  Issue Type: Bug
  Components: Performance, Recovery
Affects Versions: 2.1.2
Reporter: Pavel


Compaction of high read loaded region may leave compacted files undeleted 
because of existing scan references:

INFO org.apache.hadoop.hbase.regionserver.HStore - Can't archive compacted file 
hdfs://hdfs-ha/hbase... because of either isCompactedAway=true or file has 
reference, isReferencedInReads=true, refCount=1, skipping for now

If region is either high write loaded this happens quite often and region may 
have few storefiles and tons of undeleted compacted hdfs files.

Region keeps all that files (in my case thousands) untill graceful region 
closing procedure, which ignores existing references and drop obsolete files. 
It works fine unless consuming some extra hdfs space, but only in case of 
normal region closing. If region server crashes than new region server, 
responsible for that overfiling region, reads hdfs folder and try to deal with 
all undeleted files, producing tons of storefiles, compaction tasks and 
consuming abnormal amount of memory, wich may lead to OutOfMemory Exception and 
further region servers crash. This stops writing to region because number of 
storefiles reach *hbase.hstore.blockingStoreFiles* limit, forces high GC duty 
and may take hours to compact all files into working set of files.

Workaround is a periodically check hdfs folders files count and force region 
assign for ones with too many files.

It could be nice if regionserver had a setting similar to 
hbase.hstore.blockingStoreFiles and invoke attempt to drop undeleted compacted 
files if number of files reaches this setting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-20540) [umbrella] Hadoop 3 compatibility

2018-08-10 Thread Pavel (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576098#comment-16576098
 ] 

Pavel edited comment on HBASE-20540 at 8/10/18 10:56 AM:
-

It looks like hadoop 3.1 bunch has production ready release now

[Apache Hadoop 3.1.1 
release|https://lists.apache.org/thread.html/895f28e0941b37f006812afa383ff8ff9148fafc4a5be385aebd0fa1@%3Cgeneral.hadoop.apache.org%3E]


was (Author: pkirillov):
It looks like hadoop 3.1 bunch has production ready release now

[[ANNOUNCE] Apache Hadoop 3.1.1 
release|https://lists.apache.org/thread.html/895f28e0941b37f006812afa383ff8ff9148fafc4a5be385aebd0fa1@%3Cgeneral.hadoop.apache.org%3E]

> [umbrella] Hadoop 3 compatibility
> -
>
> Key: HBASE-20540
> URL: https://issues.apache.org/jira/browse/HBASE-20540
> Project: HBase
>  Issue Type: Umbrella
>Reporter: Duo Zhang
>Priority: Major
> Fix For: 2.0.2, 2.1.1
>
>
> There are known issues about the hadoop 3 compatibility for hbase 2. But 
> hadoop 3 is still not production ready. So we will link the issues here and 
> once there is a production ready hadoop 3 release, we will fix these issues 
> soon and upgrade our dependencies on hadoop, and also update the support 
> matrix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20540) [umbrella] Hadoop 3 compatibility

2018-08-10 Thread Pavel (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576098#comment-16576098
 ] 

Pavel commented on HBASE-20540:
---

It looks like hadoop 3.1 bunch has production ready release now

[[ANNOUNCE] Apache Hadoop 3.1.1 
release|https://lists.apache.org/thread.html/895f28e0941b37f006812afa383ff8ff9148fafc4a5be385aebd0fa1@%3Cgeneral.hadoop.apache.org%3E]

> [umbrella] Hadoop 3 compatibility
> -
>
> Key: HBASE-20540
> URL: https://issues.apache.org/jira/browse/HBASE-20540
> Project: HBase
>  Issue Type: Umbrella
>Reporter: Duo Zhang
>Priority: Major
> Fix For: 2.0.2, 2.1.1
>
>
> There are known issues about the hadoop 3 compatibility for hbase 2. But 
> hadoop 3 is still not production ready. So we will link the issues here and 
> once there is a production ready hadoop 3 release, we will fix these issues 
> soon and upgrade our dependencies on hadoop, and also update the support 
> matrix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)