[ 
https://issues.apache.org/jira/browse/HBASE-22072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16799051#comment-16799051
 ] 

Pavel commented on HBASE-22072:
-------------------------------

Thanks [~anoop.hbase] for your attention.

Answering to you last question first, RS first crashed due to hardware error, 
for example if host machine goes to reboot. Next RS crashes can happen during 
compacting with OutOfMemoryException.

First files are not deleted immediately after compaction and after Chore 
service repeatedly try to drop them, without success because of existing read 
references.

It surprise me a little, because some files in a list compacted 3 days ago, 
marked as "compactedAway", so new scanners should not read them, and existing 
scanners should release reference as far as 
{{hbase.client.scanner.timeout.period }}achived. I am going to inspect source 
of hbase client, maybe there is a bug inside and 
{{hbase.client.scanner.timeout.period}} setting ignored if scaner.close() was 
not called.{{}}

Correct me please if I am mistaken in difference between 2.1 and 2.2 branch 
behavior.

*2.1:* Chore service try to drop files untill region is closing, cause closing 
ignores read references.

*2.2:* Closing region does not affect undeleted files and after region is 
assigned again new RS, serving this region, read compaction marker, does not 
compact files again, but keep trying to drop them with chore service.

It seems that HBASE-20724 solves this issue, but in case of having unclosed 
references on files Chore will accumulate all that files forever, trying to 
drop them.I wonder if region split keeps read references. If yes, split will 
double work for Chore service.

I will try to figure out bottom of existing so much read references on 
compacted files and probably create another issue.

 

 

> High read/write intensive regions may cause long crash recovery
> ---------------------------------------------------------------
>
>                 Key: HBASE-22072
>                 URL: https://issues.apache.org/jira/browse/HBASE-22072
>             Project: HBase
>          Issue Type: Bug
>          Components: Performance, Recovery
>    Affects Versions: 2.1.2
>            Reporter: Pavel
>            Priority: Major
>
> Compaction of high read loaded region may leave compacted files undeleted 
> because of existing scan references:
> INFO org.apache.hadoop.hbase.regionserver.HStore - Can't archive compacted 
> file hdfs://hdfs-ha/hbase... because of either isCompactedAway=true or file 
> has reference, isReferencedInReads=true, refCount=1, skipping for now
> If region is either high write loaded this happens quite often and region may 
> have few storefiles and tons of undeleted compacted hdfs files.
> Region keeps all that files (in my case thousands) untill graceful region 
> closing procedure, which ignores existing references and drop obsolete files. 
> It works fine unless consuming some extra hdfs space, but only in case of 
> normal region closing. If region server crashes than new region server, 
> responsible for that overfiling region, reads hdfs folder and try to deal 
> with all undeleted files, producing tons of storefiles, compaction tasks and 
> consuming abnormal amount of memory, wich may lead to OutOfMemory Exception 
> and further region servers crash. This stops writing to region because number 
> of storefiles reach *hbase.hstore.blockingStoreFiles* limit, forces high GC 
> duty and may take hours to compact all files into working set of files.
> Workaround is a periodically check hdfs folders files count and force region 
> assign for ones with too many files.
> It could be nice if regionserver had a setting similar to 
> hbase.hstore.blockingStoreFiles and invoke attempt to drop undeleted 
> compacted files if number of files reaches this setting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to