[ https://issues.apache.org/jira/browse/HBASE-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018801#comment-16018801 ]
Yu Li edited comment on HBASE-18084 at 5/21/17 12:41 PM: --------------------------------------------------------- bq. if the initial batch contains large directory But what if not sir? Let me say more about my case. The current clean logic uses depth-first algo, while the archive dir hierarchical like: {noformat} /hbase/archive/data - namespace - table - region - CF - files {noformat} And while we reach one leaf directory and get the file list in it and cleaning, flush is still ongoing and the new files will be included when we iterate the other directory later. In our case the output of "hadoop fs -count" order by space usage (descending) is like: {noformat} 2043 686999 770527133663895 /hbase/archive/data/default/pora_6_feature_queue 2049 3430815 470358930247550 /hbase/archive/data/default/pora_6_feature 17101 704476 100740814980772 /hbase/archive/data/default/mainv3_ic 14251 495293 79161730247206 /hbase/archive/data/default/mainv3_main_result_b 14251 893144 71121202187220 /hbase/archive/data/default/mainv3_main_result_a 2045 79223 51098022268522 /hbase/archive/data/default/pora_log_wireless_search_item_pv_queue 2001 123332 49075201291122 /hbase/archive/data/default/mainv3_main_askr_queue_a 2001 65030 45649351359151 /hbase/archive/data/default/mainv3_main_askr_queue_b {noformat} And we have many directories like {noformat} 13 6 173403 /hbase/archive/data/default/b2b-et2mainse_tisplus_tisplus_IdleFishPool_askr 3 1 253497 /hbase/archive/data/default/b2b-et2mainse_tisplus_tisplus_buyoffer_searcher_askr 17 17 15635421 /hbase/archive/data/default/b2b-et2mainse_tisplus_tisplus_cloud_wukuang_askr 13 6 56062313 /hbase/archive/data/default/b2b-et2mainse_tisplus_tisplus_common_search_askr 5 2 1165298 /hbase/archive/data/default/b2b-et2mainse_tisplus_tisplus_company_askr 11 9 1196774 /hbase/archive/data/default/b2b-et2mainse_tisplus_tisplus_content_search_askr {noformat} So the largest 3 directories took 1.3PB while the whole archive directory took 1.8PB, and the largest directory name starts with "p". If we use the greedy algorithm, we may choose {{mainv3_main_askr_queue_a}} which has 123k files to clean, while {{pora_6_feature_queue}} is still flushing with speed. And the worst case is we cannot reach the largest dir in a long time. And I agree that depends on the real case, but in our case the simple method in current patch could work well, while I'm not sure whether the new approach suggested will do (smile). Since the patch here is already applied online, how about letting it in and open other JIRA to implement and verify the new approach with greedy algo? [~tedyu] was (Author: carp84): bq. if the initial batch contains large directory But what if not sir? Let me say more about my case. The current clean logic uses depth-first algo, while the archive dir hierarchical like: {noformat} /hbase/archive/data - namespace - table - region - CF - files {noformat} And while we reach one leaf directory and get the file list in it and cleaning, flush is still ongoing and the new files will be included when we iterate the other directory later. In our case the output of "hadoop fs -count" order by space usage (descending) is like: {noformat} 2043 686999 770527133663895 /hbase/archive/data/default/pora_6_feature_queue 2049 3430815 470358930247550 /hbase/archive/data/default/pora_6_feature 17101 704476 100740814980772 /hbase/archive/data/default/mainv3_ic 14251 495293 79161730247206 /hbase/archive/data/default/mainv3_main_result_b 14251 893144 71121202187220 /hbase/archive/data/default/mainv3_main_result_a 2045 79223 51098022268522 /hbase/archive/data/default/pora_log_wireless_search_item_pv_queue 2001 123332 49075201291122 /hbase/archive/data/default/mainv3_main_askr_queue_a 2001 65030 45649351359151 /hbase/archive/data/default/mainv3_main_askr_queue_b {noformat} And we have many directories like {noformat} 13 6 173403 /hbase/archive/data/default/b2b-et2mainse_tisplus_tisplus_IdleFishPool_askr 3 1 253497 /hbase/archive/data/default/b2b-et2mainse_tisplus_tisplus_buyoffer_searcher_askr 17 17 15635421 /hbase/archive/data/default/b2b-et2mainse_tisplus_tisplus_cloud_wukuang_askr 13 6 56062313 /hbase/archive/data/default/b2b-et2mainse_tisplus_tisplus_common_search_askr 5 2 1165298 /hbase/archive/data/default/b2b-et2mainse_tisplus_tisplus_company_askr 11 9 1196774 /hbase/archive/data/default/b2b-et2mainse_tisplus_tisplus_content_search_askr {noformat} So the largest 3 directories took 1.3PB while the whole archive directory took 1.8PB, and the largest directory name starts with "p". If we use the greedy algorithm, we may choose {{mainv3_main_askr_queue_a}} which has 123k files to clean, while {{pora_6_feature_queue}} is still flushing with speed. And the worst case is we cannot reach the largest dir in a long time. And I agree that depends on the real case, but in our case the simple method in current patch could works well, while I'm not sure whether the new approach suggested will do (smile). > Improve CleanerChore to clean from directory which consumes more disk space > --------------------------------------------------------------------------- > > Key: HBASE-18084 > URL: https://issues.apache.org/jira/browse/HBASE-18084 > Project: HBase > Issue Type: Bug > Reporter: Yu Li > Assignee: Yu Li > Attachments: HBASE-18084.patch, HBASE-18084.v2.patch > > > Currently CleanerChore cleans the directory in dictionary order, rather than > from the directory with largest space usage. And when data abnormally > accumulated to some huge volume in archive directory, the cleaning speed > might not be enough. > This proposal is another improvement working together with HBASE-18083 to > resolve our online issue (archive dir consumed more than 1.8PB SSD space) -- This message was sent by Atlassian JIRA (v6.3.15#6346)