[ https://issues.apache.org/jira/browse/HBASE-10492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13899501#comment-13899501 ]
Jerry He commented on HBASE-10492: ---------------------------------- I am able to reproduce the problem and get more realtime metrics on the region server. Now it does not seem to be a thread scheduling problem, or doesn't it? We can see from the live metrics: Initializing region since 5mins, 18sec ago Instantiating store for column family since 5mins, 18sec ago {code} Wed Feb 12 00:00:59 PST 2014 Initializing region tpch_hb_1000_2.lineitem,`\x01\x85\xF5\xEC\x8D\x01\x80\x00\x0B\x8E\x01\x80\x00\x00\x00\xB3\x9EN\xE3\x01\x80\x00\x00\x03,1392192032936.1d4381bb583f957a9996c1ef0fa3ce68. RUNNING (since 5mins, 18sec ago) Instantiating store for column family {NAME => 'cf', REPLICATION_SCOPE => '0', KEEP_DELETED_CELLS => 'false', COMPRESSION => 'GZ', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true', MIN_VERSIONS => '0', DATA_BLOCK_ENCODING => 'NONE', IN_MEMORY => 'false', BLOOMFILTER => 'ROW', TTL => '2147483647', VERSIONS => '2147483647', BLOCKSIZE => '65536'} (since 5mins, 18sec ago) Wed Feb 12 00:00:59 PST 2014 Initializing region tpch_hb_1000_2.lineitem,`\x01\x80\x01:\x94\x01\x80\x01:\x95\x01\x80\x00\x00\x00\xB5\xA8\x94\x04\x01\x80\x00\x00\x02,1392192032936.2980739184621d45397a972ea89c9411. RUNNING (since 5mins, 18sec ago) Instantiating store for column family {NAME => 'cf', REPLICATION_SCOPE => '0', KEEP_DELETED_CELLS => 'false', COMPRESSION => 'GZ', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true', MIN_VERSIONS => '0', DATA_BLOCK_ENCODING => 'NONE', IN_MEMORY => 'false', BLOOMFILTER => 'ROW', TTL => '2147483647', VERSIONS => '2147483647', BLOCKSIZE => '65536'} (since 5mins, 18sec ago) Wed Feb 12 00:00:59 PST 2014 Initializing region tpch_hb_1000_2.lineitem,`\x01\x85\xF5\xEC\x8D\x01\x80\x00\x0B\x8E\x01\x80\x00\x00\x00\xB3\x9EN\xE3\x01\x80\x00\x00\x03,1392192032936.1d4381bb583f957a9996c1ef0fa3ce68. RUNNING (since 8mins, 18sec ago) Instantiating store for column family {NAME => 'cf', REPLICATION_SCOPE => '0', KEEP_DELETED_CELLS => 'false', COMPRESSION => 'GZ', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true', MIN_VERSIONS => '0', DATA_BLOCK_ENCODING => 'NONE', IN_MEMORY => 'false', BLOOMFILTER => 'ROW', TTL => '2147483647', VERSIONS => '2147483647', BLOCKSIZE => '65536'} (since 8mins, 18sec ago) Wed Feb 12 00:00:59 PST 2014 Initializing region tpch_hb_1000_2.lineitem,`\x01\x80\x01:\x94\x01\x80\x01:\x95\x01\x80\x00\x00\x00\xB5\xA8\x94\x04\x01\x80\x00\x00\x02,1392192032936.2980739184621d45397a972ea89c9411. RUNNING (since 8mins, 18sec ago) Instantiating store for column family {NAME => 'cf', REPLICATION_SCOPE => '0', KEEP_DELETED_CELLS => 'false', COMPRESSION => 'GZ', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true', MIN_VERSIONS => '0', DATA_BLOCK_ENCODING => 'NONE', IN_MEMORY => 'false', BLOOMFILTER => 'ROW', TTL => '2147483647', VERSIONS => '2147483647', BLOCKSIZE => '65536'} (since 8mins, 18sec ago) {code} It is more like file system issue? BWT. HBase's new metrics rocks! > open daughter regions can unpredictably take long time > ------------------------------------------------------ > > Key: HBASE-10492 > URL: https://issues.apache.org/jira/browse/HBASE-10492 > Project: HBase > Issue Type: Bug > Components: regionserver > Affects Versions: 0.96.0 > Reporter: Jerry He > > I have seen during a stress testing client was getting > "RetriesExhaustedWithDetailsException: Failed 748 actions: > NotServingRegionException" > On the master log, 2014-02-08 20:43 is the timestamp from OFFLINE to > SPLITTING_NEW, 2014-02-08 21:41 is the timestamp from SPLITTING_NEW to OPEN. > The corresponding time period on the region sever log is: > {code} > 2014-02-08 20:44:12,662 WARN > org.apache.hadoop.hbase.regionserver.HRegionFileSystem: .regioninfo file not > found for region: 010c1981882d1a59201af5e2dc589d44 > 2014-02-08 20:44:12,666 WARN > org.apache.hadoop.hbase.regionserver.HRegionFileSystem: .regioninfo file not > found for region: c2eb9b7971ca7f3fed3da86df5b788e7 > {code} > There were no INFO related to these two regions until: (at the end see this: > Split took 57mins, 16sec) > {code} > 2014-02-08 21:41:14,029 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Onlined c2eb9b7971ca7f3fed3da86df5b788e7; next sequenceid=213355 > 2014-02-08 21:41:14,031 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Onlined 010c1981882d1a59201af5e2dc589d44; next sequenceid=213354 > 2014-02-08 21:41:14,032 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Post open deploy tasks > for > region=tpch_hb_1000_2.lineitem,]\x01\x8B\xE9\xF4\x8A\x01\x80p\xA3\xA4\x01\x80\x00\x00\x00\xB6\xB7+\x02\x01\x80\x00\x00\x02,1391921037353.c2eb9b7971ca7f3fed3da86df5b788e7. > 2014-02-08 21:41:14,054 INFO org.apache.hadoop.hbase.catalog.MetaEditor: > Updated row > tpch_hb_1000_2.lineitem,]\x01\x8B\xE9\xF4\x8A\x01\x80p\xA3\xA4\x01\x80\x00\x00\x00\xB6\xB7+\x02\x01\x80\x00\x00\x02,1391921037353.c2eb9b7971ca7f3fed3da86df5b788e7. > with server=hdtest208.svl.ibm.com,60020,1391887547473 > 2014-02-08 21:41:14,054 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Finished post open deploy > task for > tpch_hb_1000_2.lineitem,]\x01\x8B\xE9\xF4\x8A\x01\x80p\xA3\xA4\x01\x80\x00\x00\x00\xB6\xB7+\x02\x01\x80\x00\x00\x02,1391921037353.c2eb9b7971ca7f3fed3da86df5b788e7. > 2014-02-08 21:41:14,054 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Post open deploy tasks > for > region=tpch_hb_1000_2.lineitem,,1391921037353.010c1981882d1a59201af5e2dc589d44. > 2014-02-08 21:41:14,059 INFO org.apache.hadoop.hbase.regionserver.HStore: > Completed compaction of 10 file(s) in cf of > tpch_hb_1000_2.lineitem,^\x01\x8B\xE7(\x80\x01\x80\x93\xFD\x01\x01\x80\x00\x00\x00\xB5\x0E\xCC'\x01\x80\x00\x00\x03,1391918508561.1fbcfc0a792435dfd73ec5b0ef5c953c. > into 451be6df8c604993ae540b808d9cfa08(size=72.8 M), total size for store is > 2.4 G. This selection was in queue for 0sec, and took 1mins, 40sec to execute. > 2014-02-08 21:41:14,059 INFO > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Completed > compaction: Request = > regionName=tpch_hb_1000_2.lineitem,^\x01\x8B\xE7(\x80\x01\x80\x93\xFD\x01\x01\x80\x00\x00\x00\xB5\x0E\xCC'\x01\x80\x00\x00\x03,1391918508561.1fbcfc0a792435dfd73ec5b0ef5c953c., > storeName=cf, fileCount=10, fileSize=94.1 M, priority=9883, > time=1391924373278861000; duration=1mins, 40sec > 2014-02-08 21:41:14,059 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Starting compaction on cf in region > tpch_hb_1000_2.lineitem,]\x01\x8B\xE9\xF4\x8A\x01\x80p\xA3\xA4\x01\x80\x00\x00\x00\xB6\xB7+\x02\x01\x80\x00\x00\x02,1391921037353.c2eb9b7971ca7f3fed3da86df5b788e7. > 2014-02-08 21:41:14,059 INFO org.apache.hadoop.hbase.regionserver.HStore: > Starting compaction of 10 file(s) in cf of > tpch_hb_1000_2.lineitem,]\x01\x8B\xE9\xF4\x8A\x01\x80p\xA3\xA4\x01\x80\x00\x00\x00\xB6\xB7+\x02\x01\x80\x00\x00\x02,1391921037353.c2eb9b7971ca7f3fed3da86df5b788e7. > into > tmpdir=gpfs:/hbase/data/default/tpch_hb_1000_2.lineitem/c2eb9b7971ca7f3fed3da86df5b788e7/.tmp, > totalSize=709.7 M > 2014-02-08 21:41:14,066 INFO org.apache.hadoop.hbase.catalog.MetaEditor: > Updated row > tpch_hb_1000_2.lineitem,,1391921037353.010c1981882d1a59201af5e2dc589d44. with > server=hdtest208.svl.ibm.com,60020,1391887547473 > 2014-02-08 21:41:14,066 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Finished post open deploy > task for > tpch_hb_1000_2.lineitem,,1391921037353.010c1981882d1a59201af5e2dc589d44. > 2014-02-08 21:41:14,190 INFO > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, hbase:meta > updated, and report to master. > Parent=tpch_hb_1000_2.lineitem,,1391918508561.b576e8db65d56ec08db5ca900587c28d., > new regions: > tpch_hb_1000_2.lineitem,,1391921037353.010c1981882d1a59201af5e2dc589d44., > tpch_hb_1000_2.lineitem,]\x01\x8B\xE9\xF4\x8A\x01\x80p\xA3\xA4\x01\x80\x00\x00\x00\xB6\xB7+\x02\x01\x80\x00\x00\x02,1391921037353.c2eb9b7971ca7f3fed3da86df5b788e7.. > Split took 57mins, 16sec > {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)