[ https://issues.apache.org/jira/browse/HBASE-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714814#comment-13714814 ]
Dave Latham commented on HBASE-8778: ------------------------------------ {quote}Could make is a config option. That way one could rolling upgrade the cluster, then flip the option on, and roll the cluster again. Would have to think through the details.{quote} Yeah I think it's still tricky to get the second roll correct - if you're supporting having some writers that write only to the old dir and some readers who are trying to read from the new location. I think you may be able to pull it off by two passes - first upgrade all the writers to write to both places but keep everything reading from the old location, then do a second rolling pass to move the readers to the new location. Then you could do a third pass to have writers only write to the new location. {quote}Looking at the code. If the modtime of the tabledescriptor has changed after the cached version, we do what Dave described twice! First getTableInfoModtime is called, if that determines that the cache was changed, getTableDescriptorModtime is called, which does the same work of stat'ing the dir all over again.{quote} Yes. I wasn't too concerned about this case as normally the descriptors are not changed often so this would only happen once per table. {quote}When the table descriptor is cached, check the mod time of the table directory first; if that mod time is <= the cached descriptor's mod time we're good and do not need to stat the table directory. In a high churn table that might not help much, though, as new region dirs are constantly added.{quote} That's an interesting approach I hadn't considered. I wasn't aware that HDFS maintained directory mod times. Sounds like a pretty simple change for 0.94 to me. In our case it would solve the issue for one of our huge tables that doesn't have much split activity and make a huge difference for the other that has splits every few minutes or so. {quote}Record the sequence number of a table descriptor when cached. Instead of checking mod time, we can check whether next highest sequence number exists. If so, we need to reload (but no need to check the mod time by stat'ing the dir). Can there be gaps in the sequence numbers?{quote} Don't think this approach would work as the table could be modified many times since the last check so the reader couldn't know which sequence number to check. > Region assigments scan table directory making them slow for huge tables > ----------------------------------------------------------------------- > > Key: HBASE-8778 > URL: https://issues.apache.org/jira/browse/HBASE-8778 > Project: HBase > Issue Type: Improvement > Reporter: Dave Latham > Assignee: Dave Latham > Fix For: 0.98.0, 0.95.2, 0.94.11 > > Attachments: HBASE-8778-0.94.5.patch, HBASE-8778-0.94.5-v2.patch > > > On a table with 130k regions it takes about 3 seconds for a region server to > open a region once it has been assigned. > Watching the threads for a region server running 0.94.5 that is opening many > such regions shows the thread opening the reigon in code like this: > {noformat} > "PRI IPC Server handler 4 on 60020" daemon prio=10 tid=0x00002aaac07e9000 > nid=0x6566 runnable [0x000000004c46d000] > java.lang.Thread.State: RUNNABLE > at java.lang.String.indexOf(String.java:1521) > at java.net.URI$Parser.scan(URI.java:2912) > at java.net.URI$Parser.parse(URI.java:3004) > at java.net.URI.<init>(URI.java:736) > at org.apache.hadoop.fs.Path.initialize(Path.java:145) > at org.apache.hadoop.fs.Path.<init>(Path.java:126) > at org.apache.hadoop.fs.Path.<init>(Path.java:50) > at > org.apache.hadoop.hdfs.protocol.HdfsFileStatus.getFullPath(HdfsFileStatus.java:215) > at > org.apache.hadoop.hdfs.DistributedFileSystem.makeQualified(DistributedFileSystem.java:252) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:311) > at > org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:159) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:842) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:867) > at org.apache.hadoop.hbase.util.FSUtils.listStatus(FSUtils.java:1168) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoPath(FSTableDescriptors.java:269) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoPath(FSTableDescriptors.java:255) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoModtime(FSTableDescriptors.java:368) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:155) > at > org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:126) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:2834) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:2807) > at sun.reflect.GeneratedMethodAccessor64.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) > {noformat} > To open the region, the region server first loads the latest > HTableDescriptor. Since HBASE-4553 HTableDescriptor's are stored in the file > system at "/hbase/<tableDir>/.tableinfo.<sequenceNum>". The file with the > largest sequenceNum is the current descriptor. This is done so that the > current descirptor is updated atomically. However, since the filename is not > known in advance FSTableDescriptors it has to do a FileSystem.listStatus > operation which has to list all files in the directory to find it. The > directory also contains all the region directories, so in our case it has to > load 130k FileStatus objects. Even using a globStatus matching function > still transfers all the objects to the client before performing the pattern > matching. Furthermore HDFS uses a default of transferring 1000 directory > entries in each RPC call, so it requires 130 roundtrips to the namenode to > fetch all the directory entries. > Consequently, to reassign all the regions of a table (or a constant fraction > thereof) requires time proportional to the square of the number of regions. > In our case, if a region server fails with 200 such regions, it takes 10+ > minutes for them all to be reassigned, after the zk expiration and log > splitting. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira