[jira] [Commented] (HBASE-4797) [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region h
[ https://issues.apache.org/jira/browse/HBASE-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13154362#comment-13154362 ] stack commented on HBASE-4797: -- Thanks Jimmy for taking this on. Looks like you don't have to rename the files; just sort them and figure which set to apply (and do what Todd suggests rewriting the znode less often -- or asynchronously). [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region has -- Key: HBASE-4797 URL: https://issues.apache.org/jira/browse/HBASE-4797 Project: HBase Issue Type: Bug Components: performance Reporter: stack Assignee: Jimmy Xiang Priority: Critical Labels: noob Testing 0.92, I crashed all servers out. Another bug makes it so WALs are not getting cleaned so I had 7000 regions to replay. The distributed split code did a nice job and cluster came back but interesting is that some hot regions ended up having loads of recovered.edits files -- tens if not hundreds -- to replay against the region (can we bulk load recovered.edits instead of replaying them?). Each recovered.edits file is taking about a second to process (though only about 30 odd edits per file it seems). The region is unavailable during this time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4797) [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region h
[ https://issues.apache.org/jira/browse/HBASE-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13154380#comment-13154380 ] Jimmy Xiang commented on HBASE-4797: Yes, that's what I was thinking. The file name has the start seq id. If there are multiple files, there should be multiple start seq ids. That implies the max seq ids in some of these files, if sorted. I can use these information to filter out some files safely. On Mon, Nov 21, 2011 at 10:52 AM, stack (Commented) (JIRA) [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region has -- Key: HBASE-4797 URL: https://issues.apache.org/jira/browse/HBASE-4797 Project: HBase Issue Type: Bug Components: performance Reporter: stack Assignee: Jimmy Xiang Priority: Critical Labels: noob Testing 0.92, I crashed all servers out. Another bug makes it so WALs are not getting cleaned so I had 7000 regions to replay. The distributed split code did a nice job and cluster came back but interesting is that some hot regions ended up having loads of recovered.edits files -- tens if not hundreds -- to replay against the region (can we bulk load recovered.edits instead of replaying them?). Each recovered.edits file is taking about a second to process (though only about 30 odd edits per file it seems). The region is unavailable during this time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4797) [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region h
[ https://issues.apache.org/jira/browse/HBASE-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13154538#comment-13154538 ] Jimmy Xiang commented on HBASE-4797: The region opening is tried periodically. The waiting interval is about 1/3 of the assignment time out. I think that's fine. [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region has -- Key: HBASE-4797 URL: https://issues.apache.org/jira/browse/HBASE-4797 Project: HBase Issue Type: Bug Components: performance Reporter: stack Assignee: Jimmy Xiang Priority: Critical Labels: noob Testing 0.92, I crashed all servers out. Another bug makes it so WALs are not getting cleaned so I had 7000 regions to replay. The distributed split code did a nice job and cluster came back but interesting is that some hot regions ended up having loads of recovered.edits files -- tens if not hundreds -- to replay against the region (can we bulk load recovered.edits instead of replaying them?). Each recovered.edits file is taking about a second to process (though only about 30 odd edits per file it seems). The region is unavailable during this time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4797) [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region h
[ https://issues.apache.org/jira/browse/HBASE-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13154689#comment-13154689 ] jirapos...@reviews.apache.org commented on HBASE-4797: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2906/ --- Review request for hbase, Todd Lipcon and Michael Stack. Summary --- If there are multiple recovered edits files, I used the file name to find the initial sequence id. After these files are sorted, we can find a file's possible maximum sequence id based on the next file's initial sequence id. If the maximum sequence id is smaller than the current sequence id, the whole recovered edits file is old and ignored. This addresses bug HBASE-4797. https://issues.apache.org/jira/browse/HBASE-4797 Diffs - src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 8b89661 src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java 5daa02b Diff: https://reviews.apache.org/r/2906/diff Testing --- Added test case to TestHRegion, and all the tests in this test are passed. Thanks, Jimmy [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region has -- Key: HBASE-4797 URL: https://issues.apache.org/jira/browse/HBASE-4797 Project: HBase Issue Type: Bug Components: performance Reporter: stack Assignee: Jimmy Xiang Priority: Critical Labels: noob Testing 0.92, I crashed all servers out. Another bug makes it so WALs are not getting cleaned so I had 7000 regions to replay. The distributed split code did a nice job and cluster came back but interesting is that some hot regions ended up having loads of recovered.edits files -- tens if not hundreds -- to replay against the region (can we bulk load recovered.edits instead of replaying them?). Each recovered.edits file is taking about a second to process (though only about 30 odd edits per file it seems). The region is unavailable during this time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4797) [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region h
[ https://issues.apache.org/jira/browse/HBASE-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13154696#comment-13154696 ] jirapos...@reviews.apache.org commented on HBASE-4797: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2906/#review3409 --- Very nice patch. In future, would suggest you confine your change just to what you are adding. The white space cleanup is nice but it distracts from your patch. It also bloats it and makes it look intimidating to review (smile). Minor fixups only. src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java https://reviews.apache.org/r/2906/#comment7635 So, are these already sorted in right order from oldest edit to newest? src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java https://reviews.apache.org/r/2906/#comment7636 Possilbe should be Possible. I'd be more assertive in this message. Maximum possible sequenceid for this log is + + , skipping .. src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java https://reviews.apache.org/r/2906/#comment7637 Good. src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java https://reviews.apache.org/r/2906/#comment7638 Any more asserts we can do in here? Assert we replayed N of the M files? - Michael On 2011-11-21 22:38:39, Jimmy Xiang wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/2906/ bq. --- bq. bq. (Updated 2011-11-21 22:38:39) bq. bq. bq. Review request for hbase, Todd Lipcon and Michael Stack. bq. bq. bq. Summary bq. --- bq. bq. If there are multiple recovered edits files, I used the file name to find the initial sequence id. After these files are sorted, we can find a file's possible maximum sequence id based on the next file's initial sequence id. If the maximum sequence id is smaller than the current sequence id, the whole recovered edits file is old and ignored. bq. bq. bq. This addresses bug HBASE-4797. bq. https://issues.apache.org/jira/browse/HBASE-4797 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 8b89661 bq.src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java 5daa02b bq. bq. Diff: https://reviews.apache.org/r/2906/diff bq. bq. bq. Testing bq. --- bq. bq. Added test case to TestHRegion, and all the tests in this test are passed. bq. bq. bq. Thanks, bq. bq. Jimmy bq. bq. [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region has -- Key: HBASE-4797 URL: https://issues.apache.org/jira/browse/HBASE-4797 Project: HBase Issue Type: Bug Components: performance Reporter: stack Assignee: Jimmy Xiang Priority: Critical Labels: noob Testing 0.92, I crashed all servers out. Another bug makes it so WALs are not getting cleaned so I had 7000 regions to replay. The distributed split code did a nice job and cluster came back but interesting is that some hot regions ended up having loads of recovered.edits files -- tens if not hundreds -- to replay against the region (can we bulk load recovered.edits instead of replaying them?). Each recovered.edits file is taking about a second to process (though only about 30 odd edits per file it seems). The region is unavailable during this time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4797) [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region h
[ https://issues.apache.org/jira/browse/HBASE-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13154699#comment-13154699 ] Kannan Muthukkaruppan commented on HBASE-4797: -- The title for the bug can be updated given that we are no longer renaming the files in recovered.edits. [That concerned me initially -- but reading through the details, looks like you have come up with a way to avoid new name format. That's always smoother for upgrades and such..] [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region has -- Key: HBASE-4797 URL: https://issues.apache.org/jira/browse/HBASE-4797 Project: HBase Issue Type: Bug Components: performance Reporter: stack Assignee: Jimmy Xiang Priority: Critical Labels: noob Testing 0.92, I crashed all servers out. Another bug makes it so WALs are not getting cleaned so I had 7000 regions to replay. The distributed split code did a nice job and cluster came back but interesting is that some hot regions ended up having loads of recovered.edits files -- tens if not hundreds -- to replay against the region (can we bulk load recovered.edits instead of replaying them?). Each recovered.edits file is taking about a second to process (though only about 30 odd edits per file it seems). The region is unavailable during this time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4797) [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region h
[ https://issues.apache.org/jira/browse/HBASE-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151411#comment-13151411 ] stack commented on HBASE-4797: -- Thinking some more on this, we don't need to rename recovered.edits files. The files are named for the first sequenceid in the file, so, we could just do file listing and sort the return. Then we'd have range of sequenceids per file. We could then just pass on files with edits that are smaller than regions current seqid. [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region has -- Key: HBASE-4797 URL: https://issues.apache.org/jira/browse/HBASE-4797 Project: HBase Issue Type: Bug Components: performance Reporter: stack Labels: noob Testing 0.92, I crashed all servers out. Another bug makes it so WALs are not getting cleaned so I had 7000 regions to replay. The distributed split code did a nice job and cluster came back but interesting is that some hot regions ended up having loads of recovered.edits files -- tens if not hundreds -- to replay against the region (can we bulk load recovered.edits instead of replaying them?). Each recovered.edits file is taking about a second to process (though only about 30 odd edits per file it seems). The region is unavailable during this time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4797) [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region h
[ https://issues.apache.org/jira/browse/HBASE-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150992#comment-13150992 ] stack commented on HBASE-4797: -- Oh... i suppose its a bit worse than I though. I'm looking at a region that has nearly 6k recovered.edits files to replay. The RegionServer is doing this per file: {code} 2011-11-16 03:06:02,403 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Applied 0, skipped 33, firstSequenceidInLog=296860, maxSequenceidInLog=351600, path=hdfs://sv4r11s38:7000/hbase/TestTable/69ab6eb0e2feff1fda52d36d8fa75798/recovered.edits/0296860 2011-11-16 03:06:02,405 INFO org.apache.hadoop.hbase.regionserver.HRegion: Replaying edits from hdfs://sv4r11s38:7000/hbase/TestTable/69ab6eb0e2feff1fda52d36d8fa75798/recovered.edits/0296914; minSequenceid=351600; path=hdfs://sv4r11s38:7000/hbase/TestTable/69ab6eb0e2feff1fda52d36d8fa75798/recovered.edits/0296914 2011-11-16 03:06:05,097 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:7003-0x133a5bab186271f Attempting to transition node 69ab6eb0e2feff1fda52d36d8fa75798 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING 2011-11-16 03:06:05,278 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:7003-0x133a5bab186271f Successfully transitioned node 69ab6eb0e2feff1fda52d36d8fa75798 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING 2011-11-16 03:06:05,278 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Applied 0, skipped 33, firstSequenceidInLog=296914, maxSequenceidInLog=351600, path=hdfs://sv4r11s38:7000/hbase/TestTable/69ab6eb0e2feff1fda52d36d8fa75798/recovered.edits/0296914 2011-11-16 03:06:05,279 INFO org.apache.hadoop.hbase.regionserver.HRegion: Replaying edits from hdfs://sv4r11s38:7000/hbase/TestTable/69ab6eb0e2feff1fda52d36d8fa75798/recovered.edits/0296970; minSequenceid=351600; path=hdfs://sv4r11s38:7000/hbase/TestTable/69ab6eb0e2feff1fda52d36d8fa75798/recovered.edits/0296970 2011-11-16 03:06:05,952 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:7003-0x133a5bab186271f Attempting to transition node 69ab6eb0e2feff1fda52d36d8fa75798 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING 2011-11-16 03:06:06,093 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:7003-0x133a5bab186271f Successfully transitioned node 69ab6eb0e2feff1fda52d36d8fa75798 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING 2011-11-16 03:06:06,093 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Applied 0, skipped 44, firstSequenceidInLog=296970, maxSequenceidInLog=351600, path=hdfs://sv4r11s38:7000/hbase/TestTable/69ab6eb0e2feff1fda52d36d8fa75798/recovered.edits/0296970 2011-11-16 03:06:06,094 INFO org.apache.hadoop.hbase.regionserver.HRegion: Replaying edits from hdfs://sv4r11s38:7000/hbase/TestTable/69ab6eb0e2feff1fda52d36d8fa75798/recovered.edits/0297041; minSequenceid=351600; path=hdfs://sv4r11s38:7000/hbase/TestTable/69ab6eb0e2feff1fda52d36d8fa75798/recovered.edits/0297041 2011-11-16 03:06:06,795 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:7003-0x133a5bab186271f Attempting to transition node 69ab6eb0e2feff1fda52d36d8fa75798 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING 2011-11-16 03:06:06,810 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:7003-0x133a5bab186271f Successfully transitioned node 69ab6eb0e2feff1fda52d36d8fa75798 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING {code} [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region has -- Key: HBASE-4797 URL: https://issues.apache.org/jira/browse/HBASE-4797 Project: HBase Issue Type: Bug Components: performance Reporter: stack Testing 0.92, I crashed all servers out. Another bug makes it so WALs are not getting cleaned so I had 7000 regions to replay. The distributed split code did a nice job and cluster came back but interesting is that some hot regions ended up having loads of recovered.edits files -- tens if not hundreds -- to replay against the region (can we bulk load recovered.edits instead of replaying them?). Each recovered.edits file is taking about a second to process (though only about 30 odd edits per file it seems). The region is unavailable during this time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more