[jira] [Commented] (HBASE-4797) [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region h

2011-11-21 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13154362#comment-13154362
 ] 

stack commented on HBASE-4797:
--

Thanks Jimmy for taking this on.  Looks like you don't have to rename the 
files; just sort them and figure which set to apply (and do what Todd suggests 
rewriting the znode less often -- or asynchronously).

 [availability] Give recovered.edits files better names, ones that include 
 first and last sequence id so we can skip files with edits we know older than 
 current region has
 --

 Key: HBASE-4797
 URL: https://issues.apache.org/jira/browse/HBASE-4797
 Project: HBase
  Issue Type: Bug
  Components: performance
Reporter: stack
Assignee: Jimmy Xiang
Priority: Critical
  Labels: noob

 Testing 0.92, I crashed all servers out.  Another bug makes it so WALs are 
 not getting cleaned so I had 7000 regions to replay.  The distributed split 
 code did a nice job and cluster came back but interesting is that some hot 
 regions ended up having loads of recovered.edits files -- tens if not 
 hundreds -- to replay against the region (can we bulk load recovered.edits 
 instead of replaying them?).  Each recovered.edits file is taking about a 
 second to process (though only about 30 odd edits per file it seems).  The 
 region is unavailable during this time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4797) [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region h

2011-11-21 Thread Jimmy Xiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13154380#comment-13154380
 ] 

Jimmy Xiang commented on HBASE-4797:


Yes, that's what I was thinking. The file name has the start seq id.  If
there are multiple files, there should be multiple start seq ids.  That
implies the max seq ids in
some of these files, if sorted.  I can use these information to filter out
some files safely.

On Mon, Nov 21, 2011 at 10:52 AM, stack (Commented) (JIRA)



 [availability] Give recovered.edits files better names, ones that include 
 first and last sequence id so we can skip files with edits we know older than 
 current region has
 --

 Key: HBASE-4797
 URL: https://issues.apache.org/jira/browse/HBASE-4797
 Project: HBase
  Issue Type: Bug
  Components: performance
Reporter: stack
Assignee: Jimmy Xiang
Priority: Critical
  Labels: noob

 Testing 0.92, I crashed all servers out.  Another bug makes it so WALs are 
 not getting cleaned so I had 7000 regions to replay.  The distributed split 
 code did a nice job and cluster came back but interesting is that some hot 
 regions ended up having loads of recovered.edits files -- tens if not 
 hundreds -- to replay against the region (can we bulk load recovered.edits 
 instead of replaying them?).  Each recovered.edits file is taking about a 
 second to process (though only about 30 odd edits per file it seems).  The 
 region is unavailable during this time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4797) [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region h

2011-11-21 Thread Jimmy Xiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13154538#comment-13154538
 ] 

Jimmy Xiang commented on HBASE-4797:


The region opening is tried periodically.  The waiting interval is about 1/3 of 
the assignment time out. I think that's fine.

 [availability] Give recovered.edits files better names, ones that include 
 first and last sequence id so we can skip files with edits we know older than 
 current region has
 --

 Key: HBASE-4797
 URL: https://issues.apache.org/jira/browse/HBASE-4797
 Project: HBase
  Issue Type: Bug
  Components: performance
Reporter: stack
Assignee: Jimmy Xiang
Priority: Critical
  Labels: noob

 Testing 0.92, I crashed all servers out.  Another bug makes it so WALs are 
 not getting cleaned so I had 7000 regions to replay.  The distributed split 
 code did a nice job and cluster came back but interesting is that some hot 
 regions ended up having loads of recovered.edits files -- tens if not 
 hundreds -- to replay against the region (can we bulk load recovered.edits 
 instead of replaying them?).  Each recovered.edits file is taking about a 
 second to process (though only about 30 odd edits per file it seems).  The 
 region is unavailable during this time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4797) [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region h

2011-11-21 Thread jirapos...@reviews.apache.org (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13154689#comment-13154689
 ] 

jirapos...@reviews.apache.org commented on HBASE-4797:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2906/
---

Review request for hbase, Todd Lipcon and Michael Stack.


Summary
---

If there are multiple recovered edits files, I used the file name to find the 
initial sequence id.  After these files are sorted, we can find a file's 
possible maximum sequence id based on the next file's initial sequence id.  If 
the maximum sequence id is smaller than the current sequence id, the whole 
recovered edits file is old and ignored.


This addresses bug HBASE-4797.
https://issues.apache.org/jira/browse/HBASE-4797


Diffs
-

  src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 8b89661 
  src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java 5daa02b 

Diff: https://reviews.apache.org/r/2906/diff


Testing
---

Added test case to TestHRegion, and all the tests in this test are passed.


Thanks,

Jimmy



 [availability] Give recovered.edits files better names, ones that include 
 first and last sequence id so we can skip files with edits we know older than 
 current region has
 --

 Key: HBASE-4797
 URL: https://issues.apache.org/jira/browse/HBASE-4797
 Project: HBase
  Issue Type: Bug
  Components: performance
Reporter: stack
Assignee: Jimmy Xiang
Priority: Critical
  Labels: noob

 Testing 0.92, I crashed all servers out.  Another bug makes it so WALs are 
 not getting cleaned so I had 7000 regions to replay.  The distributed split 
 code did a nice job and cluster came back but interesting is that some hot 
 regions ended up having loads of recovered.edits files -- tens if not 
 hundreds -- to replay against the region (can we bulk load recovered.edits 
 instead of replaying them?).  Each recovered.edits file is taking about a 
 second to process (though only about 30 odd edits per file it seems).  The 
 region is unavailable during this time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4797) [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region h

2011-11-21 Thread jirapos...@reviews.apache.org (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13154696#comment-13154696
 ] 

jirapos...@reviews.apache.org commented on HBASE-4797:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2906/#review3409
---


Very nice patch.

In future, would suggest you confine your change just to what you are adding.   
The white space cleanup is nice but it distracts from your patch.  It also 
bloats it and makes it look intimidating to review (smile).

Minor fixups only.


src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
https://reviews.apache.org/r/2906/#comment7635

So, are these already sorted in right order from oldest edit to newest?



src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
https://reviews.apache.org/r/2906/#comment7636

Possilbe should be Possible.

I'd be more assertive in this message.  Maximum possible sequenceid for 
this log is  + + , skipping ..



src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
https://reviews.apache.org/r/2906/#comment7637

Good.



src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java
https://reviews.apache.org/r/2906/#comment7638

Any more asserts we can do in here?   Assert we replayed N of the M files?


- Michael


On 2011-11-21 22:38:39, Jimmy Xiang wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2906/
bq.  ---
bq.  
bq.  (Updated 2011-11-21 22:38:39)
bq.  
bq.  
bq.  Review request for hbase, Todd Lipcon and Michael Stack.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  If there are multiple recovered edits files, I used the file name to find 
the initial sequence id.  After these files are sorted, we can find a file's 
possible maximum sequence id based on the next file's initial sequence id.  If 
the maximum sequence id is smaller than the current sequence id, the whole 
recovered edits file is old and ignored.
bq.  
bq.  
bq.  This addresses bug HBASE-4797.
bq.  https://issues.apache.org/jira/browse/HBASE-4797
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 8b89661 
bq.src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java 
5daa02b 
bq.  
bq.  Diff: https://reviews.apache.org/r/2906/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  Added test case to TestHRegion, and all the tests in this test are passed.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Jimmy
bq.  
bq.



 [availability] Give recovered.edits files better names, ones that include 
 first and last sequence id so we can skip files with edits we know older than 
 current region has
 --

 Key: HBASE-4797
 URL: https://issues.apache.org/jira/browse/HBASE-4797
 Project: HBase
  Issue Type: Bug
  Components: performance
Reporter: stack
Assignee: Jimmy Xiang
Priority: Critical
  Labels: noob

 Testing 0.92, I crashed all servers out.  Another bug makes it so WALs are 
 not getting cleaned so I had 7000 regions to replay.  The distributed split 
 code did a nice job and cluster came back but interesting is that some hot 
 regions ended up having loads of recovered.edits files -- tens if not 
 hundreds -- to replay against the region (can we bulk load recovered.edits 
 instead of replaying them?).  Each recovered.edits file is taking about a 
 second to process (though only about 30 odd edits per file it seems).  The 
 region is unavailable during this time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4797) [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region h

2011-11-21 Thread Kannan Muthukkaruppan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13154699#comment-13154699
 ] 

Kannan Muthukkaruppan commented on HBASE-4797:
--

The title for the bug can be updated given that we are no longer renaming the 
files in recovered.edits. [That concerned me initially -- but reading through 
the details, looks like you have come up with a way to avoid new name format. 
That's always smoother for upgrades and such..]



 [availability] Give recovered.edits files better names, ones that include 
 first and last sequence id so we can skip files with edits we know older than 
 current region has
 --

 Key: HBASE-4797
 URL: https://issues.apache.org/jira/browse/HBASE-4797
 Project: HBase
  Issue Type: Bug
  Components: performance
Reporter: stack
Assignee: Jimmy Xiang
Priority: Critical
  Labels: noob

 Testing 0.92, I crashed all servers out.  Another bug makes it so WALs are 
 not getting cleaned so I had 7000 regions to replay.  The distributed split 
 code did a nice job and cluster came back but interesting is that some hot 
 regions ended up having loads of recovered.edits files -- tens if not 
 hundreds -- to replay against the region (can we bulk load recovered.edits 
 instead of replaying them?).  Each recovered.edits file is taking about a 
 second to process (though only about 30 odd edits per file it seems).  The 
 region is unavailable during this time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4797) [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region h

2011-11-16 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151411#comment-13151411
 ] 

stack commented on HBASE-4797:
--

Thinking some more on this, we don't need to rename recovered.edits files.  The 
files are named for the first sequenceid in the file, so, we could just do file 
listing and sort the return.  Then we'd have range of sequenceids per file.  We 
could then just pass on files with edits that are smaller than regions current 
seqid.

 [availability] Give recovered.edits files better names, ones that include 
 first and last sequence id so we can skip files with edits we know older than 
 current region has
 --

 Key: HBASE-4797
 URL: https://issues.apache.org/jira/browse/HBASE-4797
 Project: HBase
  Issue Type: Bug
  Components: performance
Reporter: stack
  Labels: noob

 Testing 0.92, I crashed all servers out.  Another bug makes it so WALs are 
 not getting cleaned so I had 7000 regions to replay.  The distributed split 
 code did a nice job and cluster came back but interesting is that some hot 
 regions ended up having loads of recovered.edits files -- tens if not 
 hundreds -- to replay against the region (can we bulk load recovered.edits 
 instead of replaying them?).  Each recovered.edits file is taking about a 
 second to process (though only about 30 odd edits per file it seems).  The 
 region is unavailable during this time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4797) [availability] Give recovered.edits files better names, ones that include first and last sequence id so we can skip files with edits we know older than current region h

2011-11-15 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150992#comment-13150992
 ] 

stack commented on HBASE-4797:
--

Oh... i suppose its a bit worse than I though.  I'm looking at a region that 
has nearly 6k recovered.edits files to replay.  The RegionServer is doing this 
per file:

{code}
2011-11-16 03:06:02,403 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Applied 0, skipped 33, firstSequenceidInLog=296860, maxSequenceidInLog=351600, 
path=hdfs://sv4r11s38:7000/hbase/TestTable/69ab6eb0e2feff1fda52d36d8fa75798/recovered.edits/0296860
2011-11-16 03:06:02,405 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
Replaying edits from 
hdfs://sv4r11s38:7000/hbase/TestTable/69ab6eb0e2feff1fda52d36d8fa75798/recovered.edits/0296914;
 minSequenceid=351600; 
path=hdfs://sv4r11s38:7000/hbase/TestTable/69ab6eb0e2feff1fda52d36d8fa75798/recovered.edits/0296914
2011-11-16 03:06:05,097 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:7003-0x133a5bab186271f Attempting to transition node 
69ab6eb0e2feff1fda52d36d8fa75798 from RS_ZK_REGION_OPENING to 
RS_ZK_REGION_OPENING
2011-11-16 03:06:05,278 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:7003-0x133a5bab186271f Successfully transitioned node 
69ab6eb0e2feff1fda52d36d8fa75798 from RS_ZK_REGION_OPENING to 
RS_ZK_REGION_OPENING
2011-11-16 03:06:05,278 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Applied 0, skipped 33, firstSequenceidInLog=296914, maxSequenceidInLog=351600, 
path=hdfs://sv4r11s38:7000/hbase/TestTable/69ab6eb0e2feff1fda52d36d8fa75798/recovered.edits/0296914
2011-11-16 03:06:05,279 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
Replaying edits from 
hdfs://sv4r11s38:7000/hbase/TestTable/69ab6eb0e2feff1fda52d36d8fa75798/recovered.edits/0296970;
 minSequenceid=351600; 
path=hdfs://sv4r11s38:7000/hbase/TestTable/69ab6eb0e2feff1fda52d36d8fa75798/recovered.edits/0296970
2011-11-16 03:06:05,952 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:7003-0x133a5bab186271f Attempting to transition node 
69ab6eb0e2feff1fda52d36d8fa75798 from RS_ZK_REGION_OPENING to 
RS_ZK_REGION_OPENING
2011-11-16 03:06:06,093 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:7003-0x133a5bab186271f Successfully transitioned node 
69ab6eb0e2feff1fda52d36d8fa75798 from RS_ZK_REGION_OPENING to 
RS_ZK_REGION_OPENING
2011-11-16 03:06:06,093 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Applied 0, skipped 44, firstSequenceidInLog=296970, maxSequenceidInLog=351600, 
path=hdfs://sv4r11s38:7000/hbase/TestTable/69ab6eb0e2feff1fda52d36d8fa75798/recovered.edits/0296970
2011-11-16 03:06:06,094 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
Replaying edits from 
hdfs://sv4r11s38:7000/hbase/TestTable/69ab6eb0e2feff1fda52d36d8fa75798/recovered.edits/0297041;
 minSequenceid=351600; 
path=hdfs://sv4r11s38:7000/hbase/TestTable/69ab6eb0e2feff1fda52d36d8fa75798/recovered.edits/0297041
2011-11-16 03:06:06,795 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:7003-0x133a5bab186271f Attempting to transition node 
69ab6eb0e2feff1fda52d36d8fa75798 from RS_ZK_REGION_OPENING to 
RS_ZK_REGION_OPENING
2011-11-16 03:06:06,810 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:7003-0x133a5bab186271f Successfully transitioned node 
69ab6eb0e2feff1fda52d36d8fa75798 from RS_ZK_REGION_OPENING to 
RS_ZK_REGION_OPENING
{code}

 [availability] Give recovered.edits files better names, ones that include 
 first and last sequence id so we can skip files with edits we know older than 
 current region has
 --

 Key: HBASE-4797
 URL: https://issues.apache.org/jira/browse/HBASE-4797
 Project: HBase
  Issue Type: Bug
  Components: performance
Reporter: stack

 Testing 0.92, I crashed all servers out.  Another bug makes it so WALs are 
 not getting cleaned so I had 7000 regions to replay.  The distributed split 
 code did a nice job and cluster came back but interesting is that some hot 
 regions ended up having loads of recovered.edits files -- tens if not 
 hundreds -- to replay against the region (can we bulk load recovered.edits 
 instead of replaying them?).  Each recovered.edits file is taking about a 
 second to process (though only about 30 odd edits per file it seems).  The 
 region is unavailable during this time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more