[jira] [Commented] (HBASE-23633) Find a way to handle the corrupt recovered hfiles

2020-03-23 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064938#comment-17064938
 ] 

Hudson commented on HBASE-23633:


Results for branch branch-2.3
[build #5 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/5/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/5//General_Nightly_Build_Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/5//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/5//JDK8_Nightly_Build_Report_(Hadoop3)/]


(x) {color:red}-1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/5//JDK11_Nightly_Build_Report/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(x) {color:red}-1 client integration test{color}
--Failed when running client tests on top of Hadoop 2. [see log for 
details|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/5//artifact/output-integration/hadoop-2.log].
 (note that this means we didn't run on Hadoop 3)


> Find a way to handle the corrupt recovered hfiles
> -
>
> Key: HBASE-23633
> URL: https://issues.apache.org/jira/browse/HBASE-23633
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR, wal
>Affects Versions: 3.0.0, 2.3.0
>Reporter: Guanghao Zhang
>Assignee: Pankaj Kumar
>Priority: Critical
> Fix For: 3.0.0, 2.3.0, 2.4.0
>
>
> Copy the comment from PR review.
>  
> If the file is a corrupt HFile, an exception will be thrown here, which will 
> cause the region to fail to open.
> Maybe we can add a new parameter to control whether to skip the exception, 
> similar to recover edits which has a parameter 
> "hbase.hregion.edits.replay.skip.errors";
>  
> Regions that can't be opened because of detached References or corrupt hfiles 
> are a fact-of-life. We need work on this issue. This will be a new variant on 
> the problem -- i.e. bad recovered hfiles.
> On adding a config to ignore bad files and just open, thats a bit dangerous 
> as per @infraio  as it could mean silent data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23633) Find a way to handle the corrupt recovered hfiles

2020-03-22 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064518#comment-17064518
 ] 

Hudson commented on HBASE-23633:


Results for branch branch-2
[build #2561 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2561/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2561//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2561//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2561//JDK8_Nightly_Build_Report_(Hadoop3)/]


(x) {color:red}-1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2561//JDK11_Nightly_Build_Report/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(x) {color:red}-1 client integration test{color}
--Failed when running client tests on top of Hadoop 2. [see log for 
details|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2561//artifact/output-integration/hadoop-2.log].
 (note that this means we didn't run on Hadoop 3)


> Find a way to handle the corrupt recovered hfiles
> -
>
> Key: HBASE-23633
> URL: https://issues.apache.org/jira/browse/HBASE-23633
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR, wal
>Affects Versions: 3.0.0, 2.3.0
>Reporter: Guanghao Zhang
>Assignee: Pankaj Kumar
>Priority: Critical
> Fix For: 3.0.0, 2.3.0, 2.4.0
>
>
> Copy the comment from PR review.
>  
> If the file is a corrupt HFile, an exception will be thrown here, which will 
> cause the region to fail to open.
> Maybe we can add a new parameter to control whether to skip the exception, 
> similar to recover edits which has a parameter 
> "hbase.hregion.edits.replay.skip.errors";
>  
> Regions that can't be opened because of detached References or corrupt hfiles 
> are a fact-of-life. We need work on this issue. This will be a new variant on 
> the problem -- i.e. bad recovered hfiles.
> On adding a config to ignore bad files and just open, thats a bit dangerous 
> as per @infraio  as it could mean silent data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23633) Find a way to handle the corrupt recovered hfiles

2020-03-22 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064379#comment-17064379
 ] 

Hudson commented on HBASE-23633:


Results for branch master
[build #1675 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/1675/]: (x) 
*{color:red}-1 overall{color}*

details (if available):

(x) {color:red}-1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1675//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1675//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1675//JDK8_Nightly_Build_Report_(Hadoop3)/]


(x) {color:red}-1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1675//JDK11_Nightly_Build_Report/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(x) {color:red}-1 client integration test{color}
--Failed when running client tests on top of Hadoop 2. [see log for 
details|https://builds.apache.org/job/HBase%20Nightly/job/master/1675//artifact/output-integration/hadoop-2.log].
 (note that this means we didn't run on Hadoop 3)


> Find a way to handle the corrupt recovered hfiles
> -
>
> Key: HBASE-23633
> URL: https://issues.apache.org/jira/browse/HBASE-23633
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR, wal
>Affects Versions: 3.0.0, 2.3.0
>Reporter: Guanghao Zhang
>Assignee: Pankaj Kumar
>Priority: Critical
> Fix For: 3.0.0, 2.3.0, 2.4.0
>
>
> Copy the comment from PR review.
>  
> If the file is a corrupt HFile, an exception will be thrown here, which will 
> cause the region to fail to open.
> Maybe we can add a new parameter to control whether to skip the exception, 
> similar to recover edits which has a parameter 
> "hbase.hregion.edits.replay.skip.errors";
>  
> Regions that can't be opened because of detached References or corrupt hfiles 
> are a fact-of-life. We need work on this issue. This will be a new variant on 
> the problem -- i.e. bad recovered hfiles.
> On adding a config to ignore bad files and just open, thats a bit dangerous 
> as per @infraio  as it could mean silent data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23633) Find a way to handle the corrupt recovered hfiles

2020-03-22 Thread Guanghao Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064206#comment-17064206
 ] 

Guanghao Zhang commented on HBASE-23633:


[~pankajkumar] As the PR not updated long time, I merged it and open a new 
issue to add ut.

> Find a way to handle the corrupt recovered hfiles
> -
>
> Key: HBASE-23633
> URL: https://issues.apache.org/jira/browse/HBASE-23633
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR, wal
>Affects Versions: 3.0.0, 2.3.0
>Reporter: Guanghao Zhang
>Assignee: Pankaj Kumar
>Priority: Critical
>
> Copy the comment from PR review.
>  
> If the file is a corrupt HFile, an exception will be thrown here, which will 
> cause the region to fail to open.
> Maybe we can add a new parameter to control whether to skip the exception, 
> similar to recover edits which has a parameter 
> "hbase.hregion.edits.replay.skip.errors";
>  
> Regions that can't be opened because of detached References or corrupt hfiles 
> are a fact-of-life. We need work on this issue. This will be a new variant on 
> the problem -- i.e. bad recovered hfiles.
> On adding a config to ignore bad files and just open, thats a bit dangerous 
> as per @infraio  as it could mean silent data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23633) Find a way to handle the corrupt recovered hfiles

2020-02-03 Thread Pankaj Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029184#comment-17029184
 ] 

Pankaj Kumar commented on HBASE-23633:
--

Will raise PR with UT.

> Find a way to handle the corrupt recovered hfiles
> -
>
> Key: HBASE-23633
> URL: https://issues.apache.org/jira/browse/HBASE-23633
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Guanghao Zhang
>Priority: Major
>
> Copy the comment from PR review.
>  
> If the file is a corrupt HFile, an exception will be thrown here, which will 
> cause the region to fail to open.
> Maybe we can add a new parameter to control whether to skip the exception, 
> similar to recover edits which has a parameter 
> "hbase.hregion.edits.replay.skip.errors";
>  
> Regions that can't be opened because of detached References or corrupt hfiles 
> are a fact-of-life. We need work on this issue. This will be a new variant on 
> the problem -- i.e. bad recovered hfiles.
> On adding a config to ignore bad files and just open, thats a bit dangerous 
> as per @infraio  as it could mean silent data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23633) Find a way to handle the corrupt recovered hfiles

2020-02-01 Thread Guanghao Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028252#comment-17028252
 ] 

Guanghao Zhang commented on HBASE-23633:


[~pankajkumar] Are you working for this now?

> Find a way to handle the corrupt recovered hfiles
> -
>
> Key: HBASE-23633
> URL: https://issues.apache.org/jira/browse/HBASE-23633
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Guanghao Zhang
>Priority: Major
>
> Copy the comment from PR review.
>  
> If the file is a corrupt HFile, an exception will be thrown here, which will 
> cause the region to fail to open.
> Maybe we can add a new parameter to control whether to skip the exception, 
> similar to recover edits which has a parameter 
> "hbase.hregion.edits.replay.skip.errors";
>  
> Regions that can't be opened because of detached References or corrupt hfiles 
> are a fact-of-life. We need work on this issue. This will be a new variant on 
> the problem -- i.e. bad recovered hfiles.
> On adding a config to ignore bad files and just open, thats a bit dangerous 
> as per @infraio  as it could mean silent data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23633) Find a way to handle the corrupt recovered hfiles

2020-01-30 Thread Pankaj Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026669#comment-17026669
 ] 

Pankaj Kumar commented on HBASE-23633:
--

In my test scenario all corrupted hfiles were of zero length. We should check & 
delete zero length file during recovered hfile bulkload, the same way how it 
handled while replaying edits,

{code}
  private long loadRecoveredHFilesIfAny(Collection stores) throws 
IOException {
Path regionDir = getWALRegionDir();
long maxSeqId = -1;
for (HStore store : stores) {
  String familyName = store.getColumnFamilyName();
  FileStatus[] files =
  WALSplitUtil.getRecoveredHFiles(fs.getFileSystem(), regionDir, 
familyName);
  if (files != null && files.length != 0) {
for (FileStatus file : files) {
  // Check and delete the zero length file
  if (isZeroLengthThenDelete(fs.getFileSystem(), file.getPath())) {
continue;
  }
  store.assertBulkLoadHFileOk(file.getPath());
{code}

> Find a way to handle the corrupt recovered hfiles
> -
>
> Key: HBASE-23633
> URL: https://issues.apache.org/jira/browse/HBASE-23633
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Guanghao Zhang
>Priority: Major
>
> Copy the comment from PR review.
>  
> If the file is a corrupt HFile, an exception will be thrown here, which will 
> cause the region to fail to open.
> Maybe we can add a new parameter to control whether to skip the exception, 
> similar to recover edits which has a parameter 
> "hbase.hregion.edits.replay.skip.errors";
>  
> Regions that can't be opened because of detached References or corrupt hfiles 
> are a fact-of-life. We need work on this issue. This will be a new variant on 
> the problem -- i.e. bad recovered hfiles.
> On adding a config to ignore bad files and just open, thats a bit dangerous 
> as per @infraio  as it could mean silent data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23633) Find a way to handle the corrupt recovered hfiles

2020-01-29 Thread Pankaj Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025829#comment-17025829
 ] 

Pankaj Kumar commented on HBASE-23633:
--

I also observed this problem during test, many regions *FAILED* to open due to 
CorruptHFileException. 
{noformat}
2020-01-29 07:07:13,911 | INFO  | RS_OPEN_REGION-RS-IP:RS-PORT-2 | Validating 
hfile at 
hdfs://cluster/hbase/data/default/usertable01/a2f0e8b46399ce55e864d4ee7311c845/family/recovered.hfiles/290-RS-IP%2CRS-PORT%2C1580220469213.1580222580793
 for inclusion in store family region 
usertable01,user35466,1580220595485.a2f0e8b46399ce55e864d4ee7311c845. | 
org.apache.hadoop.hbase.regionserver.HStore.assertBulkLoadHFileOk(HStore.java:730)
2020-01-29 07:07:13,930 | ERROR | RS_OPEN_REGION-RS-IP:RS-PORT-2 | Failed open 
of 
region=usertable01,user35466,1580220595485.a2f0e8b46399ce55e864d4ee7311c845., 
starting to roll back the global memstore size. | 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:386)
org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile 
Trailer from file 
hdfs://cluster/hbase/data/default/usertable01/a2f0e8b46399ce55e864d4ee7311c845/family/recovered.hfiles/290-RS-IP%2CRS-PORT%2C1580220469213.1580222580793
at org.apache.hadoop.hbase.io.hfile.HFile.openReader(HFile.java:503)
at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:562)
at 
org.apache.hadoop.hbase.regionserver.HStore.assertBulkLoadHFileOk(HStore.java:732)
at 
org.apache.hadoop.hbase.regionserver.HRegion.loadRecoveredHFilesIfAny(HRegion.java:4905)
at 
org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:863)
at 
org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:824)
at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7023)
{noformat}


After digging more into the log, observed this problem occured when 
"split-log-closeStream" thread was splitting WAL into hfile and Region Server 
abort due to some region. So the "split-log-closeStream" thread was interrupted 
and left the recovered hfile in an intermediate state.

{noformat}
2020-01-28 23:01:04,962 | WARN  | RS_LOG_REPLAY_OPS-8-5-179-5:RS-PORT-0 | log 
splitting of 
WALs/RS-IP,RS-PORT,1580220469213-splitting/RS-IP%2CRS-PORT%2C1580220469213.1580222580793
 interrupted, resigning | 
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111)
java.io.InterruptedIOException
at 
org.apache.hadoop.hbase.wal.BoundedRecoveredHFilesOutputSink.writeRemainingEntryBuffers(BoundedRecoveredHFilesOutputSink.java:186)
at 
org.apache.hadoop.hbase.wal.BoundedRecoveredHFilesOutputSink.close(BoundedRecoveredHFilesOutputSink.java:155)
at 
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:404)
at 
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:225)
at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:105)
at 
org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at 
java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
at 
org.apache.hadoop.hbase.wal.BoundedRecoveredHFilesOutputSink.writeRemainingEntryBuffers(BoundedRecoveredHFilesOutputSink.java:179)
... 9 more

{noformat}

Further I checked and confirmed from the NN audit log that file was not written 
completelty and RS went down,
{noformat}
2020-01-28 23:01:04,946 | INFO  | IPC Server handler 125 on 25000 | BLOCK* 
allocate blk_1092127264_18392260, replicas=DN-IP1:DN-PORT, DN-IP2:DN-PORT, 
DN-IP3:DN-PORT for 
/hbase/data/default/usertable01/a2f0e8b46399ce55e864d4ee7311c845/family/recovered.hfiles/290-RS-IP%2CRS-PORT%2C1580220469213.1580222580793
 | FSDirWriteFileOp.java:856

2020-01-29 00:01:04,956 | INFO  | 
org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@862fb5 | Recovering 
[Lease.  Holder: DFSClient_NONMAPREDUCE_-1098699935_1, pending creates: 21], 
src=/hbase/dat

[jira] [Commented] (HBASE-23633) Find a way to handle the corrupt recovered hfiles

2020-01-03 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007690#comment-17007690
 ] 

Michael Stack commented on HBASE-23633:
---

What I want is an indication in Master log that a Region is not opening because 
hfiles are corrupt or we have dangling references. Currently it just says fail 
but not why (smile).

> Find a way to handle the corrupt recovered hfiles
> -
>
> Key: HBASE-23633
> URL: https://issues.apache.org/jira/browse/HBASE-23633
> Project: HBase
>  Issue Type: Improvement
>Reporter: Guanghao Zhang
>Priority: Major
>
> Copy the comment from PR review.
>  
> If the file is a corrupt HFile, an exception will be thrown here, which will 
> cause the region to fail to open.
> Maybe we can add a new parameter to control whether to skip the exception, 
> similar to recover edits which has a parameter 
> "hbase.hregion.edits.replay.skip.errors";
>  
> Regions that can't be opened because of detached References or corrupt hfiles 
> are a fact-of-life. We need work on this issue. This will be a new variant on 
> the problem -- i.e. bad recovered hfiles.
> On adding a config to ignore bad files and just open, thats a bit dangerous 
> as per @infraio  as it could mean silent data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)