Let's take this to JIRA
On Wed, Jul 30, 2014 at 12:50 PM, Ted Yu <yuzhih...@gmail.com> wrote: > In BaseDecoder#rethrowEofException() : > > if (!isEof) throw ioEx; > > LOG.error("Partial cell read caused by EOF: " + ioEx); > > EOFException eofEx = new EOFException("Partial cell read"); > > eofEx.initCause(ioEx); > > throw eofEx; > > throwing EOFException would not propagate the "Partial cell read" condition > to HLogSplitter which doesn't treat EOFException as an error. > > I think a new exception type (DecoderException e.g.) should be used above. > > Cheers > > > On Wed, Jul 30, 2014 at 10:22 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > > Looking at HLogSplitter#getNextLogLine() : > > > > try { > > > > return in.next(); > > > > } catch (EOFException eof) { > > > > // truncated files are expected if a RS crashes (see HBASE-2643) > > > > LOG.info("EOF from hlog " + path + ". continuing"); > > > > return null; > > > > The EOFException is not treated as an error. But the posted log doesn't > > contain "EOF from hlog " - there may be other code path leading to > > codec.BaseDecoder > > > > Cheers > > > > > > On Wed, Jul 30, 2014 at 9:20 AM, Kiran Kumar.M.R < > > kiran.kumar...@huawei.com> wrote: > > > >> Hi, > >> > >> After step 4 ( i.e disabling of WAL encryption, removing > >> SecureProtobufReader/Writer and restart), read of encrypted WAL fails > >> mainly due to EOF exception at Basedecoder. This is not considered as > error > >> and these WAL are being moved to /oldWALs. > >> > >> Following is observed in log files: > >> 2014-07-30 19:44:29,254 INFO [RS_LOG_REPLAY_OPS-HOST-16:15264-1] > >> wal.HLogSplitter: Splitting hlog: > >> > hdfs://HOST-16:18020/hbase/WALs/HOST-16,15264,1406725441997-splitting/HOST-16%2C15264%2C1406725441997.1406725444017, > >> length=172 > >> 2014-07-30 19:44:29,254 INFO [RS_LOG_REPLAY_OPS-HOST-16:15264-1] > >> wal.HLogSplitter: DistributedLogReplay = false > >> 2014-07-30 19:44:29,313 INFO [RS_LOG_REPLAY_OPS-HOST-16:15264-1] > >> util.FSHDFSUtils: Recovering lease on dfs file > >> > hdfs://HOST-16:18020/hbase/WALs/HOST-16,15264,1406725441997-splitting/HOST-16%2C15264%2C1406725441997.1406725444017 > >> 2014-07-30 19:44:29,315 INFO [RS_LOG_REPLAY_OPS-HOST-16:15264-1] > >> util.FSHDFSUtils: recoverLease=true, attempt=0 on > >> > file=hdfs://HOST-16:18020/hbase/WALs/HOST-16,15264,1406725441997-splitting/HOST-16%2C15264%2C1406725441997.1406725444017 > >> after 1ms > >> 2014-07-30 19:44:29,429 DEBUG > >> [RS_LOG_REPLAY_OPS-HOST-16:15264-1-Writer-0] wal.HLogSplitter: Writer > >> thread Thread[RS_LOG_REPLAY_OPS-HOST-16:15264-1-Writer-0,5,main]: > starting > >> 2014-07-30 19:44:29,429 DEBUG > >> [RS_LOG_REPLAY_OPS-HOST-16:15264-1-Writer-1] wal.HLogSplitter: Writer > >> thread Thread[RS_LOG_REPLAY_OPS-HOST-16:15264-1-Writer-1,5,main]: > starting > >> 2014-07-30 19:44:29,430 DEBUG > >> [RS_LOG_REPLAY_OPS-HOST-16:15264-1-Writer-2] wal.HLogSplitter: Writer > >> thread Thread[RS_LOG_REPLAY_OPS-HOST-16:15264-1-Writer-2,5,main]: > starting > >> 2014-07-30 19:44:29,591 ERROR [RS_LOG_REPLAY_OPS-HOST-16:15264-1] > >> codec.BaseDecoder: Partial cell read caused by EOF: java.io.IOException: > >> Premature EOF from inputStream > >> 2014-07-30 19:44:29,592 INFO [RS_LOG_REPLAY_OPS-HOST-16:15264-1] > >> wal.HLogSplitter: Finishing writing output logs and closing down. > >> 2014-07-30 19:44:29,592 INFO [RS_LOG_REPLAY_OPS-HOST-16:15264-1] > >> wal.HLogSplitter: Waiting for split writer threads to finish > >> 2014-07-30 19:44:29,592 INFO [RS_LOG_REPLAY_OPS-HOST-16:15264-1] > >> wal.HLogSplitter: Split writers finished > >> 2014-07-30 19:44:29,592 INFO [RS_LOG_REPLAY_OPS-HOST-16:15264-1] > >> wal.HLogSplitter: Processed 0 edits across 0 regions; log > >> > file=hdfs://HOST-16:18020/hbase/WALs/HOST-16,15264,1406725441997-splitting/HOST-16%2C15264%2C1406725441997.1406725444017 > >> is corrupted = false progress failed = false > >> > >> To fix this, we need to propagate EOF exception to HLogSplitter. Any > >> suggestions on the fix? > >> > >> > >> Regards, > >> Kiran > >> > >> > __________________________________________________________________________________________________________ > >> This e-mail and its attachments contain confidential information from > >> HUAWEI, which is intended only for the person or entity whose address is > >> listed above. Any use of the information contained herein in any way > >> (including, but not limited to, total or partial disclosure, > reproduction, > >> or dissemination) by persons other than the intended recipient(s) is > >> prohibited. If you receive this e-mail in error, please notify the > sender > >> by phone or email immediately and delete it! > >> > >> > __________________________________________________________________________________________________________ > >> > >> > >> > >> > >> > -----Original Message----- > >> > From: Shankar hiremath [mailto:shankar.hirem...@huawei.com] > >> > Sent: Friday, July 25, 2014 19:38 > >> > To: user@hbase.apache.org > >> > Subject: HBase file encryption, inconsistencies observed and data loss > >> > > >> > HBase file encryption some inconsistencies observed and data loss > >> > happens after running the hbck tool, > >> > the operation steps are as below. (one thing what I observed is, on > >> > startup of HMaster if it is not able to process the WAL file, then > also > >> > it moved to /oldWALs) > >> > > >> > Procedure: > >> > 1. Start the Hbase services (HMaster & region Server) 2. Enable HFile > >> > encryption and WAL file encryption as below, and perform 'table4-0' > put > >> > operations (100 records added) <property> > >> > <name>hbase.crypto.keyprovider</name> > >> > <value>org.apache.hadoop.hbase.io.crypto.KeyStoreKeyProvider</value> > >> > </property> > >> > <property> > >> > <name>hbase.crypto.keyprovider.parameters</name> > >> > <value>jceks:///opt/shankar1/kdc_keytab/hbase.jks?password=Hadoop@234 > </ > >> > value> > >> > </property> > >> > <property> > >> > <name>hbase.crypto.master.key.name</name> > >> > <value>hdfs</value> > >> > </property> > >> > <property> > >> > <name>hfile.format.version</name> > >> > <value>3</value> > >> > </property> > >> > <property> > >> > <name>hbase.regionserver.hlog.reader.impl</name> > >> > > <value>org.apache.hadoop.hbase.regionserver.wal.SecureProtobufLogReader > >> > </value> > >> > </property> > >> > <property> > >> > <name>hbase.regionserver.hlog.writer.impl</name> > >> > > <value>org.apache.hadoop.hbase.regionserver.wal.SecureProtobufLogWriter > >> > </value> > >> > </property> > >> > <property> > >> > <name>hbase.regionserver.wal.encryption</name> > >> > <value>true</value> > >> > </property> > >> > 3. Machine went down, so all process went down > >> > > >> > 4. We disabled the WAL file encryption for performance reason, and > keep > >> > encryption only for Hfile, as below <property> > >> > <name>hbase.crypto.keyprovider</name> > >> > <value>org.apache.hadoop.hbase.io.crypto.KeyStoreKeyProvider</value> > >> > </property> > >> > <property> > >> > <name>hbase.crypto.keyprovider.parameters</name> > >> > <value>jceks:///opt/shankar1/kdc_keytab/hbase.jks?password=Hadoop@234 > </ > >> > value> > >> > </property> > >> > <property> > >> > <name>hbase.crypto.master.key.name</name> > >> > <value>hdfs</value> > >> > </property> > >> > <property> > >> > <name>hfile.format.version</name> > >> > <value>3</value> > >> > </property> > >> > 5. Start the Region Server and query the 'table4-0' data > >> > hbase(main):003:0> count 'table4-0' > >> > ERROR: org.apache.hadoop.hbase.NotServingRegionException: Region > >> > table4-0,,1406207815456.fc10620a3dcc14e004ab034420f7d332. is not > online > >> > on XX-XX-XX-XX,60020,1406209023146 at > >> > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedNa > >> > me(HRegionServer.java:2685) > >> > at > >> > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionSer > >> > ver.java:4119) > >> > at > >> > > org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.j > >> > ava:3066) > >> > at > >> > > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2 > >> > .callBlockingMethod(ClientProtos.java:29497) > >> > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2084) > >> > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98) > >> > at > >> > > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSc > >> > heduler.java:168) > >> > at > >> > > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSche > >> > duler.java:39) > >> > at > >> > > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler > >> > .java:111) > >> > at java.lang.Thread.run(Thread.java:662) > >> > 6. Not able to read the data, so we decided to revert back the > >> > configuration (as original) 7. Kill/Stop the Region Server, revert all > >> > the configurations as original, as below <property> > >> > <name>hbase.crypto.keyprovider</name> > >> > <value>org.apache.hadoop.hbase.io.crypto.KeyStoreKeyProvider</value> > >> > </property> > >> > <property> > >> > <name>hbase.crypto.keyprovider.parameters</name> > >> > <value>jceks:///opt/shankar1/kdc_keytab/hbase.jks?password=Hadoop@234 > </ > >> > value> > >> > </property> > >> > <property> > >> > <name>hbase.crypto.master.key.name</name> > >> > <value>hdfs</value> > >> > </property> > >> > <property> > >> > <name>hfile.format.version</name> > >> > <value>3</value> > >> > </property> > >> > <property> > >> > <name>hbase.regionserver.hlog.reader.impl</name> > >> > > <value>org.apache.hadoop.hbase.regionserver.wal.SecureProtobufLogReader > >> > </value> > >> > </property> > >> > <property> > >> > <name>hbase.regionserver.hlog.writer.impl</name> > >> > > <value>org.apache.hadoop.hbase.regionserver.wal.SecureProtobufLogWriter > >> > </value> > >> > </property> > >> > <property> > >> > <name>hbase.regionserver.wal.encryption</name> > >> > <value>true</value> > >> > </property> > >> > 7. Start the Region Server, and perform the 'table4-0' query > >> > hbase(main):003:0> count 'table4-0' > >> > ERROR: org.apache.hadoop.hbase.NotServingRegionException: Region > >> > table4-0,,1406207815456.fc10620a3dcc14e004ab034420f7d332. is not > online > >> > on XX-XX-XX-XX,60020,1406209023146 at > >> > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedNa > >> > me(HRegionServer.java:2685) > >> > at > >> > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionSer > >> > ver.java:4119) > >> > at > >> > > org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.j > >> > ava:3066) > >> > at > >> > > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2 > >> > .callBlockingMethod(ClientProtos.java:29497) > >> > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2084) > >> > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98) > >> > at > >> > > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSc > >> > heduler.java:168) > >> > at > >> > > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSche > >> > duler.java:39) > >> > at > >> > > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler > >> > .java:111) > >> > at java.lang.Thread.run(Thread.java:662) > >> > 8. Run the hbase hbck to repair, as below ./hbase hbck - > >> > details ......................... > >> > Summary: > >> > table1-0 is okay. > >> > Number of regions: 0 > >> > Deployed on: > >> > table2-0 is okay. > >> > Number of regions: 0 > >> > Deployed on: > >> > table3-0 is okay. > >> > Number of regions: 0 > >> > Deployed on: > >> > table4-0 is okay. > >> > Number of regions: 0 > >> > Deployed on: > >> > table5-0 is okay. > >> > Number of regions: 0 > >> > Deployed on: > >> > table6-0 is okay. > >> > Number of regions: 0 > >> > Deployed on: > >> > table7-0 is okay. > >> > Number of regions: 0 > >> > Deployed on: > >> > table8-0 is okay. > >> > Number of regions: 0 > >> > Deployed on: > >> > table9-0 is okay. > >> > Number of regions: 0 > >> > Deployed on: > >> > hbase:meta is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 hbase:acl is okay. > >> > Number of regions: 0 > >> > Deployed on: > >> > hbase:namespace is okay. > >> > Number of regions: 0 > >> > Deployed on: > >> > 22 inconsistencies detected. > >> > Status: INCONSISTENT > >> > 2014-07-24 19:13:05,532 INFO [main] > >> > client.HConnectionManager$HConnectionImplementation: Closing master > >> > protocol: MasterService > >> > 2014-07-24 19:13:05,533 INFO [main] > >> > client.HConnectionManager$HConnectionImplementation: Closing zookeeper > >> > sessionid=0x1475d1611611bcf > >> > 2014-07-24 19:13:05,533 DEBUG [main] zookeeper.ZooKeeper: Closing > >> > session: 0x1475d1611611bcf > >> > 2014-07-24 19:13:05,533 DEBUG [main] zookeeper.ClientCnxn: Closing > >> > client for session: 0x1475d1611611bcf > >> > 2014-07-24 19:13:05,546 DEBUG [main-SendThread(XX-XX-XX-XX:2181)] > >> > zookeeper.ClientCnxn: Reading reply sessionid:0x1475d1611611bcf, > >> > packet:: clientPath:null serverPath:null finished:false header:: 6,-11 > >> > replyHeader:: 6,4295102074,0 request:: null response:: null > >> > 2014-07-24 19:13:05,546 DEBUG [main] zookeeper.ClientCnxn: > >> > Disconnecting client for session: 0x1475d1611611bcf > >> > 2014-07-24 19:13:05,546 DEBUG [main-SendThread(XX-XX-XX-XX:2181)] > >> > zookeeper.ClientCnxn: An exception was thrown while closing send > thread > >> > for session 0x1475d1611611bcf : Unable to read additional data from > >> > server sessionid 0x1475d1611611bcf, likely server has closed socket > >> > 2014-07-24 19:13:05,546 INFO [main-EventThread] zookeeper.ClientCnxn: > >> > EventThread shut down > >> > 2014-07-24 19:13:05,546 INFO [main] zookeeper.ZooKeeper: Session: > >> > 0x1475d1611611bcf closed shankar1@XX-XX-XX-XX:~/DataSight/hbase/bin> > >> > 9. Fix the assignments as below > >> > ./hbase hbck -fixAssignments > >> > Summary: > >> > table1-0 is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 > >> > table2-0 is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 > >> > table3-0 is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 > >> > table4-0 is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 > >> > table5-0 is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 > >> > table6-0 is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 > >> > table7-0 is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 > >> > table8-0 is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 > >> > table9-0 is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 hbase:meta is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 hbase:acl is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 hbase:namespace is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 > >> > 0 inconsistencies detected. > >> > Status: OK > >> > 2014-07-24 19:44:55,194 INFO [main] > >> > client.HConnectionManager$HConnectionImplementation: Closing master > >> > protocol: MasterService > >> > 2014-07-24 19:44:55,194 INFO [main] > >> > client.HConnectionManager$HConnectionImplementation: Closing zookeeper > >> > sessionid=0x2475d15f7b31b73 > >> > 2014-07-24 19:44:55,194 DEBUG [main] zookeeper.ZooKeeper: Closing > >> > session: 0x2475d15f7b31b73 > >> > 2014-07-24 19:44:55,194 DEBUG [main] zookeeper.ClientCnxn: Closing > >> > client for session: 0x2475d15f7b31b73 > >> > 2014-07-24 19:44:55,203 DEBUG [main-SendThread(XX-XX-XX-XX:2181)] > >> > zookeeper.ClientCnxn: Reading reply sessionid:0x2475d15f7b31b73, > >> > packet:: clientPath:null serverPath:null finished:false header:: 7,-11 > >> > replyHeader:: 7,4295102377,0 request:: null response:: null > >> > 2014-07-24 19:44:55,203 DEBUG [main] zookeeper.ClientCnxn: > >> > Disconnecting client for session: 0x2475d15f7b31b73 > >> > 2014-07-24 19:44:55,204 DEBUG [main-SendThread(XX-XX-XX-XX:2181)] > >> > zookeeper.ClientCnxn: An exception was thrown while closing send > thread > >> > for session 0x2475d15f7b31b73 : Unable to read additional data from > >> > server sessionid 0x2475d15f7b31b73, likely server has closed socket > >> > 2014-07-24 19:44:55,204 INFO [main] zookeeper.ZooKeeper: Session: > >> > 0x2475d15f7b31b73 closed > >> > 2014-07-24 19:44:55,204 INFO [main-EventThread] zookeeper.ClientCnxn: > >> > EventThread shut down 10. Fix the assignments as below ./hbase hbck - > >> > fixAssignments -fixMeta > >> > Summary: > >> > table1-0 is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 > >> > table2-0 is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 > >> > table3-0 is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 > >> > table4-0 is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 > >> > table5-0 is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 > >> > table6-0 is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 > >> > table7-0 is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 > >> > table8-0 is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 > >> > table9-0 is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 hbase:meta is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 hbase:acl is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 hbase:namespace is okay. > >> > Number of regions: 1 > >> > Deployed on: XX-XX-XX-XX,60020,1406209023146 > >> > 0 inconsistencies detected. > >> > Status: OK > >> > 2014-07-24 19:46:16,290 INFO [main] > >> > client.HConnectionManager$HConnectionImplementation: Closing master > >> > protocol: MasterService > >> > 2014-07-24 19:46:16,290 INFO [main] > >> > client.HConnectionManager$HConnectionImplementation: Closing zookeeper > >> > sessionid=0x3475d1605321be9 > >> > 2014-07-24 19:46:16,290 DEBUG [main] zookeeper.ZooKeeper: Closing > >> > session: 0x3475d1605321be9 > >> > 2014-07-24 19:46:16,290 DEBUG [main] zookeeper.ClientCnxn: Closing > >> > client for session: 0x3475d1605321be9 > >> > 2014-07-24 19:46:16,300 DEBUG [main-SendThread(XX-XX-XX-XX:2181)] > >> > zookeeper.ClientCnxn: Reading reply sessionid:0x3475d1605321be9, > >> > packet:: clientPath:null serverPath:null finished:false header:: 6,-11 > >> > replyHeader:: 6,4295102397,0 request:: null response:: null > >> > 2014-07-24 19:46:16,300 DEBUG [main] zookeeper.ClientCnxn: > >> > Disconnecting client for session: 0x3475d1605321be9 > >> > 2014-07-24 19:46:16,300 DEBUG [main-SendThread(XX-XX-XX-XX:2181)] > >> > zookeeper.ClientCnxn: An exception was thrown while closing send > thread > >> > for session 0x3475d1605321be9 : Unable to read additional data from > >> > server sessionid 0x3475d1605321be9, likely server has closed socket > >> > 2014-07-24 19:46:16,300 INFO [main] zookeeper.ZooKeeper: Session: > >> > 0x3475d1605321be9 closed > >> > 2014-07-24 19:46:16,300 INFO [main-EventThread] zookeeper.ClientCnxn: > >> > EventThread shut down hbase(main):006:0> count 'table4-0' > >> > 0 row(s) in 0.0200 seconds > >> > => 0 > >> > hbase(main):007:0> > >> > Complete data loss happened, > >> > WALs, oldWALs & /hbase/data/default/table4-0/ does not have any data > >> > > >> > > >> > > >> > [X] > >> > This e-mail and its attachments contain confidential information from > >> > HUAWEI, which is intended only for the person or entity whose address > >> > is listed above. Any use of the information contained herein in any > way > >> > (including, but not limited to, total or partial disclosure, > >> > reproduction, or dissemination) by persons other than the intended > >> > recipient(s) is prohibited. If you receive this e-mail in error, > please > >> > notify the sender by phone or email immediately and delete it! > >> > [X] > >> > > >> > > >> > > >> > > >> > >> > > > -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)