On Thu, Jun 28, 2018 at 1:24 PM Austin Heyne <[email protected]> wrote:
> Hey again, > > I'm running a bulk load on s3 and I'm seeing region servers being > instructed to load an hfile multiple times. I've seen this behavior two > different ways. The first time I saw this was after deciding to brute > force my way around the problem in HBASE-20774 [1] by just letting the > cluster copy the data down from s3 and back up. After letting this run > over night the input 15TB of data was now over 60TB in the hbase > catalog. I noticed in the logs that files were being copied multiple > times (I saw around 10 times on some files on a single region server). > > After writing a patch for HBASE-20774 and deploying it to the cluster on > the next attempt I'm seeing the same behavior but because when you bulk > load from the same FS as the HBase root, HBase renames files rather than > copying them. This is causing FileNotFoundExceptions the next time the > region server is instructed to load the file. I've cherry picked the > logs of an occurrence of this happening off of a region server: > > 2018-06-27 19:34:18,170 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] > regionserver.HStore: Validating hfile at > s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe for > inclusion in store d region z3_v2,\x00\x09\xD9 > > \x00\x00\x00\x00\x00\x00\x00,1530122295326.f6e3567551881fbd48baed14bfe2252b. > 2018-06-27 19:34:18,236 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] > s3n.S3NativeFileSystem: Opening > 's3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe' > for reading > 2018-06-27 19:34:45,915 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] > s3n.S3NativeFileSystem: rename > s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe > s3://bucket_name/data/default/z3_v2/f6e3567551881fbd48baed14bfe2252b/d/47aef50a986340908a8242b256adbedf_SeqId_4_ > > > 2018-06-27 19:34:59,005 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] > regionserver.HStore: Loaded HFile > s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe > into store 'd' as > s3://bucket_name/data/default/z3_v2/f6e3567551881fbd48baed14bfe2252b/d/47aef50a986340908a8242b256adbedf_SeqId_4_ > > - updating store file list. > 2018-06-27 19:34:59,101 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] > regionserver.HStore: Successfully loaded store file > s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe > into store d (new location: > > s3://bucket_name/data/default/z3_v2/f6e3567551881fbd48baed14bfe2252b/d/47aef50a986340908a8242b256adbedf_SeqId_4_) > 2018-06-27 19:40:03,463 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] > regionserver.HStore: Validating hfile at > s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe for > inclusion in store d region z3_v2,\x00\x09\xD9 > > \x00\x00\x00\x00\x00\x00\x00,1530122295326.f6e3567551881fbd48baed14bfe2252b. > org.apache.hadoop.io.MultipleIOException: 18 exceptions > [java.io.FileNotFoundException: No such file or directory > 's3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe', > ....] > at > org.apache.hadoop.io > .MultipleIOException.createIOException(MultipleIOException.java:53) > at > > org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:5782) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.bulkLoadHFile(RSRpcServices.java:2170) > > > at > > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36615) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2354) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277) > > I'm not sure where the problem is coming from here, if the master is > adding files to the queue multiple times or just not removing them. If > anyone has knowledge on this or have an idea where things are going > wrong it would be very helpful since I'm not super familiar with the > code base. > You think that rename is happening in the 15 seconds that elapses between log lines? Can you see renamed file in its new location up in s3? s3n.S3NativeFileSystem: rename s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe s3://bucket_name/data/default/z3_v2/f6e3567551881fbd48baed14 bfe2252b/d/47aef50a986340908a8242b256adbedf_SeqId_4_ Try turning on rpc trace to see the requests/responses going back and forth. Are the s3 ops taking a while so the client is retrying? To enable RPC Trace-level logging, set org.apache.hadoop.hbase.ipc.RpcServer to TRACE level in the log-level Servlet available from the regionserver home page (WARNING: logging is profuse). Hopefully you can grep out the request to see more about what is going on. S > > Thanks, > Austin > > [1] https://issues.apache.org/jira/browse/HBASE-20774 > >
