On Thu, Jun 28, 2018 at 1:24 PM Austin Heyne <[email protected]> wrote:

> Hey again,
>
> I'm running a bulk load on s3 and I'm seeing region servers being
> instructed to load an hfile multiple times. I've seen this behavior two
> different ways. The first time I saw this was after deciding to brute
> force my way around the problem in HBASE-20774 [1] by just letting the
> cluster copy the data down from s3 and back up. After letting this run
> over night the input 15TB of data was now over 60TB in the hbase
> catalog. I noticed in the logs that files were being copied multiple
> times (I saw around 10 times on some files on a single region server).
>
> After writing a patch for HBASE-20774 and deploying it to the cluster on
> the next attempt I'm seeing the same behavior but because when you bulk
> load from the same FS as the HBase root, HBase renames files rather than
> copying them. This is causing FileNotFoundExceptions the next time the
> region server is instructed to load the file. I've cherry picked the
> logs of an occurrence of this happening off of a region server:
>
> 2018-06-27 19:34:18,170 INFO
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
> regionserver.HStore: Validating hfile at
> s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe for
> inclusion in store d region z3_v2,\x00\x09\xD9
>
> \x00\x00\x00\x00\x00\x00\x00,1530122295326.f6e3567551881fbd48baed14bfe2252b.
> 2018-06-27 19:34:18,236 INFO
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
> s3n.S3NativeFileSystem: Opening
> 's3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe'
> for reading
> 2018-06-27 19:34:45,915 INFO
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
> s3n.S3NativeFileSystem: rename
> s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe
> s3://bucket_name/data/default/z3_v2/f6e3567551881fbd48baed14bfe2252b/d/47aef50a986340908a8242b256adbedf_SeqId_4_
>
>
> 2018-06-27 19:34:59,005 INFO
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
> regionserver.HStore: Loaded HFile
> s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe
> into store 'd' as
> s3://bucket_name/data/default/z3_v2/f6e3567551881fbd48baed14bfe2252b/d/47aef50a986340908a8242b256adbedf_SeqId_4_
>
> - updating store file list.
> 2018-06-27 19:34:59,101 INFO
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
> regionserver.HStore: Successfully loaded store file
> s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe
> into store d (new location:
>
> s3://bucket_name/data/default/z3_v2/f6e3567551881fbd48baed14bfe2252b/d/47aef50a986340908a8242b256adbedf_SeqId_4_)
> 2018-06-27 19:40:03,463 INFO
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
> regionserver.HStore: Validating hfile at
> s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe for
> inclusion in store d region z3_v2,\x00\x09\xD9
>
> \x00\x00\x00\x00\x00\x00\x00,1530122295326.f6e3567551881fbd48baed14bfe2252b.
> org.apache.hadoop.io.MultipleIOException: 18 exceptions
> [java.io.FileNotFoundException: No such file or directory
> 's3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe',
> ....]
> at
> org.apache.hadoop.io
> .MultipleIOException.createIOException(MultipleIOException.java:53)
> at
>
> org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:5782)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.bulkLoadHFile(RSRpcServices.java:2170)
>
>
> at
>
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36615)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2354)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>
> I'm not sure where the problem is coming from here, if the master is
> adding files to the queue multiple times or just not removing them. If
> anyone has knowledge on this or have an idea where things are going
> wrong it would be very helpful since I'm not super familiar with the
> code base.
>

You think that rename is happening in the 15 seconds that elapses between
log lines? Can you see renamed file in its new location up in s3?

s3n.S3NativeFileSystem: rename
s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe
s3://bucket_name/data/default/z3_v2/f6e3567551881fbd48baed14
bfe2252b/d/47aef50a986340908a8242b256adbedf_SeqId_4_

Try turning on rpc trace to see the requests/responses going back and
forth. Are the s3 ops taking a while so the client is retrying?

To enable RPC Trace-level logging, set  org.apache.hadoop.hbase.ipc.RpcServer
to TRACE level in the log-level Servlet available from the regionserver
home page (WARNING: logging is profuse). Hopefully you can grep out the
request to see more about what is going on.


S



>
> Thanks,
> Austin
>
> [1] https://issues.apache.org/jira/browse/HBASE-20774
>
>

Reply via email to