Sorry, was out for a week.

But yes, I am seeing the renamed files. I'm going to try re-running the bulk load tomorrow with emrfs enabled and see if the consistency layer can help with the master thinking a file didn't get renamed. I'll let you know how it goes or reply with the info from the RPC logs.

Thanks,
Austin


On 06/29/2018 05:54 PM, Stack wrote:
On Thu, Jun 28, 2018 at 1:24 PM Austin Heyne <[email protected]> wrote:

Hey again,

I'm running a bulk load on s3 and I'm seeing region servers being
instructed to load an hfile multiple times. I've seen this behavior two
different ways. The first time I saw this was after deciding to brute
force my way around the problem in HBASE-20774 [1] by just letting the
cluster copy the data down from s3 and back up. After letting this run
over night the input 15TB of data was now over 60TB in the hbase
catalog. I noticed in the logs that files were being copied multiple
times (I saw around 10 times on some files on a single region server).

After writing a patch for HBASE-20774 and deploying it to the cluster on
the next attempt I'm seeing the same behavior but because when you bulk
load from the same FS as the HBase root, HBase renames files rather than
copying them. This is causing FileNotFoundExceptions the next time the
region server is instructed to load the file. I've cherry picked the
logs of an occurrence of this happening off of a region server:

2018-06-27 19:34:18,170 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
regionserver.HStore: Validating hfile at
s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe for
inclusion in store d region z3_v2,\x00\x09\xD9

\x00\x00\x00\x00\x00\x00\x00,1530122295326.f6e3567551881fbd48baed14bfe2252b.
2018-06-27 19:34:18,236 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
s3n.S3NativeFileSystem: Opening
's3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe'
for reading
2018-06-27 19:34:45,915 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
s3n.S3NativeFileSystem: rename
s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe
s3://bucket_name/data/default/z3_v2/f6e3567551881fbd48baed14bfe2252b/d/47aef50a986340908a8242b256adbedf_SeqId_4_


2018-06-27 19:34:59,005 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
regionserver.HStore: Loaded HFile
s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe
into store 'd' as
s3://bucket_name/data/default/z3_v2/f6e3567551881fbd48baed14bfe2252b/d/47aef50a986340908a8242b256adbedf_SeqId_4_

- updating store file list.
2018-06-27 19:34:59,101 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
regionserver.HStore: Successfully loaded store file
s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe
into store d (new location:

s3://bucket_name/data/default/z3_v2/f6e3567551881fbd48baed14bfe2252b/d/47aef50a986340908a8242b256adbedf_SeqId_4_)
2018-06-27 19:40:03,463 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
regionserver.HStore: Validating hfile at
s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe for
inclusion in store d region z3_v2,\x00\x09\xD9

\x00\x00\x00\x00\x00\x00\x00,1530122295326.f6e3567551881fbd48baed14bfe2252b.
org.apache.hadoop.io.MultipleIOException: 18 exceptions
[java.io.FileNotFoundException: No such file or directory
's3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe',
....]
at
org.apache.hadoop.io
.MultipleIOException.createIOException(MultipleIOException.java:53)
at

org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:5782)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.bulkLoadHFile(RSRpcServices.java:2170)


at

org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36615)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2354)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)

I'm not sure where the problem is coming from here, if the master is
adding files to the queue multiple times or just not removing them. If
anyone has knowledge on this or have an idea where things are going
wrong it would be very helpful since I'm not super familiar with the
code base.

You think that rename is happening in the 15 seconds that elapses between
log lines? Can you see renamed file in its new location up in s3?

s3n.S3NativeFileSystem: rename
s3://bucket_name/data/bulkload/z3/d/19bca1eee5ff45a38cca616696b92ffe
s3://bucket_name/data/default/z3_v2/f6e3567551881fbd48baed14
bfe2252b/d/47aef50a986340908a8242b256adbedf_SeqId_4_

Try turning on rpc trace to see the requests/responses going back and
forth. Are the s3 ops taking a while so the client is retrying?

To enable RPC Trace-level logging, set  org.apache.hadoop.hbase.ipc.RpcServer
to TRACE level in the log-level Servlet available from the regionserver
home page (WARNING: logging is profuse). Hopefully you can grep out the
request to see more about what is going on.


S



Thanks,
Austin

[1] https://issues.apache.org/jira/browse/HBASE-20774



--
Austin L. Heyne

Reply via email to