Hi everyone,

I'm trying to run a bulk load of about 15TB of data sitting in s3 that I've bulk ingested. When I initiate the load I'm seeing the data get copied down to the workers and then back up to s3 even though the HBase root is that same bucket.

Data to be bulk loaded is at s3://bucket_name/data/bulkload/z3/d/ and the HBase root is s3://bucket_name/. I have validation disabled through the "hbase.loadincremental.validate.hfile" = "false" config set in code before I call LoadIncrementalHFiles.doBulkLoad. (Code is available at [1]) The splits have already been generated with the same config that was used during the ingest so they'll line up. I'm currently running HBase 1.4.2 on AWS EMR. The logs of interest were pulled from a worker on the cluster:

"""
2018-06-20 22:42:15,888 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] s3n.S3NativeFileSystem: Opening 's3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59' for reading 2018-06-20 22:42:16,026 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] compress.CodecPool: Got brand-new decompressor [.snappy] 2018-06-20 22:42:16,056 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] regionserver.HRegionFileSystem: Bulk-load file s3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59 is on different filesystem than the destination store. Copying file over to destination filesystem. 2018-06-20 22:42:16,109 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] s3n.S3NativeFileSystem: Opening 's3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59' for reading 2018-06-20 22:42:17,910 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] s3n.MultipartUploadOutputStream: close closed:false s3://bucket_name/data/default/rtb_rtb_z3_v2/c34a364639dab40857a8d18871a3f705/.tmp/3d69598daa9841f986da2341a5901444 2018-06-20 22:42:17,927 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] regionserver.HRegionFileSystem: Copied s3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59 to temporary path on destination filesystem: s3://bucket_name/data/default/rtb_rtb_z3_v2/c34a364639dab40857a8d18871a3f705/.tmp/3d69598daa9841f986da2341a5901444
"""

I'm seeing the behavior using 's3://' or 's3n://'. Has anyone experienced this or have advice?

Thanks,
Austin Heyne

[1] https://github.com/aheyne/geomesa/blob/b36507f4e999b295ebab8b1fb47a38f53e2a0e93/geomesa-hbase/geomesa-hbase-tools/src/main/scala/org/locationtech/geomesa/hbase/tools/ingest/HBaseBulkLoadCommand.scala#L47

Reply via email to