Hi everyone,
I'm trying to run a bulk load of about 15TB of data sitting in s3 that
I've bulk ingested. When I initiate the load I'm seeing the data get
copied down to the workers and then back up to s3 even though the HBase
root is that same bucket.
Data to be bulk loaded is at s3://bucket_name/data/bulkload/z3/d/ and
the HBase root is s3://bucket_name/. I have validation disabled through
the "hbase.loadincremental.validate.hfile" = "false" config set in code
before I call LoadIncrementalHFiles.doBulkLoad. (Code is available at
[1]) The splits have already been generated with the same config that
was used during the ingest so they'll line up. I'm currently running
HBase 1.4.2 on AWS EMR. The logs of interest were pulled from a worker
on the cluster:
"""
2018-06-20 22:42:15,888 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
s3n.S3NativeFileSystem: Opening
's3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59'
for reading
2018-06-20 22:42:16,026 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
compress.CodecPool: Got brand-new decompressor [.snappy]
2018-06-20 22:42:16,056 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
regionserver.HRegionFileSystem: Bulk-load file
s3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59 is
on different filesystem than the destination store. Copying file over to
destination filesystem.
2018-06-20 22:42:16,109 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
s3n.S3NativeFileSystem: Opening
's3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59'
for reading
2018-06-20 22:42:17,910 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
s3n.MultipartUploadOutputStream: close closed:false
s3://bucket_name/data/default/rtb_rtb_z3_v2/c34a364639dab40857a8d18871a3f705/.tmp/3d69598daa9841f986da2341a5901444
2018-06-20 22:42:17,927 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
regionserver.HRegionFileSystem: Copied
s3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59 to
temporary path on destination filesystem:
s3://bucket_name/data/default/rtb_rtb_z3_v2/c34a364639dab40857a8d18871a3f705/.tmp/3d69598daa9841f986da2341a5901444
"""
I'm seeing the behavior using 's3://' or 's3n://'. Has anyone
experienced this or have advice?
Thanks,
Austin Heyne
[1]
https://github.com/aheyne/geomesa/blob/b36507f4e999b295ebab8b1fb47a38f53e2a0e93/geomesa-hbase/geomesa-hbase-tools/src/main/scala/org/locationtech/geomesa/hbase/tools/ingest/HBaseBulkLoadCommand.scala#L47
- Bulk Load is pulling files locally with same FS and validatio... Austin Heyne
-