Hi again,
I've been doing more digging into this and I've found that with the way
the code it written it's actually impossible. In FSHDFSUtils [1] HBase
attempts to get the Canonical Service Name from Hadoop. Since we're
running on EMR our filesystem is the S3NativeFileSystem (com.amazon)
which extends, I believe, the NativeS3FileSystem (org.apache). Since the
NativeS3FileSystem [2] and S3FileSystem [2] both always return null for
getCannonicalServiceName and from testing it appears the
S3NativeFileSystem does the same it looks like there is no way to get
past the check in FSHDFSUtils.isSameHdfs when running on S3 of any kind.
Does anyone know of a workaround for this issue?
Thanks,
Austin Heyne
[1]
https://github.com/apache/hbase/blob/rel/1.4.2/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java#L120-L123
[2]
https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/NativeS3FileSystem.java#L788-L791
[3]
https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3/S3FileSystem.java#L403-L406
On 06/20/2018 07:07 PM, Austin Heyne wrote:
Hi everyone,
I'm trying to run a bulk load of about 15TB of data sitting in s3 that
I've bulk ingested. When I initiate the load I'm seeing the data get
copied down to the workers and then back up to s3 even though the
HBase root is that same bucket.
Data to be bulk loaded is at s3://bucket_name/data/bulkload/z3/d/ and
the HBase root is s3://bucket_name/. I have validation disabled
through the "hbase.loadincremental.validate.hfile" = "false" config
set in code before I call LoadIncrementalHFiles.doBulkLoad. (Code is
available at [1]) The splits have already been generated with the same
config that was used during the ingest so they'll line up. I'm
currently running HBase 1.4.2 on AWS EMR. The logs of interest were
pulled from a worker on the cluster:
"""
2018-06-20 22:42:15,888 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
s3n.S3NativeFileSystem: Opening
's3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59'
for reading
2018-06-20 22:42:16,026 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
compress.CodecPool: Got brand-new decompressor [.snappy]
2018-06-20 22:42:16,056 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
regionserver.HRegionFileSystem: Bulk-load file
s3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59
is on different filesystem than the destination store. Copying file
over to destination filesystem.
2018-06-20 22:42:16,109 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
s3n.S3NativeFileSystem: Opening
's3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59'
for reading
2018-06-20 22:42:17,910 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
s3n.MultipartUploadOutputStream: close closed:false
s3://bucket_name/data/default/z3_v2/c34a364639dab40857a8d18871a3f705/.tmp/3d69598daa9841f986da2341a5901444
2018-06-20 22:42:17,927 INFO
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
regionserver.HRegionFileSystem: Copied
s3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59
to temporary path on destination filesystem:
s3://bucket_name/data/default/z3_v2/c34a364639dab40857a8d18871a3f705/.tmp/3d69598daa9841f986da2341a5901444
"""
I'm seeing the behavior using 's3://' or 's3n://'. Has anyone
experienced this or have advice?
Thanks,
Austin Heyne
[1]
https://github.com/aheyne/geomesa/blob/b36507f4e999b295ebab8b1fb47a38f53e2a0e93/geomesa-hbase/geomesa-hbase-tools/src/main/scala/org/locationtech/geomesa/hbase/tools/ingest/HBaseBulkLoadCommand.scala#L47
--
Austin L. Heyne