Hi again,

I've been doing more digging into this and I've found that with the way the code it written it's actually impossible. In FSHDFSUtils [1] HBase attempts to get the Canonical Service Name from Hadoop. Since we're running on EMR our filesystem is the S3NativeFileSystem (com.amazon) which extends, I believe, the NativeS3FileSystem (org.apache). Since the NativeS3FileSystem [2] and S3FileSystem [2] both always return null for getCannonicalServiceName and from testing it appears the S3NativeFileSystem does the same it looks like there is no way to get past the check in FSHDFSUtils.isSameHdfs when running on S3 of any kind.

Does anyone know of a workaround for this issue?

Thanks,
Austin Heyne

[1] https://github.com/apache/hbase/blob/rel/1.4.2/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java#L120-L123 [2] https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/NativeS3FileSystem.java#L788-L791 [3] https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3/S3FileSystem.java#L403-L406


On 06/20/2018 07:07 PM, Austin Heyne wrote:
Hi everyone,

I'm trying to run a bulk load of about 15TB of data sitting in s3 that I've bulk ingested. When I initiate the load I'm seeing the data get copied down to the workers and then back up to s3 even though the HBase root is that same bucket.

Data to be bulk loaded is at s3://bucket_name/data/bulkload/z3/d/ and the HBase root is s3://bucket_name/. I have validation disabled through the "hbase.loadincremental.validate.hfile" = "false" config set in code before I call LoadIncrementalHFiles.doBulkLoad. (Code is available at [1]) The splits have already been generated with the same config that was used during the ingest so they'll line up. I'm currently running HBase 1.4.2 on AWS EMR. The logs of interest were pulled from a worker on the cluster:

"""
2018-06-20 22:42:15,888 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] s3n.S3NativeFileSystem: Opening 's3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59' for reading 2018-06-20 22:42:16,026 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] compress.CodecPool: Got brand-new decompressor [.snappy] 2018-06-20 22:42:16,056 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] regionserver.HRegionFileSystem: Bulk-load file s3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59 is on different filesystem than the destination store. Copying file over to destination filesystem. 2018-06-20 22:42:16,109 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] s3n.S3NativeFileSystem: Opening 's3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59' for reading 2018-06-20 22:42:17,910 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] s3n.MultipartUploadOutputStream: close closed:false s3://bucket_name/data/default/z3_v2/c34a364639dab40857a8d18871a3f705/.tmp/3d69598daa9841f986da2341a5901444 2018-06-20 22:42:17,927 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020] regionserver.HRegionFileSystem: Copied s3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59 to temporary path on destination filesystem: s3://bucket_name/data/default/z3_v2/c34a364639dab40857a8d18871a3f705/.tmp/3d69598daa9841f986da2341a5901444
"""

I'm seeing the behavior using 's3://' or 's3n://'. Has anyone experienced this or have advice?

Thanks,
Austin Heyne

[1] https://github.com/aheyne/geomesa/blob/b36507f4e999b295ebab8b1fb47a38f53e2a0e93/geomesa-hbase/geomesa-hbase-tools/src/main/scala/org/locationtech/geomesa/hbase/tools/ingest/HBaseBulkLoadCommand.scala#L47

--
Austin L. Heyne

Reply via email to