Re: Bulk Load is pulling files locally with same FS and validation disabled.

Austin Heyne Thu, 21 Jun 2018 14:09:13 -0700

Hi again,

I've been doing more digging into this and I've found that with the waythe code it written it's actually impossible. In FSHDFSUtils [1] HBaseattempts to get the Canonical Service Name from Hadoop. Since we'rerunning on EMR our filesystem is the S3NativeFileSystem (com.amazon)which extends, I believe, the NativeS3FileSystem (org.apache). Since theNativeS3FileSystem [2] and S3FileSystem [2] both always return null forgetCannonicalServiceName and from testing it appears theS3NativeFileSystem does the same it looks like there is no way to getpast the check in FSHDFSUtils.isSameHdfs when running on S3 of any kind.


Does anyone know of a workaround for this issue?

Thanks,
Austin Heyne

[1]https://github.com/apache/hbase/blob/rel/1.4.2/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java#L120-L123[2]https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/NativeS3FileSystem.java#L788-L791[3]https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3/S3FileSystem.java#L403-L406



On 06/20/2018 07:07 PM, Austin Heyne wrote:

Hi everyone,
I'm trying to run a bulk load of about 15TB of data sitting in s3 thatI've bulk ingested. When I initiate the load I'm seeing the data getcopied down to the workers and then back up to s3 even though theHBase root is that same bucket.
Data to be bulk loaded is at s3://bucket_name/data/bulkload/z3/d/ andthe HBase root is s3://bucket_name/. I have validation disabledthrough the "hbase.loadincremental.validate.hfile" = "false" configset in code before I call LoadIncrementalHFiles.doBulkLoad. (Code isavailable at [1]) The splits have already been generated with the sameconfig that was used during the ingest so they'll line up. I'mcurrently running HBase 1.4.2 on AWS EMR. The logs of interest werepulled from a worker on the cluster:
"""
2018-06-20 22:42:15,888 INFO[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]s3n.S3NativeFileSystem: Opening's3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59'for reading2018-06-20 22:42:16,026 INFO[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]compress.CodecPool: Got brand-new decompressor [.snappy]2018-06-20 22:42:16,056 INFO[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]regionserver.HRegionFileSystem: Bulk-load files3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59is on different filesystem than the destination store. Copying fileover to destination filesystem.2018-06-20 22:42:16,109 INFO[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]s3n.S3NativeFileSystem: Opening's3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59'for reading2018-06-20 22:42:17,910 INFO[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]s3n.MultipartUploadOutputStream: close closed:falses3://bucket_name/data/default/z3_v2/c34a364639dab40857a8d18871a3f705/.tmp/3d69598daa9841f986da2341a59014442018-06-20 22:42:17,927 INFO[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]regionserver.HRegionFileSystem: Copieds3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59to temporary path on destination filesystem:s3://bucket_name/data/default/z3_v2/c34a364639dab40857a8d18871a3f705/.tmp/3d69598daa9841f986da2341a5901444
"""
I'm seeing the behavior using 's3://' or 's3n://'. Has anyoneexperienced this or have advice?
Thanks,
Austin Heyne
[1]https://github.com/aheyne/geomesa/blob/b36507f4e999b295ebab8b1fb47a38f53e2a0e93/geomesa-hbase/geomesa-hbase-tools/src/main/scala/org/locationtech/geomesa/hbase/tools/ingest/HBaseBulkLoadCommand.scala#L47


--
Austin L. Heyne

Re: Bulk Load is pulling files locally with same FS and validation disabled.

Reply via email to