Re: Bulk Load is pulling files locally with same FS and validation disabled.

Austin Heyne Thu, 21 Jun 2018 16:06:37 -0700

Sure thing, I've created a ticket here [1]. Though since we are usingEMR it's not simple to swap out this code so I would still be veryinterested in a workaround if anyone has an idea.


Thanks,
Austin Heyne


[1] https://issues.apache.org/jira/browse/HBASE-20774


On 06/21/2018 06:48 PM, Ted Yu wrote:

Since S3FileSystem is not taken into account in FSHDFSUtils#isSameHdfs, we
need to add more code to avoid the overhead.

Can you log a JIRA with what you discovered ?

Thanks

On Thu, Jun 21, 2018 at 2:08 PM, Austin Heyne <[email protected]> wrote:

Hi again,

I've been doing more digging into this and I've found that with the way
the code it written it's actually impossible. In FSHDFSUtils [1] HBase
attempts to get the Canonical Service Name from Hadoop. Since we're running
on EMR our filesystem is the S3NativeFileSystem (com.amazon) which extends,
I believe, the NativeS3FileSystem (org.apache). Since the
NativeS3FileSystem [2] and S3FileSystem [2] both always return null for
getCannonicalServiceName and from testing it appears the S3NativeFileSystem
does the same it looks like there is no way to get past the check in
FSHDFSUtils.isSameHdfs when running on S3 of any kind.

Does anyone know of a workaround for this issue?

Thanks,
Austin Heyne

[1] https://github.com/apache/hbase/blob/rel/1.4.2/hbase-server/
src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java#L120-L123
[2] https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hado
op-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/
s3native/NativeS3FileSystem.java#L788-L791
[3] https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hado
op-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3/
S3FileSystem.java#L403-L406


On 06/20/2018 07:07 PM, Austin Heyne wrote:

Hi everyone,

I'm trying to run a bulk load of about 15TB of data sitting in s3 that
I've bulk ingested. When I initiate the load I'm seeing the data get copied
down to the workers and then back up to s3 even though the HBase root is
that same bucket.

Data to be bulk loaded is at s3://bucket_name/data/bulkload/z3/d/ and
the HBase root is s3://bucket_name/. I have validation disabled through the
"hbase.loadincremental.validate.hfile" = "false" config set in code
before I call LoadIncrementalHFiles.doBulkLoad. (Code is available at
[1]) The splits have already been generated with the same config that was
used during the ingest so they'll line up. I'm currently running HBase
1.4.2 on AWS EMR. The logs of interest were pulled from a worker on the
cluster:

"""
2018-06-20 22:42:15,888 INFO 
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
s3n.S3NativeFileSystem: Opening 's3n://bucket_name/data/bulklo
ad/z3/d/b9371885084e4060ac157799e5c89b59' for reading
2018-06-20 22:42:16,026 INFO 
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
compress.CodecPool: Got brand-new decompressor [.snappy]
2018-06-20 22:42:16,056 INFO 
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
regionserver.HRegionFileSystem: Bulk-load file
s3n://bucket_name/data/bulkload/z3/d/b9371885084e4060ac157799e5c89b59 is
on different filesystem than the destination store. Copying file over to
destination filesystem.
2018-06-20 22:42:16,109 INFO 
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
s3n.S3NativeFileSystem: Opening 's3n://bucket_name/data/bulklo
ad/z3/d/b9371885084e4060ac157799e5c89b59' for reading
2018-06-20 22:42:17,910 INFO 
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
s3n.MultipartUploadOutputStream: close closed:false
s3://bucket_name/data/default/z3_v2/c34a364639dab40857a8d188
71a3f705/.tmp/3d69598daa9841f986da2341a5901444
2018-06-20 22:42:17,927 INFO 
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16020]
regionserver.HRegionFileSystem: Copied s3n://bucket_name/data/bulkloa
d/z3/d/b9371885084e4060ac157799e5c89b59 to temporary path on destination
filesystem: s3://bucket_name/data/default/z3_v2/c34a364639dab40857a8d188
71a3f705/.tmp/3d69598daa9841f986da2341a5901444
"""

I'm seeing the behavior using 's3://' or 's3n://'. Has anyone experienced
this or have advice?

Thanks,
Austin Heyne

[1] https://github.com/aheyne/geomesa/blob/b36507f4e999b295ebab8
b1fb47a38f53e2a0e93/geomesa-hbase/geomesa-hbase-tools/src/
main/scala/org/locationtech/geomesa/hbase/tools/ingest/HBase
BulkLoadCommand.scala#L47

--
Austin L. Heyne


--
Austin L. Heyne

Re: Bulk Load is pulling files locally with same FS and validation disabled.

Reply via email to