Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Colin Kincaid Williams Sun, 10 Aug 2014 14:23:10 -0700

By the way I have copied the table across clusters, with the tables
configured the same. the source cluster has an underlying ext2 filesystem,
while the dest cluster had an underlying ext4 filesystem. The counts are
the same for the tables. Will the filesystem account for the difference in
directory size?


[root@clusterA_ext2 ~]# sudo -u hdfs hadoop fs -dus -h /a_d/
dus: DEPRECATED: Please use 'du -s' instead.
225.9g  /a_d


[root@clusterB_ext4 ~]#  sudo -u hdfs hadoop fs -dus -h /a_d/
dus: DEPRECATED: Please use 'du -s' instead.
172.8g  /a_d



On Sun, Aug 10, 2014 at 4:17 AM, Jean-Marc Spaggiari <
[email protected]> wrote:

> HBASE-11715 <https://issues.apache.org/jira/browse/HBASE-11715> opened.
>
>
> 2014-08-10 7:12 GMT-04:00 Jean-Marc Spaggiari <[email protected]>:
>
> > +1 too for a tool to produce a hash of a table. Like, one hash per
> region,
> > or as Lars said, one hash per range. You define the number of buckets you
> > want, run the MR job, which produce a list of hash, and compare that from
> > the 2 clusters. Might be pretty simple to do. The more buckets you
> define,
> > the less risk you have to have a hash collision. We can even have a
> global
> > hash and one hash per bucket, and other options...
> >
> >
> > 2014-08-10 1:59 GMT-04:00 anil gupta <[email protected]>:
> >
> > +1 for MerkleTree or Range Hash based implementation. We had a table
> with 1
> >> Billion records. We ran verifyRep for that table across two Data Centers
> >> and it took close to 1 week to finish. It seems at present, VerifyRep
> >> comapres every row byte by byte.
> >>
> >>
> >> On Sat, Aug 9, 2014 at 6:11 PM, lars hofhansl <[email protected]> wrote:
> >>
> >> > VerifyReplication is something you could use. It's not replication
> >> > specific, just named that way because it was initially conceived as a
> >> tool
> >> > to verify that replication is working correctly. Unfortunately it will
> >> need
> >> > to ship all data from the remote cluster, which is quite inefficient.
> >> > I think we should include a better way with HBase, maybe using
> >> > Merkletrees, or at least hashes of ranges, and compare those.
> >> >
> >> > -- Lars
> >> >
> >> >
> >> >
> >> > ________________________________
> >> >  From: Colin Kincaid Williams <[email protected]>
> >> > To: [email protected]; lars hofhansl <[email protected]>
> >> > Sent: Saturday, August 9, 2014 2:28 PM
> >> > Subject: Re: Large discrepancy in hdfs hbase rootdir size after
> >> copytable
> >> > operation.
> >> >
> >> >
> >> >
> >> > Hi Everybody,
> >> >
> >> > I do wish to upgrade to a more recent hbase soon. However the choice
> >> isn't
> >> > entirely mine. Does anybody know how to verify the contents between
> >> tables
> >> > across clusters after a copytable operation?
> >> > I see replication.VerifyReplication , but that seems replication
> >> specific.
> >> > Maybe I should have began with replication in the first place...
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Fri, Aug 8, 2014 at 9:51 PM, lars hofhansl <[email protected]>
> wrote:
> >> >
> >> > Hi Colin,
> >> > >
> >> > >you might want to consider upgrading. The current stable version is
> >> > 0.98.4 (soon .5).
> >> > >
> >> > >Even just going to 0.94 will give a lot of new features, stability,
> and
> >> > performance.
> >> > >0.92.x can be upgraded to 0.94.x without any downtime and without any
> >> > upgrade steps necessary.
> >> > >For an upgrade to 0.98 and later you'd need some downtime and also
> >> excute
> >> > an upgrade step.
> >> > >
> >> > >
> >> > >-- Lars
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >----- Original Message -----
> >> > >From: Colin Kincaid Williams <[email protected]>
> >> > >To: [email protected]
> >> > >Cc:
> >> > >Sent: Friday, August 8, 2014 1:16 PM
> >> > >Subject: Re: Large discrepancy in hdfs hbase rootdir size after
> >> copytable
> >> > operation.
> >> > >
> >> > >Not in the hbase shell I have:
> >> > >
> >> > >hbase version
> >> > >14/08/08 14:16:08 INFO util.VersionInfo: HBase 0.92.1-cdh4.1.3
> >> > >14/08/08 14:16:08 INFO util.VersionInfo: Subversion
> >> >
> >> >
> >>
> >file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hbase-0.92.1-cdh4.1.3
> >> > >-r Unknown
> >> > >14/08/08 14:16:08 INFO util.VersionInfo: Compiled by jenkins on Sat
> >> Jan 26
> >> > >17:11:38 PST 2013
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >On Fri, Aug 8, 2014 at 12:56 PM, Ted Yu <[email protected]> wrote:
> >> > >
> >> > >> Using simplified version of your command, I saw the following in
> >> shell
> >> > >> output (you may have noticed as well):
> >> > >>
> >> > >> An argument ignored (unknown or overridden): BLOOMFILTER
> >> > >> An argument ignored (unknown or overridden): VERSIONS
> >> > >> 0 row(s) in 2.1110 seconds
> >> > >>
> >> > >> Cheers
> >> > >>
> >> > >>
> >> > >> On Fri, Aug 8, 2014 at 12:23 PM, Colin Kincaid Williams <
> >> [email protected]
> >> > >
> >> > >> wrote:
> >> > >>
> >> > >> > I have discovered the error. I made the mistake regarding the
> >> > compression
> >> > >> > and the bloom filter. The new table doesn't have them enabled,
> and
> >> the
> >> > >> old
> >> > >> > does. However I'm wondering how I can create tables with splits
> >> and bf
> >> > >> and
> >> > >> > compression enabled. Shouldn't the following command return an
> >> error?
> >> > >> >
> >> > >> > hbase(main):001:0> create 'ADMd5','a',{
> >> > >> >
> >> > >> > hbase(main):002:1* BLOOMFILTER => 'ROW',
> >> > >> > hbase(main):003:1* VERSIONS => '1',
> >> > >> > hbase(main):004:1* COMPRESSION => 'SNAPPY',
> >> > >> > hbase(main):005:1* MIN_VERSIONS => '0',
> >> > >> > hbase(main):006:1* SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> >> > >> > hbase(main):007:2* '/zyuFR1VmhJyF4rbWsFnEg==',
> >> > >> > hbase(main):008:2* '0sZYnBd83ul58d1O8I2JnA==',
> >> > >> > hbase(main):009:2* '2+03N7IicZH3ltrqZUX6kQ==',
> >> > >> > hbase(main):010:2* '4+/slRQtkBDU7Px6C9MAbg==',
> >> > >> > hbase(main):011:2* '6+1dGCQ/IBrCsrNQXe/9xQ==',
> >> > >> > hbase(main):012:2* '7+2pvtpHUQHWkZJoouR9wQ==',
> >> > >> > hbase(main):013:2* '8+4n2deXhzmrpe//2Fo6Fg==',
> >> > >> > hbase(main):014:2* '9+4SKW/BmNzpL68cXwKV1Q==',
> >> > >> > hbase(main):015:2* 'A+4ajStFkjEMf36cX5D9xg==',
> >> > >> > hbase(main):016:2* 'B+6Zm6Kccb3l6iM2L0epxQ==',
> >> > >> > hbase(main):017:2* 'C+6lKKDiOWl5qrRn72fNCw==',
> >> > >> > hbase(main):018:2* 'D+6dZMyn7m+NhJ7G07gqaw==',
> >> > >> > hbase(main):019:2* 'E+6BrimmrpAd92gZJ5hyMw==',
> >> > >> > hbase(main):020:2* 'G+5tisu4xWZMOJnDHeYBJg==',
> >> > >> > hbase(main):021:2* 'I+7fRy4dvqcM/L6dFRQk9g==',
> >> > >> > hbase(main):022:2* 'J+8ECMw1zeOyjfOg/ypXJA==',
> >> > >> > hbase(main):023:2* 'K+7tenLYn6a1aNLniL6tbg==',]}
> >> > >> > 0 row(s) in 1.8010 seconds
> >> > >> >
> >> > >> > hbase(main):024:0> describe 'ADMd5'
> >> > >> > DESCRIPTION                                        ENABLED
> >> > >> >
> >> > >> >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOO true
> >> > >> >
> >> > >> >  MFILTER => 'NONE', REPLICATION_SCOPE => '0', VERS
> >> > >> >
> >> > >> >  IONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS
> >> > >> >
> >> > >> >  => '0', TTL => '2147483647', BLOCKSIZE => '65536'
> >> > >> >
> >> > >> >  , IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
> >> > >> >
> >> > >> > 1 row(s) in 0.0420 seconds
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >> > On Thu, Aug 7, 2014 at 5:50 PM, Jean-Marc Spaggiari <
> >> > >> > [email protected]
> >> > >> > > wrote:
> >> > >> >
> >> > >> > > Hi Colin,
> >> > >> > >
> >> > >> > > Just to make sure.
> >> > >> > >
> >> > >> > > Is table A from the source cluster and not compressed, and
> table
> >> B
> >> > in
> >> > >> the
> >> > >> > > destination cluster and SNAPPY compressed? Is that correct?
> Then
> >> > ratio
> >> > >> > > should be the opposite. Are you able to du -h from hadoop to
> see
> >> if
> >> > all
> >> > >> > > regions are evenly bigger or if anything else is wrong?
> >> > >> > >
> >> > >> > >
> >> > >> > > 2014-08-07 20:44 GMT-04:00 Colin Kincaid Williams <
> >> [email protected]>:
> >> > >> > >
> >> > >> > > > I haven't yet tried to major compact table B. I will look up
> >> some
> >> > >> > > > documentation on WALs and snapshots to find this information
> in
> >> > the
> >> > >> > hdfs
> >> > >> > > > filesystem tomorrow. Could it be caused by the bloomfilter
> >> > existing
> >> > >> on
> >> > >> > > > table B, but not table A? The funny thing is the source table
> >> is
> >> > >> > smaller
> >> > >> > > > than the destination.
> >> > >> > > >
> >> > >> > > >
> >> > >> > > > On Thu, Aug 7, 2014 at 4:50 PM, Esteban Gutierrez <
> >> > >> > [email protected]>
> >> > >> > > > wrote:
> >> > >> > > >
> >> > >> > > > > Hi Colin,
> >> > >> > > > >
> >> > >> > > > > Have you verified if the content of /a_d includes WALs
> and/or
> >> > the
> >> > >> > > content
> >> > >> > > > > of the snapshots or the HBase archive? have you tried to
> >> major
> >> > >> > compact
> >> > >> > > > > table B?  does it makes any difference?
> >> > >> > > > >
> >> > >> > > > > regards,
> >> > >> > > > > esteban.
> >> > >> > > > >
> >> > >> > > > >
> >> > >> > > > >
> >> > >> > > > > --
> >> > >> > > > > Cloudera, Inc.
> >> > >> > > > >
> >> > >> > > > >
> >> > >> > > > >
> >> > >> > > > > On Thu, Aug 7, 2014 at 2:00 PM, Colin Kincaid Williams <
> >> > >> > [email protected]
> >> > >> > > >
> >> > >> > > > > wrote:
> >> > >> > > > >
> >> > >> > > > > > I used the copy table command to copy a database between
> >> the
> >> > >> > original
> >> > >> > > > > > cluster A and a new cluster B. I have noticed that the
> >> > rootdir is
> >> > >> > > > larger
> >> > >> > > > > > than 2X the size of the original. I am trying to account
> >> for
> >> > >> such a
> >> > >> > > > large
> >> > >> > > > > > difference. The following are some details about the
> table.
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > I'm trying to figure out why my copied table is more than
> >> 2X
> >> > the
> >> > >> > size
> >> > >> > > > of
> >> > >> > > > > > the original table. Could the bloomfilter itself account
> >> for
> >> > >> this?
> >> > >> > > > > >
> >> > >> > > > > > The guide I used as a reference:
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> http://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > Supposedly the original command used to create the table
> on
> >> > >> cluster
> >> > >> > > A:
> >> > >> > > > > >
> >> > >> > > > > > create 'ADMd5', {NAME => 'a', BLOOMFILTER => 'ROW',
> >> VERSIONS
> >> > =>
> >> > >> > '1',
> >> > >> > > > > > COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0'}
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > How I created the target table on cluster B:
> >> > >> > > > > >
> >> > >> > > > > > create 'ADMd5','a',{
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > BLOOMFILTER => 'ROW',
> >> > >> > > > > > VERSIONS => '1',
> >> > >> > > > > > COMPRESSION => 'SNAPPY',
> >> > >> > > > > > MIN_VERSIONS => '0',
> >> > >> > > > > > SPLITS =>['/++ASUZm4u7YsTcF/VtK6Q==',
> >> > >> > > > > > '/zyuFR1VmhJyF4rbWsFnEg==',
> >> > >> > > > > > '0sZYnBd83ul58d1O8I2JnA==',
> >> > >> > > > > > '2+03N7IicZH3ltrqZUX6kQ==',
> >> > >> > > > > > '4+/slRQtkBDU7Px6C9MAbg==',
> >> > >> > > > > > '6+1dGCQ/IBrCsrNQXe/9xQ==',
> >> > >> > > > > > '7+2pvtpHUQHWkZJoouR9wQ==',
> >> > >> > > > > > '8+4n2deXhzmrpe//2Fo6Fg==',
> >> > >> > > > > > '9+4SKW/BmNzpL68cXwKV1Q==',
> >> > >> > > > > > 'A+4ajStFkjEMf36cX5D9xg==',
> >> > >> > > > > > 'B+6Zm6Kccb3l6iM2L0epxQ==',
> >> > >> > > > > > 'C+6lKKDiOWl5qrRn72fNCw==',
> >> > >> > > > > > 'D+6dZMyn7m+NhJ7G07gqaw==',
> >> > >> > > > > > 'E+6BrimmrpAd92gZJ5hyMw==',
> >> > >> > > > > > 'G+5tisu4xWZMOJnDHeYBJg==',
> >> > >> > > > > > 'I+7fRy4dvqcM/L6dFRQk9g==',
> >> > >> > > > > > 'J+8ECMw1zeOyjfOg/ypXJA==',
> >> > >> > > > > > 'K+7tenLYn6a1aNLniL6tbg==']}
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > How the tables now appear in hbase shell:
> >> > >> > > > > >
> >> > >> > > > > > table A:
> >> > >> > > > > >
> >> > >> > > > > > describe 'ADMd5'
> >> > >> > > > > > DESCRIPTION
> >> > >> > > > > >
> >> > >> > > > > >   ENABLED
> >> > >> > > > > >
> >> > >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER
> >> =>
> >> > >> > 'NONE',
> >> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION =>
> >> > 'NONE',
> >> > >> > > > MIN_VER
> >> > >> > > > > > true
> >> > >> > > > > >
> >> > >> > > > > >  SIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536',
> >> > >> IN_MEMORY
> >> > >> > > =>
> >> > >> > > > > > 'false', BLOCKCACHE => 'true'}]}
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > 1 row(s) in 0.0370 seconds
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > table B:
> >> > >> > > > > >
> >> > >> > > > > > hbase(main):003:0> describe 'ADMd5'
> >> > >> > > > > > DESCRIPTION
> >> > >> > > > > >
> >> > >> > > > > >   ENABLED
> >> > >> > > > > >
> >> > >> > > > > >  {NAME => 'ADMd5', FAMILIES => [{NAME => 'a', BLOOMFILTER
> >> =>
> >> > >> 'ROW',
> >> > >> > > > > > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION =>
> >> > >> 'SNAPPY',
> >> > >> > > > > MIN_VE
> >> > >> > > > > > true
> >> > >> > > > > >
> >> > >> > > > > >  RSIONS => '0', TTL => '2147483647', BLOCKSIZE =>
> '65536',
> >> > >> > IN_MEMORY
> >> > >> > > =>
> >> > >> > > > > > 'false', BLOCKCACHE => 'true'}]}
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > 1 row(s) in 0.0280 seconds
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > The containing foldersize in hdfs:
> >> > >> > > > > > table A:
> >> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> >> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> >> > >> > > > > > 227.4g  /a_d
> >> > >> > > > > >
> >> > >> > > > > > table B:
> >> > >> > > > > > sudo -u hdfs hadoop fs -dus -h /a_d
> >> > >> > > > > > dus: DEPRECATED: Please use 'du -s' instead.
> >> > >> > > > > > 501.0g  /a_d
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > https://gist.github.com/drocsid/80bba7b6b19d64fde6c2
> >> > >> > > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >>
> >>
> >>
> >>
> >> --
> >> Thanks & Regards,
> >> Anil Gupta
> >>
> >
> >
>

Re: Large discrepancy in hdfs hbase rootdir size after copytable operation.

Reply via email to