Bryan: >From javadoc of Backup.java: bq. it favors swallowing exceptions and incrementing counters as opposed to failing
Can you share some experience how you handled the errors reported by Backup ? Thanks On Fri, Aug 15, 2014 at 10:38 AM, Bryan Beaudreault < bbeaudrea...@hubspot.com> wrote: > I agree it would be nice if this was provided by HBase, but it's already > possible to work straight with the HFiles. All you need is a custom hadoop > job. A good starting point is > > https://github.com/mozilla-metrics/akela/blob/master/src/main/java/com/mozilla/hadoop/Backup.java > and modify it to your needs. We've used our own modification of this job > many times when we do our own cluster migrations. The idea is that it is > incremental, so as HFiles get compacted, deleted, etc, you can just run it > again and move smaller and smaller amounts of data. > > Working at the hdfs level should be faster, as you can use more mappers. > You will still be taxing the IO of the source cluster, but not adding load > to the actual regionserver processes (ipc queue, memory, etc). > > If you upgrade to CDH5 (or the equivalent hdfs version), you can use hdfs > snapshots to minimize the need to re-run the above Backup job (since you > are already using replication to keep data up-to-date). > > > On Fri, Aug 15, 2014 at 1:11 PM, Esteban Gutierrez <este...@cloudera.com> > wrote: > > > 1.8TB in a day is not terrible slow if that number comes from the > CopyTable > > counters and you are moving data across data centers using public > networks, > > that should be about 20MB/sec. Also, CopyTable won't compress anything on > > the wire so the network overhead should be a lot. If you use anything > like > > snappy for block compression and/or fast_diff for block encoding the > > HFiles, then using snapshots and export them using the ExportSnapshot > tool > > should be the way to go. > > > > cheers, > > esteban. > > > > > > > > -- > > Cloudera, Inc. > > > > > > > > On Thu, Aug 14, 2014 at 11:24 PM, tobe <tobeg3oo...@gmail.com> wrote: > > > > > Thank @lars. > > > > > > We're using HBase 0.94.11 and follow the instruction to run > `./bin/hbase > > > org.apache.hadoop.hbase.mapreduce.CopyTable > > --peer.adr=hbase://cluster_name > > > table_name`. We have namespace service to find the ZooKeeper with > > > "hbase://cluster_name". And the job ran on a shared yarn cluster. > > > > > > The performance is affected by many factors, but we haven't found out > the > > > reason. It would be great to see your suggestions. > > > > > > > > > On Fri, Aug 15, 2014 at 1:34 PM, lars hofhansl <la...@apache.org> > wrote: > > > > > > > What version of HBase? How are you running CopyTable? A day for 1.8T > is > > > > not what we would expect. > > > > You can definitely take a snapshot and then export the snapshot to > > > another > > > > cluster, which will move the actual files; but CopyTable should not > be > > so > > > > slow. > > > > > > > > > > > > -- Lars > > > > > > > > > > > > > > > > ________________________________ > > > > From: tobe <tobeg3oo...@gmail.com> > > > > To: "u...@hbase.apache.org" <u...@hbase.apache.org> > > > > Cc: dev@hbase.apache.org > > > > Sent: Thursday, August 14, 2014 8:18 PM > > > > Subject: A better way to migrate the whole cluster? > > > > > > > > > > > > Sometimes our users want to upgrade their servers or move to a new > > > > datacenter, then we have to migrate the data from HBase. Currently we > > > > enable the replication from the old cluster to the new cluster, and > run > > > > CopyTable to move the older data. > > > > > > > > It's a little inefficient. It takes more than one day to migrate 1.8T > > > data > > > > and more time to verify. Can we have a better way to do that, like > > > snapshot > > > > or purely HDFS files? > > > > > > > > And what's the best practise or your valuable experience? > > > > > > > > > >