Hi folks -- we're running CDH3u3 (0.90.4).  I'm trying export data
from an existing table that has far too many regions (2600+ for only 8
regionservers) into one with a more reasonable region count for this
cluster (256).  Overall data volume is approx. 3 TB.

I thought initially that I'd use the bulkload/importtsv approach, but
it turns out this table's schema has column qualifiers made from
timestamps, so it's impossible for me to specify a list of target
columns for importtsv.  From what I can tell, the TSV interchange
format requires your data to have the same colquals throughout.

I took a look at CopyTable and Export/Import, which both appear to
wrap the Hbase client API (emitting Puts from a mapper).  But I'm
seeing significant performance problems with this approach, to the
point that I'm not sure it's feasible.  Export appears to work OK, but
when I try importing the data back from HDFS, the rest of our cluster
drags to halt -- client writes (even those not associated with the
Import) start timing out.  Fwiw, import already disables autoFlush
(via TableOutputFormat).

>From [1], one option I could try would to disable the WAL.  Are there
are other techniques I should try?  Has anyone implemented a
bulkloader which doesn't use the TSV format?

Norbert

[1] http://hbase.apache.org/book/perf.writing.html

Reply via email to