Hi folks -- we're running CDH3u3 (0.90.4). I'm trying export data from an existing table that has far too many regions (2600+ for only 8 regionservers) into one with a more reasonable region count for this cluster (256). Overall data volume is approx. 3 TB.
I thought initially that I'd use the bulkload/importtsv approach, but it turns out this table's schema has column qualifiers made from timestamps, so it's impossible for me to specify a list of target columns for importtsv. From what I can tell, the TSV interchange format requires your data to have the same colquals throughout. I took a look at CopyTable and Export/Import, which both appear to wrap the Hbase client API (emitting Puts from a mapper). But I'm seeing significant performance problems with this approach, to the point that I'm not sure it's feasible. Export appears to work OK, but when I try importing the data back from HDFS, the rest of our cluster drags to halt -- client writes (even those not associated with the Import) start timing out. Fwiw, import already disables autoFlush (via TableOutputFormat). >From [1], one option I could try would to disable the WAL. Are there are other techniques I should try? Has anyone implemented a bulkloader which doesn't use the TSV format? Norbert [1] http://hbase.apache.org/book/perf.writing.html
