Heya Something, I had a similar task recently and by far the best way to go about this is with bulk loading after pre-splitting your target table. As you know ImportTsv doesn't understand Avro files so I hacked together my own ImportAvro class to create the Hfiles that I eventually moved into HBase with completebulkload. I haven't committed my class anywhere because it's a pretty ugly hack, but I'm happy to share it with you as a starting point. Doing billions of puts will just drive you crazy.
Cheers, Oliver On 2012-05-09, at 4:51 PM, Something Something wrote: > I ran the following MR job that reads AVRO files & puts them on HBase. The > files have tons of data (billions). We have a fairly decent size cluster. > When I ran this MR job, it brought down HBase. When I commented out the > Puts on HBase, the job completed in 45 seconds (yes that's seconds). > > Obviously, my HBase configuration is not ideal. I am using all the default > HBase configurations that come out of Cloudera's distribution: 0.90.4+49. > > I am planning to read up on the following two: > > http://hbase.apache.org/book/important_configurations.html > http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/ > > But can someone quickly take a look and recommend a list of priorities, > such as "try this first..."? That would be greatly appreciated. As > always, thanks for the time. > > > Here's the Mapper. (There's no reducer): > > > > public class AvroProfileMapper extends AvroMapper<GenericData.Record, > NullWritable> { > private static final Logger logger = > LoggerFactory.getLogger(AvroProfileMapper.class); > > final private String SEPARATOR = "*"; > > private HTable table; > > private String datasetDate; > private String tableName; > > @Override > public void configure(JobConf jobConf) { > super.configure(jobConf); > datasetDate = jobConf.get("datasetDate"); > tableName = jobConf.get("tableName"); > > // Open table for writing > try { > table = new HTable(jobConf, tableName); > table.setAutoFlush(false); > table.setWriteBufferSize(1024 * 1024 * 12); > } catch (IOException e) { > throw new RuntimeException("Failed table construction", e); > } > } > > @Override > public void map(GenericData.Record record, AvroCollector<NullWritable> > collector, > Reporter reporter) throws IOException { > > String u1 = record.get("u1").toString(); > > GenericData.Array<GenericData.Record> fields = > (GenericData.Array<GenericData.Record>) record.get("bag"); > for (GenericData.Record rec : fields) { > Integer s1 = (Integer) rec.get("s1"); > Integer n1 = (Integer) rec.get("n1"); > Integer c1 = (Integer) rec.get("c1"); > Integer freq = (Integer) rec.get("freq"); > if (freq == null) { > freq = 0; > } > > String key = u1 + SEPARATOR + n1 + SEPARATOR + c1 + SEPARATOR + > s1; > Put put = new Put(Bytes.toBytes(key)); > put.setWriteToWAL(false); > put.add(Bytes.toBytes("info"), Bytes.toBytes("frequency"), > Bytes.toBytes(freq.toString())); > try { > table.put(put); > } catch (IOException e) { > throw new RuntimeException("Error while writing to " + > table + " table.", e); > } > > } > logger.error("------------ Finished processing user: " + u1); > } > > @Override > public void close() throws IOException { > table.close(); > } > > } -- Oliver Meyn Software Developer Global Biodiversity Information Facility (GBIF) +45 35 32 15 12 http://www.gbif.org