I ran some more tests to clarify my questions from above. After the same MR job, 5 out of 8 of my Regionservers died before I terminated the job. Here's what I saw in one of the HBase Regionserver logs...
Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 192.168.18.48:50010 (with many different IPs...) Then I get errors like this: Error Recovery for block blk_-4108085472136309132_97478 in pipeline 192.168.18.49:50010, 192.168.18.48:50010, 192.168.18.16:50010: bad datanode 192.168.18.48:50010 then things continue for a while and I get this: Exception while reading from blk_1698571189906026963_93533 of /hbase-0.19/joinedcontent/2018887968/content/mapfiles/3048972636250467459/data from 192.168.18.49:50010: java.io.IOException: Premeture EOF from inputStream Then I start seeing stuff like this: Error Recovery for block blk_3202913437369696154_99607 bad datanode[0] nodes == null 2009-06-09 16:31:15,330 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase-0.19/joinedcontent/compaction.dir/2018887968/content/mapfiles/2166568776864749492/data" - Aborting... Exception in createBlockOutputStream java.io.IOException: Could not read from stream Abandoning block blk_-4592653855912358506_99607 And this... DataStreamer Exception: java.io.IOException: Unable to create new block. Then it eventually dies. On Tue, Jun 9, 2009 at 11:51 AM, Bradford Stephens<bradfordsteph...@gmail.com> wrote: > I sort of need the reduce since I'm combining primary keys from a CSV > file. Although I guess I could just use the combiner class... hrm. > > How do I decrease the batch size? > > Also, I tried to make a map-only task that used ImmutableBytesWritable > and BatchUpdate as the output K and V, and TableOutputFormat as the > OutputFormat -- the job fails, saying that "HbaseMapWritable cannot be > cast to org.apache.hadoop.hbase.io.BatchUpdate". I've checked my > Mapper multiple times, it's definitely ouputting a BatchUpdate. > > On Tue, Jun 9, 2009 at 10:43 AM, stack<st...@duboce.net> wrote: >> On Tue, Jun 9, 2009 at 10:13 AM, Bradford Stephens < >> bradfordsteph...@gmail.com> wrote: >> >> >>> Hey rock stars, >>> >> >> >> Flattery makes us perk up for sure. >> >> >> >>> >>> I'm having problems loading large amounts of data into a table (about >>> 120 GB, 250million rows). My Map task runs fine, but when it comes to >>> reducing, things start burning. 'top' inidcates that I only have ~ >>> 100M of RAM free on my datanodes, and every process starts thrashing >>> ... even ssh and ping. Then I start to get errors like: >>> >>> "org.apache.hadoop.hbase.client.RegionOfflineException: region >>> offline: joinedcontent,,1244513452487" >>> >> >> See if said region is actually offline? Try getting a row from it in shell. >> >> >> >>> >>> and: >>> >>> "Task attempt_200906082135_0001_r_000002_0 failed to report status for >>> 603 seconds. Killing!" >> >> >> >> Sounds like nodes are heavily loaded.. so loaded either the task can't >> report in... or its stuck on an hbase update so long, its taking ten minutes >> or more to return. >> >> One thing to look at is disabling batching or making batches smaller. When >> batch is big, can take a while under high-load for all row edits to go in. >> HBase client will not return till all row commits have succeeded. Smaller >> batches will mean more likely to return and not have the task killed because >> takes longer than the report period to checkin. >> >> >> Whats your MR job like? Your updating hbase in the reduce phase i presume >> (TableOutputFormat?). Do you need the reduce? Can you update hbase in the >> map step? Saves on the sort the MR framework is doing -- a sort that is >> unnecessary given as hbase orders on insertion. >> >> >> Can you try with a lighter load? Maybe a couple of smaller MR jobs rather >> than one big one? >> >> St.Ack >> >> >>> >>> >>> I'm running Hadoop .19.1 and HBase .19.3, with 1 master/name node and >>> 8 regionservers. 2 x Dual Core Intel 3.2 GHz procs, 4 GB of RAM. 16 >>> map tasks, 8 reducers. I've set the MAX_HEAP in hadoop-env to 768, and >>> the one in hbase-env is at its default with 1000. I've also done all >>> the performance enchancements in the Wiki with the file handlers, the >>> garbage collection, and the epoll limits. >>> >>> What am I missing? :) >>> >>> Cheers, >>> Bradford >>> >> >