For sure each record in the input data is being uploaded with a unique key? For example, if same rowid and column and you are asking the regionserver to supply the timestamp, if you add two cells with same row+column coordinates, they'll both end up with the same row/family/qualifier/timestamp key. When you do your count, we'll only see the last instance added.
St.Ack On Mon, Mar 22, 2010 at 8:15 AM, Nathan Harkenrider <nathan.harkenri...@gmail.com> wrote: > Thanks Ryan. > > We currently have the xceiver count set to 16k (not sure if this is too > high) and the fh max is 32k, and are still seeing the data loss issue. > > I'll dig through the datanode logs for errors and report back. > > Regards, > > Nathan > > On Sun, Mar 21, 2010 at 7:11 PM, Ryan Rawson <ryano...@gmail.com> wrote: > >> Maybe you are having HDFS capacity issues? Check your datanode logs >> for any exceptions. While you are at it, double check the xceiver >> count is set high (2048 is a good value) and the ulimit -n (fh max) is >> also reasonably high - 32k should do it. >> >> I recently ran an import of 36 hours and perfectly imported 24 billion >> rows into 2 tables and the row counts between the tables lined up >> exactly. >> >> PS: one other thing, in your close() method of your map reduce, you >> call HTable#flushCommits() right? right? >> >> On Sun, Mar 21, 2010 at 3:50 PM, Nathan Harkenrider >> <nathan.harkenri...@gmail.com> wrote: >> > Hi All, >> > >> > I'm currently running into data loss issues when bulk loading data into >> > HBase. I'm loading data via a Map/Reduce job that is parsing XML and >> > inserting rows into 2 HBase tables. The job is currently configured to >> run >> > 30 mappers concurrently (3 per node) and is inserting at a rate of >> > approximately 6000 rows/sec. The Map/Reduce job appears to run correctly, >> > however, when I run the HBase rowcounter job on the tables afterwards the >> > row count is less than expected. The data loss is small percentage wise >> > (~200,000 rows out of 80,000,000) but concerning nevertheless. >> > >> > I managed to locate the following errors in the regionserver logs related >> to >> > failed compactions and/or splits. >> > http://pastebin.com/5WjDpS9F >> > >> > I'm running HBase 0.20.3 and Cloudera CDH2, on CentOS 5.4. The cluster is >> > comprised of 11 machines, 1 master and 10 region servers. Each machine is >> 8 >> > cores, 8GB ram. A >> > >> > Any advice is appreciated. Thanks, >> > >> > Nathan Harkenrider >> > nathan.harkenri...@gmail.com >> > >> >