Re: Data Loss During Bulk Load

Stack Mon, 22 Mar 2010 13:37:38 -0700

For sure each record in the input data is being uploaded with a unique
key?  For example, if same rowid and column and you are asking the
regionserver to supply the timestamp, if you add two cells with same
row+column coordinates, they'll both end up with the same
row/family/qualifier/timestamp key.  When you do your count, we'll
only see the last instance added.


St.Ack

On Mon, Mar 22, 2010 at 8:15 AM, Nathan Harkenrider
<nathan.harkenri...@gmail.com> wrote:
> Thanks Ryan.
>
> We currently have the xceiver count set to 16k (not sure if this is too
> high) and the fh max is 32k, and are still seeing the data loss issue.
>
> I'll dig through the datanode logs for errors and report back.
>
> Regards,
>
> Nathan
>
> On Sun, Mar 21, 2010 at 7:11 PM, Ryan Rawson <ryano...@gmail.com> wrote:
>
>> Maybe you are having HDFS capacity issues?  Check your datanode logs
>> for any exceptions.  While you are at it, double check the xceiver
>> count is set high (2048 is a good value) and the ulimit -n (fh max) is
>> also reasonably high - 32k should do it.
>>
>> I recently ran an import of 36 hours and perfectly imported 24 billion
>> rows into 2 tables and the row counts between the tables lined up
>> exactly.
>>
>> PS: one other thing, in your close() method of your map reduce, you
>> call HTable#flushCommits() right? right?
>>
>> On Sun, Mar 21, 2010 at 3:50 PM, Nathan Harkenrider
>> <nathan.harkenri...@gmail.com> wrote:
>> > Hi All,
>> >
>> > I'm currently running into data loss issues when bulk loading data into
>> > HBase. I'm loading data via a Map/Reduce job that is parsing XML and
>> > inserting rows into 2 HBase tables. The job is currently configured to
>> run
>> > 30 mappers concurrently (3 per node) and is inserting at a rate of
>> > approximately 6000 rows/sec. The Map/Reduce job appears to run correctly,
>> > however, when I run the HBase rowcounter job on the tables afterwards the
>> > row count is less than expected. The data loss is small percentage wise
>> > (~200,000 rows out of 80,000,000) but concerning nevertheless.
>> >
>> > I managed to locate the following errors in the regionserver logs related
>> to
>> > failed compactions and/or splits.
>> > http://pastebin.com/5WjDpS9F
>> >
>> > I'm running HBase 0.20.3 and Cloudera CDH2, on CentOS 5.4. The cluster is
>> > comprised of 11 machines, 1 master and 10 region servers. Each machine is
>> 8
>> > cores, 8GB ram. A
>> >
>> > Any advice is appreciated. Thanks,
>> >
>> > Nathan Harkenrider
>> > nathan.harkenri...@gmail.com
>> >
>>
>

Re: Data Loss During Bulk Load

Reply via email to