Hi Laxman,

can you clarify what you mean by "duplicates"?
The TreeSet is using KeyValue.COMPARATOR,which treats KVs as the same only if 
the entire key (including column and timestamp) is the same.
Do you have KVs with the same rowKey, columnKey, and timestamp, but different 
values?


Thanks.

-- Lars


________________________________
 From: Laxman <lakshman...@huawei.com>
To: d...@hbase.apache.org; user@hbase.apache.org 
Sent: Monday, March 12, 2012 8:17 AM
Subject: Bulkload discards duplicates
 
In our test, we noticed that bulkload is discarding the duplicates.
On further analysis, I noticed duplicates are getting discarded only
duplicates exists in same input file and in same split.
I think this is a bug and its not any intentional behavior. 

Usage of TreeSet in the below code snippet is causing the issue.

PutSortReducer.reduce()
======================
      TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
      long curSize = 0;
      // stop at the end or the RAM threshold
      while (iter.hasNext() && curSize < threshold) {
        Put p = iter.next();
        for (List<KeyValue> kvs : p.getFamilyMap().values()) {
          for (KeyValue kv : kvs) {
            map.add(kv);
            curSize += kv.getLength();
          }
        }

Changing this back to List and then sort explicitly will solve the issue.

Filed a new JIRA for this
https://issues.apache.org/jira/browse/HBASE-5564
--
Regards,
Laxman

Reply via email to