Re: Bulkload discards duplicates

lars hofhansl Mon, 12 Mar 2012 09:42:05 -0700

Hi Laxman,

can you clarify what you mean by "duplicates"?
The TreeSet is using KeyValue.COMPARATOR,which treats KVs as the same only if 
the entire key (including column and timestamp) is the same.
Do you have KVs with the same rowKey, columnKey, and timestamp, but different 
values?

Thanks.

-- Lars

________________________________
 From: Laxman <[email protected]>
To: [email protected]; [email protected] 
Sent: Monday, March 12, 2012 8:17 AM
Subject: Bulkload discards duplicates

In our test, we noticed that bulkload is discarding the duplicates.
On further analysis, I noticed duplicates are getting discarded only
duplicates exists in same input file and in same split.
I think this is a bug and its not any intentional behavior. 

Usage of TreeSet in the below code snippet is causing the issue.

PutSortReducer.reduce()
======================
      TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
      long curSize = 0;
      // stop at the end or the RAM threshold
      while (iter.hasNext() && curSize < threshold) {
        Put p = iter.next();
        for (List<KeyValue> kvs : p.getFamilyMap().values()) {
          for (KeyValue kv : kvs) {
            map.add(kv);
            curSize += kv.getLength();
          }
        }

Changing this back to List and then sort explicitly will solve the issue.

Filed a new JIRA for this
https://issues.apache.org/jira/browse/HBASE-5564
--
Regards,
Laxman

Re: Bulkload discards duplicates

Reply via email to