[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233417#comment-13233417
 ] 

Anoop Sam John commented on HBASE-5564:
---------------------------------------

Comment from Jesse Yates
{quote}
The question is, if you have a TSV file with the same row key, which value 
should be considered the most recent version? Should any of them - maybe that 
is actually a problem and we want to have a warning/error when that occurs?
{quote}

Do we need to handle this? The issue is TreeSet used by PutSortReducer and 
KeyValueSortReducer as mentioned by Laxman. 
In normal data insertion using Puts, all the duplicate values will go into the 
memstore (and finally to HFiles) and while scan the last entered one will get 
retrieved. In this bulk load case the 1st data only will get inserted as DS 
avoid the duplicates. Is this a behaviour mismatch?  But this depends on which 
entry in the TSV file needs to be considered as the recent version.If we say 
that last entry coming in the file is the recent version.....

                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>         Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
> HBASE-5564_trunk.patch
>
>
> Duplicate records are getting discarded when duplicate records exists in same 
> input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different 
> splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to