[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13227678#comment-13227678
 ] 

Laxman commented on HBASE-5564:
-------------------------------

I tested again with the proposed patch.
> > Changing this back to List and then sort explicitly will solve the issue.

Still the same problem persists making this issue bit more complicated. 
I think the usage of same timestamp for all records in split causing the issue.

Currently in code,
a) If configured, we are using static timestamp for all mappers.
b) If not configured, we are using current system time generated for each split.

TsvImporterMapper.doSetup
====================
{code}
ts = conf.getLong(ImportTsv.TIMESTAMP_CONF_KEY, System.currentTimeMillis());
{code}

Should we think of an approach to generate a unique sequence number and use it 
as a timestamp?

Any other thoughts?
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>
> Duplicate records are getting discarded when duplicate records exists in same 
> input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different 
> splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to