[
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228212#comment-13228212
]
Laxman commented on HBASE-5564:
-------------------------------
bq. ts++, or ts--, could be an option?
ts++ or ts-- will not solve this problem. Reason being each mapper spawns a new
JVM and ts will be reset to initial value. so, still there is a chance of ts
collision.
bq. that the timestamps are all identical. The whole point is that, in a
bulk-load-only workflow, you can identify each bulk load exactly, and correlate
it to the MR job that inserted it.
No Todd. At least the implementation is buggy enough and not matching with this
expected behavior.
New timestamp is generated for each map task (i.e., for each split) in
TsvImporterMapper.doSetup.
Please check my previous comments.
bq. So this is only about ImportTsv? Should change the title in that case.
I'm not aware what other tools comes under bulkload. Bulkload documentation
talks only about importtsv.
http://hbase.apache.org/bulk-loads.html
But if you feel we should change the title, feel free to modify the title.
bq. If you want to use custom timestamps, you should specify a timestamp column
in your data, or write your own MR job (ImportTsv is just an example which use
useful for some cases, but for anything advanced I would expect users to write
their own code)
I think we can provide the provision to specify the timestamp column (Like
ROWKEY column) as arguments.
Example : importtsv.columns='HBASE_ROW_KEY, HBASE_TS_KEY,
emp:name,emp:sal,dept:code'
This makes importtsv more usable. Otherwise, user has to copy paste entire
importtsv code and do this minor modification.
Please let me know your suggestions on this.
> Bulkload is discarding duplicate records
> ----------------------------------------
>
> Key: HBASE-5564
> URL: https://issues.apache.org/jira/browse/HBASE-5564
> Project: HBase
> Issue Type: Bug
> Components: mapreduce
> Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
> Environment: HBase 0.92
> Reporter: Laxman
> Assignee: Laxman
> Labels: bulkloader
>
> Duplicate records are getting discarded when duplicate records exists in same
> input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different
> splits.
> Version under test: HBase 0.92
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira