[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Laxman (Commented) (JIRA) Mon, 12 Mar 2012 22:06:12 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228212#comment-13228212
 ]


Laxman commented on HBASE-5564:
-------------------------------

bq. ts++, or ts--, could be an option?

ts++ or ts-- will not solve this problem. Reason being each mapper spawns a new 
JVM and ts will be reset to initial value. so, still there is a chance of ts 
collision.

bq. that the timestamps are all identical. The whole point is that, in a 
bulk-load-only workflow, you can identify each bulk load exactly, and correlate 
it to the MR job that inserted it.

No Todd. At least the implementation is buggy enough and not matching with this 
expected behavior.
New timestamp is generated for each map task (i.e., for each split) in 
TsvImporterMapper.doSetup.
Please check my previous comments.

bq. So this is only about ImportTsv? Should change the title in that case.
I'm not aware what other tools comes under bulkload. Bulkload documentation 
talks only about importtsv.
http://hbase.apache.org/bulk-loads.html

But if you feel we should change the title, feel free to modify the title.

bq. If you want to use custom timestamps, you should specify a timestamp column 
in your data, or write your own MR job (ImportTsv is just an example which use 
useful for some cases, but for anything advanced I would expect users to write 
their own code)

I think we can provide the provision to specify the timestamp column (Like 
ROWKEY column) as arguments.
Example : importtsv.columns='HBASE_ROW_KEY, HBASE_TS_KEY, 
emp:name,emp:sal,dept:code'

This makes importtsv more usable. Otherwise, user has to copy paste entire 
importtsv code and do this minor modification.

Please let me know your suggestions on this.

                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>
> Duplicate records are getting discarded when duplicate records exists in same 
> input file and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different 
> splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

Reply via email to