[jira] [Updated] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

Apekshit Sharma (JIRA) Fri, 15 May 2015 18:18:58 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Apekshit Sharma updated HBASE-13702:
------------------------------------
    Description: 
ImportTSV job skips bad records by default (keeps a count though). 
-Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
encountered. 
To be easily able to determine which rows are corrupted in an input, rather 
than failing on one row at a time seems like a good feature to have.
Moreover, there should be 'dry-run' functionality in such kinds of tools, which 
can essentially does a quick run of tool without making any changes but 
reporting any errors/warnings and success/failure.

To identify corrupted rows, simply logging them should be enough. In worst 
case, all rows will be logged and size of logs will be same as input size, 
which seems fine. However, user might have to do some work figuring out where 
the logs. Is there some link we can show to the user when the tool starts which 
can help them with that?

For the dry run, we can simply use if-else to skip over creating table, writing 
out KVs, and other mutations.

  was:
ImportTSV job skips bad records by default (keeps a count though). 
-Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
encountered. 
To be easily able to determine which rows are corrupted in an input, rather 
than failing on one row at a time seems like a good feature to have.
Moreover, there should be 'dry-run' functionality in such kinds of tools, which 
can essentially does a quick run of tool without making any changes but 
reporting any errors/warnings and success/failure.

To identify corrupted rows, simply logging them should be enough. In worst 
case, all rows will be logged and size of logs will be same as input size, 
which seems fine. However, user might have to do some work figuring out where 
the logs. If there some link we can show the user in the starting which can 
help them with that?

For the dry run, we can simply use if-else to skip over creating table, writing 
out KVs, etc.


> ImportTsv: Add dry-run functionality and log bad rows
> -----------------------------------------------------
>
>                 Key: HBASE-13702
>                 URL: https://issues.apache.org/jira/browse/HBASE-13702
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Apekshit Sharma
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over creating table, 
> writing out KVs, and other mutations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

Reply via email to