[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-21 Thread Apekshit Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595416#comment-14595416
 ] 

Apekshit Sharma commented on HBASE-13702:
-

Hey guys, I'd really like to close this one since it's almost there. Requesting 
reviews. Thanks.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Attachments: HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596876#comment-14596876
 ] 

Hadoop QA commented on HBASE-13702:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12741132/HBASE-13702-v2.patch
  against master branch at commit d51a184051d968dc3bdc00b1c9257c0a9e5ff8a6.
  ATTACHMENT ID: 12741132

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.1 2.5.2 2.6.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1943 checkstyle errors (more than the master's current 1942 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn post-site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.wal.TestBoundedRegionGroupingProvider
  org.apache.hadoop.hbase.master.TestDistributedLogSplitting

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14508//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14508//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14508//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14508//console

This message is automatically generated.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Attachments: HBASE-13702-v2.patch, HBASE-13702-v3.patch, 
> HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596971#comment-14596971
 ] 

Hadoop QA commented on HBASE-13702:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12741147/HBASE-13702-v3.patch
  against master branch at commit d51a184051d968dc3bdc00b1c9257c0a9e5ff8a6.
  ATTACHMENT ID: 12741147

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.1 2.5.2 2.6.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1943 checkstyle errors (more than the master's current 1942 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn post-site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.util.TestProcessBasedCluster
  org.apache.hadoop.hbase.mapreduce.TestImportExport
  org.apache.hadoop.hbase.TestRegionRebalancing

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14509//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14509//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14509//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14509//console

This message is automatically generated.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Attachments: HBASE-13702-v2.patch, HBASE-13702-v3.patch, 
> HBASE-13702-v4.patch, HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597117#comment-14597117
 ] 

Hadoop QA commented on HBASE-13702:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12741191/HBASE-13702-v4.patch
  against master branch at commit d51a184051d968dc3bdc00b1c9257c0a9e5ff8a6.
  ATTACHMENT ID: 12741191

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.1 2.5.2 2.6.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 checkstyle{color}.  The applied patch does not increase the 
total number of checkstyle errors

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn post-site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.util.TestProcessBasedCluster
  org.apache.hadoop.hbase.mapreduce.TestImportExport

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14517//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14517//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14517//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14517//console

This message is automatically generated.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Attachments: HBASE-13702-v2.patch, HBASE-13702-v3.patch, 
> HBASE-13702-v4.patch, HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-23 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598457#comment-14598457
 ] 

Ted Yu commented on HBASE-13702:


{code}
503   dryRunTableCreated = true;
504 }
{code}
The flag is set before calling createTable(). Suggest setting the flag 
following the call to createTable().
For deleteTable():
{code}
616   admin.deleteTable(tableName);
617 } catch (Exception e) {
{code}
Catching IOE should suffice, right ?

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Attachments: HBASE-13702-v2.patch, HBASE-13702-v3.patch, 
> HBASE-13702-v4.patch, HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-25 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601647#comment-14601647
 ] 

Ted Yu commented on HBASE-13702:


+1 if tests pass.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-13702-v2.patch, HBASE-13702-v3.patch, 
> HBASE-13702-v4.patch, HBASE-13702-v5.patch, HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-25 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601667#comment-14601667
 ] 

Ted Yu commented on HBASE-13702:


There're several hunks in 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java 
which don't apply on branch-1

Mind providing patch for branch-1 ?

Thanks

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-13702-v2.patch, HBASE-13702-v3.patch, 
> HBASE-13702-v4.patch, HBASE-13702-v5.patch, HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601878#comment-14601878
 ] 

Hadoop QA commented on HBASE-13702:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12741902/HBASE-13702-v5.patch
  against master branch at commit edef3d64bce41fffbc5649ffa19b2cf80ce28d7a.
  ATTACHMENT ID: 12741902

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 checkstyle{color}.  The applied patch does not increase the 
total number of checkstyle errors

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn post-site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.util.TestHBaseFsck
  org.apache.hadoop.hbase.TestRegionRebalancing

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14574//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14574//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14574//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14574//console

This message is automatically generated.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-13702-v2.patch, HBASE-13702-v3.patch, 
> HBASE-13702-v4.patch, HBASE-13702-v5.patch, HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14602154#comment-14602154
 ] 

Hudson commented on HBASE-13702:


FAILURE: Integrated in HBase-TRUNK #6603 (See 
[https://builds.apache.org/job/HBase-TRUNK/6603/])
HBASE-13702 ImportTsv: Add dry-run functionality and log bad rows (Apekshit 
Sharma) (tedyu: rev e6ed79219966ce0dac3bc748261fce9478aa7550)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterTextMapper.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java
* 
hbase-it/src/test/java/org/apache/hadoop/hbase/mapreduce/IntegrationTestImportTsv.java


> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-13702-v2.patch, HBASE-13702-v3.patch, 
> HBASE-13702-v4.patch, HBASE-13702-v5.patch, HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609435#comment-14609435
 ] 

Hadoop QA commented on HBASE-13702:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12742990/HBASE-13702-branch-1.patch
  against branch-1 branch at commit 85c278a6a8b25ff86e22c254ffec35e945cd7c66.
  ATTACHMENT ID: 12742990

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified tests.

{color:red}-1 javac{color}.  The patch appears to cause mvn compile goal to 
fail with Hadoop version 2.4.0.

Compilation errors resume:
[ERROR] COMPILATION ERROR : 
[ERROR] 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-it/src/test/java/org/apache/hadoop/hbase/mapreduce/IntegrationTestImportTsv.java:[148,18]
 error: cannot find symbol
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:2.5.1:testCompile 
(default-testCompile) on project hbase-it: Compilation failure
[ERROR] 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-it/src/test/java/org/apache/hadoop/hbase/mapreduce/IntegrationTestImportTsv.java:[148,18]
 error: cannot find symbol
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :hbase-it


Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14634//console

This message is automatically generated.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-13702-branch-1.patch, HBASE-13702-v2.patch, 
> HBASE-13702-v3.patch, HBASE-13702-v4.patch, HBASE-13702-v5.patch, 
> HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-30 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609440#comment-14609440
 ] 

Ted Yu commented on HBASE-13702:


Please fix compilation error against hadoop 2.4.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-13702-branch-1.patch, HBASE-13702-v2.patch, 
> HBASE-13702-v3.patch, HBASE-13702-v4.patch, HBASE-13702-v5.patch, 
> HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609565#comment-14609565
 ] 

Hadoop QA commented on HBASE-13702:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12743000/HBASE-13702-branch-1-v2.patch
  against branch-1 branch at commit 85c278a6a8b25ff86e22c254ffec35e945cd7c66.
  ATTACHMENT ID: 12743000

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 checkstyle{color}.  The applied patch does not increase the 
total number of checkstyle errors

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn post-site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.mapreduce.TestImportTsv

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14636//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14636//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14636//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14636//console

This message is automatically generated.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-13702-branch-1-v2.patch, 
> HBASE-13702-branch-1.patch, HBASE-13702-v2.patch, HBASE-13702-v3.patch, 
> HBASE-13702-v4.patch, HBASE-13702-v5.patch, HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-07-02 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612781#comment-14612781
 ] 

Ted Yu commented on HBASE-13702:


Waiting for Jenkins to come back so that QA can test the patch.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-13702-branch-1-v2.patch, 
> HBASE-13702-branch-1-v3.patch, HBASE-13702-branch-1.patch, 
> HBASE-13702-v2.patch, HBASE-13702-v3.patch, HBASE-13702-v4.patch, 
> HBASE-13702-v5.patch, HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-07-03 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14613400#comment-14613400
 ] 

Ted Yu commented on HBASE-13702:


TestImportTsv passed with patch v3.

Nice job, Apekshit.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-13702-branch-1-v2.patch, 
> HBASE-13702-branch-1-v3.patch, HBASE-13702-branch-1.patch, 
> HBASE-13702-v2.patch, HBASE-13702-v3.patch, HBASE-13702-v4.patch, 
> HBASE-13702-v5.patch, HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-07-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14613697#comment-14613697
 ] 

Hadoop QA commented on HBASE-13702:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12743436/HBASE-13702-branch-1-v3.patch
  against branch-1 branch at commit e640f1e76af8f32015f475629610da127897f01e.
  ATTACHMENT ID: 12743436

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 checkstyle{color}.  The applied patch does not increase the 
total number of checkstyle errors

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn post-site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

 {color:red}-1 core zombie tests{color}.  There are 5 zombie test(s):   
at 
org.apache.hadoop.hbase.regionserver.wal.TestWALReplay.testReplayEditsWrittenViaHRegion(TestWALReplay.java:538)
at 
org.apache.hadoop.hbase.regionserver.wal.TestWALReplay.testReplayEditsWrittenIntoWAL(TestWALReplay.java:827)
at 
org.apache.hadoop.hbase.regionserver.TestRegionReplicaFailover.testSecondaryRegionKillWhilePrimaryIsAcceptingWrites(TestRegionReplicaFailover.java:333)

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14660//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14660//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14660//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14660//console

This message is automatically generated.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-13702-branch-1-v2.patch, 
> HBASE-13702-branch-1-v3.patch, HBASE-13702-branch-1.patch, 
> HBASE-13702-v2.patch, HBASE-13702-v3.patch, HBASE-13702-v4.patch, 
> HBASE-13702-v5.patch, HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-07-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14613961#comment-14613961
 ] 

Hudson commented on HBASE-13702:


SUCCESS: Integrated in HBase-1.3-IT #19 (See 
[https://builds.apache.org/job/HBase-1.3-IT/19/])
HBASE-13702 ImportTsv: Add dry-run functionality and log bad rows (Apekshit 
Sharma) (tedyu: rev 9e54e195f60689bfde26279630f80825214d0219)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterTextMapper.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java
* 
hbase-it/src/test/java/org/apache/hadoop/hbase/mapreduce/IntegrationTestImportTsv.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java


> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-13702-branch-1-v2.patch, 
> HBASE-13702-branch-1-v3.patch, HBASE-13702-branch-1.patch, 
> HBASE-13702-v2.patch, HBASE-13702-v3.patch, HBASE-13702-v4.patch, 
> HBASE-13702-v5.patch, HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-07-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14613972#comment-14613972
 ] 

Hudson commented on HBASE-13702:


FAILURE: Integrated in HBase-1.3 #34 (See 
[https://builds.apache.org/job/HBase-1.3/34/])
HBASE-13702 ImportTsv: Add dry-run functionality and log bad rows (Apekshit 
Sharma) (tedyu: rev 9e54e195f60689bfde26279630f80825214d0219)
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterTextMapper.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java
* 
hbase-it/src/test/java/org/apache/hadoop/hbase/mapreduce/IntegrationTestImportTsv.java


> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-13702-branch-1-v2.patch, 
> HBASE-13702-branch-1-v3.patch, HBASE-13702-branch-1.patch, 
> HBASE-13702-v2.patch, HBASE-13702-v3.patch, HBASE-13702-v4.patch, 
> HBASE-13702-v5.patch, HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-05-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560071#comment-14560071
 ] 

Hadoop QA commented on HBASE-13702:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12735396/HBASE-13702.patch
  against master branch at commit c8c23cc3183735b02e9f43bf7115d9ce3cd2a880.
  ATTACHMENT ID: 12735396

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.1 2.5.2 2.6.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1920 checkstyle errors (more than the master's current 1919 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14187//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14187//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14187//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14187//console

This message is automatically generated.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
> Attachments: HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-05-27 Thread Ashish Singhi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560531#comment-14560531
 ] 

Ashish Singhi commented on HBASE-13702:
---

Skimmed the patch, looks fine.
bq. "  -D" + LOG_BAD_LINES_CONF_KEY + "=true - logs invalid lines to stderr\n" +
I think it will be better to capture the bad lines into a hdfs file. This is 
what we have done internally so that user can refer to it at later point as 
well and to avoid the lines getting lost if there too many lines getting 
printed on the console.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Attachments: HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-05-27 Thread Apekshit Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561658#comment-14561658
 ] 

Apekshit Sharma commented on HBASE-13702:
-

You're right, it would have been bad if these logs weren't available for 
back-reference later in time or if data was lost, but afai understand, the logs 
generated by mapper tasks are stored in stderr/stdout files locally so these 
should be no issue of data loss, and these logs are later available for 
back-reference from job history server.
http://blog.cloudera.com/blog/2010/11/hadoop-log-location-and-retention/
http://blog.cloudera.com/blog/2009/09/apache-hadoop-log-files-where-to-find-them-in-cdh-and-what-info-they-contain/

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Attachments: HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-05-27 Thread Apekshit Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561670#comment-14561670
 ] 

Apekshit Sharma commented on HBASE-13702:
-

Irrespective, what you suggested is a good alternative. I chose not to do it to 
keep things simple as dry-run feature already made the change big enough.
Since you already have it, I'll highly encourage to push the change upstream.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Attachments: HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-05-28 Thread Bhupendra Kumar Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562591#comment-14562591
 ] 

Bhupendra Kumar Jain commented on HBASE-13702:
--

What is the scope of dry-run functionality ? 

As per current patch , in dry-run , same map task is getting executed. which 
internally performs various operations such as ( Parsing text data, creating 
PUT object, creating Cell object , tags etc. ) .. These operations will consume 
some extra time and actually not required by dry-run functionality ..  I think 
Dry-run should finish very fast. 

If dry-run scope is only to validate the parsing of data, then I think better 
to have a new Map task for dry-run 

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Attachments: HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-05-28 Thread Apekshit Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563169#comment-14563169
 ] 

Apekshit Sharma commented on HBASE-13702:
-

So the way I see dry-run functionality for adhoc tools like this one is "check 
tool will run successfully on given data without making any (permanent) change 
to system". So ideally, users should get all errors/warning in dry run and 
actual run should be like butter, instead of getting stuck in a half-commit 
stage where some things went through and other didn't (unless it's acceptable).
On practical side, I am with you if it makes sense to remove some trivial logic 
if it shaves of huge run-time. I don't have practical exp. of runtimes of this 
tool, but I would guess any processing in mapper shouldn't take much time 
compared to final stage of writing Put mutations to table (in non-bulk 
mode)/hfiles to disk(bulk mode) which dry-run already skips. If my assumptions 
are wrong, please let me know.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Attachments: HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-01 Thread Apekshit Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568310#comment-14568310
 ] 

Apekshit Sharma commented on HBASE-13702:
-

Did few runs on single node cluster.
Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, single core, hyper threading = 4

Dataset 1: 1000 rows, key length = 20, #columns = 100, column length = 1k
size of input csv file: 100M

Dataset 2: 1 rows, key length = 20, #columns = 100, column length = 1k
size of input csv file: 1G

*Non Bulk Mode:*

Dataset 1
Dry mode: <1 sec
Non-dry mode: ~4 sec

Dataset 2
Dry mode: ~10s
Non dry mode: ~24 s
num automatic splits: 8

Verified row count after each run.

*Bulk Mode:*

Dataset 2
1 rows, key length = 20, #columns = 100, column length = 1k
size of input csv file: 1G

dry mode: ~40 sec (table not existent on start, verified no table and output 
dir after run)
non-dry mode: ~60 sec (table not existent on start, verified table and output 
dir exists after run)
num automatic splits: 8

Since the runs are in order of seconds/minutes, I think we can and should test 
all functionality in dry-run.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Attachments: HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-03 Thread Bhupendra Kumar Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570867#comment-14570867
 ] 

Bhupendra Kumar Jain commented on HBASE-13702:
--

As per current patch, dry-run executes only Map task, so its useful only when 
Map task is having lot of extra code logic (parsing, validating, transformation 
etc... ). Dry run can execute that logic and output the errors. 

But there might be many logic present in Combiner, Reducer phase also, Which 
dry-run will not check. So I think better to rename the dry-run function as 
*dry-run-map*. It will be much clear. 

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Attachments: HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-03 Thread Apekshit Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14571581#comment-14571581
 ] 

Apekshit Sharma commented on HBASE-13702:
-

So, combiners and reducers in the bulk mode are executed in dry run mode too.  
However, TableReducer in non-bulk mode is not run in dry-mode.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Attachments: HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-04 Thread Bhupendra Kumar Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572786#comment-14572786
 ] 

Bhupendra Kumar Jain commented on HBASE-13702:
--

Yes you are right. For bulk mode its going to run all.  

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Attachments: HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-05 Thread Apekshit Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575315#comment-14575315
 ] 

Apekshit Sharma commented on HBASE-13702:
-

Does the patch looks good? I feel like it's ready for commit.


> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Attachments: HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-05 Thread Srikanth Srungarapu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575335#comment-14575335
 ] 

Srikanth Srungarapu commented on HBASE-13702:
-

Let's see how much overhead does adding the dry-run functionality will add to 
the original code. Can you please also come up with timings without the patch 
for the experiments you have mentioned in [this 
comment|https://issues.apache.org/jira/browse/HBASE-13702?focusedCommentId=14568310&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14568310]?
 

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Attachments: HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-05 Thread Srikanth Srungarapu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575354#comment-14575354
 ] 

Srikanth Srungarapu commented on HBASE-13702:
-

My bad. Looks like dry run checks are being done before launching MR job, not 
within job itself. So, I don't think there will be any tangible perf impact. 
Let me take a deeper look at the patch and get back.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Attachments: HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows

2015-06-12 Thread Apekshit Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583970#comment-14583970
 ] 

Apekshit Sharma commented on HBASE-13702:
-

Ping for reviews.

> ImportTsv: Add dry-run functionality and log bad rows
> -
>
> Key: HBASE-13702
> URL: https://issues.apache.org/jira/browse/HBASE-13702
> Project: HBase
>  Issue Type: New Feature
>Reporter: Apekshit Sharma
>Assignee: Apekshit Sharma
> Attachments: HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). 
> -Dimporttsv.skip.bad.lines=false can be used to fail if a bad row is 
> encountered. 
> To be easily able to determine which rows are corrupted in an input, rather 
> than failing on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, 
> which can essentially does a quick run of tool without making any changes but 
> reporting any errors/warnings and success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst 
> case, all rows will be logged and size of logs will be same as input size, 
> which seems fine. However, user might have to do some work figuring out where 
> the logs. Is there some link we can show to the user when the tool starts 
> which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and 
> any other mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)