[jira] [Commented] (HBASE-14267) In Mapreduce on HBase scenario, restart in TableInputFormat will result in getting wrong data.

stack (JIRA) Wed, 02 Sep 2015 17:24:07 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-14267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14728278#comment-14728278
 ]


stack commented on HBASE-14267:
-------------------------------

 bq. User should know the data can not be modified.

I read through the javadoc and we do not clearly state this.

When you modify the row, you are altering the cache'd row instance in Result?

  // We're not using java serialization.  Transient here is just a marker to say
  // that this is where we cache row if we're ever asked for it.
  private transient byte [] row = null;

I am unclear on how this plays in the restart of the Scan? Is the modified 
Result row used to calculate where new Scan restarts?

Thanks [~qianxiZhang] Looks like some dirty debugging to figure the issue.

> In Mapreduce on HBase scenario, restart in TableInputFormat will result in 
> getting wrong data.
> ----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-14267
>                 URL: https://issues.apache.org/jira/browse/HBASE-14267
>             Project: HBase
>          Issue Type: Bug
>          Components: Client, mapreduce
>            Reporter: Qianxi Zhang
>            Assignee: Qianxi Zhang
>         Attachments: HBASE_14267_trunk_v1.patch
>
>
> When I run a mapreduce job on HBase, I will modify the row got from 
> Result.getRow(), for example, reverse the row. Since my program is very 
> complicated to handle data, it takes long time, and the lease int Region 
> server expired. 
> Result#195
> {code}
>   public byte [] getRow() {
>     if (this.row == null) {
>       this.row = (this.cells == null || this.cells.length == 0) ?
>           null :
>           CellUtil.cloneRow(this.cells[0]);
>     }
>     return this.row;
>   }
> {code}
> TableInputFormat will restart the scan from last row, but the row has been 
> modified, so it will read wrong data.
> TableRecordReaderImpl#218
> {code}
>       } catch (IOException e) {
>         // do not retry if the exception tells us not to do so
>         if (e instanceof DoNotRetryIOException) {
>           throw e;
>         }
>         // try to handle all other IOExceptions by restarting
>         // the scanner, if the second call fails, it will be rethrown
>         LOG.info("recovered from " + StringUtils.stringifyException(e));
>         if (lastSuccessfulRow == null) {
>           LOG.warn("We are restarting the first next() invocation," +
>               " if your mapper has restarted a few other times like this" +
>               " then you should consider killing this job and investigate" +
>               " why it's taking so long.");
>         }
>         if (lastSuccessfulRow == null) {
>           restart(scan.getStartRow());
>         } else {
>           restart(lastSuccessfulRow);
>           scanner.next();    // skip presumed already mapped row
>         }
>         value = scanner.next();
>         if (value != null && value.isStale()) numStale++;
>         numRestarts++;
>       }
>       if (value != null && value.size() > 0) {
>         key.set(value.getRow());
>         lastSuccessfulRow = key.get();
>         return true;
>       }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14267) In Mapreduce on HBase scenario, restart in TableInputFormat will result in getting wrong data.

Reply via email to