[ 
https://issues.apache.org/jira/browse/PHOENIX-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705322#comment-14705322
 ] 

maghamravikiran commented on PHOENIX-2154:
------------------------------------------

Out of turn, but I agree. I believe, we need to break up this task into two 
broadly, 1) resilency 2) quick execution

Right now, with the direct API approach, we achieve the quicker execution of 
the job as we have seen the direct API performing far better than the HFiles 
route . This can be partially attributed to the fact there was only one reducer 
shuffling across the mapper output.  Here, if one mapper fails,  there is a 
possibility that few successful mappers have already committed data onto the 
index table.  Is this ok?

For resilency, I believe the bulk load approach is good as it's a all or 
nothing job. We don't copy HFiles onto the index table until all the mappers 
and reducer is completed .   

In both approaches, the important task we need to address is looking out for 
options to avoid successful mappers from being re run.  Is this possible and 
what are the best means to address it.

> Failure of one mapper should not affect other mappers in MR index build
> -----------------------------------------------------------------------
>
>                 Key: PHOENIX-2154
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2154
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: James Taylor
>            Assignee: maghamravikiran
>         Attachments: IndexTool.java, PHOENIX-2154-WIP.patch
>
>
> Once a mapper in the MR index job succeeds, it should not need to be re-done 
> in the event of the failure of one of the other mappers. The initial 
> population of an index is based on a snapshot in time, so new rows getting 
> *after* the index build has started and/or failed do not impact it.
> Also, there's a 1:1 correspondence between index rows and table rows, so 
> there's really no need to dedup. However, the index rows will have a 
> different row key than the data table, so I'm not sure how the HFiles are 
> split. Will they potentially overlap and is this an issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to