[ 
https://issues.apache.org/jira/browse/PHOENIX-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646637#comment-14646637
 ] 

James Taylor commented on PHOENIX-2154:
---------------------------------------

The main thing to keep in mind is that it's difficult to guarantee that the 
client that kicked off the job will be around long enough to complete it. Same 
goes for all the mappers. For a multi-billion row table, it'll take maybe 10-12 
hours. If, on the other hand, the incremental work can be kept, that would be 
ideal.

If need be, we may want to use the regular HBase APIs (rather than building 
HFiles), and keep track of which data row key we're at for each mapper. Then, 
if something goes wrong, the Mapper can start up where it left off. Even if 
this is slower overall, it's still a win, as we know the job will complete.

Would be interested in hearing what you think too, [~jfernando_sfdc], 
[~elilevine], [~maghamravikiran], [~rvaleti], [~tdsilva]

> Failure of one mapper should not affect other mappers in MR index build
> -----------------------------------------------------------------------
>
>                 Key: PHOENIX-2154
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2154
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: James Taylor
>
> Once a mapper in the MR index job succeeds, it should not need to be re-done 
> in the event of the failure of one of the other mappers. The initial 
> population of an index is based on a snapshot in time, so new rows getting 
> *after* the index build has started and/or failed do not impact it.
> Also, there's a 1:1 correspondence between index rows and table rows, so 
> there's really no need to dedup. However, the index rows will have a 
> different row key than the data table, so I'm not sure how the HFiles are 
> split. Will they potentially overlap and is this an issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to