[ 
https://issues.apache.org/jira/browse/PHOENIX-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698899#comment-14698899
 ] 

James Taylor commented on PHOENIX-2154:
---------------------------------------

Turns out it's pretty easy to switch a MR job to use the front-door APIs 
instead of loading HFiles. See attachment of new version of IndexTool on how to 
do that (thanks to [~lhofhansl]). Depending on what we figure out, we may want 
to make this configurable between loading HFiles and using regular HBase APIs. 
I good first step would be to perf test this new mapper-only solution and 
compare with the HFile building one. [~tdsilva] - would you mind attaching the 
perf numbers that you've collected so far?

Once nice thing about using the front-door APIs is that it'd be pretty easy to 
track where we are with the index build and start off again where we left off 
(using the previous known successful row key as the start row of the scan). One 
thing we still need to figure out is how to mark the index as active once all 
mappers have completely (hopefully without having to hold the client open the 
entire time). Is there some kind of callback mechanism we can rely on?

IMHO, we need to ensure we’re meeting our design goal of making the index build 
resilient. It must be incremental and restartable (picking up more or less 
where it left off without requiring the client that started it to be up for the 
entire build), and it must be monitorable.

[~maghamravikiran], [~rvaleti].


> Failure of one mapper should not affect other mappers in MR index build
> -----------------------------------------------------------------------
>
>                 Key: PHOENIX-2154
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2154
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: James Taylor
>
> Once a mapper in the MR index job succeeds, it should not need to be re-done 
> in the event of the failure of one of the other mappers. The initial 
> population of an index is based on a snapshot in time, so new rows getting 
> *after* the index build has started and/or failed do not impact it.
> Also, there's a 1:1 correspondence between index rows and table rows, so 
> there's really no need to dedup. However, the index rows will have a 
> different row key than the data table, so I'm not sure how the HFiles are 
> split. Will they potentially overlap and is this an issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to