[ https://issues.apache.org/jira/browse/PHOENIX-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700080#comment-14700080 ]
James Taylor commented on PHOENIX-2154: --------------------------------------- Good point, [~gabriel.reid]. What do you think the trade-off would be between 1) the work the reducer needs to do to make the HFiles non overlapping (as there's no correlation between the index row keys and the data table row keys) versus 2) the extra flush/compactions that'll be necessary if we go through the front-door APIs? For (1), with a new index we don't know the split points, so it'll be a single reducer unfortunately. Is it possible in MR to set the split points after the map phase, but before the reduce phase? If not possible in MR can this be done in Spark? Also, [~gabriel.reid] - do you know if there's a mechanism to get some kind of callback once all the mappers are doing? Or is the only choice to have the client that submitted the MR job wait until it's complete? > Failure of one mapper should not affect other mappers in MR index build > ----------------------------------------------------------------------- > > Key: PHOENIX-2154 > URL: https://issues.apache.org/jira/browse/PHOENIX-2154 > Project: Phoenix > Issue Type: Bug > Reporter: James Taylor > Attachments: IndexTool.java > > > Once a mapper in the MR index job succeeds, it should not need to be re-done > in the event of the failure of one of the other mappers. The initial > population of an index is based on a snapshot in time, so new rows getting > *after* the index build has started and/or failed do not impact it. > Also, there's a 1:1 correspondence between index rows and table rows, so > there's really no need to dedup. However, the index rows will have a > different row key than the data table, so I'm not sure how the HFiles are > split. Will they potentially overlap and is this an issue? -- This message was sent by Atlassian JIRA (v6.3.4#6332)