[jira] [Commented] (PHOENIX-1609) MR job to populate index tables

Gabriel Reid (JIRA) Mon, 16 Feb 2015 23:44:58 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323784#comment-14323784
 ]


Gabriel Reid commented on PHOENIX-1609:
---------------------------------------

Invoking a MR job from java code isn't really an issue in itself, but it brings 
quite a few extra things to worry about with it:
* a client needs to be configured to talk to the job tracker (as well as 
including the needed dependencies to do this)
* a client might also need to be able to talk to HDFS (if HFile creation is 
being done)

HFile creation is (probably) the most performant way to do, and it also allows 
(nearly) atomic population of the index (nothing goes into HBase until the full 
set of HFiles is created). This contrasts with writing directly to the index 
tables during population of course, because if something goes wrong during that 
process then you have a partially written index.

An additional complexity that writing HFiles brings with it is that you need to 
have an extra process that runs after the MR job that loads the HFiles into 
HBase itself. This post-process isn't necessary if you are only starting up a 
single MR job. However, there is still also the general concept of having to 
update the state of the index from BUILDING to the "usable" state.

My general feeling is that it would probably be better to avoid kicking off a 
MR job from a client. If/when we've got a server process then that might be a 
better place to kick it off from.

Another idea: introduce the concept of an "initially deferred" index (this 
might be the wrong terminology here though). An index that is created as 
initially deferred will not be populated, but can be populated via the MR job 
(which would be started up via the hadoop command). A configurable limit could 
be set on the maximum index size that can be created in "populate immediately" 
mode, so with this limit set, some indexes would be required to be created in 
"initially deferred" mode.

This approach would allow keeping all the configuration, job tracking, index 
state, etc logic outside of the SQL client (which I think is probably the safer 
option)

> MR job to populate index tables 
> --------------------------------
>
>                 Key: PHOENIX-1609
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1609
>             Project: Phoenix
>          Issue Type: New Feature
>            Reporter: maghamravikiran
>            Assignee: maghamravikiran
>         Attachments: 0001-PHOENIX_1609.patch
>
>
> Often, we need to create new indexes on master tables way after the data 
> exists on the master tables.  It would be good to have a simple MR job given 
> by the phoenix code that users can call to have indexes in sync with the 
> master table. 
> Users can invoke the MR job using the following command 
> hadoop jar org.apache.phoenix.mapreduce.Index -st MASTER_TABLE -tt 
> INDEX_TABLE -columns a,b,c
> Is this ideal? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-1609) MR job to populate index tables

Reply via email to