[ https://issues.apache.org/jira/browse/PHOENIX-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323784#comment-14323784 ]
Gabriel Reid commented on PHOENIX-1609: --------------------------------------- Invoking a MR job from java code isn't really an issue in itself, but it brings quite a few extra things to worry about with it: * a client needs to be configured to talk to the job tracker (as well as including the needed dependencies to do this) * a client might also need to be able to talk to HDFS (if HFile creation is being done) HFile creation is (probably) the most performant way to do, and it also allows (nearly) atomic population of the index (nothing goes into HBase until the full set of HFiles is created). This contrasts with writing directly to the index tables during population of course, because if something goes wrong during that process then you have a partially written index. An additional complexity that writing HFiles brings with it is that you need to have an extra process that runs after the MR job that loads the HFiles into HBase itself. This post-process isn't necessary if you are only starting up a single MR job. However, there is still also the general concept of having to update the state of the index from BUILDING to the "usable" state. My general feeling is that it would probably be better to avoid kicking off a MR job from a client. If/when we've got a server process then that might be a better place to kick it off from. Another idea: introduce the concept of an "initially deferred" index (this might be the wrong terminology here though). An index that is created as initially deferred will not be populated, but can be populated via the MR job (which would be started up via the hadoop command). A configurable limit could be set on the maximum index size that can be created in "populate immediately" mode, so with this limit set, some indexes would be required to be created in "initially deferred" mode. This approach would allow keeping all the configuration, job tracking, index state, etc logic outside of the SQL client (which I think is probably the safer option) > MR job to populate index tables > -------------------------------- > > Key: PHOENIX-1609 > URL: https://issues.apache.org/jira/browse/PHOENIX-1609 > Project: Phoenix > Issue Type: New Feature > Reporter: maghamravikiran > Assignee: maghamravikiran > Attachments: 0001-PHOENIX_1609.patch > > > Often, we need to create new indexes on master tables way after the data > exists on the master tables. It would be good to have a simple MR job given > by the phoenix code that users can call to have indexes in sync with the > master table. > Users can invoke the MR job using the following command > hadoop jar org.apache.phoenix.mapreduce.Index -st MASTER_TABLE -tt > INDEX_TABLE -columns a,b,c > Is this ideal? -- This message was sent by Atlassian JIRA (v6.3.4#6332)