[ https://issues.apache.org/jira/browse/PHOENIX-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323330#comment-14323330 ]
James Taylor commented on PHOENIX-1609: --------------------------------------- Thanks for the patch, [~maghamraviki...@gmail.com]. Here's some feedback: - I think we should aim to build this more directly on top of the MR support you already built, in particular on the ability to run a SELECT query through PhoenixInputFormat. The main reason is that with functional indexes (see http://phoenix.apache.org/secondary_indexing.html#Functional_Indexes), arbitrary expressions may be used to define the index which would fit in nicely with the mechanism you've already built. Probably the approach that'll give you the most bang-for-the-buck would be to expand your MR integration first to support *writing* the results from the SELECT to create an HFile (much like the CSV loader). - Once you can write to a table through our MR support, take a look at the UPSERT SELECT statement created by PostIndexDDLCompiler to populate an index. The SELECT part of this is what you'd want to build as your select statement, while the UPSERT part defines the columns to which you're writing. It's possible that the building of this statement could be exposed through a shared utility (or that you could just use PostIndexDDLCompiler for this work too). If you get the QueryPlan for this SELECT statement, you should, in theory, be able to run it through your existing MR support (which gets you most of the way there). - I think we should strive to hide the MR job behind our existing CREATE INDEX statement. I think you can decide in PostIndexDDLCompiler.compile() on whether or not you run the index creation through MR or using our existing mechanism, based on the table stats you can retrieve from the data table. In fact, then you'll already have the SELECT statement and UPSERT statement built, so it's just a matter of how they'll be run. Something like this: {code} PTableStats stats = dataTableRef.getTable().getTableStats(); Collection<GuidePostsInfo> guidePostsCollection = stats.getGuidePosts().values(); long totalByteSize = 0; for (GuidePostsInfo info : guidePostsCollection) { totalByteSize += info.getByteCount(); } long byteThreshold = connection.unwrap(PhoenixConnection.class).getQueryServices(). getProps(QueryServices.MAP_REDUCE_INDEX_BUILD_THRESHOLD_ATTRIB, QueryServicesOptions.DEFAULT_MAP_REDUCE_INDEX_BUILD_THRESHOLD); if (totalByteSize >= byteThreshold) { // Return new MutationPlan that has an execute() method that kicks off the map/reduce job } else { // Return MutationPlan as it is created today } {code} - As far as setting the index state appropriately, you shouldn't need to do anything to initialize the state, as the CREATE INDEX call would set the index state at the beginning to a PIndexState.BUILDING from createTableInternal already. Then on the successful completion of your MR job, you'd set the index state to PIndexState.ACTIVE. It's likely we'll want to move the code that does this now in MetaDataClient.buildIndex() into the end of each MutationPlan generated there (instead of assuming that the index build always happens synchronously). - Minor, but when validating that a data/index table exists, go through our meta data operations using connection.getMetaData() and the corresponding JDBC APIs for DatabaseMetaData, instead of dipping down to our internal PTable APIs as you've done here: {code} + private boolean isValidIndexTable(final Connection connection, final String masterTable, final String indexTable) throws SQLException { + final PTable table = PhoenixRuntime.getTable(connection, masterTable); + for(PTable indxTable : table.getIndexes()){ + if(indxTable.getTableName().getString().equalsIgnoreCase(indexTable)) { + return true; + } + } + return false; + + } + > MR job to populate index tables > -------------------------------- > > Key: PHOENIX-1609 > URL: https://issues.apache.org/jira/browse/PHOENIX-1609 > Project: Phoenix > Issue Type: New Feature > Reporter: maghamravikiran > Assignee: maghamravikiran > Attachments: 0001-PHOENIX_1609.patch > > > Often, we need to create new indexes on master tables way after the data > exists on the master tables. It would be good to have a simple MR job given > by the phoenix code that users can call to have indexes in sync with the > master table. > Users can invoke the MR job using the following command > hadoop jar org.apache.phoenix.mapreduce.Index -st MASTER_TABLE -tt > INDEX_TABLE -columns a,b,c > Is this ideal? -- This message was sent by Atlassian JIRA (v6.3.4#6332)