[ 
https://issues.apache.org/jira/browse/PHOENIX-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325478#comment-14325478
 ] 

James Taylor commented on PHOENIX-1609:
---------------------------------------

[~lhofhansl] - good idea about the ASYNC keyword. I was thinking we could use 
MR if the size of the data table is over a certain threshold, but making it 
explicit might be better.

As far as the building blocks, we have most of them already. We have the 
ability in our existing MR integration to run a SELECT statement as a MR job 
(http://phoenix.apache.org/phoenix_mr.html). That's more than half the battle, 
as index population is done through an UPSERT SELECT query. We just need a way 
of piping the SELECT results into the index table.

We also have a mechanism of directly creating HFiles through our CSV Bulk 
Loader (http://phoenix.apache.org/bulk_dataload.html) by generating UPSERT 
statements under-the-covers, and getting the underlying KeyValues to build the 
HFile. Perhaps some of this code can be leveraged/refactored.

The one point we're not sure on is whether or not we should invoke the MR job 
from our Phoenix client when a CREATE INDEX ASYNC is done, or whether we 
require the user to initiate the MR job through the more standard hadoop.jar 
mechanism (outside of Phoenix).

> MR job to populate index tables 
> --------------------------------
>
>                 Key: PHOENIX-1609
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1609
>             Project: Phoenix
>          Issue Type: New Feature
>            Reporter: maghamravikiran
>            Assignee: maghamravikiran
>         Attachments: 0001-PHOENIX_1609.patch
>
>
> Often, we need to create new indexes on master tables way after the data 
> exists on the master tables.  It would be good to have a simple MR job given 
> by the phoenix code that users can call to have indexes in sync with the 
> master table. 
> Users can invoke the MR job using the following command 
> hadoop jar org.apache.phoenix.mapreduce.Index -st MASTER_TABLE -tt 
> INDEX_TABLE -columns a,b,c
> Is this ideal? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to