I am considering using Phoenix, but I know that I will want to transform my data via MapReduce, e.g. UPSERT some core data, then go back over the data set and "fill in" some additional columns (appropriately stored in additional column groups).
I think all I need to do is implement an InputFormat implementation that takes a table name (or more generally /select * from table where .../). But in order to define splits, I need to somehow discover key ranges so that I can issue a series of contiguous range scans. Can you suggest how I might go about this in a general way... if I get this right then I'll contribute the code. Else I will need to use external knowledge of my specific table data to partition the task. If Phoenix had a LIMIT with a SKIP option plus a table ROWCOUNT, then that would also achieve the goal. Or is there some way to implement the InputFormat via a native HBase API call perhaps? Andrew. (MongoDB's InputFormat implementation, calls an internal function on the server to do this: https://github.com/mongodb/mongo-hadoop/blob/master/core/src/main/java/com/mongodb/hadoop/splitter/StandaloneMongoSplitter.java)
