Using Phoenix as an InputFormat

Andrew Thu, 03 Apr 2014 21:48:07 -0700

I am considering using Phoenix, but I know that I will want to transform
my data via MapReduce, e.g. UPSERT some core data, then go back over the
data set and "fill in" some additional columns (appropriately stored in
additional column groups).


I think all I need to do is implement an InputFormat implementation that
takes a table name (or more generally /select * from table where .../).
But in order to define splits, I need to somehow discover key ranges so
that I can issue a series of contiguous range scans.

Can you suggest how I might go about this in a general way... if I get
this right then I'll contribute the code.  Else I will need to use
external knowledge of my specific table data to partition the task.  If
Phoenix had a LIMIT with a SKIP option plus a table ROWCOUNT, then that
would also achieve the goal.  Or is there some way to implement the
InputFormat via a native HBase API call perhaps?

Andrew.

(MongoDB's InputFormat implementation, calls an internal function on the
server to do this:
https://github.com/mongodb/mongo-hadoop/blob/master/core/src/main/java/com/mongodb/hadoop/splitter/StandaloneMongoSplitter.java)

Using Phoenix as an InputFormat

Reply via email to