hi, Guys, I am looking for a way to Read HBase table through MPP(Postgres-XC). And hoping to get some suggestions to either validate or invalidate the approach.
Kind of like Apache Drill, but through PostgresSQL. Long story about why Postgres, and how c/c++ will give me headache for months to come. :-) I will leave it as is for now. The design is to have distributed Postgres-XC installed on the same HBase cluster, so Postgres' datanodes are on the same physical node as HBase's regionServers. connect HBase from PostgresSQL through existing HBase client code. Step1: At Postgres coordinator node(like Master of HBase), use HTable.getRegionLocations to get all Regions of a particular table: NavigableMap<HRegionInfo, ServerName> Step 2: iterate through above NavigatbleMap to map HBase ServerName to PG-XC's dataNode. The goal is to let the dataNode of Postgres handle the regions on its own physical machine. Step 3: Postgres coordinator node send the execution plan to Postgres datanode , through a existing framework called foreign data wrapper. Step 4: Postgres DataNode iterate through its assigned regions, and open a HBase Client.Scan() with .setStartRow and .setStopRow so it will only read the assigned region. I was hoping to use HRegionInfo.regionId directly, but can find such API in Client.Scan Step 5: Posgres DataNode further analyse the retrieve data. So in short, the architect design is to leverage Postgres optimizer to parse SQL Query, and use Postgres DataNode as HBase' client to read HBase regions directly in parallel. With the hope to 1) read HRegion locally; 2) leverage existing HBase filters. On step4 above, is there a way to talk to RegionSever directly without communicating with HMaster? Similar ideas(Drill for one, how about HP vertica?) are brought up before, and discussed. So before I am heading down the same road, Can I pick your brain, please shed me some light? or prevent me from doing something stupid? Many thanks Demai