Sorry Kapser, a bit busy and hectic with my schedule so I have punt my response later. Apologize about the delay.
- Henry On Mon, Jan 27, 2014 at 12:18 PM, Kasper Sørensen <[email protected]> wrote: > OK to kick things off, let me provide my own input for this discussion. > Please find below my thoughts on the issues and what we need to do. Your > feedback is very very welcome. > > > 2014-01-24 Kasper Sørensen <[email protected]> > >> Hi everyone, >> >> I was looking at our "hbase-module" branch and as much as I like this >> idea, I think we've been a bit too idle with the branch. Maybe we should >> try to make something final e.g. for a version 4.1. >> >> So I thought to give an overview/status of the module's current >> capabilities and it's shortcomings. We should figure out if we think this >> is good enough for a first version, or if we want to do some improvements >> to the module before adding it to our portfolio of MetaModel modules. >> >> 1) The module only offers read-only/query access to HBase. That is in my >> opinion OK for now, we have several such modules, and this is something we >> can better add later if we straighten out the remaining topics in this mail. >> > > No problem > > >> 2) With regards to metadata mapping: HBase is different because it has >> both column families and in column families there are columns. For the sake >> of our view on HBase I would describe column families simply as "a logical >> of columns". Column families are fixed within a table, but rows in a table >> may contain arbitrary numbers of columns within each column family. So... >> You can instantiate the HBaseDataContext in two ways: >> >> 2a) You can let MetaModel discover the metadata. This unfortunately has a >> severe limitation. We discover the table names and column families using >> the HBase API. But the actual columns and their contents cannot be provided >> by the API. So instead we simply expose the column families with a MAP data >> types. The trouble with this is that the keys and values of the maps will >> simply be byte-arrays ... Usually not very useful! But it's sort of the >> only thing (as far as I can see) that's "safe" in HBase, since HBase allows >> anything (byte arrays) in it's columns. >> > > I think we could maybe add a flag here to allow MetaModel to assume that > column keys are of String type. That would at least make the discovered > metadata more meaningful since we can expose columns and not just column > families. It's still going to be tough to figure out the value types, but > we could e.g. make the Column implementations mutable and allow setting > ColumnType on a "live" HBaseColumn. > > >> 2b) Like in e.g. MongoDb or CouchDb modules you can provide an array of >> tables (SimpleTableDef). That way the user defines the metadata himself and >> the implementation assumes that it is correct (or else it will break). The >> good thing about this is that the user can define the proper data types >> etc. for columns. The user defines the column family and column name by >> setting defining the MetaModel column name as this: "family:name" >> (consistent with most HBase tools and API calls). >> > > This is good, but requires more of the user. > > >> 3) With regards to querying: We've implemented basic query capabilities >> using the MetaModel query postprocessor. But not all queries are very >> effective... In addition to of course full table scans, we have optimized >> support of of COUNT queries and of table scans with maxRows. >> >> We could rather easily add optimized support for a couple of other typical >> queries: >> * lookup record by ID >> * paged table scans (both firstRow and maxRows) >> * queries with simple filters/where items >> > > I think "lookup record by ID" is a MUST, since this is a whole other class > of queries in HBase (Get instead of Scan). > > Other optimizations would be nice too, but for the usage I have I could > live without it in the first release. > > >> 4) With regards to dependencies: The module right now depends on the >> artifact called "hbase-client". This dependency has a loot of transient >> dependencies so the size of the module is quite extreme. As an example, it >> includes stuff like jetty, jersey, jackson and of course hadoop... But I am >> wondering if we can have a more thin client-side than that! If anyone knows >> if e.g. we can use the REST interface easily or so, that would maybe be >> better. I'm not an expert on HBase though, so please enlighten me! >> > > This is a big problem IMO. Anyone with HBase client experience? Would be a > lot better with a thin client somehow. > > >> Kind regards, >> Kasper >> >> >>
