Kasper, sorry typo =)
On Mon, Jan 27, 2014 at 1:07 PM, Henry Saputra <[email protected]> wrote: > Sorry Kapser, a bit busy and hectic with my schedule so I have punt my > response later. Apologize about the delay. > > - Henry > > On Mon, Jan 27, 2014 at 12:18 PM, Kasper Sørensen > <[email protected]> wrote: >> OK to kick things off, let me provide my own input for this discussion. >> Please find below my thoughts on the issues and what we need to do. Your >> feedback is very very welcome. >> >> >> 2014-01-24 Kasper Sørensen <[email protected]> >> >>> Hi everyone, >>> >>> I was looking at our "hbase-module" branch and as much as I like this >>> idea, I think we've been a bit too idle with the branch. Maybe we should >>> try to make something final e.g. for a version 4.1. >>> >>> So I thought to give an overview/status of the module's current >>> capabilities and it's shortcomings. We should figure out if we think this >>> is good enough for a first version, or if we want to do some improvements >>> to the module before adding it to our portfolio of MetaModel modules. >>> >>> 1) The module only offers read-only/query access to HBase. That is in my >>> opinion OK for now, we have several such modules, and this is something we >>> can better add later if we straighten out the remaining topics in this mail. >>> >> >> No problem >> >> >>> 2) With regards to metadata mapping: HBase is different because it has >>> both column families and in column families there are columns. For the sake >>> of our view on HBase I would describe column families simply as "a logical >>> of columns". Column families are fixed within a table, but rows in a table >>> may contain arbitrary numbers of columns within each column family. So... >>> You can instantiate the HBaseDataContext in two ways: >>> >>> 2a) You can let MetaModel discover the metadata. This unfortunately has a >>> severe limitation. We discover the table names and column families using >>> the HBase API. But the actual columns and their contents cannot be provided >>> by the API. So instead we simply expose the column families with a MAP data >>> types. The trouble with this is that the keys and values of the maps will >>> simply be byte-arrays ... Usually not very useful! But it's sort of the >>> only thing (as far as I can see) that's "safe" in HBase, since HBase allows >>> anything (byte arrays) in it's columns. >>> >> >> I think we could maybe add a flag here to allow MetaModel to assume that >> column keys are of String type. That would at least make the discovered >> metadata more meaningful since we can expose columns and not just column >> families. It's still going to be tough to figure out the value types, but >> we could e.g. make the Column implementations mutable and allow setting >> ColumnType on a "live" HBaseColumn. >> >> >>> 2b) Like in e.g. MongoDb or CouchDb modules you can provide an array of >>> tables (SimpleTableDef). That way the user defines the metadata himself and >>> the implementation assumes that it is correct (or else it will break). The >>> good thing about this is that the user can define the proper data types >>> etc. for columns. The user defines the column family and column name by >>> setting defining the MetaModel column name as this: "family:name" >>> (consistent with most HBase tools and API calls). >>> >> >> This is good, but requires more of the user. >> >> >>> 3) With regards to querying: We've implemented basic query capabilities >>> using the MetaModel query postprocessor. But not all queries are very >>> effective... In addition to of course full table scans, we have optimized >>> support of of COUNT queries and of table scans with maxRows. >>> >>> We could rather easily add optimized support for a couple of other typical >>> queries: >>> * lookup record by ID >>> * paged table scans (both firstRow and maxRows) >>> * queries with simple filters/where items >>> >> >> I think "lookup record by ID" is a MUST, since this is a whole other class >> of queries in HBase (Get instead of Scan). >> >> Other optimizations would be nice too, but for the usage I have I could >> live without it in the first release. >> >> >>> 4) With regards to dependencies: The module right now depends on the >>> artifact called "hbase-client". This dependency has a loot of transient >>> dependencies so the size of the module is quite extreme. As an example, it >>> includes stuff like jetty, jersey, jackson and of course hadoop... But I am >>> wondering if we can have a more thin client-side than that! If anyone knows >>> if e.g. we can use the REST interface easily or so, that would maybe be >>> better. I'm not an expert on HBase though, so please enlighten me! >>> >> >> This is a big problem IMO. Anyone with HBase client experience? Would be a >> lot better with a thin client somehow. >> >> >>> Kind regards, >>> Kasper >>> >>> >>>
