Re: [DISCUSS] State of the work-in-progress HBase branch

Henry Saputra Mon, 27 Jan 2014 13:08:37 -0800

Kasper, sorry typo =)


On Mon, Jan 27, 2014 at 1:07 PM, Henry Saputra <[email protected]> wrote:
> Sorry Kapser, a bit busy and hectic with my schedule so I have punt my
> response later. Apologize about the delay.
>
> - Henry
>
> On Mon, Jan 27, 2014 at 12:18 PM, Kasper Sørensen
> <[email protected]> wrote:
>> OK to kick things off, let me provide my own input for this discussion.
>> Please find below my thoughts on the issues and what we need to do. Your
>> feedback is very very welcome.
>>
>>
>> 2014-01-24 Kasper Sørensen <[email protected]>
>>
>>> Hi everyone,
>>>
>>> I was looking at our "hbase-module" branch and as much as I like this
>>> idea, I think we've been a bit too idle with the branch. Maybe we should
>>> try to make something final e.g. for a version 4.1.
>>>
>>> So I thought to give an overview/status of the module's current
>>> capabilities and it's shortcomings. We should figure out if we think this
>>> is good enough for a first version, or if we want to do some improvements
>>> to the module before adding it to our portfolio of MetaModel modules.
>>>
>>> 1) The module only offers read-only/query access to HBase. That is in my
>>> opinion OK for now, we have several such modules, and this is something we
>>> can better add later if we straighten out the remaining topics in this mail.
>>>
>>
>> No problem
>>
>>
>>> 2) With regards to metadata mapping: HBase is different because it has
>>> both column families and in column families there are columns. For the sake
>>> of our view on HBase I would describe column families simply as "a logical
>>> of columns". Column families are fixed within a table, but rows in a table
>>> may contain arbitrary numbers of columns within each column family. So...
>>> You can instantiate the HBaseDataContext in two ways:
>>>
>>> 2a) You can let MetaModel discover the metadata. This unfortunately has a
>>> severe limitation. We discover the table names and column families using
>>> the HBase API. But the actual columns and their contents cannot be provided
>>> by the API. So instead we simply expose the column families with a MAP data
>>> types. The trouble with this is that the keys and values of the maps will
>>> simply be byte-arrays ... Usually not very useful! But it's sort of the
>>> only thing (as far as I can see) that's "safe" in HBase, since HBase allows
>>> anything (byte arrays) in it's columns.
>>>
>>
>> I think we could maybe add a flag here to allow MetaModel to assume that
>> column keys are of String type. That would at least make the discovered
>> metadata more meaningful since we can expose columns and not just column
>> families. It's still going to be tough to figure out the value types, but
>> we could e.g. make the Column implementations mutable and allow setting
>> ColumnType on a "live" HBaseColumn.
>>
>>
>>> 2b) Like in e.g. MongoDb or CouchDb modules you can provide an array of
>>> tables (SimpleTableDef). That way the user defines the metadata himself and
>>> the implementation assumes that it is correct (or else it will break). The
>>> good thing about this is that the user can define the proper data types
>>> etc. for columns. The user defines the column family and column name by
>>> setting defining the MetaModel column name as this: "family:name"
>>> (consistent with most HBase tools and API calls).
>>>
>>
>> This is good, but requires more of the user.
>>
>>
>>> 3) With regards to querying: We've implemented basic query capabilities
>>> using the MetaModel query postprocessor. But not all queries are very
>>> effective... In addition to of course full table scans, we have optimized
>>> support of of COUNT queries and of table scans with maxRows.
>>>
>>> We could rather easily add optimized support for a couple of other typical
>>> queries:
>>>  * lookup record by ID
>>>  * paged table scans (both firstRow and maxRows)
>>>  * queries with simple filters/where items
>>>
>>
>> I think "lookup record by ID" is a MUST, since this is a whole other class
>> of queries in HBase (Get instead of Scan).
>>
>> Other optimizations would be nice too, but for the usage I have I could
>> live without it in the first release.
>>
>>
>>> 4) With regards to dependencies: The module right now depends on the
>>> artifact called "hbase-client". This dependency has a loot of transient
>>> dependencies so the size of the module is quite extreme. As an example, it
>>> includes stuff like jetty, jersey, jackson and of course hadoop... But I am
>>> wondering if we can have a more thin client-side than that! If anyone knows
>>> if e.g. we can use the REST interface easily or so, that would maybe be
>>> better. I'm not an expert on HBase though, so please enlighten me!
>>>
>>
>> This is a big problem IMO. Anyone with HBase client experience? Would be a
>> lot better with a thin client somehow.
>>
>>
>>> Kind regards,
>>> Kasper
>>>
>>>
>>>

Re: [DISCUSS] State of the work-in-progress HBase branch

Reply via email to