Re: [DISCUSS] State of the work-in-progress HBase branch

Henry Saputra Mon, 27 Jan 2014 13:08:37 -0800

Sorry Kapser, a bit busy and hectic with my schedule so I have punt my
response later. Apologize about the delay.


- Henry

On Mon, Jan 27, 2014 at 12:18 PM, Kasper Sørensen
<[email protected]> wrote:
> OK to kick things off, let me provide my own input for this discussion.
> Please find below my thoughts on the issues and what we need to do. Your
> feedback is very very welcome.
>
>
> 2014-01-24 Kasper Sørensen <[email protected]>
>
>> Hi everyone,
>>
>> I was looking at our "hbase-module" branch and as much as I like this
>> idea, I think we've been a bit too idle with the branch. Maybe we should
>> try to make something final e.g. for a version 4.1.
>>
>> So I thought to give an overview/status of the module's current
>> capabilities and it's shortcomings. We should figure out if we think this
>> is good enough for a first version, or if we want to do some improvements
>> to the module before adding it to our portfolio of MetaModel modules.
>>
>> 1) The module only offers read-only/query access to HBase. That is in my
>> opinion OK for now, we have several such modules, and this is something we
>> can better add later if we straighten out the remaining topics in this mail.
>>
>
> No problem
>
>
>> 2) With regards to metadata mapping: HBase is different because it has
>> both column families and in column families there are columns. For the sake
>> of our view on HBase I would describe column families simply as "a logical
>> of columns". Column families are fixed within a table, but rows in a table
>> may contain arbitrary numbers of columns within each column family. So...
>> You can instantiate the HBaseDataContext in two ways:
>>
>> 2a) You can let MetaModel discover the metadata. This unfortunately has a
>> severe limitation. We discover the table names and column families using
>> the HBase API. But the actual columns and their contents cannot be provided
>> by the API. So instead we simply expose the column families with a MAP data
>> types. The trouble with this is that the keys and values of the maps will
>> simply be byte-arrays ... Usually not very useful! But it's sort of the
>> only thing (as far as I can see) that's "safe" in HBase, since HBase allows
>> anything (byte arrays) in it's columns.
>>
>
> I think we could maybe add a flag here to allow MetaModel to assume that
> column keys are of String type. That would at least make the discovered
> metadata more meaningful since we can expose columns and not just column
> families. It's still going to be tough to figure out the value types, but
> we could e.g. make the Column implementations mutable and allow setting
> ColumnType on a "live" HBaseColumn.
>
>
>> 2b) Like in e.g. MongoDb or CouchDb modules you can provide an array of
>> tables (SimpleTableDef). That way the user defines the metadata himself and
>> the implementation assumes that it is correct (or else it will break). The
>> good thing about this is that the user can define the proper data types
>> etc. for columns. The user defines the column family and column name by
>> setting defining the MetaModel column name as this: "family:name"
>> (consistent with most HBase tools and API calls).
>>
>
> This is good, but requires more of the user.
>
>
>> 3) With regards to querying: We've implemented basic query capabilities
>> using the MetaModel query postprocessor. But not all queries are very
>> effective... In addition to of course full table scans, we have optimized
>> support of of COUNT queries and of table scans with maxRows.
>>
>> We could rather easily add optimized support for a couple of other typical
>> queries:
>>  * lookup record by ID
>>  * paged table scans (both firstRow and maxRows)
>>  * queries with simple filters/where items
>>
>
> I think "lookup record by ID" is a MUST, since this is a whole other class
> of queries in HBase (Get instead of Scan).
>
> Other optimizations would be nice too, but for the usage I have I could
> live without it in the first release.
>
>
>> 4) With regards to dependencies: The module right now depends on the
>> artifact called "hbase-client". This dependency has a loot of transient
>> dependencies so the size of the module is quite extreme. As an example, it
>> includes stuff like jetty, jersey, jackson and of course hadoop... But I am
>> wondering if we can have a more thin client-side than that! If anyone knows
>> if e.g. we can use the REST interface easily or so, that would maybe be
>> better. I'm not an expert on HBase though, so please enlighten me!
>>
>
> This is a big problem IMO. Anyone with HBase client experience? Would be a
> lot better with a thin client somehow.
>
>
>> Kind regards,
>> Kasper
>>
>>
>>

Re: [DISCUSS] State of the work-in-progress HBase branch

Reply via email to