Re: [DISCUSS] State of the work-in-progress HBase branch

Henry Saputra Mon, 24 Mar 2014 14:38:21 -0700

Hmm, what does it mean by read only? You can use it to read data from HBase?


- Henry

On Mon, Mar 24, 2014 at 2:34 PM, Kasper Sørensen
<[email protected]> wrote:
> A quick update on this since the module has now been merged into the master
> branch:
>
> 1) Module is still read-only. This is accepted for now (unless someone
> wants to help change it of course).
>
> 2) Metadata mapping is still working in two modes: a) we discover the
> column families and expose them as byte-array maps (not very useful, but
> works as a "lowest common denominator") and b) the user provides a set of
> SimpleTableDef (which now has a convenient parser btw.:)) and gets his
> table mapping as he wants it.
>
> 3) Querying now has special support for lookup-by-id type queries where we
> will use HBase Get instead of Scan. We also have good support for
> LIMIT/"maxRows", but not OFFSET/"firstRow" (in those cases we will scan
> past the first records on the client side).
>
> 4) Dependencies seems to be a pain still. HBase and Hadoop comes in many
> flavours and all are not compatible. I doubt there's a lot we can do about
> it, except ask the users to provide their own HBase dependency as per their
> backend version. We should probably thus make all our HBase/Hadoop
> dependencies <optional>true</optional> in order to not influence the
> typical clients.
>
> Kasper
>
>
> 2014-02-24 17:08 GMT+01:00 Kasper Sørensen <[email protected]>:
>
>> Hi Henry,
>>
>> Yea the Phoenix project is definately an interesting approach to making MM
>> capable of working with HBase. The only downside to me is that it seems
>> they do a lot of intrusive stuff to HBase like creating new index tables
>> etc... I would normally not "allow" that for a simple connector.
>>
>> Maybe we should simply support both styles. And in the case of Phoenix, I
>> guess we could simply go through the JDBC module of MetaModel and connect
>> via their JDBC driver... Is that maybe a route, do you know?
>>
>> - Kasper
>>
>>
>> 2014-02-24 6:37 GMT+01:00 Henry Saputra <[email protected]>:
>>
>> We could use the HBase client library from the store I suppose.
>>> The issue I am actually worry is actually adding real query support
>>> for column based datastore is kind of big task.
>>> Apache Phoenix tried to do that so maybe we could leverage the SQL
>>> planner layer to provide the implementation of the query execution to
>>> HBase layer?
>>>
>>> - Henry
>>>
>>>
>>> On Mon, Feb 17, 2014 at 9:33 AM, Kasper Sørensen
>>> <[email protected]> wrote:
>>> > Thanks for the input Henry. With your experience, do you then also
>>> happen
>>> > to know of a good thin client-side library? I imagine that we could
>>> maybe
>>> > use a REST client instead of the full client we currently use. That
>>> would
>>> > save us a ton of dependency-overhead I think. Or is it a non-issue in
>>> your
>>> > mind, since HBase users are used to this overhead?
>>> >
>>> >
>>> > 2014-02-16 7:16 GMT+01:00 Henry Saputra <[email protected]>:
>>> >
>>> >> For 1 > I think adding read only to HBase should be ok because most
>>> >> update to HBase either through HBase client or REST via Stargate [1]
>>> >> or Thrift
>>> >>
>>> >> For 2 > In Apache Gora we use Avro to do type mapping to column and
>>> >> generate POJO java via Avro compiler.
>>> >>
>>> >> For 3 > This is the one I am kinda torn. Apache Phoenix incubating try
>>> >> to provide SQL to HBase [2] via extra indexing and caching. I think
>>> >> this is defeat the purpose of having NoSQL databases that serve
>>> >> different purpose than Relational databse.
>>> >>
>>> >> I am not sure Metamodel should touch NoSQL databases which more like
>>> >> column types. These databases are designed for large data with access
>>> >> primary via key and not query mechanism.
>>> >>
>>> >> Just my 2-cent
>>> >>
>>> >>
>>> >> [1] http://wiki.apache.org/hadoop/Hbase/Stargate
>>> >> [2] http://phoenix.incubator.apache.org/
>>> >>
>>> >> On Fri, Jan 24, 2014 at 11:35 AM, Kasper Sørensen
>>> >> <[email protected]> wrote:
>>> >> > Hi everyone,
>>> >> >
>>> >> > I was looking at our "hbase-module" branch and as much as I like this
>>> >> idea,
>>> >> > I think we've been a bit too idle with the branch. Maybe we should
>>> try to
>>> >> > make something final e.g. for a version 4.1.
>>> >> >
>>> >> > So I thought to give an overview/status of the module's current
>>> >> > capabilities and it's shortcomings. We should figure out if we think
>>> this
>>> >> > is good enough for a first version, or if we want to do some
>>> improvements
>>> >> > to the module before adding it to our portfolio of MetaModel modules.
>>> >> >
>>> >> > 1) The module only offers read-only/query access to HBase. That is
>>> in my
>>> >> > opinion OK for now, we have several such modules, and this is
>>> something
>>> >> we
>>> >> > can better add later if we straighten out the remaining topics in
>>> this
>>> >> mail.
>>> >> >
>>> >> > 2) With regards to metadata mapping: HBase is different because it
>>> has
>>> >> both
>>> >> > column families and in column families there are columns. For the
>>> sake of
>>> >> > our view on HBase I would describe column families simply as "a
>>> logical
>>> >> of
>>> >> > columns". Column families are fixed within a table, but rows in a
>>> table
>>> >> may
>>> >> > contain arbitrary numbers of columns within each column family.
>>> So... You
>>> >> > can instantiate the HBaseDataContext in two ways:
>>> >> >
>>> >> > 2a) You can let MetaModel discover the metadata. This unfortunately
>>> has a
>>> >> > severe limitation. We discover the table names and column families
>>> using
>>> >> > the HBase API. But the actual columns and their contents cannot be
>>> >> provided
>>> >> > by the API. So instead we simply expose the column families with a
>>> MAP
>>> >> data
>>> >> > types. The trouble with this is that the keys and values of the maps
>>> will
>>> >> > simply be byte-arrays ... Usually not very useful! But it's sort of
>>> the
>>> >> > only thing (as far as I can see) that's "safe" in HBase, since HBase
>>> >> allows
>>> >> > anything (byte arrays) in it's columns.
>>> >> >
>>> >> > 2b) Like in e.g. MongoDb or CouchDb modules you can provide an array
>>> of
>>> >> > tables (SimpleTableDef). That way the user defines the metadata
>>> himself
>>> >> and
>>> >> > the implementation assumes that it is correct (or else it will
>>> break).
>>> >> The
>>> >> > good thing about this is that the user can define the proper data
>>> types
>>> >> > etc. for columns. The user defines the column family and column name
>>> by
>>> >> > setting defining the MetaModel column name as this: "family:name"
>>> >> > (consistent with most HBase tools and API calls).
>>> >> >
>>> >> > 3) With regards to querying: We've implemented basic query
>>> capabilities
>>> >> > using the MetaModel query postprocessor. But not all queries are very
>>> >> > effective... In addition to of course full table scans, we have
>>> optimized
>>> >> > support of of COUNT queries and of table scans with maxRows.
>>> >> >
>>> >> > We could rather easily add optimized support for a couple of other
>>> >> typical
>>> >> > queries:
>>> >> >  * lookup record by ID
>>> >> >  * paged table scans (both firstRow and maxRows)
>>> >> >  * queries with simple filters/where items
>>> >> >
>>> >> > 4) With regards to dependencies: The module right now depends on the
>>> >> > artifact called "hbase-client". This dependency has a loot of
>>> transient
>>> >> > dependencies so the size of the module is quite extreme. As an
>>> example,
>>> >> it
>>> >> > includes stuff like jetty, jersey, jackson and of course hadoop...
>>> But I
>>> >> am
>>> >> > wondering if we can have a more thin client-side than that! If anyone
>>> >> knows
>>> >> > if e.g. we can use the REST interface easily or so, that would maybe
>>> be
>>> >> > better. I'm not an expert on HBase though, so please enlighten me!
>>> >> >
>>> >> > Kind regards,
>>> >> > Kasper
>>> >>
>>>
>>
>>

Re: [DISCUSS] State of the work-in-progress HBase branch

Reply via email to