Re: [DISCUSS] State of the work-in-progress HBase branch

Henry Saputra Mon, 24 Mar 2014 14:45:13 -0700

Ok +1

How do you propose to document this feature? As another page in the
doc svn repo?


- Henry

On Mon, Mar 24, 2014 at 2:42 PM, Kasper Sørensen
<[email protected]> wrote:
> Yep. Or in slightly more technical terms: It means that the
> HBaseDataContext only implements DataContext which has these two
> significant methods:
>
>  * getSchemas()
>  * executeQuery(...)
>
> (Plus a bunch more methods, but those two give you the general impression:
> Explore metadata and fire queries / reads)
> But not UpdateableDataContext, which has the write operations:
>
>  * executeUpdate(...)
>
> Regards,
> Kasper
>
>
> 2014-03-24 22:37 GMT+01:00 Henry Saputra <[email protected]>:
>
>> Hmm, what does it mean by read only? You can use it to read data from
>> HBase?
>>
>> - Henry
>>
>> On Mon, Mar 24, 2014 at 2:34 PM, Kasper Sørensen
>> <[email protected]> wrote:
>> > A quick update on this since the module has now been merged into the
>> master
>> > branch:
>> >
>> > 1) Module is still read-only. This is accepted for now (unless someone
>> > wants to help change it of course).
>> >
>> > 2) Metadata mapping is still working in two modes: a) we discover the
>> > column families and expose them as byte-array maps (not very useful, but
>> > works as a "lowest common denominator") and b) the user provides a set of
>> > SimpleTableDef (which now has a convenient parser btw.:)) and gets his
>> > table mapping as he wants it.
>> >
>> > 3) Querying now has special support for lookup-by-id type queries where
>> we
>> > will use HBase Get instead of Scan. We also have good support for
>> > LIMIT/"maxRows", but not OFFSET/"firstRow" (in those cases we will scan
>> > past the first records on the client side).
>> >
>> > 4) Dependencies seems to be a pain still. HBase and Hadoop comes in many
>> > flavours and all are not compatible. I doubt there's a lot we can do
>> about
>> > it, except ask the users to provide their own HBase dependency as per
>> their
>> > backend version. We should probably thus make all our HBase/Hadoop
>> > dependencies <optional>true</optional> in order to not influence the
>> > typical clients.
>> >
>> > Kasper
>> >
>> >
>> > 2014-02-24 17:08 GMT+01:00 Kasper Sørensen <
>> [email protected]>:
>> >
>> >> Hi Henry,
>> >>
>> >> Yea the Phoenix project is definately an interesting approach to making
>> MM
>> >> capable of working with HBase. The only downside to me is that it seems
>> >> they do a lot of intrusive stuff to HBase like creating new index tables
>> >> etc... I would normally not "allow" that for a simple connector.
>> >>
>> >> Maybe we should simply support both styles. And in the case of Phoenix,
>> I
>> >> guess we could simply go through the JDBC module of MetaModel and
>> connect
>> >> via their JDBC driver... Is that maybe a route, do you know?
>> >>
>> >> - Kasper
>> >>
>> >>
>> >> 2014-02-24 6:37 GMT+01:00 Henry Saputra <[email protected]>:
>> >>
>> >> We could use the HBase client library from the store I suppose.
>> >>> The issue I am actually worry is actually adding real query support
>> >>> for column based datastore is kind of big task.
>> >>> Apache Phoenix tried to do that so maybe we could leverage the SQL
>> >>> planner layer to provide the implementation of the query execution to
>> >>> HBase layer?
>> >>>
>> >>> - Henry
>> >>>
>> >>>
>> >>> On Mon, Feb 17, 2014 at 9:33 AM, Kasper Sørensen
>> >>> <[email protected]> wrote:
>> >>> > Thanks for the input Henry. With your experience, do you then also
>> >>> happen
>> >>> > to know of a good thin client-side library? I imagine that we could
>> >>> maybe
>> >>> > use a REST client instead of the full client we currently use. That
>> >>> would
>> >>> > save us a ton of dependency-overhead I think. Or is it a non-issue in
>> >>> your
>> >>> > mind, since HBase users are used to this overhead?
>> >>> >
>> >>> >
>> >>> > 2014-02-16 7:16 GMT+01:00 Henry Saputra <[email protected]>:
>> >>> >
>> >>> >> For 1 > I think adding read only to HBase should be ok because most
>> >>> >> update to HBase either through HBase client or REST via Stargate [1]
>> >>> >> or Thrift
>> >>> >>
>> >>> >> For 2 > In Apache Gora we use Avro to do type mapping to column and
>> >>> >> generate POJO java via Avro compiler.
>> >>> >>
>> >>> >> For 3 > This is the one I am kinda torn. Apache Phoenix incubating
>> try
>> >>> >> to provide SQL to HBase [2] via extra indexing and caching. I think
>> >>> >> this is defeat the purpose of having NoSQL databases that serve
>> >>> >> different purpose than Relational databse.
>> >>> >>
>> >>> >> I am not sure Metamodel should touch NoSQL databases which more like
>> >>> >> column types. These databases are designed for large data with
>> access
>> >>> >> primary via key and not query mechanism.
>> >>> >>
>> >>> >> Just my 2-cent
>> >>> >>
>> >>> >>
>> >>> >> [1] http://wiki.apache.org/hadoop/Hbase/Stargate
>> >>> >> [2] http://phoenix.incubator.apache.org/
>> >>> >>
>> >>> >> On Fri, Jan 24, 2014 at 11:35 AM, Kasper Sørensen
>> >>> >> <[email protected]> wrote:
>> >>> >> > Hi everyone,
>> >>> >> >
>> >>> >> > I was looking at our "hbase-module" branch and as much as I like
>> this
>> >>> >> idea,
>> >>> >> > I think we've been a bit too idle with the branch. Maybe we should
>> >>> try to
>> >>> >> > make something final e.g. for a version 4.1.
>> >>> >> >
>> >>> >> > So I thought to give an overview/status of the module's current
>> >>> >> > capabilities and it's shortcomings. We should figure out if we
>> think
>> >>> this
>> >>> >> > is good enough for a first version, or if we want to do some
>> >>> improvements
>> >>> >> > to the module before adding it to our portfolio of MetaModel
>> modules.
>> >>> >> >
>> >>> >> > 1) The module only offers read-only/query access to HBase. That is
>> >>> in my
>> >>> >> > opinion OK for now, we have several such modules, and this is
>> >>> something
>> >>> >> we
>> >>> >> > can better add later if we straighten out the remaining topics in
>> >>> this
>> >>> >> mail.
>> >>> >> >
>> >>> >> > 2) With regards to metadata mapping: HBase is different because it
>> >>> has
>> >>> >> both
>> >>> >> > column families and in column families there are columns. For the
>> >>> sake of
>> >>> >> > our view on HBase I would describe column families simply as "a
>> >>> logical
>> >>> >> of
>> >>> >> > columns". Column families are fixed within a table, but rows in a
>> >>> table
>> >>> >> may
>> >>> >> > contain arbitrary numbers of columns within each column family.
>> >>> So... You
>> >>> >> > can instantiate the HBaseDataContext in two ways:
>> >>> >> >
>> >>> >> > 2a) You can let MetaModel discover the metadata. This
>> unfortunately
>> >>> has a
>> >>> >> > severe limitation. We discover the table names and column families
>> >>> using
>> >>> >> > the HBase API. But the actual columns and their contents cannot be
>> >>> >> provided
>> >>> >> > by the API. So instead we simply expose the column families with a
>> >>> MAP
>> >>> >> data
>> >>> >> > types. The trouble with this is that the keys and values of the
>> maps
>> >>> will
>> >>> >> > simply be byte-arrays ... Usually not very useful! But it's sort
>> of
>> >>> the
>> >>> >> > only thing (as far as I can see) that's "safe" in HBase, since
>> HBase
>> >>> >> allows
>> >>> >> > anything (byte arrays) in it's columns.
>> >>> >> >
>> >>> >> > 2b) Like in e.g. MongoDb or CouchDb modules you can provide an
>> array
>> >>> of
>> >>> >> > tables (SimpleTableDef). That way the user defines the metadata
>> >>> himself
>> >>> >> and
>> >>> >> > the implementation assumes that it is correct (or else it will
>> >>> break).
>> >>> >> The
>> >>> >> > good thing about this is that the user can define the proper data
>> >>> types
>> >>> >> > etc. for columns. The user defines the column family and column
>> name
>> >>> by
>> >>> >> > setting defining the MetaModel column name as this: "family:name"
>> >>> >> > (consistent with most HBase tools and API calls).
>> >>> >> >
>> >>> >> > 3) With regards to querying: We've implemented basic query
>> >>> capabilities
>> >>> >> > using the MetaModel query postprocessor. But not all queries are
>> very
>> >>> >> > effective... In addition to of course full table scans, we have
>> >>> optimized
>> >>> >> > support of of COUNT queries and of table scans with maxRows.
>> >>> >> >
>> >>> >> > We could rather easily add optimized support for a couple of other
>> >>> >> typical
>> >>> >> > queries:
>> >>> >> >  * lookup record by ID
>> >>> >> >  * paged table scans (both firstRow and maxRows)
>> >>> >> >  * queries with simple filters/where items
>> >>> >> >
>> >>> >> > 4) With regards to dependencies: The module right now depends on
>> the
>> >>> >> > artifact called "hbase-client". This dependency has a loot of
>> >>> transient
>> >>> >> > dependencies so the size of the module is quite extreme. As an
>> >>> example,
>> >>> >> it
>> >>> >> > includes stuff like jetty, jersey, jackson and of course hadoop...
>> >>> But I
>> >>> >> am
>> >>> >> > wondering if we can have a more thin client-side than that! If
>> anyone
>> >>> >> knows
>> >>> >> > if e.g. we can use the REST interface easily or so, that would
>> maybe
>> >>> be
>> >>> >> > better. I'm not an expert on HBase though, so please enlighten me!
>> >>> >> >
>> >>> >> > Kind regards,
>> >>> >> > Kasper
>> >>> >>
>> >>>
>> >>
>> >>
>>

Re: [DISCUSS] State of the work-in-progress HBase branch

Reply via email to