Re: [DISCUSS] State of the work-in-progress HBase branch

Henry Saputra Tue, 25 Mar 2014 12:36:09 -0700

Some projects do link back from homepage to wiki page. I think the
main key is to have separate docs for each release.


What do you think?

- Henry

On Tue, Mar 25, 2014 at 4:47 AM, Kasper Sørensen
<[email protected]> wrote:
> Hmm was kinda hoping we wouldn't have to... But that's just because I am
> lazy and I prefer "live" (editable online) documentation where possible
> (that way you can easily react if someone starts pointing at missing
> parts). I think either way is doable, but you're right that in case we use
> wiki-pages, each wiki page should clearly state which versions they apply
> to, if they are version-specific.
>
>
> 2014-03-24 23:03 GMT+01:00 Henry Saputra <[email protected]>:
>
>> Hmm seems like we need to bundle the doc for each release. For
>> example, the 4.0.0 does not have HBase store.
>>
>> Most projects have docs for each release on top of project homepage,
>> like Zookeeper http://zookeeper.apache.org/doc/r3.4.6/ or Spark
>> http://spark.apache.org/docs/0.9.0/
>>
>> Thoughts?
>>
>> - Henry
>>
>> On Mon, Mar 24, 2014 at 2:50 PM, Kasper Sørensen
>> <[email protected]> wrote:
>> > Hmm I suppose a wiki page would be good. I guess we have wiki pages for
>> > some of the DataContext implementations already like Salesforce [1], POJO
>> > [2] and Composite [3] ... Maybe we should even have a page for *every
>> > *DataContext
>> > implementation there is, simply for completeness and referenceability of
>> > documentation.
>> >
>> > [1] http://wiki.apache.org/metamodel/examples/SalesforceDataContext
>> > [2] http://wiki.apache.org/metamodel/examples/PojoDataContext
>> > [3] http://wiki.apache.org/metamodel/examples/CompositeDataContext
>> >
>> >
>> > 2014-03-24 22:44 GMT+01:00 Henry Saputra <[email protected]>:
>> >
>> >> Ok +1
>> >>
>> >> How do you propose to document this feature? As another page in the
>> >> doc svn repo?
>> >>
>> >> - Henry
>> >>
>> >> On Mon, Mar 24, 2014 at 2:42 PM, Kasper Sørensen
>> >> <[email protected]> wrote:
>> >> > Yep. Or in slightly more technical terms: It means that the
>> >> > HBaseDataContext only implements DataContext which has these two
>> >> > significant methods:
>> >> >
>> >> >  * getSchemas()
>> >> >  * executeQuery(...)
>> >> >
>> >> > (Plus a bunch more methods, but those two give you the general
>> >> impression:
>> >> > Explore metadata and fire queries / reads)
>> >> > But not UpdateableDataContext, which has the write operations:
>> >> >
>> >> >  * executeUpdate(...)
>> >> >
>> >> > Regards,
>> >> > Kasper
>> >> >
>> >> >
>> >> > 2014-03-24 22:37 GMT+01:00 Henry Saputra <[email protected]>:
>> >> >
>> >> >> Hmm, what does it mean by read only? You can use it to read data from
>> >> >> HBase?
>> >> >>
>> >> >> - Henry
>> >> >>
>> >> >> On Mon, Mar 24, 2014 at 2:34 PM, Kasper Sørensen
>> >> >> <[email protected]> wrote:
>> >> >> > A quick update on this since the module has now been merged into
>> the
>> >> >> master
>> >> >> > branch:
>> >> >> >
>> >> >> > 1) Module is still read-only. This is accepted for now (unless
>> someone
>> >> >> > wants to help change it of course).
>> >> >> >
>> >> >> > 2) Metadata mapping is still working in two modes: a) we discover
>> the
>> >> >> > column families and expose them as byte-array maps (not very
>> useful,
>> >> but
>> >> >> > works as a "lowest common denominator") and b) the user provides a
>> >> set of
>> >> >> > SimpleTableDef (which now has a convenient parser btw.:)) and gets
>> his
>> >> >> > table mapping as he wants it.
>> >> >> >
>> >> >> > 3) Querying now has special support for lookup-by-id type queries
>> >> where
>> >> >> we
>> >> >> > will use HBase Get instead of Scan. We also have good support for
>> >> >> > LIMIT/"maxRows", but not OFFSET/"firstRow" (in those cases we will
>> >> scan
>> >> >> > past the first records on the client side).
>> >> >> >
>> >> >> > 4) Dependencies seems to be a pain still. HBase and Hadoop comes in
>> >> many
>> >> >> > flavours and all are not compatible. I doubt there's a lot we can
>> do
>> >> >> about
>> >> >> > it, except ask the users to provide their own HBase dependency as
>> per
>> >> >> their
>> >> >> > backend version. We should probably thus make all our HBase/Hadoop
>> >> >> > dependencies <optional>true</optional> in order to not influence
>> the
>> >> >> > typical clients.
>> >> >> >
>> >> >> > Kasper
>> >> >> >
>> >> >> >
>> >> >> > 2014-02-24 17:08 GMT+01:00 Kasper Sørensen <
>> >> >> [email protected]>:
>> >> >> >
>> >> >> >> Hi Henry,
>> >> >> >>
>> >> >> >> Yea the Phoenix project is definately an interesting approach to
>> >> making
>> >> >> MM
>> >> >> >> capable of working with HBase. The only downside to me is that it
>> >> seems
>> >> >> >> they do a lot of intrusive stuff to HBase like creating new index
>> >> tables
>> >> >> >> etc... I would normally not "allow" that for a simple connector.
>> >> >> >>
>> >> >> >> Maybe we should simply support both styles. And in the case of
>> >> Phoenix,
>> >> >> I
>> >> >> >> guess we could simply go through the JDBC module of MetaModel and
>> >> >> connect
>> >> >> >> via their JDBC driver... Is that maybe a route, do you know?
>> >> >> >>
>> >> >> >> - Kasper
>> >> >> >>
>> >> >> >>
>> >> >> >> 2014-02-24 6:37 GMT+01:00 Henry Saputra <[email protected]
>> >:
>> >> >> >>
>> >> >> >> We could use the HBase client library from the store I suppose.
>> >> >> >>> The issue I am actually worry is actually adding real query
>> support
>> >> >> >>> for column based datastore is kind of big task.
>> >> >> >>> Apache Phoenix tried to do that so maybe we could leverage the
>> SQL
>> >> >> >>> planner layer to provide the implementation of the query
>> execution
>> >> to
>> >> >> >>> HBase layer?
>> >> >> >>>
>> >> >> >>> - Henry
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> On Mon, Feb 17, 2014 at 9:33 AM, Kasper Sørensen
>> >> >> >>> <[email protected]> wrote:
>> >> >> >>> > Thanks for the input Henry. With your experience, do you then
>> also
>> >> >> >>> happen
>> >> >> >>> > to know of a good thin client-side library? I imagine that we
>> >> could
>> >> >> >>> maybe
>> >> >> >>> > use a REST client instead of the full client we currently use.
>> >> That
>> >> >> >>> would
>> >> >> >>> > save us a ton of dependency-overhead I think. Or is it a
>> >> non-issue in
>> >> >> >>> your
>> >> >> >>> > mind, since HBase users are used to this overhead?
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> > 2014-02-16 7:16 GMT+01:00 Henry Saputra <
>> [email protected]
>> >> >:
>> >> >> >>> >
>> >> >> >>> >> For 1 > I think adding read only to HBase should be ok because
>> >> most
>> >> >> >>> >> update to HBase either through HBase client or REST via
>> Stargate
>> >> [1]
>> >> >> >>> >> or Thrift
>> >> >> >>> >>
>> >> >> >>> >> For 2 > In Apache Gora we use Avro to do type mapping to
>> column
>> >> and
>> >> >> >>> >> generate POJO java via Avro compiler.
>> >> >> >>> >>
>> >> >> >>> >> For 3 > This is the one I am kinda torn. Apache Phoenix
>> >> incubating
>> >> >> try
>> >> >> >>> >> to provide SQL to HBase [2] via extra indexing and caching. I
>> >> think
>> >> >> >>> >> this is defeat the purpose of having NoSQL databases that
>> serve
>> >> >> >>> >> different purpose than Relational databse.
>> >> >> >>> >>
>> >> >> >>> >> I am not sure Metamodel should touch NoSQL databases which
>> more
>> >> like
>> >> >> >>> >> column types. These databases are designed for large data with
>> >> >> access
>> >> >> >>> >> primary via key and not query mechanism.
>> >> >> >>> >>
>> >> >> >>> >> Just my 2-cent
>> >> >> >>> >>
>> >> >> >>> >>
>> >> >> >>> >> [1] http://wiki.apache.org/hadoop/Hbase/Stargate
>> >> >> >>> >> [2] http://phoenix.incubator.apache.org/
>> >> >> >>> >>
>> >> >> >>> >> On Fri, Jan 24, 2014 at 11:35 AM, Kasper Sørensen
>> >> >> >>> >> <[email protected]> wrote:
>> >> >> >>> >> > Hi everyone,
>> >> >> >>> >> >
>> >> >> >>> >> > I was looking at our "hbase-module" branch and as much as I
>> >> like
>> >> >> this
>> >> >> >>> >> idea,
>> >> >> >>> >> > I think we've been a bit too idle with the branch. Maybe we
>> >> should
>> >> >> >>> try to
>> >> >> >>> >> > make something final e.g. for a version 4.1.
>> >> >> >>> >> >
>> >> >> >>> >> > So I thought to give an overview/status of the module's
>> current
>> >> >> >>> >> > capabilities and it's shortcomings. We should figure out if
>> we
>> >> >> think
>> >> >> >>> this
>> >> >> >>> >> > is good enough for a first version, or if we want to do some
>> >> >> >>> improvements
>> >> >> >>> >> > to the module before adding it to our portfolio of MetaModel
>> >> >> modules.
>> >> >> >>> >> >
>> >> >> >>> >> > 1) The module only offers read-only/query access to HBase.
>> >> That is
>> >> >> >>> in my
>> >> >> >>> >> > opinion OK for now, we have several such modules, and this
>> is
>> >> >> >>> something
>> >> >> >>> >> we
>> >> >> >>> >> > can better add later if we straighten out the remaining
>> topics
>> >> in
>> >> >> >>> this
>> >> >> >>> >> mail.
>> >> >> >>> >> >
>> >> >> >>> >> > 2) With regards to metadata mapping: HBase is different
>> >> because it
>> >> >> >>> has
>> >> >> >>> >> both
>> >> >> >>> >> > column families and in column families there are columns.
>> For
>> >> the
>> >> >> >>> sake of
>> >> >> >>> >> > our view on HBase I would describe column families simply
>> as "a
>> >> >> >>> logical
>> >> >> >>> >> of
>> >> >> >>> >> > columns". Column families are fixed within a table, but rows
>> >> in a
>> >> >> >>> table
>> >> >> >>> >> may
>> >> >> >>> >> > contain arbitrary numbers of columns within each column
>> family.
>> >> >> >>> So... You
>> >> >> >>> >> > can instantiate the HBaseDataContext in two ways:
>> >> >> >>> >> >
>> >> >> >>> >> > 2a) You can let MetaModel discover the metadata. This
>> >> >> unfortunately
>> >> >> >>> has a
>> >> >> >>> >> > severe limitation. We discover the table names and column
>> >> families
>> >> >> >>> using
>> >> >> >>> >> > the HBase API. But the actual columns and their contents
>> >> cannot be
>> >> >> >>> >> provided
>> >> >> >>> >> > by the API. So instead we simply expose the column families
>> >> with a
>> >> >> >>> MAP
>> >> >> >>> >> data
>> >> >> >>> >> > types. The trouble with this is that the keys and values of
>> the
>> >> >> maps
>> >> >> >>> will
>> >> >> >>> >> > simply be byte-arrays ... Usually not very useful! But it's
>> >> sort
>> >> >> of
>> >> >> >>> the
>> >> >> >>> >> > only thing (as far as I can see) that's "safe" in HBase,
>> since
>> >> >> HBase
>> >> >> >>> >> allows
>> >> >> >>> >> > anything (byte arrays) in it's columns.
>> >> >> >>> >> >
>> >> >> >>> >> > 2b) Like in e.g. MongoDb or CouchDb modules you can provide
>> an
>> >> >> array
>> >> >> >>> of
>> >> >> >>> >> > tables (SimpleTableDef). That way the user defines the
>> metadata
>> >> >> >>> himself
>> >> >> >>> >> and
>> >> >> >>> >> > the implementation assumes that it is correct (or else it
>> will
>> >> >> >>> break).
>> >> >> >>> >> The
>> >> >> >>> >> > good thing about this is that the user can define the proper
>> >> data
>> >> >> >>> types
>> >> >> >>> >> > etc. for columns. The user defines the column family and
>> column
>> >> >> name
>> >> >> >>> by
>> >> >> >>> >> > setting defining the MetaModel column name as this:
>> >> "family:name"
>> >> >> >>> >> > (consistent with most HBase tools and API calls).
>> >> >> >>> >> >
>> >> >> >>> >> > 3) With regards to querying: We've implemented basic query
>> >> >> >>> capabilities
>> >> >> >>> >> > using the MetaModel query postprocessor. But not all queries
>> >> are
>> >> >> very
>> >> >> >>> >> > effective... In addition to of course full table scans, we
>> have
>> >> >> >>> optimized
>> >> >> >>> >> > support of of COUNT queries and of table scans with maxRows.
>> >> >> >>> >> >
>> >> >> >>> >> > We could rather easily add optimized support for a couple of
>> >> other
>> >> >> >>> >> typical
>> >> >> >>> >> > queries:
>> >> >> >>> >> >  * lookup record by ID
>> >> >> >>> >> >  * paged table scans (both firstRow and maxRows)
>> >> >> >>> >> >  * queries with simple filters/where items
>> >> >> >>> >> >
>> >> >> >>> >> > 4) With regards to dependencies: The module right now
>> depends
>> >> on
>> >> >> the
>> >> >> >>> >> > artifact called "hbase-client". This dependency has a loot
>> of
>> >> >> >>> transient
>> >> >> >>> >> > dependencies so the size of the module is quite extreme. As
>> an
>> >> >> >>> example,
>> >> >> >>> >> it
>> >> >> >>> >> > includes stuff like jetty, jersey, jackson and of course
>> >> hadoop...
>> >> >> >>> But I
>> >> >> >>> >> am
>> >> >> >>> >> > wondering if we can have a more thin client-side than that!
>> If
>> >> >> anyone
>> >> >> >>> >> knows
>> >> >> >>> >> > if e.g. we can use the REST interface easily or so, that
>> would
>> >> >> maybe
>> >> >> >>> be
>> >> >> >>> >> > better. I'm not an expert on HBase though, so please
>> enlighten
>> >> me!
>> >> >> >>> >> >
>> >> >> >>> >> > Kind regards,
>> >> >> >>> >> > Kasper
>> >> >> >>> >>
>> >> >> >>>
>> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>>

Re: [DISCUSS] State of the work-in-progress HBase branch

Reply via email to