Re: [DISCUSS] State of the work-in-progress HBase branch

Kasper Sørensen Tue, 25 Mar 2014 12:48:06 -0700

Wondering what other people think here ... And if we go for a documentation
site that is "built" and released, how do we then bootstrap it easily with
the knowledge that is currently in the wiki?



2014-03-25 20:35 GMT+01:00 Henry Saputra <[email protected]>:

> Some projects do link back from homepage to wiki page. I think the
> main key is to have separate docs for each release.
>
> What do you think?
>
> - Henry
>
> On Tue, Mar 25, 2014 at 4:47 AM, Kasper Sørensen
> <[email protected]> wrote:
> > Hmm was kinda hoping we wouldn't have to... But that's just because I am
> > lazy and I prefer "live" (editable online) documentation where possible
> > (that way you can easily react if someone starts pointing at missing
> > parts). I think either way is doable, but you're right that in case we
> use
> > wiki-pages, each wiki page should clearly state which versions they apply
> > to, if they are version-specific.
> >
> >
> > 2014-03-24 23:03 GMT+01:00 Henry Saputra <[email protected]>:
> >
> >> Hmm seems like we need to bundle the doc for each release. For
> >> example, the 4.0.0 does not have HBase store.
> >>
> >> Most projects have docs for each release on top of project homepage,
> >> like Zookeeper http://zookeeper.apache.org/doc/r3.4.6/ or Spark
> >> http://spark.apache.org/docs/0.9.0/
> >>
> >> Thoughts?
> >>
> >> - Henry
> >>
> >> On Mon, Mar 24, 2014 at 2:50 PM, Kasper Sørensen
> >> <[email protected]> wrote:
> >> > Hmm I suppose a wiki page would be good. I guess we have wiki pages
> for
> >> > some of the DataContext implementations already like Salesforce [1],
> POJO
> >> > [2] and Composite [3] ... Maybe we should even have a page for *every
> >> > *DataContext
> >> > implementation there is, simply for completeness and referenceability
> of
> >> > documentation.
> >> >
> >> > [1] http://wiki.apache.org/metamodel/examples/SalesforceDataContext
> >> > [2] http://wiki.apache.org/metamodel/examples/PojoDataContext
> >> > [3] http://wiki.apache.org/metamodel/examples/CompositeDataContext
> >> >
> >> >
> >> > 2014-03-24 22:44 GMT+01:00 Henry Saputra <[email protected]>:
> >> >
> >> >> Ok +1
> >> >>
> >> >> How do you propose to document this feature? As another page in the
> >> >> doc svn repo?
> >> >>
> >> >> - Henry
> >> >>
> >> >> On Mon, Mar 24, 2014 at 2:42 PM, Kasper Sørensen
> >> >> <[email protected]> wrote:
> >> >> > Yep. Or in slightly more technical terms: It means that the
> >> >> > HBaseDataContext only implements DataContext which has these two
> >> >> > significant methods:
> >> >> >
> >> >> >  * getSchemas()
> >> >> >  * executeQuery(...)
> >> >> >
> >> >> > (Plus a bunch more methods, but those two give you the general
> >> >> impression:
> >> >> > Explore metadata and fire queries / reads)
> >> >> > But not UpdateableDataContext, which has the write operations:
> >> >> >
> >> >> >  * executeUpdate(...)
> >> >> >
> >> >> > Regards,
> >> >> > Kasper
> >> >> >
> >> >> >
> >> >> > 2014-03-24 22:37 GMT+01:00 Henry Saputra <[email protected]
> >:
> >> >> >
> >> >> >> Hmm, what does it mean by read only? You can use it to read data
> from
> >> >> >> HBase?
> >> >> >>
> >> >> >> - Henry
> >> >> >>
> >> >> >> On Mon, Mar 24, 2014 at 2:34 PM, Kasper Sørensen
> >> >> >> <[email protected]> wrote:
> >> >> >> > A quick update on this since the module has now been merged into
> >> the
> >> >> >> master
> >> >> >> > branch:
> >> >> >> >
> >> >> >> > 1) Module is still read-only. This is accepted for now (unless
> >> someone
> >> >> >> > wants to help change it of course).
> >> >> >> >
> >> >> >> > 2) Metadata mapping is still working in two modes: a) we
> discover
> >> the
> >> >> >> > column families and expose them as byte-array maps (not very
> >> useful,
> >> >> but
> >> >> >> > works as a "lowest common denominator") and b) the user
> provides a
> >> >> set of
> >> >> >> > SimpleTableDef (which now has a convenient parser btw.:)) and
> gets
> >> his
> >> >> >> > table mapping as he wants it.
> >> >> >> >
> >> >> >> > 3) Querying now has special support for lookup-by-id type
> queries
> >> >> where
> >> >> >> we
> >> >> >> > will use HBase Get instead of Scan. We also have good support
> for
> >> >> >> > LIMIT/"maxRows", but not OFFSET/"firstRow" (in those cases we
> will
> >> >> scan
> >> >> >> > past the first records on the client side).
> >> >> >> >
> >> >> >> > 4) Dependencies seems to be a pain still. HBase and Hadoop
> comes in
> >> >> many
> >> >> >> > flavours and all are not compatible. I doubt there's a lot we
> can
> >> do
> >> >> >> about
> >> >> >> > it, except ask the users to provide their own HBase dependency
> as
> >> per
> >> >> >> their
> >> >> >> > backend version. We should probably thus make all our
> HBase/Hadoop
> >> >> >> > dependencies <optional>true</optional> in order to not influence
> >> the
> >> >> >> > typical clients.
> >> >> >> >
> >> >> >> > Kasper
> >> >> >> >
> >> >> >> >
> >> >> >> > 2014-02-24 17:08 GMT+01:00 Kasper Sørensen <
> >> >> >> [email protected]>:
> >> >> >> >
> >> >> >> >> Hi Henry,
> >> >> >> >>
> >> >> >> >> Yea the Phoenix project is definately an interesting approach
> to
> >> >> making
> >> >> >> MM
> >> >> >> >> capable of working with HBase. The only downside to me is that
> it
> >> >> seems
> >> >> >> >> they do a lot of intrusive stuff to HBase like creating new
> index
> >> >> tables
> >> >> >> >> etc... I would normally not "allow" that for a simple
> connector.
> >> >> >> >>
> >> >> >> >> Maybe we should simply support both styles. And in the case of
> >> >> Phoenix,
> >> >> >> I
> >> >> >> >> guess we could simply go through the JDBC module of MetaModel
> and
> >> >> >> connect
> >> >> >> >> via their JDBC driver... Is that maybe a route, do you know?
> >> >> >> >>
> >> >> >> >> - Kasper
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> 2014-02-24 6:37 GMT+01:00 Henry Saputra <
> [email protected]
> >> >:
> >> >> >> >>
> >> >> >> >> We could use the HBase client library from the store I suppose.
> >> >> >> >>> The issue I am actually worry is actually adding real query
> >> support
> >> >> >> >>> for column based datastore is kind of big task.
> >> >> >> >>> Apache Phoenix tried to do that so maybe we could leverage the
> >> SQL
> >> >> >> >>> planner layer to provide the implementation of the query
> >> execution
> >> >> to
> >> >> >> >>> HBase layer?
> >> >> >> >>>
> >> >> >> >>> - Henry
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>> On Mon, Feb 17, 2014 at 9:33 AM, Kasper Sørensen
> >> >> >> >>> <[email protected]> wrote:
> >> >> >> >>> > Thanks for the input Henry. With your experience, do you
> then
> >> also
> >> >> >> >>> happen
> >> >> >> >>> > to know of a good thin client-side library? I imagine that
> we
> >> >> could
> >> >> >> >>> maybe
> >> >> >> >>> > use a REST client instead of the full client we currently
> use.
> >> >> That
> >> >> >> >>> would
> >> >> >> >>> > save us a ton of dependency-overhead I think. Or is it a
> >> >> non-issue in
> >> >> >> >>> your
> >> >> >> >>> > mind, since HBase users are used to this overhead?
> >> >> >> >>> >
> >> >> >> >>> >
> >> >> >> >>> > 2014-02-16 7:16 GMT+01:00 Henry Saputra <
> >> [email protected]
> >> >> >:
> >> >> >> >>> >
> >> >> >> >>> >> For 1 > I think adding read only to HBase should be ok
> because
> >> >> most
> >> >> >> >>> >> update to HBase either through HBase client or REST via
> >> Stargate
> >> >> [1]
> >> >> >> >>> >> or Thrift
> >> >> >> >>> >>
> >> >> >> >>> >> For 2 > In Apache Gora we use Avro to do type mapping to
> >> column
> >> >> and
> >> >> >> >>> >> generate POJO java via Avro compiler.
> >> >> >> >>> >>
> >> >> >> >>> >> For 3 > This is the one I am kinda torn. Apache Phoenix
> >> >> incubating
> >> >> >> try
> >> >> >> >>> >> to provide SQL to HBase [2] via extra indexing and
> caching. I
> >> >> think
> >> >> >> >>> >> this is defeat the purpose of having NoSQL databases that
> >> serve
> >> >> >> >>> >> different purpose than Relational databse.
> >> >> >> >>> >>
> >> >> >> >>> >> I am not sure Metamodel should touch NoSQL databases which
> >> more
> >> >> like
> >> >> >> >>> >> column types. These databases are designed for large data
> with
> >> >> >> access
> >> >> >> >>> >> primary via key and not query mechanism.
> >> >> >> >>> >>
> >> >> >> >>> >> Just my 2-cent
> >> >> >> >>> >>
> >> >> >> >>> >>
> >> >> >> >>> >> [1] http://wiki.apache.org/hadoop/Hbase/Stargate
> >> >> >> >>> >> [2] http://phoenix.incubator.apache.org/
> >> >> >> >>> >>
> >> >> >> >>> >> On Fri, Jan 24, 2014 at 11:35 AM, Kasper Sørensen
> >> >> >> >>> >> <[email protected]> wrote:
> >> >> >> >>> >> > Hi everyone,
> >> >> >> >>> >> >
> >> >> >> >>> >> > I was looking at our "hbase-module" branch and as much
> as I
> >> >> like
> >> >> >> this
> >> >> >> >>> >> idea,
> >> >> >> >>> >> > I think we've been a bit too idle with the branch. Maybe
> we
> >> >> should
> >> >> >> >>> try to
> >> >> >> >>> >> > make something final e.g. for a version 4.1.
> >> >> >> >>> >> >
> >> >> >> >>> >> > So I thought to give an overview/status of the module's
> >> current
> >> >> >> >>> >> > capabilities and it's shortcomings. We should figure out
> if
> >> we
> >> >> >> think
> >> >> >> >>> this
> >> >> >> >>> >> > is good enough for a first version, or if we want to do
> some
> >> >> >> >>> improvements
> >> >> >> >>> >> > to the module before adding it to our portfolio of
> MetaModel
> >> >> >> modules.
> >> >> >> >>> >> >
> >> >> >> >>> >> > 1) The module only offers read-only/query access to
> HBase.
> >> >> That is
> >> >> >> >>> in my
> >> >> >> >>> >> > opinion OK for now, we have several such modules, and
> this
> >> is
> >> >> >> >>> something
> >> >> >> >>> >> we
> >> >> >> >>> >> > can better add later if we straighten out the remaining
> >> topics
> >> >> in
> >> >> >> >>> this
> >> >> >> >>> >> mail.
> >> >> >> >>> >> >
> >> >> >> >>> >> > 2) With regards to metadata mapping: HBase is different
> >> >> because it
> >> >> >> >>> has
> >> >> >> >>> >> both
> >> >> >> >>> >> > column families and in column families there are columns.
> >> For
> >> >> the
> >> >> >> >>> sake of
> >> >> >> >>> >> > our view on HBase I would describe column families simply
> >> as "a
> >> >> >> >>> logical
> >> >> >> >>> >> of
> >> >> >> >>> >> > columns". Column families are fixed within a table, but
> rows
> >> >> in a
> >> >> >> >>> table
> >> >> >> >>> >> may
> >> >> >> >>> >> > contain arbitrary numbers of columns within each column
> >> family.
> >> >> >> >>> So... You
> >> >> >> >>> >> > can instantiate the HBaseDataContext in two ways:
> >> >> >> >>> >> >
> >> >> >> >>> >> > 2a) You can let MetaModel discover the metadata. This
> >> >> >> unfortunately
> >> >> >> >>> has a
> >> >> >> >>> >> > severe limitation. We discover the table names and column
> >> >> families
> >> >> >> >>> using
> >> >> >> >>> >> > the HBase API. But the actual columns and their contents
> >> >> cannot be
> >> >> >> >>> >> provided
> >> >> >> >>> >> > by the API. So instead we simply expose the column
> families
> >> >> with a
> >> >> >> >>> MAP
> >> >> >> >>> >> data
> >> >> >> >>> >> > types. The trouble with this is that the keys and values
> of
> >> the
> >> >> >> maps
> >> >> >> >>> will
> >> >> >> >>> >> > simply be byte-arrays ... Usually not very useful! But
> it's
> >> >> sort
> >> >> >> of
> >> >> >> >>> the
> >> >> >> >>> >> > only thing (as far as I can see) that's "safe" in HBase,
> >> since
> >> >> >> HBase
> >> >> >> >>> >> allows
> >> >> >> >>> >> > anything (byte arrays) in it's columns.
> >> >> >> >>> >> >
> >> >> >> >>> >> > 2b) Like in e.g. MongoDb or CouchDb modules you can
> provide
> >> an
> >> >> >> array
> >> >> >> >>> of
> >> >> >> >>> >> > tables (SimpleTableDef). That way the user defines the
> >> metadata
> >> >> >> >>> himself
> >> >> >> >>> >> and
> >> >> >> >>> >> > the implementation assumes that it is correct (or else it
> >> will
> >> >> >> >>> break).
> >> >> >> >>> >> The
> >> >> >> >>> >> > good thing about this is that the user can define the
> proper
> >> >> data
> >> >> >> >>> types
> >> >> >> >>> >> > etc. for columns. The user defines the column family and
> >> column
> >> >> >> name
> >> >> >> >>> by
> >> >> >> >>> >> > setting defining the MetaModel column name as this:
> >> >> "family:name"
> >> >> >> >>> >> > (consistent with most HBase tools and API calls).
> >> >> >> >>> >> >
> >> >> >> >>> >> > 3) With regards to querying: We've implemented basic
> query
> >> >> >> >>> capabilities
> >> >> >> >>> >> > using the MetaModel query postprocessor. But not all
> queries
> >> >> are
> >> >> >> very
> >> >> >> >>> >> > effective... In addition to of course full table scans,
> we
> >> have
> >> >> >> >>> optimized
> >> >> >> >>> >> > support of of COUNT queries and of table scans with
> maxRows.
> >> >> >> >>> >> >
> >> >> >> >>> >> > We could rather easily add optimized support for a
> couple of
> >> >> other
> >> >> >> >>> >> typical
> >> >> >> >>> >> > queries:
> >> >> >> >>> >> >  * lookup record by ID
> >> >> >> >>> >> >  * paged table scans (both firstRow and maxRows)
> >> >> >> >>> >> >  * queries with simple filters/where items
> >> >> >> >>> >> >
> >> >> >> >>> >> > 4) With regards to dependencies: The module right now
> >> depends
> >> >> on
> >> >> >> the
> >> >> >> >>> >> > artifact called "hbase-client". This dependency has a
> loot
> >> of
> >> >> >> >>> transient
> >> >> >> >>> >> > dependencies so the size of the module is quite extreme.
> As
> >> an
> >> >> >> >>> example,
> >> >> >> >>> >> it
> >> >> >> >>> >> > includes stuff like jetty, jersey, jackson and of course
> >> >> hadoop...
> >> >> >> >>> But I
> >> >> >> >>> >> am
> >> >> >> >>> >> > wondering if we can have a more thin client-side than
> that!
> >> If
> >> >> >> anyone
> >> >> >> >>> >> knows
> >> >> >> >>> >> > if e.g. we can use the REST interface easily or so, that
> >> would
> >> >> >> maybe
> >> >> >> >>> be
> >> >> >> >>> >> > better. I'm not an expert on HBase though, so please
> >> enlighten
> >> >> me!
> >> >> >> >>> >> >
> >> >> >> >>> >> > Kind regards,
> >> >> >> >>> >> > Kasper
> >> >> >> >>> >>
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
>

Re: [DISCUSS] State of the work-in-progress HBase branch

Reply via email to