A quick update on this since the module has now been merged into the master branch:
1) Module is still read-only. This is accepted for now (unless someone wants to help change it of course). 2) Metadata mapping is still working in two modes: a) we discover the column families and expose them as byte-array maps (not very useful, but works as a "lowest common denominator") and b) the user provides a set of SimpleTableDef (which now has a convenient parser btw.:)) and gets his table mapping as he wants it. 3) Querying now has special support for lookup-by-id type queries where we will use HBase Get instead of Scan. We also have good support for LIMIT/"maxRows", but not OFFSET/"firstRow" (in those cases we will scan past the first records on the client side). 4) Dependencies seems to be a pain still. HBase and Hadoop comes in many flavours and all are not compatible. I doubt there's a lot we can do about it, except ask the users to provide their own HBase dependency as per their backend version. We should probably thus make all our HBase/Hadoop dependencies <optional>true</optional> in order to not influence the typical clients. Kasper 2014-02-24 17:08 GMT+01:00 Kasper Sørensen <[email protected]>: > Hi Henry, > > Yea the Phoenix project is definately an interesting approach to making MM > capable of working with HBase. The only downside to me is that it seems > they do a lot of intrusive stuff to HBase like creating new index tables > etc... I would normally not "allow" that for a simple connector. > > Maybe we should simply support both styles. And in the case of Phoenix, I > guess we could simply go through the JDBC module of MetaModel and connect > via their JDBC driver... Is that maybe a route, do you know? > > - Kasper > > > 2014-02-24 6:37 GMT+01:00 Henry Saputra <[email protected]>: > > We could use the HBase client library from the store I suppose. >> The issue I am actually worry is actually adding real query support >> for column based datastore is kind of big task. >> Apache Phoenix tried to do that so maybe we could leverage the SQL >> planner layer to provide the implementation of the query execution to >> HBase layer? >> >> - Henry >> >> >> On Mon, Feb 17, 2014 at 9:33 AM, Kasper Sørensen >> <[email protected]> wrote: >> > Thanks for the input Henry. With your experience, do you then also >> happen >> > to know of a good thin client-side library? I imagine that we could >> maybe >> > use a REST client instead of the full client we currently use. That >> would >> > save us a ton of dependency-overhead I think. Or is it a non-issue in >> your >> > mind, since HBase users are used to this overhead? >> > >> > >> > 2014-02-16 7:16 GMT+01:00 Henry Saputra <[email protected]>: >> > >> >> For 1 > I think adding read only to HBase should be ok because most >> >> update to HBase either through HBase client or REST via Stargate [1] >> >> or Thrift >> >> >> >> For 2 > In Apache Gora we use Avro to do type mapping to column and >> >> generate POJO java via Avro compiler. >> >> >> >> For 3 > This is the one I am kinda torn. Apache Phoenix incubating try >> >> to provide SQL to HBase [2] via extra indexing and caching. I think >> >> this is defeat the purpose of having NoSQL databases that serve >> >> different purpose than Relational databse. >> >> >> >> I am not sure Metamodel should touch NoSQL databases which more like >> >> column types. These databases are designed for large data with access >> >> primary via key and not query mechanism. >> >> >> >> Just my 2-cent >> >> >> >> >> >> [1] http://wiki.apache.org/hadoop/Hbase/Stargate >> >> [2] http://phoenix.incubator.apache.org/ >> >> >> >> On Fri, Jan 24, 2014 at 11:35 AM, Kasper Sørensen >> >> <[email protected]> wrote: >> >> > Hi everyone, >> >> > >> >> > I was looking at our "hbase-module" branch and as much as I like this >> >> idea, >> >> > I think we've been a bit too idle with the branch. Maybe we should >> try to >> >> > make something final e.g. for a version 4.1. >> >> > >> >> > So I thought to give an overview/status of the module's current >> >> > capabilities and it's shortcomings. We should figure out if we think >> this >> >> > is good enough for a first version, or if we want to do some >> improvements >> >> > to the module before adding it to our portfolio of MetaModel modules. >> >> > >> >> > 1) The module only offers read-only/query access to HBase. That is >> in my >> >> > opinion OK for now, we have several such modules, and this is >> something >> >> we >> >> > can better add later if we straighten out the remaining topics in >> this >> >> mail. >> >> > >> >> > 2) With regards to metadata mapping: HBase is different because it >> has >> >> both >> >> > column families and in column families there are columns. For the >> sake of >> >> > our view on HBase I would describe column families simply as "a >> logical >> >> of >> >> > columns". Column families are fixed within a table, but rows in a >> table >> >> may >> >> > contain arbitrary numbers of columns within each column family. >> So... You >> >> > can instantiate the HBaseDataContext in two ways: >> >> > >> >> > 2a) You can let MetaModel discover the metadata. This unfortunately >> has a >> >> > severe limitation. We discover the table names and column families >> using >> >> > the HBase API. But the actual columns and their contents cannot be >> >> provided >> >> > by the API. So instead we simply expose the column families with a >> MAP >> >> data >> >> > types. The trouble with this is that the keys and values of the maps >> will >> >> > simply be byte-arrays ... Usually not very useful! But it's sort of >> the >> >> > only thing (as far as I can see) that's "safe" in HBase, since HBase >> >> allows >> >> > anything (byte arrays) in it's columns. >> >> > >> >> > 2b) Like in e.g. MongoDb or CouchDb modules you can provide an array >> of >> >> > tables (SimpleTableDef). That way the user defines the metadata >> himself >> >> and >> >> > the implementation assumes that it is correct (or else it will >> break). >> >> The >> >> > good thing about this is that the user can define the proper data >> types >> >> > etc. for columns. The user defines the column family and column name >> by >> >> > setting defining the MetaModel column name as this: "family:name" >> >> > (consistent with most HBase tools and API calls). >> >> > >> >> > 3) With regards to querying: We've implemented basic query >> capabilities >> >> > using the MetaModel query postprocessor. But not all queries are very >> >> > effective... In addition to of course full table scans, we have >> optimized >> >> > support of of COUNT queries and of table scans with maxRows. >> >> > >> >> > We could rather easily add optimized support for a couple of other >> >> typical >> >> > queries: >> >> > * lookup record by ID >> >> > * paged table scans (both firstRow and maxRows) >> >> > * queries with simple filters/where items >> >> > >> >> > 4) With regards to dependencies: The module right now depends on the >> >> > artifact called "hbase-client". This dependency has a loot of >> transient >> >> > dependencies so the size of the module is quite extreme. As an >> example, >> >> it >> >> > includes stuff like jetty, jersey, jackson and of course hadoop... >> But I >> >> am >> >> > wondering if we can have a more thin client-side than that! If anyone >> >> knows >> >> > if e.g. we can use the REST interface easily or so, that would maybe >> be >> >> > better. I'm not an expert on HBase though, so please enlighten me! >> >> > >> >> > Kind regards, >> >> > Kasper >> >> >> > >
