Accumulo seems mostly like features we can roll into HBase. Decline. On Fri, Sep 9, 2011 at 2:50 PM, Andrew Purtell <[email protected]> wrote: >> From: Duane Moore <[email protected]> > >> I will second what Todd and Joey >> said and reiterate that contributing to open source is not easy for a >> government contractor, and especially not easy for U.S. government >> employees. > > > This is true as a general statement I'm sure. > > However, my former life was as an engineer in a DARPA shop with a TS > clearance. During that time I worked on both closed/classified systems and > projects such as TrustedBSD (http://www.trustedbsd.org/). Choosing to develop > an internal alternative rather than work with the HBase project was a > decision of convenience by someone. > > While all appreciate this eventual open sourcing on some level, the outcome > is hardly optimal, and does not favor in my opinion the existing open source > community here (HBase) in the short term, and any long term favor is going to > require work by that community. > >> My personal preference for a long while has been to migrate >> our Accumulo implementation to HBase, but as with any project there are >> often non-technical considerations for doing so. > > > I can only hope that open source communities in general will apply a penalty > for taking the easy way out for such non-technical considerations. We do not > have to act as beggars. Presumably this open sourcing was not done out of > charity -- I would be quite surprised, maybe shocked. If government (or > contractors) want to leverage open source communities for some benefit, the > least we can do is insist on respectful terms. > > Best regards, > > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein (via > Tom White) > > > ----- Original Message ----- >> From: Duane Moore <[email protected]> >> To: "[email protected]" <[email protected]> >> Cc: >> Sent: Tuesday, September 6, 2011 9:21 AM >> Subject: Re: [DISCUSSION] Accumulo, another BigTable clone, has shown up on >> Apache Incubator as a proposal >> >> Hello all, >> >> I've been a lurker on the HBase list for a year or so and our company has >> also been working with the Accumulo implementation during the same time >> frame. I'd like to respond to Stack's suggestion to focus on the >> technical merits of the proposal. Since I have some info on the pre-open >> sourced version of Accumulo, I'd like to share some of our evaluation of >> the software, primarily from a client perspective (vs. implementation >> details like logging to NFS vs HDFS). >> >> First, I share many of the same concerns of folks who were frustrated that >> this project seems to duplicate the effort of the open source >> (particularly HBase) community. However, I will second what Todd and Joey >> said and reiterate that contributing to open source is not easy for a >> government contractor, and especially not easy for U.S. government >> employees. My personal preference for a long while has been to migrate >> our Accumulo implementation to HBase, but as with any project there are >> often non-technical considerations for doing so. >> >> Below are some notes we took last year on the differences between Accumulo >> and HBase, with additional notes from me inline. Much of this mirrors >> what is in the current Accumulo proposal. >> >> ----- >> >> - Column Families >> In HBase you must specify all column families up front as part of the >> table schema declaration when creating a table. >> Accumulo does not have this restriction, you do not declare column >> families when you create a table. When you insert a new row into the table >> you can just provide a new column family. >> ** Note: sounds like from what Stack said, this is close to being OBE? >> >> >> - Aggregation >> Accumulo offers the ability to specify an aggregator for an individual >> column family or column. This allows you to keep a row count, or summation >> of numerical values that may be stored in a particular column. It would >> appear the function has to operate on the subset of values stored for that >> column in the table at a particular time since it keeps the aggregate >> value in memory. So this may not be able to handle certain aggregation >> functions like 'median' for instance. But functions like sum, max, min, >> mean, and count should all be supportable. >> I could not find a comparable feature within HBase, but HBase does offer >> an atomic function called incremementColumnValue on the HTable class which >> appears can be leveraged to provide aggregation behavior. >> >> >> - Column Visibility >> This is the feature in Accumulo that allows tagging of the data at the >> column level, which would primarily be used for classification markings >> (in our scenario). >> If we were to implement the same type of column visibility in HBase that >> Accumulo supports, we would have potentially several options: >> -Try to implement column visibility as a patch to HBase. Would be fun, but >> may be a lot of work. >> -Since the value of a particular column (cell, actually) is simply a byte >> array, we could utilize a standard technique of encoding the visibility >> level/classification in the column value itself. >> -Since the number of columns is not pre-defined, adopt a convention >> whereby each column "foo" gets an additional column added by our >> infrastructure called "foo_visibility". >> ** Note: We have a requirement to use PKI (digital certificates) for >> authentication in our service stack. The relationship between PKI and >> Kerberos currently used for Secure HBase is interesting; not quite sure >> how the two would fit together in practice. >> >> -Retrieving Data >> Accumulo uses a Scanner object for all retrieval operations, which are >> instantiated by retrieving a Scanner from the Connector object. When >> retrieving all values for a particular row, the _individual cells are >> returned as a new entry_ returned by the Scanner iterator. >> In HBase, you can use a Scan object (org.apache.hadoop.hbase.client.Scan) >> or you can use a Get object, which allows you to retrieve a single row at >> a time. In either case, the org.apache.hadoop.hbase.client.Result class is >> returned, representing all of the requested data for that particular row. >> In HBase, to set constraints on a query, you set a >> org.apache.hadoop.hbase.filter.Filter object on the Scan object. Multiple >> Filters may be set by using the FilterList object. In Accumulo, you call >> the setScanIterators() method on the Scanner object, which enables the >> appropriate iterators for use on the server before returning data. >> ** Note: primary difference here is in the use of server-side iterators, >> which Andy has correctly pointed out could be implemented via the >> coprocessor framework. We did some initial investigation into >> coprocessors to see if we could implement this equivalent functionality, >> but since we'd been directed to use Accumulo, we didn't have much >> bandwidth to address this (also coprocessors were in their infancy at the >> time). >> >> >> >> ----- >> >> >> Hope that helps. Bottom line is that I believe that the features in >> Accumulo can and ought to be merged into HBase at some point (assuming the >> technical merits hold up). Looking forward to contributing to that >> conversation. >> >> Thanks, >> Duane >> >> On 9/3/11 2:21 PM, "Stack" <[email protected]> wrote: >> >>> >>> I'd suggest we refocus this thread on how to respond to the Accumulo >>> proposal (or whether to respond at all), since thats what we 'know'. >>> I think it'd be useful correcting at least the 'unlikely tos' >> with >>> pointers to committed code. >>> >>> Code overlap, if any, can be addressed when the code drop happens. >>> >>> St.Ack >>> >> >
-- Bradford Stephens, Founder, Drawn to Scale http://drawntoscale.com (530) 763-DATA http://www.drawntoscale.com -- Spire, the scalable database with real-time queries and fulltext search. http://www.roadtofailure.com -- The Fringes of Scalability, Startups and Computer Science
