Accumulo seems mostly like features we can roll into HBase. Decline.

On Fri, Sep 9, 2011 at 2:50 PM, Andrew Purtell <[email protected]> wrote:
>> From: Duane Moore <[email protected]>
>
>> I will second what Todd and Joey
>> said and reiterate that contributing to open source is not easy for a
>> government contractor, and especially not easy for U.S. government
>> employees.
>
>
> This is true as a general statement I'm sure.
>
> However, my former life was as an engineer in a DARPA shop with a TS 
> clearance. During that time I worked on both closed/classified systems and 
> projects such as TrustedBSD (http://www.trustedbsd.org/). Choosing to develop 
> an internal alternative rather than work with the HBase project was a 
> decision of convenience by someone.
>
> While all appreciate this eventual open sourcing on some level, the outcome 
> is hardly optimal, and does not favor in my opinion the existing open source 
> community here (HBase) in the short term, and any long term favor is going to 
> require work by that community.
>
>> My personal preference for a long while has been to migrate
>> our Accumulo implementation to HBase, but as with any project there are
>> often non-technical considerations for doing so.
>
>
> I can only hope that open source communities in general will apply a penalty 
> for taking the easy way out for such non-technical considerations. We do not 
> have to act as beggars. Presumably this open sourcing was not done out of 
> charity -- I would be quite surprised, maybe shocked. If government (or 
> contractors) want to leverage open source communities for some benefit, the 
> least we can do is insist on respectful terms.
>
> Best regards,
>
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
> Tom White)
>
>
> ----- Original Message -----
>> From: Duane Moore <[email protected]>
>> To: "[email protected]" <[email protected]>
>> Cc:
>> Sent: Tuesday, September 6, 2011 9:21 AM
>> Subject: Re: [DISCUSSION] Accumulo, another BigTable clone, has shown up on 
>> Apache Incubator as a proposal
>>
>> Hello all,
>>
>> I've been a lurker on the HBase list for a year or so and our company has
>> also been working with the Accumulo implementation during the same time
>> frame.  I'd like to respond to Stack's suggestion to focus on the
>> technical merits of the proposal.  Since I have some info on the pre-open
>> sourced version of Accumulo, I'd like to share some of our evaluation of
>> the software, primarily from a client perspective (vs. implementation
>> details like logging to NFS vs HDFS).
>>
>> First, I share many of the same concerns of folks who were frustrated that
>> this project seems to duplicate the effort of the open source
>> (particularly HBase) community.  However, I will second what Todd and Joey
>> said and reiterate that contributing to open source is not easy for a
>> government contractor, and especially not easy for U.S. government
>> employees.  My personal preference for a long while has been to migrate
>> our Accumulo implementation to HBase, but as with any project there are
>> often non-technical considerations for doing so.
>>
>> Below are some notes we took last year on the differences between Accumulo
>> and HBase, with additional notes from me inline.  Much of this mirrors
>> what is in the current Accumulo proposal.
>>
>> -----
>>
>> - Column Families
>> In HBase you must specify all column families up front as part of the
>> table schema declaration when creating a table.
>> Accumulo does not have this restriction, you do not declare column
>> families when you create a table. When you insert a new row into the table
>> you can just provide a new column family.
>> ** Note: sounds like from what Stack said, this is close to being OBE?
>>
>>
>> - Aggregation
>> Accumulo offers the ability to specify an aggregator for an individual
>> column family or column. This allows you to keep a row count, or summation
>> of numerical values that may be stored in a particular column. It would
>> appear the function has to operate on the subset of values stored for that
>> column in the table at a particular time since it keeps the aggregate
>> value in memory. So this may not be able to handle certain aggregation
>> functions like 'median' for instance. But functions like sum, max, min,
>> mean, and count should all be supportable.
>> I could not find a comparable feature within HBase, but HBase does offer
>> an atomic function called incremementColumnValue on the HTable class which
>> appears can be leveraged to provide aggregation behavior.
>>
>>
>> - Column Visibility
>> This is the feature in Accumulo that allows tagging of the data at the
>> column level, which would primarily be used for classification markings
>> (in our scenario).
>> If we were to implement the same type of column visibility in HBase that
>> Accumulo supports, we would have potentially several options:
>> -Try to implement column visibility as a patch to HBase. Would be fun, but
>> may be a lot of work.
>> -Since the value of a particular column (cell, actually) is simply a byte
>> array, we could utilize a standard technique of encoding the visibility
>> level/classification in the column value itself.
>> -Since the number of columns is not pre-defined, adopt a convention
>> whereby each column "foo" gets an additional column added by our
>> infrastructure called "foo_visibility".
>> ** Note: We have a requirement to use PKI (digital certificates) for
>> authentication in our service stack. The relationship between PKI and
>> Kerberos currently used for Secure HBase is interesting; not quite sure
>> how the two would fit together in practice.
>>
>> -Retrieving Data
>> Accumulo uses a Scanner object for all retrieval operations, which are
>> instantiated by retrieving a Scanner from the Connector object. When
>> retrieving all values for a particular row, the _individual cells are
>> returned as a new entry_ returned by the Scanner iterator.
>> In HBase, you can use a Scan object (org.apache.hadoop.hbase.client.Scan)
>> or you can use a Get object, which allows you to retrieve a single row at
>> a time. In either case, the org.apache.hadoop.hbase.client.Result class is
>> returned, representing all of the requested data for that particular row.
>> In HBase, to set constraints on a query, you set a
>> org.apache.hadoop.hbase.filter.Filter object on the Scan object. Multiple
>> Filters may be set by using the FilterList object. In Accumulo, you call
>> the setScanIterators() method on the Scanner object, which enables the
>> appropriate iterators for use on the server before returning data.
>> ** Note: primary difference here is in the use of server-side iterators,
>> which Andy has correctly pointed out could be implemented via the
>> coprocessor framework.  We did some initial investigation into
>> coprocessors to see if we could implement this equivalent functionality,
>> but since we'd been directed to use Accumulo, we didn't have much
>> bandwidth to address this (also coprocessors were in their infancy at the
>> time).
>>
>>
>>
>> -----
>>
>>
>> Hope that helps.  Bottom line is that I believe that the features in
>> Accumulo can and ought to be merged into HBase at some point (assuming the
>> technical merits hold up).  Looking forward to contributing to that
>> conversation.
>>
>> Thanks,
>> Duane
>>
>> On 9/3/11 2:21 PM, "Stack" <[email protected]> wrote:
>>
>>>
>>> I'd suggest we refocus this thread on how to respond to the Accumulo
>>> proposal (or whether to respond at all), since thats what we 'know'.
>>> I think it'd be useful correcting at least the 'unlikely tos'
>> with
>>> pointers to committed code.
>>>
>>> Code overlap, if any, can be addressed when the code drop happens.
>>>
>>> St.Ack
>>>
>>
>



-- 
Bradford Stephens,
Founder, Drawn to Scale
http://drawntoscale.com
(530) 763-DATA

http://www.drawntoscale.com -- Spire, the scalable database with
real-time queries and fulltext search.

http://www.roadtofailure.com -- The Fringes of Scalability, Startups
and Computer Science

Reply via email to