Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutation

Aaron Cordova Thu, 22 Dec 2011 13:03:21 -0800

Why is it that none of you seem to consider two keys that differ by timestamp 
to be two different keys?


On Dec 22, 2011, at 4:00 PM, Adam Fuchs wrote:

> Aaron,
> 
> I think it would be more accurate to describe Accumulo as an underlying
> multi-map with support for aggregation overlays. A map can be thought of as
> a multi-map with an overlay that takes the first of the multiple entries.
> This is in fact the default configuration of Accumulo tables, where the
> VersioningIterator defines this overlay. Other Iterator configurations
> provide different overlays.
> 
> There are two challenges that make it difficult to case the underlying
> representation as a map. The first is that the definition of uniqueness of
> a Key is a bit muddy. I think that many users consider the uniqueness to
> include row, column family, and column qualifier. Those that use cell-level
> security also include the column visibility. Timestamp doesn't usually make
> it into the uniqueness concept, from a user's perspective, even though that
> affects the sort order of Keys. In fact, most users let Accumulo set the
> timestamp for them. I think your definition of uniqueness takes timestamp
> into account, and from that perspective what we're doing is sort of like
> providing a finer grained timestamp instead of using one timestamp for an
> entire Mutation (or for all Mutations that show up within a millisecond).
> 
> The second challenge is that the overlay is persisted and is not
> reversible. Aggregators don't keep the Keys that they aggregate, so if a
> user wants to replace a Key in the underlying map and have that replacement
> operation be reflected in the overlay, we can't really do that. However, we
> can do that if the underlying store is a multi-map (which is what we do
> now).
> 
> Adam
> 
> On Thu, Dec 22, 2011 at 3:41 PM, Aaron Cordova <[email protected]> wrote:
> 
>> Rather than aggregation functionality being defined as some operation
>> performed across a set of the values of different keys, you're advocating
>> allowing inserting identical keys and aggregating their values as well?
>> This just seems semantically sloppy to me.
>> 
>> These types of changes just incur a cost in terms of understanding for the
>> user. Rather than being able to describe Accumulo as a map, a well defined
>> and understood concept, that also supports aggregations over a set of keys
>> that share a subkey, we would then have to describe Accumulo as a map, most
>> of the time, except when it functions more like a multi-map, in the case of
>> aggregation in the presence of multiple values for the same key ... it's
>> just confusing.
>> 
>> Even with aggregators configured over a table, it still functions as a map
>> - in fact like two maps, one 'underlying' map, in which each key has one
>> value, and an 'aggregate' map, in which keys also have one value, define as
>> an aggregation over the 'underlying' map. Perhaps one could argue that what
>> I just described could be termed a multi-map, but from the user's point of
>> view, thinking of it as an 'underlying' map, which is how the user sees the
>> table when writing, and an 'aggregate' map, which is how the user sees the
>> table when reading is more clean. Users are used to this situation if
>> they've ever used views in a relational database.
>> 
>> For you and John, who are steeped in this field, this distinction, and
>> this change, probably doesn't seem like a big deal. But when telling a new
>> user about Accumulo, being able to explain to them that Accumulo is a map,
>> is very useful. It makes predicting the behavior of Accumulo possible. If
>> users can put identical key-value pairs into a mutation, and if Accumulo
>> treats them as distinct, users' predictions will be wrong.
>> 
>> Feel free to make this change, but just consider the collective cognitive
>> cost it incurred by altering the semantics. Earlier you argued that
>> extending the times aggregations are executed to include the client would
>> be too great. Yet making it possible for Accumulo to cease acting like a
>> map sometime doesn't give you pause?
>> 
>> On Dec 22, 2011, at 2:52 PM, Adam Fuchs wrote:
>> 
>>> Aaron,
>>> 
>>> I have to disagree with you. By default, Accumulo tables are distributed
>>> maps. However, as soon as you configure an aggregator or some other
>>> interesting iterator on a table the semantics for that table change and
>> it
>>> is no longer a "proper" distributed map. Therefore I claim that the basic
>>> tenant to which you refer does not exist as such.
>>> 
>>> Users generally don't set the timestamps in a mutation, and aggregators
>>> certainly don't preserve the keys that they aggregate. Are you suggesting
>>> that modifying the value associated with a key that has already
>> contributed
>>> to a persisted aggregate should have an affect that is dependent on the
>>> original value? So, if I sum a:foo:bar->1 and then a:foo:bar->2 I should
>>> get 2?
>>> 
>>> The fix that is suggested in this ticket just makes the behavior
>> consistent
>>> between the cases of putting two identical entries in one mutation versus
>>> putting the two entries in two mutations. However we account for the
>>> semantics of aggregation we should be for this change.
>>> 
>>> Adam
>>> 
>>> 
>>> On Thu, Dec 22, 2011 at 12:31 PM, Aaron Cordova (Commented) (JIRA) <
>>> [email protected]> wrote:
>>> 
>>>> 
>>>>  [
>>>> 
>> https://issues.apache.org/jira/browse/ACCUMULO-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174913#comment-13174913
>> ]
>>>> 
>>>> Aaron Cordova commented on ACCUMULO-227:
>>>> ----------------------------------------
>>>> 
>>>> What the client should expect is that Accumulo will only store/process
>> one
>>>> value per unique key: Accumulo is a distributed map. Even if it's only
>> for
>>>> aggregation's sake, allowing Mutations to submit multiple values per
>> unique
>>>> key and processing all those values, rather than arbitrarily choosing
>> one,
>>>> violates the concept of a map, which will cause more confusion on the
>> part
>>>> of users.
>>>> 
>>>> The right thing to do for users who want to submit lots of values to
>>>> aggregate under a sub key is to insist that they make their cells
>> differ by
>>>> at least one element in the key. Again, aggregating multiple values
>> under
>>>> the same key violates the basic tenet that Accumulo is a map.
>> Aggregation
>>>> is performed across different keys sharing a sub key.
>>>> 
>>>> If having the users generate unique timestamps is a problem, there are
>>>> several strategies for dealing with that. One is to generate random
>>>> timestamps. If aggregation is being done over timestamps, the actual
>>>> timestamp shouldn't matter / ever be interpreted. If there are worries
>>>> about Accumulo doing something undesired with random timestamps, one
>> could
>>>> generate random column qualifiers, etc. and aggregate over those.
>>>> 
>>>> To address what Adam said about versioning - aggregating tables should
>>>> probably turn off the iterator that only keeps the latest version. But
>> that
>>>> has nothing to do with the policy for handling multiple identical cells.
>>>> 
>>>> Finally, I'm not advocating we do anything to support aggregation on the
>>>> client side, but rather leave it up to the application developer to
>> exploit
>>>> any opportunities for aggregation in their application.
>>>> 
>>>> 
>>>>> Improve in memory map counts to provide cell level uniqueness for
>>>> repeated columns in  mutation
>>>>> 
>>>> 
>> -----------------------------------------------------------------------------------------------
>>>>> 
>>>>>               Key: ACCUMULO-227
>>>>>               URL: https://issues.apache.org/jira/browse/ACCUMULO-227
>>>>>           Project: Accumulo
>>>>>        Issue Type: Improvement
>>>>>        Components: tserver
>>>>>          Reporter: John Vines
>>>>>          Assignee: John Vines
>>>>>           Fix For: 1.5.0
>>>>> 
>>>>> 
>>>>> Currently for isolation we only isolate mutations. This doesn't allow
>>>> mutations with identical cells within it. We should increase the
>> mutation
>>>> counts to account for each individual cell instead of each mutation.
>>>> 
>>>> --
>>>> This message is automatically generated by JIRA.
>>>> If you think it was sent incorrectly, please contact your JIRA
>>>> administrators:
>>>> 
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>>> For more information on JIRA, see:
>> http://www.atlassian.com/software/jira
>>>> 
>>>> 
>>>> 
>> 
>>

Re: [jira] [Commented] (ACCUMULO-227) Improve in memory map counts to provide cell level uniqueness for repeated columns in mutation

Reply via email to