Rather than aggregation functionality being defined as some operation performed 
across a set of the values of different keys, you're advocating allowing 
inserting identical keys and aggregating their values as well? This just seems 
semantically sloppy to me. 

These types of changes just incur a cost in terms of understanding for the 
user. Rather than being able to describe Accumulo as a map, a well defined and 
understood concept, that also supports aggregations over a set of keys that 
share a subkey, we would then have to describe Accumulo as a map, most of the 
time, except when it functions more like a multi-map, in the case of 
aggregation in the presence of multiple values for the same key ... it's just 
confusing.

Even with aggregators configured over a table, it still functions as a map - in 
fact like two maps, one 'underlying' map, in which each key has one value, and 
an 'aggregate' map, in which keys also have one value, define as an aggregation 
over the 'underlying' map. Perhaps one could argue that what I just described 
could be termed a multi-map, but from the user's point of view, thinking of it 
as an 'underlying' map, which is how the user sees the table when writing, and 
an 'aggregate' map, which is how the user sees the table when reading is more 
clean. Users are used to this situation if they've ever used views in a 
relational database.

For you and John, who are steeped in this field, this distinction, and this 
change, probably doesn't seem like a big deal. But when telling a new user 
about Accumulo, being able to explain to them that Accumulo is a map, is very 
useful. It makes predicting the behavior of Accumulo possible. If users can put 
identical key-value pairs into a mutation, and if Accumulo treats them as 
distinct, users' predictions will be wrong. 

Feel free to make this change, but just consider the collective cognitive cost 
it incurred by altering the semantics. Earlier you argued that extending the 
times aggregations are executed to include the client would be too great. Yet 
making it possible for Accumulo to cease acting like a map sometime doesn't 
give you pause?

On Dec 22, 2011, at 2:52 PM, Adam Fuchs wrote:

> Aaron,
> 
> I have to disagree with you. By default, Accumulo tables are distributed
> maps. However, as soon as you configure an aggregator or some other
> interesting iterator on a table the semantics for that table change and it
> is no longer a "proper" distributed map. Therefore I claim that the basic
> tenant to which you refer does not exist as such.
> 
> Users generally don't set the timestamps in a mutation, and aggregators
> certainly don't preserve the keys that they aggregate. Are you suggesting
> that modifying the value associated with a key that has already contributed
> to a persisted aggregate should have an affect that is dependent on the
> original value? So, if I sum a:foo:bar->1 and then a:foo:bar->2 I should
> get 2?
> 
> The fix that is suggested in this ticket just makes the behavior consistent
> between the cases of putting two identical entries in one mutation versus
> putting the two entries in two mutations. However we account for the
> semantics of aggregation we should be for this change.
> 
> Adam
> 
> 
> On Thu, Dec 22, 2011 at 12:31 PM, Aaron Cordova (Commented) (JIRA) <
> [email protected]> wrote:
> 
>> 
>>   [
>> https://issues.apache.org/jira/browse/ACCUMULO-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174913#comment-13174913]
>> 
>> Aaron Cordova commented on ACCUMULO-227:
>> ----------------------------------------
>> 
>> What the client should expect is that Accumulo will only store/process one
>> value per unique key: Accumulo is a distributed map. Even if it's only for
>> aggregation's sake, allowing Mutations to submit multiple values per unique
>> key and processing all those values, rather than arbitrarily choosing one,
>> violates the concept of a map, which will cause more confusion on the part
>> of users.
>> 
>> The right thing to do for users who want to submit lots of values to
>> aggregate under a sub key is to insist that they make their cells differ by
>> at least one element in the key. Again, aggregating multiple values under
>> the same key violates the basic tenet that Accumulo is a map. Aggregation
>> is performed across different keys sharing a sub key.
>> 
>> If having the users generate unique timestamps is a problem, there are
>> several strategies for dealing with that. One is to generate random
>> timestamps. If aggregation is being done over timestamps, the actual
>> timestamp shouldn't matter / ever be interpreted. If there are worries
>> about Accumulo doing something undesired with random timestamps, one could
>> generate random column qualifiers, etc. and aggregate over those.
>> 
>> To address what Adam said about versioning - aggregating tables should
>> probably turn off the iterator that only keeps the latest version. But that
>> has nothing to do with the policy for handling multiple identical cells.
>> 
>> Finally, I'm not advocating we do anything to support aggregation on the
>> client side, but rather leave it up to the application developer to exploit
>> any opportunities for aggregation in their application.
>> 
>> 
>>> Improve in memory map counts to provide cell level uniqueness for
>> repeated columns in  mutation
>>> 
>> -----------------------------------------------------------------------------------------------
>>> 
>>>                Key: ACCUMULO-227
>>>                URL: https://issues.apache.org/jira/browse/ACCUMULO-227
>>>            Project: Accumulo
>>>         Issue Type: Improvement
>>>         Components: tserver
>>>           Reporter: John Vines
>>>           Assignee: John Vines
>>>            Fix For: 1.5.0
>>> 
>>> 
>>> Currently for isolation we only isolate mutations. This doesn't allow
>> mutations with identical cells within it. We should increase the mutation
>> counts to account for each individual cell instead of each mutation.
>> 
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA
>> administrators:
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>> 
>> 
>> 

Reply via email to