Rather than aggregation functionality being defined as some operation performed across a set of the values of different keys, you're advocating allowing inserting identical keys and aggregating their values as well? This just seems semantically sloppy to me.
These types of changes just incur a cost in terms of understanding for the user. Rather than being able to describe Accumulo as a map, a well defined and understood concept, that also supports aggregations over a set of keys that share a subkey, we would then have to describe Accumulo as a map, most of the time, except when it functions more like a multi-map, in the case of aggregation in the presence of multiple values for the same key ... it's just confusing. Even with aggregators configured over a table, it still functions as a map - in fact like two maps, one 'underlying' map, in which each key has one value, and an 'aggregate' map, in which keys also have one value, define as an aggregation over the 'underlying' map. Perhaps one could argue that what I just described could be termed a multi-map, but from the user's point of view, thinking of it as an 'underlying' map, which is how the user sees the table when writing, and an 'aggregate' map, which is how the user sees the table when reading is more clean. Users are used to this situation if they've ever used views in a relational database. For you and John, who are steeped in this field, this distinction, and this change, probably doesn't seem like a big deal. But when telling a new user about Accumulo, being able to explain to them that Accumulo is a map, is very useful. It makes predicting the behavior of Accumulo possible. If users can put identical key-value pairs into a mutation, and if Accumulo treats them as distinct, users' predictions will be wrong. Feel free to make this change, but just consider the collective cognitive cost it incurred by altering the semantics. Earlier you argued that extending the times aggregations are executed to include the client would be too great. Yet making it possible for Accumulo to cease acting like a map sometime doesn't give you pause? On Dec 22, 2011, at 2:52 PM, Adam Fuchs wrote: > Aaron, > > I have to disagree with you. By default, Accumulo tables are distributed > maps. However, as soon as you configure an aggregator or some other > interesting iterator on a table the semantics for that table change and it > is no longer a "proper" distributed map. Therefore I claim that the basic > tenant to which you refer does not exist as such. > > Users generally don't set the timestamps in a mutation, and aggregators > certainly don't preserve the keys that they aggregate. Are you suggesting > that modifying the value associated with a key that has already contributed > to a persisted aggregate should have an affect that is dependent on the > original value? So, if I sum a:foo:bar->1 and then a:foo:bar->2 I should > get 2? > > The fix that is suggested in this ticket just makes the behavior consistent > between the cases of putting two identical entries in one mutation versus > putting the two entries in two mutations. However we account for the > semantics of aggregation we should be for this change. > > Adam > > > On Thu, Dec 22, 2011 at 12:31 PM, Aaron Cordova (Commented) (JIRA) < > [email protected]> wrote: > >> >> [ >> https://issues.apache.org/jira/browse/ACCUMULO-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174913#comment-13174913] >> >> Aaron Cordova commented on ACCUMULO-227: >> ---------------------------------------- >> >> What the client should expect is that Accumulo will only store/process one >> value per unique key: Accumulo is a distributed map. Even if it's only for >> aggregation's sake, allowing Mutations to submit multiple values per unique >> key and processing all those values, rather than arbitrarily choosing one, >> violates the concept of a map, which will cause more confusion on the part >> of users. >> >> The right thing to do for users who want to submit lots of values to >> aggregate under a sub key is to insist that they make their cells differ by >> at least one element in the key. Again, aggregating multiple values under >> the same key violates the basic tenet that Accumulo is a map. Aggregation >> is performed across different keys sharing a sub key. >> >> If having the users generate unique timestamps is a problem, there are >> several strategies for dealing with that. One is to generate random >> timestamps. If aggregation is being done over timestamps, the actual >> timestamp shouldn't matter / ever be interpreted. If there are worries >> about Accumulo doing something undesired with random timestamps, one could >> generate random column qualifiers, etc. and aggregate over those. >> >> To address what Adam said about versioning - aggregating tables should >> probably turn off the iterator that only keeps the latest version. But that >> has nothing to do with the policy for handling multiple identical cells. >> >> Finally, I'm not advocating we do anything to support aggregation on the >> client side, but rather leave it up to the application developer to exploit >> any opportunities for aggregation in their application. >> >> >>> Improve in memory map counts to provide cell level uniqueness for >> repeated columns in mutation >>> >> ----------------------------------------------------------------------------------------------- >>> >>> Key: ACCUMULO-227 >>> URL: https://issues.apache.org/jira/browse/ACCUMULO-227 >>> Project: Accumulo >>> Issue Type: Improvement >>> Components: tserver >>> Reporter: John Vines >>> Assignee: John Vines >>> Fix For: 1.5.0 >>> >>> >>> Currently for isolation we only isolate mutations. This doesn't allow >> mutations with identical cells within it. We should increase the mutation >> counts to account for each individual cell instead of each mutation. >> >> -- >> This message is automatically generated by JIRA. >> If you think it was sent incorrectly, please contact your JIRA >> administrators: >> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa >> For more information on JIRA, see: http://www.atlassian.com/software/jira >> >> >>
