Why is it that none of you seem to consider two keys that differ by timestamp to be two different keys?
On Dec 22, 2011, at 4:00 PM, Adam Fuchs wrote: > Aaron, > > I think it would be more accurate to describe Accumulo as an underlying > multi-map with support for aggregation overlays. A map can be thought of as > a multi-map with an overlay that takes the first of the multiple entries. > This is in fact the default configuration of Accumulo tables, where the > VersioningIterator defines this overlay. Other Iterator configurations > provide different overlays. > > There are two challenges that make it difficult to case the underlying > representation as a map. The first is that the definition of uniqueness of > a Key is a bit muddy. I think that many users consider the uniqueness to > include row, column family, and column qualifier. Those that use cell-level > security also include the column visibility. Timestamp doesn't usually make > it into the uniqueness concept, from a user's perspective, even though that > affects the sort order of Keys. In fact, most users let Accumulo set the > timestamp for them. I think your definition of uniqueness takes timestamp > into account, and from that perspective what we're doing is sort of like > providing a finer grained timestamp instead of using one timestamp for an > entire Mutation (or for all Mutations that show up within a millisecond). > > The second challenge is that the overlay is persisted and is not > reversible. Aggregators don't keep the Keys that they aggregate, so if a > user wants to replace a Key in the underlying map and have that replacement > operation be reflected in the overlay, we can't really do that. However, we > can do that if the underlying store is a multi-map (which is what we do > now). > > Adam > > On Thu, Dec 22, 2011 at 3:41 PM, Aaron Cordova <[email protected]> wrote: > >> Rather than aggregation functionality being defined as some operation >> performed across a set of the values of different keys, you're advocating >> allowing inserting identical keys and aggregating their values as well? >> This just seems semantically sloppy to me. >> >> These types of changes just incur a cost in terms of understanding for the >> user. Rather than being able to describe Accumulo as a map, a well defined >> and understood concept, that also supports aggregations over a set of keys >> that share a subkey, we would then have to describe Accumulo as a map, most >> of the time, except when it functions more like a multi-map, in the case of >> aggregation in the presence of multiple values for the same key ... it's >> just confusing. >> >> Even with aggregators configured over a table, it still functions as a map >> - in fact like two maps, one 'underlying' map, in which each key has one >> value, and an 'aggregate' map, in which keys also have one value, define as >> an aggregation over the 'underlying' map. Perhaps one could argue that what >> I just described could be termed a multi-map, but from the user's point of >> view, thinking of it as an 'underlying' map, which is how the user sees the >> table when writing, and an 'aggregate' map, which is how the user sees the >> table when reading is more clean. Users are used to this situation if >> they've ever used views in a relational database. >> >> For you and John, who are steeped in this field, this distinction, and >> this change, probably doesn't seem like a big deal. But when telling a new >> user about Accumulo, being able to explain to them that Accumulo is a map, >> is very useful. It makes predicting the behavior of Accumulo possible. If >> users can put identical key-value pairs into a mutation, and if Accumulo >> treats them as distinct, users' predictions will be wrong. >> >> Feel free to make this change, but just consider the collective cognitive >> cost it incurred by altering the semantics. Earlier you argued that >> extending the times aggregations are executed to include the client would >> be too great. Yet making it possible for Accumulo to cease acting like a >> map sometime doesn't give you pause? >> >> On Dec 22, 2011, at 2:52 PM, Adam Fuchs wrote: >> >>> Aaron, >>> >>> I have to disagree with you. By default, Accumulo tables are distributed >>> maps. However, as soon as you configure an aggregator or some other >>> interesting iterator on a table the semantics for that table change and >> it >>> is no longer a "proper" distributed map. Therefore I claim that the basic >>> tenant to which you refer does not exist as such. >>> >>> Users generally don't set the timestamps in a mutation, and aggregators >>> certainly don't preserve the keys that they aggregate. Are you suggesting >>> that modifying the value associated with a key that has already >> contributed >>> to a persisted aggregate should have an affect that is dependent on the >>> original value? So, if I sum a:foo:bar->1 and then a:foo:bar->2 I should >>> get 2? >>> >>> The fix that is suggested in this ticket just makes the behavior >> consistent >>> between the cases of putting two identical entries in one mutation versus >>> putting the two entries in two mutations. However we account for the >>> semantics of aggregation we should be for this change. >>> >>> Adam >>> >>> >>> On Thu, Dec 22, 2011 at 12:31 PM, Aaron Cordova (Commented) (JIRA) < >>> [email protected]> wrote: >>> >>>> >>>> [ >>>> >> https://issues.apache.org/jira/browse/ACCUMULO-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174913#comment-13174913 >> ] >>>> >>>> Aaron Cordova commented on ACCUMULO-227: >>>> ---------------------------------------- >>>> >>>> What the client should expect is that Accumulo will only store/process >> one >>>> value per unique key: Accumulo is a distributed map. Even if it's only >> for >>>> aggregation's sake, allowing Mutations to submit multiple values per >> unique >>>> key and processing all those values, rather than arbitrarily choosing >> one, >>>> violates the concept of a map, which will cause more confusion on the >> part >>>> of users. >>>> >>>> The right thing to do for users who want to submit lots of values to >>>> aggregate under a sub key is to insist that they make their cells >> differ by >>>> at least one element in the key. Again, aggregating multiple values >> under >>>> the same key violates the basic tenet that Accumulo is a map. >> Aggregation >>>> is performed across different keys sharing a sub key. >>>> >>>> If having the users generate unique timestamps is a problem, there are >>>> several strategies for dealing with that. One is to generate random >>>> timestamps. If aggregation is being done over timestamps, the actual >>>> timestamp shouldn't matter / ever be interpreted. If there are worries >>>> about Accumulo doing something undesired with random timestamps, one >> could >>>> generate random column qualifiers, etc. and aggregate over those. >>>> >>>> To address what Adam said about versioning - aggregating tables should >>>> probably turn off the iterator that only keeps the latest version. But >> that >>>> has nothing to do with the policy for handling multiple identical cells. >>>> >>>> Finally, I'm not advocating we do anything to support aggregation on the >>>> client side, but rather leave it up to the application developer to >> exploit >>>> any opportunities for aggregation in their application. >>>> >>>> >>>>> Improve in memory map counts to provide cell level uniqueness for >>>> repeated columns in mutation >>>>> >>>> >> ----------------------------------------------------------------------------------------------- >>>>> >>>>> Key: ACCUMULO-227 >>>>> URL: https://issues.apache.org/jira/browse/ACCUMULO-227 >>>>> Project: Accumulo >>>>> Issue Type: Improvement >>>>> Components: tserver >>>>> Reporter: John Vines >>>>> Assignee: John Vines >>>>> Fix For: 1.5.0 >>>>> >>>>> >>>>> Currently for isolation we only isolate mutations. This doesn't allow >>>> mutations with identical cells within it. We should increase the >> mutation >>>> counts to account for each individual cell instead of each mutation. >>>> >>>> -- >>>> This message is automatically generated by JIRA. >>>> If you think it was sent incorrectly, please contact your JIRA >>>> administrators: >>>> >> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa >>>> For more information on JIRA, see: >> http://www.atlassian.com/software/jira >>>> >>>> >>>> >> >>
