> On Dec. 22, 2014, 11:42 p.m., Christopher Tubbs wrote: > > I'm curious about the expected behavior for when the source data contains > > deletes. It seems that those deletes will be merged into the destination, > > clobbering any non-deleted data in the destination. Clobbering (or at > > least, some sensible merge, based on timestamp) is probably expected > > behavior for non-deleted data from the source, but deleted data is tricky, > > because the user has no insight into the current state of deletes in the > > source. They can "cloneInto" in one instant and have deletes clobber their > > destination, and then do it again without any clobbering (because the > > source has since compacted). > > > > To prevent this behavior, it seems you'd need to either 1) force a > > compaction, or 2) add metadata to the value of the file entry instructing > > the tserver to ignore deleted data (delete markers and any data it is > > hiding) coming from that particular file, before merging with the other > > files, similar to how we re-write timestamps when reading data from > > bulk-imported files. > > > > This unexpected behavior can occur even if all other parts of the cloneInto > > feature were working perfectly. It gets even more complicated and > > unexpected when we begin considering that the user's view of the data is > > different than what the underlying data actually is in other ways, as the > > result of iterators. It may not be obvious that they are injecting the > > underlying data, and not their view of it, into another table. How do you > > propose to address that? With additional permissions to ensure not just > > everybody can perform this action? With a check to verify that the same > > iterators exist on the destination table that existed on the source table? > > John Vines wrote: > I see your concerns no different as those for a standard bulk import. And > these concerns are currently handled by the user doing the bulk import needs > to be aware of the data they're calling the command on. > > Christopher Tubbs wrote: > I agree that they aren't much different than bulk import. However, this > is not a "bulkImportFrom" feature, it is a "cloneInto" feature, and it's > *very* different behavior than clone. With bulk import, also, we have a > special permission, that can be reserved to those who understand its > particulars. It's also the case that typical bulk import use cases involve > writing data that will not have deletes, or will not come from a source with > a different view than the underlying files. > > I'm just concerned about user expectations, and how we address them. > > John Vines wrote: > What if we require the source table to be compacted down to 1 file per > tablet to ease concerns about deletes? > > And I totally agree with the permissions. Would you be okay with > requiring bulk import on the destination, or would you rather it's own > permission? > > John Vines wrote: > Actually, compacting down to 1 file is expensive and may not be required > if there are no deletes in play. I think this is best left to user's of the > system to quiesce to their needs. > > Christopher Tubbs wrote: > Yeah, compacting would be expensive. It's fine to leave it to the user... > but we do need to communicate the expected behavior and pitfalls very > thoroughly, and perhaps lock the feature down with a special permission, like > bulk import has, to prevent non-permitted users from performing this > function, who may not understand the pitfalls. > > John Vines wrote: > I see a few ways to achieve this, curious what your opinion is- > For the destination table I see 2 options- > 1. Just utilize bulk import because it has known pitfalls that need to be > accomodated > 2. Introduce new permission for this > > And then for the source table, there's a few more options- > 1. READ > 2. if we introduce a new permission for the destination table, use that > same permission > 3. A new permisssion for cloning FROM > > Christopher Tubbs wrote: > For the destination table: > > If we utilize the bulk import permission, I'd say we'd definitely want to > change the name of this operation to reflect that it's more of a "bulk > import" variation than a "clone" variation, at least from the user's > perspective (even though the implementation details are more like clone). > > For the source table: > > I don't think READ is sufficient, because you can copy and reveal > underlying data that you couldn't normally see with READ. For instance, > create table, apply a majc iterator which drops visibilities, then clone from > old table, which you can READ, into this new table, which you control, > compact, and now you can see data you weren't supposed to. You could also > bypass any other iterator that would normally be in the read path (though, > that might already be possible by clobbering iterators with scan time > iterators, anyway). At the very least, you'd have to have the equivalent of > ALTER_TABLE on the source in order to get the same level of exposure to that > underlying data from the source table with today's features. > > Rather than per-table permissions, we could just add a system-wide > permission that locks down the feature. As it stands, this feature allows > getting underlying data and letting it escape the data owner (table > creator's) control, by moving it to a different table. It can result in > unintuitive behavior, and grave security concerns, which go beyond that of > bulk import. A system-level permission might be appropriate (or, just require > the existing system-level admin permission). In general, a system-level > permission kinda seems to make sense anyway, since it involves multiple > tables. > > kturner wrote: > > I'm curious about the expected behavior for when the source data > contains deletes > > This is an interesting point. The reverse is also possible. Deletes in > destination table may clobber data being merged in from the source. I think > the chances deletes in the dest causing problems are similar to the chances > of it happening w/ bulk import. I feel like the chances of deletes from the > src causing problems are higher than the chances of that happening w/ bulk > import. Need some good documentation to inform users of the pitfalls.
Makes sense. I'm going to run with that. - John ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27198/#review65843 ----------------------------------------------------------- On Dec. 22, 2014, 7:59 p.m., John Vines wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/27198/ > ----------------------------------------------------------- > > (Updated Dec. 22, 2014, 7:59 p.m.) > > > Review request for accumulo. > > > Bugs: ACCUMULO-3236 > https://issues.apache.org/jira/browse/ACCUMULO-3236 > > > Repository: accumulo > > > Description > ------- > > Includes all code to support feature, including thrift changes > Includes minor code cleanup to TableLocator and items in the Bulk path to > remove signature items that are unused (arguments & exceptions) > Includes renaming of some bulk import functions to clarify their purpose > (because they're now multi-purpose) > > Patch is based on 1.6, but we can choose to make it target only 1.7 if we > choose (this conversation should be taken up on jira, not in RB) > > > Diffs > ----- > > > core/src/main/java/org/apache/accumulo/core/client/admin/TableOperations.java > 97f538d > > core/src/main/java/org/apache/accumulo/core/client/impl/RootTabletLocator.java > 97d476b > > core/src/main/java/org/apache/accumulo/core/client/impl/TableOperationsImpl.java > 07df1bd > core/src/main/java/org/apache/accumulo/core/client/impl/TabletLocator.java > e396d82 > > core/src/main/java/org/apache/accumulo/core/client/impl/TabletLocatorImpl.java > c550f15 > > core/src/main/java/org/apache/accumulo/core/client/impl/TimeoutTabletLocator.java > bcbe561 > > core/src/main/java/org/apache/accumulo/core/client/impl/thrift/TableOperation.java > 7716823 > > core/src/main/java/org/apache/accumulo/core/client/mock/MockTableOperationsImpl.java > de19137 > > core/src/main/java/org/apache/accumulo/core/client/mock/impl/MockTabletLocator.java > 35f160f > > core/src/main/java/org/apache/accumulo/core/master/thrift/FateOperation.java > f65f552 > > core/src/main/java/org/apache/accumulo/core/tabletserver/thrift/TabletClientService.java > 2ba7674 > core/src/main/thrift/client.thrift 38a8076 > core/src/main/thrift/master.thrift 38e9227 > core/src/main/thrift/tabletserver.thrift 25e0b10 > > core/src/test/java/org/apache/accumulo/core/client/admin/TableOperationsHelperTest.java > 1d91574 > > core/src/test/java/org/apache/accumulo/core/client/impl/TableOperationsHelperTest.java > 02838ed > > server/base/src/main/java/org/apache/accumulo/server/client/BulkImporter.java > 4cc13a9 > > server/base/src/main/java/org/apache/accumulo/server/client/ClientServiceHandler.java > fe17a62 > > server/base/src/main/java/org/apache/accumulo/server/security/SecurityOperation.java > 258080c > > server/base/src/test/java/org/apache/accumulo/server/client/BulkImporterTest.java > 3680341 > > server/master/src/main/java/org/apache/accumulo/master/FateServiceHandler.java > 5818da3 > > server/master/src/main/java/org/apache/accumulo/master/tableOps/CloneIntoTable.java > PRE-CREATION > server/tserver/src/main/java/org/apache/accumulo/tserver/Tablet.java > 9a07a4a > server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java > 3f594cc > > test/src/main/java/org/apache/accumulo/test/performance/thrift/NullTserver.java > 0591b19 > test/src/test/java/org/apache/accumulo/test/functional/CloneIntoIT.java > PRE-CREATION > > Diff: https://reviews.apache.org/r/27198/diff/ > > > Testing > ------- > > Includes CloneIntoIT, which exercises all permutations of the flags. Existing > BulkIT still functions as intended for validation of no feature loss in > refactoring exiting code for multi-purposing. > > > Thanks, > > John Vines > >