> On Dec. 22, 2014, 11:42 p.m., Christopher Tubbs wrote:
> > I'm curious about the expected behavior for when the source data contains 
> > deletes. It seems that those deletes will be merged into the destination, 
> > clobbering any non-deleted data in the destination. Clobbering (or at 
> > least, some sensible merge, based on timestamp) is probably expected 
> > behavior for non-deleted data from the source, but deleted data is tricky, 
> > because the user has no insight into the current state of deletes in the 
> > source. They can "cloneInto" in one instant and have deletes clobber their 
> > destination, and then do it again without any clobbering (because the 
> > source has since compacted).
> > 
> > To prevent this behavior, it seems you'd need to either 1) force a 
> > compaction, or 2) add metadata to the value of the file entry instructing 
> > the tserver to ignore deleted data (delete markers and any data it is 
> > hiding) coming from that particular file, before merging with the other 
> > files, similar to how we re-write timestamps when reading data from 
> > bulk-imported files.
> > 
> > This unexpected behavior can occur even if all other parts of the cloneInto 
> > feature were working perfectly. It gets even more complicated and 
> > unexpected when we begin considering that the user's view of the data is 
> > different than what the underlying data actually is in other ways, as the 
> > result of iterators. It may not be obvious that they are injecting the 
> > underlying data, and not their view of it, into another table. How do you 
> > propose to address that? With additional permissions to ensure not just 
> > everybody can perform this action? With a check to verify that the same 
> > iterators exist on the destination table that existed on the source table?
> 
> John Vines wrote:
>     I see your concerns no different as those for a standard bulk import. And 
> these concerns are currently handled by the user doing the bulk import needs 
> to be aware of the data they're calling the command on.
> 
> Christopher Tubbs wrote:
>     I agree that they aren't much different than bulk import. However, this 
> is not a "bulkImportFrom" feature, it is a "cloneInto" feature, and it's 
> *very* different behavior than clone. With bulk import, also, we have a 
> special permission, that can be reserved to those who understand its 
> particulars. It's also the case that typical bulk import use cases involve 
> writing data that will not have deletes, or will not come from a source with 
> a different view than the underlying files.
>     
>     I'm just concerned about user expectations, and how we address them.
> 
> John Vines wrote:
>     What if we require the source table to be compacted down to 1 file per 
> tablet to ease concerns about deletes?
>     
>     And I totally agree with the permissions. Would you be okay with 
> requiring bulk import on the destination, or would you rather it's own 
> permission?
> 
> John Vines wrote:
>     Actually, compacting down to 1 file is expensive and may not be required 
> if there are no deletes in play. I think this is best left to user's of the 
> system to quiesce to their needs.
> 
> Christopher Tubbs wrote:
>     Yeah, compacting would be expensive. It's fine to leave it to the user... 
> but we do need to communicate the expected behavior and pitfalls very 
> thoroughly, and perhaps lock the feature down with a special permission, like 
> bulk import has, to prevent non-permitted users from performing this 
> function, who may not understand the pitfalls.
> 
> John Vines wrote:
>     I see a few ways to achieve this, curious what your opinion is-
>     For the destination table I see 2 options-
>     1. Just utilize bulk import because it has known pitfalls that need to be 
> accomodated
>     2. Introduce new permission for this
>     
>     And then for the source table, there's a few more options-
>     1. READ
>     2. if we introduce a new permission for the destination table, use that 
> same permission
>     3. A new permisssion for cloning FROM
> 
> Christopher Tubbs wrote:
>     For the destination table:
>     
>     If we utilize the bulk import permission, I'd say we'd definitely want to 
> change the name of this operation to reflect that it's more of a "bulk 
> import" variation than a "clone" variation, at least from the user's 
> perspective (even though the implementation details are more like clone).
>     
>     For the source table:
>     
>     I don't think READ is sufficient, because you can copy and reveal 
> underlying data that you couldn't normally see with READ. For instance, 
> create table, apply a majc iterator which drops visibilities, then clone from 
> old table, which you can READ, into this new table, which you control, 
> compact, and now you can see data you weren't supposed to. You could also 
> bypass any other iterator that would normally be in the read path (though, 
> that might already be possible by clobbering iterators with scan time 
> iterators, anyway). At the very least, you'd have to have the equivalent of 
> ALTER_TABLE on the source in order to get the same level of exposure to that 
> underlying data from the source table with today's features.
>     
>     Rather than per-table permissions, we could just add a system-wide 
> permission that locks down the feature. As it stands, this feature allows 
> getting underlying data and letting it escape the data owner (table 
> creator's) control, by moving it to a different table. It can result in 
> unintuitive behavior, and grave security concerns, which go beyond that of 
> bulk import. A system-level permission might be appropriate (or, just require 
> the existing system-level admin permission). In general, a system-level 
> permission kinda seems to make sense anyway, since it involves multiple 
> tables.

> I'm curious about the expected behavior for when the source data contains 
> deletes

This is an interesting point.   The reverse is also possible.  Deletes in 
destination table may clobber data being merged in from the source.  I think 
the chances deletes in the dest causing problems are similar to the chances of 
it happening w/ bulk import.  I feel like the chances of deletes from the src 
causing problems are higher than the chances of that happening w/ bulk import.  
Need some good documentation to inform users of the pitfalls.


- kturner


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27198/#review65843
-----------------------------------------------------------


On Dec. 22, 2014, 7:59 p.m., John Vines wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27198/
> -----------------------------------------------------------
> 
> (Updated Dec. 22, 2014, 7:59 p.m.)
> 
> 
> Review request for accumulo.
> 
> 
> Bugs: ACCUMULO-3236
>     https://issues.apache.org/jira/browse/ACCUMULO-3236
> 
> 
> Repository: accumulo
> 
> 
> Description
> -------
> 
> Includes all code to support feature, including thrift changes
> Includes minor code cleanup to TableLocator and items in the Bulk path to 
> remove signature items that are unused (arguments & exceptions)
> Includes renaming of some bulk import functions to clarify their purpose 
> (because they're now multi-purpose)
> 
> Patch is based on 1.6, but we can choose to make it target only 1.7 if we 
> choose (this conversation should be taken up on jira, not in RB)
> 
> 
> Diffs
> -----
> 
>   
> core/src/main/java/org/apache/accumulo/core/client/admin/TableOperations.java 
> 97f538d 
>   
> core/src/main/java/org/apache/accumulo/core/client/impl/RootTabletLocator.java
>  97d476b 
>   
> core/src/main/java/org/apache/accumulo/core/client/impl/TableOperationsImpl.java
>  07df1bd 
>   core/src/main/java/org/apache/accumulo/core/client/impl/TabletLocator.java 
> e396d82 
>   
> core/src/main/java/org/apache/accumulo/core/client/impl/TabletLocatorImpl.java
>  c550f15 
>   
> core/src/main/java/org/apache/accumulo/core/client/impl/TimeoutTabletLocator.java
>  bcbe561 
>   
> core/src/main/java/org/apache/accumulo/core/client/impl/thrift/TableOperation.java
>  7716823 
>   
> core/src/main/java/org/apache/accumulo/core/client/mock/MockTableOperationsImpl.java
>  de19137 
>   
> core/src/main/java/org/apache/accumulo/core/client/mock/impl/MockTabletLocator.java
>  35f160f 
>   
> core/src/main/java/org/apache/accumulo/core/master/thrift/FateOperation.java 
> f65f552 
>   
> core/src/main/java/org/apache/accumulo/core/tabletserver/thrift/TabletClientService.java
>  2ba7674 
>   core/src/main/thrift/client.thrift 38a8076 
>   core/src/main/thrift/master.thrift 38e9227 
>   core/src/main/thrift/tabletserver.thrift 25e0b10 
>   
> core/src/test/java/org/apache/accumulo/core/client/admin/TableOperationsHelperTest.java
>  1d91574 
>   
> core/src/test/java/org/apache/accumulo/core/client/impl/TableOperationsHelperTest.java
>  02838ed 
>   
> server/base/src/main/java/org/apache/accumulo/server/client/BulkImporter.java 
> 4cc13a9 
>   
> server/base/src/main/java/org/apache/accumulo/server/client/ClientServiceHandler.java
>  fe17a62 
>   
> server/base/src/main/java/org/apache/accumulo/server/security/SecurityOperation.java
>  258080c 
>   
> server/base/src/test/java/org/apache/accumulo/server/client/BulkImporterTest.java
>  3680341 
>   
> server/master/src/main/java/org/apache/accumulo/master/FateServiceHandler.java
>  5818da3 
>   
> server/master/src/main/java/org/apache/accumulo/master/tableOps/CloneIntoTable.java
>  PRE-CREATION 
>   server/tserver/src/main/java/org/apache/accumulo/tserver/Tablet.java 
> 9a07a4a 
>   server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java 
> 3f594cc 
>   
> test/src/main/java/org/apache/accumulo/test/performance/thrift/NullTserver.java
>  0591b19 
>   test/src/test/java/org/apache/accumulo/test/functional/CloneIntoIT.java 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/27198/diff/
> 
> 
> Testing
> -------
> 
> Includes CloneIntoIT, which exercises all permutations of the flags. Existing 
> BulkIT still functions as intended for validation of no feature loss in 
> refactoring exiting code for multi-purposing.
> 
> 
> Thanks,
> 
> John Vines
> 
>

Reply via email to