ctubbsii commented on PR #6040:
URL: https://github.com/apache/accumulo/pull/6040#issuecomment-3726217254

   Discussed iterator conflicts today, and here's a summary of some key points:
   
   1. **Conflict within the config**: In configuration, no two iterators at the 
same scope (scan, minc, majc) should be able to have the same priority.
      * This applies only to the complete view of the `TableConfiguration`, 
with all inherited properties from parent configs (namespace, system, ...), so 
it is okay, for example, if a table config set at the namespace level is 
overridden in part at the table level, so that the one single iterator at that 
scope and priority has configuration that spans across two levels of the 
configs. What is important is that the resulting view of the 
`TableConfiguration` when trying to construct an iterator stack, will not show 
any two different iterators at the same scope with the same priority.
      * Checks could be in place when editing table/namespace configuration to 
ensure a priority isn't "doubled up". A user who wishes to replace `iteratorA` 
with `iteratorB` at the same priority would have to remove `iteratorA` before 
adding `iteratorB`, or would have to use `modifyProperties` to atomically 
mutate the properties to remove and add at the same time, in order to avoid an 
error. This alone, however, **does not guarantee** that there isn't a conflict. 
If `iteratorA` had been set at the `namespaceN.tableT`, but `iteratorB` was 
being added to `namespaceN`, we would have to check that there isn't a conflict 
with *any* of the tables in `namespaceN`. That's not exactly practical, so we 
may just want to check that there isn't a conflict at the level being modified, 
and rely on later checks when setting up the iterator stack to verify that 
there isn't a conflict overall.
      * Note: this probably would be easier to deconflict if our iterator 
configs used a different property key scheme that was more overrideable 
atomically, like 
`table.<scope>.iterator.<priority>=class,opt1key=opt1val,opt2key=opt2val,....` 
so that it wouldn't be possible to have conflict between namespace and table 
configs, because one would fully override the other. But, that's not what we 
have today.
   2. **Conflict in user-supplied iterators for a specific scan/compaction**: 
No two iterators in a single client-initiated operation's settings should have 
the same priority.
      * This is for things like scan-time iterators set in the API for a scan, 
and passed over the RPC, rather than iterators set in configuration on the 
server-side.
      * This also applies to any other place where we might be able to specify 
iterators that aren't in the configuration (compactions, conditional mutations, 
etc.)
      * We could check for conflicts set on the scanner easily, but would have 
to rely on the server-side setting up the iterator stack to ensure no conflicts 
between the user iterators and those set on the table.
   3. **Conflict between configured iterators and user-supplied iterators**: 
The complete iterator stack for an operation should not have any iterators 
running at the same priority, regardless of whether it came from the 
configuration or from the client API/RPC request.
      * To address this, we can simply check the full iterator stack when it is 
being constructed on the server-side, and fail the operation if any priorities 
are reused, regardless of where they came from.
      * Alternatively, we could treat one as overriding another, but I don't 
think that's a very good idea.
      * As a follow-on improvement here, we could treat all configured 
iterators as higher priority than all client operation-specific 
(scan-time/compaction-time) iterators:
         1. Instead of ordering three configured iterators and two 
user-supplied iterators by priority alone, as in `C1, U2, C3, U4, C5`, we would 
instead order them as `C1, C3, C5, U2, U4`.
         2. This enables stronger security guarantees by preventing a 
user-supplied iterator from seeing data that is filtered out in a 
administrator-configured iterator.
         3. This prevents bugs that could be caused by a user-supplied iterator 
that transforms data in a way that a subsequent administrator-configured 
iterator won't be able to handle.
         4. This is a behavior change, and may break some people's 
(ill-advised) uses, but I think it is better overall.
         5. This would also open the possibility of having a cleaner 
client-side API, because you don't actually have to specify priority numbers on 
the client. Instead, clients only need to order user-supplied iterators with 
respect to other user-supplied iterators, and won't need a priority number to 
indicate a global ordering that includes the configured iterators for a table. 
So, we could have an API something like: 
`scanner.map(iterator1).map(iterator2).map(iterator3).scan()`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to