ctubbsii commented on PR #6040:
URL: https://github.com/apache/accumulo/pull/6040#issuecomment-3726217254
Discussed iterator conflicts today, and here's a summary of some key points:
1. **Conflict within the config**: In configuration, no two iterators at the
same scope (scan, minc, majc) should be able to have the same priority.
* This applies only to the complete view of the `TableConfiguration`,
with all inherited properties from parent configs (namespace, system, ...), so
it is okay, for example, if a table config set at the namespace level is
overridden in part at the table level, so that the one single iterator at that
scope and priority has configuration that spans across two levels of the
configs. What is important is that the resulting view of the
`TableConfiguration` when trying to construct an iterator stack, will not show
any two different iterators at the same scope with the same priority.
* Checks could be in place when editing table/namespace configuration to
ensure a priority isn't "doubled up". A user who wishes to replace `iteratorA`
with `iteratorB` at the same priority would have to remove `iteratorA` before
adding `iteratorB`, or would have to use `modifyProperties` to atomically
mutate the properties to remove and add at the same time, in order to avoid an
error. This alone, however, **does not guarantee** that there isn't a conflict.
If `iteratorA` had been set at the `namespaceN.tableT`, but `iteratorB` was
being added to `namespaceN`, we would have to check that there isn't a conflict
with *any* of the tables in `namespaceN`. That's not exactly practical, so we
may just want to check that there isn't a conflict at the level being modified,
and rely on later checks when setting up the iterator stack to verify that
there isn't a conflict overall.
* Note: this probably would be easier to deconflict if our iterator
configs used a different property key scheme that was more overrideable
atomically, like
`table.<scope>.iterator.<priority>=class,opt1key=opt1val,opt2key=opt2val,....`
so that it wouldn't be possible to have conflict between namespace and table
configs, because one would fully override the other. But, that's not what we
have today.
2. **Conflict in user-supplied iterators for a specific scan/compaction**:
No two iterators in a single client-initiated operation's settings should have
the same priority.
* This is for things like scan-time iterators set in the API for a scan,
and passed over the RPC, rather than iterators set in configuration on the
server-side.
* This also applies to any other place where we might be able to specify
iterators that aren't in the configuration (compactions, conditional mutations,
etc.)
* We could check for conflicts set on the scanner easily, but would have
to rely on the server-side setting up the iterator stack to ensure no conflicts
between the user iterators and those set on the table.
3. **Conflict between configured iterators and user-supplied iterators**:
The complete iterator stack for an operation should not have any iterators
running at the same priority, regardless of whether it came from the
configuration or from the client API/RPC request.
* To address this, we can simply check the full iterator stack when it is
being constructed on the server-side, and fail the operation if any priorities
are reused, regardless of where they came from.
* Alternatively, we could treat one as overriding another, but I don't
think that's a very good idea.
* As a follow-on improvement here, we could treat all configured
iterators as higher priority than all client operation-specific
(scan-time/compaction-time) iterators:
1. Instead of ordering three configured iterators and two
user-supplied iterators by priority alone, as in `C1, U2, C3, U4, C5`, we would
instead order them as `C1, C3, C5, U2, U4`.
2. This enables stronger security guarantees by preventing a
user-supplied iterator from seeing data that is filtered out in a
administrator-configured iterator.
3. This prevents bugs that could be caused by a user-supplied iterator
that transforms data in a way that a subsequent administrator-configured
iterator won't be able to handle.
4. This is a behavior change, and may break some people's
(ill-advised) uses, but I think it is better overall.
5. This would also open the possibility of having a cleaner
client-side API, because you don't actually have to specify priority numbers on
the client. Instead, clients only need to order user-supplied iterators with
respect to other user-supplied iterators, and won't need a priority number to
indicate a global ordering that includes the configured iterators for a table.
So, we could have an API something like:
`scanner.map(iterator1).map(iterator2).map(iterator3).scan()`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]