There are two options for this.  One is to buffer the row in memory and
encode it in your iterator like the whole row iterator does.  The other is
to use the isolated scanner[1][2], but this does not work for batch scans.

Accumulo should not tear iterators down until after they return something,
this is the behavior that the wholerowiterator relies on.  So if your
iterator reads the entire row from its source iterator without returning
anything then Accumulo will not do anything to the iterator or its data
sources.  An iterators data sources are the files and optionally a snapshot
of the in memory map.   After the top level iterator has returned a key
value, its possible that Accumulo could rebuild the iterator stack with new
data sources (like new files that arrived or a new snapshot of the in
memory map).  This means you can use the trick of having the top level
iterator not return anything until a row boundary is seen.

For isolated scans Accumulo will only tear down iterators and use new data
sources on row boundaries.   Enabling isolation on scanner[2] will cause
the scanner to throw an isolation exception if a tablet server dies while
the client scanner is in the middle of reading a row.  The
IsolatedScanner[3] wraps a scanner and hides the isolation exception by
buffering rows and rereading them when an isolation exception occurs,
making it easy to use isolated scans.

The wholerowiterator handles a tablet server dying or data source changing
well because it encodes the entire row as a single key value, so if the
client gets it then it will not request that row again.

[1]:
https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/Scanner.html#enableIsolation()
[2]:
https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/IsolatedScanner.html



On Wed, Jul 5, 2023 at 10:53 AM Logan Jones <lo...@codescratch.com> wrote:

> Hello Mailing List:
>
> I have an iterator that will scan an entire table and only return keys that
> match these criteria:
>
>    1. If a specific CF is "invalid" according to some criteria
>    2. If a specific CF is missing on a row
>    3. If there are multiple entries for a specific CF
>
> #1 would be easy to accomplish with a Filter, however #2 and #3 have proven
> to be more tricky. As I understand the problem, Accumulo can, at any point,
> destroy an iterator and re-call init. I am keeping some internal state
> related to a row (namely a count of how many times I've seen that specific
> CF).
>
> How can I keep the state I need for an entire row?
>
> I've looked at the RowEncodingIterator along with the WholeRowIterator, but
> based on my understanding, it feels like Accumulo should be allowed to
> destroy their state at any time and cause them to effectively break. Is
> there a guarantee that an iterator won't get destroyed mid row?
>
> Thanks,
>
> - Logan
>

Reply via email to