Hi Keith:

Thanks so much for the response. We will base things off the
RowEncodingIterator then.

A few follow up questions out of curiosity:


   1. It is likely that my iterator will not return very many records
   because I"m hoping we don't have much invalid data. Should I be worried
   about the fact that it's not going to return much data? I guess I should
   expect long running scans. And then, if a t-server dies, it just won't know
   where to pick up from so entries might get re-scanned?
   2. What are the criteria Accumulo uses to decide it's time to re-build
   an entire iterator stack? Feel free to point me at code and I can read it
   from there.

Thanks,

- Logan

On Wed, Jul 5, 2023 at 7:13 PM Keith Turner <ke...@deenlo.com> wrote:

> There are two options for this.  One is to buffer the row in memory and
> encode it in your iterator like the whole row iterator does.  The other is
> to use the isolated scanner[1][2], but this does not work for batch scans.
>
> Accumulo should not tear iterators down until after they return something,
> this is the behavior that the wholerowiterator relies on.  So if your
> iterator reads the entire row from its source iterator without returning
> anything then Accumulo will not do anything to the iterator or its data
> sources.  An iterators data sources are the files and optionally a snapshot
> of the in memory map.   After the top level iterator has returned a key
> value, its possible that Accumulo could rebuild the iterator stack with new
> data sources (like new files that arrived or a new snapshot of the in
> memory map).  This means you can use the trick of having the top level
> iterator not return anything until a row boundary is seen.
>
> For isolated scans Accumulo will only tear down iterators and use new data
> sources on row boundaries.   Enabling isolation on scanner[2] will cause
> the scanner to throw an isolation exception if a tablet server dies while
> the client scanner is in the middle of reading a row.  The
> IsolatedScanner[3] wraps a scanner and hides the isolation exception by
> buffering rows and rereading them when an isolation exception occurs,
> making it easy to use isolated scans.
>
> The wholerowiterator handles a tablet server dying or data source changing
> well because it encodes the entire row as a single key value, so if the
> client gets it then it will not request that row again.
>
> [1]:
>
> https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/Scanner.html#enableIsolation()
> [2]:
>
> https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/IsolatedScanner.html
>
>
>
> On Wed, Jul 5, 2023 at 10:53 AM Logan Jones <lo...@codescratch.com> wrote:
>
> > Hello Mailing List:
> >
> > I have an iterator that will scan an entire table and only return keys
> that
> > match these criteria:
> >
> >    1. If a specific CF is "invalid" according to some criteria
> >    2. If a specific CF is missing on a row
> >    3. If there are multiple entries for a specific CF
> >
> > #1 would be easy to accomplish with a Filter, however #2 and #3 have
> proven
> > to be more tricky. As I understand the problem, Accumulo can, at any
> point,
> > destroy an iterator and re-call init. I am keeping some internal state
> > related to a row (namely a count of how many times I've seen that
> specific
> > CF).
> >
> > How can I keep the state I need for an entire row?
> >
> > I've looked at the RowEncodingIterator along with the WholeRowIterator,
> but
> > based on my understanding, it feels like Accumulo should be allowed to
> > destroy their state at any time and cause them to effectively break. Is
> > there a guarantee that an iterator won't get destroyed mid row?
> >
> > Thanks,
> >
> > - Logan
> >
>

Reply via email to