Hi Keith: Thanks so much for the response. We will base things off the RowEncodingIterator then.
A few follow up questions out of curiosity: 1. It is likely that my iterator will not return very many records because I"m hoping we don't have much invalid data. Should I be worried about the fact that it's not going to return much data? I guess I should expect long running scans. And then, if a t-server dies, it just won't know where to pick up from so entries might get re-scanned? 2. What are the criteria Accumulo uses to decide it's time to re-build an entire iterator stack? Feel free to point me at code and I can read it from there. Thanks, - Logan On Wed, Jul 5, 2023 at 7:13 PM Keith Turner <ke...@deenlo.com> wrote: > There are two options for this. One is to buffer the row in memory and > encode it in your iterator like the whole row iterator does. The other is > to use the isolated scanner[1][2], but this does not work for batch scans. > > Accumulo should not tear iterators down until after they return something, > this is the behavior that the wholerowiterator relies on. So if your > iterator reads the entire row from its source iterator without returning > anything then Accumulo will not do anything to the iterator or its data > sources. An iterators data sources are the files and optionally a snapshot > of the in memory map. After the top level iterator has returned a key > value, its possible that Accumulo could rebuild the iterator stack with new > data sources (like new files that arrived or a new snapshot of the in > memory map). This means you can use the trick of having the top level > iterator not return anything until a row boundary is seen. > > For isolated scans Accumulo will only tear down iterators and use new data > sources on row boundaries. Enabling isolation on scanner[2] will cause > the scanner to throw an isolation exception if a tablet server dies while > the client scanner is in the middle of reading a row. The > IsolatedScanner[3] wraps a scanner and hides the isolation exception by > buffering rows and rereading them when an isolation exception occurs, > making it easy to use isolated scans. > > The wholerowiterator handles a tablet server dying or data source changing > well because it encodes the entire row as a single key value, so if the > client gets it then it will not request that row again. > > [1]: > > https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/Scanner.html#enableIsolation() > [2]: > > https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/IsolatedScanner.html > > > > On Wed, Jul 5, 2023 at 10:53 AM Logan Jones <lo...@codescratch.com> wrote: > > > Hello Mailing List: > > > > I have an iterator that will scan an entire table and only return keys > that > > match these criteria: > > > > 1. If a specific CF is "invalid" according to some criteria > > 2. If a specific CF is missing on a row > > 3. If there are multiple entries for a specific CF > > > > #1 would be easy to accomplish with a Filter, however #2 and #3 have > proven > > to be more tricky. As I understand the problem, Accumulo can, at any > point, > > destroy an iterator and re-call init. I am keeping some internal state > > related to a row (namely a count of how many times I've seen that > specific > > CF). > > > > How can I keep the state I need for an entire row? > > > > I've looked at the RowEncodingIterator along with the WholeRowIterator, > but > > based on my understanding, it feels like Accumulo should be allowed to > > destroy their state at any time and cause them to effectively break. Is > > there a guarantee that an iterator won't get destroyed mid row? > > > > Thanks, > > > > - Logan > > >