Hi Mario, As you gain more experience with Accumulo, feel free to write or modify Accumulo's documentation in the places you find it lacking and send a PR. If you find a topic confusing, probably many others do too.
Cheers, Dylan On Fri, Jul 15, 2016 at 4:04 PM, Christopher <[email protected]> wrote: > Ah, I thought you were doing WholeRowIterator -> RowCounterIterator > I now understand you're doing WholeRowIterator -> SomeCustomFilter (column > predicate) -> RowCounterIterator > > That's okay to do, but it may be better to have an iterator that creates a > clone of its source at the beginning of each row, advances to do the > filtering, and then informs the spawning iterator to either accept or > reject. This is, admittedly, far more complicated than WholeRowIterator, > but it can safer if you have really big rows which don't fit in memory. > > To your question about WholeRowIterator, yes, it's fine. The iterator will > always see sorted data (unless it's sitting on top of another iterator > which breaks this... which is possible, but not recommended at all), even > though the client may not. And yes, rows are never split (but if the query > range doesn't include the full row, it may return early). Their usage is > orthogonal, and can be used together or not. > > On Fri, Jul 15, 2016 at 6:35 PM Mario Pastorelli < > [email protected]> wrote: > >> The WholeRowIterator is for filtering: I need all the columns that the >> filter requires so that the filter can see if the row matches or not the >> query. That's the only proper way I found to implement logic operators on >> predicated over columns of the same row. >> >> Actually I do have a question about WholeRowIterator, while we are >> talking about them. Do they make sense when used with a BatchScanner? My >> guess is yes because while the BatchScanner can return data non-sorted to >> the client, when it is scanning a single tablet the data is sorted. Because >> the data of the same rowId is never split (right?) then there is no problem >> in using a WholeRowIterator with a BatchScanner. Is this correct? I really >> can't find much documentation for Accumulo and the book doesn't help enough. >> >> On Sat, Jul 16, 2016 at 12:29 AM, Christopher <[email protected]> >> wrote: >> >>> It'd be more efficient to use the FirstEntryInRowIterator to just grab >>> one each, rather than the WholeRowIterator which could use up a lot of >>> memory unnecessarily. >>> >>> On Fri, Jul 15, 2016 at 6:20 PM Mario Pastorelli < >>> [email protected]> wrote: >>> >>>> I'm actually using this after a wholerowiterator, which is used to >>>> filter rows with the same rowId. >>>> >>>> On Fri, Jul 15, 2016 at 10:02 PM, William Slacum <[email protected]> >>>> wrote: >>>> >>>>> The iterator in the gist also counts cells/entries/KV pairs, not >>>>> unique rows. You'll want to have some way to skip to the next row value if >>>>> you want the count to be reflective of the number of rows being read. >>>>> >>>>> On Fri, Jul 15, 2016 at 3:34 PM, Shawn Walker < >>>>> [email protected]> wrote: >>>>> >>>>>> My read is that you're mistaking the sequence of calls Accumulo will >>>>>> be making to your iterator. The sequence isn't quite the same as a Java >>>>>> iterator (initially positioned "before" the first element), and is more >>>>>> like a C++ iterator: >>>>>> >>>>>> 0. Accumulo calls seek(...) >>>>>> 1. Is there more data? Accumulo calls hasTop(). You return yes. >>>>>> 2. Ok, so there's data. Accumulo calls getTopKey(), getTopValue() to >>>>>> retrieve the data. You return a key indicating 0 columns seen (since >>>>>> next() >>>>>> hasn't yet been called) >>>>>> 3. First datum done, Accumulo calls next() >>>>>> ... >>>>>> >>>>>> I imagine that if you pull the second item out of your scan result, >>>>>> it'll have the number you expect. Alternately, you might consider >>>>>> performing the count computation during an override of the seek(...) >>>>>> method, instead of in the next(...) method. >>>>>> >>>>>> -- >>>>>> Shawn Walker >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Jul 15, 2016 at 2:24 PM, Mario Pastorelli < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> I'm trying to create a RowCounterIterator that counts all the rows >>>>>>> and returns only one key-value with the counter inside. The problem is >>>>>>> that >>>>>>> I can't get it work. The Scala code is available in the gist >>>>>>> <https://gist.github.com/melrief/5f2ca248f1a980ddead2f2eeb19e6389> >>>>>>> together with some pseudo-code of a test. The problem is that if I add >>>>>>> an >>>>>>> entry to my table, this iterator will return 0 instead of 1 and >>>>>>> apparently >>>>>>> the reason is that super.hasTop() is always false. I've tried without >>>>>>> the >>>>>>> iterator and the scanner returns 1 elements. Any idea of what I'm doing >>>>>>> wrong here? Is WrappingIterator the right class to extend for this kind >>>>>>> of >>>>>>> behaviour? >>>>>>> >>>>>>> Thanks, >>>>>>> Mario >>>>>>> >>>>>>> -- >>>>>>> Mario Pastorelli | TERALYTICS >>>>>>> >>>>>>> *software engineer* >>>>>>> >>>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland >>>>>>> phone: +41794381682 >>>>>>> email: [email protected] >>>>>>> www.teralytics.net >>>>>>> >>>>>>> Company registration number: CH-020.3.037.709-7 | Trade register >>>>>>> Canton Zurich >>>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark >>>>>>> Schmitz, Yann de Vries >>>>>>> >>>>>>> This e-mail message contains confidential information which is for >>>>>>> the sole attention and use of the intended recipient. Please notify us >>>>>>> at >>>>>>> once if you think that it may not be intended for you and delete it >>>>>>> immediately. >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Mario Pastorelli | TERALYTICS >>>> >>>> *software engineer* >>>> >>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland >>>> phone: +41794381682 >>>> email: [email protected] >>>> www.teralytics.net >>>> >>>> Company registration number: CH-020.3.037.709-7 | Trade register Canton >>>> Zurich >>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, >>>> Yann de Vries >>>> >>>> This e-mail message contains confidential information which is for the >>>> sole attention and use of the intended recipient. Please notify us at once >>>> if you think that it may not be intended for you and delete it immediately. >>>> >>> >> >> >> -- >> Mario Pastorelli | TERALYTICS >> >> *software engineer* >> >> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland >> phone: +41794381682 >> email: [email protected] >> www.teralytics.net >> >> Company registration number: CH-020.3.037.709-7 | Trade register Canton >> Zurich >> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, >> Yann de Vries >> >> This e-mail message contains confidential information which is for the >> sole attention and use of the intended recipient. Please notify us at once >> if you think that it may not be intended for you and delete it immediately. >> >
