Hi Dylan, Correct me if I'm wrong, but it seems that the idea of returning entries in unsorted would require a seek which need to go backward (if and when the iterator stack is torn down/recreated). Seeking in a non-forward direction is warned against in the chapter you linked to.
Am I missing something? Thanks in advance, Jim On Sat, May 16, 2015 at 11:03 PM, Dylan Hutchison <dhutc...@mit.edu> wrote: > Dave, > > Check out the new chapter on iterator design > <https://github.com/apache/accumulo/blob/master/docs/src/main/asciidoc/chapters/iterator_design.txt> > going into 1.7 (applicable to all versions). > > Emitting entries in unsorted order should be ok for scan iterators but > definitely not for compaction iterators. Compaction iterators will fail > when the FileSKVWriter sees an entry out of order > <https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/file/rfile/RFile.java#L370>. > > > Use cases like these are exciting. Be prepared to debug the tablet server > if your idea doesn't work. > > Cheers, Dylan > > > On Sat, May 16, 2015 at 11:12 AM, Dave Hardcastle < > hardcastle.d...@gmail.com> wrote: > >> Thanks James. I asked about the filtering example just to check my >> understanding was right, but I agree it's probably a corner case. >> >> Re the documentation - I don't think the problem is not conforming to the >> sorted key part. If you had row keys which were integers in increasing >> order, and in the iterator added a million to each row key and emitted that >> then you'd still get problems if there was a reseek (assuming that adding a >> million took you out of the range). Admittedly I can't see why you'd do >> that, but I'd read the javadoc, the manual and the Accumulo book carefully >> and I hadn't picked up that the actual key that is emitted is relevant to >> the reseek issue. >> >> BTW, none of this is meant to reflect badly on the iterator stack - >> they're really powerful and are one of Accumulo's main selling points. >> >> Dave. >> >> >> On 16 May 2015 at 14:55, James Hughes <jn...@virginia.edu> wrote: >> >>> Hi Dave, >>> >>> I can speak to the first question a little bit. The one time I saw >>> this, I traced the code and saw that after emitting a certain number of >>> bytes, the iterator stack was recreated. In that case, no further keys >>> would have been filtered since the current key-value pair being emitted >>> would trigger the reset and that key would be used for the re-seek. I'll >>> apply all caveats to that explanation: it was Accumulo 1.4 and didn't learn >>> about why the stack was stopped and recreated or other times that may >>> happen. >>> >>> On the other hand, one could imagine a tablet server dying in the middle >>> of returning entries. I have no idea of the details of how Accumulo >>> handles that. Worst case, you may be right about some reprocessing, but >>> all this sounds like a corner case. >>> >>> For the documentation, writing about implementation details directly may >>> not be the best way. I'd hope that the documentation would make it clear >>> that all iterators (even presumed 'top' or 'final' iterators) should >>> conform to the 'sorted key' part of the contract. >>> >>> Thanks, >>> >>> Jim >>> >>> >>> On Sat, May 16, 2015 at 3:27 AM, Dave Hardcastle < >>> hardcastle.d...@gmail.com> wrote: >>> >>>> A couple of follow-up questions... >>>> >>>> So, is it true to say that a filtering iterator that is filtering out a >>>> high percentage of the key-values in a range, might have to redo a lot of >>>> work if a reseek happens? (It's reseeked to the last emitted key, but a lot >>>> of key-values past that may already have been rejected by the filter.) >>>> >>>> Would it be worth making the fact the the reseek happens to the last >>>> emitted key explicit in the documentation? It seems natural to me to assume >>>> that the reseek happens to one key past the last read key. I don't think >>>> the javadoc for the seek() method in SortedKeyValueIterator makes it quite >>>> clear enough. >>>> >>>> Thanks, >>>> >>>> Dave. >>>> >>>> On 15 May 2015 at 19:32, Eric Newton <eric.new...@gmail.com> wrote: >>>> >>>>> is it the same instance of the iterator object >>>>> >>>>> >>>>> No, it is not. >>>>> >>>>> On Fri, May 15, 2015 at 2:16 PM, Dave Hardcastle < >>>>> hardcastle.d...@gmail.com> wrote: >>>>> >>>>>> Jim, >>>>>> >>>>>> That explains a lot - I knew that the iterator stack could be resumed >>>>>> in the middle of a range, but didn't realise that it used the last >>>>>> emitted >>>>>> key to decide where to resume. >>>>>> >>>>>> Just so I'm clear, when iterators get stopped and later resumed, is >>>>>> it the same instance of the iterator object that's restarted (so that I >>>>>> could store state in there and use that to help the reseek) or is it a >>>>>> new >>>>>> instance of the iterator that has to be able to resume purely on the >>>>>> basis >>>>>> of the last emitted key? >>>>>> >>>>>> As you say though, it's probably best to stick to modifying values >>>>>> only. >>>>>> >>>>>> Thanks very much, >>>>>> >>>>>> Dave. >>>>>> >>>>>> On 15 May 2015 at 18:55, James Hughes <jn...@virginia.edu> wrote: >>>>>> >>>>>>> Hi Dave, >>>>>>> >>>>>>> The big thing to note is that your iterator stack may get stopped >>>>>>> and torn down for various reasons. As Accumulo recreates the stack, it >>>>>>> will call 'seek' with the last emitted key in order to resume. >>>>>>> >>>>>>> If you are returning keys out of order in an iterator, the 'seek' >>>>>>> method needs to be able to undo the transformation and call 'seek' >>>>>>> appropriately. That's not impossible, but it isn't trivial. >>>>>>> >>>>>>> In GeoMesa, we did something like that at one point (without having >>>>>>> a smart 'seek'). I enjoyed two days of debugging trying to figure out >>>>>>> why >>>>>>> medium sized requests would hang. (There was an infinite loop....) >>>>>>> From >>>>>>> that experience, I'd suggest only modifying values. >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Jim >>>>>>> >>>>>>> >>>>>>> On Fri, May 15, 2015 at 1:26 PM, Dave Hardcastle < >>>>>>> hardcastle.d...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I've always assumed that the last iterator in the stack can make >>>>>>>> arbitrary changes to keys and values, including not returning the keys >>>>>>>> in >>>>>>>> sorted order. I know that SortedKeyValueIterator says that "anything >>>>>>>> implementing this interface should return keys in sorted order" - but I >>>>>>>> don't see a good reason that has to be true for the final iterator. >>>>>>>> This >>>>>>>> assumption seems to be backed up by the manual which says that "the >>>>>>>> only >>>>>>>> safe way to generate additional data in an iterator is to alter the >>>>>>>> current >>>>>>>> key-value pair" - it doesn't say that making arbitrary modifications >>>>>>>> to the >>>>>>>> rowkey or key is forbidden. >>>>>>>> >>>>>>>> I have a situation where I am making a transformation of the rowkey >>>>>>>> that may not preserve the ordering of the keys. When I scan for >>>>>>>> individual >>>>>>>> ranges I get the correct results. When I scan for two ranges using a >>>>>>>> BatchScanner, I get lots of data back which is not in the ranges I >>>>>>>> queried >>>>>>>> for. I am not explicitly checking that I have not gone beyond the >>>>>>>> range, >>>>>>>> but that should not be necessary as I am not doing any seeking, only >>>>>>>> consuming the key-values I receive. >>>>>>>> >>>>>>>> So, my main question is whether the last iterator is allowed to not >>>>>>>> return keys in sorted order? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Dave. >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >