Hi Dylan,

Correct me if I'm wrong, but it seems that the idea of returning entries in
unsorted would require a seek which need to go backward (if and when the
iterator stack is torn down/recreated).  Seeking in a non-forward direction
is warned against in the chapter you linked to.

Am I missing something?

Thanks in advance,

Jim

On Sat, May 16, 2015 at 11:03 PM, Dylan Hutchison <dhutc...@mit.edu> wrote:

> Dave,
>
> Check out the new chapter on iterator design
> <https://github.com/apache/accumulo/blob/master/docs/src/main/asciidoc/chapters/iterator_design.txt>
> going into 1.7 (applicable to all versions).
>
> Emitting entries in unsorted order should be ok for scan iterators but
> definitely not for compaction iterators.  Compaction iterators will fail
> when the FileSKVWriter sees an entry out of order
> <https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/file/rfile/RFile.java#L370>.
>
>
> Use cases like these are exciting.  Be prepared to debug the tablet server
> if your idea doesn't work.
>
> Cheers, Dylan
>
>
> On Sat, May 16, 2015 at 11:12 AM, Dave Hardcastle <
> hardcastle.d...@gmail.com> wrote:
>
>> Thanks James. I asked about the filtering example just to check my
>> understanding was right, but I agree it's probably a corner case.
>>
>> Re the documentation - I don't think the problem is not conforming to the
>> sorted key part. If you had row keys which were integers in increasing
>> order, and in the iterator added a million to each row key and emitted that
>> then you'd still get problems if there was a reseek (assuming that adding a
>> million took you out of the range). Admittedly I can't see why you'd do
>> that, but I'd read the javadoc, the manual and the Accumulo book carefully
>> and I hadn't picked up that the actual key that is emitted is relevant to
>> the reseek issue.
>>
>> BTW, none of this is meant to reflect badly on the iterator stack -
>> they're really powerful and are one of Accumulo's main selling points.
>>
>> Dave.
>>
>>
>> On 16 May 2015 at 14:55, James Hughes <jn...@virginia.edu> wrote:
>>
>>> Hi Dave,
>>>
>>> I can speak to the first question a little bit.  The one time I saw
>>> this, I traced the code and saw that after emitting a certain number of
>>> bytes, the iterator stack was recreated.  In that case, no further keys
>>> would have been filtered since the current key-value pair being emitted
>>> would trigger the reset and that key would be used for the re-seek.  I'll
>>> apply all caveats to that explanation: it was Accumulo 1.4 and didn't learn
>>> about why the stack was stopped and recreated or other times that may
>>> happen.
>>>
>>> On the other hand, one could imagine a tablet server dying in the middle
>>> of returning entries.  I have no idea of the details of how Accumulo
>>> handles that.  Worst case, you may be right about some reprocessing, but
>>> all this sounds like a corner case.
>>>
>>> For the documentation, writing about implementation details directly may
>>> not be the best way.  I'd hope that the documentation would make it clear
>>> that all iterators (even presumed 'top' or 'final' iterators) should
>>> conform to the 'sorted key' part of the contract.
>>>
>>> Thanks,
>>>
>>> Jim
>>>
>>>
>>> On Sat, May 16, 2015 at 3:27 AM, Dave Hardcastle <
>>> hardcastle.d...@gmail.com> wrote:
>>>
>>>> A couple of follow-up questions...
>>>>
>>>> So, is it true to say that a filtering iterator that is filtering out a
>>>> high percentage of the key-values in a range, might have to redo a lot of
>>>> work if a reseek happens? (It's reseeked to the last emitted key, but a lot
>>>> of key-values past that may already have been rejected by the filter.)
>>>>
>>>> Would it be worth making the fact the the reseek happens to the last
>>>> emitted key explicit in the documentation? It seems natural to me to assume
>>>> that the reseek happens to one key past the last read key. I don't think
>>>> the javadoc for the seek() method in SortedKeyValueIterator makes it quite
>>>> clear enough.
>>>>
>>>> Thanks,
>>>>
>>>> Dave.
>>>>
>>>> On 15 May 2015 at 19:32, Eric Newton <eric.new...@gmail.com> wrote:
>>>>
>>>>> is it the same instance of the iterator object
>>>>>
>>>>>
>>>>> No, it is not.
>>>>>
>>>>> On Fri, May 15, 2015 at 2:16 PM, Dave Hardcastle <
>>>>> hardcastle.d...@gmail.com> wrote:
>>>>>
>>>>>> Jim,
>>>>>>
>>>>>> That explains a lot - I knew that the iterator stack could be resumed
>>>>>> in the middle of a range, but didn't realise that it used the last 
>>>>>> emitted
>>>>>> key to decide where to resume.
>>>>>>
>>>>>> Just so I'm clear, when iterators get stopped and later resumed, is
>>>>>> it the same instance of the iterator object that's restarted (so that I
>>>>>> could store state in there and use that to help the reseek) or is it a 
>>>>>> new
>>>>>> instance of the iterator that has to be able to resume purely on the 
>>>>>> basis
>>>>>> of the last emitted key?
>>>>>>
>>>>>> As you say though, it's probably best to stick to modifying values
>>>>>> only.
>>>>>>
>>>>>> Thanks very much,
>>>>>>
>>>>>> Dave.
>>>>>>
>>>>>> On 15 May 2015 at 18:55, James Hughes <jn...@virginia.edu> wrote:
>>>>>>
>>>>>>> Hi Dave,
>>>>>>>
>>>>>>> The big thing to note is that your iterator stack may get stopped
>>>>>>> and torn down for various reasons.  As Accumulo recreates the stack, it
>>>>>>> will call 'seek' with the last emitted key in order to resume.
>>>>>>>
>>>>>>> If you are returning keys out of order in an iterator, the 'seek'
>>>>>>> method needs to be able to undo the transformation and call 'seek'
>>>>>>> appropriately.  That's not impossible, but it isn't trivial.
>>>>>>>
>>>>>>> In GeoMesa, we did something like that at one point (without having
>>>>>>> a smart 'seek').  I enjoyed two days of debugging trying to figure out 
>>>>>>> why
>>>>>>> medium sized requests would hang.  (There was an infinite loop....)  
>>>>>>> From
>>>>>>> that experience, I'd suggest only modifying values.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Jim
>>>>>>>
>>>>>>>
>>>>>>> On Fri, May 15, 2015 at 1:26 PM, Dave Hardcastle <
>>>>>>> hardcastle.d...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I've always assumed that the last iterator in the stack can make
>>>>>>>> arbitrary changes to keys and values, including not returning the keys 
>>>>>>>> in
>>>>>>>> sorted order. I know that SortedKeyValueIterator says that "anything
>>>>>>>> implementing this interface should return keys in sorted order" - but I
>>>>>>>> don't see a good reason that has to be true for the final iterator. 
>>>>>>>> This
>>>>>>>> assumption seems to be backed up by the manual which says that "the 
>>>>>>>> only
>>>>>>>> safe way to generate additional data in an iterator is to alter the 
>>>>>>>> current
>>>>>>>> key-value pair" - it doesn't say that making arbitrary modifications 
>>>>>>>> to the
>>>>>>>> rowkey or key is forbidden.
>>>>>>>>
>>>>>>>> I have a situation where I am making a transformation of the rowkey
>>>>>>>> that may not preserve the ordering of the keys. When I scan for 
>>>>>>>> individual
>>>>>>>> ranges I get the correct results. When I scan for two ranges using a
>>>>>>>> BatchScanner, I get lots of data back which is not in the ranges I 
>>>>>>>> queried
>>>>>>>> for. I am not explicitly checking that I have not gone beyond the 
>>>>>>>> range,
>>>>>>>> but that should not be necessary as I am not doing any seeking, only
>>>>>>>> consuming the key-values I receive.
>>>>>>>>
>>>>>>>> So, my main question is whether the last iterator is allowed to not
>>>>>>>> return keys in sorted order?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Dave.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to