Alright. It's clear now. Thank you guys so much!

Sent from my iPhone

> On Nov 22, 2016, at 1:44 PM, Christopher <ctubb...@apache.org> wrote:
> 
> The names of the scanners don't clearly reflect how they behave.
> 
> The regular Scanner is really a sequential scanner. It queries one tablet at 
> a time, sequentially, in-order, for a given range. So, the data it will 
> return is always in-order, and doesn't need to be merged explicitly in the 
> client.
> 
> The BatchScanner is really a parallel scanner, which queries multiple ranges 
> simultaneously, and the API does not have ordering guarantees. So, whichever 
> threads have data first will have their data seen first.
> 
> Regarding iterators, the server side constructs a "stack" of iterators, based 
> on their priority, and the data traverses this stack before being sent back 
> to the client:
> 
> scan on tserver (system iterators -> user iter 1 -> user iter 2 -> user iter 
> 3) -> client
> 
> Only data coming out of the end of the pipeline is returned the the client. 
> The iterator stack could get torn-down and reconstructed during the lifetime 
> of the scan.
> 
>> On Tue, Nov 22, 2016 at 1:09 PM Yamini Joshi <yamini.1...@gmail.com> wrote:
>> So, for a batch scan, the merge is not required but, for a scan, since it 
>> returns sorted data, data from tserver1 and tserver2 is merged at the client?
>> 
>> I know how to write iterators but I can't vsiualize the workflow. Lets say 
>> in the same example I have 3 custom iterators to be applied on data: it1, 
>> it2, it3 respectively. When are the iterators applied:
>> 
>> 1. scan on tserver -> client -> it1 on tserver -> client -> it2 on tserver  
>> -> client -> it3 on tserver -> client
>> I'm sure this is not the case, it adds a lot of overhead
>> 
>> 2. scan on tserver ->  it1 on tserver ->  it2 on tserver  -> it3 on tserver 
>> -> client
>> The processing is done in batches?
>> Data is returned to the client when it reaches the max limit for 
>> table.scan.max.memory even if it is in the middle of the pipeline above?
>> 
>> Best regards,
>> Yamini Joshi
>> 
>> On Tue, Nov 22, 2016 at 11:56 AM, Christopher <ctubb...@apache.org> wrote:
>> That's basically how it works, yes.
>> 
>> 1. The data from tserver1 and tserver2 necessarily comes from at least two 
>> different tablets. This is because tables are divided into discrete, 
>> non-overlapping tablets, and each tablet is hosted only on a single tserver. 
>> So, it is not normally necessary to merge the data from these two sources. 
>> Your application may do a join between the two tablets on the client side, 
>> but that is outside the scope of Accumulo.
>> 
>> 2. Custom iterators can be applied to minc, majc, and scan scopes. I suggest 
>> starting here: 
>> https://accumulo.apache.org/1.8/accumulo_user_manual.html#_iterators
>> 
>> 
>> On Tue, Nov 22, 2016 at 12:05 PM Yamini Joshi <yamini.1...@gmail.com> wrote:
>> Hello all
>> 
>> I am trying to understand Accumulo scan workflow. I've checked the official 
>> docs but I couldn't understand the workflow properly. Could anyone please 
>> tell me if I'm on the right track? For example if I want to scan rows in the 
>> range e-g in a table mytable which is sharded across 3 nodes in the cluster:
>> 
>> Step1: Client connects to the Zookeeper and gets the location of the root 
>> tablet.
>> Step2: Client connects to tserver with the root tablet and gets the location 
>> of mytable.
>> the row distribution is as follows:
>> tserver1             tserver2                   tserver3
>> a-g                       h-k                            l-z
>> 
>> Step3: Client connects to tserver1 and tserver2. 
>> Step4: tservers merge and sort data from in-memory maps, minc files and majc 
>> files, apply versioning iterator, seek the requested range and send data 
>> back to the client. 
>>  
>> Is this how a scan works? Also, I have some doubts:
>> 1. Where is the data from tserver1 and tserver2 merged?
>> 2. when and how are custom iterators applied?
>> 
>> 
>> Also, if there is any resource explaining this, please point me to it. I've 
>> found some slides but no detailed explanation.  
>> 
>> 
>> Best regards,
>> Yamini Joshi
>> 

Reply via email to