Alright. It's clear now. Thank you guys so much! Sent from my iPhone
> On Nov 22, 2016, at 1:44 PM, Christopher <ctubb...@apache.org> wrote: > > The names of the scanners don't clearly reflect how they behave. > > The regular Scanner is really a sequential scanner. It queries one tablet at > a time, sequentially, in-order, for a given range. So, the data it will > return is always in-order, and doesn't need to be merged explicitly in the > client. > > The BatchScanner is really a parallel scanner, which queries multiple ranges > simultaneously, and the API does not have ordering guarantees. So, whichever > threads have data first will have their data seen first. > > Regarding iterators, the server side constructs a "stack" of iterators, based > on their priority, and the data traverses this stack before being sent back > to the client: > > scan on tserver (system iterators -> user iter 1 -> user iter 2 -> user iter > 3) -> client > > Only data coming out of the end of the pipeline is returned the the client. > The iterator stack could get torn-down and reconstructed during the lifetime > of the scan. > >> On Tue, Nov 22, 2016 at 1:09 PM Yamini Joshi <yamini.1...@gmail.com> wrote: >> So, for a batch scan, the merge is not required but, for a scan, since it >> returns sorted data, data from tserver1 and tserver2 is merged at the client? >> >> I know how to write iterators but I can't vsiualize the workflow. Lets say >> in the same example I have 3 custom iterators to be applied on data: it1, >> it2, it3 respectively. When are the iterators applied: >> >> 1. scan on tserver -> client -> it1 on tserver -> client -> it2 on tserver >> -> client -> it3 on tserver -> client >> I'm sure this is not the case, it adds a lot of overhead >> >> 2. scan on tserver -> it1 on tserver -> it2 on tserver -> it3 on tserver >> -> client >> The processing is done in batches? >> Data is returned to the client when it reaches the max limit for >> table.scan.max.memory even if it is in the middle of the pipeline above? >> >> Best regards, >> Yamini Joshi >> >> On Tue, Nov 22, 2016 at 11:56 AM, Christopher <ctubb...@apache.org> wrote: >> That's basically how it works, yes. >> >> 1. The data from tserver1 and tserver2 necessarily comes from at least two >> different tablets. This is because tables are divided into discrete, >> non-overlapping tablets, and each tablet is hosted only on a single tserver. >> So, it is not normally necessary to merge the data from these two sources. >> Your application may do a join between the two tablets on the client side, >> but that is outside the scope of Accumulo. >> >> 2. Custom iterators can be applied to minc, majc, and scan scopes. I suggest >> starting here: >> https://accumulo.apache.org/1.8/accumulo_user_manual.html#_iterators >> >> >> On Tue, Nov 22, 2016 at 12:05 PM Yamini Joshi <yamini.1...@gmail.com> wrote: >> Hello all >> >> I am trying to understand Accumulo scan workflow. I've checked the official >> docs but I couldn't understand the workflow properly. Could anyone please >> tell me if I'm on the right track? For example if I want to scan rows in the >> range e-g in a table mytable which is sharded across 3 nodes in the cluster: >> >> Step1: Client connects to the Zookeeper and gets the location of the root >> tablet. >> Step2: Client connects to tserver with the root tablet and gets the location >> of mytable. >> the row distribution is as follows: >> tserver1 tserver2 tserver3 >> a-g h-k l-z >> >> Step3: Client connects to tserver1 and tserver2. >> Step4: tservers merge and sort data from in-memory maps, minc files and majc >> files, apply versioning iterator, seek the requested range and send data >> back to the client. >> >> Is this how a scan works? Also, I have some doubts: >> 1. Where is the data from tserver1 and tserver2 merged? >> 2. when and how are custom iterators applied? >> >> >> Also, if there is any resource explaining this, please point me to it. I've >> found some slides but no detailed explanation. >> >> >> Best regards, >> Yamini Joshi >>