Re: Accumulo Working

Yamini Joshi Tue, 22 Nov 2016 10:43:51 -0800

I see. So, foa a scan opertaion that span 2 tservers: the client knows that
the ts1 contains range a-h (via communication with zookeeper) and once it
gets the data it requests the next range from ts2?


Also, can the data be sent to the client in the middle of the pipeline
Tserver -> it1 -> it2 -> it3 -> client (if the max limit is reached) or is
it always at the the end of the pipeline?

Best regards,
Yamini Joshi

On Tue, Nov 22, 2016 at 12:36 PM, Josh Elser <[email protected]> wrote:

> Scanners are sequentially communicating with TabletServers, as opposed to
> BatchScanners which do this communication in parallel. Scanners aren't so
> much "merging" data, but requesting it in sorted order from the appropriate
> TabletServer.
>
> All Iterators are applied to some batch of results from a TabletServer
> before the results are sent to the client. So, the action is more Tserver
> -> it1 -> it2 -> it3 -> client. Multiple iterators does not increase the
> number of RPCs.
>
> And yes, data is returned in batches from a TabletServer, constrained by
> the max memory setting you listed. When the boundaries of the Tablet that
> are currently be read from are reached (this is rowId boundaries), the
> batch would also be returned immediately.
>
> Yamini Joshi wrote:
>
>> So, for a batch scan, the merge is not required but, for a scan, since
>> it returns sorted data, data from tserver1 and tserver2 is merged at the
>> client?
>>
>> I know how to write iterators but I can't vsiualize the workflow. Lets
>> say in the same example I have 3 custom iterators to be applied on data:
>> it1, it2, it3 respectively. When are the iterators applied:
>>
>> 1. scan on tserver -> client -> it1 on tserver -> client -> it2 on
>> tserver  -> client -> it3 on tserver -> client
>> I'm sure this is not the case, it adds a lot of overhead
>>
>> 2. scan on tserver ->  it1 on tserver ->  it2 on tserver  -> it3 on
>> tserver -> client
>> The processing is done in batches?
>> Data is returned to the client when it reaches the max limit for
>> table.scan.max.memory even if it is in the middle of the pipeline above?
>>
>> Best regards,
>> Yamini Joshi
>>
>> On Tue, Nov 22, 2016 at 11:56 AM, Christopher <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     That's basically how it works, yes.
>>
>>     1. The data from tserver1 and tserver2 necessarily comes from at
>>     least two different tablets. This is because tables are divided into
>>     discrete, non-overlapping tablets, and each tablet is hosted only on
>>     a single tserver. So, it is not normally necessary to merge the data
>>     from these two sources. Your application may do a join between the
>>     two tablets on the client side, but that is outside the scope of
>>     Accumulo.
>>
>>     2. Custom iterators can be applied to minc, majc, and scan scopes. I
>>     suggest starting here:
>>     https://accumulo.apache.org/1.8/accumulo_user_manual.html#_iterators
>>     <https://accumulo.apache.org/1.8/accumulo_user_manual.html#_iterators
>> >
>>
>>
>>     On Tue, Nov 22, 2016 at 12:05 PM Yamini Joshi <[email protected]
>>     <mailto:[email protected]>> wrote:
>>
>>         Hello all
>>
>>         I am trying to understand Accumulo scan workflow. I've checked
>>         the official docs but I couldn't understand the workflow
>>         properly. Could anyone please tell me if I'm on the right track?
>>         For example if I want to scan rows in the range e-g in a table
>>         mytable which is sharded across 3 nodes in the cluster:
>>
>>         Step1: Client connects to the Zookeeper and gets the location of
>>         the root tablet.
>>         Step2: Client connects to tserver with the root tablet and gets
>>         the location of mytable.
>>         the row distribution is as follows:
>>         tserver1             tserver2                   tserver3
>>         a-g                       h-k                            l-z
>>
>>         Step3: Client connects to tserver1 and tserver2.
>>         Step4: tservers merge and sort data from in-memory maps, minc
>>         files and majc files, apply versioning iterator, seek the
>>         requested range and send data back to the client.
>>
>>         Is this how a scan works? Also, I have some doubts:
>>         1. Where is the data from tserver1 and tserver2 merged?
>>         2. when and how are custom iterators applied?
>>
>>
>>         Also, if there is any resource explaining this, please point me
>>         to it. I've found some slides but no detailed explanation.
>>
>>
>>         Best regards,
>>         Yamini Joshi
>>
>>
>>

Re: Accumulo Working

Reply via email to