Re: [External] Re: Q: BatchScanner and parallel (i.e. m/r style) execution

Roberts, Geoffry [USA] Sat, 16 Jan 2021 10:59:45 -0800

Sweet

Thanks


Geoffry Roberts
Lead Technologist
702.290.9098
[email protected]

Booz | Allen | Hamilton
BoozAllen.com

From: Christopher <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Saturday, January 16, 2021 at 1:57 PM
To: accumulo-user <[email protected]>
Subject: Re: [External] Re: Q: BatchScanner and parallel (i.e. m/r style) 
execution

Not to all servers, just to those hosting data in that range. But otherwise, 
yes.

On Sat, Jan 16, 2021 at 1:45 PM Roberts, Geoffry [USA] 
<[email protected]<mailto:[email protected]>> wrote:
If I have a batch scanner that has one large range, and this range spans 
several tservers, accumulo will distribute it to all tservers, it will process 
in parallel; and I’ll get back as single result set?

Geoffry Roberts
Lead Technologist
702.290.9098
[email protected]<mailto:[email protected]>

Booz | Allen | Hamilton
BoozAllen.com

From: Christopher <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Saturday, January 16, 2021 at 1:39 PM
To: accumulo-user <[email protected]<mailto:[email protected]>>
Subject: [External] Re: Q: BatchScanner and parallel (i.e. m/r style) execution

A BatchScanner takes multiple ranges, groups them by TServer, and then queries 
TServers in parallel for the ranges that are located in each, returning data in 
its iterator as it comes back (without regard to order).

If you run the same scan on multiple nodes, the task won't be sub-divided in 
any way... it will just be multiple nodes querying for the same thing. If you 
want, you can sub-divide your ranges in your client code, distribute those 
ranges to different nodes, and have each node scan only its designated range. 
You probably wouldn't use a BatchScanner for that. A regular Scanner would 
suffice. This is how AccumuloInputFormat works, implemented for both Hadoop's 
"mapred" and "mapreduce" APIs.

See more in the Javadocs:

https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/BatchScanner.html<https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/BatchScanner.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgFFpgPRZ$>
https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/hadoop/mapred/AccumuloInputFormat.html<https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/hadoop/mapred/AccumuloInputFormat.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgIc60MEe$>
https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/hadoop/mapreduce/AccumuloInputFormat.html<https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/hadoop/mapreduce/AccumuloInputFormat.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgCVflycP$>
https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/mapred/AccumuloInputFormat.html<https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/mapred/AccumuloInputFormat.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgA-Po3AT$>
https://accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/mapreduce/AccumuloInputFormat.html<https://urldefense.com/v3/__https:/accumulo.apache.org/docs/2.x/apidocs/org/apache/accumulo/core/client/mapreduce/AccumuloInputFormat.html__;!!May37g!fqBeauPtN3_oHvKMPOMc0SZqu-HJeDmjGh2YtRLczKWTA-nmkOFDb3OMPBavgJRPi2h0$>



On Sat, Jan 16, 2021 at 11:28 AM Roberts, Geoffry [USA] 
<[email protected]<mailto:[email protected]>> wrote:
All,

Three questions all asking the same thing:

Can an Accumulo scan or batchscan run like a map/reduce job?

I have an Accumulo 2.0 cluster.

In hadoop, I can launch a map/reduce job on the name node and hadoop 
distributes the job over the nodes of the cluster and the job runs in parallel.

In accumulo, I am calling the batch scanner from some non-java code that is 
first distributed across the cluster then on each node it attaches to accumulo 
and does the scan.  It works on a single node accumulo—so far so good.  I need 
to escalate and run it multi-node.  I am concerned that I’ll wind up running 
the same scan on each node, which would return me an array of result sets all 
alike.  Am I correct?

Can I somehow get the Hadoop m/r effect in accumulo?

Thanks

Geoffry Roberts
Lead Technologist
702.290.9098
[email protected]<mailto:[email protected]>

Booz | Allen | Hamilton
BoozAllen.com

Re: [External] Re: Q: BatchScanner and parallel (i.e. m/r style) execution

Reply via email to