dlmarion commented on PR #2665:
URL: https://github.com/apache/accumulo/pull/2665#issuecomment-1113302935
## Background
Accumulo TabletServers are responsible for:
1. ingesting new data
2. compacting (merging) new and old data into files
3. reading data from files to support system and user scans
4. performing maintenance on Tablets (assignments, merging, splitting,
bulk importing, etc).
To support these activities newly ingested data is hosted in memory
(in-memory maps) until it's written to a file, and blocks of accessed files may
be cached within the TabletServer for better performance. The TabletServer has
configuration properties to control the amout of memory available to the heap,
in-memory maps, and block caches, and the size of the various thread pools that
perform these activities. For example:
tserver.assignment.concurrent.max
tserver.bulk.process.threads
tserver.cache.data.size
tserver.cache.index.size
tserver.cache.summary.size
tserver.compaction.major.concurrent.max
tserver.compaction.minor.concurrent.max
tserver.memory.maps.max
tserver.migrations.concurrent.max
tserver.recovery.concurrent.max
tserver.scan.executors.default.threads
tserver.scan.executors.meta.threads
tserver.scan.files.open.max
tserver.server.threads.minimum
tserver.sort.buffer.size
tserver.summary.partition.threads
tserver.summary.remote.threads
tserver.total.mutation.queue.max
tserver.workq.threads
When a TabletServer exhausts available memory, for whatever reason, an
OutOfMemoryError will be raised and the TabletServer will be terminated. When
this happens all of the running scans on that TabletServer are paused while the
Tablets are re-hosted and then the scans continue on the new TabletServers once
the re-hosting process is complete. If the cause of the TabletServer failure
was due to scans on a particular Tablet, then this process will repeat until
there are no TabletServers remaining or the pattern is identified by a
user/admin and the scan process is terminated.
## Objective
Provide Accumulo users with the ability to run scans without terminating the
TabletServer.
## Possible approaches
1. Run the scan in a separate process
2. Restrict memory usage on a per-scan basis
3. Read directly from files in client side scan code. This approach does
not allow a small number of clients to scale out a large number of expensive
queries to tablet and/or scan servers. It also may lead to an OOM killing a
client process that may be executing multiple concurrent scans. It also does
not allow client to leverage cache of data and metadata on a scan server or
tablet server.
## This approach
Create a separate server process that is used to run user scans and give the
user the option whether or not to use the new server process on a per-scan
basis. Provide the user with the ability to control how many scans will be
affected if this new process dies and how many of these new processes to use
for a single scan.
## Implementation
This PR includes:
1. a new server process called the ScanServer.
2. changes to the Accumulo client
3. changes to the GarbageCollector
4. Ancillary changes
### Scan Server
The ScanServer is a TabletHostingServer that hosts SnapshotTablets and
implements the TabletScanClientService Thrift API. When the ScanServer receives
a request via the scan API, it creates a SnapshotTablet object from the Tablet
metadata (which may be cached), and then uses the ThriftScanClientHandler to
complete the scan operations. The user scan is run using the same code that the
TabletServer uses; the ScanServer is just responsible for ensuring that the
Tablet exists for the scan code. The Tablet hosted within the ScanServer may
not contain the exact same data as the corresponding Tablet hosted by the
TabletServer. The ScanServer does not have any of the Tablet data that may
reside within the in-memory maps and the Tablet may reference files that have
been compacted as Tablet metadata can be cached within the ScanServer (see
Property.SSERV_CACHED_TABLET_METADATA_EXPIRATION). The number of concurrent
scans that the ScanServer will run is configurable (Property.SSERV_SCAN_EXECU
TORS_DEFAULT_THREADS and Property.SSERV_SCAN_EXECUTORS_PREFIX). The ScanServer
has other configuration properties that can be set to allow it to have
different settings than the TabletServer (Thrift, block caches, etc). It is
also possible that a ScanServer may be hosting multiple versions of a
SnapshotTablet in the case where scans are in progress, the TabletMetadata has
expires, and a new scan request arrives.
Scan servers implement a busy timeout parameter on their scan RPCs. The
busytimeout allows a client to specify a configurable time during which the
scan must either start running or throw a busy thrift exception. On the client
side this busy exception can be detected and a different scan server selected.
### Client changes
A new method has been added to the client (ScannerBase.setConsistencyLevel)
to configure the client to use IMMEDIATE (default) or EVENTUAL consistency for
scans. IMMEDIATE means that the user wants to scan all data related to the
Tablet at the time of the scan. To accomplish this the client will send the
scan request to the TabletServer that is hosting the Tablet. This is the
current behavior and is the default configuration, so no code change is
required to have the same behavior. The other possible value, EVENTUAL, means
that the user is willing to relax the data freshness guarantee that the
TabletServer provides and instead potentially improve the chances of their scan
completing when their scan is known to take a long time or require a lot of
memory. When the consistency level is set to EVENTUAL the client uses a
ScanServerDispatcher class to determine which ScanServers to use. The user can
supply their own ScanServerDispatcher implementation
(ClientProperty.SCAN_SERVER_DISPAT
CHER) if they don't want to use the DefaultScanServerDispatcher (see class
javadoc for a description of the behavior). Scans will be sent to the
TabletServer in the event that EVENTUAL consistency is selected for the client
and no ScanServers are running.
#### Default scan server dispatcher
The default scan server dispatcher that executes on the client side has the
following strategy for selecting a scan server.
* It hashes a tablets tableId, end row, and prev endrow. This hash is used
to consitently map the tablet to one of three random scan servers. So for a
given tablet the same three random scan servers are used by different tablets.
* The client sends a request to one of the three scan servers with a small
busytimeout.
* If a busytimeout exception happens, then the default scan server
dispatcher will notice this and it will choose from a larger set of scan
servers.
* The default scan server dispatcher will expand rapidly to randomly
selecting from all scan servers after which point it will start exponentially
increasing the busy timeout.
For example if there are 1000 scan servers and a lot of them are busy, the
default scan dispatcher might do something like the following. This example
shows how it will rapidly increase the set of servers chosen from and then
start rapidly increasing the busy timeout. The reason to start increasing the
busy timeout after observing a lot busy exceptions is that those provide
evidence that the entire cluster of scan servers may be busy. So eventually its
better to just go to a scan server and queue up rather look for a non-busy scan
server.
1. Choose scan server S1 from 3 random scan servers with a busy timeout of
33ms.
2. If a busy exceptions happens. Choose scan server S2 from 21 random scan
servers with a busy timeout of 33ms.
3. If a busy exceptions happens. Choose scan server S3 from 147 random scan
servers with a busy timeout of 33ms.
4. If a busy exceptions happens. Choose scan server S4 from 1000 random
scan servers with a busy timeout of 33ms.
5. If a busy exceptions happens. Choose scan server S5 from 1000 random
scan servers with a busy timeout of 66ms.
6. If a busy exceptions happens. Choose scan server S6 from 1000 random
scan servers with a busy timeout of 132ms.
This default behavior makes tablets sticky to scan servers which is good for
cache utilization and reusing cached tablet metadata. In the case where those
few scan servers are busy the client starts searching for other places to run.
### Garbage Collector changes
The ScanServer inserts entries into a new section (~sserv) of the metadata
table to place a reservation on the file so that the GarbageCollector process
does not remove the files that are being used for the scan. Accordingly
GCEnv.getReferences has been modified to include these file reservations in the
list of active file references. The ScanServer has a background thread that
removes the file reservations from the metadata table after some period of time
after the file is no longer used (see
Property.SSERVER_SCAN_REFERENCE_EXPIRATION_TIME). The Manager has a new
background thread that calls the ScanServerMetadataEntries.clean method on a
periodic basis. Users can use the ScanServerMetadataEntries utility to remove
file reservations that exist in the metadata table with no corresponding
running ScanServer.
In order to avoid race conditions with the Accumulo GC, Scan servers use the
following algorithm when first reading a tablets metadata.
1. Read metadata for tablet
2. Write an ~sserv entries for the tablets files to the metadata table to
prevent GC
3. Read the meadata again and see if it changed. If it did changes delete
the entries from step 2 and go back to step 1.
The above algorithm may be a bit expensive the first time a tablet is
scanned on scan server. However subsequent scans of the same tablet will use
cached tablet metadata for a configurable time and not repeate the above steps.
In the future we may want to look into faster ways of preventing GC of files
used by scan servers.
### Ancillary changes
1. Modifications to scripts (accumulo-cluster, accumulo-service and
accumulo-env.sh) have been made to start/stop one or more ScanServers per host.
2. The shell commands `grep` and `scan` have been modified to accept a
consistency level (`cl`) argument
3. The shell command `listscans` has been modified to include scans
running on ScanServers
4. ZooZap has been modified to remove ScanServer entries in ZooKeeper
5. MiniAccumuloCluster has been modified to include the ability to
start/stop ScanServers (used by the ITs)
6. A new utility (ScanServerMetadataEntries) has been created to cleanup
any dangling scan server file references in the metadata table.
## Shell Example
Below is an example of how this works using the `scan` command in the shell.
```
root@test> createtable test (1)
root@test test> insert a b c d (2)
root@test test> scan (3)
a b:c [] d
root@test test> scan -cl immediate (4)
a b:c [] d
root@test test> scan -cl eventual (5)
root@test test> flush (6)
2022-01-28T18:58:10,693 [shell.Shell] INFO : Flush of table test
initiated...
root@test test> scan (7)
a b:c [] d
root@test test> scan -cl eventual (8)
a b:c [] d
```
In this example, I create a table (1) and insert some data (2). When I run a
scan (3,4) with the immediate consistency level, which happens to be the
default, the client uses the normal code path and issues the scan command
against the Tablet Server. Data is returned because the Tablet Server code path
also returns data that is in the in-memory map. When I scan with the eventual
consistency level (5) no data is returned because the Scan Server only uses the
data in the Tablet's files. When I flush (6) the data to write a file in HDFS,
the subsequent scans with immediate (7) and eventual (8) consistency level
return the data.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]