Roughly (don’t have the exact command syntax on hand) I make a file that is
the executed by passing to the shell command. To build the command file:
Use the getSplits command with the number of batches that I want – that can
roughly be calculated using # current tablets / (# tservers * # compaction
slots * comfort factor). IYou can specify and output file or tee the command
output, something like
* getsplits -t tablename -n 20 -o /tmp/my_splits.txt
This would give you the splits for 20 rounds. Using those splits the compact
command file then looks like:
compact -w -t tablename -e {first split}
compact -w -t tablename -b [first split] -e [second split]
…
compact -w -t tablename -b [last split]
To do a merge, interleave the merge commands:
compact -w -t tablename -e [first split]
merge -w -t tablename -size=[5G] -e [first split]
compact -w -t tablename -b [first split] -e [second split]
merge -w -t tablename -size 5G -b [first split] -e [second split]
Then just issue the shell command with (login info) -e filename. (I don’t
recall if the switch to pass a file is -e, -f,…?)
The -w switch pauses each round so that it completes before moving to the next.
The comfort factor is some multiple to increase the number of tablets in each
round. This will over subscribe the compaction slots – but usually some
compactions are quick for small tablets and the over-subscription quickly
drops. It is a balancing act, you want fewer rounds, but limit the over
subscription period.
You may want to increase the # of compaction slots available – depending on
your hardware and load – I think the default is 3, 6 is not unreasonable.
Using the compact / merge command with just and end (first row) and a beginning
(last row) are to insure that all splits are covered – don’t mix them up – or
you will compact everything.
A few tablets can take much longer if the row ids are not evenly distributed –
the time that each round takes will be the time of the longest compaction. With
larger, but fewer rounds you increase the chance that more of the long-poles
will be in a round and run in parallel. And shorten the total time needed to
complete – but doing it in rounds does take longer because each round may have
a long-pole that is essentially being compacted serially in each round.
Ed Coleman
From: Ligade, Shailesh [USA] <[email protected]>
Sent: Friday, February 4, 2022 8:28 AM
To: '[email protected]' <[email protected]>
Subject: Re: tablets per tablet server for accumulo 1.10.0
Thank you,
Will range compaction (compact -t <> --begin-row<> --end-row<>) be faster than
just compact -t <>? My worry is, if I somehow issue 72k compact command at
once, it will kill the system?
On that part what is the best way to issue these compact commands, especially
because there are so many of them. I saw accumulo shell -u<> -p<> -e 'compact
...,compact...,compact,....' will work just don't know how many i can tack on
one shell command..is there a better way of doing all this? I mean i want to be
as gentle to my production system and yet as fast as possible.. don't want to
spend days doing compact/merge 🙁
Thanks
-S
________________________________
From: dev1 <[email protected]<mailto:[email protected]>>
Sent: Tuesday, February 1, 2022 8:53 AM
To: '[email protected]'
<[email protected]<mailto:[email protected]>>
Subject: [External] RE: tablets per tablet server for accumulo 1.10.0
Before. That has the benefit that file sizes are reduced (if data is eligible
for age off) and the merge is operating on current file sizes.
From: Ligade, Shailesh [USA]
<[email protected]<mailto:[email protected]>>
Sent: Tuesday, February 1, 2022 7:49 AM
To: '[email protected]'
<[email protected]<mailto:[email protected]>>
Subject: Re: tablets per tablet server for accumulo 1.10.0
Thank you for explanation!
Once ran getsplits it was clear that splits were the culprit, so I need to do
merge as well bump the threshold to higher number as you have suggested.
If I have to perform a major compaction, should i do it before merge or after
merge?
Thanks again,
-S
________________________________
From: dev1 <[email protected]<mailto:[email protected]>>
Sent: Monday, January 31, 2022 1:14 PM
To: '[email protected]'
<[email protected]<mailto:[email protected]>>
Subject: [External] RE: tablets per tablet server for accumulo 1.10.0
You can get the hdfs size using standard hdfs commands – count or ls. As long
as you have not cloned the table, the size of the hdfs files and the space
occupied by the table are equivalent.
You can also get a sense of the referenced files examining the metadata table –
the column qualifier file: will just give you the referenced files. You can
look at the directories b-xxxxxxx are from a bulk import and t-xxxxxxx files
are assigned to the tablets. Also bulk import file names start with I-xxxxxx,
files from compactions will be A-xxxxxx if from a full compaction, C-xxxxxxx
from a minor compaction and F-xxxxxx is the result of a flush. You can look at
the entries for the files – the numbers for the value are number of entities,
file size
How do you ingest? Bulk or continuous? On a bulk ingest, the imported files
end up in /accumulo/table/x/b-xxxxx and then are assigned to tablets – the
directories for the
Tablets will be created, but will be “empty” until a compaction occurs. A
compaction will copy from the files referenced by the tablets into a new file
that will be placed into the corresponding /accumulo/table/x/t-xxxxxx
directory. When a bulk imported file is no longer referenced by any tablets,
it will get garbage collected, until then file will exist and inflate the
actual space used by the table. The compaction will also remove any data that
is past the TTL for the records.
Do you ever run a compaction? With a very large number of tablets, you may
want to run the compaction in parts so that you don’t end up occupying all of
the compaction slots for a long time.
Are you using keys (row ids) that are always increasing? An typical example
would be a date. Say some of your row ids are yyyy-mm-dd-hh and there is a 10
day TTL. What will happened is that new data will continue to create new
tablets and on compaction the old tablets will age-off and have 0 size. You
can remove the “unused splits” by running a merge. Anything that creates new
row ids that are ordered can do this – new splits are necessary and the
old-splits eventually become unnecessary, if the row ids are distributed across
the splits it will not do this. It is not necessary a problem if this what you
data looks like, just something that you may want to manage with merges.
There is usually not much benefit having a large number of tablets for a single
table on a server. You can reduce the number of tablets required by setting
the split threshold to a larger number and then running a merge. This can be
done in sections, and you should run a compaction on the section first.
If you have recently compacted, you can figure out the rough number of tables
necessary by taking hdfs size / split threshold = number of tablets. If you
increase the spilt threshold size you will need fewer tablets. You may also
consider setting a split threshold that is larger than your target – say you
decided that 5G was a good target, if you set the threshold to 8G during the
merge and then setting it to 5G when completed will cause the table to split –
and it could give you a better distribution of data in the splits.
This can be done while things are running, but it will be a heavy IO load
(files and on the hdfs namenode) and can take a very long time. What can be
useful is you the getSplits command with the number of split options and create
a script that compacts, then merges a section – using the splits as start / end
row to the compaction and merge command.
Ed Coleman
From: Ligade, Shailesh [USA]
<[email protected]<mailto:[email protected]>>
Sent: Monday, January 31, 2022 11:16 AM
To: [email protected]<mailto:[email protected]>
Subject: tablets per tablet server for accumulo 1.10.0
Hello,
table.split.threshold is set to default 1G (except for metadata nd root - which
is set to 64M)
What can cause tablets per tablet server count to go high? Within a week, that
count jumped from 5k/tablet server to 23k/tablet server, even though total size
in hdfs has not changed.
Is high count, a cause for concern?
We didn't apply any splits. I did a dumpConfig and checked all my tables and
didn't see splits either.
Is there a way to find tablet size in hdfs? When I look at hdfs
/accumulo/table/x/ i see some empty folders, meaning not all folders has rf
files. is that normal?
Thanks in advance!
-S