Dieter De Paepe created HBASE-28445:
---------------------------------------

             Summary: Shared job jars for full backups
                 Key: HBASE-28445
                 URL: https://issues.apache.org/jira/browse/HBASE-28445
             Project: HBase
          Issue Type: Improvement
          Components: backup&restore
    Affects Versions: 2.6.0
            Reporter: Dieter De Paepe


Our YARN clusters are configured with 10GB of temporary local storage.

When investigating an unhealthy YARN nodemanager, we found it became unhealthy 
because it's "local-dirs usable space" had dropped below 90%. Investigation 
showed that this was mainly due to over a 100 different entries in the 
usercache, all containing the exact same libjars:
{code:java}
yarn@yarn-nodemanager-0:/tmp/yarn/nm-local-dir$ du -s 
./usercache/lily/filecache/*
41272   ./usercache/lily/filecache/10
41272   ./usercache/lily/filecache/100
41272   ./usercache/lily/filecache/101
41272   ./usercache/lily/filecache/102
41272   ./usercache/lily/filecache/103
41272   ./usercache/lily/filecache/104 
...{code}
{code:java}
yarn@yarn-nodemanager-0:/tmp/yarn/nm-local-dir$ du -s 
./usercache/lily/filecache/99/libjars/* 576 
./usercache/lily/filecache/99/libjars/commons-lang3-3.12.0.jar 4496 
./usercache/lily/filecache/99/libjars/hadoop-common-3.3.6-2-lily.jar 1800 
./usercache/lily/filecache/99/libjars/hadoop-mapreduce-client-core-3.3.6-2-lily.jar
 100 ./usercache/lily/filecache/99/libjars/hbase-asyncfs-2.6.0-prc-1-lily.jar 
2076 ./usercache/lily/filecache/99/libjars/hbase-client-2.6.0-prc-1-lily.jar 
876 ./usercache/lily/filecache/99/libjars/hbase-common-2.6.0-prc-1-lily.jar 76 
./usercache/lily/filecache/99/libjars/hbase-hadoop-compat-2.6.0-prc-1-lily.jar 
164 
./usercache/lily/filecache/99/libjars/hbase-hadoop2-compat-2.6.0-prc-1-lily.jar 
124 ./usercache/lily/filecache/99/libjars/hbase-http-2.6.0-prc-1-lily.jar 436 
./usercache/lily/filecache/99/libjars/hbase-mapreduce-2.6.0-prc-1-lily.jar 32 
./usercache/lily/filecache/99/libjars/hbase-metrics-2.6.0-prc-1-lily.jar 24 
./usercache/lily/filecache/99/libjars/hbase-metrics-api-2.6.0-prc-1-lily.jar 
208 ./usercache/lily/filecache/99/libjars/hbase-procedure-2.6.0-prc-1-lily.jar 
3208 ./usercache/lily/filecache/99/libjars/hbase-protocol-2.6.0-prc-1-lily.jar 
7356 
./usercache/lily/filecache/99/libjars/hbase-protocol-shaded-2.6.0-prc-1-lily.jar
 52 
./usercache/lily/filecache/99/libjars/hbase-replication-2.6.0-prc-1-lily.jar 
5932 ./usercache/lily/filecache/99/libjars/hbase-server-2.6.0-prc-1-lily.jar 
304 ./usercache/lily/filecache/99/libjars/hbase-shaded-gson-4.1.5.jar 4060 
./usercache/lily/filecache/99/libjars/hbase-shaded-miscellaneous-4.1.5.jar 4864 
./usercache/lily/filecache/99/libjars/hbase-shaded-netty-4.1.5.jar 1832 
./usercache/lily/filecache/99/libjars/hbase-shaded-protobuf-4.1.5.jar 20 
./usercache/lily/filecache/99/libjars/hbase-unsafe-4.1.5.jar 108 
./usercache/lily/filecache/99/libjars/hbase-zookeeper-2.6.0-prc-1-lily.jar 120 
./usercache/lily/filecache/99/libjars/metrics-core-3.1.5.jar 128 
./usercache/lily/filecache/99/libjars/opentelemetry-api-1.15.0.jar 48 
./usercache/lily/filecache/99/libjars/opentelemetry-context-1.15.0.jar 32 
./usercache/lily/filecache/99/libjars/opentelemetry-semconv-1.15.0-alpha.jar 
524 ./usercache/lily/filecache/99/libjars/protobuf-java-2.5.0.jar 1292 
./usercache/lily/filecache/99/libjars/zookeeper-3.8.3.jar
{code}
Investigating the YARN logs showed that for every HBase table included in a 
full backup, a separate YARN application is started, each uploading these job 
jars.

We encountered this on an HBase installation with limited tables, where we were 
running backup&restore related tests (so this was regular use). But I can 
imagine this could be annoying for HBase installations with hundreds to 
thousands of tables.

I wonder if it's possible to use shared job jars instead of the current 
approach?

(Strangely enough, the mechanisms to clean up this cache weren't triggering as 
expected, but that's probably something that requires its own investigation.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to