[
https://issues.apache.org/jira/browse/GOBBLIN-2135?focusedWorklogId=934987&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-934987
]
ASF GitHub Bot logged work on GOBBLIN-2135:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 17/Sep/24 00:20
Start Date: 17/Sep/24 00:20
Worklog Time Spent: 10m
Work Description: phet commented on code in PR #4030:
URL: https://github.com/apache/gobblin/pull/4030#discussion_r1762107332
##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnHelixUtils.java:
##########
@@ -229,10 +229,11 @@ public static boolean retainKLatestJarCachePaths(Path
parentCachePath, int k, Fi
// Cleanup old cache if necessary
List<FileStatus> jarDirs =
Arrays.stream(fs.exists(parentCachePath) ?
fs.listStatus(parentCachePath) : new
FileStatus[0]).sorted().collect(Collectors.toList());
- if (jarDirs.size() > k) {
- return fs.delete(jarDirs.get(0).getPath(), true);
+ boolean deletesSuccessful = true;
+ for (int i = 0; i < jarDirs.size() - k; i++) {
+ deletesSuccessful = deletesSuccessful &&
fs.delete(jarDirs.get(i).getPath(), true);
Review Comment:
`&=`
##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/GobblinYarnAppLauncher.java:
##########
@@ -590,6 +590,15 @@ ApplicationId setupAndSubmitApplication() throws
IOException, YarnException, Int
amContainerLaunchContext.setEnvironment(YarnHelixUtils.getEnvironmentVariables(this.yarnConfiguration));
amContainerLaunchContext.setCommands(Lists.newArrayList(buildApplicationMasterCommand(applicationId.toString(),
resource.getMemory())));
+ if (this.jarCacheEnabled) {
+ Path jarCachePath = YarnHelixUtils.calculateJarCachePath(this.config);
+ // Retain at least the current and last month's jars to handle
executions running for ~30 days max
+ boolean cleanedSuccessfully =
YarnHelixUtils.retainKLatestJarCachePaths(jarCachePath.getParent(), 2, this.fs);
Review Comment:
would be great to add this info to a code comment here (in the caller, since
it's the combination of the two being used, so less effective in either's
javadoc). the crux is that retention of K=2 won't save us in cases where K=3
might exceed any FS quotas.
##########
gobblin-utility/src/main/java/org/apache/gobblin/util/filesystem/HdfsJarUploadUtils.java:
##########
@@ -39,15 +39,15 @@ public class HdfsJarUploadUtils {
* given the {@link FileStatus} of a jarFile and the path of directory that
contains jar.
* Snapshot dirs should not be shared, as different jobs may be using
different versions of it.
* @param fs
- * @param localJar
+ * @param localJarPath
* @param unsharedJarsDir
* @param jarCacheDir
* @return
* @throws IOException
*/
- public static Path calculateDestJarFile(FileSystem fs, FileStatus localJar,
Path unsharedJarsDir, Path jarCacheDir) throws IOException {
- Path uploadDir = localJar.getPath().getName().contains("SNAPSHOT") ?
unsharedJarsDir : jarCacheDir;
- Path destJarFile = new Path(fs.makeQualified(uploadDir),
localJar.getPath().getName());
+ public static Path calculateDestJarFilePath(FileSystem fs, String
localJarPath, Path unsharedJarsDir, Path jarCacheDir) throws IOException {
+ Path uploadDir = localJarPath.contains("SNAPSHOT") ? unsharedJarsDir :
jarCacheDir;
Review Comment:
seeing this invoked, `localJarPath` looks more like `localJarBasename`, not
the (full) path.
(or simply `jarName`)
Issue Time Tracking
-------------------
Worklog Id: (was: 934987)
Time Spent: 2h (was: 1h 50m)
> Cache Yarn jars in GobblinYarnAppLauncher
> -----------------------------------------
>
> Key: GOBBLIN-2135
> URL: https://issues.apache.org/jira/browse/GOBBLIN-2135
> Project: Apache Gobblin
> Issue Type: Improvement
> Reporter: William Lo
> Priority: Major
> Time Spent: 2h
> Remaining Estimate: 0h
>
> Gobblin YARN Application Launcher lacks some functionality used in
> MRJobLauncher. One of the biggest gaps in feature parity is the absence of
> jar caching, where MRJobLauncher creates a monthly cache that is
> automatically cleaned up by subsequent executions performed 2 months in
> advance.
> YARN/MR requires uploading jars to HDFS, this step can be quite slow (~15
> mins for a sizeable job to get all the jars), and given that many jobs do
> share the same jars, it makes sense to cache them together and only provide
> YARN the shared path.
> We also want to ensure that SNAPSHOT jars are other files are not uploaded to
> a cache, since they are not immutable unlike jar versions on Artifactory.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)