keith-turner commented on code in PR #35:
URL:
https://github.com/apache/accumulo-classloaders/pull/35#discussion_r2676962925
##########
modules/local-caching-classloader/src/main/java/org/apache/accumulo/classloader/lcc/util/LocalStore.java:
##########
@@ -49,6 +50,42 @@
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
+/**
+ * A simple storage service backed by a local file system for storing
downloaded
+ * {@link ContextDefinition} files and the {@link Resource} objects it
references.
+ * <p>
+ * The layout of the storage area consists of two directories:
+ * <ul>
+ * <li><b>contexts</b> stores a copy of the {@link ContextDefinition} JSON
files for each context,
+ * and exist primarily for user convenience (they aren't used again by this
factory)
+ * <li><b>resources</b> stores a copy of all the {@link Resource} files for
all contexts
Review Comment:
A sentence or two about how hashes in the file names are used to dedupe file
across context would be helpful. Not sure where it should go, but when I read
this sentence it brought that to mind.
##########
modules/local-caching-classloader/src/main/java/org/apache/accumulo/classloader/lcc/util/LocalStore.java:
##########
@@ -49,6 +50,42 @@
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
+/**
+ * A simple storage service backed by a local file system for storing
downloaded
+ * {@link ContextDefinition} files and the {@link Resource} objects it
references.
+ * <p>
+ * The layout of the storage area consists of two directories:
+ * <ul>
+ * <li><b>contexts</b> stores a copy of the {@link ContextDefinition} JSON
files for each context,
+ * and exist primarily for user convenience (they aren't used again by this
factory)
+ * <li><b>resources</b> stores a copy of all the {@link Resource} files for
all contexts
+ * </ul>
+ *
+ * <p>
+ * When downloading any file, the file is first downloaded to a temporary file
with a unique name,
+ * and then atomically renamed, so it works correctly even if there are
multiple processes or
+ * threads doing the same thing. The use of `CREATE_NEW` and the lack of
`REPLACE_EXISTING` when
+ * creating the temporary files is to ensure we do not silently collide with
other processes, when
+ * these temporary files should be unique.
+ *
+ * <p>
+ * When downloading resource files, an additional "in-progress" signal file
derived from the name of
+ * the file being downloaded with the suffix ".downloading" is also used. This
file is updated with
+ * the current process' PID every 5 seconds so long as the file is still being
downloaded. This
+ * serves as a signal to other processes or threads that they can wait instead
of attempting to
+ * download the same file. This avoids duplicate work. If an attempt to
download a resource file is
+ * detected, and it has been updated within the last 30 seconds, it is skipped
and the remaining
+ * files are downloaded before attempting it again. If this signal file hasn't
been updated in the
+ * last 30 seconds, a download will be attempted. Failures to download are
propagated up to the
+ * higher level. `CREATE_NEW` is used when creating this signal file, in order
to atomically detect
+ * race collisions with other threads or processes, and to try to avoid
duplicate work.
Review Comment:
Could shorten all of this to something like the following to give a really
high level overview to help guide someone. The more detailed comments are
good, but maybe could be collocated w/ the more specific code. If collocating
w/ more specific code, then would not need to describe the code itself only the
goal.
```
Downloads make a best effort to avoid multiple processes from concurrently
downloading the same file
using the "in progress" file. However there could be race conditions where
this best effort fails. When
downloading any file, the file is first downloaded to a temporary file with
a unique name, and then
atomically renamed, so it works correctly even if there are multiple
processes or threads concurrently
downloading the same file.
```
##########
modules/local-caching-classloader/README.md:
##########
@@ -20,88 +20,206 @@
-->
# Local Caching ClassLoader
-The LocalCachingContextClassLoaderFactory is an Accumulo
ContextClassLoaderFactory implementation that creates and maintains a
-LocalCachingContext. The
`LocalCachingContextClassLoaderFactory.getClassLoader(String)` method expects
the method
-argument to be a valid `file`, `hdfs`, `http` or `https` URL to a context
definition file.
-
-The context definition file is a JSON formatted file that contains the name of
the context, the interval (in seconds) at which
-the context definition file should be monitored, and a list of classpath
resources. The LocalCachingContextClassLoaderFactory
-creates the LocalCachingContext based on the initial contents of the context
definition file, and updates the classloader
-as changes are noticed based on the monitoring interval. An example of the
context definition file is below.
-
-```
+`LocalCachingContextClassLoaderFactory` implements Accumulo's
+`ContextClassLoaderFactory` SPI. Given a context parameter, supplied as a
+String containing a URL to a remote context definition file, it will produce
+and return `ClassLoader` instances to load classes and resources for a Java
+application, based on the list of resource URLs contained in the remote context
+definition. It will also monitor the URL to the context definition file for any
+changes, at the monitoring interval specified in the context definition file.
+
+## Introduction
+
+This factory allows you to create `ClassLoader` instances that point to locally
+cached copies of remote resource files. In this way, this factory allows you to
+place common resources in a remote location for use across many hosts, but
+without many of the problems that can occur when loading resources from a
+remote location.
+
+This factory uses a storage cache in the local filesystem for any files it
+downloads from a remote URL.
+
+To use this factory, you must store your resource files in a location that can
+be specified by a supported URL, and then you must create a JSON-formatted
+context definition file that contains a monitoring interval (in seconds,
+greater than 0), and a list of resource URLs along with a checksum for each
+resource file. You must then store this context definition file somewhere where
+this factory can download it, and use the URL to that context definition file
+as the `context` parameter for this factory's `getClassLoader(String context)`
+method.
+
+This factory can handle context and resource URLs that use the `file`, `hdfs`,
+`http`, or `https` URL scheme.
+
+Here is an example context definition file:
+
+```json
{
- "contextName": "myContext",
- "monitorIntervalSeconds": 5,
- "resources": [
- {
- "location": "file:/home/user/ClassLoaderTestA/TestA.jar",
- "checksum": "a10883244d70d971ec25cbfa69b6f08f"
- },
- {
- "location": "hdfs://localhost:8020/contextB/TestB.jar",
- "checksum": "a02a3b7026528156fb782dcdecaaa097"
- },
- {
- "location": "http://localhost:80/TestC.jar",
- "checksum": "f464e66f6d07a41c656e8f4679509215"
- }
- ]
+ "monitorIntervalSeconds": 5,
+ "resources": [
+ {
+ "location": "file:/home/user/ClassLoaderTestA/TestA.jar",
+ "checksum": "a10883244d70d971ec25cbfa69b6f08f"
+ },
+ {
+ "location": "hdfs://localhost:8020/contextB/TestB.jar",
+ "checksum": "a02a3b7026528156fb782dcdecaaa097"
+ },
+ {
+ "location": "http://localhost:80/TestC.jar",
+ "checksum": "f464e66f6d07a41c656e8f4679509215"
+ }
+ ]
}
```
+## How it Works
+
+When this factory receives a request for a `ClassLoader` for a given URL, it
+downloads a copy of the context definition file and parses it. If it has
+recently acquired that context definition file, based on the monitoring
+interval from a previous retrieval, it can skip this step and use the
+definition from the earlier retrieval, which is kept up-to-date by the
+background monitoring process that started when it was previously retrieved.
+Once it has the context definition, it then returns a `ClassLoader` instance
+containing the resources defined in that context definition file, first
+downloading any missing resources and verifying them using the checksums in the
+context definition file.
+
+`ClassLoader` instances are stored in a de-deduplicating cache in memory with a
+minimum lifetime of 24 hours. So, no two instances will ever exist in a process
+for the same context definition.
+
+If this context definition had not previously been downloaded, a background
+monitoring task is set up to ensure the URL is watched for any changes to the
+context definition. This monitoring continues for as long as there exists
+`ClassLoader` instances in the system that were constructed from the definition
+file at that URL (at least 24 hours, since that is the minimum time they will
+exist in the de-duplicating cache).
+
+## Local Storage Cache
+
+The local storage cache location is configured by the user by setting the
+Accumulo property named `general.custom.classloader.lcc.cache.dir` to a
+directory on the local filesystem. This location may be specified as an
+absolute path or as a URL representing an absolute path with the `file` scheme.
+
+The selected location should be a persistent location with plenty of space to
+store downloaded resources (usually jar files), and should be writable by all
+the processes which use this factory to share the same resources.
+
+Files downloaded to this cache may be used by multiple threads or processes, so
+be very careful when removing old contents to ensure that they are no longer
+needed.
+
+Do **NOT** use a temporary directory for the local storage cache location.
+
+If a resource file is deleted from the local storage cache while a
+`ClassLoader` exists that references it, that `ClassLoader` may, and probably
+will, stop working correctly.
+
+The layout of the local storage cache is two directories: `contexts` and
+`resources`. Context definition files are stored in the `contexts` directory.
+These are for reference only. This factory will not use them once they are
+written. All of the resources used by all contexts are stored in the `contexts`
+directory.
+
+Both context definition files, and resource files use a naming scheme that
+includes the hash of their contents to avoid conflicts with similarly named
+remote resources.
+
+Multiple contexts, multiple threads, and multiple processes, can use the same
+resource files, once they are downloaded. If the resource file already exists,
+it is assumed that it is safe to use. Checksums are only verified on first
+download.
+
+When a resource file is first downloaded, it downloads to a temporary file, and
+then atomically renames it to the final location. **Important**: this factory
+depends on the local filesystem's ability to atomically move files and will
+only work on filesystems with that capability.
+
+This factory will only download one file at a time for each context. While each
Review Comment:
Could omit the description of how this works and the internal layout of the
directory and move that information to the javadoc for the class that
implements this. For the readme, need to let the user know the important
things about how to correctly configure this local directory. Like the mention
of support for atomic move, not using a temp file system, etc. Focusing on
only those things and omitting the implementation details here makes it more
likely someone will see these important details. When I read about not using a
temp dir that seemed obvious to me after reading it, but not sure if I would
have thought of that before reading it. So there is some really critical
information for using this code correctly that is interleaved with information
that someone does not need to know to use this code correctly. Could
mention/link the javadoc in the readme if anyone wants to know more about why
the constraints are important.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]