Piotr Karwasz created COMPRESS-727:
--------------------------------------
Summary: Provide a symlink-resistant (and concurrency-safe)
archive Extractor
Key: COMPRESS-727
URL: https://issues.apache.org/jira/browse/COMPRESS-727
Project: Commons Compress
Issue Type: Improvement
Reporter: Piotr Karwasz
h2. Summary
{{ArchiveEntry.resolveIn(Path)}} protects against classic Zip-slip by resolving
the entry name against the target directory, normalizing it, and rejecting the
result if it no longer {{startsWith}} the target. This check is purely
*lexical*: it manipulates path strings only. It does not look at the file
system, so it does not defend against *symlink* attacks, and the example
{{Expander}} that drives extraction does not create symlinks at all (so it is
both lossy for legitimate archives and silent about the risk).
This issue proposes a first-class, security-first {{Extractor}} that safely
materializes a whole archive into a directory, including symlinks, and resists
both symlink-slip from the archive itself and concurrent symlink races from
third parties. The API is centered on {{Extractor.newExtractor(Path)}}, which
returns a {{SecureExtractor}} when the platform supports
{{java.nio.file.SecureDirectoryStream}}, or a best-effort {{Extractor}}
otherwise.
h2. Background: what resolveIn does and does not cover
{code:java}
default Path resolveIn(final Path parentPath) throws IOException {
final Path outputFile = parentPath.resolve(getName()).normalize();
if (!outputFile.startsWith(parentPath)) {
throw new ArchiveException("Zip slip ...");
}
return outputFile;
}
{code}
This stops {{../../etc/passwd}} style names. It does *not* stop:
* A symlink *entry inside the archive* (for example {{link -> /etc}} or {{link
-> ../..}}) followed by a later entry written *through* that link
({{link/evil}}). The lexical check on {{target/link/evil}} passes because the
string starts with {{target}}, but {{link}} is a symlink and the write escapes.
This is "symlink-slip", and it defeats the lexical guard even when the target
directory is fully trusted.
* A symlink (or directory-to-symlink swap) planted *concurrently* by another
process between the {{resolveIn}} check and the actual
{{Files.newOutputStream}} call. The check and the write are not atomic with
respect to the intermediate path components, so this is a classic TOCTOU race.
The metadata needed to handle links is already exposed:
* {{TarArchiveEntry}}: {{isSymbolicLink()}}, {{isLink()}} (hard link),
{{getLinkName()}}, plus {{isBlockDevice()}} / {{isCharacterDevice()}} /
{{isFIFO()}} for special types.
* {{ZipArchiveEntry}}: {{isUnixSymlink()}}, with
{{ZipFile.getUnixSymlink(entry)}} returning the link target.
There is currently no use of {{SecureDirectoryStream}},
{{LinkOption.NOFOLLOW_LINKS}}, or {{Files.createSymbolicLink}} anywhere in main
code, so this is greenfield.
h2. Threat model
# *Zip-slip* (path traversal via {{..}} or absolute names). Already handled by
{{resolveIn}}.
# *Symlink-slip from the archive* (trusted target directory). The archive
plants a symlink and then writes through it, or supplies a symlink whose target
points outside the extraction root. Must be handled by an extractor that
controls symlink creation and never follows a link when materializing a child
path.
# *Concurrent symlink race* (untrusted / shared target directory). A third
party with write access to the target tree swaps a path component for a symlink
during extraction. Defeating this requires file-descriptor-relative,
one-component-at-a-time resolution with no-follow semantics, that is,
{{SecureDirectoryStream}}.
h2. Real-life examples of untrusted / shared target directories
These are settings where threat 3 is realistic because another user or process
can write into the extraction tree while extraction runs:
* *Shared CI / build agents*: multiple jobs extract into a common workspace or
temp directory on the same host; one job races another.
* *Multi-user package-manager caches*: extracting into group- or world-writable
shared caches (Maven {{~/.m2}}, Gradle, pip, npm) on a shared build box.
* *Web upload processing*: an upload is extracted into a directory that other
worker processes or requests (possibly other tenants) can also write.
* *Container image / layer extraction*: a layer's symlink redirects a later
layer's write. Container runtimes specifically defend this (for example
Docker/containerd use chroot-style joins, and Go's {{filepath-securejoin}}
exists for exactly this).
* *Background daemons*: antivirus, indexers, or backup agents racing the
extractor over the same tree.
Prior art that hardened against the same classes of bug:
* The original Zip-slip disclosure (2018) prompted the lexical fix that
{{resolveIn}} implements.
* Python's {{tarfile}} extraction filters (PEP 706, shipped in 3.12, hardening
the long-standing CVE-2007-4559) added {{data}} / {{tar}} / {{fully_trusted}}
filters that, among other things, reject symlinks and links escaping the
destination.
* GNU tar performs "secure" extraction by removing an existing symlink before
writing through its name and by deferring symlink creation.
* Container runtimes rely on {{filepath-securejoin}} (Go) to resolve each
component safely against symlink races.
h2. Proposed API
A supported {{Extractor}} type (promoted out of {{archivers.examples}} into,
for example, {{org.apache.commons.compress.archivers}}), with a
capability-detecting factory:
{code:java}
// Picks the strongest implementation the platform/filesystem allows.
public static Extractor newExtractor(Path targetDirectory) throws IOException;
{code}
* On platforms where {{Files.newDirectoryStream(targetDirectory) instanceof
SecureDirectoryStream}} (Unix-like systems, see
{{sun.nio.fs.UnixSecureDirectoryStream}}), {{newExtractor}} returns a
*{{SecureExtractor}}* that is resistant to both symlink-slip and concurrent
symlink races.
* Otherwise (notably Windows, or filesystems whose provider does not return a
secure stream) it returns a base {{Extractor}} that defends symlink-slip from
the archive on a best-effort basis but documents that it cannot fully close the
TOCTOU race.
The {{Extractor}} mirrors the entry points of today's {{Expander}}
({{ArchiveInputStream}}, {{ZipFile}}, {{TarFile}}, {{SevenZFile}}) and adds
configuration:
{code:java}
Extractor extractor = Extractor.newExtractor(targetDir)
.setSymlinkPolicy(SymlinkPolicy.ALLOW_WITHIN_ROOT) // default REJECT
.setSpecialFilePolicy(SpecialFilePolicy.SKIP) // devices, FIFOs
.setOverwrite(false);
extractor.extract(archiveInputStream);
{code}
h3. SecureExtractor design
Walk each entry's parent path *component by component* starting from an open
directory handle on the (validated) extraction root:
* For each intermediate component, open the subdirectory via
{{SecureDirectoryStream.newDirectoryStream(name, LinkOption.NOFOLLOW_LINKS)}}.
If a component is a symlink, refuse (it cannot be a legitimate parent under
no-follow).
* Create files relative to the innermost directory handle with no-follow /
create-new semantics ({{newByteChannel}} with {{NOFOLLOW_LINKS}} +
{{CREATE_NEW}}), so a concurrently planted symlink cannot redirect the write.
* Create directories and symlinks relative to the directory handle, never via
an absolute resolved path.
Because every step is relative to a file descriptor rather than a re-resolved
string path, a third party cannot win the race by swapping a parent component.
h3. Symlink policy
||Policy||Behavior||
|{{REJECT}} (default)|Fail if the archive contains any symlink entry.|
|{{SKIP}}|Silently ignore symlink entries.|
|{{ALLOW_WITHIN_ROOT}}|Create a symlink only if its target resolves to a
location inside the extraction root; reject links escaping the root. Never
follow the new link when writing later entries.|
|{{ALLOW_ALL}}|Create symlinks verbatim (only for fully trusted archives).|
h3. Entry-type handling
||Entry type||Default action||
|Regular file|Extract (no-follow, create-new unless overwrite enabled).|
|Directory|Create (no-follow).|
|Symbolic link|Per {{SymlinkPolicy}} (default {{REJECT}}).|
|Hard link (tar {{isLink()}})|Resolve link target within root, no-follow;
reject if it escapes.|
|Device / FIFO / other special|Per {{SpecialFilePolicy}} (default {{SKIP}};
never created by the secure path).|
h2. Platform support and degradation
* {{SecureDirectoryStream}} is available on Unix-like platforms via the default
file system provider; {{newExtractor}} detects it with an {{instanceof}} check
on a directory stream opened over the target.
* On platforms without it, the base {{Extractor}} still validates each parent
component with {{LinkOption.NOFOLLOW_LINKS}} (via {{Files.readAttributes}}) and
refuses to traverse existing symlinks, which closes symlink-slip from the
archive but remains non-atomic against a concurrent attacker. This residual
limitation must be documented clearly.
h2. Acceptance criteria
* New {{Extractor}} (supported, not under {{examples}}) with
{{newExtractor(Path)}} returning {{SecureExtractor}} when
{{SecureDirectoryStream}} is available, else a best-effort {{Extractor}}.
* Configurable {{SymlinkPolicy}} (default {{REJECT}}) and special-file
handling; safe hard-link resolution.
* {{SecureExtractor}} performs descriptor-relative, component-by-component,
no-follow resolution and create-new writes.
* Entry points covering {{ArchiveInputStream}}, {{ZipFile}}, {{TarFile}}, and
{{SevenZFile}}, matching the existing {{Expander}} surface.
* Tests: symlink-slip archive (link then write-through) is blocked under a
trusted target; an escaping symlink target is rejected; a concurrent
directory-to-symlink swap is defeated by {{SecureExtractor}} (and documented as
not fully defeated by the base extractor); legitimate within-root symlinks
extract correctly when allowed.
* Javadoc plus a "safe extraction" section in the user guide, cross-referencing
{{ArchiveEntry.resolveIn}} and the platform caveats.
* Consider deprecating / redirecting {{Expander}} toward {{Extractor}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)