Piotr Karwasz created COMPRESS-727:
--------------------------------------

             Summary: Provide a symlink-resistant (and concurrency-safe) 
archive Extractor
                 Key: COMPRESS-727
                 URL: https://issues.apache.org/jira/browse/COMPRESS-727
             Project: Commons Compress
          Issue Type: Improvement
            Reporter: Piotr Karwasz


h2. Summary

{{ArchiveEntry.resolveIn(Path)}} protects against classic Zip-slip by resolving 
the entry name against the target directory, normalizing it, and rejecting the 
result if it no longer {{startsWith}} the target. This check is purely 
*lexical*: it manipulates path strings only. It does not look at the file 
system, so it does not defend against *symlink* attacks, and the example 
{{Expander}} that drives extraction does not create symlinks at all (so it is 
both lossy for legitimate archives and silent about the risk).

This issue proposes a first-class, security-first {{Extractor}} that safely 
materializes a whole archive into a directory, including symlinks, and resists 
both symlink-slip from the archive itself and concurrent symlink races from 
third parties. The API is centered on {{Extractor.newExtractor(Path)}}, which 
returns a {{SecureExtractor}} when the platform supports 
{{java.nio.file.SecureDirectoryStream}}, or a best-effort {{Extractor}} 
otherwise.

h2. Background: what resolveIn does and does not cover

{code:java}
default Path resolveIn(final Path parentPath) throws IOException {
    final Path outputFile = parentPath.resolve(getName()).normalize();
    if (!outputFile.startsWith(parentPath)) {
        throw new ArchiveException("Zip slip ...");
    }
    return outputFile;
}
{code}

This stops {{../../etc/passwd}} style names. It does *not* stop:

* A symlink *entry inside the archive* (for example {{link -> /etc}} or {{link 
-> ../..}}) followed by a later entry written *through* that link 
({{link/evil}}). The lexical check on {{target/link/evil}} passes because the 
string starts with {{target}}, but {{link}} is a symlink and the write escapes. 
This is "symlink-slip", and it defeats the lexical guard even when the target 
directory is fully trusted.
* A symlink (or directory-to-symlink swap) planted *concurrently* by another 
process between the {{resolveIn}} check and the actual 
{{Files.newOutputStream}} call. The check and the write are not atomic with 
respect to the intermediate path components, so this is a classic TOCTOU race.

The metadata needed to handle links is already exposed:
* {{TarArchiveEntry}}: {{isSymbolicLink()}}, {{isLink()}} (hard link), 
{{getLinkName()}}, plus {{isBlockDevice()}} / {{isCharacterDevice()}} / 
{{isFIFO()}} for special types.
* {{ZipArchiveEntry}}: {{isUnixSymlink()}}, with 
{{ZipFile.getUnixSymlink(entry)}} returning the link target.

There is currently no use of {{SecureDirectoryStream}}, 
{{LinkOption.NOFOLLOW_LINKS}}, or {{Files.createSymbolicLink}} anywhere in main 
code, so this is greenfield.

h2. Threat model

# *Zip-slip* (path traversal via {{..}} or absolute names). Already handled by 
{{resolveIn}}.
# *Symlink-slip from the archive* (trusted target directory). The archive 
plants a symlink and then writes through it, or supplies a symlink whose target 
points outside the extraction root. Must be handled by an extractor that 
controls symlink creation and never follows a link when materializing a child 
path.
# *Concurrent symlink race* (untrusted / shared target directory). A third 
party with write access to the target tree swaps a path component for a symlink 
during extraction. Defeating this requires file-descriptor-relative, 
one-component-at-a-time resolution with no-follow semantics, that is, 
{{SecureDirectoryStream}}.

h2. Real-life examples of untrusted / shared target directories

These are settings where threat 3 is realistic because another user or process 
can write into the extraction tree while extraction runs:

* *Shared CI / build agents*: multiple jobs extract into a common workspace or 
temp directory on the same host; one job races another.
* *Multi-user package-manager caches*: extracting into group- or world-writable 
shared caches (Maven {{~/.m2}}, Gradle, pip, npm) on a shared build box.
* *Web upload processing*: an upload is extracted into a directory that other 
worker processes or requests (possibly other tenants) can also write.
* *Container image / layer extraction*: a layer's symlink redirects a later 
layer's write. Container runtimes specifically defend this (for example 
Docker/containerd use chroot-style joins, and Go's {{filepath-securejoin}} 
exists for exactly this).
* *Background daemons*: antivirus, indexers, or backup agents racing the 
extractor over the same tree.

Prior art that hardened against the same classes of bug:
* The original Zip-slip disclosure (2018) prompted the lexical fix that 
{{resolveIn}} implements.
* Python's {{tarfile}} extraction filters (PEP 706, shipped in 3.12, hardening 
the long-standing CVE-2007-4559) added {{data}} / {{tar}} / {{fully_trusted}} 
filters that, among other things, reject symlinks and links escaping the 
destination.
* GNU tar performs "secure" extraction by removing an existing symlink before 
writing through its name and by deferring symlink creation.
* Container runtimes rely on {{filepath-securejoin}} (Go) to resolve each 
component safely against symlink races.

h2. Proposed API

A supported {{Extractor}} type (promoted out of {{archivers.examples}} into, 
for example, {{org.apache.commons.compress.archivers}}), with a 
capability-detecting factory:

{code:java}
// Picks the strongest implementation the platform/filesystem allows.
public static Extractor newExtractor(Path targetDirectory) throws IOException;
{code}

* On platforms where {{Files.newDirectoryStream(targetDirectory) instanceof 
SecureDirectoryStream}} (Unix-like systems, see 
{{sun.nio.fs.UnixSecureDirectoryStream}}), {{newExtractor}} returns a 
*{{SecureExtractor}}* that is resistant to both symlink-slip and concurrent 
symlink races.
* Otherwise (notably Windows, or filesystems whose provider does not return a 
secure stream) it returns a base {{Extractor}} that defends symlink-slip from 
the archive on a best-effort basis but documents that it cannot fully close the 
TOCTOU race.

The {{Extractor}} mirrors the entry points of today's {{Expander}} 
({{ArchiveInputStream}}, {{ZipFile}}, {{TarFile}}, {{SevenZFile}}) and adds 
configuration:

{code:java}
Extractor extractor = Extractor.newExtractor(targetDir)
        .setSymlinkPolicy(SymlinkPolicy.ALLOW_WITHIN_ROOT) // default REJECT
        .setSpecialFilePolicy(SpecialFilePolicy.SKIP)      // devices, FIFOs
        .setOverwrite(false);
extractor.extract(archiveInputStream);
{code}

h3. SecureExtractor design

Walk each entry's parent path *component by component* starting from an open 
directory handle on the (validated) extraction root:
* For each intermediate component, open the subdirectory via 
{{SecureDirectoryStream.newDirectoryStream(name, LinkOption.NOFOLLOW_LINKS)}}. 
If a component is a symlink, refuse (it cannot be a legitimate parent under 
no-follow).
* Create files relative to the innermost directory handle with no-follow / 
create-new semantics ({{newByteChannel}} with {{NOFOLLOW_LINKS}} + 
{{CREATE_NEW}}), so a concurrently planted symlink cannot redirect the write.
* Create directories and symlinks relative to the directory handle, never via 
an absolute resolved path.
Because every step is relative to a file descriptor rather than a re-resolved 
string path, a third party cannot win the race by swapping a parent component.

h3. Symlink policy

||Policy||Behavior||
|{{REJECT}} (default)|Fail if the archive contains any symlink entry.|
|{{SKIP}}|Silently ignore symlink entries.|
|{{ALLOW_WITHIN_ROOT}}|Create a symlink only if its target resolves to a 
location inside the extraction root; reject links escaping the root. Never 
follow the new link when writing later entries.|
|{{ALLOW_ALL}}|Create symlinks verbatim (only for fully trusted archives).|

h3. Entry-type handling

||Entry type||Default action||
|Regular file|Extract (no-follow, create-new unless overwrite enabled).|
|Directory|Create (no-follow).|
|Symbolic link|Per {{SymlinkPolicy}} (default {{REJECT}}).|
|Hard link (tar {{isLink()}})|Resolve link target within root, no-follow; 
reject if it escapes.|
|Device / FIFO / other special|Per {{SpecialFilePolicy}} (default {{SKIP}}; 
never created by the secure path).|

h2. Platform support and degradation

* {{SecureDirectoryStream}} is available on Unix-like platforms via the default 
file system provider; {{newExtractor}} detects it with an {{instanceof}} check 
on a directory stream opened over the target.
* On platforms without it, the base {{Extractor}} still validates each parent 
component with {{LinkOption.NOFOLLOW_LINKS}} (via {{Files.readAttributes}}) and 
refuses to traverse existing symlinks, which closes symlink-slip from the 
archive but remains non-atomic against a concurrent attacker. This residual 
limitation must be documented clearly.

h2. Acceptance criteria

* New {{Extractor}} (supported, not under {{examples}}) with 
{{newExtractor(Path)}} returning {{SecureExtractor}} when 
{{SecureDirectoryStream}} is available, else a best-effort {{Extractor}}.
* Configurable {{SymlinkPolicy}} (default {{REJECT}}) and special-file 
handling; safe hard-link resolution.
* {{SecureExtractor}} performs descriptor-relative, component-by-component, 
no-follow resolution and create-new writes.
* Entry points covering {{ArchiveInputStream}}, {{ZipFile}}, {{TarFile}}, and 
{{SevenZFile}}, matching the existing {{Expander}} surface.
* Tests: symlink-slip archive (link then write-through) is blocked under a 
trusted target; an escaping symlink target is rejected; a concurrent 
directory-to-symlink swap is defeated by {{SecureExtractor}} (and documented as 
not fully defeated by the base extractor); legitimate within-root symlinks 
extract correctly when allowed.
* Javadoc plus a "safe extraction" section in the user guide, cross-referencing 
{{ArchiveEntry.resolveIn}} and the platform caveats.
* Consider deprecating / redirecting {{Expander}} toward {{Extractor}}.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to