Re: [PR] feat: add length unit support in FileSystem limits [commons-io]

via GitHub Sat, 06 Sep 2025 01:13:03 -0700


ppkarwasz commented on code in PR #781:
URL: https://github.com/apache/commons-io/pull/781#discussion_r2326630312



##########
src/main/java/org/apache/commons/io/FileSystem.java:
##########
@@ -530,4 +623,76 @@ CharSequence trimExtension(final CharSequence cs) {
         final int index = indexOf(cs, '.', 0);
         return index < 0 ? cs : cs.subSequence(0, index);
     }
+
+    private boolean isLegalFileLength(final CharSequence candidate, final 
Charset charset) {
+        if (candidate == null || candidate.length() == 0) {
+            return false;
+        }
+        if (lengthUnit == LengthUnit.CHARS) {
+            return candidate.length() <= getMaxFileNameLength();
+        }
+        final CharsetEncoder encoder = charset.newEncoder();
+        try {
+            final ByteBuffer buffer = 
encoder.encode(CharBuffer.wrap(candidate));
+            return buffer.remaining() <= getMaxFileNameLength();
+        } catch (CharacterCodingException e) {
+            // If we can't encode, it's not legal
+            return false;
+        }
+    }
+
+    CharSequence truncateFileName(final CharSequence candidate, final Charset 
charset) {
+        final int maxFileNameLength = getMaxFileNameLength();
+        // Character-based limit: simple substring if needed.
+        if (lengthUnit == LengthUnit.CHARS) {
+            return candidate.length() <= maxFileNameLength ? candidate : 
candidate.subSequence(0, maxFileNameLength);
+        }
+
+        // Byte-based limit
+        return truncateByBytes(candidate, charset, maxFileNameLength);
+    }
+
+    static CharSequence truncateByBytes(final CharSequence candidate, final 
Charset charset, final int maxBytes) {
+        // Byte-based limit
+        final CharsetEncoder encoder = charset.newEncoder()
+                .onMalformedInput(CodingErrorAction.REPORT)
+                .onUnmappableCharacter(CodingErrorAction.REPORT);
+
+        if (!encoder.canEncode(candidate)) {
+            throw new IllegalArgumentException(
+                    "File name contains characters that cannot be encoded with 
charset " + charset.name());
+        }
+
+        // Fast path: if even the worst-case expansion fits, we're done.
+        if (candidate.length() <= Math.floor(maxBytes / 
encoder.maxBytesPerChar())) {
+            return candidate;
+        }
+
+        // Slow path: encode into a fixed-size byte buffer.
+        final ByteBuffer out = ByteBuffer.allocate(maxBytes);
+        final CharBuffer in = CharBuffer.wrap(candidate);
+
+        // Encode until the first character that would exceed the byte budget.
+        final CoderResult cr = encoder.encode(in, out, true);
+
+        if (cr.isUnderflow()) {
+            // Entire candidate fit within maxFileNameLength bytes.
+            return candidate;
+        }
+
+        // We ran out of space mid-encode: truncate BEFORE the offending 
character.
+        return candidate.subSequence(0, in.position());
+    }
+
+    /**
+     * Units of length for the file name and path length limits.
+     *

Review Comment:
   I should have explained it more clearly.
   
   **File name length**
   
   * The maximum length of a *single file or directory name* is defined by the 
**filesystem**, not the OS.
   * On Windows (NTFS) and macOS (APFS, HFS+), names are stored internally as 
**UTF-16 code units**. The limit is expressed in those code units (e.g. 255). 
Because the filesystem enforces UTF-16 internally, the user-visible encoding 
doesn’t affect the limit.
   * On most Linux/UNIX filesystems (ext4, XFS, Btrfs), names are just opaque 
**byte strings** in directory entries. The usual limit is 255 bytes. If your 
system locale is UTF-8, that translates to 255 ASCII characters, but only \~85 
`★` characters (3 bytes each).
   
   These are **hard limits**: you simply cannot store a longer name on those 
filesystems.
   
   **Path length**
   
   * The length of an *entire path* (all components combined) is typically an 
**OS-level API restriction**, not a filesystem restriction. The filesystem 
doesn’t know where it has been mounted, so it cannot enforce a total path 
length.
   * On POSIX systems, the constants `PATH_MAX` (often 4096) and `NAME_MAX` 
(255) are defined in bytes, because paths are passed as `char*`.
   * On Windows, path limits are defined in UTF-16 characters. NTFS itself 
allows very long paths (32,767 chars), but older Win32 APIs traditionally 
limited paths to 260 characters unless special prefixes are used.
   * On macOS, it depends which API you use: the POSIX layer (`open`, `getcwd`) 
works in UTF-8 byte strings and enforces `PATH_MAX` in bytes, while Apple’s 
higher-level APIs often work in UTF-16 and measure in code units.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: add length unit support in FileSystem limits [commons-io]

Reply via email to