Re: [PR] feat: add length unit support in FileSystem limits [commons-io]

via GitHub Sat, 06 Sep 2025 14:20:23 -0700


ecki commented on code in PR #781:
URL: https://github.com/apache/commons-io/pull/781#discussion_r2328366872



##########
src/main/java/org/apache/commons/io/FileSystem.java:
##########
@@ -411,30 +424,26 @@ public int getMaxFileNameLength() {
     /**
      * Gets the maximum length for file paths (may include folders).
      *
-     * <p><strong>Note:</strong> This may include folder names as well as the 
file name.
-     * The unit is the same as {@link #getMaxFileNameLength()} and can be 
obtained
-     * from {@link #getLengthUnit()}.</p>
+     * <p>This may include folder names as well as the file name.</p>

Review Comment:
   “May include folder names” sounds a bit unclear to me, is this the path 
length (including separators and roots) for folders and files or is it for path 
components?



##########
src/main/java/org/apache/commons/io/FileSystem.java:
##########
@@ -519,13 +528,11 @@ public boolean isLegalFileName(final CharSequence 
candidate) {
      * @since 2.21.0
      */
     public boolean isLegalFileName(final CharSequence candidate, final Charset 
charset) {
-        if (!isLegalFileLength(candidate, charset)) {
-            return false;
-        }
-        if (isReservedFileName(candidate)) {
-            return false;
-        }
-        return candidate.chars().noneMatch(this::isIllegalFileNameChar);
+        return candidate != null
+                && candidate.length() != 0
+                && nameLengthStrategy.isWithinLimit(candidate, 
getMaxFileNameLength(), charset)

Review Comment:
   I think passing the candidate to the strategy would be better, this could 
then also implements things like “no more than 255 utf16 code units and no more 
than 300mbytes utf8 serialized”



##########
src/main/java/org/apache/commons/io/FileSystem.java:
##########
@@ -107,7 +107,7 @@ public enum FileSystem {
                     "LPT1", "LPT2", "LPT3", "LPT4", "LPT5", "LPT6", "LPT7", 
"LPT8", "LPT9",
                     "LPT\u00b2", "LPT\u00b3", "LPT\u00b9", // Superscript 2 3 
1 in that order
                     "NUL", "PRN"
-            }, true, true, '\\', LengthUnit.CHARS);
+            }, true, true, '\\', NameLengthStrategy.UTF16_CHARS);

Review Comment:
   Maybe differentiate code points and code units (and clusters?) instead of 
using “chars”? Also should it support max path length and path components and 
depth etc? )I think the problem is generally too big tomwddress)



##########
src/main/java/org/apache/commons/io/FileSystem.java:
##########
@@ -342,12 +343,12 @@ private static String replace(final String path, final 
char oldChar, final char
      * @param reservedFileNamesExtensions TODO
      * @param supportsDriveLetter Whether this file system support driver 
letters.
      * @param nameSeparator The name separator, '\\' on Windows, '/' on Linux.
-     * @param lengthUnit The unit of measurement for length limits.
+     * @param nameLengthStrategy The strategy for measuring and truncating 
file and path names.

Review Comment:
   I like this general change, it can cover more limits 



##########
src/main/java/org/apache/commons/io/FileSystem.java:
##########
@@ -624,75 +631,102 @@ CharSequence trimExtension(final CharSequence cs) {
         return index < 0 ? cs : cs.subSequence(0, index);
     }
 
-    private boolean isLegalFileLength(final CharSequence candidate, final 
Charset charset) {
-        if (candidate == null || candidate.length() == 0) {
-            return false;
-        }
-        if (lengthUnit == LengthUnit.CHARS) {
-            return candidate.length() <= getMaxFileNameLength();
-        }
-        final CharsetEncoder encoder = charset.newEncoder();
-        try {
-            final ByteBuffer buffer = 
encoder.encode(CharBuffer.wrap(candidate));
-            return buffer.remaining() <= getMaxFileNameLength();
-        } catch (CharacterCodingException e) {
-            // If we can't encode, it's not legal
-            return false;
-        }
-    }
+    /**
+     * Strategy for measuring and truncating file or path names in different 
units.
+     * Implementations measure length and can truncate to a specified limit.
+     */
+    enum NameLengthStrategy {
+        /** Length measured as encoded bytes. */
+        BYTES {
+            @Override
+            int getLength(final CharSequence value, final Charset charset) {
+                final CharsetEncoder enc = charset.newEncoder()
+                        .onMalformedInput(CodingErrorAction.REPORT)
+                        .onUnmappableCharacter(CodingErrorAction.REPORT);
+                try {
+                    return enc.encode(CharBuffer.wrap(value)).remaining();
+                } catch (CharacterCodingException e) {
+                    // Unencodable ⇒ does not fit any byte limit.
+                    return Integer.MAX_VALUE;
+                }
+            }
 
-    CharSequence truncateFileName(final CharSequence candidate, final Charset 
charset) {
-        final int maxFileNameLength = getMaxFileNameLength();
-        // Character-based limit: simple substring if needed.
-        if (lengthUnit == LengthUnit.CHARS) {
-            return candidate.length() <= maxFileNameLength ? candidate : 
candidate.subSequence(0, maxFileNameLength);
-        }
+            @Override
+            CharSequence truncate(final CharSequence value, final int limit, 
final Charset charset) {
+                final CharsetEncoder encoder = charset.newEncoder()
+                        .onMalformedInput(CodingErrorAction.REPORT)
+                        .onUnmappableCharacter(CodingErrorAction.REPORT);
 
-        // Byte-based limit
-        return truncateByBytes(candidate, charset, maxFileNameLength);
-    }
+                if (!encoder.canEncode(value)) {
+                    throw new IllegalArgumentException(
+                            "The value " + value + " cannot be encoded using " 
+ charset.name());
+                }
 
-    static CharSequence truncateByBytes(final CharSequence candidate, final 
Charset charset, final int maxBytes) {
-        // Byte-based limit
-        final CharsetEncoder encoder = charset.newEncoder()
-                .onMalformedInput(CodingErrorAction.REPORT)
-                .onUnmappableCharacter(CodingErrorAction.REPORT);
+                // Fast path: if even the worst-case expansion fits, we're 
done.
+                if (value.length() <= Math.floor(limit / 
encoder.maxBytesPerChar())) {
+                    return value;
+                }
 
-        if (!encoder.canEncode(candidate)) {
-            throw new IllegalArgumentException(
-                    "File name contains characters that cannot be encoded with 
charset " + charset.name());
-        }
+                // Slow path: encode into a fixed-size byte buffer.
+                final ByteBuffer out = ByteBuffer.allocate(limit);
+                final CharBuffer in = CharBuffer.wrap(value);
 
-        // Fast path: if even the worst-case expansion fits, we're done.
-        if (candidate.length() <= Math.floor(maxBytes / 
encoder.maxBytesPerChar())) {
-            return candidate;
-        }
+                // Encode until the first character that would exceed the byte 
budget.
+                final CoderResult cr = encoder.encode(in, out, true);
+
+                if (cr.isUnderflow()) {
+                    // Entire candidate fit within maxFileNameLength bytes.
+                    return value;
+                }
 
-        // Slow path: encode into a fixed-size byte buffer.
-        final ByteBuffer out = ByteBuffer.allocate(maxBytes);
-        final CharBuffer in = CharBuffer.wrap(candidate);
+                // We ran out of space mid-encode: truncate BEFORE the 
offending character.
+                return value.subSequence(0, in.position());
+            }
+        },
 
-        // Encode until the first character that would exceed the byte budget.
-        final CoderResult cr = encoder.encode(in, out, true);
+        /** Length measured as UTF-16 code units (i.e., {@code 
CharSequence.length()}). */
+        UTF16_CHARS {
+            @Override
+            int getLength(final CharSequence value, final Charset charset) {
+                return value.length();
+            }
 
-        if (cr.isUnderflow()) {
-            // Entire candidate fit within maxFileNameLength bytes.
-            return candidate;
+            @Override
+            CharSequence truncate(final CharSequence value, final int limit, 
final Charset charset) {
+                return value.length() <= limit ? value : value.subSequence(0, 
limit);
+            }
+        };
+
+        /**
+         * Gets the measured length in this strategy’s unit.
+         *
+         * @param value The value to measure, not null.
+         * @param charset The charset to use when measuring in bytes.
+         * @return The length in this strategy’s unit.
+         */
+        abstract int getLength(CharSequence value, Charset charset);
+
+        /**
+         * Tests if the measured length is less or equal the {@code limit}.
+         *
+         * @param value The value to measure, not null.
+         * @param limit The limit to compare to.
+         * @param charset The charset to use when measuring in bytes.
+         * @return {@code true} if the measured length is less or equal the 
{@code limit}, {@code false} otherwise.
+         */
+        final boolean isWithinLimit(final CharSequence value, final int limit, 
final Charset charset) {
+            return getLength(value, charset) <= limit;
         }
 
-        // We ran out of space mid-encode: truncate BEFORE the offending 
character.
-        return candidate.subSequence(0, in.position());
+        /**
+         * Truncates to {@code limit} in this strategy’s unit (no-op if 
already within limit).
+         *
+         * @param value The value to truncate, not null.
+         * @param limit The limit to truncate to.
+         * @param charset The charset to use when measuring in bytes.
+         * @return The truncated value, not null.
+         */
+        abstract CharSequence truncate(CharSequence value, int limit, Charset 
charset);

Review Comment:
   Should truncate at least be able to preserve extensions? Also make sure that 
after truncation file name is still valid. And can truncation also affect 
parents? Should there be a minimum length the remaining name has (don’t 
truncate to a single char, etc)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: add length unit support in FileSystem limits [commons-io]

Reply via email to