Re: [PR] SOLR-18213 Replace simplemagic with built-in magic-byte file type detection [solr]

via GitHub Sat, 25 Apr 2026 14:12:39 -0700


Copilot commented on code in PR #4336:
URL: https://github.com/apache/solr/pull/4336#discussion_r3142573315



##########
solr/core/src/java/org/apache/solr/util/FileTypeMagicUtil.java:
##########
@@ -97,22 +97,28 @@ public FileVisitResult preVisitDirectory(Path dir, 
BasicFileAttributes attrs) {
    * @return string with content-type or "application/octet-stream" if unknown
    */
   public String guessMimeType(Path file) {
-    try {
-      return guessTypeFallbackToOctetStream(util.findMatch(file.toFile()));
+    try (InputStream in = Files.newInputStream(file)) {
+      return guessMimeType(in);
     } catch (IOException e) {
       throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, e);

Review Comment:
   `guessMimeType(Path)` wraps `IOException` in a `SolrException` without 
including the file path, which makes failures during configset validation 
harder to diagnose (previously the thrown error included the filename). 
Consider including `file` in the exception message (or reusing 
`isFileForbiddenInConfigset(Path)` in `assertConfigSetFolderLegal`).
   ```suggestion
         throw new SolrException(
             SolrException.ErrorCode.SERVER_ERROR,
             "Failed to guess mime type for file " + file,
             e);
   ```



##########
solr/core/src/java/org/apache/solr/util/FileTypeMagicUtil.java:
##########
@@ -199,11 +202,63 @@ public static boolean isFileForbiddenInConfigset(byte[] 
bytes) {
                       
"application/x-java-applet,application/zip,application/x-tar,text/x-shellscript")
                   .split(",")));
 
-  private String guessTypeFallbackToOctetStream(ContentInfo contentInfo) {
-    if (contentInfo == null) {
-      return ContentType.OTHER.getMimeType();
-    } else {
-      return contentInfo.getContentType().getMimeType();
+  /**
+   * Detects JVM class files by the 0xCAFEBABE magic. Kotlin, Scala and Groovy 
use the same bytes.
+   */
+  private static boolean isJavaClass(byte[] b) {
+    return b.length >= 4
+        && (b[0] & 0xFF) == 0xCA
+        && (b[1] & 0xFF) == 0xFE
+        && (b[2] & 0xFF) == 0xBA
+        && (b[3] & 0xFF) == 0xBE;
+  }
+
+  /**
+   * Detects ZIP/JAR archives by the PK magic at offset 0. Handles three 
signatures: PK\x03\x04
+   * (local file header), PK\x05\x06 (empty archive), PK\x07\x08 (data 
descriptor).
+   *
+   * <p>Polyglot files (e.g. JPEG+ZIP) are not detected: their ZIP content is 
appended after the
+   * outer format's end marker and does not appear at offset 0.
+   */
+  private static boolean isZip(byte[] b) {
+    return b.length >= 4
+        && b[0] == 'P'
+        && b[1] == 'K'
+        && ((b[2] == 0x03 && b[3] == 0x04)
+            || (b[2] == 0x05 && b[3] == 0x06)
+            || (b[2] == 0x07 && b[3] == 0x08));
+  }
+
+  /**
+   * Detects TAR archives via the POSIX/GNU ustar magic at offset 257, and 
compressed archives
+   * (gzip, bzip2, xz) which commonly wrap tarballs. V7-format tars have no 
magic and are not
+   * detected, but they are essentially extinct. Plain compressed files 
without a tar payload are
+   * also blocked — they have no legitimate use in a Solr configset.
+   */
+  private static boolean isTar(byte[] b) {
+    if (b.length >= 262
+        && b[257] == 'u'
+        && b[258] == 's'
+        && b[259] == 't'
+        && b[260] == 'a'
+        && b[261] == 'r') {
+      return true;
     }
+    if (b.length >= 2 && (b[0] & 0xFF) == 0x1F && (b[1] & 0xFF) == 0x8B) 
return true; // gzip
+    if (b.length >= 3 && b[0] == 'B' && b[1] == 'Z' && b[2] == 'h') return 
true; // bzip2
+    return b.length >= 6 // xz: FD 37 7A 58 5A 00
+        && (b[0] & 0xFF) == 0xFD
+        && b[1] == '7'
+        && b[2] == 'z'
+        && b[3] == 'X'
+        && b[4] == 'Z'
+        && b[5] == 0x00;

Review Comment:
   `isTar` treats any gzip/bzip2/xz-compressed file as `application/x-tar`, 
even if it’s not a tarball. This is both an inaccurate MIME classification and 
a behavior change vs the previous magic rules (which only detected actual tar 
headers). Consider either (a) detecting only real tar (ustar) here, or (b) 
returning distinct mime types for gzip/bzip2/xz and adding those to the 
forbidden list explicitly if you intend to block them.
   ```suggestion
      * Detects TAR archives via the POSIX/GNU ustar magic at offset 257. 
V7-format tars have no
      * magic and are not detected, but they are essentially extinct.
      */
     private static boolean isTar(byte[] b) {
       return b.length >= 262
           && b[257] == 'u'
           && b[258] == 's'
           && b[259] == 't'
           && b[260] == 'a'
           && b[261] == 'r';
   ```



##########
solr/core/src/java/org/apache/solr/util/FileTypeMagicUtil.java:
##########
@@ -199,11 +202,63 @@ public static boolean isFileForbiddenInConfigset(byte[] 
bytes) {
                       
"application/x-java-applet,application/zip,application/x-tar,text/x-shellscript")
                   .split(",")));
 
-  private String guessTypeFallbackToOctetStream(ContentInfo contentInfo) {
-    if (contentInfo == null) {
-      return ContentType.OTHER.getMimeType();
-    } else {
-      return contentInfo.getContentType().getMimeType();
+  /**
+   * Detects JVM class files by the 0xCAFEBABE magic. Kotlin, Scala and Groovy 
use the same bytes.
+   */
+  private static boolean isJavaClass(byte[] b) {
+    return b.length >= 4
+        && (b[0] & 0xFF) == 0xCA
+        && (b[1] & 0xFF) == 0xFE
+        && (b[2] & 0xFF) == 0xBA
+        && (b[3] & 0xFF) == 0xBE;
+  }
+
+  /**
+   * Detects ZIP/JAR archives by the PK magic at offset 0. Handles three 
signatures: PK\x03\x04
+   * (local file header), PK\x05\x06 (empty archive), PK\x07\x08 (data 
descriptor).
+   *
+   * <p>Polyglot files (e.g. JPEG+ZIP) are not detected: their ZIP content is 
appended after the
+   * outer format's end marker and does not appear at offset 0.
+   */
+  private static boolean isZip(byte[] b) {
+    return b.length >= 4
+        && b[0] == 'P'
+        && b[1] == 'K'
+        && ((b[2] == 0x03 && b[3] == 0x04)
+            || (b[2] == 0x05 && b[3] == 0x06)
+            || (b[2] == 0x07 && b[3] == 0x08));
+  }
+
+  /**
+   * Detects TAR archives via the POSIX/GNU ustar magic at offset 257, and 
compressed archives
+   * (gzip, bzip2, xz) which commonly wrap tarballs. V7-format tars have no 
magic and are not
+   * detected, but they are essentially extinct. Plain compressed files 
without a tar payload are
+   * also blocked — they have no legitimate use in a Solr configset.
+   */
+  private static boolean isTar(byte[] b) {
+    if (b.length >= 262
+        && b[257] == 'u'
+        && b[258] == 's'
+        && b[259] == 't'
+        && b[260] == 'a'
+        && b[261] == 'r') {
+      return true;
     }
+    if (b.length >= 2 && (b[0] & 0xFF) == 0x1F && (b[1] & 0xFF) == 0x8B) 
return true; // gzip
+    if (b.length >= 3 && b[0] == 'B' && b[1] == 'Z' && b[2] == 'h') return 
true; // bzip2
+    return b.length >= 6 // xz: FD 37 7A 58 5A 00
+        && (b[0] & 0xFF) == 0xFD
+        && b[1] == '7'
+        && b[2] == 'z'
+        && b[3] == 'X'
+        && b[4] == 'Z'
+        && b[5] == 0x00;
+  }
+
+  private static boolean isShellScript(byte[] b) {
+    if (b.length < 2) return false;
+    // lookingAt() matches from the start without requiring a full-string 
match.
+    String prefix = new String(b, 0, Math.min(b.length, 64), 
StandardCharsets.ISO_8859_1);
+    return SHELL_SHEBANG.matcher(prefix).lookingAt();

Review Comment:
   Shell-script detection currently builds a `String` and runs a regex just to 
check for `#!`. This can be simplified to a direct byte-prefix check 
(`b[0]=='#' && b[1]=='!'`) to avoid allocations and regex overhead, and would 
better match the intent of the method (since the regex is effectively constant).
   ```suggestion
       return b.length >= 2 && b[0] == '#' && b[1] == '!';
   ```



##########
solr/core/src/java/org/apache/solr/util/FileTypeMagicUtil.java:
##########
@@ -199,11 +202,63 @@ public static boolean isFileForbiddenInConfigset(byte[] 
bytes) {
                       
"application/x-java-applet,application/zip,application/x-tar,text/x-shellscript")
                   .split(",")));
 
-  private String guessTypeFallbackToOctetStream(ContentInfo contentInfo) {
-    if (contentInfo == null) {
-      return ContentType.OTHER.getMimeType();
-    } else {
-      return contentInfo.getContentType().getMimeType();
+  /**
+   * Detects JVM class files by the 0xCAFEBABE magic. Kotlin, Scala and Groovy 
use the same bytes.
+   */
+  private static boolean isJavaClass(byte[] b) {
+    return b.length >= 4
+        && (b[0] & 0xFF) == 0xCA
+        && (b[1] & 0xFF) == 0xFE
+        && (b[2] & 0xFF) == 0xBA
+        && (b[3] & 0xFF) == 0xBE;

Review Comment:
   `isJavaClass` now flags any file starting with `0xCAFEBABE` as a class file. 
The previous implementation also validated the version field, which reduces 
accidental false-positives. Consider additionally checking that `b.length >= 8` 
and that the major/minor version bytes represent a plausible classfile version 
(while still allowing preview classfiles with minor `0xFFFF`).
   ```suggestion
       if (b.length < 8
           || (b[0] & 0xFF) != 0xCA
           || (b[1] & 0xFF) != 0xFE
           || (b[2] & 0xFF) != 0xBA
           || (b[3] & 0xFF) != 0xBE) {
         return false;
       }
   
       final int minorVersion = ((b[4] & 0xFF) << 8) | (b[5] & 0xFF);
       final int majorVersion = ((b[6] & 0xFF) << 8) | (b[7] & 0xFF);
   
       return majorVersion >= 45 && (minorVersion == 0 || minorVersion == 
0xFFFF);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] SOLR-18213 Replace simplemagic with built-in magic-byte file type detection [solr]

Reply via email to