Mingchun Zhao created TIKA-4298: ----------------------------------- Summary: Failed to detect charset for zip entry with short non-Unicode file name Key: TIKA-4298 URL: https://issues.apache.org/jira/browse/TIKA-4298 Project: Tika Issue Type: Bug Components: detector Reporter: Mingchun Zhao Attachments: testZipEntryNameCharsetShiftSJIS.zip
The Japanese file names extracted from a zip file [^testZipEntryNameCharsetShiftSJIS.zip] were garbled. The charset of the file name is Shift_JIS, but the detect() method within the PackageParser class was not able to detect the charset properly. {code:java} $ ls -1 testZipEntryNameCharsetShiftSJIS shiba.png 文章1.txt 文章2.txt {code} {code:java} $ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pkg.PackageParser"/> <meta name="resourceName" content="testZipEntryNameCharsetShiftSJIS.zip"/> <meta name="X-TIKA:detectedEncoding" content="ISO-8859-1"/> <meta name="Content-Length" content="28885"/> <meta name="X-TIKA:encodingDetector" content="UniversalEncodingDetector"/> <meta name="Content-Type" content="application/zip"/> <title/> </head> <body><div class="embedded" id="shiba.png"/> <div class="package-entry"><h1>shiba.png</h1> </div> <div class="embedded" id="���1.txt"/> <div class="package-entry"><h1>���1.txt</h1> <p>あいうえお かきくけこ </p></div> <div class="embedded" id="���2.txt"/> <div class="package-entry"><h1>���2.txt</h1> <p>さしすせそ たちつてと </p></div> </body></html>% {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)