[ https://issues.apache.org/jira/browse/TIKA-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873949#comment-17873949 ]
Tilman Hausherr commented on TIKA-4298: --------------------------------------- The problem is that this image might be considered to be a work of art. Your colleague didn't sign an ICLA. IMHO there might be two solutions: 1) you recreate the zip file without the image 2) you change the test so that it loads the zip file from the URL in the ticket. (2) is done a lot in PDFBox but I haven't seen it in tika. > Failed to detect charset for zip entry with short non-Unicode file name > ----------------------------------------------------------------------- > > Key: TIKA-4298 > URL: https://issues.apache.org/jira/browse/TIKA-4298 > Project: Tika > Issue Type: Bug > Components: detector > Reporter: Mingchun Zhao > Priority: Major > Fix For: 3.0.0, 2.9.3 > > Attachments: TIKA-4298.patch, testZipEntryNameCharsetShiftSJIS.zip > > > The Japanese file names extracted from a zip file > [^testZipEntryNameCharsetShiftSJIS.zip] were garbled. The charset of the file > name is Shift_JIS, but the detect() method within the PackageParser class was > not able to detect the charset properly. > {code:java} > $ ls -1 testZipEntryNameCharsetShiftSJIS > shiba.png > 文章1.txt > 文章2.txt > {code} > {code:java} > $ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip > <?xml version="1.0" encoding="UTF-8"?><html > xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/> > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.pkg.PackageParser"/> > <meta name="resourceName" content="testZipEntryNameCharsetShiftSJIS.zip"/> > <meta name="X-TIKA:detectedEncoding" content="ISO-8859-1"/> > <meta name="Content-Length" content="28885"/> > <meta name="X-TIKA:encodingDetector" content="UniversalEncodingDetector"/> > <meta name="Content-Type" content="application/zip"/> > <title/> > </head> > <body><div class="embedded" id="shiba.png"/> > <div class="package-entry"><h1>shiba.png</h1> > </div> > <div class="embedded" id="���1.txt"/> > <div class="package-entry"><h1>���1.txt</h1> > <p>あいうえお > かきくけこ > </p></div> > <div class="embedded" id="���2.txt"/> > <div class="package-entry"><h1>���2.txt</h1> > <p>さしすせそ > たちつてと > </p></div> > </body></html>% {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)