Mingchun Zhao created TIKA-4298:
-----------------------------------

             Summary: Failed to detect charset for zip entry with short 
non-Unicode file name
                 Key: TIKA-4298
                 URL: https://issues.apache.org/jira/browse/TIKA-4298
             Project: Tika
          Issue Type: Bug
          Components: detector
            Reporter: Mingchun Zhao
         Attachments: testZipEntryNameCharsetShiftSJIS.zip

The Japanese file names extracted from a zip file  
[^testZipEntryNameCharsetShiftSJIS.zip] were garbled. The charset of the file 
name is Shift_JIS, but the detect() method within the PackageParser class was 
not able to detect the charset properly.


{code:java}
$ ls -1 testZipEntryNameCharsetShiftSJIS
shiba.png
文章1.txt
文章2.txt
{code}
{code:java}
$ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip

<?xml version="1.0" encoding="UTF-8"?><html 
xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-TIKA:Parsed-By" 
content="org.apache.tika.parser.pkg.PackageParser"/>
<meta name="resourceName" content="testZipEntryNameCharsetShiftSJIS.zip"/>
<meta name="X-TIKA:detectedEncoding" content="ISO-8859-1"/>
<meta name="Content-Length" content="28885"/>
<meta name="X-TIKA:encodingDetector" content="UniversalEncodingDetector"/>
<meta name="Content-Type" content="application/zip"/>
<title/>
</head>
<body><div class="embedded" id="shiba.png"/>
<div class="package-entry"><h1>shiba.png</h1>
</div>
<div class="embedded" id="���1.txt"/>
<div class="package-entry"><h1>���1.txt</h1>
<p>あいうえお&#13;
かきくけこ&#13;
</p></div>
<div class="embedded" id="���2.txt"/>
<div class="package-entry"><h1>���2.txt</h1>
<p>さしすせそ&#13;
たちつてと&#13;
</p></div>
</body></html>% {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to