[ 
https://issues.apache.org/jira/browse/TIKA-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873969#comment-17873969
 ] 

Mingchun Zhao commented on TIKA-4298:
-------------------------------------

Thanks for your info, I've recreated the zip file without the image but two 
text files with Shift_JIS filenames as you mentioned.

[https://github.com/apache/tika/pull/1903/commits/860d8db0f2d93bf333c37beee13ba288b4eb1088]

Could you confirm this please? Thanks!

> Failed to detect charset for zip entry with short non-Unicode file name
> -----------------------------------------------------------------------
>
>                 Key: TIKA-4298
>                 URL: https://issues.apache.org/jira/browse/TIKA-4298
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>            Reporter: Mingchun Zhao
>            Priority: Major
>             Fix For: 3.0.0, 2.9.3
>
>         Attachments: TIKA-4298.patch, testZipEntryNameCharsetShiftSJIS.zip
>
>
> The Japanese file names extracted from a zip file  
> [^testZipEntryNameCharsetShiftSJIS.zip] were garbled. The charset of the file 
> name is Shift_JIS, but the detect() method within the PackageParser class was 
> not able to detect the charset properly.
> {code:java}
> $ ls -1 testZipEntryNameCharsetShiftSJIS
> shiba.png
> 文章1.txt
> 文章2.txt
> {code}
> {code:java}
> $ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip
> <?xml version="1.0" encoding="UTF-8"?><html 
> xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
> <meta name="X-TIKA:Parsed-By" 
> content="org.apache.tika.parser.pkg.PackageParser"/>
> <meta name="resourceName" content="testZipEntryNameCharsetShiftSJIS.zip"/>
> <meta name="X-TIKA:detectedEncoding" content="ISO-8859-1"/>
> <meta name="Content-Length" content="28885"/>
> <meta name="X-TIKA:encodingDetector" content="UniversalEncodingDetector"/>
> <meta name="Content-Type" content="application/zip"/>
> <title/>
> </head>
> <body><div class="embedded" id="shiba.png"/>
> <div class="package-entry"><h1>shiba.png</h1>
> </div>
> <div class="embedded" id="���1.txt"/>
> <div class="package-entry"><h1>���1.txt</h1>
> <p>あいうえお&#13;
> かきくけこ&#13;
> </p></div>
> <div class="embedded" id="���2.txt"/>
> <div class="package-entry"><h1>���2.txt</h1>
> <p>さしすせそ&#13;
> たちつてと&#13;
> </p></div>
> </body></html>% {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to