[ 
https://issues.apache.org/jira/browse/TIKA-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873949#comment-17873949
 ] 

Tilman Hausherr commented on TIKA-4298:
---------------------------------------

The problem is that this image might be considered to be a work of art. Your 
colleague didn't sign an ICLA. IMHO there might be two solutions: 1) you 
recreate the zip file without the image 2) you change the test so that it loads 
the zip file from the URL in the ticket. (2) is done a lot in PDFBox but I 
haven't seen it in tika.

> Failed to detect charset for zip entry with short non-Unicode file name
> -----------------------------------------------------------------------
>
>                 Key: TIKA-4298
>                 URL: https://issues.apache.org/jira/browse/TIKA-4298
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>            Reporter: Mingchun Zhao
>            Priority: Major
>             Fix For: 3.0.0, 2.9.3
>
>         Attachments: TIKA-4298.patch, testZipEntryNameCharsetShiftSJIS.zip
>
>
> The Japanese file names extracted from a zip file  
> [^testZipEntryNameCharsetShiftSJIS.zip] were garbled. The charset of the file 
> name is Shift_JIS, but the detect() method within the PackageParser class was 
> not able to detect the charset properly.
> {code:java}
> $ ls -1 testZipEntryNameCharsetShiftSJIS
> shiba.png
> 文章1.txt
> 文章2.txt
> {code}
> {code:java}
> $ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip
> <?xml version="1.0" encoding="UTF-8"?><html 
> xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
> <meta name="X-TIKA:Parsed-By" 
> content="org.apache.tika.parser.pkg.PackageParser"/>
> <meta name="resourceName" content="testZipEntryNameCharsetShiftSJIS.zip"/>
> <meta name="X-TIKA:detectedEncoding" content="ISO-8859-1"/>
> <meta name="Content-Length" content="28885"/>
> <meta name="X-TIKA:encodingDetector" content="UniversalEncodingDetector"/>
> <meta name="Content-Type" content="application/zip"/>
> <title/>
> </head>
> <body><div class="embedded" id="shiba.png"/>
> <div class="package-entry"><h1>shiba.png</h1>
> </div>
> <div class="embedded" id="���1.txt"/>
> <div class="package-entry"><h1>���1.txt</h1>
> <p>あいうえお&#13;
> かきくけこ&#13;
> </p></div>
> <div class="embedded" id="���2.txt"/>
> <div class="package-entry"><h1>���2.txt</h1>
> <p>さしすせそ&#13;
> たちつてと&#13;
> </p></div>
> </body></html>% {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to