[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled

ASF GitHub Bot (Jira) Wed, 28 Apr 2021 19:42:08 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335103#comment-17335103
 ]


ASF GitHub Bot commented on TIKA-3374:
--------------------------------------

Ryan421 commented on a change in pull request #433:
URL: https://github.com/apache/tika/pull/433#discussion_r622696337



##########
File path: 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
##########
@@ -392,6 +392,15 @@ private void parseEntry(ArchiveInputStream archive, 
ArchiveEntry entry,
                             XHTMLContentHandler xhtml)
             throws SAXException, IOException, TikaException {
         String name = entry.getName();
+        
+        //Try to detect charset of archive entry in case of non-unicode 
filename is used
+        if (entry instanceof ZipArchiveEntry) {
+            detector.setText(((ZipArchiveEntry) entry).getRawName());

Review comment:
       Yes, it is really embarrasssing, will change to extend 
AbstractEncodingDetectorParser and using getEncodingDetector to do the job, 
thank you so much.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Non-Unicode archive entry name is garbled
> -----------------------------------------
>
>                 Key: TIKA-3374
>                 URL: https://issues.apache.org/jira/browse/TIKA-3374
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.26
>            Reporter: Ryan Liu
>            Priority: Major
>         Attachments: gbk.zip
>
>
> PackageParser retrieves archive entry name through commons-compress 
> archiver's ArchiveEntry#getName function and does not have automatic charset 
> detection for entry names.
>  Although one could set encoding by passing ArchiveStreamFactory(charset) 
> into parser context,
>  It is not practical since all kinds of charset could be used in an archive 
> file.
> Instead of directly calling entry.getName() in the PackageParser#parseEntry() 
> function,
> use entry.getRawName() and apply charset detection to reduce the possibility 
> of getting garbled string is recommended.
>  
> The attachment is an example of a Non-Unicode archive entry name been used in 
> a zip file.
> The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc*
> but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled

Reply via email to