[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled

2021-04-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17336999#comment-17336999
 ] 

ASF GitHub Bot commented on TIKA-3374:
--

Ryan421 commented on a change in pull request #433:
URL: https://github.com/apache/tika/pull/433#discussion_r623514986



##
File path: 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
##
@@ -392,6 +392,15 @@ private void parseEntry(ArchiveInputStream archive, 
ArchiveEntry entry,
 XHTMLContentHandler xhtml)
 throws SAXException, IOException, TikaException {
 String name = entry.getName();
+
+//Try to detect charset of archive entry in case of non-unicode 
filename is used
+if (entry instanceof ZipArchiveEntry) {
+detector.setText(((ZipArchiveEntry) entry).getRawName());

Review comment:
   No need to be sorry ^^, It was really my fault when moving the code 
block from our project to here and not properly checked. Really appreciate your 
review and suggestions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Non-Unicode archive entry name is garbled
> -
>
> Key: TIKA-3374
> URL: https://issues.apache.org/jira/browse/TIKA-3374
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: Ryan Liu
>Priority: Major
> Attachments: gbk.zip
>
>
> PackageParser retrieves archive entry name through commons-compress 
> archiver's ArchiveEntry#getName function and does not have automatic charset 
> detection for entry names.
>  Although one could set encoding by passing ArchiveStreamFactory(charset) 
> into parser context,
>  It is not practical since all kinds of charset could be used in an archive 
> file.
> Instead of directly calling entry.getName() in the PackageParser#parseEntry() 
> function,
> use entry.getRawName() and apply charset detection to reduce the possibility 
> of getting garbled string is recommended.
>  
> The attachment is an example of a Non-Unicode archive entry name been used in 
> a zip file.
> The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc*
> but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [tika] Ryan421 commented on a change in pull request #433: [TIKA-3374] Apply charset detection for archive entry name

2021-04-29 Thread GitBox


Ryan421 commented on a change in pull request #433:
URL: https://github.com/apache/tika/pull/433#discussion_r623514986



##
File path: 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
##
@@ -392,6 +392,15 @@ private void parseEntry(ArchiveInputStream archive, 
ArchiveEntry entry,
 XHTMLContentHandler xhtml)
 throws SAXException, IOException, TikaException {
 String name = entry.getName();
+
+//Try to detect charset of archive entry in case of non-unicode 
filename is used
+if (entry instanceof ZipArchiveEntry) {
+detector.setText(((ZipArchiveEntry) entry).getRawName());

Review comment:
   No need to be sorry ^^, It was really my fault when moving the code 
block from our project to here and not properly checked. Really appreciate your 
review and suggestions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (TIKA-3376) Improve handling of write limit reached in new /tika json endpoint

2021-04-29 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335762#comment-17335762
 ] 

Hudson commented on TIKA-3376:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #214 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/214/])
TIKA-3376 -- improve write limit reached handling in new /tika json output 
(tallison: 
[https://github.com/apache/tika/commit/9ac7e759b2007f541375ee2dedc736de5a555ccb])
* (edit) 
tika-core/src/main/java/org/apache/tika/parser/RecursiveParserWrapper.java
* (edit) 
tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaResourceTest.java
* (edit) 
tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaResourceNoStackTest.java
* (edit) 
tika-core/src/main/java/org/apache/tika/exception/WriteLimitReachedException.java
* (edit) 
tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java


> Improve handling of write limit reached in new /tika json endpoint
> --
>
> Key: TIKA-3376
> URL: https://issues.apache.org/jira/browse/TIKA-3376
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> If the server is not started with the -s option (show stacktrace), the new 
> json endpoint for /tika should return 200 with a writelimitreached=true 
> metadata value but no stacktrace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled

2021-04-29 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335746#comment-17335746
 ] 

Hudson commented on TIKA-3374:
--

SUCCESS: Integrated in Jenkins build Tika » tika-branch1x-jdk8 #122 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/122/])
TIKA-3374 add encoding detection to zip entry names via Ryan Liu. (tallison: 
[https://github.com/apache/tika/commit/2704f0ee82b7799366aa2eeb02957be7eb7630d2])
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/pkg/PackageParserTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
* (edit) CHANGES.txt
* (add) tika-parsers/src/test/resources/test-documents/gbk.zip
* (edit) 
tika-parsers/src/test/java/org/apache/tika/config/TikaEncodingDetectorTest.java


> Non-Unicode archive entry name is garbled
> -
>
> Key: TIKA-3374
> URL: https://issues.apache.org/jira/browse/TIKA-3374
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: Ryan Liu
>Priority: Major
> Attachments: gbk.zip
>
>
> PackageParser retrieves archive entry name through commons-compress 
> archiver's ArchiveEntry#getName function and does not have automatic charset 
> detection for entry names.
>  Although one could set encoding by passing ArchiveStreamFactory(charset) 
> into parser context,
>  It is not practical since all kinds of charset could be used in an archive 
> file.
> Instead of directly calling entry.getName() in the PackageParser#parseEntry() 
> function,
> use entry.getRawName() and apply charset detection to reduce the possibility 
> of getting garbled string is recommended.
>  
> The attachment is an example of a Non-Unicode archive entry name been used in 
> a zip file.
> The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc*
> but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3376) Improve handling of write limit reached in new /tika json endpoint

2021-04-29 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335747#comment-17335747
 ] 

Hudson commented on TIKA-3376:
--

SUCCESS: Integrated in Jenkins build Tika » tika-branch1x-jdk8 #122 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/122/])
TIKA-3376 improve handling of write limit reached in json output from /tika 
endpoint (tallison: 
[https://github.com/apache/tika/commit/32545d471b2ecdc57c64813c40cf834d55dc8f77])
* (edit) 
tika-server/src/test/java/org/apache/tika/server/TikaResourceNoStackTest.java
* (edit) 
tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java


> Improve handling of write limit reached in new /tika json endpoint
> --
>
> Key: TIKA-3376
> URL: https://issues.apache.org/jira/browse/TIKA-3376
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> If the server is not started with the -s option (show stacktrace), the new 
> json endpoint for /tika should return 200 with a writelimitreached=true 
> metadata value but no stacktrace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3376) Improve handling of write limit reached in new /tika json endpoint

2021-04-29 Thread Tim Allison (Jira)
Tim Allison created TIKA-3376:
-

 Summary: Improve handling of write limit reached in new /tika json 
endpoint
 Key: TIKA-3376
 URL: https://issues.apache.org/jira/browse/TIKA-3376
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


If the server is not started with the -s option (show stacktrace), the new json 
endpoint for /tika should return 200 with a writelimitreached=true metadata 
value but no stacktrace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled

2021-04-29 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335680#comment-17335680
 ] 

Hudson commented on TIKA-3374:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #213 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/213/])
TIKA-3374 -- fix up to encoding detection in package parser (tallison: 
[https://github.com/apache/tika/commit/fbac00b1dbe0464a7de379e6edb843973b917c6e])
* (delete) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/test/resources/org/apache/tika/parser/pkg/gbk.zip
* (edit) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/test/java/org/apache/tika/parser/pkg/PackageParserTest.java
* (edit) CHANGES.txt
* (add) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/test/resources/test-documents/gbk.zip
* (edit) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
* (add) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-package/src/test/java/org/apache/tika/parser/pkg/PackageParserTest.java
* (edit) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-package/src/test/java/org/apache/tika/config/TikaEncodingDetectorTest.java


> Non-Unicode archive entry name is garbled
> -
>
> Key: TIKA-3374
> URL: https://issues.apache.org/jira/browse/TIKA-3374
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: Ryan Liu
>Priority: Major
> Attachments: gbk.zip
>
>
> PackageParser retrieves archive entry name through commons-compress 
> archiver's ArchiveEntry#getName function and does not have automatic charset 
> detection for entry names.
>  Although one could set encoding by passing ArchiveStreamFactory(charset) 
> into parser context,
>  It is not practical since all kinds of charset could be used in an archive 
> file.
> Instead of directly calling entry.getName() in the PackageParser#parseEntry() 
> function,
> use entry.getRawName() and apply charset detection to reduce the possibility 
> of getting garbled string is recommended.
>  
> The attachment is an example of a Non-Unicode archive entry name been used in 
> a zip file.
> The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc*
> but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled

2021-04-29 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335545#comment-17335545
 ] 

Hudson commented on TIKA-3374:
--

UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk8 #212 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/212/])
TIKA-3374 -- apply charset detection for archive entry name (#433) (github: 
[https://github.com/apache/tika/commit/07aa47855dfcbb27d11a996dc7b8cfa04b68493b])
* (edit) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
* (edit) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/test/java/org/apache/tika/parser/pkg/PackageParserTest.java
* (add) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/test/resources/org/apache/tika/parser/pkg/gbk.zip


> Non-Unicode archive entry name is garbled
> -
>
> Key: TIKA-3374
> URL: https://issues.apache.org/jira/browse/TIKA-3374
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: Ryan Liu
>Priority: Major
> Attachments: gbk.zip
>
>
> PackageParser retrieves archive entry name through commons-compress 
> archiver's ArchiveEntry#getName function and does not have automatic charset 
> detection for entry names.
>  Although one could set encoding by passing ArchiveStreamFactory(charset) 
> into parser context,
>  It is not practical since all kinds of charset could be used in an archive 
> file.
> Instead of directly calling entry.getName() in the PackageParser#parseEntry() 
> function,
> use entry.getRawName() and apply charset detection to reduce the possibility 
> of getting garbled string is recommended.
>  
> The attachment is an example of a Non-Unicode archive entry name been used in 
> a zip file.
> The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc*
> but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled

2021-04-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335483#comment-17335483
 ] 

ASF GitHub Bot commented on TIKA-3374:
--

tballison merged pull request #433:
URL: https://github.com/apache/tika/pull/433


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Non-Unicode archive entry name is garbled
> -
>
> Key: TIKA-3374
> URL: https://issues.apache.org/jira/browse/TIKA-3374
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: Ryan Liu
>Priority: Major
> Attachments: gbk.zip
>
>
> PackageParser retrieves archive entry name through commons-compress 
> archiver's ArchiveEntry#getName function and does not have automatic charset 
> detection for entry names.
>  Although one could set encoding by passing ArchiveStreamFactory(charset) 
> into parser context,
>  It is not practical since all kinds of charset could be used in an archive 
> file.
> Instead of directly calling entry.getName() in the PackageParser#parseEntry() 
> function,
> use entry.getRawName() and apply charset detection to reduce the possibility 
> of getting garbled string is recommended.
>  
> The attachment is an example of a Non-Unicode archive entry name been used in 
> a zip file.
> The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc*
> but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [tika] tballison merged pull request #433: [TIKA-3374] Apply charset detection for archive entry name

2021-04-29 Thread GitBox


tballison merged pull request #433:
URL: https://github.com/apache/tika/pull/433


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled

2021-04-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335476#comment-17335476
 ] 

ASF GitHub Bot commented on TIKA-3374:
--

tballison commented on a change in pull request #433:
URL: https://github.com/apache/tika/pull/433#discussion_r623052485



##
File path: 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
##
@@ -392,6 +392,15 @@ private void parseEntry(ArchiveInputStream archive, 
ArchiveEntry entry,
 XHTMLContentHandler xhtml)
 throws SAXException, IOException, TikaException {
 String name = entry.getName();
+
+//Try to detect charset of archive entry in case of non-unicode 
filename is used
+if (entry instanceof ZipArchiveEntry) {
+detector.setText(((ZipArchiveEntry) entry).getRawName());

Review comment:
   Sorry, please forgive me.  I meant embarrassing for me because I figured 
I was missing something!!!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Non-Unicode archive entry name is garbled
> -
>
> Key: TIKA-3374
> URL: https://issues.apache.org/jira/browse/TIKA-3374
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: Ryan Liu
>Priority: Major
> Attachments: gbk.zip
>
>
> PackageParser retrieves archive entry name through commons-compress 
> archiver's ArchiveEntry#getName function and does not have automatic charset 
> detection for entry names.
>  Although one could set encoding by passing ArchiveStreamFactory(charset) 
> into parser context,
>  It is not practical since all kinds of charset could be used in an archive 
> file.
> Instead of directly calling entry.getName() in the PackageParser#parseEntry() 
> function,
> use entry.getRawName() and apply charset detection to reduce the possibility 
> of getting garbled string is recommended.
>  
> The attachment is an example of a Non-Unicode archive entry name been used in 
> a zip file.
> The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc*
> but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [tika] tballison commented on a change in pull request #433: [TIKA-3374] Apply charset detection for archive entry name

2021-04-29 Thread GitBox


tballison commented on a change in pull request #433:
URL: https://github.com/apache/tika/pull/433#discussion_r623052485



##
File path: 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
##
@@ -392,6 +392,15 @@ private void parseEntry(ArchiveInputStream archive, 
ArchiveEntry entry,
 XHTMLContentHandler xhtml)
 throws SAXException, IOException, TikaException {
 String name = entry.getName();
+
+//Try to detect charset of archive entry in case of non-unicode 
filename is used
+if (entry instanceof ZipArchiveEntry) {
+detector.setText(((ZipArchiveEntry) entry).getRawName());

Review comment:
   Sorry, please forgive me.  I meant embarrassing for me because I figured 
I was missing something!!!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Re: Release 1.27?

2021-04-29 Thread Tim Allison
Thank you Konstantin! I’m not planning on updating POI because ooxml
schemas lite didn’t have enough classes for our unit tests.

Andi recently made some updates on their trunk, and I haven’t had a chance
to confirm those fixes work :(.

If we wanted to drop the full ooxml schemas into our jar, I can test 5.0.0
w our regression files.

On Thu, Apr 29, 2021 at 4:17 AM Konstantin Gribov  wrote:

> +1 for release
>
> Are you planning to merge TIKA-3164 (update to POI 5.0.0) for this release?
>
> --
> Best regards,
> Konstantin Gribov.
>
>
> On Wed, Apr 28, 2021 at 9:36 PM Oleg Tikhonov 
> wrote:
>
>>  +1
>>
>> On Wed, Apr 28, 2021, 19:22 Tim Allison  wrote:
>>
>> > All,
>> >
>> >   There have been a number of key fixes in 1.x and some security fixes
>> > in some of our dependencies.  Any objections to starting the release
>> > process for 1.27 in the next few weeks?  Any blockers we need to fix
>> > for 1.27?
>> >
>> >  Cheers,
>> >
>> >Tim
>> >
>> > ref: https://issues.apache.org/jira/browse/TIKA-3375
>> >
>>
>


Re: Release 1.27?

2021-04-29 Thread Konstantin Gribov
+1 for release

Are you planning to merge TIKA-3164 (update to POI 5.0.0) for this release?

-- 
Best regards,
Konstantin Gribov.


On Wed, Apr 28, 2021 at 9:36 PM Oleg Tikhonov 
wrote:

>  +1
>
> On Wed, Apr 28, 2021, 19:22 Tim Allison  wrote:
>
> > All,
> >
> >   There have been a number of key fixes in 1.x and some security fixes
> > in some of our dependencies.  Any objections to starting the release
> > process for 1.27 in the next few weeks?  Any blockers we need to fix
> > for 1.27?
> >
> >  Cheers,
> >
> >Tim
> >
> > ref: https://issues.apache.org/jira/browse/TIKA-3375
> >
>


[jira] [Updated] (TIKA-3164) Upgrade to POI 5.0.0 when available

2021-04-29 Thread Konstantin Gribov (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov updated TIKA-3164:

Issue Type: Task  (was: Bug)

> Upgrade to POI 5.0.0 when available
> ---
>
> Key: TIKA-3164
> URL: https://issues.apache.org/jira/browse/TIKA-3164
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled

2021-04-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335171#comment-17335171
 ] 

ASF GitHub Bot commented on TIKA-3374:
--

Ryan421 commented on pull request #433:
URL: https://github.com/apache/tika/pull/433#issuecomment-828968582


   Add unit test with a dummy EncodingDetector to verify the charset detection 
flow is executed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Non-Unicode archive entry name is garbled
> -
>
> Key: TIKA-3374
> URL: https://issues.apache.org/jira/browse/TIKA-3374
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.26
>Reporter: Ryan Liu
>Priority: Major
> Attachments: gbk.zip
>
>
> PackageParser retrieves archive entry name through commons-compress 
> archiver's ArchiveEntry#getName function and does not have automatic charset 
> detection for entry names.
>  Although one could set encoding by passing ArchiveStreamFactory(charset) 
> into parser context,
>  It is not practical since all kinds of charset could be used in an archive 
> file.
> Instead of directly calling entry.getName() in the PackageParser#parseEntry() 
> function,
> use entry.getRawName() and apply charset detection to reduce the possibility 
> of getting garbled string is recommended.
>  
> The attachment is an example of a Non-Unicode archive entry name been used in 
> a zip file.
> The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc*
> but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [tika] Ryan421 commented on pull request #433: [TIKA-3374] Apply charset detection for archive entry name

2021-04-29 Thread GitBox


Ryan421 commented on pull request #433:
URL: https://github.com/apache/tika/pull/433#issuecomment-828968582


   Add unit test with a dummy EncodingDetector to verify the charset detection 
flow is executed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org