[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled
[ https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17336999#comment-17336999 ] ASF GitHub Bot commented on TIKA-3374: -- Ryan421 commented on a change in pull request #433: URL: https://github.com/apache/tika/pull/433#discussion_r623514986 ## File path: tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java ## @@ -392,6 +392,15 @@ private void parseEntry(ArchiveInputStream archive, ArchiveEntry entry, XHTMLContentHandler xhtml) throws SAXException, IOException, TikaException { String name = entry.getName(); + +//Try to detect charset of archive entry in case of non-unicode filename is used +if (entry instanceof ZipArchiveEntry) { +detector.setText(((ZipArchiveEntry) entry).getRawName()); Review comment: No need to be sorry ^^, It was really my fault when moving the code block from our project to here and not properly checked. Really appreciate your review and suggestions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Non-Unicode archive entry name is garbled > - > > Key: TIKA-3374 > URL: https://issues.apache.org/jira/browse/TIKA-3374 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: Ryan Liu >Priority: Major > Attachments: gbk.zip > > > PackageParser retrieves archive entry name through commons-compress > archiver's ArchiveEntry#getName function and does not have automatic charset > detection for entry names. > Although one could set encoding by passing ArchiveStreamFactory(charset) > into parser context, > It is not practical since all kinds of charset could be used in an archive > file. > Instead of directly calling entry.getName() in the PackageParser#parseEntry() > function, > use entry.getRawName() and apply charset detection to reduce the possibility > of getting garbled string is recommended. > > The attachment is an example of a Non-Unicode archive entry name been used in > a zip file. > The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc* > but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [tika] Ryan421 commented on a change in pull request #433: [TIKA-3374] Apply charset detection for archive entry name
Ryan421 commented on a change in pull request #433: URL: https://github.com/apache/tika/pull/433#discussion_r623514986 ## File path: tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java ## @@ -392,6 +392,15 @@ private void parseEntry(ArchiveInputStream archive, ArchiveEntry entry, XHTMLContentHandler xhtml) throws SAXException, IOException, TikaException { String name = entry.getName(); + +//Try to detect charset of archive entry in case of non-unicode filename is used +if (entry instanceof ZipArchiveEntry) { +detector.setText(((ZipArchiveEntry) entry).getRawName()); Review comment: No need to be sorry ^^, It was really my fault when moving the code block from our project to here and not properly checked. Really appreciate your review and suggestions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-3376) Improve handling of write limit reached in new /tika json endpoint
[ https://issues.apache.org/jira/browse/TIKA-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335762#comment-17335762 ] Hudson commented on TIKA-3376: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #214 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/214/]) TIKA-3376 -- improve write limit reached handling in new /tika json output (tallison: [https://github.com/apache/tika/commit/9ac7e759b2007f541375ee2dedc736de5a555ccb]) * (edit) tika-core/src/main/java/org/apache/tika/parser/RecursiveParserWrapper.java * (edit) tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaResourceTest.java * (edit) tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/TikaResourceNoStackTest.java * (edit) tika-core/src/main/java/org/apache/tika/exception/WriteLimitReachedException.java * (edit) tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java > Improve handling of write limit reached in new /tika json endpoint > -- > > Key: TIKA-3376 > URL: https://issues.apache.org/jira/browse/TIKA-3376 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > If the server is not started with the -s option (show stacktrace), the new > json endpoint for /tika should return 200 with a writelimitreached=true > metadata value but no stacktrace. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled
[ https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335746#comment-17335746 ] Hudson commented on TIKA-3374: -- SUCCESS: Integrated in Jenkins build Tika » tika-branch1x-jdk8 #122 (See [https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/122/]) TIKA-3374 add encoding detection to zip entry names via Ryan Liu. (tallison: [https://github.com/apache/tika/commit/2704f0ee82b7799366aa2eeb02957be7eb7630d2]) * (edit) tika-parsers/src/test/java/org/apache/tika/parser/pkg/PackageParserTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java * (edit) CHANGES.txt * (add) tika-parsers/src/test/resources/test-documents/gbk.zip * (edit) tika-parsers/src/test/java/org/apache/tika/config/TikaEncodingDetectorTest.java > Non-Unicode archive entry name is garbled > - > > Key: TIKA-3374 > URL: https://issues.apache.org/jira/browse/TIKA-3374 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: Ryan Liu >Priority: Major > Attachments: gbk.zip > > > PackageParser retrieves archive entry name through commons-compress > archiver's ArchiveEntry#getName function and does not have automatic charset > detection for entry names. > Although one could set encoding by passing ArchiveStreamFactory(charset) > into parser context, > It is not practical since all kinds of charset could be used in an archive > file. > Instead of directly calling entry.getName() in the PackageParser#parseEntry() > function, > use entry.getRawName() and apply charset detection to reduce the possibility > of getting garbled string is recommended. > > The attachment is an example of a Non-Unicode archive entry name been used in > a zip file. > The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc* > but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3376) Improve handling of write limit reached in new /tika json endpoint
[ https://issues.apache.org/jira/browse/TIKA-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335747#comment-17335747 ] Hudson commented on TIKA-3376: -- SUCCESS: Integrated in Jenkins build Tika » tika-branch1x-jdk8 #122 (See [https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/122/]) TIKA-3376 improve handling of write limit reached in json output from /tika endpoint (tallison: [https://github.com/apache/tika/commit/32545d471b2ecdc57c64813c40cf834d55dc8f77]) * (edit) tika-server/src/test/java/org/apache/tika/server/TikaResourceNoStackTest.java * (edit) tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java > Improve handling of write limit reached in new /tika json endpoint > -- > > Key: TIKA-3376 > URL: https://issues.apache.org/jira/browse/TIKA-3376 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > If the server is not started with the -s option (show stacktrace), the new > json endpoint for /tika should return 200 with a writelimitreached=true > metadata value but no stacktrace. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3376) Improve handling of write limit reached in new /tika json endpoint
Tim Allison created TIKA-3376: - Summary: Improve handling of write limit reached in new /tika json endpoint Key: TIKA-3376 URL: https://issues.apache.org/jira/browse/TIKA-3376 Project: Tika Issue Type: Task Reporter: Tim Allison If the server is not started with the -s option (show stacktrace), the new json endpoint for /tika should return 200 with a writelimitreached=true metadata value but no stacktrace. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled
[ https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335680#comment-17335680 ] Hudson commented on TIKA-3374: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #213 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/213/]) TIKA-3374 -- fix up to encoding detection in package parser (tallison: [https://github.com/apache/tika/commit/fbac00b1dbe0464a7de379e6edb843973b917c6e]) * (delete) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/test/resources/org/apache/tika/parser/pkg/gbk.zip * (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/test/java/org/apache/tika/parser/pkg/PackageParserTest.java * (edit) CHANGES.txt * (add) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/test/resources/test-documents/gbk.zip * (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java * (add) tika-parsers/tika-parsers-classic/tika-parsers-classic-package/src/test/java/org/apache/tika/parser/pkg/PackageParserTest.java * (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-package/src/test/java/org/apache/tika/config/TikaEncodingDetectorTest.java > Non-Unicode archive entry name is garbled > - > > Key: TIKA-3374 > URL: https://issues.apache.org/jira/browse/TIKA-3374 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: Ryan Liu >Priority: Major > Attachments: gbk.zip > > > PackageParser retrieves archive entry name through commons-compress > archiver's ArchiveEntry#getName function and does not have automatic charset > detection for entry names. > Although one could set encoding by passing ArchiveStreamFactory(charset) > into parser context, > It is not practical since all kinds of charset could be used in an archive > file. > Instead of directly calling entry.getName() in the PackageParser#parseEntry() > function, > use entry.getRawName() and apply charset detection to reduce the possibility > of getting garbled string is recommended. > > The attachment is an example of a Non-Unicode archive entry name been used in > a zip file. > The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc* > but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled
[ https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335545#comment-17335545 ] Hudson commented on TIKA-3374: -- UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk8 #212 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/212/]) TIKA-3374 -- apply charset detection for archive entry name (#433) (github: [https://github.com/apache/tika/commit/07aa47855dfcbb27d11a996dc7b8cfa04b68493b]) * (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java * (edit) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/test/java/org/apache/tika/parser/pkg/PackageParserTest.java * (add) tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/test/resources/org/apache/tika/parser/pkg/gbk.zip > Non-Unicode archive entry name is garbled > - > > Key: TIKA-3374 > URL: https://issues.apache.org/jira/browse/TIKA-3374 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: Ryan Liu >Priority: Major > Attachments: gbk.zip > > > PackageParser retrieves archive entry name through commons-compress > archiver's ArchiveEntry#getName function and does not have automatic charset > detection for entry names. > Although one could set encoding by passing ArchiveStreamFactory(charset) > into parser context, > It is not practical since all kinds of charset could be used in an archive > file. > Instead of directly calling entry.getName() in the PackageParser#parseEntry() > function, > use entry.getRawName() and apply charset detection to reduce the possibility > of getting garbled string is recommended. > > The attachment is an example of a Non-Unicode archive entry name been used in > a zip file. > The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc* > but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled
[ https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335483#comment-17335483 ] ASF GitHub Bot commented on TIKA-3374: -- tballison merged pull request #433: URL: https://github.com/apache/tika/pull/433 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Non-Unicode archive entry name is garbled > - > > Key: TIKA-3374 > URL: https://issues.apache.org/jira/browse/TIKA-3374 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: Ryan Liu >Priority: Major > Attachments: gbk.zip > > > PackageParser retrieves archive entry name through commons-compress > archiver's ArchiveEntry#getName function and does not have automatic charset > detection for entry names. > Although one could set encoding by passing ArchiveStreamFactory(charset) > into parser context, > It is not practical since all kinds of charset could be used in an archive > file. > Instead of directly calling entry.getName() in the PackageParser#parseEntry() > function, > use entry.getRawName() and apply charset detection to reduce the possibility > of getting garbled string is recommended. > > The attachment is an example of a Non-Unicode archive entry name been used in > a zip file. > The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc* > but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [tika] tballison merged pull request #433: [TIKA-3374] Apply charset detection for archive entry name
tballison merged pull request #433: URL: https://github.com/apache/tika/pull/433 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled
[ https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335476#comment-17335476 ] ASF GitHub Bot commented on TIKA-3374: -- tballison commented on a change in pull request #433: URL: https://github.com/apache/tika/pull/433#discussion_r623052485 ## File path: tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java ## @@ -392,6 +392,15 @@ private void parseEntry(ArchiveInputStream archive, ArchiveEntry entry, XHTMLContentHandler xhtml) throws SAXException, IOException, TikaException { String name = entry.getName(); + +//Try to detect charset of archive entry in case of non-unicode filename is used +if (entry instanceof ZipArchiveEntry) { +detector.setText(((ZipArchiveEntry) entry).getRawName()); Review comment: Sorry, please forgive me. I meant embarrassing for me because I figured I was missing something!!! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Non-Unicode archive entry name is garbled > - > > Key: TIKA-3374 > URL: https://issues.apache.org/jira/browse/TIKA-3374 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.26 >Reporter: Ryan Liu >Priority: Major > Attachments: gbk.zip > > > PackageParser retrieves archive entry name through commons-compress > archiver's ArchiveEntry#getName function and does not have automatic charset > detection for entry names. > Although one could set encoding by passing ArchiveStreamFactory(charset) > into parser context, > It is not practical since all kinds of charset could be used in an archive > file. > Instead of directly calling entry.getName() in the PackageParser#parseEntry() > function, > use entry.getRawName() and apply charset detection to reduce the possibility > of getting garbled string is recommended. > > The attachment is an example of a Non-Unicode archive entry name been used in > a zip file. > The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc* > but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [tika] tballison commented on a change in pull request #433: [TIKA-3374] Apply charset detection for archive entry name
tballison commented on a change in pull request #433: URL: https://github.com/apache/tika/pull/433#discussion_r623052485 ## File path: tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java ## @@ -392,6 +392,15 @@ private void parseEntry(ArchiveInputStream archive, ArchiveEntry entry, XHTMLContentHandler xhtml) throws SAXException, IOException, TikaException { String name = entry.getName(); + +//Try to detect charset of archive entry in case of non-unicode filename is used +if (entry instanceof ZipArchiveEntry) { +detector.setText(((ZipArchiveEntry) entry).getRawName()); Review comment: Sorry, please forgive me. I meant embarrassing for me because I figured I was missing something!!! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: Release 1.27?
Thank you Konstantin! I’m not planning on updating POI because ooxml schemas lite didn’t have enough classes for our unit tests. Andi recently made some updates on their trunk, and I haven’t had a chance to confirm those fixes work :(. If we wanted to drop the full ooxml schemas into our jar, I can test 5.0.0 w our regression files. On Thu, Apr 29, 2021 at 4:17 AM Konstantin Gribov wrote: > +1 for release > > Are you planning to merge TIKA-3164 (update to POI 5.0.0) for this release? > > -- > Best regards, > Konstantin Gribov. > > > On Wed, Apr 28, 2021 at 9:36 PM Oleg Tikhonov > wrote: > >> +1 >> >> On Wed, Apr 28, 2021, 19:22 Tim Allison wrote: >> >> > All, >> > >> > There have been a number of key fixes in 1.x and some security fixes >> > in some of our dependencies. Any objections to starting the release >> > process for 1.27 in the next few weeks? Any blockers we need to fix >> > for 1.27? >> > >> > Cheers, >> > >> >Tim >> > >> > ref: https://issues.apache.org/jira/browse/TIKA-3375 >> > >> >
Re: Release 1.27?
+1 for release Are you planning to merge TIKA-3164 (update to POI 5.0.0) for this release? -- Best regards, Konstantin Gribov. On Wed, Apr 28, 2021 at 9:36 PM Oleg Tikhonov wrote: > +1 > > On Wed, Apr 28, 2021, 19:22 Tim Allison wrote: > > > All, > > > > There have been a number of key fixes in 1.x and some security fixes > > in some of our dependencies. Any objections to starting the release > > process for 1.27 in the next few weeks? Any blockers we need to fix > > for 1.27? > > > > Cheers, > > > >Tim > > > > ref: https://issues.apache.org/jira/browse/TIKA-3375 > > >
[jira] [Updated] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-3164: Issue Type: Task (was: Bug) > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)