[jira] [Commented] (TIKA-2569) Grouped Text boxes in .ppt
[ https://issues.apache.org/jira/browse/TIKA-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16358830#comment-16358830 ] Hudson commented on TIKA-2569: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1434 (See [https://builds.apache.org/job/Tika-trunk/1434/]) TIKA-2569 -- Extract text from grouped text boxes in PPT. (tallison: [https://github.com/apache/tika/commit/4c510d6a9910044825c6ee8df87c419a3370ab4e]) * (add) tika-parsers/src/test/resources/test-documents/testPPT_groups.pptx * (edit) CHANGES.txt * (add) tika-parsers/src/test/resources/test-documents/testPPT_groups.ppt * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXSLFExtractorTest.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java > Grouped Text boxes in .ppt > -- > > Key: TIKA-2569 > URL: https://issues.apache.org/jira/browse/TIKA-2569 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.16 >Reporter: Richard A >Assignee: Tim Allison >Priority: Major > Labels: easyfix > Attachments: Presentation1.ppt, Presentation1.pptx > > > Grouped Text boxes are unable to be parsed and no content is returned when > items have been grouped together. This issue does not seem to affect .pptx > files, only .ppt. The attached documents are the same except the file format. > It should give a very simple example of a .ppt document where no content will > be returned. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2569) Grouped Text boxes in .ppt
[ https://issues.apache.org/jira/browse/TIKA-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360484#comment-16360484 ] Richard A commented on TIKA-2569: - This looks positive. Is there any visibility of a date/release number that this fix is scheduled to be included in? > Grouped Text boxes in .ppt > -- > > Key: TIKA-2569 > URL: https://issues.apache.org/jira/browse/TIKA-2569 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.16 >Reporter: Richard A >Assignee: Tim Allison >Priority: Major > Labels: easyfix > Attachments: Presentation1.ppt, Presentation1.pptx > > > Grouped Text boxes are unable to be parsed and no content is returned when > items have been grouped together. This issue does not seem to affect .pptx > files, only .ppt. The attached documents are the same except the file format. > It should give a very simple example of a .ppt document where no content will > be returned. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2569) Grouped Text boxes in .ppt
[ https://issues.apache.org/jira/browse/TIKA-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360895#comment-16360895 ] Tim Allison commented on TIKA-2569: --- No idea on date, but I just cherry-picked this back to the 1.18 release branch so it will be available no matter what the next release is...:) > Grouped Text boxes in .ppt > -- > > Key: TIKA-2569 > URL: https://issues.apache.org/jira/browse/TIKA-2569 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.16 >Reporter: Richard A >Assignee: Tim Allison >Priority: Major > Labels: easyfix > Fix For: 1.18, 2.0.0 > > Attachments: Presentation1.ppt, Presentation1.pptx > > > Grouped Text boxes are unable to be parsed and no content is returned when > items have been grouped together. This issue does not seem to affect .pptx > files, only .ppt. The attached documents are the same except the file format. > It should give a very simple example of a .ppt document where no content will > be returned. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2569) Grouped Text boxes in .ppt
[ https://issues.apache.org/jira/browse/TIKA-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16384155#comment-16384155 ] Tim Allison commented on TIKA-2569: --- [~BAEApache], if all goes according to plan, we'll start the release process in about a week. The process itself can take a week or two. You can follow our discussion on our dev list: https://lists.apache.org/list.html?dev@tika.apache.org If you'd like to test 1.18 vs 1.17, you can grab a nightly build from jenkins, e.g. [here|https://builds.apache.org/job/Tika-trunk/1442/org.apache.tika$tika-app/] and use tika-eval to run comparisons: https://wiki.apache.org/tika/TikaEval . Let us know if you find any regressions before the 1.18 release! > Grouped Text boxes in .ppt > -- > > Key: TIKA-2569 > URL: https://issues.apache.org/jira/browse/TIKA-2569 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.16 >Reporter: Richard A >Assignee: Tim Allison >Priority: Major > Labels: easyfix > Fix For: 1.18, 2.0.0 > > Attachments: Presentation1.ppt, Presentation1.pptx > > > Grouped Text boxes are unable to be parsed and no content is returned when > items have been grouped together. This issue does not seem to affect .pptx > files, only .ppt. The attached documents are the same except the file format. > It should give a very simple example of a .ppt document where no content will > be returned. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2569) Grouped Text boxes in .ppt
[ https://issues.apache.org/jira/browse/TIKA-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417703#comment-16417703 ] Tim Allison commented on TIKA-2569: --- Whoa! This added a huge amount of newly extracted text in our regression corpus. Thank you, [~BAEApache]! > Grouped Text boxes in .ppt > -- > > Key: TIKA-2569 > URL: https://issues.apache.org/jira/browse/TIKA-2569 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.16 >Reporter: Richard A >Assignee: Tim Allison >Priority: Major > Labels: easyfix > Fix For: 1.18, 2.0.0 > > Attachments: Presentation1.ppt, Presentation1.pptx > > > Grouped Text boxes are unable to be parsed and no content is returned when > items have been grouped together. This issue does not seem to affect .pptx > files, only .ppt. The attached documents are the same except the file format. > It should give a very simple example of a .ppt document where no content will > be returned. -- This message was sent by Atlassian JIRA (v7.6.3#76005)